What You Need to Know about Hadoop and Its Ecosystem


What is the actual relationship between Big Data and Hadoop? Simply speaking, big data is a huge pool of various kinds of data in various formats, which contains hidden potentials for better business decisions. Hadoop, on the other hand, is an open source program designed to handle those big data cheaply and efficiently.

It is not only capable of storing and processing many large files at once but also enables moving them over networks quickly, which normally exceeds the capacity of a single server (e.g. your PC). Hadoop has been well known for its high computing power, strong fault tolerance and scalability.

Understanding Hadoop Ecosystem

Operating under Apache license, Hadoop provides a full ecosystem. Hadoop alone cannot do amazing work but with its ‘friends’, it becomes a perfect match with Big Data. In our post, we will introduce some of the most notable components of Hadoop categorized by their functions.

Hadoop Ecosystem components

Data Storage

HDFS (Hadoop Distributed File System) is the key component that makes up Hadoop. HDFS is used to store and access huge file based on client/server architecture. This system also enables the distribution and storage of data across Hadoop clusters.

HBase (Hadoop Database) is a columnar database built on top of the HDFS. Being a file system, HDFS lacks the random read and write capability. It is when HBase steps in and provides fast record lookups in large tables.

Data Processing

MapReduce is a parallel data processing framework over clusters. Using MapReduce can help data seeker save a lot of time, for example, if it takes a normal relational database around 20 hours to process a large data set, it might take MapReduce only around three minutes to get everything done.

YARN (Yet Another Resource Negotiator) is a resource manager. It is said to be the second generation of MapReduce and also a critical advancement from Hadoop 1. YARN acts the role of an operating system, its jobs is to manage and monitor workloads, make sure it can serve multiple clients and perform security controls. In addition, YARN supports new processing models that MapReduce does not.

Data Access

Hive is new kind of structured query language. It was born to help who are familiar with the traditional database and SQL to leverage Hadoop and MapReduce.

Pig serves the analysis purpose for large data sets. Pig is made up of two components, firstly the platform to execute Pig programs; secondly, a powerful and simple scripting language called PigLatin, which is used to write those programs.

Mahout provides a library of the most popular machine learning algorithms written in Java that supports collaborative filtering, clustering, and classification.

Arvo is a data serialization system. It uses JSON for defining data types and protocols to support data-driven applications. Arvo provides a simple integration with many different languages with the expectation to support Hadoop application to be written in other languages (e.g. Python, C++) rather than Java.

Sqoop (SQL + Hadoop = Sqoop) is a command line interface application, which helps transfer data between Hadoop and relational databases (e.g. MySQL or Oracle) or mainframes.

Data Management

Oozie is a workflow scheduler for Hadoop. Oozie streamlines the process of creating workflows and managing coordination jobs among Hadoop and other applications such as Map Reduce, Pig, Sqoop, Hive etc. The main responsibilities of Oozie are: firstly to define a sequence of actions to be executed; secondly, to place triggers for those actions.

Chukwa is another framework that is built on top of HDFS and Map Reduce. Its purpose is to provide a dynamic and powerful data collection system. Chukwa is capable of monitoring, analyzing and presenting the results to get the most out of collected data.

Flume is also a scalable and reliable system for collecting and moving cluster logs from various sources to a centralized store like Chukwa. However, there are some differences. In Flume, chunks of data are transferred from node to node in store and forward manner; while in Chukwa, the agent of each machine will need to determine what data to be sent.

ZooKeeper is a distributed coordination service for distributed system. It provides a very simple programming interface and helps reduce the management complexity by providing services such as configuration, distributed synchronization, naming, group services etc.

Hadoop keeps evolving day by day. Apart from above-mentioned projects and systems, there are more amazing tools and resources to be explored such as Cassandra, Impala, R Connectors & etc. Consider carefully Hadoop as an open option, then let it enable you to to truly personalize their big data experience to meet your business requirements.

Savvycom always stays ahead with the evolvement of this technology. Last time, we had an interesting internal training about Big Data, Hadoop and NoSQL to all technical team members. Check out our slideshare here.

Tags: ,

What You Need to Know about Hadoop and Its Ecosystem at: October 2nd, 2017 by admin