Hadoop has become the de-facto platform for storing and processing large amounts of data and has found widespread applications. In the Hadoop ecosystem, you can store your data in one of the storage managers (for example, HDFS, HBase, Solr, etc.) and then use a processing framework to process the stored data. Hadoop first shipped with only one processing framework: MapReduce. Today, there are many other open source tools in the Hadoop ecosystem that can be used to process data in Hadoop; a few common tools include the following Apache projects: Hive, Pig, Spark, Cascading, Crunch, Tez, and Drill, along with Impala and Presto. Some of these frameworks are built on top of each other. For example, you can write queries in Hive that can run on MapReduce or Tez. Another example currently under development is the ability to run Hive queries on Spark.
Amidst all of these options, two key questions arise for Hadoop users:
Which processing frameworks are most commonly used?
How do I choose which framework(s) to use for my specific use case?
This post will you help answer both of these questions, giving you enough context to make an educated decision regarding the best processing framework for your specific use case.
Categories of processing frameworks
One can broadly classify processing frameworks in Hadoop into the following six categories:
General-purpose processing frameworks — These frameworks allow users to process data in Hadoop using a low-level API. Although these are all batch frameworks, they follow different programming models. Examples include MapReduce and Spark.
Abstraction frameworks — These frameworks allow users to process data using a higher level abstraction. These can be API-based — for example, Crunch and Cascading, or based on a custom DSL, such as Pig. These are typically built on top of a general-purpose processing framework.
SQL frameworks — These frameworks enable querying data in Hadoop using SQL. These can be built on top of a general-purpose framework, such as Hive, or as a stand-alone, special-purpose framework, such as Impala. Technically, SQL frameworks can be considered abstraction frameworks. However, given their high demand and slew of options available in this category, it makes sense to classify SQL frameworks as their own category.
Graph processing frameworks — These frameworks enable graph processing capabilities on Hadoop. They can be built on top of a general-purpose framework, such as Giraph, or as a stand-alone, special-purpose framework, such as GraphLab.
Machine learning frameworks — These frameworks enable machine learning analysis on Hadoop data. These can also be built on top of a general-purpose framework, such as MLlib (on Spark), or as a stand-alone, special-purpose framework, such as Oryx.
Real-time/streaming frameworks — These frameworks provide near real-time processing (several hundred milliseconds to few seconds latency) for data in the Hadoop ecosystem. They can be built on top of a generic framework, such as Spark Streaming (on Spark), or as a stand-alone, special-purpose framework, such as Storm.
The diagram below organizes common processing frameworks in the Hadoop ecosystem by classifying them into the six categories.
As you can see, some of these frameworks build on top of a general-purpose processing framework, while others don’t. Examples of frameworks that do not build on top of a general-purpose framework include Impala, Drill, and GraphLab. We’ll use the term special-purpose frameworks to refer to them from here on.
Note that there is another way to distinguish processing frameworks: based on their architecture. Frameworks that have active components, like a server (e.g., Hive), can be considered engines, while others that do not have an active component can simply be considered libraries (e.g., MLlib). (This distinction, however, does not impact end users; users who need a solid machine learning framework usually don’t care whether it’s architecturally considered a library or an engine.)
Now comes the million dollar question: which framework(s) should you use?
The answer depends on two major factors:
Your use case
The expertise/experience present in your organization
To decide, you should first pick the category of framework(s) you need, and then choose from the particular frameworks available within those categories. The next section should help you decide which processing framework(s) to use.
When to use each processing framework
General-purpose processing frameworks: You always need a general-purpose framework for your cluster. This is because all of the other kinds of frameworks only solve a specific use case (e.g., graph processing, machine learning, etc.), and by themselves, they are not sufficient for handling the variety of processing needs likely at your organization. Moreover, many of the other frameworks rely on general-purpose frameworks. Even the special-purpose frameworks that don’t build upon general-purpose frameworks often rely on bits and pieces of them.
The common frameworks in this category are MapReduce, Spark, and Tez — and newer frameworks, such as Apache Flink, are now emerging. As of today, MapReduce is typically always installed on clusters. Other general-purpose frameworks rely on bits and pieces from the MapReduce stack, like Input/Output formats. You can still use other frameworks like Tez or Spark, though, without having MapReduce installed on your cluster.
So, the question is: which of the general-purpose processing frameworks should you use? MapReduce is the most mature; however, it is arguably the slowest. Spark and Tez are both DAG frameworks and don’t have the overhead of always running a Map followed by a Reduce job; both are more flexible than MapReduce. Spark is one of the most popular projects in the Hadoop ecosystem and has a lot of traction. It is thought by many as the successor to MapReduce — I encourage you to use Spark over MapReduce wherever possible.
Notably, MapReduce and Spark have different API’s; this means that, unless you are using an abstraction framework, if you migrate from MapReduce to Spark, you’ll have to rewrite your jobs in Spark. It’s also worth noting that even though Spark is a general-purpose engine with other abstraction frameworks built upon it, it also provides high-level processing APIs. So in this way, Spark API can also be seen as an abstraction framework itself. Consequently, the amount of time and code required for writing a Spark job is usually much less than writing an equivalent MapReduce job.
At this point, Tez is best suited as a framework to build abstraction frameworks, instead of building applications using its API.
The important thing to note is that just because you have a general-purpose processing framework installed on your cluster doesn’t mean you have to write all of your processing jobs using that framework’s API. In fact, it is recommended to use abstraction frameworks (e.g., Pig, Crunch, Cascading) or SQL frameworks (e.g., Hive and Impala) for writing processing jobs wherever possible (there are two exceptions to this rule, as discussed in the next section).
Abstraction and SQL frameworks: Abstraction frameworks (e.g., Pig, Crunch, and Cascading) and SQL frameworks (e.g., Hive and Impala) reduce the amount of time spent writing jobs directly for the general-purpose frameworks.
Abstraction frameworks: As shown in the above diagram, Pig is an abstraction framework that can run on MapReduce, Spark, or Tez. Apache Crunch provides a higher level API that can be used to run MapReduce or Spark jobs. Cascading is another API based abstraction framework that can run on MapReduce or Tez.
SQL frameworks: As far as SQL engines go, Hive can run on top of MapReduce or Tez, and work is being done to make Hive run on Spark. There are several special-purpose SQL engines aimed at faster SQL, including Impala, Presto, and Apache Drill.
Key points on the benefits of using an abstraction or SQL framework:
You can save a lot of time by not having to implement common processing tasks using the low-level APIs of general-purpose frameworks.
You can change underlying general-purpose processing frameworks (as needed and applicable). Coding directly on the framework means you would have to re-write your jobs if you decided to change frameworks. Using an abstraction or SQL framework that builds upon a generic framework abstracts that away.
Running a job on an abstraction or SQL framework requires just a small percentage of the overhead necessary for an equivalent job written directly in the general-purpose framework. Also, running a query on a special-purpose processing framework (e.g., Impala, or Presto for SQL) is much faster than running an equivalent MapReduce job, because they use a completely different execution model, built for running fast SQL queries.
Two exceptions where you should use a general-purpose framework:
If you have certain information about the data (i.e. metadata) that can’t be expressed and taken advantage of in an abstraction or SQL framework. For example, let’s say that your data set is partitioned or sorted in a particular way that you cannot express when creating a logical data set in an abstraction or SQL framework. However, making use of such partitioning/sorting metadata in your job can speed up the processing. In such a case, it makes sense to directly program within the low-level API of a general-purpose processing framework. In such cases, the time savings in running a job over and over again more than pays off for the extra development time.
If your use case is particularly suited to a general-purpose framework. This is usually a small percentage of use cases where the analysis is very complex and can’t be easily expressed in a DSL like SQL or Pig Latin. In these cases, Crunch and Cascading should be considered, but oftentimes you might just have to directly program using a general-purpose processing framework.
Once you have decided on using an abstraction or SQL framework, which particular framework you use usually depends on the expertise and experience you have in-house.
Graph, machine learning, and real-time/streaming frameworks
There is usually no need to convince users to adopt graph, machine learning, and real-time/streaming frameworks. If a specific use case is important you, you will likely need to use a framework that solves that use case.
Giraph, GraphX, and GraphLab are popular graph processing frameworks.
Apache Giraph is a library that runs on top of MapReduce.
GraphX is a library for graph processing on Spark.
GraphLab was a stand-alone, special-purpose graph processing framework that can now also handle tabular data.
Mahout, MLlib, Oryx, and H2O are commonly used machine learning frameworks.
Mahout is a library on top of MapReduce, although there are plans to make Mahout work on Spark.
MLlib is a machine learning library for Spark.
Oryx and H2O are stand-alone, special-purpose machine learning engines.
For near real-time analysis of data, Spark Streaming and Storm + Trident are commonly used frameworks.
Spark Streaming is a library for doing micro-batch streaming analysis, built on top of Spark.
Apache Storm is a special-purpose, distributed, real-time computation engine with Trident used as an abstraction engine on top of it.
The Hadoop ecosystem has evolved to the point where using MapReduce is no longer the only way to query data in Hadoop. With the breadth of options now available, it can be tough to choose which framework to use for processing your Hadoop data.
Most users adopt more than one framework for processing their Hadoop data, and this makes having resource management in your Hadoop cluster extremely important. A common data pipeline begins with ingestion; followed by ETL, which is done by a general-purpose engine, an abstraction engine, or a combination thereof; followed by one of the many special-purpose engines for doing low-latency SQL, machine learning, or graph processing.
I hope this post helps when you’re deciding which processing framework(s) to use. Happy Hadooping!