Do we want to choose a Hadoop
distribution based on the Apache source or do we go with a distribution that has
improved on or modified the Hadoop architecture?
There are some great Big Data
distributions out there that have made their own improvements on the Hadoop
architecture, like MapR. MapR has thrown out the Hadoop file system (HDFS) and replaced it with their own
file system which they call MapR-FS. By rewriting the file system, MapR has
done away with the need for the Namenode and Secondary Namenode services. MapR
claims that their proprietary file system is much faster that HDFS and that MapR's
Apache Hadoop distribution provides full data protection, no single points of
failure, improved performance, and dramatic ease of use advantages. MapR
provides three versions of their product known as M3, M5 and M7. M3 is a free
version of the M5 product with degraded availability features. M7 is like M5,
but adds a purpose built rewrite of HBase that implements the HBase API
directly in the file-system layer.
Another distribution which keeps HDFS but changes how Namenode works is
WANdisco. WANdisco uses three Namenodes to provide High Availability (HA). In
WANdisco's implementation, they remove the need for a Secondary Namenode,
JobTracker, or TaskTracker. WANdisco calls this their Non-Stop NameNode which
uses it's patented replication technology to turns the NameNode into an
active-active shared-nothing cluster that delivers optimum performance,
scalability and availability on a 24-by-7 basis without any downtime or data
loss.
Both Hortonworks and Cloudera use a 100% Open Source Apache Hadoop
Distribution. Cloudera was founded in 2009 by Christophe Bisciglia from Google,
Amr Awadallah from Yahoo, and Jeff Hammerbacher from Facebook. Cloudera announced
this year the release of Impala. Impala is an open-source distributed query
engine for Apache Hadoop. Cloudera is a sponsor of the Apache Software
Foundation.
Hortonworks was formed in 2011 by Yahoo and Benchmark Capital. Hortonwork has a
great community involvement. Hortonworks, along with Yahoo, sponsor the Hadoop
Summit. Hadoop Summit is an Apache Hadoop community event that outlines the
evolution of Apache Hadoop into the next-generation enterprise data platform
and features presentations from community developers, experienced users and
administrators, as well as a vast array of ecosystem solution providers. And
like Cloudera, Hortonworks is a sponsor of the Apache Software Foundation. In
response to Cloudera's Impala release, Hortonworks announced its' Stinger
Initiative. Stinger represents a concerted effort by Hortonworks and the
broader Apache community to improve Hive performance and better serve business
intelligence use cases such as interactive data exploration, visualization and
parameterized reporting.
So to answer the first question I think we need to ask what may be the most
important question; what feature is most important in our implementation of
Hadoop? For my company it's support. Why support? Because we want to stand up a
small cluster, 6 datanodes with 100TB of storage in our local datacenter
to improve the time it takes us to load and query large static, transactional
type data, which we currently do in SQLSerever. Having easy access to technical
support, blogs, how to's, mailing list, knowledge base articles, and white
papers will ensure the success of our Hadoop implementation. With that said, we
can now ask the original question; do we want to choose a Hadoop distribution
based on the Apache source or do we go with a distribution that has improved on
or modified the Hadoop architecture?
I believe for us the answer is to use a 100% Open Source Apache Hadoop
Distribution. We need to have as many resources available to us so that we can
learn and we want a solution that is fully backed up by the vast Apache Hadoop
community. By using Apache Hadoop, we can be assured that bugs will be fixed
much faster, new features will be added and will be available in faster release
cycles, and we will have a larger support base.