Do we want to choose a Hadoop distribution based on the Apache source or do we go with a distribution that has improved on or modified the Hadoop architecture?
There are some great Big Data distributions out there that have made their own improvements on the Hadoop architecture, like MapR. MapR has thrown out the Hadoop file system (HDFS) and replaced it with their own file system which they call MapR-FS. By rewriting the file system, MapR has done away with the need for the Namenode and Secondary Namenode services. MapR claims that their proprietary file system is much faster that HDFS and that MapR's Apache Hadoop distribution provides full data protection, no single points of failure, improved performance, and dramatic ease of use advantages. MapR provides three versions of their product known as M3, M5 and M7. M3 is a free version of the M5 product with degraded availability features. M7 is like M5, but adds a purpose built rewrite of HBase that implements the HBase API directly in the file-system layer.
Another distribution which keeps HDFS but changes how Namenode works is WANdisco. WANdisco uses three Namenodes to provide High Availability (HA). In WANdisco's implementation, they remove the need for a Secondary Namenode, JobTracker, or TaskTracker. WANdisco calls this their Non-Stop NameNode which uses it's patented replication technology to turns the NameNode into an active-active shared-nothing cluster that delivers optimum performance, scalability and availability on a 24-by-7 basis without any downtime or data loss.
Both Hortonworks and Cloudera use a 100% Open Source Apache Hadoop Distribution. Cloudera was founded in 2009 by Christophe Bisciglia from Google, Amr Awadallah from Yahoo, and Jeff Hammerbacher from Facebook. Cloudera announced this year the release of Impala. Impala is an open-source distributed query engine for Apache Hadoop. Cloudera is a sponsor of the Apache Software Foundation.
Hortonworks was formed in 2011 by Yahoo and Benchmark Capital. Hortonwork has a great community involvement. Hortonworks, along with Yahoo, sponsor the Hadoop Summit. Hadoop Summit is an Apache Hadoop community event that outlines the evolution of Apache Hadoop into the next-generation enterprise data platform and features presentations from community developers, experienced users and administrators, as well as a vast array of ecosystem solution providers. And like Cloudera, Hortonworks is a sponsor of the Apache Software Foundation. In response to Cloudera's Impala release, Hortonworks announced its' Stinger Initiative. Stinger represents a concerted effort by Hortonworks and the broader Apache community to improve Hive performance and better serve business intelligence use cases such as interactive data exploration, visualization and parameterized reporting.
So to answer the first question I think we need to ask what may be the most important question; what feature is most important in our implementation of Hadoop? For my company it's support. Why support? Because we want to stand up a small cluster, 6 datanodes with 100TB of storage in our local datacenter to improve the time it takes us to load and query large static, transactional type data, which we currently do in SQLSerever. Having easy access to technical support, blogs, how to's, mailing list, knowledge base articles, and white papers will ensure the success of our Hadoop implementation. With that said, we can now ask the original question; do we want to choose a Hadoop distribution based on the Apache source or do we go with a distribution that has improved on or modified the Hadoop architecture?
I believe for us the answer is to use a 100% Open Source Apache Hadoop Distribution. We need to have as many resources available to us so that we can learn and we want a solution that is fully backed up by the vast Apache Hadoop community. By using Apache Hadoop, we can be assured that bugs will be fixed much faster, new features will be added and will be available in faster release cycles, and we will have a larger support base.