It's been a very long time since my last blog post. I am still very much involved in SQL Server, but more accurately, involved in managing large volumes of data. Not as large as some, but larger than many. Just in the month of October, my company managed to upload 20TB in SQL Server. All static transnational data from other companies that we will run analysis on to answer questions about that data. So how do we do this currently? Well, most of our servers are virtual servers. 3PAR tier 1 fiber channel storage. Tables with row counts approaching a billion rows and greater are partitioned. We work with our analyst to improve query performance as they work to analyze the data. We are often under pressure to provide our analysis to customers in a short amount of time, therefore, we do not have much time to stage the data. Also, because the data is static, we have no use for transaction logs, which are big drain on performance. If only there was an option to turn off transaction logging.
And so here is where I find myself today. Hadoop. Could Hadoop help us? About a year ago I started reading about Hadoop and attending Hadoop and other NoSQL meet-ups. Over that time, I have become convinced that it can. I needed to set up a test environment, so I followed Michael Nolls' Hadoop tutorial, which can be found here; http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
I installed Hadoop on a single node, not without overcoming an obvious learning curve as it has been a very long time since I used Linux. I need to add that it's helpful to know a little Linux. After successfully installing Hadoop on a single node, I moved on to Michael's tutorial for installing Hadoop in a cluster. My test environment uses 4 nodes. I learned that the Hadoop logs are the key for correcting errors. And even though I had successfully installed Hadoop on my cluster of Dell desktop boxes, I still needed to learn how I might duplicate the analysis that we run in SQL Server in Hadoop.
Now, Hadoop has many data warehouse and database infrastructures that have been developed to sit on top of Hadoop. Some of these are Hive, Hbase, Pig, and Accumulo, just to name a few. But which one is best for our test case? Next week I will be driving up to Philly to attend Hadoop developer training provided by Hortonworks. It is a 4 day course designed for developers who want to better understand how to create Apache Hadoop solutions. The course consists of 50% hands-on lab exercises and 50% presentation material. It's very exciting to learn new technologies and I guess that's why I love what I do. Hopefully, I'll find answers to my many questions and post a detailed blog post of how I translated our processes that we currently perform in SQL Server, to our Hadoop test cluster. So stay tuned!