It's been a very long time since my
last blog post. I am still very much involved in SQL Server, but more
accurately, involved in managing large volumes of data. Not as large
as some, but larger than many. Just in the month of October, my
company managed to upload 20TB in SQL Server. All static
transnational data from other companies that we will run analysis on
to answer questions about that data. So how do we do this currently?
Well, most of our servers are virtual servers. 3PAR tier 1 fiber
channel storage. Tables with row counts approaching a billion rows
and greater are partitioned. We work with our analyst to improve
query performance as they work to analyze the data. We are often
under pressure to provide our analysis to customers in a short amount
of time, therefore, we do not have much time to stage the data. Also,
because the data is static, we have no use for transaction logs,
which are big drain on performance. If only there was an option to
turn off transaction logging.
And so here is where I find myself
today. Hadoop. Could Hadoop help us? About a year ago I started
reading about Hadoop and attending Hadoop and other NoSQL meet-ups.
Over that time, I have become convinced that it can. I needed to set
up a test environment, so I followed Michael Nolls' Hadoop tutorial,
which can be found here;
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
I installed Hadoop on a single node,
not without overcoming an obvious learning curve as it has been a very long time since I used Linux. I need to add that
it's helpful to know a little Linux. After successfully installing
Hadoop on a single node, I moved on to Michael's tutorial for
installing Hadoop in a cluster. My test environment uses 4 nodes. I
learned that the Hadoop logs are the key for correcting errors. And
even though I had successfully installed Hadoop on my cluster of Dell desktop boxes, I still needed
to learn how I might duplicate the analysis that we run in SQL Server
in Hadoop.
Now, Hadoop has many data warehouse and
database infrastructures that have been developed to sit on top of
Hadoop. Some of these are Hive, Hbase, Pig, and Accumulo, just to
name a few. But which one is best for our test case? Next week I will
be driving up to Philly to attend Hadoop developer training provided
by Hortonworks. It is a 4 day course designed for developers who
want to better understand how to create Apache Hadoop solutions. The
course consists of 50% hands-on lab exercises and 50% presentation
material. It's very exciting to learn new technologies and I guess
that's why I love what I do. Hopefully, I'll find answers to my many
questions and post a detailed blog post of how I translated our
processes that we currently perform in SQL Server, to our Hadoop
test cluster. So stay tuned!