Stop the Hollyweb! No DRM in HTML5.   

Monday, November 5, 2012

Hadoop As An Alternative To SQL Server?

It's been a very long time since my last blog post. I am still very much involved in SQL Server, but more accurately, involved in managing large volumes of data. Not as large as some, but larger than many. Just in the month of October, my company managed to upload 20TB in SQL Server. All static transnational data from other companies that we will run analysis on to answer questions about that data. So how do we do this currently? Well, most of our servers are virtual servers. 3PAR tier 1 fiber channel storage. Tables with row counts approaching a billion rows and greater are partitioned. We work with our analyst to improve query performance as they work to analyze the data. We are often under pressure to provide our analysis to customers in a short amount of time, therefore, we do not have much time to stage the data. Also, because the data is static, we have no use for transaction logs, which are big drain on performance. If only there was an option to turn off transaction logging.

And so here is where I find myself today. Hadoop. Could Hadoop help us? About a year ago I started reading about Hadoop and attending Hadoop and other NoSQL meet-ups. Over that time, I have become convinced that it can. I needed to set up a test environment, so I followed Michael Nolls' Hadoop tutorial, which can be found here; http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/


I installed Hadoop on a single node, not without overcoming an obvious learning curve as it has been a very long time since I used Linux. I need to add that it's helpful to know a little Linux. After successfully installing Hadoop on a single node, I moved on to Michael's tutorial for installing Hadoop in a cluster. My test environment uses 4 nodes. I learned that the Hadoop logs are the key for correcting errors. And even though I had successfully installed Hadoop on my cluster of Dell desktop boxes, I still needed to learn how I might duplicate the analysis that we run in SQL Server in Hadoop.

Now, Hadoop has many data warehouse and database infrastructures that have been developed to sit on top of Hadoop. Some of these are Hive, Hbase, Pig, and Accumulo, just to name a few. But which one is best for our test case? Next week I will be driving up to Philly to attend Hadoop developer training provided by Hortonworks. It is a 4 day course designed for developers who want to better understand how to create Apache Hadoop solutions. The course consists of 50% hands-on lab exercises and 50% presentation material. It's very exciting to learn new technologies and I guess that's why I love what I do. Hopefully, I'll find answers to my many questions and post a detailed blog post of how I translated our processes that we currently perform in SQL Server, to our Hadoop test cluster. So stay tuned!

Wednesday, February 1, 2012

Free ebook: Introducing Microsoft SQL Server 2012

Microsoft Press is releasing a free ebook: Introducing Microsoft SQL Server 2012 (second DRAFT preview)by Ross Mistry and Stacia Misner. This Draft Preview release is only available in PDF format. To get the full details, look here; Microsoft Press.

Sunday, January 8, 2012

Correcting Hadoop's HDFS java.io.IOException Errors

This past Christmas and New Year, like the last three years now, I accompanied my wife to Bogota, Colombia for the holidays. The difference about this trip was; our pet Mickey was going with us, we had no side trips planned, and I was taking my new Lenovo T420 with me. We arrived in Bogota late at night on December, 22nd, and after being greeted by family, we took a cab to my mother-in-laws house. There’s something both exciting and terrifying about cab rides in Bogota, but after time, for me at least, it’s just fun.

I woke up the next morning well rested and we began our Christmas vacation. I began setting up a new wireless router that I had brought with me so that I could work remotely from wherever was most comfortable. After having installed the router, I connected with my new laptop and set my priorities for the task that I needed to complete. Was the typical task, update my time, respond to some emails, finalize peer reviews, and check up on database backups. After these tasks were completed, I had time to relax.

What was planned for us was day to day, but mostly we would go out to have lunch or dinner with friends and family, then return home. On a few nights, we engaged in consuming heavy amounts of adult beverages and dancing which is the custom in Colombia. Most of our time however, was spent at home. I would wake up early, before anyone else, and sit at the dining room table, next to the window with my laptop and watch the sun rise over the mountains enjoying some fresh Colombian coffee.

With nothing to do so early in the morning, I decided to log onto my desktop back in DC. When I connected to my desktop, I saw that I had left open a ssh connection to a small Hadoop cluster I had setup to do some testing, except HDFS was not working properly. This was the perfect opportunity to find out what went wrong in my install and configuration of the cluster. The only catch was, is that I would have do everything from the command shell, no gui. I had followed Michael Noll’s “Running Hadoop on Ubuntu”,but now, I was getting errors in the namenode logs.

I was getting java.io.IOException errors. In Michael Noll’s how-to, he describes how he addressed this error by reformatting the cluster. He described how he stopped all running daemons and deleted the /app/hadoop/tmp/hdf/name/data directory and then ran bin/hadoop namenode –format . Somewhere in my troubleshooting my errors and researching online, I found that it was also a good idea to add the following properties to the hdfs-site.xml configuration file.

< !-- Adding dfs.data.dir dfs.name.dir 1/1/2012. -- >
< property >
< name > dfs.data.dir < /name >
< value > /app/hadoop/tmp/dfs/name/data < /value >
< final > true < /final >
< /property >
< property >
< name > dfs.name.dir < /name >
< value > /app/hadoop/tmp/dfs/name < /value >
< final > true < /final >
< /property >

Also, if you get permission denied (publickey,password), you may want to check that the paths for the properties you added to the hdfs-site.xml file are correct. If this problem persist, you might try running the following on all nodes;

sudo chown –R hduser:hadoop /app/hadoop

Some of the other errors that I ran into were as follows;

Cannot lock storage /app/hadoop/tmp/dfs/name. The directory is already locked.
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All storage directories are inaccessible.
ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: Problem binding to Address already in use

I did not do a good job at keeping notes during my troubleshooting these issues, but it seemed that whenever I would try to fix one thing, a different error would pop up. I did find that I had made an typo in the hdfs-site.xml file. At the end of each path, I had added a /. Therefore, instead of /app/hadoop/tmp/dfs/name, I had /app/hadoop/tmp/dfs/name/. But once I corrected that and delete all data in the HDFS directory and then ran format, everything worked! So here is how that went.

After stopping all daemons and correcting the paths in the hdfs-site.xml file, I then deleted all data in the HDFS directory on all nodes.

hduser@bigdata1:/app/hadoop/tmp/dfs$">hduser@bigdata1:/app/hadoop/tmp/dfs$ sudo rm -rf *

Then, I ran the format.

hduser@bigdata1:/usr/local/hadoop/hadoop$ bin/hadoop namenode –format

The output looks like;

12/01/04 07:59:15 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = bigdata1/172.20.10.92
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.20.2
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
************************************************************/
12/01/04 07:59:15 INFO namenode.FSNamesystem: fsOwner=hduser,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,netdev,admin,sambashare
12/01/04 07:59:15 INFO namenode.FSNamesystem: supergroup=supergroup
12/01/04 07:59:15 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/01/04 07:59:15 INFO common.Storage: Image file of size 96 saved in 0 seconds.
12/01/04 07:59:15 INFO common.Storage: Storage directory /app/hadoop/tmp/dfs/name has been successfully formatted.
12/01/04 07:59:15 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at bigdata1/172.20.10.92
************************************************************/

Next, I started HDFS.

hduser@bigdata1:/usr/local/hadoop/hadoop$ bin/start-dfs.sh

It’s output was;

starting namenode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-namenode-bigdata1.out
bigdata2: starting datanode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-datanode-bigdata2.out
bigdata3: starting datanode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-datanode-bigdata3.out
bigdata4: starting datanode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-datanode-bigdata4.out
bigdata1: starting secondarynamenode, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-secondarynamenode-bigdata1.out

I ran the HDFS Admin Report to see the status of my cluster.

hduser@bigdata1:/usr/local/hadoop/hadoop$ bin/hadoop dfsadmin –report

The report displays the following;

Configured Capacity: 206701436928 (192.51 GB)
Present Capacity: 186873368576 (174.04 GB)
DFS Remaining: 186873294848 (174.04 GB)
DFS Used: 73728 (72 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 3 (3 total, 0 dead)

Name: 172.20.10.127:50010
Decommission Status : Normal
Configured Capacity: 68900478976 (64.17 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 6721228800 (6.26 GB)
DFS Remaining: 62179225600(57.91 GB)
DFS Used%: 0%
DFS Remaining%: 90.24%
Last contact: Wed Jan 04 08:00:20 EST 2012


Name: 172.20.10.128:50010
Decommission Status : Normal
Configured Capacity: 68900478976 (64.17 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 6732939264 (6.27 GB)
DFS Remaining: 62167515136(57.9 GB)
DFS Used%: 0%
DFS Remaining%: 90.23%
Last contact: Wed Jan 04 08:00:20 EST 2012


Name: 172.20.10.48:50010
Decommission Status : Normal
Configured Capacity: 68900478976 (64.17 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 6373900288 (5.94 GB)
DFS Remaining: 62526554112(58.23 GB)
DFS Used%: 0%
DFS Remaining%: 90.75%
Last contact: Wed Jan 04 08:00:17 EST 2012

After this I started MapReduce.

hduser@bigdata1:/usr/local/hadoop/hadoop$ bin/start-mapred.sh

It’s output was this;

starting jobtracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-jobtracker-bigdata1.out
bigdata3: starting tasktracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-tasktracker-bigdata3.out
bigdata2: starting tasktracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-tasktracker-bigdata2.out
bigdata4: starting tasktracker, logging to /usr/local/hadoop/hadoop/bin/../logs/hadoop-hduser-tasktracker-bigdata4.out

Success!




Now there are three web interface URLs that you can use to check up on your clusters health, there are;




http://localhost:50030/ – web UI for MapReduce job tracker(s)
http://localhost:50060/ – web UI for task tracker(s)
http://localhost:50070/ – web UI for HDFS name node(s)