Introduction to Hadoop

I am new to Apache Hadoop, here are brief notes on my Hadoop and related technologies learning. First, I share the key ideas from Hadoop basics.

Life without Hadoop?

For computing and storage intensive tasks, single processor or even multi-processor computer is not enough, because it would take long to process millions of tasks on Peta Bytes of data. The cheapest way to improve the performance is utilizing tens or hundreds of computer connected to Local Area Network. But to use computers connected to LAN, we have to write program that would transfer data on them and copy the program on all computers too. Then we need to decide what chunk of records should be processed on which computer on network, that means more programming to manage task distribution. We also have to use socket programming or MPI or RMI or EJBs or DCOM type approaches to make other computers to receive and process the data. Its all doable but its low level programming in a sense thats its not the primary task in-hand. Developers should be spending time in writing data processing logic instead infrastructure stuff. In above case, developer have to take care of following aspects, in addition to writing data processing code:
  1. Installing programs on other computers
  2. Transferring input data on other computers
  3. Getting output data from others computers
  4. Taking care of failing nodes
  5. Taking care of slow nodes and halted tasks terminations
  6. Maintaining replicas of data to handle failure
  7. Keep record how much each node is busy, to better utilize the available resources
  8. Assigning jobs to network computers on-the-go, means as its done with earlier job

Why Hadoop?

Without going in details of how Hadoop works, from the above list, you can see, most of developer time would be spent on writing infrastructure related stuff instead writing actual data processing logic.

Hadoop takes care of all the stuff mentioned above. We just write our program using MapReduce approach (would be explaind later). All responsibility to scheduling jobs on processing nodes, monitoring the progress, collecting results, load balancing on all nodes, handling node failure, ensuring nodes availability, data replication, maintaining status or progress information, etc. goes to Hadoop componenets. So, Hadoop is distributed data processing framework that free the developer writing infrastructure stuff so that he or she can focus on writing data processing logic.

Idea of Hadoop / MapReduce?

Hadoop is open source implementation of MapReduce programming model (along with other related stuff e.g. file system, jobs scheduler etc.). So idea of Hadoop goes back to idea of MapReduce programming model. The MapReduce paradigm is Google invention. During 2000-2002, Google engineers were making fundamental improvements to their search technology. They were frequently processing billions of web pages for different purposes e.g. number of pages hosted on single server, number of backlinks to a particular URL, analyzing webpage access log, generating and writing complete index after improvements (the main resource behind Google search).

Initialy they spent time writing infrasstructure related stuff. But later they come up with an idea to make general purpose distributed computing model that leverage or base on commodity hardware. They encapsulated all infrastrucure related stuff into MapReduce library/system and exposted very simple interface or guidelines to write the jobs or data processing logic. As other engineers followed the suggested paradigm, the code was ready to run on hundreds of servers in parallel. In addition to deploying the MapReduce to process many other tasks, Google also rewritten the entire web index using this approach. You may like reading original research paper for more details, click here.

Next post shall explain, major Hadoop modules and how it works.

Comments