Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Sunday, 4 January 2015

How to import data from RDBMS to Hadoop and viceversa ?

Hadoop became very popular within few years because of its robust design, open source and ability to handle large data. Nowadays lot of RDBMS to hadoop migration projects are happening. Hadoop is not a replacement for the RDBMS, but for certain usecase, hadoop can perform well than RDBMS. Some projects may require data from rdbms along with multiple sources for finding insights. In these scenarios, we need to transfer data from RDBMS to hadoop environment. This task sounds simple, but this is a difficult task as this involves lot of risk. The possible solutions for importing data from RDBMS to hadoop are explained below

1) Using SQOOP
Sqoop is a hadoop ecosystem component that is developed for importing data from RDBMS to hadoop and for exporting data from hadoop to RDBMS. Sqoop jobs runs as a mapreduce job. Sqoop utilizes hadoop's parallelism for doing the parallel import and export. Internally sqoop is running as a mapper alone job that utilizes jdbc. For using sqoop, we need a good network connectivity between the RDBMS environment and hadoop environment.

2) By dumping the data from database and transferring via portable secondary storage devices
Most of the companies may not allow direct network connectivity to RDBMS environment from hadoop. Another reason for not allowing is that when a sqoop job is triggered the data flow through the network will be very high which will affect the performance of other systems connected to the network. In such cases, data will be transferred to the hadoop environment by dumping the data from the database, copying it to some portable secondary storage devices or some cloud storage (if allowed) and transferring the data to hadoop environment.

Saturday, 3 January 2015

What happens to a mapreduce job with a reduce class when we set the number of reduce tasks as zero ?

When we set the number of reduce tasks as zero, reduce tasks will not be executed. The output of the mapper will be copied to the hdfs and it will be the output of the job. Suppose 10 mappers were spawned for a job, if we set the number of reduce tasks as zero, we will get 10 output files.
The output files will be with a name similar to part-m-00000, part-m-00001 ..... part-m-00009.
We can set the number of reduce tasks as zero either from the program or from the commandline.

In the program we can set this by setting the following configuration
job.setNumReduceTasks(0);

From the  command line also we can achieve the same result by using the property below
-Dmapred.reduce.tasks=0

What happens to a mapreduce job if the user sets the number of reduce tasks as one ?

When the number of reduce tasks is set to one, only one reduce task will be executed for the entire jobs. All the intermediate map outputs will be gathered by a single reducer. The single reducer processes the entire map outputs and the output will be stored in a single file in hdfs. It will be with the name part-r-00000.
For setting the number of reduce tasks as one, add the following property in the driver class.
job.setNumReduceTasks(1);

How to pass small number of configuration parameters to a mapper and reducer ?

Hadoop is having several configurable properties that will be present in several xml and properties files. The main configuration files in hadoop are core-site.xml, mapred-site.xml, hdfs-site.xml, yarn-site.xml. The parameters in these configuration files are set while installing the cluster. This will be done by the administrator.

If a developer while developing the mapreduce programs want to modify some of the configuration parameter, he can do it from the program itself. The way to modify these values from the program is by instantiating the configuration class and setting the configuration values by passing the parameter and value as key-value pairs to the program.

The syntax is as shown below
Configuration conf = new Configuration();
conf.set("key1","value1");
conf.set("key2","value2");

Friday, 2 January 2015

What is hadoop ?


Hadoop is a framework which is designed in special for handling large data. The intension behind the development of hadoop is to develop a scalable low cost framework that can process large data. The Hadoop is having a distributed file system and a distributed processing layer. This distributed file system and distributed processing layer is residing on top of several commodity machines. The team work of the commodity machines is the strength of hadoop.

The distributed storage layer of hadoop is called Hadoop Distributed File System (HDFS) and the distributed processing layer is called mapreduce.  The idea of this hdfs and mapreduce came from google frameworks such as google file system (GFS) and google mapreduce.

Hadoop is designed in such a way that it can run on commodity hardware which will reduce the cost. In other data processing frameworks, the hardware itself is handling the fault, but in hadoop, the framework itself is handling the hardware failure. Hadoop doesn't require any RAID arrangement of disks. It just requires the disks in JBOD configuration. JBOD means just a bunch of disks.

Wednesday, 19 November 2014

Hadoop Interview Questions

1) What is the name of Hadoop's file system .?
Ans: HDFS

2) What is the full form of HDFS.?
Ans: Hadoop Distributed File System

3) What is the Processing Layer of Hadoop. ?
Ans: Mapreduce

4) Hadoop framework is written in which language .?
Ans: Java

5) What is the licencing cost for hadoop.?
Ans: Hadoop is an opensource technology. So it is free.

6) Who is known as father of Hadoop.?
Ans: Doug Cutting

7) How Hadoop differs from other data processing technologies..?
Ans: Hadoop is a framework which is having distributed storage as well as a distributed processing layer. The basic idea behind hadoop is to bring down the processing layer down to storage. Hadoop is a horizontally scaling framework So high end server grade hardware is not required. Only commodity hardware is required.

8) Is hadoop good for real time processing.?
Ans: Directly No. Hadoop is a batch processing framework. So it can't be used for real time processing. But it can work along with other technologies to produce real time outputs.

9) Is hadoop a replacement for RDBMS..?
Ans: Hadoop is not suitable for processing small or medium amount of data. Since hadoop is a batch processing framework, hadoop will not provide faster output. What hadoop guarantees is that, it will never fail with large data. In case of large data, which the other data processing technologies can't process, hadoop will perform well

10) If hadoop is open source and free, who is maintaining it and enhancing it.?
Ans: Hadoop is an Apache project, people all over the world are contributing and adding more enhancements to it. Lot of companies are also using hadoop, they are also contributing to hadoop.

11) Why hadoop became very popular.?
Ans: Analyzing hidden insight from data became a very important part of almost every organisation now. The correctness of the insights will be more as the size of the data is more. Now a days the usage of internet and social media is very high. So if we collect that data alone, we can analyse people upto some extent. Similar to this, we can analyse anything and everything using the history data. This is one reason. Similarly real time  monitoring and decision making also became very important now. This is another factor. If we go for a tool / product with licence, the licensing cost itself will be very high. Hadoop is opensource and free. Hadoop runs on commodity hardware, so the cost of the Infrastructure is also less. This made hadoop a hot cake in the market.

12) What do you mean by a pseudo distributed hadoop cluster.?
Ans If all the daemons of the hadoop are running in a single node, it is called pseudo distributed mode. This is not used for production. This is just for development and learning purpose.

How to check the memory utilization of cluster nodes in a Kubernetes Cluster ?

 The memory and CPU utilization of a Kubernetes cluster can be checked by using the following command. kubectl top nodes The above command...