Hadoop is the open source platform which is very much trendy now a days. Every corporates are looking for good Hadoop developer.
The Hadoop was first introduced by Mike Cafarella and Doug Cutting who are the persons trying to create a search engine system that can index 1 billion pages. So they are going to start their research to build that search engine but they found that the cost for preparing such engine will be very high. So, They came across a paper, published in 2003, that described the architecture of Google’s distributed file system, called GFS. Later on in 2004, Google published one more paper that introduced MapReduce to the world. Finally, these two papers led to the foundation of the framework called “Hadoop“.
How It All Started?
As I have already discussed above the foundation of Hadoop and its relevancy in terms of processing the search data. Here the term GFS is something which is mainly used to solve all search strategy for storing very large files that are generated as a part of the web crawl and indexing process in any case.
It also uses the Map Reduce technique which is mainly used to map the large data files and help for better search strategy. It is being considered as the major part framework called “Hadoop“.
So, by considering it the Hadoop is became so powerful.
What is Big Data?
As we have discussed above that the Hadoop is an open source. It is mainly a Java based framework used for storing and processing big data. The data is stored on inexpensive commodity servers that run as clusters. Cafarella, Hadoop uses the MapReduce programming model for faster storage and retrieval of data from its nodes.
Big data is a collection of large datasets that cannot be processed using traditional computing techniques. It is not a single technique or a tool, rather it has become a complete subject, which involves various tools, techniques and frameworks.
Here the Big data is mainly used to involves the data produced by different devices and applications. Some of the fields that come under the umbrella of Big Data are discussed as below.
- Black Box Data − It is a component which is get used as a part of helicopter, airplanes, and jets, etc. It is mainly used to captures the voices information of the flight crew, all sort of recordings of microphones and earphones etc in the aircraft.
- Social Media Data –Social media is one of the most important factors in the evolution of Big Data as it provides information about people’s behaviour.
- Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and ‘sell’ decisions made on a share of different companies made by the customers.
- Power Grid Data − The power grid data holds information consumed by a particular node with respect to a base station.
- Transport Data − Transport data includes model, capacity, distance and availability of a vehicle.
- Search Engine Data − Search engines retrieve lots of data from different databases.
Now a day we are using so many social applications. Social media such as Facebook and Twitter hold information and the views posted by millions of people across the globe which is a real time example of Big Data.
Big Data & Hadoop – Restaurant Analogy:
Here now Let us take an analogy of a restaurant to understand the problems that is associated with Big Data. Here we will try to learn that how Hadoop solved that problem.
Let us consider that the Bob is a businessman who has opened a small restaurant. Initially, in his restaurant, he used to receive two orders per hour and he had one chef with one food shelf in his restaurant which was sufficient enough to handle all the orders.
Now let us compare the restaurant example with the traditional scenario where data was getting generated at a steady rate and our traditional systems like RDBMS is capable enough to handle it, just like Bob’s chef.
Similarly, to tackle the problem of processing huge data sets, multiple processing units were installed so as to process the data in parallel (just like Bob hired 4 chefs). But even in this case, bringing multiple processing units was not an effective solution because the centralized storage unit became the bottleneck.
In other words, the performance of the whole system is driven by the performance of the central storage unit. Therefore, the moment our central storage goes down, the whole system gets compromised. Hence, again there was a need to resolve this single point of failure.
Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly faulttolerant and designed using low-cost hardware.
To solve the storage issue and processing issue, two core components were created in Hadoop – HDFS and YARN. HDFS solves the storage issue as it stores the data in a distributed fashion and is easily scalable. YARN solves the processing issue by reducing the processing time drastically. Moving ahead, let us understand what is Hadoop?
What is Hadoop?
As we have already discussed earlier that the Hadoop is an open-source software framework like Java script and is mainly used for storing and processing Big Data in a distributed manner on large clusters of commodity hardware. Hadoop is licensed under the Apache v2 license.
Hadoop was developed, based on the paper written by Google on the MapReduce system and it applies concepts of functional programming. Hadoop is written in the Java programming language and ranks among the highest-level Apache projects. Hadoop was developed by Doug Cutting and Michael J. Cafarella.
Let’s understand how Hadoop provides a solution to the Big Data problems that we have discussed so far.
The major challenges associated with big data are as follows −
- Capturing data
How to store huge amount of data:
- The Hadoop is mainly have a HDFS system which is provides a distributed way to store Big Data.
- Here the data is stored in blocks in DataNodes and you specify the size of each block.
- For example if you have 512 MB of data and you have configured HDFS such that it will create 128 MB of data blocks.
- The HDFS will divide data into 4 blocks as 512/128=4 and stores it across different DataNodes.
- While storing these data blocks into DataNodes, data blocks are replicated on different DataNodes to provide fault tolerance.
How to store a variety of data:
- The HDFS in Hadoop can capable enough to store all kinds of data whether it is structured, semi-structured or unstructured.
- HDFS, there is no pre-dumping schema validation.
- It also follows write once and read many models.
- Due to this, you can just write any kind of data once and you can read it multiple times for finding insights.
How to process the data faster:
In this case the Hadoop is allowed to move the processing unit to data instead of moving data to the processing unit.
So, what does it mean by moving the computation unit to data?
It means that instead of moving data from different nodes to a single master node for processing, the processing logic is sent to the nodes where data is stored so as that each node can process a part of data in parallel. Finally, all of the intermediary output produced by each node is merged together and the final response is sent back to the client.
Features of Hadoop:
The Hadoop is having the following features which makes it more unique than others. Such as
- It is suitable for the distributed storage and processing.
- Hadoop provides a command interface to interact with HDFS.
- The built-in servers of namenode and datanode help users to easily check the status of cluster.
- Streaming access to file system data.
- HDFS provides file permissions and authentication.
Hadoop infrastructure has been designed in such a unique way so that it has inbuilt fault tolerance features. So that even the failure took place it has a back up strategy to handle the situation. hence, Hadoop is highly reliable.
Hadoop uses commodity hardware (like your PC, laptop). We can easily run it on any environment.
For example, in a small Hadoop cluster, all your DataNodes can have normal configurations like 8-16 GB RAM with 5-10 TB hard disk and Xeon processors.
Also It is easier to maintain a Hadoop environment and is economical as well. Also, Hadoop is open-source software and hence there is no licensing cost.
Hadoop has the inbuilt capability of integrating seamlessly with cloud-based services. So, if you are installing Hadoop on a cloud, you don’t need to worry about the scalability factor.
Hadoop is very flexible in terms of the ability to deal with all kinds of data. The Hadoop can store and process all kind of data, whether it is structured, semi-structured or unstructured data.
Hadoop Core Components:
While setting up a Hadoop cluster, you have an option of choosing a lot of services as part of your
Hadoop platform, but there are two services which are always mandatory for setting up Hadoop.
Let us go ahead with HDFS first. The main components of HDFS are the NameNode and the DataNode. Let us talk about the roles of these two components in detail.
HDFS follows the master-slave architecture and it has the following elements.
- The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software.
- It is a software that can be run on commodity hardware.
- The system having the namenode acts as the master server and it does the following tasks such as Manages the file system namespace, Regulates client’s access to files.
- It also executes file system operations such as renaming, closing, and opening files and directories.
- If a file is deleted in HDFS, the NameNode will immediately record this in the EditLog
- It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are alive
- It keeps a record of all the blocks in the HDFS and DataNode in which they are stored
- The data node is a commodity hardware having the GNU/Linux operating system and datanode software.
- For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system.
- Data nodes perform read-write operations on the file systems, as per client request.
- They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode.
- It is also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.
- It sends heartbeats to the NameNode periodically to report the overall health of HDFS, by default, this frequency is set to 3 seconds.
- Generally the user data is stored in the files of HDFS.
- The file in a file system will be divided into one or more segments and/or stored in individual data nodes.
- These file segments are called as blocks. In other words, the minimum amount of data that HDFS can read or write is called a Block.
- The default block size is 64MB, but it can be increased as per the need to change in HDFS configuration.
The YARN of Hadoop is basically contains two major components, such as,
- Resource Manager &
- Node Manager.
- It is a cluster-level (one for each cluster) component and runs on the master machine.
- It manages resources and schedules applications running on top of YARN.
- It has two components: Scheduler & ApplicationManager.
- The Scheduler is responsible for allocating resources to the various running applications.
- The ApplicationManager is responsible for accepting job submissions and negotiating the first container for executing the application.
- It keeps a track of the heartbeats from the Node Manager.
- It is a node-level component (one on each node) and runs on each slave machine.
- It is responsible for managing containers and monitoring resource utilization in each container.
- It also keeps track of node health and log management.
- It continuously communicates with Resource Manager to remain up-to-date.
The Hadoop is a platform or framework which solves Big Data problems. Here it acts as a suite which encompasses a number of services for ingesting, storing and analyzing huge data sets along with tools for configuration management.
Fault detection and recovery − Since HDFS includes a large number of commodity hardware, failure of components is frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection and recovery.
Huge datasets − HDFS should have hundreds of nodes per cluster to manage the applications having huge datasets.
Hardware at data − A requested task can be done efficiently, when the computation takes place near the data. Especially where huge datasets are involved, it reduces the network traffic and increases the throughput.
Last.FM Case Study
- Last.FM is internet radio and community-driven music discovery service founded in 2002.
- Here the User transmit information to Last.FM servers indicating which songs they are listening to.
- The received data is processed and stored so that, the user can access it in the form of charts.
- scrobble: When a user plays a track of his or her own choice and sends the information to Last.FM through a client application.
- radio listen: When the user tunes into a Last.FM radio station and streams a song.
Scope @ NareshIT:
- At Naresh IT you will get a good Experienced faculty who will guide you, mentor you and nurture you to achieve your dream goal.
- Here you will get a good hand on practice in terms of practical industry-oriented environment which will definitely help you a lot to shape your future.
- During the designing process of application, we will let you know about the other aspect of the application too.
- Our Expert trainer will let you know about every in’s and out’s about the problem scenario.
Achieving your dream goal is our motto. Our excellent team is working restlessly for our students to click their target. So, believe on us and our advice, and we assured you about your sure success.