Big Data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. It is raw and unsorted until it is put through various kinds of tools and handlers. Big Data Training helps in mastering the concepts of Hadoop Framework and gives details on Big Data, MapReduce algorithm and Hadoop Distributed File system.
Hadoop is one of the tools designed to handle big data. It works to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program used for distributed storage and processing of dataset of big data using the MapReduce programming model. under the Apache license that is maintained by a global community of users. The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
Hadoop Distributed File System(HDFS)
Hadoop File System was developed using distributed file system design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. To store such huge data, the files are stored across multiple machines. These files are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available to parallel processing.
Features of HDFS
a. It is suitable for the distributed storage and processing.
b. Hadoop provides a command interface to interact with HDFS.
c. The built-in servers of namenode and datanode help users to easily check the status of cluster.
d. Streaming access to file system data.
e. HDFS provides file permissions and authentication.
MapReduce is a processing technique and a program model for distributed computing based on java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples. Secondly, reduce task, which takes the output from a map as an input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data.
Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Pig.
To write data analysis programs, Pig provides a high-level language known as Pig Latin.Pig has a component known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into MapReduce jobs.
Features of PIG
PIG comes with the following features
a. Rich set of operators − It provides many operators to perform operations like join, sort, filer, etc.
b. Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script with good knowledge of SQL.
c. Optimization opportunities − The tasks in Pig optimize their execution automatically, so the programmers need to focus only on semantics of the language.
d. Extensibility − Using the existing operators, users can develop their own functions to read, process, and write data.
e. UDF’s − Pig provides the facility to create User-defined Functions in other programming languages such as Java and invoke or embed them in Pig Scripts.
f. Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as well as unstructured. It stores the results in HDFS.
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies.
Features of Hive
a.It stores schema in a database and processed data into HDFS.
b.It is designed for OLAP.
c.It provides SQL type language for querying called HiveQL or HQL.
d.It is familiar, fast, scalable, and extensible.