Although Apache Pig today is not the most relevant tool for large data analytics in the Hadoop ecosystem, it is useful to know its basic principles of work and key differences from Hive. Also consider how Hive differs from PIG as a SQL-on-Hadoop tool.
What is Apache Pig
Apache Pig is a high -level procedure for performing data requests on the Hadoop platform. PIG simplifies work with MapredUCE, allowing you to write SQL-like queries to distributed data sets. Being originally designed as a high -level procedure of data streams, PIG makes it possible to paralle the programs written on it. Instead of the time-consuming development of a separate MapredUCE application, you can write a simple script on Pig Latin, which will automatically disperse and distribute between the cluster nodes. This greatly simplifies the work of the data engineer and the developer of distributed applications. Due to the fact that Pig Latin supports work with relational operators, the entrance threshold becomes much lower, because SQL is familiar to almost everyone in the world of Big Data and Data Science.
However, despite the fact that Pig Latin is similar to SQL, this is not the same thing. Unlike the declarative SQL, PIG is a procedural language and contains much less requests optimization. Whereas SQL is designed to work with structured data, PIG does not have a rigid reference to the data structure and is not demanding on the presence of data scheme.
And, unlike MapredUCE, PIG programs are not compiled. However, “under the hood” of pig-scripts are still related to the tasks of MapredUCE: PIG stores intermediate data generated by the tasks of MapredUCE in the temporary directory of HDFS.
Initially, pig-scripts are treated with a syntactic analyzer (Parser), which checks the script syntax, performs types of types and performs other validations. The result of parsing will be DAG (Directed Acyclic Graph), which is Pig Latin operators in the form of nodes, and data flows in the form of ribs. Further, this directed acyclical graph is transmitted to the logical optimizer, which performs logical optimization. Further, the compiler composes the optimized logical plan in the MapreducE series of tasks, which are transmitted to Hadoop in a sorted manner and are executed.
Pig in Hadoop has two execution modes:
- Local mode of operation on one JVM and using a local file system, which is suitable only for analyzing small data sets;
- The MapredUCE mode, where the requests written on Pig Latin are translated into Mapreduce tasks and are executed in the Hadoop cluster, which is suitable for large data sets.
In practice, PIG is used to quickly create prototypes of distributed applications by large data analysts with a changeable structure of such as web professes. PIG can also come in handy in creating search platforms and processing sensitive data from the time. Having got acquainted with Apache Pig, we will figure out how it differs from Hive and his Hiveql requests.
Differences from Hive
Recall that Apache Hive is the SQL-on-Hadoop NOSQL storage, which provides access to data stored in the Hadoop (HDFS) distributed file systems in the internal declarative language of HIVIVEQL. To connect users to HIVE, a command string tool and a JDBC drive is provided. Like PIG, HIVE instead of writing complex MapredUCE programs on Java, it simplifies the appeal to structured and semi-structured data. Hive supports DDL and DML request, as well as UDF function.
However, HIVE uses SQL-like Hiveql requests, and PIG is a procedure of data flow. Most often, Hive is the tools of a date-analytics and data engineers, and application developers write on PIG. Hive is a little slower than PIG and works on the side of the HDFS cluster server. PIG works on the client’s side and does not support ODBC/JDBC drivers, as well as sections in tables and data scheme.