Hadoop has grown in many ways to allow technical people of all levels to use it. Java programmers have a distinct advantage when it comes to Hadoop programming. You don’t have to be a programmer in Java or Jython if you are unfamiliar with these high-level languages. Apache Pig can do any data manipulations, structured or unstructured. It makes “Pig Hadoop”, the interrelated terms within the Hadoop family. Apache Pig’s purpose is to create MapReduce jobs for large data-sets, rather than writing Java code.
Hadoop has changed over the years. This is due to the increased demand from users in the field data analysis with large amounts of data. Every Hadoop component has been marketed with new features in new releases. This is also true for Apache Pig. This article will take a closer look into the major change areas in Apache Pig major releases.
Apache Pig in Few Words
Apache Pig is a high level scripting language that makes it easy for Hadoop developers to make complex data transformations. It is a procedural language that looks like SQL, also known as Pig Latin. It is compatible with other scripting languages. To solve real business problems, however, Pig Big data’s compatibility with its User Defined Functions feature (UDF) is a great option. It can invoke code in other languages, such as Java, JRuby, and Java. Developers can embed Pig scripts into other languages.
(Image source -https://www.safaribooksonline.com/library/view/hdinsight-essentials/9781849695367/graphics/5367OS_06_04.jpg)Why is Apache Pig Useful When Hadoop has its MapReduce?
Both MapReduce as well as Pig perform data processing. The first one, however, deals with data processing at a low abstraction level. The Pig, on the other hand processes large data sets at a higher level of abstraction. You will also get a number of MapReduce jobs from Pig transformations. Framework-wise, there are some other differences between MapReduce Processing (and Pig processing).
Pig Latin can perform almost all of the data-processing operations such as join, filter, union and order by. MapReduce can only perform operations like group. MapReduce does not support other operations such as projection, order by, filter, projection, join, and projection. The user must create a custom program to do this.
Apache Pig Hadoop Versions Through the Years
Apache Pig has released 24 releases of Hadoop since the incubation.
Apache Pig Evolution in Hadoop 1.0 series
Apache Pig’s first release was Hadoop 0.18. It was still in incubation. It was not stable from Hadoop’s perspective. Apache Pig’s next release, which was a maintenance version, served as Hadoop subproject’s first version. The following changes will be made to Pig functionality and performance in subsequent releases, Pig 0.1.1 to 0.10.
All features included
Five times performance gain
Multi-query optimization (It allows computation to be shared across multiple queries within one Pig script)
Two new joins are introduced – Skewed join, and merge join
Performance and memory usage improvements
Adding the Accumulator interface to UDFs
Including a new LoadFunc and StoreFunc interface
Including custom partitioner
Python UDF
Control structures, query parser changes and semantic cleanup
Adding the Accumulator interface to UDFs
Version 0.10.0 was the most important Apache Pig with Hadoop 1.0 release.
Best Hadoop CertificationsFeatures included
Boolean datatype
JRuby
Nested cross/for each
Limit the expression
UDF
The default split destination
Map-side aggregation
Support for map/tuple syntax
Distribution only of source code
Support for Apache Hadoop 2 with various Maven artifacts improved
Support for Oracle JDK 7 is better
Apache Pig Evolution in Hadoop 2.x and Beyond
Hadoop 2x is in many ways significantly different to Hadoop 1x. It is.
YARN is more scalable
Ability to run non-MapReduce jobs
High availability of name nodes
Native Windows support
More utilization
Beyond the batch approach
It is therefore more demanding of utility tools such as Pig to perform better.
Apache Pig 0.12.0 is the first major release in Hadoop2x’s Hadoop2x series.
All features included
ASSERT operator – For data validation
Streaming UDF – For UDF without JVM
New AvroStorage – Works as Pig built-in functions, and faster
IN/CASE operator
BigInteger data type and BigDecimal data kind – Some applications require calculations with high precision. BigInteger or BigDecimal can be used to perform precise calculations in such cases.
We can see the potential of non-MapReduce engines starting with Hadoop 2x. Apache Pig 0.13.0 also made the necessary changes to allow Hadoop’s non-MapReduce engines to run.