Many a times we see when there is a strategic move from legacy systems to new strategic platforms like cloud or big data, then we see it brings with it the new challenges, and when we talk about DATA migrations, biggest challenge is how to ensure the full fledged data validations without any processing/memory issues. And, to achieve the same, it's not just the straight forward SQL's that help to validate the voluminous data, this is where up skill to latest technologies help you to do the full fledged data validations.
So, if we talk about hadoop framework, there are different technologies available to ensure the full fledged data validation making use of map reduce functionality embedded in the heart of this cluster computing engine. One of them is HIVE, and other one is PIG, they are good to validate data between the DBs/file systems, however not as efficient as the SPARK framework which is 10 times faster than Hadoop framework.
Spark's fast in-memory data processing helps to validate the large volume of data in couple of minutes ensuring full fledged data validation is achieved using SQL's. Plus, it provides the big library to create user defined functions to achieve the transformations apart from providing the inbuilt functionalities to achieve the transformations.
So, ETL testing is not just limited to using SQL's, however now we need to up skill ourselves, and to find the better ways to validate the data quickly, taking less processing time. So, guys make a good learning in BIG DATA space to make it easy to validate the large volumes of data using best technologies, fitted to achieve the same.