Preparing Data is a Big part of Big Data.

By arvind |Email | Oct 15, 2018 | 6564 Views

It is just another case of the grass looks green on the other side. Like most professions think that work in other professions is relatively easy that is what the outside world thinks for big data scientist. People still have the wrong idea that all you need to do in big data is load up things and everything will happen automatically, like a magic trick. Data scientists are no magicians there is a lot of works that goes behind the stages before it looks easy to everyone from outside. Well that is not how things happen; an article that I recall was published by New York Times "For Big Data Scientists 'Janitor work' is the Key Hurdle to Insights", the article explained how these big data platforms had to be loaded with quality data and had to be provided with the structure so that it can give you value for you work, and to be honest it might not look like much but it is a lot of work. Take for example "why does a photographer a professional one charges so much money for just clicking a button which I can do to, then why can't you click the same level of photos as he does, because there is a lot that goes behind before pressing the button". 
A data scientists has to spend long hours of work in the preparation of the project that phase takes more time than they actually work the one that is shown in newspaper on TV or on the internet. You can call it whatever you want data munging, data wrangling or data janitor work but the truth is 50%-80% of a data scientists' time is spent on data preparation.
Selection of Data
before starting a project make sure the data you get is the data you need, tell what data you need for a project instead of "just throw it all in" attitude. If you end up taking low quality data which is not important for your business objective, there is a good chance it will show in your result. The more crowds there is in a data the less like it will be possible for people to see the important trends. Having a well defined strategy for your needed data source and a specific subset of the data, which will make sense for the questions you want to ask.
Data relationship should be defined
if you are doing a big data project for a corporate, the business questions and demands a data store which should be comprised of a combination of structured, semi-structured and unstructured data. Data scientists need to organize a set of unstructured or semi structured documents from SharePoint or a shared drive against master data contained in a set of structured systems. When importing structured data from multiple systems, the data relationships must be defined. Your big data platform will not magically know that "customer no" in one set of data is the same as "cust_id" in another. You must define the relationships between the data sources.
Extraction and organisation of data
as a data scientist this is the area where you will spend most of your time. One of the major challenges is data acquiring. One has to decide if it is a public data, if there is an API or the data needs to be taken from the web. If it is a corporate data who is allowed to provide extracts and documentation on the structure, what are the security considerations? Data organisation can include a few too many steps translating system specific codes into meaningful/usable data, mapping common fields consistently to be able to relate them, handling incomplete or erroneous data, replicating application logic to make the data self-describing. 

Source: HOB