Focusing on the Data - perspective by Darko Medin
No matter how advanced set of method or models used, the limit of many research projects and business development areas are still dependent on the data itself and the insights that can be derived from it. Keeping focus on the data is essential. One can deploy state of the art statistical method or machine learning procedure in a data driven product, but they will still be highly dependent on the data quality and level of information stored in the data, in fact data will always be the defining 'player' of most data driven projects.
Further, the data focus is much more then just numbers in a table. General wisdom often thinks of the data as a spreadsheet table with numbers. This is far from truth. Data is such a wide concept, that a would need around 10 articles like this one to explain most of the pathways around the data.
One of the most important steps around the data is knowing if its available or not and under which conditions. This is one of the most common mistakes with the data - assuming its out there. Making sure the data is available require thorough availability analysis. Even if the data is made available in advance for a data driven project, it doesn't mean that the data at hand will not need adding more information into the data pool.
Having a good perspective on the data means having enough information around the data and in its context too. Data without context loses information value and and brings uncertainty around its validity. Focusing on where data came from, how was it collected, adding as much as possible meta-data and making sure that the data actually has the information needed to answer the Research questions, help businesses make decisions or create state of the art products are very important segments in Data Science. Observations of the context might be considered as data source too and as such can contribute to data projects. Very often meta-data can be additional source of information and can be used to improve the levels of data quality and serve for validation purposes.
We must understand data to finest level of details and virtually learn the story about the data. One way is of course to focus on EDA, exploratory data analysis and Data Viz, data visualization, but this is just the tip of the iceberg. If we want to understand the data relevance in the project, we must understand what is it that we are trying to achieve or answer with the data. Knowing what questions can or can not be answered by the data is clearly one of the important aspects. Instead of randomly making the analysis based on the available data and then just interpreting the results, every good Statistician should know in advance how to improve the data before the analysis to be able to answer the questions asked in the study or a business model.
Recommended by LinkedIn
Focusing on the data also means extracting maximal level of information from it, so feature engineering is one of the most important practices in today's data science. Data cleaning and Feature Selection are also very important.
Making difference between Information and Important information is vital. Irrelevant data can overflow the data pipelines and cause no major improvement in the models, but still be careful, droping data our of subjective knowledge is risky. Sometimes the variables which make no sense intuitively might contain a lot of information and this is one of the frequent mistakes i've seen in Analytics communities, droping variables based on intuition or expert opinion. Intuition and experts don't have to be right, remember, metrics outputted from a good Data Science procedure and good data will get as close as possible to right (never really reaching it but still practically close enough). My advise is always to experiment with the variables value thoroughly using different Data Science procedures before removing or keeping them in the models/products or any research analysis.
Cleaning the data of invalid observations is only one segment. Many valid types of data might have lower quality might be eg. mislabeled or less accurate or might contain too much noise, so feature selection can be essential in that case.
There are people saying more is not always better etc. but ill tell you, based on my long Machine Learning and Stats experience - more data is generally better most of the time with exceptions, but guess what... more data than actually needed is very very good idea most of the time as data can be lost in may filtering and validation procedures. Sometimes planning the data that might seem correlated to labels might not turn out to be true in reality and all these scenarios will lead to needing more data.
I like to say - if the research or business project is data driven and will require Statistical or Data Science modeling, its generally a good idea to include Statisticians/Data Scientists early on, and avoid all the potential mistakes with these topics at the start. Making mistakes and then correcting them is multiple times costly compared to putting the effort early on and avoiding them.