Analyze real data, data scientists!
Data is always messy and its analysis requires smart choices. That’s why AI will never replace human market researchers. In both academia and industry, smart consumers of market research therefore demand transparency in the exact steps researchers took with the data – and the peer review process in top journals does a decent, though imperfect, job in catching those that would change the actionable insights - a Nature article claimed it would have prevented the Theranos fraud.
What is definitely NOT a good way of dealing with data messiness is to just make up data. In academia, we have seen our share of scandals. In his memoir “Faking Science: A True Story of Academic Fraud”, former psychology professor Diederik Stapel explains how he got annoyed with data not supporting his brilliant ideas at the 95% statistical confidence level, and fabricating them to fit said ideas. In his words, he “became impatient, overambitious, reckless. I wanted to go faster and better and higher and smarter, all the time.” Fortunately, academia has gotten much better in catching and reporting fraud.
How about faking data in industry? This week’s lawsuit from JPMorgan Chase against Frank is likely just the tip of the iceberg. According to the filing, Frank’s founder and CEO Charlie Javice lied “about Frank’s success, Frank’s size, and the depth of Frank’s market penetration in order to induce JPMC to purchase Frank for $175 million”. Specifically, Javice used “synthetic data” techniques to create a list of 4.265 million “students” who did not actually exist. When a Frank engineer declined to do so, Javice “turned to a datascience professor at a New York City area college who advertised his “creative solutions” to data problems.” Based a list of 293,192 actual students who had started or submitted a FAFSA application through Frank, Javice directed the Data Science Professor to use “synthetic data” techniques to create 4.265 million customer names, email addresses, birthdays, and other personal information. Interestingly, the lawsuit has access to emails between Javice and the Data Science Professor showing their understanding of their actions. The Data Science professor wrote
(1) “[f]or names, our plan was to sample first name and last name independently and then ensure none of the sampled names are real” , and
Recommended by LinkedIn
(2) “I can’t seem to find addresses in my raw files . . . . Should I attempt to fabricate them?”
Moreover, when reviewing the synthetic data, the Data Science Professor noted that many entries confusingly had customers living, attending high school, and attending college in the same town and state, and concluded that the list “would look fishy to [him] if [he] were to audit it.”
Beyond such million dollar monetary consequences, faking data can cost more, including failed harvests and famine if you eg “changed centrally held figures for a key metric such as soil fertility that many arable farmers use to organize their planting schedules”. It can also costs lives in medical trials, such as when dr. Werner Beswoda faked results that high-dose chemotherapy was successful in the treatment of women with high-risk breast cancer. Other researchers build on such research, and can waste years, sometimes setting back the research by decades. They also give an excuse to people not to take actual science seriously. For instance, a January 11th study debunks the misconception that COVID trials cut clinical corners.
The bottom line? Stay vigilant as a consumer of scientific studies and market research: ask the tough questions and demand answer. Stay patient as a researcher: it is the dynamic dialogue between theory and empirics that drives science forward.
Community Manager et Graphiste freelance | J’accompagne les PME et les indépendants à se distinguer grâce à des stratégies digitales efficaces et des designs percutants
1moHey Koen! It's crucial to shed light on the dark side of data science and the industry issues you're addressing really hit home. How do you reckon we can ensure more integrity and transparency in both academic and industry sectors? The messy nature of data definitely requires sharp minds to tame it, and your insights highlight the human touch that AI can't replace. Looking forward to more of your insights into the evolving landscape and ways we can push boundaries while avoiding the traps of fraud!
Research Professor (Marketing Science), Director Ehrenberg-Bass Institute, Adelaide University of South Australia.
2yNever trust a single study, especially a single set of data. Use it as an interesting starting point.
Professor @ Georgetown | Behavioral Economics | Consumer Empowerment | Product Management | Customer Analytics
2yNice overview!
FouAnalytics - "see Fou yourself" with better analytics
2yinteresting details. thank you for posting.