Good data means Good AI
Credit to whoever owns it

Good data means Good AI

Introduction

The adage "AI + Bad Data = Bad AI" sums up this entire article. An AI is as good as the data. The success of any AI system depends heavily on the quality and suitability of the data it processes. In highly regulated industries like life sciences and healthcare, using the right data is not only a best practice but a critical requirement. Adhering to industry standards and regulatory guidelines ensures that data used in AI systems is accurate, reliable, and compliant with legal obligations. This article will explore why selecting the right data is crucial, what constitutes the right data, and how to ensure data quality, with practical steps followed by relevant regulatory insights.

Why Is the Choice of the Right Data Important for Any AI System?

Data serves as the foundation for AI systems, influencing their reliability, accuracy, and compliance. If the data fed into an AI system is flawed, incomplete, or biased, the system’s outputs will reflect these issues, leading to poor decision-making, non-compliance, and potential harm.

For instance, if an AI system uses incorrect data, in pharmaceutical manufacturing, it could result in products that do not meet safety standards, putting patients at risk. Regulatory bodies such as the FDA and EMA mandate strict data controls to avoid these risks. Beyond compliance, the right data is essential for maintaining organizational reputation, trust, and operational efficiency.

What Is the Right Data for an AI System?

The right data for an AI system is accurate, relevant, complete, and structured appropriately for the system’s objectives. This data must be free from bias, secure, and aligned with industry-specific standards. The health authorities expect right data to be attributable, legible, contemporaneous, original, accurate, complete, consistent, enduring, and available. NIST expects right data to have accuracy, completeness, consistency, timeliness, accessibility, relevance, reliability, integrity, validity, and uniqueness. Both expect data to be looked in context. For example, in predictive maintenance, the right data would include detailed historical failure records rather than general operational data. The volume and scalability of the data also play a critical role, as AI models often require large datasets to perform effectively.

How to Ensure Data Is Accurate, Reliable, Consistent, Integral, and of High Quality for Any AI System

Building a robust AI system starts with a commitment to high-quality data. The success of an AI model hinges not just on sophisticated algorithms but on the integrity, reliability, and quality of the data it processes. Poor-quality data can lead to flawed insights, biased decisions, and significant compliance risks, particularly in regulated industries.

Ensuring that the data fed into an AI system is accurate, reliable, consistent, integral, and of high quality involves a multifaceted approach that spans data governance, security, continuous monitoring, and ethical considerations. By following industry best practices and adhering to regulatory standards, organizations can mitigate risks and enhance the effectiveness of their AI initiatives. Below, I outline the essential steps to achieve this, along with the corresponding regulatory guidelines that ensure compliance and data excellence.

Implement Strong Data Integrity and Governance Practices

Ensuring data integrity and governance prevents unauthorized alterations and maintains data accuracy over time. 21 CFR Part 11 and EU Annex 11 regulations require that electronic records and signatures are trustworthy, reliable, and equivalent to paper records. ISO 9001 emphasizes the need for data that supports continuous improvement in quality management processes. IEC 62304 mandates rigorous data management for software used in medical devices to ensure integrity throughout the software lifecycle.

  • Define clear data governance policies, including roles and responsibilities for data management.
  • Establish and execute procedures to ensure data is accurate, complete, and protected from unauthorized changes.
  • Implement audit trails to track data modifications and access.

Prioritize Data Security and Privacy

Protecting data from breaches and ensuring compliance with privacy regulations is essential for maintaining trust and avoiding legal repercussions. ISO/IEC 27001 provides a framework for managing information security risks. GDPR mandates stringent controls over the processing of personal data, ensuring individuals' privacy rights. HIPAA establishes standards for protecting sensitive patient information in the healthcare sector.

  • Implement robust data encryption and access control mechanisms to protect data from unauthorized access.
  • Ensure compliance with data privacy laws by managing sensitive information securely and transparently.
  • Regularly audit data security measures to identify and address potential vulnerabilities.

Continuously Monitor Data Quality and Establish Feedback Loops

Continuous monitoring ensures that data quality remains high, preventing degradation over time and keeping AI models effective. NIST AI Risk Management Framework (RMF) recommends continuous assessment and feedback mechanisms to ensure data quality and model performance. ISO 27001 ensures that data security is maintained during continuous monitoring. IEC 61508 addresses the functional safety of electronic systems, emphasizing the need for ongoing data quality checks.

  • Set up real-time monitoring tools to continuously assess data quality, identifying errors, inconsistencies, or anomalies as they occur.
  • Implement feedback loops to update and refine AI models based on new data and insights.
  • Regularly review the performance of AI systems and adjust data inputs to maintain accuracy and relevance.

Cleanse and Normalize Data Before Use

Clean, consistent data is essential for producing accurate and reliable AI outputs. NIST Data Quality Guidelines stress the importance of eliminating inconsistencies and errors before data is used in AI. IEEE 1012 provides verification and validation (V&V) standards to ensure data meets predefined quality criteria.

  • Conduct thorough data cleansing to remove errors, duplicates, and irrelevant information from datasets.
  • Normalize data to ensure consistency in formatting, units of measurement, and categorization across datasets.
  • Validate cleansed and normalized data to ensure it meets quality standards before feeding it into AI systems.

Ensure Data Provenance and Traceability

Understanding where data comes from and how it has been processed is crucial for maintaining trust and accountability. ICH Q7 and PIC/S Guidelines require detailed documentation to ensure data lineage and traceability. ISO 17025 mandates that testing and calibration laboratories verify and trace their data sources to ensure reliability.

  • Document the origin of data, including its source, collection methods, and any transformations it undergoes.
  • Maintain records that allow data to be traced back to its original source, ensuring transparency and accountability.
  • Implement version control to track changes in data over time, ensuring that all modifications are documented.

Adhere to Ethical Data Use and AI Governance Principles

Ethical data use is crucial for maintaining public trust, avoiding bias, and ensuring fair decision-making by AI systems. NIST AI RMF provides guidelines for developing AI systems that align with ethical standards. IEEE Standards for AI Ethics emphasizes transparency, accountability, and fairness in AI data practices. 21 CFR Part 11 and EU Annex 11 reinforce the importance of adhering to ethical principles in GxP-regulated environments.

  • Establish clear ethical guidelines for the use of data in AI, focusing on fairness, transparency, and accountability.
  • Implement governance frameworks that monitor and enforce these ethical guidelines, particularly in decision-making processes.
  • Regularly assess AI systems for potential biases and adjust data inputs or models to mitigate any identified issues.

Regularly Review and Audit Data

Regular reviews and audits help sustain long-term data quality, ensuring ongoing compliance and reliability. NIST Guidelines advocate for continuous assessment and periodic review of data quality.

  • Schedule periodic data reviews to ensure that datasets remain accurate, complete, and relevant over time.
  • Conduct regular audits to verify that data management practices align with regulatory requirements and internal policies.
  • Document audit results and take corrective actions where necessary to maintain data quality and integrity.

Conclusion

Ensuring that data used in AI systems is accurate, reliable, consistent, integral, and of high quality requires a comprehensive approach. By implementing strong data governance, prioritizing security and privacy, continuously monitoring data quality, and adhering to ethical principles, organizations can build AI systems that are not only effective but also compliant with regulatory requirements. By following the steps outlined above, supported by industry standards and regulatory guidelines, organizations can confidently navigate the complexities of data management in AI, ensuring that their systems deliver accurate, trustworthy, and valuable outcomes.

References

21 CFR Part 11, EU Annex 11

ISO 9001:2015

ISO/IEC 27001:2013, 17025:2017

IEC 62304:2006, 61508:2010

ICH Q7, PIC/S guidelines

GDPR, HIPAA

NIST RMF

IEEE 1012-2016, Data Annotation and AI Ethics


Disclaimer: The article is the author's point of view on the subject based on his understanding and interpretation of the regulations and their application. Do note that AI has been leveraged for the article's first draft to build an initial story covering the points provided by the author. Post that, the author has reviewed, updated, and appended to ensure accuracy and completeness to the best of his ability. Please use this after reviewing it for the intended purpose. It is free for use by anyone till the author is credited for the piece of work.



Christian Schmitz-Moormann

Relentlessly building tomorrow with great people, robust processes and cutting-edge validated systems for strictly regulated industries.

3mo

Hi Ankur, great overview, thanks for putting it out. I would like to point to two topics which are probably complex, but if considered well will support your outlined approach. One is time: In areas like pharmacovigilance data has been collected and curated carefully for years, but over time the standards have changed and we still see changes. Data which was considered good a few years ago, may not meet today's quality standards. We need to be able to handle this aspect, as also the old data may have high value and should be usable. Two is documents: A lot of valuable information ist stored in documents and we see approaches how to make this information accessible and usable as data. Most of the approaches treat documents as monolithic entities without inherent structure. And, frankly, most documents do have an inherent structure. Losing this structural information leads to lower-than-possible quality of data.

Anvesh Jupaka

Validation Lead - IT CSV @ Spotline Inc | Quality Compliance - IT | Cloud Compliance | SaaS Validation | SAP S4 HANA | AI - ML

3mo

Great analysis. Thanks for sharing.

To view or add a comment, sign in

More articles by Ankur Mitra

Insights from the community

Others also viewed

Explore topics