GenAI Security Checklist: Top 10 Authorization Strategies for Handling Sensitive Data
IMPORTANT: This article is a collaborative effort, co-authored with my colleague and friend, Grant Miller , IBM Distinguished Engineer and CTO for IBM Data Protection. Together, we’ve combined our points of view and experience to shape the insights shared in this piece.
As organizations increasingly adopt Generative AI (GenAI) for applications involving people-related data, ensuring the responsible and secure handling of sensitive information has become paramount. GenAI's versatility is revolutionizing fields such as human resources, finance, healthcare, and customer service. However, its use also presents unique challenges in managing personally identifiable information (PII) and sensitive personal information (SPI). These challenges encompass ethical considerations, regulatory compliance, data governance, and the potential for unintended consequences affecting individuals and society.
This article addresses a critical issue: securing access to PII and SPI in GenAI applications. Drawing on our experience, expertise, and real-world implementations at IBM, we outline the top 10 strategies organizations should use to safeguard sensitive data in GenAI deployments. These recommendations are categorized into three key scenarios of GenAI where sensitive data is typically encountered:
Each scenario presents unique risks and considerations, requiring tailored approaches to ensure the secure and ethical handling of sensitive information. In this article, we will introduce 10 authorization strategies and explore how to apply these strategies across various GenAI use cases. We will also highlight best practices and provide practical examples, with a particular focus on the HR domain.
A. Using PII/SPI for training and fine-tuning models
Generative AI solutions are predominantly powered by Large Language Models (LLMs), which are sophisticated AI systems trained on massive datasets to comprehend and generate human-like language. Within enterprise contexts, it is highly uncommon to use personally identifiable information (PII) or sensitive personal information (SPI) to train a new model from scratch. Instead, organizations typically leverage pre-trained LLMs and fine-tune them using a much smaller set of domain-specific data. This approach allows the organization to customize the general-purpose LLM to perform specialized tasks while minimizing the need for extensive data.
For example, consider the following two use-case of fine-tuning LLMs for HR:
Using people data for fine-tuning requires a robust strategy to manage sensitive information. Below are the first three strategies for ensuring privacy and security for this scenario:
1/10 - Understand the Distinction Between Confidentiality and Privacy for proper classification
Not all sensitive data is the same, and understanding the nuances between confidentiality and privacy is critical. Confidential data, such as general HR policies or organizational processes, can often be accessed by a broader group without compromising privacy. However, personal information that pertains to a specific individual (e.g., performance reviews or sale quota attainment) needs to be handled with strict privacy controls. Misclassification can result in either unnecessary restrictions or the unintentional exposure of sensitive data. For instance, in IBM HR, most of our Gen-AI solutions are specifically fine-tuned using confidential information like policies and regulations, but not PI/SPI data.
2/10 - Use Data already accessible by the Intended Audience
When fine-tuning models, consider focusing on data that is already accessible to the intended users of the solution. For example, if the model will be used only by HR Case Management team, and they already have access to all data used for tuning, it is relatively safe to fine-tune the model using historical data. Conversely, if the use case is designed to support a broader audience (e.g., all HR personnel or all managers within a company), limiting access to only the relevant subset of data after the model has been fine-tuned with broader PI/SPI data becomes highly complex (if even practically possible).
3/10 – Opt for Very Coarse-Grain Anonymization to Ensure Broad Data Protection
To protect privacy, one might think to apply anonymization or data masking techniques when using people data for model fine-tuning. However, anonymization done during the model tuning is something that needs to be done very carefully. Depending on the amount anonymized data being passed, one might be able to infer the identity of an individual by providing enough context. This problem is not new, but it is difficult to truly control it when data are passed to a GenAI model. That said, use-cases which have a high-level abstraction of the data, perhaps to the line of the one used for external company reports, could be safely used for tuning.
Be careful: depending on the amount of anonymized data shared, it may be possible to infer an individual’s identity if enough context is provided
B. Accessing PII/SPI during user interactions
This second scenario focuses on how sensitive data is accessed during interactions between users and digital assistants powered by GenAI. In these cases, PII/SPI is not used to train or fine-tune models but is dynamically retrieved on-demand to respond to specific user queries.
Let’s consider two examples once again drawn from the HR space:
Given the risks associated with retrieving PII/SPI data during these interactions, organizations must implement stringent access controls and continuously monitor data usage. Below are three more strategies to ensure secure handling of sensitive data for these scenarios:
4/10 - GenAI is not the place to set authorization – Use existing entitlements and APIs instead
GenAI should not be used to set data access controls. Instead, utilize the entitlements and APIs of the systems where the source data is stored. The user identity should be passed to the source system, which acts as an access decision point, and fetches only the data the user is authorized to access. If a digital assistant is used to access the data, then the user’s identity who is interacting with the agent should be propagated along with the agent’s identity to be used for determining access. Consequently, a digital assistant may have its own rights against the source data. In this case, least privilege is used between the agent and the user.
Remember: A digital assistant may have its own rights to PII/SPI data, independent of the rights held by the user operating the assistant.
This approach ensures consistent and efficient access control by implementing it as close to the source as possible. For example, IBM’s AskHR solution does not manage HR Partner identities or their access rights to PI/SPI data. Instead, the user’s identity is transmitted to WF360, IBM people data platform, which is responsible for managing user authorization across all reports and analytics data needs.
5/10 - Apply Data Masking especially for summary-Level responses
Data masking is an effective way to protect sensitive information by partially or fully obfuscating data elements based on the context. For example, in business intelligence reports, IBM employs masking techniques that display only aggregated results if a query returns data on fewer than five employees. Although masking cannot eliminate all risks, it greatly reduces the likelihood of exposing identifiable details in summary-level interactions.
6/10 - Monitor and Audit Continuously as GenAI Evolves
Even with robust controls in place to secure data as described, continuous monitoring and auditing are critical to maintaining data security as GenAI systems evolve. Monitor user interactions, data access patterns, and any anomalies in real time to identify and mitigate new risks as they emerge. Record actions and implement audit trails to track who accessed what data, when, and for what purpose. This proactive approach helps detect potential misuse early and ensures ongoing compliance with organizational policies and regulations.
C. Generating PII/SPI as part of user interactions
The third scenario involves PII/SPI data that is shared by users during their interactions with a GenAI solution. This can occur either implicitly, where a digital assistant pulls contextual information to offer personalized experiences, or explicitly, where users input personal details directly into the conversation. Because these interactions happen dynamically, it can be highly challenging to pin-point which PII/SPI are been shared and for which purpose. However, ensuring proper handling of such data remain critical to maintaining user trust and compliance.
To further illustrate this scenario, let's consider the following two Gen-AI use cases from the HR domain:
Key is relying on having the use-case approved to pull certain data. Here are another three strategies to support these use cases:
7/10 – Establish a “User–Digital Assistant Confidentiality” Agreement
Conversations between users and digital assistants may contain sensitive PII/SPI that is difficult to classify and has no long-term value to the organization. In use cases like the “Guide on Filing Concerns,” employees will only trust the assistant if they are assured that their personal information is not being stored or used beyond the immediate session. For this, consider implementing a “User–Digital Assistant Confidentiality” policy that guarantees session data, transient or stored, will be immediately deleted after the interaction ends. This approach not only reduces privacy risks but also enhances user trust in the system.
Consider this: a 'User–Digital Assistant Confidentiality' policy to ensure session data, whether transient or stored, is deleted immediately after each interaction. This reduces privacy risks and builds user trust
8/10 - Apply Standard Data Handling Practices to GenAI-Generated PII/SPI
Any PII/SPI generated during user interactions with GenAI should be treated with the same rigor as sensitive data collected by non GenAI interaction. This means applying all the standard principles: obtaining user consent, enforcing data retention policies, and ensuring data is only shared on a strict “need-to-know” basis. At IBM, for example, the “Security and Privacy by Design” framework guides all our development, ensuring that privacy and security are integrated into the product lifecycle from the ground up.
9/10 - Prepare for Incident Response
Effective incident response planning is crucial across all GenAI use cases but is particularly important for interactions involving PII/SPI. Organizations should be ready to respond quickly if there is a data breach, unauthorized access, or unexpected data exposure. This includes having designated teams, communication plans, and predefined actions to take if a user’s sensitive data is compromised. Establish clear protocols for detecting, reporting, and mitigating incidents.
As GenAI systems become deeply embedded in enterprise solutions, establishing robust data governance frameworks for handling PII and SPI is not just a technical requirement, but a strategic imperative. Effectively managing sensitive data across the three key scenarios—data used for fine-tuning, data retrieved during interactions, and data generated through interactions—ensures that organizations can safeguard user information, maintain compliance, and foster trust in AI technologies.
Throughout this article, we have outlined nine practical strategies for managing sensitive data in GenAI applications. However, before concluding, it’s important to recognize that not all GenAI use cases carry the same level of risk when it comes to PII/SPI. For this reason, we recommend one final strategy:
Final note: Not all GenAI use cases carry the same risk for PII/SPI. Choose wisely.
10/10: Tailor Your Strategy Based on Use-Case (Risk) Assessment
When working with GenAI, it’s crucial to understand that different use cases present varying levels of risk depending on how they interact with sensitive data. For example, a chatbot providing generic career guidance may not require the same stringent data handling measures as a tool designed to support confidential employee relations issues. Privacy risks, data quality, robustness of data pipelines, and ethical considerations are key factors that should be part of a comprehensive risk assessment when selecting and designing GenAI use cases. For a deeper dive into these considerations and how to apply them, please refer to our separate article.
Thank you for your attention. We look forward to hearing which strategies resonate most with you and if there are any critical ones we may have overlooked. Please also consider joining my people data platform newsletter for monthly insights.
IMPORTANT: This article is a collaborative effort, co-authored with my colleague and friend, Grant Miller, IBM Distinguished Engineer and CTO for IBM Data Protection. Thanks, Grant, for the opportunity and the partnership!
Insightful