Data management is the practice of collecting, processing and using data securely and efficiently for better business outcomes.
72% of top-performing CEOs agree that competitive advantage depends on who has the most advanced generative AI. However, in order to take advantage of artificial intelligence (AI), organizations must first organize their information architecture to make their data accessible and usable. Fundamental data management challenges include data volumes, and data silos across multiple locations and cloud providers. New data types and various formats such as documents, images and videos, also present challenges. Also, complexity and inconsistent datasets can limit an organization’s ability to use data for AI.
As a result of these challenges, an effective data management strategy has become an increasing priority for organizations to address challenges presented by big data. A flexible, modern data management system integrates with existing technology within an organization to access high-quality, usable data for data scientists, AI and machine learning (ML) engineers, and the organization’s business users.
A complete data management strategy accounts for various factors, including how to:
While the data management tools for constructing generative AI applications are widely available, the data itself holds the value for both customers and businesses. High volumes of quality data must be properly organized and processed to successfully train models. This approach is a rapidly growing use case for modern data management.
For example, a generative AI-driven commentary was offered during The Championships 2023 at Wimbledon, which accessed information from 130 million documents and 2.7 million pertinent contextual data points in real time. Visitors using the tournament app or website were able to access complete statistics, play-by-play narration and game commentary, as well as a precise prediction of the winner at any moment as matches progressed. Having the correct data management strategy can help ensure that valuable data is always available, integrated, governed, secure and accurate.
Generative AI can give organizations a strong competitive advantage, with their AI strategy relying on the strength of the data that’s used. Many organizations still struggle with fundamental data challenges that are exacerbated by the demand for generative AI, which requires ever more data—leading to yet more data management headaches.
Data might be stored in multiple locations, applications and clouds, often leading to isolated data silos. To add even more complexity, the uses of data have become more varied, with data in varying and complex forms—such as images, videos, documents and audio. More time is required for data cleaning, integration and preparation. These challenges can lead organizations to avoid using their full data estate for analytics and AI purposes.
However, equipped with modern tools for data architecture, governance and security, data can be successfully used to gain new insights and make more precise predictions consistently. This capability can enable a deeper understanding of customer preferences and can enhance customer experiences (CX) by delivering insights derived from data analysis. Moreover, it facilitates the development of innovative data-driven business models, such as service offerings reliant on generative AI, which need a foundation of high-quality data for model training.
Data and analytics leaders face major challenges when transforming their organizations due to the increasing complexity of the data landscape across hybrid cloud deployments. Generative AI and AI assistants, machine learning (ML), advanced analytics, Internet of Things (IoT), and automation also all require huge volumes of data to work effectively. This data needs to be stored, integrated, governed, transformed and prepared for the right data foundation. And to build a strong data foundation for AI, organizations need to focus on building an open and trusted data foundation, which means creating a data management strategy that is centered on openness, trust and collaboration.
The AI requirement was summed up by a Gartner® analyst1: “AI-ready data means that your data must be representative of the use case, including all patterns, errors, outliers and unexpected emergence that is needed to train or run the AI model for the specific use.”
Data and analytics executives might feel that AI-prepared data equals high-quality data, but the standards of high-quality data for purposes other than AI do not necessarily meet the standard for AI readiness. In the realm of analytics, for instance, data is typically refined to eliminate outliers or conform to human expectations. However, when training an algorithm, it needs representative data.
Data governance is a subset of data management. This means that when a data governance team identifies commonalities across disparate datasets and wants to integrate them, they will need to partner with a database architecture or engineering team to define the data model and data architecture to facilitate linkages and data flows. Another example pertains to data access. A data governance team might set the policies around data access to specific types of data, such as personally identifiable information (PII). Meanwhile a data management team would either provide direct access or set a mechanism in place to provide access, such as adjusting internally defined user roles to approve access.
Effective data management, including robust data governance practices, can help with adhering to regulatory compliance. This compliance encompasses both national and global data privacy regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), along with industry-specific privacy and security standards. Establishing comprehensive data management policies and procedures becomes crucial for demonstrating or undergoing audits to validate these protections.
Modern data management solutions provide an efficient way to manage data and metadata across diverse datasets. Modern systems are built with the latest data management software and reliable databases or data stores. This can include transactional data lakes, data warehouses or data lakehouses, combined with a data fabric architecture including data ingestion, governance, lineage, observability and master data management. Together, this trusted data foundation can feed quality data to data consumers as data products, business intelligence (BI) and dashboarding, and AI models—both traditional ML and generative AI.
A strong data management strategy typically includes multiple components to streamline strategy and operations throughout an organization.
While data can be stored before or after data processing, the type of data and purpose of it will usually dictate the storage repository that is used. While relational databases organize data into a tabular format, nonrelational databases do not have as rigid of a database schema.
Relational databases are also typically associated with transactional databases, which run commands or transactions collectively. An example is a bank transfer. A defined amount is withdrawn from one account and then it is deposited within another. But for enterprises to support both structured and unstructured data types, they require purpose-built databases. These databases must also cater to various use cases across analytics, AI and applications. They must span both relational and nonrelational databases, such as key-value, document, wide-column, graph and in-memory. These multimodal databases provide native support for different types of data and the latest development models, and can run many kinds of workloads, including IoT, analytics, ML and AI.
Data management best practices suggest that data warehousing be optimized for high-performance analytics on structured data. This requires a defined schema to meet specific data analytics requirements for specific use cases, such as dashboards, data visualization and other business intelligence tasks. These data requirements are usually directed and documented by business users in partnership with data engineers, who will ultimately run against the defined data model.
The underlying structure of a data warehouse is typically organized as a relational system that uses a structured data format, sourcing data from transactional databases. However, for unstructured and semistructured data, data lakes incorporate data from both relational and nonrelational systems, and other business intelligence tasks. Data lakes are often preferred to the other storage options because they are normally a low-cost storage environment, which can house petabytes of raw data.
Data lakes benefit data scientists in particular, as they enable them to incorporate both structured and unstructured data into their data science projects. However, data warehouses and data lakes have their own limitations. Proprietary data formats and high storage costs limit AI and ML model collaboration and deployments within a data warehouse environment.
In contrast, data lakes are challenged with extracting insights directly in a governed and performant manner. An open data lakehouse addresses these limitations by handling multiple open formats over cloud object storage and combines data from multiple sources, including existing repositories, to ultimately enable analytics and AI at scale.
Multicloud and hybrid strategies are steadily becoming more popular. AI technologies are powered by massive amounts of data that require modern data stores that reside on cloud-native architectures to provide scalability, cost optimization, enhanced performance and business continuity. According to Gartner2, by the end of 2026, "90% of data management tools and platforms that fail to support multi-cloud and hybrid capabilities will be set for decommissioning."
While existing tools aid database administrators (DBAs) in automating numerous conventional management duties, manual involvement remains necessary due to the typically large and intricate nature of database setups. Whenever manual intervention becomes necessary, the likelihood of errors rises. Minimizing the necessity for manual data management stands as a primary goal in operating databases as fully managed services.
Fully managed cloud databases automate time-consuming tasks such as upgrades, backups, patching and maintenance. This approach helps free DBAs from time-consuming manual tasks to spend more time on valuable tasks such as schema optimization, new cloud-native apps and support for new AI use cases. Unlike on-premises deployments, cloud storage providers also enable users to spin up large clusters as needed, often requiring only payment for the storage specified. This means that if an organization needs more compute power to run a job in a few hours (versus a few days), it can do this on a cloud platform by purchasing more compute nodes.
This shift to cloud data platforms is also facilitating the adoption of streaming data processing. Tools such as Apache Kafka enable more real-time data processing, so that consumers can subscribe to topics to receive data in a matter of seconds. However, batch processing still has its advantages as it’s more efficient at processing large volumes of data. While batch processing abides by a set schedule, such as daily, weekly or monthly, it is ideal for business performance dashboards, which typically do not require real-time data.
More recently, data fabrics have emerged to assist with the complexity of managing these data systems. Data fabrics use intelligent and automated systems to facilitate end-to-end integration of data pipelines and cloud environments. A data fabric also simplifies delivery of quality data and provides a framework for enforcing data governance policies to help ensure that the data used is compliant. This facilitates self-service access to trustworthy data products by connecting to data residing across organizational silos, so that business leaders gain a more holistic view of business performance. The unification of data across HR, marketing, sales, supply chain and others give leaders a better understanding of their customer.
A data mesh might also be useful. A data fabric is an architecture that facilitates the end-to-end integration. In contrast, a data mesh is a decentralized data architecture that organizes data by a specific business domain—for example, marketing, sales, customer service and more. This approach provides more ownership to the producers of a dataset.
Within this stage of the data management lifecycle, raw data is ingested from a range of data sources, such as web APIs, mobile apps, Internet of Things (IoT) devices, forms, surveys and more. After data collection, the data is usually processed or loaded by using data integration techniques, such as extract, transform, load (ETL) or extract, load, transform (ELT). While ETL has historically been the standard method to integrate and organize data across different datasets, ELT has been growing in popularity with the emergence of cloud data platforms and the increasing demand for real-time data.
In addition to batch processing, data replication is an alternative method of integrating data and consists of synchronizing data from a source location to one or more target locations, helping ensure data availability, reliability and resilience. Technology such as change data capture (CDC) uses log-based replication to capture changes made to data at the source and propagate those changes to target systems, helping organizations make decisions based on current information.
Independently of the data integration technique used, the data is usually filtered, merged or aggregated during the data processing stage to meet the requirements for its intended purpose. These applications can range from a business intelligence dashboard to a predictive machine learning algorithm.
Using continuous integration and continuous deployment (CI/CD) for version control can enable data teams to track changes to their code and data assets. Version control enables data teams to collaborate more effectively, as they can work on different parts of a project simultaneously and merge their changes without conflicts.
Data governance promotes the availability and usage of data. To help ensure compliance, governance generally includes processes, policies and tools around data quality, data access, usability and data security. For instance, data governance councils tend to align taxonomies to help ensure that metadata is added consistently across various data sources. A taxonomy can also be further documented through a data catalog to make the data more accessible to users, facilitating data democratization across an organization.
Enriching data with the right business context is critical for the automated enforcement of data governance policies and data quality. This is where service level agreement (SLA) rules come into effect, helping ensure that data is protected and of the required quality. It is also important to understand the provenance of the data and gain transparency into the journey of the data as it moves through pipelines. This calls for robust data lineage capabilities to drive visibility as organizational data makes it ways from data sources to the end users. Data governance teams also define roles and responsibilities to help ensure that data access is provided appropriately. This controlled access is particularly important to maintain data privacy.
Data security sets guardrails in place to protect digital information from unauthorized access, corruption or theft. As digital technology becomes an increasing part of our lives, more scrutiny is placed upon the security practices of modern businesses. This scrutiny is important to help protect customer data from cybercriminals or to help prevent incidents that need disaster recovery. While data loss can be devastating to any business, data breaches, in particular, can result in costly consequences from both a financial and brand standpoint. Data security teams can better secure their data by using encryption and data masking within their data security strategy.
Data observability refers to the practice of monitoring, managing and maintaining data in a way that helps ensure its quality, availability and reliability across various processes, systems and pipelines within an organization. Data observability is about truly understanding the health of an organization’s data and its state across a data ecosystem. It includes various activities that go beyond traditional monitoring, which only describes a problem. Data observability can help identify, troubleshoot and resolve data issues in near-real time.
Master data management (MDM) focuses on the creation of a single, high-quality view of core business entities including products, customers, employees and suppliers. By delivering accurate views of master data and their relationships, MDM enables faster insights, improved data quality and compliance readiness. With a single 360-degree view of master data across the enterprise, MDM enables businesses with the right data to drive business analytics, determine their most successful products and markets, and their highest valued customers.
Organizations experience multiple benefits when starting and maintaining data management initiatives.
Many companies inadvertently create data silos within their organization. Modern data management tools and frameworks, such as data fabrics and data lakes, help to eliminate data silos and dependencies on data owners. For instance, data fabrics assist in revealing potential integrations across disparate datasets across functions, such as human resources, marketing and sales. However, data lakes ingest raw data from those same functions, removing dependencies and eliminating single owners to a dataset.
Governance councils assist in placing guardrails to protect businesses from fines and negative publicity that can occur due to noncompliance to government regulations and policies. Missteps here can be costly from both a brand and financial perspective.
While this benefit might not be immediately seen, successful proof of concepts can improve the overall user experience, enabling teams to better understand and personalize the customer journey through more holistic analyses.
Data management can help businesses scale, but this largely depends on the technology and processes in place. For example, cloud platforms enable greater flexibility, so that data owners can scale up or scale down their compute power as needed.
Over the last decade, developments within hybrid cloud, artificial intelligence, the Internet of Things (IoT) and edge computing have led to the exponential growth of big data, creating even more complexity for enterprises to manage. New components continue to improve data management capabilities. Here are some of the latest:
To further boost data management capabilities, augmented data management is becoming increasingly popular. This is a branch of augmented intelligence, powered by cognitive technologies, which include AI, ML, data automation, data fabric and data mesh. The benefits of this automation include enabling data owners to create data products such as catalogs of data assets, with the ability to search and find data products, and query visuals and data products by using APIs. In addition, insights from data fabric metadata can help automate tasks by learning from patterns as part of the data product creation process or as part of the data management process of monitoring data products.
A data store for generative AI such as IBM® watsonx.data™ can help organizations unify, curate and prepare data efficiently for AI models and applications. Integrated and vectorized embedding capabilities enable retrieval-augmented generation (RAG) use cases at scale across large sets of trusted, governed data.
To simplify application connectivity and security across platforms, clusters and clouds, a hybrid cloud deployment can assist. Applications can be easily deployed and moved between environments because containers and object storage have made computing and data portable.
To accelerate data access and unlock new data insights without SQL, organizations are creating an embeddable, AI-powered semantic layer. This is a metadata and abstraction layer that is built onto the organization’s source data, such as a data lake or warehouse. The metadata can enrich the data model being used and also be sufficiently clear for business users to understand.
Organizations can access data across a hybrid cloud by connecting storage and analytics environments. This access can be through a single point of entry with a shared metadata layer across clouds and on-premises environments. Multiple query engines can be used to optimize analytics and AI workloads.
Creating a shared metadata layer in a data lakehouse to catalog and share data is a best practice. This speeds discovery and enrichment, the analysis of data across multiple sources, the running of multiple workloads and use cases.
In addition, a shared metadata management tool speeds the management of objects in a shared repository. It can be used to add a new host system, add a new database or data file, or add a new schema, in addition to deleting items from a shared repository.
1 Wire19.com: “Ways to ensure that your data is AI-ready”, 14 June 2024
2 Gartner: "Strategic Roadmap for Migrating Data Management Solutions to the Cloud", 27 September 2023
Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.
IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.
Explore the data leader's guide to building a data-driven organization and driving business advantage.
Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.
Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.
Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.
Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.
Design a data strategy that eliminates data silos, reduces complexity and improves data quality for exceptional customer and employee experiences.
Watsonx.data enables you to scale analytics and AI with all your data, wherever it resides, through an open, hybrid and governed data store.
Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.