Have You Experienced Shape-Shifting Data? A Business person is looking at a report and thinks one of the numbers is not correct. They go to the architect or responsible IT person that manages the pipelines and say: "I don't think the number 3 here is correct". The architect comes back saying, "I checked the source system and in table ABC column 123 has the number 3 in it, so it's correct." Can a number be both correct and incorrect at the same time? Yes, it can. It might be that the 3 is in the database but it's a value that should not be there (human error). So the number 3 would then be technically correct, but business contextually incorrect. It might be that the 3 is in the database but it's actually time sensitive data, so the number 3 is technically correct but its not correct for the person viewing it in this timezone. It might be that the 3 is in the database and it has been correct but the source system logic has changed in the last release, so the number 3 is technically correct, but business contextually incorrect as it is outdated. The only way to talk about data correctly is to have the same understanding of what it means for the number to be correct. You need to have an agreement about what is correct and have the attributes of what "correct" means recorded somewhere in a central place (with proper ownership). #nakyvamuutos #bringeyourdatagap Bonus fun for Friday. Can anyone guess where this picture is taken from? :)
Säde Haveri’s Post
More Relevant Posts
-
Are you someone who is thinking to build your first data warehouse at work and want to know where to start -I can suggest some strategic steps for sure 1.Write it down-Why do we need this?Who are the audience and how will it scale in next 2 years-Will it able to do reporting/AI easily 2.Brainstorm with prospective users(Analysts,Engineers and Business users) and build a architecture which fits all -but most importantly is understood by all 3.Breakdown the future into small achievements- like a)Best practices documented b)logical model created c)Architecture reviewed by everyone d)Tech stack for quality,monitoring and integration finalized 4.First table created with all best practices 5.Access control for new users sorted out 6.Connected data to a first report 7.Build a data catalog of the current database or atleast a monitoring tool to see if a table load is successful or failed And may be more which I might have not covered -pls try to drop in your comments below.
To view or add a comment, sign in
-
𝗪𝗵𝗮𝘁 𝗶𝘀 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝗰𝗲 𝗯𝗲𝘁𝘄𝗲𝗲𝗻 𝗝𝗽𝗮𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆 ,𝗖𝗿𝘂𝗱𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆 𝗮𝗻𝗱 𝗟𝗶𝘀𝘁𝗖𝗿𝘂𝗱𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆 𝗮𝗻𝗱 𝘄𝗵𝗲𝗻 𝘄𝗲 𝗻𝗲𝗲𝗱 𝘁𝗼 𝘂𝘀𝗲 𝗲𝗮𝗰𝗵? It is Important to choose right repository interface for DB interaction and data handling in spring boot application. 𝗖𝗿𝘂𝗱𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆 : • it is lightweight and base interface for other repository types in spring data. • It is designed for simplicity and ideal when need basic data access functionality like saving, deleting and finding data by ID. • The return type of methods of this interface is Iterable. 𝗟𝗶𝘀𝘁𝗖𝗿𝘂𝗱𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆: • This an extension to CrudRepository returning List instead of Iterable where applicable. Return type as List enables us to use advanced methods from List interface. • Most of time we use sorting repositories (PagingAndSortingRepository) for pagination and sorting. • With introduction of ListCrudRepository, the sorting repositories don’t extend the older crud repositories anymore. Instead, the user has the option to extend the new List-based interfaces or the Iterable-based interfaces along with the sorting interfaces. • 𝗘𝘅𝗮𝗺𝗽𝗹𝗲 : @Repository public interface PersonPagingAndSortingRepository extends PagingAndSortingRepository<Book, Long>, ListCrudRepository<Book, Long> { List<Person> findPersonsByName(String name, Pageable pageable); } 𝗝𝗽𝗮𝗥𝗲𝗽𝗼𝘀𝗶𝘁𝗼𝗿𝘆: • It is more powerful extension of CrudRepository and ListCrudRepository which helps for more advanced data handling. • It is commonly used when need to handle batch processing ,pagination. • It has methods like findAll() , SaveAllAndFlush() which execute single query to perform the batch operation.
To view or add a comment, sign in
-
🚀 Ensuring Data Integrity After Transformation Changes 💡 Data transformation logic often changes to improve performance or meet new requirements. But how do you know if the changes didn’t break your data? With 25+ years in the data warehousing field, I’ve found that a simple SQL approach can save hours of manual validation. Here’s a pattern using MINUS and UNION ALL that I use to compare two data sets (e.g., “dev” vs. “prd”) to ensure consistency: ``` select 'DEV-PRD' as set_name, set_a.* from dev_edw_db.analytics_base.dim_client as set_a minus select 'DEV-PRD' as set_name, set_b.* from prd_edw_db.analytics_base.dim_client as set_b union all select 'PRD-DEV' as set_name, set_b.* from prd_edw_db.analytics_base.dim_client as set_b minus select 'PRD-DEV' as set_name, set_a.* from dev_edw_db.analytics_base.dim_client as set_a; ``` The result? You quickly see mismatches, if any. Use this technique to validate your transformation jobs and keep your data integrity intact. 🔍💪 #DataWarehouse #Snowflake #SQL #DataEngineering
To view or add a comment, sign in
-
We'd like to reshare this case study: Data Integration Tool Development. In this entry, we share our experience with a client from the Midwest, who we were able to help by designing and developing an optimal solution at the lowest possible cost. https://hubs.la/Q02s20GW0 #247Digitize #DataIntegration #CaseStudy
To view or add a comment, sign in
-
Do you load data with slowly changing dimensions? We added support for Slowly Changing Dimension type 2 as a loading strategy docs: https://lnkd.in/eSxEpN2a #DataAnalytics #Developers #SlowlyChangingDimensions #DataManagement
Incremental loading | dlt Docs
dlthub.com
To view or add a comment, sign in
-
🚀 Ready to level up your data game? Data migration is your ticket to modern tools and enhanced efficiency. But beware: challenges like data loss, quality issues, and other headaches can arise. 💡 Here are some quick tips for a smooth transition: 1️⃣ Plan & prep 2️⃣ Clean your data 3️⃣ Test before you commit 4️⃣ Train your team Need a reliable partner? MigrateMyCRM makes transferring data between CRMs seamless and secure. #DataMigration #MigrateMyCRM #TechTransformation https://bit.ly/3BwrBEr
12 Data Migration Challenges and Solutions for Successful Implementation | MigrateMyCRM
migratemycrm.com
To view or add a comment, sign in
-
A New Approach to Primary Keys One of the key challenges we have with exchanging data between systems and aggregating it for data warehouses is we do not have a unified approach to primary keys. Teams decide what they like best. Some choose incremented integer, some composite and some GUID. As a result, the same data has different primary keys in almost every system. This makes moving data from one system to another very difficult and when you try to aggregate data from many system you end up creating new primary keys for every record. No amount of after market tools, APIs and manual effort can resolve this situation. Instead, we need a primary key mechanism that is efficient and allows us to assign a primary key once to data and never have it change no matter how many systems it gets transferred to, regardless who developed it. I propose an 8-byte integer primary key that is generated with a special function that combines a system id with a record id to formulate the key. The SYSTEM ID identifies the system that created the record. The 8 byte id supports 1.5 million systems. The RECORD ID uniquely identifies a record within a table for a given system. The function will leverage the built in capabilities of SQL databases for incremented integers to ensure the same number was not used twice. The eight byte key has a capacity of a quadrillion unique values for a given system within a given table. Advantages of This Key COMPACT: At 8 bytes, it’s small enough to be used everywhere without performance concerns. SEQUENTIAL STORAGE: The incremented integer component allows data to be stored sequentially, improving retrieval performance and reducing bloating. RECORD GOVERNANCE: The inclusion of system ID enables tracking of ownership, simplifying governance and reducing the need to store exceptions. SIMPLIFIES THE ARCHITECTURE: Having a single consistent column on every table simplifies the creation of complex functionality like a central audit logging mechanism, data exchange or data aggregation. USER FRIENDLY: It’s simple for users and developers to reference (e.g., 123-1112). CONCLUSION This is the first of four principles necessary for being able to easily exchange data between systems and automatically aggregating it into a data warehouse. For more information on the principles needed to achieve this, please visit www.3d-ess.com/principles.
To view or add a comment, sign in
-
Managing data schema evolution in large datasets is going to become an essential skill. But how can you handle schema changes effectively? ⬇️ In large datasets, schemas—the structure of your data—often need to evolve over time as new requirements emerge. Without proper management, these changes can lead to inconsistencies, errors, and even data loss. Here’s how to manage data schema evolution effectively: 1️⃣Version control Implement version control for your schemas to track changes and roll back if needed, ensuring you always have a clear history of modifications. 2️⃣Backward compatibility Design your schema changes to be backward compatible, allowing older data to coexist with new data without issues. 3️⃣Automated migration Set up automated tools to migrate data to the new schema format, reducing the risk of errors and saving time. 4️⃣Validation checks Regularly run validation checks to ensure that schema changes do not introduce errors or inconsistencies in your data. 5️⃣Communication Keep all stakeholders informed about schema changes to ensure that everyone understands the impact and can adjust their workflows accordingly. 💡By effectively managing schema evolution, you can keep your data robust and reliable, even as your datasets grow and change. #DataSchema #DataEngineering #BigData #SchemaEvolution #DataManagement #DatabaseDesign #TechLeadership
To view or add a comment, sign in
-
I've really appreciated Andrew Jones's perspective on cost of poor data quality. It made me think about two important aspects that I would like to share: 1️⃣ The “cost” of remediation already assumes that the problem caused by poor quality has been identified, and this is far from trivial: often we don’t even realize we have a problem, or the search for the cause can be as complex as the resolution, if not more. So, the cost of remediation is often just the tip of the iceberg 2️⃣ Some consequences of not properly addressing data quality can be invaluable, such as the loss of reputation with customers. This means that there is not exclusively a cost to absorb to “return” to a healthy situation, but some consequences can remain. These aspects should highlight the importance of proactive data quality management. 💼📊 #dataquality #dataproducts #datacontracts #datamanagement
Poor data quality has a cost 💸 The 1:10:100 rule, developed by George Labovitz and Yu Sang Chang back in 1992, is a great way to understand those costs. It states that: - The cost of preventing poor data quality at source is $1 per record - The cost of remediation after it is created is $10 per record - The cost of failure (i.e. doing nothing) is $100 per record So, the earlier you deal with poor data quality, the cheaper it is. Tools and techniques such as data observability and data contracts can help you catch data quality issues earlier. Check out my article on Medium for more 👇 https://lnkd.in/eUJ9tHeb #DataQuality #DataObservability #DataContracts
𝗧𝗵𝗲 𝟭:𝟭𝟬:𝟭𝟬𝟬 𝗿𝘂𝗹𝗲 𝗼𝗳 𝗱𝗮𝘁𝗮 𝗾𝘂𝗮𝗹𝗶𝘁𝘆
andrew-jones.medium.com
To view or add a comment, sign in