Does more data mean more information?

Ankur Verma, PhD

Founder CEO @ Lightscline | SME 30 under 30

Published Oct 8, 2024

Just ~10% of the data can preserve information from multi-modal sensors

Colloquially, more data means more information. However, there is increasing experimental and theoretical evidence against this, especially for real-world sensor data. This is because of the redundancy in such data, implying the existence of low rank structures that can be used to efficiently represent raw data without losing information.

Let’s look at two examples:

(i) Learning from all the raw data:

In the following figure, we are looking at the raw data (x) of a time series signal. This raw data can be transformed into a latent representation (s) in the Fourier or some other domain. In such domains, it is possible to represent the signal using a very few number of co-efficients. This latent representation ‘s’ can be used to recover the entire signal ‘x’ using some computational techniques like the Inverse Fourier Transform.

Raw data -> latent representation -> Raw data

Important observation #1: We need to perform at least 1000 operations to transform a 1000x1 array into some latent representation. (look at the size of the red funnel).

Now, let’s test whether we can use just 10% of the raw data and still preserve the full signal ‘x’. This will validate the fact that a small fraction of raw data can still preserve the full signal. For this, let’s consider the following example:

(ii) Learning from a fraction of raw data:

In this case, we do undersampling on ‘x’ and create a new signal ‘y’ which has undersampled data. One way to proceed is to try to recover ‘s’ from ‘y’, which can then be used to get the full signal ‘x’ using the Inverse Fourier Transform. In the following figure, we are specifically using compressed sensing to exploit signal sparsity and recover ‘x’ from ‘y’ using an iterative optimization technique. However, other data-driven approaches can be formulated to leverage sparsity and recover raw data from under sampled data.

Recommended by LinkedIn

Big Data is changing the way we solve problems

Naveen Joshi 8 years ago

Notes on Data Compression: Part 1

Simon Southwell 2 years ago

Selling with Data #29 - Top 5 data innovations in 2023

Ayal Steinberg 1 year ago

Under sampled data -> latent representation -> Raw data

Important observation #2: We need to perform only 10 or 100 operations to collect 1% or 10% of raw data. (look at the size of the red tubes - much smaller than the above funnel). This itself saves us from collecting 99% or 90% of the data, without losing any information, as we can still recover ‘x’ from ‘y’.

In both the approaches, we are able to preserve the signal information. In learning from 100% data, we need all the raw data to create a latent representation. However, in learning from just 10% data, we can learn the same latent representation and recover the signal information from just 1-10% of the raw data.

In other words, the large funnel with 100% of the data is carrying the same information as the three small tubes with just 10% data. The tubes however, require 10x less processing power and time to deal with!

100% data and 10% data can carry the same amount of information

If the structure in the data is exploited, information can be preserved with 1-10% of the raw data, thus saving 99-90% of data and associated pre-processing needed to make a latent representation. This indicates that in cases where we can exploit structure in the data, more data does not mean more information. Instead, more data is redundant and we are not learning any better representations by collecting more than a certain amount of raw data.

Benefits of preserving information with just 1-10% raw data

There are several advantages of being able to get the same information from a small fraction of the raw data, like reduced streaming power, transmission bandwidth, cloud storage and compute, and human capital requirements. By leveraging this counterintuitive insight, we can create lightweight analytics techniques that can preserve information with 10-100x smaller compute, power, and network footprint. This is central to our mission at Lightscline, where we are leveraging AI to automatically identify the 10% important data for tasks like anomaly detection, classification, etc. You can learn more about us here.

Lightweight intelligence

583 followers

+ Subscribe

Ayush Goyal

Founder at Lightscline

2mo

Great Article!

2 Reactions

To view or add a comment, sign in

See all

Does more data mean more information?

Ankur Verma, PhD

Founder CEO @ Lightscline | SME 30 under 30

Just ~10% of the data can preserve information from multi-modal sensors

Recommended by LinkedIn

Lightweight intelligence

583 followers

More articles by this author

Insights from the community

Others also viewed

The Data Dilemma

Take charge of your information, data and sources

Datafication….an outlook at our data assets!

What should we do if we have a case where we have data of what we must detect and negligible or zero data in case of what we must not detect ?

Will IDMO , someday morph into IDCO ?

Understanding Noise in Time Series Data 🎧

Data Pipelines, The Heart of AIoT

Pioneering the Frontier of Data Democratization and Data-Driven Decision Making

Transforming Industries: The Impact of Data Science on Business and Society

Data rocks

Explore topics

Just ~10% of the data can preserve information from multi-modal sensors

Recommended by LinkedIn

Lightweight intelligence

583 followers

Focusing on inherent structure of sensor data

Dec 11, 2024

Redundancy enables upfront sampling instead of dimensionality reduction

Nov 27, 2024

A 90% smaller footprint to enable AI deployment at scale

Nov 13, 2024

Low-dimensional structures in big data suffice for several tasks

Nov 5, 2024

Training, re-training, and iterating close to the data source

Oct 29, 2024

Leveraging channels and augmentation to improve sensor datasets

Oct 22, 2024

Leveraging inherent structure to work with 90% less data

Oct 15, 2024

Redundancy in sensor data

Oct 1, 2024

Going from raw sensor data to a decision advantage

Oct 1, 2024

Insights from the community

Others also viewed

The Data Dilemma

Take charge of your information, data and sources

Datafication….an outlook at our data assets!

What should we do if we have a case where we have data of what we must detect and negligible or zero data in case of what we must not detect ?

Will IDMO , someday morph into IDCO ?

Understanding Noise in Time Series Data 🎧

Data Pipelines, The Heart of AIoT

Pioneering the Frontier of Data Democratization and Data-Driven Decision Making

Transforming Industries: The Impact of Data Science on Business and Society

Data rocks

Explore topics