Does more data mean more information?
Just ~10% of the data can preserve information from multi-modal sensors
Colloquially, more data means more information. However, there is increasing experimental and theoretical evidence against this, especially for real-world sensor data. This is because of the redundancy in such data, implying the existence of low rank structures that can be used to efficiently represent raw data without losing information.
Let’s look at two examples:
(i) Learning from all the raw data:
In the following figure, we are looking at the raw data (x) of a time series signal. This raw data can be transformed into a latent representation (s) in the Fourier or some other domain. In such domains, it is possible to represent the signal using a very few number of co-efficients. This latent representation ‘s’ can be used to recover the entire signal ‘x’ using some computational techniques like the Inverse Fourier Transform.
Important observation #1: We need to perform at least 1000 operations to transform a 1000x1 array into some latent representation. (look at the size of the red funnel).
Now, let’s test whether we can use just 10% of the raw data and still preserve the full signal ‘x’. This will validate the fact that a small fraction of raw data can still preserve the full signal. For this, let’s consider the following example:
(ii) Learning from a fraction of raw data:
In this case, we do undersampling on ‘x’ and create a new signal ‘y’ which has undersampled data. One way to proceed is to try to recover ‘s’ from ‘y’, which can then be used to get the full signal ‘x’ using the Inverse Fourier Transform. In the following figure, we are specifically using compressed sensing to exploit signal sparsity and recover ‘x’ from ‘y’ using an iterative optimization technique. However, other data-driven approaches can be formulated to leverage sparsity and recover raw data from under sampled data.
Recommended by LinkedIn
Important observation #2: We need to perform only 10 or 100 operations to collect 1% or 10% of raw data. (look at the size of the red tubes - much smaller than the above funnel). This itself saves us from collecting 99% or 90% of the data, without losing any information, as we can still recover ‘x’ from ‘y’.
In both the approaches, we are able to preserve the signal information. In learning from 100% data, we need all the raw data to create a latent representation. However, in learning from just 10% data, we can learn the same latent representation and recover the signal information from just 1-10% of the raw data.
In other words, the large funnel with 100% of the data is carrying the same information as the three small tubes with just 10% data. The tubes however, require 10x less processing power and time to deal with!
If the structure in the data is exploited, information can be preserved with 1-10% of the raw data, thus saving 99-90% of data and associated pre-processing needed to make a latent representation. This indicates that in cases where we can exploit structure in the data, more data does not mean more information. Instead, more data is redundant and we are not learning any better representations by collecting more than a certain amount of raw data.
There are several advantages of being able to get the same information from a small fraction of the raw data, like reduced streaming power, transmission bandwidth, cloud storage and compute, and human capital requirements. By leveraging this counterintuitive insight, we can create lightweight analytics techniques that can preserve information with 10-100x smaller compute, power, and network footprint. This is central to our mission at Lightscline, where we are leveraging AI to automatically identify the 10% important data for tasks like anomaly detection, classification, etc. You can learn more about us here.
Founder at Lightscline
2moGreat Article!