Count-Min Sketch Data Structure with Implementation
Last Updated :
30 Jun, 2023
The Count-Min Sketch is a probabilistic data structure and is defined as a simple technique to summarize large amounts of frequency data. Count-min sketch algorithm talks about keeping track of the count of things. i.e, How many times an element is present in the set.
What is Count-Min Sketch?
Count-min sketch approach was proposed by Graham Cormode and S. Muthukrishnan. in the paper approximating data with the count-min sketch published in 2011/12. Count-min sketch is used to count the frequency of the events on the streaming data. Like the Bloom filter, Count-min sketch algorithm also works with hash codes. It uses multiple hash functions to map these frequencies on to the matrix (Consider sketch here a two dimensional array or matrix).
Need for Count-Min Sketch
Since Count-Min Sketch is used to find the frequency of an element, one might think if there is actually a need for such data structure! The answer is Yes. Let us see with the help of an example.
Let us try to solve this frequency estimation problem using a MultiSet Data Structure
Trying MultiSet as an alternative to Count-min sketch
Let’s try to implement this data structure using MultiSet with the below source code and try to find out the issues with this approach.
Java
import com.google.common.collect.HashMultiset;
import com.google.common.collect.Multiset;
public class MultiSetDemo {
public static void main(String[] args)
{
Multiset<String> blackListedIPs
= HashMultiset.create();
blackListedIPs.add( "192.170.0.1" );
blackListedIPs.add( "75.245.10.1" );
blackListedIPs.add( "10.125.22.20" );
blackListedIPs.add( "192.170.0.1" );
System.out.println(
blackListedIPs.count( "192.170.0.1" ));
System.out.println(
blackListedIPs.count( "10.125.22.20" ));
}
}
|
Output:
Understanding the problem of using MultiSet instead of Count-Min Sketch
Now let’s look at the time and space consumed with this type of approach.
10
|
<25
|
100
|
<25
|
1,000
|
30
|
10,000
|
257
|
100,000
|
1200
|
1,000,000
|
4244
|
Let’s have a look at the memory (space) consumed:
10
|
<2
|
100
|
<2
|
1,000
|
3
|
10,000
|
9
|
100,000
|
39
|
1,000,000
|
234
|
We can easily understand that as data grows, the above approach is consuming a lot of memory and time to process the data.
This can be optimised if we use the Count-Min Sketch.
How does Count-Min Sketch work?
Let’s look at the below example step by step.
Creating a Count-Min Sketch using Matrix
- Consider the below 2D array with 4 rows and 16 columns, also the number of rows is equal to the number of hash functions. That means we are taking four hash functions for our example. Initialize/mark each cell in the matrix with zero.
Note: The more accurate result you want, the more hash function to be used.
Now let’s add some elements (Input) to the Count-Min Sketch.
To do so we have to pass that element with all four hash functions which will result as follows.
Passing the input through Hash Functions:
- hashFunction1(192.170.0.1): 1
- hashFunction2(192.170.0.1): 6
- hashFunction3(192.170.0.1): 3
- hashFunction4(192.170.0.1): 1
Now visit the indexes retrieved above by all four hash functions and mark them as 1.
Passing the input through Hash Functions:
- hashFunction1(75.245.10.1): 1
- hashFunction2(75.245.10.1): 2
- hashFunction3(75.245.10.1): 4
- hashFunction4(75.245.10.1): 6
Now visit the indexes retrieved above by all four hash functions and mark them as 1.
Now, take these indexes and visit the matrix, if the given index has already been marked as 1. This is called collision, i.e., the index of that row was already marked by some previous inputs.
In this case, just increment the index value by 1.
In our case, since we have already marked index 1 of row 1 i.e., hashFunction1() as 1 by previous input, so this time it will be incremented by 1, and now this cell entry will be 2, but for the rest of the index of rest rows, it will be 0, since there was no collision.
Passing the input through Hash Functions:
- hashFunction1(10.125.22.20): 3
- hashFunction2(10.125.22.20): 4
- hashFunction3(10.125.22.20): 1
- hashFunction4(10.125.22.20): 6
Lets, represent it on matrix, do remember to increment the count by 1 if already some entry exist.
Passing the input through Hash Functions:
- hashFunction1(192.170.0.1): 1
- hashFunction2(192.170.0.1): 6
- hashFunction3(192.170.0.1): 3
- hashFunction4(192.170.0.1): 1
Lets, represent it on matrix, do remember to increment the count by 1 if already some entry exist.
Testing Count-Min Sketch data structure against Test cases:
Now let’s test some element and check how many time are they present.
Pass above input to all four hash functions, and take the index numbers generated by hash functions.
- hashFunction1(192.170.0.1): 1
- hashFunction2(192.170.0.1): 6
- hashFunction3(192.170.0.1): 3
- hashFunction4(192.170.0.1): 1
Now visit to each index and take note down the entry present on that index.
So the final entry on each index was 3, 2, 2, 2.
Now take the minimum count among these entries and that is the result. So min(3, 2, 2, 2) is 2, that means the above test input is processed 2 times in the above list.
Hence Output (Frequency of 192.170.0.1) = 2.
Pass above input to all four hash functions, and take the index numbers generated by hash functions.
- hashFunction1(10.125.22.20): 3
- hashFunction2(10.125.22.20): 4
- hashFunction3(10.125.22.20): 1
- hashFunction4(10.125.22.20): 6
Now visit to each index and take note down the entry present on that index.
So the final entry on each index was 1, 1, 1, 2.
Now take the minimum count among these entries and that is the result. So min(1, 1, 1, 2) is 1, that means the above test input is processed only once in the above list.
Hence Output (Frequency of 10.125.22.20) = 1.
Implementation of Count-min sketch using Guava library in Java:
We can implement the Count-min sketch using Java library provided by Guava. Below is the step by step implementation:
- Use below maven dependency.
XML
< dependency >
< groupId >com.clearspring.analytics</ groupId >
< artifactId >stream</ artifactId >
< version >2.9.5</ version >
</ dependency >
|
- The detailed Java code is as follows:
Java
import com.clearspring.analytics
.stream.frequency.CountMinSketch;
public class CountMinSketchDemo {
public static void main(String[] args)
{
CountMinSketch countMinSketch
= new CountMinSketch(
0.001 ,
0.99 ,
1 );
countMinSketch.add( "75.245.10.1" , 1 );
countMinSketch.add( "10.125.22.20" , 1 );
countMinSketch.add( "192.170.0.1" , 2 );
System.out.println(
countMinSketch
.estimateCount(
"192.170.0.1" ));
System.out.println(
countMinSketch
.estimateCount(
"999.999.99.99" ));
}
}
|
Above example takes three arguments in the constructor which are
- 0.001 = the epsilon i.e., error rate
- 0.99 = the delta i.e., confidence or accuracy rate
- 1 = the seed
Output:
Time and Space Complexity of Count-Min Sketch Data Structure
Now let’s look at the time and space consumed with this type of approach (wrt to above Java-Guava Implementation)
10
|
<25
|
35
|
100
|
<25
|
30
|
1,000
|
30
|
69
|
10,000
|
257
|
246
|
100,000
|
1200
|
970
|
1,000,000
|
4244
|
4419
|
Let’s have a look at the memory (space) consumed:
10
|
<2
|
N/A
|
100
|
<2
|
N/A
|
1,000
|
3
|
N/A
|
10,000
|
9
|
N/A
|
100,000
|
39
|
N/A
|
1,000,000
|
234
|
N/A
|
Applications of Count-min sketch:
- Compressed Sensing
- Networking
- NLP
- Stream Processing
- Frequency tracking
- Extension: Heavy-hitters
- Extension: Range-query
Issue with Count-min sketch and its solution:
What if one or more elements got the same hash values and then they all incremented. So, in that case, the value would have been increased because of the hash collision. Thus sometimes (in very rare cases) Count-min sketch overcounts the frequencies because of the hash functions.
So the more hash function we take there will be less collision. The fewer hash functions we take there will be a high probability of collision. Hence it always recommended taking more number of hash functions.
Conclusion:
We have observed that the Count-min sketch is a good choice in a situation where we have to process a large data set with low memory consumption. We also saw that the more accurate result we want the number of hash functions(rows/width) has to be increased.
Similar Reads
Count-Min Sketch Data Structure with Implementation
The Count-Min Sketch is a probabilistic data structure and is defined as a simple technique to summarize large amounts of frequency data. Count-min sketch algorithm talks about keeping track of the count of things. i.e, How many times an element is present in the set. What is Count-Min Sketch?Count-
7 min read
Create a customized data structure which evaluates functions in O(1)
Create a customized data structure such that it has functions :- GetLastElement(); RemoveLastElement(); AddElement() GetMin() All the functions should be of O(1) Question Source : amazon interview questions Approach : create a custom stack of type structure with two elements, (element, min_till_now)
7 min read
Design an efficient data structure for given operations
To design an efficient data structure for a specific set of operations, it's important to consider the time and space complexity of different data structures and choose the one that is best suited for the specific requirements. For example, if you need to perform operations such as inserting element
15+ min read
2 Sum - Count distinct pairs with given sum
Given an array arr[] of size n and an integer target, the task is to count the number of distinct pairs in the array whose sum is equal to target. Examples: Input: arr[] = { 5, 6, 5, 7, 7, 8 }, target = 13 Output: 2 Explanation: Distinct pairs with sum equal to 13 are (5, 8) and (6, 7). Input: arr[]
15 min read
Count number of distinct sum subsets within given range
Given a set S of N numbers and a range specified by two numbers L (Lower Bound) and R (Upper Bound). Find the number of distinct values of all possible sums of some subset of S that lie between the given range. Examples : Input : S = { 1, 2, 2, 3, 5 }, L = 1 and R = 5 Output : 5 Explanation : Every
8 min read
Design a data structure that supports insert, delete, getRandom in O(1) with duplicates
Design a Data Structure that can support the following operations in O(1) Time Complexity. insert(x): Inserts x in the data structure. Returns True if x was not present and False if it was already present.remove(x): Removes x from the data structure, if present.getRandom(): Returns any value present
9 min read
Count pairs from two sorted matrices with given sum
Given two sorted matrices mat1 and mat2 of size n x n of distinct elements. Given a value x. The problem is to count all pairs from both matrices whose sum is equal to x. Note: The pair has an element from each matrix. Matrices are strictly sorted which means that matrices are sorted in a way such t
15 min read
Count-Min Sketch in Python
Count-Min Sketch is a probabilistic data structure which approximates the frequency of items in a stream of data. It uses little memory while handling massive amounts of data and producing approximations of the answers. In this post, we'll explore the idea behind the Count-Min Sketch, how it's imple
2 min read
Count distinct elements in every window of size k
Given an array arr[] of size n and an integer k, return the count of distinct numbers in all windows of size k. Examples: Input: arr[] = [1, 2, 1, 3, 4, 2, 3], k = 4Output: [3, 4, 4, 3]Explanation: First window is [1, 2, 1, 3], count of distinct numbers is 3. Second window is [2, 1, 3, 4] count of d
10 min read
Sum of numbers obtained by the count of set and non-set bits in diagonal matrix elements
Given a square matrix mat[][] of dimension N*N, convert the elements present in both the diagonals to their respective binary representations and perform the following operations: For every position of bits, count the number of set bits and non-set bits in those binary representations.If count of se
10 min read
Count elements which divide all numbers in range L-R
The problem statement describes a scenario where we are given an array of N numbers and Q queries. Each query consists of two integers L and R, representing a range of indices in the array. The task is to find the count of numbers in the array that divide all the numbers in the range L-R. Examples :
15+ min read
Count smaller elements on right side using Set in C++ STL
Write a function to count the number of smaller elements on the right of each element in an array. Given an unsorted array arr[] of distinct integers, construct another array countSmaller[] such that countSmaller[i] contains the count of smaller elements on right side of element arr[i] in the array.
4 min read
Dynamic Disjoint Set Data Structure for large range values
Prerequisites: Disjoint Set Data StructureSetUnordered_MapDisjoint Set data structure is used to keep track of a set of elements partitioned into a number of disjoint (non-overlapping) subsets. In this article, we will learn about constructing the same Data Structure dynamically. This data structure
14 min read
Count smaller elements on Right side
Given an unsorted array arr[] of distinct integers, construct another array countSmaller[] such that countSmaller[i] contains the count of smaller elements on the right side of each element arr[i] in the array. Examples: Input: arr[] = {12, 1, 2, 3, 0, 11, 4}Output: countSmaller[] = {6, 1, 1, 1, 0,
15+ min read
Map Policy Based Data Structure in g++
There are some data structures that are supported by g++ compiler and are not a part of the C++ standard library. One of these is: Policy-Based Data Structure, which is used for high-performance, flexibility, semantic safety, and conformance to the corresponding containers in std. This can also be u
3 min read
Count Number of Nodes With Value One in Undirected Tree
Given an undirected connected tree with n nodes labeled from 1 to n, and an integer array of queries[]. Initially, all nodes have a value of 0. For each query in the array, you need to flip the values of all nodes in the subtree of the node with the corresponding label. The parent of a node with lab
8 min read
Count of index pairs with equal elements in an array | Set 2
Given an array arr[] of N elements. The task is to count the total number of indices (i, j) such that arr[i] = arr[j] and i != j Examples: Input: arr[]={1, 2, 1, 1}Output: 3 Explanation:In the array arr[0]=arr[2]=arr[3]Valid Pairs are (0, 2), (0, 3) and (2, 3) Input: arr[]={2, 2, 3, 2, 3}Output: 4Ex
8 min read
Count subarrays having exactly K elements occurring at least twice
Given an array arr[] consisting of N integers and a positive integer K, the task is to count the number of subarrays having exactly K elements occurring at least twice. Examples: Input: arr[] = {1, 1, 1, 2, 2}, K = 1Output: 7Explanation: The subarrays having exactly 1 element occurring at least twic
11 min read
Count smaller elements present in the array for each array element
Given an array arr[] consisting of N integers, the task is for each array element, say arr[i], is to find the number of array elements that are smaller than arr[i]. Examples: Input: arr[] = {3, 4, 1, 1, 2}Output: 3 4 0 0 2Explanation:The elements which are smaller than arr[0](= 3) are {1, 1, 2}. Hen
9 min read