Version Vectors(I)

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

Published Aug 15, 2022

In my last article, we talked about Vector Clocks. We learned that Vector Clocks could help establish the ordering of operations, including identifying whether operations happened concurrently or were causally related. A very similar mechanism is used by Version Vector, but it’s used for a slightly different purpose.

Version Vector

Version Vectors are generally leveraged in distributed data-driven applications, where each data record is tied to a Version Vector. Since it’s a distributed system, a data record can be updated concurrently by multiple nodes. Thus, we can leverage version vectors, to identify if a data record can be reconciled immediately or if it requires a conflict resolution(remember that we can identify concurrent updates, & if an update to a data record is concurrent, it needs a conflict resolution).

Just like Vector Clocks, Version Vectors also maintain a vector per node, but the entries don’t represent the logical times anymore.

Before going further, let’s specify the criteria for identifying a concurrent update. Similar to Vector clocks, we compare two Version Vectors to understand the relationship between them-

A version vector is considered higher than the other if both of the version vectors have a version number for the same nodes and each version number is higher than the one in the other vector and vice versa.
If the version vectors cover different nodes or if not all the version numbers are higher, they are considered concurrent.

Examples(A-D are actors, while the number represents event count just like in Vector Clocks)-

{A: 1, B: 1} is greater than {A: 1, B: 0}
{A: 2, B: 1} is concurrent with {A: 1, B: 2}
{A: 2, B: 1, C: 1} is greater than {A: 2, B: 1}
{A: 2, B: 1, C: 1} is concurrent with {A: 2, B: 1, D:1}

Now, it’s extremely critical to identify where things are different from Vector Clocks. Always remember that Version Vectors exist for each data item, & hence the comparison of vectors in the above example is the comparison of vectors on different Actors for the same data record/item.

Following are the steps followed by the algorithm -

Each actor reads the initial state of the version vector for each key & the value associated with the key. Let’s start with just 1 key for simplicity & the algo can easily be extended to multiple keys.
The actor pushes an update to the value to the data store(which internally implements Version Vectors). The update message includes the updated value, an identifier for the actor & the version vector present with the actor.
Two nodes can synchronize their version vector, allowing them to reconcile the version vector or detect concurrent updates. In case there is a concurrent update, a conflict is detected. But how can we represent conflicts in the system?

We’ll talk about two approaches that we can leverage to identify conflicts, & the advantage/pitfalls of both.

Client As An Actor

As we read in the algorithm above, each write is associated with the Actor, and since we’re taking each client to be an Actor, the client responsible for the write will send its identifier along with the write. So, if in case a conflict is detected, a sibling is added to the version vector corresponding to the write from that particular client.

Let’s try to understand what’s happening in the above diagram -

Let’s assume we have a key K, with value U. We’re assuming that we have an empty version vector, to begin with. Client’s C2 and C3 sync the same state from the system that’s implementing Version Vectors.
C3 updates the value to V & sends a PUT command, with the local state of Version Vector it has(empty VV).
The System on receiving the Version Vector compares it with its local state(again empty). So it creates a new Version Vector for this client(C3) and updates the value of key K to V.
C3 syncs the state from the server and again updates the value to A from V and sends a PUT command.
The System on receiving the Version Vector compares it with its local state. This time the local state has an entry for C3, so it increments the counter to 2 & updates the value to A.
C2 updates the value to W & sends a PUT command, with the local state of Version Vector it has(empty VV).
The System on receiving the Version Vector compares it with its local state((C3, 2)). Since there is no local state for C2, it creates a new state of (C2, 1). If we compare the 2 Version Vectors, {C3, 2} and {C2, 1}, based on what we’ve discussed before, they are concurrent. Since there is a conflict, the System stores both the version vectors(1 per client) as siblings.

Advantage -

Leveraging Client Id’s to identify conflicts, we could easily handle concurrent updates to the same key.

Concerns -

Scalability — This approach suffers from something called Actors Explosion. If your system has a lot of clients, the corresponding version vectors for each data item could potentially grow to the size of the client. Since this will be stored per data item, we’ll have a lot of data to be stored. Also, as the size of the vector grows, the cost of comparing grows along with it.
Needs a unique Id for each client. If there is any overlap in clientId, we could have potential data loss.

This brings us to the end of this article. We learned about Version Vectors, where they differ from Vector Clocks. We looked at how Version Vectors can be used by distributed data stores, relying on eventual consistency to identify concurrent updates. We saw the advantages and pitfalls of using Client as an Actor, and in our next article, we’ll look at other approaches to identify concurrent updates and more!

Thank you for reading! I’ll be posting weekly content on distributed systems & patterns, so please like, share and subscribe to this newsletter for notifications of new posts.

Please comment on the post with your feedback, will help me improve! :)

Until next time, Keep asking questions & Keep learning!

Version Vectors(I)

Pratik Pandey

Senior Software Engineer at Booking.com | AWS Serverless Community Builder | pratikpandey.substack.com

Version Vector

Recommended by LinkedIn

Client As An Actor

Distributed Systems Made Easy

7,941 followers

More articles by Pratik Pandey

Insights from the community

Others also viewed

The measure of Central Tendency

Telemetry Properties of a Digital Twin Model

Makeover walkthrough

"Well, It Depends..."

Taking a Rubik’s Cube Approach to Data-Driven Decision-Making

4) Small Multiples, Huge Advantage

THE NEW ERA OF BIG DATA ....

Cow<T> in Rust

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Grind 75 - 18 - Reverse Linked List

Explore topics

Version Vector

Recommended by LinkedIn

Client As An Actor

Distributed Systems Made Easy

7,941 followers

More articles by Pratik Pandey

Database Intermediate Series: Change Data Capture(II)

Database Intermediate Series: Change Data Capture(I)

Database Intermediate Series: SQL Isolation Levels Internals

Database Basics Series: Understanding SQL Isolation Levels

Go Concurrency Series: Concurrency Patterns(II)

Go Concurrency Series: Concurrency Patterns

Go Concurrency Series: Deep Dive into Go Scheduler(III)

Go Concurrency Series: Deep Dive into Go Scheduler(II)

Go Concurrency Series: Deep Dive into Go Scheduler(I)

Go Concurrency Series: Introduction to Goroutines

Insights from the community

Others also viewed

The measure of Central Tendency

Telemetry Properties of a Digital Twin Model

Makeover walkthrough

"Well, It Depends..."

Taking a Rubik’s Cube Approach to Data-Driven Decision-Making

4) Small Multiples, Huge Advantage

THE NEW ERA OF BIG DATA ....

Cow<T> in Rust

I ran 580 model-dataset experiments to show that, even if you try very hard, it is almost impossible to know that a model is degrading just by looking

Grind 75 - 18 - Reverse Linked List

Explore topics