Protocol Buffers: Understanding Concepts
Protocol Buffers (often called Protobufs) was developed by Google to tackle inefficiencies in data serialization. Traditional methods, like XML and JSON, were slow and bulky, which affected performance and data handling. Protobuf addresses these issues by using a compact binary format that speeds up serialization and reduces data size.
Note: I encourage you to read this insightful newsletter by Neo Kim, which explains how LinkedIn was able to reduce their latency by 60% simply by replacing JSON with Protobufs.
Additionally, Protobuf is language-platform agnostic and it supports multiple programming languages, making it easier to integrate and communicate across diverse systems.
Protobuf is defined in a .proto file which contains major two components:
1. Messages in Protobufs
Messages defines the type and structure of the data that needs to be exchanged. A simple example of Message in .proto file can be looked as following:
message User {
int32 id = 1;
string email = 2;
bool is_active = 3;
}
We can see the values 1, 2, 3 which are assigned to id, email and is_active property. These numbers are known as field numbers which should be unique. They are used to efficiently encode and decode data.
When a message is serialized, each field is identified by its unique number rather than it's name. For instance, in a User message with fields like id (1), email (2) and is_active (3), the serialization process encodes these fields as binary data with their respective field numbers.
How Serialization Happens
Suppose user message contains: {id: 123, email: "z@z.com", is_active: true}, the serialization process would look like this:
Note: The wire type specifies how field values are encoded in binary data, guiding Protobuf on how to interpret the bytes that follow the field’s tag. For example, wire type 0 (varint) is used for encoding integers and booleans, while wire type 2 (length-delimited) is used for strings and other length-prefixed data.
The binary output after serialization would be like:
Tag 1: 0001 0111 1011 (field number 1, value 123)
Tag 2: 0010 0110 0110 0110 0110 0110 0110 0110 (field number 2, value "z@z.com")
Tag 3: 0011 01 (field number 3, value true)
Nested Message
You can define Protobuf messages within other messages and use types like enums — effectively creating nested types. Here’s an example:
syntax = "proto3";
message User {
int32 id = 1;
string email = 2;
bool is_active = 3;
enum SocialMediaType {
FACEBOOK = 0;
TWITTER = 1;
LINKEDIN = 2;
INSTAGRAM = 3;
}
message SocialMediaProfile {
string username = 1;
SocialMediaType type = 2;
}
repeated SocialMediaProfile social_media_profiles = 4;
}
In this example, the User message includes a nested SocialMediaProfile message. The SocialMediaProfile message has two fields: username and type, which are used to represent a user's social media account details. The type field uses an enum called SocialMediaType to categorize different social media platforms.
The User message also contains a repeated field of SocialMediaProfile messages named social_media_profiles. This means that a single User can have multiple social media profile entries.
Recommended by LinkedIn
Field Number Scope
Field Number Uniqueness
2. Services in Protobufs
In Protobufs, services define a set of RPC (remote procedure call) methods. These methods are like functions or procedures that you can call over a network. Services help different systems or components communicate by allowing one system (the client) to call methods on another system (the server).
Messages define the structure of the data, while services specify the APIs for accessing and manipulating that data through remote procedure calls (RPCs).
Here's a basic example of defining a service in Protobuf:
syntax = "proto3";
service UserService {
rpc GetUser(GetUserRequest) returns (GetUserResponse);
}
message GetUserRequest {
int32 user_id = 1;
}
message GetUserResponse {
User user = 1;
}
message User {
// above
}
Compiling Protobufs
The .proto files can be compiled into multiple languages using the Protocol Buffer compiler, protoc. For example, to generate Python code, you would use the following command:
protoc --python_out=. user.proto
This will generate a python file named user_pb2.py, which includes the necessary code for creating, manipulating, and serializing the defined messages.
Implementing Protobufs in Code
Here’s how you can use the generated Python code to create a User message, populate its fields, and then serialize the message into a string:
import user_pb2 # This is the generated file for the User message
# Create a User message
user = user_pb2.User()
# Set the fields
user.id = 1234
user.email = "np@np.com"
user.is_active = True
# Add a social media profile
profile = user.social_media_profiles.add() # Add a new SocialMediaProfile
profile.username = "np123"
profile.type = user_pb2.User.FACEBOOK
# Serialize the message to a binary string
serialized_user = user.SerializeToString()
Similarly, parsing of serialized data can be done as follows:
import user_pb2 # This is the generated file for the User message
# Assume `serialized_user` is the binary data obtained from serialization
user = user_pb2.User()
user.ParseFromString(serialized_user)
# Access the fields of the deserialized User message
print(user.email)
Summary
Protocol Buffers (Protobuf) helps make data handling faster and more efficient compared to older methods like XML and JSON. It uses a compact binary format, which makes data smaller and quicker to work with. Protobuf is great for defining data structures and creating services that allow different systems to communicate with each other. It supports many programming languages, making it versatile and easy to integrate into various projects. That all's for this article.
Stay tuned for more insights on Protobuf and gRPC topics.
For more details and resources, visit my personal website.