C++ Protobuf
- Description: Protobuf in C++ —
.protoschema, generated message API, binary/text/JSON serialization,oneof/repeated/map, arenas, schema evolution, sharing schemas with Python - My Notion Note ID: K2A-B2-3
- Created: 2020-01-13
- Updated: 2026-04-30
- License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io
Table of Contents
- 1. Schema-First Serialization
- 2. Schema (
.proto) Basics - 3. Generating C++ Code
- 4. The Generated Message API
- 5. Serialization and Parsing
- 6. Arenas
- 7. Reflection and
Any - 8. Schema Evolution
- 9. Build Integration Sketches
- 10. Sharing Schemas Across Languages
- 11. References
1. Schema-First Serialization
A .proto file declares the data; protoc compiles it into classes for C++, Python, Go, Java, and other languages. The wire format is binary and unknown-field tolerant: old binaries parsing a message with new fields keep those fields intact and re-serialize them unchanged (the rules are in § 8).
In C++, "protobuf" usually refers to proto3 with the Google libprotobuf runtime. Recent releases introduced protobuf editions as the successor to proto2/proto3 numbering, but proto3 syntax is still what most code uses.
2. Schema (.proto) Basics
syntax = "proto3";
package myapp.v1;
import "google/protobuf/timestamp.proto";
message User {
int64 id = 1;
string name = 2;
string email = 3;
google.protobuf.Timestamp created = 4;
}
Each field has a tag number (the = 1, = 2, ...). Tags are what end up on the wire — names are not. Tags 1–15 take one byte to encode; tags 16+ take two. Reserve hot fields for the low range.
Files in the same package share a namespace; the C++ generator maps package myapp.v1 to namespace myapp::v1.
2.1 Scalar Types
.proto |
C++ | Notes |
|---|---|---|
double |
double |
|
float |
float |
|
int32 / int64 |
int32_t / int64_t |
varint encoding; inefficient for negatives |
sint32 / sint64 |
int32_t / int64_t |
zig-zag encoded; use for negative-prone values |
uint32 / uint64 |
uint32_t / uint64_t |
varint |
fixed32 / fixed64 |
uint32_t / uint64_t |
always 4 / 8 bytes; better for large values |
sfixed32 / sfixed64 |
int32_t / int64_t |
signed fixed |
bool |
bool |
|
string |
std::string |
must be valid UTF-8 — passing arbitrary bytes can break clients in other languages; use bytes for opaque data |
bytes |
std::string |
arbitrary bytes |
2.2 Field Rules: singular, optional, repeated, map
message Post {
string title = 1; // singular (default in proto3)
optional string body = 2; // explicit presence (proto3, since 3.15)
repeated string tags = 3; // 0..N elements
map<string, int32> reactions = 4; // string -> int32
}
In proto3, singular scalar fields always have a value — there's no "is set". A missing scalar reads back as the type's zero value (0, "", false). When you genuinely need to distinguish "unset" from "set to zero", mark the field optional; the generated class then has has_field() and clear_field(). (Message-typed fields always have presence regardless.)
Proto3 also dropped the required keyword that proto2 had — required fields turn out to be a foot-gun for schema evolution (you can never safely remove one). Validate at the application layer instead.
Tag numbers are forever. They're also the only thing on the wire that identifies a field. Reserve hot fields for tags 1–15 (one byte to encode); never reuse a tag number; use reserved (see § 8) when removing a field.
2.3 Enums and Nested Messages
message Order {
enum Status {
STATUS_UNSPECIFIED = 0; // proto3 enums must have a 0 value
STATUS_PENDING = 1;
STATUS_SHIPPED = 2;
}
message LineItem {
string sku = 1;
int32 count = 2;
}
int64 id = 1;
Status status = 2;
repeated LineItem items = 3;
}
The Google style guide recommends prefixing enum values with the enum name and reserving _UNSPECIFIED = 0 so a default-constructed value is always meaningful.
2.4 oneof
For "exactly one of these fields is set":
message Event {
int64 timestamp = 1;
oneof body {
Login login = 2;
Logout logout = 3;
Click click = 4;
}
}
Setting any field in the oneof clears the others. Generated C++ exposes body_case() returning an enum (Event::kLogin, Event::kLogout, ...), plus per-field accessors.
3. Generating C++ Code
protoc \
--proto_path=src/proto \
--cpp_out=gen \
src/proto/user.proto
This produces gen/user.pb.h and gen/user.pb.cc mirroring the directory under --proto_path. Add gen/ to your include path and compile the .pb.cc files into your library.
Useful flags:
-I path— alias for--proto_path. Repeatable for multiple roots.--cpp_out=...— emit C++.--grpc_out=...(with the gRPC plugin) — emit.grpc.pb.{h,cc}for service stubs.--descriptor_set_out=foo.desc— emit a binaryFileDescriptorSetfor runtime/dynamic use.
4. The Generated Message API
For each message, protoc generates a class with:
- A default constructor producing the all-zero message.
- Accessors for each field: getter
name(), setterset_name(value), mutable accessormutable_name()for sub-messages and strings,clear_name(), and (foroptional/message fields)has_name(). - Repeated field accessors:
name(i),name_size(),add_name(),mutable_name(i),clear_name(), plusname()returning aRepeatedFieldorRepeatedPtrFieldfor iteration. - Map field accessors:
name()/mutable_name()returning aMap<K,V>you can index. - Lifecycle:
Swap,CopyFrom,MergeFrom,Clear,IsInitialized,ByteSizeLong. - Reflection hooks:
GetDescriptor,GetReflection.
#include "user.pb.h"
myapp::v1::User u;
u.set_id(42);
u.set_name("yu");
u.set_email("[email protected]");
// Sub-message
auto* ts = u.mutable_created();
ts->set_seconds(absl::ToUnixSeconds(absl::Now()));
// Repeated
myapp::v1::Post p;
p.set_title("Hello");
*p.add_tags() = "intro";
p.add_tags("greeting");
// Map
(*p.mutable_reactions())["like"] = 3;
// Read
for (const std::string& tag : p.tags()) { /* ... */ }
Don't keep raw pointers or references across mutable_* calls on repeated/map fields. Adding an element can reallocate the underlying storage and invalidate everything you held. If you need stable references, pull the values out into your own container.
5. Serialization and Parsing
5.1 Binary (Wire Format)
The default. Compact, deterministic enough for byte-by-byte hashing if you call CodedOutputStream::SetSerializationDeterministic(true) on the stream you serialize into, fast to parse.
#include "user.pb.h"
#include <fstream>
bool WriteBinary(const myapp::v1::User& u, const std::string& path) {
std::ofstream out(path, std::ios::binary | std::ios::trunc);
return u.SerializeToOstream(&out);
}
bool ReadBinary(myapp::v1::User* u, const std::string& path) {
std::ifstream in(path, std::ios::binary);
return u->ParseFromIstream(&in);
}
// Or to/from a string buffer:
std::string buf;
u.SerializeToString(&buf);
u.ParseFromString(buf);
// Or to/from a fixed byte array:
u.SerializeToArray(ptr, size);
u.ParseFromArray(ptr, size);
SerializeToFoo returns false if the message is not initialized (a required field is missing). For proto3 there are no required fields, so the only failure mode in practice is I/O.
SerializeAsString() is non-deterministic by default for messages containing maps — Hash-table iteration order leaks into the bytes. If you hash, sign, or compare serialized output, route through a CodedOutputStream with SetSerializationDeterministic(true) instead.
5.2 Text Format
A human-readable format useful for golden tests, configs, and debugging.
#include <google/protobuf/text_format.h>
#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <fcntl.h>
#include <unistd.h>
bool WriteProtoToTextFile(const google::protobuf::Message& proto,
const std::string& filename) {
int fd = ::open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) return false;
google::protobuf::io::FileOutputStream output(fd);
bool ok = google::protobuf::TextFormat::Print(proto, &output);
::close(fd);
return ok;
}
bool ReadProtoFromTextFile(const std::string& filename,
google::protobuf::Message* proto) {
int fd = ::open(filename.c_str(), O_RDONLY);
if (fd == -1) return false;
google::protobuf::io::FileInputStream input(fd);
bool ok = google::protobuf::TextFormat::Parse(&input, proto);
::close(fd);
return ok;
}
For ad-hoc debug printing, proto.DebugString() and proto.ShortDebugString() produce the text format directly to a std::string.
5.3 JSON
#include <google/protobuf/util/json_util.h>
std::string json;
google::protobuf::util::MessageToJsonString(u, &json);
myapp::v1::User u2;
google::protobuf::util::JsonStringToMessage(json, &u2);
By default, fields with default values are omitted and field names use camelCase. Pass a JsonPrintOptions to keep zero values, preserve snake_case field names, or pretty-print.
6. Arenas
Allocating individual messages via new is slow and fragmented. Arenas allocate all messages within one region and free them as a block — order-of-magnitude wins on parsing-heavy workloads.
#include <google/protobuf/arena.h>
google::protobuf::Arena arena;
auto* u = google::protobuf::Arena::Create<myapp::v1::User>(&arena);
u->set_id(42);
// All messages allocated on `arena` are freed when `arena` goes out of scope.
// Do NOT delete `u` yourself.
// Note: older code uses Arena::CreateMessage<T>; that's deprecated and
// scheduled for removal in protobuf v30. Use Arena::Create<T> in new code.
Caveats:
- A message that owns sub-messages must agree on the arena. Mixing arena-allocated and heap-allocated submessages is undefined.
stringandbytesfields are still heap-allocated by default. For zero-allocation parsing you also want[ctype = STRING_PIECE]or the newstring_viewaccessors (recent versions).- Arena allocation methods are thread-safe, but
Reset()and destruction are not — synchronize with all allocating threads before resetting or destroying an arena.
Message::Swap swaps arenas too. That's usually what you want, but it means after a.Swap(&b), a and b may now be allocated on different arenas (or one on the heap). Read the Swap docs before assuming ownership invariants are preserved.
7. Reflection and Any
Every generated message exposes a Descriptor and Reflection. With them you can read/write fields by name without knowing the message type at compile time — useful for generic tooling (config diffs, validators, debuggers).
const auto* desc = u.GetDescriptor();
const auto* refl = u.GetReflection();
const auto* field = desc->FindFieldByName("name");
std::cout << refl->GetString(u, field);
google.protobuf.Any packs an arbitrary message + its type URL. Useful for plugin-style APIs where the runtime type is decided per call:
import "google/protobuf/any.proto";
message Envelope { google.protobuf.Any payload = 1; }
myapp::v1::User user;
user.set_id(1);
Envelope env;
env.mutable_payload()->PackFrom(user);
myapp::v1::User out;
if (env.payload().Is<myapp::v1::User>()) {
env.payload().UnpackTo(&out);
}
8. Schema Evolution
The protobuf wire format is unknown-field tolerant: an old binary parsing a message with new fields preserves them and re-serializes them unchanged. To stay compatible:
-
Never reuse a tag number. If you delete a field,
reserveits tag and name so future edits can't recycle them:message User { reserved 4, 7 to 9; reserved "old_email", "legacy_id"; } -
Never change a field's type in a way that changes the wire encoding. Some changes are wire-compatible (
int32↔int64↔uint32↔uint64↔bool); most are not. See the official "Updating a Message Type" guide. -
Adding fields is always safe. Old code ignores them.
-
Removing optional fields is safe at the wire level, but
reserveto prevent later reuse. -
Don't change
oneofmembership — moving a field in or out of a oneof is wire-compatible but changes presence semantics for old code.
9. Build Integration Sketches
CMake (using the upstream protobuf-config.cmake):
find_package(Protobuf CONFIG REQUIRED)
add_library(myapp_proto user.proto post.proto)
target_link_libraries(myapp_proto PUBLIC protobuf::libprotobuf)
protobuf_generate(TARGET myapp_proto LANGUAGE cpp)
target_include_directories(myapp_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
Bazel:
load("@com_google_protobuf//bazel:cc_proto_library.bzl", "cc_proto_library")
load("@com_google_protobuf//bazel:proto_library.bzl", "proto_library")
proto_library(name = "user_proto", srcs = ["user.proto"])
cc_proto_library(name = "user_cc_proto", deps = [":user_proto"])
10. Sharing Schemas Across Languages
The biggest payoff of Protobuf is that the .proto file becomes a wire-format contract every language can speak. A message serialized by C++ parses cleanly in Python, Go, Java, or any other language with a generator — without any custom serializer code per pair of languages.
Compile the same schema to Python alongside C++:
protoc -I=src/proto \
--cpp_out=gen/cpp \
--python_out=gen/py \
src/proto/user.proto
Use it from Python:
# gen/py is on PYTHONPATH
from user_pb2 import User
# Build and serialize on the Python side
u = User(id=42, name="yu", email="[email protected]")
data = u.SerializeToString() # bytes -- identical wire format to C++
# Parse bytes produced by another language
u2 = User()
u2.ParseFromString(data)
print(u2.name, u2.id)
The bytes that Python's SerializeToString produces are byte-for-byte interchangeable with C++'s Message::SerializeToString. So a typical layered system writes the schema once, compiles it for every component's language, and the wire format glues them together:
- C++ ↔ Python: a C++ data pipeline writes Protobuf records to disk or a queue; a Python analytics job reads them back.
- C++ ↔ Go ↔ TypeScript: a Go gRPC service exchanges request/response Protobuf messages with a C++ backend and a TypeScript frontend; the
.protois the single source of truth for all three. - Versioning: any party can add new fields without breaking older parties (see § 8).
For RPC, the natural companion is gRPC — define a service block in the .proto, run protoc --grpc_out=... per language, get matching client/server stubs:
service UserService {
rpc GetUser(GetUserRequest) returns (User);
rpc StreamUsers(StreamUsersRequest) returns (stream User);
}
The Python tooling lives in protobuf and grpcio-tools PyPI packages; install both, then python -m grpc_tools.protoc ... is the equivalent of running protoc with the gRPC plugin.
11. References
- Protocol Buffers official site — language guides, runtime docs, downloads.
- Proto3 Language Guide — schema semantics, field rules, evolution.
- C++ Generated Code Reference — exactly what
protocproduces for each schema element. - C++ API Reference —
Message,Reflection,TextFormat,Arena,JsonUtil. - Updating a Message Type — the canonical compatibility rules.
- protobuf GitHub repo — source, releases, issues.