C++ Protobuf
- Description: Protobuf in C++ —
.protoschema, generated message API, binary/text/JSON serialization,oneof/repeated/map, arenas, schema evolution, sharing schemas with Python - My Notion Note ID: K2A-B2-3
- Created: 2020-01-13
- Updated: 2026-04-30
- License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io
Table of Contents
- 1. Schema-First Serialization
- 2. Schema (
.proto) Basics - 3. Generating C++ Code
- 4. The Generated Message API
- 5. Serialization and Parsing
- 6. Arenas
- 7. Reflection and
Any - 8. Schema Evolution
- 9. Build Integration Sketches
- 10. Sharing Schemas Across Languages
- 11. References
1. Schema-First Serialization
.protofile declares data;protoccompiles into classes for C++, Python, Go, Java, etc.- Wire format = binary, unknown-field tolerant: old binaries parsing messages with new fields preserve them + re-serialize unchanged. Rules in § 8.
- "protobuf" in C++ usually = proto3 with Google
libprotobufruntime. Recent releases added protobuf editions as successor to proto2/proto3 numbering; proto3 syntax still most-used.
2. Schema (.proto) Basics
syntax = "proto3";
package myapp.v1;
import "google/protobuf/timestamp.proto";
message User {
int64 id = 1;
string name = 2;
string email = 3;
google.protobuf.Timestamp created = 4;
}
- Each field has a tag number (
= 1,= 2, ...). Tags = what's on wire; names are not. - Tags 1–15 → 1 byte to encode. Tags 16+ → 2 bytes. Reserve hot fields for low range.
- Files in same package share a namespace. C++ generator:
package myapp.v1→namespace myapp::v1.
2.1 Scalar Types
.proto |
C++ | Notes |
|---|---|---|
double |
double |
|
float |
float |
|
int32 / int64 |
int32_t / int64_t |
varint encoding; inefficient for negatives |
sint32 / sint64 |
int32_t / int64_t |
zig-zag encoded; use for negative-prone values |
uint32 / uint64 |
uint32_t / uint64_t |
varint |
fixed32 / fixed64 |
uint32_t / uint64_t |
always 4 / 8 bytes; better for large values |
sfixed32 / sfixed64 |
int32_t / int64_t |
signed fixed |
bool |
bool |
|
string |
std::string |
must be valid UTF-8 — passing arbitrary bytes can break clients in other languages; use bytes for opaque data |
bytes |
std::string |
arbitrary bytes |
2.2 Field Rules: singular, optional, repeated, map
message Post {
string title = 1; // singular (default in proto3)
optional string body = 2; // explicit presence (proto3, since 3.15)
repeated string tags = 3; // 0..N elements
map<string, int32> reactions = 4; // string -> int32
}
-
Proto3 singular scalar fields always have a value — no "is set". Missing scalar reads as type's zero (0, "", false).
-
For "unset" vs "set to zero" distinction → mark field
optional. Generated class then hashas_field()+clear_field(). -
Message-typed fields always have presence regardless.
-
Proto3 dropped proto2's
required— foot-gun for schema evolution (can never safely remove). Validate at app layer instead. -
Tag numbers are forever. Only thing on wire identifying a field. Reserve hot fields for tags 1–15. Never reuse a tag number. Use
reserved(see § 8) when removing a field.
2.3 Enums and Nested Messages
message Order {
enum Status {
STATUS_UNSPECIFIED = 0; // proto3 enums must have a 0 value
STATUS_PENDING = 1;
STATUS_SHIPPED = 2;
}
message LineItem {
string sku = 1;
int32 count = 2;
}
int64 id = 1;
Status status = 2;
repeated LineItem items = 3;
}
- Google style guide: prefix enum values with enum name, reserve
_UNSPECIFIED = 0so default-constructed value is always meaningful.
2.4 oneof
- For "exactly one of these fields is set":
message Event {
int64 timestamp = 1;
oneof body {
Login login = 2;
Logout logout = 3;
Click click = 4;
}
}
- Setting any field in oneof clears the others.
- Generated C++:
body_case()returning enum (Event::kLogin,Event::kLogout, ...) + per-field accessors.
3. Generating C++ Code
protoc \
--proto_path=src/proto \
--cpp_out=gen \
src/proto/user.proto
- Produces
gen/user.pb.h+gen/user.pb.ccmirroring directory under--proto_path. - Add
gen/to include path; compile.pb.ccinto your library.
Useful flags:
-I path— alias for--proto_path. Repeatable for multiple roots.--cpp_out=...— emit C++.--grpc_out=...(with gRPC plugin) — emit.grpc.pb.{h,cc}for service stubs.--descriptor_set_out=foo.desc— emit binaryFileDescriptorSetfor runtime/dynamic use.
4. The Generated Message API
-
For each message,
protocgenerates a class with: -
Default ctor — all-zero message.
-
Accessors per field — getter
name(), setterset_name(value), mutable accessormutable_name()for sub-messages/strings,clear_name(), (foroptional/message fields)has_name(). -
Repeated field accessors —
name(i),name_size(),add_name(),mutable_name(i),clear_name(),name()→RepeatedField/RepeatedPtrFieldfor iteration. -
Map field accessors —
name()/mutable_name()→Map<K,V>you can index. -
Lifecycle —
Swap,CopyFrom,MergeFrom,Clear,IsInitialized,ByteSizeLong. -
Reflection hooks —
GetDescriptor,GetReflection.
#include "user.pb.h"
myapp::v1::User u;
u.set_id(42);
u.set_name("yu");
u.set_email("[email protected]");
// Sub-message
auto* ts = u.mutable_created();
ts->set_seconds(absl::ToUnixSeconds(absl::Now()));
// Repeated
myapp::v1::Post p;
p.set_title("Hello");
*p.add_tags() = "intro";
p.add_tags("greeting");
// Map
(*p.mutable_reactions())["like"] = 3;
// Read
for (const std::string& tag : p.tags()) { /* ... */ }
- Don't keep raw pointers/references across
mutable_*calls on repeated/map fields. Adding element can reallocate underlying storage, invalidating everything you held. For stable refs → pull values into your own container.
5. Serialization and Parsing
5.1 Binary (Wire Format)
- Default. Compact, deterministic enough for byte-by-byte hashing if you call
CodedOutputStream::SetSerializationDeterministic(true)on the stream. Fast to parse.
#include "user.pb.h"
#include <fstream>
bool WriteBinary(const myapp::v1::User& u, const std::string& path) {
std::ofstream out(path, std::ios::binary | std::ios::trunc);
return u.SerializeToOstream(&out);
}
bool ReadBinary(myapp::v1::User* u, const std::string& path) {
std::ifstream in(path, std::ios::binary);
return u->ParseFromIstream(&in);
}
// Or to/from a string buffer:
std::string buf;
u.SerializeToString(&buf);
u.ParseFromString(buf);
// Or to/from a fixed byte array:
u.SerializeToArray(ptr, size);
u.ParseFromArray(ptr, size);
-
SerializeToFooreturns false if message not initialized (required field missing). Proto3 has no required → only failure mode in practice = I/O. -
SerializeAsString()is non-deterministic by default for messages with maps — hash-table iteration order leaks into bytes. If you hash, sign, or compare serialized output → route throughCodedOutputStreamwithSetSerializationDeterministic(true).
5.2 Text Format
- Human-readable format. Useful for golden tests, configs, debugging.
#include <google/protobuf/text_format.h>
#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <fcntl.h>
#include <unistd.h>
bool WriteProtoToTextFile(const google::protobuf::Message& proto,
const std::string& filename) {
int fd = ::open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) return false;
google::protobuf::io::FileOutputStream output(fd);
bool ok = google::protobuf::TextFormat::Print(proto, &output);
::close(fd);
return ok;
}
bool ReadProtoFromTextFile(const std::string& filename,
google::protobuf::Message* proto) {
int fd = ::open(filename.c_str(), O_RDONLY);
if (fd == -1) return false;
google::protobuf::io::FileInputStream input(fd);
bool ok = google::protobuf::TextFormat::Parse(&input, proto);
::close(fd);
return ok;
}
- For ad-hoc debug printing →
proto.DebugString()/proto.ShortDebugString()return text format asstd::string.
5.3 JSON
#include <google/protobuf/util/json_util.h>
std::string json;
google::protobuf::util::MessageToJsonString(u, &json);
myapp::v1::User u2;
google::protobuf::util::JsonStringToMessage(json, &u2);
- Default: zero-value fields omitted, field names camelCase.
- Pass
JsonPrintOptionsto keep zero values, preservesnake_casenames, or pretty-print.
6. Arenas
- Allocating individual messages via
new= slow + fragmented. - Arenas — allocate all messages within one region, free as a block. Order-of-magnitude wins on parsing-heavy workloads.
#include <google/protobuf/arena.h>
google::protobuf::Arena arena;
auto* u = google::protobuf::Arena::Create<myapp::v1::User>(&arena);
u->set_id(42);
// All messages allocated on `arena` are freed when `arena` goes out of scope.
// Do NOT delete `u` yourself.
// Note: older code uses Arena::CreateMessage<T>; that's deprecated and
// scheduled for removal in protobuf v30. Use Arena::Create<T> in new code.
Caveats:
-
Message owning sub-messages must agree on the arena. Mixing arena + heap submessages = UB.
-
string/bytesfields still heap-allocated by default. For zero-allocation parsing →[ctype = STRING_PIECE]or newstring_viewaccessors (recent versions). -
Arena allocation methods are thread-safe;
Reset()+ destruction are not. Sync with all allocating threads before reset/destroy. -
Message::Swapswaps arenas too. Usually what you want — buta.Swap(&b)meansaandbmay now be on different arenas (or one on heap). Read Swap docs before assuming ownership invariants preserved.
7. Reflection and Any
- Every generated message has
Descriptor+Reflection. Read/write fields by name without knowing message type at compile time. Useful for generic tooling (config diffs, validators, debuggers).
const auto* desc = u.GetDescriptor();
const auto* refl = u.GetReflection();
const auto* field = desc->FindFieldByName("name");
std::cout << refl->GetString(u, field);
google.protobuf.Anypacks arbitrary message + its type URL. Useful for plugin-style APIs where runtime type decided per call:
import "google/protobuf/any.proto";
message Envelope { google.protobuf.Any payload = 1; }
myapp::v1::User user;
user.set_id(1);
Envelope env;
env.mutable_payload()->PackFrom(user);
myapp::v1::User out;
if (env.payload().Is<myapp::v1::User>()) {
env.payload().UnpackTo(&out);
}
8. Schema Evolution
- Wire format is unknown-field tolerant — old binary parsing message with new fields preserves them + re-serializes unchanged.
Compatibility rules:
-
Never reuse a tag number. If you delete a field,
reserveits tag and name:message User { reserved 4, 7 to 9; reserved "old_email", "legacy_id"; } -
Never change field type in a way that changes wire encoding. Some changes wire-compatible (
int32↔int64↔uint32↔uint64↔bool); most not. See "Updating a Message Type". -
Adding fields always safe. Old code ignores them.
-
Removing optional fields safe at wire level — but
reserveto prevent reuse. -
Don't change
oneofmembership — moving in/out of oneof is wire-compatible but changes presence semantics for old code.
9. Build Integration Sketches
CMake (upstream protobuf-config.cmake):
find_package(Protobuf CONFIG REQUIRED)
add_library(myapp_proto user.proto post.proto)
target_link_libraries(myapp_proto PUBLIC protobuf::libprotobuf)
protobuf_generate(TARGET myapp_proto LANGUAGE cpp)
target_include_directories(myapp_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})
Bazel:
load("@com_google_protobuf//bazel:cc_proto_library.bzl", "cc_proto_library")
load("@com_google_protobuf//bazel:proto_library.bzl", "proto_library")
proto_library(name = "user_proto", srcs = ["user.proto"])
cc_proto_library(name = "user_cc_proto", deps = [":user_proto"])
10. Sharing Schemas Across Languages
- Biggest payoff —
.protobecomes a wire-format contract every language speaks. - Message serialized by C++ parses cleanly in Python, Go, Java, etc. No custom serializer per language pair.
protoc -I=src/proto \
--cpp_out=gen/cpp \
--python_out=gen/py \
src/proto/user.proto
From Python:
# gen/py is on PYTHONPATH
from user_pb2 import User
# Build and serialize on the Python side
u = User(id=42, name="yu", email="[email protected]")
data = u.SerializeToString() # bytes -- identical wire format to C++
# Parse bytes produced by another language
u2 = User()
u2.ParseFromString(data)
print(u2.name, u2.id)
-
Python's
SerializeToStringbytes = byte-for-byte interchangeable with C++'sMessage::SerializeToString. -
Typical layered system: write schema once, compile per component language, wire format glues them together.
-
C++ ↔ Python — C++ data pipeline writes Protobuf to disk/queue; Python analytics reads back.
-
C++ ↔ Go ↔ TypeScript — Go gRPC service exchanges Protobuf with C++ backend + TS frontend;
.proto= single source of truth. -
Versioning — any party can add fields without breaking older (see § 8).
-
For RPC → natural companion = gRPC. Define
serviceblock in.proto, runprotoc --grpc_out=...per language, get matching client/server stubs:
service UserService {
rpc GetUser(GetUserRequest) returns (User);
rpc StreamUsers(StreamUsersRequest) returns (stream User);
}
- Python tooling:
protobuf+grpcio-toolsPyPI packages. Install both →python -m grpc_tools.protoc ...=protoc+ gRPC plugin.
11. References
- Protocol Buffers official site — language guides, runtime docs, downloads.
- Proto3 Language Guide — schema semantics, field rules, evolution.
- C++ Generated Code Reference — exactly what
protocproduces per schema element. - C++ API Reference —
Message,Reflection,TextFormat,Arena,JsonUtil. - Updating a Message Type — canonical compatibility rules.
- protobuf GitHub repo — source, releases, issues.