C++ Protobuf


  • Description: Protobuf in C++ — .proto schema, generated message API, binary/text/JSON serialization, oneof/repeated/map, arenas, schema evolution, sharing schemas with Python
  • My Notion Note ID: K2A-B2-3
  • Created: 2020-01-13
  • Updated: 2026-04-30
  • License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

Table of Contents


1. Schema-First Serialization

  • .proto file declares data; protoc compiles into classes for C++, Python, Go, Java, etc.
  • Wire format = binary, unknown-field tolerant: old binaries parsing messages with new fields preserve them + re-serialize unchanged. Rules in § 8.
  • "protobuf" in C++ usually = proto3 with Google libprotobuf runtime. Recent releases added protobuf editions as successor to proto2/proto3 numbering; proto3 syntax still most-used.

2. Schema (.proto) Basics

syntax = "proto3";
package myapp.v1;

import "google/protobuf/timestamp.proto";

message User {
    int64  id           = 1;
    string name         = 2;
    string email        = 3;
    google.protobuf.Timestamp created = 4;
}
  • Each field has a tag number (= 1, = 2, ...). Tags = what's on wire; names are not.
  • Tags 1–15 → 1 byte to encode. Tags 16+ → 2 bytes. Reserve hot fields for low range.
  • Files in same package share a namespace. C++ generator: package myapp.v1namespace myapp::v1.

2.1 Scalar Types

.proto C++ Notes
double double
float float
int32 / int64 int32_t / int64_t varint encoding; inefficient for negatives
sint32 / sint64 int32_t / int64_t zig-zag encoded; use for negative-prone values
uint32 / uint64 uint32_t / uint64_t varint
fixed32 / fixed64 uint32_t / uint64_t always 4 / 8 bytes; better for large values
sfixed32 / sfixed64 int32_t / int64_t signed fixed
bool bool
string std::string must be valid UTF-8 — passing arbitrary bytes can break clients in other languages; use bytes for opaque data
bytes std::string arbitrary bytes

2.2 Field Rules: singular, optional, repeated, map

message Post {
    string title         = 1;             // singular (default in proto3)
    optional string body = 2;             // explicit presence (proto3, since 3.15)
    repeated string tags = 3;             // 0..N elements
    map<string, int32> reactions = 4;     // string -> int32
}
  • Proto3 singular scalar fields always have a value — no "is set". Missing scalar reads as type's zero (0, "", false).

  • For "unset" vs "set to zero" distinction → mark field optional. Generated class then has has_field() + clear_field().

  • Message-typed fields always have presence regardless.

  • Proto3 dropped proto2's required — foot-gun for schema evolution (can never safely remove). Validate at app layer instead.

  • Tag numbers are forever. Only thing on wire identifying a field. Reserve hot fields for tags 1–15. Never reuse a tag number. Use reserved (see § 8) when removing a field.

2.3 Enums and Nested Messages

message Order {
    enum Status {
        STATUS_UNSPECIFIED = 0;     // proto3 enums must have a 0 value
        STATUS_PENDING     = 1;
        STATUS_SHIPPED     = 2;
    }

    message LineItem {
        string sku   = 1;
        int32  count = 2;
    }

    int64 id                 = 1;
    Status status            = 2;
    repeated LineItem items  = 3;
}
  • Google style guide: prefix enum values with enum name, reserve _UNSPECIFIED = 0 so default-constructed value is always meaningful.

2.4 oneof

  • For "exactly one of these fields is set":
message Event {
    int64 timestamp = 1;
    oneof body {
        Login   login   = 2;
        Logout  logout  = 3;
        Click   click   = 4;
    }
}
  • Setting any field in oneof clears the others.
  • Generated C++: body_case() returning enum (Event::kLogin, Event::kLogout, ...) + per-field accessors.

3. Generating C++ Code

protoc \
    --proto_path=src/proto \
    --cpp_out=gen \
    src/proto/user.proto
  • Produces gen/user.pb.h + gen/user.pb.cc mirroring directory under --proto_path.
  • Add gen/ to include path; compile .pb.cc into your library.

Useful flags:

  • -I path — alias for --proto_path. Repeatable for multiple roots.
  • --cpp_out=... — emit C++.
  • --grpc_out=... (with gRPC plugin) — emit .grpc.pb.{h,cc} for service stubs.
  • --descriptor_set_out=foo.desc — emit binary FileDescriptorSet for runtime/dynamic use.

4. The Generated Message API

  • For each message, protoc generates a class with:

  • Default ctor — all-zero message.

  • Accessors per field — getter name(), setter set_name(value), mutable accessor mutable_name() for sub-messages/strings, clear_name(), (for optional/message fields) has_name().

  • Repeated field accessorsname(i), name_size(), add_name(), mutable_name(i), clear_name(), name()RepeatedField/RepeatedPtrField for iteration.

  • Map field accessorsname()/mutable_name()Map<K,V> you can index.

  • LifecycleSwap, CopyFrom, MergeFrom, Clear, IsInitialized, ByteSizeLong.

  • Reflection hooksGetDescriptor, GetReflection.

#include "user.pb.h"

myapp::v1::User u;
u.set_id(42);
u.set_name("yu");
u.set_email("[email protected]");

// Sub-message
auto* ts = u.mutable_created();
ts->set_seconds(absl::ToUnixSeconds(absl::Now()));

// Repeated
myapp::v1::Post p;
p.set_title("Hello");
*p.add_tags() = "intro";
p.add_tags("greeting");

// Map
(*p.mutable_reactions())["like"] = 3;

// Read
for (const std::string& tag : p.tags()) { /* ... */ }
  • Don't keep raw pointers/references across mutable_* calls on repeated/map fields. Adding element can reallocate underlying storage, invalidating everything you held. For stable refs → pull values into your own container.

5. Serialization and Parsing

5.1 Binary (Wire Format)

  • Default. Compact, deterministic enough for byte-by-byte hashing if you call CodedOutputStream::SetSerializationDeterministic(true) on the stream. Fast to parse.
#include "user.pb.h"
#include <fstream>

bool WriteBinary(const myapp::v1::User& u, const std::string& path) {
    std::ofstream out(path, std::ios::binary | std::ios::trunc);
    return u.SerializeToOstream(&out);
}

bool ReadBinary(myapp::v1::User* u, const std::string& path) {
    std::ifstream in(path, std::ios::binary);
    return u->ParseFromIstream(&in);
}

// Or to/from a string buffer:
std::string buf;
u.SerializeToString(&buf);
u.ParseFromString(buf);

// Or to/from a fixed byte array:
u.SerializeToArray(ptr, size);
u.ParseFromArray(ptr, size);
  • SerializeToFoo returns false if message not initialized (required field missing). Proto3 has no required → only failure mode in practice = I/O.

  • SerializeAsString() is non-deterministic by default for messages with maps — hash-table iteration order leaks into bytes. If you hash, sign, or compare serialized output → route through CodedOutputStream with SetSerializationDeterministic(true).

5.2 Text Format

  • Human-readable format. Useful for golden tests, configs, debugging.
#include <google/protobuf/text_format.h>
#include <google/protobuf/io/zero_copy_stream_impl.h>
#include <fcntl.h>
#include <unistd.h>

bool WriteProtoToTextFile(const google::protobuf::Message& proto,
                          const std::string& filename) {
    int fd = ::open(filename.c_str(), O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd == -1) return false;
    google::protobuf::io::FileOutputStream output(fd);
    bool ok = google::protobuf::TextFormat::Print(proto, &output);
    ::close(fd);
    return ok;
}

bool ReadProtoFromTextFile(const std::string& filename,
                           google::protobuf::Message* proto) {
    int fd = ::open(filename.c_str(), O_RDONLY);
    if (fd == -1) return false;
    google::protobuf::io::FileInputStream input(fd);
    bool ok = google::protobuf::TextFormat::Parse(&input, proto);
    ::close(fd);
    return ok;
}
  • For ad-hoc debug printing → proto.DebugString() / proto.ShortDebugString() return text format as std::string.

5.3 JSON

#include <google/protobuf/util/json_util.h>

std::string json;
google::protobuf::util::MessageToJsonString(u, &json);

myapp::v1::User u2;
google::protobuf::util::JsonStringToMessage(json, &u2);
  • Default: zero-value fields omitted, field names camelCase.
  • Pass JsonPrintOptions to keep zero values, preserve snake_case names, or pretty-print.

6. Arenas

  • Allocating individual messages via new = slow + fragmented.
  • Arenas — allocate all messages within one region, free as a block. Order-of-magnitude wins on parsing-heavy workloads.
#include <google/protobuf/arena.h>

google::protobuf::Arena arena;
auto* u = google::protobuf::Arena::Create<myapp::v1::User>(&arena);
u->set_id(42);

// All messages allocated on `arena` are freed when `arena` goes out of scope.
// Do NOT delete `u` yourself.

// Note: older code uses Arena::CreateMessage<T>; that's deprecated and
// scheduled for removal in protobuf v30. Use Arena::Create<T> in new code.

Caveats:

  • Message owning sub-messages must agree on the arena. Mixing arena + heap submessages = UB.

  • string/bytes fields still heap-allocated by default. For zero-allocation parsing → [ctype = STRING_PIECE] or new string_view accessors (recent versions).

  • Arena allocation methods are thread-safe; Reset() + destruction are not. Sync with all allocating threads before reset/destroy.

  • Message::Swap swaps arenas too. Usually what you want — but a.Swap(&b) means a and b may now be on different arenas (or one on heap). Read Swap docs before assuming ownership invariants preserved.

7. Reflection and Any

  • Every generated message has Descriptor + Reflection. Read/write fields by name without knowing message type at compile time. Useful for generic tooling (config diffs, validators, debuggers).
const auto* desc = u.GetDescriptor();
const auto* refl = u.GetReflection();
const auto* field = desc->FindFieldByName("name");
std::cout << refl->GetString(u, field);
  • google.protobuf.Any packs arbitrary message + its type URL. Useful for plugin-style APIs where runtime type decided per call:
import "google/protobuf/any.proto";
message Envelope { google.protobuf.Any payload = 1; }
myapp::v1::User user;
user.set_id(1);
Envelope env;
env.mutable_payload()->PackFrom(user);

myapp::v1::User out;
if (env.payload().Is<myapp::v1::User>()) {
    env.payload().UnpackTo(&out);
}

8. Schema Evolution

  • Wire format is unknown-field tolerant — old binary parsing message with new fields preserves them + re-serializes unchanged.

Compatibility rules:

  • Never reuse a tag number. If you delete a field, reserve its tag and name:

    message User {
        reserved 4, 7 to 9;
        reserved "old_email", "legacy_id";
    }
    
  • Never change field type in a way that changes wire encoding. Some changes wire-compatible (int32int64uint32uint64bool); most not. See "Updating a Message Type".

  • Adding fields always safe. Old code ignores them.

  • Removing optional fields safe at wire level — but reserve to prevent reuse.

  • Don't change oneof membership — moving in/out of oneof is wire-compatible but changes presence semantics for old code.

9. Build Integration Sketches

CMake (upstream protobuf-config.cmake):

find_package(Protobuf CONFIG REQUIRED)

add_library(myapp_proto user.proto post.proto)
target_link_libraries(myapp_proto PUBLIC protobuf::libprotobuf)
protobuf_generate(TARGET myapp_proto LANGUAGE cpp)

target_include_directories(myapp_proto PUBLIC ${CMAKE_CURRENT_BINARY_DIR})

Bazel:

load("@com_google_protobuf//bazel:cc_proto_library.bzl", "cc_proto_library")
load("@com_google_protobuf//bazel:proto_library.bzl", "proto_library")

proto_library(name = "user_proto", srcs = ["user.proto"])
cc_proto_library(name = "user_cc_proto", deps = [":user_proto"])

10. Sharing Schemas Across Languages

  • Biggest payoff — .proto becomes a wire-format contract every language speaks.
  • Message serialized by C++ parses cleanly in Python, Go, Java, etc. No custom serializer per language pair.
protoc -I=src/proto \
    --cpp_out=gen/cpp \
    --python_out=gen/py \
    src/proto/user.proto

From Python:

# gen/py is on PYTHONPATH
from user_pb2 import User

# Build and serialize on the Python side
u = User(id=42, name="yu", email="[email protected]")
data = u.SerializeToString()    # bytes -- identical wire format to C++

# Parse bytes produced by another language
u2 = User()
u2.ParseFromString(data)
print(u2.name, u2.id)
  • Python's SerializeToString bytes = byte-for-byte interchangeable with C++'s Message::SerializeToString.

  • Typical layered system: write schema once, compile per component language, wire format glues them together.

  • C++ ↔ Python — C++ data pipeline writes Protobuf to disk/queue; Python analytics reads back.

  • C++ ↔ Go ↔ TypeScript — Go gRPC service exchanges Protobuf with C++ backend + TS frontend; .proto = single source of truth.

  • Versioning — any party can add fields without breaking older (see § 8).

  • For RPC → natural companion = gRPC. Define service block in .proto, run protoc --grpc_out=... per language, get matching client/server stubs:

service UserService {
    rpc GetUser(GetUserRequest) returns (User);
    rpc StreamUsers(StreamUsersRequest) returns (stream User);
}
  • Python tooling: protobuf + grpcio-tools PyPI packages. Install both → python -m grpc_tools.protoc ... = protoc + gRPC plugin.

11. References