Python Protobuf


  • Description: Protobuf in Python, generating _pb2.py, the message API, parsing/serializing binary and text formats, oneof/repeated/map fields, Any, IsInitialized, and JSON via MessageToJson
  • My Notion Note ID: K2A-D2-2
  • Created: 2023-06-28
  • Updated: 2026-05-11
  • License: Reuse is very welcome. Please credit Yu Zhang and link back to the original on yuzhang.io

Table of Contents


1. Why Protobuf

  • Google's binary serialization format
  • A .proto file declares the data; protoc compiles it into classes for many languages
  • Wire format: binary, compact, schema-evolving, unknown-field-tolerant
  • Python output: a *_pb2.py module
  • Python API is more dynamic than C++, fields as Python attributes, repeated fields look like lists, easy JSON conversion

2. Generating Python Code

Install runtime + compiler:

pip install protobuf
# protoc itself (one option):
brew install protobuf            # or download from https://github.com/protocolbuffers/protobuf/releases

Compile a .proto:

protoc -I=src --python_out=build src/addressbook.proto
  • Produces build/addressbook_pb2.py
  • For gRPC stubs: also pass --grpc_python_out=build (requires grpcio-tools)
import addressbook_pb2
person = addressbook_pb2.Person()

3. The Generated Message API

Given:

syntax = "proto3";

message Person {
  string name  = 1;
  int32  id    = 2;
  string email = 3;
  repeated string phones = 4;
}

The generated class behaves like a regular Python object:

p = addressbook_pb2.Person()
p.name = "Yu"
p.id   = 42
p.email = "[email protected]"
p.phones.append("555-1234")
p.phones.extend(["555-5678", "555-9012"])

p.name                # "Yu"
p.id                  # 42

# Scalar fields default to the proto3 zero value (0, "", False).
# HasField only works on message-typed and `optional` scalar fields.
sub = addressbook_pb2.Person()
sub.HasField("email")  # only if `email` is `optional`

p.Clear()             # reset all fields
p.ClearField("name")  # reset one field

Constructor accepts field kwargs:

p = addressbook_pb2.Person(name="Yu", id=42, phones=["555-1234"])
  • Copy a message: dst.CopyFrom(src) (replaces) or dst.MergeFrom(src) (merges)

4. Repeated, Map, and oneof Fields

message Order {
  repeated string items = 1;
  map<string, int32> counts = 2;

  oneof payment {
    string card = 3;
    string bank = 4;
  }
}
o = order_pb2.Order()

# repeated: list-like, no assignment of a Python list
o.items.append("apple")
o.items.extend(["banana", "cherry"])
o.items[:] = ["orange"]           # full replacement

# map: dict-like
o.counts["apple"] = 3
o.counts.update({"banana": 5})

# oneof: setting one field clears the others
o.card = "1234"
o.WhichOneof("payment")           # 'card'
o.bank = "ACME"                   # now WhichOneof is 'bank'; `card` is cleared

For repeated messages, append via .add():

phone = p.phones.add()
phone.number = "555-1234"

5. Serialization and Parsing

5.1 Binary (Wire Format)

data = p.SerializeToString()      # bytes
p2 = addressbook_pb2.Person()
p2.ParseFromString(data)          # raises DecodeError on bad input
# or
p2 = addressbook_pb2.Person.FromString(data)

# Read/write files:
with open("person.bin", "wb") as f:
    f.write(p.SerializeToString())

with open("person.bin", "rb") as f:
    p2.ParseFromString(f.read())
  • SerializePartialToString skips the required-field check (proto2 only)

5.2 Text Format

  • Human-readable; useful for logs, debugging, and golden test fixtures
from google.protobuf import text_format

s = text_format.MessageToString(p)        # str
p2 = addressbook_pb2.Person()
text_format.Parse(s, p2)                  # also: text_format.Merge

# Round-trip via file:
with open("person.txt", "w") as f:
    f.write(text_format.MessageToString(p))

with open("person.txt", "r") as f:
    text_format.Parse(f.read(), p2)

5.3 JSON

from google.protobuf import json_format

s = json_format.MessageToJson(p, indent=2, preserving_proto_field_name=True)
p2 = addressbook_pb2.Person()
json_format.Parse(s, p2)                  # ParseError on unknown fields (by default)
  • MessageToDict / ParseDict, go through Python dicts when bridging to libraries that expect dicts

6. Any: Pack, Unpack, Is

  • google.protobuf.Any carries an arbitrary serialized message plus its type URL
from google.protobuf import any_pb2

a = any_pb2.Any()
a.Pack(p)                              # wrap a Person

if a.Is(addressbook_pb2.Person.DESCRIPTOR):
    p2 = addressbook_pb2.Person()
    a.Unpack(p2)
  • Pack records the type URL (type.googleapis.com/<full_name> by default)
  • Is checks the URL against a descriptor
  • Unpack decodes the bytes into the given message; returns True on success

7. IsInitialized and Required Fields

  • required fields only exist in proto2
  • IsInitialized(), all required fields (recursively) set
if not p.IsInitialized():
    missing = p.FindInitializationErrors()
    raise ValueError(f"missing: {missing}")
  • In proto3, trivially True, no required fields
  • Escape hatch: proto3 optional (since 3.15 / late 2020) restores HasField for scalars

8. Two Python Implementations: Pure vs UPB

  • Pure Python: slow but easy to debug
  • UPB / C extension (upb since 4.x, was cpp before), 5–20× faster

Force a backend:

PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python    # or "upb" / "cpp"
  • API-compatible across backends
  • Error messages and edge-case behavior differ, test against the backend you ship

9. Cross-Language Schema Sharing

  • The .proto file is the contract; generated code is a per-language build artifact

Common layout:

proto/                       # canonical schemas, checked in
    addressbook.proto
generated/
    cpp/addressbook.pb.{h,cc}
    python/addressbook_pb2.py
    go/addressbook.pb.go
  • Build systems (Bazel, CMake + custom rules, buf) regenerate from proto/ to keep languages in sync

10. References