Kafka Schema Registry and Avro Managing Message Formats

Kafka stores raw bytes. It has no built-in understanding of what those bytes represent — a JSON object, a sensor reading, a bank transfer record, or random binary data. When a producer and a consumer agree on the format out of band, everything works. But as teams grow, as producers and consumers are owned by different teams, and as message formats evolve over time, informal agreements break down. Schema Registry and Apache Avro solve this systematically by making message schemas an explicit, versioned, centrally managed contract.

The Schema Problem at Scale

Imagine a payment producer at a fintech company. It sends payment events to Kafka in JSON format. Six consumers read those events: the fraud detection service, the accounting system, the mobile notification service, the analytics pipeline, the audit log writer, and the data warehouse loader. All six parse the same JSON structure.

Six months later, the payments team needs to add a new field and rename an existing field. Without a schema management system, this change requires coordinating all six consumers simultaneously — a deployment nightmare. If any consumer misses the change, it breaks at runtime when it tries to parse a field that no longer exists under the old name.

THE SCHEMA CHAOS PROBLEM:

Producer sends:
  {"payment_id": 123, "amount": 89.99, "currency": "USD"}

Team decides to rename "amount" → "total_amount" and add "fee" field.

New format:
  {"payment_id": 123, "total_amount": 89.99, "fee": 0.99, "currency": "USD"}

Consumer A (updated) reads "total_amount" → works fine ✓
Consumer B (not updated yet) reads "amount" → field missing → CRASH ✗
Consumer C (not updated yet) reads "amount" → null → silent wrong result ✗

Without schema management: breaking changes cascade silently.
With Schema Registry: compatibility rules prevent breaking changes from publishing.

What Is a Schema Registry

A Schema Registry is a separate service that stores and manages schemas. Confluent's Schema Registry is the most widely used implementation — it is open-source and integrates directly with Kafka's serializer/deserializer ecosystem.

When a producer sends a message, instead of embedding the full schema in every message (wasteful — the schema might be larger than the data), the producer registers the schema with the Schema Registry and receives a small integer schema ID. It then embeds only this tiny schema ID (4 bytes) at the start of each message. The consumer reads the schema ID, fetches the schema from the registry (and caches it), and deserializes the message using that schema.

SCHEMA REGISTRY FLOW:

PRODUCER SIDE:
  Step 1: Producer serializes record using Avro schema
  Step 2: Producer checks: "Is this schema registered for topic orders?"
  Step 3: If not → registers schema → gets back schema_id=5
  Step 4: Sends message:
          [magic byte: 0x00][schema_id: 5 (4 bytes)][avro-binary-payload]
  (Schema ID adds only 5 bytes overhead per message)

CONSUMER SIDE:
  Step 1: Consumer reads first byte → magic byte confirms schema registry format
  Step 2: Consumer reads next 4 bytes → schema_id=5
  Step 3: Consumer checks local cache: "Do I have schema 5?"
          If not → fetches schema from registry → caches it
  Step 4: Consumer deserializes Avro binary payload using schema 5
  Step 5: Consumer gets back typed Java/Python/Go object

SCHEMA REGISTRY SERVICE:
  Stores: {schema_id: 5, subject: "orders-value", schema: "{ ... Avro JSON ... }"}
  Enforces: compatibility rules before accepting new schema versions

Apache Avro: The Standard Schema Format for Kafka

Apache Avro is the most commonly used serialization format with Kafka Schema Registry, though the registry also supports Protobuf and JSON Schema. Avro uses a JSON-based schema definition language and produces compact binary output — no field names are included in the binary data, just values in schema-defined order. This makes Avro messages 5-10x smaller than equivalent JSON.

An Avro Schema Example

AVRO SCHEMA: Payment Event

{
  "type": "record",
  "name": "Payment",
  "namespace": "com.mycompany.payments",
  "fields": [
    {"name": "payment_id",   "type": "long"},
    {"name": "user_id",      "type": "string"},
    {"name": "amount",       "type": "double"},
    {"name": "currency",     "type": "string"},
    {"name": "timestamp",    "type": "long", "logicalType": "timestamp-millis"},
    {"name": "status",       "type": {"type": "enum",
                                       "name": "PaymentStatus",
                                       "symbols": ["PENDING", "PAID", "FAILED"]}},
    {"name": "metadata",     "type": ["null", "string"], "default": null}
  ]
}

JSON equivalent of one record:
  {"payment_id": 1001, "user_id": "u_7823", "amount": 89.99,
   "currency": "USD", "timestamp": 1709654400000,
   "status": "PAID", "metadata": null}
  JSON size: ~115 bytes

Avro binary:
  [binary encoding of same values]
  Avro size: ~25 bytes
  Savings: ~78% smaller than JSON for this schema

Avro Data Types

AVRO PRIMITIVE TYPES:
  null     → null value
  boolean  → true or false
  int      → 32-bit signed integer
  long     → 64-bit signed integer
  float    → 32-bit IEEE 754 float
  double   → 64-bit IEEE 754 float
  bytes    → sequence of 8-bit unsigned bytes
  string   → Unicode character sequence

AVRO COMPLEX TYPES:
  record   → like a class/struct with named fields
  array    → list of items of one type
  map      → key-value pairs (keys are strings)
  union    → value can be one of several types: ["null","string"] means nullable string
  enum     → set of named constants: {"type":"enum","symbols":["A","B","C"]}
  fixed    → fixed-size sequence of bytes

LOGICAL TYPES (annotations on primitives):
  date             → int, days since epoch
  timestamp-millis → long, milliseconds since epoch
  decimal          → bytes or fixed, arbitrary precision
  uuid             → string, UUID format

Schema Subjects and Naming Conventions

The Schema Registry organizes schemas by subject. A subject is a named scope for schema versions. The most common naming convention is topic-based:

TopicNameStrategy (default): The subject name is the topic name with a "-value" or "-key" suffix.

DEFAULT SUBJECT NAMING:

Topic: orders
  Key schema subject:   orders-key
  Value schema subject: orders-value

Topic: payments
  Value schema subject: payments-value

All producers to the "orders" topic share the same "orders-value" schema subject.
Any producer trying to evolve the schema must be compatible with all existing versions.

RecordNameStrategy: The subject name is the Avro record's fully qualified name. Allows different record types in the same topic, each with their own schema evolution history. More flexible but requires explicit schema selection.

TopicRecordNameStrategy: Combines topic name and record name. Allows different record types per topic with topic-scoped schemas.

Schema Compatibility Rules

The Schema Registry enforces compatibility rules when you try to register a new schema version. These rules determine which schema changes are allowed and which would break existing producers or consumers.

COMPATIBILITY TYPES AND WHAT THEY ALLOW:

BACKWARD (default):
  New schema can read data written with old schema.
  → Safe for consumers to upgrade before producers.
  Allowed changes:  Add optional fields with defaults.
                    Remove fields (old data has them; new schema ignores them).
  Forbidden changes: Add required fields without defaults (old data has no value).
                     Rename fields.
                     Change field types.

FORWARD:
  Old schema can read data written with new schema.
  → Safe for producers to upgrade before consumers.
  Allowed changes:  Add required fields (old consumers ignore unknown fields).
                    Remove optional fields with defaults.

FULL:
  Both backward AND forward compatible.
  → Safest: both producers and consumers can upgrade in any order.
  Allowed changes:  Only add or remove optional fields with defaults.

NONE:
  No compatibility checking. Any change allowed.
  → Dangerous in production. Use only in development.

BACKWARD_TRANSITIVE, FORWARD_TRANSITIVE, FULL_TRANSITIVE:
  Same as above but checked against ALL historical schema versions, not just the previous one.
  Strongest guarantee: safe to skip multiple versions on upgrade.

Compatibility in Practice

BACKWARD COMPATIBLE CHANGE (allowed ✓):

Schema V1:
  {"name": "payment_id", "type": "long"}
  {"name": "amount",     "type": "double"}

Schema V2 (add optional field with default):
  {"name": "payment_id", "type": "long"}
  {"name": "amount",     "type": "double"}
  {"name": "fee",        "type": ["null","double"], "default": null}  ← NEW, optional

Old data (V1): has no "fee" field.
V2 schema reads old data → "fee" gets default value null ✓

BACKWARD INCOMPATIBLE CHANGE (rejected ✗):

Schema V2 (rename field):
  {"name": "payment_id",  "type": "long"}
  {"name": "total_amount","type": "double"}  ← renamed from "amount"

Old data: has "amount" field. V2 schema expects "total_amount".
V2 schema reads old data → "total_amount" not found → FAILS.
Registry rejects this schema: BACKWARD INCOMPATIBILITY.

Setting Up Schema Registry with Confluent

SCHEMA REGISTRY STARTUP (Confluent Platform):

# With Docker Compose:
schema-registry:
  image: confluentinc/cp-schema-registry:latest
  environment:
    SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: "broker:9092"
    SCHEMA_REGISTRY_HOST_NAME: schema-registry
    SCHEMA_REGISTRY_LISTENERS: "http://0.0.0.0:8081"
  ports:
    - "8081:8081"

# REST API for schema management:

# Register a schema:
POST http://schema-registry:8081/subjects/orders-value/versions
Content-Type: application/vnd.schemaregistry.v1+json
{
  "schema": "{\"type\":\"record\",\"name\":\"Order\",...}"
}

# Get latest schema version:
GET http://schema-registry:8081/subjects/orders-value/versions/latest

# List all subjects:
GET http://schema-registry:8081/subjects

# Check compatibility of a new schema:
POST http://schema-registry:8081/compatibility/subjects/orders-value/versions/latest
{
  "schema": "{\"type\":\"record\", ... proposed new schema ...}"
}

Response: {"is_compatible": true}

Producer and Consumer Configuration for Schema Registry

PRODUCER WITH AVRO + SCHEMA REGISTRY (Java):

Properties props = new Properties();
props.put("bootstrap.servers", "broker:9092");
props.put("key.serializer",
          "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer",
          "io.confluent.kafka.serializers.KafkaAvroSerializer");
props.put("schema.registry.url", "http://schema-registry:8081");

KafkaProducer producer = new KafkaProducer<>(props);

Payment payment = Payment.newBuilder()
    .setPaymentId(1001L)
    .setUserId("u_7823")
    .setAmount(89.99)
    .setCurrency("USD")
    .setTimestamp(Instant.now().toEpochMilli())
    .setStatus(PaymentStatus.PAID)
    .build();

producer.send(new ProducerRecord<>("payments", "u_7823", payment));

CONSUMER WITH AVRO + SCHEMA REGISTRY (Java):

Properties props = new Properties();
props.put("bootstrap.servers", "broker:9092");
props.put("key.deserializer",
          "org.apache.kafka.common.serialization.StringDeserializer");
props.put("value.deserializer",
          "io.confluent.kafka.serializers.KafkaAvroDeserializer");
props.put("schema.registry.url", "http://schema-registry:8081");
props.put("specific.avro.reader", "true");  // use generated Avro classes

KafkaConsumer consumer = new KafkaConsumer<>(props);
consumer.subscribe(List.of("payments"));

ConsumerRecords records = consumer.poll(Duration.ofSeconds(1));
for (ConsumerRecord record : records) {
    Payment payment = record.value();
    System.out.println("Received: " + payment.getPaymentId()
                        + " amount=" + payment.getAmount());
}

Protobuf and JSON Schema: Alternative Formats

While Avro is the most widely used format with Schema Registry, Confluent Schema Registry also supports Protobuf (Protocol Buffers) and JSON Schema. Each format has distinct characteristics that make it preferable in certain contexts.

FORMAT COMPARISON:

Feature          Avro          Protobuf        JSON Schema
──────────────────────────────────────────────────────────
Binary encoding  Yes (compact) Yes (compact)   No (text)
Schema in wire   No (ID only)  No (ID only)    No (ID only)
Backward compat  Excellent     Good            Limited
Language support Broad         Excellent       Universal
gRPC support     No            Yes (native)    No
Human-readable   No            No              Yes
Schema evolution Mature        Mature          Basic

Choose Avro when: Data engineering, analytics pipelines, existing Avro usage.
Choose Protobuf when: gRPC services also involved, strong type safety needed.
Choose JSON Schema when: Debugging ease matters, teams prefer human-readable format.

Schema Registry High Availability

The Schema Registry stores schemas in a Kafka topic called _schemas with a single partition. This makes it durable — schemas survive Schema Registry restarts. For high availability, run multiple Schema Registry instances. Only one instance is the leader (handles writes). Others are read replicas. If the leader fails, a new leader is elected from the remaining instances.

SCHEMA REGISTRY HA SETUP:

Primary (leader): http://sr-1:8081
Replica 1:        http://sr-2:8081  (forwards writes to leader)
Replica 2:        http://sr-3:8081  (forwards writes to leader)

Configure a load balancer in front:
  http://schema-registry.internal:8081
  → routes to any instance for reads (cached locally)
  → routes to leader for writes (replicas redirect automatically)

Client configuration:
  schema.registry.url=http://schema-registry.internal:8081

_schemas topic: 1 partition, RF=3 (stores all registered schemas durably)

Key Points

Schema Registry centrally stores and versions message schemas. Producers embed a small schema ID (5 bytes) in each message instead of the full schema, saving bandwidth and storage.
Consumers fetch schemas by ID from the registry and cache them locally. This enables schema-aware deserialization without embedding schema in every message.
Apache Avro is the most common serialization format used with Schema Registry — compact binary encoding with strong schema evolution support.
Compatibility modes (BACKWARD, FORWARD, FULL) enforce that schema changes don't break existing producers or consumers. BACKWARD is the default and safest mode for most workflows.
Schema subjects follow the topic name by default (topic-value, topic-key). All producers to a topic share the same subject and must be compatible with existing versions.
Schema Registry stores schemas in a Kafka topic (_schemas), making them durable across restarts. Run multiple instances for high availability.

Previous lesson

Back to course

Next lesson