Validating Data Consistency Across Service Boundaries

Data Validation & Assertions 6 min read July 24, 2026

Your order service says the order total is 142.50. Your billing service received 142.5. Your analytics pipeline stored "142.50" as a string. Nobody filed a bug — the data just silently drifted, and now your monthly revenue report is wrong by rounding errors that compound across 40 million rows. Cross-service data inconsistency is one of the most expensive classes of production defects, and it's almost never caught by unit tests or contract tests alone.

The problem isn't that engineers don't care about consistency — it's that the tooling conversation usually stops at API schema validation. JSON Schema tells you the shape is right; it says nothing about whether the value in user_id on the downstream event actually corresponds to a live record in the upstream service, or whether a status field that's "PENDING" in one service is represented as 0 in another.

By the end of this article you'll have a concrete approach for defining, instrumenting, and asserting cross-boundary data consistency — covering contract validation with Pact, structural assertions with JSON Schema 2020-12, value-level checks with Great Expectations and dbt tests, and a lightweight CI harness that surfaces drift before it hits production.

Modern Test Automation with AI and BDD

Practical guides for building smarter test frameworks, pipelines, and automation strategies.

Learn more

What Cross-Boundary Data Consistency Actually Means

Cross-boundary data consistency is the guarantee that a data entity — an order, a user record, a payment event — retains semantic equivalence as it moves between services, even when its representation changes. It's distinct from API contract testing (which validates message shape) and from referential integrity (which is a database-level constraint). It lives in the space between: the producer emits a valid message, the consumer receives a valid message, but the meaning has shifted somewhere in transit — a currency field lost precision, an enum was mapped incorrectly, a nullable field became absent rather than null.

In a modern architecture this problem appears at every seam: REST responses consumed by a BFF, Kafka events ingested by a data warehouse, gRPC payloads deserialized by a downstream microservice, dbt models reading from Postgres tables written by an application ORM. Each hop is an opportunity for silent transformation. The validation strategy has to be layered — schema contracts at the API boundary, value-level assertions at the storage boundary, and lineage checks at the pipeline boundary — because no single tool covers all three.

Building a Layered Consistency Validation Stack

Start at the API boundary with Pact (v10+ for the Rust core) for consumer-driven contract tests, but extend the pact with value constraints that JSON Schema alone won't enforce. Pact's matching DSL supports decimal type matching and regex constraints — use them aggressively on financial and identifier fields:

# Python Pact v2 DSL — consumer side
from pact import Consumer, Provider, Like, Decimal, Regex

(Consumer("billing-service")
  .has_pact_with(Provider("order-service"))
  .upon_receiving("an order total")
  .with_request("GET", "/orders/123")
  .will_respond_with(200, body={
      "order_id": Regex(r"^ORD-\d{8}$", "ORD-00000123"),
      "total":    Decimal(142.50),   # enforces numeric, not string
      "currency": "USD",
      "status":   Like("PENDING"),
  })
)

The Decimal matcher prevents the 142.5 vs "142.50" drift at the contract layer. Run pact verification in GitHub Actions on every PR to both repos — producer and consumer — so neither side can silently break the other.

For Kafka event streams, Pact doesn't reach far enough. Use JSON Schema 2020-12 with jsonschema (Python, v4.21+) and enforce it in your consumer's deserialization path, not just in tests. The key addition over Draft-07 is prefixItems for tuple validation and unevaluatedProperties: false — which catches additive drift where producers start emitting new fields that consumers silently ignore until they don't:

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["order_id", "total", "currency", "status"],
  "unevaluatedProperties": false,
  "properties": {
    "order_id": { "type": "string", "pattern": "^ORD-\\d{8}$" },
    "total":    { "type": "number", "minimum": 0, "multipleOf": 0.01 },
    "currency": { "type": "string", "enum": ["USD", "EUR", "GBP"] },
    "status":   { "type": "string", "enum": ["PENDING", "CONFIRMED", "CANCELLED"] }
  }
}

The multipleOf: 0.01 constraint on total is doing real work here — it rejects 142.5001 from a floating-point serialization bug before it enters the warehouse.

At the storage and pipeline boundary, bring in Great Expectations (GX Core 1.x) and dbt tests. GX handles the cross-service referential check that schema validation can't: does every user_id in the billing_events table exist in the users table written by a different service?

# Great Expectations — cross-table referential check
expectation_suite.add_expectation(
    gx.expectations.ExpectColumnValuesToBeInSet(
        column="status",
        value_set={"PENDING", "CONFIRMED", "CANCELLED"},
    )
)
# dbt singular test — cross-service FK integrity
-- tests/assert_billing_user_ids_exist.sql
SELECT b.user_id
FROM {{ ref('billing_events') }} b
LEFT JOIN {{ ref('users') }} u ON b.user_id = u.id
WHERE u.id IS NULL

Running this dbt test in CI against a seeded Postgres environment (not prod) caught a 3% orphaned-record rate introduced by an async user-deletion job — a bug that had been silently present for six weeks. The fix took two hours; discovery without this test would have been a multi-day incident investigation.

Where Senior Engineers Still Get Burned

Validating at the wrong layer. The most common mistake is running schema validation only at the integration test boundary, not at the deserialization boundary in production code. When a producer ships a breaking change that passes Pact (because the pact wasn't updated), the only safety net is the runtime deserializer — and if that's permissive, the bad data flows silently. Enforce JSON Schema validation inside your Kafka consumer's on_message handler, not only in test fixtures. Log and dead-letter on failure; don't swallow exceptions.

Treating enum drift as a schema problem. When the order service adds "ON_HOLD" to its status enum, the downstream consumer's schema test passes (the field is still a string), but the business logic silently routes it to the wrong state machine branch. This is a value-contract problem, not a structural one. The fix is to version your enum sets explicitly — in Pact matchers, in JSON Schema enum arrays, and in dbt accepted_values tests — and treat any addition as a breaking change requiring coordinated deployment, not a minor update.

Myths That Lead to False Confidence in Cross-Service Validation

"Our contract tests cover consistency." Pact and OpenAPI contract tests verify that a message is structurally acceptable to both parties. They don't verify that the values are semantically consistent across the full data lifecycle — that the total stored in Postgres matches the total on the Kafka event matches the total in the data warehouse. Contract tests are necessary but not sufficient. You need value-level assertions at each persistence boundary, not just at the API handshake.

"We test with production data clones, so coverage is realistic." Production clones give you realistic volume and shape, but they introduce two problems: PII exposure risk (even with masking, re-identification attacks on quasi-identifiers are well-documented), and stale consistency — a prod clone from last Tuesday doesn't reflect the enum additions or schema changes deployed on Wednesday. Synthetic data generated with Faker or Gretel against your current schema definitions is more consistent with what your services will actually emit today. Use prod clones for load testing; use schema-driven synthetic data for consistency validation.

A working consistency validation stack isn't a single tool — it's Pact at the API boundary, JSON Schema 2020-12 at deserialization, and Great Expectations or dbt tests at the storage layer, all wired into CI. Start with the layer where you've had the most production incidents, instrument it first, and expand outward. For deeper reading, the Pact documentation on provider verification and the dbt documentation on singular tests are the most practical next stops.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

What Cross-Boundary Data Consistency Actually Means

Building a Layered Consistency Validation Stack

Where Senior Engineers Still Get Burned

Myths That Lead to False Confidence in Cross-Service Validation

Related Articles

Building a Synthetic Data Service for AI Models

Validating Complex Data Structures in Tests

Deep Assertions: Beyond assertEqual

Schema Validation for APIs Step-by-Step