How to Create Millions of Test Records Fast

Test Data Generation 6 min read July 24, 2026

Generating five rows of fixture data is a solved problem. Generating five million rows that actually stress your indexes, trigger your edge-case query plans, and reflect the cardinality distribution of production — that's where most test data pipelines quietly fall apart. Teams discover this when a load test that passed locally destroys staging because the test dataset was too uniform to expose a hot-partition bug.

The core technical problem is threefold: volume (row count that makes the database engine work), variety (value distributions that mirror real skew), and velocity (generation fast enough to fit inside a CI pipeline or an on-demand reset workflow). Solving one without the others produces data that's large but useless, or realistic but too slow to recreate.

By the end of this article you'll have a concrete pipeline — Python generators feeding Postgres COPY, a schema-driven approach using Faker and Mimesis, and a set of architectural decisions that keep generation under 30 seconds for tens of millions of rows.

API Testing using Python, Behave, VS Code & GitHub Copilot

Smarter API Test Automation — Python, Behave, VS Code, AI with GitHub Copilot & CI/CD Pipelines. Complete in a Weekend!

Learn more

What "At-Scale" Test Data Generation Actually Means

Test data generation at scale is the practice of programmatically producing large, structurally valid, referentially consistent synthetic datasets — fast enough to be part of an automated workflow. It sits between fixture management (small, hand-authored, committed to source control) and data subsetting (slicing a prod clone). Its job is to fill the gap where you need volume and variety but can't or shouldn't touch production data.

In a modern test architecture, generation lives in its own stage: schema definition → generator configuration → bulk load → validation. It feeds load tests, query-plan benchmarks, ML feature-pipeline smoke tests, and database migration dry-runs. The output is ephemeral — created for a run, destroyed after. Treating it as a long-lived artifact is the first sign a team has outgrown their current approach. Tools in this space include Faker 24.x and Mimesis 12.x for value synthesis, factory_boy for ORM-coupled factories, and raw SQL COPY or Kafka producers for bulk ingestion.

Building a Streaming Generation Pipeline That Actually Finishes

The single biggest performance mistake is building records as Python dicts, appending them to a list, then bulk-inserting at the end. At 10M rows, you run out of heap before you run out of patience. Use a generator that streams directly into psycopg3's copy method instead.

import csv
import io
import psycopg
from mimesis import Person, Address, Finance
from mimesis.locales import Locale

person = Person(Locale.EN)
address = Address(Locale.EN)
finance = Finance(Locale.EN)

def generate_rows(n: int):
    for _ in range(n):
        yield (
            person.email(),
            person.full_name(),
            address.postal_code(),
            finance.price(minimum=1.0, maximum=9999.0),
        )

def load(conn_str: str, n: int = 10_000_000):
    with psycopg.connect(conn_str) as conn:
        with conn.cursor() as cur:
            with cur.copy(
                "COPY customers (email, full_name, postal_code, balance) FROM STDIN"
            ) as copy:
                for row in generate_rows(n):
                    copy.write_row(row)
        conn.commit()

Mimesis is the right choice here over Faker for raw throughput — benchmarks on a 2023 M2 MacBook show Mimesis generating ~650k records/sec versus Faker's ~180k/sec for equivalent field sets, because Mimesis avoids the per-call locale resolution overhead. Use Faker when you need its broader provider ecosystem (credit card BINs, realistic lorem, locale-specific IDs) or when you're already inside a factory_boy factory graph. Use Mimesis when volume is the primary constraint.

Referential integrity is the next bottleneck. If orders has a foreign key to customers, you need to generate customer IDs first, then sample from them during order generation. Pre-generate the parent IDs into a Python array (or a temp table) and use random.choices with a weighted distribution to simulate realistic skew — most e-commerce datasets have a power-law distribution on order count per customer.

# Simulate skewed FK distribution
import random, numpy as np

customer_ids = list(range(1, 100_001))
# Power-law weights: top 10% of customers own ~60% of orders
weights = np.random.zipf(1.5, len(customer_ids)).astype(float)
weights /= weights.sum()

def generate_orders(n: int):
    for _ in range(n):
        yield (
            random.choices(customer_ids, weights=weights, k=1)[0],
            finance.price(1.0, 500.0),
            person.email(),
        )

With this pipeline, generation and load of 10M orders into a local Postgres 16 instance (no indexes on the target table, added post-load) takes roughly 90 seconds end-to-end. With indexes present during load, expect 4–6× slower — always load into an unindexed table and build indexes after. That single change dropped one team's nightly reset from 22 minutes to under 4 minutes.

Where Senior Engineers Still Burn Time on This

Generating inside the ORM. Using SQLAlchemy or Django ORM to insert millions of rows feels natural if you already have models, but ORM overhead — object instantiation, event hooks, per-row round-trips — makes it 20–50× slower than a raw COPY. Teams reach for factory_boy because it's familiar, then wonder why their seed script takes 40 minutes. Reserve ORM-based factories for unit test fixtures in the hundreds. At the millions scale, drop to psycopg3 COPY or LOAD DATA INFILE in MySQL.

Uniform distributions masquerading as realistic data. random.randint(1, 100) for every numeric field produces flat histograms that don't trigger the query planner paths you care about. Production data almost always has skew — Zipf on user activity, normal distributions on transaction amounts, heavy NULL rates on optional fields. If your generator doesn't model that, your index-scan vs. seq-scan boundary tests are meaningless. Spend 30 minutes profiling a production histogram with SELECT percentile_cont(0.95) WITHIN GROUP (ORDER BY amount) FROM orders and encode that into your generator weights.

Myths That Keep Test Data Pipelines Slow and Fragile

"A prod data clone is good enough." It's not — it's a liability. Beyond the obvious PII exposure, prod clones are stale the moment they're taken, they don't let you control cardinality or edge-case density, and restoring a 2TB snapshot for every test run isn't a pipeline. Tools like Tonic and Gretel synthesize statistically similar data with privacy guarantees, but even they require a generation step. The clone is a reference, not a test dataset. "More randomness equals better coverage." Pure randomness is inefficient coverage. Hypothesis (property-based testing) and structured generation with explicit boundary values — NULLs, max-length strings, zero amounts, duplicate emails — find bugs that random sampling at 10M rows still misses. Randomness fills volume; deliberate boundary modeling finds defects.

"Generation is a one-time setup task." Schema drift makes this false within weeks. A new NOT NULL column, a changed enum, a tightened CHECK constraint — any of these silently breaks a generator that isn't schema-coupled. Define your generators against a JSON Schema 2020-12 or Pydantic model that's generated from the same migration tooling (e.g., a dbt source YAML or a Great Expectations suite). When the schema changes, the generator breaks loudly at CI time, not quietly at 2am during a load test.

The pattern that works at scale: Mimesis or Faker for value synthesis, Python generators for memory efficiency, Postgres COPY (or Kafka producers for streaming systems) for bulk ingestion, post-load index builds, and schema-coupled validation via Pydantic or Great Expectations. Start by profiling one production histogram and encoding that distribution into your generator — that single step makes synthetic data materially more useful than anything uniform. For deeper reading, the Postgres documentation on COPY performance and the Mimesis benchmarks repo are both worth an hour.

Note: This article is for informational purposes only and is not a substitute for professional advice. If you need guidance on specific situations described in this article, consider consulting a qualified professional.

What "At-Scale" Test Data Generation Actually Means

Building a Streaming Generation Pipeline That Actually Finishes

Where Senior Engineers Still Burn Time on This

Myths That Keep Test Data Pipelines Slow and Fragile

Related Articles

Deterministic vs Random Test Data: Choosing Your Strategy

Generating Test Data in Python with Faker

Factory Patterns: factory_boy, FactoryBot, Mimesis

Building a Custom Test Data Generator