Skip to content

Token-efficient serialization to reduce structured output overhead in production pipelines #2272

@makroumi

Description

@makroumi

Is your feature request related to a problem?

Instructor's core value is structured outputs - getting LLMs to return valid, typed data reliably.

The irony: the JSON wire format used to achieve this structure is itself one of the largest token cost drivers in production pipelines.

Every Instructor call serializes:

  • The JSON schema in the prompt
  • The structured output response
  • Retry attempts with validation errors
  • Patch history across retries

~44% of tokens in typical Instructor payloads
are pure JSON syntax overhead.

At scale this compounds fast:

  • Schema overhead on every single call
  • Repeated field names across every retry
  • Validation error payloads add more JSON
  • Multi-step pipelines multiply the overhead

At 10M Instructor calls on GPT-4o:
~$59K spent on syntax noise. Not intelligence.

The frustration: Instructor already does the hard work of structured validation. The JSON wire format undermines that efficiency at scale.

Describe the solution you'd like

A pluggable serializer interface allowing token-efficient wire formats as opt-in replacement for JSON in Instructor pipelines.

I built ULMEN specifically for this problem.

Benchmarks on NVIDIA Tesla T4:

Image

The natural fit with Instructor specifically:

Instructor validates structure on the Python side.
ULMEN's Semantic Firewall extends this to the wire:

  • Validates structured output schemas
  • Rejects malformed responses before retry
  • Catches invalid enum states
  • Raises structured errors vs silent failures

This aligns with Instructor's core philosophy:
never pass broken structure downstream.

How code might look:

import instructor
from openai import OpenAI

Current

client = instructor.from_openai(OpenAI())

Proposed

client = instructor.from_openai(
OpenAI(),
serializer="ulmen"
)

Per-call override

response = client.chat.completions.create(
model="gpt-4o",
response_model=MyModel,
serializer="ulmen",
messages=[...]
)

Pydantic model definitions unchanged.
ULMEN handles wire format transparently.
Pure Python fallback if Rust unavailable.
BSL license, free under $10M revenue.

Reproducible benchmark notebook:
github.com/makroumi/ulmen

Describe alternatives you've considered

  1. orjson: Faster but identical token count. Doesn't address context window overhead.

  2. Manual schema compressionLossy. Breaks with model changes.Not systematic across pipelines.

  3. Reducing retry attempts: Trades reliability for cost. Wrong tradeoff for production systems.

  4. Smaller models: Reduces capability not just cost.Wrong lever for this specific problem.

ULMEN addresses the root cause: JSON was designed for web APIs not LLM context windows.

Additional Context

Instructor users running structured extraction pipelines at scale are the most affected by this problem because:

  1. Schema overhead appears on EVERY call
  2. Retry payloads add compounding JSON overhead
  3. Batch extraction pipelines multiply the cost
  4. Multi-step pipelines chain the overhead

The teams most likely to benefit:

  • Document extraction pipelines at scale
  • Classification systems with high call volume
  • Any production Instructor deployment over 1M calls per month

Reproducible benchmark notebook:
github.com/makroumi/ulmen

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions