Token-efficient serialization to reduce structured output overhead in production pipelines

# Is your feature request related to a problem?
Instructor's core value is structured outputs - getting LLMs to return valid, typed data reliably.

The irony: the JSON wire format used to achieve this structure is itself one of the largest token cost drivers in production pipelines.

Every Instructor call serializes:
- The JSON schema in the prompt
- The structured output response
- Retry attempts with validation errors
- Patch history across retries

~44% of tokens in typical Instructor payloads 
are pure JSON syntax overhead.

At scale this compounds fast:
- Schema overhead on every single call
- Repeated field names across every retry
- Validation error payloads add more JSON
- Multi-step pipelines multiply the overhead

At 10M Instructor calls on GPT-4o:
~$59K spent on syntax noise. Not intelligence.

The frustration: Instructor already does the hard work of structured validation. The JSON wire format undermines that efficiency at scale.

# Describe the solution you'd like
A pluggable serializer interface allowing token-efficient wire formats as opt-in replacement for JSON in Instructor pipelines.

I built ULMEN specifically for this problem.

Benchmarks on NVIDIA Tesla T4:

<img width="1861" height="1281" alt="Image" src="https://github.com/user-attachments/assets/f055ff59-1465-4224-8fd3-db89a0335ff9" />

The natural fit with Instructor specifically:

Instructor validates structure on the Python side.
ULMEN's Semantic Firewall extends this to the wire:

- Validates structured output schemas
- Rejects malformed responses before retry
- Catches invalid enum states
- Raises structured errors vs silent failures

This aligns with Instructor's core philosophy:
never pass broken structure downstream.

How code might look:

import instructor
from openai import OpenAI

#### Current
client = instructor.from_openai(OpenAI())

#### Proposed
client = instructor.from_openai(
    OpenAI(),
    serializer="ulmen"
)

#### Per-call override
response = client.chat.completions.create(
    model="gpt-4o",
    response_model=MyModel,
    serializer="ulmen",
    messages=[...]
)

Pydantic model definitions unchanged.
ULMEN handles wire format transparently.
Pure Python fallback if Rust unavailable.
BSL license, free under $10M revenue.

Reproducible benchmark notebook:
github.com/makroumi/ulmen

# Describe alternatives you've considered
1. orjson: Faster but identical token count. Doesn't address context window overhead.

2. Manual schema compressionLossy. Breaks with model changes.Not systematic across pipelines.

3. Reducing retry attempts: Trades reliability for cost. Wrong tradeoff for production systems.

4. Smaller models: Reduces capability not just cost.Wrong lever for this specific problem.

ULMEN addresses the root cause: JSON was designed for web APIs not LLM context windows.

# Additional Context
Instructor users running structured extraction pipelines at scale are the most affected by this problem because:

1. Schema overhead appears on EVERY call
2. Retry payloads add compounding JSON overhead
3. Batch extraction pipelines multiply the cost
4. Multi-step pipelines chain the overhead

The teams most likely to benefit:
- Document extraction pipelines at scale
- Classification systems with high call volume
- Any production Instructor deployment over 1M calls per month

Reproducible benchmark notebook:
github.com/makroumi/ulmen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Token-efficient serialization to reduce structured output overhead in production pipelines #2272

Is your feature request related to a problem?

Describe the solution you'd like

Current

Proposed

Per-call override

Describe alternatives you've considered

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Token-efficient serialization to reduce structured output overhead in production pipelines #2272

Description

Is your feature request related to a problem?

Describe the solution you'd like

Current

Proposed

Per-call override

Describe alternatives you've considered

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions