Go back
Published:
· llm / decoding / structured-outputs

Setting Logits to Negative Infinity: How LLMs Actually Output JSON

Structured outputs aren't a validation layer; they're a decoding-time intervention. How logit masking actually works, why token boundaries make it hard, and why reordering one field in your Pydantic schema can move accuracy by 90 points.

And why your field names, token boundaries, and schema order matter more than you think.

Here are two Pydantic models. They differ by one word.

class Solution(BaseModel):
    final_choice: int

class Solution(BaseModel):
    answer: int

According to the Instructor docs, swapping final_choice for answer on the same model with the same prompt has been reported to move classification accuracy from 4.5% to 95%.1 Same constraints. Same temperature. Same model weights. One field name.

You may want to verify that on your own task before you cite it. But the fact that it could even plausibly be true says something the rest of this post will spend three thousand words unpacking.

Structured outputs are not a validation layer. They are a decoding-time intervention. When you hand the model a schema, you are not just telling a validator what to accept. You are telling the decoder what tokens it is allowed to emit. Your schema becomes part of your prompt. The fields, their order, their names, even the types: all of them steer the generation as it streams.

That reframe is the whole post. Everything else is mechanism.

Table of contents

Open Table of contents

Three kinds of valid

Before going further, three definitions worth keeping straight. Constrained decoding gets confused for things it isn’t, mostly because the word “valid” is overloaded.

  1. Syntactic validity: the bytes parse as JSON. {"x": 1 is invalid (unclosed); {"x": 1} is valid.
  2. Schema validity: it parses and matches your declared types, enums, and required fields. {"age": "old"} parses fine but fails a schema that demands age: int.
  3. Semantic correctness: it actually answers the question. {"capital_of_france": "Berlin"} is impeccably structured and entirely wrong.

Constrained decoding sells you the first two with a hard guarantee. The third stays your problem. This sentence does load-bearing work later, when we discuss the quality debate.

The three ways to get structured output sit on a spectrum:

StrategyGuaranteeRequires logit access
Prompt + retryNone; vibesNo
JSON modeParseableProvider-side
Structured outputs (true constrained decoding)Schema-validYes

The third one is what this post is about. Here is how it actually works.

Logits in, mask, logits out

Normal autoregressive decoding is a tight loop. At each step:

  1. The model produces a vector of logits, one per token in the vocabulary. For a modern LLM that is 100,000 to 200,000 numbers.
  2. Logits get divided by temperature.
  3. Optionally trimmed by top-k and top-p (nucleus) filters.
  4. Softmax turns what remains into a probability distribution.
  5. Sample.
  6. Append the chosen token, go to step 1.

That is text generation.

Constrained decoding inserts one new step between (1) and (2):

Conceptually: for every token that would be illegal in the current grammar state, set its logit to negative infinity. Then run softmax as before.

Because exp()=0\exp(-\infty) = 0, the illegal tokens get probability zero. The probability mass that would have gone to them gets redistributed across the legal tokens via the softmax’s normalization. Temperature, top-k, top-p, beam search: all of it composes downstream without modification.

In code:

logits = model(input_ids)                  # [vocab_size]
mask = grammar.allowed_token_mask(state)   # bool [vocab_size]
logits[~mask] = -float('inf')
probs = softmax(logits / temperature)
next_token = sample(probs)
state = grammar.advance(state, next_token)

Six lines. The mechanism is this simple.

The word “conceptually” is doing real work in the description above. The implementation is where most of the engineering hides: how do you efficiently compute grammar.allowed_token_mask(state) for a vocabulary of 200,000 tokens, with grammars that may be context-free, at every decoding step, across batches, on a hot GPU? The rest of this post answers that question. But none of the answer changes the core mechanism. It just makes it fast and correct.

Three things worth pinning down before moving on:

The model is unchanged. No fine-tuning. No new architecture. Constrained decoding is purely an inference-time intervention. The same weights produce different outputs under different constraints.

The model still does the choosing among legal options. Constraints prune the action space; they don’t decide. A schema with five enum options leaves the model to pick one based on what it knows.

The constraint is incremental. The grammar advances one token at a time, in lockstep with generation. You can’t know what’s legal at position 12 without having tracked the state from positions 0 through 11.

That last point is the bridge. To know what is legal at any moment, the engine needs a model of the grammar’s current state.

The state machine inside your schema

The mask depends on state. State depends on what was emitted. So the engine needs to track state. Welcome to formal language theory.

The smallest tool that handles enums, numbers, and bounded strings is a finite state machine (FSM). State equals “where am I in the regex.” Transitions equal “what character did I just see.” For schemas that flatten to a regex, an FSM is enough.

The smallest tool that handles arbitrarily nested JSON is a pushdown automaton (PDA). The “down” part is the stack: every { or [ pushes a frame, every } or ] pops. JSON’s nesting is unbounded in principle, which means no finite-state machine can recognize it. You need the stack.

Worth being precise here: the syntax layer of JSON is context-free. The full JSON Schema spec, with uniqueItems, $ref resolution, dependent fields, and conditional schemas, can express constraints that are not cleanly captured by a simple PDA. Not all of those are enforced at decode time; some validators check them after the fact.

Aside: the Chomsky hierarchy snuck into your API. Regex and FSMs handle regular languages. JSON sits in the context-free tier and needs a PDA. Most programming languages are context-sensitive or beyond, which is why “constrained decoding for valid Python” is much harder than “syntactically well-formed Python.” Your tool-call validator is sitting on top of decades of formal language theory and probably nobody on the team has noticed.

The pipeline is now: grammar tracks state, state produces a legal-character set, that set gets translated into a legal-token set, the legal-token set becomes the mask, the mask gets applied to logits. That last translation step, from characters to tokens, is where the trouble starts.

The token boundary swamp

LLMs do not emit characters. They emit tokens chosen by a BPE tokenizer for statistical compression. A single token might be "json", the, ":", or ,\n ". The boundaries between tokens have nothing to do with the boundaries between grammar states.

Here is {"name": "Adam"} tokenized by a typical BPE tokenizer:

[ {" ] [ name ] [ ": " ] [ Adam ] [ "} ]

Five tokens for sixteen characters. Now compare against the grammar states:

start → in-key → post-key → in-value → in-string → post-value → closed

Seven state transitions. None of which align with the token boundaries.

This misalignment is the source of basically every implementation headache in the field. Three concrete versions:

Token straddling. The token {" is one BPE unit. Emitting it advances the grammar from start through object-open and into in-key. The implementation has to walk the parser by one token while updating multiple grammar states atomically. Every serious engine handles this with a byte-level PDA so that a single token can be matched against a sequence of lexer transitions in one step.

Non-canonical tokenization. Say the grammar requires a " after name_of_the_person. The natural BPE tokenization of that sequence is probably one or two tokens that already include the quote. If the engine forces an isolated " token, you have pushed the model onto a token sequence it almost never saw in training. The model is still producing the right characters, but the probability distribution over them now comes from a region of the loss landscape it doesn’t know. Output quality silently degrades. This is the “alignment tax” people are talking about when they say constrained decoding hurts quality.

Token healing. The fix, pioneered by Microsoft Guidance: when a constraint would force a particular string, back up to the last natural token boundary and re-run the model so it can pick the canonical merger across the boundary. Conceptually obvious. Operationally a multi-day engineering project.

The point: “set illegal logits to negative infinity” is correct as a mental model and a lie as a specification. The mechanism is six lines. The implementation is XGrammar’s adaptive token mask cache, llguidance’s Earley parser, byte-level PDAs, lazy lexer automata, and a small amount of swearing. This is why production-grade constrained decoding really only emerged around 2023 and 2024; the tokenization-alignment problem was genuinely unsolved before that.

Does it hurt quality?

A real concern: if the model wants to put 80% of its probability on a token that leads to a syntactically dead end three tokens later, the mask forces the next-most-probable token instead. That token may be the model’s 1%-likely option. Locally legal, globally low-probability, sometimes nonsense.

The concern surfaced loudly in “Let Me Speak Freely?” (Tam et al., 2024), which reported degraded reasoning under format constraints.2

The methodology was contested. The dottxt team’s rebuttal (November 2024) pointed out that the paper’s “structured” condition was JSON-mode prompting (no schema, no real constraint engine), that prompts were not matched across conditions, and that one of the parsers was an LLM, which can hardly be a neutral arbiter.3 On reruns with apples-to-apples conditions, structured generation matched or slightly beat unconstrained.

Independent corroboration: JSONSchemaBench (Geng et al. 2025, co-authored with the Microsoft Guidance team) ran six frameworks across thousands of schemas and found constrained decoding modestly improves accuracy on Last Letters, Shuffle Objects, and GSM8K.4

The legitimate version of the concern is still alive. Park et al. (“Grammar-Aligned Decoding,” NeurIPS 2024) proved that greedy logit masking does not sample from the model’s true conditional distribution given the grammar.5 There is a real distortion. ASAp, CARS, and AWRS are recent attempts to recover the true conditional via importance correction or rejection sampling. Mostly this matters at small model scale and with very tight constraints; at frontier-model scale, the empirical effect on task accuracy is small to favorable.

The honest summary: constrained decoding changes the distribution. In practice, paired with a semantically aware prompt and a well-ordered schema (next section), the change is in your favor. The theoretical concern is real. The practical objection mostly isn’t.

This is why we defined “three kinds of valid” at the top. When we say constraints help quality, we mean (1) syntactic and (2) schema validity rise to 100%, and (3) semantic correctness usually rises with them. We do not mean constraints turn a small model into a smart one.

Schema design is prompt design

If your schema steers generation, then designing your schema is a prompting activity. Eight rules, ordered roughly by how much they have moved numbers in my work.

1. Reasoning fields before answer fields. The single highest-leverage decision in this whole post. LLMs are autoregressive: every token is conditioned on the ones before it, in order. If the JSON streams answer first, the model commits to a value before any of its work gets to influence it. The reasoning field below the answer becomes post-hoc rationalization for whatever it guessed.

Reverse the order and the model now has a scratchpad. It writes the reasoning, and the answer field is conditioned on the reasoning it just produced. Chain-of-thought, smuggled into the schema. For free.

2. Field names are part of the prompt. The model reads them. final_choice: int and answer: int carry different priors. So do notes: list[str] and key_assumptions: list[str]. Use descriptive names.

3. Use description= fields. Pydantic and most schema systems let you attach a description to each field, which gets included in the schema visible to the model. They are free prompt slots.

4. Enums over free strings, for classification. If the legal answers are {"high", "medium", "low"}, declare an enum. The mask collapses to three tokens. The model cannot fail to be a valid category.

5. Indexes over strings, for picking from known sets. If the model is choosing one of a list of items, have it return the index, not the string. Robust to fuzzy matches, paraphrasing, and the model’s helpful instinct to “improve” the wording.

6. Flat over deep. Compilation cost climbs sharply with schema depth, and so does the cognitive burden on the model: deeply nested structures correlate with what some recent work calls “structure snowballing,” where the model devotes so much attention to bookkeeping that the actual content suffers.

7. additionalProperties: false. Forces the model to commit to your declared schema instead of inventing fields. Required by OpenAI’s strict mode and Anthropic’s structured outputs for good reason.

8. Cap string lengths. Free-text fields without a max length are how you reach the “infinite \n\n\n... until max_tokens” failure mode that gets reported on every provider’s bug tracker. A constr(max_length=500) defuses it.

The collective effect of these rules is larger than the effect of upgrading models a tier. Same model, schema redesigned, can double accuracy on extraction-style tasks. The mechanism is just that the schema is one more thing the model is reading.

Constraints can make generation faster

What you’d expect is for adding constraints to make generation slower. In well-engineered systems it actually goes the other way: constrained generation is often faster than unconstrained generation on the same workload. Two reasons.

Coalescence, also called jump-forward decoding. When the grammar dictates the next several tokens with no choice for the model, skip the forward pass. After an object opens, the next characters are almost always "<key>": for some key. The colon, the space, the closing quote after a key, the comma between fields, the closing brace: most of these are deterministic given the grammar. SGLang’s compressed-FSM approach reported around 1.6x throughput on JSON workloads from this optimization alone.6 Some workloads see substantially more.

Compilation, not masking, is the real cost. Per-token mask generation in modern engines (XGrammar, llguidance, Outlines) runs in tens to hundreds of microseconds, well below model forward-pass time. The cost that bites is compiling the grammar from your JSON Schema into the engine’s internal representation. For simple schemas this is fast. For pathological ones (deep recursion, polymorphic unions, large enums), compilation can take seconds to minutes. JSONSchemaBench reports some Outlines compilations running 40 seconds to 10 minutes on adversarial inputs.

The practical takeaway: cache compiled grammars by schema hash. Treat schema compilation like template compilation: do it once at startup, reuse the artifact. If you find yourself recompiling per request, you have architected the slow path.

This is one of very few situations in computing where adding a constraint makes things faster. That should feel surprising, and it should make you suspicious of the intuition that constraints are pure overhead.

When to skip the constraint

Use caseConstrain?
Tool / function callingAlways
Data extraction, classificationYes
Multi-step agent step outputsYes
Reasoning-heavy tasksYes, but wrap the answer only; leave reasoning as free text
Open-ended creative writingLight envelope only (constrain a title field, not the body)
Dynamic, user-supplied schemasCarefully; watch compilation cost
Long outputs (>4k tokens)Watch max_tokens truncation; an interrupted JSON is invalid JSON
When the model semantically doesn’t know the answerNo constraint will save you

That last row is the one to internalize. Will Kurt of dottxt put it best: structured generation “can’t magically make a model understand what you want any more than throwing rail road tracks in your backyard will make your home a convenient train stop.”7

A schema is a steering tool, not a knowledge tool. Ask a model to extract a person’s birth year from a document that doesn’t contain it, and constrained decoding will hand you a syntactically pristine integer. The integer will be wrong. Validate semantics with real-data checks downstream. Constrained decoding is not validation; it is only generation.

Other practical hedges: monitor finish_reason and treat length as a first-class error; validate after generation against real data; consider the two-call pattern for reasoning-heavy work, where call one is free-form scratch and call two is a cheap constrained reformatting of the scratch into the schema.

Three takeaways

The mechanism is logit masking driven by a state machine. Mask first, softmax second, sampling unchanged. Everything fancy in the implementation (FSMs, PDAs, token healing, jump-forward decoding) is engineering around the same single intervention.

The quality debate is mostly settled. Paired with a semantically aware prompt and a well-ordered schema, constrained decoding does not hurt task accuracy and usually helps. The legitimate distribution-distortion concern is real but small at modern model scale.

The highest-leverage decision in any structured-output pipeline is free. Put reasoning fields before answer fields. Rename final_choice to answer. Use enums for categories and indexes for selection. These changes cost nothing at runtime and routinely outperform model upgrades on the tasks where you care most.

If you take one idea from this post, take this: your schema is part of the decoding process. It is the same artifact your validator reads, but it is also actively steering the model as it generates. Design it like a prompt, because that is what it is now.


Footnotes

  1. Liu, J. Bad Schemas could break your LLM Structured Outputs. Instructor blog, September 2024. The 4.5% → 95% figure is striking enough that you should verify on your own task before citing.

  2. Tam et al. Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models. arXiv:2408.02442, 2024.

  3. Kurt, W. Say What You Mean: A Response to “Let Me Speak Freely”. dottxt blog, November 2024.

  4. Geng et al. JSONSchemaBench: A Rigorous Benchmark of Structured Outputs for Language Models. arXiv:2501.10868, 2025.

  5. Park et al. Grammar-Aligned Decoding. NeurIPS 2024. arXiv:2405.21047.

  6. Zheng et al. SGLang: Efficient Execution of Structured Language Model Programs. NeurIPS 2024. arXiv:2312.07104.

  7. Kurt, W. Structured Generation Improves LLM performance: GSM8K Benchmark. dottxt blog, 2024.


Tagged