TextGrad: Automatic Differentiation Through LLM Critiques

When you build a multi-step LLM pipeline, you end up with a bag of strings to tune: a system prompt for the planner, another for the reasoner, a critique prompt for the verifier, maybe a code template the agent emits. Each one is, in effect, a hyperparameter you set by hand. TextGrad is the framework that asks what happens if you treat that whole bag of strings as a computation graph, define a textual loss at the output, and run something that looks exactly like backprop to update every string in the graph.

Each “gradient” in this framework is a short LLM-written critique of one variable. “Descent” means handing the variable and its critique back to an LLM with the instruction to rewrite the variable so it addresses the critique. The API is deliberately PyTorch-shaped: Variable, BlackboxLLM, TextLoss, loss.backward(), optimizer.step(). If you know the autograd loop you already know the shape of TextGrad.

Open Table of contents

Where this post sits
Backpropagation, briefly
From floats to strings
One forward pass, four steps
The API primitives
Why one loop optimises prompts, code, and molecules
When to use it (and when DSPy is the better bet)
References

Where this post sits

The Prompts are Hyperparameters post argued the reframe and toured DSPy, MIPROv2 and GEPA. The GEPA walkthrough drilled into reflective prompt evolution specifically. This post is the same family of ideas seen through TextGrad’s lens, which is the most explicit “make this look like PyTorch” attempt of the bunch.

TextGrad is described in Yuksekgonul et al., Optimizing generative AI by backpropagating language model feedback (Nature, 2025; preprint arXiv:2406.07496; code at zou-group/textgrad). It is the framework GEPA names as one of its baselines, and the conceptual lens worth carrying away even if you never run a TextGrad job in production.

Backpropagation, briefly

In a neural net, you compute an output, compare it to a target with a loss function, and ask: for each parameter, in which direction and by how much should it change to reduce the loss? Backprop answers this by applying the chain rule from the loss back through the graph. Each parameter gets a gradient, an optimiser steps along it, and you repeat.

Two things make this work. First, the operations are differentiable, so the chain rule gives you a clean local update signal at every node. Second, the gradient is a precise, actionable answer to the question “what should this parameter look like to do better?”.

TextGrad keeps the shape of that loop and replaces each piece with an LLM-driven analog. Operations are still differentiable in the loose sense that each one can produce feedback about its inputs given feedback about its outputs. Gradients are still actionable answers to “what should this look like to do better?”, just written in English instead of stored as floats.

From floats to strings

A TextGrad computation graph has nodes that are strings (Variables) and edges that are LLM calls (BlackboxLLM ops). The forward pass runs the pipeline, threading a question through a system prompt and any intermediate modules until you get a final output. The “loss” is itself an LLM call whose job is to produce a critique of the final output. loss.backward() walks the graph in reverse, asking the backward engine at every node: given this critique of the output, what specific feedback should be sent to each upstream variable? Each variable accumulates its own textual gradient. optimizer.step() hands a variable and its gradient back to an LLM with the instruction: rewrite this variable to address this feedback.

Figure 1. A minimal TextGrad computation graph. Amber arrows are the forward pass (text in, text out). The blue dashed curve is the backward pass: LLM-generated critiques routed back through the graph.

One forward pass, four steps

The interactive walkthrough below steps through what actually moves on the graph during a single TextGrad update, using a small counting example. Click through the four stages before reading on; the worked example in stages two and four is the load-bearing intuition.

In prose, the four moves are:

Forward. The system prompt and the question flow through a BlackboxLLM call and produce an answer. No optimisation has happened yet; we just ran the pipeline and stored the intermediate values on the graph.
Loss. A TextLoss is itself an LLM call (or a programmatic check) that examines the output and returns a critique. The critique is the string-valued analogue of a scalar loss.
Backward. loss.backward() walks the graph in reverse. At each node the backward engine asks: given downstream feedback, what should this upstream variable look like to do better? Every variable with requires_grad=True accumulates a written critique in its .grad.
Step. TGD (Textual Gradient Descent) takes each parameter and its textual gradient, sends them to an LLM with the instruction “rewrite this to address the feedback”, and replaces the value. The next forward pass uses the new version.

“Gradients” here are short LLM-written critiques. “Descent” means rewriting a variable to address that critique. The PyTorch metaphor is doing real work rather than just decorating the API.

The API primitives

TextGrad code is built from a small set of primitives:

# 1. Variable: anything you want in the graph
#    (prompt, answer, code, SMILES string, ...)
var = tg.Variable(text, role_description="...", requires_grad=True)

# 2. BlackboxLLM: a forward op that turns input variables
#    into an output variable
model = tg.BlackboxLLM("gpt-4o", system_prompt=var)
out = model(question)

# 3. TextLoss: a natural-language loss.
#    Evaluates a variable, produces a critique.
loss_fn = tg.TextLoss("Critique this answer. Be specific and concise.")
loss = loss_fn(out)

# 4. set_backward_engine: which LLM produces the textual gradients
tg.set_backward_engine("gpt-4o")

# 5. TGD (Textual Gradient Descent): the optimiser
optimizer = tg.TGD(parameters=[var])
loss.backward()
optimizer.step()

You build a graph of Variables and BlackboxLLM ops, evaluate it with a TextLoss, call backward(), call step(). The framework routes critiques through the graph behind the scenes; you rarely write the routing yourself.

Why one loop optimises prompts, code, and molecules

A Variable is just a string with metadata. Nothing in the optimisation loop assumes the string is a prompt. The Nature paper applies the same machinery to four very different domains.

The first is prompt optimisation for downstream reasoning and coding benchmarks, the most obvious application. Yuksekgonul et al. report improvements on GPQA, MMLU, and BIG-Bench Hard tasks against zero-shot baselines, using the system prompt as the only Variable with requires_grad=True.

The second is solution refinement, where the variable being optimised is the answer itself rather than the prompt that produced it. You freeze the system prompt, treat the candidate answer as a learnable string, and let TextGrad rewrite the answer over a few iterations until the critique converges.

The third is code optimisation, where the variable is a code snippet and the loss combines runtime feedback (test failures, profiling output) with an LLM critique of correctness and style. This is a domain where the gradient is unusually informative because the executor produces precise, locatable failure messages.

The fourth, and the one most distant from prompt engineering, is molecule design. The variable is a SMILES string. The loss combines docking scores against a target protein with druglikeness critiques. The same backward() + step() machinery proposes new candidate molecules, in much the same way it proposes new candidate prompts. The “gradient” is now structural feedback (“this functional group is likely metabolically unstable; try replacing it”) delivered as text.

Push the molecule example further and the implication gets more interesting. The TextLoss in this framework does not have to be an LLM critique. A docking simulator could play that role, or a synthesis-cost model, or a logp predictor, or even a lab automation rig that runs a real assay and feeds the result back in. Whatever the source of the feedback, the same backward() + step() loop proposes the next candidate molecule conditioned on it: hypothesis (a SMILES string), experiment, structured feedback, updated hypothesis, repeat. Done well, the cycle runs without a human in the middle. That is roughly what people mean by “auto-research”, and TextGrad is one of the cleanest ways currently available to wire one up.

TextGrad therefore blurs the line between prompt engineering, retrieval pipeline tuning, chain-of-thought scaffolding, and even structure search. They are all graphs of strings with critiques flowing backward, and the API does not need to know which one you are doing.

When to use it (and when DSPy is the better bet)

Both the strengths and the limitations of TextGrad track its mechanism directly.

On the strengths side, the API is genuinely intuitive if you already know PyTorch. The learning curve is shallow and the names map cleanly to Variable, backward, step. The same machinery is task-agnostic, so a team that has invested in TextGrad for prompts can reuse the same loop for code search or molecule design with a new Variable definition and a new TextLoss. Critiques are inspectable in a way numerical gradients are not, so you can read what the optimiser is “thinking” at every step. That is useful for debugging, and useful when you have to defend a particular prompt update to a reviewer. The framework also composes cleanly with multi-step pipelines, which is harder in single-prompt frameworks that treat the prompt as the only knob.

The flip side starts with cost. Every step is one or more LLM calls, so cost and latency add up fast across a training set of any size. Gradients are noisy and non-deterministic, so convergence is empirical rather than guaranteed; reruns of the same job can produce different prompts, sometimes better, sometimes worse. Quality is bottlenecked by the backward engine, in that a weaker LLM produces weaker critiques, which produce weaker rewrites, which compound. And DSPy’s optimisers (MIPROv2, GEPA) are further along in compiler-style optimisation and pipeline modularity, so for single-prompt tuning with a clear metric they are usually a more efficient choice.

That narrows the case for actually running TextGrad in production. Use it when your pipeline already has multiple LLM-driven steps and you want one unified loop to improve all of them, particularly when the units being optimised are heterogeneous (some prompts, some code, some retrieval templates). For pure single-prompt tuning with a clear metric, the GEPA walkthrough covers the better choice in detail. There is a separate reason to learn TextGrad even if you would not run it in production: it gives you a clean way to think about LLM-driven compound systems as differentiable graphs, which is a useful lens even when you write the optimisation loop yourself.

References

Paper. Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Lu, P., Huang, Z., Guestrin, C., & Zou, J. (2025). Optimizing generative AI by backpropagating language model feedback. Nature, 639, 609–616. doi:10.1038/s41586-025-08661-4.
Preprint. arXiv:2406.07496.
Code. github.com/zou-group/textgrad.
Install. pip install textgrad.
Hands-on. The authors ship official Colab notebooks for primitives, solution optimisation, code optimisation with a custom loss, prompt optimisation, and multimodal optimisation. Each needs an OpenAI or Anthropic key. If you want to see the API in motion before using it yourself, these are where to start.