ashashar@github: ~/essays

Reflection as a Training Signal

Oct 6, 2025 · 6 min read

For years, reinforcement learning has driven AI optimization with its simple logic of "try, fail, adjust, and repeat thousands or millions of times". It works, but it comes at a high cost in compute, engineering effort, and time.

A recent paper, GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning, suggests a different path. Instead of endless trial-and-error with somewhat blunt reward signals, what if models could look back at their own reasoning, identify what went wrong, and propose better instructions in plain language? In other words, what if they could improve the way humans do, by reflecting on their own mistakes?

Think of it like coaching an athlete. I used to play basketball, and I remember how hard it was to improve when all I heard was "Good" or "Bad." I had to guess what had to change. In reflection-based learning, I would instead watch the replay and notice the details. My elbow drifted. My wrist flicked late. I would adjust, and the feedback loop would tighten. I started learning faster, not by memorizing moves, but by understanding myself.

That principle, reflection as a driver of adaptation, is what makes GEPA interesting. It treats language itself as a training signal, turning vague outcomes into actionable insights. Models go beyond executing instructions, to analyze their own traces, propose variations, and test which ones hold up for Pareto selection.

The results from the paper are compelling. On benchmarks like HotpotQA and HoVer, GEPA matched or exceeded reinforcement learning while using up to 35× fewer rollouts. The prompts that emerged were more effective and, equally importantly, were shorter and cheaper to run in production. Ablation studies confirmed the importance of both reflection and diversity: remove either, and performance drops.

There are concrete implications for developers and enterprises:

  • Efficiency: Better results with lower compute cost.
  • Adaptability: Prompts can evolve in place, without retraining the whole system.
  • Interpretability: Edits happen in plain language, making them easier to inspect.

At the same time, we should be clear about limitations. Reflection can be noisy. Some tasks don't yield obvious traces to critique. And when it comes to deep alignment, e.g. questions of safety, bias, and values, reinforcement learning and retraining are still necessary. GEPA complements rather than replaces these methods, offering a lighter-weight way to adapt in production.

This is exactly the space we are building into at Microsoft CoreAI. Our focus is not only to make models smarter, but to make systems adaptive: developer tools that learn while you use them, enterprise agents that adjust as they encounter new data, compilers and services that refine their own strategies over time.

Imagine a Copilot CLI that fine-tunes its instructions on the fly, adapting to the developer's workflow, coding style, and intent. Or an enterprise agent that optimizes its grounding strategies in real time based on retrieved context and observed outcomes. Reflection becomes part of the runtime, not just the training process.

Two vignettes as example of adaptation in place
"An Assistant That Learns With You" "Compliance That Adapts in Real Time"

A developer asks Copilot CLI to scaffold a script. The first attempt fails due to a missing internal API detail.

Copilot reflects on its reasoning: "Assumed a public API; should handle x-api-key header." It updates its prompt and succeeds.

Later, while debugging a test, Copilot simplifies its own verbose patch after reflection: "The test succeeded, but redundant checks were added. Simplify."

Over time, Copilot's underlying prompts evolve in place, becoming shorter and more accurate without retraining.

Impact: developers experience an assistant that adapts with them, avoiding recurring mistakes over time.

A financial compliance agent mislabels a clause as "non-compliant" under Canadian law.

The agent reflects: "Relied on U.S. rules; adjust prompt to include Canadian disclosure policies." It updates instructions immediately.

Later, deployed in Europe, the agent reflects again: "Need to include GDPR disclosure for retention terms."

Prompts evolve per deployment, without retraining cycles.

Impact: Lower cost of ownership, faster adaptation to regulations, and natural-language logs for auditability.

Instead of tools that stop learning the day they ship, we are building systems that grow through use. GEPA points to that shift, showing that progress does not always need complex retraining or endless data. Sometimes the most powerful signal for improvement comes from within, the system's own ability to reflect.

There's still work to do. Reflections need to become more reliable. Merge strategies need refinement. Guardrails must prevent hallucinated edits. But the direction is clear. We are entering a new era where language models are participants in their own improvement. For Microsoft CoreAI and GitHub, that means moving from static releases to living systems, and from products that are deployed once to platforms that evolve with you.

The future of software development will belong to builders who can design systems that don't just execute logic, but learn from their own reasoning.