Training Agentic Systems for Live Markets

Notes on the iterative development process behind a long-running research project

Most public writing on training AI systems for trading falls into one of two registers. There’s the academic register, which describes formal training methodology in terms that don’t survive contact with live markets. And there’s the marketing register, which describes triumphant results without engaging seriously with what the training process actually involves. Neither register is particularly useful for engineers working on the problem.

This article is an attempt at a third register: an honest description of what iterative training of agentic systems looks like in practice, drawn from nine years of work on the problem. The framing here is general enough to be useful across systematic trading contexts; specifics about any particular implementation are kept deliberately abstract. The work described is part of an internal research project. Nothing in this article describes a product, service, or investment vehicle available to outside parties, and the article does not constitute an offer of any kind.

The basic shape of the work

At a high level, the iterative training cycle for an agentic trading system has four phases that repeat continuously: pushing agents to their limits, identifying failure points, retraining against those failures, and validating that the retraining actually addressed the underlying problem rather than just masking it.

Stated that simply, this sounds straightforward. In practice each phase contains failure modes that defeat most attempts. The reason this work takes years rather than months is that the failure modes are individually subtle, collectively numerous, and only become visible through extended operation in changing conditions.

Pushing agents to their limits

The first phase is about finding out where the system breaks. This is harder than it sounds because the obvious approach — testing against historical extreme conditions — produces results that don’t generalize. Agents trained against historical volatility events learn to recognize the specific shape of those events, not the general phenomenon of extreme conditions. When new extreme conditions arrive with a different shape, the recognition fails.

A more productive approach is to construct synthetic stress conditions that are structurally extreme without being historically specific. This means scenarios with extreme participant behavior, extreme liquidity discontinuities, extreme regime shifts — but not literal replays of 1987 or 2008 or 2020. The goal is to find the boundaries of the agents’ competence in a way that generalizes.

The discipline that has to be maintained here is to actually run these stress tests rather than just simulating them. There is a strong temptation to declare an agent robust because it passed simulated stress tests, when what those tests actually demonstrated is that the simulation faithfully reproduced the agent’s expected behavior in scenarios the agent was trained against. Real stress testing requires conditions the agent was not specifically trained for, and the willingness to discover that the agent fails in those conditions.

Identifying failure points

When agents fail under stress, the failures themselves are usually obvious. The question is what caused them. The answer is rarely the most visible factor.

A common pattern: an agent fails to respond appropriately to a volatility spike. The visible failure is in the execution layer. The actual cause is in the perception layer, which misclassified the regime and provided wrong context to the reasoning layer, which made the wrong decision, which the execution layer faithfully executed. Tracing failures to their actual source requires careful logging of every layer’s inputs and outputs, and the discipline to actually trace through the chain rather than fixing the visible failure and moving on.

A second common pattern: an agent fails in a specific market condition and the diagnosis points to the training data not covering that condition. The fix appears obvious — add training examples for that condition and retrain. This is usually wrong. The actual problem is more often that the agent was trained against the wrong representation of the condition, or that the condition exposes a deeper architectural limitation that additional training is unlikely to resolve. Throwing more data at the visible symptom can mask the underlying problem in ways that produce confidence without competence.

The disciplined approach is to treat every failure as potentially diagnostic of architecture rather than data, and to resist the temptation to declare the diagnosis complete once a plausible cause has been identified. The plausible cause is usually the surface; the actual cause is usually a layer or two deeper.

Retraining against failures

Retraining is the phase where the temptation to make the system look better is strongest, and therefore where the most damage can be done.

The technically correct approach is to update the system in ways that address the diagnosed failure without overfitting to the specific scenario that surfaced it. This is harder than it sounds because the natural feedback loop — the agent now performs well in the scenario that previously caused failure — provides immediate positive reinforcement for changes that may actually narrow the agent’s competence rather than broadening it. A retraining update that improves the agent in one specific scenario while subtly degrading its behavior in others looks like a win in the moment and becomes a problem later.

The discipline that addresses this is comprehensive validation across the full range of conditions the agent is expected to handle, not just the scenario that prompted the retraining. This is expensive — it means running validation against everything you can construct, not just the failure case — but it’s the only way to detect regression.

A second discipline is to be willing to back out changes when validation reveals problems, even when the changes addressed the original failure. The natural inclination is to keep the fix and try to address the regression with additional fixes. This rapidly produces systems that are accretions of patches rather than coherent architectures. Sometimes the right answer is to recognize that the original failure cannot be addressed within the current architecture and that more fundamental work is needed.

Validation

The last phase is about confirming that the work actually accomplished something. This sounds trivial and is in fact one of the hardest parts of the cycle.

The challenge is that the conditions an agent will encounter in live operation are not the same as the conditions used to validate it. Validation against known conditions can only describe behavior against those conditions; live operation is where the agent encounters conditions that don’t match anything in the validation set. A system that validates well but behaves poorly in live operation is one of the most demoralizing failure modes in this work, partly because the cause is often that validation was too closely matched to training rather than reflecting the actual generalization the system needs to demonstrate.

The discipline that addresses this is to maintain validation conditions that are deliberately not derived from training data, and to update validation conditions less frequently than training data so that successive training cycles can be compared against a stable reference. This is operationally painful — it means validation results sometimes look worse than training results suggest they should — but it provides actual information about whether the system is improving.

A second discipline is to weight live behavior more heavily than validation results when they disagree, and to be skeptical of any pattern where validation steadily improves while live behavior does not. This pattern usually indicates that training is moving toward validation rather than toward live generalization, which is a failure mode that gets harder to detect the longer it continues.

What this cycle is not

A few things this iterative training process is not, because the description above can sound more clean than the underlying work actually is:

It is not a process that produces a finished system. The system at any given moment is the current state of a continuous development effort. The framing of “training complete, deploy to production” doesn’t apply because the conditions the system operates in continue to evolve and the system has to evolve with them.

It is not a process where each iteration is straightforwardly an improvement on the last. Many iterations make the system worse in ways that aren’t immediately visible. Recognizing and reverting bad iterations is a substantial part of the work.

It is not a process that scales naively with computational resources. More compute applied to the same architecture produces marginal returns once the architecture has been adequately exercised. The bottleneck is usually conceptual rather than computational — figuring out what to do, not running more iterations of what you already know how to do.

Why this work takes years

The cycle described above takes years to execute well for one fundamental reason: many of the failure modes only become visible through extended operation in changing conditions. A system can pass every validation test for months and then encounter a market condition that exposes a limitation no one anticipated. The only way to find these limitations is to run the system through enough varied conditions for them to surface, and there is no way to compress this timeline through more aggressive testing.

This is the part of systematic trading work that is least appreciated by people who haven’t done it. The intellectual content of the architecture can be developed relatively quickly. The reduction to working code can be done in months. The actual maturation of the system into something that handles real conditions reliably takes years, and the years cannot be skipped.

This is also the reason that public claims about systems built in months should be treated with skepticism. Whatever exists at that point is a prototype. Whether it survives extended live operation is an open question that only time can answer.