Back to project

Why I built an AI trading system where the AI never picks trades

By Jackson Lai · 2026-05-11 · Open-source release of JacksonBuildsAI/ai-trading-lab v0.1.0

Most "AI trading" projects share a structural mistake: they hand the model the buy/sell decision and bolt safety mechanisms on as an afterthought. The model produces confident, well-formatted theses. The safety mechanisms get overridden the first time someone is "behind schedule" on returns. The end state is a slow bleed — looking rigorous in the audit while losing money to factor exposure and friction.

I just open-sourced a different shape. AI Trading Lab v2 is a defensive trading scaffold where the AI is restricted to two narrow jobs:

  1. Narrator — reads the deterministic decision and writes a monthly summary
  2. Anomaly detector — flags days where prices look statistically weird vs a 60-day baseline

Neither output is fed back into the trading loop. If both AI modules were deleted tomorrow, the trading loop would behave identically. The AI is downstream of the decision, not upstream.

The architecture

Six layers, each able to reject what the layer above asks for:

AI layer        — narrator + anomaly detector ONLY (no signal)
Strategy        — pure rules, deterministic, backtestable
Risk engine     — factor caps, drawdown breaker, reconciler
Execution       — limit orders, slippage budget, partial-fill OK
Broker          — Alpaca paper/live, simulated, backtest
Audit           — SQLite, every event recorded

The AI cannot reach below itself. The strategy is pure rules — a trend-filtered tactical asset allocation across 5 ETFs (SPY, EFA, AGG, GLD, BIL). 12-month momentum ranking, 200-day SMA filter, monthly rebalance.

The 7-gate live-trading lock

Going from paper to live capital requires all of:

  1. mode: live in config
  2. risk.allow_live_trading: true
  3. TRADING_LAB_ENABLE_LIVE=1 in environment
  4. --confirm-live CLI flag
  5. --execute CLI flag
  6. Latest backtest report passes: config hash matches current config (no tuning-then-switching), Sharpe ≥ floor, drawdown ≤ floor, Calmar within 10% of the buy-and-hold benchmark
  7. Reconciliation between audit and broker positions is clean

The config-hash check is the interesting one. An operator who tunes config to make a backtest pass, runs the backtest, then switches back to the original config, would have a stale report on disk that no longer matches the deployed config. The hash mismatch refuses to unlock live mode.

The honest result

Backtest over 16.5 years (2008-2024) including 2008, 2020, and the 2022 bond rout:

MetricTFTAABuy-Hold SPY
Annualized return4.48%11.07%
Max drawdown-14.34%-49.70%
Sharpe (rf=4%)0.0540.360
Calmar (return / max DD)0.310.22

The strategy beats SPY on Calmar — better risk-adjusted return. It loses on absolute return because the cash sleeve drags during sustained bull runs.

But the Sharpe floor in the live-trading lock is 0.5. This strategy doesn't pass. The CLI refuses to enable live capital. That refusal is exactly what the safety architecture is supposed to do.

The infrastructure is the artifact. The strategy is one variant of many you could plug in. The acceptance criterion for a new strategy is in the README: pass the live-trading floor, beat the benchmark on Calmar, ship a PR.

What the AI actually does

Narrator

After each rebalance, the AI receives the deterministic decision (target weights, current weights, computed orders) and writes a human-readable summary explaining what changed and why, with citations back to the deterministic logic. Costs about $0.50/month using Claude Haiku 4.5 with prompt caching. Output is for the operator's monthly review, never for the trading loop.

Anomaly detector

Two-pass design. The fast statistical screen runs first — for each universe symbol on each daily check, it computes today's return vs the trailing 60-day distribution and flags anything more than 4 standard deviations out. If something is flagged, the AI optionally adds a second-pass classification with context. The statistical screen is the authority — AI failure must never silence a real anomaly. Critical anomalies halt the workflow until the operator clears them with --accept-critical-anomaly --reason '...'.

Why this generalizes

"AI proposes, code disposes" is the right shape for AI in any high-consequence domain — finance, healthcare, infrastructure ops, defense. The trading lab is just one instance.

The pattern is:

That last one is the most important. A trading system that refuses to deploy capital against a strategy that fails its own backtest is a different kind of artifact than one that just trusts the operator. Same code shape applies to a clinical AI that refuses to surface a recommendation when the input data fails an integrity check, or an infra agent that refuses to apply a config change when the canary fails its own SLA.

Try it

If you're working on AI-in-the-loop systems for any high-stakes domain, the "narrator, not decider" pattern transfers directly. Different domain, same problem shape.

If you're hiring for harness or agent engineering and want to compare notes, find me on LinkedIn or @JacksonAIBuilds on X.