Why I built an AI trading system where the AI never picks trades
Most "AI trading" projects share a structural mistake: they hand the model the buy/sell decision and bolt safety mechanisms on as an afterthought. The model produces confident, well-formatted theses. The safety mechanisms get overridden the first time someone is "behind schedule" on returns. The end state is a slow bleed — looking rigorous in the audit while losing money to factor exposure and friction.
I just open-sourced a different shape. AI Trading Lab v2 is a defensive trading scaffold where the AI is restricted to two narrow jobs:
- Narrator — reads the deterministic decision and writes a monthly summary
- Anomaly detector — flags days where prices look statistically weird vs a 60-day baseline
Neither output is fed back into the trading loop. If both AI modules were deleted tomorrow, the trading loop would behave identically. The AI is downstream of the decision, not upstream.
The architecture
Six layers, each able to reject what the layer above asks for:
AI layer — narrator + anomaly detector ONLY (no signal)
Strategy — pure rules, deterministic, backtestable
Risk engine — factor caps, drawdown breaker, reconciler
Execution — limit orders, slippage budget, partial-fill OK
Broker — Alpaca paper/live, simulated, backtest
Audit — SQLite, every event recorded
The AI cannot reach below itself. The strategy is pure rules — a trend-filtered tactical asset allocation across 5 ETFs (SPY, EFA, AGG, GLD, BIL). 12-month momentum ranking, 200-day SMA filter, monthly rebalance.
The 7-gate live-trading lock
Going from paper to live capital requires all of:
mode: livein configrisk.allow_live_trading: trueTRADING_LAB_ENABLE_LIVE=1in environment--confirm-liveCLI flag--executeCLI flag- Latest backtest report passes: config hash matches current config (no tuning-then-switching), Sharpe ≥ floor, drawdown ≤ floor, Calmar within 10% of the buy-and-hold benchmark
- Reconciliation between audit and broker positions is clean
The config-hash check is the interesting one. An operator who tunes config to make a backtest pass, runs the backtest, then switches back to the original config, would have a stale report on disk that no longer matches the deployed config. The hash mismatch refuses to unlock live mode.
The honest result
Backtest over 16.5 years (2008-2024) including 2008, 2020, and the 2022 bond rout:
| Metric | TFTAA | Buy-Hold SPY |
|---|---|---|
| Annualized return | 4.48% | 11.07% |
| Max drawdown | -14.34% | -49.70% |
| Sharpe (rf=4%) | 0.054 | 0.360 |
| Calmar (return / max DD) | 0.31 | 0.22 |
The strategy beats SPY on Calmar — better risk-adjusted return. It loses on absolute return because the cash sleeve drags during sustained bull runs.
But the Sharpe floor in the live-trading lock is 0.5. This strategy doesn't pass. The CLI refuses to enable live capital. That refusal is exactly what the safety architecture is supposed to do.
The infrastructure is the artifact. The strategy is one variant of many you could plug in. The acceptance criterion for a new strategy is in the README: pass the live-trading floor, beat the benchmark on Calmar, ship a PR.
What the AI actually does
Narrator
After each rebalance, the AI receives the deterministic decision (target weights, current weights, computed orders) and writes a human-readable summary explaining what changed and why, with citations back to the deterministic logic. Costs about $0.50/month using Claude Haiku 4.5 with prompt caching. Output is for the operator's monthly review, never for the trading loop.
Anomaly detector
Two-pass design. The fast statistical screen runs first — for each universe symbol on each daily check, it computes today's return vs the trailing 60-day distribution and flags anything more than 4 standard deviations out. If something is flagged, the AI optionally adds a second-pass classification with context. The statistical screen is the authority — AI failure must never silence a real anomaly. Critical anomalies halt the workflow until the operator clears them with --accept-critical-anomaly --reason '...'.
Why this generalizes
"AI proposes, code disposes" is the right shape for AI in any high-consequence domain — finance, healthcare, infrastructure ops, defense. The trading lab is just one instance.
The pattern is:
- Code makes every consequential decision via deterministic rules you can write down
- AI is restricted to surfaces where being wrong is cheap (summarization, anomaly classification, narrative explanation)
- State of the world is the source of truth, not the AI's prediction of it
- Safety mechanisms have explicit, friction-laden, logged overrides — never silent toggles
- The system can refuse its own operator
That last one is the most important. A trading system that refuses to deploy capital against a strategy that fails its own backtest is a different kind of artifact than one that just trusts the operator. Same code shape applies to a clinical AI that refuses to surface a recommendation when the input data fails an integrity check, or an infra agent that refuses to apply a config change when the canary fails its own SLA.
Try it
- Repo: github.com/JacksonBuildsAI/ai-trading-lab (MIT)
- Live paper-trading dashboard, refreshed daily: jacksonlai.com/projects/ai-trading-lab/dashboard.html
- Architecture + design choices: project page
If you're working on AI-in-the-loop systems for any high-stakes domain, the "narrator, not decider" pattern transfers directly. Different domain, same problem shape.
If you're hiring for harness or agent engineering and want to compare notes, find me on LinkedIn or @JacksonAIBuilds on X.