All projects
Open source · MIT · Live paper on Alpaca

Defensive scaffolding for an AI in a
high-consequence domain.

A trading lab built around one invariant: the AI never decides trades. It writes the monthly narrative and flags weird data — that's the entire surface area. A deterministic risk engine, a 7-gate live-trading lock, an audit-vs-broker reconciler, and a drawdown circuit breaker do all the rest.

LanguagePython 3.10+
BrokerAlpaca paper
Tests24/24 passing
Lines of code~2,000
Live discipline dashboard for the AI Trading Lab — equity curve, process metrics, risk state, anomalies
↗ View live dashboard · auto-refreshed after every cron run
The problem

"Let the AI trade" is the failure mode.

Most retail "AI trading" systems hand the model the buy/sell decision and bolt safety on as an afterthought. Two predictable outcomes: the model produces confident-sounding theses with no real signal, and the safety layers get overridden the first time someone is "behind schedule" on returns. The result is a slow bleed — looking rigorous in the audit while losing money to friction.

This project flips the dependency: code makes every decision, the AI is downstream. If you ripped both AI modules out, the trading loop would behave identically — by design. Removing the AI is allowed; trusting it isn't.

Architecture

Six layers. Each one can reject what the layer above asks for.

The AI sits at the top of the stack and cannot reach below itself. Strategy is pure rules, deterministic, backtestable. Risk caps are explicit. Execution rejects unsafe spreads. Audit records everything.

Layer 1 — AI
Narrator + anomaly detector. Reads the deterministic decision and writes a monthly summary. Flags days that look statistically weird against the trailing 60-day distribution. Never produces trade signals.
Layer 2 — Strategy
Pure rules, no model in the loop. Trend-filtered tactical asset allocation across 5 ETFs (SPY, EFA, AGG, GLD, BIL). 12-month momentum ranking, 200-day SMA filter. Mandatory buy_hold_spy benchmark for comparison.
Layer 3 — Risk engine
Deterministic caps. Per-symbol allocation cap (60%), cash buffer reservation, drawdown circuit breaker (when equity drops 15%+ from the rolling 60-day high-water mark, the strategy decision is overridden to flatten 100% into the cash sleeve for a 30-day cooldown), reconciliation between audit and broker positions before any new orders.
Layer 4 — Execution
Limit orders only. Slippage-budget aware — if the live spread blows past the configured limit, the order is rejected rather than submitted. No market orders, no surprises.
Layer 5 — Broker
Three implementations behind one interface. SimulatedBroker (dev), BacktestBroker (passive bookkeeper for the walk-forward backtester), AlpacaBroker (paper / live). Live and paper share identical code paths.
Layer 6 — Audit
SQLite, every event recorded. Runs, equity snapshots, strategy decisions, intents, orders, positions, anomalies, override-flag uses. The audit is the source of truth — reconciliation runs against it.
Live-trading lock

Seven gates. All must pass before real money moves.

Going from paper to live capital is the highest-stakes state transition in the system. So it has the highest friction.

1

Mode flag

mode: live in config.

2

Risk allowance

risk.allow_live_trading: true

3

Environment variable

TRADING_LAB_ENABLE_LIVE=1

4

CLI confirmation

--confirm-live flag

5

Execute flag

--execute required (off by default)

6

Backtest passes floor

Latest report's config hash matches current config (no tuning-then-switching), Sharpe ≥ floor, drawdown ≤ floor, and Calmar within 10% of the buy-and-hold benchmark.

7

Reconciliation clean

Audit positions match broker positions.

Numbers

Real backtest, real stress test, real paper position.

24/24
Tests passing
Risk paths, strategy logic, backtest, reconciliation, metrics
16.5yr
Backtest window
2008-06 → 2024-12 incl. 2008, 2020, 2022 drawdowns
5
Stress regimes
Bull, crash, sideways, rotation, flash crash — each with safety-property checks
4/4
Safety properties
Reconciliation halt, anomaly flag, drawdown breaker, live-lock — all confirmed
What the backtest revealed

Better risk-adjusted return than buy-and-hold SPY.

Across 16.5 years (2008-2024) — including the 2008 crisis, COVID 2020, and the 2022 bond rout — the trend filter cut maximum drawdown by roughly 71%: from SPY's -49.7% to -14.3%. Annualized volatility dropped from 19.6% to 8.9%. The drawdown circuit breaker (15% trigger, flatten to cash, 30-day cooldown) fires when the trend filter alone isn't enough.

The trade-off is absolute return: 4.5% annualized vs. SPY's 11.1%. But on a Calmar basis (return divided by max drawdown), TFTAA wins 0.31 to 0.22 — meaningfully better risk-adjusted performance. The strategy excels in clean bull and crash regimes; the synthetic stress harness shows it gets chopped in fast leadership rotations. Knowing exactly where the edges are before deploying capital is the entire point of the harness.

Configurable risk floor, enforced in code.

The live-trading lock holds any candidate strategy to a backtest floor the operator sets. The CLI won't let real money flow until that floor passes — defaults are conservative (Sharpe ≥ 0.5, drawdown ≤ 25%), but a more risk-tolerant operator can raise or lower them deliberately, in writing, never by accident.

Finding a strategy's edges with 16 years of synthetic data costs 30 seconds of compute. Finding them with real money costs months of slow bleed and a hard lesson. The infrastructure is what makes that gap cheap — and what makes adding the next strategy variant a one-class change instead of a rebuild.

Why this matters

The exemplar is the architecture, not the returns.

🛡️

AI-as-narrator pattern

Demonstrates the right shape for AI in any high-consequence domain — finance, healthcare, infra ops, defense. Code decides; AI explains.

🧪

Mandatory backtest + stress harness

You can't deploy without proof. Both historical and adversarial-synthetic. Forces the operator to confront whether the strategy actually works.

🔍

Reconciler-first state

Every run starts by comparing audit vs. broker. Drift halts the workflow. The class of failures where state silently diverges between systems just can't propagate.

📊

Discipline dashboard

Tracks process metrics — halts, override-flag uses, strategy changes during drawdowns — not return targets. The thing the operator actually controls.

One stack. Many strategies.

The architecture is the artifact — every new strategy plugs into the same safety, audit, and reconciliation rails. The repository is MIT-licensed: clone it, fork it, submit a strategy variant that passes the live-trading floor (Sharpe ≥ 0.5, Calmar ≥ benchmark). If you're hiring for harness or agent engineering, or want to compare notes on AI-in-the-loop systems for high-stakes domains, I'd love to walk through the audit DB, the live-trading lock, and the stress harness with you.

★ Star on GitHub → LinkedIn