Editorial / v1
The prediction loop.
Same architecture, three domains. What 1,344 stock calls, 504 MLB games, and 8,689 energy forecasts taught me.
I made a bet a few months ago. The bet was that the same architecture could predict stocks, baseball outcomes, and electricity load, and that the architecture itself was the moat. Not the model picks. Not the prompt. The loop.
This essay is the receipt. Three live systems, three domains, three different feedback signals (daily closes, final scores, hourly meter readings), one shared shape. Below are the real numbers I pulled from production a few minutes before I started writing this.
Vantage: 1,344 predictions, 1,248 graded, ensemble trades hitting a 74.6% win rate.
Diamond: 504 predictions, 473 graded, beating the Vegas line on 59.6% of graded picks.
LoadLens: 8,689 forecasts, 4,224 graded, with the system rebalancing its model weights from a hand-tuned 40/35/25 split to 84/8/8 once it figured out which model was carrying the others.
The numbers matter, but the loop matters more. Let's start there.
The loop
Predictive AI is the part of the field that does not get demoed at conferences, because the demos are boring. Nobody claps at a chart of stock predictions or a list of expected MLB scores. The work is plumbing, and the edge is a closed loop, not a bigger model.
The loop has five steps. They are mundane individually, and rare as a unit:
- Predict. Multiple models, each with a different worldview. One looks at price momentum. One looks at sentiment. One looks at temperature. They produce a prediction independently.
- Aggregate. Combine the predictions into one ensemble call, weighted by past performance. The weights are the memory of the system.
- Grade. When the truth lands (close price, final score, actual load), score every model and the ensemble against it. Store every score.
- Adapt. Adjust the weights. Models that were right gain weight. Models that were wrong lose it. Slowly. Bayesian, not whiplash.
- Journal. Write a structured note about what the system learned today, what shifted, what is trending. The note is for the system to read tomorrow, not for the user.
That is the entire architecture. Anyone can copy it. Almost nobody does, because steps three through five are unglamorous, and most "AI prediction" projects end at step one.
Vantage: stocks, sentiment, and the model that surprised me
Vantage runs a four-model ensemble on equities: momentum, mean reversion, sentiment, and a meta-ensemble that blends them. Each one makes a directional call (up or down) with a confidence score. They are graded against the actual close.
Here is what 1,248 graded predictions tell me as of yesterday's evaluation:
- Sentiment model: 47.1% directional accuracy. Sharpe 2.25 on its calls. On high-confidence calls only, it hits 83.3%.
- Momentum model: 34.3% accuracy. Sharpe negative. The momentum signal, in this market, on this set of tickers, in this period, is not adding information.
- Mean reversion model: 26.9% accuracy. Worse than coin flip. Bad model, full stop.
- Ensemble (the live trader): 34.6% directional accuracy overall. Sharpe slightly negative. But on actual position-sized trades only, the ensemble is hitting a 74.6% win rate with $109.05 in paper P/L on 63 trades.
Two things in those numbers are the actual story.
First: the sentiment model beat all three technical models. I did not expect that. I built Vantage with the implicit assumption that price action would carry the prediction and sentiment would be a tiebreaker. The data said otherwise, and the system reweighted accordingly. I was wrong about the relative value of the signals, and the loop corrected me.
Second: the ensemble has a worse directional accuracy than the sentiment model alone (34.6% vs 47.1%) but a much better trading record. That looks like a contradiction until you read the gap. Directional accuracy counts every call. Trading P/L only counts the calls the system was confident enough to size into a position. The ensemble is correctly not trading on most of its calls. The accuracy of bets actually placed is the number that matters, and it is 74.6%.
This is also where the system fails honestly. Down-predictions are weak across all four models (8% to 27% accuracy). The system knows up-moves better than it knows down-moves, in this regime. That is a real limitation, and it is in the journal, and it is shaping where I focus the next round of work.
You can see the live state at vantage.champlinenterprises.com.
Diamond: baseball, the Vegas line, and calibration
Diamond predicts MLB games. Score, winner, totals, run line. The benchmark that matters is the Vegas line, because Vegas has been doing this for a very long time and has more money in the loop than any model I will ever build.
As of this morning, the v2 model has graded 473 of 504 predictions:
- Winner accuracy: 53.7% rolling 7-day, 55.4% lifetime on the v2 model. The 50% baseline is a coin flip on a binary outcome.
- Beat Vegas: 282 of 473 graded picks, or 59.6%. Beating Vegas is a different question than predicting outcomes. You can be right less than half the time and still beat the line, because the line is a probability statement and you are looking for divergence.
- Paper P/L on value bets: $2,207.77 across 378 picks. This is the number a sportsbook would care about. Not the win percent, the dollar yield.
The single most useful thing Diamond does, though, is calibrate. Every confidence bucket has its own measured accuracy:
| Confidence | Predicted | Actual |
|---|---|---|
| 50-54% | 50% | 56.5% |
| 55-59% | 57% | 46.2% |
| 60-64% | 62% | 60.6% |
| 65-69% | 67% | 61.2% |
| 70-74% | 72% | 71.4% |
A well-calibrated model has its predicted confidence and its actual accuracy land on the same number. Diamond is well-calibrated in the high-confidence buckets (70-74% predicts at 71.4%) and slightly miscalibrated in the middle (the 55-59% bucket is overconfident). That is data the system uses to flag picks that should be sized differently. A 70-confidence pick is meaningfully different from a 55-confidence pick, and the system can prove it.
There is a second-order story in Diamond worth naming. The v2 model replaced a legacy model that hit 46.7% on 92 picks. The new model hits 55.4% on 381 picks. The system promoted itself: the legacy model is still in the data, still graded, but the meta-controller now routes new predictions to v2 because v2 has earned the throughput. There is no human gate on that promotion. The grading layer is the gate.
Live at diamond.champlinenterprises.com.
LoadLens: energy, and the weight migration that proves the loop works
LoadLens forecasts electricity load on short time horizons. Three models contribute: a trend model (extrapolates from recent load curves), a momentum model (looks at rate of change), and a weather model (correlates load to temperature and humidity).
The starting weights were a reasonable guess: trend at 40%, weather at 35%, momentum at 25%. The hand-tuned mix, before any data came in.
Then 4,224 forecasts got graded. Here is the per-model accuracy:
- Trend: 5.7% mean absolute error. Excellent.
- Momentum: 31.9% MAE. Bad.
- Weather: 32.5% MAE. Bad.
The system reweighted on its own. Today's weights:
- Trend: 84.0%
- Weather: 8.2%
- Momentum: 7.9%
I want to be honest about what this means and what it does not mean. It does not mean the weather model is broken or that weather is irrelevant to load. Weather almost certainly belongs in the system. What it means is that the weather model as currently implemented is adding noise more than signal, and the loop figured that out faster than any human review cycle would have. My next pass at LoadLens is rebuilding the weather model, not removing it. The 8% weight is the system's polite way of saying "don't trust this version."
About 28% of all forecasts (1,183 of 4,224) land within ±5% of actual load. That is the headline accuracy number, and it is honest. The same number a year ago, when LoadLens was just the trend model, was lower, because trend alone overshoots when conditions change. The ensemble, even with two weak models pulled almost to zero, still beats trend alone, because the residual 16% of weight on momentum and weather catches the cases where pure trend extrapolation fails.
This is the architecture earning its keep. A 5.7%-MAE single model would be a great single model. A 5.7%-MAE single model with a structured way to integrate two future better models without rewriting anything is a system.
Live at loadlens.champlinenterprises.com.
What is shared, what is different
The shared shape is the five-step loop above. The differences are forced by the domain.
Feedback latency. Diamond grades on a delay of hours (game ends, score is final). Vantage grades daily (close price). LoadLens grades on minutes-to-hours (meter readings come in fast). Faster feedback means faster adaptation, but also more noise per cycle, so LoadLens uses a longer smoothing window on its weight updates than Diamond does.
What a "model" is. In Vantage, a model is a prompt plus a feature window plus a Claude call. In Diamond, a model is a Python ensemble of statistical components plus a Claude reasoning layer for narrative. In LoadLens, a model is pure numerical: linear trend, gradient, and weather regression. The loop does not care. It treats every model as a black box that takes inputs and emits a prediction with a confidence.
What "right" means. Vantage scores directional correctness and trading P/L (two different right). Diamond scores winner, total, run line, and Vegas-divergence (four different right). LoadLens scores absolute error in megawatts (one right, with a sign). The grading code is the most domain-specific part of each system, and the only part you cannot generalize. Everything else is the loop.
Where Claude lives. Claude is not the predictor. In all three systems, Claude is the reasoning narrator: the layer that takes the numerical prediction and explains why, in language a human can use. The numbers come from the models. The story comes from Claude. People sometimes flip this when they hear "AI prediction" and assume the LLM is the forecaster. It is not, and it should not be. LLMs are bad at numerical extrapolation and good at structured explanation. Use them for explanation.
What this means if you are building something similar
Three things, no more.
One: build the grader before the predictor. This is the inversion that separates working systems from demos. Most AI prediction projects ship a prediction surface, then "we'll add accuracy tracking later." Later never comes, because grading is annoying and there is no dopamine in it. Build it first. Make every prediction land in a row that already has a place for the truth value it will eventually need. Make the grading job run on a cron from day one, even if it only logs "no truth yet" for the first week. The discipline imposed by an empty actual_value column is what turns a guesser into a forecaster.
Two: weight the models, never delete them. The temptation when a model is performing badly is to remove it. Don't. Reduce its weight. Keep it in the loop. Weak models occasionally light up in regimes where the strong models go cold, and the only way the system catches those moments is if the weak model is still in the data stream when they happen. LoadLens's weather model at 8% weight is not waste, it is insurance.
Three: write the journal, even if no one reads it. Each of these three systems writes a structured journal entry, daily or hourly. Diamond's most recent entry is sitting on my screen as I type this: "v2 55.4% (381 picks) | Legacy 46.7%, alerts: none." Nobody reads these entries. The system does not even read them in the next prediction cycle. They exist so that when something does break in three months, there is a continuous trail of what the system thought was true at every step. It is the equivalent of a flight recorder. The cost is a few hundred bytes a day. The value is the ability to debug a regression that took six weeks to manifest.
The honest part
None of these systems is making me rich. Vantage's $109 in paper P/L would be eaten by transaction costs in any real account. Diamond's $2,207 paper yield assumes flat-stake bets that no real bettor would size flat. LoadLens does not produce dollar yield at all, it produces operational forecasts.
What they produce is evidence that the loop works. That is the asset. The architecture is now portable. The next domain I point this at, whatever it is, will start with three months of head start because the grading, the weighting, the journaling, and the failure modes are already understood.
That is the move I would make in any prediction-shaped problem, in any business, regardless of vertical. Predictive maintenance on equipment. Pricing models. Inventory forecasting. Lead scoring. Demand sensing. The loop is the same. The grader changes. The predictor changes. The loop is the same.
If you are running a business where one of those problems is unsolved and would change the unit economics if it were solved, that is a Champlin Enterprises conversation. We have built this stack three times now. The fourth time is faster.
— Kevin Champlin, 2026-05-08
Related glossary