I guess Tetsuo isn’t done after all. Quickly:
Two days in, on SIG. I can’t win. I pulled on a thread, found three things wrong, then production handed me a fourth on the way out the door.
First, I’m working on a reporting component (that’s right, it’s called ‘RPT’ lol) that will help me start getting concrete performance visualization to start identifying issues over time.
sig_probability wasn’t ranking anything
RPT’s sig_signal report for the 26th showed 1721 UP forecasts across the universe, 1038 of which actually went UP. Fine on its own, 60%. Then I sliced by confidence and probability thresholds. ≥80/≥80: 57.7%. ≥90/≥90: 59.4%. ≥100/≥90: 40%. Flat across every threshold. A calibrated probability score has to be rankable (high-confidence calls should hit more often than low-confidence ones). SIG’s wasn’t ranking a damned thing.
I knew introducing this as a filter would open a can of worms, too — SIG was initially persisting probability_score as the XGBoost classifier’s raw predict_proba output, scaled to 0-100. That number is leaf-purity from the training set passed through the booster’s sigmoid. It’s leaf statistics from data the model just got done fitting against, not an out-of-sample probability of being right. Tree boosters apparently normally produce uncalibrated probabilities that cluster at the extremes. That’s why every sig_probability in the WIN report was sitting at 90-99 regardless of whether the call was any good. That isn’t usable for the purpose I need to use it, and I’m not sure who it would be usable to, outside of maybe someone debugging a tree booster?
The fix was to wrap the trained model with a chronological-tail, “isotonic calibrator”. Base XGBoost fits on history[:-60]. Isotonic regressor fits on the last 60 days against the actual binary outcomes. Inference goes through both, so the persisted probability score becomes the established UP frequency in the holdout window for that raw-score band instead of the leaf-purity from training. That’s a mouthful.
Anyway, chronological tail is sued because this is next-day forecasting: KFold with a shuffle leaks future days into earlier calibration folds and validates against past days, which is the wrong validation shape for a time series.
How do you even explain issues like this to a non-technical audience? I catch a ton of shit sometimes for technical jargon by people trying to understand something I’m doing, and then I see issues like this and I just — how would you even explain this to someone who isn’t already balls deep in it? Maybe that’s why I’d be a terrible developer.
sig_confidence was selection-inflated
After that, another worm crawls out of the can: I then see that even with calibration landed, sig_confidence values across the universe were piling up in 80/90/100 bands. Those are those “whispered promises” I keep talking about. It’s a fugazi, a bugatti, a pulled lazy. The tree booster is eating daisies. It’s clearly not 90% accurate if it’s f*cking wrong 38% of the time.

Anyway, using n=10 for walking configuration batteries and forecasting is discrete by construction (10 binary trials, values in 10% buckets), but the clustering at the high end across thousands of symbols was suspicious. Genuine model quality doesn’t actually distribute itself that way, so, that banding was a clear and explainable indicator.
The configurator runs forward feature selection as KEPT/DISCARDED determinations against the same n-trial (again, n-10) battery that produces the persisted accuracy. So the reported accuracy is the in-sample maximum of a feature search whose hypothesis space (any subset of about 50 features) dwarfs 10 trials of evidence. That’s 50 features wide by 10 rows, which isn’t very epistemologically sound. Standard error of binary accuracy at n=10 is around 16pp; a symbol with true 50% accuracy will frequently score 80/90/100 just by chance (imagine flipping a coin 3 times and getting heads all 3 times), and the filter keeps whichever feature set happened to produce that drop what happened to not do that (failures happen bidirectionally, which is deceptive with this kind of thing). I thought I’d compensated with daily reconfiguration of half the universe but it just hands me fresh noise instead of stale noise.
The fix was to break the dependency. Selection still runs against the most recent 10 days, so the kept feature set keeps adapting to the freshest data. After selection, run the locked-in feature set against a SECOND, non-overlapping 10-day battery offset to the 10 days immediately preceding the selection window. Persist THAT resulting figure as the “accuracy” score instead of the overfit in-sample score. Now the reported number is the kept feature set’s hit rate on a window the search never saw. Same n on the selection side (no adaptation cost), inflation comes off. So the forecaster configuration process grew an offset parameter so one iteration helper drives both windows from one configured value.
What I expect to see over the next few cycles
Once a few configurate refreshes go around, sig_confidence values should start to drop and spread out in a more even distribution among symbols. This is the high band that was sitting at quality.
Same with sig_probability. It stops clustering at 90-99 and spreads across the range. After a few trading days, slicing RPT’s sig_signal report by probability >= X should monotonically improve hit rate as X rises. If it still doesn’t, the features don’t carry signal for the T+1:T+2 horizon and the next investigation moves out of SIG into the big nasty data_builder where I’ve isolated all the math, and I just do not have the time to invest in that. The calibrated report can distinguish “no signal” from “uncalibrated probabilities”; the old one couldn’t.
MAG’s candidate pool shrinks because its minimum_sig_confidence and minimum_sig_probability filters cut more aggressively against the now-honest, more-spread-out distributions. So there will need to be adjustments to MAG filters. WIN may fall short of max_winners=10 on some days. That’s information, not a bug. The numbers were lying before; now they aren’t.
Nobody knows what the fuck you’re talking about, Chris.
As much as I love this project, I currently hate this project and don’t want to work on it anymore. It’s near a limit of complexity that I can barely stay wrapped around, there aren’t alot of established best practices for it, it’s almost purely in the theory space, and it’s not building something anyone would use but me. And to top it off, it’s not profitable.
Humaning
It’s 2:30 AM and I’m wound up. More interviews tomorrow. They’re finally pinging, but nothing’s biting yet. This is about as close as I’ve ever been to failure and it’s almost artificially weird how total and submersive it is. The clock is ticking faster than I’d hoped, but something will open up.