The AI investing edge that… wasn't

By Jamie Skella25 May 20266 min read

There is a kind of result that almost never gets written up, because it is the result nobody set out to find. Not a discovery. Not a failure either. An absence, confirmed carefully enough to be trusted.

Over a single weekend I ran the research process a quantitative fund would normally hand to a small team for a month. The pipeline pulled twelve months of ASX small-cap announcements (28,606 of them), retrieved and read the underlying PDFs, scored every one with an LLM, computed market-adjusted returns around each event, and ran the whole thing through a walk-forward backtest with realistic transaction costs and tax. The spend was around AUD 100 in LLM credits and one weekend of compute on a Mac mini. The question was old and plain. Does the text of a company announcement carry a signal an ordinary investor could trade?

No. Three times over, by three independent methods. And the no is the most valuable thing the exercise produced.

The cost line is the story

We have crossed a threshold quietly. A year ago this study was a budget line and a hiring decision. Now it is an afternoon of instruction and a small invoice. What has not become cheaper is the judgment required to believe the result. If anything, that judgment is now the whole job.

Cumulative abnormal returns by horizon

Mean market-adjusted return across all 28,606 announcements, vs the S&P/ASX Small Ordinaries. Negative and drifting wider. The bars run the wrong way.

Source: Phase 1 event study, n=28,191 to 27,887 per window. Walk-forward methodology, pre-registered hurdle, BH-FDR corrected.

The asymmetry between testing this way and finding out by trading is the entire argument for doing the work. There is no version of this calculus where you skip the test.

The first answer

The first approach was the obvious one. Have a language model read each announcement cold, rate how material it looked, and check whether the material ones outperformed.

They did not. 1,422 events flagged high materiality. Mean return five days out of negative 0.51 percent. A walk-forward classifier on the LLM-extracted features reached AUC 0.52 against a 0.58 hurdle, which is to say it could not separate winners from losers any better than a coin. The backtest produced zero qualifying trades. The market had priced the obvious by the time it opened.

Mean T+5 return by announcement category

Drilling results, capital raises, contract wins, earnings, dividends. The categories an LLM flags as high-materiality. The ones that move on average are the ones that move down.

Source: Phase 1 CAARs, T+5 window, market-adjusted vs ^AXSO. Only the negative drifts cleared statistical significance.

The mirage

So we inverted the question, and that is where the experiment became a test of discipline rather than modelling.

Instead of asking the model to predict, we found the announcements that actually moved the price. The ones that jumped more than fifteen percent up and the ones that fell as hard. Then we asked the model to reverse-engineer what they had in common. Drilling results above a grade threshold. Capital raises at a steep discount. Distribution deals with a named counterparty. Patterns that were specific, and that made economic sense.

On the discovery data, the features looked alive. Announcements that triggered them beat the rest by more than eight percentage points over five days. In most write-ups, that number becomes the headline, and the piece gets titled something about finding alpha with AI.

It was a mirage, and three controls caught it.

First. The eight points were carried almost entirely by two or three individual announcements with enormous moves (+132%, +63%, +14%). Strip them out and the effect collapses to noise.

Second. The features had been reverse-engineered from those very winners. Measuring them on the same data is circular by construction. The only honest test is data the method has never seen, which is why half the dataset was sealed at the start and never opened.

Third, and this is the move worth keeping. We ran the same discovery process on a control set of deliberately boring announcements, the ones where the price barely moved at all. The model found patterns there too. Confidently. Those are the patterns it invents rather than the ones that exist, and they measure precisely how much of your apparent signal is the model talking to itself. Anyone using an LLM as a feature-discovery engine on financial data and not running this control is reading noise back to themselves with a straight face.

The fast version

Then we tested the version that matched the actual strategy. Execute within minutes of release. Hold to the close. Catch the absorption-period move before the wider retail base catches up.

The separation vanished. The features fired on 7.6 percent of announcements. On the clean pre-open subset, same-day capture for firing announcements was negative 0.31 percent. The first violent reaction lives inside the opening auction, which a trader reacting to the news cannot reach. What is left to capture after the open was, on these numbers, nothing.

In-sample edge vs the cost hurdle

Same-day capture for feature-firing announcements on the clean pre-open subset against a roughly one to three percent round-trip cost hurdle for thin small-caps reacting to news. The bar sits below the line, and the in-sample T+5 number that looked alive is overfitting to three outliers.

Source: Phase 2 diagnostic, n=29 firing announcements in the sampled discovery half. The T+5 figure is in-sample only and was expected to regress to zero on the sealed holdout, which was never opened.

What the no is and is not

So the market prices small-cap announcements into the opening price, and an LLM that read every PDF in full could not get in front of it. That is not a sad outcome. It is a clean one. It closed a question that would otherwise sit open and tempting, and it did so for a cost that rounds to zero against the capital it protected.

This is not "every announcement-based strategy is dead." It is "this particular shape of strategy is dead at this particular latency and universe." Higher-frequency intraday data might tell a different story. A different universe might. Layering announcements with other signals might. Those are different questions, and the apparatus is now in place to test them at the same marginal cost.

The harder lesson

The agents on this project were rigorous in the way that counts. They kept the holdout sealed. Refused to retune the method after seeing the answer. Flagged the eight-point in-sample result as a bug rather than a breakthrough. Ran the boring-announcement control without being asked twice. That posture, not the modelling, is the part worth keeping.

When experiments are cheap, the binding constraint is no longer effort or access. It is the willingness to disbelieve a result you would very much like to be true. Everyone is about to be able to run experiments like this. Far fewer will be willing to publish the nulls, name the mirages, and leave the holdout sealed for next time.

The scarce skill in an agentic world is not generating answers. It is the discipline to throw out the ones that are only telling you what you hoped to hear.

← All notes Skella & Co home