For this week’s ramble, I’d like to take you on a tour of how cycles tend to begin and end for the development of new scaling technologies, and why I think things might be different for AI. In particular, I’d like to dig into the VC-powered business model behind our current funding structures for big AI companies like OpenAI, and smaller ones that are using these underlying technologies, and how I foresee things going in the next 12-18 months. I’ll also take a brief look at some important canaries which might suggest if you have a lot of money invested in this technology, it might be a good idea to deleverage a bit!
Let’s start with something technical for a change - what’s a Pareto frontier? The concept is a pretty old one due to an Italian economist from the 1890s who was interested in the efficient allocation of resources in economic systems. He looked at this from a point of view (effectively) of finding where collaboration ends and competition begins - where is the point at which there is no longer a way of getting more for nothing, and we have to start making trade-offs between things. It turns out that this concept is pretty universal - there are lots of places in the world where we need to do multivariate optimisation under constraints, so the idea came quickly into the world of finance and operations research. The Pareto frontier is a line on which you cannot make any one objective better without making the others worse - it defines the “optimal” solution for a given system. The whole point of this is that if you’re on the frontier, you can’t really do better - all you can do is trade off one performance for another…
Uber as a salutary lesson
I’m going to start with Uber as an apparently similar case to study because there are interesting, unexpected, and important parallels with the model underlying how the Gen AI foundation companies (OpenAI, Anthropic, Google) are acting. So before we start to look into it, how did Uber operate? Founded in the fallout from the Great Recession, in March 2009, Uber became one of the best-known flagship companies for what became known as the “sharing economy” - a model of passively or actively using assets that you own but don’t use that much as a means of generating income. Other examples include AirBnB, TaskRabbit and Vinted, and the growth of this portion of the economy led Ida Auken to write the controversial phrase:
“you’ll own nothing and be happy”
in an essay later picked up by the World Economic Forum, which caused a portion of the population to panic.
Uber’s initial approach to revolutionising the business model behind transportation was perhaps not that revolutionary. For other examples of similar approaches from big tech, read “Chokepoint Capitalism” by Cory Doctorow and Rebecca Giblin - they will give you the whole playbook. Essentially what Uber did was insert themselves into the market as a broker between people who wanted to get somewhere (“riders” in their language) and people who were available to drive them around (“drivers”). This seems like a great opportunity for people to trade their spare time and underused resources for spare cash - if you get off work at 5pm and don’t have something else to do, you can take your car out for a couple of hours and ferry people, earning some extra cash for your labours. On the other side of the equation, it’s also great for riders, because you’re injecting extra competition into the market to reduce prices for them. Everyone wins except the people whose labour was already being used (taxi drivers) to do a smaller pool of work, and who could rely on lack of supply to keep prices feasible for them. Guess who we saw protesting at the rise of Uber…
However, these simple dynamics miss an important aspect of how the company intended to work - enshittification. Let me explain. As a chokepoint business, your aim is not to be one of several potential brokers that people go to for a service, you want to be the only viable place, because then you can set your terms and nobody can go anywhere else if you decide to take more profit for yourself. In order to do this, Uber needed to “win” the markets it entered - it needed to be the only place that people went to for last mile mobility. To do this, it needed to do two things - it needed to incentivise both its riders and its drivers to join the platform in droves, thereby squeezing its competition and being better able to control the market. To do this, it ran trips at a loss - it would give drivers bonuses for joining the platform and subsidise journeys for riders, effectively burning billions of dollars in investor money to try to become the only game in town. And to be fair to them, fifteen years on, it seems to have worked - they turned their first operating profit in 2023 and seem to be able to continue operating profitably, albeit with many drivers feeling the pinch.
All good, but what about AI?
On the surface of it, this business model seems to have little to do with AI at all - isn’t AI a pure software business and not a chokepoint market? So why on earth am I telling you all of this. Patience dear reader!
The excitement about “scaling laws” and the AI singularity all started fairly mildly with the January 2020 paper “Scaling Laws for Neural Language Models” by a group of OpenAI researchers including Jared Kaplan and Dario Amodei (later CTO and CEO at Anthropic). The paper itself points out some interesting behaviours of the early stable of language models (they worked with both long-short term memory LSTM implementations and the ubiquitous transformer that powers everything now), including that dataset size, model size and compute budget all gave a predictable contribution to the effectiveness of models over a wide range of parameters. In particular, they pointed out that larger models were a lot more efficient in reducing their entropy per sample, which suggested that big models were the way to go. People got excited, and suddenly scaling laws were everywhere e.g. the Chinchilla paper which updated it, and this post on lesswrong. (Interesting side note - Jared Kaplan and I have the same sort of background in theoretical physics, and have friends in common - I suspect that this interest in scaling laws has a lot to do with this influence, because this is a very “physicist” way to think!).
Essentially what researchers had found was the outline of a Pareto frontier for how you should manage model size (N, number of neurons), dataset size (D, number of training examples) and compute budget (C, number of floating point operations). From the early paper it seemed like for 5 times as much data, you needed a model with about 8 times as many neurons. This frontier survived over seven epochs (i.e. seven orders of magnitude), and so people took it to represent a robust scaling law. All scaling laws of this nature in the physical world endure to a point of phase transition of the system, and seven epochs is a long time for most phases to endure.
Why does this matter?
Well, at first we saw a boom in size for models, and the performance followed - GPT2 got released, and was amazing but kind of janky, but by GPT-3 in the summer of 2022, things seemed to be getting to a point of being actually useful. At some point on this worldline, we got very excited that this empirical scaling law would last forever, and that we would always be able to build bigger and better models that would perform optimally. The current LLMs are trained on something around 20Tn tokens of text, which is around 15Tn words using most tokenization techniques. There are around 60Tn words of text that have been written in all languages since human beings started to write in total, and that number is climbing only slowly. On top of that, a lot of this text is essentially inaccessible to modern machine learning techniques, as it’s privately stored and effectively un-scrapable.
New stories from this year of Anthropic buying second hand books online, cutting the pages from the bindings and scanning them to add to their training set should convince you of two things I have said before:
We are running out of data that is easy to scrape.
Synthetic data won’t cut it for training LLMs, and the big guys know this.
Which is all to say that the D in our scaling laws is flatlining. This explains two other things that I have been saying for a while:
Model performance is tailing off (the release of GPT5 this week seems to have convinced a lot of naysayers on this front).
Models have tended to become the same, or similar in terms of performance. There are a few small differences, but on the whole performances have converged (to the point that even the “edgy” model, Grok 4 saw fit to call Elon’s exaggerations this week!).
Just to emphasise the Pareto nature of this, we have two things we want to optimize for here - model performance, and inference cost. Let’s take them in turn.
Model performance
As I have written previously, we can picture large language models as operating something like a knowledge graph - shallow (easy) ideas can be found near to the root node, and more precise material which answers deeper or more specialist questions are found closer to the leaves of the tree. This means that for specific queries, we want to find a way of getting to the material closer to the leaf nodes, where the good stuff is.
A clever approach to this taken by all the major labs has been to use reinforcement learning on top of the LLMs themselves to “reason” - essentially creating a multi-turn dialog between LLMs to go deeper into the leaves of the tree. I put reasoning in inverted commas because, like “hallucinations”, it’s another example of researchers misusing a word to make things sound better than they are - it turns out that LLMs don’t reason at all! This gives the models better human-rated scores of problem solving ability, and honestly these are mostly the models that I reach for these days, but like very smart human beings it also has a catch - it seems like this trick makes the bullshit (“hallucination”) rate higher; another Pareto frontier.
There’s another component to this too - because for a small-ish query size, we are doing multi-turn dialogs, that means that we are consuming many more tokens for the same query; this pushes up the query cost significantly for these models. We are trading off significantly more inference-time compute for better human ratings.
Model inference and energy costs
At the same time, model inference costs are pretty predictable, and depend only on the total number of operations needed to compute the next tokens. A rough formula for this is:
E_total = alpha x N^beta x P + gamma x M + delta x C
where in the first term beta is around 1.2-2.0, and alpha is a positive constant, and P is the numerical precision in bits (e.g. FP4, FP8 etc). The other two terms refer to the memory costs of moving data to and from memory, and the energy costs of communication, which are usually much smaller.
A typical agentic workflow to build a to-do app will consume around 120,000-180,000 tokens total, which means that its energy costs will be somewhere in the ballpark of $0.15 for standard GPUs to around $0.007 for highly optimized tensor processing units (TPUs). This is just for the running costs of the machines, not the rest of the infrastructure. As you can see, using highly optimized field-programmable gate arrays (FPGAs) can get your costs down about 20-fold, but note that none of the big players are talking about doing more with this apart from Google who already took the hit - this is a post-winning the market optimization for most.
Realistically, it’s unlikely that companies are going to bear the cost of these sorts of optimisations alone (note Sam A’s request for $7Tn has made a reappearance) and in any case, we’re talking about maximum 2 orders of magnitude performance, not the typical 10^6 to 10^9 improvements we have seen with pure software - to misquote Richard Feynman “there’s not so much space at the bottom”.
Where now?
My guess would be that, like Uber, the aim of the game is going to be to continue to burn investor money to get a strong hold of the market, and then jack up the costs. As I previously posted on Linkedin a couple of weeks ago, with this a definite possibility, if you’re building an app based on top of ChatGPT or Claude, it would be sensible to build it in such a way that relatively immune to:
a 20% drop in performance of your app, as companies reduce performance to try to improve efficiency.
a 5-10x increase in costs as companies start to charge what the infra really costs, rather than their “best price”.
Where you definitely don’t want to be is the pizza delivery service that relies on Uber drivers to deliver your food, and where your profit margins crucially depend on what Uber charges you for this service. By all means do that in the Paul Graham sense of “doing things that don’t scale”, but make sure that you have a viable plan B as soon as possible so you can escape the mercenary clutches of the company that is effectively investing in you right now.
In the data era, there’s an even better play - use these models whilst they are still subsidised to build yourself a beachhead in a market (do things that don’t scale) and then collect the data or users that will allow you to continue adding value even when the taps are turned off. I like Lovable as a concept, but I am unconvinced that their business model is long-term sustainable. Maybe I’m wrong - let’s see.
Canaries
This all goes to sit as uncomfortable context for the grand launch of GPT-5, which turned out to mainly be a UI fix for a problem that OpenAI themselves created - too many models. The lacklustre reception has fed into an environment where everyone is perhaps starting to realise that AI is a bubble (hello, I’ve been telling you all this since 2023), and that maybe the market is overheated. There were two financial reasons recently to believe that this is true:
ASML had their earnings call last month, in which they announced stronger performance, but also said that they “couldn’t guarantee growth” for the rest of the year. Bearing in mind that they are the company that makes the equipment to make Nvidia chips (they are the shovel factory for Nvidias shovels in the AI goldrush), this should give us pause. Essentially, it seems like they believe that Nvidia aren’t going to grow much more, and therefore are not going to be opening more chip fabrication plants (“chip fabs”). The big hyperscalers have been trying to dig themselves a physical moat by getting hold of more GPUs to build bigger data centres, but essentially ASML are saying they think that era is coming to an end.
Coreweave took a serious haircut. The coreweave business model is a little strange - they have an agreement to provide compute to OpenAI, but are part-owned by Nvidia, and source their chips from Nvidia. So if coreweave manage to get hold of more chips from Nvidia, Nvidia’s stake in them goes up and…honestly this is such a mess. It also makes it clear how scarily intertwined all of these different companies are, and how fragile the whole system has become.
Where next?
I think the world is starting to realise that this growth at all costs approach to LLMs is not going to bear fruit. This is likely to be painful for all of us as the unwinding of entrenched positions takes a bite out of our pension funds, but there is hope too. As I have often pointed out in these articles, in the Gartner hype cycle, the place where the work happens is once you’re over the peak hype. I really believe that the future of AI will look very different to what we see right now, and that there are smart folks out there already inventing it. Maybe the repeated cycles of glaciation and thaw in the AI world are not a bug, maybe they’re a feature!
This post also appears on my Substack.