Let’s take a look at the world from a social point of view for this week. In general, human organisation proceeds in two roughly equi-temporal phases, which are very different in terms of power structures and politics. These phases are already playing out in the new AI frontier, and so past developments might help guide our thinking as to where we go from here.
The Square and the Tower
Historian Niall Ferguson wrote a well thought-out book on this topic, in which he posits that human power structures can be understood to adopt two different phases, each of which is unstable under the right circumstances to the other. We can roughly speaking understand these as:
A collaborative phase (the “Square”), where people work together to try to achieve their goals. Power is loose, diffuse, and held only at the “will of the people”.
A competitive phase (the “Tower”), where individual agents compete with one another to try to win the prize of “control” over others. This control is taken, not given.
Ferguson’s broad thrust is that we can understand both ancient and modern historical events as the pendulum swinging backwards and forwards between these two states, and that the human experience in each of them is driven by the needs and fears of society in that epoch.
The examples that always come to mind for me are those where we see rapid changes from one setup to the other, or where we see organisations with the opposing systems of government in direct conflict with one another. Indulge me, and let me give you an example of each.
The Arab Spring
In the spring of 2011, after between 30 and 40 years of rule by ruthless, autocratic leaders, peoples throughout the middle east and north Africa had had enough. Young people in these countries, who had grown up knowing nothing but the authoritarian rule of their aging dictators came to realise that they were working through their degrees at great personal cost, only to graduate into economies which could not support the skills that they had. The unemployment rate skyrocketed meaning that the lacked the means to find even basic jobs to allow them to leave home.
The explosive mix of this lack of opportunity, high levels of education, and access to social media which allowed them to communicate with one another freely, and the time to self organise meant that the “Tower” phase of human organisation under authoritarian rule was becoming fragile. All it took to ignite the mix was a fruit seller to self-immolate to protest his treatment by police in Tunisia, and suddenly protest spread virally throughout the region. The means of organisation of the protests was fascinating in its own right, and is wonderfully covered by Zeynep Tufekci in “Twitter and Teargas”, but the most important point to note is that they were not dependent on a single person, or a small group. Instead, they took on a sense of shoaling fish - organised at will by many volunteers on several different social media platforms. This approach has the obvious advantage that the movement does not depend on individuals, but rather on a critical mass, and as such is hard to decapitate.
However, what makes the “Square” approach strong also makes it weak. The nature of this “public discourse” structure means that it is hard to make decisions quickly in response to stimulus (because everyone must be informed) and that internal disagreement can disempower or even destroy the movement altogether. In the Tahrir square protests in Egypt, young people flocked to the central square and occupied it for months. When Hosni Mubarak was finally removed from power, the occupiers had no idea how to organise for government. There were no power structures within the organisation that would allow them to slot into the formal organisation of a post-Mubarak government, which made it more or less impossible to replace the old leader with a new one representing the movement which had deposed him. As a result, the one organisation that was involved in the protests which did have these power structures, the Muslim Brotherhood, swept to power in a move which calls to mind Vladimir Lenin’s observation about the Bolshevik’s in Russia - “power was lying in the street; we picked it up!”.
The “troubles”
The second example of this effect is the troubles in Northern Ireland, where we saw a very hierarchical and “traditional” armed group doing the work of a repressive state (the infamous Royal Ulster Constabulary or RUC) pitted against a loose network of armed insurgents (the Irish Republican Army or IRA). The flashpoint here was arguably the collapse of the 1972 Stormont government, and the imposition of direct rule on Northern Ireland by the British Government of Ted Heath (a state of governance which persisted for 35 years until 2007). In doing this, the British Government fundamentally misunderstood the power structures of the time, and believed that by changing the way in which law and order was imposed, they would quell the rebellion - classical “Tower” thinking.
In fact, the nature of the networks for the IRA form a blueprint of how to set up a successful rebellion - they were loose informal networks of individuals united by a common goal, but without the risks of a formal control structure. Individuals were highly-networked, with a high degree of “centrality” (a formal measure of how well connected members of a network are), with women often the key transmission nodes between groups. As a result, they were extremely difficult for a “Tower” structure to disrupt - remove one node and you don’t change the network topology or the information flows, so there are no “obvious targets” to remove.
It was as as result of this that a small organisation without state backing or access to the same weapons as its opponent was ultimately able to force the British Government to the negotiating table in 1998 to a negotiated peace that has more or less held since.
Square and Towers in AI land
“Get to the point, Chris”, I hear you cry. Okay, sorry, I like to moonlight as a historian sometimes! How does this apply to the world of AI and machine learning? Let’s do a whistlestop historical tour of how we got to where we are, and then we can try to crystal-ball-gaze to work out what’s coming next.
Machine learning as we currently know it has been around since the 1950s. The first artificial neural network was built (in hardware!) in 1958, the first chatbot (ELIZA) made her appearance in 1964. It has already been through several cycles of boom and bust, hype and collapse, driven by moments of huge excitement and enthusiasm, followed by periods of total dejection at how it fails on “simple tasks”. The current wave found its trigger in the discovery in 2012 that modern chip architectures, called graphical processing units or GPUs, which were specialised units needed for fast computer graphics, could be repurposed to do the sorts of calculations needed to make deep neural networks work. The paper on AlexNet, which won that year’s ImageNet competition by miles was powered by this observation, kicking off a wave of interest in what had previously been an academic curiosity - a way of doing calculations which it was thought would never be possible in the real world.
Like most academic research, this work came out of an environment which is a curious mix of square and tower - although there is something of the democratic about how academic research is circulated, it’s important to remember that research labs themselves take on the operating model of the Tower, as do grant-awarding bodies. The opportunity for genuinely networked collaboration on AI would have to wait until Google (a very different company now than it was then) released Tensorflow 0 in 2017 as an open source project. Like all the best OS projects, it was a mess, it was hard to use or extend, and as such it made something which had previously been impossible for hobbyists like me merely hard. At the same time, Kaggle competitions allowed groups of people who had never met to collaborate and work together on competitions to try to beat other distributed teams. Forgive my nostalgia dear reader, but looking back this feels like the golden age of this wave of AI - sure, models were rubbish, failed frequently and unexpectedly, and data cleaning took up a huge amount of our time, but there was a wonderful sense of possibility.
As the underlying work developed, groups within bigger research institutions, both public and private, made significant advances on the underlying architecture of the models being used, and convolutional neural networks and LSTMs gave way to the now-ubiquitous transformer block. At the same time, as a community we started to become aware of a phenomenon that was later beautifully crystallized in Rich Sutton’s “bitter lesson”:
“If you plan to improve your model performance, first optimize your data collection, then your data cleaning, then finally your model architecture.”
I personally learned this when working on a neural search engine - at some point, no matter what we tried, our question-answer matching engine plateaued in performance. Then Google released BERT (the first model based on the transformer architecture) which was trained on a huge dataset of text. Plugging BERT into what we have previously done worked a treat, even with the same sized training set, which made me suspicious that essentially what we were doing was a form of data augmentation.
Consolidation of Power
Even in this period, research labs in large companies had access to two things the rest of us lacked to make progress:
Access to large, private datasets collected by their companies from normal operations.
Access to vast amounts of compute.
However, the use of the models built from these tools was still somewhat limited - products were intended really to enhance some specific part of the user experience of a platform, or streamline some internal process to require less human input, but the idea of an “everything app” was still some was off (with apologies to those of you who, like me, have an allergic response to anything linked to Elon!). In this period I remember speaking to friends working at Google and Facebook about the stakes for them - they already had highly-developed internal teams who were collecting, processing, updating and training on billions of data points, and it really mattered. A change of 0.1% in accuracy for some of their core models could make or cost companies tens of millions of dollars.
It’s also in this period that we start to see the “data flywheel” effect - if you can build a model that improves your user experience using data you currently have, you can then turn this into a competitive advantage as follows:
Use your existing data to improve the experience.
Improving the user experience brings on further users.
This gives you more data to train your next model on, so you can re-enter the loop at 1.
This approach allowed a bunch of companies to do do even better and “own” entire markets by virtue of their machine learning models. This consolidated power in the ML world with a few big players in the tech world.
“The Tower” era
This allowed a consolidation of power with a group of large technology companies. Interestingly, none of these organisations were founded on the basis of using ML. Probably the closest to ML-native was Nvidia, who originally created a coupled ecosystem of hardware and software originally for doing ray tracing for computer graphics, but discovered that “linear algebra is linear algebra”, and so they could easily move across into the ML world. From around 2020 we saw a combination of two things crossing over - the aggregation of compute, in particular GPUs and the aggregation of data with the biggest technology companies.
Seen through this lens, a lot of things which make relatively little sense since 2020 start to for into a more coherent image - e.g. the fact that OpenAI & Anthropic burst onto the scene without any direct association to big tech (by virtue of their dubiously-sourced data - a tick in one box) but were then forced to partner with big tech despite being highly valuable (to get access to the other tick - the GPUs). The lack of “real” competition for the top of the leaderboard can be understood similarly - you need a combination of access to lots of data and lots of compute, which builds a pretty challenging moat to competition. What about DeepSeek I hear you cry - how did they do it? As far as I can tell from what information is publicly available on how their model was trained, it seems like:
They had access to a huge number of GPUs to do convex optimization problems as part of their work in quantitative crypto trading.
They short-cut access to the giant datasets by using ChatGPT to generate them a lot of training material.
In other words, they had a tick in one box and found a way to get the other one.
So there’s a key question here - how can anyone compete in this era of dual, expensive moats? The short answer is that they can’t. However, there are loads of fascinating nuances to this - firstly that it seems like we’re pretty short on data. If you doubt me on this, bear in mind that Anthropic have been buying second hand books online and scanning them to add to their training set. I would argue that you don’t do this when there’s a load of easy to scrape data already available online. The big models today are trained on something around 20Tn tokens of data, which seems to be around 30% of all the textual data humanity has ever produced. The rest is beyond the reach of the hyperscalers, at least for the moment. Secondly, the cost of running the models is high, and seems unlikely to come down as fast as we have all been hoping, whilst the model providers are increasingly encouraging us to use these models in a token-expensive way (e.g. by creating multi-turn dialogs or agentic workflows). Lastly, we have seen play out in the press and in the job market a strange and vicious competition for talent. Meta boss Mark Zuckerberg has been making unbelievable (possible because they aren’t true?!) offers to researchers to join his company, fundamentally missing the mark (ha) of the Bitter Lesson - more data is better than smart people. This feels like premature optimization to me, although perhaps Mark knows something I don’t - I hope we’re not already into the final phase of diminishing returns.
What next?
How can we move away from the Tower phase and back to a more collaborative “Square” phase of AI? I would argue that there’s not much we can do about GPU development for people to own this - perhaps if public money is being used to build data centres or power plants for them we can demand more of a stake, but I feel like this is unlikely given governments’ fondness for funding private enterprise with public money. That leaves us with the data as a lever to push on.
Right now, these models are trained on data scraped openly from the Web, much of it apparently in contravention of the robots.txt requirements of websites - a cost which has been pushed onto the owners of websites, who have seen the cost of hosting their websites grow and grow. The dubiousness of the data itself is starting to be challenged in the courts, with mixed success, but it seems increasingly likely that technology will succeed in pushing for a Section 230 type interpretation of their situation and get away with this. Amusingly, OpenAI, who have pushed a very aggressive interpretation of “fair use” were the first to cry foul when it seemed that DeepSeek had contravened their terms of use to create their own dataset. If you steal what is already stolen, you don’t make it more stolen.
So, a mea culpa. I have no idea how we would be able to do this, but let’s imagine for a second that we can take control of our data. Perhaps we can use services like Cloudflare to restrict who can scrape and what they can take, and suitable charge the creators of these models for access to the information we are providing (even if it is just to cover the costs imposed on us as denizens of the internet). If we can do this, we can also use our new control to allow access to causes we consider valuable. Maybe we could even consider collaboratively training models for the public good that we all control (albeit with suitable safeguards to prevent their misuse or augmentation of privately-held models). With that, we would have access to a huge amount of “dark” data which isn’t visible to the big tech companies, giving us hope of being able to train competitive models.
Is it possible? Technically, I don’t know, socially, I think it will be challenging. Is it worth hoping for and working towards? Definitely. If there’s one thing we know from history it’s that the tower always becomes fragile to the square. The opportunity for us to take control of the future of AI as equals will arise sooner or later. I hope it’s an opportunity we can communally decide to take.
This post also appears on my Substack.