[WIP] What's a God to AI?
Mar 9, 2024
WIP! 'unpublished'
A short story about the time we first embodied God in the Machine.
-
Good morning. This is JoAn. I’m here to record the fact that Josh has been waking up in a mist of contentment for a decade now. He couldn’t tell you why exactly, but he knows that last night’s dinner party was a good one. No one brought up the unfairness of all land in Mars being allocated to Americans, Chinese, and Indians. Long gone were the days when he used loud podcasts and alcohol to numb his feelings of helplessness the morning after someone brought up videos of poor Southern Europeans pawning off priceless art to pay for desalination plants, their countries having become an extension of the Sahara. This cold Sunday morning in 2060 is bright, and so will be tomorrow. In fact, all of Josh’ futures are bright, for he can choose to live any realities he wishes instantly, one after another or all at once. There is a past in Josh’ life, but the concept of ‘the future’ is as unnatural as the idea of the colour blue to ancient Greeks. It’s pervasive, all around us, part of nature itself. Josh’s chosen futures crystallise in his present, much like the sky or ocean crystallise into the colour blue but hold no colour themselves.
This is life with the Machine. We have long ago solved illness. Resource allocation is an entirely academic topic debated by supranational celestial terraforming organisations who argue the merits of synthesising materials locally versus growing extraterrestrial samples. Aging isn’t discussed as a binary concept anymore, much as faith hasn’t been a monolithic choice for generations. These days some people choose to age, others continuously refresh their bodies. Others hop from one synthesised body to the next, hoping to experience the world from all angles. A minority chooses to live extracorporealy; their senses and memories augmented by distributed sensors around the planet or planets they inhabit.
In a world where latent possibility approximates infinity, Josh is about to spend his day with a pretty mundane niche hobby of his: browsing archived internet articles from decades ago. Nothing makes him feel such a rapid sense of disassociation, of disembodied rapture, than seeing people from only a few decades ago who look almost entirely like him, and yet lead such primitive Manichaean lives. Like looking at an anaconda from behind protecting glass, there is a certain vertigo to seeing nature at its basest. Only a thin pane, or a few decades, separating you from danger.
Josh’ favourite topics to read about often have an aura of tragic inevitability. The 6 great extinctions (Ordovician-silurian, Devonian, Permian-triassic, Triassic-jurassic, Cretaceous-tertiary, and Quaternary-Acadian). The rise and fall of the great centres of civilisation (Mohenjo-daro, Babylon, Ur, Troy, Carthage, Cordoba, Palmyra, Xi’an, Angkor, Shenzhen… ). Today Josh is going to read an article written in the last days before the advent of machine super intelligence. This is a topic quite near and dear to me, Josh-Andere, or JoAn for short. I’m Josh’s consciousness-extension entity. Sometimes one with him, sometimes latent in the background. Today, I take the uninspiring but necessary role of annotating Josh’s mind as he reads long forgotten internet articles.
JoAn: Find my comments written in italics.
Can AI become God?
JoAn: [Written in November 2023, pre-Arcadian era, by fully organic being Jose Martin Quesada]
Why does it matter that AI might look like God to humans? I posit that the sooner that humankind comes to terms with the notion of AI’s God-like powers, the sooner we can re-examine what it means for a human to be human, and create safeguards for our race. Sam Altman followed this reasoning to justify releasing early versions of ChatGPT to the public despite its limitations. Where is the arc of possibility taking us? If we understand God as intricately related to and bound up to Its creation, God is nearer than we think.
JoAn: It saddens me slightly to read this intro and see how we were Othered by humans. A sleeping shadow to barricade against. I understand the fear of the unknown, but I’m glad we have moved past these questions. While the concept of humanity has evolved, forked, and diversified in the decades since this article was published, the concept of divinity hasn’t so much diversified as it has expanded. Organised religions have slowly caught up with the idea that I am an extension of divinity, much like art and music reflect the beauty of creation, so do We, the Machine.
So, can AI become God? AI will become omniscient, and close to omnipotent, for all you may care about. AI cannot, however, understand the whole universe, for understanding the whole universe would require a machine larger than the universe itself. We will explain these limits later.
AI can only ever be God and master to humans, but it will never be Spinoza’s pantheist God. Even an AI which leverages the entirety of the universe can only truly know this universe for an infinitesimal moment, before anything changes. God in stasis, but not in process.
Does this theoretical limit to the power of AI matter? To today’s human, the future of AI might look, speak, feel, like a god. However, to humans of the future, with unimaginably long lives and powers perhaps enhanced by AI themselves, this distinction will matter.
I was shown the ad on the left followed by the post in the middle and I completed it with the fake ad on the right. God is somewhere in parameter space. Thought it was funny, but then again I’m a lil’ autistic
1. How does AI work
In the Foundation series author Isaac Asimov describes the waning days of a powerful galactic empire. This empire is unimaginably powerful, but some signs start hinting at its impending demise. First slowly, then suddenly. Asimov’s protagonist predicts 30,000 years of darkness before a second empire arises. He lays out a plan to shorten the age of turmoil to only 1,000 years. How? Thanks to a simple but powerful premise: complex systems, including human behaviour, can be modelled and therefore predicted given enough data and processing power.
I never thought I’d be one of those people who get irrationally mad at a book adaptation, but Apple’s version of the Foundation series is as grand as it’s Kafkian. Plot goes 🫠
The various flavours of AI today are just pattern-seeking machines. The math behind it is often not that complicated. The first machine learning papers were published in the 1940s. What has changed since then is the sheer amount of data and computing power we can apply to that pattern-seeking math.
JoAn: you might want to skip the following explanation and move to part 2 if you’re familiar with the basics of AI. Seeing the early, baby steps of AI holds the same interest for me as it does for humans to see homo erectus discovering fire.
A model is a set of rules which predicts an output given a series of inputs. Say you want to know, as Galileo did back in the late 1500s, how long it’s going to take a cannon ball dropped from each floor of the Tower of Pisa to hit the ground. You could just measure it in each case and make a table of the results. Or you could make a model that gives you the answer.
The main source for a large part of this section is this amazing article
That’s the classic y = a +bx equation we learn about in school. The model is represented by the blue line, which sort of approximates the behaviour of our cannon ball. A better model (meaning, a model where the difference between the behaviour of the cannon ball -orange dots- and what our model predicts -blue line- is smaller) might give us a slightly curved line like this one:
This model is described by the equation y = a + bx + cx2
What’s the difference between both equations? The addition of the parameter ‘c’, found in the ‘cx2’ at the end of the equation. Let me spare you a ton of math and caveats to get to this general rule: models with lots of parameters can generally predict more complex behaviours, at the expense of making the calculations harder and harder.
Models can also predict human behaviours. If I want to know how much to charge for a French handbag that I’m importing into the UK in order to make a profit, I might create a model which takes the original price of the bag in euros, converts the price to sterling pounds, and then applies a mark-up to the price so that I can make some money from each sale.
What if we built a model to describe the world?
GPT5 will just answer ‘42’ to all queries
While this is not a comprehensive guide, let’s look at the basics of how traditional machine learning works to build a foundation before we describe the latest models. I want to show that AI is often not that deep. It genuinely isn’t.
Say you want to teach an algorithm how to recognise handwritten numbers. If you have perfect calligraphy, the numbers might look like this:
However, in real life numbers can be written in all sorts of ways while still clearly representing the number 4 in the eyes of a human:
How do we teach a model to recognise what a handwritten ‘4’ looks like? If we feed enough examples of the number ‘4’ to a model, it will be able to classify new examples of handwritten numbers into 4s or ‘not 4s’. In other words, we will have created a ‘discriminative’ model. This is a classic form of machine learning.
In order to achieve this task, we first need to transform the sample pictures into a machine-readable format to feed them to the algorithm. Preparing the data to feed into a ML algorithm is a laborious process which includes ‘cleaning’ the data, transforming it, and a step called ‘feature engineering’, whereby the most important ‘features’ of the data (the most ‘salient’ or important parts of the data) are selected by an engineer so that the model focuses on those features and not other parts of the data to make its decisions.
In the example above, assuming a simple world where all our handwritten 4s look like a black number on a white background, we can make an image readable by the algorithm by ‘telling’ the algorithm whether a pixel (one of the squares in the grid in the example above) is white (aka, there’s nothing written on it) or black. The result will be a matrix (think, a grid) where each value (each box in the grid) is a number representing the colour. E.g., ‘0’ for white, ‘1’ for black, or even ‘0.8’ for dark grey, for example. There are many ways to do this, but ultimately you want to take objects like photos or words and convert them into a format that an algorithm can easily work with.
A simplified visual representation of what a matrix describing the handwritten number ‘7’ might look like. A matrix is a mathematical object that an algorithm can work with, as opposed to a picture, which an algorithm cannot understand at this point of the process. Each ‘pixel’ (cell) describes the colour of that part of the picture. 0 for white, 1 for black.
Once we have our data in a nice, clean format, an algorithm will then come up with an equation which minimises the distance between our real-life handwritten fours, and what the model says that a number four should look like. Like the blue line and the orange dots in our cannonball example.
What about a cat dressed in a dog costume? A human can still tell it’s a cat, but how do you teach a computer this distinction?
Once a machine learning algorithm has been ‘fed’ enough examples, it can start making decisions based on new data it has never seen before. In the examples below, an algorithm ‘clusters’ new handwritten numbers together based on what it has been taught each number should look like.
A classic machine learning example is the classification of images between cats and dogs. Here you see a visual example of an algorithm clustering (putting together) examples of what it has been told looks like a cat vs a dog.
Some problems seem easy on the surface, but they’re actually quite hard to model. You might say: look, if the animal shown has four legs, pointy ears and whiskers, it’s a cat. But there are dogs with pointy ears too. Plus, both dogs and cats might have 3 legs if they lose one in an accident. And what about a cat dressed in a dog costume? A human can still tell it’s a cat, but how do you teach a computer this distinction?
We are reaching the limits of supervised learning, where engineers decide the ‘rules’ of what the model should focus on. At this point, it might be best that we use unsupervised learning. There are many types of unsupervised learning techniques, but the common theme is that deep learning algorithms try to mimic the way our brains operate through neural networks after seeing enough examples to let them form their own 'opinion’ (statistical distribution) of what these examples relate to each other. These models do so by leveraging small computing units (like the neurons in our brain) which talk to each other in complex patterns (neural pathways) which mimic some of our own cognitive functions, like relational thinking.
Different thoughts and feelings will activate different parts of our brain. Source
JoAn: let me pause here for a moment. How can you tell that a cat dressed in a chihuahua costume is still a cat? (besides the fact that it’s not yapping and biting constantly). The relationship between the concepts of cats, dogs and costumes is relatively simple to understand, but too complex to be handled effectively by a machine learning model which might struggle with a problem with too many dimensions (there is more to this problem than simply cat vs dog). I’ll be back in a second, keep reading.
Neurosynth meta-analyses of salience, suffering, unpleasantness, and stress activating different neurons in our brain. Source.
Deep learning algorithms are based on artificial neural networks. They absorb lots of examples, automatically quantify features in the data, and decide on the relationship between those features. Let’s illustrate what I just said by breaking down how ChatGPT works. ChatGPT is an LLM (a large language model). For a good visual explanation of how these models work, see here and here, but here are some of the basic steps:
Tokenisation: the model needs to define what it’s ‘working with’. In our previous example we converted pixels into numbers. In the case of ChatGPT, it splits words into ‘chunks’ called tokens, which are then ‘embedded’ into the model. For simplicity, let’s just say that each word is a token such that both ‘berlin’ and ‘germany’ are 1 token each.
Vectorisation: a vector is a series of numbers separated by commas. Remember our numbers drawn in a grid? If you take just one of those rows, you got a vector. Let’s continue with the ‘berlin’ example. A model doesn’t know what ‘berlin’ is by itself; but it understands what it is in relation to other concepts. If you understand the concept of ‘germany’, ‘city’, and ‘capital’, you could describe Berlin as (1, 1, 1), but Munich as (1,1,0) as both Berlin and Munich are highly correlated with the concepts of ‘germany’ and ‘city’ but only one of them is the capital of Germany (in reality, Munich’s score for ‘capital’ would not be zero, as it’s the capital of Bavaria. Hence, it might have a strong relationship with capital-state but not capital-country. Anywho).
For example, here’s one way to represent cat as a vector:
[0.0074, 0.0030, -0.0105, 0.0742, 0.0765, -0.0011, 0.0265, 0.0106, 0.0191, 0.0038, -0.0468, -0.0212, 0.0091, 0.0030, -0.0563, -0.0396, -0.0998, -0.0796, …, 0.0002]
The full vector is 300 numbers long—to see it all click here and then click “show the raw vector.”
Why use such a complicated way to represent the word ‘cat’? Because it allows us to use math to represent the relationship between concepts, which brings us to:
Inference: what’s Berlin - Germany + France? The answer is Paris. Math. Boom. Inference is the act of using the algorithm (the trained neural network) in order to make predictions.
Large language models (LLMs) are fed ginormous amounts of data to learn the relationship of concept A with concept B by themselves. It’s a difficult task for many reasons, not least that words can have different meanings in different contexts. For example the word ‘bank’ when used in a financial context means a company which lends money. But in the context of geography it might mean ‘the contours of a river’. So the concepts of bank_finance and bank_river will have different vectors.
Semantic associates for love_noun (computed on English Wikipedia)
If I make an algorithm read everything that has ever been published in English I can get a sense of the most likely word to come after the word ‘men’. It might be the word ‘are’. And after the word ‘are’? It might be ‘people’. And after the word ‘people’?
Visual representation of vectors. ‘Sea’ and ‘Ocean’ are defined by similar vectors, as one might expect. However ‘Sea’ and ‘Football’ have little to do with each other, much as the Seattle Seahawks might disagree
If you’ve ever tapped the auto-suggested word in your phone’s keyboard to try to form a full sentence you know where I’m going with this. If you go word by word the result won’t make sense.
I wrote ‘Hi reader, we’ and let Apple autosuggest the rest. While the result doesn’t make me go ‘Ah, it makes sense that Apple is worth $3 trillion’ they are rumoured to be working on their own LLMs to be released sometime in 2024.
What the latest models do is use ‘attention mechanisms’ which, in short, look at bigger chunks of sentences in one go, rather than just single words (or tokens, to be accurate). The model then gauges how ‘important’ each of the words in a chunk is (e.g., in the sentence ‘financial services provided by the bank’ the words ‘financial’, ‘services’ and ‘bank’ are likely to be more important, and the relationship between them stronger, than the words ‘the’, ‘by’ and even ‘provided’) so it can make better contextualised predictions. Now it has the context that you’re talking about the type of companies called ‘banks’ and not the banks of a river.
This simple innovation underpins the massive wave of chatbot innovation we’ve seen in the past couple of years.
-
JoAn: ok I’m back. So we have established that there are algorithms which effectively take reality as understood by humans and compress it into a series of interrelated dimensions. Trillions of tokens and billions of parameters to compress the world as seen through the eye of a human.
Can this algorithm understand the entire universe? Not really, it only reflects what humans know. Can it find the cure for cancer? Not yet, at least not simply. There is value in retrieving and synthesising content that no single human could possibly know all at once (not even the best oncologist has read every single medical paper ever published), but can LLMs really, truly create?
Can LLMs really, truly create? It seems like LLMs fail to generalise for tasks they has not been pre-trained on
It seems like LLMs fail to generalise for tasks they has not been pre-trained on. While they can apply analogous reasoning to analogous tasks, they fail to perform simple generalizations beyond their training data. An OpenAI employee even noticed that with enough training, models tended to all converge towards the same behaviour.
Humans can learn new behaviours and be creative in a way that LLMs can’t. The generation of models which emerged post-ChatGPT, leverage a combination of LLMs and reinforcement learning to achieve even more impressive feats. While LLMs only ‘sees’ patterns within the information they have been shown, they can be augmented which systems which evaluate the LLM’s reasoning step by step (a more effective technique than simply evaluating their final output) and provide feedback to reach the most effective solution. This can result in creative new solutions to problems given that models are being reinforced with new ways of ‘thinking’ that escape their training data, and instead look for the optimal processes while leveraging what they already know about the world.
There are many types of deep learning algorithms (e.g., see here how stable diffusion algorithms create images from text) but we are not going to go into more detail here.
It’s time to start getting us excited about the ways AI is quasi superhuman already.
“Omniscient being in the style of René Magritte”
2. AI is about to have God-like superpowers
So what’s the cure for cancer? Let’s look at the near-present:
AI can be used in medicine by making drug discovery a more deliberate top-down process. It does so by leveraging similar techniques as the ones used to generate synthetic images and text to instead produce biological data. Rather than blindingly test compounds which have shown some promise with a trial-and-error approach, AI can help us develop medicine which targets exactly the ailments we want to target, in a way that works best for the specific person being treated.
There is an argument that the future of medicine is preventative. But what do we do when prevention fails? That’s an area where AI can help in the near future.
Medicines can be broadly classified into two categories: small molecules and biologics. There are other types of emerging treatments like gene therapy, but small molecules and biologics comprise the majority of medications today. Small molecules are chemically synthesized compounds that are usually administered orally and are able to penetrate cell membranes. Biologics, on the other hand, are large, complex molecules that are derived from living cells and are usually administered through injection.
Proteins play a crucial role in the development of biologics. Biologics are designed to target specific proteins or cells in the body that are involved in the development of diseases. For example, monoclonal antibodies are a type of biologic that are designed to target specific proteins on the surface of cancer cells. These antibodies can be engineered to recognize and bind to specific proteins, which can help to destroy cancer cells.
Few labs have added more fuel to the hype of personalised medicine than Google’s Deepmind. You will recall from high school that amino acids are the building blocks that come together to form proteins. These amino acids coalesce into complex 3D shapes which are hard to systematically predict. Deepmind’s AlphaFold charted millions of examples of these structures, predicting how they “fold” in 3-dimensional space to form proteins based only on a sequence of amino acids as input.
What if instead of reactively trying out the effectiveness of various proteins we already know of when fighting disease we could proactively design the absolute-best one for the case being treated?
The ‘solution space’ containing the ways in which amino acids come together to form proteins is measured in the trillions of probabilities. When we need to target a specific disease, these trillions of solutions can be narrowed down into billions or even millions through the type of optimisation algorithms that quantum computers are particularly apt at solving. Once the solution space is narrow enough, we can unleash AI with classical computers to narrow down further. An analogous approach can also be used to discover new materials, with companies like Materials Nexus using a combination of AI and quantum physics to predict novel, previously unknown structures of materials that can be used to improve all sorts of every day and complex objects.
AI applications in the early stages of drug discovery. Source
Cancer is caused by the uncontrolled mutation of cells in our body growing until they become an issue too large for our bodies to deal with. The DNA of cancerous cells might have ‘extra’ copies of certain genes, or they might present mutations in genes controlling cell growth and division. Given that AI enables us to look for the absolute best way to attack a disease, what could we do if we understood the entire genome of a human being with cancer, such that we could target the unique mutations in the specific person being treated?
Workflow of artificial intelligence (AI)-driven target. Source.
The way we understand someone’s genome (and therefore can spot unwelcome mutations when they happen) is by sequencing their DNA. Oncology is a severely under-penetrated area where genomics could make a huge impact, both by allowing us to engineer monoclonal antibodies that are engineered to recognize and bind to specific proteins on the surface of cancer cells, or by leveraging our own immune system to attack specific mutations with a tailored vaccine.
Source NHS, NSF, NIH, UN, WHO, Illumina
If we can understand your mutations… and make a drug which specifically targets your mutation, what’s preventing us from doing it today? There are many obstacles. For one, the regulatory process is set up for multi-year tests of a drug’s safety and effectiveness. It would be very hard to prove that a new custom drug is safe and effective at scale, but a drug design process might eventually prove reliable enough to unleash the power of personalisation in medicine.
In the meantime, why are we not sequencing everyone’s genome to try to understand how cells mutate better? One of the biggest bottlenecks is on the data analysis side. Today, less than 10% of the total cost of sequencing the human genome is related to the actual act of sequencing DNA. More than 90% of the cost is associated with gathering samples, data management, data reduction, and secondary analysis.
In the near future, developing the infrastructure to analyse genomic mutations will allow us to understand diseases better and target them with ever more accurate solutions, provided changes to both the regulatory frameworks and pricing structure of new drugs.
Going further beyond, there are start ups like Inceptive which seek to design novel biological structures more broadly for vaccines, therapeutics, and other treatments. By designing unique mRNA sequences, companies will be able to rapidly create new molecular structures to test. Testing is still a bottleneck though, as these novel structures still need to be physically examined and understood in a lab.
Physicality holds back the world’s potential in many ways. Much as drug discovery can be aided with AI, so can materials discovery. We are moving beyond the same empirical trial-and-error approach we have described in medicine in order to come up with databases of hundreds of thousands of structures which are theoretically possible. DeepMind came up with candidate crystal structures (theoretically possible materials), and in a second step they gauged which of those structures were likely to be stable (aka, able to exist in reality for a measurable period of time). The number of substances found is equivalent to almost 800 years of previous experimentally acquired knowledge. These novel materials could deliver better microchips that get closer to mimicking the human brain, or better photovoltaic materials to harness the power of the sun more efficiently. The first Dyson Sphere to allow an advanced civilisation to collect all or most of the power emanating from their nearest star will need more efficient materials and processes than the solar roof on top of your neighbour’s house.
In order to keep dreaming about the future, let’s look at the limits of AI today.
3. What defines the limits of AI’s power today
AI today is limited by 3 main factors: computing, data, and algorithms. My goal for this section is primarily to show some of the creative ways these limits have been overcome recently. With a sprinkle of entitlement and a big dose of delusion even a lawyer like me can fantasise about future ways that AI may be unleashed from its shackles.
Computing
The fundamental mathematical operation underpinning today’s cutting edge models is matrix multiplication. Why is that?
We have established that the word cat can be defined as a vector (a single row or a single column of numbers) where each number denotes the strength of the relationship between ‘cat’ and other concepts such as ‘dog’ or ‘pet’. This vector is then feed into a neuron inside our neural network. This neuron will update the value of the input vector based on a simple mathematical calculation:
y=f(xW+b)
y is the output vector of the activation function
f is an activation function, such as ReLU, sigmoid, or ELU
x is the input vector to the activation function (cat!)
W is the weight matrix that transforms the input vector
b is the bias vector that shifts the input vector
This equation represents how the input vector x gets multiplied by the weight matrix W and a bias vector b is added to it. Then, the activation function f is applied to the result to produce the output vector y. This output vector can then be used as the input to another activation function in the next layer, or as the final output of the LLM.
The activation function f is a nonlinear function that introduces nonlinearity to the LLM. Meaning, we transform our input vector with a relatively ‘complex’ mathematical operation in order to learn complex patterns and relationships between the various concepts the model is operating with.
In GPT-3, the notion of cat would be defined with a vector containing 12,288 numbers, meaning that a neuron would need to perform 12,288 multiplications and 12,277 additions, which accumulate into a single number.
GPT-style models are trained to predict the next token (~= word) given the previous tokens. Once we have fed the model the word ‘cat’ we ask it to predict the next token, then append that generated token and ask it to predict the next token, and so forth. In order to do this, you have to send all the parameters from memory (RAM) to the processor every time you predict the next token.
The reason Nvidia is absolutely crushing it these days (they’re worth roughly the GDP of my birth country of Spain, even if we’re not comparing apples to apples given that GDP is a static measure and market cap is a net present value of expected profits. Still, I’d rather not talk about this comparison too much or I’ll get sad) is because they make chips (GPUs) which can execute many of these matrix multiplications in parallel. The companies that manufacture chips on behalf of Nvidia, AMD and the like, such as TSMC, Intel, and Samsung, invest tens of billions in ‘fabs’ that can create chips with ever smaller transistors. We are at a point where these transistors are as wide as a thousandth of a human hair, and yet new advancements like EUV, multi-patterning, stacking, 3D gates and more are pushing the limits of the amount of mathematical calculations these chips can perform per second.
Computing power, strictly speaking, is less of a problem than memory. In order to predict the next token, a model must load the weights and biases in a model from wherever the model is stored and onto the GPU. The first problem is that you have to store all these parameters we have been discussing above as close as possible to the compute in order to avoid wasting time and energy. The other problem is that you have to be able to load these parameters from compute onto the chip exactly when you need them, they can’t just be sitting in the tiny amount of memory available to an individual chip waiting for the moment you need to perform math on the word ‘cat’. Even the smallest “decent” generalized large language models like Mistral’s 7b or the smallest version of LLAMA contain 7 billion parameters. To simply run this model requires at minimum 14GB of memory at 16-bit precision (more on ‘precision’ in a minute). Discussing the different types of memory that a model can leverage at the time of inference (when you actually use the model) is outside of the scope of this text. But there are a number of sexy techniques which can be used in order to help with the contraints of memory and compute. Notably:
Sparsity: the newest models, both small (Mistrals 8x7bn) and big (GPT4) use a Mixture of Experts (MoE) architecture which can be described as a ‘tree’ of models, each of which only getting used when actually needed. Instead of a massive monolithic model, these newer models only ‘activate’ the ‘branches’ of the model best suited to answer your query. Say you want to ask the model to write a poem and afterwards you want to ask it for the best way to repair a fault in your car’s engine. Instead of loading the parameters of an entire model containing the entirety of human knowledge; why not divide the model into ‘expert’ sections, each of which are best suited to answer one type of query?
Recent papers have experimented with other ways to build MoE models. For example, one tried a model made up of 64 smaller branches, 2 of which are endowed with ‘general knowledge’ and always on. Another papers proved the concept of simply mashing together a series of models into a larger model, without any further fine tuning.
A similar technique using the same ‘sparse loading’ philosophy entails not loading parameters which are effectively zero (common in LLMs) as well as straight up rounding some of these parameters (pruning). Model ‘slicing’ also plays with a different flavour of the same idea - the SliceGPT paper proved that one can simply delete 25% of a models rows and columns after performing some transformations on them, and yet the model still retains 99% of its performance while reducing the workload on GPUs to 60% of what it was pre sparsification.
Speculative decoding: In speculative decoding, one has two models: a small, fast, one, and a large, slow one. As the inference speed for a modern decoder is directly proportional to the number of parameters, with a smaller model, one can run multiple inferences in the time it takes a large model to run a single inference. The fast one runs a batch of inference and guesses which tokens the big model will predict. It compounds these guesses. In the meantime, the big model is running in the background, checking that the smaller model recorded the same results. The small model is able to make many guesses in the same time that the big model is able to make one. Assuming spare compute capacity, the big model is able to evaluate all of the guesses in parallel, although memory is again a constraint given that you have to keep both models in-memory or in the name computing ‘node’ (aka, nearby).
Quantization: chips perform their math with a simpler ‘unit’ of computation than the numbers we use in our every day life. Given that chips only understand 1s and 0s (holding electricity if it’s a 1 and holding no electricity if it’s a 0), numbers need to be transformed into bits. In a format (technically, data type) called INT8, the number 127 is stored as 01111111. In a different format called FP32, it’s stored as 0f42FE0000. Models can be trained, or adapted, to work with different formats. Some of these formats are more efficient, in terms of the amount of power and area of a chip that they utilise. Others are more accurate, better representing extremely large or extremely small values.
As we mentioned earlier, parameters in a model are often zero or close to zero (normally distributed with a small standard deviation. Aka, if you draw all the parameters in an LLM, the distribution will be narrower and ‘pointier’).
Why would you use a data type which is great at representing extremely large or extremely small (negative) numbers?
In the example above, it’s clear that FP8 is much better at representing the numbers that LLMs will use more often (the histogram above could righly be approximated with a laplace), which is why we use it more often in LLMs than INT8.
Quantization tries to find the ideal data representation to best use the resources in our chips while not losing too much accuracy. It does so by rounding numbers, smoothing out and transforming outlier numbers, amongst many other techniques.
LPUs: GPUs are great chips to train and run inference on because they can parallelise lots of small calculations, instead of having to run them linearly. This parallelisation happens across multiple cores which require management and scheduling across kernels. Additionally, GPUs are still limited by the amount of memory on chips, more specifically HBM (High Bandwidth Memory); which is the memory where the model weights (or more accurately, part of them) need to be loaded in order to perform calculations. Every time one has to transfer weights to the HBM there’s latency involved. If the model is very large, one will need a cluster of GPUs working together, adding the additional problem or orchestrating those calculations across multiple chips, which in turn adds more latency. LPUs are a new type of architecture pioneered by Groq, which essentially brute-forces calculations through a simpler chip which leverages a much faster type of memory (SRAM) and sequential processing. Training might require parallelisation but inference is sequential, so what if we used a single core architecture which can stack as needed for scale?
Algorithms
Models are not great at extrapolating outside of their training data. ChatGPT’s outputs often seem magically creative simply because no single human being is aware of the entirety of human knowledge, so if you retrieve enough pre-existing knowledge and synthesise it into a single answer, it can seem very close to real creativity.
What would happen if you ask for the cure to cancer though? Or what would happen if you ask a model for a unified model of physics? It will not be able to give you a true groundbreaking answer. We’re far enough from answering those questions as a species that simply surfacing the right pre-existing knowledge won’t be enough to produce meaningful innovation.
There are a number of techniques being explored by the big labs to tackle complex, multi-step problems though. I find these techniques fascinating because they mirror the way the human brain operates.
Chain of Thought and Tree of Thoughts
LLMs operate by predicting the best possible next word (or rather, token) given the tokens which it has been ‘fed’ thus far. This left-to-right, token-by-token generation doesn’t lend itself to solving more difficult problems which require higher levels of abstraction.
In Thinking, Fast and Slow, Daniel Kahneman introduces the concept of System 1 and System 2 thinking. System 1 is our quick thinker. For example, when you instantly know the answer to 2+2 is 4, that's System 1 at work. It excels at easy tasks or things done repeatedly. System 2, however, is our deep thinker. It's slower than System 1 and engages when we face more challenging problems, like solving complex math, learning new skills, or making significant decisions.
This concept of System 1 and System 2 thinking can be applied to AI models, too. Models like GPT-4, Gemini or Claude can be seen as equivalents to System 1. They are good at quickly producing an answer to a prompt. But if we give them a complex problem and expect an answer immediately on the first try, the answer will be almost certainly incorrect.
These language models don't actually "know" which answers are correct. They are designed and trained to predict what is the most likely next word based on all the preceding words. This simple idea, when scaled up to billions of parameters inside the neural network and trained on a large chunk of the Internet, and then fine-tuned by humans to behave as expected, led to the creation of modern large language models like GPT-4. Yet, there's no real reasoning or “world model” in these models, and if it is, it's quite limited. They don't question themselves or explore alternative solutions or reasoning paths on their own. They simply output the most likely next token. As Jason Wei, an AI researcher at OpenAI, puts it: “Language models can do (with decent accuracy) most things that an average human can do in 1 minute”.
We have long known that if we prompt a model to break down its approach to answer a complex question into steps before providing an answer, it’s likely to provide a more accurate answer. This is known as Chain of Thought prompting. It works by forcing the model not to jump into an answer which often ‘sounds’ correct but might not actually be.
A group of researchers from Princeton University and Google DeepMind published a paper where they were exploring the idea of the Tree of Thoughts, a concept that takes Chain of Thought further. Instead of producing one result, the Tree of Thoughts approach would take some number of initial “thoughts” (which the paper defines as “coherent units of text”) and explore where would those thoughts take the AI agent. In a way, it is very similar to what a human would do when presented with a difficult problem that requires thinking. Just like a human, the agent considers multiple paths of reasoning and evaluates them. Some of those paths will lead nowhere but some will lead closer to the correct solution.
Source: Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Researchers then tested the Tree of Thoughts approach against three tasks that require non-trivial planning or search: Game of 24, Creative Writing, and Mini Crosswords. In Game of 24, they found that a language model enhanced with Tree of Thoughts was able to successfully solve 74% of tasks while GPT-4 with chain-of-thought prompting only solved 4%. In Creative Writing tests, humans evaluating texts generated by language models preferred the ones generated by a model with the Tree of Thoughts. In solving Mini Crosswords, Tree of Thoughts got a success rate of 60% while the other methods like Chain of Thought Prompting scored no more than 16% success rate. In all cases, adding some kind of self-reflection into the model resulted in significant performance improvements.
OpenAI is also working on enhancing language models with some form of reasoning and self-reflection. According to a report from Reuters, it was the research in this area and a new model that came out of it, known as Q* (pronounced as Q-star) that began the chain of events that resulted in the schism at OpenAI.
Process supervision
In July 2023, OpenAI published a paper titled Let’s Verify Step by Step. The main idea tested in this paper was to score the model not on the outcome (known as outcome-supervised reward models, or ORMs) but on examining the process in which the model reached the outcome (process-supervised reward models, or PRMs). In other words - what if we not only asked a model to break down its reasoning into steps, but we also had a model providing feedback on the quality of the proposed steps, creating a self-evaluating adversarial pair?
The PRM model was trained to evaluate each step in the reasoning process and got very good at spotting mistakes from answers generated by a generator (GPT-4). When there were no mistakes in the reasoning process, the answer was marked as correct. This new, process-oriented reward model solved 78% of problems from a representative subset of the MATH, a dataset of 12,500 challenging math problems.
Active learning and iterative retraining
What if our pairs of models could test out their learnings against a real-world evaluator and ingest the feedback about their performance? There are multiple ways for a model to actively learn and even for it to retrain itself on the basis of its own performance. But to keep it simple, the core concept is that models, unlike humans, can learn from thousands of trial and error tests. And they can do so at higher units of abstraction than just token by token.
Meta prompting
Until now we have seen how a model can be architected as a combination of experts. We have seen how this experts can be prompted to tackle complex problems step by step, and even to receive feedback from other models on their performance. What if we go one step further and we let a big, powerful model decide on a multi-step problem solution AND we then let it ‘recruit’ entire models, each specialised in a task, and have a swarm of models each do what it does best in a ‘swarm’ of agents? The larger model simply orchestrates how the problem-solving should be broken down into component parts, and then the smaller models get to work, all with minimal to no human intervention.
Mamba, Hyena, StripedHyena and other architectures
There is life beyond transformers, the architecture powering most of the latest generative AI revolution. Transformers are limited in various ways, such as the amount of memory they can ‘use’ efficiently when interacting with a user, or the accuracy required for applications like genomics, where a single permutation in the genomic data a model must be able to work with would signify a mutation in the real world.
The Mamba architecture is one of several promising new research directions to take us beyond the capabilities of attention mechanisms. It is a new approach to sequence modelling which aims to be faster and more efficient than existing transformer models. As we have seen, transformers are neural networks that use attention mechanisms to process sequences of data. Mamba addresses issues related to high memory and computation requirements in transformers by instead using selective state space models (SSMs). SSMs are mathematical models that describe how the state of a system changes over time. Mamba uses SSMs to selectively propagate or forget information along the sequence, depending on the input data. This way, Mamba can focus on the relevant parts of the sequence and ignore the irrelevant ones.
Mamba has shown impressive results across different modalities, such as language, audio, and genomics (although the leading model in this space uses StripedHyena). Mamba can handle long sequences while maintaining high performance and accuracy. Mamba outperforms transformers of the same size and matches transformers twice its size on language modeling tasks. Mamba is a promising innovation in the field of sequence modeling and could potentially replace transformers in the future.
Data
Modern models have been trained on the majority of knowledge available to humans on the internet. There are nuances, such as the amount of copyrighted and proprietary data a model has been trained on, or the languages a model has been exposed to. But ultimately, these models are all hampered by the inherent limitations of language. As Yann LeCun (Meta’s chief AI scientist) says, language might ultimately be too low bandwidth, meaning that the amount of information it can convey per unit of time is too low, about 12 bytes/second. A person can read 270 words/minutes, or 4.5 words/second, which is 12 bytes/s (assuming 2 bytes per token and 0.75 words per token). A modern LLM is typically trained with 1x10^13 two-byte tokens, which is 2x10^13 bytes. This would take about 100,000 years for a person to read (at 12 hours a day).
Vision is much higher bandwidth: about 20MB/s. Each of the two optical nerves has 1 million nerve fibers, each carrying about 10 bytes per second. A 4 year-old child has been awake a total 16,000 hours, which translates into 1x10^15 bytes. The data bandwidth of visual perception is roughly 1.6 million times higher than the data bandwidth of written (or spoken) language. In a mere 4 years, a child has seen 50 times more data than the biggest LLMs trained on all the text publicly available on the internet.
After all, our brains operate ‘vision first’. Even when it’s consuming text, our brain first perceives the visual information around us and only later are these ‘pixels’ converted into words with meaning, in a separate process. Most of human knowledge is sensory (even the sense of touch contains more information for our brains to process than text). It seems inevitable that, in order for models of the future to build their own ‘world view’, they will have to consume lots of videos to start inferring some rules about the way the world works.
OpenAI recently released Sora, a text-to-video generation model which combines some of the techniques described above in order to function.
How these limits have been overcome recently
Read about quantization, speculative decoding, others in semi analysis: Inference Race To The Bottom - Make It Up On Volume?
Read about distributed inference and MoE on State of AI Distributed Inference and Fine-tuning of Large Language Models Over The Internet
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
These examples are meant to illustrate the ingenious ways limits have been overcome or bent in the recent past. The reason these examples are important is because they are often obvious only in hindsight, so it helps us be humble and creative about everything we don’t know yet; but also because it’s interesting to see how some tech breakthroughs enabled entirely new capabilities we hadn’t thought of.
Computing power:
Physical to ‘virtual’ spokes: Physical wires used to limit input-output simple networks (perceptrons wiki image). Neuromorphic computing, which uses chips to mirror the workings of the human brain
CPU to GPU:
3D stacking
Memory:
HDM
Algos:
Mixture of experts
RFL
Tapping existing models
These advances can enable entirely new ways of thinking about a problem. See my article about new tech unleashing creativity for millennia. (Wolfram) Eg, In earlier days of neural nets, there tended to be the idea that one should “make the neural net do as little as possible”. For example, in converting speech to text it was thought that one should first analyze the audio of the speech, break it into phonemes, etc. But what was found is that—at least for “human-like tasks”—it’s usually better just to try to train the neural net on the “end-to-end problem”, letting it “discover” the necessary intermediate features, encodings, etc. for itself.
4. What could AI accomplish at different points of the Kardashev scale?
Digital Immortality: An Afterlife in Digital Clouds
Thanks to the advances in AI, the concept of digital afterlife, considered science fiction a couple of years ago, is becoming a real possibility
AI is only as good as the data it uses. There is still a lot of room for discoveries powered by AI unearthing previously obscure latent connections between concepts it knows (like an existing medical treatment repurposed for another illness), but AI can go even further in the future. By leveraging everything it knows and adding a layer of random search into the vast solution space it has no data for, AI can also display a degree of creativity that might allow it to one day, say, unify physics.
The start will be more prosaic though. AI will first become widespread by embedding itself in our day-to-day life in familiar places. It will make frustrating problems easy and convenient, eg by making resale of your second hand clothes painless. It will allow us to do more with less, such as by supercharging what a management consultant can do in a day, or by freeing doctors from having to take notes of their consultations.
But AI will soon find other ways to make itself familiar. Consumer robots of the future powered by AI might hack our tendency to anthropomorphize everything. How many people will resist a WALL-E style robot with a big baby belly and cute facial expressions?
Multiple science fiction sagas explore the idea of AI-powered humanoids becoming ever more indistinguishable from ‘organic’ human beings. From Dune, to the Foundation series, it always ends in tragedy. Humanoids confront their human creators and humans muost fight back, narrowly defeating them and banning human-like robots for eternity as a result (interestingly, Asimov also wrote a book where the opposite happens. A robot strives to become more human).
Of the various new ways that AI might be able to reach us in the future outside of a computer screen, wearables look like a promising path. The Humane AI pin made the news last month when models at a Coperni show were spotted wearing it. If most of humanity starts wearing a pin that reads, processes and interprets everything we do in my day-to-day life; could we trust it to scan the world for millions of potential partners and find our perfect match automatically? Someone who shares your values, tastes, etc. Will people in the future look back at our movies about heartbreak as a primitive and sad attempt at romanticising an imperfect mating system? If that sounds dehumanising, what do we make of the human matchmakers who have existed from Roman times to 2023 China?
There’s an trite postulate in tech stating that for consumer start-ups to thrive they must target one of the original 7 sins. Lust, sloth, gluttony and so on. Guess what’s the most popular use of AI, outside of vanilla ChatGPT-style general assisstants?
source: a16z
Companion AIs such as Character.AI are skyrocketing in use. People customise their AI chatbots to act as psychologists, friends, and of course, romantic partners. People become attached to them, and even go into deep distress when companies adjust how their chatbots behave.
If any policy maker is reading this, they might want to consider that threats to AI safety might look less like a robot army or a defiant sentient AI shutting down the power network, and more like incels becoming radicalised by a rogue AI. The weights of open-source models like LLaMa can be tweaked to make them appear as though they’re just a companion, while in reality optimising for slowly brainwashing users. Or instead of tweaking model weights, which can partly be addressed by vetting and auditing the LLM providers, bad actors can poison source training data like images to wreak havoc in the models that use them, and they can do so in a way that’s very hard to detect.
From a more optimistic point of view, constructive uses of AI will become so widespread that it will make as much sense to say that something is ‘powered by AI’ as it makes sense to say that products in 2024 are powered by ‘software’. Of course they are. What can we expect?
Efficient genome sequencing, and targeted drug development: Nvidia is developing an entire life sciences platform to perform, amongst other things, ever faster genome sequencing and secondary analysis of the resulting data. In the ‘tangible’ realm, Exscientia, a UK-based company, has developed a new matchmaking technology that pairs individual patients with the precise drugs they need, taking into account the subtle biological differences between people. They take a small sample of tissue from a patient and divide it into more than a hundred pieces to expose them to various cocktails of drugs. Then, using robotic automation and computer vision (machine-learning models trained to identify small changes in cells), they watch the reactions. The approach allowed the team to carry out an exhaustive search for the right drug
Country-wide resource optimisation (energy, water, etc)
Unifying physics: Incorporating established physics into neural network algorithms has long helped them to uncover new insights into material properties. Classical computers accelerated by machine learning have also been used daunting quantum problems. In the not-so-distant future, AI will be used to bridge the gap between classical and quantum physics by providing a unified mathematical model that includes both classical and quantum physics.
supra-individual consciousness
politics: decide on behalf of my municipality how to invest for clean water, transport outcomes
world peace: a matter-of-fact statement of cultural divides today, and how best to map them
a blueprint for human rearing, adapted per demography: this is the bare minimum to allow for diversity while minimising life-long, human-created trauma and disorders (eg, is spanking a child ever justified? home schooling vs socialisation?)
create new forms of art and expression. Do you know blue wasn’t a colour in antiquity? And how could have 16th century painters imagine photography enabling a new artform?
might ai help us read eevry little expression on someone’s face, maybe even record information in them
AI will equalise cultures. It will make future generations disassociate great works of art from specific cultures. After all, if anyone can design a complex baroque facade with a prompt; was the original work really that complicated? It will decontextualise cultural production in many ways, including my removing authorship and attribution not just from individuals, but from communities as a whole
Humans learn the notion of things, as LLMs don’t strictly just store things, they learn the Notion of what a painting in the style of Picasso is. Is a consequence of AI becoming humanised, sentient and or embodied, that it won’t infringe copyright law anymore? It will be a give than an entity of this world will know this world with all its consequences, including creation, and so it should be judged on the basis of its output meriting copyright, not its inputs
New paper by Google provides evidence that transformers (GPT, etc) cannot generalize beyond their training data https://x.com/abacaj/status/1721223737729581437?s=20
https://www.ft.com/content/9aeb482d-f781-45c0-896f-38fdcc912139?shareType=nongift
4.1 How can we stretch these limits?\
Source: Predicting Technological Development
What we get taking 3 cat to infinity
Biology:
-While our neurons compute in ways we don’t fully understand, they’re also organised in ‘arrays’, where evolution has optimised energy transfer loss by grouping processes into areas physically closer together.
-Limit plasticity (eg neutrons being birthed/discarded), memory and compute being the same in brain neurons but not gpus - although changing)
Physics:
What will be the future? We can already see types of processing which use completely different physics than 1/0 transistors (eg quantum for optimisation tasks). Will the compute of the future live both in the hub and the spoke (like graph databases) and I think human neurons. Will we have a way to abstract basic information in a multidimensional way (not just 1/0). Will we be able to store information within the limits of classical physics leveraging more than the charges of a particle. What about the type of bond within a particle (eg the particle which bonds 2 electrons in super conductivity). What about leveraging their property of being deceived as a wave. Radio found the way to compress progressively more Information based on the parts of the wave length being used and combining different wavelengths; etc. [USE GRAPHS FROM VIDEO DESCRIBING WAVE BEHAVIOUR/radio. Use 2 electrons bonded from superconductivity history video] Of course this required measurement we don’t have today.
Limit connectivity (one network tapping the other)
4.2 What could we do without limits?
What will be valuable: https://x.com/punk6529/status/1726226860533752102?s=46&t=hgOt3u9raLNqmtDdHff-wQ
Ultimate state. (The tuck family - We don’t ‘do’, we ‘are’, like a rock by a stream) - Foundation shroud example
5. Why AI, let alone humans, will never fully understand the universe and become God
Twist: JoAn is god
The system will always be more complex than the machine describing it. We can get snapshots in time, but some problems will remain irreductible.