Generative AI – behind the scenes
What it takes to train and run a generative AI model
In the previous newsletters, I wrote mostly about generative AI’s use cases. In this issue let’s go behind the scenes to see what it takes to build and run a generative AI model.
Text-to-text generative AI models, like ChatGPT, are trained on a large corpus of text. But what text exactly are these models fed? An excellent analysis from the Washington Post looked at the Google C4 dataset which was used to train some Large Language models. Here are some results:
The top websites in the dataset were patents.google.com, Wikipedia, and scribd.com. I’m not surprised at Wikipedia and patents. But Scribd, a digital library did give me a pause. I was even more surprised to learn that b-ok.org, a pirated e-books site ranked highly in the dataset along with other pirated sites.
Of the top 10 sites, more than half were journalism-related (like the New York Times, or Forbes).
If grouped into categories, business websites (fool.com, Kickstarter) ranked first, technology second, and News & Media third. In the technology category belonged millions of blogs (like medium.com, or sites.google.com).
According to the Post article, the dataset reflected a Western view of religion. Out of the top 20 religious sites, 14 were Christian, followed by two Jewish, one Muslim, one Mormon, and one Jehovah’s Witness.
It’s really interesting to see the saying “if it’s not on the internet, it doesn't exist” manifest. The training of a model is complex, and there can be other sources of the data being fed into a model, than the web. But since the web seems to be the most accessible source of raw text, that’s what is most easily used.
To be more precise, that’s what is used in the “pre-training” of an LLM. An Ars Technica article does a great job of explaining ChatGPT’s problem of hallucination, i. e. when an AI makes up information. To lessen the effect of hallucination, a technique called “reinforcement learning from human feedback” (RLHF) is used. RLHF means that human raters rank the answers given by ChatGPT from which the model learns. When ChatGPT hallucinates, it basically has an information deficit problem which it tries to “solve creatively”. This creativity meter could be dialed back to be more accurate resulting in ChatGPT giving the response “I don’t know” more often. Finding the right balance between creativity and accuracy is hard.
What’s also hard to find is enough computing power to train LLMs. Video cards, a.k.a. GPUs are used for this task, as their parallel computing capabilities make them ideal not only for running video games but also for training AI. But you need them in the thousands. 10,000 – that’s the number of GPUs Elon Musk reportedly bought to support Twitter’s AI project. Google also used more than 6000 chips to train its Pathways Language Model. And you can use Nvidia chips, but it’s even better to develop your own, something that Google and Amazon have already done, but Microsoft is also scrambling to do.
Wait, surely tech companies pay for GPUs, but what about the data? Some sites are asking the same question. Reddit has announced that it will start charging companies for accessing its API, mainly for commercial purposes. This will probably include firms like OpenAI or Google which could use Reddit data for training their AI models. Similarly, StackOverflow, a large programmer Q&A site, will also start charging for access to its API and is considering giving some compensation to users from the income. The economy of AI is weird…
It’s weird because it actually costs money to run generative AI models. When ChatGPT generates the response to your query effortlessly, it consumes computing power and electricity in a Microsoft data centre. And that could cost up to $700 000 per day, according to Dylan Patel from SemiAnalysis.
So, we have a generative AI model, that costs a lot to run and might cost some to train, but for some use cases (like ChatGPT, or Bing) users are not paying. Does this sound familiar? As we have seen with search and social media – the ads might be the “solution”. At least, that’s what Microsoft and Google are trying to figure out.
Podcast
This Wired Gadget Lab episode is about the state of voice-generating AI with examples, products, and safety concerns.
Substack
Instead of a video, I’d like to recommend a Substack from Ethan Mollick. It shows what results you can get from generative AI with just one prompt. Mollick points out that for the best results, you shouldn’t use just one prompt and provides tips at the end for good prompting.
Have ai nice week!



