When will AI run out of internet? Is that the end of pre-training?

Recently, I saw one of those slick AI infographics: “ChatGPT was trained on x% of all Reddit, x% of Wikipedia, etc.” As a hopeless data nerd, I couldn’t let it go. How do they even calculate that? Are those percentages real, or just a clever statistical trick? More importantly, what happens when we run out of public internet data? I went down the rabbit hole to find out. Here’s what I found.

Do all AI models get trained on the same data?

Well most of the people including myself would assume no, and they would be correct. So everyone gets most of the data from the same publicly available information, but after that there is a specific data. I was so unaware of the following.

Google feeds its models unlimited YouTube transcripts and scanned books nobody else can touch. Meta has the firehose of Instagram captions and Facebook posts. X gives xAI every tweet in real time. Anthropic gets enterprise data most of us will never see.

So maybe 50-70% is common knowledge, but the rest of the data is a secret sauce for each AI model.

How do they get this percentages and data?

So, first of all almost everywhere you see percentages, but they are based on tokens. Tokens, what are those you might ask?

A token is roughly 3–4 characters (think “hel-lo” = 2 tokens, “supercalifragilistic” = 5 tokens). Reddit threads are long, chatty, full of emojis and repeated slang, one single comment thread can explode into thousands of tokens.

So when the infographic screams “17 % of all Reddit,” it’s 17 % of the token mountain, not 17 % of the actual human posts — that’s why Reddit always looks gigantic and other platforms look like crumbs.

They calculate it by forcing the model to spill out exact long sequences, then hashing those sequences and checking which original websites contain the same hash. Cheap trick, and this is how they ultimately cheat the statistics

How do they know what is high-quality data?

Ok, so if you look at those infographics and you find some text below, you will easily see that there is an explanation that this is a “high-quality data”. I started thinking so far, everything was on a shaky legs, how can we be sure?

To make sure the data used for training AI models is top-notch, it is used several filtering steps. First, AI checks for clarity and good writing, removing confusing or incoherent text. There there is a small spam detector to toss out obvious junk like porn, SEO garbage, and repetitive content. Finally, AI prioritizes sources that historically produce great results (like high-quality papers or code), and starting to favour data that specifically helps the AI learn to think better, not just memorize facts. At least, this how I hope it works.

When will AI learn everything?

Holographic portrait amid swirling, crumbling internet data vortex apocalypse.

Best current estimates there are roughly 300 trillion “high-quality” public tokens left on Earch (Crypto mining flashback). Frontier labs are burning through them at about 10× every two years. If we do the math, sometime between late 2027 and mid-2029 the biggest players will have seen every clean public token at least once. The truly “good” stuff (books, scientific papers, curated websites) runs out even earlier — basically next year or the year after.

If you ask me beginning of 2027, we will be flooded how some AI models are trained fully and know everything. Can’t wait to see how the users of AI will travel without their visas and passports to countries where those documents are mandatory all over again.

What after it learn everything?

Now here it becomes scary, and I’m not thinking about conspiracies about doomsday and Skynet scenario. So next on the menu will be synthetic data, which means the model writes its own textbooks and proofs, you don’t like what the data shows you, fix the data yourself to suit desired output. There are also trillions of new tokens that aren’t text that has to be processed, and also there are real-time continual learning from social platforms, new discoveries and so on.

And the biggest jump won’t even come from more data — it’ll come from burning 100× more compute at inference time: longer thinking chains, tool loops, self-correction, verification models babysitting the main one.

In short: AI will stop scaling by reading more and start scaling by thinking deeper, rewriting reality to our liking, and plugging directly into the physical world. Fun times?

Final thoughts

I’m not really sure that any of AI models will ever learn everything. I think at some point there will be like iPhone and Android tribes, pick your ecosystem, love its biases, hate the other ones. And also we are still waiting on AI really thinking as a sentient being (if it is not already). So my final bullet points:

  1. Love the infographics. It is such a great way to visualize complex sets of data. That being said, don’t trust everything you see. Eye witness is least reliable method as a proof in physics.

  2. AI hallucinates often, and this is even based on the numbers given by AI companies themselves. I’ll write more about it soon.

  3. Because each model uses a unique "secret sauce" of proprietary and high-quality data, they are inherently biased. Choosing an AI ecosystem (Gemini, ChatGPT, Claude) is increasingly like choosing a political party—you're picking its biases.

  4. AI is still just a tool, a powerful tool, use it as such.

When you deep seek (pun intended) the data, what do you think about all percentages shown to us by AI companies and surrounding people? Do you look into details or blindly accept it as a truth?

Previous
Previous

What Is Your Expected Salary?

Next
Next

What are the essential AI skills for the product manager?