Research

Academic or research source. Check the methodology, sample size, and whether it's been replicated.

Current language model training leaves large parts of the internet on the table

Large language models learn from web data, but which pages actually make it into training sets depends heavily on a seemingly mundane choice: the HTML extractor. Researchers at Apple, Stanford, and...

The Decoder · Feb 28, 2026 11:47 UTC · ~4 min read

2-Minute Brief

According to The Decoder: Large language models learn from web data, but which pages actually make it into training sets depends heavily on a seemingly mundane choice: the HTML extractor. Researchers at Apple, Stanford, and the University of Washington found that three common extraction tools pull surprisingly different content from the same web pages. The article Current language model training leaves large parts of the internet on the table appeared first on The Decoder .

Read Original

Current language model training leaves large parts of the internet on the table

TLDR

2-Minute Brief

According to The Decoder: Large language models learn from web data, but which pages actually make it into training sets depends heavily on a seemingly mundane choice: the HTML extractor. Researchers at Apple, Stanford, and the University of Washington found that three common extraction tools pull surprisingly different content from the same web pages. The article Current language model training leaves large parts of the internet on the table appeared first on The Decoder .

Open

O open S save B back M mode