The Hubris of Scale: A Lesson From Junkmail Detection

The first AI product that I worked on was email junkmail detection. While spam detection is focused on catching malicious email, junkmail is an adjacent problem of forgotten newsletters and coupons from marketers. In some ways, this is a harder problem than spam detection — while a spam message is spam for everyone, some users may want that J.C. Penny’s coupon.

Our approach was to train a model per user that would understand that user’s preferences for what email they considered junk. This would be hundreds of millions of ML models. We built a series of complicated infrastructure pieces to experiment and train the base model, and another series of complicated infrastructure pieces to fine-tune models for each user. This fine tuning would happen by the user’s natural behavior of deleting or reading messages, among other signals. Yes, this could be considered reinforcement learning. In sum, we were using RL to train hundreds of millions of models using loads of custom infrastructure and pipelines. Incredibly advanced stuff!

It sucked.

We tried for a long time to improve the results. We would update which signals would be used, run more experiments, and tweak feature weights. There were executive meetings about whether a user opening and then deleting should be considered a positive or negative signal. We burned a lot of compute and a lot more engineering time.

It turns out that the original hypothesis was wrong: no users wanted that J.C. Penny’s coupon in their main inbox view. We ripped out the whole stack and filtered junkmail based on the domain of the sender and similar information. This was cheaper, faster, didn’t require new infrastructure, was more predictable, and had better precision than our fanciest machine learning attempts. It took us years to show an improvement by using ML, and even still it was a small amount of ML on top of the basic heuristics.

Beyond Scale Alone

I do not think that the transformer architecture behind today’s LLMs has hit its final scale wall yet. But I do know that it will! For every ML technique researchers have ever come up with, at some point, scale stops working. The good news, nay, the great news, is that we’ve barely scratched the surface of applying this generation’s models to the problems in our lives.

We do not need another generation of scale to make a significant improvement to gross world product. There are so many places that we can apply today’s AI in our work and personal lives to increase what we can do. I don’t even think we need inference-time compute to do it.