I used to agree, but now I disagree. You don't need to look any further than Goo...

vegesm · on April 2, 2022

Actually, MobileNetV3 is a supporting example of the bitter lesson and not the other way round. The point of Sutton's essay is that it isn't worth adding inductive biases (specific loss functions, handcrafted features, special architectures) to our algorithm. Having lots of data, just put that into a generic architecture and it eventually outperforms manually tuned ones.

MobileNetV3 uses architecture search, which is a prime example of the above: even the architecture hyperparameters are derived from data. The handcrafted optimizations just concern speed and do not include any inductive biases.

fxtentacle · on April 2, 2022

"The handcrafted optimizations just concern speed"

That is the goal here. Efficient execution on mobile hardware. Mobilenet v1 and v2 did similar parameter sweeps, but perform much worse. The main novel thing about v3 is precisely the handcrafted changes. I'd treat that as an indication that those handcrafted changes in v3 far exceed what could be achieved with lots of compute in v1 and v2.

Also, I don't think any amount of compute can come up with new efficient non-linearity formulas like hswish in v3.

fnbr · on April 2, 2022

Right, but that’s not a counterexample. The bitter lesson suggests that, eventually, it’ll be difficult to outperform a learning system manually. It doesn’t say that this is always true. DeepBlue _was_ better than all other chess players at the time. But now, AlphaZero is better.

I believe the same is true for neural network architecture search: at some point, learning systems will be better than all humans. Maybe that’s not true today, but I wouldn’t bet on that _always_ being false.

fxtentacle · on April 2, 2022

The article says:

"We have to learn the bitter lesson that building in how we think we think does not work in the long run."

And I would argue: It saves at least 100x in compute time. So by hand-designing relevant areas, I can build an AI today which otherwise would become possible due to Moore's law in about 7 years. Those 7 years are the reason to do it. That's plenty of time to create a startup and cash out.

bee_rider · on April 2, 2022

I think the "we" in this case is researchers and scientists trying to advance human knowledge, not startup folks. Startups of course expend lots of effort on doing things that don't end up helping humanity in the long run.

sitkack · on April 2, 2022

Sutton is talking about a long term trend. Would Google have been able to achieve this w/o a lot of computation? I don't think it refutes the essay in any way. If anything, model compression takes even more computation. We can't scale heuristics, we can scale computation.

koeng · on April 2, 2022

Link to the paper?

fxtentacle · on April 2, 2022

https://arxiv.org/abs/1905.02244v5