Even accounting for speed, TPUs are worse on a cost benefit basis. People (some ...

Xorlev · on Sept 25, 2020

There's a lot of reasons TPUs could be poorly matched for your workload, including model complexity (or lack thereof), the way your model is setup, the inputs to your models (if you're bottlenecked on IO, including host <-> TPU memory bandwidth, well, what can you do?), how you were training them (including the evaluators used).

Since your post didn't actually include any details, I did a search and immediately found an article[1] where TPUs worked better for their particular use-case. I suspect I could find many such reports (and probably some opposite reports too).

It's unfortunate they didn't work for you, perhaps you should give them another shot with a different model. I'd recommend using Cloud's examples as a starting point.

[1] https://medium.com/bigdatarepublic/cost-comparison-of-deep-l...

asdfasgasdgasdg · on Sept 25, 2020

That ... Doesn't seem likely? How big is your team? Any chance the person who evaluated TPUs made an error when setting up their test? I'm just a little skeptical that Google would have gone to all the trouble to design the things and that so many organizations would be spending so much money to use them if they weren't better than GPUs in some way.

sillysaurusx · on Sept 25, 2020

You're almost certainly using the TPUs wrong. It's very easy to use them wrong, unfortunately.

When you use them right, a TPUv3-8 gets equivalent perf to a cluster of 8 V100s.

I was astounded. I trained StyleGAN 2 from scratch at 1024x1024 in 2.5 days. nvidia took 7 days for their official model. Granted, I used a v3-32, not a v3-8, but performance seems pretty similar.

google234123 · on Sept 25, 2020

Wow, you really like dealing with absolutes...

mlthoughts2018 · on Sept 25, 2020

TPUs are pretty much the only thing in engineering I’ve ever encountered where literal absolutes actually apply.

lrem · on Sept 25, 2020

Have you compared TPUs to learning on other cloud infra, or some other model?

lowiqengineer · on Sept 25, 2020

Have you written about this in the past? This seems pretty absolute.