Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Even accounting for speed, TPUs are worse on a cost benefit basis. People (some of whom are data scientists) pay to use them because they are the shiny new toy and there’s entertainment in blogging about it or writing a walkthrough, not because they economically solve any problem.

I manage an enterprise machine learning team and we do tons of stuff on GCP. I’m dead serious: there is not a single use case where it makes sense to choose TPUs unless the main value you’re seeking is just the entertainment factor of using Google’s new thing.



There's a lot of reasons TPUs could be poorly matched for your workload, including model complexity (or lack thereof), the way your model is setup, the inputs to your models (if you're bottlenecked on IO, including host <-> TPU memory bandwidth, well, what can you do?), how you were training them (including the evaluators used).

Since your post didn't actually include any details, I did a search and immediately found an article[1] where TPUs worked better for their particular use-case. I suspect I could find many such reports (and probably some opposite reports too).

It's unfortunate they didn't work for you, perhaps you should give them another shot with a different model. I'd recommend using Cloud's examples as a starting point.

[1] https://medium.com/bigdatarepublic/cost-comparison-of-deep-l...


That ... Doesn't seem likely? How big is your team? Any chance the person who evaluated TPUs made an error when setting up their test? I'm just a little skeptical that Google would have gone to all the trouble to design the things and that so many organizations would be spending so much money to use them if they weren't better than GPUs in some way.


You're almost certainly using the TPUs wrong. It's very easy to use them wrong, unfortunately.

When you use them right, a TPUv3-8 gets equivalent perf to a cluster of 8 V100s.

I was astounded. I trained StyleGAN 2 from scratch at 1024x1024 in 2.5 days. nvidia took 7 days for their official model. Granted, I used a v3-32, not a v3-8, but performance seems pretty similar.


Wow, you really like dealing with absolutes...


TPUs are pretty much the only thing in engineering I’ve ever encountered where literal absolutes actually apply.


Have you compared TPUs to learning on other cloud infra, or some other model?


Have you written about this in the past? This seems pretty absolute.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: