Last year we reached out to a major GPU vendor for a need to get access to a seven figure dollar amount worth of compute time.
They contacted (and we spoke with) several of the largest partners they had, including education/research institutions and some private firms, and could not find ANYONE that could accommodate our needs.
AWS also did not have the capacity, at least for spot instances since that was the only way we could have afforded it.
We ended up rolling our own solution with (more but lower-end) GPUs we sourced ourselves that actually came out cheaper than renting a dozen "big iron" boxes for six months.
It sounds like currently that capacity might actually be available now, but at the time we could not afford to wait another year to start the job.
If you were able to make do with cheaper GPUs, then you didn't need FP64 so you didn't need H100s in the first place right? Then you made the right choice in buying a drill for your screw work instead of renting a jackhammer even if the jackhammer would've seemed cooler to you at the time.
I think we're splitting hairs here, it was more about choosing a good combination of least effort, time and money involved. When you're spending that amount of money, things are not so black and white... rented H100s get the job done faster and easier than whatever we can piece together ourselves. L40 (cheaper but no FP64) was also brand new at the time. Also our code was custom OpenCL and could have taken advantage of FP64 to go faster if we had the devices for it.
They contacted (and we spoke with) several of the largest partners they had, including education/research institutions and some private firms, and could not find ANYONE that could accommodate our needs.
AWS also did not have the capacity, at least for spot instances since that was the only way we could have afforded it.
We ended up rolling our own solution with (more but lower-end) GPUs we sourced ourselves that actually came out cheaper than renting a dozen "big iron" boxes for six months.
It sounds like currently that capacity might actually be available now, but at the time we could not afford to wait another year to start the job.