You are correct on true H100 ownership costs being far lower. As I mention in th...

pama · on July 30, 2024

My bad on the 6 D V estimate; you are correct that if they do a dense decoding (rather than a hierarchical one as google used to do in the old days) the cost is exactly 6 D V. I cannot edit the GP comment and I will absorb the shame of my careless words there. I was put off by the subtitle and initial title of this HN post, though the current title is more appropriate and correct.

Even if it's a small model, one could use ddp or FSDP/2 without slowdowns on fast interconnect, which certainly adds to the cost. But if you want to reproduce all the work at the cheapest price point you only need to parallelize to the minimal level for fitting in memory (or rather, the one that maxes the MFU), so everything below 2B parameters runs on a single H100 or single node.

lonk11 · on July 30, 2024

I think the commenter was thinking about the input embedding layer, where to get an input token embedding the model does a lookup of the embedding by index, which is constant time.

And the blog post author is talking about the output layer where the model has to produce an output prediction for every possible token in the vocabulary. Each output token prediction is a dot-product between the transformer hidden state (D) and the token embedding (D) (whether shared with input or not) for all tokens in the vocabulary (V). That's where the VD comes from.

It would be great to clarify this in the blog post to make it more accessible but I understand that there is a tradeoff.