I've been playing around with deploying different large models on various platforms (HF, AWS etc) for testing and have been underwhelmed by the inference speeds I've been able to achieve. They're fine (though considerably slower than OpenAI) but nothing like what I feel I've been led to believe by others who talk about how frighteningly fast their self-hosted models are.
For reference, I get responses in:
~1200ms from gpt-3.5-turbo,
~1600ms from gpt-4o
~5000ms from llama-70b-instruct on dedicated HF endpoint
I've been using standard Nvidia A100, 4x GPU, 320 GB for these deployments and so I'm now wondering, am I missing something or were my expectations just unreasonable? Curious to hear any of your thoughts, experiences, and tips/tricks, thanks.