The way the current architecture works—as far as I know—is your assumed "server ...

bjt · 2025-12-13T19:23:33 1765653813

The user's last prompt can be sent with an idempotency key that changes each time the user initiates a new request. If that's the same, use the cache. If it's new, hit the LLM again.

ivan_gammel · 2025-12-13T19:26:54 1765654014

The only reason LLM server responds with partial results instead of waiting and returning all at once is UX. It’s just too slow. But the problem of slow bulk responses isn’t unique for LLM and can be solved within HTTP 1.1 well enough. Doesn’t have to be the same server, can be a caching proxy in front of it. Any privacy concerns can be addressed by giving the user opportunity to tell server to cache/not to cache (can be as easy as submitting with PUT vs POST requests)

debazel · 2025-12-13T19:11:11 1765653071

But adding caching to SSE is trivial compared to completely changing your transfer protocol, so why wouldn't you just do that instead?