Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The way the current architecture works—as far as I know—is your assumed "server caches the generated output" step doesn't exist. What you get in your output is streamed directly from the LLM to your client. Which is, in theory, the most efficient way to do it.

That's why LLM outputs that get cut off mid-stream require the end user click the "retry" button and not the, "re-send me that last output" button (which doesn't exist).

I would imagine that a simpler approach would be to simply make the last prompt idempotent... Which would require caching on their servers; something that supposedly isn't happening right now. That way, if the user re-sends the last prompt the server just responds with the same exact output it just generated. Except LLMs often make mistakes and hallucinate things... So re-sending the last prompt and hoping for a better output isn't an uncommon thing.

Soooo... Back to my suggested workaround in my other comment: Pub/sub over WebSockets :D





The user's last prompt can be sent with an idempotency key that changes each time the user initiates a new request. If that's the same, use the cache. If it's new, hit the LLM again.

The only reason LLM server responds with partial results instead of waiting and returning all at once is UX. It’s just too slow. But the problem of slow bulk responses isn’t unique for LLM and can be solved within HTTP 1.1 well enough. Doesn’t have to be the same server, can be a caching proxy in front of it. Any privacy concerns can be addressed by giving the user opportunity to tell server to cache/not to cache (can be as easy as submitting with PUT vs POST requests)

But adding caching to SSE is trivial compared to completely changing your transfer protocol, so why wouldn't you just do that instead?



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: