You say this is for generative AI. How do you distribute inference across workers? Can one use just any protocol and how does this work together with the queue and fault tolerance?
Could not find any specifics on generative AI in your docs. Thanks
This isn't built specifically for generative AI, but generative AI apps typically have architectural issues that are solved by a good queueing system and worker pool. This is particularly true once you start integrating smaller, self-hosted LLMs or other types of models into your pipeline.
> How do you distribute inference across workers?
In Hatchet, "run inference" would be a task. By default, tasks get randomly assigned to workers in a FIFO fashion. But we give you a few options for controlling how tasks get ordered and sent. For example, let's say you'd like to limit users to 1 inference task at a time per session. You could do this by setting a concurrency key "<session-id>" and `maxRuns=1` [1]. This means that for each session key, you only run 1 inference task. The purpose of this would be fairness.
> Can one use just any protocol
We handle the communication between the worker and the queue through a gRPC connection. We assume that you're passing JSON-serializable objects through the queue.
Got it, so the underlying infrastructure (the inference nodes, if you wish) would be something to be solved outside of Hatched, but it would then allow to schedule inference tasks per user with limits.
Could not find any specifics on generative AI in your docs. Thanks