You say this is for generative AI. How do you distribute inference across worker...

abelanger · on March 8, 2024

This isn't built specifically for generative AI, but generative AI apps typically have architectural issues that are solved by a good queueing system and worker pool. This is particularly true once you start integrating smaller, self-hosted LLMs or other types of models into your pipeline.

> How do you distribute inference across workers?

In Hatchet, "run inference" would be a task. By default, tasks get randomly assigned to workers in a FIFO fashion. But we give you a few options for controlling how tasks get ordered and sent. For example, let's say you'd like to limit users to 1 inference task at a time per session. You could do this by setting a concurrency key "<session-id>" and `maxRuns=1` [1]. This means that for each session key, you only run 1 inference task. The purpose of this would be fairness.

> Can one use just any protocol

We handle the communication between the worker and the queue through a gRPC connection. We assume that you're passing JSON-serializable objects through the queue.

[1] https://docs.hatchet.run/home/features/concurrency/round-rob...

zwaps · on March 9, 2024

Got it, so the underlying infrastructure (the inference nodes, if you wish) would be something to be solved outside of Hatched, but it would then allow to schedule inference tasks per user with limits.