Dynamic libraries are a dumpster fire with how they are implemented right now, and I'd really prefer everything to be statically linked. But ideally, I'd like to see exploration of a hybrid solution, where library code is tagged inside a binary, so if the OS detects that multiple applications are using the same version of a library, it's not duplicated in RAM. Such a design would also allow for libraries to be updated if absolutely necessary, either by runtime or some kind of package manager.
OSes already typically look for duplicated code pages as opportunities to dedupe. It doesn’t need to be special cases for code pages because it’ll also find runtime heap duplicates that seem to be read only (eg your JS code JIT pages shared between sites).
One challenge will be that the likelihood of two random binaries having generated the same code pages for a given source library (even if pinned to the exact source) can be limited by linker and compiler options (eg dead code stripping, optimization setting differences, LTO, PGO etc).
The benefit of sharing libraries is generally limited unless you’re using a library that nearly every binary may end up linking which has decreased in probability as the software ecosystem has gotten more varied and complex.
I believe NixOS-like "build time binding" is the answer. Especially with Rust "if it compiles, it works". Software shares code in form of libraries, but any set of installed software built against some concrete version of lib which it depends on will use this concrete version forever (until update replaces it with new builds which are built against different concrete version of lib).
The system you’re proposing wouldn’t work, because without additional effort in the compiler and linker (which AFAIK doesn’t exist) there won’t be perfectly identical pages for the same static library linked into the same executable. And once you can update them independently, you have all the drawbacks of dynamic libraries again.
Outside of embedded, this kind of reuse is a very marginal memory savings for the overall system to begin with. The key benefit of dynamic libraries for a system with gigabytes of RAM is that you can update a common dependency (e.g. OpenSSL) without redownloading every binary on your system.
I wish the standard way of using shared libraries would be to ship the .so the programs want to dynamically link to alongside the program binary (using RUNPATH), instead of expecting them to exist globally (yes, I mean all shared libraries even glibc, first and foremost glibc, actually).
This way we'd have no portability issue, same benefit as with static linking except it works with glibc out of the box instead of requiring to use musl, and we could benefit from filesystem-level deduplication (with btrfs) to save disk space and memory.
There are LLM's that can process 1 million token context window. Amazon Nova 2 for one, even though it's definitely not the highest quality model. You just put whole book in context and make LLM answer questions about it. And given the fact that domain is pretty limited, you can just store KV cache for most popular books on SSD, eliminating quite a bit of cost.
MLA uses way more flops in order to conserve memory bandwidth, H20 has plenty of memory bandwidth and almost no flops. MLA makes sense on H100/H800, but on H20 GQA-based models are a way better option.
Not sure what you are referring to—do you have a pointer to a technical writeup perhaps? In training and inference MLA has way less flops than MHA, which is the gold standard, and way better accuracy (model performance) than GQA (see comparisons in the DeepSeek papers or try deepseek models vs llama for long context.)
More generally, with any hardware architecture you use, you can optimize the throughput for your main goal (initially training; later inference) by balancing other parameters of the architecture. Even if training is suboptimal, if you want to make a global impact with a public model, you aim for the next NVidia inference hardware.
Didn't deep-seek figure out how to train with mixed precision and so get much more out of the cards, with a lot of the training steps able to run at what was traditionally post training quantization type precisions (block compressed).
Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.
How so? As far as I can tell, Microsoft has a large equity interest in OpenAI, and OpenAI has a lot of cloud credits usable on Microsoft’s cloud. I don’t think those credits are transferable to other providers.
The value in the proposition is OpenAI IP. Money and data centers are commodities easily replaced, especially when you hold the IP everyone wants a piece of.
The arrangement is mutually beneficial, but the owner of the IP holds the cards.
But how many of them have hot data centers to offer? Google is a direct competitor, so Oracle or Amazon are kinda the only other two big options to offer them what MS is right now.
If MS drops OpenAI, it's not like they can just seamlessly pivot to running their own data centers with no downtime, even with pretty high investment.
A relationship that’s mutually beneficial needn’t be symmetric. Microsoft’s relationship is fairly commoditized - money and GPUs. OpenAI controls the IP that matters.
I’d note that the supplier of GPUs is Nvidia, who also offers cloud GPU services and doesn’t have a stake in the GCP, Azure, AWS behemoth battle. I’d actually see that as a more natural less middle man relationship.
The real value azure brings is enterprise compliance chops. However IMO aws bedrock seems to be a more successful enterprise integration point. But they’re all commodity products and don’t provide the value OpenAI provides to the relationships.
Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.
Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!
Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?
Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)