More

terafo · 2026-01-06T02:36:36 1767666996

Dynamic libraries are a dumpster fire with how they are implemented right now, and I'd really prefer everything to be statically linked. But ideally, I'd like to see exploration of a hybrid solution, where library code is tagged inside a binary, so if the OS detects that multiple applications are using the same version of a library, it's not duplicated in RAM. Such a design would also allow for libraries to be updated if absolutely necessary, either by runtime or some kind of package manager.

vlovich123 · 2026-01-06T06:48:26 1767682106

OSes already typically look for duplicated code pages as opportunities to dedupe. It doesn’t need to be special cases for code pages because it’ll also find runtime heap duplicates that seem to be read only (eg your JS code JIT pages shared between sites).

One challenge will be that the likelihood of two random binaries having generated the same code pages for a given source library (even if pinned to the exact source) can be limited by linker and compiler options (eg dead code stripping, optimization setting differences, LTO, PGO etc).

The benefit of sharing libraries is generally limited unless you’re using a library that nearly every binary may end up linking which has decreased in probability as the software ecosystem has gotten more varied and complex.

shatsky · 2026-01-06T02:53:34 1767668014

I believe NixOS-like "build time binding" is the answer. Especially with Rust "if it compiles, it works". Software shares code in form of libraries, but any set of installed software built against some concrete version of lib which it depends on will use this concrete version forever (until update replaces it with new builds which are built against different concrete version of lib).

johncolanduoni · 2026-01-06T07:05:14 1767683114

The system you’re proposing wouldn’t work, because without additional effort in the compiler and linker (which AFAIK doesn’t exist) there won’t be perfectly identical pages for the same static library linked into the same executable. And once you can update them independently, you have all the drawbacks of dynamic libraries again.

Outside of embedded, this kind of reuse is a very marginal memory savings for the overall system to begin with. The key benefit of dynamic libraries for a system with gigabytes of RAM is that you can update a common dependency (e.g. OpenSSL) without redownloading every binary on your system.

0xdeafbeef · 2026-01-06T11:40:56 1767699656

Also, won't most of the lib be removed due to dead code elimination? And used code will be inlined where applicable, so nothing to dedup in reality

littlestymaar · 2026-01-06T04:19:18 1767673158

I wish the standard way of using shared libraries would be to ship the .so the programs want to dynamically link to alongside the program binary (using RUNPATH), instead of expecting them to exist globally (yes, I mean all shared libraries even glibc, first and foremost glibc, actually).

This way we'd have no portability issue, same benefit as with static linking except it works with glibc out of the box instead of requiring to use musl, and we could benefit from filesystem-level deduplication (with btrfs) to save disk space and memory.

SkiFire13 · 2026-01-06T09:23:21 1767691401

What you're describing is not static linking, it's embedding a dynamically linked library in another binary.

tliltocatl · 2026-01-06T08:31:10 1767688270

IMHO dynamic libraries are a dumpster fire because they are often used as a method to provide external interfaces, rather then just share common code.

terafo · 2025-12-24T04:19:32 1766549972

This article specifically talks about PC laptops and discusses changes in them.

terafo · 2025-12-12T20:57:27 1765573047

Having access to the text and being trained on the text are two different things.

terafo · 2025-12-12T20:56:05 1765572965

There are LLM's that can process 1 million token context window. Amazon Nova 2 for one, even though it's definitely not the highest quality model. You just put whole book in context and make LLM answer questions about it. And given the fact that domain is pretty limited, you can just store KV cache for most popular books on SSD, eliminating quite a bit of cost.

DennisP · 2025-12-12T21:29:01 1765574941

You could also fill the context with just the book portion that you've read. That'd be a sure-fire way to fulfill Amazon's "spoiler-free" promise.

terafo · 2025-06-28T11:56:44 1751111804

MLA uses way more flops in order to conserve memory bandwidth, H20 has plenty of memory bandwidth and almost no flops. MLA makes sense on H100/H800, but on H20 GQA-based models are a way better option.

pama · 2025-06-28T19:07:23 1751137643

Not sure what you are referring to—do you have a pointer to a technical writeup perhaps? In training and inference MLA has way less flops than MHA, which is the gold standard, and way better accuracy (model performance) than GQA (see comparisons in the DeepSeek papers or try deepseek models vs llama for long context.)

More generally, with any hardware architecture you use, you can optimize the throughput for your main goal (initially training; later inference) by balancing other parameters of the architecture. Even if training is suboptimal, if you want to make a global impact with a public model, you aim for the next NVidia inference hardware.

cma · 2025-06-28T17:06:45 1751130405

Didn't deep-seek figure out how to train with mixed precision and so get much more out of the cards, with a lot of the training steps able to run at what was traditionally post training quantization type precisions (block compressed).

reliabilityguy · 2025-06-28T12:00:16 1751112016

MLA as in multi-head latent attention?

terafo · 2025-06-28T12:08:30 1751112510

reliabilityguy · 2025-06-28T12:26:34 1751113594

Ah, gotcha. Thank you

terafo · 2025-02-24T23:40:15 1740440415

https://www.youtube.com/watch?v=b2F-DItXtZs

plasma_beam · 2025-02-25T02:13:22 1740449602

I have not watched that video since 2013 (wow!) and it is still hilarious.

terafo · on Dec 6, 2024

Because you have to do inference distributed between multiple nodes at this point. For prefill because prefill is actually quadratic, but also for memory reasons. KV Cache for 405B at 10M context length would take more than 5 terabytes (at bf16). That's 36 H200 just for KV Cache, but you would need roughly 48 GPUs to serve bf16 version of the model. Generation speed at that setup would be roughly 30 tokens per second, 100k tokens per hour, and you can server only a single user because batching doesn't make sense at these kinds of context lengths. If you pay 3 dollars per hour per GPU, it's $1440 per million tokens cost. For fp8 version the numbers are a bit better: you need only 24 GPUs, generation speed stays roughly the same, so it's only 700 dollars per million tokens. There are architectural modifications that will bring that down significantly, but, nonetheless, it's still really really expensive, but also quite hard to get to work.

terafo · on Aug 31, 2024

Why mention Microsoft twice?

mensetmanusman · on Aug 31, 2024

The departments aren’t talking, so they accidentally made two orders.

fnordpiglet · on Aug 31, 2024

I’d note Microsoft needs OpenAI a hell of a lot more than OpenAI needs Microsoft. I’d actually pivot that to be why mention OpenAI twice.

amluto · on Aug 31, 2024

How so? As far as I can tell, Microsoft has a large equity interest in OpenAI, and OpenAI has a lot of cloud credits usable on Microsoft’s cloud. I don’t think those credits are transferable to other providers.

fnordpiglet · on Sept 1, 2024

The value in the proposition is OpenAI IP. Money and data centers are commodities easily replaced, especially when you hold the IP everyone wants a piece of.

The arrangement is mutually beneficial, but the owner of the IP holds the cards.

FridgeSeal · on Aug 31, 2024

OpenAI doesn’t operate without the enormous amounts of funding MS gives it.

adwi · on Aug 31, 2024

I think a lot of institutions and people would love the chance to give them money.

noirbot · on Aug 31, 2024

But how many of them have hot data centers to offer? Google is a direct competitor, so Oracle or Amazon are kinda the only other two big options to offer them what MS is right now.

If MS drops OpenAI, it's not like they can just seamlessly pivot to running their own data centers with no downtime, even with pretty high investment.

fnordpiglet · on Sept 1, 2024

A relationship that’s mutually beneficial needn’t be symmetric. Microsoft’s relationship is fairly commoditized - money and GPUs. OpenAI controls the IP that matters.

I’d note that the supplier of GPUs is Nvidia, who also offers cloud GPU services and doesn’t have a stake in the GCP, Azure, AWS behemoth battle. I’d actually see that as a more natural less middle man relationship.

The real value azure brings is enterprise compliance chops. However IMO aws bedrock seems to be a more successful enterprise integration point. But they’re all commodity products and don’t provide the value OpenAI provides to the relationships.

terafo · on Aug 20, 2024

There was. Now second gen of that goes for $15.

SirMaster · on Aug 20, 2024

It's the third gen.

The Zero was $5, the Zero W was $10 and the Zero W 2 is $15.

terafo · on May 15, 2024

Overwhelming majority of flops is indeed spent on matmuls, but softmax disproportionately uses memory bandwidth, so it generally takes much longer than you'd expect from just looking at flops.

tehsauce · on May 15, 2024

If cpu softmax were limited by memory bandwidth, then these vectorization optimizations wouldn't improve performance.

cgearhart · on May 15, 2024

Why does it disproportionately use bandwidth?

jacobn · on May 16, 2024

In transformers the attention matrix is N*N, so there are a lot of values to go over. Typically makes it memory bandwidth bound, not compute bound.

cgearhart · on May 16, 2024

Oooooh, I forgot that the self attention layer has a softmax. I thought this was referring to a softmax on the dense forward layer. Thanks!

Next question: does the softmax in the SA block cause it to be bandwidth bound—won’t it have to materialize all the parameters of the N^2 matrix either way? Does SM cause redundant data reads?

bjornsing · on May 16, 2024

Wouldn’t the softmax typically be “fused” with the matmul though?

anewhnaccount2 · on May 16, 2024

Yes but as far as I understand this is only really usefully possible with FlashAttention. (The main idea is that you have to use the log-sum-exp trick when computing the softmax, but can't compute the max activation incrementally so have to rescale everything.)