Curiously, almost all of this video is mostly covered by computer architectures ...

frozenport · on Feb 19, 2024

This was sort of the dream of KNL but today I noticed

    Xeon Phi CPUs support (a.k.a. Knight Landing and Knight Mill) are marked as deprecated. GCC will emit a warning when using the -mavx5124fmaps, -mavx5124vnniw, -mavx512er, -mavx512pf, -mprefetchwt1, -march=knl, -march=knm, -mtune=knl or -mtune=knm compiler switches. Support will be removed in GCC 15.

the issue was that coordinating across this kind of hierarchy wasted a bunch of time. If you already knew how to coordinate, mostly, you could instead get better performance

you might be surprised but we're getting to the point that communicating over a super computer is on the same order of magnitude as talking across a numa node.

frognumber · on Feb 20, 2024

I actually wasn't so much talking from that perspective, as simply from the perspective of the design of individual pieces. There were rather clever things done in e.g. older multipliers or adders or similar which, I think, could apply to most modern parallel architectures, be that GPGPU, SP, MPE, FPGA, or whatever, in order to significantly increase density at a cost of slightly reduced serial performance.

For machine learning, that's a good tradeoff.

Indeed, with some of the simpler architectures, I think computation could be moved into the memory itself, as long dreamed of.

(Simply sticking 32,000 SA-110 processors on a die would be very, very limited by interconnect; there's a good reason for the types of architectures we're seeing not being that)

frozenport · on Feb 20, 2024

Truth is that there is another startup called graph core that is doing exactly that, and also a really big chip

frognumber · on Feb 22, 2024

I assume no one will read this, but good places to look for super-clever ways to reduce transistor count while maintaining good performance:

- Early mainframes / room-sized computers (era of vacuum tubes and discrete transistors), especially at the upper-end , where there was enough budget to have modern pipelined and scalar architectures.

- Cray X-MP and successors

- DEC Alpha / StrongARM (referenced SA-110)

Bad places to look are all the microcode architectures. These optimized transistor count, often sacrificing massive amounts of performance in order to save on cost. Ditto for some of the minicomputers, where the goal was to make an "affordable" computer. Something like the PDP was super-clever in cost-cutting, which made sense at the time, does much less to maintain performance.

There's a ton of long-forgotten cleverness.

frognumber · on Feb 20, 2024

They do what you were talking about, not what I was.

They seem annoying. "The IPU has a unique memory architecture consisting of large amounts of In-Processor-Memory™ within the IPU made up of SRAM (organised as a set of smaller independent distributed memory units) and a set of attached DRAM chips which can transfer to the In-Processor-Memory via explicit copies within the software. The memory contained in the external DRAM chips is referred to as Streaming Memory™."

There's a ™ every few words. Those seem like pretty generic terms. That's their technical documentation.

The architecture is reminiscent of some ideas from circa-2000 which didn't pan out. It reminds me of Tilera (the guy who ran it was the Donald Trump of computer architectures; company was acquihired by EZchip for a fraction of the investment which was put into it, which went to Mellanox, and then to NVidia).

foundval · on Feb 20, 2024

Sweet, thanks! It seems like this research ecosystem was incredibly rich, but Moore's law was in full swing, and statically known workloads weren't useful at the compute scale of back then.

So these specialized approach never stood a chance next to CPUS. Nowadays the ground is.. more fertile.

frognumber · on Feb 20, 2024

Lots of things were useful to compute.

The problem was

1) If you took 3 years longer to build a SIMD architecture than Intel to make a CPU, Intel would be 4x faster by the time you shipped.

2) If, as a customer, I was to code to your architecture, and it took me 3 more years to do that, by that point, Intel would be 16x faster

And any edge would be lost. The world was really fast-paced. Groq was founded in 2016. It's 2024. If it was still hayday of Moore's Law, you'd be competing with CPUs running 40x as fast as today's.

I'm not sure you'd be so competitive against a 160GHz processor, and I'm not sure I'd be interested knowing a 300+GHz was just around the corner.

Good ideas -- lots of them -- lived in academia, where people could prototype neat architectures on ancient processes, and benchmark themselves to CPUs of yesteryear from those processes.