Curiously, almost all of this video is mostly covered by computer architectures lit in the late 90's early 00's. At the time, I recall Tom Knight had done most of the analysis in this video, but I don't know if he ever published it. It was extrapolating into the distant future.
To answer your questions:
- Spatial processors are an insanely good fit for async logic
- Matrix of processing engines are a moderately good fit -- definitely could be done, but I have no clue if it'd be a good idea.
In SP, especially in an ASIC, each computation can start as soon as the previous one finishes. If you have a 4-bit layer, and 8-bit layer, and a 32-bit layer, those will take different amounts of time to run. Individual computations can take different amounts of time too (e.g. an ADD with a lot of carries versus one with just a few). In an SP, a compute will take as much time as it needs, and no more.
Footnote: Personally, I think there are a lot of good ideas in 80's era and earlier processors for the design of individual compute units which have been forgotten. The basic move in architectures up through 2005 was optimizing serial computation speed at the cost of power and die size (Netburst went up to 3.8GHz two decades ago). With much simpler old-school compute units, we can have *many* more of them than a modern multiply unit. Critically, they could be positioned closer to the data, so there would be less data moving around. Especially the early pipelined / scalar / RISC cores seem very relevant. As a point of reference, a 4090 has 16k CUDA cores running at just north of 2GHz. It has the same number of transistors as 32,000 SA-110 processors (running at 200MHz on a 350 nanometer process in 1994).
TL;DR: I'm getting old and either nostalgic or grumpy. Dunno which.
This was sort of the dream of KNL but today I noticed
Xeon Phi CPUs support (a.k.a. Knight Landing and Knight Mill) are marked as deprecated. GCC will emit a warning when using the -mavx5124fmaps, -mavx5124vnniw, -mavx512er, -mavx512pf, -mprefetchwt1, -march=knl, -march=knm, -mtune=knl or -mtune=knm compiler switches. Support will be removed in GCC 15.
the issue was that coordinating across this kind of hierarchy wasted a bunch of time. If you already knew how to coordinate, mostly, you could instead get better performance
you might be surprised but we're getting to the point that communicating over a super computer is on the same order of magnitude as talking across a numa node.
I actually wasn't so much talking from that perspective, as simply from the perspective of the design of individual pieces. There were rather clever things done in e.g. older multipliers or adders or similar which, I think, could apply to most modern parallel architectures, be that GPGPU, SP, MPE, FPGA, or whatever, in order to significantly increase density at a cost of slightly reduced serial performance.
For machine learning, that's a good tradeoff.
Indeed, with some of the simpler architectures, I think computation could be moved into the memory itself, as long dreamed of.
(Simply sticking 32,000 SA-110 processors on a die would be very, very limited by interconnect; there's a good reason for the types of architectures we're seeing not being that)
I assume no one will read this, but good places to look for super-clever ways to reduce transistor count while maintaining good performance:
- Early mainframes / room-sized computers (era of vacuum tubes and discrete transistors), especially at the upper-end , where there was enough budget to have modern pipelined and scalar architectures.
- Cray X-MP and successors
- DEC Alpha / StrongARM (referenced SA-110)
Bad places to look are all the microcode architectures. These optimized transistor count, often sacrificing massive amounts of performance in order to save on cost. Ditto for some of the minicomputers, where the goal was to make an "affordable" computer. Something like the PDP was super-clever in cost-cutting, which made sense at the time, does much less to maintain performance.
They do what you were talking about, not what I was.
They seem annoying. "The IPU has a unique memory architecture consisting of large amounts of In-Processor-Memory™ within the IPU made up of SRAM (organised as a set of smaller independent distributed memory units) and a set of attached DRAM chips which can transfer to the In-Processor-Memory via explicit copies within the software. The memory contained in the external DRAM chips is referred to as Streaming Memory™."
There's a ™ every few words. Those seem like pretty generic terms. That's their technical documentation.
The architecture is reminiscent of some ideas from circa-2000 which didn't pan out. It reminds me of Tilera (the guy who ran it was the Donald Trump of computer architectures; company was acquihired by EZchip for a fraction of the investment which was put into it, which went to Mellanox, and then to NVidia).
Sweet, thanks! It seems like this research ecosystem was incredibly rich, but Moore's law was in full swing, and statically known workloads weren't useful at the compute scale of back then.
So these specialized approach never stood a chance next to CPUS. Nowadays the ground is.. more fertile.
1) If you took 3 years longer to build a SIMD architecture than Intel to make a CPU, Intel would be 4x faster by the time you shipped.
2) If, as a customer, I was to code to your architecture, and it took me 3 more years to do that, by that point, Intel would be 16x faster
And any edge would be lost. The world was really fast-paced. Groq was founded in 2016. It's 2024. If it was still hayday of Moore's Law, you'd be competing with CPUs running 40x as fast as today's.
I'm not sure you'd be so competitive against a 160GHz processor, and I'm not sure I'd be interested knowing a 300+GHz was just around the corner.
Good ideas -- lots of them -- lived in academia, where people could prototype neat architectures on ancient processes, and benchmark themselves to CPUs of yesteryear from those processes.
To answer your questions:
- Spatial processors are an insanely good fit for async logic
- Matrix of processing engines are a moderately good fit -- definitely could be done, but I have no clue if it'd be a good idea.
In SP, especially in an ASIC, each computation can start as soon as the previous one finishes. If you have a 4-bit layer, and 8-bit layer, and a 32-bit layer, those will take different amounts of time to run. Individual computations can take different amounts of time too (e.g. an ADD with a lot of carries versus one with just a few). In an SP, a compute will take as much time as it needs, and no more.
Footnote: Personally, I think there are a lot of good ideas in 80's era and earlier processors for the design of individual compute units which have been forgotten. The basic move in architectures up through 2005 was optimizing serial computation speed at the cost of power and die size (Netburst went up to 3.8GHz two decades ago). With much simpler old-school compute units, we can have *many* more of them than a modern multiply unit. Critically, they could be positioned closer to the data, so there would be less data moving around. Especially the early pipelined / scalar / RISC cores seem very relevant. As a point of reference, a 4090 has 16k CUDA cores running at just north of 2GHz. It has the same number of transistors as 32,000 SA-110 processors (running at 200MHz on a 350 nanometer process in 1994).
TL;DR: I'm getting old and either nostalgic or grumpy. Dunno which.