I assume no one will read this, but good places to look for super-clever ways to reduce transistor count while maintaining good performance:
- Early mainframes / room-sized computers (era of vacuum tubes and discrete transistors), especially at the upper-end , where there was enough budget to have modern pipelined and scalar architectures.
- Cray X-MP and successors
- DEC Alpha / StrongARM (referenced SA-110)
Bad places to look are all the microcode architectures. These optimized transistor count, often sacrificing massive amounts of performance in order to save on cost. Ditto for some of the minicomputers, where the goal was to make an "affordable" computer. Something like the PDP was super-clever in cost-cutting, which made sense at the time, does much less to maintain performance.
They do what you were talking about, not what I was.
They seem annoying. "The IPU has a unique memory architecture consisting of large amounts of In-Processor-Memory™ within the IPU made up of SRAM (organised as a set of smaller independent distributed memory units) and a set of attached DRAM chips which can transfer to the In-Processor-Memory via explicit copies within the software. The memory contained in the external DRAM chips is referred to as Streaming Memory™."
There's a ™ every few words. Those seem like pretty generic terms. That's their technical documentation.
The architecture is reminiscent of some ideas from circa-2000 which didn't pan out. It reminds me of Tilera (the guy who ran it was the Donald Trump of computer architectures; company was acquihired by EZchip for a fraction of the investment which was put into it, which went to Mellanox, and then to NVidia).