Yes, Intel also takes a less than "full" approach to moving from 256b to 512.
Though I think it is fair to say the Intel implementation represents kind of an intermediate state between the AMD approach (essentially no increase in execution or datapath resources outside of the shuffle) and simply extending everything 2x and a full doubling of every resource.
Essentially on SKX Intel chip behaves as if it had 2 full-width 512-bit execution ports: p01 (via fusion) and p5. For 256b it is three ports. Not all ports can do everything so the comparison is sometimes 3 vs 2 or 2 vs 1, but also sometimes 2 vs 2 (FMA operations on 2-FMA chips come to mind).
Critically, however, the load and store units were also extended to 512 bits: SKX can do 2x loads (1024 bits) and 1x store (512 bits) per cycle. This puts a hard cap on the performance of load and store heavy AVX methods, which does includes some fairly trivial but important integer operation loops like memcpy, memset and memchr type stuff which is fast enough to hit the load or store limits.
Maxing out memBW requires multiple threads because Intel cores are relatively limited in line fill buffers. I've seen around 12 GB/s per SKX core with AVX-512.
You usually don't even need AVX512 to sustain enough load/stores at the core to max out memory bandwidth "in theory": even with 256 bit loads and assuming 2/1 loads/stores per cycle (ICL/Zen 3 and newer can do more), that's 256 GB/s of read bandwidth or 128 GB/s write bandwidth (or both, at once!) at 4 GHz.
Indeed, you can reach these numbers if you always hit in L1 and come close if you always hit in L2. The load number especially is higher than almost any single socket bandwidth until perhaps very recently*: an 8-channel chip with the fastest DDR4-3200 would get 25.6 x 8 = 204.8 GB/s max theoretical bandwidth. Most chips have fewer channels and lower max theoretical bandwidth.
However, and as a sibling comment alludes to, you generally cannot in practice sustain enough outstanding misses from a single core to actually achieve this number. E.g., with 16 outstanding misses and 100 ns latency per cache line you can only demand fetch at ~10 GB/s from on core. Actually numbers are higher due to prefetching, which both decreases the latency (since the prefetch is initiated from a component closer to the memory controller) and makes more outstanding misses available (since there are more miss buffers from L2 than there are from the core) but this to only roughly double the bandwidth: it's hard to get more than 20-30 GB/s from a single core on Intel.
This isn't a fundamental limitation which applies to every CPU however: Apple chips can extract the entire bandwidth from a single core, despite having much smaller 128-bit (perhaps 256-bit if you consider load pair) load and store instructions.
---
* Not really sure about this one: are there 16-channel DDR5 setups out there yet (16 DDR5 channels corresponds to 8 independent DIMMS so is similar to an 8-channel DDR4 setup as DDR5 has 2x channels per DIMM)?
Though I think it is fair to say the Intel implementation represents kind of an intermediate state between the AMD approach (essentially no increase in execution or datapath resources outside of the shuffle) and simply extending everything 2x and a full doubling of every resource.
Essentially on SKX Intel chip behaves as if it had 2 full-width 512-bit execution ports: p01 (via fusion) and p5. For 256b it is three ports. Not all ports can do everything so the comparison is sometimes 3 vs 2 or 2 vs 1, but also sometimes 2 vs 2 (FMA operations on 2-FMA chips come to mind).
Critically, however, the load and store units were also extended to 512 bits: SKX can do 2x loads (1024 bits) and 1x store (512 bits) per cycle. This puts a hard cap on the performance of load and store heavy AVX methods, which does includes some fairly trivial but important integer operation loops like memcpy, memset and memchr type stuff which is fast enough to hit the load or store limits.