I'm curious what you did with the "active sorting range" after a push/pop event. Since it's a vector underneath, I don't see any option other than to sort the entire range after each event, O(N). This would surely destroy performance, right?
> To Koreans, they looked more like sauce bowls, leading them to conclude that the Japanese had starved themselves to stretch out the siege.
As a Bengali man, that's exactly how I felt when I came to USA and first visited japanese restaurants. Part of the reason we consume so much rice is that rice is kind of the main dish (not a side)- it literally takes up central and most of the space in your food plate.
Typical Japanese will devour their small rice bowl until there's none of rice grain is left over, since they're taught from the very young age not to waste food.
Most of other Asian nations will not eat their rice until it's completely finished. Even with their most delicious biryani dish there're always many rice grains left in the plate. I think the small bowl make it much easier to completely consume the rice unlike the big bowl or plate.
The Japanese mostly eat sticky rice, which is very easy to eat and "clean up" even with a chopstick.
The Indian subcontinent eat long-grain Basmati or similar rice which fluff up into individual grains on the plate. It doesn't make sense to individually pick out single leftover grains.
In nearly every culture is the idea of "Annapurna" or the god of food, and wasting food is generally frowned upon and considered bad table manners. I've been scolded plenty of times as a child for not cleaning up my plate in Nepal.
I wouldn't attribute it to small bowls at least. The Japanese instilling good virtues into their children almost institutionally perhaps plays some part in it, but also some of it is just physics.
Having had grandparents live through WWII (or any other war to be fair) also helps instill this attitude. I can barely imagine what kind of famines they had to endure.
Sure.. in detail and abstracted slightly, the byte table problem:
Maybe you're remapping RGB values [0..255] with a tone curve in graphics, or doing a mapping lookup of IDs to indexes in a set, or a permutation table, or .. well, there's a lot of use cases, right? This is essentially an arbitrary function lookup where the domain and range is on bytes.
It looks like this in scalar code:
transform_lut(byte* dest, const byte* src, int size, const byte* lut) {
for (int i = 0; i < size; i++) {
dest[i] = lut[src[i]];
}
}
The function above is basically load/store limited - it's doing negligible arithmetic, just loading a byte from the source, using that to index a load into the table, and then storing the result to the destination. So two loads and a store per element. Zen5 has 4 load pipes and 2 store pipes, so our CPU can do two elements per cycle in scalar code. (Zen4 has only 1 store pipe, so 1 per cycle there)
Here's a snippet of the AVX512 version.
You load the lookup table into 4 registers outside the loop:
Then, for each SIMD vector of 64 elements, use each lane's value as an index into the lookup table, just like the scalar version. Since we only can use 128 bytes, we DO have to do it twice, once for the lower and again for the upper half, and use a mask to choose between them appropriately on a per-element basis.
auto tLow = _mm512_permutex2var_epi8(p0, x, p1);
auto tHigh = _mm512_permutex2var_epi8(p2, x, p3);
You can use _mm512_movepi8_mask to load the mask register. That instruction sets each lane is active if its high bit of the byte is set, which perfectly sets up our table. You could use the mask register directly on the second shuffle instruction or a later blend instruction, it doesn't really matter.
For every 64 bytes, the avx512 version has one load&store and does two permutes, which Zen5 can do at 2 a cycle. So 64 elements per cycle.
So our theoretical speedup here is ~32x over the scalar code! You could pull tricks like this with SSE and pshufb, but the size of the lookup table is too small to really be useful. Being able to do an arbitrary super-fast byte-byte transform is incredibly useful.
One feedback from someone interested in using this about the examples: I have looked at several and they seem too high level to get a sense of the actual API (i.e. the expected benefit of using this library vs the development complexity of using it).
For example, the cloth bending simulation is almost entirely: at __init__, call a function to add a cloth mesh to model builder obj, pass built model to initializer of a solver class; and at each timestep: call a collide model function, then call another function called solver.step. That's really it.
"Earlier this year, World Liberty, the crypto firm run by the Trumps and Witkoffs, announced an agreement with an investment firm backed by the ruling family of the U.A.E. The Emirati firm would conduct a $2 billion transaction using World Liberty’s digital coins, a deal that would provide a windfall to the Trump and Witkoff families."
One of NYT's recent podcasts (The Daily) covered this, basically the Biden administration was reluctant to give the UAE access to Nvidia chips because of their close dealings with China. 2 weeks after this crypto investment, the white house agrees to give the UAE access to the chips.
I’m a big, big fan of acquired podcast, listened to every single of their episodes.
But I remember their entire Microsoft episodes felt like a lengthy defense of Steve Ballmer. There were too many instances of “here’s why this bad decision of Steve made sense given the circumstances” or “ here is how people underestimate the contribution of Steve on this good decision.” They were all well argued points, of course, but so numerous that I found myself wondering if the hosts does not have a relationship with Steve.
The existence of this interview does not help with that suspicion.
I think the big thing was Steve did make a lot of great decisions, some of the best the company could at that time in those respective fields but completely missed on everything that Apple did. Portable media players, smart phones and tablets and those are the three huge misses and that is really were it counted.
The old three envelope joke.
You become CEO and there are three envelopes on your office desk, a note says "Every time there is an issue you open them in order and do what is inside". First envelope says "Blame your predecessor.". The second says "Blame yourself". The third says "Prepare three envelopes".
This is the problem with podcasts, but also modern media, in general. You have to play softball or be ideologically homogeneous to get access. Anything else has a negative k for a variety of reasons.
I haven't listened to all their MSFT coverage but it's possible they genuinely feel Ballmer's gotten a worse rap than he deserves and they're trying to contextualize some of the decisions and circumstances.
Yes, but as sibling comment says there's that thing about softball.
I think Ballmer was better than how he was perceived. So I did expect some justification, etc in their MS episodes. But these points seemed, to use your word, numerous. I think they must have done this because in preparation for those MS episodes, they did talk to Ballmer, and expected him to listen to the episodes. Comparatively their Bernard Arnault, LVMH games take on episodes like LVMH, Hermes seemed somewhat balanced.
People don’t understand that other countries (primary suppliers of stem graduate students) do have lots of research positions, it’s just they don’t usually get first rate talent because USA is far more attractive for those people. Now they will
In one view, the fact that it's a software problem is actually a weakness of (GPU) hardware design.
In the olden, serial computing days, our algorithms were standard, and CPU designers did all sorts of behind-the-scene tricks to improve performance without burdening software developers. It wasn't perfect abstraction, but they tried. Algorithm led the way; hardware had to follow.
CUDA threw that all away, exposed lots of ugly details of GPU hardware design that developers _had to_ take into account. This is why, for a long time, CUDA's primary customers (HPC community & Natl labs) refused to adopt CUDA.
It's interesting that now that CUDA has become a legitimate, widely adopted computing paradigm, how much our view on this has shifted.
I don't believe you really can in GPU world. With CPU, if you ignore something important like cache hierarchy, the performance penalty is likely to be in double digits percentage. Something people can and do often ignore. With GPU, there are many many things (memory coalescing, warp, SRAM) that can have triple digits % of impact, hell maybe even more than that.
> Iran is the principle destabilising element in the middle east
Says Israel, the nation who tore up every single international laws, directly led campaign against UN and ICC, and whose right-wing (ones in power now) have been dreaming about a Greater Israel that threatens territorial integrity of like 10 different ME countries.
I understand Iran is a headache to Israel, but did it have to be an enemy of USA? Isn't Iran's ambition, and its proxies, are all regional in nature? Have they ever attempted to harm an american living in America?
Israel has led an amazingly succesful campaign in presenting their problems (often arising out of their territorial ambitions) as a problem for the entire west.
I agree which is why we need to get all these evangelical nuts actively trying to destroy the world so that Jesus come back out of power. No more death cults!.