Where it mattered for me was on an ARM core managing a much larger DSP. The DSP consumed most of the memory bandwidth, so fetching a cacheline of instructions or an MMU mapping into the ARM had long and variable latency as it had to wait for the DSP to finish a large burst to or from the shared memory.