the "streaming" mode optimization which skips cacheline allocation
for fully-dirty lines is frequently defeated when coherent processors
perfom stores simultaneously
this results in cachelines being allocated in SMP which are not
allocated when run in uniprocessor, resulting in a significant
reduction in aggregate write bandwidth. for example, on Tegra 2
systems with 300MHz DDR main memory, running memset over a large
buffer (i.e., L2 miss) on a single processor will achieve ~2GB/sec
of write bandwidth, but if the same operation is run in parallel on
both CPUs, the aggregate write bandwidth is just 500MB/sec
changing the cache allocation policy to read-allocate reduces some
of this performance loss on SMP systems.
Change-Id: Ice47ab0a15f2490b7e9a007b4b37800566ed7be1
Signed-off-by: Gary King <gking@nvidia.com>
* NOS = PRRR[24+n] = 1 - not outer shareable
*/
ldr r5, =0xff0a81a8 @ PRRR
- ldr r6, =0x40e040e0 @ NMRR
+#ifdef CONFIG_SMP
+ ldr r6, =0xc0e0c0e0 @ NMRR
+#else
+ ldr r6, =0x40e040e0
+#endif
mcr p15, 0, r5, c10, c2, 0 @ write PRRR
mcr p15, 0, r6, c10, c2, 1 @ write NMRR
#endif