2004-03-13 04:14:21

by Bryan Rittmeyer

[permalink] [raw]
Subject: [PATCH] ppc32 copy_to_user dcbt fixup

copy_tofrom_user and copy_page use dcbt to prefetch source data [1].
Since at least 2.4.17, these functions have been prefetching
beyond the end of the source buffer, leading to two problems:

1. Subtly broken software cache coherency. If the area following src
was invalidate_dcache_range'd prior to submitting for DMA,
an out-of-bounds dcbt from copy_to_user of a separate slab object
may read in the area before DMA completion. When the DMA does complete,
data will not be loaded from RAM because stale data is already in cache.
Thus you get a corrupt network packet, bogus audio capture, etc.

This problem probably does not affect hardware coherent systems
(all Apple machines?). However:

2. The extra 'dcbt' wastes bus bandwidth. Worst case: on a 128 byte copy,
we currently dcbt 256 bytes. These extra loads trash cache, potentially
causing writeback of more useful data.

The attached patch attempts to reign in dcbt prefetching at the end of
copies such that we do not read beyond the src area. This change fixes
DMA data corruption on software coherent systems and improves
performance slightly in my lame microbenchmark [2].

[1] csum_partial_copy_generic does not use dcbt/dcbz despite being
scorching hot in TCP workloads. I'm cooking up another patch to
dcb?ize it.

[2] http://staidm.org/linux/ppc/copy_dcbt/copyuser-microbench.tar.bz2

Comments?

-Bryan


Attachments:
(No filename) (1.34 kB)
dcbt.patch (2.08 kB)
Download all attachments

2004-03-13 04:55:07

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] ppc32 copy_to_user dcbt fixup


> [1] csum_partial_copy_generic does not use dcbt/dcbz despite being
> scorching hot in TCP workloads. I'm cooking up another patch to
> dcb?ize it.
>
> [2] http://staidm.org/linux/ppc/copy_dcbt/copyuser-microbench.tar.bz2

I'll have a good look at it asap. Also, keep in mind that between the
dcbz and the time you can actually write to that cache line, some CPUs
may need some time to complete the L1 dcbz operation, which can lead
to poor performances, at least I've been told this is the case on
POWER3, maybe others. It would be wise to make the dcbz as long as
possible in "advance" before the actual write to the cache line.

Ben.


2004-03-13 07:47:50

by Bryan Rittmeyer

[permalink] [raw]
Subject: Re: [PATCH] ppc32 copy_to_user dcbt fixup

On Sat, Mar 13, 2004 at 03:50:03PM +1100, Benjamin Herrenschmidt wrote:
> It would be wise to make the dcbz as long as possible in "advance"
> before the actual write to the cache line.

I guess we could try "pre-dcbz" ala the dcbt prefetch code.
Even dcbz right before stw is probably cheaper than RAM
loading data that will be totally overwritten. It's hard to
lose eliminating bus I/O (especially reads).

Any comments on the dcbt prefetch patch?

-Bryan

2004-03-13 08:41:33

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] ppc32 copy_to_user dcbt fixup

On Sat, 2004-03-13 at 18:49, Bryan Rittmeyer wrote:
> On Sat, Mar 13, 2004 at 03:50:03PM +1100, Benjamin Herrenschmidt wrote:
> > It would be wise to make the dcbz as long as possible in "advance"
> > before the actual write to the cache line.
>
> I guess we could try "pre-dcbz" ala the dcbt prefetch code.
> Even dcbz right before stw is probably cheaper than RAM
> loading data that will be totally overwritten. It's hard to
> lose eliminating bus I/O (especially reads).

Except if the target is already in the cache... difficult choice.

> Any comments on the dcbt prefetch patch?

Didn't have time to look at it in details yet, maybe not before
monday or tuesday

Ben


2004-03-13 09:11:16

by Eugene Surovegin

[permalink] [raw]
Subject: Re: [PATCH] ppc32 copy_to_user dcbt fixup

On Fri, Mar 12, 2004 at 08:15:47PM -0800, Bryan Rittmeyer wrote:
> copy_tofrom_user and copy_page use dcbt to prefetch source data [1].
> Since at least 2.4.17, these functions have been prefetching
> beyond the end of the source buffer, leading to two problems:
>
> 1. Subtly broken software cache coherency. If the area following src
> was invalidate_dcache_range'd prior to submitting for DMA,
> an out-of-bounds dcbt from copy_to_user of a separate slab object
> may read in the area before DMA completion. When the DMA does complete,
> data will not be loaded from RAM because stale data is already in cache.
> Thus you get a corrupt network packet, bogus audio capture, etc.
>

I reported this problem on -embedded list half a year ago.

This is already fixed in 2.4 tree, not sure about 2.6

Eugene.

2004-03-13 14:39:58

by Danjel McGougan

[permalink] [raw]
Subject: Re: [PATCH] ppc32 copy_to_user dcbt fixup

Bryan Rittmeyer wrote:
> copy_tofrom_user and copy_page use dcbt to prefetch source data [1].
> Since at least 2.4.17, these functions have been prefetching
> beyond the end of the source buffer, leading to two problems:
>
> 1. Subtly broken software cache coherency. If the area following src
> was invalidate_dcache_range'd prior to submitting for DMA,
> an out-of-bounds dcbt from copy_to_user of a separate slab object
> may read in the area before DMA completion. When the DMA does complete,
> data will not be loaded from RAM because stale data is already in cache.
> Thus you get a corrupt network packet, bogus audio capture, etc.
>
[snip]

I am no expert on the ppc arch, but in my humble opinion it seems
strange to invalidate the dcache *before* the memory-writing
DMA-transaction. The obvious solution is to invalidate the dcache
*after* DMA completion. It seems hard to guarantee that nobody will
touch the memory area in question during the DMA.

Just some clue-less comments from a linux-kernel lurker.

Regards,
Danjel

2004-03-14 22:34:03

by Bryan Rittmeyer

[permalink] [raw]
Subject: Re: [PATCH] ppc32 copy_to_user dcbt fixup

On Sat, Mar 13, 2004 at 03:39:57PM +0100, Danjel McGougan wrote:
> The obvious solution is to invalidate the dcache
> *after* DMA completion. It seems hard to guarantee that nobody will
> touch the memory area in question during the DMA.

You need to invalidate prior to submitting for DMA. Otherwise if the
region is dirty the CPU may write back after DMA has begun, causing a
data corrupting race.

Invalidating before _and_ after is expensive; better to fix the misreads.

-Bryan

2004-03-15 08:39:14

by Segher Boessenkool

[permalink] [raw]
Subject: Re: [PATCH] ppc32 copy_to_user dcbt fixup

> It would be wise to make the dcbz as long as
> possible in "advance" before the actual write to the cache line.

And use dcbzl instead, if available...


Segher

2004-03-16 02:02:22

by Bryan Rittmeyer

[permalink] [raw]
Subject: Re: [PATCH] ppc32 copy_to_user dcbt fixup

On Sat, Mar 13, 2004 at 01:11:10AM -0800, Eugene Surovegin wrote:
> I reported this problem on -embedded list half a year ago.
> This is already fixed in 2.4 tree, not sure about 2.6

Thanks. The fix in linuxppc-2.4 is cleaner than my prior patch.
Here's a forward port to linuxppc-2.5-benh:

--- linuxppc-2.5-benh/arch/ppc/lib/string.S.orig 2004-02-19 18:08:13.000000000 -0800
+++ linuxppc-2.5-benh/arch/ppc/lib/string.S 2004-03-15 17:05:38.000000000 -0800
@@ -436,48 +436,57 @@
73: stwu r9,4(r6)
bdnz 72b

+ .section __ex_table,"a"
+ .align 2
+ .long 70b,100f
+ .long 71b,101f
+ .long 72b,102f
+ .long 73b,103f
+ .text
+
58: srwi. r0,r5,LG_CACHELINE_BYTES /* # complete cachelines */
clrlwi r5,r5,32-LG_CACHELINE_BYTES
li r11,4
beq 63f

-#if !defined(CONFIG_8xx)
+#ifdef CONFIG_8xx
+ /* Don't use prefetch on 8xx */
+ mtctr r0
+53: COPY_16_BYTES_WITHEX(0)
+ bdnz 53b
+
+#else /* not CONFIG_8xx */
/* Here we decide how far ahead to prefetch the source */
+ li r3,4
+ cmpwi r0,1
+ li r7,0
+ ble 114f
+ li r7,1
#if MAX_COPY_PREFETCH > 1
/* Heuristically, for large transfers we prefetch
MAX_COPY_PREFETCH cachelines ahead. For small transfers
we prefetch 1 cacheline ahead. */
cmpwi r0,MAX_COPY_PREFETCH
- li r7,1
- li r3,4
- ble 111f
+ ble 112f
li r7,MAX_COPY_PREFETCH
-111: mtctr r7
-112: dcbt r3,r4
+112: mtctr r7
+111: dcbt r3,r4
addi r3,r3,CACHELINE_BYTES
- bdnz 112b
-#else /* MAX_COPY_PREFETCH == 1 */
- li r3,CACHELINE_BYTES + 4
- dcbt r11,r4
-#endif /* MAX_COPY_PREFETCH */
-#endif /* CONFIG_8xx */
-
- mtctr r0
-53:
-#if !defined(CONFIG_8xx)
+ bdnz 111b
+#else
dcbt r3,r4
+ addi r3,r3,CACHELINE_BYTES
+#endif /* MAX_COPY_PREFETCH > 1 */
+
+114: subf r8,r7,r0
+ mr r0,r7
+ mtctr r8
+
+53: dcbt r3,r4
54: dcbz r11,r6
-#endif
-/* had to move these to keep extable in order */
.section __ex_table,"a"
.align 2
- .long 70b,100f
- .long 71b,101f
- .long 72b,102f
- .long 73b,103f
-#if !defined(CONFIG_8xx)
.long 54b,105f
-#endif
.text
/* the main body of the cacheline loop */
COPY_16_BYTES_WITHEX(0)
@@ -495,6 +504,11 @@
#endif
#endif
bdnz 53b
+ cmpwi r0,0
+ li r3,4
+ li r7,0
+ bne 114b
+#endif /* CONFIG_8xx */

63: srwi. r0,r5,2
mtctr r0
--- linuxppc-2.5-benh/arch/ppc/kernel/misc.S.orig 2004-03-15 17:20:13.000000000 -0800
+++ linuxppc-2.5-benh/arch/ppc/kernel/misc.S 2004-03-15 17:35:22.000000000 -0800
@@ -783,9 +783,18 @@
_GLOBAL(copy_page)
addi r3,r3,-4
addi r4,r4,-4
+
+#ifdef CONFIG_8xx
+ /* don't use prefetch on 8xx */
+ li r0,4096/L1_CACHE_LINE_SIZE
+ mtctr r0
+1: COPY_16_BYTES
+ bdnz 1b
+ blr
+
+#else /* not 8xx, we can prefetch */
li r5,4

-#ifndef CONFIG_8xx
#if MAX_COPY_PREFETCH > 1
li r0,MAX_COPY_PREFETCH
li r11,4
@@ -793,19 +802,17 @@
11: dcbt r11,r4
addi r11,r11,L1_CACHE_LINE_SIZE
bdnz 11b
-#else /* MAX_L1_COPY_PREFETCH == 1 */
+#else /* MAX_COPY_PREFETCH == 1 */
dcbt r5,r4
li r11,L1_CACHE_LINE_SIZE+4
-#endif /* MAX_L1_COPY_PREFETCH */
-#endif /* CONFIG_8xx */
-
- li r0,4096/L1_CACHE_LINE_SIZE
+#endif /* MAX_COPY_PREFETCH */
+ li r0,4096/L1_CACHE_LINE_SIZE - MAX_COPY_PREFETCH
+ crclr 4*cr0+eq
+2:
mtctr r0
1:
-#ifndef CONFIG_8xx
dcbt r11,r4
dcbz r5,r3
-#endif
COPY_16_BYTES
#if L1_CACHE_LINE_SIZE >= 32
COPY_16_BYTES
@@ -821,6 +828,12 @@
#endif
#endif
bdnz 1b
+ beqlr
+ crnot 4*cr0+eq,4*cr0+eq
+ li r0,MAX_COPY_PREFETCH
+ li r11,4
+ b 2b
+#endif /* CONFIG_8xx */
blr

/*


-Bryan