Basically, the following execution :
dd if=/dev/zero of=/tmp/testfile
will slowly fill _all_ ram available without taking into account memory
pressure.
This is because the dirty page accounting is incorrect in
redirty_page_for_writepage.
This patch adds missing dirty page accounting in redirty_page_for_writepage().
This should fix a _lot_ of issues involving machines becoming slow under heavy
write I/O. No surprise : eventually the system starts swapping.
Linux kernel 2.6.30-rc2
The /proc/meminfo picture I had before applying this patch after filling my
memory with the dd execution was :
MemTotal: 16433732 kB
MemFree: 10919700 kB
Buffers: 12492 kB
Cached: 5262508 kB
SwapCached: 0 kB
Active: 37096 kB
Inactive: 5254384 kB
Active(anon): 16716 kB
Inactive(anon): 0 kB
Active(file): 20380 kB
Inactive(file): 5254384 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 19535024 kB
SwapFree: 19535024 kB
Dirty: 2125956 kB
Writeback: 50476 kB
AnonPages: 16660 kB
Mapped: 9560 kB
Slab: 189692 kB
SReclaimable: 166688 kB
SUnreclaim: 23004 kB
PageTables: 3396 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 27751888 kB
Committed_AS: 53904 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 10764 kB
VmallocChunk: 34359726963 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 3456 kB
DirectMap2M: 16773120 kB
After applying my patch, the same test case steadily leaves between 8
and 500MB ram free in the steady-state (when pressure is reached).
MemTotal: 16433732 kB
MemFree: 85144 kB
Buffers: 23148 kB
Cached: 15766280 kB
SwapCached: 0 kB
Active: 51500 kB
Inactive: 15755140 kB
Active(anon): 15540 kB
Inactive(anon): 1824 kB
Active(file): 35960 kB
Inactive(file): 15753316 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 19535024 kB
SwapFree: 19535024 kB
Dirty: 2501644 kB
Writeback: 33280 kB
AnonPages: 17280 kB
Mapped: 9272 kB
Slab: 505524 kB
SReclaimable: 485596 kB
SUnreclaim: 19928 kB
PageTables: 3396 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 27751888 kB
Committed_AS: 54508 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 10764 kB
VmallocChunk: 34359726715 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 3456 kB
DirectMap2M: 16773120 kB
The pressure pattern I see with the patch applied is :
(16GB ram total)
- Inactive(file) fills up to 15.7GB.
- Dirty fills up to 1.7GB.
- Writeback vary between 0 and 600MB
sync() behavior :
- Dirty down to ~6MB.
- Writeback increases to 1.6GB, then shrinks down to ~0MB.
References :
This insanely huge
http://bugzilla.kernel.org/show_bug.cgi?id=12309
[Bug 12309] Large I/O operations result in slow performance and high iowait times
(yes, I've been in CC all along)
Special thanks to Linus Torvalds and Nick Piggin and Thomas Pi for their
suggestions on previous patch iterations.
Special thanks to the LTTng community, which helped me getting LTTng up to its
current usability level. It's been tremendously useful in understanding those
problematic I/O workloads and generating fio test cases.
Signed-off-by: Mathieu Desnoyers <[email protected]>
CC: Linus Torvalds <[email protected]>
CC: [email protected]
CC: Nick Piggin <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: [email protected]
CC: Yuriy Lalym <[email protected]>
---
mm/page-writeback.c | 6 ++++++
1 file changed, 6 insertions(+)
Index: linux-2.6-lttng/mm/page-writeback.c
===================================================================
--- linux-2.6-lttng.orig/mm/page-writeback.c 2009-04-29 18:14:48.000000000 -0400
+++ linux-2.6-lttng/mm/page-writeback.c 2009-04-29 18:23:59.000000000 -0400
@@ -1237,6 +1237,12 @@ int __set_page_dirty_nobuffers(struct pa
if (!mapping)
return 1;
+ /*
+ * Take care of setting back page accounting correctly.
+ */
+ inc_zone_page_state(page, NR_FILE_DIRTY);
+ inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+
spin_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
if (mapping2) { /* Race with truncate? */
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Mathieu Desnoyers ([email protected]) wrote:
> Basically, the following execution :
>
> dd if=/dev/zero of=/tmp/testfile
>
> will slowly fill _all_ ram available without taking into account memory
> pressure.
>
> This is because the dirty page accounting is incorrect in
> redirty_page_for_writepage.
>
> This patch adds missing dirty page accounting in redirty_page_for_writepage().
> This should fix a _lot_ of issues involving machines becoming slow under heavy
> write I/O. No surprise : eventually the system starts swapping.
>
> Linux kernel 2.6.30-rc2
>
> The /proc/meminfo picture I had before applying this patch after filling my
> memory with the dd execution was :
>
> MemTotal: 16433732 kB
> MemFree: 10919700 kB
Darn, I have not taken this meminfo snapshot at the appropriate moment.
I actually have to double-check if 2.6.30-rc still shows the bogus
behavior I identified in the 2.6.28-2.6.29 days. Then I'll check with
earlier 2.6.29.x. I know there has been some improvement on the ext3
side since then. I'll come back when I have those informations.
Sorry.
Mathieu
> Buffers: 12492 kB
> Cached: 5262508 kB
> SwapCached: 0 kB
> Active: 37096 kB
> Inactive: 5254384 kB
> Active(anon): 16716 kB
> Inactive(anon): 0 kB
> Active(file): 20380 kB
> Inactive(file): 5254384 kB
> Unevictable: 0 kB
> Mlocked: 0 kB
> SwapTotal: 19535024 kB
> SwapFree: 19535024 kB
> Dirty: 2125956 kB
> Writeback: 50476 kB
> AnonPages: 16660 kB
> Mapped: 9560 kB
> Slab: 189692 kB
> SReclaimable: 166688 kB
> SUnreclaim: 23004 kB
> PageTables: 3396 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 27751888 kB
> Committed_AS: 53904 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 10764 kB
> VmallocChunk: 34359726963 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 3456 kB
> DirectMap2M: 16773120 kB
>
> After applying my patch, the same test case steadily leaves between 8
> and 500MB ram free in the steady-state (when pressure is reached).
>
> MemTotal: 16433732 kB
> MemFree: 85144 kB
> Buffers: 23148 kB
> Cached: 15766280 kB
> SwapCached: 0 kB
> Active: 51500 kB
> Inactive: 15755140 kB
> Active(anon): 15540 kB
> Inactive(anon): 1824 kB
> Active(file): 35960 kB
> Inactive(file): 15753316 kB
> Unevictable: 0 kB
> Mlocked: 0 kB
> SwapTotal: 19535024 kB
> SwapFree: 19535024 kB
> Dirty: 2501644 kB
> Writeback: 33280 kB
> AnonPages: 17280 kB
> Mapped: 9272 kB
> Slab: 505524 kB
> SReclaimable: 485596 kB
> SUnreclaim: 19928 kB
> PageTables: 3396 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 27751888 kB
> Committed_AS: 54508 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 10764 kB
> VmallocChunk: 34359726715 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 3456 kB
> DirectMap2M: 16773120 kB
>
> The pressure pattern I see with the patch applied is :
> (16GB ram total)
>
> - Inactive(file) fills up to 15.7GB.
> - Dirty fills up to 1.7GB.
> - Writeback vary between 0 and 600MB
>
> sync() behavior :
>
> - Dirty down to ~6MB.
> - Writeback increases to 1.6GB, then shrinks down to ~0MB.
>
> References :
> This insanely huge
> http://bugzilla.kernel.org/show_bug.cgi?id=12309
> [Bug 12309] Large I/O operations result in slow performance and high iowait times
> (yes, I've been in CC all along)
>
> Special thanks to Linus Torvalds and Nick Piggin and Thomas Pi for their
> suggestions on previous patch iterations.
>
> Special thanks to the LTTng community, which helped me getting LTTng up to its
> current usability level. It's been tremendously useful in understanding those
> problematic I/O workloads and generating fio test cases.
>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> CC: Linus Torvalds <[email protected]>
> CC: [email protected]
> CC: Nick Piggin <[email protected]>
> CC: Ingo Molnar <[email protected]>
> CC: KOSAKI Motohiro <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: [email protected]
> CC: Yuriy Lalym <[email protected]>
> ---
> mm/page-writeback.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> Index: linux-2.6-lttng/mm/page-writeback.c
> ===================================================================
> --- linux-2.6-lttng.orig/mm/page-writeback.c 2009-04-29 18:14:48.000000000 -0400
> +++ linux-2.6-lttng/mm/page-writeback.c 2009-04-29 18:23:59.000000000 -0400
> @@ -1237,6 +1237,12 @@ int __set_page_dirty_nobuffers(struct pa
> if (!mapping)
> return 1;
>
> + /*
> + * Take care of setting back page accounting correctly.
> + */
> + inc_zone_page_state(page, NR_FILE_DIRTY);
> + inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +
> spin_lock_irq(&mapping->tree_lock);
> mapping2 = page_mapping(page);
> if (mapping2) { /* Race with truncate? */
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 29 Apr 2009 19:25:46 -0400
Mathieu Desnoyers <[email protected]> wrote:
> Basically, the following execution :
>
> dd if=/dev/zero of=/tmp/testfile
>
> will slowly fill _all_ ram available without taking into account memory
> pressure.
>
> This is because the dirty page accounting is incorrect in
> redirty_page_for_writepage.
>
> This patch adds missing dirty page accounting in redirty_page_for_writepage().
The patch changes __set_page_dirty_nobuffers(), not
redirty_page_for_writepage().
__set_page_dirty_nobuffers() has a huge number of callers.
> --- linux-2.6-lttng.orig/mm/page-writeback.c 2009-04-29 18:14:48.000000000 -0400
> +++ linux-2.6-lttng/mm/page-writeback.c 2009-04-29 18:23:59.000000000 -0400
> @@ -1237,6 +1237,12 @@ int __set_page_dirty_nobuffers(struct pa
> if (!mapping)
> return 1;
>
> + /*
> + * Take care of setting back page accounting correctly.
> + */
> + inc_zone_page_state(page, NR_FILE_DIRTY);
> + inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> +
> spin_lock_irq(&mapping->tree_lock);
> mapping2 = page_mapping(page);
> if (mapping2) { /* Race with truncate? */
>
But __set_page_dirty_nobuffers() calls account_page_dirtied(), which
already does the above two operations. afacit we're now
double-accounting.
Now, it's possible that the accounting goes wrong very occasionally in
the "/* Race with truncate? */" case. If the truncate path clears the
page's dirty bit then it will decrement the dirty-page accounting, but
this code path will fail to perform the increment of the dirty-page
accounting. IOW, once this function has set PG_Dirty, it is committed
to altering some or all of the page-dirty accounting.
But afacit your test case will not trigger the race-with-truncate anyway?
Can you determine at approximately what frequency (pages-per-second)
this accounting leak is occurring in your test?
On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
>
> This patch adds missing dirty page accounting in redirty_page_for_writepage().
> This should fix a _lot_ of issues involving machines becoming slow under heavy
> write I/O. No surprise : eventually the system starts swapping.
That patch (and description) is odd.
The patch actually adds the dirty page accounting not to
redirty_page_for_writepage(), but to __set_page_dirty_nobuffers().
And __set_page_dirty_nobuffers() will later (just a few lines down) call
down to account_page_dirtied(), which in turn does all that
same accounting (assuming the "mapping" is marked to account for dirty.
So the description seems to be wrong, but so does the patch.
Did you attach the wrong patch (explaining both problems)?
Or if the patch is what you really meant to do, then you need to fix your
explanation, and also explain why the double-dirty accounting is a good
idea.
Or is the real problem perhaps that your /tmp is ramdisk, and not marked
to do dirty accounting?
Linus
* Andrew Morton ([email protected]) wrote:
> On Wed, 29 Apr 2009 19:25:46 -0400
> Mathieu Desnoyers <[email protected]> wrote:
>
> > Basically, the following execution :
> >
> > dd if=/dev/zero of=/tmp/testfile
> >
> > will slowly fill _all_ ram available without taking into account memory
> > pressure.
> >
> > This is because the dirty page accounting is incorrect in
> > redirty_page_for_writepage.
> >
> > This patch adds missing dirty page accounting in redirty_page_for_writepage().
>
> The patch changes __set_page_dirty_nobuffers(), not
> redirty_page_for_writepage().
>
> __set_page_dirty_nobuffers() has a huge number of callers.
>
Right.
> > --- linux-2.6-lttng.orig/mm/page-writeback.c 2009-04-29 18:14:48.000000000 -0400
> > +++ linux-2.6-lttng/mm/page-writeback.c 2009-04-29 18:23:59.000000000 -0400
> > @@ -1237,6 +1237,12 @@ int __set_page_dirty_nobuffers(struct pa
> > if (!mapping)
> > return 1;
> >
> > + /*
> > + * Take care of setting back page accounting correctly.
> > + */
> > + inc_zone_page_state(page, NR_FILE_DIRTY);
> > + inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > +
> > spin_lock_irq(&mapping->tree_lock);
> > mapping2 = page_mapping(page);
> > if (mapping2) { /* Race with truncate? */
> >
>
> But __set_page_dirty_nobuffers() calls account_page_dirtied(), which
> already does the above two operations. afacit we're now
> double-accounting.
>
Yes, you are right.
> Now, it's possible that the accounting goes wrong very occasionally in
> the "/* Race with truncate? */" case. If the truncate path clears the
> page's dirty bit then it will decrement the dirty-page accounting, but
> this code path will fail to perform the increment of the dirty-page
> accounting. IOW, once this function has set PG_Dirty, it is committed
> to altering some or all of the page-dirty accounting.
>
> But afacit your test case will not trigger the race-with-truncate anyway?
>
> Can you determine at approximately what frequency (pages-per-second)
> this accounting leak is occurring in your test?
>
0 per minute actually. I've tried adding a printk when the
if (mapping2) {
} else {
<--
}
case is hit, and it never triggered in my tests.
I am currently trying to figure out if I can reproduce the OOM problems
I had experienced with 2.6.29-rc3. I investigate memory accounting by
turning the memory accounting code into a slow cache-line bouncing
version and by adding some assertions about the fact that per-zone
global counters must never go below zero. Having unbalanced accounting
could have some nasty long-term effects on memory pressure accounting.
But so far the memory accounting code looks solid. It's my bad then. I
cannot reproduce the behavior I noticed with 2.6.29-rc3, so I guess we
should we consider this a non-issue (or code 9 if you prefer). ;)
Thanks for looking into this.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Linus Torvalds ([email protected]) wrote:
>
>
> On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> >
> > This patch adds missing dirty page accounting in redirty_page_for_writepage().
> > This should fix a _lot_ of issues involving machines becoming slow under heavy
> > write I/O. No surprise : eventually the system starts swapping.
>
> That patch (and description) is odd.
>
> The patch actually adds the dirty page accounting not to
> redirty_page_for_writepage(), but to __set_page_dirty_nobuffers().
>
> And __set_page_dirty_nobuffers() will later (just a few lines down) call
> down to account_page_dirtied(), which in turn does all that
> same accounting (assuming the "mapping" is marked to account for dirty.
>
> So the description seems to be wrong, but so does the patch.
>
> Did you attach the wrong patch (explaining both problems)?
>
Nope, just tried to go back into the OOM problem I experienced when
testing the "heavy I/O latency" bug. But it looks like I can't reproduce
it anyway.
> Or if the patch is what you really meant to do, then you need to fix your
> explanation, and also explain why the double-dirty accounting is a good
> idea.
Nope, I just "messed this one up" completely. :-)
>
> Or is the real problem perhaps that your /tmp is ramdisk, and not marked
> to do dirty accounting?
Nope, I don't think so. /tmp is on a ext3 partition, on raid 1, sata
disks.
And thanks for the review! This excercise only convinced me that the
kernel memory accounting works as expected. All this gave me the chance
to have a good look at the memory accounting code. We could probably
benefit of Christoph Lameter's cpu ops (using segment registers to
address per-cpu variables with atomic inc/dec) in there. Or at least
removing interrupt disabling by using preempt disable and local_t
variables for the per-cpu counters could bring some benefit.
Mathieu
>
> Linus
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Mathieu Desnoyers <[email protected]> wrote:
> And thanks for the review! This excercise only convinced me that
> the kernel memory accounting works as expected. All this gave me
> the chance to have a good look at the memory accounting code. We
> could probably benefit of Christoph Lameter's cpu ops (using
> segment registers to address per-cpu variables with atomic
> inc/dec) in there. Or at least removing interrupt disabling by
> using preempt disable and local_t variables for the per-cpu
> counters could bring some benefit.
Note, optimized per cpu ops are already implemented upstream, by
Tejun Heo's percpu patches in .30:
#define percpu_read(var) percpu_from_op("mov", per_cpu__##var)
#define percpu_write(var, val) percpu_to_op("mov", per_cpu__##var, val)
#define percpu_add(var, val) percpu_to_op("add", per_cpu__##var, val)
#define percpu_sub(var, val) percpu_to_op("sub", per_cpu__##var, val)
#define percpu_and(var, val) percpu_to_op("and", per_cpu__##var, val)
#define percpu_or(var, val) percpu_to_op("or", per_cpu__##var, val)
#define percpu_xor(var, val) percpu_to_op("xor", per_cpu__##var, val)
See:
6dbde35: percpu: add optimized generic percpu accessors
>From the changelog:
[...]
The advantage is that for example to read a local percpu variable,
instead of this sequence:
return __get_cpu_var(var);
ffffffff8102ca2b: 48 8b 14 fd 80 09 74 mov -0x7e8bf680(,%rdi,8),%rdx
ffffffff8102ca32: 81
ffffffff8102ca33: 48 c7 c0 d8 59 00 00 mov $0x59d8,%rax
ffffffff8102ca3a: 48 8b 04 10 mov (%rax,%rdx,1),%rax
We can get a single instruction by using the optimized variants:
return percpu_read(var);
ffffffff8102ca3f: 65 48 8b 05 91 8f fd mov %gs:0x7efd8f91(%rip),%rax
[...]
So if you want to make use of it, percpu_add()/percpu_sub() would be
the place to start.
Ingo
* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > And thanks for the review! This excercise only convinced me that
> > the kernel memory accounting works as expected. All this gave me
> > the chance to have a good look at the memory accounting code. We
> > could probably benefit of Christoph Lameter's cpu ops (using
> > segment registers to address per-cpu variables with atomic
> > inc/dec) in there. Or at least removing interrupt disabling by
> > using preempt disable and local_t variables for the per-cpu
> > counters could bring some benefit.
>
> Note, optimized per cpu ops are already implemented upstream, by
> Tejun Heo's percpu patches in .30:
>
> #define percpu_read(var) percpu_from_op("mov", per_cpu__##var)
> #define percpu_write(var, val) percpu_to_op("mov", per_cpu__##var, val)
> #define percpu_add(var, val) percpu_to_op("add", per_cpu__##var, val)
> #define percpu_sub(var, val) percpu_to_op("sub", per_cpu__##var, val)
> #define percpu_and(var, val) percpu_to_op("and", per_cpu__##var, val)
> #define percpu_or(var, val) percpu_to_op("or", per_cpu__##var, val)
> #define percpu_xor(var, val) percpu_to_op("xor", per_cpu__##var, val)
>
> See:
>
> 6dbde35: percpu: add optimized generic percpu accessors
>
> From the changelog:
>
> [...]
> The advantage is that for example to read a local percpu variable,
> instead of this sequence:
>
> return __get_cpu_var(var);
>
> ffffffff8102ca2b: 48 8b 14 fd 80 09 74 mov -0x7e8bf680(,%rdi,8),%rdx
> ffffffff8102ca32: 81
> ffffffff8102ca33: 48 c7 c0 d8 59 00 00 mov $0x59d8,%rax
> ffffffff8102ca3a: 48 8b 04 10 mov (%rax,%rdx,1),%rax
>
> We can get a single instruction by using the optimized variants:
>
> return percpu_read(var);
>
> ffffffff8102ca3f: 65 48 8b 05 91 8f fd mov %gs:0x7efd8f91(%rip),%rax
> [...]
>
> So if you want to make use of it, percpu_add()/percpu_sub() would be
> the place to start.
>
Great !
I see however that it's only guaranteed to be atomic wrt preemption.
What would be even better would be to have the atomic ops wrt local irqs
(as local.h does) available in this percpu flavor. By doing this, we
could have interrupt and nmi-safe per-cpu counters, without even the
need to disable preemption.
In terms of counters, except maybe for tri-values for some
architectures, I don't see how we could manage synchronization in a
better way.
Mathieu
> Ingo
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Mathieu Desnoyers <[email protected]> wrote:
> * Ingo Molnar ([email protected]) wrote:
> >
> > * Mathieu Desnoyers <[email protected]> wrote:
> >
> > > And thanks for the review! This excercise only convinced me that
> > > the kernel memory accounting works as expected. All this gave me
> > > the chance to have a good look at the memory accounting code. We
> > > could probably benefit of Christoph Lameter's cpu ops (using
> > > segment registers to address per-cpu variables with atomic
> > > inc/dec) in there. Or at least removing interrupt disabling by
> > > using preempt disable and local_t variables for the per-cpu
> > > counters could bring some benefit.
> >
> > Note, optimized per cpu ops are already implemented upstream, by
> > Tejun Heo's percpu patches in .30:
> >
> > #define percpu_read(var) percpu_from_op("mov", per_cpu__##var)
> > #define percpu_write(var, val) percpu_to_op("mov", per_cpu__##var, val)
> > #define percpu_add(var, val) percpu_to_op("add", per_cpu__##var, val)
> > #define percpu_sub(var, val) percpu_to_op("sub", per_cpu__##var, val)
> > #define percpu_and(var, val) percpu_to_op("and", per_cpu__##var, val)
> > #define percpu_or(var, val) percpu_to_op("or", per_cpu__##var, val)
> > #define percpu_xor(var, val) percpu_to_op("xor", per_cpu__##var, val)
> >
> > See:
> >
> > 6dbde35: percpu: add optimized generic percpu accessors
> >
> > From the changelog:
> >
> > [...]
> > The advantage is that for example to read a local percpu variable,
> > instead of this sequence:
> >
> > return __get_cpu_var(var);
> >
> > ffffffff8102ca2b: 48 8b 14 fd 80 09 74 mov -0x7e8bf680(,%rdi,8),%rdx
> > ffffffff8102ca32: 81
> > ffffffff8102ca33: 48 c7 c0 d8 59 00 00 mov $0x59d8,%rax
> > ffffffff8102ca3a: 48 8b 04 10 mov (%rax,%rdx,1),%rax
> >
> > We can get a single instruction by using the optimized variants:
> >
> > return percpu_read(var);
> >
> > ffffffff8102ca3f: 65 48 8b 05 91 8f fd mov %gs:0x7efd8f91(%rip),%rax
> > [...]
> >
> > So if you want to make use of it, percpu_add()/percpu_sub() would be
> > the place to start.
> >
>
> Great !
>
> I see however that it's only guaranteed to be atomic wrt preemption.
That's really only true for the non-x86 fallback defines. If we so
decide, we could make the fallbacks in asm-generic/percpu.h irq-safe
...
> What would be even better would be to have the atomic ops wrt local irqs
> (as local.h does) available in this percpu flavor. By doing this, we
> could have interrupt and nmi-safe per-cpu counters, without even the
> need to disable preemption.
nmi-safe isnt a big issue (we have no NMI code that interacts with
MM counters) - and we could make them irq-safe by fixing the
wrapper. (and on x86 they are NMI-safe too.)
Ingo
On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> to have a good look at the memory accounting code. We could probably
> benefit of Christoph Lameter's cpu ops (using segment registers to
> address per-cpu variables with atomic inc/dec) in there. Or at least
> removing interrupt disabling by using preempt disable and local_t
> variables for the per-cpu counters could bring some benefit.
Guess we are ready for atomic per cpu ops now that the new per cpu
allocator is in? Segment register issues with the PDA are also solved
right?
* Christoph Lameter <[email protected]> wrote:
> On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
>
> > to have a good look at the memory accounting code. We could
> > probably benefit of Christoph Lameter's cpu ops (using segment
> > registers to address per-cpu variables with atomic inc/dec) in
> > there. Or at least removing interrupt disabling by using preempt
> > disable and local_t variables for the per-cpu counters could
> > bring some benefit.
>
> Guess we are ready for atomic per cpu ops now that the new per cpu
> allocator is in? Segment register issues with the PDA are also
> solved right?
it's all done, implemented and upstream already. You are a bit late
to the party ;-)
Ingo
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> > I see however that it's only guaranteed to be atomic wrt preemption.
>
> That's really only true for the non-x86 fallback defines. If we so
> decide, we could make the fallbacks in asm-generic/percpu.h irq-safe
The fallbacks have different semantics and therefore we cannot rely on
irq safeness in the core code when using the x86 cpu ops.
> nmi-safe isnt a big issue (we have no NMI code that interacts with
> MM counters) - and we could make them irq-safe by fixing the
> wrapper. (and on x86 they are NMI-safe too.)
There are also context in which you alrady are preempt safe and where the
per cpu ops do not need to go through the prremption hoops.
This means it would be best to have 3 variants for 3 different contexts in
the core code:
1. Need irq safety
2. Need preempt safety
3. We know the operation is safe due to preemption already having been
disabled or irqs are not enabled.
The 3 variants on x86 generate the same instructions. On other platforms
they would need to be able to fallback in various way depending on the
availability of instructions that are atomic vs. preempt or irqs.
http://thread.gmane.org/gmane.linux.kernel.cross-arch/1124
http://lwn.net/Articles/284526/
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> it's all done, implemented and upstream already. You are a bit late
> to the party ;-)
Just looked over it. Yep, now we just need to fix the things that were
messed up.
* Ingo Molnar ([email protected]) wrote:
>
> * Christoph Lameter <[email protected]> wrote:
>
> > On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> >
> > > to have a good look at the memory accounting code. We could
> > > probably benefit of Christoph Lameter's cpu ops (using segment
> > > registers to address per-cpu variables with atomic inc/dec) in
> > > there. Or at least removing interrupt disabling by using preempt
> > > disable and local_t variables for the per-cpu counters could
> > > bring some benefit.
> >
> > Guess we are ready for atomic per cpu ops now that the new per cpu
> > allocator is in? Segment register issues with the PDA are also
> > solved right?
>
> it's all done, implemented and upstream already. You are a bit late
> to the party ;-)
>
> Ingo
Or way too early, depending on the point of view. :-)
e.g.
http://lkml.org/lkml/2008/5/30/3
I think Christoph deserves credits for pioneering this area with fresh
ideas.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:
> I think Christoph deserves credits for pioneering this area with fresh
> ideas.
Thanks. I like ideas but not so much the follow through on them (which
also got a bit problematic in this case because I was changing companies
while this was going on). The work that Tejun has done on this is quite
good as far as I can see.
* Christoph Lameter <[email protected]> wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> > > I see however that it's only guaranteed to be atomic wrt preemption.
> >
> > That's really only true for the non-x86 fallback defines. If we so
> > decide, we could make the fallbacks in asm-generic/percpu.h irq-safe
>
> The fallbacks have different semantics and therefore we cannot
> rely on irq safeness in the core code when using the x86 cpu ops.
Well it's irq and preempt safe on x86.
It's preempt-safe on other architectures - but the fallback is not
irq-safe on other architectures. That is remedied easily via the
patch below. (Note: totally untested)
Ingo
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1581ff2..6b3984a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -139,17 +139,23 @@ static inline void free_percpu(void *p)
#ifndef percpu_read
# define percpu_read(var) \
({ \
+ unsigned long flags; \
typeof(per_cpu_var(var)) __tmp_var__; \
- __tmp_var__ = get_cpu_var(var); \
- put_cpu_var(var); \
+ \
+ local_irq_save(flags); \
+ __tmp_var__ = __get_cpu_var(var); \
+ local_irq_restore(flags); \
__tmp_var__; \
})
#endif
#define __percpu_generic_to_op(var, val, op) \
do { \
- get_cpu_var(var) op val; \
- put_cpu_var(var); \
+ unsigned long flags; \
+ \
+ local_irq_save(flags); \
+ op val; \
+ local_irq_restore(flags); \
} while (0)
#ifndef percpu_write
* Christoph Lameter <[email protected]> wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> > it's all done, implemented and upstream already. You are a bit late
> > to the party ;-)
>
> Just looked over it. Yep, now we just need to fix the things that
> were messed up.
Could we please skip over the lengthy flamewar and bad-mouthing of
other people's work and go straight to the constructive portion of
the discussion? ;-)
The patch below makes the fallback/slowpath irq safe.
Ingo
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 1581ff2..6b3984a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -139,17 +139,23 @@ static inline void free_percpu(void *p)
#ifndef percpu_read
# define percpu_read(var) \
({ \
+ unsigned long flags; \
typeof(per_cpu_var(var)) __tmp_var__; \
- __tmp_var__ = get_cpu_var(var); \
- put_cpu_var(var); \
+ \
+ local_irq_save(flags); \
+ __tmp_var__ = __get_cpu_var(var); \
+ local_irq_restore(flags); \
__tmp_var__; \
})
#endif
#define __percpu_generic_to_op(var, val, op) \
do { \
- get_cpu_var(var) op val; \
- put_cpu_var(var); \
+ unsigned long flags; \
+ \
+ local_irq_save(flags); \
+ op val; \
+ local_irq_restore(flags); \
} while (0)
#ifndef percpu_write
* Christoph Lameter ([email protected]) wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> > > I see however that it's only guaranteed to be atomic wrt preemption.
> >
> > That's really only true for the non-x86 fallback defines. If we so
> > decide, we could make the fallbacks in asm-generic/percpu.h irq-safe
>
> The fallbacks have different semantics and therefore we cannot rely on
> irq safeness in the core code when using the x86 cpu ops.
>
> > nmi-safe isnt a big issue (we have no NMI code that interacts with
> > MM counters) - and we could make them irq-safe by fixing the
> > wrapper. (and on x86 they are NMI-safe too.)
>
> There are also context in which you alrady are preempt safe and where the
> per cpu ops do not need to go through the prremption hoops.
>
> This means it would be best to have 3 variants for 3 different contexts in
> the core code:
>
> 1. Need irq safety
> 2. Need preempt safety
> 3. We know the operation is safe due to preemption already having been
> disabled or irqs are not enabled.
>
> The 3 variants on x86 generate the same instructions. On other platforms
> they would need to be able to fallback in various way depending on the
> availability of instructions that are atomic vs. preempt or irqs.
>
The problem here, as we did figure out a while ago with the atomic
slub we worked on a while ago, is that if we have the following code :
local_irq_save
var++
var++
local_irq_restore
that we would like to turn into irq-safe percpu variant with this
semantic :
percpu_add_irqsafe(var)
percpu_add_irqsafe(var)
We are generating two irq save/restore in the fallback, which will be
slow.
However, we could do the following trick :
percpu_irqsave(flags);
percpu_add_irq(var);
percpu_add_irq(var);
percpu_irqrestore(flags);
And we could require that percpu_*_irq operations are put within a
irq safe section. The fallback would disable interrupts, but
arch-specific irq-safe atomic implementations would replace this by
nops.
And if interrupts are already disabled, percpu_add_irq could be used
directly. There is no need to duplicate the primitives (no
_percpu_add_irq() needed). Same could apply to preempt-safety :
percpu_preempt_disable();
percpu_add(var);
percpu_add(var);
percpu_preempt_enable();
Where requirements on percpu_add would be to be called within a
percpu_preempt_disable/percpu_preempt_enable section or to be sure that
preemption is already disabled around.
Same thing could apply to bh. But I don't see any difference between
percpu_add_bh and percpu_add_irq, except maybe on architectures which
would use tri-values :
percpu_bh_disable();
percpu_add_bh(var);
percpu_add_bh(var);
percpu_bh_enable();
Thoughts ?
Mathieu
> http://thread.gmane.org/gmane.linux.kernel.cross-arch/1124
> http://lwn.net/Articles/284526/
>
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:
> > The 3 variants on x86 generate the same instructions. On other platforms
> > they would need to be able to fallback in various way depending on the
> > availability of instructions that are atomic vs. preempt or irqs.
> >
>
> The problem here, as we did figure out a while ago with the atomic
> slub we worked on a while ago, is that if we have the following code :
>
> local_irq_save
> var++
> var++
> local_irq_restore
>
> that we would like to turn into irq-safe percpu variant with this
> semantic :
>
> percpu_add_irqsafe(var)
> percpu_add_irqsafe(var)
>
> We are generating two irq save/restore in the fallback, which will be
> slow.
>
> However, we could do the following trick :
>
> percpu_irqsave(flags);
> percpu_add_irq(var);
> percpu_add_irq(var);
> percpu_irqrestore(flags);
Hmmm.I do not remember any of those double ops in the patches that I did a
while back for this. It does not make sense either because atomic per cpu
ops are only atomic for a single instruction. You are trying to extend
that so that multiple "atomic" instructions are now atomic.
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> The patch below makes the fallback/slowpath irq safe.
Yes but sometimes you are already irq safe and such a fallback would
create significant irq/enable/disable stack operations etc overhead for
architectures that are using the fallback.
I think we really need another __xxx op here. Especially since these
operations are often in critical code paths.
* Mathieu Desnoyers <[email protected]> wrote:
> * Ingo Molnar ([email protected]) wrote:
> >
> > * Christoph Lameter <[email protected]> wrote:
> >
> > > On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> > >
> > > > to have a good look at the memory accounting code. We could
> > > > probably benefit of Christoph Lameter's cpu ops (using segment
> > > > registers to address per-cpu variables with atomic inc/dec) in
> > > > there. Or at least removing interrupt disabling by using preempt
> > > > disable and local_t variables for the per-cpu counters could
> > > > bring some benefit.
> > >
> > > Guess we are ready for atomic per cpu ops now that the new per cpu
> > > allocator is in? Segment register issues with the PDA are also
> > > solved right?
> >
> > it's all done, implemented and upstream already. You are a bit late
> > to the party ;-)
> >
> > Ingo
>
> Or way too early, depending on the point of view. :-)
>
> e.g.
> http://lkml.org/lkml/2008/5/30/3
>
> I think Christoph deserves credits for pioneering this area with fresh
> ideas.
Ok, i didnt want to go there - but let me correct this version of
history.
Christoph's zero-based x86 percpu patches were incomplete and never
worked reliably - Christoph unfortunately never addressed the
bugs/crashes i reported. (and Mike Travis and me injected quite a
bit of testing into it) There were two failed attempts to productize
them and the patches just bitrotted for more than a year.
Tejun on the other hand fixed those problems (four of Christoph's
patches survived more or less and were credited to Christoph) and
did more than 50 highly delicate patches of far larger complexity to
solve the _whole_ problem range - within a two months timeframe.
Ideas and half-done patches covering <10% of the work needed are not
enough. Being able to implement it and productize it is the real
deal, in my book.
Thanks goes to Christoph (and Rusty) for coming up with the idea,
but it would be manifestly unfair to not send 90% of the kudos to
Tejun for turning it all into reality and fixing all the other
problems and redesigning almost all the x86 percpu code in the
process! ;-)
Ingo
* Christoph Lameter <[email protected]> wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> > The patch below makes the fallback/slowpath irq safe.
>
> Yes but sometimes you are already irq safe and such a fallback
> would create significant irq/enable/disable stack operations etc
> overhead for architectures that are using the fallback.
It's a fallback slowpath - non-x86 architectures should still fill
in a real implementation of course.
> I think we really need another __xxx op here. Especially since
> these operations are often in critical code paths.
That's a receipe for fragility: as using __xxx will still be
irq-safe on x86, and 95% of the testing is done on x86, so this
opens up the path to non-x86 bugs.
So we first have to see the list of architectures that _cannot_
implement an irq-safe op here via a single machine instruction.
x86, ia64 and powerpc should be fine.
Ingo
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> Christoph's zero-based x86 percpu patches were incomplete and never
> worked reliably - Christoph unfortunately never addressed the
> bugs/crashes i reported. (and Mike Travis and me injected quite a
> bit of testing into it) There were two failed attempts to productize
> them and the patches just bitrotted for more than a year.
Sorry not my issue. Mike took over that work on the x86 arch.
I am fine with how things developed. The core patches fell off the
priority list after job change. Just get the things that are not right now
straightened out.
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> > Yes but sometimes you are already irq safe and such a fallback
> > would create significant irq/enable/disable stack operations etc
> > overhead for architectures that are using the fallback.
>
> It's a fallback slowpath - non-x86 architectures should still fill
> in a real implementation of course.
Arch code cannot provide an effective implementation since they
always have to assume that interupts need to be disabled if we stay with
the current implementation.
> So we first have to see the list of architectures that _cannot_
> implement an irq-safe op here via a single machine instruction.
> x86, ia64 and powerpc should be fine.
Look at Ia64, sparc, s/390, powerpc. They can fall back to atomic ops but
those are very ineffective on some of these platforms. Since these are
performance critical they will need to be optimized depending on the
context of their use in the core.
* Christoph Lameter <[email protected]> wrote:
> I am fine with how things developed. The core patches fell off the
> priority list after job change. Just get the things that are not
> right now straightened out.
Regarding the __xxx ops i'm on two minds really. If there's a
significant architecture that really cannot live without them (if
it does not have the proper machine instructions _and_ local irq
disabling is unreasonably expensive) then i guess we could (and
should) add them.
Otherwise we should just try to be as simple as possible.
Ingo
* Christoph Lameter <[email protected]> wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> > > Yes but sometimes you are already irq safe and such a fallback
> > > would create significant irq/enable/disable stack operations etc
> > > overhead for architectures that are using the fallback.
> >
> > It's a fallback slowpath - non-x86 architectures should still fill
> > in a real implementation of course.
>
> Arch code cannot provide an effective implementation since they
> always have to assume that interupts need to be disabled if we stay with
> the current implementation.
>
> > So we first have to see the list of architectures that _cannot_
> > implement an irq-safe op here via a single machine instruction.
> > x86, ia64 and powerpc should be fine.
>
> Look at Ia64, sparc, s/390, powerpc. They can fall back to atomic
> ops but those are very ineffective on some of these platforms.
> Since these are performance critical they will need to be
> optimized depending on the context of their use in the core.
Could you cite a specific example / situation where you'd use __xxx
ops?
Ingo
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> Could you cite a specific example / situation where you'd use __xxx
> ops?
Well the cpu alloc v3 patchset (already cited) included numerous
locations. Sure wish we could move much of the core stats over to percpu
ops. Here is a list of some stats patches that do not need irq disable.
http://article.gmane.org/gmane.linux.kernel.cross-arch/1128
http://article.gmane.org/gmane.linux.kernel.cross-arch/1132
http://article.gmane.org/gmane.linux.kernel.cross-arch/1134
http://article.gmane.org/gmane.linux.kernel.cross-arch/1138
http://article.gmane.org/gmane.linux.kernel.cross-arch/1139
http://article.gmane.org/gmane.linux.kernel.cross-arch/1145 VM stats
http://article.gmane.org/gmane.linux.kernel.cross-arch/1160 NFS stats
http://article.gmane.org/gmane.linux.kernel.cross-arch/1161 Genhd stats
http://article.gmane.org/gmane.linux.kernel.cross-arch/1163 Socket inuse counter
http://article.gmane.org/gmane.linux.kernel.cross-arch/1164 SRCU
* Christoph Lameter <[email protected]> wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> > Could you cite a specific example / situation where you'd use __xxx
> > ops?
>
> Well the cpu alloc v3 patchset (already cited) included numerous
> locations. Sure wish we could move much of the core stats over to
> percpu ops. Here is a list of some stats patches that do not need
> irq disable.
>
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1128
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1132
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1134
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1138
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1139
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1145 VM stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1160 NFS stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1161 Genhd stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1163 Socket inuse counter
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1164 SRCU
nice. Do these all get called with irqs off, all the time?
Ingo
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> nice. Do these all get called with irqs off, all the time?
In most cases yes. The VMstat patch includes an example were we may not
care too much about a event counter missing a beat once in a while for
platforms not supporting atomic per cpu ops. I know this affects IA64. The
cost of an atomic operations for an event counter update (which would have
avoided the potential of a concurrent update) was not justifiable.
* Christoph Lameter <[email protected]> wrote:
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1163 Socket inuse counter
btw., this one is already converted upstream, via:
4e69489: socket: use percpu_add() while updating sockets_in_use
done by Eric Dumazet.
Ingo
* Ingo Molnar ([email protected]) wrote:
>
> * Mathieu Desnoyers <[email protected]> wrote:
>
> > * Ingo Molnar ([email protected]) wrote:
> > >
> > > * Christoph Lameter <[email protected]> wrote:
> > >
> > > > On Wed, 29 Apr 2009, Mathieu Desnoyers wrote:
> > > >
> > > > > to have a good look at the memory accounting code. We could
> > > > > probably benefit of Christoph Lameter's cpu ops (using segment
> > > > > registers to address per-cpu variables with atomic inc/dec) in
> > > > > there. Or at least removing interrupt disabling by using preempt
> > > > > disable and local_t variables for the per-cpu counters could
> > > > > bring some benefit.
> > > >
> > > > Guess we are ready for atomic per cpu ops now that the new per cpu
> > > > allocator is in? Segment register issues with the PDA are also
> > > > solved right?
> > >
> > > it's all done, implemented and upstream already. You are a bit late
> > > to the party ;-)
> > >
> > > Ingo
> >
> > Or way too early, depending on the point of view. :-)
> >
> > e.g.
> > http://lkml.org/lkml/2008/5/30/3
> >
> > I think Christoph deserves credits for pioneering this area with fresh
> > ideas.
>
> Ok, i didnt want to go there - but let me correct this version of
> history.
>
> Christoph's zero-based x86 percpu patches were incomplete and never
> worked reliably - Christoph unfortunately never addressed the
> bugs/crashes i reported. (and Mike Travis and me injected quite a
> bit of testing into it) There were two failed attempts to productize
> them and the patches just bitrotted for more than a year.
>
> Tejun on the other hand fixed those problems (four of Christoph's
> patches survived more or less and were credited to Christoph) and
> did more than 50 highly delicate patches of far larger complexity to
> solve the _whole_ problem range - within a two months timeframe.
>
> Ideas and half-done patches covering <10% of the work needed are not
> enough. Being able to implement it and productize it is the real
> deal, in my book.
>
> Thanks goes to Christoph (and Rusty) for coming up with the idea,
> but it would be manifestly unfair to not send 90% of the kudos to
> Tejun for turning it all into reality and fixing all the other
> problems and redesigning almost all the x86 percpu code in the
> process! ;-)
>
Then thanks to Christoph, Rusty and to Tejun for the respective effort
they did. My goal is surely only to recognise their respective credit.
Mathieu
> Ingo
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Christoph Lameter <[email protected]> wrote:
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1128
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1132
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1134
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1138
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1139
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1145 VM stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1160 NFS stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1161 Genhd stats
> http://article.gmane.org/gmane.linux.kernel.cross-arch/1164 SRCU
The new percpu APIs could be used in most of these places already,
straight away. This is a really good TODO list for places to
enhance.
Then a second set of patches could convert percpu_add() / etc. uses
to __percpu_add() ... but that should be done by those architectures
that need it (and to the extent they need it), because it's not
really testable on x86.
I dont really like the PER_CPU / CPU_INC etc. type of all-capitals
APIs you introduced in the patches above:
+ __CPU_INC(bt->sequence);
+ CPU_FREE(bt->sequence);
was there any strong reason to go outside the well-established
percpu_* name space and call these primitives as if they were
macros?
Ingo
* Christoph Lameter <[email protected]> wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> > nice. Do these all get called with irqs off, all the time?
>
> In most cases yes. The VMstat patch includes an example were we
> may not care too much about a event counter missing a beat once in
> a while for platforms not supporting atomic per cpu ops. I know
> this affects IA64. The cost of an atomic operations for an event
> counter update (which would have avoided the potential of a
> concurrent update) was not justifiable.
when you say "atomics", do you mean the classic meaning of atomics?
Because there are no classic atomics involved. This is the
before/after disassembly from Eric's commit 4e69489a0:
before:
c0436274: b8 01 00 00 00 mov $0x1,%eax
c0436279: e8 42 40 df ff call c022a2c0 <add_preempt_count>
c043627e: bb 20 4f 6a c0 mov $0xc06a4f20,%ebx
c0436283: e8 18 ca f0 ff call c0342ca0 <debug_smp_processor_id>
c0436288: 03 1c 85 60 4a 65 c0 add -0x3f9ab5a0(,%eax,4),%ebx
c043628f: ff 03 incl (%ebx)
c0436291: b8 01 00 00 00 mov $0x1,%eax
c0436296: e8 75 3f df ff call c022a210 <sub_preempt_count>
c043629b: 89 e0 mov %esp,%eax
c043629d: 25 00 e0 ff ff and $0xffffe000,%eax
c04362a2: f6 40 08 08 testb $0x8,0x8(%eax)
c04362a6: 75 07 jne c04362af <sock_alloc+0x7f>
c04362a8: 8d 46 d8 lea -0x28(%esi),%eax
c04362ab: 5b pop %ebx
c04362ac: 5e pop %esi
c04362ad: c9 leave
c04362ae: c3 ret
c04362af: e8 cc 5d 09 00 call c04cc080 <preempt_schedule>
c04362b4: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi
c04362b8: eb ee jmp c04362a8 <sock_alloc+0x78>
after:
c0436275: 64 83 05 20 5f 6a c0 addl $0x1,%fs:0xc06a5f20
There's no atomic instructions at all - the counters here are only
accessed locally. They are local-irq-atomic, but not
cacheline-atomic.
Ingo
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> The new percpu APIs could be used in most of these places already,
> straight away. This is a really good TODO list for places to
> enhance.
Please look a the full list in the cpu alloc v3 patchset and not only
those that I listed here.
> Then a second set of patches could convert percpu_add() / etc. uses
> to __percpu_add() ... but that should be done by those architectures
> that need it (and to the extent they need it), because it's not
> really testable on x86.
Ok So we convert it and wait until the arch maintainers complain? I
definitely know that there is an IA64 issue with vm statistics.
> I dont really like the PER_CPU / CPU_INC etc. type of all-capitals
> APIs you introduced in the patches above:
I know. Patches would have to be redone against whatever API we agree on.
>
> + __CPU_INC(bt->sequence);
> + CPU_FREE(bt->sequence);
>
> was there any strong reason to go outside the well-established
> percpu_* name space and call these primitives as if they were
> macros?
They are macros and may do weird things with the variables. This goes back
to our disagreement last year on caps/lower case. I still think this kind
of preprocessor magic should be uppercase.
The reason not to use the percpu_* names was that they were x86 arch
specific (and thus not available) and did not differentiate in terms of
the irq/preemption context.
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> > may not care too much about a event counter missing a beat once in
> > a while for platforms not supporting atomic per cpu ops. I know
> > this affects IA64. The cost of an atomic operations for an event
> > counter update (which would have avoided the potential of a
> > concurrent update) was not justifiable.
>
> when you say "atomics", do you mean the classic meaning of atomics?
> Because there are no classic atomics involved. This is the
> before/after disassembly from Eric's commit 4e69489a0:
The fallback for IA64 would be to use full (classic) atomic operations
(fetchadd) instead of fast atomic vs. interrupt as available on x86
> c0436275: 64 83 05 20 5f 6a c0 addl $0x1,%fs:0xc06a5f20
>
> There's no atomic instructions at all - the counters here are only
> accessed locally. They are local-irq-atomic, but not
> cacheline-atomic.
Right but that is not available on IA64. So one must choose between
manually disabling interrupts and then increment the counter (long code
sequence) and a classic atomic operation for the fallback.
On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> nice. Do these all get called with irqs off, all the time?
There are lots of non-irq-off cases where non-atomic counters are safe. A
couple of reasons:
- counters that just don't care enough. Some statistics are very
important to always be exact, others are "helpful" but performance is
more important than being exactly right.
- counters simply not accessed (or at least changed) from interrupts at
all. This is very common for some cases, notably "user event" stats.
They may need preemption support, but nothing fancier.
- counters that are "idempotent" wrt interrupts. For example, anything
that will always count up and then down again in a paired fashion
(think kernel lock counter, preemption counter etc) is inherently safe
as a per-cpu counter without needing any protection at all, since any
interrupts may _change_ them, but will always change them back, so even
if a non-atomic sequence gets interrupted, it doesn't matter.
In fact, I'd argue that for a per-cpu counter, the whole _point_ of the
exercise is almost always performance, so locked sequences would be bad to
assume. The fact that x86 can do "atomic" per-cpu accesses with basically
zero cost (by virtue of it's rmw memory ops) is unusual.
Most other architectures will have a very hard time making such counters
cheap, since for them, there "irq atomicity" (expensive irq disable or
store conditional) or "SMP atomicity" (the same fairly expensive
store-conditional or something even worse).
Linus
On Thu, 30 Apr 2009, Ingo Molnar wrote:
>
> c0436275: 64 83 05 20 5f 6a c0 addl $0x1,%fs:0xc06a5f20
>
> There's no atomic instructions at all - the counters here are only
> accessed locally. They are local-irq-atomic, but not
> cacheline-atomic.
On other architectures, you need the whole "disable preemption,
load-locked, store-conditional, test-and-loop, enable preemption" thing.
Or "disable interrupts, load, store, restore interrupts".
There really aren't very many architectures that can do almost
unrestricted ALU ops in a single instruction (and thus automatically safe
from preemption and interrupts).
Linus
* Linus Torvalds <[email protected]> wrote:
> On Thu, 30 Apr 2009, Ingo Molnar wrote:
> >
> > c0436275: 64 83 05 20 5f 6a c0 addl $0x1,%fs:0xc06a5f20
> >
> > There's no atomic instructions at all - the counters here are
> > only accessed locally. They are local-irq-atomic, but not
> > cacheline-atomic.
>
> On other architectures, you need the whole "disable preemption,
> load-locked, store-conditional, test-and-loop, enable preemption"
> thing.
>
> Or "disable interrupts, load, store, restore interrupts".
>
> There really aren't very many architectures that can do almost
> unrestricted ALU ops in a single instruction (and thus
> automatically safe from preemption and interrupts).
Maybe then what we should do is the very first version of commit
6dbde35308: declaredly make percpu_arith_op() non-irq-atomic (and
non-preempt-atomic) everywhere. The commit's internal changelog
still says:
* made generic percpu ops atomic against preemption
So we introduced preemption-safety in the v2 version of that commit.
This non-atomicity will 1) either not matter 2) will be irq-atomic
by virtue of being within a critical section 3) can be made atomic
in the few remaining cases.
And maybe, at most, introduce an opt-in API: percpu_add_irqsafe().
Right?
Ingo
On Thu, 30 Apr 2009, Ingo Molnar wrote:
> So we introduced preemption-safety in the v2 version of that commit.
>
> This non-atomicity will 1) either not matter 2) will be irq-atomic
> by virtue of being within a critical section 3) can be made atomic
> in the few remaining cases.
>
> And maybe, at most, introduce an opt-in API: percpu_add_irqsafe().
Plus percpu_add_preemptsafe().
* Christoph Lameter ([email protected]) wrote:
> On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:
>
> > > The 3 variants on x86 generate the same instructions. On other platforms
> > > they would need to be able to fallback in various way depending on the
> > > availability of instructions that are atomic vs. preempt or irqs.
> > >
> >
> > The problem here, as we did figure out a while ago with the atomic
> > slub we worked on a while ago, is that if we have the following code :
> >
> > local_irq_save
> > var++
> > var++
> > local_irq_restore
> >
> > that we would like to turn into irq-safe percpu variant with this
> > semantic :
> >
> > percpu_add_irqsafe(var)
> > percpu_add_irqsafe(var)
> >
> > We are generating two irq save/restore in the fallback, which will be
> > slow.
> >
> > However, we could do the following trick :
> >
> > percpu_irqsave(flags);
> > percpu_add_irq(var);
> > percpu_add_irq(var);
> > percpu_irqrestore(flags);
>
> Hmmm.I do not remember any of those double ops in the patches that I did a
> while back for this. It does not make sense either because atomic per cpu
> ops are only atomic for a single instruction. You are trying to extend
> that so that multiple "atomic" instructions are now atomic.
>
Hrm, not exactly. So I probably chose the naming of the primitives
poorly here if my idea seems unclear. Here is what I am trying to do :
On architectures with irq-safe percpu_add :
- No need to disable interrupts at all
On archs lacking such irq-safe percpu_add :
- disabling interrupts only once for a sequence of percpu counter operations.
I tried to come up with an example in vmstat where multiple percpu ops
would be required, but I figured out that the code needs to be changed
to support percpu ops correctly. However separating
percpu_irqsave/restore from percpu_add_return_irq lets us express
__inc_zone_state and inc_zone_state cleanly, which would be difficult
otherwise.
Let's assume we change the stat_threshold values in mm/vmstat.c so they
become power of 2 (so we don't care when the u8 overflow occurs so it
becomes a free running counter). This model does not support
"overstep" (yet).
Then assume :
u8 stat_threshold_mask = pcp->stat_threshold - 1;
mm/vmstat.c :
void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
... assuming p references percpu "u8" counters ...
u8 p_new;
p_new = percpu_add_return_irq(p, 1);
if (unlikely(!(p_new & pcp->stat_threshold_mask)))
zone_page_state_add(pcp->stat_threshold, zone, item);
}
void inc_zone_state(struct zone *zone, enum zone_stat_item item)
{
unsigned long flags;
/*
* Disabling interrupts _only_ on architectures lacking atomic
* percpu_*_irq ops.
*/
percpu_irqsave(flags);
__inc_zone_state(zone, item);
percpu_irqrestore(flags);
}
void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
... assuming p references percpu "u8" counters ...
u8 p_new;
p_new = percpu_sub_return_irq(p, 1);
if (unlikely(!(p_new & pcp->stat_threshold_mask)))
zone_page_state_add(-(pcp->stat_threshold), zone, item);
}
void dec_zone_state(struct zone *zone, enum zone_stat_item item)
{
unsigned long flags;
/*
* Disabling interrupts _only_ on architectures lacking atomic
* percpu_*_irq ops.
*/
percpu_irqsave(flags);
__dec_zone_state(zone, item);
percpu_irqrestore(flags);
}
void __mod_zone_state(struct zone *zone, enum zone_stat_item item, long delta)
{
... assuming p references percpu "u8" counters ...
u8 p_new;
long overflow_delta;
p_new = percpu_add_return_irq(p, delta);
/*
* We must count the number of threshold overflow generated by
* "delta". I know, this looks rather odd.
*/
overflow_delta = ((long)p_new & ~(long)pcp->stat_threshold_mask)
- (((long)p_new - delta)
& ~(long)pcp->stat_threshold_mask);
if (unlikely(abs(overflow_delta) > pcp->stat_threshold_mask))
zone_page_state_add(glob_delta, zone, item);
}
void mod_zone_state(struct zone *zone, enum zone_stat_item item, long delta)
{
unsigned long flags;
/*
* Disabling interrupts _only_ on architectures lacking atomic
* percpu_*_irq ops.
*/
percpu_irqsave(flags);
__mod_zone_state(zone, item, detlta);
percpu_irqrestore(flags);
}
Note that all the fast-path would execute with preemption enabled if the
architecture supports irqsave percpu atomic ops.
So as we can see, if cpu ops are used on _different_ atomic counters,
then it may require multiple percpu ops in sequence. However, in the
vmstat case, given the version currently in mainline uses a sequence of
operations on the same variable, this requires re-engineering the
structure, because otherwise races with preemption would occur.
disclaimer : the code above has been written in a email client and may
not compile/work/etc etc.
Mathieu
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:
> On architectures with irq-safe percpu_add :
> - No need to disable interrupts at all
They may have other options like fall back to classic atomic ops.
> Let's assume we change the stat_threshold values in mm/vmstat.c so they
> become power of 2 (so we don't care when the u8 overflow occurs so it
> becomes a free running counter). This model does not support
> "overstep" (yet).
The problem with such an approach is that the counters may wrap if
sufficiently many processors are in a ZVC update. Lets say a new interrupt
is occuring while irq is held off that increments the same counter. When
interrupts are reenabled, we increment again from the new interrupt.
We then check for overflow only after the new interrupt that modified the
counter has completed. The differentials must be constant while the ZVC
update functions are running and with your modifications that is no longer
guaranteed.
> Note that all the fast-path would execute with preemption enabled if the
> architecture supports irqsave percpu atomic ops.
What would work is to use a percpu cmpxchg for the ZVC updates.
I wrote a patch like that over a year ago. There is no percpu_cmpxchg
coming with the percpu ops as of today so we'd need to add that first.
Not sure if this is a performance win though. Complexity increases
somewhat.
---
include/linux/vmstat.h | 17 ++++--
mm/vmstat.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 127 insertions(+), 12 deletions(-)
Index: linux-2.6/include/linux/vmstat.h
===================================================================
--- linux-2.6.orig/include/linux/vmstat.h 2007-11-07 18:36:03.000000000 -0800
+++ linux-2.6/include/linux/vmstat.h 2007-11-07 18:38:27.000000000 -0800
@@ -202,15 +202,22 @@ extern void inc_zone_state(struct zone *
void __mod_zone_page_state(struct zone *, enum zone_stat_item item, int);
void __inc_zone_page_state(struct page *, enum zone_stat_item);
void __dec_zone_page_state(struct page *, enum zone_stat_item);
+void __inc_zone_state(struct zone *, enum zone_stat_item);
+void __dec_zone_state(struct zone *, enum zone_stat_item);
+#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+#define inc_zone_page_state __inc_zone_page_state
+#define dec_zone_page_state __dec_zone_page_state
+#define mod_zone_page_state __mod_zone_page_state
+#define inc_zone_state __inc_zone_state
+#define dec_zone_state __dec_zone_state
+#else
void mod_zone_page_state(struct zone *, enum zone_stat_item, int);
void inc_zone_page_state(struct page *, enum zone_stat_item);
void dec_zone_page_state(struct page *, enum zone_stat_item);
-
-extern void inc_zone_state(struct zone *, enum zone_stat_item);
-extern void __inc_zone_state(struct zone *, enum zone_stat_item);
-extern void dec_zone_state(struct zone *, enum zone_stat_item);
-extern void __dec_zone_state(struct zone *, enum zone_stat_item);
+void inc_zone_state(struct zone *, enum zone_stat_item);
+void dec_zone_state(struct zone *, enum zone_stat_item);
+#endif
void refresh_cpu_vm_stats(int);
#else /* CONFIG_SMP */
Index: linux-2.6/mm/vmstat.c
===================================================================
--- linux-2.6.orig/mm/vmstat.c 2007-11-07 17:20:16.000000000 -0800
+++ linux-2.6/mm/vmstat.c 2007-11-07 18:55:57.000000000 -0800
@@ -151,8 +151,109 @@ static void refresh_zone_stat_thresholds
}
}
+#ifdef CONFIG_FAST_CMPXCHG_LOCAL
+void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
+ int delta)
+{
+ struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ s8 *p = pcp->vm_stat_diff + item;
+ s8 old;
+ unsigned long new;
+ unsigned long add;
+
+ do {
+ add = 0;
+ old = *p;
+ new = old + delta;
+
+ if (unlikely(new > pcp->stat_threshold ||
+ new < -pcp->stat_threshold)) {
+ add = new;
+ new = 0;
+ }
+
+ } while (cmpxchg_local(p, old, new) != old);
+
+ if (add)
+ zone_page_state_add(add, zone, item);
+}
+EXPORT_SYMBOL(__mod_zone_page_state);
+
+/*
+ * Optimized increment and decrement functions implemented using
+ * cmpxchg_local. These do not require interrupts to be disabled
+ * and enabled.
+ */
+void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+ struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ s8 *p = pcp->vm_stat_diff + item;
+ int add;
+ unsigned long old;
+ unsigned long new;
+
+ do {
+ add = 0;
+ old = *p;
+ new = old + 1;
+
+ if (unlikely(new > pcp->stat_threshold)) {
+ add = new + pcp->stat_threshold / 2;
+ new = -(pcp->stat_threshold / 2);
+ }
+ } while (cmpxchg_local(p, old, new) != old);
+
+ if (add)
+ zone_page_state_add(add, zone, item);
+}
+
+void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+ __inc_zone_state(page_zone(page), item);
+}
+EXPORT_SYMBOL(__inc_zone_page_state);
+
+void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
+{
+ struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
+ s8 *p = pcp->vm_stat_diff + item;
+ int sub;
+ unsigned long old;
+ unsigned long new;
+
+ do {
+ sub = 0;
+ old = *p;
+ new = old - 1;
+
+ if (unlikely(new < - pcp->stat_threshold)) {
+ sub = new - pcp->stat_threshold / 2;
+ new = pcp->stat_threshold / 2;
+ }
+ } while (cmpxchg_local(p, old, new) != old);
+
+ if (sub)
+ zone_page_state_add(sub, zone, item);
+}
+
+void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
+{
+ __dec_zone_state(page_zone(page), item);
+}
+EXPORT_SYMBOL(__dec_zone_page_state);
+
+static inline void sync_diff(struct zone *zone, struct per_cpu_pageset *p, int i)
+{
+ /*
+ * xchg_local() would be useful here but that does not exist.
+ */
+ zone_page_state_add(xchg(&p->vm_stat_diff[i], 0), zone, i);
+}
+
+#else /* CONFIG_FAST_CMPXCHG_LOCAL */
+
/*
- * For use when we know that interrupts are disabled.
+ * Functions that do not rely on cmpxchg_local
*/
void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
int delta)
@@ -281,6 +382,17 @@ void dec_zone_page_state(struct page *pa
}
EXPORT_SYMBOL(dec_zone_page_state);
+static inline void sync_diff(struct zone *zone, struct per_cpu_pageset *p, int i)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ zone_page_state_add(p->vm_stat_diff[i], zone, i);
+ p->vm_stat_diff[i] = 0;
+ local_irq_restore(flags);
+}
+#endif /* !CONFIG_FAST_CMPXCHG_LOCAL */
+
/*
* Update the zone counters for one cpu.
*
@@ -299,7 +411,6 @@ void refresh_cpu_vm_stats(int cpu)
{
struct zone *zone;
int i;
- unsigned long flags;
for_each_zone(zone) {
struct per_cpu_pageset *p;
@@ -311,15 +422,12 @@ void refresh_cpu_vm_stats(int cpu)
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
if (p->vm_stat_diff[i]) {
- local_irq_save(flags);
- zone_page_state_add(p->vm_stat_diff[i],
- zone, i);
- p->vm_stat_diff[i] = 0;
+ sync_diff(zone, p, i);
+
#ifdef CONFIG_NUMA
/* 3 seconds idle till flush */
p->expire = 3;
#endif
- local_irq_restore(flags);
}
#ifdef CONFIG_NUMA
/*
* Christoph Lameter ([email protected]) wrote:
> On Thu, 30 Apr 2009, Mathieu Desnoyers wrote:
>
> > On architectures with irq-safe percpu_add :
> > - No need to disable interrupts at all
>
> They may have other options like fall back to classic atomic ops.
>
> > Let's assume we change the stat_threshold values in mm/vmstat.c so they
> > become power of 2 (so we don't care when the u8 overflow occurs so it
> > becomes a free running counter). This model does not support
> > "overstep" (yet).
>
> The problem with such an approach is that the counters may wrap if
> sufficiently many processors are in a ZVC update.
By ZVC update, you mean Zone ... Counter update ? (which code exactly ?)
> Lets say a new interrupt
> is occuring while irq is held off that increments the same counter. When
> interrupts are reenabled, we increment again from the new interrupt.
> We then check for overflow only after the new interrupt that modified the
> counter has completed. The differentials must be constant while the ZVC
> update functions are running and with your modifications that is no longer
> guaranteed.
>
Hrm, I must admit I'm not sure I follow how your reasoning applies to my
code. I am using a percpu_add_return_irq() exactly for this reason : it
only ever touches the percpu variable once and atomically. The test for
overflow is done on the value returned by percpu_add_return_irq().
Therefore, an interrupt scenario that would be close to what I
understand from your concerns would be :
* Thread A
inc_zone_page_state()
p_ret = percpu_add_return(p, 1); (let's suppose this increment
overflows the threshold, therefore
(p_ret & mask) == 0)
----> interrupt comes in, preempts the current thread, execution in a
different thread context (* Thread B) :
inc_zone_page_state()
p_ret = percpu_add_return(p, 1); ((p_ret & mask) == 1)
if (!(p_ret & mask))
increment global zone count. (not executed)
----> interrupt comes in, preempts the current thread, execution back to
the original thread context (Thread A), on the same or on a
different CPU :
if (!(p_ret & mask))
increment global zone count. -----> will therefore increment the
global zone count only after
scheduling back the original
thread.
So I guess what you say here is that if Thread B is preempted for too
long, we will have to wait until it gets scheduled back before the
global count is incremented. Do we really need such degree of precision
for those counters ?
(I fear I'm not understanding your concern fully though)
> > Note that all the fast-path would execute with preemption enabled if the
> > architecture supports irqsave percpu atomic ops.
>
> What would work is to use a percpu cmpxchg for the ZVC updates.
> I wrote a patch like that over a year ago. There is no percpu_cmpxchg
> coming with the percpu ops as of today so we'd need to add that first.
>
> Not sure if this is a performance win though. Complexity increases
> somewhat.
>
hrm, yes, the patch below clearly does not win the low-complexity
contest. :)
Mathieu
> ---
> include/linux/vmstat.h | 17 ++++--
> mm/vmstat.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 127 insertions(+), 12 deletions(-)
>
> Index: linux-2.6/include/linux/vmstat.h
> ===================================================================
> --- linux-2.6.orig/include/linux/vmstat.h 2007-11-07 18:36:03.000000000 -0800
> +++ linux-2.6/include/linux/vmstat.h 2007-11-07 18:38:27.000000000 -0800
> @@ -202,15 +202,22 @@ extern void inc_zone_state(struct zone *
> void __mod_zone_page_state(struct zone *, enum zone_stat_item item, int);
> void __inc_zone_page_state(struct page *, enum zone_stat_item);
> void __dec_zone_page_state(struct page *, enum zone_stat_item);
> +void __inc_zone_state(struct zone *, enum zone_stat_item);
> +void __dec_zone_state(struct zone *, enum zone_stat_item);
>
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +#define inc_zone_page_state __inc_zone_page_state
> +#define dec_zone_page_state __dec_zone_page_state
> +#define mod_zone_page_state __mod_zone_page_state
> +#define inc_zone_state __inc_zone_state
> +#define dec_zone_state __dec_zone_state
> +#else
> void mod_zone_page_state(struct zone *, enum zone_stat_item, int);
> void inc_zone_page_state(struct page *, enum zone_stat_item);
> void dec_zone_page_state(struct page *, enum zone_stat_item);
> -
> -extern void inc_zone_state(struct zone *, enum zone_stat_item);
> -extern void __inc_zone_state(struct zone *, enum zone_stat_item);
> -extern void dec_zone_state(struct zone *, enum zone_stat_item);
> -extern void __dec_zone_state(struct zone *, enum zone_stat_item);
> +void inc_zone_state(struct zone *, enum zone_stat_item);
> +void dec_zone_state(struct zone *, enum zone_stat_item);
> +#endif
>
> void refresh_cpu_vm_stats(int);
> #else /* CONFIG_SMP */
> Index: linux-2.6/mm/vmstat.c
> ===================================================================
> --- linux-2.6.orig/mm/vmstat.c 2007-11-07 17:20:16.000000000 -0800
> +++ linux-2.6/mm/vmstat.c 2007-11-07 18:55:57.000000000 -0800
> @@ -151,8 +151,109 @@ static void refresh_zone_stat_thresholds
> }
> }
>
> +#ifdef CONFIG_FAST_CMPXCHG_LOCAL
> +void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> + int delta)
> +{
> + struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + s8 *p = pcp->vm_stat_diff + item;
> + s8 old;
> + unsigned long new;
> + unsigned long add;
> +
> + do {
> + add = 0;
> + old = *p;
> + new = old + delta;
> +
> + if (unlikely(new > pcp->stat_threshold ||
> + new < -pcp->stat_threshold)) {
> + add = new;
> + new = 0;
> + }
> +
> + } while (cmpxchg_local(p, old, new) != old);
> +
> + if (add)
> + zone_page_state_add(add, zone, item);
> +}
> +EXPORT_SYMBOL(__mod_zone_page_state);
> +
> +/*
> + * Optimized increment and decrement functions implemented using
> + * cmpxchg_local. These do not require interrupts to be disabled
> + * and enabled.
> + */
> +void __inc_zone_state(struct zone *zone, enum zone_stat_item item)
> +{
> + struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + s8 *p = pcp->vm_stat_diff + item;
> + int add;
> + unsigned long old;
> + unsigned long new;
> +
> + do {
> + add = 0;
> + old = *p;
> + new = old + 1;
> +
> + if (unlikely(new > pcp->stat_threshold)) {
> + add = new + pcp->stat_threshold / 2;
> + new = -(pcp->stat_threshold / 2);
> + }
> + } while (cmpxchg_local(p, old, new) != old);
> +
> + if (add)
> + zone_page_state_add(add, zone, item);
> +}
> +
> +void __inc_zone_page_state(struct page *page, enum zone_stat_item item)
> +{
> + __inc_zone_state(page_zone(page), item);
> +}
> +EXPORT_SYMBOL(__inc_zone_page_state);
> +
> +void __dec_zone_state(struct zone *zone, enum zone_stat_item item)
> +{
> + struct per_cpu_pageset *pcp = zone_pcp(zone, smp_processor_id());
> + s8 *p = pcp->vm_stat_diff + item;
> + int sub;
> + unsigned long old;
> + unsigned long new;
> +
> + do {
> + sub = 0;
> + old = *p;
> + new = old - 1;
> +
> + if (unlikely(new < - pcp->stat_threshold)) {
> + sub = new - pcp->stat_threshold / 2;
> + new = pcp->stat_threshold / 2;
> + }
> + } while (cmpxchg_local(p, old, new) != old);
> +
> + if (sub)
> + zone_page_state_add(sub, zone, item);
> +}
> +
> +void __dec_zone_page_state(struct page *page, enum zone_stat_item item)
> +{
> + __dec_zone_state(page_zone(page), item);
> +}
> +EXPORT_SYMBOL(__dec_zone_page_state);
> +
> +static inline void sync_diff(struct zone *zone, struct per_cpu_pageset *p, int i)
> +{
> + /*
> + * xchg_local() would be useful here but that does not exist.
> + */
> + zone_page_state_add(xchg(&p->vm_stat_diff[i], 0), zone, i);
> +}
> +
> +#else /* CONFIG_FAST_CMPXCHG_LOCAL */
> +
> /*
> - * For use when we know that interrupts are disabled.
> + * Functions that do not rely on cmpxchg_local
> */
> void __mod_zone_page_state(struct zone *zone, enum zone_stat_item item,
> int delta)
> @@ -281,6 +382,17 @@ void dec_zone_page_state(struct page *pa
> }
> EXPORT_SYMBOL(dec_zone_page_state);
>
> +static inline void sync_diff(struct zone *zone, struct per_cpu_pageset *p, int i)
> +{
> + unsigned long flags;
> +
> + local_irq_save(flags);
> + zone_page_state_add(p->vm_stat_diff[i], zone, i);
> + p->vm_stat_diff[i] = 0;
> + local_irq_restore(flags);
> +}
> +#endif /* !CONFIG_FAST_CMPXCHG_LOCAL */
> +
> /*
> * Update the zone counters for one cpu.
> *
> @@ -299,7 +411,6 @@ void refresh_cpu_vm_stats(int cpu)
> {
> struct zone *zone;
> int i;
> - unsigned long flags;
>
> for_each_zone(zone) {
> struct per_cpu_pageset *p;
> @@ -311,15 +422,12 @@ void refresh_cpu_vm_stats(int cpu)
>
> for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
> if (p->vm_stat_diff[i]) {
> - local_irq_save(flags);
> - zone_page_state_add(p->vm_stat_diff[i],
> - zone, i);
> - p->vm_stat_diff[i] = 0;
> + sync_diff(zone, p, i);
> +
> #ifdef CONFIG_NUMA
> /* 3 seconds idle till flush */
> p->expire = 3;
> #endif
> - local_irq_restore(flags);
> }
> #ifdef CONFIG_NUMA
> /*
>
> _______________________________________________
> ltt-dev mailing list
> [email protected]
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
>
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68