2006-02-23 09:31:11

by Arjan van de Ven

[permalink] [raw]
Subject: [Patch 0/3] threaded mmap tweaks

Hi,

I've been looking at a micro-benchmark that basically starts a few
threads which then each allocate some memory (via mmap), use it briefly
and then free it again (munmap). During this benchmark the mmap_sem gets
contended and as a result things go less well than expected. The patches
in this series improved the benchmark by 3% on a wallclock time basis.

Greetings,
Arjan van de Ven


2006-02-23 09:31:32

by Arjan van de Ven

[permalink] [raw]
Subject: [Patch 2/3] fast VMA recycling

This patch adds a per task-struct cache of a free vma.

In normal operation, it is a really common action during userspace mmap
or malloc to first allocate a vma, and then find out that it can be merged,
and thus free it again. In fact this is the case roughly 95% of the time.

In addition, this patch allows code to "prepopulate" the cache, and
this is done as example for the x86_64 mmap codepath. The advantage of this
prepopulation is that the memory allocation (which is a sleeping operation
due to the GFP_KERNEL flag, potentially causing either a direct sleep or a
voluntary preempt sleep) will happen before the mmap_sem is taken, and thus
reduces lock hold time (and thus the contention potential)

The cache is only allowed to be accessed for "current", and not from IRQ
context. This allows for lockless access, making it a cheap cache.

One could argue that this should be a generic slab feature (the preloading) but
that only gives some of the gains and not all, and vma's are small creatures,
with a high "recycling" rate in typical cases.

Signed-off-by: Arjan van de Ven <[email protected]>

---
arch/x86_64/kernel/sys_x86_64.c | 2 +
include/linux/mm.h | 2 +
include/linux/sched.h | 2 +
kernel/exit.c | 4 +++
kernel/fork.c | 2 +
mm/mmap.c | 50 ++++++++++++++++++++++++++++++++--------
6 files changed, 52 insertions(+), 10 deletions(-)

Index: linux-work/arch/x86_64/kernel/sys_x86_64.c
===================================================================
--- linux-work.orig/arch/x86_64/kernel/sys_x86_64.c
+++ linux-work/arch/x86_64/kernel/sys_x86_64.c
@@ -55,6 +55,8 @@ asmlinkage long sys_mmap(unsigned long a
if (!file)
goto out;
}
+
+ prepopulate_vma();
down_write(&current->mm->mmap_sem);
error = do_mmap_pgoff(file, addr, len, prot, flags, off >> PAGE_SHIFT);
up_write(&current->mm->mmap_sem);
Index: linux-work/include/linux/mm.h
===================================================================
--- linux-work.orig/include/linux/mm.h
+++ linux-work/include/linux/mm.h
@@ -1051,6 +1051,8 @@ int shrink_slab(unsigned long scanned, g
void drop_pagecache(void);
void drop_slab(void);

+extern void prepopulate_vma(void);
+
extern int randomize_va_space;

#endif /* __KERNEL__ */
Index: linux-work/include/linux/sched.h
===================================================================
--- linux-work.orig/include/linux/sched.h
+++ linux-work/include/linux/sched.h
@@ -838,6 +838,8 @@ struct task_struct {
/* VM state */
struct reclaim_state *reclaim_state;

+ struct vm_area_struct *free_vma_cache; /* keep 1 free vma around as cache */
+
struct dentry *proc_dentry;
struct backing_dev_info *backing_dev_info;

Index: linux-work/kernel/exit.c
===================================================================
--- linux-work.orig/kernel/exit.c
+++ linux-work/kernel/exit.c
@@ -878,6 +878,10 @@ fastcall NORET_TYPE void do_exit(long co
*/
mutex_debug_check_no_locks_held(tsk);

+ if (tsk->free_vma_cache)
+ kmem_cache_free(vm_area_cachep, tsk->free_vma_cache);
+ tsk->free_vma_cache = NULL;
+
/* PF_DEAD causes final put_task_struct after we schedule. */
preempt_disable();
BUG_ON(tsk->flags & PF_DEAD);
Index: linux-work/kernel/fork.c
===================================================================
--- linux-work.orig/kernel/fork.c
+++ linux-work/kernel/fork.c
@@ -179,6 +179,8 @@ static struct task_struct *dup_task_stru
/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
atomic_set(&tsk->fs_excl, 0);
+ tsk->free_vma_cache = NULL;
+
return tsk;
}

Index: linux-work/mm/mmap.c
===================================================================
--- linux-work.orig/mm/mmap.c
+++ linux-work/mm/mmap.c
@@ -65,6 +65,36 @@ int sysctl_overcommit_ratio = 50; /* def
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
atomic_t vm_committed_space = ATOMIC_INIT(0);

+
+static void free_vma(struct vm_area_struct *vma)
+{
+ struct vm_area_struct *oldvma;
+
+ oldvma = current->free_vma_cache;
+ current->free_vma_cache = vma;
+ if (oldvma)
+ kmem_cache_free(vm_area_cachep, oldvma);
+}
+
+static struct vm_area_struct *alloc_vma(void)
+{
+ if (current->free_vma_cache) {
+ struct vm_area_struct *vma;
+ vma = current->free_vma_cache;
+ current->free_vma_cache = NULL;
+ return vma;
+ }
+ return kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+}
+
+void prepopulate_vma(void)
+{
+ if (!current->free_vma_cache)
+ current->free_vma_cache =
+ kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+}
+
+
/*
* Check that a process has enough memory to allocate a new virtual
* mapping. 0 means there is enough memory for the allocation to
@@ -206,7 +236,7 @@ static struct vm_area_struct *remove_vma
if (vma->vm_file)
fput(vma->vm_file);
mpol_free(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
+ free_vma(vma);
return next;
}

@@ -593,7 +623,7 @@ again: remove_next = 1 + (end > next->
fput(file);
mm->map_count--;
mpol_free(vma_policy(next));
- kmem_cache_free(vm_area_cachep, next);
+ free_vma(next);
/*
* In mprotect's case 6 (see comments on vma_merge),
* we must remove another next too. It would clutter
@@ -1048,7 +1078,7 @@ munmap_back:
* specific mapper. the address has already been validated, but
* not unmapped, but the maps are removed from the list.
*/
- vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+ vma = alloc_vma();
if (!vma) {
error = -ENOMEM;
goto unacct_error;
@@ -1113,7 +1143,7 @@ munmap_back:
fput(file);
}
mpol_free(vma_policy(vma));
- kmem_cache_free(vm_area_cachep, vma);
+ free_vma(vma);
}
out:
mm->total_vm += len >> PAGE_SHIFT;
@@ -1140,7 +1170,7 @@ unmap_and_free_vma:
unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
charged = 0;
free_vma:
- kmem_cache_free(vm_area_cachep, vma);
+ free_vma(vma);
unacct_error:
if (charged)
vm_unacct_memory(charged);
@@ -1711,7 +1741,7 @@ int split_vma(struct mm_struct * mm, str
if (mm->map_count >= sysctl_max_map_count)
return -ENOMEM;

- new = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+ new = alloc_vma();
if (!new)
return -ENOMEM;

@@ -1727,7 +1757,7 @@ int split_vma(struct mm_struct * mm, str

pol = mpol_copy(vma_policy(vma));
if (IS_ERR(pol)) {
- kmem_cache_free(vm_area_cachep, new);
+ free_vma(new);
return PTR_ERR(pol);
}
vma_set_policy(new, pol);
@@ -1904,7 +1934,7 @@ unsigned long do_brk(unsigned long addr,
/*
* create a vma struct for an anonymous mapping
*/
- vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+ vma = alloc_vma();
if (!vma) {
vm_unacct_memory(len >> PAGE_SHIFT);
return -ENOMEM;
@@ -2024,12 +2054,12 @@ struct vm_area_struct *copy_vma(struct v
vma_start < new_vma->vm_end)
*vmap = new_vma;
} else {
- new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+ new_vma = alloc_vma();
if (new_vma) {
*new_vma = *vma;
pol = mpol_copy(vma_policy(vma));
if (IS_ERR(pol)) {
- kmem_cache_free(vm_area_cachep, new_vma);
+ free_vma(new_vma);
return NULL;
}
vma_set_policy(new_vma, pol);

2006-02-23 09:31:33

by Arjan van de Ven

[permalink] [raw]
Subject: [Patch 3/3] prepopulate/cache cleared pages

This patch adds an entry for a cleared page to the task struct. The main
purpose of this patch is to be able to pre-allocate and clear a page in a
pagefault scenario before taking any locks (esp mmap_sem),
opportunistically. Allocating+clearing a page is an very expensive
operation that currently increases lock hold times quite bit (in a threaded
environment that allocates/use/frees memory on a regular basis, this leads
to contention).

This is probably the most controversial patch of the 3, since there is
a potential to take up 1 page per thread in this cache. In practice it's
not as bad as it sounds (a large degree of the pagefaults are anonymous
and thus immediately use up the page). One could argue "let the VM reap
these" but that has a few downsides; it increases locking needs but more,
clearing a page is relatively expensive, if the VM reaps the page again
in case it wasn't needed, the work was just wasted.


Signed-off-by: Arjan van de Ven <[email protected]>

---
arch/x86_64/mm/fault.c | 2 ++
include/linux/mm.h | 1 +
include/linux/sched.h | 1 +
kernel/exit.c | 4 ++++
kernel/fork.c | 1 +
mm/mempolicy.c | 37 +++++++++++++++++++++++++++++++++++++
6 files changed, 46 insertions(+)

Index: linux-work/arch/x86_64/mm/fault.c
===================================================================
--- linux-work.orig/arch/x86_64/mm/fault.c
+++ linux-work/arch/x86_64/mm/fault.c
@@ -375,6 +375,8 @@ asmlinkage void __kprobes do_page_fault(
goto bad_area_nosemaphore;

again:
+ prepare_cleared_page();
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunatly, in the case of an
Index: linux-work/include/linux/mm.h
===================================================================
--- linux-work.orig/include/linux/mm.h
+++ linux-work/include/linux/mm.h
@@ -1052,6 +1052,7 @@ void drop_pagecache(void);
void drop_slab(void);

extern void prepopulate_vma(void);
+extern void prepopulate_cleared_page(void);

extern int randomize_va_space;

Index: linux-work/include/linux/sched.h
===================================================================
--- linux-work.orig/include/linux/sched.h
+++ linux-work/include/linux/sched.h
@@ -839,6 +839,7 @@ struct task_struct {
struct reclaim_state *reclaim_state;

struct vm_area_struct *free_vma_cache; /* keep 1 free vma around as cache */
+ struct page *cleared_page; /* optionally keep 1 cleared page around */

struct dentry *proc_dentry;
struct backing_dev_info *backing_dev_info;
Index: linux-work/kernel/exit.c
===================================================================
--- linux-work.orig/kernel/exit.c
+++ linux-work/kernel/exit.c
@@ -882,6 +882,10 @@ fastcall NORET_TYPE void do_exit(long co
kmem_cache_free(vm_area_cachep, tsk->free_vma_cache);
tsk->free_vma_cache = NULL;

+ if (tsk->cleared_page)
+ __free_page(tsk->cleared_page);
+ tsk->cleared_page = NULL;
+
/* PF_DEAD causes final put_task_struct after we schedule. */
preempt_disable();
BUG_ON(tsk->flags & PF_DEAD);
Index: linux-work/kernel/fork.c
===================================================================
--- linux-work.orig/kernel/fork.c
+++ linux-work/kernel/fork.c
@@ -180,6 +180,7 @@ static struct task_struct *dup_task_stru
atomic_set(&tsk->usage,2);
atomic_set(&tsk->fs_excl, 0);
tsk->free_vma_cache = NULL;
+ tsk->cleared_page = NULL;

return tsk;
}
Index: linux-work/mm/mempolicy.c
===================================================================
--- linux-work.orig/mm/mempolicy.c
+++ linux-work/mm/mempolicy.c
@@ -1231,6 +1231,13 @@ alloc_page_vma(gfp_t gfp, struct vm_area
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);

+ if ( (gfp & __GFP_ZERO) && current->cleared_page) {
+ struct page *addr;
+ addr = current->cleared_page;
+ current->cleared_page = NULL;
+ return addr;
+ }
+
cpuset_update_task_memory_state();

if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
@@ -1242,6 +1249,36 @@ alloc_page_vma(gfp_t gfp, struct vm_area
return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
}

+
+/**
+ * prepare_cleared_page - populate the per-task zeroed-page cache
+ *
+ * This function populates the per-task cache with one zeroed page
+ * (if there wasn't one already)
+ * The idea is that this (expensive) clearing is done before any
+ * locks are taken, speculatively, and that when the page is actually
+ * needed under a lock, it is ready for immediate use
+ */
+
+void prepare_cleared_page(void)
+{
+ struct mempolicy *pol = current->mempolicy;
+
+ if (current->cleared_page)
+ return;
+
+ cpuset_update_task_memory_state();
+
+ if (!pol)
+ pol = &default_policy;
+ if (pol->policy == MPOL_INTERLEAVE)
+ current->cleared_page = alloc_page_interleave(
+ GFP_HIGHUSER | __GFP_ZERO, 0, interleave_nodes(pol));
+ current->cleared_page = __alloc_pages(GFP_USER | __GFP_ZERO,
+ 0, zonelist_policy(GFP_USER, pol));
+}
+
+
/**
* alloc_pages_current - Allocate pages.
*

2006-02-23 09:43:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

On Thursday 23 February 2006 10:29, Arjan van de Ven wrote:
> This patch adds an entry for a cleared page to the task struct. The main
> purpose of this patch is to be able to pre-allocate and clear a page in a
> pagefault scenario before taking any locks (esp mmap_sem),
> opportunistically. Allocating+clearing a page is an very expensive
> operation that currently increases lock hold times quite bit (in a threaded
> environment that allocates/use/frees memory on a regular basis, this leads
> to contention).
>
> This is probably the most controversial patch of the 3, since there is
> a potential to take up 1 page per thread in this cache. In practice it's
> not as bad as it sounds (a large degree of the pagefaults are anonymous
> and thus immediately use up the page). One could argue "let the VM reap
> these" but that has a few downsides; it increases locking needs but more,
> clearing a page is relatively expensive, if the VM reaps the page again
> in case it wasn't needed, the work was just wasted.

Looks like an incredible bad hack. What workload was that again where it helps?
And how much? I think before we can consider adding that ugly code you would a far better
rationale.

-Andi

2006-02-23 09:42:59

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> This patch adds a per task-struct cache of a free vma.
>
> In normal operation, it is a really common action during userspace mmap
> or malloc to first allocate a vma, and then find out that it can be merged,
> and thus free it again. In fact this is the case roughly 95% of the time.
>
> In addition, this patch allows code to "prepopulate" the cache, and
> this is done as example for the x86_64 mmap codepath. The advantage of this
> prepopulation is that the memory allocation (which is a sleeping operation
> due to the GFP_KERNEL flag, potentially causing either a direct sleep or a
> voluntary preempt sleep) will happen before the mmap_sem is taken, and thus
> reduces lock hold time (and thus the contention potential)

The slab fast path doesn't sleep. And if you're short of memory
then you probably have other issues.

> The cache is only allowed to be accessed for "current", and not from IRQ
> context. This allows for lockless access, making it a cheap cache.
>
> One could argue that this should be a generic slab feature (the preloading) but
> that only gives some of the gains and not all, and vma's are small creatures,
> with a high "recycling" rate in typical cases.

Numbers numbers numbers. What workload? How much did it help?

Also looks like a very narrow purpose hard to maintain fragile hack to me.

-Andi

2006-02-23 09:48:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > This patch adds a per task-struct cache of a free vma.
> >
> > In normal operation, it is a really common action during userspace mmap
> > or malloc to first allocate a vma, and then find out that it can be merged,
> > and thus free it again. In fact this is the case roughly 95% of the time.
> >
> > In addition, this patch allows code to "prepopulate" the cache, and
> > this is done as example for the x86_64 mmap codepath. The advantage of this
> > prepopulation is that the memory allocation (which is a sleeping operation
> > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a
> > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus
> > reduces lock hold time (and thus the contention potential)
>
> The slab fast path doesn't sleep.

it does via might_sleep()

> Numbers numbers numbers. What workload? How much did it help?

see post 0/3

3% on a threaded allocation benchmark (which resembles a webserver with
cgis apparently)

2006-02-23 10:09:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Thursday 23 February 2006 10:48, Arjan van de Ven wrote:
> On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > This patch adds a per task-struct cache of a free vma.
> > >
> > > In normal operation, it is a really common action during userspace mmap
> > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > and thus free it again. In fact this is the case roughly 95% of the time.
> > >
> > > In addition, this patch allows code to "prepopulate" the cache, and
> > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > prepopulation is that the memory allocation (which is a sleeping operation
> > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a
> > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus
> > > reduces lock hold time (and thus the contention potential)
> >
> > The slab fast path doesn't sleep.
>
> it does via might_sleep()

Hmm? That shouldn't sleep.

If it takes any time in a real workload then it should move into DEBUG_KERNEL
too. But I doubt it. Something with your analysis is wrong.

-Andi

P.S.: Your email address bounces.

2006-02-23 10:15:16

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Thu, 2006-02-23 at 11:05 +0100, Andi Kleen wrote:
> On Thursday 23 February 2006 10:48, Arjan van de Ven wrote:
> > On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > > This patch adds a per task-struct cache of a free vma.
> > > >
> > > > In normal operation, it is a really common action during userspace mmap
> > > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > > and thus free it again. In fact this is the case roughly 95% of the time.
> > > >
> > > > In addition, this patch allows code to "prepopulate" the cache, and
> > > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > > prepopulation is that the memory allocation (which is a sleeping operation
> > > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a
> > > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus
> > > > reduces lock hold time (and thus the contention potential)
> > >
> > > The slab fast path doesn't sleep.
> >
> > it does via might_sleep()
>
> Hmm? That shouldn't sleep.

see voluntary preempt.


> If it takes any time in a real workload then it should move into DEBUG_KERNEL
> too. But I doubt it. Something with your analysis is wrong.

well I'm seeing contention; and this is one of the things that can be
moved out of the lock easily, and especially given the high recycle rate
of these things... looks worth it to me.



> P.S.: Your email address bounces.

sorry about that; made a typo when trying to figure out how to set up
non-corrupting patches lkml

2006-02-23 11:00:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Thursday 23 February 2006 11:15, Arjan van de Ven wrote:
> On Thu, 2006-02-23 at 11:05 +0100, Andi Kleen wrote:
> > On Thursday 23 February 2006 10:48, Arjan van de Ven wrote:
> > > On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > > > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > > > This patch adds a per task-struct cache of a free vma.
> > > > >
> > > > > In normal operation, it is a really common action during userspace mmap
> > > > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > > > and thus free it again. In fact this is the case roughly 95% of the time.
> > > > >
> > > > > In addition, this patch allows code to "prepopulate" the cache, and
> > > > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > > > prepopulation is that the memory allocation (which is a sleeping operation
> > > > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a
> > > > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus
> > > > > reduces lock hold time (and thus the contention potential)
> > > >
> > > > The slab fast path doesn't sleep.
> > >
> > > it does via might_sleep()
> >
> > Hmm? That shouldn't sleep.
>
> see voluntary preempt.

Only when its time slice is used up but then it would sleep a bit later
in user space. But it should be really a unlikely case and nothing
to optimize for.

>
> > If it takes any time in a real workload then it should move into DEBUG_KERNEL
> > too. But I doubt it. Something with your analysis is wrong.
>
> well I'm seeing contention; and this is one of the things that can be
> moved out of the lock easily, and especially given the high recycle rate
> of these things... looks worth it to me.

I think you need a better analysis of what is actually happening
instead of trying all kind of weird quick fragile hacks.

-Andi

2006-02-23 11:22:38

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

Andi Kleen wrote:
>> see voluntary preempt.
>
> Only when its time slice is used up

or if some other thread gets a higher dynamic prio
> but then it would sleep a bit later
> in user space.

... but that is without the semaphore held! (and that is the entire
point of this patch, move the sleep moments to outside the lock holding
area as much as possible, to reduce lock hold times)


2006-02-23 11:57:53

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Thursday 23 February 2006 12:22, Arjan van de Ven wrote:
> Andi Kleen wrote:
> >> see voluntary preempt.
> >
> > Only when its time slice is used up
>
> or if some other thread gets a higher dynamic prio
> > but then it would sleep a bit later
> > in user space.
>
> ... but that is without the semaphore held! (and that is the entire
> point of this patch, move the sleep moments to outside the lock holding
> area as much as possible, to reduce lock hold times)

And you verified this happens often in your workload?

Anyways, how about adding a down_no_preempt() or similar instead
that won't voluntarily preempt?

-Andi

2006-02-23 12:43:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages


* Andi Kleen <[email protected]> wrote:

> On Thursday 23 February 2006 10:29, Arjan van de Ven wrote:
> > This patch adds an entry for a cleared page to the task struct. The main
> > purpose of this patch is to be able to pre-allocate and clear a page in a
> > pagefault scenario before taking any locks (esp mmap_sem),
> > opportunistically. Allocating+clearing a page is an very expensive
> > operation that currently increases lock hold times quite bit (in a threaded
> > environment that allocates/use/frees memory on a regular basis, this leads
> > to contention).
> >
> > This is probably the most controversial patch of the 3, since there is
> > a potential to take up 1 page per thread in this cache. In practice it's
> > not as bad as it sounds (a large degree of the pagefaults are anonymous
> > and thus immediately use up the page). One could argue "let the VM reap
> > these" but that has a few downsides; it increases locking needs but more,
> > clearing a page is relatively expensive, if the VM reaps the page again
> > in case it wasn't needed, the work was just wasted.
>
> Looks like an incredible bad hack. What workload was that again where
> it helps? And how much? I think before we can consider adding that
> ugly code you would a far better rationale.

yes, the patch is controversial technologically, and Arjan pointed it
out above. This is nothing new - and Arjan probably submitted this to
lkml so that he can get contructive feedback.

What Arjan did is quite nifty, as it moves the page clearing out from
under the mmap_sem-held critical section. How that is achieved is really
secondary, it's pretty clear that it could be done in some nicer way.
Furthermore, we havent seen any significant advance in that area of the
kernel [threaded mmap() performance] in quite some time, so
contributions are both welcome and encouraged.

But Andi, regarding the tone of your reply - it is really getting out of
hand! Please lean back and take a deep breath. Maybe you should even
sleep over it. And when you come back, please start being _much_ more
respectful of other people's work. You are not doing Linux _any_ favor
by flaming away so mindlessly ... Arjan is a longtime contributor and he
can probably take your flames just fine (as I took your flames just fine
the other day), but we should really use a _much_ nicer tone on lkml.

You might not realize it, but replies like this can scare away novice
contributors forever! You could scare away the next DaveM. Or the next
Alan Cox. Or the next Andi Kleen. Heck, much milder replies can scare
away even longtime contributors: see Rusty Russell's comments from a
couple of days ago ...

And no, i dont accept the lame "dont come into the kitchen if you cant
stand the flames" excuse: your reply was totally uncalled for, was
totally undeserved and was totally unnecessary. It was incredibly mean
spirited, and the only effect it can possibly have on the recipient is
harm. And it's not like this happened in the heat of a longer discussion
or due to some misunderstanding: you flamed away _right at the
beginning_, right as the first reply to the patch submission. Shame on
you! Is it that hard to reply:

"Hm, nice idea, I wish it were cleaner though because
<insert explanation here>. How about <insert nifty idea here>."

[ Btw., i could possibly have come up with a good solution for this
during the time i had to waste on this reply. ]

Ingo

2006-02-23 13:06:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

On Thursday 23 February 2006 13:41, Ingo Molnar wrote:

>
> What Arjan did is quite nifty, as it moves the page clearing out from
> under the mmap_sem-held critical section.

So that was the point not the rescheduling under lock? Or both?

BTW since it touches your area of work you could comment what
you think about not using voluntary preempt points for fast sleep locks
like I later proposed.

> How that is achieved is really
> secondary, it's pretty clear that it could be done in some nicer way.

Great we agree then.
>
> And no, i dont accept the lame "dont come into the kitchen if you cant
> stand the flames" excuse: your reply was totally uncalled for, was
> totally undeserved

Well he didn't supply any data so I asked for more.

> and was totally unnecessary. It was incredibly mean
> spirited,

Sorry, but I don't think that's true. Mean spirited would be
"we don't care, go away". When I think that I generally don't
answer the email.

I could have perhaps worded it a bit nicer, agreed, but I think
the core of my reply - we need more analysis for that - was
quite constructive. At least for one of the subproblems I even
proposed a better solution. If there is more analysis of the
problem maybe I can help even for more of it.

Also it's probably quite clear that added lots of special purpose caches
to task_struct for narrow purpose isn't a good way to do optimization.

The mail was perhaps a bit harsher than it should have been
because I think Arjan should have really known better...

> You might not realize it, but replies like this can scare away novice
> contributors forever! You could scare away the next DaveM. Or the next
> Alan Cox. Or the next Andi Kleen. Heck, much milder replies can scare
> away even longtime contributors: see Rusty Russell's comments from a
> couple of days ago ...

I must say I'm feeling a bit unfairly attacked here because I think I generally
try to help new patch submitters (at least if their basic ideas are sound even
if some details are wrong)

e.g. you might notice that a lot of patches from new contributors
go smoother into x86-64 than into i386.

-Andi

2006-02-23 13:21:16

by Nick Piggin

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

Andi Kleen wrote:
> On Thursday 23 February 2006 13:41, Ingo Molnar wrote:
>
>
>>What Arjan did is quite nifty, as it moves the page clearing out from
>>under the mmap_sem-held critical section.
>
>
> So that was the point not the rescheduling under lock? Or both?
>
> BTW since it touches your area of work you could comment what
> you think about not using voluntary preempt points for fast sleep locks
> like I later proposed.
>
>
>>How that is achieved is really
>>secondary, it's pretty clear that it could be done in some nicer way.
>
>
> Great we agree then.
>

I'm worried about the situation where we allocate but don't use the
new page: it blows quite a bit of cache. Then, when we do get around
to using it, it will be cold(er).

Good to see someone trying to improve this area, though.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-23 13:31:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages


* Nick Piggin <[email protected]> wrote:

> I'm worried about the situation where we allocate but don't use the
> new page: it blows quite a bit of cache. Then, when we do get around
> to using it, it will be cold(er).

couldnt the new pte be flipped in atomically via cmpxchg? That way we
could do the page clearing close to where we are doing it now, but
without holding the mmap_sem.

to solve the pte races we could use a bit in the [otherwise empty] pte
to signal "this pte can be flipped in from now on", which bit would
automatically be cleared if mprotect() or munmap() is called over that
range (without any extra changes to those codepaths). (in the rare case
if the cmpxchg() fails, we go into a slowpath that drops the newly
allocated page, re-lookups the vma and the pte, etc.)

Ingo

2006-02-23 16:42:25

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Thu, Feb 23, 2006 at 10:30:29AM +0100, Arjan van de Ven wrote:
> One could argue that this should be a generic slab feature (the preloading) but
> that only gives some of the gains and not all, and vma's are small creatures,
> with a high "recycling" rate in typical cases.

I think this is wrong, and far too narrow in useful scope to be merged.
It's unnecessary bloat of task_struct and if this is really a problem, then
there is a [performance] bug in the slab allocator. Please at least provide
the slab statistics for what is going on, as the workload might well be
thrashing the creation and destruction of slabs, which would be a much better
thing to fix.

-ben

2006-02-23 18:25:58

by Paul Jackson

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

Just a random idea, offered with little real understanding
of what's going on ...

Instead of a per-task clear page, how about a per-cpu clear page,
or short queue of clear pages?

This lets the number of clear pages be throttled to whatever
is worth it. And it handles such cases as a few threads using
the clear pages rapidly, while many other threads don't need any,
with a much higher "average usefulness" per clear page (meaning
the average time a cleared page sits around wasting memory prior
to its being used is much shorter.)

Some locking would still be needed, but per-cpu locking is
a separate, quicker beast than something like mmap_sem.

Mind you, I am not commenting one way or the other on whether any
of this is a good idea. Not my expertise ...

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-23 20:08:12

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

In-Reply-To: <[email protected]>

On Thu, 23 Feb 2006 at 14:06:43 +0100, Andi Kleen wrote:
> e.g. you might notice that a lot of patches from new contributors
> go smoother into x86-64 than into i386.

That's because, strangely enough, i386 doesn't have an official maintainer.

--
Chuck
"Equations are the Devil's sentences." --Stephen Colbert

2006-02-23 21:12:16

by Chuck Ebbert

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

In-Reply-To: <[email protected]>

On Thu, 23 Feb 2006 at 10:29:54 +0100, Arjan van de Ven wrote:

> Index: linux-work/mm/mempolicy.c
> ===================================================================
> --- linux-work.orig/mm/mempolicy.c
> +++ linux-work/mm/mempolicy.c
> @@ -1231,6 +1231,13 @@ alloc_page_vma(gfp_t gfp, struct vm_area
> {
> struct mempolicy *pol = get_vma_policy(current, vma, addr);
>
> + if ( (gfp & __GFP_ZERO) && current->cleared_page) {
> + struct page *addr;
> + addr = current->cleared_page;
> + current->cleared_page = NULL;
> + return addr;
> + }
> +
> cpuset_update_task_memory_state();
>
> if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
> @@ -1242,6 +1249,36 @@ alloc_page_vma(gfp_t gfp, struct vm_area
> return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
> }
>
> +
> +/**
> + * prepare_cleared_page - populate the per-task zeroed-page cache
> + *
> + * This function populates the per-task cache with one zeroed page
> + * (if there wasn't one already)
> + * The idea is that this (expensive) clearing is done before any
> + * locks are taken, speculatively, and that when the page is actually
> + * needed under a lock, it is ready for immediate use
> + */
> +
> +void prepare_cleared_page(void)
> +{
> + struct mempolicy *pol = current->mempolicy;
> +
> + if (current->cleared_page)
> + return;
> +
> + cpuset_update_task_memory_state();
> +
> + if (!pol)
> + pol = &default_policy;
> + if (pol->policy == MPOL_INTERLEAVE)
> + current->cleared_page = alloc_page_interleave(
> + GFP_HIGHUSER | __GFP_ZERO, 0, interleave_nodes(pol));

======> else ???

> + current->cleared_page = __alloc_pages(GFP_USER | __GFP_ZERO,
> + 0, zonelist_policy(GFP_USER, pol));
> +}
> +
> +
> /**
> * alloc_pages_current - Allocate pages.
> *

--
Chuck
"Equations are the Devil's sentences." --Stephen Colbert

2006-02-23 21:19:33

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

On Thu, 2006-02-23 at 16:10 -0500, Chuck Ebbert wrote:
> > + if (pol->policy == MPOL_INTERLEAVE)
> > + current->cleared_page = alloc_page_interleave(
> > + GFP_HIGHUSER | __GFP_ZERO, 0, interleave_nodes(pol));
>
> ======> else ???
>
> > + current->cleared_page = __alloc_pages(GFP_USER | __GFP_ZERO,
> > + 0, zonelist_policy(GFP_USER, pol));
> > +}
> > +

good catch, thanks!

2006-02-24 06:36:49

by Nick Piggin

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>>I'm worried about the situation where we allocate but don't use the
>>new page: it blows quite a bit of cache. Then, when we do get around
>>to using it, it will be cold(er).
>
>
> couldnt the new pte be flipped in atomically via cmpxchg? That way we
> could do the page clearing close to where we are doing it now, but
> without holding the mmap_sem.
>

We have nothing to pin the pte page with if we're not holding the
mmap_sem.

> to solve the pte races we could use a bit in the [otherwise empty] pte
> to signal "this pte can be flipped in from now on", which bit would
> automatically be cleared if mprotect() or munmap() is called over that
> range (without any extra changes to those codepaths). (in the rare case
> if the cmpxchg() fails, we go into a slowpath that drops the newly
> allocated page, re-lookups the vma and the pte, etc.)
>

Page still isn't pinned. You might be able to do something wild like
disable preemption and interrupts (to stop the TLB IPI) to get a pin
on the pte pages.

But even in that case, there is nothing in the mmu gather / tlb flush
interface that guarantees an architecture cannot free the page table
pages immediately (ie without waiting for the flush IPI). This would
make sense on architectures that don't walk the page tables in hardware.

Arjan, just to get an idea of your workload: obviously it is a mix of
read and write on the mmap_sem (read only will not really benefit from
reducing lock width because cacheline transfers will still be there).
Is it coming from brk() from the allocator? Someone told me a while ago
that glibc doesn't have a decent amount of hysteresis in its allocator
and tends to enter the kernel quite a lot... that might be something
to look into.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-24 06:50:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages


* Nick Piggin <[email protected]> wrote:

> > couldnt the new pte be flipped in atomically via cmpxchg? That way
> > we could do the page clearing close to where we are doing it now,
> > but without holding the mmap_sem.
>
> We have nothing to pin the pte page with if we're not holding the
> mmap_sem.

why does it have to be pinned? The page is mostly private to this thread
until it manages to flip it into the pte. Since there's no pte presence,
there's no swapout possible [here i'm assuming anonymous malloc()
memory, which is the main focus of Arjan's patch]. Any parallel
unmapping of that page will be caught and the installation of the page
will be prevented by the 'bit-spin-lock' embedded in the pte.

> But even in that case, there is nothing in the mmu gather / tlb flush
> interface that guarantees an architecture cannot free the page table
> pages immediately (ie without waiting for the flush IPI). This would
> make sense on architectures that don't walk the page tables in
> hardware.

but the page wont be found by any other CPU, so it wont be freed! It is
private to this CPU. The page has no pte presence. It will only be
present and lookupable as a result of the cmpxchg() flipping the page
into the pte.

Ingo

2006-02-24 07:01:43

by Nick Piggin

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
>
>>>couldnt the new pte be flipped in atomically via cmpxchg? That way
>>>we could do the page clearing close to where we are doing it now,
>>>but without holding the mmap_sem.
>>
>>We have nothing to pin the pte page with if we're not holding the
>>mmap_sem.
>
>
> why does it have to be pinned? The page is mostly private to this thread
> until it manages to flip it into the pte. Since there's no pte presence,
> there's no swapout possible [here i'm assuming anonymous malloc()
> memory, which is the main focus of Arjan's patch]. Any parallel
> unmapping of that page will be caught and the installation of the page
> will be prevented by the 'bit-spin-lock' embedded in the pte.
>

No, I was talking about page table pages, rather than the newly
allocated page.

But I didn't realise you wanted the bit lock to go the other way
as well (ie. a real bit spinlock). Seems like that would have to
add overhead somewhere.

>
>>But even in that case, there is nothing in the mmu gather / tlb flush
>>interface that guarantees an architecture cannot free the page table
>>pages immediately (ie without waiting for the flush IPI). This would
>>make sense on architectures that don't walk the page tables in
>>hardware.
>
>
> but the page wont be found by any other CPU, so it wont be freed! It is
> private to this CPU. The page has no pte presence. It will only be
> present and lookupable as a result of the cmpxchg() flipping the page
> into the pte.
>

Yeah, as I said above, the newly allocated page is fine, it is the
page table pages I'm worried about.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-24 09:15:51

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages


> Arjan, just to get an idea of your workload: obviously it is a mix of
> read and write on the mmap_sem (read only will not really benefit from
> reducing lock width because cacheline transfers will still be there).

yeah it's threads that each allocate, use and then free memory with
mmap()

> Is it coming from brk() from the allocator? Someone told me a while ago
> that glibc doesn't have a decent amount of hysteresis in its allocator
> and tends to enter the kernel quite a lot... that might be something
> to look into.

we already are working on that angle; I just posted the kernel stuff as
a side effect basically


2006-02-24 09:26:06

by Nick Piggin

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

Arjan van de Ven wrote:
>>Arjan, just to get an idea of your workload: obviously it is a mix of
>>read and write on the mmap_sem (read only will not really benefit from
>>reducing lock width because cacheline transfers will still be there).
>
>
> yeah it's threads that each allocate, use and then free memory with
> mmap()
>
>
>>Is it coming from brk() from the allocator? Someone told me a while ago
>>that glibc doesn't have a decent amount of hysteresis in its allocator
>>and tends to enter the kernel quite a lot... that might be something
>>to look into.
>
>
> we already are working on that angle; I just posted the kernel stuff as
> a side effect basically
>

OK.

[aside]
Actually I have a scalability improvement for rwsems, that moves the
actual task wakeups out from underneath the rwsem spinlock in the up()
paths. This was useful exactly on a mixed read+write workload on mmap_sem.

The difference was quite large for the "generic rwsem" algorithm because
it uses the spinlock in fastpaths a lot more than the xadd algorithm. I
think x86-64 uses the former, which is what I presume you're testing with?

Obviously this is a slightly different issue from the one you're trying to
address here, but I'll dig out patch when I get some time and you can see
if it helps.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-24 12:33:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

On Friday 24 February 2006 08:01, Nick Piggin wrote:

>
> Yeah, as I said above, the newly allocated page is fine, it is the
> page table pages I'm worried about.

page tables are easy because we zero them on free (as a side effect
of all the pte_clears)

I did a experimental hack some time ago to set a new struct page
flag when a page is known to be zeroed on freeing and use that for
a GFP_ZERO allocation (basically skip the clear_page when that
flag was set)

The idea was to generalize the old page table reuse caches which
Ingo removed at some point.

It only works of course if the allocations and freeing
of page tables roughly matches up. In theory on could have
split the lists of the buddy allocator too into zero/non
zero pages to increase the hit rate, but I didn't attempt this.

I unfortunately don't remember the outcome, dropped it for some reason.

-Andi

2006-02-24 12:33:32

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

On Friday 24 February 2006 10:26, Nick Piggin wrote:

> [aside]
> Actually I have a scalability improvement for rwsems, that moves the
> actual task wakeups out from underneath the rwsem spinlock in the up()
> paths. This was useful exactly on a mixed read+write workload on mmap_sem.
>
> The difference was quite large for the "generic rwsem" algorithm because
> it uses the spinlock in fastpaths a lot more than the xadd algorithm. I
> think x86-64 uses the former, which is what I presume you're testing with?

I used the generic algorithm because Andrea originally expressed some doubts
on the correctness of the xadd algorithms and after trying to understand them
myself I wasn't sure myself. Generic was the safer choice.

But if someone can show convincing numbers that XADD rwsems are faster
for some workload we can switch. I guess they are tested well enough now
on i386.

Or would your scalability improvement remove that difference (if it really exists)?


-Andi

2006-02-24 12:55:10

by Hugh Dickins

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

On Fri, 24 Feb 2006, Andi Kleen wrote:
> On Friday 24 February 2006 08:01, Nick Piggin wrote:
> >
> > Yeah, as I said above, the newly allocated page is fine, it is the
> > page table pages I'm worried about.
>
> page tables are easy because we zero them on free (as a side effect
> of all the pte_clears)

But once the page table is freed, it's likely to get used for something
else, whether for another page table or something different doesn't
matter: this mm can no longer blindly mess with the entries within in.

Nick's point is that it's mmap_sem (or mm_users 0) guarding against
the page table being freed.

Hugh

2006-02-24 15:31:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

On Fri, Feb 24, 2006 at 01:27:26PM +0100, Andi Kleen wrote:
> I used the generic algorithm because Andrea originally expressed some doubts
> on the correctness of the xadd algorithms and after trying to understand them
> myself I wasn't sure myself. Generic was the safer choice.

Amittedly we never had bugreports for the xadd ones, but trust me that's
not a good reason to assume that they must be correct. I'd be more
confortable if somebody would provide a demonstration of their
correctnes. Overall I gave up also because I felt that the small gain
was by far not worth the risk.

2006-02-24 18:52:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Thu, Feb 23, 2006 at 10:48:50AM +0100, Arjan van de Ven wrote:
> On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > This patch adds a per task-struct cache of a free vma.
> > >
> > > In normal operation, it is a really common action during userspace mmap
> > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > and thus free it again. In fact this is the case roughly 95% of the time.
> > >
> > > In addition, this patch allows code to "prepopulate" the cache, and
> > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > prepopulation is that the memory allocation (which is a sleeping operation
> > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a
> > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus
> > > reduces lock hold time (and thus the contention potential)
> >
> > The slab fast path doesn't sleep.
>
> it does via might_sleep()

so turn of the voluntary preempt bullshit.

2006-02-24 19:05:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Friday 24 February 2006 19:52, Christoph Hellwig wrote:
> On Thu, Feb 23, 2006 at 10:48:50AM +0100, Arjan van de Ven wrote:
> > On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > > This patch adds a per task-struct cache of a free vma.
> > > >
> > > > In normal operation, it is a really common action during userspace mmap
> > > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > > and thus free it again. In fact this is the case roughly 95% of the time.
> > > >
> > > > In addition, this patch allows code to "prepopulate" the cache, and
> > > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > > prepopulation is that the memory allocation (which is a sleeping operation
> > > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a
> > > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus
> > > > reduces lock hold time (and thus the contention potential)
> > >
> > > The slab fast path doesn't sleep.
> >
> > it does via might_sleep()
>
> so turn of the voluntary preempt bullshit.

I think voluntary preempt is generally a good idea, but we should make it optional
for down() since it can apparently cause bad side effects (like holding
semaphores/mutexes for too long)

There would be two possible ways:

have default mutex_lock()/down() do a might_sleep()
with preemption and have a separate variant that doesn't preempt
or have the default non preempt and a separate variant
that does preempt.

I think I would prefer the later because for preemption
it probably makes often more sense to do the preemption
on the up() not the down.

On the other hand one could argue that it's safer to only
change it for mmap_sem, which would suggest a special variant
without preemption and keep the current default.

Actually the non preempt variants probably should still have a might_sleep()
for debugging, but in a variant that doesn't do preemption (might_sleep_no_preempt()?)

Ingo, what do you think?

-Andi

2006-02-24 19:09:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Patch 2/3] fast VMA recycling

On Fri, Feb 24, 2006 at 08:05:16PM +0100, Andi Kleen wrote:
> I think voluntary preempt is generally a good idea, but we should make it optional
> for down() since it can apparently cause bad side effects (like holding
> semaphores/mutexes for too long)
>
> There would be two possible ways:
>
> have default mutex_lock()/down() do a might_sleep()
> with preemption and have a separate variant that doesn't preempt
> or have the default non preempt and a separate variant
> that does preempt.

better remove the stupid cond_resched() from might_sleep which from it's
naming already is a debug thing and add it to those places where it makes
sense.

2006-02-25 16:48:58

by Nick Piggin

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

Move wakeups out from under the rwsem's wait_lock spinlock.

This reduces that lock's contention by a factor of around
10 on a 16-way NUMAQ running volanomark, however cacheline
contention on the rwsem's "activity" drowns out these
small improvements when using the i386 "optimised" rwsem:

unpatched:
55802519 total 32.3097
23325323 default_idle 364458.1719
22084349 .text.lock.futex 82404.2873
2369107 queue_me 24678.1979
1875296 unqueue_me 9767.1667
1202258 .text.lock.rwsem 46240.6923
941801 finish_task_switch 7357.8203
787101 __wake_up 12298.4531
645252 drop_key_refs 13442.7500
362789 futex_wait 839.7894
333294 futex_wake 1487.9196
146797 rwsem_down_read_failed 436.8958
82788 .text.lock.dev 221.3583
81221 try_to_wake_up 133.5872

+rwsem-scale:
58120260 total 33.6458
25482132 default_idle 398158.3125
22774675 .text.lock.futex 84980.1306
2517797 queue_me 26227.0521
1953424 unqueue_me 10174.0833
1063068 finish_task_switch 8305.2188
834793 __wake_up 13043.6406
674570 drop_key_refs 14053.5417
371811 futex_wait 860.6736
343398 futex_wake 1533.0268
155419 try_to_wake_up 255.6234
114704 .text.lock.rwsem 4411.6923

The rwsem-spinlock implementation, however, is improved
significantly more, and now gets volanomark performance
similar to the xadd rwsem:

unpatched:
30850964 total 18.1787
18986006 default_idle 296656.3438
3989183 .text.lock.rwsem_spinlock 40294.7778
2990161 .text.lock.futex 32501.7500
549707 finish_task_switch 4294.5859
535327 __down_read 3717.5486
452721 queue_me 4715.8438
439725 __up_read 9160.9375
396273 __wake_up 6191.7656
326595 unqueue_me 1701.0156

+rwsem-scale:
25378268 total 14.9537
13325514 default_idle 208211.1562
3675634 .text.lock.futex 39952.5435
2908629 .text.lock.rwsem_spinlock 28239.1165
628115 __down_read 4361.9097
607417 finish_task_switch 4745.4453
588031 queue_me 6125.3229
571169 __up_read 11899.3542
436795 __wake_up 6824.9219
416788 unqueue_me 2170.7708


Index: linux-2.6/lib/rwsem.c
===================================================================
--- linux-2.6.orig/lib/rwsem.c 2006-02-26 03:05:01.000000000 +1100
+++ linux-2.6/lib/rwsem.c 2006-02-26 03:37:04.000000000 +1100
@@ -36,14 +36,15 @@ void rwsemtrace(struct rw_semaphore *sem
* - the spinlock must be held by the caller
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only woken if downgrading is false
+ *
+ * The spinlock will be dropped by this function.
*/
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
+static inline void
+__rwsem_do_wake(struct rw_semaphore *sem, int downgrading, unsigned long flags)
{
+ LIST_HEAD(wake_list);
struct rwsem_waiter *waiter;
- struct task_struct *tsk;
- struct list_head *next;
- signed long oldcount, woken, loop;
+ signed long oldcount, woken;

rwsemtrace(sem, "Entering __rwsem_do_wake");

@@ -72,12 +73,8 @@ __rwsem_do_wake(struct rw_semaphore *sem
* It is an allocated on the waiter's stack and may become invalid at
* any time after that point (due to a wakeup from another source).
*/
- list_del(&waiter->list);
- tsk = waiter->task;
- smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
+ list_move_tail(&waiter->list, &wake_list);
+ waiter->flags = 0;
goto out;

/* don't want to wake any writers */
@@ -94,41 +91,37 @@ __rwsem_do_wake(struct rw_semaphore *sem
readers_only:
woken = 0;
do {
- woken++;
-
- if (waiter->list.next == &sem->wait_list)
+ list_move_tail(&waiter->list, &wake_list);
+ waiter->flags = 0;
+ woken += RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+ if (list_empty(&sem->wait_list))
break;

- waiter = list_entry(waiter->list.next,
+ waiter = list_entry(sem->wait_list.next,
struct rwsem_waiter, list);
-
} while (waiter->flags & RWSEM_WAITING_FOR_READ);

- loop = woken;
- woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
if (!downgrading)
/* we'd already done one increment earlier */
woken -= RWSEM_ACTIVE_BIAS;

rwsem_atomic_add(woken, sem);

- next = sem->wait_list.next;
- for (; loop > 0; loop--) {
- waiter = list_entry(next, struct rwsem_waiter, list);
- next = waiter->list.next;
+out:
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
+ while (!list_empty(&wake_list)) {
+ struct task_struct *tsk;
+ waiter = list_entry(wake_list.next, struct rwsem_waiter, list);
+ list_del(&waiter->list);
tsk = waiter->task;
- smp_mb();
waiter->task = NULL;
+ smp_mb();
wake_up_process(tsk);
put_task_struct(tsk);
}

- sem->wait_list.next = next;
- next->prev = &sem->wait_list;
-
- out:
rwsemtrace(sem, "Leaving __rwsem_do_wake");
- return sem;
+ return;

/* undo the change to count, but check for a transition 1->0 */
undo:
@@ -145,12 +138,13 @@ rwsem_down_failed_common(struct rw_semap
struct rwsem_waiter *waiter, signed long adjustment)
{
struct task_struct *tsk = current;
+ unsigned long flags;
signed long count;

set_task_state(tsk, TASK_UNINTERRUPTIBLE);

/* set up my own style of waitqueue */
- spin_lock_irq(&sem->wait_lock);
+ spin_lock_irqsave(&sem->wait_lock, flags);
waiter->task = tsk;
get_task_struct(tsk);

@@ -161,14 +155,12 @@ rwsem_down_failed_common(struct rw_semap

/* if there are no active locks, wake the front queued process(es) up */
if (!(count & RWSEM_ACTIVE_MASK))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irq(&sem->wait_lock);
+ __rwsem_do_wake(sem, 0, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

/* wait to be given the lock */
- for (;;) {
- if (!waiter->task)
- break;
+ while (waiter->task) {
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
@@ -227,9 +219,9 @@ struct rw_semaphore fastcall *rwsem_wake

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 0, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving rwsem_wake");

@@ -251,9 +243,9 @@ struct rw_semaphore fastcall *rwsem_down

/* do nothing if list empty */
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 1);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 1, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving rwsem_downgrade_wake");
return sem;
Index: linux-2.6/lib/rwsem-spinlock.c
===================================================================
--- linux-2.6.orig/lib/rwsem-spinlock.c 2006-02-26 03:05:01.000000000 +1100
+++ linux-2.6/lib/rwsem-spinlock.c 2006-02-26 03:37:42.000000000 +1100
@@ -49,11 +49,11 @@ void fastcall init_rwsem(struct rw_semap
* - woken process blocks are discarded from the list after having task zeroed
* - writers are only woken if wakewrite is non-zero
*/
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
+static inline void
+__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite, unsigned long flags)
{
+ LIST_HEAD(wake_list);
struct rwsem_waiter *waiter;
- struct task_struct *tsk;
int woken;

rwsemtrace(sem, "Entering __rwsem_do_wake");
@@ -73,46 +73,46 @@ __rwsem_do_wake(struct rw_semaphore *sem
*/
if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
sem->activity = -1;
- list_del(&waiter->list);
- tsk = waiter->task;
- /* Don't touch waiter after ->task has been NULLed */
- smp_mb();
- waiter->task = NULL;
- wake_up_process(tsk);
- put_task_struct(tsk);
+ list_move_tail(&waiter->list, &wake_list);
goto out;
}

/* grant an infinite number of read locks to the front of the queue */
dont_wake_writers:
woken = 0;
- while (waiter->flags & RWSEM_WAITING_FOR_READ) {
- struct list_head *next = waiter->list.next;
+ do {
+ list_move_tail(&waiter->list, &wake_list);
+ woken++;
+ if (list_empty(&sem->wait_list))
+ break;
+ waiter = list_entry(sem->wait_list.next,
+ struct rwsem_waiter, list);
+ } while (waiter->flags & RWSEM_WAITING_FOR_READ);
+
+ sem->activity += woken;

+out:
+ spin_unlock_irqrestore(&sem->wait_lock, flags);
+ while (!list_empty(&wake_list)) {
+ struct task_struct *tsk;
+ waiter = list_entry(wake_list.next, struct rwsem_waiter, list);
list_del(&waiter->list);
tsk = waiter->task;
- smp_mb();
waiter->task = NULL;
+ smp_mb();
wake_up_process(tsk);
put_task_struct(tsk);
- woken++;
- if (list_empty(&sem->wait_list))
- break;
- waiter = list_entry(next, struct rwsem_waiter, list);
}

- sem->activity += woken;
-
- out:
rwsemtrace(sem, "Leaving __rwsem_do_wake");
- return sem;
}

/*
* wake a single writer
+ * called with wait_lock locked and returns with it unlocked.
*/
-static inline struct rw_semaphore *
-__rwsem_wake_one_writer(struct rw_semaphore *sem)
+static inline void
+__rwsem_wake_one_writer(struct rw_semaphore *sem, unsigned long flags)
{
struct rwsem_waiter *waiter;
struct task_struct *tsk;
@@ -121,13 +121,13 @@ __rwsem_wake_one_writer(struct rw_semaph

waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
list_del(&waiter->list);
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

tsk = waiter->task;
- smp_mb();
waiter->task = NULL;
+ smp_mb();
wake_up_process(tsk);
put_task_struct(tsk);
- return sem;
}

/*
@@ -163,9 +163,7 @@ void fastcall __sched __down_read(struct
spin_unlock_irq(&sem->wait_lock);

/* wait to be given the lock */
- for (;;) {
- if (!waiter.task)
- break;
+ while (waiter.task) {
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
@@ -234,9 +232,7 @@ void fastcall __sched __down_write(struc
spin_unlock_irq(&sem->wait_lock);

/* wait to be given the lock */
- for (;;) {
- if (!waiter.task)
- break;
+ while (waiter.task) {
schedule();
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
}
@@ -283,9 +279,9 @@ void fastcall __up_read(struct rw_semaph
spin_lock_irqsave(&sem->wait_lock, flags);

if (--sem->activity == 0 && !list_empty(&sem->wait_list))
- sem = __rwsem_wake_one_writer(sem);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_wake_one_writer(sem, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving __up_read");
}
@@ -303,9 +299,9 @@ void fastcall __up_write(struct rw_semap

sem->activity = 0;
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 1);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 1, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving __up_write");
}
@@ -324,9 +320,9 @@ void fastcall __downgrade_write(struct r

sem->activity = 1;
if (!list_empty(&sem->wait_list))
- sem = __rwsem_do_wake(sem, 0);
-
- spin_unlock_irqrestore(&sem->wait_lock, flags);
+ __rwsem_do_wake(sem, 0, flags);
+ else
+ spin_unlock_irqrestore(&sem->wait_lock, flags);

rwsemtrace(sem, "Leaving __downgrade_write");
}


Attachments:
rwsem-scale.patch (11.96 kB)

2006-02-25 17:22:58

by Nick Piggin

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

Nick Piggin wrote:

> Here is a forward port of my scalability improvement, untested.
>
> Unfortunately I didn't include absolute performance results, but the
> changelog indicates that there was some noticable delta between the
> rwsem implementations (and that's what I vaguely remember).
>
> Note: this was with volanomark, which can be quite variable at the
> best of times IIRC.
>
> Note2: this was on a 16-way NUMAQ, which had the interconnect
> performance of a string between two cans compared to a modern x86-64
> of smaller size, so it may be harder to notice an improvement.
>

Oh, I remember one performance caveat of this code on preempt kernels:
at very high loads, (ie. lots of tasks being woken in the up path),
the wakeup code would end up doing a lot of context switches to and
from the tasks it is waking up.

This could be easily solved by disabling preempt (and would be still
better in terms of interrupt latency and lock hold times), however I
never got around to implementing that.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-02-28 22:32:04

by Pavel Machek

[permalink] [raw]
Subject: Re: [Patch 3/3] prepopulate/cache cleared pages

On Čt 23-02-06 13:41:53, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > On Thursday 23 February 2006 10:29, Arjan van de Ven wrote:
> > > This patch adds an entry for a cleared page to the task struct. The main
> > > purpose of this patch is to be able to pre-allocate and clear a page in a
> > > pagefault scenario before taking any locks (esp mmap_sem),
> > > opportunistically. Allocating+clearing a page is an very expensive
> > > operation that currently increases lock hold times quite bit (in a threaded
> > > environment that allocates/use/frees memory on a regular basis, this leads
> > > to contention).
> > >
> > > This is probably the most controversial patch of the 3, since there is
> > > a potential to take up 1 page per thread in this cache. In practice it's
> > > not as bad as it sounds (a large degree of the pagefaults are anonymous
> > > and thus immediately use up the page). One could argue "let the VM reap
> > > these" but that has a few downsides; it increases locking needs but more,
> > > clearing a page is relatively expensive, if the VM reaps the page again
> > > in case it wasn't needed, the work was just wasted.
> >
> > Looks like an incredible bad hack. What workload was that again where
> > it helps? And how much? I think before we can consider adding that
> > ugly code you would a far better rationale.
>
> yes, the patch is controversial technologically, and Arjan pointed it
> out above. This is nothing new - and Arjan probably submitted this to
> lkml so that he can get contructive feedback.

Actually, I think I have to back Andi here. This looked like patch for
inclusion (signed-off, cc-ed Andrew). And yes, Arjan pointed out that
it is controversial, but the way patch was worded I could imagine
Andrew merging it...
Pavel
--
Web maintainer for suspend.sf.net (http://www.sf.net/projects/suspend) wanted...