2006-02-23 09:31:12

by Arjan van de Ven

[permalink] [raw]
Subject: [Patch 1/3] prefetch the mmap_sem in the fault path

In a micro-benchmark that stresses the pagefault path, the down_read_trylock
on the mmap_sem showed up quite high on the profile. Turns out this lock is
bouncing between cpus quite a bit and thus is cache-cold a lot. This patch
prefetches the lock (for write) as early as possible (and before some other
somewhat expensive operations). With this patch, the down_read_trylock
basically fell out of the top of profile.

Signed-off-by: Arjan van de Ven <[email protected]>

---
arch/x86_64/mm/fault.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-work/arch/x86_64/mm/fault.c
===================================================================
--- linux-work.orig/arch/x86_64/mm/fault.c
+++ linux-work/arch/x86_64/mm/fault.c
@@ -312,6 +312,10 @@ asmlinkage void __kprobes do_page_fault(
unsigned long flags;
siginfo_t info;

+ tsk = current;
+ mm = tsk->mm;
+ prefetchw(&mm->mmap_sem);
+
/* get the address */
__asm__("movq %%cr2,%0":"=r" (address));
if (notify_die(DIE_PAGE_FAULT, "page fault", regs, error_code, 14,
@@ -325,8 +329,6 @@ asmlinkage void __kprobes do_page_fault(
printk("pagefault rip:%lx rsp:%lx cs:%lu ss:%lu address %lx error %lx\n",
regs->rip,regs->rsp,regs->cs,regs->ss,address,error_code);

- tsk = current;
- mm = tsk->mm;
info.si_code = SEGV_MAPERR;




2006-02-23 09:43:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 1/3] prefetch the mmap_sem in the fault path

On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> In a micro-benchmark that stresses the pagefault path, the down_read_trylock
> on the mmap_sem showed up quite high on the profile. Turns out this lock is
> bouncing between cpus quite a bit and thus is cache-cold a lot. This patch
> prefetches the lock (for write) as early as possible (and before some other
> somewhat expensive operations). With this patch, the down_read_trylock
> basically fell out of the top of profile.

It is hard to believe because you effectively didn't do the prefetch
very early

(e.g. the patch from your prefetch to taking the lock is quite short)

-Andi

2006-02-23 09:47:45

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 1/3] prefetch the mmap_sem in the fault path

On Thu, 2006-02-23 at 10:39 +0100, Andi Kleen wrote:
> On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > In a micro-benchmark that stresses the pagefault path, the down_read_trylock
> > on the mmap_sem showed up quite high on the profile. Turns out this lock is
> > bouncing between cpus quite a bit and thus is cache-cold a lot. This patch
> > prefetches the lock (for write) as early as possible (and before some other
> > somewhat expensive operations). With this patch, the down_read_trylock
> > basically fell out of the top of profile.
>
> It is hard to believe because you effectively didn't do the prefetch
> very early

all you need is a few dozen cycles though; there's a cr2 move and the
entire notifier inbetween.... neither of those is really cheap.


(and after patch 3/3 also a page allocation/clear)

2006-02-23 10:15:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [Patch 1/3] prefetch the mmap_sem in the fault path

On Thursday 23 February 2006 10:47, Arjan van de Ven wrote:
> On Thu, 2006-02-23 at 10:39 +0100, Andi Kleen wrote:
> > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > In a micro-benchmark that stresses the pagefault path, the down_read_trylock
> > > on the mmap_sem showed up quite high on the profile. Turns out this lock is
> > > bouncing between cpus quite a bit and thus is cache-cold a lot. This patch
> > > prefetches the lock (for write) as early as possible (and before some other
> > > somewhat expensive operations). With this patch, the down_read_trylock
> > > basically fell out of the top of profile.
> >
> > It is hard to believe because you effectively didn't do the prefetch
> > very early
>
> all you need is a few dozen cycles though; there's a cr2 move and the
> entire notifier inbetween.... neither of those is really cheap.

Ok. I added that patch.

-Andi

2006-02-23 12:29:04

by Jes Sorensen

[permalink] [raw]
Subject: Re: [Patch 1/3] prefetch the mmap_sem in the fault path

>>>>> "Arjan" == Arjan van de Ven <[email protected]> writes:

Arjan> In a micro-benchmark that stresses the pagefault path, the
Arjan> down_read_trylock on the mmap_sem showed up quite high on the
Arjan> profile. Turns out this lock is bouncing between cpus quite a
Arjan> bit and thus is cache-cold a lot. This patch prefetches the
Arjan> lock (for write) as early as possible (and before some other
Arjan> somewhat expensive operations). With this patch, the
Arjan> down_read_trylock basically fell out of the top of profile.

Out of curiousity, how big was the box used for testing? It might be
worth investigating if anything can be done to reduce the number of
times that lock is taken in the first place.

After all, what's a pain on a 4-way tends to be an utter nightmare on
a 16-way ;(

Cheers,
Jes

2006-02-23 12:39:30

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Patch 1/3] prefetch the mmap_sem in the fault path

On Thu, 2006-02-23 at 07:29 -0500, Jes Sorensen wrote:
> >>>>> "Arjan" == Arjan van de Ven <[email protected]> writes:
>
> Arjan> In a micro-benchmark that stresses the pagefault path, the
> Arjan> down_read_trylock on the mmap_sem showed up quite high on the
> Arjan> profile. Turns out this lock is bouncing between cpus quite a
> Arjan> bit and thus is cache-cold a lot. This patch prefetches the
> Arjan> lock (for write) as early as possible (and before some other
> Arjan> somewhat expensive operations). With this patch, the
> Arjan> down_read_trylock basically fell out of the top of profile.
>
> Out of curiousity, how big was the box used for testing? It might be
> worth investigating if anything can be done to reduce the number of
> times that lock is taken in the first place.
>
> After all, what's a pain on a 4-way tends to be an utter nightmare on
> a 16-way ;(

most of it was done on a 2 way, but some tests were done on a 4-way.

2006-02-23 16:16:04

by Ray Bryant

[permalink] [raw]
Subject: Re: [Patch 1/3] prefetch the mmap_sem in the fault path

On Thursday 23 February 2006 06:39, Arjan van de Ven wrote:
> On Thu, 2006-02-23 at 07:29 -0500, Jes Sorensen wrote:
> > >>>>> "Arjan" == Arjan van de Ven <[email protected]> writes:
> >
> > Arjan> In a micro-benchmark that stresses the pagefault path, the
> > Arjan> down_read_trylock on the mmap_sem showed up quite high on the
> > Arjan> profile. Turns out this lock is bouncing between cpus quite a
> > Arjan> bit and thus is cache-cold a lot. This patch prefetches the
> > Arjan> lock (for write) as early as possible (and before some other
> > Arjan> somewhat expensive operations). With this patch, the
> > Arjan> down_read_trylock basically fell out of the top of profile.
> >
> > Out of curiousity, how big was the box used for testing? It might be
> > worth investigating if anything can be done to reduce the number of
> > times that lock is taken in the first place.
> >
> > After all, what's a pain on a 4-way tends to be an utter nightmare on
> > a 16-way ;(
>
> most of it was done on a 2 way, but some tests were done on a 4-way.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Could you share your microbenchmark with us (or point to the source) and we
can give this a try on larger systems?

Thanks,
--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

2006-03-01 00:51:28

by Valerie Henson

[permalink] [raw]
Subject: Re: [Patch 1/3] prefetch the mmap_sem in the fault path

Sorry for the broken threading...

On Thursday 23 February 2006 11:13:50 EST, Ray Bryant wrote:
> On Thursday 23 February 2006 06:39, Arjan van de Ven wrote:
> > On Thu, 2006-02-23 at 07:29 -0500, Jes Sorensen wrote:
> > > >>>>> "Arjan" == Arjan van de Ven <arjan@xxxxxxxxxxxxxxx> writes:
> > >
> > > Arjan> In a micro-benchmark that stresses the pagefault path, the
> > > Arjan> down_read_trylock on the mmap_sem showed up quite high on the
> > > Arjan> profile. Turns out this lock is bouncing between cpus quite a
> > > Arjan> bit and thus is cache-cold a lot. This patch prefetches the
> > > Arjan> lock (for write) as early as possible (and before some other
> > > Arjan> somewhat expensive operations). With this patch, the
> > > Arjan> down_read_trylock basically fell out of the top of profile.
> > >
> > > Out of curiousity, how big was the box used for testing? It might be
> > > worth investigating if anything can be done to reduce the number of
> > > times that lock is taken in the first place.
> > >
> > > After all, what's a pain on a 4-way tends to be an utter nightmare on
> > > a 16-way ;(
> >
> > most of it was done on a 2 way, but some tests were done on a 4-way.
>
> Could you share your microbenchmark with us (or point to the source) and we
> can give this a try on larger systems?

I would be ecstatic to share this benchmark; however I just started
working at Intel and did not realize how long it would take to open
source a program written solely by an Intel employee (me). I'm
getting the paperwork done as fast as I can.

A quick description of the benchmark is:

* Allocate memory
* Write a pattern to it
* Spawn sufficient threads to keep your cpus busy

Each thread does:

* Allocate a little more memory and copy part of memory to it
* Search for a key within its copy
* Free the memory
* Repeat

The patches Arjan submitted make a small improvement, but the big win
turns out to be in tuning malloc() parameters, which we are currently
experimenting with.

-VAL (not subscribed to l-k as yet)