2009-03-27 20:32:15

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 1/2] x86/mm: maintain a percpu "in get_user_pages_fast" flag

get_user_pages_fast() relies on cross-cpu tlb flushes being a barrier
between clearing and setting a pte, and before freeing a pagetable page.
It usually does this by disabling interrupts to hold off IPIs, but
some tlb flush implementations don't use IPIs for tlb flushes, and
must use another mechanism.

In this change, add in_gup_cpumask, which is a cpumask of cpus currently
performing a get_user_pages_fast traversal of a pagetable. A cross-cpu
tlb flush function can use this to determine whether it should hold-off
on the flush until the gup_fast has finished.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 80a1dee..b2e23e2 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -23,4 +23,6 @@ static inline void leave_mm(int cpu)
}
#endif

+extern cpumask_var_t in_gup_cpumask;
+
#endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index be54176..a937b46 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -4,13 +4,17 @@
* Copyright (C) 2008 Nick Piggin
* Copyright (C) 2008 Novell Inc.
*/
+#include <linux/init.h>
#include <linux/sched.h>
+#include <linux/cpumask.h>
#include <linux/mm.h>
#include <linux/vmstat.h>
#include <linux/highmem.h>

#include <asm/pgtable.h>

+cpumask_var_t in_gup_cpumask;
+
static inline pte_t gup_get_pte(pte_t *ptep)
{
#ifndef CONFIG_X86_PAE
@@ -227,6 +231,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
unsigned long next;
pgd_t *pgdp;
int nr = 0;
+ int cpu;

start &= PAGE_MASK;
addr = start;
@@ -255,6 +260,10 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
* address down to the the page and take a ref on it.
*/
local_irq_disable();
+
+ cpu = smp_processor_id();
+ cpumask_set_cpu(cpu, in_gup_cpumask);
+
pgdp = pgd_offset(mm, addr);
do {
pgd_t pgd = *pgdp;
@@ -265,6 +274,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
goto slow;
} while (pgdp++, addr = next, addr != end);
+
+ cpumask_clear_cpu(cpu, in_gup_cpumask);
+
local_irq_enable();

VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
@@ -274,6 +286,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
int ret;

slow:
+ cpumask_clear_cpu(cpu, in_gup_cpumask);
local_irq_enable();
slow_irqon:
/* Try to get the remaining pages with get_user_pages */
@@ -296,3 +309,9 @@ slow_irqon:
return ret;
}
}
+
+static int __init gup_mask_init(void)
+{
+ return alloc_cpumask_var(&in_gup_cpumask, GFP_KERNEL);
+}
+core_initcall(gup_mask_init);


2009-03-28 03:56:29

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/mm: maintain a percpu "in get_user_pages_fast" flag

Jeremy Fitzhardinge wrote:
> get_user_pages_fast() relies on cross-cpu tlb flushes being a barrier
> between clearing and setting a pte, and before freeing a pagetable page.
> It usually does this by disabling interrupts to hold off IPIs, but
> some tlb flush implementations don't use IPIs for tlb flushes, and
> must use another mechanism.
>
> In this change, add in_gup_cpumask, which is a cpumask of cpus currently
> performing a get_user_pages_fast traversal of a pagetable. A cross-cpu
> tlb flush function can use this to determine whether it should hold-off
> on the flush until the gup_fast has finished.
>
> @@ -255,6 +260,10 @@ int get_user_pages_fast(unsigned long start, int
> nr_pages, int write,
> * address down to the the page and take a ref on it.
> */
> local_irq_disable();
> +
> + cpu = smp_processor_id();
> + cpumask_set_cpu(cpu, in_gup_cpumask);
> +

This will bounce a cacheline, every time. Please wrap in CONFIG_XEN and
skip at runtime if Xen is not enabled.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-03-28 05:01:24

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/mm: maintain a percpu "in get_user_pages_fast" flag

Avi Kivity wrote:
> Jeremy Fitzhardinge wrote:
>> get_user_pages_fast() relies on cross-cpu tlb flushes being a barrier
>> between clearing and setting a pte, and before freeing a pagetable page.
>> It usually does this by disabling interrupts to hold off IPIs, but
>> some tlb flush implementations don't use IPIs for tlb flushes, and
>> must use another mechanism.
>>
>> In this change, add in_gup_cpumask, which is a cpumask of cpus currently
>> performing a get_user_pages_fast traversal of a pagetable. A cross-cpu
>> tlb flush function can use this to determine whether it should hold-off
>> on the flush until the gup_fast has finished.
>>
>> @@ -255,6 +260,10 @@ int get_user_pages_fast(unsigned long start, int
>> nr_pages, int write,
>> * address down to the the page and take a ref on it.
>> */
>> local_irq_disable();
>> +
>> + cpu = smp_processor_id();
>> + cpumask_set_cpu(cpu, in_gup_cpumask);
>> +
>
> This will bounce a cacheline, every time. Please wrap in CONFIG_XEN
> and skip at runtime if Xen is not enabled.

Every time? Only when running successive gup_fasts on different cpus,
and only twice per gup_fast. (What's the typical page count? I see that
kvm and lguest are page-at-a-time users, but presumably direct IO has
larger batches.)

Alternatively, it could have per-cpu flags and the other side could
construct the mask (I originally had that, but this was simpler).

J

2009-03-28 08:16:23

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/mm: maintain a percpu "in get_user_pages_fast" flag

Jeremy Fitzhardinge a écrit :
> Avi Kivity wrote:
>> Jeremy Fitzhardinge wrote:
>>> get_user_pages_fast() relies on cross-cpu tlb flushes being a barrier
>>> between clearing and setting a pte, and before freeing a pagetable page.
>>> It usually does this by disabling interrupts to hold off IPIs, but
>>> some tlb flush implementations don't use IPIs for tlb flushes, and
>>> must use another mechanism.
>>>
>>> In this change, add in_gup_cpumask, which is a cpumask of cpus currently
>>> performing a get_user_pages_fast traversal of a pagetable. A cross-cpu
>>> tlb flush function can use this to determine whether it should hold-off
>>> on the flush until the gup_fast has finished.
>>>
>>> @@ -255,6 +260,10 @@ int get_user_pages_fast(unsigned long start, int
>>> nr_pages, int write,
>>> * address down to the the page and take a ref on it.
>>> */
>>> local_irq_disable();
>>> +
>>> + cpu = smp_processor_id();
>>> + cpumask_set_cpu(cpu, in_gup_cpumask);
>>> +
>>
>> This will bounce a cacheline, every time. Please wrap in CONFIG_XEN
>> and skip at runtime if Xen is not enabled.
>
> Every time? Only when running successive gup_fasts on different cpus,
> and only twice per gup_fast. (What's the typical page count? I see that
> kvm and lguest are page-at-a-time users, but presumably direct IO has
> larger batches.)

If I am not mistaken, shared futexes where hitting hard mm semaphore.
Then gup_fast was introduced in kernel/futex.c to remove this contention point.
Yet, this contention point was process specific, not a global one :)

And now, you want to add a global hot point, that would slow
down unrelated processes, only because they use shared futexes, thousand
times per second...

>
> Alternatively, it could have per-cpu flags and the other side could
> construct the mask (I originally had that, but this was simpler).

Simpler but would be a regression for legacy applications still using shared
futexes (because statically linked with old libc)

2009-03-28 09:54:55

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/mm: maintain a percpu "in get_user_pages_fast" flag

Jeremy Fitzhardinge wrote:
>>> @@ -255,6 +260,10 @@ int get_user_pages_fast(unsigned long start,
>>> int nr_pages, int write,
>>> * address down to the the page and take a ref on it.
>>> */
>>> local_irq_disable();
>>> +
>>> + cpu = smp_processor_id();
>>> + cpumask_set_cpu(cpu, in_gup_cpumask);
>>> +
>>
>> This will bounce a cacheline, every time. Please wrap in CONFIG_XEN
>> and skip at runtime if Xen is not enabled.
>
> Every time? Only when running successive gup_fasts on different cpus,
> and only twice per gup_fast. (What's the typical page count? I see
> that kvm and lguest are page-at-a-time users, but presumably direct IO
> has larger batches.)

Databases will often issue I/Os of 1 or 2 pages. But not regressing kvm
should be sufficient motivation.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

2009-03-28 12:28:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/mm: maintain a percpu "in get_user_pages_fast" flag

On Fri, 2009-03-27 at 22:01 -0700, Jeremy Fitzhardinge wrote:
> Avi Kivity wrote:
> > Jeremy Fitzhardinge wrote:
> >> get_user_pages_fast() relies on cross-cpu tlb flushes being a barrier
> >> between clearing and setting a pte, and before freeing a pagetable page.
> >> It usually does this by disabling interrupts to hold off IPIs, but
> >> some tlb flush implementations don't use IPIs for tlb flushes, and
> >> must use another mechanism.
> >>
> >> In this change, add in_gup_cpumask, which is a cpumask of cpus currently
> >> performing a get_user_pages_fast traversal of a pagetable. A cross-cpu
> >> tlb flush function can use this to determine whether it should hold-off
> >> on the flush until the gup_fast has finished.
> >>
> >> @@ -255,6 +260,10 @@ int get_user_pages_fast(unsigned long start, int
> >> nr_pages, int write,
> >> * address down to the the page and take a ref on it.
> >> */
> >> local_irq_disable();
> >> +
> >> + cpu = smp_processor_id();
> >> + cpumask_set_cpu(cpu, in_gup_cpumask);
> >> +
> >
> > This will bounce a cacheline, every time. Please wrap in CONFIG_XEN
> > and skip at runtime if Xen is not enabled.
>
> Every time? Only when running successive gup_fasts on different cpus,
> and only twice per gup_fast. (What's the typical page count? I see that
> kvm and lguest are page-at-a-time users, but presumably direct IO has
> larger batches.)

The larger the batch, the longer the irq-off latency, I've just proposed
adding a batch mechanism to gup_fast() to limit this.



2009-03-28 12:38:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/2] x86/mm: maintain a percpu "in get_user_pages_fast" flag

On Sat, 2009-03-28 at 08:54 +0100, Eric Dumazet wrote:
> Jeremy Fitzhardinge a écrit :
> > Avi Kivity wrote:
> >> Jeremy Fitzhardinge wrote:
> >>> get_user_pages_fast() relies on cross-cpu tlb flushes being a barrier
> >>> between clearing and setting a pte, and before freeing a pagetable page.
> >>> It usually does this by disabling interrupts to hold off IPIs, but
> >>> some tlb flush implementations don't use IPIs for tlb flushes, and
> >>> must use another mechanism.
> >>>
> >>> In this change, add in_gup_cpumask, which is a cpumask of cpus currently
> >>> performing a get_user_pages_fast traversal of a pagetable. A cross-cpu
> >>> tlb flush function can use this to determine whether it should hold-off
> >>> on the flush until the gup_fast has finished.
> >>>
> >>> @@ -255,6 +260,10 @@ int get_user_pages_fast(unsigned long start, int
> >>> nr_pages, int write,
> >>> * address down to the the page and take a ref on it.
> >>> */
> >>> local_irq_disable();
> >>> +
> >>> + cpu = smp_processor_id();
> >>> + cpumask_set_cpu(cpu, in_gup_cpumask);
> >>> +
> >>
> >> This will bounce a cacheline, every time. Please wrap in CONFIG_XEN
> >> and skip at runtime if Xen is not enabled.
> >
> > Every time? Only when running successive gup_fasts on different cpus,
> > and only twice per gup_fast. (What's the typical page count? I see that
> > kvm and lguest are page-at-a-time users, but presumably direct IO has
> > larger batches.)
>
> If I am not mistaken, shared futexes where hitting hard mm semaphore.
> Then gup_fast was introduced in kernel/futex.c to remove this contention point.
> Yet, this contention point was process specific, not a global one :)
>
> And now, you want to add a global hot point, that would slow
> down unrelated processes, only because they use shared futexes, thousand
> times per second...

Yet another reason to turn all this virt muck off :-) I just wish I
could turn off the paravirt code impact, it makes finding functions in
the x86 code a terrible pain.

> > Alternatively, it could have per-cpu flags and the other side could
> > construct the mask (I originally had that, but this was simpler).
>
> Simpler but would be a regression for legacy applications still using shared
> futexes (because statically linked with old libc)

Still doesn't help those apps that really use shared futexes.