LinuxLists.cc - DRM lock ordering fix series

2009-03-26 19:59:39

On Fri, 2009-03-27 at 10:34 +0100, Andi Kleen wrote:
> Eric Anholt <[email protected]> writes:
>
> > Here's hopefully the final attempt at the lock ordering fix for GEM. The
> > problem was introduced in .29 with the GTT mapping support. We hashed out
> > a few potential fixes on the mailing list and at OSTS. Peter's plan was
> > to use get_user_pages, but it has significant CPU overhead (10% cost to text
> > rendering, though part of that is due to some dumb userland code. But it's
> > dumb userland code we're all running).
>
>
> You are aware that there is a fast path now (get_user_pages_fast) which
> is significantly faster? (but has some limitations)

In the code I have, get_user_pages_fast is just a wrapper that calls the
get_user_pages in the way that I'm calling it from the DRM.

I'm assuming that that's changing. Can you explain what the "some
limitations" is? (Though, of course, in a comment in mm/util.c would be
better than email).

--
Eric Anholt
[email protected] [email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-03-27 16:36:57

by Eric Anholt

[permalink] [raw]

Subject: Re: DRM lock ordering fix series

On Fri, 2009-03-27 at 09:19 -0700, Eric Anholt wrote:
> On Fri, 2009-03-27 at 10:34 +0100, Andi Kleen wrote:
> > Eric Anholt <[email protected]> writes:
> >
> > > Here's hopefully the final attempt at the lock ordering fix for GEM. The
> > > problem was introduced in .29 with the GTT mapping support. We hashed out
> > > a few potential fixes on the mailing list and at OSTS. Peter's plan was
> > > to use get_user_pages, but it has significant CPU overhead (10% cost to text
> > > rendering, though part of that is due to some dumb userland code. But it's
> > > dumb userland code we're all running).
> >
> >
> > You are aware that there is a fast path now (get_user_pages_fast) which
> > is significantly faster? (but has some limitations)
>
> In the code I have, get_user_pages_fast is just a wrapper that calls the
> get_user_pages in the way that I'm calling it from the DRM.

Ah, I see: that's a weak stub, and there is a real implementation. I
didn't know we could do weak stubs.

Still, needs docs badly.

--
Eric Anholt
[email protected] [email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-03-27 16:56:19

On Fri, 2009-03-27 at 19:10 +0100, Andi Kleen wrote:
> On Fri, Mar 27, 2009 at 09:36:45AM -0700, Eric Anholt wrote:
> > > > You are aware that there is a fast path now (get_user_pages_fast) which
> > > > is significantly faster? (but has some limitations)
> > >
> > > In the code I have, get_user_pages_fast is just a wrapper that calls the
> > > get_user_pages in the way that I'm calling it from the DRM.
> >
> > Ah, I see: that's a weak stub, and there is a real implementation. I
> > didn't know we could do weak stubs.
>
> The main limitation is that it only works for your current process,
> not another one. For more details you can check the git changelog
> that added it (8174c430e445a93016ef18f717fe570214fa38bf)
>
> And yes it's only faster for architectures that support it, that's
> currently x86 and ppc.

OK. I'm not too excited here -- 10% of 2% of the CPU time doesn't get
me to the 10% loss that the slow path added up to. Most of the cost is
in k{un,}map_atomic of the returned pages. If the gup somehow filled in
the user's PTEs, I'd be happy and always use that (since then I'd have
the mapping already in place and just use that). But I think I can see
why that can't be done.

I suppose I could rework this so that we get_user_pages_fast outside the
lock, then walk doing copy_from_user_inatomic, and fall back to
kmap_atomic of the page list if we fault on the user's address. It's
still going to be a cost in our hot path, though, so I'd rather not.

I'm working on a set of tests and microbenchmarks for GEM, so other
people will be able to play with this easily soon.

--
Eric Anholt
[email protected] [email protected]

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part

2009-03-27 21:04:35

by Andi Kleen

[permalink] [raw]

Subject: Re: DRM lock ordering fix series

> OK. I'm not too excited here -- 10% of 2% of the CPU time doesn't get
> me to the 10% loss that the slow path added up to. Most of the cost is
> in k{un,}map_atomic of the returned pages. If the gup somehow filled in
> the user's PTEs, I'd be happy and always use that (since then I'd have

On x86 the user PTEs are already there if it's your current process
context so you could just use them. And it's even safe to use as long as its
locked. But that would be a x86 specific hack, not working on all platforms
that have split kernel user address spaces (that includes UML)

-Andi

--
[email protected] -- Speaking for myself only.

2009-03-28 00:54:47

by Peter Zijlstra

[permalink] [raw]

On Saturday 28 March 2009 23:46:14 Peter Zijlstra wrote:
> On Sat, 2009-03-28 at 13:22 +0100, Peter Zijlstra wrote:
> > I'm not really trusting my brain today, but something like the below
> > should work I think.
> >
> > Nick, any thoughts?
> >
> > Not-Signed-off-by: Peter Zijlstra <[email protected]>
> > ---
> > arch/x86/mm/gup.c | 24 +++++++++++++++++++++---
> > 1 files changed, 21 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> > index be54176..4ded5c3 100644
> > --- a/arch/x86/mm/gup.c
> > +++ b/arch/x86/mm/gup.c
> > @@ -11,6 +11,8 @@
> >
> > #include <asm/pgtable.h>
> >
> > +#define GUP_BATCH 32
> > +
> > static inline pte_t gup_get_pte(pte_t *ptep)
> > {
> > #ifndef CONFIG_X86_PAE
> > @@ -91,7 +93,8 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned
> > long addr, get_page(page);
> > pages[*nr] = page;
> > (*nr)++;
> > -
> > + if (*nr > GUP_BATCH)
> > + break;
> > } while (ptep++, addr += PAGE_SIZE, addr != end);
> > pte_unmap(ptep - 1);
> >
> > @@ -157,6 +160,8 @@ static int gup_pmd_range(pud_t pud, unsigned long
> > addr, unsigned long end, if (!gup_pte_range(pmd, addr, next, write,
> > pages, nr))
> > return 0;
> > }
> > + if (*nr > GUP_BATCH)
> > + break;
> > } while (pmdp++, addr = next, addr != end);
> >
> > return 1;
> > @@ -214,6 +219,8 @@ static int gup_pud_range(pgd_t pgd, unsigned long
> > addr, unsigned long end, if (!gup_pmd_range(pud, addr, next, write,
> > pages, nr))
> > return 0;
> > }
> > + if (*nr > GUP_BATCH)
> > + break;
> > } while (pudp++, addr = next, addr != end);
> >
> > return 1;
> > @@ -226,7 +233,7 @@ int get_user_pages_fast(unsigned long start, int
> > nr_pages, int write, unsigned long addr, len, end;
> > unsigned long next;
> > pgd_t *pgdp;
> > - int nr = 0;
> > + int batch = 0, nr = 0;
> >
> > start &= PAGE_MASK;
> > addr = start;
> > @@ -254,6 +261,7 @@ int get_user_pages_fast(unsigned long start, int
> > nr_pages, int write, * (which we do on x86, with the above PAE
> > exception), we can follow the * address down to the the page and take a
> > ref on it.
> > */
> > +again:
> > local_irq_disable();
> > pgdp = pgd_offset(mm, addr);
> > do {
> > @@ -262,11 +270,21 @@ int get_user_pages_fast(unsigned long start, int
> > nr_pages, int write, next = pgd_addr_end(addr, end);
> > if (pgd_none(pgd))
> > goto slow;
> > - if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
> > + if (!gup_pud_range(pgd, addr, next, write, pages, &batch))
> > goto slow;
> > + if (batch > GUP_BATCH) {
> > + local_irq_enable();
> > + addr += batch << PAGE_SHIFT;
> > + nr += batch;
> > + batch = 0;
> > + if (addr != end)
> > + goto again;
> > + }
> > } while (pgdp++, addr = next, addr != end);
> > local_irq_enable();
> >
> > + nr += batch;
> > +
> > VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
> > return nr;
>
> Would also need the following bit:
>
> @@ -274,6 +292,7 @@ int get_user_pages_fast(unsigned long start, int
> nr_pages, int write, int ret;
>
> slow:
> + nr += batch;
> local_irq_enable();
> slow_irqon:
> /* Try to get the remaining pages with get_user_pages */

Yeah something like this would be fine (and welcome). And we can
remove the XXX comment in there too. I would suggest 64 being a
reasonable value simply because that's what direct IO does.

Implementation-wise, why not just break "len" into chunks in the
top level function rather than add branches all down the call
chain?

2009-06-24 13:46:19

by Brice Goglin

[permalink] [raw]

Subject: Re: [RFC] x86: gup_fast() batch limit

Any news about this patch?

Brice

Nick Piggin wrote:
> On Saturday 28 March 2009 23:46:14 Peter Zijlstra wrote:
>
>> On Sat, 2009-03-28 at 13:22 +0100, Peter Zijlstra wrote:
>>
>>> I'm not really trusting my brain today, but something like the below
>>> should work I think.
>>>
>>> Nick, any thoughts?
>>>
>>> Not-Signed-off-by: Peter Zijlstra <[email protected]>
>>> ---
>>> arch/x86/mm/gup.c | 24 +++++++++++++++++++++---
>>> 1 files changed, 21 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
>>> index be54176..4ded5c3 100644
>>> --- a/arch/x86/mm/gup.c
>>> +++ b/arch/x86/mm/gup.c
>>> @@ -11,6 +11,8 @@
>>>
>>> #include <asm/pgtable.h>
>>>
>>> +#define GUP_BATCH 32
>>> +
>>> static inline pte_t gup_get_pte(pte_t *ptep)
>>> {
>>> #ifndef CONFIG_X86_PAE
>>> @@ -91,7 +93,8 @@ static noinline int gup_pte_range(pmd_t pmd, unsigned
>>> long addr, get_page(page);
>>> pages[*nr] = page;
>>> (*nr)++;
>>> -
>>> + if (*nr > GUP_BATCH)
>>> + break;
>>> } while (ptep++, addr += PAGE_SIZE, addr != end);
>>> pte_unmap(ptep - 1);
>>>
>>> @@ -157,6 +160,8 @@ static int gup_pmd_range(pud_t pud, unsigned long
>>> addr, unsigned long end, if (!gup_pte_range(pmd, addr, next, write,
>>> pages, nr))
>>> return 0;
>>> }
>>> + if (*nr > GUP_BATCH)
>>> + break;
>>> } while (pmdp++, addr = next, addr != end);
>>>
>>> return 1;
>>> @@ -214,6 +219,8 @@ static int gup_pud_range(pgd_t pgd, unsigned long
>>> addr, unsigned long end, if (!gup_pmd_range(pud, addr, next, write,
>>> pages, nr))
>>> return 0;
>>> }
>>> + if (*nr > GUP_BATCH)
>>> + break;
>>> } while (pudp++, addr = next, addr != end);
>>>
>>> return 1;
>>> @@ -226,7 +233,7 @@ int get_user_pages_fast(unsigned long start, int
>>> nr_pages, int write, unsigned long addr, len, end;
>>> unsigned long next;
>>> pgd_t *pgdp;
>>> - int nr = 0;
>>> + int batch = 0, nr = 0;
>>>
>>> start &= PAGE_MASK;
>>> addr = start;
>>> @@ -254,6 +261,7 @@ int get_user_pages_fast(unsigned long start, int
>>> nr_pages, int write, * (which we do on x86, with the above PAE
>>> exception), we can follow the * address down to the the page and take a
>>> ref on it.
>>> */
>>> +again:
>>> local_irq_disable();
>>> pgdp = pgd_offset(mm, addr);
>>> do {
>>> @@ -262,11 +270,21 @@ int get_user_pages_fast(unsigned long start, int
>>> nr_pages, int write, next = pgd_addr_end(addr, end);
>>> if (pgd_none(pgd))
>>> goto slow;
>>> - if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
>>> + if (!gup_pud_range(pgd, addr, next, write, pages, &batch))
>>> goto slow;
>>> + if (batch > GUP_BATCH) {
>>> + local_irq_enable();
>>> + addr += batch << PAGE_SHIFT;
>>> + nr += batch;
>>> + batch = 0;
>>> + if (addr != end)
>>> + goto again;
>>> + }
>>> } while (pgdp++, addr = next, addr != end);
>>> local_irq_enable();
>>>
>>> + nr += batch;
>>> +
>>> VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
>>> return nr;
>>>
>> Would also need the following bit:
>>
>> @@ -274,6 +292,7 @@ int get_user_pages_fast(unsigned long start, int
>> nr_pages, int write, int ret;
>>
>> slow:
>> + nr += batch;
>> local_irq_enable();
>> slow_irqon:
>> /* Try to get the remaining pages with get_user_pages */
>>
>
>
> Yeah something like this would be fine (and welcome). And we can
> remove the XXX comment in there too. I would suggest 64 being a
> reasonable value simply because that's what direct IO does.
>
> Implementation-wise, why not just break "len" into chunks in the
> top level function rather than add branches all down the call
> chain?
>

2009-06-24 17:12:43

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] x86: gup_fast() batch limit

On Wed, 2009-06-24 at 15:46 +0200, Brice Goglin wrote:
> Any news about this patch?

I knew I was forgetting something, I'll try and find a hole to
ressurect/finish this ;-)

2009-06-24 19:55:35

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [RFC] x86: gup_fast() batch limit

On Wed, 2009-06-24 at 15:46 +0200, Brice Goglin wrote:
> Any news about this patch?

Compile tested on x86_64 and ppc64.

---
Implement the batching mentioned in the gup_fast comment.

Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/powerpc/mm/gup.c | 28 +++++++++++++---------------
arch/x86/mm/gup.c | 46 ++++++++++++++++++++--------------------------
2 files changed, 33 insertions(+), 41 deletions(-)

diff --git a/arch/powerpc/mm/gup.c b/arch/powerpc/mm/gup.c
index bc400c7..cf535bf 100644
--- a/arch/powerpc/mm/gup.c
+++ b/arch/powerpc/mm/gup.c
@@ -146,11 +146,13 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
return 1;
}

+#define CHUNK_SIZE (64 * PAGE_SIZE)
+
int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
struct mm_struct *mm = current->mm;
- unsigned long addr, len, end;
+ unsigned long addr, len, end, chunk;
unsigned long next;
pgd_t *pgdp;
int nr = 0;
@@ -191,16 +193,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
}
#endif /* CONFIG_HUGETLB_PAGE */

- /*
- * XXX: batch / limit 'nr', to avoid large irq off latency
- * needs some instrumenting to determine the common sizes used by
- * important workloads (eg. DB2), and whether limiting the batch size
- * will decrease performance.
- *
- * It seems like we're in the clear for the moment. Direct-IO is
- * the main guy that batches up lots of get_user_pages, and even
- * they are limited to 64-at-a-time which is not so many.
- */
+again:
+ chunk = min(addr + CHUNK_SIZE, end);
+
/*
* This doesn't prevent pagetable teardown, but does prevent
* the pagetables from being freed on powerpc.
@@ -235,10 +230,10 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
VM_BUG_ON(shift != mmu_psize_defs[get_slice_psize(mm, a)].shift);
ptep = huge_pte_offset(mm, a);
pr_debug(" %016lx: huge ptep %p\n", a, ptep);
- if (!ptep || !gup_huge_pte(ptep, hstate, &a, end, write, pages,
+ if (!ptep || !gup_huge_pte(ptep, hstate, &a, chunk, write, pages,
&nr))
goto slow;
- } while (a != end);
+ } while (a != chunk);
} else
#endif /* CONFIG_HUGETLB_PAGE */
{
@@ -251,15 +246,18 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
#endif
pr_debug(" %016lx: normal pgd %p\n", addr,
(void *)pgd_val(pgd));
- next = pgd_addr_end(addr, end);
+ next = pgd_addr_end(addr, chunk);
if (pgd_none(pgd))
goto slow;
if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
goto slow;
- } while (pgdp++, addr = next, addr != end);
+ } while (pgdp++, addr = next, addr != chunk);
}
local_irq_enable();

+ if (addr != end)
+ goto again;
+
VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
return nr;

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 71da1bc..9e0552b 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -219,6 +219,8 @@ static int gup_pud_range(pgd_t pgd, unsigned long addr, unsigned long end,
return 1;
}

+#define CHUNK_SIZE (64 * PAGE_SIZE)
+
/*
* Like get_user_pages_fast() except its IRQ-safe in that it won't fall
* back to the regular GUP.
@@ -227,7 +229,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
struct mm_struct *mm = current->mm;
- unsigned long addr, len, end;
+ unsigned long addr, len, end, chunk;
unsigned long next;
unsigned long flags;
pgd_t *pgdp;
@@ -241,16 +243,9 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
(void __user *)start, len)))
return 0;

- /*
- * XXX: batch / limit 'nr', to avoid large irq off latency
- * needs some instrumenting to determine the common sizes used by
- * important workloads (eg. DB2), and whether limiting the batch size
- * will decrease performance.
- *
- * It seems like we're in the clear for the moment. Direct-IO is
- * the main guy that batches up lots of get_user_pages, and even
- * they are limited to 64-at-a-time which is not so many.
- */
+again:
+ chunk = min(addr + CHUNK_SIZE, end);
+
/*
* This doesn't prevent pagetable teardown, but does prevent
* the pagetables and pages from being freed on x86.
@@ -264,14 +259,17 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
do {
pgd_t pgd = *pgdp;

- next = pgd_addr_end(addr, end);
+ next = pgd_addr_end(addr, chunk);
if (pgd_none(pgd))
break;
if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
break;
- } while (pgdp++, addr = next, addr != end);
+ } while (pgdp++, addr = next, addr != chunk);
local_irq_restore(flags);

+ if (addr != end)
+ goto again;
+
return nr;
}

@@ -295,7 +293,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages)
{
struct mm_struct *mm = current->mm;
- unsigned long addr, len, end;
+ unsigned long addr, len, end, chunk;
unsigned long next;
pgd_t *pgdp;
int nr = 0;
@@ -313,16 +311,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
goto slow_irqon;
#endif

- /*
- * XXX: batch / limit 'nr', to avoid large irq off latency
- * needs some instrumenting to determine the common sizes used by
- * important workloads (eg. DB2), and whether limiting the batch size
- * will decrease performance.
- *
- * It seems like we're in the clear for the moment. Direct-IO is
- * the main guy that batches up lots of get_user_pages, and even
- * they are limited to 64-at-a-time which is not so many.
- */
+again:
+ chunk = min(addr + CHUNK_SIZE, end);
+
/*
* This doesn't prevent pagetable teardown, but does prevent
* the pagetables and pages from being freed on x86.
@@ -336,14 +327,17 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
do {
pgd_t pgd = *pgdp;

- next = pgd_addr_end(addr, end);
+ next = pgd_addr_end(addr, chunk);
if (pgd_none(pgd))
goto slow;
if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
goto slow;
- } while (pgdp++, addr = next, addr != end);
+ } while (pgdp++, addr = next, addr != chunk);
local_irq_enable();

+ if (addr != end)
+ goto again;
+
VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
return nr;