2005-10-19 07:54:04

by Andrea Arcangeli

[permalink] [raw]
Subject: [PATCH] .text page fault SMP scalability optimization

We had a problem on ppc64 where with more than 4 threads a large system
wouldn't scale well while faulting in the .text (most of the time was
spent in the kernel despite it was an userland compute intensive app).
The reason is the useless overwrite of the same pte from all cpu.

I fixed it this way (verified on an older kernel but the forward port is
almost identical). This will benefit all archs not just ppc64.

Thanks.

From: Andrea Arcangeli <[email protected]>
Subject: make the .text parallel fault from multiple threads scale in smp

Signed-off-by: Andrea Arcangeli <[email protected]>

memory.c | 12 +++++++-----
1 files changed, 7 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/memory.c
--- linux-2.6/mm/memory.c.~1~ 2005-09-21 17:38:48.000000000 +0200
+++ linux-2.6/mm/memory.c 2005-10-18 14:20:25.000000000 +0200
@@ -2000,7 +2000,7 @@ static inline int handle_pte_fault(struc
struct vm_area_struct * vma, unsigned long address,
int write_access, pte_t *pte, pmd_t *pmd)
{
- pte_t entry;
+ pte_t entry, young_entry;

entry = *pte;
if (!pte_present(entry)) {
@@ -2021,10 +2021,12 @@ static inline int handle_pte_fault(struc
return do_wp_page(mm, vma, address, pte, pmd, entry);
entry = pte_mkdirty(entry);
}
- entry = pte_mkyoung(entry);
- ptep_set_access_flags(vma, address, pte, entry, write_access);
- update_mmu_cache(vma, address, entry);
- lazy_mmu_prot_update(entry);
+ young_entry = pte_mkyoung(entry);
+ if (!pte_same(young_entry, entry)) {
+ ptep_set_access_flags(vma, address, pte, young_entry, write_access);
+ update_mmu_cache(vma, address, young_entry);
+ lazy_mmu_prot_update(young_entry);
+ }
pte_unmap(pte);
spin_unlock(&mm->page_table_lock);
return VM_FAULT_MINOR;


2005-10-19 08:15:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] .text page fault SMP scalability optimization

Andrea Arcangeli <[email protected]> wrote:
>
> We had a problem on ppc64 where with more than 4 threads a large system
> wouldn't scale well while faulting in the .text (most of the time was
> spent in the kernel despite it was an userland compute intensive app).
> The reason is the useless overwrite of the same pte from all cpu.
>
> I fixed it this way (verified on an older kernel but the forward port is
> almost identical). This will benefit all archs not just ppc64.

How strange. Do you mena that all CPUs were entering the pagefault handler
on behalf of the same pte all the time? That they're staying in lockstep?

If so, there must be a bunch of page_table_lock contention too?

2005-10-19 09:07:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] .text page fault SMP scalability optimization

On Wed, Oct 19, 2005 at 01:14:20AM -0700, Andrew Morton wrote:
> How strange. Do you mena that all CPUs were entering the pagefault handler
> on behalf of the same pte all the time? That they're staying in lockstep?

Yes.

> If so, there must be a bunch of page_table_lock contention too?

Not really as much. Note also that with latest mainline the ppc64 kernel
was going well even without this patch, it was some older codebase
falling apart, primarly because it was still doing pte_establish there
see:

young_entry = pte_mkyoung(entry);
if (!pte_same(young_entry, entry)) {
ptep_establish(vma, address, pte, young_entry);
update_mmu_cache(vma, address, young_entry);
lazy_mmu_prot_update(young_entry);
}

So those flush actions were a bottleneck when executed by all cpus at
the same time. But some archs still implement it like with the old
codebase, and the patch is quite an obvious optimization that can
clearly avoid useless tlb flushes (and that fixed the problem completely
with the older codebase still dong ptep_establish even on ppc64).

2005-10-30 06:18:46

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [PATCH] .text page fault SMP scalability optimization

On Wed, 2005-10-19 at 11:06 +0200, Andrea Arcangeli wrote:
> On Wed, Oct 19, 2005 at 01:14:20AM -0700, Andrew Morton wrote:
> > How strange. Do you mena that all CPUs were entering the pagefault handler
> > on behalf of the same pte all the time? That they're staying in lockstep?
>
> Yes.

Nice catch Andrea !

> > If so, there must be a bunch of page_table_lock contention too?
>
> Not really as much. Note also that with latest mainline the ppc64 kernel
> was going well even without this patch, it was some older codebase
> falling apart, primarly because it was still doing pte_establish there
> see:
>
> young_entry = pte_mkyoung(entry);
> if (!pte_same(young_entry, entry)) {
> ptep_establish(vma, address, pte, young_entry);
> update_mmu_cache(vma, address, young_entry);
> lazy_mmu_prot_update(young_entry);
> }

Yes, that's one reason why I introduced ptep_set_access_flags() to be
used when relaxing access permissions to a PTE.

> So those flush actions were a bottleneck when executed by all cpus at
> the same time. But some archs still implement it like with the old
> codebase, and the patch is quite an obvious optimization that can
> clearly avoid useless tlb flushes (and that fixed the problem completely
> with the older codebase still dong ptep_establish even on ppc64).

Note that we should really pass more than just "write_access" from the
arch code. We could make good use of "execute" in some cases as well,
also knowing wether this is a real fault or the result of
get_user_pages() (in some case, the former could use more agressive TLB
pre-loading, not the later). Finally, those infos should be passed to
update_mmu_cache().

Ben.


2005-10-30 14:45:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] .text page fault SMP scalability optimization

On Sun, Oct 30, 2005 at 05:17:31PM +1100, Benjamin Herrenschmidt wrote:
> Note that we should really pass more than just "write_access" from the
> arch code. We could make good use of "execute" in some cases as well,

Indeed. Bitflags sounds the most efficient way to do that. Currently
write_access is 0/1 value only.

> also knowing wether this is a real fault or the result of
> get_user_pages() (in some case, the former could use more agressive TLB
> pre-loading, not the later). Finally, those infos should be passed to
> update_mmu_cache().

Interesting.