Hi,
This (still incomplete) patch-set makes part of the mm a lot more preemptible.
It converts i_mmap_lock and anon_vma->lock to mutexes. On the way there it
also makes mmu_gather preemptible.
The main motivation was making mm_take_all_locks() preemptible, since it
appears people are nesting hundreds of spinlocks there.
The side-effects are that we can finally make mmu_gather preemptible, something
which lots of people have wanted to do for a long time.
It also gets us anon_vma refcounting which seems to be wanted by KSM as well as
Mel's compaction work.
This patch-set seems to build and boot on my x86_64 machines and even builds a
kernel. I've also attempted powerpc and sparc, which I've compile tested with
their respective defconfigs, remaining are (afaikt the rest uses the generic
tlb bits):
- s390
- ia64
- arm
- superh
- um
>From those, s390 and ia64 look 'interesting', arm and superh seem very similar
and should be relatively easy (-rt has a patchlet for arm iirc).
What kind of performance tests would people have me run on this to satisfy
their need for numbers? I've done a kernel build on x86_64 and if anything that
was slightly faster with these patches, but it was well within the noise
levels so it might be heat noise I'm looking at ;-)
From: Peter Zijlstra <[email protected]>
Date: Thu, 08 Apr 2010 21:17:37 +0200
> This patch-set seems to build and boot on my x86_64 machines and even builds a
> kernel. I've also attempted powerpc and sparc, which I've compile tested with
> their respective defconfigs, remaining are (afaikt the rest uses the generic
> tlb bits):
Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
work basically fine.
On Thu, 2010-04-08 at 13:29 -0700, David Miller wrote:
> From: Peter Zijlstra <[email protected]>
> Date: Thu, 08 Apr 2010 21:17:37 +0200
>
> > This patch-set seems to build and boot on my x86_64 machines and even builds a
> > kernel. I've also attempted powerpc and sparc, which I've compile tested with
> > their respective defconfigs, remaining are (afaikt the rest uses the generic
> > tlb bits):
>
> Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
> work basically fine.
Wheee, thanks Dave!
From: David Miller <[email protected]>
Date: Thu, 08 Apr 2010 13:29:35 -0700 (PDT)
> From: Peter Zijlstra <[email protected]>
> Date: Thu, 08 Apr 2010 21:17:37 +0200
>
>> This patch-set seems to build and boot on my x86_64 machines and even builds a
>> kernel. I've also attempted powerpc and sparc, which I've compile tested with
>> their respective defconfigs, remaining are (afaikt the rest uses the generic
>> tlb bits):
>
> Did a build and test boot of this on my 128-cpu Niagara2 box, seems to
> work basically fine.
Here comes a set of 4 patches which build on top of your
work by:
1) Killing quicklists on sparc64
2) Using the generic RCU page table liberation code on sparc64
3) Implement pte_special() et al. on sparc64
4) Implement get_user_pages_fast() on sparc64
Please add them to your patch set. If you change the RCU generic code
enabler CPP define to be controlled via Kconfig (as we discussed on
IRC) it should be easy to propagate that change into patch #2 here.
Thanks!
On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> Hi,
>
> This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> It converts i_mmap_lock and anon_vma->lock to mutexes. On the way there it
> also makes mmu_gather preemptible.
>
> The main motivation was making mm_take_all_locks() preemptible, since it
> appears people are nesting hundreds of spinlocks there.
>
> The side-effects are that we can finally make mmu_gather preemptible, something
> which lots of people have wanted to do for a long time.
What's the straight-line performance impact of all this? And how about
concurrency, I wonder. mutexes of course are double the atomics, and
you've added a refcount which is two more again for those paths using
it.
Page faults are very important. We unfortunately have some databases
doing a significant amount of mmap/munmap activity too. I'd like to
see microbenchmark numbers for each of those (both anon and file backed
for page faults).
kbuild does quite a few pages faults, that would be an easy thing to
test. Not sure what reasonable kinds of cases exercise parallelism.
> What kind of performance tests would people have me run on this to satisfy
> their need for numbers? I've done a kernel build on x86_64 and if anything that
> was slightly faster with these patches, but it was well within the noise
> levels so it might be heat noise I'm looking at ;-)
Is it because you're reducing the number of TLB flushes, or what
(kbuild isn't multi threaded so on x86 TLB flushes should be really
fast anyway).
On Fri, 2010-04-09 at 14:14 +1000, Nick Piggin wrote:
> On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> > Hi,
> >
> > This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> > It converts i_mmap_lock and anon_vma->lock to mutexes. On the way there it
> > also makes mmu_gather preemptible.
> >
> > The main motivation was making mm_take_all_locks() preemptible, since it
> > appears people are nesting hundreds of spinlocks there.
> >
> > The side-effects are that we can finally make mmu_gather preemptible, something
> > which lots of people have wanted to do for a long time.
>
> What's the straight-line performance impact of all this? And how about
> concurrency, I wonder. mutexes of course are double the atomics, and
> you've added a refcount which is two more again for those paths using
> it.
>
> Page faults are very important. We unfortunately have some databases
> doing a significant amount of mmap/munmap activity too.
You think this would affect the mmap/munmap times in any significant
way? It seems to me those are relatively heavy ops to begin with.
> I'd like to
> see microbenchmark numbers for each of those (both anon and file backed
> for page faults).
OK, I'll dig out that fault test used in the whole mmap_sem/rwsem thread
a while back and modify it to also do file backed faults.
> kbuild does quite a few pages faults, that would be an easy thing to
> test. Not sure what reasonable kinds of cases exercise parallelism.
>
>
> > What kind of performance tests would people have me run on this to satisfy
> > their need for numbers? I've done a kernel build on x86_64 and if anything that
> > was slightly faster with these patches, but it was well within the noise
> > levels so it might be heat noise I'm looking at ;-)
>
> Is it because you're reducing the number of TLB flushes, or what
> (kbuild isn't multi threaded so on x86 TLB flushes should be really
> fast anyway).
I'll try and get some perf stat runs to get some insight into this. But
the numbers were:
time make O=defconfig -j48 bzImage (5x, cache hot)
without: avg: 39.2018s +- 0.3407
with: avg: 38.9886s +- 0.1814
On Fri, Apr 09, 2010 at 10:35:31AM +0200, Peter Zijlstra wrote:
> On Fri, 2010-04-09 at 14:14 +1000, Nick Piggin wrote:
> > On Thu, Apr 08, 2010 at 09:17:37PM +0200, Peter Zijlstra wrote:
> > > Hi,
> > >
> > > This (still incomplete) patch-set makes part of the mm a lot more preemptible.
> > > It converts i_mmap_lock and anon_vma->lock to mutexes. On the way there it
> > > also makes mmu_gather preemptible.
> > >
> > > The main motivation was making mm_take_all_locks() preemptible, since it
> > > appears people are nesting hundreds of spinlocks there.
> > >
> > > The side-effects are that we can finally make mmu_gather preemptible, something
> > > which lots of people have wanted to do for a long time.
> >
> > What's the straight-line performance impact of all this? And how about
> > concurrency, I wonder. mutexes of course are double the atomics, and
> > you've added a refcount which is two more again for those paths using
> > it.
> >
> > Page faults are very important. We unfortunately have some databases
> > doing a significant amount of mmap/munmap activity too.
>
> You think this would affect the mmap/munmap times in any significant
> way? It seems to me those are relatively heavy ops to begin with.
They're actually not _too_ heavy because they just setup and tear
down vmas. No flushing or faulting required (well, in order to make
any _use_ of them you need flushing and faulting of course).
I have some microbenchmarks like this and the page fault test I
could try.
> > I'd like to
> > see microbenchmark numbers for each of those (both anon and file backed
> > for page faults).
>
> OK, I'll dig out that fault test used in the whole mmap_sem/rwsem thread
> a while back and modify it to also do file backed faults.
That'd be good. Anonymous as well of course, for non-databases :)
> > kbuild does quite a few pages faults, that would be an easy thing to
> > test. Not sure what reasonable kinds of cases exercise parallelism.
> >
> >
> > > What kind of performance tests would people have me run on this to satisfy
> > > their need for numbers? I've done a kernel build on x86_64 and if anything that
> > > was slightly faster with these patches, but it was well within the noise
> > > levels so it might be heat noise I'm looking at ;-)
> >
> > Is it because you're reducing the number of TLB flushes, or what
> > (kbuild isn't multi threaded so on x86 TLB flushes should be really
> > fast anyway).
>
> I'll try and get some perf stat runs to get some insight into this. But
> the numbers were:
>
> time make O=defconfig -j48 bzImage (5x, cache hot)
>
> without: avg: 39.2018s +- 0.3407
> with: avg: 38.9886s +- 0.1814
Well that's interesting. Nice if it is an improvement not just some
anomoly. I'd start by looking at TLB flushes maybe? For testing, it
would be nice to make the flush sizes equal so you get more of a
comparison of the straight line code.
Other than this, I don't have a good suggestion of what to test. I mean,
how far can you go? :) Some threaded workloads would probably be a good
idea, though. Java?
Hi Peter,
On Thu, 08 Apr 2010 21:17:37 +0200
Peter Zijlstra <[email protected]> wrote:
> The side-effects are that we can finally make mmu_gather preemptible, something
> which lots of people have wanted to do for a long time.
Yes, that is a good thing. With the preemptible mmu_gather s390 will
use less IDTEs (that is the instruction that flushes all TLBs for a
given address space) on a full flush. Good :-)
> This patch-set seems to build and boot on my x86_64 machines and even builds a
> kernel. I've also attempted powerpc and sparc, which I've compile tested with
> their respective defconfigs, remaining are (afaikt the rest uses the generic
> tlb bits):
>
> - s390
> - ia64
> - arm
> - superh
> - um
>
> From those, s390 and ia64 look 'interesting', arm and superh seem very similar
> and should be relatively easy (-rt has a patchlet for arm iirc).
To get the 'interesting' TLB flushing on s390 working again you need
this patch:
--
[PATCH] s390: preemptible mmu_gather
From: Martin Schwidefsky <[email protected]>
Adapt the stand-alone s390 mmu_gather implementation to the new
preemptible mmu_gather interface.
Signed-off-by: Martin Schwidefsky <[email protected]>
---
arch/s390/include/asm/tlb.h | 43 +++++++++++++++++++++++++------------------
1 file changed, 25 insertions(+), 18 deletions(-)
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -28,45 +28,51 @@
#include <asm/smp.h>
#include <asm/tlbflush.h>
-#ifndef CONFIG_SMP
-#define TLB_NR_PTRS 1
-#else
-#define TLB_NR_PTRS 508
-#endif
-
struct mmu_gather {
struct mm_struct *mm;
unsigned int fullmm;
unsigned int nr_ptes;
unsigned int nr_pxds;
- void *array[TLB_NR_PTRS];
+ unsigned int max;
+ void **array;
+ void *local[8];
};
-DECLARE_PER_CPU(struct mmu_gather, mmu_gathers);
-
-static inline struct mmu_gather *tlb_gather_mmu(struct mm_struct *mm,
- unsigned int full_mm_flush)
+static inline void __tlb_alloc_pages(struct mmu_gather *tlb)
{
- struct mmu_gather *tlb = &get_cpu_var(mmu_gathers);
+ unsigned long addr = __get_free_pages(GFP_ATOMIC, 0);
+
+ if (addr) {
+ tlb->array = (void *) addr;
+ tlb->max = PAGE_SIZE / sizeof(void *);
+ }
+}
+static inline void tlb_gather_mmu(struct mmu_gather *tlb,
+ struct mm_struct *mm,
+ unsigned int full_mm_flush)
+{
tlb->mm = mm;
+ tlb->max = ARRAY_SIZE(tlb->local);
+ tlb->array = tlb->local;
tlb->fullmm = full_mm_flush || (num_online_cpus() == 1) ||
(atomic_read(&mm->mm_users) <= 1 && mm == current->active_mm);
- tlb->nr_ptes = 0;
- tlb->nr_pxds = TLB_NR_PTRS;
if (tlb->fullmm)
__tlb_flush_mm(mm);
- return tlb;
+ else
+ __tlb_alloc_pages(tlb);
+ tlb->nr_ptes = 0;
+ tlb->nr_pxds = tlb->max;
}
static inline void tlb_flush_mmu(struct mmu_gather *tlb,
unsigned long start, unsigned long end)
{
- if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < TLB_NR_PTRS))
+ if (!tlb->fullmm && (tlb->nr_ptes > 0 || tlb->nr_pxds < tlb->max))
__tlb_flush_mm(tlb->mm);
while (tlb->nr_ptes > 0)
pte_free(tlb->mm, tlb->array[--tlb->nr_ptes]);
- while (tlb->nr_pxds < TLB_NR_PTRS)
+ while (tlb->nr_pxds < tlb->max)
/* pgd_free frees the pointer as region or segment table */
pgd_free(tlb->mm, tlb->array[tlb->nr_pxds++]);
}
@@ -79,7 +85,8 @@ static inline void tlb_finish_mmu(struct
/* keep the page table cache within bounds */
check_pgt_cache();
- put_cpu_var(mmu_gathers);
+ if (tlb->array != tlb->local)
+ free_pages((unsigned long) tlb->array, 0);
}
/*
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Fri, 2010-04-09 at 18:50 +1000, Nick Piggin wrote:
> I have some microbenchmarks like this and the page fault test I
> could try.
>
If you don't mind, please. If you send them my way I'll run them on my
machines too.
On Fri, 2010-04-09 at 10:03 +0100, David Howells wrote:
> Have you tried compiling this for NOMMU?
No, I will eventually, I'm sure there's something to fix.
On Fri, 2010-04-09 at 10:58 +0200, Martin Schwidefsky wrote:
> [PATCH] s390: preemptible mmu_gather
>
> From: Martin Schwidefsky <[email protected]>
>
> Adapt the stand-alone s390 mmu_gather implementation to the new
> preemptible mmu_gather interface.
>
> Signed-off-by: Martin Schwidefsky <[email protected]>
Thanks a lot Martin!