2013-07-31 21:43:43

by Rik van Riel

[permalink] [raw]
Subject: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

Don Zickus and Joe Mario have been working on improvements to
perf, and noticed heavy cache line contention on the mm_cpumask,
running linpack on a 60 core / 120 thread system.

The cause turned out to be unnecessary atomic accesses to the
mm_cpumask. When in lazy TLB mode, the CPU is only removed from
the mm_cpumask if there is a TLB flush event.

Most of the time, no such TLB flush happens, and the kernel
skips the TLB reload. It can also skip the atomic memory
set & test.

Here is a summary of Joe's test results:

* The __schedule function dropped from 24% of all program cycles down
to 5.5%.
* The cacheline contention/hotness for accesses to that bitmask went
from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
all shared misses which is now quite cold)
* The average load latency for the bit-test-n-set instruction in
__schedule dropped from 10k-15k cycles down to an average of 600 cycles.
* The linpack program results improved from 133 GFlops to 144 GFlops.
Peak GFlops rose from 133 to 153.

Reported-by: Don Zickus <[email protected]>
Reported-by: Joe Mario <[email protected]>
Tested-by: Joe Mario <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index cdbf367..987eb3d 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -59,11 +59,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

- if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
+ if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
/* We were in lazy tlb mode and leave_mm disabled
* tlb flush IPI delivery. We must reload CR3
* to make sure to use no freed page tables.
*/
+ cpumask_set_cpu(cpu, mm_cpumask(next));
load_cr3(next->pgd);
load_LDT_nolock(&next->context);
}


2013-07-31 21:47:07

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel <[email protected]> wrote:
> Don Zickus and Joe Mario have been working on improvements to
> perf, and noticed heavy cache line contention on the mm_cpumask,
> running linpack on a 60 core / 120 thread system.
>
> The cause turned out to be unnecessary atomic accesses to the
> mm_cpumask. When in lazy TLB mode, the CPU is only removed from
> the mm_cpumask if there is a TLB flush event.
>
> Most of the time, no such TLB flush happens, and the kernel
> skips the TLB reload. It can also skip the atomic memory
> set & test.
>
> Here is a summary of Joe's test results:
>
> * The __schedule function dropped from 24% of all program cycles down
> to 5.5%.
> * The cacheline contention/hotness for accesses to that bitmask went
> from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
> all shared misses which is now quite cold)
> * The average load latency for the bit-test-n-set instruction in
> __schedule dropped from 10k-15k cycles down to an average of 600 cycles.
> * The linpack program results improved from 133 GFlops to 144 GFlops.
> Peak GFlops rose from 133 to 153.
>
> Reported-by: Don Zickus <[email protected]>
> Reported-by: Joe Mario <[email protected]>
> Tested-by: Joe Mario <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> arch/x86/include/asm/mmu_context.h | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
> index cdbf367..987eb3d 100644
> --- a/arch/x86/include/asm/mmu_context.h
> +++ b/arch/x86/include/asm/mmu_context.h
> @@ -59,11 +59,12 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
> this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
>
> - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
> + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> /* We were in lazy tlb mode and leave_mm disabled
> * tlb flush IPI delivery. We must reload CR3
> * to make sure to use no freed page tables.
> */
> + cpumask_set_cpu(cpu, mm_cpumask(next));
> load_cr3(next->pgd);
> load_LDT_nolock(&next->context);
> }

We're carrying the *exact* same patch for *exact* same reason. I've
been meaning to send it out but wasn't sure of a good external
workload for this.

Reviewed-by: Paul Turner <[email protected]>

2013-07-31 22:08:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel <[email protected]> wrote:
>
> The cause turned out to be unnecessary atomic accesses to the
> mm_cpumask. When in lazy TLB mode, the CPU is only removed from
> the mm_cpumask if there is a TLB flush event.
>
> Most of the time, no such TLB flush happens, and the kernel
> skips the TLB reload. It can also skip the atomic memory
> set & test.

The patch looks obvious, and I'm not seeing any very clear reasons for
why we would want that test-and-set to be atomic. That said, I'd like
to have some explicit comments about exactly why it doesn't need the
atomicity. Because afaik, there actually are concurrent readers _and_
writers of that mask, and the accesses are not locked by anything
here.

I _think_ the reason for this all being safe is simply that the only
real race is "We need to set the bit before we load the page table,
and we're protected against that bit being cleared because the TLB
state is TLBSTATE_OK and thus TLB flushing will no longer leave that
mm".

But damn, it all looks subtle as hell. That code does:

this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {

and I'm wondering if we need a barrier to make sure that that
TLBSTATE_OK write happens *before* we test the cpumask. With
test_and_set(), we have the barrier in the test-and-set. But with just
test_bit, I'm not seeing why the compiler couldn't re-order them. I
suspect it won't, but...

So I think the patch is _probably_ fine, but I think this is a hell of
a lot more subtle than it all looks, and I'd like people to show that
they have thought this through. And I don't see any indication of that
in the changelog - it only talks about the performance issues, not
about the *correctness* issues.

Please?

Linus

2013-07-31 22:16:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On 07/31/2013 06:07 PM, Linus Torvalds wrote:
> On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel <[email protected]> wrote:
>>
>> The cause turned out to be unnecessary atomic accesses to the
>> mm_cpumask. When in lazy TLB mode, the CPU is only removed from
>> the mm_cpumask if there is a TLB flush event.
>>
>> Most of the time, no such TLB flush happens, and the kernel
>> skips the TLB reload. It can also skip the atomic memory
>> set & test.
>
> The patch looks obvious, and I'm not seeing any very clear reasons for
> why we would want that test-and-set to be atomic. That said, I'd like
> to have some explicit comments about exactly why it doesn't need the
> atomicity. Because afaik, there actually are concurrent readers _and_
> writers of that mask, and the accesses are not locked by anything
> here.
>
> I _think_ the reason for this all being safe is simply that the only
> real race is "We need to set the bit before we load the page table,
> and we're protected against that bit being cleared because the TLB
> state is TLBSTATE_OK and thus TLB flushing will no longer leave that
> mm".
>
> But damn, it all looks subtle as hell. That code does:
>
> this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
>
> if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
>
> and I'm wondering if we need a barrier to make sure that that
> TLBSTATE_OK write happens *before* we test the cpumask. With
> test_and_set(), we have the barrier in the test-and-set. But with just
> test_bit, I'm not seeing why the compiler couldn't re-order them. I
> suspect it won't, but...

cpumask_set_bit expands to set_bit, which is atomic

Do we need any explicit compiler barrier in addition to the
atomic operation performed by set_bit?

I would be happy to rewrite the comment right above the
cpumask_set_cpu call if you want.

--
All rights reversed

2013-07-31 22:39:29

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On 07/31/2013 06:21 PM, Linus Torvalds wrote:
> Ummm.. The race is to the testing of the bit, not setting. The testing
> of the bit is not valid before we have set the tlb state, AFAIK.

I believe the bit is cleared and set by the current CPU.

Clearing is done from the TLB flush IPI handler, or by directly
calling leave_mm from ptep_flush_clear if the flush originated
locally. The exception is clear_tasks_mm_cpumask, which may
only be called for an already offlined CPU.

I believe setting is only ever done in switch_mm.

Interrupts are blocked inside switch_mm, so I think we
are safe.

Would you like a comment to this effect in the code, or
are there other things we need to check first?

> On Jul 31, 2013 3:16 PM, "Rik van Riel" <[email protected]
> <mailto:[email protected]>> wrote:
>
> On 07/31/2013 06:07 PM, Linus Torvalds wrote:
>
> On Wed, Jul 31, 2013 at 2:43 PM, Rik van Riel <[email protected]
> <mailto:[email protected]>> wrote:
>
>
> The cause turned out to be unnecessary atomic accesses to the
> mm_cpumask. When in lazy TLB mode, the CPU is only removed from
> the mm_cpumask if there is a TLB flush event.
>
> Most of the time, no such TLB flush happens, and the kernel
> skips the TLB reload. It can also skip the atomic memory
> set & test.
>
>
> The patch looks obvious, and I'm not seeing any very clear
> reasons for
> why we would want that test-and-set to be atomic. That said, I'd
> like
> to have some explicit comments about exactly why it doesn't need the
> atomicity. Because afaik, there actually are concurrent readers
> _and_
> writers of that mask, and the accesses are not locked by anything
> here.
>
> >
>
> I _think_ the reason for this all being safe is simply that the only
> real race is "We need to set the bit before we load the page table,
> and we're protected against that bit being cleared because the TLB
> state is TLBSTATE_OK and thus TLB flushing will no longer leave that
> mm".
>
> But damn, it all looks subtle as hell. That code does:
>
> this_cpu_write(cpu_tlbstate.__state, TLBSTATE_OK);
> BUG_ON(this_cpu_read(cpu___tlbstate.active_mm)
> != next);
>
> if (!cpumask_test_and_set_cpu(__cpu,
> mm_cpumask(next))) {
>
> and I'm wondering if we need a barrier to make sure that that
> TLBSTATE_OK write happens *before* we test the cpumask. With
> test_and_set(), we have the barrier in the test-and-set. But
> with just
> test_bit, I'm not seeing why the compiler couldn't re-order them. I
> suspect it won't, but...
>
>
> cpumask_set_bit expands to set_bit, which is atomic
>
> Do we need any explicit compiler barrier in addition to the
> atomic operation performed by set_bit?
>
> I would be happy to rewrite the comment right above the
> cpumask_set_cpu call if you want.
>
> --
> All rights reversed
>


--
All rights reversed

2013-07-31 23:12:37

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On 07/31/2013 06:46 PM, Linus Torvalds wrote:
>
> On Jul 31, 2013 3:39 PM, "Rik van Riel" <[email protected]
> <mailto:[email protected]>> wrote:
> >
> > On 07/31/2013 06:21 PM, Linus Torvalds wrote:
> >>
> >> Ummm.. The race is to the testing of the bit, not setting. The testing
> >> of the bit is not valid before we have set the tlb state, AFAIK.
> >
> >
> > I believe the bit is cleared and set by the current CPU.
>
> Yes, but we need to be careful with interrupts.
>
> > Interrupts are blocked inside switch_mm, so I think we
> > are safe.
>
> Are they? I thought we removed all that.

context_switch() shows that the runqueue lock (which is an irq
lock) is released, and irqs re-enabled, by the next task, after
switch_to(), in finish_lock_switch(), called from finish_task_switch()

> Note that switch_mm gets called for activate_mm too, or something.

Good catch, though it looks like activate_mm is only called from
exec_mmap, with the new mm as its argument. While the new mm can
have pages in memory yet, it has never been run so there should
be nothing in the TLB yet for the new mm.

This is subtler than I thought, but it does appear to be safe.

--
All rights reversed

2013-07-31 23:14:43

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

We attached the following explanatory comment to our version of the patch:

/*
* In the common case (two user threads sharing mm
* switching) the bit will be set; avoid doing a write
* (via atomic test & set) unless we have to. This is
* safe, because no other CPU ever writes to our bit
* in the mask, and interrupts are off (so we can't
* take a TLB IPI here.) If we don't do this, then
* switching threads will pingpong the cpumask
* cacheline.
*/

On Wed, Jul 31, 2013 at 4:12 PM, Rik van Riel <[email protected]> wrote:
> On 07/31/2013 06:46 PM, Linus Torvalds wrote:
>>
>>
>> On Jul 31, 2013 3:39 PM, "Rik van Riel" <[email protected]
>>
>> <mailto:[email protected]>> wrote:
>> >
>> > On 07/31/2013 06:21 PM, Linus Torvalds wrote:
>> >>
>> >> Ummm.. The race is to the testing of the bit, not setting. The testing
>> >> of the bit is not valid before we have set the tlb state, AFAIK.
>> >
>> >
>> > I believe the bit is cleared and set by the current CPU.
>>
>> Yes, but we need to be careful with interrupts.
>>
>>
>> > Interrupts are blocked inside switch_mm, so I think we
>> > are safe.
>>
>> Are they? I thought we removed all that.
>
>
> context_switch() shows that the runqueue lock (which is an irq
> lock) is released, and irqs re-enabled, by the next task, after
> switch_to(), in finish_lock_switch(), called from finish_task_switch()
>
>> Note that switch_mm gets called for activate_mm too, or something.
>
>
> Good catch, though it looks like activate_mm is only called from
> exec_mmap, with the new mm as its argument. While the new mm can
> have pages in memory yet, it has never been run so there should
> be nothing in the TLB yet for the new mm.
>
> This is subtler than I thought, but it does appear to be safe.
>
>
> --
> All rights reversed
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-08-01 00:41:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On Wed, Jul 31, 2013 at 4:14 PM, Paul Turner <[email protected]> wrote:
> We attached the following explanatory comment to our version of the patch:
>
> /*
> * In the common case (two user threads sharing mm
> * switching) the bit will be set; avoid doing a write
> * (via atomic test & set) unless we have to. This is
> * safe, because no other CPU ever writes to our bit
> * in the mask, and interrupts are off (so we can't
> * take a TLB IPI here.) If we don't do this, then
> * switching threads will pingpong the cpumask
> * cacheline.
> */

So as mentioned, the "interrupts will be off" is actually dubious.
It's true for the context switch case, but not for the activate_mm().

However, as Rik points out, activate_mm() is different in that we
shouldn't have any preexisting MMU state anyway. And besides, that
should never trigger the "prev == next" case.

But it does look a bit messy, and even your comment is a bit
misleading (it might make somebody think that all of switch_mm() is
protected from interrupts)
.
Anyway, I'm perfectly ok with the patch itself, but I just wanted to
make sure people had thought about these things.

Linus

2013-08-01 01:58:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On 07/31/2013 08:41 PM, Linus Torvalds wrote:
> On Wed, Jul 31, 2013 at 4:14 PM, Paul Turner <[email protected]> wrote:
>> We attached the following explanatory comment to our version of the patch:
>>
>> /*
>> * In the common case (two user threads sharing mm
>> * switching) the bit will be set; avoid doing a write
>> * (via atomic test & set) unless we have to. This is
>> * safe, because no other CPU ever writes to our bit
>> * in the mask, and interrupts are off (so we can't
>> * take a TLB IPI here.) If we don't do this, then
>> * switching threads will pingpong the cpumask
>> * cacheline.
>> */
>
> So as mentioned, the "interrupts will be off" is actually dubious.
> It's true for the context switch case, but not for the activate_mm().
>
> However, as Rik points out, activate_mm() is different in that we
> shouldn't have any preexisting MMU state anyway. And besides, that
> should never trigger the "prev == next" case.
>
> But it does look a bit messy, and even your comment is a bit
> misleading (it might make somebody think that all of switch_mm() is
> protected from interrupts)
> .
> Anyway, I'm perfectly ok with the patch itself, but I just wanted to
> make sure people had thought about these things.

Would you like me to document the things we found in the comment,
and resend a patch, or is the patch good as-is?

--
All rights reversed

2013-08-01 02:14:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On Wed, Jul 31, 2013 at 6:58 PM, Rik van Riel <[email protected]> wrote:
>
> Would you like me to document the things we found in the comment,
> and resend a patch, or is the patch good as-is?

Please re-send with commentary..

Linus

2013-08-01 02:14:33

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads

On Wed, 31 Jul 2013 17:41:39 -0700
Linus Torvalds <[email protected]> wrote:

> However, as Rik points out, activate_mm() is different in that we
> shouldn't have any preexisting MMU state anyway. And besides, that
> should never trigger the "prev == next" case.
>
> But it does look a bit messy, and even your comment is a bit
> misleading (it might make somebody think that all of switch_mm() is
> protected from interrupts)

Is this better? Not that I really care which version gets applied :)

---8<---

Subject: sched,x86: optimize switch_mm for multi-threaded workloads

Dick Fowles, Don Zickus and Joe Mario have been working on improvements
to perf, and noticed heavy cache line contention on the mm_cpumask,
running linpack on a 60 core / 120 thread system.

The cause turned out to be unnecessary atomic accesses to the
mm_cpumask. When in lazy TLB mode, the CPU is only removed from
the mm_cpumask if there is a TLB flush event.

Most of the time, no such TLB flush happens, and the kernel
skips the TLB reload. It can also skip the atomic memory
set & test.

Here is a summary of Joe's test results:

* The __schedule function dropped from 24% of all program cycles down
to 5.5%.
* The cacheline contention/hotness for accesses to that bitmask went
from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
all shared misses which is now quite cold)
* The average load latency for the bit-test-n-set instruction in
__schedule dropped from 10k-15k cycles down to an average of 600 cycles.
* The linpack program results improved from 133 GFlops to 144 GFlops.
Peak GFlops rose from 133 to 153.

Reported-by: Don Zickus <[email protected]>
Reported-by: Joe Mario <[email protected]>
Tested-by: Joe Mario <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index cdbf367..3ac6089 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -59,7 +59,13 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

- if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
+ if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+ /* On established mms, the mm_cpumask is only changed
+ * from irq context, from ptep_clear_flush while in
+ * lazy tlb mode, and here. Irqs are blocked during
+ * schedule, protecting us from simultaneous changes.
+ */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
/* We were in lazy tlb mode and leave_mm disabled
* tlb flush IPI delivery. We must reload CR3
* to make sure to use no freed page tables.

2013-08-01 02:25:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads

On Wed, Jul 31, 2013 at 7:14 PM, Rik van Riel <[email protected]> wrote:
>
> Is this better? Not that I really care which version gets applied :)

I'm ok with this.

That said, I'm assuming this goes in through Ingo, since it is both
x86 and scheduler-related. I can take it directly, but see no real
reason to. Ingo?

Talking about x86-specific - I'm assuming other architectures have
similar patterns?

Linus

2013-08-01 07:04:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH -v2] sched,x86: optimize switch_mm for multi-threaded workloads


* Linus Torvalds <[email protected]> wrote:

> On Wed, Jul 31, 2013 at 7:14 PM, Rik van Riel <[email protected]> wrote:
> >
> > Is this better? Not that I really care which version gets applied :)
>
> I'm ok with this.
>
> That said, I'm assuming this goes in through Ingo, since it is both
> x86 and scheduler-related. I can take it directly, but see no real
> reason to. Ingo?

Yeah, I'm picking it up.

Thanks,

Ingo

2013-08-01 17:09:51

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On Wed, 31 July 2013 19:12:30 -0400, Rik van Riel wrote:
> On 07/31/2013 06:46 PM, Linus Torvalds wrote:
> > > On 07/31/2013 06:21 PM, Linus Torvalds wrote:

Looks like two of Linus' mails didn't make it to vger.kernel.org.
They are missing from my inbox and seem to be missing from lkml.org as
well. I only know noticed them because Rik seemingly replied to
himself while quoting Linus.

Message-IDs in question are:
CA+55aFwj+6P4y8MgGTGbiK_EtfY7LJ_bL_7k9zFYLx4S8F0rJQ@mail.gmail.com
CA+55aFxKSEHSkdsCkTvjwwo4MnEpN0TwJrek2jd1QJCyUTb-=Q@mail.gmail.com

I find this disturbing. Many "modern" mailers seem to be broken in
more ways than I care to enumerate. When I tell people, including
otherwise capable programmers, that they use a broken mailer I rarely
get more than a shrug. But if Linus is incapable of sending mail to
the kernel mailing list, we are in trouble indeed. Linus, can you get
this sorted out?

Jörn

--
There are three principal ways to lose money: wine, women, and engineers.
While the first two are more pleasant, the third is by far the more certain.
-- Baron Rothschild

2013-08-01 17:45:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On Thu, Aug 1, 2013 at 8:37 AM, Jörn Engel <[email protected]> wrote:
>
> Looks like two of Linus' mails didn't make it to vger.kernel.org.
> They are missing from my inbox and seem to be missing from lkml.org as
> well. I only know noticed them because Rik seemingly replied to
> himself while quoting Linus.

It's normal. I occasionally write quick replies while on my phone, and
the f*cking android gmail app cannot do plain-text mode. There are
open bug tickets about it:

https://code.google.com/p/android/issues/detail?id=8712

and I've talked in private to my google contacts too, but some
ass-wipe inside the android team thinks that html-only mail is the
right thing to do.

The sad part is that there is absolutely no *technical* reason for it,
since the gmail android app actually generates a multipart/alternative
email where the first part is perfectly fine text, properly wrapped
and everything. So the code to generate the email in proper format is
actually there, it's just that there is no setting to say "don't
generate the other parts". It's literally one added setting, and
probably a few lines of code to test it and not generate the html
part.

So there's some internal politics going on, probably one or two
internal people who are just ass-hats and are lying to their managers
about how hard it is to do. Don't ask me why. Some people just want to
watch the world burn.

And no, don't tell me about the other mailers. I've tried many of
them. They tend to be badly integrated or have their own crazy issues.
I'm not willing to read email over imap any more. The one big feature
of gmail is how you can search for things sanely on the server side,
and despite the imap server-side searching technically being a
feature, I've never met any server and MUA that does it reasonably.
Gmail does, but sadly the android app has this one big glaring fault.

Which is sad, because in other ways the android gmail app is actually
quite good these days (that didn't use to be true, but they've fixed
much bigger problems over the years).

Feel free to pester the android people, *especially* if you know
somebody inside the gmail app team. Call them out on the bullshit if
somebody says "that's hard". Sadly, the features they *have* added
seem to be about things big companies asked for, not trivial technical
improvements.

Linus

2013-08-01 19:26:46

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] sched,x86: optimize switch_mm for multi-threaded workloads

On Thu, 1 August 2013 10:45:15 -0700, Linus Torvalds wrote:
> On Thu, Aug 1, 2013 at 8:37 AM, Jörn Engel <[email protected]> wrote:
> >
> > Looks like two of Linus' mails didn't make it to vger.kernel.org.
> > They are missing from my inbox and seem to be missing from lkml.org as
> > well. I only know noticed them because Rik seemingly replied to
> > himself while quoting Linus.
>
> It's normal.

I agree with everything in your email except this bit. Even though
technically it is normal now that all email clients are broken in
impressive ways, your words imply that this might even remotely be an
acceptable state of the world. It is not.

Four-letter words directed at Google (and Apple, and...) seem more
appropriate.

Jörn

--
"Security vulnerabilities are here to stay."
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001

Subject: [tip:sched/core] sched/x86: Optimize switch_mm() for multi-threaded workloads

Commit-ID: 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1
Gitweb: http://git.kernel.org/tip/8f898fbbe5ee5e20a77c4074472a1fd088dc47d1
Author: Rik van Riel <[email protected]>
AuthorDate: Wed, 31 Jul 2013 22:14:21 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 1 Aug 2013 09:10:26 +0200

sched/x86: Optimize switch_mm() for multi-threaded workloads

Dick Fowles, Don Zickus and Joe Mario have been working on
improvements to perf, and noticed heavy cache line contention
on the mm_cpumask, running linpack on a 60 core / 120 thread
system.

The cause turned out to be unnecessary atomic accesses to the
mm_cpumask. When in lazy TLB mode, the CPU is only removed from
the mm_cpumask if there is a TLB flush event.

Most of the time, no such TLB flush happens, and the kernel
skips the TLB reload. It can also skip the atomic memory
set & test.

Here is a summary of Joe's test results:

* The __schedule function dropped from 24% of all program cycles down
to 5.5%.

* The cacheline contention/hotness for accesses to that bitmask went
from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
all shared misses which is now quite cold)

* The average load latency for the bit-test-n-set instruction in
__schedule dropped from 10k-15k cycles down to an average of 600 cycles.

* The linpack program results improved from 133 GFlops to 144 GFlops.
Peak GFlops rose from 133 to 153.

Reported-by: Don Zickus <[email protected]>
Reported-by: Joe Mario <[email protected]>
Tested-by: Joe Mario <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Reviewed-by: Paul Turner <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Made the comments consistent around the modified code. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/mmu_context.h | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index cdbf367..be12c53 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -45,22 +45,28 @@ static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
/* Re-load page tables */
load_cr3(next->pgd);

- /* stop flush ipis for the previous mm */
+ /* Stop flush ipis for the previous mm */
cpumask_clear_cpu(cpu, mm_cpumask(prev));

- /*
- * load the LDT, if the LDT is different:
- */
+ /* Load the LDT, if the LDT is different: */
if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT_nolock(&next->context);
}
#ifdef CONFIG_SMP
- else {
+ else {
this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

- if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
- /* We were in lazy tlb mode and leave_mm disabled
+ if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
+ /*
+ * On established mms, the mm_cpumask is only changed
+ * from irq context, from ptep_clear_flush() while in
+ * lazy tlb mode, and here. Irqs are blocked during
+ * schedule, protecting us from simultaneous changes.
+ */
+ cpumask_set_cpu(cpu, mm_cpumask(next));
+ /*
+ * We were in lazy tlb mode and leave_mm disabled
* tlb flush IPI delivery. We must reload CR3
* to make sure to use no freed page tables.
*/

2013-08-02 09:12:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/x86: Optimize switch_mm() for multi-threaded workloads

* tip-bot for Rik van Riel <[email protected]> wrote:

> Commit-ID: 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1
> Gitweb: http://git.kernel.org/tip/8f898fbbe5ee5e20a77c4074472a1fd088dc47d1
> Author: Rik van Riel <[email protected]>
> AuthorDate: Wed, 31 Jul 2013 22:14:21 -0400
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Thu, 1 Aug 2013 09:10:26 +0200
>
> sched/x86: Optimize switch_mm() for multi-threaded workloads
>
> Dick Fowles, Don Zickus and Joe Mario have been working on
> improvements to perf, and noticed heavy cache line contention
> on the mm_cpumask, running linpack on a 60 core / 120 thread
> system.
>
> The cause turned out to be unnecessary atomic accesses to the
> mm_cpumask. When in lazy TLB mode, the CPU is only removed from
> the mm_cpumask if there is a TLB flush event.
>
> Most of the time, no such TLB flush happens, and the kernel
> skips the TLB reload. It can also skip the atomic memory
> set & test.
>
> Here is a summary of Joe's test results:
>
> * The __schedule function dropped from 24% of all program cycles down
> to 5.5%.
>
> * The cacheline contention/hotness for accesses to that bitmask went
> from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
> all shared misses which is now quite cold)
>
> * The average load latency for the bit-test-n-set instruction in
> __schedule dropped from 10k-15k cycles down to an average of 600 cycles.
>
> * The linpack program results improved from 133 GFlops to 144 GFlops.
> Peak GFlops rose from 133 to 153.
>
> Reported-by: Don Zickus <[email protected]>
> Reported-by: Joe Mario <[email protected]>
> Tested-by: Joe Mario <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
> Reviewed-by: Paul Turner <[email protected]>
> Acked-by: Linus Torvalds <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> [ Made the comments consistent around the modified code. ]
> Signed-off-by: Ingo Molnar <[email protected]>

> + else {
> this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
>
> - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
> + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> + /*
> + * On established mms, the mm_cpumask is only changed
> + * from irq context, from ptep_clear_flush() while in
> + * lazy tlb mode, and here. Irqs are blocked during
> + * schedule, protecting us from simultaneous changes.
> + */
> + cpumask_set_cpu(cpu, mm_cpumask(next));

Note, I marked this for v3.12 with no -stable backport tag as it's not a
regression fix.

Nevertheless if it's a real issue in production (and +20% of linpack
performance is certainly significant) feel free to forward it to -stable
once this hits Linus's tree in the v3.12 merge window - by that time the
patch will be reasonably well tested and it's a relatively simple change.

Thanks,

Ingo

2013-08-02 12:44:23

by Joe Mario

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/x86: Optimize switch_mm() for multi-threaded workloads

On 08/02/2013 05:12 AM, Ingo Molnar wrote:
> * tip-bot for Rik van Riel <[email protected]> wrote:
>
>> Commit-ID: 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1
>> Gitweb: http://git.kernel.org/tip/8f898fbbe5ee5e20a77c4074472a1fd088dc47d1
>> Author: Rik van Riel <[email protected]>
>> AuthorDate: Wed, 31 Jul 2013 22:14:21 -0400
>> Committer: Ingo Molnar <[email protected]>
>> CommitDate: Thu, 1 Aug 2013 09:10:26 +0200
>>
>> sched/x86: Optimize switch_mm() for multi-threaded workloads
>>
>> Dick Fowles, Don Zickus and Joe Mario have been working on
>> improvements to perf, and noticed heavy cache line contention
>> on the mm_cpumask, running linpack on a 60 core / 120 thread
>> system.
>>
>> The cause turned out to be unnecessary atomic accesses to the
>> mm_cpumask. When in lazy TLB mode, the CPU is only removed from
>> the mm_cpumask if there is a TLB flush event.
>>
>> Most of the time, no such TLB flush happens, and the kernel
>> skips the TLB reload. It can also skip the atomic memory
>> set & test.
>>
>> Here is a summary of Joe's test results:
>>
>> * The __schedule function dropped from 24% of all program cycles down
>> to 5.5%.
>>
>> * The cacheline contention/hotness for accesses to that bitmask went
>> from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
>> all shared misses which is now quite cold)
>>
>> * The average load latency for the bit-test-n-set instruction in
>> __schedule dropped from 10k-15k cycles down to an average of 600 cycles.
>>
>> * The linpack program results improved from 133 GFlops to 144 GFlops.
>> Peak GFlops rose from 133 to 153.
>>
>> Reported-by: Don Zickus <[email protected]>
>> Reported-by: Joe Mario <[email protected]>
>> Tested-by: Joe Mario <[email protected]>
>> Signed-off-by: Rik van Riel <[email protected]>
>> Reviewed-by: Paul Turner <[email protected]>
>> Acked-by: Linus Torvalds <[email protected]>
>> Link: http://lkml.kernel.org/r/[email protected]
>> [ Made the comments consistent around the modified code. ]
>> Signed-off-by: Ingo Molnar <[email protected]>
>> + else {
>> this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
>> BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
>>
>> - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
>> + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
>> + /*
>> + * On established mms, the mm_cpumask is only changed
>> + * from irq context, from ptep_clear_flush() while in
>> + * lazy tlb mode, and here. Irqs are blocked during
>> + * schedule, protecting us from simultaneous changes.
>> + */
>> + cpumask_set_cpu(cpu, mm_cpumask(next));
> Note, I marked this for v3.12 with no -stable backport tag as it's not a
> regression fix.
>
> Nevertheless if it's a real issue in production (and +20% of linpack
> performance is certainly significant)

The cacheline contention on the mm->cpu_vm_mask_va cpulist bitmask
skyrockets on numa systems as soon as the number of threads exceeds the
available cpus - and it's all from the locked bts instruction in
cpumask_test_and_set_cpu().
The OMP version of Linpack uses clone() with the CLONE_VM flag set, so
all the clones are tugging at the same mm->cpu_vm_mask_va.
With Rik's patch, the number of accesses tomm->cpu_vm_mask_va plummets.

> feel free to forward it to -stable
> once this hits Linus's tree in the v3.12 merge window - by that time the
> patch will be reasonably well tested and it's a relatively simple change.
>
> Thanks,
>
> Ingo

2013-08-03 01:26:21

by Greg KH

[permalink] [raw]
Subject: Re: [tip:sched/core] sched/x86: Optimize switch_mm() for multi-threaded workloads

On Fri, Aug 02, 2013 at 11:12:47AM +0200, Ingo Molnar wrote:
> * tip-bot for Rik van Riel <[email protected]> wrote:
>
> > Commit-ID: 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1
> > Gitweb: http://git.kernel.org/tip/8f898fbbe5ee5e20a77c4074472a1fd088dc47d1
> > Author: Rik van Riel <[email protected]>
> > AuthorDate: Wed, 31 Jul 2013 22:14:21 -0400
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Thu, 1 Aug 2013 09:10:26 +0200
> >
> > sched/x86: Optimize switch_mm() for multi-threaded workloads
> >
> > Dick Fowles, Don Zickus and Joe Mario have been working on
> > improvements to perf, and noticed heavy cache line contention
> > on the mm_cpumask, running linpack on a 60 core / 120 thread
> > system.
> >
> > The cause turned out to be unnecessary atomic accesses to the
> > mm_cpumask. When in lazy TLB mode, the CPU is only removed from
> > the mm_cpumask if there is a TLB flush event.
> >
> > Most of the time, no such TLB flush happens, and the kernel
> > skips the TLB reload. It can also skip the atomic memory
> > set & test.
> >
> > Here is a summary of Joe's test results:
> >
> > * The __schedule function dropped from 24% of all program cycles down
> > to 5.5%.
> >
> > * The cacheline contention/hotness for accesses to that bitmask went
> > from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
> > all shared misses which is now quite cold)
> >
> > * The average load latency for the bit-test-n-set instruction in
> > __schedule dropped from 10k-15k cycles down to an average of 600 cycles.
> >
> > * The linpack program results improved from 133 GFlops to 144 GFlops.
> > Peak GFlops rose from 133 to 153.
> >
> > Reported-by: Don Zickus <[email protected]>
> > Reported-by: Joe Mario <[email protected]>
> > Tested-by: Joe Mario <[email protected]>
> > Signed-off-by: Rik van Riel <[email protected]>
> > Reviewed-by: Paul Turner <[email protected]>
> > Acked-by: Linus Torvalds <[email protected]>
> > Link: http://lkml.kernel.org/r/[email protected]
> > [ Made the comments consistent around the modified code. ]
> > Signed-off-by: Ingo Molnar <[email protected]>
>
> > + else {
> > this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
> > BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);
> >
> > - if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
> > + if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
> > + /*
> > + * On established mms, the mm_cpumask is only changed
> > + * from irq context, from ptep_clear_flush() while in
> > + * lazy tlb mode, and here. Irqs are blocked during
> > + * schedule, protecting us from simultaneous changes.
> > + */
> > + cpumask_set_cpu(cpu, mm_cpumask(next));
>
> Note, I marked this for v3.12 with no -stable backport tag as it's not a
> regression fix.
>
> Nevertheless if it's a real issue in production (and +20% of linpack
> performance is certainly significant) feel free to forward it to -stable
> once this hits Linus's tree in the v3.12 merge window - by that time the
> patch will be reasonably well tested and it's a relatively simple change.

I'll watch for this as well and try to remember to pick it up for
-stable once it hits Linus's tree, as those type of benchmark
improvements are good to have in stable releases.

thanks,

greg k-h