2010-11-03 06:44:43

by Shaohua Li

[permalink] [raw]
Subject: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

This one isn't related to previous patch. If online cpus are below
NUM_INVALIDATE_TLB_VECTORS, we don't need the lock. The comments
in the code declares we don't need the check, but a hot lock still
needs an atomic operation and expensive, so add the check here.

Signed-off-by: Shaohua Li <[email protected]>
---
arch/x86/mm/tlb.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/tlb.c
===================================================================
--- linux.orig/arch/x86/mm/tlb.c 2010-11-02 10:31:51.000000000 +0800
+++ linux/arch/x86/mm/tlb.c 2010-11-02 14:53:27.000000000 +0800
@@ -174,17 +174,16 @@ static void flush_tlb_others_ipi(const s
{
unsigned int sender;
union smp_flush_state *f;
+ bool do_lock = false;

/* Caller has disabled preemption */
sender = this_cpu_read(tlb_vector_offset);
f = &flush_state[sender];

- /*
- * Could avoid this lock when
- * num_online_cpus() <= NUM_INVALIDATE_TLB_VECTORS, but it is
- * probably not worth checking this for a cache-hot lock.
- */
- raw_spin_lock(&f->tlbstate_lock);
+ if (num_online_cpus() > NUM_INVALIDATE_TLB_VECTORS) {
+ do_lock = true;
+ raw_spin_lock(&f->tlbstate_lock);
+ }

f->flush_mm = mm;
f->flush_va = va;
@@ -202,7 +201,8 @@ static void flush_tlb_others_ipi(const s

f->flush_mm = NULL;
f->flush_va = 0;
- raw_spin_unlock(&f->tlbstate_lock);
+ if (do_lock)
+ raw_spin_unlock(&f->tlbstate_lock);
}

void native_flush_tlb_others(const struct cpumask *cpumask,


2010-11-03 06:59:48

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

Le mercredi 03 novembre 2010 à 14:44 +0800, Shaohua Li a écrit :
> This one isn't related to previous patch. If online cpus are below
> NUM_INVALIDATE_TLB_VECTORS, we don't need the lock. The comments
> in the code declares we don't need the check, but a hot lock still
> needs an atomic operation and expensive, so add the check here.
>
> Signed-off-by: Shaohua Li <[email protected]>
> ---
> arch/x86/mm/tlb.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> Index: linux/arch/x86/mm/tlb.c
> ===================================================================
> --- linux.orig/arch/x86/mm/tlb.c 2010-11-02 10:31:51.000000000 +0800
> +++ linux/arch/x86/mm/tlb.c 2010-11-02 14:53:27.000000000 +0800
> @@ -174,17 +174,16 @@ static void flush_tlb_others_ipi(const s
> {
> unsigned int sender;
> union smp_flush_state *f;
> + bool do_lock = false;
>
> /* Caller has disabled preemption */
> sender = this_cpu_read(tlb_vector_offset);
> f = &flush_state[sender];
>
> - /*
> - * Could avoid this lock when
> - * num_online_cpus() <= NUM_INVALIDATE_TLB_VECTORS, but it is
> - * probably not worth checking this for a cache-hot lock.
> - */
> - raw_spin_lock(&f->tlbstate_lock);
> + if (num_online_cpus() > NUM_INVALIDATE_TLB_VECTORS) {

Ouch, you remove a comment that pretty well explained the problem.

Last time I checked, num_online_cpus() was pretty expensive on a 4096
cpus machine, since 4096 bits array is 512 bytes long.

Are you sure you didnt want to use nr_cpu_ids here ?

> + do_lock = true;
> + raw_spin_lock(&f->tlbstate_lock);
> + }
>
> f->flush_mm = mm;
> f->flush_va = va;
> @@ -202,7 +201,8 @@ static void flush_tlb_others_ipi(const s
>
> f->flush_mm = NULL;
> f->flush_va = 0;
> - raw_spin_unlock(&f->tlbstate_lock);
> + if (do_lock)
> + raw_spin_unlock(&f->tlbstate_lock);
> }
>
> void native_flush_tlb_others(const struct cpumask *cpumask,
>

2010-11-03 07:06:42

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

On Wed, 2010-11-03 at 14:59 +0800, Eric Dumazet wrote:
> Le mercredi 03 novembre 2010 à 14:44 +0800, Shaohua Li a écrit :
> > This one isn't related to previous patch. If online cpus are below
> > NUM_INVALIDATE_TLB_VECTORS, we don't need the lock. The comments
> > in the code declares we don't need the check, but a hot lock still
> > needs an atomic operation and expensive, so add the check here.
> >
> > Signed-off-by: Shaohua Li <[email protected]>
> > ---
> > arch/x86/mm/tlb.c | 14 +++++++-------
> > 1 file changed, 7 insertions(+), 7 deletions(-)
> >
> > Index: linux/arch/x86/mm/tlb.c
> > ===================================================================
> > --- linux.orig/arch/x86/mm/tlb.c 2010-11-02 10:31:51.000000000 +0800
> > +++ linux/arch/x86/mm/tlb.c 2010-11-02 14:53:27.000000000 +0800
> > @@ -174,17 +174,16 @@ static void flush_tlb_others_ipi(const s
> > {
> > unsigned int sender;
> > union smp_flush_state *f;
> > + bool do_lock = false;
> >
> > /* Caller has disabled preemption */
> > sender = this_cpu_read(tlb_vector_offset);
> > f = &flush_state[sender];
> >
> > - /*
> > - * Could avoid this lock when
> > - * num_online_cpus() <= NUM_INVALIDATE_TLB_VECTORS, but it is
> > - * probably not worth checking this for a cache-hot lock.
> > - */
> > - raw_spin_lock(&f->tlbstate_lock);
> > + if (num_online_cpus() > NUM_INVALIDATE_TLB_VECTORS) {
>
> Ouch, you remove a comment that pretty well explained the problem.
>
> Last time I checked, num_online_cpus() was pretty expensive on a 4096
> cpus machine, since 4096 bits array is 512 bytes long.
ok
> Are you sure you didnt want to use nr_cpu_ids here ?
just don't want to include the non-present cpus here. I wonder why we
haven't a variable to record online cpu number.

Thanks,
Shaohua

2010-11-03 07:12:19

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

Le mercredi 03 novembre 2010 à 15:06 +0800, Shaohua Li a écrit :
> just don't want to include the non-present cpus here. I wonder why we
> haven't a variable to record online cpu number.

What prevents a 256 cpus machine, to have 8 online cpus that all use the
same TLB vector ?

(Max 32 vectors, so 8 cpus share each vector, settled at boot time)

Forget about 'online', and think 'possible' ;)


2010-11-03 07:19:59

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

On Wed, 2010-11-03 at 15:12 +0800, Eric Dumazet wrote:
> Le mercredi 03 novembre 2010 à 15:06 +0800, Shaohua Li a écrit :
> > just don't want to include the non-present cpus here. I wonder why we
> > haven't a variable to record online cpu number.
>
> What prevents a 256 cpus machine, to have 8 online cpus that all use the
> same TLB vector ?
>
> (Max 32 vectors, so 8 cpus share each vector, settled at boot time)
>
> Forget about 'online', and think 'possible' ;)
Hmm, the spread vectors to node already merged, how could the 8 cpus
share a vector?

2010-11-03 07:25:30

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

Le mercredi 03 novembre 2010 à 15:19 +0800, Shaohua Li a écrit :
> On Wed, 2010-11-03 at 15:12 +0800, Eric Dumazet wrote:
> > Le mercredi 03 novembre 2010 à 15:06 +0800, Shaohua Li a écrit :
> > > just don't want to include the non-present cpus here. I wonder why we
> > > haven't a variable to record online cpu number.
> >
> > What prevents a 256 cpus machine, to have 8 online cpus that all use the
> > same TLB vector ?
> >
> > (Max 32 vectors, so 8 cpus share each vector, settled at boot time)
> >
> > Forget about 'online', and think 'possible' ;)
> Hmm, the spread vectors to node already merged, how could the 8 cpus
> share a vector?
>

You boot a machine with 256 cpu.

They are online and very well.

Each vector is shared by at least 8 cpus, because 256/32 = 8. OK ?

Now you off-line 256-8 cpus, because you have HOTPLUG capability in your
kernel and you have some policy to bring them up later if needed.

What happens ? Do you rebalance TLB vectors to make sure each cpu has
its own vector ?


2010-11-03 07:31:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

Le mercredi 03 novembre 2010 à 08:25 +0100, Eric Dumazet a écrit :
> Le mercredi 03 novembre 2010 à 15:19 +0800, Shaohua Li a écrit :
> > On Wed, 2010-11-03 at 15:12 +0800, Eric Dumazet wrote:
> > > Le mercredi 03 novembre 2010 à 15:06 +0800, Shaohua Li a écrit :
> > > > just don't want to include the non-present cpus here. I wonder why we
> > > > haven't a variable to record online cpu number.
> > >
> > > What prevents a 256 cpus machine, to have 8 online cpus that all use the
> > > same TLB vector ?
> > >
> > > (Max 32 vectors, so 8 cpus share each vector, settled at boot time)
> > >
> > > Forget about 'online', and think 'possible' ;)
> > Hmm, the spread vectors to node already merged, how could the 8 cpus
> > share a vector?
> >
>
> You boot a machine with 256 cpu.
>
> They are online and very well.
>
> Each vector is shared by at least 8 cpus, because 256/32 = 8. OK ?
>
> Now you off-line 256-8 cpus, because you have HOTPLUG capability in your
> kernel and you have some policy to bring them up later if needed.
>
> What happens ? Do you rebalance TLB vectors to make sure each cpu has
> its own vector ?
>
>

It seems you do that since commit 932967202182743 in
calculate_tlb_offset()

So just add in this function some logic to tell if each tlb vector is
used by one or several cpu.

We can avoid the lock for each TLB vector used by exactly one cpu.


2010-11-03 09:08:11

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

Le mercredi 03 novembre 2010 à 16:41 +0800, Shaohua Li a écrit :

> yes, this is ok. we might need avoid some cpu hotplug race too. I'll
> post a new patch later.
>
>

Hmm, maybe only set the variable "must take the lock", never unset it.


2010-11-04 05:21:23

by Shaohua Li

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

On Wed, 2010-11-03 at 17:08 +0800, Eric Dumazet wrote:
> Le mercredi 03 novembre 2010 à 16:41 +0800, Shaohua Li a écrit :
>
> > yes, this is ok. we might need avoid some cpu hotplug race too. I'll
> > post a new patch later.
> >
> >
>
> Hmm, maybe only set the variable "must take the lock", never unset it.
I followed your suggestions to use nr_cpu_ids, it should be good enough.

Thanks,
Shaohua


This one isn't related to previous patch. If online cpus are below
NUM_INVALIDATE_TLB_VECTORS, we don't need the lock. The comments
in the code declares we don't need the check, but a hot lock still
needs an atomic operation and expensive, so add the check here.

Uses nr_cpu_ids here as suggested by Eric Dumazet.

Signed-off-by: Shaohua Li <[email protected]>
---
arch/x86/mm/tlb.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/tlb.c
===================================================================
--- linux.orig/arch/x86/mm/tlb.c 2010-11-04 10:59:09.000000000 +0800
+++ linux/arch/x86/mm/tlb.c 2010-11-04 13:17:51.000000000 +0800
@@ -179,12 +179,8 @@ static void flush_tlb_others_ipi(const s
sender = this_cpu_read(tlb_vector_offset);
f = &flush_state[sender];

- /*
- * Could avoid this lock when
- * num_online_cpus() <= NUM_INVALIDATE_TLB_VECTORS, but it is
- * probably not worth checking this for a cache-hot lock.
- */
- raw_spin_lock(&f->tlbstate_lock);
+ if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
+ raw_spin_lock(&f->tlbstate_lock);

f->flush_mm = mm;
f->flush_va = va;
@@ -202,7 +198,8 @@ static void flush_tlb_others_ipi(const s

f->flush_mm = NULL;
f->flush_va = 0;
- raw_spin_unlock(&f->tlbstate_lock);
+ if (nr_cpu_ids > NUM_INVALIDATE_TLB_VECTORS)
+ raw_spin_unlock(&f->tlbstate_lock);
}

void native_flush_tlb_others(const struct cpumask *cpumask,

2010-11-04 05:43:02

by Eric Dumazet

[permalink] [raw]
Subject: Re: [RFC 4/4]x86: avoid tlbstate lock if no enough cpus

Le jeudi 04 novembre 2010 à 13:21 +0800, Shaohua Li a écrit :
> On Wed, 2010-11-03 at 17:08 +0800, Eric Dumazet wrote:
> > Le mercredi 03 novembre 2010 à 16:41 +0800, Shaohua Li a écrit :
> >
> > > yes, this is ok. we might need avoid some cpu hotplug race too. I'll
> > > post a new patch later.
> > >
> > >
> >
> > Hmm, maybe only set the variable "must take the lock", never unset it.
> I followed your suggestions to use nr_cpu_ids, it should be good enough.
>

Yes, unfortunately not on HP machines, because of their tendency to have
holes in CPU numberings :)

This can probably can improved later.

Thanks

> Thanks,
> Shaohua
>
>
> This one isn't related to previous patch. If online cpus are below
> NUM_INVALIDATE_TLB_VECTORS, we don't need the lock. The comments
> in the code declares we don't need the check, but a hot lock still
> needs an atomic operation and expensive, so add the check here.
>
> Uses nr_cpu_ids here as suggested by Eric Dumazet.
>
> Signed-off-by: Shaohua Li <[email protected]>
> ---

Acked-by: Eric Dumazet <[email protected]>