Hi,
On x86_64 flush tlb data is stored in per_cpu variables. This is
unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
are accessed.
This patch aims at making the code less confusing (there's nothing
really "per_cpu") by using a plain array. It also would save some memory
on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
Regards,
Frederik
Signed-off-by: Frederik Deweerdt <[email protected]>
diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
index f8be6f1..c177a1f 100644
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -33,7 +33,7 @@
* To avoid global state use 8 different call vectors.
* Each CPU uses a specific vector to trigger flushes on other
* CPUs. Depending on the received vector the target CPUs look into
- * the right per cpu variable for the flush data.
+ * the right array slot for the flush data.
*
* With more than 8 CPUs they are hashed to the 8 available
* vectors. The limited global vector space forces us to this right now.
@@ -54,7 +54,7 @@ union smp_flush_state {
/* State is put into the per CPU data section, but padded
to a full cache line because other CPUs can access it and we don't
want false sharing in the per cpu data segment. */
-static DEFINE_PER_CPU(union smp_flush_state, flush_state);
+static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
/*
* We cannot call mmdrop() because we are in interrupt context,
@@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
* Use that to determine where the sender put the data.
*/
sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
- f = &per_cpu(flush_state, sender);
+ f = &flush_state[sender];
if (!cpu_isset(cpu, f->flush_cpumask))
goto out;
@@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
/* Caller has disabled preemption */
sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
- f = &per_cpu(flush_state, sender);
+ f = &flush_state[sender];
/*
* Could avoid this lock when
@@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
{
int i;
- for_each_possible_cpu(i)
- spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
+ for (i = 0; i < ARRAY_SIZE(flush_state); i++)
+ spin_lock_init(&flush_state[i].tlbstate_lock);
return 0;
}
On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> Hi,
>
> On x86_64 flush tlb data is stored in per_cpu variables. This is
> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> are accessed.
> This patch aims at making the code less confusing (there's nothing
> really "per_cpu") by using a plain array. It also would save some memory
> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
Nope it doesn't save memory on most systems because per cpu is only allocated
based on the CPUs that are actually there. And if you have more than 8
cores you can likely afford a few bytes per CPU.
You would need to cache line pad each entry then, otherwise you risk
false sharing. That would make the array 1K on 128 bytes cache line
system. This means on small systems this would actually waste
much more memory.
per cpu avoids that problem completely.
-Andi
* Frederik Deweerdt <[email protected]> wrote:
> Hi,
>
> On x86_64 flush tlb data is stored in per_cpu variables. This is
> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> are accessed.
>
> This patch aims at making the code less confusing (there's nothing
> really "per_cpu") by using a plain array. It also would save some memory
> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
indeed, well spotted!
This construct was indeed an abuse of the per_cpu facilities.
Note that beyond the obvious memory savings, your patch should make the
code a bit faster and a bit smaller as well in the SMP TLB flush codepath:
the smp_flush_state data structure is 64 (or 128) bytes so static array
arithmetics will be faster than the per_cpu indirection.
I have applied your patch to tip/x86/mm, thanks Frederik!
Ingo
On Mon, Jan 12, 2009 at 10:57:02PM +0100, Andi Kleen wrote:
> On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> > Hi,
> >
> > On x86_64 flush tlb data is stored in per_cpu variables. This is
> > unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> > are accessed.
> > This patch aims at making the code less confusing (there's nothing
> > really "per_cpu") by using a plain array. It also would save some memory
> > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
>
> Nope it doesn't save memory on most systems because per cpu is only allocated
> based on the CPUs that are actually there. And if you have more than 8
> cores you can likely afford a few bytes per CPU.
I did not understand that, thanks for clarifiying
>
> You would need to cache line pad each entry then, otherwise you risk
> false sharing. That would make the array 1K on 128 bytes cache line
> system. This means on small systems this would actually waste
> much more memory.
>
> per cpu avoids that problem completely.
It is also slower (or so percpu.h says), and confusing I'd say.
Regards,
Frederik
On Mon, Jan 12, 2009 at 11:10:23PM +0100, Frederik Deweerdt wrote:
> On Mon, Jan 12, 2009 at 10:57:02PM +0100, Andi Kleen wrote:
> > On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> > > Hi,
> > >
> > > On x86_64 flush tlb data is stored in per_cpu variables. This is
> > > unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> > > are accessed.
> > > This patch aims at making the code less confusing (there's nothing
> > > really "per_cpu") by using a plain array. It also would save some memory
> > > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> >
> > Nope it doesn't save memory on most systems because per cpu is only allocated
> > based on the CPUs that are actually there. And if you have more than 8
> > cores you can likely afford a few bytes per CPU.
> I did not understand that, thanks for clarifiying
> >
> > You would need to cache line pad each entry then, otherwise you risk
> > false sharing. That would make the array 1K on 128 bytes cache line
> > system. This means on small systems this would actually waste
> > much more memory.
> >
> > per cpu avoids that problem completely.
> It is also slower (or so percpu.h says), and confusing I'd say.
Well it's something like 3 instructions versus one. You would
have a hard time benchmarking it unless you run it in a very tight
loop. It will be lost in the noise compared to all the other costs
of the IPI.
Also why i don't like this patch is that on the typical small single/dual
core system running a 128 byte cache line distro kernel you always pay the
1K cost now, while with per cpu it only needed one/two entries.
Admittedly it could have been better commented.
Not that it matters now unfortunately it's already applied. Sometimes
wonder why I still bother to do patch review...
-Andi
--
[email protected] -- Speaking for myself only.
Hi Frederik,
On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
>
>Signed-off-by: Frederik Deweerdt <[email protected]>
>
>diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
>index f8be6f1..c177a1f 100644
>--- a/arch/x86/kernel/tlb_64.c
>+++ b/arch/x86/kernel/tlb_64.c
>@@ -33,7 +33,7 @@
> * To avoid global state use 8 different call vectors.
> * Each CPU uses a specific vector to trigger flushes on other
> * CPUs. Depending on the received vector the target CPUs look into
>- * the right per cpu variable for the flush data.
>+ * the right array slot for the flush data.
> *
> * With more than 8 CPUs they are hashed to the 8 available
> * vectors. The limited global vector space forces us to this right now.
>@@ -54,7 +54,7 @@ union smp_flush_state {
> /* State is put into the per CPU data section, but padded
> to a full cache line because other CPUs can access it and we don't
> want false sharing in the per cpu data segment. */
>-static DEFINE_PER_CPU(union smp_flush_state, flush_state);
>+static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
Since flush_state has a spinlock, an array of flush_states would end up
having multiples spinlocks on the same L1 cacheline. However, I see that
the union smp_flush_state is already padded to SMP_CACHE_BYTES.
With per-cpu areas, locks belonging to different array elements did
not end up on the same internode cacheline due to per-node allocation of
per-cpu areas, but with a simple array, two locks could end up on the same
internode cacheline.
Hence, can you please change the padding in smp_flush_state to
INTERNODE_CACHE_BYTES (you have to derive this off INTERNODE_CACHE_SHIFT)
and align it using ____cacheline_internodealigned_in_smp?
* Andi Kleen <[email protected]> wrote:
> On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> > Hi,
> >
> > On x86_64 flush tlb data is stored in per_cpu variables. This is
> > unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> > are accessed.
> >
> > This patch aims at making the code less confusing (there's nothing
> > really "per_cpu") by using a plain array. It also would save some
> > memory on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by
> > default).
>
> Nope it doesn't save memory on most systems because per cpu is only
> allocated based on the CPUs that are actually there. And if you have
> more than 8 cores you can likely afford a few bytes per CPU.
No distro kernel will build with less than 8 CPUs anyway so this point is
moot.
> You would need to cache line pad each entry then, otherwise you risk
> false sharing. [...]
They are already cache line padded.
> [...] That would make the array 1K on 128 bytes cache line system.
512 bytes.
> [...] This means on small systems this would actually waste much more
> memory.
Really small systems will be UP and wont do cross-CPU TLB flushes, so if
they are a worry the flush code can be UP optimized. (Nobody bothered so
far.)
Ingo
* Andi Kleen <[email protected]> wrote:
> > It is also slower (or so percpu.h says), and confusing I'd say.
>
> Well it's something like 3 instructions versus one. [...]
That's enough - micro-optimizations are done like that, instruction by
instruction.
> [...] You would have a hard time benchmarking it unless you run it in a
> very tight loop. It will be lost in the noise compared to all the other
> costs of the IPI.
>
> Also why i don't like this patch is that on the typical small
> single/dual core system running a 128 byte cache line distro kernel you
> always pay the 1K cost now, while with per cpu it only needed one/two
> entries.
4 or 8 cores is the norm these days - by the time this change hits real
Linux computers en masse 8 cores will be quite common.
> Admittedly it could have been better commented.
>
> Not that it matters now unfortunately it's already applied. Sometimes
> wonder why I still bother to do patch review...
Whether patches are already applied or not has no relevance - patches can
still be undone or reverted of course, should your review feedback be
correct.
Ingo
Frederik Deweerdt wrote:
> Hi,
>
> On x86_64 flush tlb data is stored in per_cpu variables. This is
> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> are accessed.
> This patch aims at making the code less confusing (there's nothing
> really "per_cpu") by using a plain array. It also would save some memory
> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
>
> Regards,
> Frederik
>
> Signed-off-by: Frederik Deweerdt <[email protected]>
Here is the net change in memory usage with this patch on a allyesconfig
with NR_CPUS=4096.
====== Data (-l 500)
1 - initial
2 - change-flush-tlb
.1. .2. .delta.
0 5120 +5120 . flush_state(.bss)
====== Sections (-l 500)
.1. .2. .delta.
12685496 12693688 +8192 +0.06% .bss
1910176 1909408 -768 -0.04% .data.percpu
>
> diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> index f8be6f1..c177a1f 100644
> --- a/arch/x86/kernel/tlb_64.c
> +++ b/arch/x86/kernel/tlb_64.c
> @@ -33,7 +33,7 @@
> * To avoid global state use 8 different call vectors.
> * Each CPU uses a specific vector to trigger flushes on other
> * CPUs. Depending on the received vector the target CPUs look into
> - * the right per cpu variable for the flush data.
> + * the right array slot for the flush data.
> *
> * With more than 8 CPUs they are hashed to the 8 available
> * vectors. The limited global vector space forces us to this right now.
> @@ -54,7 +54,7 @@ union smp_flush_state {
> /* State is put into the per CPU data section, but padded
> to a full cache line because other CPUs can access it and we don't
> want false sharing in the per cpu data segment. */
> -static DEFINE_PER_CPU(union smp_flush_state, flush_state);
> +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
>
> /*
> * We cannot call mmdrop() because we are in interrupt context,
> @@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
> * Use that to determine where the sender put the data.
> */
> sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
> - f = &per_cpu(flush_state, sender);
> + f = &flush_state[sender];
>
> if (!cpu_isset(cpu, f->flush_cpumask))
> goto out;
> @@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
>
> /* Caller has disabled preemption */
> sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
> - f = &per_cpu(flush_state, sender);
> + f = &flush_state[sender];
>
> /*
> * Could avoid this lock when
> @@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
> {
> int i;
>
> - for_each_possible_cpu(i)
> - spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
> + for (i = 0; i < ARRAY_SIZE(flush_state); i++)
> + spin_lock_init(&flush_state[i].tlbstate_lock);
>
> return 0;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Ingo Molnar wrote:
>> Nope it doesn't save memory on most systems because per cpu is only
>> allocated based on the CPUs that are actually there. And if you have
>> more than 8 cores you can likely afford a few bytes per CPU.
>
> No distro kernel will build with less than 8 CPUs anyway so this point is
> moot.
Except for embedded, which will build for exactly what's in the box.
-hpa
* Ravikiran G Thirumalai <[email protected]> wrote:
> Hi Frederik,
>
> On Mon, Jan 12, 2009 at 10:35:42PM +0100, Frederik Deweerdt wrote:
> >
> >Signed-off-by: Frederik Deweerdt <[email protected]>
> >
> >diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> >index f8be6f1..c177a1f 100644
> >--- a/arch/x86/kernel/tlb_64.c
> >+++ b/arch/x86/kernel/tlb_64.c
> >@@ -33,7 +33,7 @@
> > * To avoid global state use 8 different call vectors.
> > * Each CPU uses a specific vector to trigger flushes on other
> > * CPUs. Depending on the received vector the target CPUs look into
> >- * the right per cpu variable for the flush data.
> >+ * the right array slot for the flush data.
> > *
> > * With more than 8 CPUs they are hashed to the 8 available
> > * vectors. The limited global vector space forces us to this right now.
> >@@ -54,7 +54,7 @@ union smp_flush_state {
> > /* State is put into the per CPU data section, but padded
> > to a full cache line because other CPUs can access it and we don't
> > want false sharing in the per cpu data segment. */
> >-static DEFINE_PER_CPU(union smp_flush_state, flush_state);
> >+static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
>
> Since flush_state has a spinlock, an array of flush_states would end up
> having multiples spinlocks on the same L1 cacheline. However, I see
> that the union smp_flush_state is already padded to SMP_CACHE_BYTES.
Correct.
> With per-cpu areas, locks belonging to different array elements did not
> end up on the same internode cacheline due to per-node allocation of
> per-cpu areas, but with a simple array, two locks could end up on the
> same internode cacheline.
>
> Hence, can you please change the padding in smp_flush_state to
> INTERNODE_CACHE_BYTES (you have to derive this off
> INTERNODE_CACHE_SHIFT) and align it using
> ____cacheline_internodealigned_in_smp?
Yes, that matters to vsmp, which has ... 4K node cache-lines.
Note that there's already CONFIG_X86_INTERNODE_CACHE_BYTES which is set
properly on vsmp.
So ... something like the commit below?
Ingo
--------------->
>From 23d9dc8bffc759c131b09a48b5215cc2b37a5ac3 Mon Sep 17 00:00:00 2001
From: Frederik Deweerdt <[email protected]>
Date: Mon, 12 Jan 2009 22:35:42 +0100
Subject: [PATCH] x86, tlb flush_data: replace per_cpu with an array
Impact: micro-optimization, memory reduction
On x86_64 flush tlb data is stored in per_cpu variables. This is
unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
are accessed.
This patch aims at making the code less confusing (there's nothing
really "per_cpu") by using a plain array. It also would save some memory
on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
[ Ravikiran G Thirumalai also pointed out that the correct alignment
is ____cacheline_internodealigned_in_smp, so that there's no
bouncing on vsmp. ]
Signed-off-by: Frederik Deweerdt <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/tlb_64.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
index f8be6f1..c5a6c6f 100644
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -33,7 +33,7 @@
* To avoid global state use 8 different call vectors.
* Each CPU uses a specific vector to trigger flushes on other
* CPUs. Depending on the received vector the target CPUs look into
- * the right per cpu variable for the flush data.
+ * the right array slot for the flush data.
*
* With more than 8 CPUs they are hashed to the 8 available
* vectors. The limited global vector space forces us to this right now.
@@ -48,13 +48,13 @@ union smp_flush_state {
unsigned long flush_va;
spinlock_t tlbstate_lock;
};
- char pad[SMP_CACHE_BYTES];
-} ____cacheline_aligned;
+ char pad[X86_INTERNODE_CACHE_BYTES];
+} ____cacheline_internodealigned_in_smp;
/* State is put into the per CPU data section, but padded
to a full cache line because other CPUs can access it and we don't
want false sharing in the per cpu data segment. */
-static DEFINE_PER_CPU(union smp_flush_state, flush_state);
+static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
/*
* We cannot call mmdrop() because we are in interrupt context,
@@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
* Use that to determine where the sender put the data.
*/
sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
- f = &per_cpu(flush_state, sender);
+ f = &flush_state[sender];
if (!cpu_isset(cpu, f->flush_cpumask))
goto out;
@@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
/* Caller has disabled preemption */
sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
- f = &per_cpu(flush_state, sender);
+ f = &flush_state[sender];
/*
* Could avoid this lock when
@@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
{
int i;
- for_each_possible_cpu(i)
- spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
+ for (i = 0; i < ARRAY_SIZE(flush_state); i++)
+ spin_lock_init(&flush_state[i].tlbstate_lock);
return 0;
}
* Ingo Molnar <[email protected]> wrote:
> + char pad[X86_INTERNODE_CACHE_BYTES];
s/CONFIG_X86_INTERNODE_CACHE_BYTES
Ingo
On Mon, Jan 12, 2009 at 02:54:25PM -0800, Mike Travis wrote:
> Frederik Deweerdt wrote:
> > Hi,
> >
> > On x86_64 flush tlb data is stored in per_cpu variables. This is
> > unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> > are accessed.
> > This patch aims at making the code less confusing (there's nothing
> > really "per_cpu") by using a plain array. It also would save some memory
> > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> >
> > Regards,
> > Frederik
> >
> > Signed-off-by: Frederik Deweerdt <[email protected]>
>
> Here is the net change in memory usage with this patch on a allyesconfig
> with NR_CPUS=4096.
Yes, this point wrt. memory was based on my flawed understanding of how
per_cpu actually allocates the data. There is however 1) a confusing use
of per_cpu removed, 2) faster access to the flush data.
>
> ====== Data (-l 500)
>
> 1 - initial
> 2 - change-flush-tlb
>
> .1. .2. .delta.
> 0 5120 +5120 . flush_state(.bss)
>
> ====== Sections (-l 500)
>
> .1. .2. .delta.
> 12685496 12693688 +8192 +0.06% .bss
> 1910176 1909408 -768 -0.04% .data.percpu
>
I get :
Initial
size ./arch/x86/kernel/tlb_64.o
text data bss dec hex filename
1667 136 8 1811 713 ./arch/x86/kernel/tlb_64.o
After
size ./arch/x86/kernel/tlb_64.o
text data bss dec hex filename
1598 8 1088 2694 a86 ./arch/x86/kernel/tlb_64.o
-69 -128 +1080 +883
But I'm not sure those numbers are that relevant, as the percpu part
will be allocated at runtime.
Regards,
Frederik
> >
> > diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> > index f8be6f1..c177a1f 100644
> > --- a/arch/x86/kernel/tlb_64.c
> > +++ b/arch/x86/kernel/tlb_64.c
> > @@ -33,7 +33,7 @@
> > * To avoid global state use 8 different call vectors.
> > * Each CPU uses a specific vector to trigger flushes on other
> > * CPUs. Depending on the received vector the target CPUs look into
> > - * the right per cpu variable for the flush data.
> > + * the right array slot for the flush data.
> > *
> > * With more than 8 CPUs they are hashed to the 8 available
> > * vectors. The limited global vector space forces us to this right now.
> > @@ -54,7 +54,7 @@ union smp_flush_state {
> > /* State is put into the per CPU data section, but padded
> > to a full cache line because other CPUs can access it and we don't
> > want false sharing in the per cpu data segment. */
> > -static DEFINE_PER_CPU(union smp_flush_state, flush_state);
> > +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
> >
> > /*
> > * We cannot call mmdrop() because we are in interrupt context,
> > @@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
> > * Use that to determine where the sender put the data.
> > */
> > sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
> > - f = &per_cpu(flush_state, sender);
> > + f = &flush_state[sender];
> >
> > if (!cpu_isset(cpu, f->flush_cpumask))
> > goto out;
> > @@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
> >
> > /* Caller has disabled preemption */
> > sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
> > - f = &per_cpu(flush_state, sender);
> > + f = &flush_state[sender];
> >
> > /*
> > * Could avoid this lock when
> > @@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
> > {
> > int i;
> >
> > - for_each_possible_cpu(i)
> > - spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
> > + for (i = 0; i < ARRAY_SIZE(flush_state); i++)
> > + spin_lock_init(&flush_state[i].tlbstate_lock);
> >
> > return 0;
> > }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
Frederik Deweerdt wrote:
> On Mon, Jan 12, 2009 at 02:54:25PM -0800, Mike Travis wrote:
>> Frederik Deweerdt wrote:
>>> Hi,
>>>
>>> On x86_64 flush tlb data is stored in per_cpu variables. This is
>>> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
>>> are accessed.
>>> This patch aims at making the code less confusing (there's nothing
>>> really "per_cpu") by using a plain array. It also would save some memory
>>> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
>>>
>>> Regards,
>>> Frederik
>>>
>>> Signed-off-by: Frederik Deweerdt <[email protected]>
>> Here is the net change in memory usage with this patch on a allyesconfig
>> with NR_CPUS=4096.
> Yes, this point wrt. memory was based on my flawed understanding of how
> per_cpu actually allocates the data. There is however 1) a confusing use
> of per_cpu removed, 2) faster access to the flush data.
Is this true? On a widely separated NUMA system, requiring all CPU's to
access memory on NODE 0 for every tlb flush would seem expensive. That's
another benefit of per_cpu data, it's local to the node's cpus.
(And was it determined yet, that a cacheline has to be tossed around as well?)
Thanks,
Mike
>
>> ====== Data (-l 500)
>>
>> 1 - initial
>> 2 - change-flush-tlb
>>
>> .1. .2. .delta.
>> 0 5120 +5120 . flush_state(.bss)
>>
>> ====== Sections (-l 500)
>>
>> .1. .2. .delta.
>> 12685496 12693688 +8192 +0.06% .bss
>> 1910176 1909408 -768 -0.04% .data.percpu
>>
> I get :
> Initial
> size ./arch/x86/kernel/tlb_64.o
> text data bss dec hex filename
> 1667 136 8 1811 713 ./arch/x86/kernel/tlb_64.o
>
> After
> size ./arch/x86/kernel/tlb_64.o
> text data bss dec hex filename
> 1598 8 1088 2694 a86 ./arch/x86/kernel/tlb_64.o
>
> -69 -128 +1080 +883
>
> But I'm not sure those numbers are that relevant, as the percpu part
> will be allocated at runtime.
>
> Regards,
> Frederik
>
>>> diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
>>> index f8be6f1..c177a1f 100644
>>> --- a/arch/x86/kernel/tlb_64.c
>>> +++ b/arch/x86/kernel/tlb_64.c
>>> @@ -33,7 +33,7 @@
>>> * To avoid global state use 8 different call vectors.
>>> * Each CPU uses a specific vector to trigger flushes on other
>>> * CPUs. Depending on the received vector the target CPUs look into
>>> - * the right per cpu variable for the flush data.
>>> + * the right array slot for the flush data.
>>> *
>>> * With more than 8 CPUs they are hashed to the 8 available
>>> * vectors. The limited global vector space forces us to this right now.
>>> @@ -54,7 +54,7 @@ union smp_flush_state {
>>> /* State is put into the per CPU data section, but padded
>>> to a full cache line because other CPUs can access it and we don't
>>> want false sharing in the per cpu data segment. */
>>> -static DEFINE_PER_CPU(union smp_flush_state, flush_state);
>>> +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
>>>
>>> /*
>>> * We cannot call mmdrop() because we are in interrupt context,
>>> @@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
>>> * Use that to determine where the sender put the data.
>>> */
>>> sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
>>> - f = &per_cpu(flush_state, sender);
>>> + f = &flush_state[sender];
>>>
>>> if (!cpu_isset(cpu, f->flush_cpumask))
>>> goto out;
>>> @@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
>>>
>>> /* Caller has disabled preemption */
>>> sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
>>> - f = &per_cpu(flush_state, sender);
>>> + f = &flush_state[sender];
>>>
>>> /*
>>> * Could avoid this lock when
>>> @@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
>>> {
>>> int i;
>>>
>>> - for_each_possible_cpu(i)
>>> - spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
>>> + for (i = 0; i < ARRAY_SIZE(flush_state); i++)
>>> + spin_lock_init(&flush_state[i].tlbstate_lock);
>>>
>>> return 0;
>>> }
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
* Mike Travis <[email protected]> wrote:
> Frederik Deweerdt wrote:
> > On Mon, Jan 12, 2009 at 02:54:25PM -0800, Mike Travis wrote:
> >> Frederik Deweerdt wrote:
> >>> Hi,
> >>>
> >>> On x86_64 flush tlb data is stored in per_cpu variables. This is
> >>> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> >>> are accessed.
> >>> This patch aims at making the code less confusing (there's nothing
> >>> really "per_cpu") by using a plain array. It also would save some memory
> >>> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
> >>>
> >>> Regards,
> >>> Frederik
> >>>
> >>> Signed-off-by: Frederik Deweerdt <[email protected]>
> >> Here is the net change in memory usage with this patch on a allyesconfig
> >> with NR_CPUS=4096.
>
> > Yes, this point wrt. memory was based on my flawed understanding of how
> > per_cpu actually allocates the data. There is however 1) a confusing use
> > of per_cpu removed, 2) faster access to the flush data.
>
> Is this true? On a widely separated NUMA system, requiring all CPU's to
> access memory on NODE 0 for every tlb flush would seem expensive.
> That's another benefit of per_cpu data, it's local to the node's cpus.
That's irrelevant here: we only have 8 IPI slots for the TLB flush, so
we'll use the first 8 per_cpu areas (note how all access is cpu-nr modulo
8 in essence). That is going to be node 0 anyway in most NUMA setups.
> (And was it determined yet, that a cacheline has to be tossed around as
> well?)
No, that assertion of Andi is wrong.
Ingo
On Tue, Jan 13, 2009 at 12:00:52AM +0100, Ingo Molnar wrote:
>
>
>> With per-cpu areas, locks belonging to different array elements did not
>> end up on the same internode cacheline due to per-node allocation of
>> per-cpu areas, but with a simple array, two locks could end up on the
>> same internode cacheline.
>>
>> Hence, can you please change the padding in smp_flush_state to
>> INTERNODE_CACHE_BYTES (you have to derive this off
>> INTERNODE_CACHE_SHIFT) and align it using
>> ____cacheline_internodealigned_in_smp?
>
>Yes, that matters to vsmp, which has ... 4K node cache-lines.
>
>Note that there's already CONFIG_X86_INTERNODE_CACHE_BYTES which is set
>properly on vsmp.
>
>So ... something like the commit below?
>
Yep. Looks good!
Acked-by: Ravikiran Thirumalai <[email protected]>
Thanks,
Kiran
> No distro kernel will build with less than 8 CPUs anyway so this point is
> moot.
It has nothing to do with what the distro kernel builds with. As I stated
clearly in my review the per cpu data is sized based on the possible map,
which is discovered from the BIOS at runtime. So if your system has two cores
only you will only have two copies of per cpu data.
With this patch on the other hand you will always have 8 copies
of this data; aka 1K no matter how many CPUs you have.
So the description that it saves memory is flat out
wrong on any system with less than 8 threads (which
is by far the biggest majority of systems currently
and in the forseeable future)
> > You would need to cache line pad each entry then, otherwise you risk
> > false sharing. [...]
>
> They are already cache line padded.
Yes that's the problem here.
>
> > [...] That would make the array 1K on 128 bytes cache line system.
>
> 512 bytes.
8 * 128 = 1024
Ok the real waste is a little less because you need at least one copy,
but still considerable. (896 bytes on UP, 768 bytes on 2C)
>
> > [...] This means on small systems this would actually waste much more
> > memory.
>
> Really small systems will be UP and wont do cross-CPU TLB flushes, so if
> they are a worry the flush code can be UP optimized. (Nobody bothered so
> far.)
The SMP flush code shouldn't be called at all on UP because the "other cpu"
mask is always empty.
Just talking about the memory. Sure it's only 1K (or 896 bytes), but if you
add up a lot of little 896byte wastes you eventually get a lot of waste all over.
Anyways if you wanted to do this without using per cpu data you
could use alloc_percpu(), but that would be much more complicated
code.
-Andi
--
[email protected] -- Speaking for myself only.
> > (And was it determined yet, that a cacheline has to be tossed around as
> > well?)
>
> No, that assertion of Andi is wrong.
For the IPI you always have to toss one cache line around to pass
the information.
Anyways the padding is just to avoid false sharing.
-Andi
--
[email protected] -- Speaking for myself only.
On Tue, 2009-01-13 at 00:00 +0100, Ingo Molnar wrote:
> From 23d9dc8bffc759c131b09a48b5215cc2b37a5ac3 Mon Sep 17 00:00:00 2001
> From: Frederik Deweerdt <[email protected]>
> Date: Mon, 12 Jan 2009 22:35:42 +0100
> Subject: [PATCH] x86, tlb flush_data: replace per_cpu with an array
>
> Impact: micro-optimization, memory reduction
>
> On x86_64 flush tlb data is stored in per_cpu variables. This is
> unnecessary because only the first NUM_INVALIDATE_TLB_VECTORS entries
> are accessed.
>
> This patch aims at making the code less confusing (there's nothing
> really "per_cpu") by using a plain array. It also would save some memory
> on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
>
> [ Ravikiran G Thirumalai also pointed out that the correct alignment
> is ____cacheline_internodealigned_in_smp, so that there's no
> bouncing on vsmp. ]
>
> Signed-off-by: Frederik Deweerdt <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> arch/x86/kernel/tlb_64.c | 16 ++++++++--------
> 1 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
> index f8be6f1..c5a6c6f 100644
> --- a/arch/x86/kernel/tlb_64.c
> +++ b/arch/x86/kernel/tlb_64.c
> @@ -33,7 +33,7 @@
> * To avoid global state use 8 different call vectors.
> * Each CPU uses a specific vector to trigger flushes on other
> * CPUs. Depending on the received vector the target CPUs look into
> - * the right per cpu variable for the flush data.
> + * the right array slot for the flush data.
> *
> * With more than 8 CPUs they are hashed to the 8 available
> * vectors. The limited global vector space forces us to this right now.
> @@ -48,13 +48,13 @@ union smp_flush_state {
> unsigned long flush_va;
> spinlock_t tlbstate_lock;
> };
> - char pad[SMP_CACHE_BYTES];
> -} ____cacheline_aligned;
> + char pad[X86_INTERNODE_CACHE_BYTES];
> +} ____cacheline_internodealigned_in_smp;
That will make the below array 8*4096 bytes for VSMP, which pushes the
limit for memory savings up to 256 cpus.
I'm really dubious this patch is really worth it.
> /* State is put into the per CPU data section, but padded
> to a full cache line because other CPUs can access it and we don't
> want false sharing in the per cpu data segment. */
> -static DEFINE_PER_CPU(union smp_flush_state, flush_state);
> +static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];
>
> /*
> * We cannot call mmdrop() because we are in interrupt context,
> @@ -129,7 +129,7 @@ asmlinkage void smp_invalidate_interrupt(struct pt_regs *regs)
> * Use that to determine where the sender put the data.
> */
> sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
> - f = &per_cpu(flush_state, sender);
> + f = &flush_state[sender];
>
> if (!cpu_isset(cpu, f->flush_cpumask))
> goto out;
> @@ -169,7 +169,7 @@ void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
>
> /* Caller has disabled preemption */
> sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
> - f = &per_cpu(flush_state, sender);
> + f = &flush_state[sender];
>
> /*
> * Could avoid this lock when
> @@ -205,8 +205,8 @@ static int __cpuinit init_smp_flush(void)
> {
> int i;
>
> - for_each_possible_cpu(i)
> - spin_lock_init(&per_cpu(flush_state, i).tlbstate_lock);
> + for (i = 0; i < ARRAY_SIZE(flush_state); i++)
> + spin_lock_init(&flush_state[i].tlbstate_lock);
>
> return 0;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
* Andi Kleen <[email protected]> wrote:
> > > (And was it determined yet, that a cacheline has to be tossed around
> > > as well?)
> >
> > No, that assertion of Andi is wrong.
>
> For the IPI you always have to toss one cache line around to pass the
> information.
>
> Anyways the padding is just to avoid false sharing.
Which is a red herring - the padding was there before the patch, and is
there after the patch as well.
Ingo
* Andi Kleen <[email protected]> wrote:
> > No distro kernel will build with less than 8 CPUs anyway so this point
> > is moot.
>
> It has nothing to do with what the distro kernel builds with. As I
> stated clearly in my review the per cpu data is sized based on the
> possible map, which is discovered from the BIOS at runtime. So if your
> system has two cores only you will only have two copies of per cpu data.
You ignored my observation that by the time this hits distro kernels the
usual SMP hw size will be 8 or more cores.
> With this patch on the other hand you will always have 8 copies of this
> data; aka 1K no matter how many CPUs you have.
Firstly, it's 512 bytes (see below), secondly, with the percpu approach we
waste far more total memory as time goes on and the average core count
increases.
> So the description that it saves memory is flat out wrong on any system
> with less than 8 threads (which is by far the biggest majority of
> systems currently and in the forseeable future)
>
> > > You would need to cache line pad each entry then, otherwise you risk
> > > false sharing. [...]
> >
> > They are already cache line padded.
>
> Yes that's the problem here.
I fail to see a problem. It has to be padded and aligned no matter where
it is - and it is padded and aligned both before and after the patch. I
dont know why you keep insisting that there's a problem where there is
none.
> > > [...] That would make the array 1K on 128 bytes cache line system.
> >
> > 512 bytes.
>
> 8 * 128 = 1024
The default cacheline size in the x86 tree (for generic cpus) is 64 bytes,
not 128 bytes - so the default size of the array is 512 bytes, not 1024
bytes. (This is a change we made yesterday, so you can not have known
about it unless you follow the x86 tree closely.)
Ingo
* Peter Zijlstra <[email protected]> wrote:
> > - char pad[SMP_CACHE_BYTES];
> > -} ____cacheline_aligned;
> > + char pad[X86_INTERNODE_CACHE_BYTES];
> > +} ____cacheline_internodealigned_in_smp;
>
> That will make the below array 8*4096 bytes for VSMP, which pushes the
> limit for memory savings up to 256 cpus.
VSMP is a clustering solution (default-disabled) that pushes
L1_CACHE_BYTES to 4096 bytes (4K). That is an extremely large alignment
that pushes up the BSS size ten-fold (!), so no generic Linux distribution
enables it. 32K compared to 9MB bloat caused by 4K cachelines is a drop in
the ocean.
Ingo
On Tue, Jan 13, 2009 at 01:33:17PM +0100, Ingo Molnar wrote:
>
>
>VSMP is a clustering solution (default-disabled)
Small correction :)
Well, it is a virtualization solution creating single VM across multiple
systems. Since it is shared memory, we would rather refer to it as SMP or
Virtual SMP rather than a clustering solution.
* Ravikiran G Thirumalai <[email protected]> wrote:
> On Tue, Jan 13, 2009 at 01:33:17PM +0100, Ingo Molnar wrote:
> >
> >
> >VSMP is a clustering solution (default-disabled)
>
> Small correction :)
> Well, it is a virtualization solution creating single VM across multiple
> systems. Since it is shared memory, we would rather refer to it as SMP or
> Virtual SMP rather than a clustering solution.
You do have hardware help too to do efficient cross-node memory traffic,
dont you?
Ingo
On Tue, Jan 13, 2009 at 07:34:20PM +0100, Ingo Molnar wrote:
>
>* Ravikiran G Thirumalai <[email protected]> wrote:
>
>> On Tue, Jan 13, 2009 at 01:33:17PM +0100, Ingo Molnar wrote:
>> >
>> >
>> >VSMP is a clustering solution (default-disabled)
>>
>> Small correction :)
>> Well, it is a virtualization solution creating single VM across multiple
>> systems. Since it is shared memory, we would rather refer to it as SMP or
>> Virtual SMP rather than a clustering solution.
>
>You do have hardware help too to do efficient cross-node memory traffic,
>dont you?
>
Of course. A fast backplane interconnect.
* Ravikiran G Thirumalai <[email protected]> wrote:
> On Tue, Jan 13, 2009 at 07:34:20PM +0100, Ingo Molnar wrote:
> >
> >* Ravikiran G Thirumalai <[email protected]> wrote:
> >
> >> On Tue, Jan 13, 2009 at 01:33:17PM +0100, Ingo Molnar wrote:
> >> >
> >> >
> >> >VSMP is a clustering solution (default-disabled)
> >>
> >> Small correction :)
> >> Well, it is a virtualization solution creating single VM across multiple
> >> systems. Since it is shared memory, we would rather refer to it as SMP or
> >> Virtual SMP rather than a clustering solution.
> >
> > You do have hardware help too to do efficient cross-node memory
> > traffic, dont you?
>
> Of course. A fast backplane interconnect.
... so it's PCs clustered together, and called vSMP, right? ;-)
a neat hack in any case.
Ingo
> > It also would save some memory
> > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
As pointed out several times this claim in the changelog is incorrect.
Please fix it.
A correct description would be:
- This patch wastes memory of one configured cache line for each CPU
on systems with less than 8 cores (e.g. 896 bytes on a 1 core system
with CONFIG_GENERIC_CPU/ cache line size 128)
It saves some memory on systems with more than 8 cores (one cache
line size for each core above 8)
-Andi
On Wed, Jan 14, 2009 at 10:08:11AM +0100, Andi Kleen wrote:
>
> > > It also would save some memory
> > > on most distros out there (Ubuntu x86_64 has NR_CPUS=64 by default).
>
> As pointed out several times this claim in the changelog is incorrect.
> Please fix it.
>
> A correct description would be:
>
> - This patch wastes memory of one configured cache line for each CPU
> on systems with less than 8 cores (e.g. 896 bytes on a 1 core system
> with CONFIG_GENERIC_CPU/ cache line size 128)
> It saves some memory on systems with more than 8 cores (one cache
> line size for each core above 8)
To be fair, it also depends on whenever HOTPLUG_CPU is on or not (if it
is we allocate NR_CPUS percpu sections). In that case, which is Ubuntu's
and Fedora's, we do save memory.
This would make for a fairly long changelog, how about a link to this
thread, and removing the NR_CPUS bit?
Regards,
Frederik
>
> -Andi
>
> To be fair, it also depends on whenever HOTPLUG_CPU is on or not (if it
> is we allocate NR_CPUS percpu sections).
not true to my knowledge. Even a !HOTPLUG_CPU kernel doesn't
allocate more than what is in the possible map. That would be dumb
if it was true, but it isn't fortunately.
I don't think it does.
In that case, which is Ubuntu's
> and Fedora's,
You're saying that Ubuntu and Fedora do not support suspend to ram?
(suspend to ram requires CPU hotplug). At least my Fedora box
does S3 just fine so that cannot be correct.
we do save memory.
> This would make for a fairly long changelog, how about a link to this
> thread, and removing the NR_CPUS bit?
My original paragraph is correct and it's not long. At least your
claim is just wrong.
-Andi
--
[email protected] -- Speaking for myself only.
On Wed, Jan 14, 2009 at 04:20:01PM +0100, Andi Kleen wrote:
> > To be fair, it also depends on whenever HOTPLUG_CPU is on or not (if it
> > is we allocate NR_CPUS percpu sections).
>
> not true to my knowledge. Even a !HOTPLUG_CPU kernel doesn't
> allocate more than what is in the possible map. That would be dumb
> if it was true, but it isn't fortunately.
My point is that possible map == NR_CPUS if HOTPLUG_CPU
>
> I don't think it does.
> In that case, which is Ubuntu's
> > and Fedora's,
>
> You're saying that Ubuntu and Fedora do not support suspend to ram?
> (suspend to ram requires CPU hotplug). At least my Fedora box
> does S3 just fine so that cannot be correct.
>
> we do save memory.
> > This would make for a fairly long changelog, how about a link to this
> > thread, and removing the NR_CPUS bit?
>
> My original paragraph is correct and it's not long. At least your
> claim is just wrong.
>
> -Andi
>
> --
> [email protected] -- Speaking for myself only.
On Wed, Jan 14, 2009 at 08:31:46AM +0100, Ingo Molnar wrote:
>
>* Ravikiran G Thirumalai <[email protected]> wrote:
>
>> > You do have hardware help too to do efficient cross-node memory
>> > traffic, dont you?
>>
>> Of course. A fast backplane interconnect.
>
>... so it's PCs clustered together, and called vSMP, right? ;-)
>
Well, it depends on how you define a cluster. You can even call a
SMP mother board a cluster of cpus :). In that case even Sequent NUMA were
clusters, and so are some of the large core count shared memory systems
made of multiple similar boards/bricks/books (based on vendor terminology)
with a custom interconnect for cache coherency.
A vSMP system is multiple similar cpu mother boards, usually dual socket,
each board having multiple cpus and a commodity fast interconnect for cache
coherency. There is a VMM which among other tasks also takes care of the
coherency, and each 'board' or 'node' can access physical memory from
another board. There is one instance of OS running on the entire system
unlike the classical cluster where there are multiple instances of OS
and the apps under the OS have to work together as a cluster.