Make some variables have correct alignment.
Signed-off-by: Shaohua Li <[email protected]>
---
mm/mmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c 2011-03-24 10:59:39.000000000 +0800
+++ linux/mm/mmap.c 2011-03-24 10:59:42.000000000 +0800
@@ -84,10 +84,10 @@ pgprot_t vm_get_page_prot(unsigned long
}
EXPORT_SYMBOL(vm_get_page_prot);
-int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
+int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; /* heuristic overcommit */
+int sysctl_overcommit_ratio __read_mostly = 50; /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
-struct percpu_counter vm_committed_as;
+struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
/*
* Check that a process has enough memory to allocate a new virtual
Shaohua Li <[email protected]> writes:
> Make some variables have correct alignment.
Nit: __read_mostly doesn't change alignment, just the section.
Please fix the description. Other than that it looks good.
-Andi
--
[email protected] -- Speaking for myself only
On Tue, 2011-03-29 at 00:55 +0800, Andi Kleen wrote:
> Shaohua Li <[email protected]> writes:
>
> > Make some variables have correct alignment.
>
> Nit: __read_mostly doesn't change alignment, just the section.
> Please fix the description. Other than that it looks good.
sure.
Make some variables have correct alignment/section to avoid cache issue.
In a workload which heavily does mmap/munmap, the variables will be used
frequently.
Signed-off-by: Shaohua Li <[email protected]>
---
mm/mmap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c 2011-03-29 08:30:12.000000000 +0800
+++ linux/mm/mmap.c 2011-03-29 08:30:54.000000000 +0800
@@ -84,10 +84,10 @@ pgprot_t vm_get_page_prot(unsigned long
}
EXPORT_SYMBOL(vm_get_page_prot);
-int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
+int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; /* heuristic overcommit */
+int sysctl_overcommit_ratio __read_mostly = 50; /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
-struct percpu_counter vm_committed_as;
+struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
/*
* Check that a process has enough memory to allocate a new virtual
On Tue, 29 Mar 2011 08:54:14 +0800
Shaohua Li <[email protected]> wrote:
> -struct percpu_counter vm_committed_as;
> +struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
Why ____cacheline_internodealigned_in_smp? That's pretty aggressive.
afacit the main benefit from this will occur if the read-only
vm_committed_as.counters lands in the same cacheline as some
write-frequently storage.
But that's a complete mad guess and I'd prefer not to have to guess.
On Tue, 29 Mar 2011, Andrew Morton wrote:
> > -struct percpu_counter vm_committed_as;
> > +struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
>
> Why ____cacheline_internodealigned_in_smp? That's pretty aggressive.
>
> afacit the main benefit from this will occur if the read-only
> vm_committed_as.counters lands in the same cacheline as some
> write-frequently storage.
>
> But that's a complete mad guess and I'd prefer not to have to guess.
It would be useful to have some functionality that allows us to give
hints as to which variables are accessed together and therefore would be
useful to put in the same cacheline. Thus avoiding things like the
readmostly segment and the above aberration.
Andi had a special pda area in earlier version before the merger of 32 and
64 bit code for x86 that resulted in placement of the most performance
critical variables near one another. I am afraid now they are all spread
out.
So maybe something that allows us to define multiple pdas? Or just structs
that are cacheline aligned?
On Wed, 2011-03-30 at 06:24 +0800, Andrew Morton wrote:
> On Tue, 29 Mar 2011 08:54:14 +0800
> Shaohua Li <[email protected]> wrote:
>
> > -struct percpu_counter vm_committed_as;
> > +struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
>
> Why ____cacheline_internodealigned_in_smp? That's pretty aggressive.
>
> afacit the main benefit from this will occur if the read-only
> vm_committed_as.counters lands in the same cacheline as some
> write-frequently storage.
vm_committed_as can be frequently updated in some workloads too.
> But that's a complete mad guess and I'd prefer not to have to guess.
is below updated patch better to you?
Make some variables have correct alignment/section to avoid cache issue.
In a workload which heavily does mmap/munmap, the variables will be used
frequently.
Signed-off-by: Shaohua Li <[email protected]>
---
mm/mmap.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c 2011-03-30 08:45:05.000000000 +0800
+++ linux/mm/mmap.c 2011-03-30 08:59:23.000000000 +0800
@@ -84,10 +84,14 @@ pgprot_t vm_get_page_prot(unsigned long
}
EXPORT_SYMBOL(vm_get_page_prot);
-int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
+int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; /* heuristic overcommit */
+int sysctl_overcommit_ratio __read_mostly = 50; /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
-struct percpu_counter vm_committed_as;
+/*
+ * Make sure vm_committed_as in one cacheline and not cacheline shared with
+ * other variables. It can be updated by several CPUs frequently.
+ */
+struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
/*
* Check that a process has enough memory to allocate a new virtual
On Wed, 30 Mar 2011 09:01:22 +0800 Shaohua Li <[email protected]> wrote:
> +/*
> + * Make sure vm_committed_as in one cacheline and not cacheline shared with
> + * other variables. It can be updated by several CPUs frequently.
> + */
> +struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
The mystery deepens. The only cross-cpu writeable fields in there are
percpu_counter.lock and its companion percpu_counter.count. If CPUs
are contending for the lock then that itself is a problem - how does
adding some padding to the struct help anything?
On Wed, 2011-03-30 at 09:06 +0800, Andrew Morton wrote:
> On Wed, 30 Mar 2011 09:01:22 +0800 Shaohua Li <[email protected]> wrote:
>
> > +/*
> > + * Make sure vm_committed_as in one cacheline and not cacheline shared with
> > + * other variables. It can be updated by several CPUs frequently.
> > + */
> > +struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
>
> The mystery deepens. The only cross-cpu writeable fields in there are
> percpu_counter.lock and its companion percpu_counter.count. If CPUs
> are contending for the lock then that itself is a problem - how does
> adding some padding to the struct help anything?
I had another patch trying to address the lock contention (for case
OVERCOMMIT_GUESS), will send out soon. But thought better to have the
correct alignment for OVERCOMMIT_NEVER case.
Thanks,
Shaohua
On Wed, 30 Mar 2011 09:17:23 +0800 Shaohua Li <[email protected]> wrote:
> On Wed, 2011-03-30 at 09:06 +0800, Andrew Morton wrote:
> > On Wed, 30 Mar 2011 09:01:22 +0800 Shaohua Li <[email protected]> wrote:
> >
> > > +/*
> > > + * Make sure vm_committed_as in one cacheline and not cacheline shared with
> > > + * other variables. It can be updated by several CPUs frequently.
> > > + */
> > > +struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
> >
> > The mystery deepens. The only cross-cpu writeable fields in there are
> > percpu_counter.lock and its companion percpu_counter.count. If CPUs
> > are contending for the lock then that itself is a problem - how does
> > adding some padding to the struct help anything?
> I had another patch trying to address the lock contention (for case
> OVERCOMMIT_GUESS), will send out soon. But thought better to have the
> correct alignment for OVERCOMMIT_NEVER case.
I still don't understand why adding
____cacheline_internodealigned_in_smp to vm_committed_as improves
anything.
Here it is:
struct percpu_counter {
spinlock_t lock;
s64 count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
s32 __percpu *counters;
};
and your patch effectively converts this to
struct percpu_counter {
spinlock_t lock;
s64 count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
s32 __percpu *counters;
+ char large_waste_of_space[lots];
};
how is it that this improves things?
On Wed, 2011-03-30 at 09:25 +0800, Andrew Morton wrote:
> On Wed, 30 Mar 2011 09:17:23 +0800 Shaohua Li <[email protected]> wrote:
>
> > On Wed, 2011-03-30 at 09:06 +0800, Andrew Morton wrote:
> > > On Wed, 30 Mar 2011 09:01:22 +0800 Shaohua Li <[email protected]> wrote:
> > >
> > > > +/*
> > > > + * Make sure vm_committed_as in one cacheline and not cacheline shared with
> > > > + * other variables. It can be updated by several CPUs frequently.
> > > > + */
> > > > +struct percpu_counter vm_committed_as ____cacheline_internodealigned_in_smp;
> > >
> > > The mystery deepens. The only cross-cpu writeable fields in there are
> > > percpu_counter.lock and its companion percpu_counter.count. If CPUs
> > > are contending for the lock then that itself is a problem - how does
> > > adding some padding to the struct help anything?
> > I had another patch trying to address the lock contention (for case
> > OVERCOMMIT_GUESS), will send out soon. But thought better to have the
> > correct alignment for OVERCOMMIT_NEVER case.
>
> I still don't understand why adding
> ____cacheline_internodealigned_in_smp to vm_committed_as improves
> anything.
>
> Here it is:
>
> struct percpu_counter {
> spinlock_t lock;
> s64 count;
> #ifdef CONFIG_HOTPLUG_CPU
> struct list_head list; /* All percpu_counters are on a list */
> #endif
> s32 __percpu *counters;
> };
>
> and your patch effectively converts this to
>
> struct percpu_counter {
> spinlock_t lock;
> s64 count;
> #ifdef CONFIG_HOTPLUG_CPU
> struct list_head list; /* All percpu_counters are on a list */
> #endif
> s32 __percpu *counters;
> + char large_waste_of_space[lots];
> };
>
> how is it that this improves things?
Hmm, it actually is:
struct percpu_counter {
spinlock_t lock;
s64 count;
#ifdef CONFIG_HOTPLUG_CPU
struct list_head list; /* All percpu_counters are on a list */
#endif
s32 __percpu *counters;
} __attribute__((__aligned__(1 << (INTERNODE_CACHE_SHIFT))))
so lock and count are in one cache line.
On Wed, 30 Mar 2011 09:36:40 +0800 Shaohua Li <[email protected]> wrote:
> > how is it that this improves things?
> Hmm, it actually is:
> struct percpu_counter {
> spinlock_t lock;
> s64 count;
> #ifdef CONFIG_HOTPLUG_CPU
> struct list_head list; /* All percpu_counters are on a list */
> #endif
> s32 __percpu *counters;
> } __attribute__((__aligned__(1 << (INTERNODE_CACHE_SHIFT))))
> so lock and count are in one cache line.
____cacheline_aligned_in_smp would achieve that?
On Wed, 2011-03-30 at 09:41 +0800, Andrew Morton wrote:
> On Wed, 30 Mar 2011 09:36:40 +0800 Shaohua Li <[email protected]> wrote:
>
> > > how is it that this improves things?
> > Hmm, it actually is:
> > struct percpu_counter {
> > spinlock_t lock;
> > s64 count;
> > #ifdef CONFIG_HOTPLUG_CPU
> > struct list_head list; /* All percpu_counters are on a list */
> > #endif
> > s32 __percpu *counters;
> > } __attribute__((__aligned__(1 << (INTERNODE_CACHE_SHIFT))))
> > so lock and count are in one cache line.
>
> ____cacheline_aligned_in_smp would achieve that?
____cacheline_aligned_in_smp can't guarantee the cache alignment for
multiple nodes, because the variable can be updated by multiple
nodes/cpus.
Thanks,
Shaohua
On Wed, 30 Mar 2011 09:54:01 +0800 Shaohua Li <[email protected]> wrote:
> On Wed, 2011-03-30 at 09:41 +0800, Andrew Morton wrote:
> > On Wed, 30 Mar 2011 09:36:40 +0800 Shaohua Li <[email protected]> wrote:
> >
> > > > how is it that this improves things?
> > > Hmm, it actually is:
> > > struct percpu_counter {
> > > spinlock_t lock;
> > > s64 count;
> > > #ifdef CONFIG_HOTPLUG_CPU
> > > struct list_head list; /* All percpu_counters are on a list */
> > > #endif
> > > s32 __percpu *counters;
> > > } __attribute__((__aligned__(1 << (INTERNODE_CACHE_SHIFT))))
> > > so lock and count are in one cache line.
> >
> > ____cacheline_aligned_in_smp would achieve that?
> ____cacheline_aligned_in_smp can't guarantee the cache alignment for
> multiple nodes, because the variable can be updated by multiple
> nodes/cpus.
Confused. If an object is aligned at a mulitple-of-128 address on one
node, it is aligned at a multiple-of-128 address when viewed from other
nodes, surely?
Even if the cache alignment to which you're referring is the internode
cache, can a 34-byte, L1-cache-aligned structure ever span multiple
internode cachelines?
On Wed, 2011-03-30 at 10:10 +0800, Andrew Morton wrote:
> On Wed, 30 Mar 2011 09:54:01 +0800 Shaohua Li <[email protected]> wrote:
>
> > On Wed, 2011-03-30 at 09:41 +0800, Andrew Morton wrote:
> > > On Wed, 30 Mar 2011 09:36:40 +0800 Shaohua Li <[email protected]> wrote:
> > >
> > > > > how is it that this improves things?
> > > > Hmm, it actually is:
> > > > struct percpu_counter {
> > > > spinlock_t lock;
> > > > s64 count;
> > > > #ifdef CONFIG_HOTPLUG_CPU
> > > > struct list_head list; /* All percpu_counters are on a list */
> > > > #endif
> > > > s32 __percpu *counters;
> > > > } __attribute__((__aligned__(1 << (INTERNODE_CACHE_SHIFT))))
> > > > so lock and count are in one cache line.
> > >
> > > ____cacheline_aligned_in_smp would achieve that?
> > ____cacheline_aligned_in_smp can't guarantee the cache alignment for
> > multiple nodes, because the variable can be updated by multiple
> > nodes/cpus.
>
> Confused. If an object is aligned at a mulitple-of-128 address on one
> node, it is aligned at a multiple-of-128 address when viewed from other
> nodes, surely?
>
> Even if the cache alignment to which you're referring is the internode
> cache, can a 34-byte, L1-cache-aligned structure ever span multiple
> internode cachelines?
ah, you are right, ____cacheline_aligned_in_smp is enough here, thanks
for correcting me here.
Make some variables have correct alignment/section to avoid cache issue.
In a workload which heavily does mmap/munmap, the variables will be used
frequently.
Signed-off-by: Shaohua Li <[email protected]>
---
mm/mmap.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c 2011-03-30 08:45:05.000000000 +0800
+++ linux/mm/mmap.c 2011-03-30 10:34:36.000000000 +0800
@@ -84,10 +84,14 @@ pgprot_t vm_get_page_prot(unsigned long
}
EXPORT_SYMBOL(vm_get_page_prot);
-int sysctl_overcommit_memory = OVERCOMMIT_GUESS; /* heuristic overcommit */
-int sysctl_overcommit_ratio = 50; /* default is 50% */
+int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS; /* heuristic overcommit */
+int sysctl_overcommit_ratio __read_mostly = 50; /* default is 50% */
int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
-struct percpu_counter vm_committed_as;
+/*
+ * Make sure vm_committed_as in one cacheline and not cacheline shared with
+ * other variables. It can be updated by several CPUs frequently.
+ */
+struct percpu_counter vm_committed_as ____cacheline_aligned_in_smp;
/*
* Check that a process has enough memory to allocate a new virtual