When booting on a large memory system, the kernel spends
considerable time in memmap_init_zone() setting up memory zones.
Analysis shows significant time spent in __early_pfn_to_nid().
The routine memmap_init_zone() checks each PFN to verify the
nid is valid. __early_pfn_to_nid() sequentially scans the list of
pfn ranges to find the right range and returns the nid. This does
not scale well. On a 4 TB (single rack) system there are 308
memory ranges to scan. The higher the PFN the more time spent
sequentially spinning through memory ranges.
Since memmap_init_zone() increments pfn, it will almost always be
looking for the same range as the previous pfn, so check that
range first. If it is in the same range, return that nid.
If not, scan the list as before.
A 4 TB (single rack) UV1 system takes 512 seconds to get through
the zone code. This performance optimization reduces the time
by 189 seconds, a 36% improvement.
A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
a 112.9 second (53%) reduction.
Signed-off-by: Russ Anderson <[email protected]>
---
mm/page_alloc.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c 2013-03-18 10:52:11.510988843 -0500
+++ linux/mm/page_alloc.c 2013-03-18 10:52:14.214931348 -0500
@@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
{
unsigned long start_pfn, end_pfn;
int i, nid;
+ static unsigned long last_start_pfn, last_end_pfn;
+ static int last_nid;
+
+ if (last_start_pfn <= pfn && pfn < last_end_pfn)
+ return last_nid;
for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
- if (start_pfn <= pfn && pfn < end_pfn)
+ if (start_pfn <= pfn && pfn < end_pfn) {
+ last_nid = nid;
+ last_start_pfn = start_pfn;
+ last_end_pfn = end_pfn;
return nid;
+ }
/* This is a memory hole */
return -1;
}
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]
On Mon, 18 Mar 2013, Russ Anderson wrote:
> When booting on a large memory system, the kernel spends
> considerable time in memmap_init_zone() setting up memory zones.
> Analysis shows significant time spent in __early_pfn_to_nid().
>
> The routine memmap_init_zone() checks each PFN to verify the
> nid is valid. __early_pfn_to_nid() sequentially scans the list of
> pfn ranges to find the right range and returns the nid. This does
> not scale well. On a 4 TB (single rack) system there are 308
> memory ranges to scan. The higher the PFN the more time spent
> sequentially spinning through memory ranges.
>
> Since memmap_init_zone() increments pfn, it will almost always be
> looking for the same range as the previous pfn, so check that
> range first. If it is in the same range, return that nid.
> If not, scan the list as before.
>
> A 4 TB (single rack) UV1 system takes 512 seconds to get through
> the zone code. This performance optimization reduces the time
> by 189 seconds, a 36% improvement.
>
> A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> a 112.9 second (53%) reduction.
>
> Signed-off-by: Russ Anderson <[email protected]>
Acked-by: David Rientjes <[email protected]>
Very nice improvement!
On Mon, 18 Mar 2013 10:56:19 -0500 Russ Anderson <[email protected]> wrote:
> When booting on a large memory system, the kernel spends
> considerable time in memmap_init_zone() setting up memory zones.
> Analysis shows significant time spent in __early_pfn_to_nid().
>
> The routine memmap_init_zone() checks each PFN to verify the
> nid is valid. __early_pfn_to_nid() sequentially scans the list of
> pfn ranges to find the right range and returns the nid. This does
> not scale well. On a 4 TB (single rack) system there are 308
> memory ranges to scan. The higher the PFN the more time spent
> sequentially spinning through memory ranges.
>
> Since memmap_init_zone() increments pfn, it will almost always be
> looking for the same range as the previous pfn, so check that
> range first. If it is in the same range, return that nid.
> If not, scan the list as before.
>
> A 4 TB (single rack) UV1 system takes 512 seconds to get through
> the zone code. This performance optimization reduces the time
> by 189 seconds, a 36% improvement.
>
> A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> a 112.9 second (53%) reduction.
>
> ...
>
> --- linux.orig/mm/page_alloc.c 2013-03-18 10:52:11.510988843 -0500
> +++ linux/mm/page_alloc.c 2013-03-18 10:52:14.214931348 -0500
> @@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
> {
> unsigned long start_pfn, end_pfn;
> int i, nid;
> + static unsigned long last_start_pfn, last_end_pfn;
> + static int last_nid;
> +
> + if (last_start_pfn <= pfn && pfn < last_end_pfn)
> + return last_nid;
>
> for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
> - if (start_pfn <= pfn && pfn < end_pfn)
> + if (start_pfn <= pfn && pfn < end_pfn) {
> + last_nid = nid;
> + last_start_pfn = start_pfn;
> + last_end_pfn = end_pfn;
> return nid;
> + }
> /* This is a memory hole */
> return -1;
lol. And yes, it seems pretty safe to assume that the kernel is
running single-threaded at this time.
arch/ia64/mm/numa.c's __early_pfn_to_nid might benefit from the same
treatment. In fact if this had been implemented as a caching wrapper
around an unchanged __early_pfn_to_nid(), no ia64 edits would be
needed?
* Russ Anderson <[email protected]> wrote:
> When booting on a large memory system, the kernel spends
> considerable time in memmap_init_zone() setting up memory zones.
> Analysis shows significant time spent in __early_pfn_to_nid().
>
> The routine memmap_init_zone() checks each PFN to verify the
> nid is valid. __early_pfn_to_nid() sequentially scans the list of
> pfn ranges to find the right range and returns the nid. This does
> not scale well. On a 4 TB (single rack) system there are 308
> memory ranges to scan. The higher the PFN the more time spent
> sequentially spinning through memory ranges.
>
> Since memmap_init_zone() increments pfn, it will almost always be
> looking for the same range as the previous pfn, so check that
> range first. If it is in the same range, return that nid.
> If not, scan the list as before.
>
> A 4 TB (single rack) UV1 system takes 512 seconds to get through
> the zone code. This performance optimization reduces the time
> by 189 seconds, a 36% improvement.
>
> A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> a 112.9 second (53%) reduction.
Nice speedup!
A minor nit, in addition to Andrew's suggestion about wrapping
__early_pfn_to_nid():
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c 2013-03-18 10:52:11.510988843 -0500
> +++ linux/mm/page_alloc.c 2013-03-18 10:52:14.214931348 -0500
> @@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
> {
> unsigned long start_pfn, end_pfn;
> int i, nid;
> + static unsigned long last_start_pfn, last_end_pfn;
> + static int last_nid;
Please move these globals out of function local scope, to make it more
apparent that they are not on-stack. I only noticed it in the second pass.
Thanks,
Ingo
On Thu 21-03-13 11:55:16, Ingo Molnar wrote:
>
> * Russ Anderson <[email protected]> wrote:
>
> > When booting on a large memory system, the kernel spends
> > considerable time in memmap_init_zone() setting up memory zones.
> > Analysis shows significant time spent in __early_pfn_to_nid().
> >
> > The routine memmap_init_zone() checks each PFN to verify the
> > nid is valid. __early_pfn_to_nid() sequentially scans the list of
> > pfn ranges to find the right range and returns the nid. This does
> > not scale well. On a 4 TB (single rack) system there are 308
> > memory ranges to scan. The higher the PFN the more time spent
> > sequentially spinning through memory ranges.
> >
> > Since memmap_init_zone() increments pfn, it will almost always be
> > looking for the same range as the previous pfn, so check that
> > range first. If it is in the same range, return that nid.
> > If not, scan the list as before.
> >
> > A 4 TB (single rack) UV1 system takes 512 seconds to get through
> > the zone code. This performance optimization reduces the time
> > by 189 seconds, a 36% improvement.
> >
> > A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> > a 112.9 second (53%) reduction.
>
> Nice speedup!
>
> A minor nit, in addition to Andrew's suggestion about wrapping
> __early_pfn_to_nid():
>
> > Index: linux/mm/page_alloc.c
> > ===================================================================
> > --- linux.orig/mm/page_alloc.c 2013-03-18 10:52:11.510988843 -0500
> > +++ linux/mm/page_alloc.c 2013-03-18 10:52:14.214931348 -0500
> > @@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
> > {
> > unsigned long start_pfn, end_pfn;
> > int i, nid;
> > + static unsigned long last_start_pfn, last_end_pfn;
> > + static int last_nid;
>
> Please move these globals out of function local scope, to make it more
> apparent that they are not on-stack. I only noticed it in the second pass.
Wouldn't this just add more confision with other _pfn variables? (e.g.
{min,max}_low_pfn and others)
IMO the local scope is more obvious as this is and should only be used
for caching purposes.
--
Michal Hocko
SUSE Labs
* Michal Hocko <[email protected]> wrote:
> On Thu 21-03-13 11:55:16, Ingo Molnar wrote:
> >
> > * Russ Anderson <[email protected]> wrote:
> >
> > > When booting on a large memory system, the kernel spends
> > > considerable time in memmap_init_zone() setting up memory zones.
> > > Analysis shows significant time spent in __early_pfn_to_nid().
> > >
> > > The routine memmap_init_zone() checks each PFN to verify the
> > > nid is valid. __early_pfn_to_nid() sequentially scans the list of
> > > pfn ranges to find the right range and returns the nid. This does
> > > not scale well. On a 4 TB (single rack) system there are 308
> > > memory ranges to scan. The higher the PFN the more time spent
> > > sequentially spinning through memory ranges.
> > >
> > > Since memmap_init_zone() increments pfn, it will almost always be
> > > looking for the same range as the previous pfn, so check that
> > > range first. If it is in the same range, return that nid.
> > > If not, scan the list as before.
> > >
> > > A 4 TB (single rack) UV1 system takes 512 seconds to get through
> > > the zone code. This performance optimization reduces the time
> > > by 189 seconds, a 36% improvement.
> > >
> > > A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> > > a 112.9 second (53%) reduction.
> >
> > Nice speedup!
> >
> > A minor nit, in addition to Andrew's suggestion about wrapping
> > __early_pfn_to_nid():
> >
> > > Index: linux/mm/page_alloc.c
> > > ===================================================================
> > > --- linux.orig/mm/page_alloc.c 2013-03-18 10:52:11.510988843 -0500
> > > +++ linux/mm/page_alloc.c 2013-03-18 10:52:14.214931348 -0500
> > > @@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
> > > {
> > > unsigned long start_pfn, end_pfn;
> > > int i, nid;
> > > + static unsigned long last_start_pfn, last_end_pfn;
> > > + static int last_nid;
> >
> > Please move these globals out of function local scope, to make it more
> > apparent that they are not on-stack. I only noticed it in the second pass.
>
> Wouldn't this just add more confision with other _pfn variables? (e.g.
> {min,max}_low_pfn and others)
I don't think so.
> IMO the local scope is more obvious as this is and should only be used
> for caching purposes.
It's a pattern we actively avoid in kernel code.
Thanks,
Ingo
On Thu, 21 Mar 2013, Ingo Molnar wrote:
> > Index: linux/mm/page_alloc.c
> > ===================================================================
> > --- linux.orig/mm/page_alloc.c 2013-03-18 10:52:11.510988843 -0500
> > +++ linux/mm/page_alloc.c 2013-03-18 10:52:14.214931348 -0500
> > @@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
> > {
> > unsigned long start_pfn, end_pfn;
> > int i, nid;
> > + static unsigned long last_start_pfn, last_end_pfn;
> > + static int last_nid;
>
> Please move these globals out of function local scope, to make it more
> apparent that they are not on-stack. I only noticed it in the second pass.
>
The way they're currently defined places these in meminit.data as
appropriate; if they are moved out, please make sure to annotate their
definitions with __meminitdata.
* David Rientjes <[email protected]> wrote:
> On Thu, 21 Mar 2013, Ingo Molnar wrote:
>
> > > Index: linux/mm/page_alloc.c
> > > ===================================================================
> > > --- linux.orig/mm/page_alloc.c 2013-03-18 10:52:11.510988843 -0500
> > > +++ linux/mm/page_alloc.c 2013-03-18 10:52:14.214931348 -0500
> > > @@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
> > > {
> > > unsigned long start_pfn, end_pfn;
> > > int i, nid;
> > > + static unsigned long last_start_pfn, last_end_pfn;
> > > + static int last_nid;
> >
> > Please move these globals out of function local scope, to make it more
> > apparent that they are not on-stack. I only noticed it in the second pass.
>
> The way they're currently defined places these in meminit.data as
> appropriate; if they are moved out, please make sure to annotate their
> definitions with __meminitdata.
I'm fine with having them within the function as well in this special
case, as long as a heavy /* NOTE: ... */ warning is put before them -
which explains why these SMP-unsafe globals are safe.
( That warning will also act as a visual delimiter that breaks the
normally confusing and misleading 'globals mixed amongst stack
variables' pattern. )
Thanks,
Ingo
On Fri, Mar 22, 2013 at 08:25:32AM +0100, Ingo Molnar wrote:
>
> * David Rientjes <[email protected]> wrote:
>
> > On Thu, 21 Mar 2013, Ingo Molnar wrote:
> >
> > > > Index: linux/mm/page_alloc.c
> > > > ===================================================================
> > > > --- linux.orig/mm/page_alloc.c 2013-03-18 10:52:11.510988843 -0500
> > > > +++ linux/mm/page_alloc.c 2013-03-18 10:52:14.214931348 -0500
> > > > @@ -4161,10 +4161,19 @@ int __meminit __early_pfn_to_nid(unsigne
> > > > {
> > > > unsigned long start_pfn, end_pfn;
> > > > int i, nid;
> > > > + static unsigned long last_start_pfn, last_end_pfn;
> > > > + static int last_nid;
> > >
> > > Please move these globals out of function local scope, to make it more
> > > apparent that they are not on-stack. I only noticed it in the second pass.
> >
> > The way they're currently defined places these in meminit.data as
> > appropriate; if they are moved out, please make sure to annotate their
> > definitions with __meminitdata.
>
> I'm fine with having them within the function as well in this special
> case, as long as a heavy /* NOTE: ... */ warning is put before them -
> which explains why these SMP-unsafe globals are safe.
>
> ( That warning will also act as a visual delimiter that breaks the
> normally confusing and misleading 'globals mixed amongst stack
> variables' pattern. )
Thanks Ingo. Here is an updated patch with heavy warning added.
As for the wrapper function, I was unable to find an obvious
way to add a wrapper without significanly changing both
versions of __early_pfn_to_nid(). It seems cleaner to add
the change in both versions. I'm sure someone will point
out if this conclusion is wrong. :-)
------------------------------------------------------------
When booting on a large memory system, the kernel spends
considerable time in memmap_init_zone() setting up memory zones.
Analysis shows significant time spent in __early_pfn_to_nid().
The routine memmap_init_zone() checks each PFN to verify the
nid is valid. __early_pfn_to_nid() sequentially scans the list of
pfn ranges to find the right range and returns the nid. This does
not scale well. On a 4 TB (single rack) system there are 308
memory ranges to scan. The higher the PFN the more time spent
sequentially spinning through memory ranges.
Since memmap_init_zone() increments pfn, it will almost always be
looking for the same range as the previous pfn, so check that
range first. If it is in the same range, return that nid.
If not, scan the list as before.
A 4 TB (single rack) UV1 system takes 512 seconds to get through
the zone code. This performance optimization reduces the time
by 189 seconds, a 36% improvement.
A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
a 112.9 second (53%) reduction.
Signed-off-by: Russ Anderson <[email protected]>
---
arch/ia64/mm/numa.c | 15 ++++++++++++++-
mm/page_alloc.c | 15 ++++++++++++++-
2 files changed, 28 insertions(+), 2 deletions(-)
Index: linux/mm/page_alloc.c
===================================================================
--- linux.orig/mm/page_alloc.c 2013-03-19 16:09:03.736450861 -0500
+++ linux/mm/page_alloc.c 2013-03-22 17:07:43.895405617 -0500
@@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne
{
unsigned long start_pfn, end_pfn;
int i, nid;
+ /*
+ NOTE: The following SMP-unsafe globals are only used early
+ in boot when the kernel is running single-threaded.
+ */
+ static unsigned long last_start_pfn, last_end_pfn;
+ static int last_nid;
+
+ if (last_start_pfn <= pfn && pfn < last_end_pfn)
+ return last_nid;
for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
- if (start_pfn <= pfn && pfn < end_pfn)
+ if (start_pfn <= pfn && pfn < end_pfn) {
+ last_start_pfn = start_pfn;
+ last_end_pfn = end_pfn;
+ last_nid = nid;
return nid;
+ }
/* This is a memory hole */
return -1;
}
Index: linux/arch/ia64/mm/numa.c
===================================================================
--- linux.orig/arch/ia64/mm/numa.c 2013-02-25 15:49:44.000000000 -0600
+++ linux/arch/ia64/mm/numa.c 2013-03-22 16:09:44.662268239 -0500
@@ -61,13 +61,26 @@ paddr_to_nid(unsigned long paddr)
int __meminit __early_pfn_to_nid(unsigned long pfn)
{
int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec;
+ /*
+ NOTE: The following SMP-unsafe globals are only used early
+ in boot when the kernel is running single-threaded.
+ */
+ static unsigned long last_start_pfn, last_end_pfn;
+ static int last_nid;
+
+ if (section >= last_ssec && section < last_esec)
+ return last_nid;
for (i = 0; i < num_node_memblks; i++) {
ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT;
esec = (node_memblk[i].start_paddr + node_memblk[i].size +
((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT;
- if (section >= ssec && section < esec)
+ if (section >= ssec && section < esec) {
+ last_ssec = ssec;
+ last_esec = esec;
+ last_nid = node_memblk[i].nid
return node_memblk[i].nid;
+ }
}
return -1;
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]
On Sat, Mar 23, 2013 at 8:29 AM, Russ Anderson <[email protected]> wrote:
> On Fri, Mar 22, 2013 at 08:25:32AM +0100, Ingo Molnar wrote:
> ------------------------------------------------------------
> When booting on a large memory system, the kernel spends
> considerable time in memmap_init_zone() setting up memory zones.
> Analysis shows significant time spent in __early_pfn_to_nid().
>
> The routine memmap_init_zone() checks each PFN to verify the
> nid is valid. __early_pfn_to_nid() sequentially scans the list of
> pfn ranges to find the right range and returns the nid. This does
> not scale well. On a 4 TB (single rack) system there are 308
> memory ranges to scan. The higher the PFN the more time spent
> sequentially spinning through memory ranges.
>
> Since memmap_init_zone() increments pfn, it will almost always be
> looking for the same range as the previous pfn, so check that
> range first. If it is in the same range, return that nid.
> If not, scan the list as before.
>
> A 4 TB (single rack) UV1 system takes 512 seconds to get through
> the zone code. This performance optimization reduces the time
> by 189 seconds, a 36% improvement.
>
> A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> a 112.9 second (53%) reduction.
Interesting. but only have 308 entries in memblock...
Did you try to extend memblock_search() to search nid back?
Something like attached patch. That should save more time.
>
> Signed-off-by: Russ Anderson <[email protected]>
> ---
> arch/ia64/mm/numa.c | 15 ++++++++++++++-
> mm/page_alloc.c | 15 ++++++++++++++-
> 2 files changed, 28 insertions(+), 2 deletions(-)
>
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c 2013-03-19 16:09:03.736450861 -0500
> +++ linux/mm/page_alloc.c 2013-03-22 17:07:43.895405617 -0500
> @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne
> {
> unsigned long start_pfn, end_pfn;
> int i, nid;
> + /*
> + NOTE: The following SMP-unsafe globals are only used early
> + in boot when the kernel is running single-threaded.
> + */
> + static unsigned long last_start_pfn, last_end_pfn;
> + static int last_nid;
> +
> + if (last_start_pfn <= pfn && pfn < last_end_pfn)
> + return last_nid;
>
> for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid)
> - if (start_pfn <= pfn && pfn < end_pfn)
> + if (start_pfn <= pfn && pfn < end_pfn) {
> + last_start_pfn = start_pfn;
> + last_end_pfn = end_pfn;
> + last_nid = nid;
> return nid;
> + }
> /* This is a memory hole */
> return -1;
> }
> Index: linux/arch/ia64/mm/numa.c
> ===================================================================
> --- linux.orig/arch/ia64/mm/numa.c 2013-02-25 15:49:44.000000000 -0600
> +++ linux/arch/ia64/mm/numa.c 2013-03-22 16:09:44.662268239 -0500
> @@ -61,13 +61,26 @@ paddr_to_nid(unsigned long paddr)
> int __meminit __early_pfn_to_nid(unsigned long pfn)
> {
> int i, section = pfn >> PFN_SECTION_SHIFT, ssec, esec;
> + /*
> + NOTE: The following SMP-unsafe globals are only used early
> + in boot when the kernel is running single-threaded.
> + */
> + static unsigned long last_start_pfn, last_end_pfn;
last_ssec, last_esec?
> + static int last_nid;
> +
> + if (section >= last_ssec && section < last_esec)
> + return last_nid;
>
> for (i = 0; i < num_node_memblks; i++) {
> ssec = node_memblk[i].start_paddr >> PA_SECTION_SHIFT;
> esec = (node_memblk[i].start_paddr + node_memblk[i].size +
> ((1L << PA_SECTION_SHIFT) - 1)) >> PA_SECTION_SHIFT;
> - if (section >= ssec && section < esec)
> + if (section >= ssec && section < esec) {
> + last_ssec = ssec;
> + last_esec = esec;
> + last_nid = node_memblk[i].nid
> return node_memblk[i].nid;
> + }
> }
>
> return -1;
>
also looks like you forget to put IA maintainers in the To list.
may just put ia64 part in separated patch?
Thanks
Yinghai
> --- linux.orig/mm/page_alloc.c 2013-03-19 16:09:03.736450861 -0500
> +++ linux/mm/page_alloc.c 2013-03-22 17:07:43.895405617 -0500
> @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne
> {
> unsigned long start_pfn, end_pfn;
> int i, nid;
> + /*
> + NOTE: The following SMP-unsafe globals are only used early
> + in boot when the kernel is running single-threaded.
> + */
> + static unsigned long last_start_pfn, last_end_pfn;
> + static int last_nid;
Why don't you mark them __meminitdata? They seems freeable.
* Russ Anderson <[email protected]> wrote:
> --- linux.orig/mm/page_alloc.c 2013-03-19 16:09:03.736450861 -0500
> +++ linux/mm/page_alloc.c 2013-03-22 17:07:43.895405617 -0500
> @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne
> {
> unsigned long start_pfn, end_pfn;
> int i, nid;
> + /*
> + NOTE: The following SMP-unsafe globals are only used early
> + in boot when the kernel is running single-threaded.
> + */
> + static unsigned long last_start_pfn, last_end_pfn;
> + static int last_nid;
I guess I'm the nitpicker of the week:
please use the customary (multi-line) comment style:
/*
* Comment .....
* ...... goes here.
*/
specified in Documentation/CodingStyle.
Thanks,
Ingo
On Sat, 23 Mar 2013, KOSAKI Motohiro wrote:
> > --- linux.orig/mm/page_alloc.c 2013-03-19 16:09:03.736450861 -0500
> > +++ linux/mm/page_alloc.c 2013-03-22 17:07:43.895405617 -0500
> > @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne
> > {
> > unsigned long start_pfn, end_pfn;
> > int i, nid;
> > + /*
> > + NOTE: The following SMP-unsafe globals are only used early
> > + in boot when the kernel is running single-threaded.
> > + */
> > + static unsigned long last_start_pfn, last_end_pfn;
> > + static int last_nid;
>
> Why don't you mark them __meminitdata? They seems freeable.
>
Um, defining them in a __meminit function places them in .meminit.data
already.
On 03/24/2013 04:37 AM, Yinghai Lu wrote:
> +#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> +int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
> + unsigned long *start_pfn, unsigned long *end_pfn)
> +{
> + struct memblock_type *type = &memblock.memory;
> + int mid = memblock_search(type, (phys_addr_t)pfn << PAGE_SHIFT);
I'm really eager to see how much time can we save using binary search compared to
linear search in this case :)
(quote)
> A 4 TB (single rack) UV1 system takes 512 seconds to get through
> the zone code. This performance optimization reduces the time
> by 189 seconds, a 36% improvement.
>
> A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> a 112.9 second (53%) reduction.
(quote)
thanks,
linfeng
On Thu, 21 Mar 2013 19:03:21 +0100 Ingo Molnar <[email protected]> wrote:
> > IMO the local scope is more obvious as this is and should only be used
> > for caching purposes.
>
> It's a pattern we actively avoid in kernel code.
On the contrary, I always encourage people to move the static
definitions into function scope if possible. So the reader can see the
identifier's scope without having to search the whole file.
Unnecessarily giving the identifier file-scope seems weird.
On Sun, 24 Mar 2013 17:28:12 -0700 (PDT) David Rientjes <[email protected]> wrote:
> On Sat, 23 Mar 2013, KOSAKI Motohiro wrote:
>
> > > --- linux.orig/mm/page_alloc.c 2013-03-19 16:09:03.736450861 -0500
> > > +++ linux/mm/page_alloc.c 2013-03-22 17:07:43.895405617 -0500
> > > @@ -4161,10 +4161,23 @@ int __meminit __early_pfn_to_nid(unsigne
> > > {
> > > unsigned long start_pfn, end_pfn;
> > > int i, nid;
> > > + /*
> > > + NOTE: The following SMP-unsafe globals are only used early
> > > + in boot when the kernel is running single-threaded.
> > > + */
> > > + static unsigned long last_start_pfn, last_end_pfn;
> > > + static int last_nid;
> >
> > Why don't you mark them __meminitdata? They seems freeable.
> >
>
> Um, defining them in a __meminit function places them in .meminit.data
> already.
I wish it did, but it doesn't.
On Mon, Mar 25, 2013 at 10:11:27AM +0800, Lin Feng wrote:
> On 03/24/2013 04:37 AM, Yinghai Lu wrote:
> > +#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> > +int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
> > + unsigned long *start_pfn, unsigned long *end_pfn)
> > +{
> > + struct memblock_type *type = &memblock.memory;
> > + int mid = memblock_search(type, (phys_addr_t)pfn << PAGE_SHIFT);
>
> I'm really eager to see how much time can we save using binary search compared to
> linear search in this case :)
I have machine time tonight to measure the difference.
Based on earlier testing, a system with 9TB memory calls
__early_pfn_to_nid() 2,377,198,300 times while booting, but
only 6815 times does it not find that the memory range is
the same as previous and search the table. Caching the
previous range avoids searching the table 2,377,191,485 times,
saving a significant amount of time.
Of the remaining 6815 times when it searches the table, a binary
search may help, but with relatively few calls it may not
make much of an overall difference. Testing will show how much.
> (quote)
> > A 4 TB (single rack) UV1 system takes 512 seconds to get through
> > the zone code. This performance optimization reduces the time
> > by 189 seconds, a 36% improvement.
> >
> > A 2 TB (single rack) UV2 system goes from 212.7 seconds to 99.8 seconds,
> > a 112.9 second (53%) reduction.
> (quote)
>
> thanks,
> linfeng
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]
On Mon, Mar 25, 2013 at 2:56 PM, Russ Anderson <[email protected]> wrote:
> On Mon, Mar 25, 2013 at 10:11:27AM +0800, Lin Feng wrote:
>> On 03/24/2013 04:37 AM, Yinghai Lu wrote:
>> > +#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
>> > +int __init_memblock memblock_search_pfn_nid(unsigned long pfn,
>> > + unsigned long *start_pfn, unsigned long *end_pfn)
>> > +{
>> > + struct memblock_type *type = &memblock.memory;
>> > + int mid = memblock_search(type, (phys_addr_t)pfn << PAGE_SHIFT);
>>
>> I'm really eager to see how much time can we save using binary search compared to
>> linear search in this case :)
>
> I have machine time tonight to measure the difference.
>
> Based on earlier testing, a system with 9TB memory calls
> __early_pfn_to_nid() 2,377,198,300 times while booting, but
> only 6815 times does it not find that the memory range is
> the same as previous and search the table. Caching the
> previous range avoids searching the table 2,377,191,485 times,
> saving a significant amount of time.
>
> Of the remaining 6815 times when it searches the table, a binary
> search may help, but with relatively few calls it may not
> make much of an overall difference. Testing will show how much.
Please check attached patch that could be applied on top of your patch
in -mm.
Thanks
Yinghai
On Mon, 25 Mar 2013, Andrew Morton wrote:
> > Um, defining them in a __meminit function places them in .meminit.data
> > already.
>
> I wish it did, but it doesn't.
>
$ objdump -t mm/page_alloc.o | grep last_start_pfn
0000000000000240 l O .meminit.data 0000000000000008 last_start_pfn.34345
What version of gcc are you using?
On Mon, 25 Mar 2013 15:36:54 -0700 (PDT) David Rientjes <[email protected]> wrote:
> On Mon, 25 Mar 2013, Andrew Morton wrote:
>
> > > Um, defining them in a __meminit function places them in .meminit.data
> > > already.
> >
> > I wish it did, but it doesn't.
> >
>
> $ objdump -t mm/page_alloc.o | grep last_start_pfn
> 0000000000000240 l O .meminit.data 0000000000000008 last_start_pfn.34345
>
> What version of gcc are you using?
4.4.4
* Andrew Morton <[email protected]> wrote:
> On Thu, 21 Mar 2013 19:03:21 +0100 Ingo Molnar <[email protected]> wrote:
>
> > > IMO the local scope is more obvious as this is and should only be
> > > used for caching purposes.
> >
> > It's a pattern we actively avoid in kernel code.
>
> On the contrary, I always encourage people to move the static
> definitions into function scope if possible. So the reader can see the
> identifier's scope without having to search the whole file.
> Unnecessarily giving the identifier file-scope seems weird.
A common solution I use is to move such variables right before the
function itself. That makes the "this function's scope only" aspect pretty
apparent - without the risks of hiding globals amongst local variables.
The other approach is to comment the variables very clearly that they are
really globals as the 'static' keyword is easy to miss while reading
email.
Both solutions are basically just as visible as the solution you prefer -
but more robust.
Anyway, I guess we have to agree to disagree on that, we probably already
spent more energy on discussing this than any worst-case problem the
placement of these variables could ever cause in the future ;-)
Thanks,
Ingo