LinuxLists.cc - Merging Nonlinear and Numa style memory hotplug

2004-06-22 19:04:43

Subject: Merging Nonlinear and Numa style memory hotplug

Hello.

I made merging patches between Nonlinear and Node style
memory hotplug code. I hope that this patches will work
for both of SMP style memory hotplug and NUMA style memory
hotplug.
The patches is here.
http://osdn.dl.sourceforge.net/sourceforge/lhms/20040621.tgz

These patches are for Linux 2.6.5.

Please comment.

Bye.

------------------
Note:
These patches base is Takahashi-san and Iwamoto-san's patches.
and this includes nonlinear's code.
Modifications from them are ...
- CONFIG definition was divided like this
CONFIG_HOTPLUG_MEMORY (common part)
CONFIG_HOTPLUG_MEMORY_OF_NODE (only node style hotplug)
CONFIG_HOTPLUG_MEMORY_NONLINAR (only mem_section style
hotplug)
- Some of strucure's member are added to mem_section[] to
unify between nonlinear and node style hotplug.
- Basic implementation mem_section's hotremove. (See below.)
- Using NUMA code of build_zonelist() for NUMA style
memory hotplug.
- Code cleaned up Memory hotadd for NUMA style.

Following is remain issue.
- Hotremovable attribute is vague concept in these patches.
I don't have suitable solution for it yet.
- Memory hot-add for memsection.
- rmap is changed in 2.6.7. These patches should
reflect the changes.

These patches are trial. So, system down might occur.
(Especially, after nonlinear hot-removing code execution.)

-------------------------------------------------
About hotremove for mem_section
(Using PG_booked)

Hot-remove needs 3 features.
1. Prohibition reallocation in the removing area against
page allocator.
2. Page migration from removing area
3. System repeats 2. until freeing all of the memory in the area.
So, system has to know all of memory freed.

1. and 3. have problem for memsection hot-remove.
In Node removing case, system could avoid reallocation
by removing zone from zone_list. But its way can$B!G(Bt be used
for memsection. System could count freed page by using free_pages
in the zone. But, it will has to know freed area
in the 'removing memsection'.

Takahashi-san proposed me that it use PG_book in the page flag
for prohibition of reallocation in the mem_section. (This flag was
used for reservation of destination of huge-page's migration in
these patches.)
- System doesn$B!G(Bt allocate booked pages.
- System doesn$B!G(Bt return the page to per_cpu_page,
return it to buddy allocator immediately.
- When all pages in the section are booked and freed,
system can find that the section can remove.

--
Yasunori Goto <ygoto at us.fujitsu.com>

2004-06-23 22:37:23

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lhns-devel] Merging Nonlinear and Numa style memory hotplug

First of all, thank you for merging nonlinear on top of your current
work. It looks very promising.

On Tue, 2004-06-22 at 12:00, Yasunori Goto wrote:
> - Some of strucure's member are added to mem_section[] to
> unify between nonlinear and node style hotplug.

This quadruples the size of the mem_section[] array, and makes each
mem_section entry take up a whole cache line. Are you sure all of these
structure members are needed? Can they be allocated elsewhere, instead
of directly inside the translation tables, or otherwise derived? Also,
why does the booked/free_count have to be kept here? Can't that be
determined by simply looping through and looking at all the pages'
flags?

Also, can you provide a patch which is just your modifications to Dave's
original nonlinear patch?

Instead of remove_from_freelist(unsigned int section), I'd hope that we
could support a much more generic interface in the page allocator:
allocate by physical address. remove_from_freelist() has some intimate
knowledge of the buddy allocator that I think is a bit complex.

That also brings up a more important issue. I see nonlinear as a
back-end for only the page_to_pfn() and pfn_to_page(), and that's all.
There are no real exposures of the nonlinear section size or the
structures to any other part of the kernel because they're all wrapped
up in those functions. It may be possible to keep the entire kernel
oblivious of nonlinear, but I think it's a worthy goal. That's why I'd
like to see the buddy allocator modifications be limited to
currently-existing concepts like physical addresses.

Brad, do you have anything that you can post to demonstrate your
approach for doing allocation by address?

-- Dave

2004-06-24 03:05:26

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Lhns-devel] Merging Nonlinear and Numa style memory hotplug

> This quadruples the size of the mem_section[] array, and makes each
> mem_section entry take up a whole cache line. Are you sure all of these
> structure members are needed?

No, I'm not sure. Especially, I don't find whether hotremovable
attribute is necessary or not in the mem_section yet.

> Can they be allocated elsewhere, instead
> of directly inside the translation tables, or otherwise derived? Also,
> why does the booked/free_count have to be kept here? Can't that be
> determined by simply looping through and looking at all the pages'
> flags?

It might be OK. But there might be some other influence by it
(for example, new lock will be necessary in alloc_pages()...).
I think measurement is necessary to find which implementation
is the fastest. If I will have a time, I would like to try it.

> Also, can you provide a patch which is just your modifications to Dave's
> original nonlinear patch?

No, I don't have it (at least now).
Base of my patches is Iwamoto-san's patch which is for 2.6.5.
But Dave-san's patch is for linux-2.6.5-mm. So, I had to change
Dave-san's patch for it.

And, other difference thing is about mem_map.
Dave-san's patch divides virtual address of mem_map per mem_section.
But it is cause of increasing steps at 'buddy allocator' like this.

- buddy1 = base + (page_idx ^ -mask);
- buddy2 = base + page_idx;
+ buddy1 = pfn_to_page(base + (page_idx ^ -mask);
+ buddy2 = pfn_to_page(base + page_idx);

So, I would like to try contiguous virtual mem_map yet,
if it is possible.

> Instead of remove_from_freelist(unsigned int section), I'd hope that we
> could support a much more generic interface in the page allocator:
> allocate by physical address. remove_from_freelist() has some intimate
> knowledge of the buddy allocator that I think is a bit complex.

I don't think I understand your idea at this point.
If booked pages remain in free_list, page allocator has to
search many pages which include booked pages.
remove_from_reelist() is to avoid this.

Thank you for your responce.

Bye.

--
Yasunori Goto <ygoto at us.fujitsu.com>

2004-06-24 03:26:25

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lhns-devel] Merging Nonlinear and Numa style memory hotplug

On Wed, 2004-06-23 at 20:04, Yasunori Goto wrote:

> > Can they be allocated elsewhere, instead
> > of directly inside the translation tables, or otherwise derived? Also,
> > why does the booked/free_count have to be kept here? Can't that be
> > determined by simply looping through and looking at all the pages'
> > flags?
>
> It might be OK. But there might be some other influence by it
> (for example, new lock will be necessary in alloc_pages()...).
> I think measurement is necessary to find which implementation
> is the fastest. If I will have a time, I would like to try it.

I'm pretty sure your current implementation is the faster of the two
approaches. However, I think we should take a performance hit during
the hotplug operations in order to decrease the storage overhead. It's
not like we'll be publishing benchmarks during remove operations...

> > Also, can you provide a patch which is just your modifications to Dave's
> > original nonlinear patch?
>
> No, I don't have it (at least now).
> Base of my patches is Iwamoto-san's patch which is for 2.6.5.
> But Dave-san's patch is for linux-2.6.5-mm. So, I had to change
> Dave-san's patch for it.
>
> And, other difference thing is about mem_map.
> Dave-san's patch divides virtual address of mem_map per mem_section.
> But it is cause of increasing steps at 'buddy allocator' like this.
>
> - buddy1 = base + (page_idx ^ -mask);
> - buddy2 = base + page_idx;
> + buddy1 = pfn_to_page(base + (page_idx ^ -mask);
> + buddy2 = pfn_to_page(base + page_idx);
>
> So, I would like to try contiguous virtual mem_map yet,
> if it is possible.

I'd love to see an implementation, but I'm not horribly sure how
feasible it is. I think they use that approach on ia64 right now, and I
don't think it's very popular.

But, it could be interesting. I'll be curious to see how it turns out.

> > Instead of remove_from_freelist(unsigned int section), I'd hope that we
> > could support a much more generic interface in the page allocator:
> > allocate by physical address. remove_from_freelist() has some intimate
> > knowledge of the buddy allocator that I think is a bit complex.
>
> I don't think I understand your idea at this point.
> If booked pages remain in free_list, page allocator has to
> search many pages which include booked pages.
> remove_from_reelist() is to avoid this.

Oh, I like *that* part. The first step in a "removal" is to allocate
the pages. I'd just like to see that allocation be more based on pfns
or physical addresses than sections. That's a much more generic
interface, and would be applicable to things outside of
CONFIG_NONLINEAR. I'll post an example of this in a day or two.

-- Dave

2004-06-24 13:32:37

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: [Lhns-devel] Merging Nonlinear and Numa style memory hotplug

Some more comments on the first patch:

+#ifdef CONFIG_HOTPLUG_MEMORY_OF_NODE
+ if (node_online(nid)) {
+ allocate_pgdat(nid);
+ printk ("node %d will remap to vaddr %08lx\n", nid,
+ (ulong) node_remap_start_vaddr[nid]);
+ }else
+ NODE_DATA(nid)=NULL;
+#else
allocate_pgdat(nid);
printk ("node %d will remap to vaddr %08lx - %08lx\n", nid,
(ulong) node_remap_start_vaddr[nid],
(ulong) pfn_to_kaddr(highstart_pfn
- node_remap_offset[nid] + node_remap_size[nid]));
+#endif

I don't think this chunk is very necessary. The 'NODE_DATA(nid)=NULL;'
is superfluous because the node_data[] is zeroed at boot:

NUMA:
#define NODE_DATA(nid) (node_data[nid])
non-NUMA:
#define NODE_DATA(nid) (&contig_page_data)

Why not just make it:

+ if (!node_online(nid))
+ continue;

That should at least get rid of the ifdef.

- bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0, system_max_low_pfn);
+ bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0,
+ (system_max_low_pfn > node_end_pfn[0]) ?
+ node_end_pfn[0] : system_max_low_pfn);

- register_bootmem_low_pages(system_max_low_pfn);
+ register_bootmem_low_pages((system_max_low_pfn > node_end_pfn[0]) ?
+ node_end_pfn[0] : system_max_low_pfn);

How about using a temp variable here instead of those nasty conditionals?

+
+#ifdef CONFIG_HOTPLUG_MEMORY_OF_NODE
+ if (node_online(nid)){
+ if (nid)
+ memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+ NODE_DATA(nid)->pgdat_next = pgdat_list;
+ pgdat_list = NODE_DATA(nid);
+ NODE_DATA(nid)->enabled = 1;
+ }
+#else
if (nid)
memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
NODE_DATA(nid)->pgdat_next = pgdat_list;
pgdat_list = NODE_DATA(nid);
+#endif

I'd just take the ifdef out. Wouldn't this work instead?

- if (nid)
- memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
- NODE_DATA(nid)->pgdat_next = pgdat_list;
- pgdat_list = NODE_DATA(nid);
+ if (node_online(nid)){
+ if (nid)
+ memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+ NODE_DATA(nid)->pgdat_next = pgdat_list;
+ pgdat_list = NODE_DATA(nid);
+ NODE_DATA(nid)->enabled = 1;
+ }

+void set_max_mapnr_init(void)
+{
...
+ struct page *hsp=0;

Should just be 'struct page *hsp = NULL;'

+ for(i = 0; i < numnodes; i++) {
+ if (!NODE_DATA(i))
+ continue;
+ pgdat = NODE_DATA(i);
+ size = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
+ if (!size)
+ continue;
+ hsp = pgdat->node_zones[ZONE_HIGHMEM].zone_mem_map;
+ if (hsp)
+ break;
+ }

Doesn't this just find the lowest-numbered node's highmem? Are you sure
that no NUMA systems have memory at lower physical addresses on
higher-numbered nodes? I'm not sure that this is true.

+ if (hsp)
+ highmem_start_page = hsp;
+ else
+ highmem_start_page = (struct page *)-1;

By not just BUG() here? Do you check for 'highmem_start_page == -1' somewhere?

@@ -478,12 +482,35 @@ void __init mem_init(void)
totalram_pages += __free_all_bootmem();

reservedpages = 0;
+
+#ifdef CONFIG_HOTPLUG_MEMORY_OF_NODE
+ for (nid = 0; nid < numnodes; nid++){
+ int start, end;
+
+ if ( !node_online(nid))
+ continue;
+ if ( node_start_pfn[nid] >= max_low_pfn )
+ break;
+
+ start = node_start_pfn[nid];
+ end = ( node_end_pfn[nid] < max_low_pfn) ?
+ node_end_pfn[nid] : max_low_pfn;
+
+ for ( tmp = start; tmp < end; tmp++)
+ /*
+ * Only count reserved RAM pages
+ */
+ if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp)))
+ reservedpages++;
+ }
+#else

Again, I don't see what this loop is used for. You appear to be trying
to detect which nodes have lowmem. Is there currently any x86 NUMA
architecture that has lowmem on any node but node 0?

-- Dave

2004-06-24 22:22:47

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: [Lhns-devel] Merging Nonlinear and Numa style memory hotplug

Dave-san.

Probably, all of your advices are right.
I was confused between my emulation environment and true NUMA machine.
I will modify them. Thanks a lot.

BTW, I have a question about nonlinear patch.
It is about difference between phys_section[] and mem_section[]
I suppose that phys_section[] looks like no-meaning now.
If it isn't necessary, __va() and __pa() translation can be more simple.
What is the purpose of phys_section[]. Is it for ppc64?

Bye.

> Some more comments on the first patch:
>
> +#ifdef CONFIG_HOTPLUG_MEMORY_OF_NODE
> + if (node_online(nid)) {
> + allocate_pgdat(nid);
> + printk ("node %d will remap to vaddr %08lx\n", nid,
> + (ulong) node_remap_start_vaddr[nid]);
> + }else
> + NODE_DATA(nid)=NULL;
> +#else
> allocate_pgdat(nid);
> printk ("node %d will remap to vaddr %08lx - %08lx\n", nid,
> (ulong) node_remap_start_vaddr[nid],
> (ulong) pfn_to_kaddr(highstart_pfn
> - node_remap_offset[nid] + node_remap_size[nid]));
> +#endif
>
> I don't think this chunk is very necessary. The 'NODE_DATA(nid)=NULL;'
> is superfluous because the node_data[] is zeroed at boot:
>
> NUMA:
> #define NODE_DATA(nid) (node_data[nid])
> non-NUMA:
> #define NODE_DATA(nid) (&contig_page_data)
>
> Why not just make it:
>
> + if (!node_online(nid))
> + continue;
>
> That should at least get rid of the ifdef.
>
> - bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0, system_max_low_pfn);
> + bootmap_size = init_bootmem_node(NODE_DATA(0), min_low_pfn, 0,
> + (system_max_low_pfn > node_end_pfn[0]) ?
> + node_end_pfn[0] : system_max_low_pfn);
>
> - register_bootmem_low_pages(system_max_low_pfn);
> + register_bootmem_low_pages((system_max_low_pfn > node_end_pfn[0]) ?
> + node_end_pfn[0] : system_max_low_pfn);
>
> How about using a temp variable here instead of those nasty conditionals?
>
> +
> +#ifdef CONFIG_HOTPLUG_MEMORY_OF_NODE
> + if (node_online(nid)){
> + if (nid)
> + memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> + NODE_DATA(nid)->pgdat_next = pgdat_list;
> + pgdat_list = NODE_DATA(nid);
> + NODE_DATA(nid)->enabled = 1;
> + }
> +#else
> if (nid)
> memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> NODE_DATA(nid)->pgdat_next = pgdat_list;
> pgdat_list = NODE_DATA(nid);
> +#endif
>
> I'd just take the ifdef out. Wouldn't this work instead?
>
> - if (nid)
> - memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> - NODE_DATA(nid)->pgdat_next = pgdat_list;
> - pgdat_list = NODE_DATA(nid);
> + if (node_online(nid)){
> + if (nid)
> + memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
> + NODE_DATA(nid)->pgdat_next = pgdat_list;
> + pgdat_list = NODE_DATA(nid);
> + NODE_DATA(nid)->enabled = 1;
> + }
>
> +void set_max_mapnr_init(void)
> +{
> ...
> + struct page *hsp=0;
>
> Should just be 'struct page *hsp = NULL;'
>
> + for(i = 0; i < numnodes; i++) {
> + if (!NODE_DATA(i))
> + continue;
> + pgdat = NODE_DATA(i);
> + size = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
> + if (!size)
> + continue;
> + hsp = pgdat->node_zones[ZONE_HIGHMEM].zone_mem_map;
> + if (hsp)
> + break;
> + }
>
> Doesn't this just find the lowest-numbered node's highmem? Are you sure
> that no NUMA systems have memory at lower physical addresses on
> higher-numbered nodes? I'm not sure that this is true.
>
> + if (hsp)
> + highmem_start_page = hsp;
> + else
> + highmem_start_page = (struct page *)-1;
>
> By not just BUG() here? Do you check for 'highmem_start_page == -1' somewhere?
>
> @@ -478,12 +482,35 @@ void __init mem_init(void)
> totalram_pages += __free_all_bootmem();
>
> reservedpages = 0;
> +
> +#ifdef CONFIG_HOTPLUG_MEMORY_OF_NODE
> + for (nid = 0; nid < numnodes; nid++){
> + int start, end;
> +
> + if ( !node_online(nid))
> + continue;
> + if ( node_start_pfn[nid] >= max_low_pfn )
> + break;
> +
> + start = node_start_pfn[nid];
> + end = ( node_end_pfn[nid] < max_low_pfn) ?
> + node_end_pfn[nid] : max_low_pfn;
> +
> + for ( tmp = start; tmp < end; tmp++)
> + /*
> + * Only count reserved RAM pages
> + */
> + if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp)))
> + reservedpages++;
> + }
> +#else
>
> Again, I don't see what this loop is used for. You appear to be trying
> to detect which nodes have lowmem. Is there currently any x86 NUMA
> architecture that has lowmem on any node but node 0?
>
>
>
> -- Dave
>
>
>
> -------------------------------------------------------
> This SF.Net email sponsored by Black Hat Briefings & Training.
> Attend Black Hat Briefings & Training, Las Vegas July 24-29 -
> digital self defense, top technical experts, no vendor pitches,
> unmatched networking opportunities. Visit http://www.blackhat.com
> _______________________________________________
> Lhns-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/lhns-devel

--
Yasunori Goto <ygoto at us.fujitsu.com>

2004-06-24 22:40:29

by Dave Hansen

[permalink] [raw]

Subject: Re: Merging Nonlinear and Numa style memory hotplug

On Thu, 2004-06-24 at 15:19, Yasunori Goto wrote:
> BTW, I have a question about nonlinear patch.
> It is about difference between phys_section[] and mem_section[]
> I suppose that phys_section[] looks like no-meaning now.
> If it isn't necessary, __va() and __pa() translation can be more simple.
> What is the purpose of phys_section[]. Is it for ppc64?

This is the fun (read: confusing) part of nonlinear.

The mem_section[] array is where the pointer to the mem_map for the
section is stored, obviously. It's indexed virtually, so that something
at a virtual address is in section number (address >> SECTION_SHIFT).
So, that makes it easy to go from a virtual address to a 'struct page'
inside of the mem_map[].

But, given a physical address (or a pfn for that matter), you sometimes
also need to get to a 'struct page'. It is for that reason that we have
the phys_section[] array. Each entry in the phys_section[] points back
to a mem_section[], which then contains the mem_map[].

pfn_to_page(unsigned long pfn)
{
return
&mem_section[phys_section[pfn_to_section(pfn)]].mem_map[section_offset_pfn(pfn)];
}

pfn_to_section(pfn) does a (pfn >> (SECTION_SHIFT - PAGE_SHIFT)), then
uses that section number to index into the phys_section[] array, which
gives an index into the mem_section[] array, from which you can get the
'struct page'.

-- Dave

2004-06-25 03:12:39

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: Merging Nonlinear and Numa style memory hotplug

I understand this idea at last.
Section size of DLPAR of PPC is only 16MB.
But kmalloc area of virtual address have to be contigous
even if the area is divided 16MB physically.
Dave-san's implementation (it was for IA32) was same index between
phys_section and mem_section. So, I was confused.

> pfn_to_page(unsigned long pfn)
> {
> return
> &mem_section[phys_section[pfn_to_section(pfn)]].mem_map[section_offset_pfn(pfn)];
> }
>

But, I suppose this translation might be too complex.
I worry that many person don't like this which is cause of
performance deterioration.
Should this translation be in common code?

Bye.

--
Yasunori Goto <ygoto at us.fujitsu.com>

2004-06-25 03:19:38

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: Merging Nonlinear and Numa style memory hotplug

On Thu, 2004-06-24 at 20:11, Yasunori Goto wrote:
> I understand this idea at last.
> Section size of DLPAR of PPC is only 16MB.
> But kmalloc area of virtual address have to be contigous
> even if the area is divided 16MB physically.
> Dave-san's implementation (it was for IA32) was same index between
> phys_section and mem_section. So, I was confused.
>
> > pfn_to_page(unsigned long pfn)
> > {
> > return
> > &mem_section[phys_section[pfn_to_section(pfn)]].mem_map[section_offset_pfn(pfn)];
> > }
> >
>
> But, I suppose this translation might be too complex.

It certainly doesn't look pretty, but I think it's manageable with a
comment, or maybe breaking the operation up into a few lines instead.

> I worry that many person don't like this which is cause of
> performance deterioration.

There is some precedent in the kernel for a table such as this. Take a
look at the NUMA page_to_pfn() and page_zone() functions. They use a
zone_table array to do that same kind of thing.

Are you worried bout the pfn_to_page() function itself, that it will
pull in 2 cachelines of data: 1 for phys_section[] and another for
mem_section[]?

> Should this translation be in common code?

What do you mean by common code? It should be shared by all
architectures.

-- Dave

2004-06-25 04:50:39

by Shai Fultheim

[permalink] [raw]

Subject: RE: [Lhms-devel] Re: [Lhns-devel] Merging Nonlinear and Numa style memory hotplug

> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Yasunori Goto
> Sent: Thursday, June 24, 2004 15:20
> To: Dave Hansen
>
> > Some more comments on the first patch:

> > + for(i = 0; i < numnodes; i++) {
> > + if (!NODE_DATA(i))
> > + continue;
> > + pgdat = NODE_DATA(i);
> > + size = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
> > + if (!size)
> > + continue;
> > + hsp = pgdat->node_zones[ZONE_HIGHMEM].zone_mem_map;
> > + if (hsp)
> > + break;
> > + }
> >
> > Doesn't this just find the lowest-numbered node's highmem? Are you sure
> > that no NUMA systems have memory at lower physical addresses on
> > higher-numbered nodes? I'm not sure that this is true.

In addition I'm involved in a NUMA-related project that might have
zone-normal on other nodes beside node0. I also think that in some cases it
might be useful to have the code above and below in case of AMD machines
that have less than 1GB per processor (or at least less than 1GB on the
FIRST processor).

> > +
> > +#ifdef CONFIG_HOTPLUG_MEMORY_OF_NODE
> > + for (nid = 0; nid < numnodes; nid++){
> > + int start, end;
> > +
> > + if ( !node_online(nid))
> > + continue;
> > + if ( node_start_pfn[nid] >= max_low_pfn )
> > + break;
> > +
> > + start = node_start_pfn[nid];
> > + end = ( node_end_pfn[nid] < max_low_pfn) ?
> > + node_end_pfn[nid] : max_low_pfn;
> > +
> > + for ( tmp = start; tmp < end; tmp++)
> > + /*
> > + * Only count reserved RAM pages
> > + */
> > + if (page_is_ram(tmp) &&
> PageReserved(pfn_to_page(tmp)))
> > + reservedpages++;
> > + }
> > +#else
> >
> > Again, I don't see what this loop is used for. You appear to be trying
> > to detect which nodes have lowmem. Is there currently any x86 NUMA
> > architecture that has lowmem on any node but node 0?
> >
> >
> >
> > -- Dave

As noted above, this is possible, the cost of this code is not much, so I
would keep it in.

--shai

2004-06-25 15:18:55

by Dave Hansen

[permalink] [raw]

Subject: Re: Merging Nonlinear and Numa style memory hotplug

On Thu, 2004-06-24 at 21:49, Shai Fultheim wrote:
> > > Doesn't this just find the lowest-numbered node's highmem? Are you sure
> > > that no NUMA systems have memory at lower physical addresses on
> > > higher-numbered nodes? I'm not sure that this is true.
>
> In addition I'm involved in a NUMA-related project that might have
> zone-normal on other nodes beside node0. I also think that in some cases it
> might be useful to have the code above and below in case of AMD machines
> that have less than 1GB per processor (or at least less than 1GB on the
> FIRST processor).

But, this code is just for i386 processors. Do you have a NUMA AMD i386
system?

> > > Again, I don't see what this loop is used for. You appear to be trying
> > > to detect which nodes have lowmem. Is there currently any x86 NUMA
> > > architecture that has lowmem on any node but node 0?
>
> As noted above, this is possible, the cost of this code is not much, so I
> would keep it in.

OK, I'll revise and say that it's impossible for all of the in-tree NUMA
systems. I'd heavily encourage you to post your code so that we can
more easily understand what kind of system you have. It's very hard to
analyze impact on systems that we've never seen code for.

In any case, I believe that the original loop should be kept pretty
close to what is there now:

for (tmp = 0; tmp < max_low_pfn; tmp++)
/*
* Only count reserved RAM pages
*/
if (page_is_ram(tmp) && PageReserved(pfn_to_page(tmp)))
reservedpages++;

If you do, indeed, have non-ram pages between pfns 0 and max_low_pfn,
I'd suggest doing something like this:

if (page_is_ram(tmp) &&
node_online(page_to_nid(tmp)) &&
PageReserved(pfn_to_page(tmp)))
reservedpages++;

That's a lot cleaner and more likely to work than replacing the entire
loop with an ifdef.

-- Dave

2004-06-25 18:50:16

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: Merging Nonlinear and Numa style memory hotplug

>
> > Should this translation be in common code?
>
> What do you mean by common code? It should be shared by all
> architectures.

If physical memory chunk size is larger than the area which
should be contiguous like IA32's kmalloc,
there is no merit in this code.
So, I thought only mem_section is enough.
But I don't know about other architecutures yet and I'm not sure.

Are you sure that all architectures need phys_section?

Bye.
--
Yasunori Goto <ygoto at us.fujitsu.com>

2004-06-25 19:02:26

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: Merging Nonlinear and Numa style memory hotplug

On Fri, 2004-06-25 at 11:48, Yasunori Goto wrote:
> >
> > > Should this translation be in common code?
> >
> > What do you mean by common code? It should be shared by all
> > architectures.
>
> If physical memory chunk size is larger than the area which
> should be contiguous like IA32's kmalloc,
> there is no merit in this code.
> So, I thought only mem_section is enough.
> But I don't know about other architecutures yet and I'm not sure.
>
> Are you sure that all architectures need phys_section?

You don't *need* it, but the alternative is a scan of the mem_section[]
array, which would be much, much slower.

Do you have an idea for an alternate implementation?

-- Dave

2004-06-25 20:46:01

by Yasunori Goto

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: Merging Nonlinear and Numa style memory hotplug

> > Are you sure that all architectures need phys_section?
>
> You don't *need* it, but the alternative is a scan of the mem_section[]
> array, which would be much, much slower.
>
> Do you have an idea for an alternate implementation?

I didn't find that scan of the mem_section[] is necessary.
I thought just that mem_section index = phys_section index.
May I ask why scan of mem_section is necessary?
I might still have misunderstood something.

--
Yasunori Goto <ygoto at us.fujitsu.com>

2004-06-25 20:51:58

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: Merging Nonlinear and Numa style memory hotplug

On Fri, 2004-06-25 at 13:45, Yasunori Goto wrote:
> > > Are you sure that all architectures need phys_section?
> >
> > You don't *need* it, but the alternative is a scan of the mem_section[]
> > array, which would be much, much slower.
> >
> > Do you have an idea for an alternate implementation?
>
> I didn't find that scan of the mem_section[] is necessary.
> I thought just that mem_section index = phys_section index.
> May I ask why scan of mem_section is necessary?
> I might still have misunderstood something.

For now, the indexes happen to be the same. However, for discontiguous
memory systems, this will not be the case

mem | phys
----+-----
|
|
|
|

-- Dave

2004-06-25 20:57:04

by Dave Hansen

[permalink] [raw]

Subject: Re: [Lhms-devel] Re: Merging Nonlinear and Numa style memory hotplug

Whoops. Hit CTRL-Enter during my ASCII art :)

On Fri, 2004-06-25 at 13:45, Yasunori Goto wrote:
> > > Are you sure that all architectures need phys_section?
> >
> > You don't *need* it, but the alternative is a scan of the mem_section[]
> > array, which would be much, much slower.
> >
> > Do you have an idea for an alternate implementation?
>
> I didn't find that scan of the mem_section[] is necessary.
> I thought just that mem_section index = phys_section index.
> May I ask why scan of mem_section is necessary?
> I might still have misunderstood something.

For now, the indexes happen to be the same. However, for discontiguous
memory systems, this will not be the case

Consider a system with 3 GB of RAM, 2GB@0x00000000 and 1 GB@0xC0000000
with 1 GB sections. The arrays would look like this:

mem | phys
----+-----
0 | 0
1 | 1
2 | 3

See the B-lpfn patch that I posted today for why this is important. It
basically allows us to represent sparse physical addresses in a much
more linear fashion.

-- Dave