2006-11-23 01:34:50

by Rohit Seth

[permalink] [raw]
Subject: [Patch1/4]: fake numa for x86_64 patch

This patch provides a IO hole size in a given address range.

Signed-off-by: David Rientjes <[email protected]>
Signed-off-by: Paul Menage <[email protected]>
Signed-off-by: Rohit Seth <[email protected]>

--- linux-2.6.19-rc5-mm2.org/include/asm-x86_64/e820.h 2006-11-22 12:20:39.000000000 -0800
+++ linux-2.6.19-rc5-mm2/include/asm-x86_64/e820.h 2006-11-22 12:17:25.000000000 -0800
@@ -46,6 +46,7 @@ extern void e820_mark_nosave_regions(voi
extern void e820_print_map(char *who);
extern int e820_any_mapped(unsigned long start, unsigned long end, unsigned type);
extern int e820_all_mapped(unsigned long start, unsigned long end, unsigned type);
+extern unsigned long e820_hole_size(unsigned long start, unsigned long end);

extern void e820_setup_gap(void);
extern void e820_register_active_regions(int nid,
--- linux-2.6.19-rc5-mm2.org/arch/x86_64/kernel/e820.c 2006-11-22 12:20:55.000000000 -0800
+++ linux-2.6.19-rc5-mm2/arch/x86_64/kernel/e820.c 2006-11-21 18:48:15.000000000 -0800
@@ -184,6 +184,38 @@ unsigned long __init e820_end_of_ram(voi
}

/*
+ * Find the hole size in the range.
+ */
+unsigned long __init e820_hole_size(unsigned long start, unsigned long end)
+{
+ unsigned long ram = 0;
+ int i;
+
+ for (i = 0; i < e820.nr_map; i++) {
+ struct e820entry *ei = &e820.map[i];
+ unsigned long last, addr;
+
+ if (ei->type != E820_RAM ||
+ ei->addr+ei->size <= start ||
+ ei->addr >= end)
+ continue;
+
+ addr = round_up(ei->addr, PAGE_SIZE);
+ if (addr < start)
+ addr = start;
+
+ last = round_down(ei->addr + ei->size, PAGE_SIZE);
+ if (last >= end)
+ last = end;
+
+ if (last > addr)
+ ram += last - addr;
+ }
+ return ((end - start) - ram);
+}
+
+
+/*
* Mark e820 reserved areas as busy for the resource manager.
*/
void __init e820_reserve_resources(void)



2006-11-27 13:18:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [Patch1/4]: fake numa for x86_64 patch

On Wed, 22 Nov 2006, Rohit Seth wrote:

> This patch provides a IO hole size in a given address range.
>

Hi,

This patch reintroduces a function that doubles up what
absent_pages_in_range(start_pfn, end_pfn). I recognise you do this because
you are interested in hole sizes before add_active_range() is called.
However, what is not clear is why these patches are so specific to x86_64.

It looks possible to do the work of functions like split_nodes_equal() in
an architecture-independent manner using early_node_map rather than
dealing with the arch-specific nodes array. That would open the
possibility of providing fake nodes on more than one architecture in the
future.

What I think can be done is that you register memory as normal and then
split up the nodes into fake nodes. This would remove the need for having
e820_hole_size() reintroduced.

> Signed-off-by: David Rientjes <[email protected]>
> Signed-off-by: Paul Menage <[email protected]>
> Signed-off-by: Rohit Seth <[email protected]>
>
> --- linux-2.6.19-rc5-mm2.org/include/asm-x86_64/e820.h 2006-11-22 12:20:39.000000000 -0800
> +++ linux-2.6.19-rc5-mm2/include/asm-x86_64/e820.h 2006-11-22 12:17:25.000000000 -0800
> @@ -46,6 +46,7 @@ extern void e820_mark_nosave_regions(voi
> extern void e820_print_map(char *who);
> extern int e820_any_mapped(unsigned long start, unsigned long end, unsigned type);
> extern int e820_all_mapped(unsigned long start, unsigned long end, unsigned type);
> +extern unsigned long e820_hole_size(unsigned long start, unsigned long end);
>
> extern void e820_setup_gap(void);
> extern void e820_register_active_regions(int nid,
> --- linux-2.6.19-rc5-mm2.org/arch/x86_64/kernel/e820.c 2006-11-22 12:20:55.000000000 -0800
> +++ linux-2.6.19-rc5-mm2/arch/x86_64/kernel/e820.c 2006-11-21 18:48:15.000000000 -0800
> @@ -184,6 +184,38 @@ unsigned long __init e820_end_of_ram(voi
> }
>
> /*
> + * Find the hole size in the range.
> + */
> +unsigned long __init e820_hole_size(unsigned long start, unsigned long end)
> +{
> + unsigned long ram = 0;
> + int i;
> +
> + for (i = 0; i < e820.nr_map; i++) {
> + struct e820entry *ei = &e820.map[i];
> + unsigned long last, addr;
> +
> + if (ei->type != E820_RAM ||
> + ei->addr+ei->size <= start ||
> + ei->addr >= end)
> + continue;
> +
> + addr = round_up(ei->addr, PAGE_SIZE);
> + if (addr < start)
> + addr = start;
> +
> + last = round_down(ei->addr + ei->size, PAGE_SIZE);
> + if (last >= end)
> + last = end;
> +
> + if (last > addr)
> + ram += last - addr;
> + }
> + return ((end - start) - ram);
> +}
> +
> +
> +/*
> * Mark e820 reserved areas as busy for the resource manager.
> */
> void __init e820_reserve_resources(void)
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-27 18:23:29

by Rohit Seth

[permalink] [raw]
Subject: Re: [Patch1/4]: fake numa for x86_64 patch

Hi Mel,

On Mon, 2006-11-27 at 13:18 +0000, Mel Gorman wrote:
> On Wed, 22 Nov 2006, Rohit Seth wrote:
>
> > This patch provides a IO hole size in a given address range.
> >
>
> Hi,
>
> This patch reintroduces a function that doubles up what
> absent_pages_in_range(start_pfn, end_pfn). I recognise you do this because
> you are interested in hole sizes before add_active_range() is called.

Right.

>
> However, what is not clear is why these patches are so specific to x86_64.
>

Specifically in the fake numa case, we want to make sure that we don't
carve fake nodes that only have IO holes in it. Unlike the real NUMA
case, here we don't have SRAT etc. to know the memory layout beforehand.


> It looks possible to do the work of functions like split_nodes_equal() in
> an architecture-independent manner using early_node_map rather than
> dealing with the arch-specific nodes array. That would open the
> possibility of providing fake nodes on more than one architecture in the
> future.

The functions like splti_nodes_equal etc. can be abstracted out to arch
independent part. I think the only API it needs from arch dependent
part is to find out how much real RAM is present in range without have
to first do add_active_range.

Though as a first step, let us fix the x86_64 (as it doesn't boot when
you have sizeable chunk of IO hole and nodes > 4).

I'm also not sure if other archs actually want to have this
functionality.

> What I think can be done is that you register memory as normal and then
> split up the nodes into fake nodes. This would remove the need for having
> e820_hole_size() reintroduced.

Are you saying first let the system find out real numa topology and then
build fake numa on top of it?

-rohit

2006-11-28 13:24:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [Patch1/4]: fake numa for x86_64 patch

On Mon, 27 Nov 2006, Rohit Seth wrote:

> Hi Mel,
>
> On Mon, 2006-11-27 at 13:18 +0000, Mel Gorman wrote:
>> On Wed, 22 Nov 2006, Rohit Seth wrote:
>>
>>> This patch provides a IO hole size in a given address range.
>>>
>>
>> Hi,
>>
>> This patch reintroduces a function that doubles up what
>> absent_pages_in_range(start_pfn, end_pfn). I recognise you do this because
>> you are interested in hole sizes before add_active_range() is called.
>
> Right.
>
>>
>> However, what is not clear is why these patches are so specific to x86_64.
>>
>
> Specifically in the fake numa case, we want to make sure that we don't
> carve fake nodes that only have IO holes in it. Unlike the real NUMA
> case, here we don't have SRAT etc. to know the memory layout beforehand.
>
>
>> It looks possible to do the work of functions like split_nodes_equal() in
>> an architecture-independent manner using early_node_map rather than
>> dealing with the arch-specific nodes array. That would open the
>> possibility of providing fake nodes on more than one architecture in the
>> future.
>
> The functions like splti_nodes_equal etc. can be abstracted out to arch
> independent part. I think the only API it needs from arch dependent
> part is to find out how much real RAM is present in range without have
> to first do add_active_range.
>

That is a problem because the ranges must be registered with
add_active_range() to work out how much real RAM is present.

> Though as a first step, let us fix the x86_64 (as it doesn't boot when
> you have sizeable chunk of IO hole and nodes > 4).
>

Ok.

> I'm also not sure if other archs actually want to have this
> functionality.
>

It's possible that the containers people are interested in the possibility
of setting up fake nodes as part of a memory controller.

>> What I think can be done is that you register memory as normal and then
>> split up the nodes into fake nodes. This would remove the need for having
>> e820_hole_size() reintroduced.
>
> Are you saying first let the system find out real numa topology and then
> build fake numa on top of it?
>

Yes, there is nothing stopping you altering the early_node_map[] before
free_area_init_node() initialises the node_mem_map. If you do hit a
problem, it'll be because x86_64 allocates it's own node_mem_map with
CONFIG_FLAT_NODE_MEM_MAP is set. Is that set when setting up fake nodes?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-28 18:07:34

by Rohit Seth

[permalink] [raw]
Subject: Re: [Patch1/4]: fake numa for x86_64 patch

On Tue, 2006-11-28 at 13:24 +0000, Mel Gorman wrote:
> On Mon, 27 Nov 2006, Rohit Seth wrote:
>
> > Hi Mel,
> >
> > On Mon, 2006-11-27 at 13:18 +0000, Mel Gorman wrote:
> >> On Wed, 22 Nov 2006, Rohit Seth wrote:
> >>
> >>> This patch provides a IO hole size in a given address range.
> >>>
> >>
> >> Hi,
> >>
> >> This patch reintroduces a function that doubles up what
> >> absent_pages_in_range(start_pfn, end_pfn). I recognise you do this because
> >> you are interested in hole sizes before add_active_range() is called.
> >
> > Right.
> >
> >>
> >> However, what is not clear is why these patches are so specific to x86_64.
> >>
> >
> > Specifically in the fake numa case, we want to make sure that we don't
> > carve fake nodes that only have IO holes in it. Unlike the real NUMA
> > case, here we don't have SRAT etc. to know the memory layout beforehand.
> >
> >
> >> It looks possible to do the work of functions like split_nodes_equal() in
> >> an architecture-independent manner using early_node_map rather than
> >> dealing with the arch-specific nodes array. That would open the
> >> possibility of providing fake nodes on more than one architecture in the
> >> future.
> >
> > The functions like splti_nodes_equal etc. can be abstracted out to arch
> > independent part. I think the only API it needs from arch dependent
> > part is to find out how much real RAM is present in range without have
> > to first do add_active_range.
> >
>
> That is a problem because the ranges must be registered with
> add_active_range() to work out how much real RAM is present.
>

Right. And that is why I need e820_hole_size functionality. BTW, what
is the concern in having that function?

> > Though as a first step, let us fix the x86_64 (as it doesn't boot when
> > you have sizeable chunk of IO hole and nodes > 4).
> >
>
> Ok.
>
> > I'm also not sure if other archs actually want to have this
> > functionality.
> >
>
> It's possible that the containers people are interested in the possibility
> of setting up fake nodes as part of a memory controller.
>
That is precisely why I'm doing it :-)

> >> What I think can be done is that you register memory as normal and then
> >> split up the nodes into fake nodes. This would remove the need for having
> >> e820_hole_size() reintroduced.
> >
> > Are you saying first let the system find out real numa topology and then
> > build fake numa on top of it?
> >
>
> Yes, there is nothing stopping you altering the early_node_map[] before
> free_area_init_node() initialises the node_mem_map. If you do hit a
> problem, it'll be because x86_64 allocates it's own node_mem_map with
> CONFIG_FLAT_NODE_MEM_MAP is set. Is that set when setting up fake nodes?
>

I thought they both (real numa + fake numa) operate on same data
structures. I'll have to double check.

-rohit

2006-11-28 21:34:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [Patch1/4]: fake numa for x86_64 patch

On Tue, 28 Nov 2006, Rohit Seth wrote:

> On Tue, 2006-11-28 at 13:24 +0000, Mel Gorman wrote:
>> On Mon, 27 Nov 2006, Rohit Seth wrote:
>>
>>> Hi Mel,
>>>
>>> On Mon, 2006-11-27 at 13:18 +0000, Mel Gorman wrote:
>>>> On Wed, 22 Nov 2006, Rohit Seth wrote:
>>>>
>>>>> This patch provides a IO hole size in a given address range.
>>>>>
>>>>
>>>> Hi,
>>>>
>>>> This patch reintroduces a function that doubles up what
>>>> absent_pages_in_range(start_pfn, end_pfn). I recognise you do this because
>>>> you are interested in hole sizes before add_active_range() is called.
>>>
>>> Right.
>>>
>>>>
>>>> However, what is not clear is why these patches are so specific to x86_64.
>>>>
>>>
>>> Specifically in the fake numa case, we want to make sure that we don't
>>> carve fake nodes that only have IO holes in it. Unlike the real NUMA
>>> case, here we don't have SRAT etc. to know the memory layout beforehand.
>>>
>>>
>>>> It looks possible to do the work of functions like split_nodes_equal() in
>>>> an architecture-independent manner using early_node_map rather than
>>>> dealing with the arch-specific nodes array. That would open the
>>>> possibility of providing fake nodes on more than one architecture in the
>>>> future.
>>>
>>> The functions like splti_nodes_equal etc. can be abstracted out to arch
>>> independent part. I think the only API it needs from arch dependent
>>> part is to find out how much real RAM is present in range without have
>>> to first do add_active_range.
>>>
>>
>> That is a problem because the ranges must be registered with
>> add_active_range() to work out how much real RAM is present.
>>
>
> Right. And that is why I need e820_hole_size functionality. BTW, what
> is the concern in having that function?
>

Because it provides almost identical functionality to another function. If
that can be avoided, it's preferable.

>>> Though as a first step, let us fix the x86_64 (as it doesn't boot when
>>> you have sizeable chunk of IO hole and nodes > 4).
>>>
>>
>> Ok.
>>
>>> I'm also not sure if other archs actually want to have this
>>> functionality.
>>>
>>
>> It's possible that the containers people are interested in the possibility
>> of setting up fake nodes as part of a memory controller.
>>
> That is precisely why I'm doing it :-)
>
>>>> What I think can be done is that you register memory as normal and then
>>>> split up the nodes into fake nodes. This would remove the need for having
>>>> e820_hole_size() reintroduced.
>>>
>>> Are you saying first let the system find out real numa topology and then
>>> build fake numa on top of it?
>>>
>>
>> Yes, there is nothing stopping you altering the early_node_map[] before
>> free_area_init_node() initialises the node_mem_map. If you do hit a
>> problem, it'll be because x86_64 allocates it's own node_mem_map with
>> CONFIG_FLAT_NODE_MEM_MAP is set. Is that set when setting up fake nodes?
>>
>
> I thought they both (real numa + fake numa) operate on same data
> structures. I'll have to double check.
>
> -rohit
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2006-11-28 22:21:25

by Rohit Seth

[permalink] [raw]
Subject: Re: [Patch1/4]: fake numa for x86_64 patch

On Tue, 2006-11-28 at 21:34 +0000, Mel Gorman wrote:
> On Tue, 28 Nov 2006, Rohit Seth wrote:
>
> > On Tue, 2006-11-28 at 13:24 +0000, Mel Gorman wrote:
> >> On Mon, 27 Nov 2006, Rohit Seth wrote:
> >>
> >>> Hi Mel,
> >>>
> >>> On Mon, 2006-11-27 at 13:18 +0000, Mel Gorman wrote:
> >>>> On Wed, 22 Nov 2006, Rohit Seth wrote:
> >>>>
> >>>>> This patch provides a IO hole size in a given address range.
> >>>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> This patch reintroduces a function that doubles up what
> >>>> absent_pages_in_range(start_pfn, end_pfn). I recognise you do this because
> >>>> you are interested in hole sizes before add_active_range() is called.
> >>>
> >>> Right.
> >>>
> >>>>
> >>>> However, what is not clear is why these patches are so specific to x86_64.
> >>>>
> >>>
> >>> Specifically in the fake numa case, we want to make sure that we don't
> >>> carve fake nodes that only have IO holes in it. Unlike the real NUMA
> >>> case, here we don't have SRAT etc. to know the memory layout beforehand.
> >>>
> >>>
> >>>> It looks possible to do the work of functions like split_nodes_equal() in
> >>>> an architecture-independent manner using early_node_map rather than
> >>>> dealing with the arch-specific nodes array. That would open the
> >>>> possibility of providing fake nodes on more than one architecture in the
> >>>> future.
> >>>
> >>> The functions like splti_nodes_equal etc. can be abstracted out to arch
> >>> independent part. I think the only API it needs from arch dependent
> >>> part is to find out how much real RAM is present in range without have
> >>> to first do add_active_range.
> >>>
> >>
> >> That is a problem because the ranges must be registered with
> >> add_active_range() to work out how much real RAM is present.
> >>
> >
> > Right. And that is why I need e820_hole_size functionality. BTW, what
> > is the concern in having that function?
> >
>
> Because it provides almost identical functionality to another function. If
> that can be avoided, it's preferable.
>

There are subtle difference in the way two function can be used. They
are operating in two different environments. absent_pages work when
memory layout is already registered. The e820_hole_size is the (low
level arch dependent) function that will be used to find out how the
memory lay out is going to be set for the cases when kernel has to
itself decide about the layout.

-rohit