2006-09-10 02:42:10

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [RFC] patch[1/1] i386 numa kva conversion to use bootmem reserve

keith mannthey wrote:
> Hello,
> I the current i386 numa the numa_kva (the area used to remap node
> local data in lowmem) space is acquired by adjusting the end of low
> memroy during boot.
>
> (from setup_memory)
> reserve_pages = calculate_numa_remap_pages();
> (then)
> system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
>
> The problem this is that initrds can be trampled over (the kva can
> adjust system_max_low_pfn into the initrd area) This results in kernel
> throwing away the intird and a failed boot. This is a long standing
> issue. (It has been like this at least for the last few years).
>
> This patch keeps the numa kva code from adjusting the end of memory and
> coverts it is just use the reserve_bootmem call to reserve the large
> amount of space needed for the numa_kva. It is mindful of initrds when
> present.
>
> This patch was built against 2.6.17-rc1 originally but applies and boots
> against 2.6.17 just fine. I have only test this against the summit
> subarch (I don't have other i386 numa hw).
>
> all feedback welcome!
>
> Signed-off-by: Keith Mannthey <[email protected]>
>
>
> ------------------------------------------------------------------------
>
> diff -urN linux-2.6.17/arch/i386/kernel/setup.c linux-2.6.17-work/arch/i386/kernel/setup.c
> --- linux-2.6.17/arch/i386/kernel/setup.c 2006-06-17 18:49:35.000000000 -0700
> +++ linux-2.6.17-work/arch/i386/kernel/setup.c 2006-06-20 23:04:37.000000000 -0700
> @@ -1210,6 +1210,9 @@
> extern void zone_sizes_init(void);
> #endif /* !CONFIG_NEED_MULTIPLE_NODES */
>
> +#ifdef CONFIG_NUMA
> +extern void numa_kva_reserve(void);
> +#endif
> void __init setup_bootmem_allocator(void)
> {
> unsigned long bootmap_size;
> @@ -1265,7 +1268,9 @@
> */
> find_smp_config();
> #endif
> -
> +#ifdef CONFIG_NUMA
> + numa_kva_reserve();
> +#endif
> #ifdef CONFIG_BLK_DEV_INITRD
> if (LOADER_TYPE && INITRD_START) {
> if (INITRD_START + INITRD_SIZE <= (max_low_pfn << PAGE_SHIFT)) {
> diff -urN linux-2.6.17/arch/i386/mm/discontig.c linux-2.6.17-work/arch/i386/mm/discontig.c
> --- linux-2.6.17/arch/i386/mm/discontig.c 2006-06-17 18:49:35.000000000 -0700
> +++ linux-2.6.17-work/arch/i386/mm/discontig.c 2006-06-20 23:11:49.000000000 -0700
> @@ -118,7 +118,8 @@
>
> void *node_remap_end_vaddr[MAX_NUMNODES];
> void *node_remap_alloc_vaddr[MAX_NUMNODES];
> -
> +static unsigned long kva_start_pfn;
> +static unsigned long kva_pages;
> /*
> * FLAT - support for basic PC memory model with discontig enabled, essentially
> * a single node with all available processors in it with a flat
> @@ -287,7 +288,6 @@
> {
> int nid;
> unsigned long system_start_pfn, system_max_low_pfn;
> - unsigned long reserve_pages;
>
> /*
> * When mapping a NUMA machine we allocate the node_mem_map arrays
> @@ -299,14 +299,23 @@
> find_max_pfn();
> get_memcfg_numa();
>
> - reserve_pages = calculate_numa_remap_pages();
> + kva_pages = calculate_numa_remap_pages();
>
> /* partially used pages are not usable - thus round upwards */
> system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
>
> - system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
> - printk("reserve_pages = %ld find_max_low_pfn() ~ %ld\n",
> - reserve_pages, max_low_pfn + reserve_pages);
> + kva_start_pfn = find_max_low_pfn() - kva_pages;
> +
> +#ifdef CONFIG_BLK_DEV_INITRD
> + /* Numa kva area is below the initrd */
> + if (LOADER_TYPE && INITRD_START)
> + kva_start_pfn = PFN_DOWN(INITRD_START) - kva_pages;
> +#endif
> + kva_start_pfn -= kva_start_pfn & (PTRS_PER_PTE-1);
> +
> + system_max_low_pfn = max_low_pfn = find_max_low_pfn();
> + printk("kva_start_pfn ~ %ld find_max_low_pfn() ~ %ld\n",
> + kva_start_pfn, max_low_pfn);
> printk("max_pfn = %ld\n", max_pfn);
> #ifdef CONFIG_HIGHMEM
> highstart_pfn = highend_pfn = max_pfn;
> @@ -324,7 +333,7 @@
> (ulong) pfn_to_kaddr(max_low_pfn));
> for_each_online_node(nid) {
> node_remap_start_vaddr[nid] = pfn_to_kaddr(
> - highstart_pfn + node_remap_offset[nid]);
> + kva_start_pfn + node_remap_offset[nid]);
> /* Init the node remap allocator */
> node_remap_end_vaddr[nid] = node_remap_start_vaddr[nid] +
> (node_remap_size[nid] * PAGE_SIZE);
> @@ -339,7 +348,6 @@
> }
> printk("High memory starts at vaddr %08lx\n",
> (ulong) pfn_to_kaddr(highstart_pfn));
> - vmalloc_earlyreserve = reserve_pages * PAGE_SIZE;
> for_each_online_node(nid)
> find_max_pfn_node(nid);
>
> @@ -349,6 +357,12 @@
> return max_low_pfn;
> }
>
> +void __init numa_kva_reserve (void)
> +{
> + reserve_bootmem(PFN_PHYS(kva_start_pfn),PFN_PHYS(kva_pages));
> +
> +}
> +
> void __init zone_sizes_init(void)
> {
> int nid;

The primary reason that the mem_map is cut from the end of ZONE_NORMAL
is so that memory that would back that stolen KVA gets pushed out into
ZONE_HIGHMEM, the boundary between them is moved down. By using
reserve_bootmem we will mark the pages which are currently backing the
KVA you are 'reusing' as reserved and prevent their release; we pay
double for the mem_map.

If the initrd's are falling into this space, can we not allocate some
bootmem for those and move them out of our way? As filesystem images
they are essentially location neutral so this should be safe?

-apw


2006-09-11 18:50:34

by Keith Mannthey

[permalink] [raw]
Subject: Re: [RFC] patch[1/1] i386 numa kva conversion to use bootmem reserve

On Sun, 2006-09-10 at 03:41 +0100, Andy Whitcroft wrote:
> keith mannthey wrote:
> > Hello,
> > I the current i386 numa the numa_kva (the area used to remap node
> > local data in lowmem) space is acquired by adjusting the end of low
> > memroy during boot.
> >
> > (from setup_memory)
> > reserve_pages = calculate_numa_remap_pages();
> > (then)
> > system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
> >
> > The problem this is that initrds can be trampled over (the kva can
> > adjust system_max_low_pfn into the initrd area) This results in kernel
> > throwing away the intird and a failed boot. This is a long standing
> > issue. (It has been like this at least for the last few years).
> >
> > This patch keeps the numa kva code from adjusting the end of memory and
> > coverts it is just use the reserve_bootmem call to reserve the large
> > amount of space needed for the numa_kva. It is mindful of initrds when
> > present.
> >
> > This patch was built against 2.6.17-rc1 originally but applies and boots
> > against 2.6.17 just fine. I have only test this against the summit
> > subarch (I don't have other i386 numa hw).
> >
> > all feedback welcome!
> >
> > Signed-off-by: Keith Mannthey <[email protected]>
> >
> >
> > ------------------------------------------------------------------------
> >
> > diff -urN linux-2.6.17/arch/i386/kernel/setup.c linux-2.6.17-work/arch/i386/kernel/setup.c
> > --- linux-2.6.17/arch/i386/kernel/setup.c 2006-06-17 18:49:35.000000000 -0700
> > +++ linux-2.6.17-work/arch/i386/kernel/setup.c 2006-06-20 23:04:37.000000000 -0700
> > @@ -1210,6 +1210,9 @@
> > extern void zone_sizes_init(void);
> > #endif /* !CONFIG_NEED_MULTIPLE_NODES */
> >
> > +#ifdef CONFIG_NUMA
> > +extern void numa_kva_reserve(void);
> > +#endif
> > void __init setup_bootmem_allocator(void)
> > {
> > unsigned long bootmap_size;
> > @@ -1265,7 +1268,9 @@
> > */
> > find_smp_config();
> > #endif
> > -
> > +#ifdef CONFIG_NUMA
> > + numa_kva_reserve();
> > +#endif
> > #ifdef CONFIG_BLK_DEV_INITRD
> > if (LOADER_TYPE && INITRD_START) {
> > if (INITRD_START + INITRD_SIZE <= (max_low_pfn << PAGE_SHIFT)) {
> > diff -urN linux-2.6.17/arch/i386/mm/discontig.c linux-2.6.17-work/arch/i386/mm/discontig.c
> > --- linux-2.6.17/arch/i386/mm/discontig.c 2006-06-17 18:49:35.000000000 -0700
> > +++ linux-2.6.17-work/arch/i386/mm/discontig.c 2006-06-20 23:11:49.000000000 -0700
> > @@ -118,7 +118,8 @@
> >
> > void *node_remap_end_vaddr[MAX_NUMNODES];
> > void *node_remap_alloc_vaddr[MAX_NUMNODES];
> > -
> > +static unsigned long kva_start_pfn;
> > +static unsigned long kva_pages;
> > /*
> > * FLAT - support for basic PC memory model with discontig enabled, essentially
> > * a single node with all available processors in it with a flat
> > @@ -287,7 +288,6 @@
> > {
> > int nid;
> > unsigned long system_start_pfn, system_max_low_pfn;
> > - unsigned long reserve_pages;
> >
> > /*
> > * When mapping a NUMA machine we allocate the node_mem_map arrays
> > @@ -299,14 +299,23 @@
> > find_max_pfn();
> > get_memcfg_numa();
> >
> > - reserve_pages = calculate_numa_remap_pages();
> > + kva_pages = calculate_numa_remap_pages();
> >
> > /* partially used pages are not usable - thus round upwards */
> > system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
> >
> > - system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
> > - printk("reserve_pages = %ld find_max_low_pfn() ~ %ld\n",
> > - reserve_pages, max_low_pfn + reserve_pages);
> > + kva_start_pfn = find_max_low_pfn() - kva_pages;
> > +
> > +#ifdef CONFIG_BLK_DEV_INITRD
> > + /* Numa kva area is below the initrd */
> > + if (LOADER_TYPE && INITRD_START)
> > + kva_start_pfn = PFN_DOWN(INITRD_START) - kva_pages;
> > +#endif
> > + kva_start_pfn -= kva_start_pfn & (PTRS_PER_PTE-1);
> > +
> > + system_max_low_pfn = max_low_pfn = find_max_low_pfn();
> > + printk("kva_start_pfn ~ %ld find_max_low_pfn() ~ %ld\n",
> > + kva_start_pfn, max_low_pfn);
> > printk("max_pfn = %ld\n", max_pfn);
> > #ifdef CONFIG_HIGHMEM
> > highstart_pfn = highend_pfn = max_pfn;
> > @@ -324,7 +333,7 @@
> > (ulong) pfn_to_kaddr(max_low_pfn));
> > for_each_online_node(nid) {
> > node_remap_start_vaddr[nid] = pfn_to_kaddr(
> > - highstart_pfn + node_remap_offset[nid]);
> > + kva_start_pfn + node_remap_offset[nid]);
> > /* Init the node remap allocator */
> > node_remap_end_vaddr[nid] = node_remap_start_vaddr[nid] +
> > (node_remap_size[nid] * PAGE_SIZE);
> > @@ -339,7 +348,6 @@
> > }
> > printk("High memory starts at vaddr %08lx\n",
> > (ulong) pfn_to_kaddr(highstart_pfn));
> > - vmalloc_earlyreserve = reserve_pages * PAGE_SIZE;
> > for_each_online_node(nid)
> > find_max_pfn_node(nid);
> >
> > @@ -349,6 +357,12 @@
> > return max_low_pfn;
> > }
> >
> > +void __init numa_kva_reserve (void)
> > +{
> > + reserve_bootmem(PFN_PHYS(kva_start_pfn),PFN_PHYS(kva_pages));
> > +
> > +}
> > +
> > void __init zone_sizes_init(void)
> > {
> > int nid;
>
> The primary reason that the mem_map is cut from the end of ZONE_NORMAL
> is so that memory that would back that stolen KVA gets pushed out into
> ZONE_HIGHMEM, the boundary between them is moved down. By using
> reserve_bootmem we will mark the pages which are currently backing the
> KVA you are 'reusing' as reserved and prevent their release; we pay
> double for the mem_map.

Perhaps just freeing the reserve pages and remapping them at an
appropriate time could accomplish this? Sorry I don't know the KVA
"freeing" path can you describe it a little more? When are these pages
returned to the system? It was my understanding that that KVA pages
were lost (the original wayu shrinks ZONE_NORMAL and creates a hole
between the zones).

> If the initrd's are falling into this space, can we not allocate some
> bootmem for those and move them out of our way? As filesystem images
> they are essentially location neutral so this should be safe?

AFAIK bootloaders choose where map initrds. Grub seems to put it around
the top of ZONE_NORMAL but it is pretty free to map it where it wants. I
suppose INITRD_START INITRD_END and all that could be dynamic and moved
around a bit but it seems a little messy. I would rather see the special
case (i386 numa the rare beast it is) jump thought a few extra hoops
than to muck with the initrd code.

Thanks,
Keith

2006-09-11 21:45:29

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [RFC] patch[1/1] i386 numa kva conversion to use bootmem reserve

keith mannthey wrote:
> On Sun, 2006-09-10 at 03:41 +0100, Andy Whitcroft wrote:
>> keith mannthey wrote:
>>> Hello,
>>> I the current i386 numa the numa_kva (the area used to remap node
>>> local data in lowmem) space is acquired by adjusting the end of low
>>> memroy during boot.
>>>
>>> (from setup_memory)
>>> reserve_pages = calculate_numa_remap_pages();
>>> (then)
>>> system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
>>>
>>> The problem this is that initrds can be trampled over (the kva can
>>> adjust system_max_low_pfn into the initrd area) This results in kernel
>>> throwing away the intird and a failed boot. This is a long standing
>>> issue. (It has been like this at least for the last few years).
>>>
>>> This patch keeps the numa kva code from adjusting the end of memory and
>>> coverts it is just use the reserve_bootmem call to reserve the large
>>> amount of space needed for the numa_kva. It is mindful of initrds when
>>> present.
>>>
>>> This patch was built against 2.6.17-rc1 originally but applies and boots
>>> against 2.6.17 just fine. I have only test this against the summit
>>> subarch (I don't have other i386 numa hw).
>>>
>>> all feedback welcome!
>>>
>>> Signed-off-by: Keith Mannthey <[email protected]>
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> diff -urN linux-2.6.17/arch/i386/kernel/setup.c linux-2.6.17-work/arch/i386/kernel/setup.c
>>> --- linux-2.6.17/arch/i386/kernel/setup.c 2006-06-17 18:49:35.000000000 -0700
>>> +++ linux-2.6.17-work/arch/i386/kernel/setup.c 2006-06-20 23:04:37.000000000 -0700
>>> @@ -1210,6 +1210,9 @@
>>> extern void zone_sizes_init(void);
>>> #endif /* !CONFIG_NEED_MULTIPLE_NODES */
>>>
>>> +#ifdef CONFIG_NUMA
>>> +extern void numa_kva_reserve(void);
>>> +#endif
>>> void __init setup_bootmem_allocator(void)
>>> {
>>> unsigned long bootmap_size;
>>> @@ -1265,7 +1268,9 @@
>>> */
>>> find_smp_config();
>>> #endif
>>> -
>>> +#ifdef CONFIG_NUMA
>>> + numa_kva_reserve();
>>> +#endif
>>> #ifdef CONFIG_BLK_DEV_INITRD
>>> if (LOADER_TYPE && INITRD_START) {
>>> if (INITRD_START + INITRD_SIZE <= (max_low_pfn << PAGE_SHIFT)) {
>>> diff -urN linux-2.6.17/arch/i386/mm/discontig.c linux-2.6.17-work/arch/i386/mm/discontig.c
>>> --- linux-2.6.17/arch/i386/mm/discontig.c 2006-06-17 18:49:35.000000000 -0700
>>> +++ linux-2.6.17-work/arch/i386/mm/discontig.c 2006-06-20 23:11:49.000000000 -0700
>>> @@ -118,7 +118,8 @@
>>>
>>> void *node_remap_end_vaddr[MAX_NUMNODES];
>>> void *node_remap_alloc_vaddr[MAX_NUMNODES];
>>> -
>>> +static unsigned long kva_start_pfn;
>>> +static unsigned long kva_pages;
>>> /*
>>> * FLAT - support for basic PC memory model with discontig enabled, essentially
>>> * a single node with all available processors in it with a flat
>>> @@ -287,7 +288,6 @@
>>> {
>>> int nid;
>>> unsigned long system_start_pfn, system_max_low_pfn;
>>> - unsigned long reserve_pages;
>>>
>>> /*
>>> * When mapping a NUMA machine we allocate the node_mem_map arrays
>>> @@ -299,14 +299,23 @@
>>> find_max_pfn();
>>> get_memcfg_numa();
>>>
>>> - reserve_pages = calculate_numa_remap_pages();
>>> + kva_pages = calculate_numa_remap_pages();
>>>
>>> /* partially used pages are not usable - thus round upwards */
>>> system_start_pfn = min_low_pfn = PFN_UP(init_pg_tables_end);
>>>
>>> - system_max_low_pfn = max_low_pfn = find_max_low_pfn() - reserve_pages;
>>> - printk("reserve_pages = %ld find_max_low_pfn() ~ %ld\n",
>>> - reserve_pages, max_low_pfn + reserve_pages);
>>> + kva_start_pfn = find_max_low_pfn() - kva_pages;
>>> +
>>> +#ifdef CONFIG_BLK_DEV_INITRD
>>> + /* Numa kva area is below the initrd */
>>> + if (LOADER_TYPE && INITRD_START)
>>> + kva_start_pfn = PFN_DOWN(INITRD_START) - kva_pages;
>>> +#endif
>>> + kva_start_pfn -= kva_start_pfn & (PTRS_PER_PTE-1);
>>> +
>>> + system_max_low_pfn = max_low_pfn = find_max_low_pfn();
>>> + printk("kva_start_pfn ~ %ld find_max_low_pfn() ~ %ld\n",
>>> + kva_start_pfn, max_low_pfn);
>>> printk("max_pfn = %ld\n", max_pfn);
>>> #ifdef CONFIG_HIGHMEM
>>> highstart_pfn = highend_pfn = max_pfn;
>>> @@ -324,7 +333,7 @@
>>> (ulong) pfn_to_kaddr(max_low_pfn));
>>> for_each_online_node(nid) {
>>> node_remap_start_vaddr[nid] = pfn_to_kaddr(
>>> - highstart_pfn + node_remap_offset[nid]);
>>> + kva_start_pfn + node_remap_offset[nid]);
>>> /* Init the node remap allocator */
>>> node_remap_end_vaddr[nid] = node_remap_start_vaddr[nid] +
>>> (node_remap_size[nid] * PAGE_SIZE);
>>> @@ -339,7 +348,6 @@
>>> }
>>> printk("High memory starts at vaddr %08lx\n",
>>> (ulong) pfn_to_kaddr(highstart_pfn));
>>> - vmalloc_earlyreserve = reserve_pages * PAGE_SIZE;
>>> for_each_online_node(nid)
>>> find_max_pfn_node(nid);
>>>
>>> @@ -349,6 +357,12 @@
>>> return max_low_pfn;
>>> }
>>>
>>> +void __init numa_kva_reserve (void)
>>> +{
>>> + reserve_bootmem(PFN_PHYS(kva_start_pfn),PFN_PHYS(kva_pages));
>>> +
>>> +}
>>> +
>>> void __init zone_sizes_init(void)
>>> {
>>> int nid;
>> The primary reason that the mem_map is cut from the end of ZONE_NORMAL
>> is so that memory that would back that stolen KVA gets pushed out into
>> ZONE_HIGHMEM, the boundary between them is moved down. By using
>> reserve_bootmem we will mark the pages which are currently backing the
>> KVA you are 'reusing' as reserved and prevent their release; we pay
>> double for the mem_map.
>
> Perhaps just freeing the reserve pages and remapping them at an
> appropriate time could accomplish this? Sorry I don't know the KVA
> "freeing" path can you describe it a little more? When are these pages
> returned to the system? It was my understanding that that KVA pages
> were lost (the original wayu shrinks ZONE_NORMAL and creates a hole
> between the zones).


No it does seem like we loose the memory at the end of NORMAL when we
shrink it, but really happens is we move the boundary down. Any page
above the boundary is then in HIGHMEM and available to be allocated.
>
>> If the initrd's are falling into this space, can we not allocate some
>> bootmem for those and move them out of our way? As filesystem images
>> they are essentially location neutral so this should be safe?
>
> AFAIK bootloaders choose where map initrds. Grub seems to put it around
> the top of ZONE_NORMAL but it is pretty free to map it where it wants. I
> suppose INITRD_START INITRD_END and all that could be dynamic and moved
> around a bit but it seems a little messy. I would rather see the special
> case (i386 numa the rare beast it is) jump thought a few extra hoops
> than to muck with the initrd code.

Right we can't change where grub puts it. But doesn't it tell us where
it is as part of the kernel parameterisation. That would allow us to
move it out of our way and change the parameters to that new location,
allowing normal processing to find it in the new location.

Be interested to see the layout during boot on one of these boxes :).

-apw

2006-09-11 23:30:10

by Keith Mannthey

[permalink] [raw]
Subject: Re: [RFC] patch[1/1] i386 numa kva conversion to use bootmem reserve

On Mon, 2006-09-11 at 22:44 +0100, Andy Whitcroft wrote:


> >> The primary reason that the mem_map is cut from the end of ZONE_NORMAL
> >> is so that memory that would back that stolen KVA gets pushed out into
> >> ZONE_HIGHMEM, the boundary between them is moved down. By using
> >> reserve_bootmem we will mark the pages which are currently backing the
> >> KVA you are 'reusing' as reserved and prevent their release; we pay
> >> double for the mem_map.
> > Perhaps just freeing the reserve pages and remapping them at an
> > appropriate time could accomplish this? Sorry I don't know the KVA
> > "freeing" path can you describe it a little more? When are these pages
> > returned to the system? It was my understanding that that KVA pages
> > were lost (the original wayu shrinks ZONE_NORMAL and creates a hole
> > between the zones).
>
>
> No it does seem like we loose the memory at the end of NORMAL when we
> shrink it, but really happens is we move the boundary down. Any page
> above the boundary is then in HIGHMEM and available to be allocated.

How is it available for allocation? I see it is in highmem but the
pmd's for the kva area are set with node local information. I don't see
any special code to reclaim the kva area or extend ZONE_HIGHMEM.... How
does having the KVA area in ZONE_HIGHMEM allow you to reclaim it?
(sorry if this is an easy question but I an still sorting out how it is
"reclaimed" in the original implementation and why it can't be reclaimed
as part of ZONE_NORMAL).

> >
> >> If the initrd's are falling into this space, can we not allocate some
> >> bootmem for those and move them out of our way? As filesystem images
> >> they are essentially location neutral so this should be safe?
> >
> > AFAIK bootloaders choose where map initrds. Grub seems to put it around
> > the top of ZONE_NORMAL but it is pretty free to map it where it wants. I
> > suppose INITRD_START INITRD_END and all that could be dynamic and moved
> > around a bit but it seems a little messy. I would rather see the special
> > case (i386 numa the rare beast it is) jump thought a few extra hoops
> > than to muck with the initrd code.
>
> Right we can't change where grub puts it. But doesn't it tell us where
> it is as part of the kernel parameterisation. That would allow us to
> move it out of our way and change the parameters to that new location,
> allowing normal processing to find it in the new location.

Yea we know right where the initrd is at. All this code is running
before the bootmem allocator is even setup in fact this function is
setting everything up to call setup_bootmem_allocator (at the end of the
function)...

Are you sure there isn't another way to reclaim these pages?

> Be interested to see the layout during boot on one of these boxes :).

It is as easy as booting with an initrd :) I can post some initrd
locations it a little while.

Thanks,
Keith




2006-09-12 08:26:39

by Andy Whitcroft

[permalink] [raw]
Subject: Re: [RFC] patch[1/1] i386 numa kva conversion to use bootmem reserve

keith mannthey wrote:
> On Mon, 2006-09-11 at 22:44 +0100, Andy Whitcroft wrote:
>
>
>>>> The primary reason that the mem_map is cut from the end of ZONE_NORMAL
>>>> is so that memory that would back that stolen KVA gets pushed out into
>>>> ZONE_HIGHMEM, the boundary between them is moved down. By using
>>>> reserve_bootmem we will mark the pages which are currently backing the
>>>> KVA you are 'reusing' as reserved and prevent their release; we pay
>>>> double for the mem_map.
>>> Perhaps just freeing the reserve pages and remapping them at an
>>> appropriate time could accomplish this? Sorry I don't know the KVA
>>> "freeing" path can you describe it a little more? When are these pages
>>> returned to the system? It was my understanding that that KVA pages
>>> were lost (the original wayu shrinks ZONE_NORMAL and creates a hole
>>> between the zones).
>>
>> No it does seem like we loose the memory at the end of NORMAL when we
>> shrink it, but really happens is we move the boundary down. Any page
>> above the boundary is then in HIGHMEM and available to be allocated.
>
> How is it available for allocation? I see it is in highmem but the
> pmd's for the kva area are set with node local information. I don't see
> any special code to reclaim the kva area or extend ZONE_HIGHMEM.... How
> does having the KVA area in ZONE_HIGHMEM allow you to reclaim it?
> (sorry if this is an easy question but I an still sorting out how it is
> "reclaimed" in the original implementation and why it can't be reclaimed
> as part of ZONE_NORMAL).

If the boundary is at page 1024 then pages 0-1024 are direct mapped into
KVA, page 1024 is not mapped at all and part of HIGHMEM, when that zone
is freed up later those pages will be in the HIGHMEM zone and
allocatable. Move the boundary down to 1000 then 24 pages of KVA will
no longer be used as direct maps, and pages from 1000 onwards are in
zone HIGHMEM, non direct mapped, but available to allocatable. Cut a
section out of the middle of NORMAL and you just have to bin the pages
'behind' that KVA as you can persuade them to be part of HIGHMEM. You
can't adjust the boundary to allow for it.

The key here is that moving the end of NORMAL downwards moves the
beginning of HIGHMEM downwards also pulling the pages into HIGHMEM
implicitly, that cannot be done anywhere but at the end of NORMAL.

>>>> If the initrd's are falling into this space, can we not allocate some
>>>> bootmem for those and move them out of our way? As filesystem images
>>>> they are essentially location neutral so this should be safe?
>>> AFAIK bootloaders choose where map initrds. Grub seems to put it around
>>> the top of ZONE_NORMAL but it is pretty free to map it where it wants. I
>>> suppose INITRD_START INITRD_END and all that could be dynamic and moved
>>> around a bit but it seems a little messy. I would rather see the special
>>> case (i386 numa the rare beast it is) jump thought a few extra hoops
>>> than to muck with the initrd code.
>> Right we can't change where grub puts it. But doesn't it tell us where
>> it is as part of the kernel parameterisation. That would allow us to
>> move it out of our way and change the parameters to that new location,
>> allowing normal processing to find it in the new location.
>
> Yea we know right where the initrd is at. All this code is running
> before the bootmem allocator is even setup in fact this function is
> setting everything up to call setup_bootmem_allocator (at the end of the
> function)...
>

Right, so the instant we detect it at the location grub specifies can we
not just move it somewhere else random like 20Mb up or something and
change the pointer, then carry on?

> Are you sure there isn't another way to reclaim these pages?

Not that I am aware of, as if a page is in NORMAL it is expected to be
direct mapped, you have stolen their mapping for KVA thus they have to
be junked.

>> Be interested to see the layout during boot on one of these boxes :).
>
> It is as easy as booting with an initrd :) I can post some initrd
> locations it a little while.

Cool.

-apw

2006-09-13 23:10:19

by Keith Mannthey

[permalink] [raw]
Subject: Re: [RFC] patch[1/1] i386 numa kva conversion to use bootmem reserve

On Tue, 2006-09-12 at 09:26 +0100, Andy Whitcroft wrote:
> keith mannthey wrote:
> > On Mon, 2006-09-11 at 22:44 +0100, Andy Whitcroft wrote:
> >
> >
> >>>> The primary reason that the mem_map is cut from the end of ZONE_NORMAL
> >>>> is so that memory that would back that stolen KVA gets pushed out into
> >>>> ZONE_HIGHMEM, the boundary between them is moved down. By using
> >>>> reserve_bootmem we will mark the pages which are currently backing the
> >>>> KVA you are 'reusing' as reserved and prevent their release; we pay
> >>>> double for the mem_map.
> >>> Perhaps just freeing the reserve pages and remapping them at an
> >>> appropriate time could accomplish this? Sorry I don't know the KVA
> >>> "freeing" path can you describe it a little more? When are these pages
> >>> returned to the system? It was my understanding that that KVA pages
> >>> were lost (the original wayu shrinks ZONE_NORMAL and creates a hole
> >>> between the zones).
> >>
> >> No it does seem like we loose the memory at the end of NORMAL when we
> >> shrink it, but really happens is we move the boundary down. Any page
> >> above the boundary is then in HIGHMEM and available to be allocated.
> >
> > How is it available for allocation? I see it is in highmem but the
> > pmd's for the kva area are set with node local information. I don't see
> > any special code to reclaim the kva area or extend ZONE_HIGHMEM.... How
> > does having the KVA area in ZONE_HIGHMEM allow you to reclaim it?
> > (sorry if this is an easy question but I an still sorting out how it is
> > "reclaimed" in the original implementation and why it can't be reclaimed
> > as part of ZONE_NORMAL).
>
> If the boundary is at page 1024 then pages 0-1024 are direct mapped into
> KVA, page 1024 is not mapped at all and part of HIGHMEM, when that zone
> is freed up later those pages will be in the HIGHMEM zone and
> allocatable. Move the boundary down to 1000 then 24 pages of KVA will
> no longer be used as direct maps, and pages from 1000 onwards are in
> zone HIGHMEM, non direct mapped, but available to allocatable. Cut a
> section out of the middle of NORMAL and you just have to bin the pages
> 'behind' that KVA as you can persuade them to be part of HIGHMEM. You
> can't adjust the boundary to allow for it.
>
> The key here is that moving the end of NORMAL downwards moves the
> beginning of HIGHMEM downwards also pulling the pages into HIGHMEM
> implicitly, that cannot be done anywhere but at the end of NORMAL.
>
> >>>> If the initrd's are falling into this space, can we not allocate some
> >>>> bootmem for those and move them out of our way? As filesystem images
> >>>> they are essentially location neutral so this should be safe?
> >>> AFAIK bootloaders choose where map initrds. Grub seems to put it around
> >>> the top of ZONE_NORMAL but it is pretty free to map it where it wants. I
> >>> suppose INITRD_START INITRD_END and all that could be dynamic and moved
> >>> around a bit but it seems a little messy. I would rather see the special
> >>> case (i386 numa the rare beast it is) jump thought a few extra hoops
> >>> than to muck with the initrd code.
> >> Right we can't change where grub puts it. But doesn't it tell us where
> >> it is as part of the kernel parameterisation. That would allow us to
> >> move it out of our way and change the parameters to that new location,
> >> allowing normal processing to find it in the new location.
> >
> > Yea we know right where the initrd is at. All this code is running
> > before the bootmem allocator is even setup in fact this function is
> > setting everything up to call setup_bootmem_allocator (at the end of the
> > function)...
> >
>
> Right, so the instant we detect it at the location grub specifies can we
> not just move it somewhere else random like 20Mb up or something and
> change the pointer, then carry on?
>
> > Are you sure there isn't another way to reclaim these pages?
>
> Not that I am aware of, as if a page is in NORMAL it is expected to be
> direct mapped, you have stolen their mapping for KVA thus they have to
> be junked.
>
> >> Be interested to see the layout during boot on one of these boxes :).
> >
> > It is as easy as booting with an initrd :) I can post some initrd
> > locations it a little while.

kernel /vmlinuz-2.6.18-km ro root=/dev/VolGroup00/LogVol00
console=ttyS0,115200
console=tty0
[Linux-bzImage, setup=0x1e00, size=0x1ca562]
initrd /initrd-18km
[Linux-initrd @ 0x37ddf000, 0x21008a bytes]

Linux version 2.6.17 (root@elm3a25) (gcc version 4.1.1 20060817 (Red Hat
4.1.1-18)) #1 SMP Wed Sep 13 01:42:30 PDT 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009c400 (usable)
BIOS-e820: 000000000009c400 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 00000000eff91840 (usable)
BIOS-e820: 00000000eff91840 - 00000000eff9c340 (ACPI data)
BIOS-e820: 00000000eff9c340 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 00000001d0000000 (usable)
Node: 0, start_pfn: 0, end_pfn: 156
Node: 0, start_pfn: 256, end_pfn: 982929
Node: 0, start_pfn: 1048576, end_pfn: 1900544

My box isn't booting numa right now. My SRAT isn't getting loaded right
I am looking at that first.

Thanks,
Keith