2006-03-17 08:23:34

by Yasunori Goto

[permalink] [raw]
Subject: [PATCH: 010/017]Memory hotplug for new nodes v.4.(allocate wait table)


Wait_table is initialized according to zone size at boot time.
But, we cannot know the maixmum zone size when memory hotplug is enabled.
It can be changed.... And resizing of wait_table is too hard.

So kernel allocate and initialzie wait_table as its maximum size.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Yasunori Goto <[email protected]>

mm/page_alloc.c | 45 +++++++++++++++++++++++++++++++++++++++------
1 files changed, 39 insertions(+), 6 deletions(-)

Index: pgdat8/mm/page_alloc.c
===================================================================
--- pgdat8.orig/mm/page_alloc.c 2006-03-17 11:13:18.466550152 +0900
+++ pgdat8/mm/page_alloc.c 2006-03-17 11:19:32.371677992 +0900
@@ -1788,6 +1788,7 @@ void __init build_all_zonelists(void)
*/
#define PAGES_PER_WAITQUEUE 256

+#ifdef CONFIG_MEMORY_HOTPLUG
static inline unsigned long wait_table_size(unsigned long pages)
{
unsigned long size = 1;
@@ -1806,6 +1807,17 @@ static inline unsigned long wait_table_s

return max(size, 4UL);
}
+#else
+/*
+ * Because zone size might be changed by hot-add,
+ * We can't determin suitable size for wait_table as traditional.
+ * So, we use maximum size.
+ */
+static inline unsigned long wait_table_size(unsigned long pages)
+{
+ return 4096UL;
+}
+#endif

/*
* This is an integer logarithm so that shifts can be used later
@@ -2074,7 +2086,7 @@ void __init setup_per_cpu_pageset(void)
#endif

static __meminit
-void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
+int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
{
int i;
struct pglist_data *pgdat = zone->zone_pgdat;
@@ -2085,12 +2097,37 @@ void zone_wait_table_init(struct zone *z
*/
zone->wait_table_size = wait_table_size(zone_size_pages);
zone->wait_table_bits = wait_table_bits(zone->wait_table_size);
- zone->wait_table = (wait_queue_head_t *)
- alloc_bootmem_node(pgdat, zone->wait_table_size
- * sizeof(wait_queue_head_t));
+ if (system_state == SYSTEM_BOOTING) {
+ zone->wait_table = (wait_queue_head_t *)
+ alloc_bootmem_node(pgdat, zone->wait_table_size
+ * sizeof(wait_queue_head_t));
+ } else {
+ int table_size;
+ /*
+ * XXX: This is the case that new node is hotadded.
+ * At this time, kmalloc() will not get this new node's
+ * memory. Because this wait_table must be initialized,
+ * to use this new node itself. To use this new node's
+ * memory, further consideration will be necessary.
+ */
+ do {
+ table_size = zone->wait_table_size
+ * sizeof(wait_queue_head_t);
+ zone->wait_table = kmalloc(table_size, GFP_KERNEL);
+ if (!zone->wait_table) {
+ /* try half size */
+ zone->wait_table_size >>= 1;
+ zone->wait_table_bits =
+ wait_table_bits(zone->wait_table_size);
+ }
+ } while (zone->wait_table_size && !zone->wait_table);
+ }
+ if (!zone->wait_table)
+ return -ENOMEM;

for(i = 0; i < zone->wait_table_size; ++i)
init_waitqueue_head(zone->wait_table + i);
+ return 0;
}

static __meminit void zone_pcp_init(struct zone *zone)
@@ -2116,8 +2153,10 @@ __meminit int init_currently_empty_zone(
unsigned long size)
{
struct pglist_data *pgdat = zone->zone_pgdat;
-
- zone_wait_table_init(zone, size);
+ int ret;
+ ret = zone_wait_table_init(zone, size);
+ if (ret)
+ return ret;
pgdat->nr_zones = zone_idx(zone) + 1;

zone->zone_start_pfn = zone_start_pfn;

--
Yasunori Goto



2006-03-17 17:54:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH: 010/017]Memory hotplug for new nodes v.4.(allocate wait table)

On Fri, 2006-03-17 at 17:22 +0900, Yasunori Goto wrote:
> +#ifdef CONFIG_MEMORY_HOTPLUG
> static inline unsigned long wait_table_size(unsigned long pages)
> {
> unsigned long size = 1;
> @@ -1806,6 +1807,17 @@ static inline unsigned long wait_table_s
>
> return max(size, 4UL);
> }
> +#else
> +/*
> + * Because zone size might be changed by hot-add,
> + * We can't determin suitable size for wait_table as traditional.
> + * So, we use maximum size.
> + */
> +static inline unsigned long wait_table_size(unsigned long pages)
> +{
> + return 4096UL;
> +}
> +#endif

Ick. Is there really _no_ way to resize this at runtime? I know it
isn't an immediately easy thing to do, but we've really tried not to do
these kinds of things with memory hotplug in the past. The whole thing
would have been really easy if we could just preallocate everything
really big in the first place.

I don't think this has to be a super-fast, efficient, implementation.
Once the code has gone into the actual waitqueue code, it is already in
a slow path.

We could do something like this:

void fastcall wait_on_page_bit(struct page *page, int bit_nr)
{
DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);

if (!test_bit(bit_nr, &page->flags))
return;

while (__wait_on_bit(page_waitqueue(page), &wait, sync_page,
TASK_UNINTERRUPTIBLE));
}

And have a special case inside of sync_page() to return -EAGAIN when a
waitqueue resize is going on. There is a race there if zone->wait_table
and zone->wait_table_bits are not matching values.

So, to do the update, you'd need to do something like this:

set_waitqueue_resize_start(zone);
// now all of the waiters will spin
zone->wait_table = kmalloc();
smp_wmb(); // make sure all the cpus see the kmalloc
zone->wait_table_bits = new_bits;
set_waitqueue_resize_done(zone);

Putting a seqlock next to wait_table_bits might also do the trick. I
need to think about it some more. BTW, I think this only works for the
waiter side, not the wakers. But, I think it can work in both cases.

> /*
> * This is an integer logarithm so that shifts can be used later
> @@ -2074,7 +2086,7 @@ void __init setup_per_cpu_pageset(void)
> #endif
>
> static __meminit
> -void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
> +int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
> {
> int i;
> struct pglist_data *pgdat = zone->zone_pgdat;
> @@ -2085,12 +2097,37 @@ void zone_wait_table_init(struct zone *z
> */
> zone->wait_table_size = wait_table_size(zone_size_pages);
> zone->wait_table_bits = wait_table_bits(zone->wait_table_size);
> - zone->wait_table = (wait_queue_head_t *)
> - alloc_bootmem_node(pgdat, zone->wait_table_size
> - * sizeof(wait_queue_head_t));
> + if (system_state == SYSTEM_BOOTING) {
> + zone->wait_table = (wait_queue_head_t *)
> + alloc_bootmem_node(pgdat, zone->wait_table_size
> + * sizeof(wait_queue_head_t));
> + } else {
> + int table_size;
> + /*
> + * XXX: This is the case that new node is hotadded.
> + * At this time, kmalloc() will not get this new node's
> + * memory. Because this wait_table must be initialized,
> + * to use this new node itself. To use this new node's
> + * memory, further consideration will be necessary.
> + */
> + do {
> + table_size = zone->wait_table_size
> + * sizeof(wait_queue_head_t);
> + zone->wait_table = kmalloc(table_size, GFP_KERNEL);
> + if (!zone->wait_table) {
> + /* try half size */
> + zone->wait_table_size >>= 1;
> + zone->wait_table_bits =
> + wait_table_bits(zone->wait_table_size);
> + }
> + } while (zone->wait_table_size && !zone->wait_table);
> + }
> + if (!zone->wait_table)
> + return -ENOMEM;
>
> for(i = 0; i < zone->wait_table_size; ++i)
> init_waitqueue_head(zone->wait_table + i);
> + return 0;
> }

Why do you need those retries to shrink the size? Are you actually
getting common failures? Is it best to shrink the size, or try
something like vmalloc? This seems a bit hackish to me.

-- Dave

2006-03-20 08:42:57

by Yasunori Goto

[permalink] [raw]
Subject: Re: [PATCH: 010/017]Memory hotplug for new nodes v.4.(allocate wait table)

> Ick. Is there really _no_ way to resize this at runtime? I know it
> isn't an immediately easy thing to do, but we've really tried not to do
> these kinds of things with memory hotplug in the past. The whole thing
> would have been really easy if we could just preallocate everything
> really big in the first place.
>
> I don't think this has to be a super-fast, efficient, implementation.
> Once the code has gone into the actual waitqueue code, it is already in
> a slow path.
>
> We could do something like this:
>
> void fastcall wait_on_page_bit(struct page *page, int bit_nr)
> {
> DEFINE_WAIT_BIT(wait, &page->flags, bit_nr);
>
> if (!test_bit(bit_nr, &page->flags))
> return;
>
> while (__wait_on_bit(page_waitqueue(page), &wait, sync_page,
> TASK_UNINTERRUPTIBLE));
> }
>
> And have a special case inside of sync_page() to return -EAGAIN when a
> waitqueue resize is going on. There is a race there if zone->wait_table
> and zone->wait_table_bits are not matching values.
>
> So, to do the update, you'd need to do something like this:
>
> set_waitqueue_resize_start(zone);
> // now all of the waiters will spin
> zone->wait_table = kmalloc();
> smp_wmb(); // make sure all the cpus see the kmalloc
> zone->wait_table_bits = new_bits;
> set_waitqueue_resize_done(zone);
>
> Putting a seqlock next to wait_table_bits might also do the trick. I
> need to think about it some more. BTW, I think this only works for the
> waiter side, not the wakers. But, I think it can work in both cases.

Hmmmmm.
I'm not sure this works well.
Probably, resizing of hash table needs much work.

But, I don't think that my patch set is the completion of node
style hotplug.
(At least, pgdat is allocated another node, not new node.)
So, I would like to leave this issue as further works too.
Linux style is not to solve everything at once,
it is step by step approach.

BTW, this issue is not only for node, but also smp style hotplug case.
When memory is hotplugged, one of zone size changes.
But, current memor hotplug do nothing when size becomes not enough.
It depends on boottime size....

When node is hot-added, there is no wait_table because size is 0.
So, new wait_table is necessary at this time.


> > /*
> > * This is an integer logarithm so that shifts can be used later
> > @@ -2074,7 +2086,7 @@ void __init setup_per_cpu_pageset(void)
> > #endif
> >
> > static __meminit
> > -void zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
> > +int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
> > {
> > int i;
> > struct pglist_data *pgdat = zone->zone_pgdat;
> > @@ -2085,12 +2097,37 @@ void zone_wait_table_init(struct zone *z
> > */
> > zone->wait_table_size = wait_table_size(zone_size_pages);
> > zone->wait_table_bits = wait_table_bits(zone->wait_table_size);
> > - zone->wait_table = (wait_queue_head_t *)
> > - alloc_bootmem_node(pgdat, zone->wait_table_size
> > - * sizeof(wait_queue_head_t));
> > + if (system_state == SYSTEM_BOOTING) {
> > + zone->wait_table = (wait_queue_head_t *)
> > + alloc_bootmem_node(pgdat, zone->wait_table_size
> > + * sizeof(wait_queue_head_t));
> > + } else {
> > + int table_size;
> > + /*
> > + * XXX: This is the case that new node is hotadded.
> > + * At this time, kmalloc() will not get this new node's
> > + * memory. Because this wait_table must be initialized,
> > + * to use this new node itself. To use this new node's
> > + * memory, further consideration will be necessary.
> > + */
> > + do {
> > + table_size = zone->wait_table_size
> > + * sizeof(wait_queue_head_t);
> > + zone->wait_table = kmalloc(table_size, GFP_KERNEL);
> > + if (!zone->wait_table) {
> > + /* try half size */
> > + zone->wait_table_size >>= 1;
> > + zone->wait_table_bits =
> > + wait_table_bits(zone->wait_table_size);
> > + }
> > + } while (zone->wait_table_size && !zone->wait_table);
> > + }
> > + if (!zone->wait_table)
> > + return -ENOMEM;
> >
> > for(i = 0; i < zone->wait_table_size; ++i)
> > init_waitqueue_head(zone->wait_table + i);
> > + return 0;
> > }
>
> Why do you need those retries to shrink the size? Are you actually
> getting common failures? Is it best to shrink the size, or try
> something like vmalloc? This seems a bit hackish to me.

Just I thought wait table might be over than "order 3" pages.
But, I don't remember why I used kmalloc(). :-(
Ok. I'll change this to vmalloc(). It will be simpler.

Thanks.

--
Yasunori Goto