2005-11-23 00:10:20

by Rohit Seth

[permalink] [raw]
Subject: [PATCH]: Free pages from local pcp lists under tight memory conditions

Andrew, Linus,

[PATCH]: This patch free pages (pcp->batch from each list at a time) from
local pcp lists when a higher order allocation request is not able to
get serviced from global free_list.

This should help fix some of the earlier failures seen with order 1 allocations.

I will send separate patches for:

1- Reducing the remote cpus pcp
2- Clean up page_alloc.c for CONFIG_HOTPLUG_CPU to use this code appropiately

Signed-off-by: Rohit Seth <[email protected]>


--- a/mm/page_alloc.c 2005-11-22 07:03:40.000000000 -0800
+++ linux-2.6.15-rc2/mm/page_alloc.c 2005-11-22 07:17:48.000000000 -0800
@@ -827,6 +827,35 @@
return page;
}

+static int
+reduce_cpu_pcp(void )
+{
+ struct zone *zone;
+ unsigned long flags;
+ unsigned int cpu = get_cpu();
+ int i, ret=0;
+
+ local_irq_save(flags);
+ for_each_zone(zone) {
+ struct per_cpu_pageset *pset;
+
+ pset = zone_pcp(zone, cpu);
+ for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
+ struct per_cpu_pages *pcp;
+
+ pcp = &pset->pcp[i];
+ if (pcp->count == 0)
+ continue;
+ pcp->count -= free_pages_bulk(zone, pcp->batch,
+ &pcp->list, 0);
+ ret++;
+ }
+ }
+ local_irq_restore(flags);
+ put_cpu();
+ return ret;
+}
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -887,6 +916,7 @@
* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
+try_again:
page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
if (page)
goto got_pg;
@@ -911,8 +941,15 @@
}

/* Atomic allocations - we can't balance anything */
- if (!wait)
- goto nopage;
+ if (!wait) {
+ /* Check if there are pages available on pcp lists that can be
+ * moved to global page list to satisfy higher order allocations.
+ */
+ if ((order > 0) && (reduce_cpu_pcp()))
+ goto try_again;
+ else
+ goto nopage;
+ }

rebalance:
cond_resched();
@@ -950,6 +987,14 @@
goto restart;
}

+ if (order > 0)
+ while (reduce_cpu_pcp()) {
+ if (get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags))
+ goto got_pg;
+ }
+ /* FIXME: Add the support for reducing/draining the remote pcps.
+ */
+
/*
* Don't let big-order allocations loop unless the caller explicitly
* requests that. Wait for some write requests to complete then retry.


2005-11-23 05:36:29

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

Rohit Seth <[email protected]> wrote:
>
> Andrew, Linus,
>
> [PATCH]: This patch free pages (pcp->batch from each list at a time) from
> local pcp lists when a higher order allocation request is not able to
> get serviced from global free_list.
>
> This should help fix some of the earlier failures seen with order 1 allocations.
>
> I will send separate patches for:
>
> 1- Reducing the remote cpus pcp
> 2- Clean up page_alloc.c for CONFIG_HOTPLUG_CPU to use this code appropiately
>
> +static int
> +reduce_cpu_pcp(void )
> +{
> + struct zone *zone;
> + unsigned long flags;
> + unsigned int cpu = get_cpu();
> + int i, ret=0;
> +
> + local_irq_save(flags);
> + for_each_zone(zone) {
> + struct per_cpu_pageset *pset;
> +
> + pset = zone_pcp(zone, cpu);
> + for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
> + struct per_cpu_pages *pcp;
> +
> + pcp = &pset->pcp[i];
> + if (pcp->count == 0)
> + continue;
> + pcp->count -= free_pages_bulk(zone, pcp->batch,
> + &pcp->list, 0);
> + ret++;
> + }
> + }
> + local_irq_restore(flags);
> + put_cpu();
> + return ret;
> +}

This significantly duplicates the existing drain_local_pages().

>
> + if (order > 0)
> + while (reduce_cpu_pcp()) {
> + if (get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags))

This forgot to assign to local variable `page'! It'll return NULL and will
leak memory.

The `while' loop worries me for some reason, so I wimped out and just tried
the remote drain once.

> + goto got_pg;
> + }
> + /* FIXME: Add the support for reducing/draining the remote pcps.

This is easy enough to do.

I wanted to call the all-CPU drainer `drain_remote_pages' but that's
already taken by some rather poorly-named NUMA thing which also duplicates
most of __drain_pages().

This patch is against a random selection of the enormous number of mm/
patches in -mm. I haven't runtime-tested it yet.

We need to verify that this patch actually does something useful.



include/linux/gfp.h | 2 +
include/linux/suspend.h | 1
mm/page_alloc.c | 85 ++++++++++++++++++++++++++++++++++++------------
3 files changed, 66 insertions(+), 22 deletions(-)

diff -puN include/linux/gfp.h~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions include/linux/gfp.h
--- devel/include/linux/gfp.h~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions 2005-11-22 21:32:47.000000000 -0800
+++ devel-akpm/include/linux/gfp.h 2005-11-22 21:32:47.000000000 -0800
@@ -109,6 +109,8 @@ static inline struct page *alloc_pages_n
NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
}

+extern int drain_local_pages(void);
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);

diff -puN include/linux/suspend.h~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions include/linux/suspend.h
--- devel/include/linux/suspend.h~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions 2005-11-22 21:32:47.000000000 -0800
+++ devel-akpm/include/linux/suspend.h 2005-11-22 21:32:47.000000000 -0800
@@ -40,7 +40,6 @@ extern dev_t swsusp_resume_device;
extern int shrink_mem(void);

/* mm/page_alloc.c */
-extern void drain_local_pages(void);
extern void mark_free_pages(struct zone *zone);

#ifdef CONFIG_PM
diff -puN mm/page_alloc.c~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions mm/page_alloc.c
--- devel/mm/page_alloc.c~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions 2005-11-22 21:32:47.000000000 -0800
+++ devel-akpm/mm/page_alloc.c 2005-11-22 21:32:47.000000000 -0800
@@ -578,32 +578,71 @@ void drain_remote_pages(void)
}
#endif

-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
-static void __drain_pages(unsigned int cpu)
+/*
+ * Drain any cpu-local pages into the buddy lists. Must be called under
+ * local_irq_disable().
+ */
+static int __drain_pages(unsigned int cpu)
{
- unsigned long flags;
struct zone *zone;
- int i;
+ int ret = 0;

for_each_zone(zone) {
struct per_cpu_pageset *pset;
+ int i;

pset = zone_pcp(zone, cpu);
for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
struct per_cpu_pages *pcp;

pcp = &pset->pcp[i];
- local_irq_save(flags);
+ if (!pcp->count)
+ continue;
pcp->count -= free_pages_bulk(zone, pcp->count,
&pcp->list, 0);
- local_irq_restore(flags);
+ ret++;
}
}
+ return ret;
}
-#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */

-#ifdef CONFIG_PM
+/*
+ * Spill all of this CPU's per-cpu pages back into the buddy allocator.
+ */
+int drain_local_pages(void)
+{
+ unsigned long flags;
+ int ret;
+
+ local_irq_save(flags);
+ ret = __drain_pages(smp_processor_id());
+ local_irq_restore(flags);
+ return ret;
+}
+
+static void drainer(void *p)
+{
+ atomic_add(drain_local_pages(), p);
+}
+
+/*
+ * Drain the per-cpu pages on all CPUs. If called from interrupt context we
+ * can only drain the local CPU's pages, since cross-CPU calls are deadlocky
+ * from interrupt context.
+ */
+static int drain_all_local_pages(void)
+{
+ if (in_interrupt()) {
+ return drain_local_pages();
+ } else {
+ atomic_t ret = ATOMIC_INIT(0);
+
+ on_each_cpu(drainer, &ret, 0, 1);
+ return atomic_read(&ret);
+ }
+}

+#ifdef CONFIG_PM
void mark_free_pages(struct zone *zone)
{
unsigned long zone_pfn, flags;
@@ -629,17 +668,6 @@ void mark_free_pages(struct zone *zone)
spin_unlock_irqrestore(&zone->lock, flags);
}

-/*
- * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- */
-void drain_local_pages(void)
-{
- unsigned long flags;
-
- local_irq_save(flags);
- __drain_pages(smp_processor_id());
- local_irq_restore(flags);
-}
#endif /* CONFIG_PM */

static void zone_statistics(struct zonelist *zonelist, struct zone *z)
@@ -913,8 +941,16 @@ nofail_alloc:
}

/* Atomic allocations - we can't balance anything */
- if (!wait)
- goto nopage;
+ if (!wait) {
+ /*
+ * Check if there are pages available on pcp lists that can be
+ * moved to global page list to satisfy higher order allocations
+ */
+ if (order > 0 && drain_all_local_pages())
+ goto restart;
+ else
+ goto nopage;
+ }

rebalance:
cond_resched();
@@ -952,6 +988,13 @@ rebalance:
goto restart;
}

+ if (order > 0 && drain_all_local_pages()) {
+ page = get_page_from_freelist(gfp_mask, order, zonelist,
+ alloc_flags);
+ if (page)
+ goto got_pg;
+ }
+
/*
* Don't let big-order allocations loop unless the caller explicitly
* requests that. Wait for some write requests to complete then retry.
_

2005-11-23 05:58:53

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

Andrew Morton <[email protected]> wrote:
>
> The `while' loop worries me for some reason, so I wimped out and just tried
> the remote drain once.

Even the `goto restart' which is in this patch worries me from a livelock
POV. Perhaps we should only ever run drain_all_local_pages() once per
__alloc_pages() invokation.

And perhaps we should run drain_all_local_pages() for GFP_ATOMIC or
PF_MEMALLOC attempts too.


--- devel/include/linux/gfp.h~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions 2005-11-22 21:47:33.000000000 -0800
+++ devel-akpm/include/linux/gfp.h 2005-11-22 21:57:22.000000000 -0800
@@ -109,6 +109,8 @@ static inline struct page *alloc_pages_n
NODE_DATA(nid)->node_zonelists + gfp_zone(gfp_mask));
}

+extern void drain_local_pages(void);
+
#ifdef CONFIG_NUMA
extern struct page *alloc_pages_current(gfp_t gfp_mask, unsigned order);

diff -puN include/linux/suspend.h~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions include/linux/suspend.h
--- devel/include/linux/suspend.h~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions 2005-11-22 21:47:33.000000000 -0800
+++ devel-akpm/include/linux/suspend.h 2005-11-22 21:47:33.000000000 -0800
@@ -40,7 +40,6 @@ extern dev_t swsusp_resume_device;
extern int shrink_mem(void);

/* mm/page_alloc.c */
-extern void drain_local_pages(void);
extern void mark_free_pages(struct zone *zone);

#ifdef CONFIG_PM
diff -puN mm/page_alloc.c~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions mm/page_alloc.c
--- devel/mm/page_alloc.c~mm-free-pages-from-local-pcp-lists-under-tight-memory-conditions 2005-11-22 21:47:33.000000000 -0800
+++ devel-akpm/mm/page_alloc.c 2005-11-22 21:58:01.000000000 -0800
@@ -578,32 +578,65 @@ void drain_remote_pages(void)
}
#endif

-#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
-static void __drain_pages(unsigned int cpu)
+/*
+ * Drain any cpu-local pages into the buddy lists. Must be called under
+ * local_irq_disable().
+ */
+static int __drain_pages(unsigned int cpu)
{
- unsigned long flags;
struct zone *zone;
- int i;
+ int ret = 0;

for_each_zone(zone) {
struct per_cpu_pageset *pset;
+ int i;

pset = zone_pcp(zone, cpu);
for (i = 0; i < ARRAY_SIZE(pset->pcp); i++) {
struct per_cpu_pages *pcp;

pcp = &pset->pcp[i];
- local_irq_save(flags);
+ if (!pcp->count)
+ continue;
pcp->count -= free_pages_bulk(zone, pcp->count,
&pcp->list, 0);
- local_irq_restore(flags);
+ ret++;
}
}
+ return ret;
}
-#endif /* CONFIG_PM || CONFIG_HOTPLUG_CPU */

-#ifdef CONFIG_PM
+/*
+ * Spill all of this CPU's per-cpu pages back into the buddy allocator.
+ */
+void drain_local_pages(void)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __drain_pages(smp_processor_id());
+ local_irq_restore(flags);
+}

+static void drainer(void *p)
+{
+ drain_local_pages();
+}
+
+/*
+ * Drain the per-cpu pages on all CPUs. If called from interrupt context we
+ * can only drain the local CPU's pages, since cross-CPU calls are deadlocky
+ * from interrupt context.
+ */
+static void drain_all_local_pages(void)
+{
+ if (in_interrupt())
+ drain_local_pages();
+ else
+ on_each_cpu(drainer, NULL, 0, 1);
+}
+
+#ifdef CONFIG_PM
void mark_free_pages(struct zone *zone)
{
unsigned long zone_pfn, flags;
@@ -629,17 +662,6 @@ void mark_free_pages(struct zone *zone)
spin_unlock_irqrestore(&zone->lock, flags);
}

-/*
- * Spill all of this CPU's per-cpu pages back into the buddy allocator.
- */
-void drain_local_pages(void)
-{
- unsigned long flags;
-
- local_irq_save(flags);
- __drain_pages(smp_processor_id());
- local_irq_restore(flags);
-}
#endif /* CONFIG_PM */

static void zone_statistics(struct zonelist *zonelist, struct zone *z)
@@ -889,6 +911,10 @@ restart:
if (gfp_mask & __GFP_HIGH)
alloc_flags |= ALLOC_DIP_LESS;

+ if (order > 0 || (!wait && (gfp_mask & __GFP_HIGH)) ||
+ (p->flags & PF_MEMALLOC))
+ drain_all_local_pages();
+
page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
if (page)
goto got_pg;
_

2005-11-23 06:36:47

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Tue, 22 Nov 2005, Andrew Morton wrote:

> > [PATCH]: This patch free pages (pcp->batch from each list at a time) from
> > local pcp lists when a higher order allocation request is not able to
> > get serviced from global free_list.
> >
> > This should help fix some of the earlier failures seen with order 1 allocations.
> >
> > I will send separate patches for:
> >
> > 1- Reducing the remote cpus pcp

That is already partially done by drain_remote_pages(). However, that
draining is specific to this processors remote pagesets in remote
zones.

> This significantly duplicates the existing drain_local_pages().

We need to extract __drain_pcp from all these functions and clearly
document how they differ. Seth probably needs to call __drain_pages for
each processor.

2005-11-23 06:42:26

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Tue, 22 Nov 2005, Andrew Morton wrote:

> +extern int drain_local_pages(void);

drain_cpu_pcps?

The naming scheme is a bit confusing right now. We drain the pcp
structures not pages so maybe switch to pcp and then name each function so
that the function can be distinguishes clearlyu?

> +static int drain_all_local_pages(void)

drain_all_pcps?

2005-11-23 16:35:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions



On Tue, 22 Nov 2005, Christoph Lameter wrote:

> On Tue, 22 Nov 2005, Andrew Morton wrote:
>
> > +extern int drain_local_pages(void);
>
> drain_cpu_pcps?

Please no.

If there is something I _hate_ it's bad naming. And "pcps" is a totally
unintelligible name.

Write it out. If a function is so trivial that you can't be bothered to
write out what the name means, that function shouldn't exist at all.
Conversely, if it's worth doing, it's worth writing out a name.

Linus

2005-11-23 17:03:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 23 Nov 2005, Linus Torvalds wrote:

> On Tue, 22 Nov 2005, Christoph Lameter wrote:
> > On Tue, 22 Nov 2005, Andrew Morton wrote:
> > > +extern int drain_local_pages(void);
> > drain_cpu_pcps?
>
> Please no.
>
> If there is something I _hate_ it's bad naming. And "pcps" is a totally
> unintelligible name.
>
> Write it out. If a function is so trivial that you can't be bothered to
> write out what the name means, that function shouldn't exist at all.
> Conversely, if it's worth doing, it's worth writing out a name.


drain_one_cpus_pages_from_per_cpu_pagesets()

drain_one_cpus_remote_pages_from_per_cpu_pagesets()

drain_all_per_cpu_pagesets()

?

2005-11-23 17:48:38

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Tue, 2005-11-22 at 21:36 -0800, Andrew Morton wrote:
> Rohit Seth <[email protected]> wrote:
> >
> > Andrew, Linus,
> >
> > [PATCH]: This patch free pages (pcp->batch from each list at a time) from
> > local pcp lists when a higher order allocation request is not able to
> > get serviced from global free_list.
> >
> > This should help fix some of the earlier failures seen with order 1 allocations.
> >
> > I will send separate patches for:
> >
> > 1- Reducing the remote cpus pcp
> > 2- Clean up page_alloc.c for CONFIG_HOTPLUG_CPU to use this code appropiately
> >
> > +static int
> > +reduce_cpu_pcp(void )
> >
> This significantly duplicates the existing drain_local_pages().

Yes. The main change in this new function is I'm only freeing batch
number of pages from each pcp rather than draining out all of them (even
under a little memory pressure). IMO, we should be more opportunistic
here in alloc_pages in moving pages back to global page pool list.
Thoughts?

As said earlier, I will be cleaning up the existing drain_local_pages in
next follow up patch.

>
> >
> > + if (order > 0)
> > + while (reduce_cpu_pcp()) {
> > + if (get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags))
>
> This forgot to assign to local variable `page'! It'll return NULL and will
> leak memory.
>
My bad. Will fix it.

> The `while' loop worries me for some reason, so I wimped out and just tried
> the remote drain once.
>
Even after direct reclaim it probably does make sense to see how
minimally we can service a higher order request.

> > + goto got_pg;
> > + }
> > + /* FIXME: Add the support for reducing/draining the remote pcps.
>
> This is easy enough to do.
>

The couple of options that I wanted to think little more were (before
attempting to do this part):

1- Whether use the IPI to get the remote CPUs to free pages from pcp or
do it lazily (using work_pending or such). As at this point in
execution we can definitely afford to get scheduled out.

2- Do we drain the whole pcp on remote processors or again follow the
stepped approach (but may be with a steeper slope).


> We need to verify that this patch actually does something useful.
>
>
I'm working on this. Will let you know later today if I can come with
some workload easily hitting this additional logic.

Thanks,
-rohit

2005-11-23 18:06:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 23 Nov 2005, Rohit Seth wrote:

> On Tue, 2005-11-22 at 21:36 -0800, Andrew Morton wrote:
> > Rohit Seth <[email protected]> wrote:
> > >
> > > Andrew, Linus,
> > >
> > > [PATCH]: This patch free pages (pcp->batch from each list at a time) from
> > > local pcp lists when a higher order allocation request is not able to
> > > get serviced from global free_list.
> > >
> > > This should help fix some of the earlier failures seen with order 1 allocations.
> > >
> > > I will send separate patches for:
> > >
> > > 1- Reducing the remote cpus pcp
> > > 2- Clean up page_alloc.c for CONFIG_HOTPLUG_CPU to use this code appropiately
> > >
> > > +static int
> > > +reduce_cpu_pcp(void )
> > >
> > This significantly duplicates the existing drain_local_pages().
>
> Yes. The main change in this new function is I'm only freeing batch
> number of pages from each pcp rather than draining out all of them (even
> under a little memory pressure). IMO, we should be more opportunistic
> here in alloc_pages in moving pages back to global page pool list.
> Thoughts?
>

I doubt you gain a whole lot by releasing them in batches. There is no way
to determine if freeing a few will result in contiguous blocks or not and
the overhead of been cautious will likely exceed the cost of simply
refilling them on the next order-0 allocation. Your worst case is where
the buddies you need are in different per-cpu caches.

As it's easy to refill a per-cpu cache, it would be easier, clearer and
probably faster to just purge the per-cpu cache and have it refilled on
the next order-0 allocation. The release-in-batch approach would only be
worthwhile if you expect an order-1 allocation to be very rare.

> As said earlier, I will be cleaning up the existing drain_local_pages in
> next follow up patch.
>
> >
> > >
> > > + if (order > 0)
> > > + while (reduce_cpu_pcp()) {
> > > + if (get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags))
> >
> > This forgot to assign to local variable `page'! It'll return NULL and will
> > leak memory.
> >
> My bad. Will fix it.
>
> > The `while' loop worries me for some reason, so I wimped out and just tried
> > the remote drain once.
> >
> Even after direct reclaim it probably does make sense to see how
> minimally we can service a higher order request.
>

After direct reclaim, things are already very slow. The cost of refilling
a per-cpu cache is nowhere near the same as a direct-reclaim.

> > > + goto got_pg;
> > > + }
> > > + /* FIXME: Add the support for reducing/draining the remote pcps.
> >
> > This is easy enough to do.
> >
>
> The couple of options that I wanted to think little more were (before
> attempting to do this part):
>
> 1- Whether use the IPI to get the remote CPUs to free pages from pcp or
> do it lazily (using work_pending or such). As at this point in
> execution we can definitely afford to get scheduled out.
>

In 005_drainpercpu.patch from the last version of the anti-defrag, I used
the smp_call_function() and it did not seem to slow up the system.
Certainly, by the time it was called, the system was already low on
memory and trashing a bit so it just wasn't noticable.

> 2- Do we drain the whole pcp on remote processors or again follow the
> stepped approach (but may be with a steeper slope).
>

I would say do the same on the remote case as you do locally to keep
things consistent.

>
> > We need to verify that this patch actually does something useful.
> >
> >
> I'm working on this. Will let you know later today if I can come with
> some workload easily hitting this additional logic.
>

I found it hard to generate reliable workloads which hit these sort of
situations although a fork-heavy workload with 8k stacks will put pressure
on order-1 allocations. You can artifically force high order allocations
using vmregress by doing something like this;

./configure --with-linux=/usr/src/linux-yourtree
make
insmod kernel_src/core/vmregress_core.ko
insmod kernel_src/core/buddyinfo.ko
insmod kernel_src/test/highalloc.ko
echo 1 100 > /proc/vmregress/test_highalloc

That will allocate 1 order-1 page every tenth of a second until it has
tried 100 allocations. When it completes, the success/failure report can
be read by catting /proc/vmregress/test_highalloc. Obviously, this is very
artifical.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-23 18:10:51

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Tue, 2005-11-22 at 21:58 -0800, Andrew Morton wrote:
> Andrew Morton <[email protected]> wrote:
> >
> > The `while' loop worries me for some reason, so I wimped out and just tried
> > the remote drain once.
>
> Even the `goto restart' which is in this patch worries me from a livelock
> POV. Perhaps we should only ever run drain_all_local_pages() once per
> __alloc_pages() invokation.
> And perhaps we should run drain_all_local_pages() for GFP_ATOMIC or
> PF_MEMALLOC attempts too.

Good point for PF_MEMALLOC scenario.

-rohit

2005-11-23 19:30:20

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Tue, 22 Nov 2005, Rohit Seth wrote:

> [PATCH]: This patch free pages (pcp->batch from each list at a time) from
> local pcp lists when a higher order allocation request is not able to
> get serviced from global free_list.

Ummm.. One controversial idea: How about removing the complete pcp
subsystem? Last time we disabled pcps we saw that the effect
that it had was within noise ratio on AIM7. The lru lock taken without
pcp is in the local zone and thus rarely contended.

2005-11-23 19:35:33

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 2005-11-23 at 18:06 +0000, Mel Gorman wrote:
> On Wed, 23 Nov 2005, Rohit Seth wrote:
>
> >
> I doubt you gain a whole lot by releasing them in batches. There is no way
> to determine if freeing a few will result in contiguous blocks or not and
> the overhead of been cautious will likely exceed the cost of simply
> refilling them on the next order-0 allocation.

It depends. If most of the higher order allocations are only order 1
(and may be order 2) then it is possible that we may gain in freeing in
batches.

> Your worst case is where
> the buddies you need are in different per-cpu caches.
>

That is why we need another patch that tries to allocate physically
contiguous pages in each per_cpu_pagelist. Actually this patch used to
be there in Andrew's tree for some time (2.6.14) before couple of corner
cases came up failing where order 1 allocations were unsuccessful.

> As it's easy to refill a per-cpu cache, it would be easier, clearer and
> probably faster to just purge the per-cpu cache and have it refilled on
> the next order-0 allocation. The release-in-batch approach would only be
> worthwhile if you expect an order-1 allocation to be very rare.
>

Well, my only fear is if this shunting happens too often...

> In 005_drainpercpu.patch from the last version of the anti-defrag, I used
> the smp_call_function() and it did not seem to slow up the system.
> Certainly, by the time it was called, the system was already low on
> memory and trashing a bit so it just wasn't noticable.
>

I agree at this point in alloaction, speed probably does not matter too
much. I definitely want to first see for simple workloads how much (and
how deep we have to go into deallocations) this extra logic helps.

> > 2- Do we drain the whole pcp on remote processors or again follow the
> > stepped approach (but may be with a steeper slope).
> >
>
> I would say do the same on the remote case as you do locally to keep
> things consistent.
>

Well, I think in bigger scope these allocations/deallocations will get
automatically balanced.

> >
> > > We need to verify that this patch actually does something useful.
> > >
> > >
> > I'm working on this. Will let you know later today if I can come with
> > some workload easily hitting this additional logic.
> >
>
> I found it hard to generate reliable workloads which hit these sort of
> situations although a fork-heavy workload with 8k stacks will put pressure
> on order-1 allocations. You can artifically force high order allocations
> using vmregress by doing something like this;

Need something more benign/stupid to kick into this logic.

-rohit

2005-11-23 19:40:53

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 2005-11-23 at 11:30 -0800, Christoph Lameter wrote:
> On Tue, 22 Nov 2005, Rohit Seth wrote:
>
> > [PATCH]: This patch free pages (pcp->batch from each list at a time) from
> > local pcp lists when a higher order allocation request is not able to
> > get serviced from global free_list.
>
> Ummm.. One controversial idea: How about removing the complete pcp
> subsystem? Last time we disabled pcps we saw that the effect
> that it had was within noise ratio on AIM7. The lru lock taken without
> pcp is in the local zone and thus rarely contended.

Oh please stop.

This per_cpu_pagelist is one great logic that has got added in
allocator. Besides providing pages without the need to acquire the zone
lock, it is one single main reason the coloring effect is drastically
reduced in 2.6 (over 2.4) based kernels.

-rohit

2005-11-23 19:59:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

Rohit Seth <[email protected]> wrote:
>
> On Wed, 2005-11-23 at 11:30 -0800, Christoph Lameter wrote:
> > On Tue, 22 Nov 2005, Rohit Seth wrote:
> >
> > > [PATCH]: This patch free pages (pcp->batch from each list at a time) from
> > > local pcp lists when a higher order allocation request is not able to
> > > get serviced from global free_list.
> >
> > Ummm.. One controversial idea: How about removing the complete pcp
> > subsystem? Last time we disabled pcps we saw that the effect
> > that it had was within noise ratio on AIM7. The lru lock taken without
> > pcp is in the local zone and thus rarely contended.
>
> Oh please stop.
>
> This per_cpu_pagelist is one great logic that has got added in
> allocator. Besides providing pages without the need to acquire the zone
> lock, it is one single main reason the coloring effect is drastically
> reduced in 2.6 (over 2.4) based kernels.
>

hm. Before it was merged in 2.5.x, the feature was very marginal from a
performance POV in my testing on 4-way.

I was able to demonstrate a large (~60%?) speedup in one microbenckmark
which consisted of four processes writing 16k to a file and truncating it
back to zero again. That gain came from the cache warmth effect, which is
the other benefit which these cpu-local pages are supposed to provide.

I don't think Martin was able to demonstrate much benefit from the lock
contention reduction on 16-way NUMAQ either.

So I dithered for months and it was a marginal merge, so it's appropriate
to justify the continued presence of the code.

We didn't measure for any coloring effects though. In fact, I didn't know
that this feature actually provided any benefit in that area.

2005-11-23 20:53:24

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 2005-11-23 at 11:55 -0800, Andrew Morton wrote:
> Rohit Seth <[email protected]> wrote:
> >
> > On Wed, 2005-11-23 at 11:30 -0800, Christoph Lameter wrote:
> > > On Tue, 22 Nov 2005, Rohit Seth wrote:
> > >
> > > > [PATCH]: This patch free pages (pcp->batch from each list at a time) from
> > > > local pcp lists when a higher order allocation request is not able to
> > > > get serviced from global free_list.
> > >
> > > Ummm.. One controversial idea: How about removing the complete pcp
> > > subsystem? Last time we disabled pcps we saw that the effect
> > > that it had was within noise ratio on AIM7. The lru lock taken without
> > > pcp is in the local zone and thus rarely contended.
> >
> > Oh please stop.
> >
> > This per_cpu_pagelist is one great logic that has got added in
> > allocator. Besides providing pages without the need to acquire the zone
> > lock, it is one single main reason the coloring effect is drastically
> > reduced in 2.6 (over 2.4) based kernels.
> >
>
> hm. Before it was merged in 2.5.x, the feature was very marginal from a
> performance POV in my testing on 4-way.
>

One less lock to worry about in the page allocation should help
somewhere I would assume.

> I was able to demonstrate a large (~60%?) speedup in one microbenckmark
> which consisted of four processes writing 16k to a file and truncating it
> back to zero again. That gain came from the cache warmth effect, which is
> the other benefit which these cpu-local pages are supposed to provide.
>

That is right.

> I don't think Martin was able to demonstrate much benefit from the lock
> contention reduction on 16-way NUMAQ either.
>
> So I dithered for months and it was a marginal merge, so it's appropriate
> to justify the continued presence of the code.
>

May be the limits on the number of pages hanging on the per_cpu_pagelist
was (or even now is) too small (for them to give any meaningful gain).
May be we should have more physical contiguity in each of these pcps to
give better cache spread.

> We didn't measure for any coloring effects though. In fact, I didn't know
> that this feature actually provided any benefit in that area.

I thought Nick et.al came up with some of the constant values like batch
size to tackle the page coloring issue specifically. In any case, I
think one of the key difference between 2.4 and 2.6 allocators is the
pcp list. And even with the minuscule batch and high watermarks this is
helping ordinary benchmarks (by reducing the variation from run to run).

-rohit

2005-11-23 21:26:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 23 Nov 2005, Rohit Seth wrote:

> I thought Nick et.al came up with some of the constant values like batch
> size to tackle the page coloring issue specifically. In any case, I
> think one of the key difference between 2.4 and 2.6 allocators is the
> pcp list. And even with the minuscule batch and high watermarks this is
> helping ordinary benchmarks (by reducing the variation from run to run).

Could you share some benchmark results?

2005-11-23 21:26:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

Rohit Seth <[email protected]> wrote:
>
> > I don't think Martin was able to demonstrate much benefit from the lock
> > contention reduction on 16-way NUMAQ either.
> >
> > So I dithered for months and it was a marginal merge, so it's appropriate
> > to justify the continued presence of the code.
> >
>
> May be the limits on the number of pages hanging on the per_cpu_pagelist
> was (or even now is) too small (for them to give any meaningful gain).
> May be we should have more physical contiguity in each of these pcps to
> give better cache spread.

Could be. The initial settings were pretty arbitrary - I assumed that
someone would get in and tune them up, but nothing much happened. Perhaps
we should expose the thresholds in /proc/sys/vm so they're easier to play
with.

> > We didn't measure for any coloring effects though. In fact, I didn't know
> > that this feature actually provided any benefit in that area.
>
> I thought Nick et.al came up with some of the constant values like batch
> size to tackle the page coloring issue specifically. In any case, I
> think one of the key difference between 2.4 and 2.6 allocators is the
> pcp list. And even with the minuscule batch and high watermarks this is
> helping ordinary benchmarks (by reducing the variation from run to run).

OK, useful info, thanks.

2005-11-23 21:34:07

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 2005-11-23 at 13:26 -0800, Andrew Morton wrote:
> Rohit Seth <[email protected]> wrote:
> >
> > > I don't think Martin was able to demonstrate much benefit from the lock
> > > contention reduction on 16-way NUMAQ either.
> > >
> > > So I dithered for months and it was a marginal merge, so it's appropriate
> > > to justify the continued presence of the code.
> > >
> >
> > May be the limits on the number of pages hanging on the per_cpu_pagelist
> > was (or even now is) too small (for them to give any meaningful gain).
> > May be we should have more physical contiguity in each of these pcps to
> > give better cache spread.
>
> Could be. The initial settings were pretty arbitrary - I assumed that
> someone would get in and tune them up, but nothing much happened. Perhaps
> we should expose the thresholds in /proc/sys/vm so they're easier to play
> with.

Most certainly. If I had a patch ready...I would have given you one
right away :-) Though I will work on it...

It surely is unfortunate that we have not digged deeper into this area
(in terms of optimizations)....

-rohit

2005-11-23 22:02:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 23 Nov 2005, Andrew Morton wrote:

> I was able to demonstrate a large (~60%?) speedup in one microbenckmark
> which consisted of four processes writing 16k to a file and truncating it
> back to zero again. That gain came from the cache warmth effect, which is
> the other benefit which these cpu-local pages are supposed to provide.

Maybe we can cut the pcp logic back to only put cache warm pages into a
single per_cpu pcp list for the local processor that contains node local
pages? Return all remote pages and cold pages directly to the buddy lists.

That way we can remove the logic to flush remote pages and remove those
pcp lists for remote nodes that are mostly empty anyways.

2005-11-23 22:22:23

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 2005-11-23 at 13:25 -0800, Christoph Lameter wrote:
> On Wed, 23 Nov 2005, Rohit Seth wrote:
>
> > I thought Nick et.al came up with some of the constant values like batch
> > size to tackle the page coloring issue specifically. In any case, I
> > think one of the key difference between 2.4 and 2.6 allocators is the
> > pcp list. And even with the minuscule batch and high watermarks this is
> > helping ordinary benchmarks (by reducing the variation from run to run).
>
> Could you share some benchmark results?
>

Some components of cpu2k on 2.4 base kernels show in access of 40-50%
variation from run to run. The same variations came down to about 10%
for 2.6 based kernels.

-rohit

2005-11-23 23:20:15

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions


On Wed, 23 Nov 2005, Rohit Seth wrote:
>
> > On Tue, 2005-11-22 at 21:36 -0800, Andrew Morton wrote:

> > > We need to verify that this patch actually does something useful.
> > >
> > >
> > I'm working on this. Will let you know later today if I can come with
> > some workload easily hitting this additional logic.
> >
>

I'm able to trigger the reduce_cpu_pcp (I'll change its name in next
update patch) logic after direct reclaim using a small test case hogging
memory and a bash loop spawning another process 1 at a time using very
little memory.

I added a single printk after the direct reclaim where we reduce the per
cpu pagelist (in my patch) just to get the order and how many iterations
do we need to service the request. order is always 1 (coming from
alloc_thread_info for 8K stack size).

This is on i386 with 8K stack size.

if (order > 0) {
int i = 0;
while (reduce_cpu_pcp()) {
i++;
page = get_page_from_freelist(gfp_mask, order, zonelist,
alloc_flags);
if (page) {
printk("Got page %d order iteration %d", order, i);
goto got_pg;
}
}
}

And got about 30 of those in couple of hours:

[17179885.360000] Got page 1 order iteration 1



2005-11-24 03:02:53

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

Rohit wrote:
> I thought Nick et.al came up with some of the constant values like batch
> size to tackle the page coloring issue specifically.

I think this came about on a linux-ia64 thread started by Jack Steiner:

http://www.gelato.unsw.edu.au/archives/linux-ia64/0504/13668.html
Subject: per_cpu_pagesets degrades MPI performance
From: Jack Steiner <steiner_at_sgi.com>
Date: 2005-04-05 05:28:27

Jack reported that per_cpu_pagesets were degrading some MPI benchmarks due
to adverse page coloring. Nick responded, recommending a non-power of two
batch size. Jack found that this helped nicely. This thread trails off,
but seems to be the origins of the 2**n-1 batch size in:

mm/page_alloc.c:zone_batchsize()
* Clamp the batch to a 2^n - 1 value. Having a power ...
batch = (1 << fls(batch + batch/2)) - 1;

I don't see here evidence that "per_cpu_pagelist is ... one single main
reason the coloring effect is drastically reduced in 2.6 (over 2.4)
based kernels." Rather in this case anyway a batch size not a power of
two was apparently needed to keep per_cpu_pagesets from hurting
performance due to page coloring affects on some workloads.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-11-24 09:25:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 23 Nov 2005, Rohit Seth wrote:

> On Wed, 2005-11-23 at 18:06 +0000, Mel Gorman wrote:
> > On Wed, 23 Nov 2005, Rohit Seth wrote:
> >
> > >
> > I doubt you gain a whole lot by releasing them in batches. There is no way
> > to determine if freeing a few will result in contiguous blocks or not and
> > the overhead of been cautious will likely exceed the cost of simply
> > refilling them on the next order-0 allocation.
>
> It depends. If most of the higher order allocations are only order 1
> (and may be order 2) then it is possible that we may gain in freeing in
> batches.
>

Possible, but if you are draining, it's just as handy to drain them all
and avoid >0 order failures in the near future.

> > Your worst case is where
> > the buddies you need are in different per-cpu caches.
> >
>
> That is why we need another patch that tries to allocate physically
> contiguous pages in each per_cpu_pagelist.

That will only delay the problem. Pages end up on the per-cpu lists from
either an allocation or a free. While the allocation would make sure the
pages were contiguous and on the same list, the frees will not. The test
case you are running is a tight loop of one process allocating and freeing
order-1 pages. This is probably staying on the one CPU so draining in
batches on just the local CPU will appear successful. On long lived loads,
it will not be as successful. Draining all the pages on all lists would
cover all cases while using the existing code.

When I was testing my version of drain-percpu for anti-defrag and order-10
allocations, I found that draining just the local CPU made little
difference but draining all of them made a massive difference. This is an
extreme case, but it still applies to the smaller orders.

> Actually this patch used to
> be there in Andrew's tree for some time (2.6.14) before couple of corner
> cases came up failing where order 1 allocations were unsuccessful.
>
> > As it's easy to refill a per-cpu cache, it would be easier, clearer and
> > probably faster to just purge the per-cpu cache and have it refilled on
> > the next order-0 allocation. The release-in-batch approach would only be
> > worthwhile if you expect an order-1 allocation to be very rare.
> >
>
> Well, my only fear is if this shunting happens too often...
>

Measure it by counting how often you drain the pages and add it to
frag_show(). This is a hack obviously and not a permanent solution, but
it's the easiest way to find out what's going on.

> > In 005_drainpercpu.patch from the last version of the anti-defrag, I used
> > the smp_call_function() and it did not seem to slow up the system.
> > Certainly, by the time it was called, the system was already low on
> > memory and trashing a bit so it just wasn't noticable.
> >
>
> I agree at this point in alloaction, speed probably does not matter too
> much. I definitely want to first see for simple workloads how much (and
> how deep we have to go into deallocations) this extra logic helps.
>
> > > 2- Do we drain the whole pcp on remote processors or again follow the
> > > stepped approach (but may be with a steeper slope).
> > >
> >
> > I would say do the same on the remote case as you do locally to keep
> > things consistent.
> >
>
> Well, I think in bigger scope these allocations/deallocations will get
> automatically balanced.
>

Depends on if your workload involves one or more processes. If the load is
multiple processes on multiple CPUs, the per-cpu pages will be spread out
a lot.

> > >
> > > > We need to verify that this patch actually does something useful.
> > > >
> > > >
> > > I'm working on this. Will let you know later today if I can come with
> > > some workload easily hitting this additional logic.
> > >
> >
> > I found it hard to generate reliable workloads which hit these sort of
> > situations although a fork-heavy workload with 8k stacks will put pressure
> > on order-1 allocations. You can artifically force high order allocations
> > using vmregress by doing something like this;
>
> Need something more benign/stupid to kick into this logic.
>

If CIFS still needs high order allocations, you could try -jN kernel
compiles over the network filesystem. Network benchmarks running over a
loopback device with a large MTU while another load consumes memory might
also trigger it.

--
Mel Gorman
Part-time Phd Student Java Applications Developer
University of Limerick IBM Dublin Software Lab

2005-11-29 23:12:19

by Rohit Seth

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

On Wed, 2005-11-23 at 19:02 -0800, Paul Jackson wrote:
> Rohit wrote:
> > I thought Nick et.al came up with some of the constant values like batch
> > size to tackle the page coloring issue specifically.
>
> I think this came about on a linux-ia64 thread started by Jack Steiner:
>
> http://www.gelato.unsw.edu.au/archives/linux-ia64/0504/13668.html
> Subject: per_cpu_pagesets degrades MPI performance
> From: Jack Steiner <steiner_at_sgi.com>
> Date: 2005-04-05 05:28:27
>
> Jack reported that per_cpu_pagesets were degrading some MPI benchmarks due
> to adverse page coloring. Nick responded, recommending a non-power of two
> batch size. Jack found that this helped nicely. This thread trails off,
> but seems to be the origins of the 2**n-1 batch size in:
>
> mm/page_alloc.c:zone_batchsize()
> * Clamp the batch to a 2^n - 1 value. Having a power ...
> batch = (1 << fls(batch + batch/2)) - 1;
>
> I don't see here evidence that "per_cpu_pagelist is ... one single main
> reason the coloring effect is drastically reduced in 2.6 (over 2.4)
> based kernels." Rather in this case anyway a batch size not a power of
> two was apparently needed to keep per_cpu_pagesets from hurting
> performance due to page coloring affects on some workloads.
>

Well, the batch size of a list ( + the high mark) are integral part of
per_cpu_pagelist infrastructure. Tuning is always required. I don't
think though one fixed set of values is fixing all the cases.

Can you please comment on the performance delta on the MPI workload
because of this change in batch values. And what were the numbers
before per_cpu_pagelists were introduced.

thanks,
-rohit


2005-12-01 14:45:04

by Paul Jackson

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

Rohit wrote:
> Can you please comment on the performance delta on the MPI workload
> because of this change in batch values.

I can't -- all I know is what I read in Jack Steiner's posts
of April 5, 2005, referenced earlier in this thread.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2005-12-02 00:32:34

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH]: Free pages from local pcp lists under tight memory conditions

Paul Jackson wrote:
> Rohit wrote:
>
>>Can you please comment on the performance delta on the MPI workload
>>because of this change in batch values.
>
>
> I can't -- all I know is what I read in Jack Steiner's posts
> of April 5, 2005, referenced earlier in this thread.
>

It was something fairly large. Basically having a power of 2 batch size
meant that 2 concurrent allocators (presumably setting up the working
area) would alternately pull in power of 2 chunks of memory, which
caused each CPU to only get pages of ~half of its cache's possible
colours.

The fix is not by any means a single value for all workloads, it simply
avoids powers of 2 batch size. Note this will have very little effect
on single threaded allocators and will do nothing for cache colouring
there, however it is important for concurrent allocators.

Nick

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com