LinuxLists.cc - [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

2006-05-18 04:36:05

Subject: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

From: Paul Jackson <[email protected]>

Fix a couple of infrequently encountered 'sleeping function
called from invalid context' in the cpuset hooks in
__alloc_pages. Could sleep while interrupts disabled.

The routine cpuset_zone_allowed() is called by code in
mm/page_alloc.c __alloc_pages() to determine if a zone is
allowed in the current tasks cpuset. This routine can sleep,
for certain GFP_KERNEL allocations, if the zone is on a memory
node not allowed in the current cpuset, but might be allowed
in a parent cpuset.

But we can't sleep in __alloc_pages() if in interrupt, nor
if called for a GFP_ATOMIC request (__GFP_WAIT not set in
gfp_flags).

The rule was intended to be:
Don't call cpuset_zone_allowed() if you can't sleep, unless you
pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
the code that might scan up ancestor cpusets and sleep.

This rule was being violated in a couple of places, due to a
bogus change made (by myself, pj) to __alloc_pages() as part
of the November 2005 effort to cleanup its logic, and also due
to a later fix to constrain which swap daemons were awoken.

The bogus change can be seen at:
http://linux.derkeiler.com/Mailing-Lists/Kernel/2005-11/4691.html
[PATCH 01/05] mm fix __alloc_pages cpuset ALLOC_* flags

This was first noticed on a tight memory system, in code that
was disabling interrupts and doing allocation requests with
__GFP_WAIT not set, which resulted in __might_sleep() writing
complaints to the log "Debug: sleeping function called ...",
when the code in cpuset_zone_allowed() tried to take the
callback_sem cpuset semaphore.

We haven't seen a system hang on this 'might_sleep' yet, but
we are at decent risk of seeing it fairly soon, especially
since the additional cpuset_zone_allowed() check was added,
conditioning wakeup_kswapd(), in March 2006.

Special thanks to Dave Chinner, for figuring this out,
and a tip of the hat to Nick Piggin who warned me of this
back in Nov 2005, before I was ready to listen.

Signed-off-by: Paul Jackson <[email protected]>

---

Andrew - this patch is worth pushing along a little faster. -pj

mm/page_alloc.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--- 2.6.17-rc4-mm1.orig/mm/page_alloc.c 2006-05-17 18:57:58.400484681 -0700
+++ 2.6.17-rc4-mm1/mm/page_alloc.c 2006-05-17 19:04:59.177414192 -0700
@@ -1038,7 +1038,7 @@ restart:
goto got_pg;

do {
- if (cpuset_zone_allowed(*z, gfp_mask))
+ if (cpuset_zone_allowed(*z, gfp_mask|__GFP_HARDWALL))
wakeup_kswapd(*z, order);
} while (*(++z));

@@ -1057,7 +1057,8 @@ restart:
alloc_flags |= ALLOC_HARDER;
if (gfp_mask & __GFP_HIGH)
alloc_flags |= ALLOC_HIGH;
- alloc_flags |= ALLOC_CPUSET;
+ if (wait)
+ alloc_flags |= ALLOC_CPUSET;

/*
* Go through the zonelist again. Let __GFP_HIGH and allocations

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-05-18 04:36:28

by Paul Jackson

[permalink] [raw]

Subject: [PATCH 03/03] Cpuset: might_sleep_if check in cpuset_zones_allowed

From: Paul Jackson <[email protected]>

It's too easy to incorrectly call cpuset_zone_allowed() in an
atomic context without __GFP_HARDWALL set, and when done, it is
not noticed until a tight memory situation forces allocations
to be tried outside the current cpuset.

Add a 'might_sleep_if()' check, to catch this earlier on, instead
of waiting for a similar check in the mutex_lock() code, which
is only rarely invoked.

Signed-off-by: Paul Jackson <[email protected]>

---

kernel/cpuset.c | 1 +
1 file changed, 1 insertion(+)

--- 2.6.17-rc4-mm1.orig/kernel/cpuset.c 2006-05-17 20:31:22.577065566 -0700
+++ 2.6.17-rc4-mm1/kernel/cpuset.c 2006-05-17 20:31:36.261218192 -0700
@@ -2261,6 +2261,7 @@ int __cpuset_zone_allowed(struct zone *z
if (in_interrupt())
return 1;
node = z->zone_pgdat->node_id;
+ might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
if (node_isset(node, current->mems_allowed))
return 1;
if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-05-18 04:36:07

by Paul Jackson

[permalink] [raw]

Subject: [PATCH 02/03] Cpuset: update cpuset_zones_allowed comment

From: Paul Jackson <[email protected]>

Update the kernel/cpuset.c:cpuset_zone_allowed() comment.

The rule for when mm/page_alloc.c should call cpuset_zone_allowed()
was intended to be:
Don't call cpuset_zone_allowed() if you can't sleep, unless you
pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
the code that might scan up ancestor cpusets and sleep.

The explanation of this rule in the comment above cpuset_zone_allowed()
was stale, as a result of a restructuring of some __alloc_pages() code
in November 2005.

Rewrite that comment ...

Signed-off-by: Paul Jackson <[email protected]>

---

kernel/cpuset.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)

--- 2.6.17-rc4-mm1.orig/kernel/cpuset.c 2006-05-17 20:10:43.555129178 -0700
+++ 2.6.17-rc4-mm1/kernel/cpuset.c 2006-05-17 20:31:22.577065566 -0700
@@ -2231,19 +2231,25 @@ static const struct cpuset *nearest_excl
* So only GFP_KERNEL allocations, if all nodes in the cpuset are
* short of memory, might require taking the callback_mutex mutex.
*
- * The first loop over the zonelist in mm/page_alloc.c:__alloc_pages()
- * calls here with __GFP_HARDWALL always set in gfp_mask, enforcing
- * hardwall cpusets - no allocation on a node outside the cpuset is
- * allowed (unless in interrupt, of course).
- *
- * The second loop doesn't even call here for GFP_ATOMIC requests
- * (if the __alloc_pages() local variable 'wait' is set). That check
- * and the checks below have the combined affect in the second loop of
- * the __alloc_pages() routine that:
+ * The first call here from mm/page_alloc:get_page_from_freelist()
+ * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets, so
+ * no allocation on a node outside the cpuset is allowed (unless in
+ * interrupt, of course).
+ *
+ * The second pass through get_page_from_freelist() doesn't even call
+ * here for GFP_ATOMIC calls. For those calls, the __alloc_pages()
+ * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
+ * in alloc_flags. That logic and the checks below have the combined
+ * affect that:
* in_interrupt - any node ok (current task context irrelevant)
* GFP_ATOMIC - any node ok
* GFP_KERNEL - any node in enclosing mem_exclusive cpuset ok
* GFP_USER - only nodes in current tasks mems allowed ok.
+ *
+ * Rule:
+ * Don't call cpuset_zone_allowed() if you can't sleep, unless you
+ * pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
+ * the code that might scan up ancestor cpusets and sleep.
**/

int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2006-05-18 05:25:51

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

Paul Jackson <[email protected]> wrote:
>
> Fix a couple of infrequently encountered 'sleeping function
> called from invalid context' in the cpuset hooks in
> __alloc_pages. Could sleep while interrupts disabled.

I'd have thought that if all the callers get their __GFP_HARDWALLS correct
then that fishy-looking in_interrupt() test in __cpuset_zone_allowed()
could be removed?

2006-05-18 05:48:18

by David Chinner

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

On Wed, May 17, 2006 at 10:25:43PM -0700, Andrew Morton wrote:
> Paul Jackson <[email protected]> wrote:
> >
> > Fix a couple of infrequently encountered 'sleeping function
> > called from invalid context' in the cpuset hooks in
> > __alloc_pages. Could sleep while interrupts disabled.
>
> I'd have thought that if all the callers get their __GFP_HARDWALLS correct
> then that fishy-looking in_interrupt() test in __cpuset_zone_allowed()
> could be removed?

I suggested to Paul that __cpuset_zone_allowed() should check for
__GFP_WAIT and allow the allocation if it is not set. Any allocation
from interrupt context has to be GFP_ATOMIC so that would kill
the need for the in_interrupt() check as well. I'm probably missing
something, but that seemed like the obvious fix to me...

Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-05-19 00:48:26

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

David wrote:
> I suggested to Paul that __cpuset_zone_allowed() should check for
> __GFP_WAIT and allow the allocation if it is not set. Any allocation
> from interrupt context has to be GFP_ATOMIC so that would kill
> the need for the in_interrupt() check as well.

A couple of things going on here ...

First of all, the mm/page_alloc.c:__alloc_pages() code has been the
place where __GFP_WAIT was checked, with its local variable 'wait'.

Internal kernel code is not like network interface code. In network
interface code, one should be conservative in what one generates
and liberal in what one accepts. If this means both ends of a link
do a similar check, that's ok. Within the kernel, it's best to do
things just once, to have a place to do everything, and everything done
in its place. A good example of this was the recent discussion of
whether "kfree(p)" should check for "p == NULL", or whether the callers
of it should check. Most seemed to agree that the check should not be
done twice, and the debate was over how to accomplish that in the
various cases.

The current implementation of the cpuset hooks in __alloc_pages()
is designed to have the __GFP_WAIT check done in alloc_pages(), not
in cpuset_zone_allowed().

Putting the check you suggest in cpuset_zone_allowed() would 'obviously'
fix this particular bug, but it would partially duplicate existing
checks on 'wait' in __alloc_pages(). Not good.

It's sort of like deciding to hold up your pants with a belt, but when
a couple of belt loops break, adding suspenders, instead of fixing the
loops. We don't want both (partially broken) belts and suspenders in
the kernel internals.

The second thing is that the current code is designed to distinguish
between the memory allocations requested in the following situations:
[A] in interrupt,
[B] GFP_ATOMIC (or !__GFP_WAIT, in general) in process context, and
[C] __GFP_WAIT ok, in process context.

In situation [A], cpusets are ignored and the node closest to the
requesting ndoe with a free page is used.

In situation [B], we try every node -within- the allowed cpuset
before dropping the cpuset constraint, and allowing all nodes.

In situation [C], we stay within the cpuset, of course.

Since, unlike the rest of the __alloc_pages() code, the
cpuset_zone_allowed() check does distinguish between in_interrupt
and GFP_ATOMIC (!wait, in general), therefore it does need to
examine the "in_interrupt()" state, to know which applies.

Your proposal above, Dave, and what I suspect Andrew's proposal
would be, if he bothered to waste more time thinking about this,
amount to changing the above design from the three cases [A], [B],
and [C], to just the two cases:

[D] __GFP_WAIT not ok, such as GFP_ATOMIC and/or in_interrupt, and
[E] __GFP_WAIT ok.

You're suggesting we ignore cpusets in [D], and honor them in [E].

And you are both confident that just checking 'wait' covers all the
cases where one can't wait, as any correctly coded interrupt allocation
will have !__GFP_WAIT in the gfp_flag passed in.

I suspect we could do this, and it might be a good idea. There may
well not be good enough reason to be making a special case of [B] above.

If we do this, then I think that such a change should be a separate
patch, as it intentionally changes the design, and it should sit in
*-mm for a few weeks, unlike this PATCH 01/03, which is a focused bug
fix to the current design, and which I would like to see on a faster
track.

One should separate out patches that fix bugs in the current
implementation from design changes that result in different behavior,
unless the old one is so broke it is hard to fix, not the case here I
hope.

One non-obvious (to me at least, for now) detail of such a design change
would be what do to with the __alloc_pages() code lines:

do {
if (cpuset_zone_allowed(*z, gfp_mask|__GFP_HARDWALL))
wakeup_kswapd(*z, order);
} while (*(++z));

If 'wait' is set, for allocations in the current task context that
can sleep, then it's obvious enough - just wake up the kswapd's on
the nodes in the current tasks cpuset.

But what do we do if 'wait' is not set, such as when in interrupt or
for GFP_ATOMIC requests. Calling cpuset_zone_allowed() is no longer
allowed in that case.

I can imagine doing any of the following:

We could wake up all the swappers in the system, on the grounds
that any node in the system is allowed to satisfy this request.

We could just wake up the swappers in the current tasks cpuset,
even though, if this is an interrupt, the current task is
irrelevant?

Maybe if (!wait) we could not wake up -any- swappers, since we
certainly won't be able to wait around long enough for that to
help us any on the current allocation.

Or maybe we could just kick the current node's swapper
(essentially, just call wakeup_kswapd() for just the first zone
z in the zonelist.)

Though I can't coherently explain why, that last choice appeals to my
intuition the most - waking kswapd on just the first zone if (!wait).

Here's a draft patch, totally untested and probably unbuildable, that
does all the above:

---

kernel/cpuset.c | 43 +++++++++++++++----------------------------
mm/page_alloc.c | 32 +++++++++++++++++---------------
2 files changed, 32 insertions(+), 43 deletions(-)

--- 2.6.17-rc4-mm1.orig/kernel/cpuset.c 2006-05-17 21:32:27.474553766 -0700
+++ 2.6.17-rc4-mm1/kernel/cpuset.c 2006-05-18 17:24:36.316549527 -0700
@@ -2224,32 +2224,21 @@ static const struct cpuset *nearest_excl
* GFP_KERNEL allocations are not so marked, so can escape to the
* nearest mem_exclusive ancestor cpuset.
*
- * Scanning up parent cpusets requires callback_mutex. The __alloc_pages()
- * routine only calls here with __GFP_HARDWALL bit _not_ set if
- * it's a GFP_KERNEL allocation, and all nodes in the current tasks
- * mems_allowed came up empty on the first pass over the zonelist.
- * So only GFP_KERNEL allocations, if all nodes in the cpuset are
- * short of memory, might require taking the callback_mutex mutex.
- *
- * The first call here from mm/page_alloc:get_page_from_freelist()
- * has __GFP_HARDWALL set in gfp_mask, enforcing hardwall cpusets, so
- * no allocation on a node outside the cpuset is allowed (unless in
- * interrupt, of course).
- *
- * The second pass through get_page_from_freelist() doesn't even call
- * here for GFP_ATOMIC calls. For those calls, the __alloc_pages()
- * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
- * in alloc_flags. That logic and the checks below have the combined
- * affect that:
- * in_interrupt - any node ok (current task context irrelevant)
- * GFP_ATOMIC - any node ok
- * GFP_KERNEL - any node in enclosing mem_exclusive cpuset ok
+ * Scanning up parent cpusets requires callback_mutex, which might
+ * sleep. Sleeping is not allowed if in interrupt or interrupts
+ * are disabled. In such cases, all memory allocations -must- have
+ * the __GFP_WAIT -not- set in gfp_mask.
+ *
+ * So to keep it simple here:
+ *
+ * ==> Don't call this routine from __alloc_pages() if __GFP_WAIT is
+ * not set in gfp_mask. Don't call here from anywhere else if
+ * one can't sleep. Just assume that the zone is allowed.
+ *
+ * For calls from __alloc_pages, the above has the affect that:
+ * !__GFP_WAIT (including GFP_ATOMIC, in_interrupt) - any node ok.
+ * GFP_KERNEL - any node in enclosing mem_exclusive cpuset ok.
* GFP_USER - only nodes in current tasks mems allowed ok.
- *
- * Rule:
- * Don't call cpuset_zone_allowed() if you can't sleep, unless you
- * pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
- * the code that might scan up ancestor cpusets and sleep.
**/

int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
@@ -2258,10 +2247,8 @@ int __cpuset_zone_allowed(struct zone *z
const struct cpuset *cs; /* current cpuset ancestors */
int allowed; /* is allocation in zone z allowed? */

- if (in_interrupt())
- return 1;
+ might_sleep();
node = z->zone_pgdat->node_id;
- might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
if (node_isset(node, current->mems_allowed))
return 1;
if (gfp_mask & __GFP_HARDWALL) /* If hardwall request, stop here */
--- 2.6.17-rc4-mm1.orig/mm/page_alloc.c 2006-05-17 21:32:27.474553766 -0700
+++ 2.6.17-rc4-mm1/mm/page_alloc.c 2006-05-18 17:08:47.509689158 -0700
@@ -877,7 +877,6 @@ failed:
#define ALLOC_WMARK_HIGH 0x08 /* use pages_high watermark */
#define ALLOC_HARDER 0x10 /* try to alloc harder */
#define ALLOC_HIGH 0x20 /* __GFP_HIGH set */
-#define ALLOC_CPUSET 0x40 /* check for correct cpuset */

/*
* Return 1 if free pages are above 'mark'. This takes into account the order
@@ -916,7 +915,7 @@ int zone_watermark_ok(struct zone *z, in
*/
static struct page *
get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
- struct zonelist *zonelist, int alloc_flags)
+ struct zonelist *zonelist, int alloc_flags, int wait)
{
struct zone **z = zonelist->zones;
struct page *page = NULL;
@@ -927,8 +926,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
do {
- if ((alloc_flags & ALLOC_CPUSET) &&
- !cpuset_zone_allowed(*z, gfp_mask))
+ if (wait && !cpuset_zone_allowed(*z, gfp_mask))
continue;

if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
@@ -1033,14 +1031,19 @@ restart:
}

page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
- zonelist, ALLOC_WMARK_LOW|ALLOC_CPUSET);
+ zonelist, ALLOC_WMARK_LOW, wait);
if (page)
goto got_pg;

- do {
- if (cpuset_zone_allowed(*z, gfp_mask|__GFP_HARDWALL))
- wakeup_kswapd(*z, order);
- } while (*(++z));
+ if (wait) {
+ do {
+ if (cpuset_zone_allowed(*z, __GFP_HARDWALL))
+ wakeup_kswapd(*z, order);
+ } while (*(++z));
+ } else {
+ /* wake up first nodes kswapd in interrupt or GFP_ATOMIC */
+ wakeup_kswapd(*z, order);
+ }

/*
* OK, we're below the kswapd watermark and have kicked background
@@ -1057,8 +1060,6 @@ restart:
alloc_flags |= ALLOC_HARDER;
if (gfp_mask & __GFP_HIGH)
alloc_flags |= ALLOC_HIGH;
- if (wait)
- alloc_flags |= ALLOC_CPUSET;

/*
* Go through the zonelist again. Let __GFP_HIGH and allocations
@@ -1068,7 +1069,8 @@ restart:
* Ignore cpuset if GFP_ATOMIC (!wait) rather than fail alloc.
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
- page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags);
+ page = get_page_from_freelist(gfp_mask, order, zonelist, alloc_flags,
+ wait);
if (page)
goto got_pg;

@@ -1080,7 +1082,7 @@ restart:
nofail_alloc:
/* go through the zonelist yet again, ignoring mins */
page = get_page_from_freelist(gfp_mask, order,
- zonelist, ALLOC_NO_WATERMARKS);
+ zonelist, ALLOC_NO_WATERMARKS, wait);
if (page)
goto got_pg;
if (gfp_mask & __GFP_NOFAIL) {
@@ -1113,7 +1115,7 @@ rebalance:

if (likely(did_some_progress)) {
page = get_page_from_freelist(gfp_mask, order,
- zonelist, alloc_flags);
+ zonelist, alloc_flags, wait);
if (page)
goto got_pg;
} else if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
@@ -1124,7 +1126,7 @@ rebalance:
* under heavy pressure.
*/
page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, order,
- zonelist, ALLOC_WMARK_HIGH|ALLOC_CPUSET);
+ zonelist, ALLOC_WMARK_HIGH, wait);
if (page)
goto got_pg;

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-05-19 00:58:55

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

Andrew wrote:
> I'd have thought that if all the callers get their __GFP_HARDWALLS correct
> then that fishy-looking in_interrupt() test in __cpuset_zone_allowed()
> could be removed?

The in_interrupt() is needed because the cpuset code really does give a
different answer for the two cases of being in an interrupt, and being
in the current task context with __GFP_WAIT not set.

Interrupts get any node they want, totally ignoring cpusets.

Context code with __GFP_WAIT not set tries every node within
the current tasks context, before giving up and allowing any
node.

See my reply to Dave Chinner for a much more long winded answer.

It may well be that this distinction between interrupt code and
interrupts disabled code is not worth it, and should be simplified out,
which would get rid of this fishy in_interrupt() check.

If that's worth doing, it would be a (subtle) semantics change, and
likely separate from this current PATCH's bug fix.

My recommendation is to expedite this current PATCH fix, and allow any
such (minor) design changes to follow along behind as a separate patch,
on a more liesurely track.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-05-19 01:10:48

by Christoph Lameter

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

I think this is another case demonstrating that alloc_pages() behavior is
way too complex right now. Dave, Nick and I talked about how to fix
this a couple of days ago and what we came up with for a solution was to
implement some sort of layering for alloc_pages(). Vaguely we thought the
following may work:

First have a lower layer that is simply allocating a page from a zone
without regards to the policies and cpuset constraints etc.

Second a middle layer where one must explicitly specify cpuset and policy
constraint.

And thirdly a higher layer that obtains policies and cpuset constraints
from the context.

Allocations can then use the proper layer for their allocations.
F.e. cpusets could use the middle and lower layer without getting into
problems with recursion.

Preferably there would be some 3rd level policy/cpusets engine that can
work on any lower allocator so that we could use this engine to allocate
from unusual allocators huge pages, uncached pages and slab objects
following the proper cpuset/policy constraints.

2006-05-19 01:27:12

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

Christoph wrote:
> I think this is another case demonstrating that alloc_pages() behavior is
> way too complex right now.

One could certainly use this thread as evidence to support that case ;).

And your thoughts are all the more reason to fix this bug we have now,
with a small, focused fix (to avoid possibly sleeping with interrupts
off), and then take our time with any rework, perhaps in the context of
the bigger changes you, Dave and Nick were discussing.

Good luck.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-05-19 02:22:20

by David Chinner

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

On Thu, May 18, 2006 at 05:48:00PM -0700, Paul Jackson wrote:
> David wrote:
> > I suggested to Paul that __cpuset_zone_allowed() should check for
> > __GFP_WAIT and allow the allocation if it is not set. Any allocation
> > from interrupt context has to be GFP_ATOMIC so that would kill
> > the need for the in_interrupt() check as well.

....

> The current implementation of the cpuset hooks in __alloc_pages()
> is designed to have the __GFP_WAIT check done in alloc_pages(), not
> in cpuset_zone_allowed().

Which results in complex and fragile logic in __alloc_pages().

> Putting the check you suggest in cpuset_zone_allowed() would 'obviously'
> fix this particular bug, but it would partially duplicate existing
> checks on 'wait' in __alloc_pages(). Not good.

But if cpuset_zone_allowed() did this check, then the logic
in __alloc_pages() gets simpler because you don't have to juggle
all these flags in just the right way to prevent cpuset_zone_allowed()
from sleeping whenit shouldn't.

> The second thing is that the current code is designed to distinguish
> between the memory allocations requested in the following situations:
> [A] in interrupt,
> [B] GFP_ATOMIC (or !__GFP_WAIT, in general) in process context, and
> [C] __GFP_WAIT ok, in process context.

.....

> Your proposal above, Dave, and what I suspect Andrew's proposal
> would be, if he bothered to waste more time thinking about this,
> amount to changing the above design from the three cases [A], [B],
> and [C], to just the two cases:
>
> [D] __GFP_WAIT not ok, such as GFP_ATOMIC and/or in_interrupt, and
> [E] __GFP_WAIT ok.
>
> You're suggesting we ignore cpusets in [D], and honor them in [E].

Effectively. There are two atomic allocation cases - one is from
interrupt where you want pages from the node local to the allocation
request, and the other is from process/kthread context where we want
pages as close to the node we are curently running on.

In the first case, cpusets simply don't apply.

In the second case, we are typically already running on a cpu
inside the cpuset we're trying to allocate in. If there's
no memory inside the cpuset, we go outside it on the second
attempt.

That is, we fail the first get_page_from_freelist() due to __GFP_HARDWALL,
then we wake kswapd, then we call get_page_from_freelist() without the
__GFP_HARDWALL to allow allocation outside the cpuset.

This hunk of your patch:

@@ -1057,7 +1057,8 @@ restart:
alloc_flags |= ALLOC_HARDER;
if (gfp_mask & __GFP_HIGH)
alloc_flags |= ALLOC_HIGH;
- alloc_flags |= ALLOC_CPUSET;
+ if (wait)
+ alloc_flags |= ALLOC_CPUSET;

/*
* Go through the zonelist again. Let __GFP_HIGH and allocations

Effectively disables the cpuset checks on atomic allocations
for the second call to get_page_from_freelist().

So the semantics your patch introduces for the second case
is:
- attempt allocation within the cpuset
- if that fails, allocate from anywhere

Basically, Case B falls back to case A when the cpuset is
full. So my question really is whether we need to attempt
to allocaate within the cpuset for GFP_ATOMIC because
most of the time the local node will be within the cpuset
anyway....

So that's what lead to me asking this - is there really a
noticable distinction between A and B, or is it just
cluttering up the code with needless complex logic?

> One non-obvious (to me at least, for now) detail of such a design change
> would be what do to with the __alloc_pages() code lines:
>
> do {
> if (cpuset_zone_allowed(*z, gfp_mask|__GFP_HARDWALL))
> wakeup_kswapd(*z, order);
> } while (*(++z));
>
> If 'wait' is set, for allocations in the current task context that
> can sleep, then it's obvious enough - just wake up the kswapd's on
> the nodes in the current tasks cpuset.
>
> But what do we do if 'wait' is not set, such as when in interrupt or
> for GFP_ATOMIC requests. Calling cpuset_zone_allowed() is no longer
> allowed in that case.

Sorry, I don't follow why you'd think that this would be not
allowed. Can you explain this further?

> + * So to keep it simple here:
> + *
> + * ==> Don't call this routine from __alloc_pages() if __GFP_WAIT is
> + * not set in gfp_mask. Don't call here from anywhere else if
> + * one can't sleep. Just assume that the zone is allowed.

Why not simply check this is __cpuset_zone_allowed() and return
true? We shouldn't put the burden of getting this right on the
callers when it is something internal to the cpuset workings....

> @@ -916,7 +915,7 @@ int zone_watermark_ok(struct zone *z, in
> */
> static struct page *
> get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
> - struct zonelist *zonelist, int alloc_flags)
> + struct zonelist *zonelist, int alloc_flags, int wait)
> {
> struct zone **z = zonelist->zones;
> struct page *page = NULL;
> @@ -927,8 +926,7 @@ get_page_from_freelist(gfp_t gfp_mask, u
> * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> */
> do {
> - if ((alloc_flags & ALLOC_CPUSET) &&
> - !cpuset_zone_allowed(*z, gfp_mask))
> + if (wait && !cpuset_zone_allowed(*z, gfp_mask))
> continue;

Why push another wait flag around when there's already one in the
gfp_mask?

Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group

2006-05-19 03:12:28

by Paul Jackson

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

David wrote:
> Basically, Case B falls back to case A when the cpuset is
> full. So my question really is whether we need to attempt
> to allocaate within the cpuset for GFP_ATOMIC because
> most of the time the local node will be within the cpuset
> anyway....
>
> So that's what lead to me asking this - is there really a
> noticable distinction between A and B, or is it just
> cluttering up the code with needless complex logic?

Perhaps I'm missing something, but that's what I thought you were
asking, and that's what I tried to answer, in my last post, saying:

pj wrote:
> I suspect we could do this, and it might be a good idea. There may
> well not be good enough reason to be making a special case of [B] above.

You sound frustrated that I am not understanding your question,
and I am feeling a little frustrated that you don't seem to have
realized that I thought I already recognized and responded to your
question, with an answer sympathetic to your concerns, and a possible
patch to address them.

David wrote:
> Why not simply check this is __cpuset_zone_allowed() and return
> true? We shouldn't put the burden of getting this right on the
> callers when it is something internal to the cpuset workings....

The callers are already conscious of whether or not they can wait.
For all of the callers of cpuset_zone_allowed() except __alloc_pages,
they can very well wait, and such a check is noise. For __alloc_pages,
it is quite consciously managing what actions it takes based on what
can wait and what can't.

Please see my belts and suspenders metaphor in the previous message.
or the kfree analogy.

In some programming contexts, I add redundancy for robustness, and
in some contexts I minimize redundancy for lean and mean code. The
kernel tends to be the latter, especially on important code paths. In
particular, I have spent quite a bit of effort over the last year or
two, reducing to a minimum the number of instructions, cache lines,
locks and conditional jumps imposed on the memory allocation code path
by cpusets.

> > But what do we do if 'wait' is not set, such as when in interrupt or
> > for GFP_ATOMIC requests. Calling cpuset_zone_allowed() is no longer
> > allowed in that case.
>
> Sorry, I don't follow why you'd think that this would be not
> allowed. Can you explain this further?

I've probably already written enough words for today on why I suggested
having cpuset_zone_allowed() not be called if it could not wait.

And even if it could be called, say with the "__GFP_WAIT" check you
suggest, there is still the question of what kswapd daemons to wake up
when a memory allocation off an interrupt comes up short of memory on
all nodes. Picking off the nodes in the interrupted tasks cpuset seems
to be rather arbitrary, at best.

> Why push another wait flag around when there's already one in the
> gfp_mask?

Good point - I was just in the habit of using the local variable
'wait' in the __alloc_pages code, and not re-extracting it from
gfp_mask.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-05-19 08:54:45

by David Chinner

[permalink] [raw]

Subject: Re: [PATCH 01/03] Cpuset: might sleep checking zones allowed fix

On Thu, May 18, 2006 at 08:12:07PM -0700, Paul Jackson wrote:
> David wrote:
> > Basically, Case B falls back to case A when the cpuset is
> > full. So my question really is whether we need to attempt
> > to allocaate within the cpuset for GFP_ATOMIC because
> > most of the time the local node will be within the cpuset
> > anyway....
> >
> > So that's what lead to me asking this - is there really a
> > noticable distinction between A and B, or is it just
> > cluttering up the code with needless complex logic?
>
> Perhaps I'm missing something, but that's what I thought you were
> realized tVhat I thought I already recognized and responded to your
> question, with an answer sympathetic to your concerns, and a possible
> patch to address them.

Perhaps we both are :/

email:
a communication medium where two people can agree
with each other but be unaware of this fact.

;)

FWIW, I don't play with cpusets much; I mostly come across them when
a cpuset goes OOM and the I/O subsystem hangs somewhere in a
filesystem, block or driver layer and they are typically due to
problems with GFP_ATOMIC allocations. So my POV is probably
different to yours.

> David wrote:
> > Why not simply check this is __cpuset_zone_allowed() and return
> > true? We shouldn't put the burden of getting this right on the
> > callers when it is something internal to the cpuset workings....
>
> The callers are already conscious of whether or not they can wait.
> For all of the callers of cpuset_zone_allowed() except __alloc_pages,
> they can very well wait, and such a check is noise.

5 of the 6 other callers use __GFP_HARDWALL so won't ever sleep.
Given that 4 of the 5 are in memory reclaim paths, that's probably
a good thing. To an outsider, this appears like __GFP_HARDWALL is
being used to ensure we don't sleep (i.e. GFP_ATOMIC semantics),
and this is one of the angles my reasoning comes from.

> In some programming contexts, I add redundancy for robustness, and
> in some contexts I minimize redundancy for lean and mean code. The
> kernel tends to be the latter, especially on important code paths. In

I often wish for better robustness while deep in the guts of
a crash dump from a machine that's gone OOM and hung.

Cheers,

Dave.
--
Dave Chinner
R&D Software Enginner
SGI Australian Software Group