Changelog since V1
o kswapd should sleep if need_resched
o Remove __GFP_REPEAT from GFP flags when speculatively using high
orders so direct/compaction exits earlier
o Remove __GFP_NORETRY for correctness
o Correct logic in sleeping_prematurely
o Leave SLUB using the default slub_max_order
There are a few reports of people experiencing hangs when copying
large amounts of data with kswapd using a large amount of CPU which
appear to be due to recent reclaim changes.
SLUB using high orders is the trigger but not the root cause as SLUB
has been using high orders for a while. The following four patches
aim to fix the problems in reclaim while reducing the cost for SLUB
using those high orders.
Patch 1 corrects logic introduced by commit [1741c877: mm:
kswapd: keep kswapd awake for high-order allocations until
a percentage of the node is balanced] to allow kswapd to
go to sleep when balanced for high orders.
Patch 2 prevents kswapd waking up in response to SLUBs speculative
use of high orders.
Patch 3 further reduces the cost by prevent SLUB entering direct
compaction or reclaim paths on the grounds that falling
back to order-0 should be cheaper.
Patch 4 notes that even when kswapd is failing to keep up with
allocation requests, it should still go to sleep when its
quota has expired to prevent it spinning.
My own data on this is not great. I haven't really been able to
reproduce the same problem locally.
The test case is simple. "download tar" wgets a large tar file and
stores it locally. "unpack" is expanding it (15 times physical RAM
in this case) and "delete source dirs" is the tarfile being deleted
again. I also experimented with having the tar copied numerous times
and into deeper directories to increase the size but the results were
not particularly interesting so I left it as one tar.
In the background, applications are being launched to time to vaguely
simulate activity on the desktop and to measure how long it takes
applications to start.
Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
LARGE COPY AND UNTAR
vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep
download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%)
unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%)
copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%)
delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26
Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35
MMTests Statistics: application launch
evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08
gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96
iceweasel-table mean 7.93 7.81 7.73 7.65 7.88
evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15
gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02
iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11
Having SLUB avoid expensive steps in reclaim improves performance
by quite a bit with the overall test completing 1.5 minutes
faster. Application launch times were not really affected but it's
not something my test machine was suffering from in the first place
so it's not really conclusive. The kswapd patches also did not appear
to help but again, the test machine wasn't suffering that problem.
These patches are against 2.6.39-rc7. Again, testing would be
appreciated.
Documentation/vm/slub.txt | 2 +-
mm/page_alloc.c | 3 ++-
mm/slub.c | 5 +++--
3 files changed, 6 insertions(+), 4 deletions(-)
mm/page_alloc.c | 3 ++-
mm/slub.c | 3 ++-
mm/vmscan.c | 6 +++++-
3 files changed, 9 insertions(+), 3 deletions(-)
--
1.7.3.4
Johannes Weiner poined out that the logic in commit [1741c877: mm:
kswapd: keep kswapd awake for high-order allocations until a percentage
of the node is balanced] is backwards. Instead of allowing kswapd to go
to sleep when balancing for high order allocations, it keeps it kswapd
running uselessly.
From-but-was-not-signed-off-by: Johannes Weiner <[email protected]>
Will-sign-off-after-Johannes: Mel Gorman <[email protected]>
---
mm/vmscan.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b435c..af24d1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
* must be balanced
*/
if (order)
- return pgdat_balanced(pgdat, balanced, classzone_idx);
+ return !pgdat_balanced(pgdat, balanced, classzone_idx);
else
return !all_zones_ok;
}
--
1.7.3.4
To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail. However, by simply trying to allocate, kswapd is woken up to
start reclaiming at that order. On a desktop system, two users report
that the system is getting locked up with kswapd using large amounts
of CPU. Using SLAB instead of SLUB made this problem go away.
This patch prevents kswapd being woken up for high-order allocations.
Testing indicated that with this patch applied, the system was much
harder to hang and even when it did, it eventually recovered.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/slub.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/slub.c b/mm/slub.c
index 9d2e5e4..98c358d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
* Let the initial higher-order allocation fail under memory pressure
* so we fall-back to the minimum order allocation.
*/
- alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
+ alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
page = alloc_slab_page(alloc_gfp, node, oo);
if (unlikely(!page)) {
--
1.7.3.4
To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail. However, by simply trying to allocate, the caller can enter
compaction or reclaim - both of which are likely to cost more than the
benefit of using high-order pages in SLUB. On a desktop system, two
users report that the system is getting stalled with kswapd using large
amounts of CPU.
This patch prevents SLUB taking any expensive steps when trying to use
high-order allocations. Instead, it is expected to fall back to smaller
orders more aggressively. Testing was somewhat inconclusive on how much
this helped but it makes sense that falling back to order-0 allocations
is faster than entering compaction or direct reclaim.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 3 ++-
mm/slub.c | 3 ++-
2 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..057f1e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
{
int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
const gfp_t wait = gfp_mask & __GFP_WAIT;
+ const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
*/
alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
- if (!wait) {
+ if (!wait && can_wake_kswapd) {
/*
* Not worth trying to allocate harder for
* __GFP_NOMEMALLOC even if it can't schedule.
diff --git a/mm/slub.c b/mm/slub.c
index 98c358d..c5797ab 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
* Let the initial higher-order allocation fail under memory pressure
* so we fall-back to the minimum order allocation.
*/
- alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
+ alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
+ ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
page = alloc_slab_page(alloc_gfp, node, oo);
if (unlikely(!page)) {
--
1.7.3.4
Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..4d24828 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
unsigned long balanced = 0;
bool all_zones_ok = true;
+ /* If kswapd has been running too long, just sleep */
+ if (need_resched())
+ return false;
+
/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
return true;
--
1.7.3.4
On Fri, May 13, 2011 at 03:03:21PM +0100, Mel Gorman wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> From-but-was-not-signed-off-by: Johannes Weiner <[email protected]>
Thanks for picking it up, Mel.
Signed-off-by: Johannes Weiner <[email protected]>
> Will-sign-off-after-Johannes: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b435c..af24d1e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2286,7 +2286,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> * must be balanced
> */
> if (order)
> - return pgdat_balanced(pgdat, balanced, classzone_idx);
> + return !pgdat_balanced(pgdat, balanced, classzone_idx);
> else
> return !all_zones_ok;
> }
> --
> 1.7.3.4
On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> Changelog since V1
> o kswapd should sleep if need_resched
> o Remove __GFP_REPEAT from GFP flags when speculatively using high
> orders so direct/compaction exits earlier
> o Remove __GFP_NORETRY for correctness
> o Correct logic in sleeping_prematurely
> o Leave SLUB using the default slub_max_order
>
> There are a few reports of people experiencing hangs when copying
> large amounts of data with kswapd using a large amount of CPU which
> appear to be due to recent reclaim changes.
>
> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
>
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until
> a percentage of the node is balanced] to allow kswapd to
> go to sleep when balanced for high orders.
>
> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> use of high orders.
>
> Patch 3 further reduces the cost by prevent SLUB entering direct
> compaction or reclaim paths on the grounds that falling
> back to order-0 should be cheaper.
>
> Patch 4 notes that even when kswapd is failing to keep up with
> allocation requests, it should still go to sleep when its
> quota has expired to prevent it spinning.
This all works fine for me ... three untar runs and no kswapd hangs or
pegging the CPU at 99% ... in fact, kswapd rarely gets over 20%
This isn't as good as the kswapd sleeping_prematurely() throttling
patch. For total CPU time on a three 90GB untar run, it's about 64s of
CPU time with your patch rather than 6s, but that's vastly better than
the 15 minutes of CPU time kswapd was taking even under PREEMPT.
James
On Fri, 13 May 2011, Mel Gorman wrote:
> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
>
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until
> a percentage of the node is balanced] to allow kswapd to
> go to sleep when balanced for high orders.
The above looks good.
> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> use of high orders.
Not sure if that is necessary since it seems that we triggered kswapd
before? Why not continue to do it? Once kswapd has enough higher order
pages kswapd should no longer be triggered right?
> Patch 3 further reduces the cost by prevent SLUB entering direct
> compaction or reclaim paths on the grounds that falling
> back to order-0 should be cheaper.
Its cheaper for reclaim path true but more expensive in terms of SLUBs
management costs of the data and it also increases the memory wasted. A
higher order means denser packing of objects less page management
overhead. Fallback is not for free. Reasonable effort should be made to
allocate the page order requested.
> Patch 4 notes that even when kswapd is failing to keep up with
> allocation requests, it should still go to sleep when its
> quota has expired to prevent it spinning.
Looks good too.
Overall, it looks like the compaction logic and the modifications to
reclaim introduced recently with the intend to increase the amount of
physically contiguous memory is not working as expected.
SLUBs chance of getting higher order pages should be *increasing* as a
result of these changes. The above looks like the chances are decreasing
now.
This is a matter of future concern. The metadata management overhead
in the kernel is continually increasing since memory sizes keep growing
and we typically manage memory in 4k chunks. Through large allocation
sizes we can reduce that management overhead but we can only do this if we
have an effective way of defragmenting memory to get longer contiguous
chunks that can be managed to a single page struct.
Please make sure that compaction and related measures really work properly.
The patches suggest that the recent modifications are not improving the
situation.
On Fri, May 13, 2011 at 10:21:46AM -0500, Christoph Lameter wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
>
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> >
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > kswapd: keep kswapd awake for high-order allocations until
> > a percentage of the node is balanced] to allow kswapd to
> > go to sleep when balanced for high orders.
>
> The above looks good.
>
Ok.
> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > use of high orders.
>
> Not sure if that is necessary since it seems that we triggered kswapd
> before? Why not continue to do it? Once kswapd has enough higher order
> pages kswapd should no longer be triggered right?
>
Because kswapd waking up isn't cheap and we are reclaiming pages
just so SLUB may get high-order pages in the future. As it's for
PAGE_ORDER_COSTLY_ORDER, we are not entering lumpy reclaim and just
selecting a few random order-0 pages which may or may not help. There
is very little control of how many pages are getting freed if kswapd
is being woken frequently.
> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > compaction or reclaim paths on the grounds that falling
> > back to order-0 should be cheaper.
>
> Its cheaper for reclaim path true but more expensive in terms of SLUBs
> management costs of the data and it also increases the memory wasted.
Surely the reclaim cost exceeds SLUB management cost?
> A
> higher order means denser packing of objects less page management
> overhead. Fallback is not for free.
Neither is reclaiming a large bunch of pages. Worse, reclaiming
pages so SLUB gets a high-order means it's likely to be stealing
MIGRATE_MOVABLE blocks which eventually gives diminishing returns but
may not be noticeable for weeks. From a fragmentation perspective,
it's better if SLUB uses order-0 allocations when memory is low so
that SLUB pages continue to get packed into as few MIGRATE_UNMOVABLE
and MIGRATE_UNRECLAIMABLE blocks as possible.
> Reasonable effort should be made to
> allocate the page order requested.
>
> > Patch 4 notes that even when kswapd is failing to keep up with
> > allocation requests, it should still go to sleep when its
> > quota has expired to prevent it spinning.
>
> Looks good too.
>
> Overall, it looks like the compaction logic and the modifications to
> reclaim introduced recently with the intend to increase the amount of
> physically contiguous memory is not working as expected.
>
The reclaim and kswapd damage was unintended and this is my fault
but reclaim/compaction still makes a lot more sense than lumpy
reclaim. Testing showed it disrupted the system a lot less and
allocated high-order pages faster with fewer pages reclaimed.
> SLUBs chance of getting higher order pages should be *increasing* as a
> result of these changes. The above looks like the chances are decreasing
> now.
>
Patches 2 and 3 may mean that SLUB gets fewer high order pages when
memory is low and it's depending on high-order pages to be naturally
freed by SLUB as it recycles slabs of old objects. On the flip-side,
fewer pages will be reclaimed. I'd expect the latter option is
cheaper overall.
> This is a matter of future concern. The metadata management overhead
> in the kernel is continually increasing since memory sizes keep growing
> and we typically manage memory in 4k chunks. Through large allocation
> sizes we can reduce that management overhead but we can only do this if we
> have an effective way of defragmenting memory to get longer contiguous
> chunks that can be managed to a single page struct.
>
> Please make sure that compaction and related measures really work properly.
>
Local testing still shows them to be behaving as expected but then
again, I haven't reproduced the simple problem reported by Chris
and James despite using a few different laptops and two different
low-end servers.
> The patches suggest that the recent modifications are not improving the
> situation.
>
--
Mel Gorman
SUSE Labs
On Fri, May 13, 2011 at 10:19:44AM -0500, James Bottomley wrote:
> On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > Changelog since V1
> > o kswapd should sleep if need_resched
> > o Remove __GFP_REPEAT from GFP flags when speculatively using high
> > orders so direct/compaction exits earlier
> > o Remove __GFP_NORETRY for correctness
> > o Correct logic in sleeping_prematurely
> > o Leave SLUB using the default slub_max_order
> >
> > There are a few reports of people experiencing hangs when copying
> > large amounts of data with kswapd using a large amount of CPU which
> > appear to be due to recent reclaim changes.
> >
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> >
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > kswapd: keep kswapd awake for high-order allocations until
> > a percentage of the node is balanced] to allow kswapd to
> > go to sleep when balanced for high orders.
> >
> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > use of high orders.
> >
> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > compaction or reclaim paths on the grounds that falling
> > back to order-0 should be cheaper.
> >
> > Patch 4 notes that even when kswapd is failing to keep up with
> > allocation requests, it should still go to sleep when its
> > quota has expired to prevent it spinning.
>
> This all works fine for me ... three untar runs and no kswapd hangs or
> pegging the CPU at 99% ... in fact, kswapd rarely gets over 20%
>
Good stuff, thanks.
> This isn't as good as the kswapd sleeping_prematurely() throttling
> patch. For total CPU time on a three 90GB untar run, it's about 64s of
> CPU time with your patch rather than 6s, but that's vastly better than
> the 15 minutes of CPU time kswapd was taking even under PREEMPT.
>
The throttling patch is unfortunately a bit hand-wavy based on number of
times it's entered and time passed. It'll be even harder to debug
problems related to this in the future particularly as it's using global
information (a static) for kswapd (per-node which could be worse in the
future depending on what memcg do).
However, as you are testing against stable, can you also apply this
patch? [2876592f: mm: vmscan: stop reclaim/compaction earlier due to
insufficient progress if !__GFP_REPEAT]. It makes a difference as to
when reclaimers give up on high-orders and go to sleep.
--
Mel Gorman
SUSE Labs
On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> Changelog since V1
> o kswapd should sleep if need_resched
> o Remove __GFP_REPEAT from GFP flags when speculatively using high
> orders so direct/compaction exits earlier
> o Remove __GFP_NORETRY for correctness
> o Correct logic in sleeping_prematurely
> o Leave SLUB using the default slub_max_order
>
> There are a few reports of people experiencing hangs when copying
> large amounts of data with kswapd using a large amount of CPU which
> appear to be due to recent reclaim changes.
>
> SLUB using high orders is the trigger but not the root cause as SLUB
> has been using high orders for a while. The following four patches
> aim to fix the problems in reclaim while reducing the cost for SLUB
> using those high orders.
>
> Patch 1 corrects logic introduced by commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until
> a percentage of the node is balanced] to allow kswapd to
> go to sleep when balanced for high orders.
>
> Patch 2 prevents kswapd waking up in response to SLUBs speculative
> use of high orders.
>
> Patch 3 further reduces the cost by prevent SLUB entering direct
> compaction or reclaim paths on the grounds that falling
> back to order-0 should be cheaper.
>
> Patch 4 notes that even when kswapd is failing to keep up with
> allocation requests, it should still go to sleep when its
> quota has expired to prevent it spinning.
>
> My own data on this is not great. I haven't really been able to
> reproduce the same problem locally.
>
> The test case is simple. "download tar" wgets a large tar file and
> stores it locally. "unpack" is expanding it (15 times physical RAM
> in this case) and "delete source dirs" is the tarfile being deleted
> again. I also experimented with having the tar copied numerous times
> and into deeper directories to increase the size but the results were
> not particularly interesting so I left it as one tar.
>
> In the background, applications are being launched to time to vaguely
> simulate activity on the desktop and to measure how long it takes
> applications to start.
>
> Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> LARGE COPY AND UNTAR
> vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep
> download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%)
> unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%)
> copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%)
> delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26
> Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35
>
> MMTests Statistics: application launch
> evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08
> gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96
> iceweasel-table mean 7.93 7.81 7.73 7.65 7.88
>
> evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15
> gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02
> iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11
>
> Having SLUB avoid expensive steps in reclaim improves performance
> by quite a bit with the overall test completing 1.5 minutes
> faster. Application launch times were not really affected but it's
> not something my test machine was suffering from in the first place
> so it's not really conclusive. The kswapd patches also did not appear
> to help but again, the test machine wasn't suffering that problem.
>
> These patches are against 2.6.39-rc7. Again, testing would be
> appreciated.
These patches solve the problem for me. I've been soak testing the file
copy test
for 3.5 hours with nearly 400 test cycles and observed no lockups at all
- rock solid. From my observations from the output from vmstat the
system is behaving sanely.
Thanks for finding a solution - much appreciated!
>
> Documentation/vm/slub.txt | 2 +-
> mm/page_alloc.c | 3 ++-
> mm/slub.c | 5 +++--
> 3 files changed, 6 insertions(+), 4 deletions(-)
>
> mm/page_alloc.c | 3 ++-
> mm/slub.c | 3 ++-
> mm/vmscan.c | 6 +++++-
> 3 files changed, 9 insertions(+), 3 deletions(-)
>
On Fri, May 13, 2011 at 11:03 PM, Mel Gorman <[email protected]> wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> From-but-was-not-signed-off-by: Johannes Weiner <[email protected]>
> Will-sign-off-after-Johannes: Mel Gorman <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Nice catch! Hannes.
--
Kind regards,
Minchan Kim
(2011/05/13 23:03), Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
>
> Signed-off-by: Mel Gorman<[email protected]>
> ---
> mm/vmscan.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..4d24828 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> unsigned long balanced = 0;
> bool all_zones_ok = true;
>
> + /* If kswapd has been running too long, just sleep */
> + if (need_resched())
> + return false;
> +
Hmm... I don't like this patch so much. because this code does
- don't sleep if kswapd got context switch at shrink_inactive_list
- sleep if kswapd didn't
It seems to be semi random behavior.
On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> (2011/05/13 23:03), Mel Gorman wrote:
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> >
> > Signed-off-by: Mel Gorman<[email protected]>
> > ---
> > mm/vmscan.c | 4 ++++
> > 1 files changed, 4 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > unsigned long balanced = 0;
> > bool all_zones_ok = true;
> >
> > + /* If kswapd has been running too long, just sleep */
> > + if (need_resched())
> > + return false;
> > +
>
> Hmm... I don't like this patch so much. because this code does
>
> - don't sleep if kswapd got context switch at shrink_inactive_list
This isn't entirely true: need_resched() will be false, so we'll follow
the normal path for determining whether to sleep or not, in effect
leaving the current behaviour unchanged.
> - sleep if kswapd didn't
This also isn't entirely true: whether need_resched() is true at this
point depends on a whole lot more that whether we did a context switch
in shrink_inactive. It mostly depends on how long we've been running
without giving up the CPU. Generally that will mean we've been round
the shrinker loop hundreds to thousands of times without sleeping.
> It seems to be semi random behavior.
Well, we have to do something. Chris Mason first suspected the hang was
a kswapd rescheduling problem a while ago. We tried putting
cond_rescheds() in several places in the vmscan code, but to no avail.
The need_resched() in sleeping_prematurely() seems to be about the best
option. The other option might be just to put a cond_resched() in
kswapd_try_to_sleep(), but that will really have about the same effect.
James
On Mon, May 16, 2011 at 1:21 PM, James Bottomley
<[email protected]> wrote:
> On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> (2011/05/13 23:03), Mel Gorman wrote:
>> > Under constant allocation pressure, kswapd can be in the situation where
>> > sleeping_prematurely() will always return true even if kswapd has been
>> > running a long time. Check if kswapd needs to be scheduled.
>> >
>> > Signed-off-by: Mel Gorman<[email protected]>
>> > ---
>> > mm/vmscan.c | 4 ++++
>> > 1 files changed, 4 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index af24d1e..4d24828 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> > unsigned long balanced = 0;
>> > bool all_zones_ok = true;
>> >
>> > + /* If kswapd has been running too long, just sleep */
>> > + if (need_resched())
>> > + return false;
>> > +
>>
>> Hmm... I don't like this patch so much. because this code does
>>
>> - don't sleep if kswapd got context switch at shrink_inactive_list
>
> This isn't entirely true: need_resched() will be false, so we'll follow
> the normal path for determining whether to sleep or not, in effect
> leaving the current behaviour unchanged.
>
>> - sleep if kswapd didn't
>
> This also isn't entirely true: whether need_resched() is true at this
> point depends on a whole lot more that whether we did a context switch
> in shrink_inactive. It mostly depends on how long we've been running
> without giving up the CPU. Generally that will mean we've been round
> the shrinker loop hundreds to thousands of times without sleeping.
>
>> It seems to be semi random behavior.
>
> Well, we have to do something. Chris Mason first suspected the hang was
> a kswapd rescheduling problem a while ago. We tried putting
> cond_rescheds() in several places in the vmscan code, but to no avail.
Is it a result of test with patch of Hannes(ie, !pgdat_balanced)?
If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
Because, although we complete zone balancing, kswapd doesn't sleep as
pgdat_balance returns wrong result. And at last VM calls
balance_pgdat. In this case, balance_pgdat returns without any work as
kswap couldn't find zones which have not enough free pages and goto
out. kswapd could repeat this work infinitely. So you don't have a
chance to call cond_resched.
But if your test was with Hanne's patch, I am very curious how come
kswapd consumes CPU a lot.
> The need_resched() in sleeping_prematurely() seems to be about the best
> option. The other option might be just to put a cond_resched() in
> kswapd_try_to_sleep(), but that will really have about the same effect.
I don't oppose it but before that, I think we have to know why kswapd
consumes CPU a lot although we applied Hannes' patch.
>
> James
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
--
Kind regards,
Minchan Kim
On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote:
> On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > Changelog since V1
> > o kswapd should sleep if need_resched
> > o Remove __GFP_REPEAT from GFP flags when speculatively using high
> > orders so direct/compaction exits earlier
> > o Remove __GFP_NORETRY for correctness
> > o Correct logic in sleeping_prematurely
> > o Leave SLUB using the default slub_max_order
> >
> > There are a few reports of people experiencing hangs when copying
> > large amounts of data with kswapd using a large amount of CPU which
> > appear to be due to recent reclaim changes.
> >
> > SLUB using high orders is the trigger but not the root cause as SLUB
> > has been using high orders for a while. The following four patches
> > aim to fix the problems in reclaim while reducing the cost for SLUB
> > using those high orders.
> >
> > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > kswapd: keep kswapd awake for high-order allocations until
> > a percentage of the node is balanced] to allow kswapd to
> > go to sleep when balanced for high orders.
> >
> > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > use of high orders.
> >
> > Patch 3 further reduces the cost by prevent SLUB entering direct
> > compaction or reclaim paths on the grounds that falling
> > back to order-0 should be cheaper.
> >
> > Patch 4 notes that even when kswapd is failing to keep up with
> > allocation requests, it should still go to sleep when its
> > quota has expired to prevent it spinning.
> >
> > My own data on this is not great. I haven't really been able to
> > reproduce the same problem locally.
> >
> > The test case is simple. "download tar" wgets a large tar file and
> > stores it locally. "unpack" is expanding it (15 times physical RAM
> > in this case) and "delete source dirs" is the tarfile being deleted
> > again. I also experimented with having the tar copied numerous times
> > and into deeper directories to increase the size but the results were
> > not particularly interesting so I left it as one tar.
> >
> > In the background, applications are being launched to time to vaguely
> > simulate activity on the desktop and to measure how long it takes
> > applications to start.
> >
> > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> > LARGE COPY AND UNTAR
> > vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep
> > download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%)
> > unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%)
> > copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%)
> > delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26
> > Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35
> >
> > MMTests Statistics: application launch
> > evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08
> > gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96
> > iceweasel-table mean 7.93 7.81 7.73 7.65 7.88
> >
> > evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15
> > gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02
> > iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11
> >
> > Having SLUB avoid expensive steps in reclaim improves performance
> > by quite a bit with the overall test completing 1.5 minutes
> > faster. Application launch times were not really affected but it's
> > not something my test machine was suffering from in the first place
> > so it's not really conclusive. The kswapd patches also did not appear
> > to help but again, the test machine wasn't suffering that problem.
> >
> > These patches are against 2.6.39-rc7. Again, testing would be
> > appreciated.
>
> These patches solve the problem for me. I've been soak testing the file
> copy test
> for 3.5 hours with nearly 400 test cycles and observed no lockups at all
> - rock solid. From my observations from the output from vmstat the
> system is behaving sanely.
> Thanks for finding a solution - much appreciated!
>
Can you tell me if just patches 1 and 4 fix the problem please? It'd be good
to know if this was only a reclaim-related problem. Thanks.
--
Mel Gorman
SUSE Labs
On Sun, May 15, 2011 at 07:27:12PM +0900, KOSAKI Motohiro wrote:
> (2011/05/13 23:03), Mel Gorman wrote:
> > Under constant allocation pressure, kswapd can be in the situation where
> > sleeping_prematurely() will always return true even if kswapd has been
> > running a long time. Check if kswapd needs to be scheduled.
> >
> > Signed-off-by: Mel Gorman<[email protected]>
> > ---
> > mm/vmscan.c | 4 ++++
> > 1 files changed, 4 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..4d24828 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > unsigned long balanced = 0;
> > bool all_zones_ok = true;
> >
> > + /* If kswapd has been running too long, just sleep */
> > + if (need_resched())
> > + return false;
> > +
>
> Hmm... I don't like this patch so much. because this code does
>
> - don't sleep if kswapd got context switch at shrink_inactive_list
> - sleep if kswapd didn't
>
> It seems to be semi random behavior.
>
It's possible to keep kswapd awake simply by allocating fast enough that
the watermarks are never balanced making kswapd appear to consume 100%
of CPU. This check causes kswapd to sleep in this case. The processes
doing the allocations will enter direct reclaim and probably stall while
processes that are not allocating will get some CPU time.
--
Mel Gorman
SUSE Labs
On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> <[email protected]> wrote:
> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >
> >> > Signed-off-by: Mel Gorman<[email protected]>
> >> > ---
> >> > ? mm/vmscan.c | ? ?4 ++++
> >> > ? 1 files changed, 4 insertions(+), 0 deletions(-)
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index af24d1e..4d24828 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> > ? ? unsigned long balanced = 0;
> >> > ? ? bool all_zones_ok = true;
> >> >
> >> > + ? /* If kswapd has been running too long, just sleep */
> >> > + ? if (need_resched())
> >> > + ? ? ? ? ? return false;
> >> > +
> >>
> >> Hmm... I don't like this patch so much. because this code does
> >>
> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >
> > This isn't entirely true: ?need_resched() will be false, so we'll follow
> > the normal path for determining whether to sleep or not, in effect
> > leaving the current behaviour unchanged.
> >
> >> - sleep if kswapd didn't
> >
> > This also isn't entirely true: whether need_resched() is true at this
> > point depends on a whole lot more that whether we did a context switch
> > in shrink_inactive. It mostly depends on how long we've been running
> > without giving up the CPU. ?Generally that will mean we've been round
> > the shrinker loop hundreds to thousands of times without sleeping.
> >
> >> It seems to be semi random behavior.
> >
> > Well, we have to do something. ?Chris Mason first suspected the hang was
> > a kswapd rescheduling problem a while ago. ?We tried putting
> > cond_rescheds() in several places in the vmscan code, but to no avail.
>
> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)?
>
> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> Because, although we complete zone balancing, kswapd doesn't sleep as
> pgdat_balance returns wrong result. And at last VM calls
> balance_pgdat. In this case, balance_pgdat returns without any work as
> kswap couldn't find zones which have not enough free pages and goto
> out. kswapd could repeat this work infinitely. So you don't have a
> chance to call cond_resched.
>
> But if your test was with Hanne's patch, I am very curious how come
> kswapd consumes CPU a lot.
>
> > The need_resched() in sleeping_prematurely() seems to be about the best
> > option. ?The other option might be just to put a cond_resched() in
> > kswapd_try_to_sleep(), but that will really have about the same effect.
>
> I don't oppose it but before that, I think we have to know why kswapd
> consumes CPU a lot although we applied Hannes' patch.
>
Because it's still possible for processes to allocate pages at the same
rate kswapd is freeing them leading to a situation where kswapd does not
consider the zone balanced for prolonged periods of time.
--
Mel Gorman
SUSE Labs
On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <[email protected]> wrote:
> On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> <[email protected]> wrote:
>> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> >> (2011/05/13 23:03), Mel Gorman wrote:
>> >> > Under constant allocation pressure, kswapd can be in the situation where
>> >> > sleeping_prematurely() will always return true even if kswapd has been
>> >> > running a long time. Check if kswapd needs to be scheduled.
>> >> >
>> >> > Signed-off-by: Mel Gorman<[email protected]>
>> >> > ---
>> >> > mm/vmscan.c | 4 ++++
>> >> > 1 files changed, 4 insertions(+), 0 deletions(-)
>> >> >
>> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> > index af24d1e..4d24828 100644
>> >> > --- a/mm/vmscan.c
>> >> > +++ b/mm/vmscan.c
>> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >> > unsigned long balanced = 0;
>> >> > bool all_zones_ok = true;
>> >> >
>> >> > + /* If kswapd has been running too long, just sleep */
>> >> > + if (need_resched())
>> >> > + return false;
>> >> > +
>> >>
>> >> Hmm... I don't like this patch so much. because this code does
>> >>
>> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> >
>> > This isn't entirely true: need_resched() will be false, so we'll follow
>> > the normal path for determining whether to sleep or not, in effect
>> > leaving the current behaviour unchanged.
>> >
>> >> - sleep if kswapd didn't
>> >
>> > This also isn't entirely true: whether need_resched() is true at this
>> > point depends on a whole lot more that whether we did a context switch
>> > in shrink_inactive. It mostly depends on how long we've been running
>> > without giving up the CPU. Generally that will mean we've been round
>> > the shrinker loop hundreds to thousands of times without sleeping.
>> >
>> >> It seems to be semi random behavior.
>> >
>> > Well, we have to do something. Chris Mason first suspected the hang was
>> > a kswapd rescheduling problem a while ago. We tried putting
>> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>
>> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)?
>>
>> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> Because, although we complete zone balancing, kswapd doesn't sleep as
>> pgdat_balance returns wrong result. And at last VM calls
>> balance_pgdat. In this case, balance_pgdat returns without any work as
>> kswap couldn't find zones which have not enough free pages and goto
>> out. kswapd could repeat this work infinitely. So you don't have a
>> chance to call cond_resched.
>>
>> But if your test was with Hanne's patch, I am very curious how come
>> kswapd consumes CPU a lot.
>>
>> > The need_resched() in sleeping_prematurely() seems to be about the best
>> > option. The other option might be just to put a cond_resched() in
>> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>
>> I don't oppose it but before that, I think we have to know why kswapd
>> consumes CPU a lot although we applied Hannes' patch.
>>
>
> Because it's still possible for processes to allocate pages at the same
> rate kswapd is freeing them leading to a situation where kswapd does not
> consider the zone balanced for prolonged periods of time.
We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
So I think kswapd can be scheduled out although it's scheduled in
after a short time as task scheduled also need page reclaim. Although
all task in system need reclaim, kswapd cpu 99% consumption is a
natural result, I think.
Do I miss something?
>
> --
> Mel Gorman
> SUSE Labs
>
--
Kind regards,
Minchan Kim
On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <[email protected]> wrote:
> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> <[email protected]> wrote:
> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >
> >> >> > Signed-off-by: Mel Gorman<[email protected]>
> >> >> > ---
> >> >> > ? mm/vmscan.c | ? ?4 ++++
> >> >> > ? 1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >
> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> > index af24d1e..4d24828 100644
> >> >> > --- a/mm/vmscan.c
> >> >> > +++ b/mm/vmscan.c
> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> > ? ? unsigned long balanced = 0;
> >> >> > ? ? bool all_zones_ok = true;
> >> >> >
> >> >> > + ? /* If kswapd has been running too long, just sleep */
> >> >> > + ? if (need_resched())
> >> >> > + ? ? ? ? ? return false;
> >> >> > +
> >> >>
> >> >> Hmm... I don't like this patch so much. because this code does
> >> >>
> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >
> >> > This isn't entirely true: ?need_resched() will be false, so we'll follow
> >> > the normal path for determining whether to sleep or not, in effect
> >> > leaving the current behaviour unchanged.
> >> >
> >> >> - sleep if kswapd didn't
> >> >
> >> > This also isn't entirely true: whether need_resched() is true at this
> >> > point depends on a whole lot more that whether we did a context switch
> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> > without giving up the CPU. ?Generally that will mean we've been round
> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >
> >> >> It seems to be semi random behavior.
> >> >
> >> > Well, we have to do something. ?Chris Mason first suspected the hang was
> >> > a kswapd rescheduling problem a while ago. ?We tried putting
> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >>
> >> Is it a result of ?test with patch of Hannes(ie, !pgdat_balanced)?
> >>
> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> pgdat_balance returns wrong result. And at last VM calls
> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> kswap couldn't find zones which have not enough free pages and goto
> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> chance to call cond_resched.
> >>
> >> But if your test was with Hanne's patch, I am very curious how come
> >> kswapd consumes CPU a lot.
> >>
> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> > option. ?The other option might be just to put a cond_resched() in
> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >>
> >> I don't oppose it but before that, I think we have to know why kswapd
> >> consumes CPU a lot although we applied Hannes' patch.
> >>
> >
> > Because it's still possible for processes to allocate pages at the same
> > rate kswapd is freeing them leading to a situation where kswapd does not
> > consider the zone balanced for prolonged periods of time.
>
> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> So I think kswapd can be scheduled out although it's scheduled in
> after a short time as task scheduled also need page reclaim. Although
> all task in system need reclaim, kswapd cpu 99% consumption is a
> natural result, I think.
> Do I miss something?
>
Lets see;
shrink_page_list() only applies if inactive pages were isolated
which in turn may not happen if all_unreclaimable is set in
shrink_zones(). If for whatver reason, all_unreclaimable is
set on all zones, we can miss calling cond_resched().
shrink_slab only applies if we are reclaiming slab pages. If the first
shrinker returns -1, we do not call cond_resched(). If that
first shrinker is dcache and __GFP_FS is not set, direct
reclaimers will not shrink at all. However, if there are
enough of them running or if one of the other shrinkers
is running for a very long time, kswapd could be starved
acquiring the shrinker_rwsem and never reaching the
cond_resched().
balance_pgdat() only calls cond_resched if the zones are not
balanced. For a high-order allocation that is balanced, it
checks order-0 again. During that window, order-0 might have
become unbalanced so it loops again for order-0 and returns
that was reclaiming for order-0 to kswapd(). It can then find
that a caller has rewoken kswapd for a high-order and re-enters
balance_pgdat() without ever have called cond_resched().
While it appears unlikely, there are bad conditions which can result
in cond_resched() being avoided.
--
Mel Gorman
SUSE Labs
On Mon, 2011-05-16 at 09:37 +0100, Mel Gorman wrote:
> On Sat, May 14, 2011 at 10:34:33AM +0200, Colin Ian King wrote:
> > On Fri, 2011-05-13 at 15:03 +0100, Mel Gorman wrote:
> > > Changelog since V1
> > > o kswapd should sleep if need_resched
> > > o Remove __GFP_REPEAT from GFP flags when speculatively using high
> > > orders so direct/compaction exits earlier
> > > o Remove __GFP_NORETRY for correctness
> > > o Correct logic in sleeping_prematurely
> > > o Leave SLUB using the default slub_max_order
> > >
> > > There are a few reports of people experiencing hangs when copying
> > > large amounts of data with kswapd using a large amount of CPU which
> > > appear to be due to recent reclaim changes.
> > >
> > > SLUB using high orders is the trigger but not the root cause as SLUB
> > > has been using high orders for a while. The following four patches
> > > aim to fix the problems in reclaim while reducing the cost for SLUB
> > > using those high orders.
> > >
> > > Patch 1 corrects logic introduced by commit [1741c877: mm:
> > > kswapd: keep kswapd awake for high-order allocations until
> > > a percentage of the node is balanced] to allow kswapd to
> > > go to sleep when balanced for high orders.
> > >
> > > Patch 2 prevents kswapd waking up in response to SLUBs speculative
> > > use of high orders.
> > >
> > > Patch 3 further reduces the cost by prevent SLUB entering direct
> > > compaction or reclaim paths on the grounds that falling
> > > back to order-0 should be cheaper.
> > >
> > > Patch 4 notes that even when kswapd is failing to keep up with
> > > allocation requests, it should still go to sleep when its
> > > quota has expired to prevent it spinning.
> > >
> > > My own data on this is not great. I haven't really been able to
> > > reproduce the same problem locally.
> > >
> > > The test case is simple. "download tar" wgets a large tar file and
> > > stores it locally. "unpack" is expanding it (15 times physical RAM
> > > in this case) and "delete source dirs" is the tarfile being deleted
> > > again. I also experimented with having the tar copied numerous times
> > > and into deeper directories to increase the size but the results were
> > > not particularly interesting so I left it as one tar.
> > >
> > > In the background, applications are being launched to time to vaguely
> > > simulate activity on the desktop and to measure how long it takes
> > > applications to start.
> > >
> > > Test server, 4 CPU threads, x86_64, 2G of RAM, no PREEMPT, no COMPACTION, X running
> > > LARGE COPY AND UNTAR
> > > vanilla fixprematurely kswapd-nowwake slub-noexstep kswapdsleep
> > > download tar 95 ( 0.00%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%) 94 ( 1.06%)
> > > unpack tar 654 ( 0.00%) 649 ( 0.77%) 655 (-0.15%) 589 (11.04%) 598 ( 9.36%)
> > > copy source files 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%) 0 ( 0.00%)
> > > delete source dirs 327 ( 0.00%) 334 (-2.10%) 318 ( 2.83%) 325 ( 0.62%) 320 ( 2.19%)
> > > MMTests Statistics: duration
> > > User/Sys Time Running Test (seconds) 1139.7 1142.55 1149.78 1109.32 1113.26
> > > Total Elapsed Time (seconds) 1341.59 1342.45 1324.90 1271.02 1247.35
> > >
> > > MMTests Statistics: application launch
> > > evolution-wait30 mean 34.92 34.96 34.92 34.92 35.08
> > > gnome-terminal-find mean 7.96 7.96 8.76 7.80 7.96
> > > iceweasel-table mean 7.93 7.81 7.73 7.65 7.88
> > >
> > > evolution-wait30 stddev 0.96 1.22 1.27 1.20 1.15
> > > gnome-terminal-find stddev 3.02 3.09 3.51 2.99 3.02
> > > iceweasel-table stddev 1.05 0.90 1.09 1.11 1.11
> > >
> > > Having SLUB avoid expensive steps in reclaim improves performance
> > > by quite a bit with the overall test completing 1.5 minutes
> > > faster. Application launch times were not really affected but it's
> > > not something my test machine was suffering from in the first place
> > > so it's not really conclusive. The kswapd patches also did not appear
> > > to help but again, the test machine wasn't suffering that problem.
> > >
> > > These patches are against 2.6.39-rc7. Again, testing would be
> > > appreciated.
> >
> > These patches solve the problem for me. I've been soak testing the file
> > copy test
> > for 3.5 hours with nearly 400 test cycles and observed no lockups at all
> > - rock solid. From my observations from the output from vmstat the
> > system is behaving sanely.
> > Thanks for finding a solution - much appreciated!
> >
>
> Can you tell me if just patches 1 and 4 fix the problem please? It'd be good
> to know if this was only a reclaim-related problem. Thanks.
Hi Mel,
Soak tested just patches 1 + 4 and works fine. Did 250 cycles for ~2
hours, no lockups, and the output from vmstat looked sane.
Colin
>
On 05/13/2011 10:03 AM, Mel Gorman wrote:
> Johannes Weiner poined out that the logic in commit [1741c877: mm:
> kswapd: keep kswapd awake for high-order allocations until a percentage
> of the node is balanced] is backwards. Instead of allowing kswapd to go
> to sleep when balancing for high order allocations, it keeps it kswapd
> running uselessly.
>
> From-but-was-not-signed-off-by: Johannes Weiner<[email protected]>
> Will-sign-off-after-Johannes: Mel Gorman<[email protected]>
Reviewed-by: Rik van Riel<[email protected]>
--
All rights reversed
On 05/13/2011 10:03 AM, Mel Gorman wrote:
> Under constant allocation pressure, kswapd can be in the situation where
> sleeping_prematurely() will always return true even if kswapd has been
> running a long time. Check if kswapd needs to be scheduled.
>
> Signed-off-by: Mel Gorman<[email protected]>
Acked-by: Rik van Riel<[email protected]>
--
All rights reversed
On Fri, 13 May 2011, Mel Gorman wrote:
> To avoid locking and per-cpu overhead, SLUB optimisically uses
> high-order allocations and falls back to lower allocations if they
> fail. However, by simply trying to allocate, kswapd is woken up to
> start reclaiming at that order. On a desktop system, two users report
> that the system is getting locked up with kswapd using large amounts
> of CPU. Using SLAB instead of SLUB made this problem go away.
>
> This patch prevents kswapd being woken up for high-order allocations.
> Testing indicated that with this patch applied, the system was much
> harder to hang and even when it did, it eventually recovered.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
On Fri, 13 May 2011, Mel Gorman wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9f8a97b..057f1e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> {
> int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> const gfp_t wait = gfp_mask & __GFP_WAIT;
> + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
>
> /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> */
> alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>
> - if (!wait) {
> + if (!wait && can_wake_kswapd) {
> /*
> * Not worth trying to allocate harder for
> * __GFP_NOMEMALLOC even if it can't schedule.
> diff --git a/mm/slub.c b/mm/slub.c
> index 98c358d..c5797ab 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> * Let the initial higher-order allocation fail under memory pressure
> * so we fall-back to the minimum order allocation.
> */
> - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
>
> page = alloc_slab_page(alloc_gfp, node, oo);
> if (unlikely(!page)) {
It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations
will immediately fail.
alloc_gfp would probably benefit from having a comment about why
__GFP_WAIT should be masked off here: that we don't want to do compaction
or direct reclaim or retry the allocation more than once (so both
__GFP_NORETRY and __GFP_REPEAT are no-ops).
On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <[email protected]> wrote:
> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <[email protected]> wrote:
>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> >> <[email protected]> wrote:
>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>> >> >> > running a long time. Check if kswapd needs to be scheduled.
>> >> >> >
>> >> >> > Signed-off-by: Mel Gorman<[email protected]>
>> >> >> > ---
>> >> >> > mm/vmscan.c | 4 ++++
>> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-)
>> >> >> >
>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> >> > index af24d1e..4d24828 100644
>> >> >> > --- a/mm/vmscan.c
>> >> >> > +++ b/mm/vmscan.c
>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> >> >> > unsigned long balanced = 0;
>> >> >> > bool all_zones_ok = true;
>> >> >> >
>> >> >> > + /* If kswapd has been running too long, just sleep */
>> >> >> > + if (need_resched())
>> >> >> > + return false;
>> >> >> > +
>> >> >>
>> >> >> Hmm... I don't like this patch so much. because this code does
>> >> >>
>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> >> >
>> >> > This isn't entirely true: need_resched() will be false, so we'll follow
>> >> > the normal path for determining whether to sleep or not, in effect
>> >> > leaving the current behaviour unchanged.
>> >> >
>> >> >> - sleep if kswapd didn't
>> >> >
>> >> > This also isn't entirely true: whether need_resched() is true at this
>> >> > point depends on a whole lot more that whether we did a context switch
>> >> > in shrink_inactive. It mostly depends on how long we've been running
>> >> > without giving up the CPU. Generally that will mean we've been round
>> >> > the shrinker loop hundreds to thousands of times without sleeping.
>> >> >
>> >> >> It seems to be semi random behavior.
>> >> >
>> >> > Well, we have to do something. Chris Mason first suspected the hang was
>> >> > a kswapd rescheduling problem a while ago. We tried putting
>> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>> >>
>> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)?
>> >>
>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>> >> pgdat_balance returns wrong result. And at last VM calls
>> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>> >> kswap couldn't find zones which have not enough free pages and goto
>> >> out. kswapd could repeat this work infinitely. So you don't have a
>> >> chance to call cond_resched.
>> >>
>> >> But if your test was with Hanne's patch, I am very curious how come
>> >> kswapd consumes CPU a lot.
>> >>
>> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>> >> > option. The other option might be just to put a cond_resched() in
>> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>> >>
>> >> I don't oppose it but before that, I think we have to know why kswapd
>> >> consumes CPU a lot although we applied Hannes' patch.
>> >>
>> >
>> > Because it's still possible for processes to allocate pages at the same
>> > rate kswapd is freeing them leading to a situation where kswapd does not
>> > consider the zone balanced for prolonged periods of time.
>>
>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>> So I think kswapd can be scheduled out although it's scheduled in
>> after a short time as task scheduled also need page reclaim. Although
>> all task in system need reclaim, kswapd cpu 99% consumption is a
>> natural result, I think.
>> Do I miss something?
>>
>
> Lets see;
>
> shrink_page_list() only applies if inactive pages were isolated
> which in turn may not happen if all_unreclaimable is set in
> shrink_zones(). If for whatver reason, all_unreclaimable is
> set on all zones, we can miss calling cond_resched().
>
> shrink_slab only applies if we are reclaiming slab pages. If the first
> shrinker returns -1, we do not call cond_resched(). If that
> first shrinker is dcache and __GFP_FS is not set, direct
> reclaimers will not shrink at all. However, if there are
> enough of them running or if one of the other shrinkers
> is running for a very long time, kswapd could be starved
> acquiring the shrinker_rwsem and never reaching the
> cond_resched().
Don't we have to move cond_resched?
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..633e761 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
if (scanned == 0)
scanned = SWAP_CLUSTER_MAX;
- if (!down_read_trylock(&shrinker_rwsem))
- return 1; /* Assume we'll be able to shrink next time */
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ ret = 1;
+ goto out; /* Assume we'll be able to shrink next time */
+ }
list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
@@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
count_vm_events(SLABS_SCANNED, this_scan);
total_scan -= this_scan;
- cond_resched();
}
shrinker->nr += total_scan;
+ cond_resched();
}
up_read(&shrinker_rwsem);
+out:
+ cond_resched();
return ret;
}
>
> balance_pgdat() only calls cond_resched if the zones are not
> balanced. For a high-order allocation that is balanced, it
> checks order-0 again. During that window, order-0 might have
> become unbalanced so it loops again for order-0 and returns
> that was reclaiming for order-0 to kswapd(). It can then find
> that a caller has rewoken kswapd for a high-order and re-enters
> balance_pgdat() without ever have called cond_resched().
If kswapd reclaims order-o followed by high order, it would have a
chance to call cond_resched in shrink_page_list. But if all zones are
all_unreclaimable is set, balance_pgdat could return any work. Okay.
It does make sense.
By your scenario, someone wakes up kswapd with higher order, again.
So re-enters balance_pgdat without ever have called cond_resched.
But if someone wakes up higher order again, we can't have a chance to
call kswapd_try_to_sleep. So your patch effect would be nop, too.
It would be better to put cond_resched after balance_pgdat?
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..61c45d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2753,6 +2753,7 @@ static int kswapd(void *p)
if (!ret) {
trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
order = balance_pgdat(pgdat, order, &classzone_idx);
+ cond_resched();
}
}
return 0;
>
> While it appears unlikely, there are bad conditions which can result
> in cond_resched() being avoided.
>
> --
> Mel Gorman
> SUSE Labs
>
--
Kind regards,
Minchan Kim
On Tue, May 17, 2011 at 8:50 AM, Minchan Kim <[email protected]> wrote:
> On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <[email protected]> wrote:
>> On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>>> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <[email protected]> wrote:
>>> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>>> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>>> >> <[email protected]> wrote:
>>> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>>> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>>> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>>> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>>> >> >> > running a long time. Check if kswapd needs to be scheduled.
>>> >> >> >
>>> >> >> > Signed-off-by: Mel Gorman<[email protected]>
>>> >> >> > ---
>>> >> >> > mm/vmscan.c | 4 ++++
>>> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-)
>>> >> >> >
>>> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> >> >> > index af24d1e..4d24828 100644
>>> >> >> > --- a/mm/vmscan.c
>>> >> >> > +++ b/mm/vmscan.c
>>> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>>> >> >> > unsigned long balanced = 0;
>>> >> >> > bool all_zones_ok = true;
>>> >> >> >
>>> >> >> > + /* If kswapd has been running too long, just sleep */
>>> >> >> > + if (need_resched())
>>> >> >> > + return false;
>>> >> >> > +
>>> >> >>
>>> >> >> Hmm... I don't like this patch so much. because this code does
>>> >> >>
>>> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>>> >> >
>>> >> > This isn't entirely true: need_resched() will be false, so we'll follow
>>> >> > the normal path for determining whether to sleep or not, in effect
>>> >> > leaving the current behaviour unchanged.
>>> >> >
>>> >> >> - sleep if kswapd didn't
>>> >> >
>>> >> > This also isn't entirely true: whether need_resched() is true at this
>>> >> > point depends on a whole lot more that whether we did a context switch
>>> >> > in shrink_inactive. It mostly depends on how long we've been running
>>> >> > without giving up the CPU. Generally that will mean we've been round
>>> >> > the shrinker loop hundreds to thousands of times without sleeping.
>>> >> >
>>> >> >> It seems to be semi random behavior.
>>> >> >
>>> >> > Well, we have to do something. Chris Mason first suspected the hang was
>>> >> > a kswapd rescheduling problem a while ago. We tried putting
>>> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>>> >>
>>> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)?
>>> >>
>>> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>>> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>>> >> pgdat_balance returns wrong result. And at last VM calls
>>> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>>> >> kswap couldn't find zones which have not enough free pages and goto
>>> >> out. kswapd could repeat this work infinitely. So you don't have a
>>> >> chance to call cond_resched.
>>> >>
>>> >> But if your test was with Hanne's patch, I am very curious how come
>>> >> kswapd consumes CPU a lot.
>>> >>
>>> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>>> >> > option. The other option might be just to put a cond_resched() in
>>> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>>> >>
>>> >> I don't oppose it but before that, I think we have to know why kswapd
>>> >> consumes CPU a lot although we applied Hannes' patch.
>>> >>
>>> >
>>> > Because it's still possible for processes to allocate pages at the same
>>> > rate kswapd is freeing them leading to a situation where kswapd does not
>>> > consider the zone balanced for prolonged periods of time.
>>>
>>> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>>> So I think kswapd can be scheduled out although it's scheduled in
>>> after a short time as task scheduled also need page reclaim. Although
>>> all task in system need reclaim, kswapd cpu 99% consumption is a
>>> natural result, I think.
>>> Do I miss something?
>>>
>>
>> Lets see;
>>
>> shrink_page_list() only applies if inactive pages were isolated
>> which in turn may not happen if all_unreclaimable is set in
>> shrink_zones(). If for whatver reason, all_unreclaimable is
>> set on all zones, we can miss calling cond_resched().
>>
>> shrink_slab only applies if we are reclaiming slab pages. If the first
>> shrinker returns -1, we do not call cond_resched(). If that
>> first shrinker is dcache and __GFP_FS is not set, direct
>> reclaimers will not shrink at all. However, if there are
>> enough of them running or if one of the other shrinkers
>> is running for a very long time, kswapd could be starved
>> acquiring the shrinker_rwsem and never reaching the
>> cond_resched().
>
> Don't we have to move cond_resched?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..633e761 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> if (scanned == 0)
> scanned = SWAP_CLUSTER_MAX;
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - return 1; /* Assume we'll be able to shrink next time */
> + if (!down_read_trylock(&shrinker_rwsem)) {
> + ret = 1;
> + goto out; /* Assume we'll be able to shrink next time */
> + }
>
> list_for_each_entry(shrinker, &shrinker_list, list) {
> unsigned long long delta;
> @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> count_vm_events(SLABS_SCANNED, this_scan);
> total_scan -= this_scan;
>
> - cond_resched();
> }
>
> shrinker->nr += total_scan;
> + cond_resched();
> }
> up_read(&shrinker_rwsem);
> +out:
> + cond_resched();
> return ret;
> }
>
>
>>
>> balance_pgdat() only calls cond_resched if the zones are not
>> balanced. For a high-order allocation that is balanced, it
>> checks order-0 again. During that window, order-0 might have
>> become unbalanced so it loops again for order-0 and returns
>> that was reclaiming for order-0 to kswapd(). It can then find
>> that a caller has rewoken kswapd for a high-order and re-enters
>> balance_pgdat() without ever have called cond_resched().
>
> If kswapd reclaims order-o followed by high order, it would have a
> chance to call cond_resched in shrink_page_list. But if all zones are
> all_unreclaimable is set, balance_pgdat could return any work.
Typo : without any work.
--
Kind regards,
Minchan Kim
On Mon, May 16, 2011 at 02:16:46PM -0700, David Rientjes wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9f8a97b..057f1e2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > {
> > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> >
> > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > */
> > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> >
> > - if (!wait) {
> > + if (!wait && can_wake_kswapd) {
> > /*
> > * Not worth trying to allocate harder for
> > * __GFP_NOMEMALLOC even if it can't schedule.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 98c358d..c5797ab 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> > * Let the initial higher-order allocation fail under memory pressure
> > * so we fall-back to the minimum order allocation.
> > */
> > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> > + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
> >
> > page = alloc_slab_page(alloc_gfp, node, oo);
> > if (unlikely(!page)) {
>
> It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations
> will immediately fail.
>
We can enter enter direct compaction or direct reclaim
at least once. If compaction is enabled and we enter
reclaim/compaction, the presense of __GFP_REPEAT makes a difference
in should_continue_reclaim(). With compaction disabled, the presense
of the flag is relevant in should_alloc_retry() with it being possible
to loop in the allocator instead of failing the SLUB allocation and
dropping back.
Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes
more sense. In that case, we clear both flags because
__GFP_REPEAT && !_GFP_WAIT is a senseless combination of flags.
If for whatever reason the __GFP_WAIT was re-added, the presense of
__GFP_REPEAT could cause problems in reclaim that would be hard to
spot again.
> alloc_gfp would probably benefit from having a comment about why
> __GFP_WAIT should be masked off here: that we don't want to do compaction
> or direct reclaim or retry the allocation more than once (so both
> __GFP_NORETRY and __GFP_REPEAT are no-ops).
That would have been helpful all right. I should have caught that
and explained it properly. In the event there is a new version of
the patch, I'll add one. For the moment, I'm dropping this patch
entirely. Christoph wants to maintain historic behaviour of SLUB to
maximise the number of high-order pages it uses and at the end of the
day, which option performs better depends entirely on the workload
and machine configuration.
--
Mel Gorman
SUSE Labs
On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <[email protected]> wrote:
> > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <[email protected]> wrote:
> >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> >> <[email protected]> wrote:
> >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> >> >> >
> >> >> >> > Signed-off-by: Mel Gorman<[email protected]>
> >> >> >> > ---
> >> >> >> > ? mm/vmscan.c | ? ?4 ++++
> >> >> >> > ? 1 files changed, 4 insertions(+), 0 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> >> >> > index af24d1e..4d24828 100644
> >> >> >> > --- a/mm/vmscan.c
> >> >> >> > +++ b/mm/vmscan.c
> >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> >> >> > ? ? unsigned long balanced = 0;
> >> >> >> > ? ? bool all_zones_ok = true;
> >> >> >> >
> >> >> >> > + ? /* If kswapd has been running too long, just sleep */
> >> >> >> > + ? if (need_resched())
> >> >> >> > + ? ? ? ? ? return false;
> >> >> >> > +
> >> >> >>
> >> >> >> Hmm... I don't like this patch so much. because this code does
> >> >> >>
> >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> >> >
> >> >> > This isn't entirely true: ?need_resched() will be false, so we'll follow
> >> >> > the normal path for determining whether to sleep or not, in effect
> >> >> > leaving the current behaviour unchanged.
> >> >> >
> >> >> >> - sleep if kswapd didn't
> >> >> >
> >> >> > This also isn't entirely true: whether need_resched() is true at this
> >> >> > point depends on a whole lot more that whether we did a context switch
> >> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> >> > without giving up the CPU. ?Generally that will mean we've been round
> >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> >> >
> >> >> >> It seems to be semi random behavior.
> >> >> >
> >> >> > Well, we have to do something. ?Chris Mason first suspected the hang was
> >> >> > a kswapd rescheduling problem a while ago. ?We tried putting
> >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >> >>
> >> >> Is it a result of ?test with patch of Hannes(ie, !pgdat_balanced)?
> >> >>
> >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> >> pgdat_balance returns wrong result. And at last VM calls
> >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> >> kswap couldn't find zones which have not enough free pages and goto
> >> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> >> chance to call cond_resched.
> >> >>
> >> >> But if your test was with Hanne's patch, I am very curious how come
> >> >> kswapd consumes CPU a lot.
> >> >>
> >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> >> > option. ?The other option might be just to put a cond_resched() in
> >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >> >>
> >> >> I don't oppose it but before that, I think we have to know why kswapd
> >> >> consumes CPU a lot although we applied Hannes' patch.
> >> >>
> >> >
> >> > Because it's still possible for processes to allocate pages at the same
> >> > rate kswapd is freeing them leading to a situation where kswapd does not
> >> > consider the zone balanced for prolonged periods of time.
> >>
> >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> >> So I think kswapd can be scheduled out although it's scheduled in
> >> after a short time as task scheduled also need page reclaim. Although
> >> all task in system need reclaim, kswapd cpu 99% consumption is a
> >> natural result, I think.
> >> Do I miss something?
> >>
> >
> > Lets see;
> >
> > shrink_page_list() only applies if inactive pages were isolated
> > ? ? ? ?which in turn may not happen if all_unreclaimable is set in
> > ? ? ? ?shrink_zones(). If for whatver reason, all_unreclaimable is
> > ? ? ? ?set on all zones, we can miss calling cond_resched().
> >
> > shrink_slab only applies if we are reclaiming slab pages. If the first
> > ? ? ? ?shrinker returns -1, we do not call cond_resched(). If that
> > ? ? ? ?first shrinker is dcache and __GFP_FS is not set, direct
> > ? ? ? ?reclaimers will not shrink at all. However, if there are
> > ? ? ? ?enough of them running or if one of the other shrinkers
> > ? ? ? ?is running for a very long time, kswapd could be starved
> > ? ? ? ?acquiring the shrinker_rwsem and never reaching the
> > ? ? ? ?cond_resched().
>
> Don't we have to move cond_resched?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..633e761 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> if (scanned == 0)
> scanned = SWAP_CLUSTER_MAX;
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - return 1; /* Assume we'll be able to shrink next time */
> + if (!down_read_trylock(&shrinker_rwsem)) {
> + ret = 1;
> + goto out; /* Assume we'll be able to shrink next time */
> + }
>
> list_for_each_entry(shrinker, &shrinker_list, list) {
> unsigned long long delta;
> @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> count_vm_events(SLABS_SCANNED, this_scan);
> total_scan -= this_scan;
>
> - cond_resched();
> }
>
> shrinker->nr += total_scan;
> + cond_resched();
> }
> up_read(&shrinker_rwsem);
> +out:
> + cond_resched();
> return ret;
> }
>
This makes some sense for the exit path but if one or more of the
shrinkers takes a very long time without sleeping (extremely long
list searches for example) then kswapd will not call cond_resched()
between shrinkers and still consume a lot of CPU.
> >
> > balance_pgdat() only calls cond_resched if the zones are not
> > ? ? ? ?balanced. For a high-order allocation that is balanced, it
> > ? ? ? ?checks order-0 again. During that window, order-0 might have
> > ? ? ? ?become unbalanced so it loops again for order-0 and returns
> > ? ? ? ?that was reclaiming for order-0 to kswapd(). It can then find
> > ? ? ? ?that a caller has rewoken kswapd for a high-order and re-enters
> > ? ? ? ?balance_pgdat() without ever have called cond_resched().
>
> If kswapd reclaims order-o followed by high order, it would have a
> chance to call cond_resched in shrink_page_list. But if all zones are
> all_unreclaimable is set, balance_pgdat could return any work. Okay.
> It does make sense.
> By your scenario, someone wakes up kswapd with higher order, again.
> So re-enters balance_pgdat without ever have called cond_resched.
> But if someone wakes up higher order again, we can't have a chance to
> call kswapd_try_to_sleep. So your patch effect would be nop, too.
>
> It would be better to put cond_resched after balance_pgdat?
>
Which will leave kswapd runnable instead of going to sleep but
guarantees a scheduling point. Lets see if the problem is that
cond_resched is being missed although if this was the case then patch
4 would truly be a no-op but Colin has already reported that patch 1 on
its own didn't fix his problem. If the problem is sandybridge-specific
where kswapd remains runnable and consuming large amounts of CPU in
turbo mode then we know that there are other cond_resched() decisions
that will need to be revisited.
Colin or James, would you be willing to test with patch 1 from this
series and Minchan's patch below? Thanks.
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> if (!ret) {
> trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> order = balance_pgdat(pgdat, order, &classzone_idx);
> + cond_resched();
> }
> }
> return 0;
>
> >
> > While it appears unlikely, there are bad conditions which can result
> > in cond_resched() being avoided.
>
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
>
>
>
> --
> Kind regards,
> Minchan Kim
--
Mel Gorman
SUSE Labs
On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <[email protected]> wrote:
> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <[email protected]> wrote:
> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> > >> >> <[email protected]> wrote:
> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> > >> >> >> >
> > >> >> >> > Signed-off-by: Mel Gorman<[email protected]>
> > >> >> >> > ---
> > >> >> >> > mm/vmscan.c | 4 ++++
> > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-)
> > >> >> >> >
> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > >> >> >> > index af24d1e..4d24828 100644
> > >> >> >> > --- a/mm/vmscan.c
> > >> >> >> > +++ b/mm/vmscan.c
> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> > >> >> >> > unsigned long balanced = 0;
> > >> >> >> > bool all_zones_ok = true;
> > >> >> >> >
> > >> >> >> > + /* If kswapd has been running too long, just sleep */
> > >> >> >> > + if (need_resched())
> > >> >> >> > + return false;
> > >> >> >> > +
> > >> >> >>
> > >> >> >> Hmm... I don't like this patch so much. because this code does
> > >> >> >>
> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> > >> >> >
> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow
> > >> >> > the normal path for determining whether to sleep or not, in effect
> > >> >> > leaving the current behaviour unchanged.
> > >> >> >
> > >> >> >> - sleep if kswapd didn't
> > >> >> >
> > >> >> > This also isn't entirely true: whether need_resched() is true at this
> > >> >> > point depends on a whole lot more that whether we did a context switch
> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
> > >> >> > without giving up the CPU. Generally that will mean we've been round
> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> > >> >> >
> > >> >> >> It seems to be semi random behavior.
> > >> >> >
> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was
> > >> >> > a kswapd rescheduling problem a while ago. We tried putting
> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> > >> >>
> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)?
> > >> >>
> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> > >> >> pgdat_balance returns wrong result. And at last VM calls
> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> > >> >> kswap couldn't find zones which have not enough free pages and goto
> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
> > >> >> chance to call cond_resched.
> > >> >>
> > >> >> But if your test was with Hanne's patch, I am very curious how come
> > >> >> kswapd consumes CPU a lot.
> > >> >>
> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> > >> >> > option. The other option might be just to put a cond_resched() in
> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> > >> >>
> > >> >> I don't oppose it but before that, I think we have to know why kswapd
> > >> >> consumes CPU a lot although we applied Hannes' patch.
> > >> >>
> > >> >
> > >> > Because it's still possible for processes to allocate pages at the same
> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
> > >> > consider the zone balanced for prolonged periods of time.
> > >>
> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> > >> So I think kswapd can be scheduled out although it's scheduled in
> > >> after a short time as task scheduled also need page reclaim. Although
> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
> > >> natural result, I think.
> > >> Do I miss something?
> > >>
> > >
> > > Lets see;
> > >
> > > shrink_page_list() only applies if inactive pages were isolated
> > > which in turn may not happen if all_unreclaimable is set in
> > > shrink_zones(). If for whatver reason, all_unreclaimable is
> > > set on all zones, we can miss calling cond_resched().
> > >
> > > shrink_slab only applies if we are reclaiming slab pages. If the first
> > > shrinker returns -1, we do not call cond_resched(). If that
> > > first shrinker is dcache and __GFP_FS is not set, direct
> > > reclaimers will not shrink at all. However, if there are
> > > enough of them running or if one of the other shrinkers
> > > is running for a very long time, kswapd could be starved
> > > acquiring the shrinker_rwsem and never reaching the
> > > cond_resched().
> >
> > Don't we have to move cond_resched?
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..633e761 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > if (scanned == 0)
> > scanned = SWAP_CLUSTER_MAX;
> >
> > - if (!down_read_trylock(&shrinker_rwsem))
> > - return 1; /* Assume we'll be able to shrink next time */
> > + if (!down_read_trylock(&shrinker_rwsem)) {
> > + ret = 1;
> > + goto out; /* Assume we'll be able to shrink next time */
> > + }
> >
> > list_for_each_entry(shrinker, &shrinker_list, list) {
> > unsigned long long delta;
> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > count_vm_events(SLABS_SCANNED, this_scan);
> > total_scan -= this_scan;
> >
> > - cond_resched();
> > }
> >
> > shrinker->nr += total_scan;
> > + cond_resched();
> > }
> > up_read(&shrinker_rwsem);
> > +out:
> > + cond_resched();
> > return ret;
> > }
> >
>
> This makes some sense for the exit path but if one or more of the
> shrinkers takes a very long time without sleeping (extremely long
> list searches for example) then kswapd will not call cond_resched()
> between shrinkers and still consume a lot of CPU.
>
> > >
> > > balance_pgdat() only calls cond_resched if the zones are not
> > > balanced. For a high-order allocation that is balanced, it
> > > checks order-0 again. During that window, order-0 might have
> > > become unbalanced so it loops again for order-0 and returns
> > > that was reclaiming for order-0 to kswapd(). It can then find
> > > that a caller has rewoken kswapd for a high-order and re-enters
> > > balance_pgdat() without ever have called cond_resched().
> >
> > If kswapd reclaims order-o followed by high order, it would have a
> > chance to call cond_resched in shrink_page_list. But if all zones are
> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> > It does make sense.
> > By your scenario, someone wakes up kswapd with higher order, again.
> > So re-enters balance_pgdat without ever have called cond_resched.
> > But if someone wakes up higher order again, we can't have a chance to
> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> >
> > It would be better to put cond_resched after balance_pgdat?
> >
>
> Which will leave kswapd runnable instead of going to sleep but
> guarantees a scheduling point. Lets see if the problem is that
> cond_resched is being missed although if this was the case then patch
> 4 would truly be a no-op but Colin has already reported that patch 1 on
> its own didn't fix his problem. If the problem is sandybridge-specific
> where kswapd remains runnable and consuming large amounts of CPU in
> turbo mode then we know that there are other cond_resched() decisions
> that will need to be revisited.
>
> Colin or James, would you be willing to test with patch 1 from this
> series and Minchan's patch below? Thanks.
This works OK fine. Ran 250 test cycles for about 2 hours.
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..61c45d0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> > if (!ret) {
> > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> > order = balance_pgdat(pgdat, order, &classzone_idx);
> > + cond_resched();
> > }
> > }
> > return 0;
> >
> > >
> > > While it appears unlikely, there are bad conditions which can result
> > > in cond_resched() being avoided.
> >
> > >
> > > --
> > > Mel Gorman
> > > SUSE Labs
> > >
> >
> >
> >
> > --
> > Kind regards,
> > Minchan Kim
>
On Tue, 17 May 2011, Mel Gorman wrote:
> entirely. Christoph wants to maintain historic behaviour of SLUB to
> maximise the number of high-order pages it uses and at the end of the
> day, which option performs better depends entirely on the workload
> and machine configuration.
That is not what I meant. I would like more higher order allocations to
succeed. That does not mean that slubs allocation methods and flags passed
have to stay the same. You can change the slub behavior if it helps.
I am just suspicious of compaction. If these mods are needed to reduce the
amount of higher order pages then compaction does not have the
beneficial effect that it should have. It does not actually
increase the available higher order pages. Fix that first.
It has been reported on some laptops that kswapd is consuming large
amounts of CPU and not being scheduled when SLUB is enabled during
large amounts of file copying. It is expected that this is due to
kswapd missing every cond_resched() point because;
shrink_page_list() calls cond_resched() if inactive pages were isolated
which in turn may not happen if all_unreclaimable is set in
shrink_zones(). If for whatver reason, all_unreclaimable is
set on all zones, we can miss calling cond_resched().
balance_pgdat() only calls cond_resched if the zones are not
balanced. For a high-order allocation that is balanced, it
checks order-0 again. During that window, order-0 might have
become unbalanced so it loops again for order-0 and returns
that it was reclaiming for order-0 to kswapd(). It can then
find that a caller has rewoken kswapd for a high-order and
re-enters balance_pgdat() without ever calling cond_resched().
shrink_slab only calls cond_resched() if we are reclaiming slab
pages. If there are a large number of direct reclaimers, the
shrinker_rwsem can be contended and prevent kswapd calling
cond_resched().
This patch modifies the shrink_slab() case. If the semaphore is
contended, the caller will still check cond_resched(). After each
successful call into a shrinker, the check for cond_resched() is
still necessary in case one shrinker call is particularly slow.
This patch replaces
mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
in -mm.
[[email protected]: Preserve call to cond_resched after each call into shrinker]
From: Minchan Kim <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..0bed248 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
if (scanned == 0)
scanned = SWAP_CLUSTER_MAX;
- if (!down_read_trylock(&shrinker_rwsem))
- return 1; /* Assume we'll be able to shrink next time */
+ if (!down_read_trylock(&shrinker_rwsem)) {
+ /* Assume we'll be able to shrink next time */
+ ret = 1;
+ goto out;
+ }
list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
@@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
shrinker->nr += total_scan;
}
up_read(&shrinker_rwsem);
+out:
+ cond_resched();
return ret;
}
On Tue, May 17, 2011 at 08:51:47AM -0500, Christoph Lameter wrote:
> On Tue, 17 May 2011, Mel Gorman wrote:
>
> > entirely. Christoph wants to maintain historic behaviour of SLUB to
> > maximise the number of high-order pages it uses and at the end of the
> > day, which option performs better depends entirely on the workload
> > and machine configuration.
>
> That is not what I meant. I would like more higher order allocations to
> succeed. That does not mean that slubs allocation methods and flags passed
> have to stay the same. You can change the slub behavior if it helps.
>
In this particular patch, the success rate for high order allocations
would likely decrease in low memory conditions albeit the latency when
calling the page allocator will be lower and the disruption to the
system will be less (no copying or reclaim of pages). My expectation
would be that it's cheaper for SLUB to fall back than compact memory
or reclaim pages even if this means a slab page is smaller until more
memory is free. However, if the "goodness" criteria is high order
allocation success rate, the patch shouldn't be merged.
> I am just suspicious of compaction. If these mods are needed to reduce the
> amount of higher order pages then compaction does not have the
> beneficial effect that it should have. It does not actually
> increase the available higher order pages. Fix that first.
>
The problem being addressed was the machine being hung at worst and in
other cases having kswapd pinned at 99-100% CPU. It's now been shown
that modifying SLUB is not necessary to fix this because the bug was
in page reclaim. The high-order allocation success rate didn't come
into it.
--
Mel Gorman
SUSE Labs
On Tue, 17 May 2011, Mel Gorman wrote:
> > That is not what I meant. I would like more higher order allocations to
> > succeed. That does not mean that slubs allocation methods and flags passed
> > have to stay the same. You can change the slub behavior if it helps.
> >
>
> In this particular patch, the success rate for high order allocations
> would likely decrease in low memory conditions albeit the latency when
> calling the page allocator will be lower and the disruption to the
> system will be less (no copying or reclaim of pages). My expectation
> would be that it's cheaper for SLUB to fall back than compact memory
> or reclaim pages even if this means a slab page is smaller until more
> memory is free. However, if the "goodness" criteria is high order
> allocation success rate, the patch shouldn't be merged.
The criteria is certainly overall system performance and not a high order
allocation rate.
On Tue, 17 May 2011, Mel Gorman wrote:
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 9f8a97b..057f1e2 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > {
> > > int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> > > const gfp_t wait = gfp_mask & __GFP_WAIT;
> > > + const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> > >
> > > /* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> > > BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > > */
> > > alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> > >
> > > - if (!wait) {
> > > + if (!wait && can_wake_kswapd) {
> > > /*
> > > * Not worth trying to allocate harder for
> > > * __GFP_NOMEMALLOC even if it can't schedule.
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 98c358d..c5797ab 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> > > * Let the initial higher-order allocation fail under memory pressure
> > > * so we fall-back to the minimum order allocation.
> > > */
> > > - alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > > + alloc_gfp = (flags | __GFP_NOWARN | __GFP_NO_KSWAPD) &
> > > + ~(__GFP_NOFAIL | __GFP_WAIT | __GFP_REPEAT);
> > >
> > > page = alloc_slab_page(alloc_gfp, node, oo);
> > > if (unlikely(!page)) {
> >
> > It's unnecessary to clear __GFP_REPEAT, these !__GFP_NOFAIL allocations
> > will immediately fail.
> >
>
> We can enter enter direct compaction or direct reclaim
> at least once. If compaction is enabled and we enter
> reclaim/compaction, the presense of __GFP_REPEAT makes a difference
> in should_continue_reclaim(). With compaction disabled, the presense
> of the flag is relevant in should_alloc_retry() with it being possible
> to loop in the allocator instead of failing the SLUB allocation and
> dropping back.
>
You've cleared __GFP_WAIT, so it cannot enter direct compaction or direct
reclaim, so clearing __GFP_REPEAT here doesn't actually do anything.
That's why I suggested adding a comment about why you're clearing
__GFP_WAIT: to make it obvious that these allocations will immediately
fail if the alloc is unsuccessful and we don't need to add __GFP_NORETRY
or remove __GFP_REPEAT.
> Maybe you meant !__GFP_WAIT instead of !__GFP_NOFAIL which makes
> more sense.
No, I meant !__GFP_NOFAIL since the high priority allocations (if
PF_MEMALLOC or TIF_MEMDIE) will not loop forever looking for a page
without that bit. That allows this !__GFP_WAIT allocation to immediately
fail. __GFP_NORETRY and __GFP_REPEAT are no-ops unless you have
__GFP_WAIT.
On Tue, 17 May 2011, Christoph Lameter wrote:
> > In this particular patch, the success rate for high order allocations
> > would likely decrease in low memory conditions albeit the latency when
> > calling the page allocator will be lower and the disruption to the
> > system will be less (no copying or reclaim of pages). My expectation
> > would be that it's cheaper for SLUB to fall back than compact memory
> > or reclaim pages even if this means a slab page is smaller until more
> > memory is free. However, if the "goodness" criteria is high order
> > allocation success rate, the patch shouldn't be merged.
>
> The criteria is certainly overall system performance and not a high order
> allocation rate.
>
SLUB definitely depends on these higher order allocations being successful
for performance, dropping back to the min order is a last resort as
opposed to failing the kmalloc(). If it's the last resort, then it makes
sense that we'd want to try both compaction and reclaim while we're
already in the page allocator as we go down the slub slowpath. Why not
try just a little harder (compaction and/or reclaim) to alloc the cache's
preferred order?
> Lets see;
>
> shrink_page_list() only applies if inactive pages were isolated
> which in turn may not happen if all_unreclaimable is set in
> shrink_zones(). If for whatver reason, all_unreclaimable is
> set on all zones, we can miss calling cond_resched().
>
> shrink_slab only applies if we are reclaiming slab pages. If the first
> shrinker returns -1, we do not call cond_resched(). If that
> first shrinker is dcache and __GFP_FS is not set, direct
> reclaimers will not shrink at all. However, if there are
> enough of them running or if one of the other shrinkers
> is running for a very long time, kswapd could be starved
> acquiring the shrinker_rwsem and never reaching the
> cond_resched().
OK.
>
> balance_pgdat() only calls cond_resched if the zones are not
> balanced. For a high-order allocation that is balanced, it
> checks order-0 again. During that window, order-0 might have
> become unbalanced so it loops again for order-0 and returns
> that was reclaiming for order-0 to kswapd(). It can then find
> that a caller has rewoken kswapd for a high-order and re-enters
> balance_pgdat() without ever have called cond_resched().
Then, Shouldn't balance_pgdat() call cond_resched() unconditionally?
The problem is NOT 100% cpu consumption. if kswapd will sleep, other
processes need to reclaim old pages. The problem is, kswapd doesn't
invoke context switch and other tasks hang-up.
> While it appears unlikely, there are bad conditions which can result
> in cond_resched() being avoided.
>
(2011/05/18 1:15), Mel Gorman wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
> which in turn may not happen if all_unreclaimable is set in
> shrink_zones(). If for whatver reason, all_unreclaimable is
> set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
> balanced. For a high-order allocation that is balanced, it
> checks order-0 again. During that window, order-0 might have
> become unbalanced so it loops again for order-0 and returns
> that it was reclaiming for order-0 to kswapd(). It can then
> find that a caller has rewoken kswapd for a high-order and
> re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
> pages. If there are a large number of direct reclaimers, the
> shrinker_rwsem can be contended and prevent kswapd calling
> cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [[email protected]: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim<[email protected]>
> Signed-off-by: Mel Gorman<[email protected]>
Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>
> It would be better to put cond_resched after balance_pgdat?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> if (!ret) {
> trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> order = balance_pgdat(pgdat, order,&classzone_idx);
> + cond_resched();
> }
> }
> return 0;
>
>>>> While it appears unlikely, there are bad conditions which can result
>> in cond_resched() being avoided.
Every reclaim priority decreasing or every shrink_zone() calling makes more
fine grained preemption. I think.
On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> > Don't we have to move cond_resched?
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..633e761 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > if (scanned == 0)
> > scanned = SWAP_CLUSTER_MAX;
> >
> > - if (!down_read_trylock(&shrinker_rwsem))
> > - return 1; /* Assume we'll be able to shrink next time */
> > + if (!down_read_trylock(&shrinker_rwsem)) {
> > + ret = 1;
> > + goto out; /* Assume we'll be able to shrink next time */
> > + }
> >
> > list_for_each_entry(shrinker, &shrinker_list, list) {
> > unsigned long long delta;
> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> > count_vm_events(SLABS_SCANNED, this_scan);
> > total_scan -= this_scan;
> >
> > - cond_resched();
> > }
> >
> > shrinker->nr += total_scan;
> > + cond_resched();
> > }
> > up_read(&shrinker_rwsem);
> > +out:
> > + cond_resched();
> > return ret;
> > }
> >
>
> This makes some sense for the exit path but if one or more of the
> shrinkers takes a very long time without sleeping (extremely long
> list searches for example) then kswapd will not call cond_resched()
> between shrinkers and still consume a lot of CPU.
>
> > >
> > > balance_pgdat() only calls cond_resched if the zones are not
> > > balanced. For a high-order allocation that is balanced, it
> > > checks order-0 again. During that window, order-0 might have
> > > become unbalanced so it loops again for order-0 and returns
> > > that was reclaiming for order-0 to kswapd(). It can then find
> > > that a caller has rewoken kswapd for a high-order and re-enters
> > > balance_pgdat() without ever have called cond_resched().
> >
> > If kswapd reclaims order-o followed by high order, it would have a
> > chance to call cond_resched in shrink_page_list. But if all zones are
> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> > It does make sense.
> > By your scenario, someone wakes up kswapd with higher order, again.
> > So re-enters balance_pgdat without ever have called cond_resched.
> > But if someone wakes up higher order again, we can't have a chance to
> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> >
> > It would be better to put cond_resched after balance_pgdat?
> >
>
> Which will leave kswapd runnable instead of going to sleep but
> guarantees a scheduling point. Lets see if the problem is that
> cond_resched is being missed although if this was the case then patch
> 4 would truly be a no-op but Colin has already reported that patch 1 on
> its own didn't fix his problem. If the problem is sandybridge-specific
> where kswapd remains runnable and consuming large amounts of CPU in
> turbo mode then we know that there are other cond_resched() decisions
> that will need to be revisited.
>
> Colin or James, would you be willing to test with patch 1 from this
> series and Minchan's patch below? Thanks.
Yes, but unfortunately I'm on the road at the moment. I won't get back
to the laptop showing the problem until late on Tuesday (24th). If it
works for Colin, I'd assume it's OK.
James
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 292582c..61c45d0 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> > if (!ret) {
> > trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
> > order = balance_pgdat(pgdat, order, &classzone_idx);
> > + cond_resched();
> > }
> > }
> > return 0;
> >
> > >
> > > While it appears unlikely, there are bad conditions which can result
> > > in cond_resched() being avoided.
> >
> > >
> > > --
> > > Mel Gorman
> > > SUSE Labs
> > >
> >
> >
> >
> > --
> > Kind regards,
> > Minchan Kim
>
Hello Colin,
On Tue, May 17, 2011 at 10:50 PM, Colin Ian King
<[email protected]> wrote:
> On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
>> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
>> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <[email protected]> wrote:
>> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
>> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <[email protected]> wrote:
>> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
>> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
>> > >> >> <[email protected]> wrote:
>> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
>> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
>> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
>> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
>> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
>> > >> >> >> >
>> > >> >> >> > Signed-off-by: Mel Gorman<[email protected]>
>> > >> >> >> > ---
>> > >> >> >> > mm/vmscan.c | 4 ++++
>> > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-)
>> > >> >> >> >
>> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > >> >> >> > index af24d1e..4d24828 100644
>> > >> >> >> > --- a/mm/vmscan.c
>> > >> >> >> > +++ b/mm/vmscan.c
>> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>> > >> >> >> > unsigned long balanced = 0;
>> > >> >> >> > bool all_zones_ok = true;
>> > >> >> >> >
>> > >> >> >> > + /* If kswapd has been running too long, just sleep */
>> > >> >> >> > + if (need_resched())
>> > >> >> >> > + return false;
>> > >> >> >> > +
>> > >> >> >>
>> > >> >> >> Hmm... I don't like this patch so much. because this code does
>> > >> >> >>
>> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
>> > >> >> >
>> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow
>> > >> >> > the normal path for determining whether to sleep or not, in effect
>> > >> >> > leaving the current behaviour unchanged.
>> > >> >> >
>> > >> >> >> - sleep if kswapd didn't
>> > >> >> >
>> > >> >> > This also isn't entirely true: whether need_resched() is true at this
>> > >> >> > point depends on a whole lot more that whether we did a context switch
>> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
>> > >> >> > without giving up the CPU. Generally that will mean we've been round
>> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
>> > >> >> >
>> > >> >> >> It seems to be semi random behavior.
>> > >> >> >
>> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was
>> > >> >> > a kswapd rescheduling problem a while ago. We tried putting
>> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
>> > >> >>
>> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)?
>> > >> >>
>> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
>> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
>> > >> >> pgdat_balance returns wrong result. And at last VM calls
>> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
>> > >> >> kswap couldn't find zones which have not enough free pages and goto
>> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
>> > >> >> chance to call cond_resched.
>> > >> >>
>> > >> >> But if your test was with Hanne's patch, I am very curious how come
>> > >> >> kswapd consumes CPU a lot.
>> > >> >>
>> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
>> > >> >> > option. The other option might be just to put a cond_resched() in
>> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
>> > >> >>
>> > >> >> I don't oppose it but before that, I think we have to know why kswapd
>> > >> >> consumes CPU a lot although we applied Hannes' patch.
>> > >> >>
>> > >> >
>> > >> > Because it's still possible for processes to allocate pages at the same
>> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
>> > >> > consider the zone balanced for prolonged periods of time.
>> > >>
>> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
>> > >> So I think kswapd can be scheduled out although it's scheduled in
>> > >> after a short time as task scheduled also need page reclaim. Although
>> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
>> > >> natural result, I think.
>> > >> Do I miss something?
>> > >>
>> > >
>> > > Lets see;
>> > >
>> > > shrink_page_list() only applies if inactive pages were isolated
>> > > which in turn may not happen if all_unreclaimable is set in
>> > > shrink_zones(). If for whatver reason, all_unreclaimable is
>> > > set on all zones, we can miss calling cond_resched().
>> > >
>> > > shrink_slab only applies if we are reclaiming slab pages. If the first
>> > > shrinker returns -1, we do not call cond_resched(). If that
>> > > first shrinker is dcache and __GFP_FS is not set, direct
>> > > reclaimers will not shrink at all. However, if there are
>> > > enough of them running or if one of the other shrinkers
>> > > is running for a very long time, kswapd could be starved
>> > > acquiring the shrinker_rwsem and never reaching the
>> > > cond_resched().
>> >
>> > Don't we have to move cond_resched?
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 292582c..633e761 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>> > if (scanned == 0)
>> > scanned = SWAP_CLUSTER_MAX;
>> >
>> > - if (!down_read_trylock(&shrinker_rwsem))
>> > - return 1; /* Assume we'll be able to shrink next time */
>> > + if (!down_read_trylock(&shrinker_rwsem)) {
>> > + ret = 1;
>> > + goto out; /* Assume we'll be able to shrink next time */
>> > + }
>> >
>> > list_for_each_entry(shrinker, &shrinker_list, list) {
>> > unsigned long long delta;
>> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
>> > count_vm_events(SLABS_SCANNED, this_scan);
>> > total_scan -= this_scan;
>> >
>> > - cond_resched();
>> > }
>> >
>> > shrinker->nr += total_scan;
>> > + cond_resched();
>> > }
>> > up_read(&shrinker_rwsem);
>> > +out:
>> > + cond_resched();
>> > return ret;
>> > }
>> >
>>
>> This makes some sense for the exit path but if one or more of the
>> shrinkers takes a very long time without sleeping (extremely long
>> list searches for example) then kswapd will not call cond_resched()
>> between shrinkers and still consume a lot of CPU.
>>
>> > >
>> > > balance_pgdat() only calls cond_resched if the zones are not
>> > > balanced. For a high-order allocation that is balanced, it
>> > > checks order-0 again. During that window, order-0 might have
>> > > become unbalanced so it loops again for order-0 and returns
>> > > that was reclaiming for order-0 to kswapd(). It can then find
>> > > that a caller has rewoken kswapd for a high-order and re-enters
>> > > balance_pgdat() without ever have called cond_resched().
>> >
>> > If kswapd reclaims order-o followed by high order, it would have a
>> > chance to call cond_resched in shrink_page_list. But if all zones are
>> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
>> > It does make sense.
>> > By your scenario, someone wakes up kswapd with higher order, again.
>> > So re-enters balance_pgdat without ever have called cond_resched.
>> > But if someone wakes up higher order again, we can't have a chance to
>> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
>> >
>> > It would be better to put cond_resched after balance_pgdat?
>> >
>>
>> Which will leave kswapd runnable instead of going to sleep but
>> guarantees a scheduling point. Lets see if the problem is that
>> cond_resched is being missed although if this was the case then patch
>> 4 would truly be a no-op but Colin has already reported that patch 1 on
>> its own didn't fix his problem. If the problem is sandybridge-specific
>> where kswapd remains runnable and consuming large amounts of CPU in
>> turbo mode then we know that there are other cond_resched() decisions
>> that will need to be revisited.
>>
>> Colin or James, would you be willing to test with patch 1 from this
>> series and Minchan's patch below? Thanks.
>
> This works OK fine. Ran 250 test cycles for about 2 hours.
Thanks for the testing!.
I would like to know exact patch for you to apply.
My modification of inserting cond_resched is two.
1) shrink_slab function
2) kswapd right after balance_pgdat.
1) or 2) ?
Or
Both?
Thanks
--
Kind regards,
Minchan Kim
On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
<[email protected]> wrote:
>> It would be better to put cond_resched after balance_pgdat?
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 292582c..61c45d0 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>> if (!ret) {
>> trace_mm_vmscan_kswapd_wake(pgdat->node_id,
>> order);
>> order = balance_pgdat(pgdat,
>> order,&classzone_idx);
>> + cond_resched();
>> }
>> }
>> return 0;
>>
>>>>> While it appears unlikely, there are bad conditions which can result
>>>
>>> in cond_resched() being avoided.
>
> Every reclaim priority decreasing or every shrink_zone() calling makes more
> fine grained preemption. I think.
It could be.
But in direct reclaim case, I have a concern about losing pages
reclaimed to other tasks by preemption.
Hmm,, anyway, we also needs test.
Hmm,, how long should we bother them(Colins and James)?
First of all, Let's fix one just between us and ask test to them and
send the last patch to akpm.
1. shrink_slab
2. right after balance_pgdat
3. shrink_zone
4. reclaim priority decreasing routine.
Now, I vote 1) and 2).
Mel, KOSAKI?
--
Kind regards,
Minchan Kim
>>>>>> While it appears unlikely, there are bad conditions which can result
>>>>
>>>> in cond_resched() being avoided.
>>
>> Every reclaim priority decreasing or every shrink_zone() calling makes more
>> fine grained preemption. I think.
>
> It could be.
> But in direct reclaim case, I have a concern about losing pages
> reclaimed to other tasks by preemption.
Nope, I proposed to add cond_resched() into balance_pgdat().
> Hmm,, anyway, we also needs test.
> Hmm,, how long should we bother them(Colins and James)?
> First of all, Let's fix one just between us and ask test to them and
> send the last patch to akpm.
>
> 1. shrink_slab
> 2. right after balance_pgdat
> 3. shrink_zone
> 4. reclaim priority decreasing routine.
>
> Now, I vote 1) and 2).
>
> Mel, KOSAKI?
I think following patch makes enough preemption.
Thanks.
From e7d88be1916184ea7c93a6f2746b15c7a32d1973 Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Wed, 18 May 2011 15:00:39 +0900
Subject: [PATCH] vmscan: balance_pgdat() call cond_resched() unconditionally
Under constant allocation pressure, kswapd can be in the situation where
sleeping_prematurely() will always return true even if kswapd has been
running a long time. Check if kswapd needs to be scheduled.
Signed-off-by: KOSAKI Motohiro <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: James Bottomley <[email protected]>
Cc: Colin King <[email protected]>
Cc: Minchan Kim <[email protected]>
---
mm/vmscan.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 19e179b..87c88fd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2449,6 +2449,7 @@ loop_again:
sc.nr_reclaimed += reclaim_state->reclaimed_slab;
total_scanned += sc.nr_scanned;
+ cond_resched();
if (zone->all_unreclaimable)
continue;
if (nr_slab == 0 &&
@@ -2518,8 +2519,6 @@ out:
* for the node to be balanced
*/
if (!(all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))) {
- cond_resched();
-
try_to_freeze();
/*
--
1.7.3.1
On 5/17/11 12:10 AM, David Rientjes wrote:
> On Fri, 13 May 2011, Mel Gorman wrote:
>
>> To avoid locking and per-cpu overhead, SLUB optimisically uses
>> high-order allocations and falls back to lower allocations if they
>> fail. However, by simply trying to allocate, kswapd is woken up to
>> start reclaiming at that order. On a desktop system, two users report
>> that the system is getting locked up with kswapd using large amounts
>> of CPU. Using SLAB instead of SLUB made this problem go away.
>>
>> This patch prevents kswapd being woken up for high-order allocations.
>> Testing indicated that with this patch applied, the system was much
>> harder to hang and even when it did, it eventually recovered.
>>
>> Signed-off-by: Mel Gorman<[email protected]>
> Acked-by: David Rientjes<[email protected]>
Christoph? I think this patch is sane although the original rationale
was to workaround kswapd problems.
Pekka
On Wed, 2011-05-18 at 13:19 +0900, Minchan Kim wrote:
> Hello Colin,
>
> On Tue, May 17, 2011 at 10:50 PM, Colin Ian King
> <[email protected]> wrote:
> > On Tue, 2011-05-17 at 11:38 +0100, Mel Gorman wrote:
> >> On Tue, May 17, 2011 at 08:50:44AM +0900, Minchan Kim wrote:
> >> > On Mon, May 16, 2011 at 7:27 PM, Mel Gorman <[email protected]> wrote:
> >> > > On Mon, May 16, 2011 at 05:58:59PM +0900, Minchan Kim wrote:
> >> > >> On Mon, May 16, 2011 at 5:45 PM, Mel Gorman <[email protected]> wrote:
> >> > >> > On Mon, May 16, 2011 at 02:04:00PM +0900, Minchan Kim wrote:
> >> > >> >> On Mon, May 16, 2011 at 1:21 PM, James Bottomley
> >> > >> >> <[email protected]> wrote:
> >> > >> >> > On Sun, 2011-05-15 at 19:27 +0900, KOSAKI Motohiro wrote:
> >> > >> >> >> (2011/05/13 23:03), Mel Gorman wrote:
> >> > >> >> >> > Under constant allocation pressure, kswapd can be in the situation where
> >> > >> >> >> > sleeping_prematurely() will always return true even if kswapd has been
> >> > >> >> >> > running a long time. Check if kswapd needs to be scheduled.
> >> > >> >> >> >
> >> > >> >> >> > Signed-off-by: Mel Gorman<[email protected]>
> >> > >> >> >> > ---
> >> > >> >> >> > mm/vmscan.c | 4 ++++
> >> > >> >> >> > 1 files changed, 4 insertions(+), 0 deletions(-)
> >> > >> >> >> >
> >> > >> >> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > >> >> >> > index af24d1e..4d24828 100644
> >> > >> >> >> > --- a/mm/vmscan.c
> >> > >> >> >> > +++ b/mm/vmscan.c
> >> > >> >> >> > @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
> >> > >> >> >> > unsigned long balanced = 0;
> >> > >> >> >> > bool all_zones_ok = true;
> >> > >> >> >> >
> >> > >> >> >> > + /* If kswapd has been running too long, just sleep */
> >> > >> >> >> > + if (need_resched())
> >> > >> >> >> > + return false;
> >> > >> >> >> > +
> >> > >> >> >>
> >> > >> >> >> Hmm... I don't like this patch so much. because this code does
> >> > >> >> >>
> >> > >> >> >> - don't sleep if kswapd got context switch at shrink_inactive_list
> >> > >> >> >
> >> > >> >> > This isn't entirely true: need_resched() will be false, so we'll follow
> >> > >> >> > the normal path for determining whether to sleep or not, in effect
> >> > >> >> > leaving the current behaviour unchanged.
> >> > >> >> >
> >> > >> >> >> - sleep if kswapd didn't
> >> > >> >> >
> >> > >> >> > This also isn't entirely true: whether need_resched() is true at this
> >> > >> >> > point depends on a whole lot more that whether we did a context switch
> >> > >> >> > in shrink_inactive. It mostly depends on how long we've been running
> >> > >> >> > without giving up the CPU. Generally that will mean we've been round
> >> > >> >> > the shrinker loop hundreds to thousands of times without sleeping.
> >> > >> >> >
> >> > >> >> >> It seems to be semi random behavior.
> >> > >> >> >
> >> > >> >> > Well, we have to do something. Chris Mason first suspected the hang was
> >> > >> >> > a kswapd rescheduling problem a while ago. We tried putting
> >> > >> >> > cond_rescheds() in several places in the vmscan code, but to no avail.
> >> > >> >>
> >> > >> >> Is it a result of test with patch of Hannes(ie, !pgdat_balanced)?
> >> > >> >>
> >> > >> >> If it isn't, it would be nop regardless of putting cond_reshed at vmscan.c.
> >> > >> >> Because, although we complete zone balancing, kswapd doesn't sleep as
> >> > >> >> pgdat_balance returns wrong result. And at last VM calls
> >> > >> >> balance_pgdat. In this case, balance_pgdat returns without any work as
> >> > >> >> kswap couldn't find zones which have not enough free pages and goto
> >> > >> >> out. kswapd could repeat this work infinitely. So you don't have a
> >> > >> >> chance to call cond_resched.
> >> > >> >>
> >> > >> >> But if your test was with Hanne's patch, I am very curious how come
> >> > >> >> kswapd consumes CPU a lot.
> >> > >> >>
> >> > >> >> > The need_resched() in sleeping_prematurely() seems to be about the best
> >> > >> >> > option. The other option might be just to put a cond_resched() in
> >> > >> >> > kswapd_try_to_sleep(), but that will really have about the same effect.
> >> > >> >>
> >> > >> >> I don't oppose it but before that, I think we have to know why kswapd
> >> > >> >> consumes CPU a lot although we applied Hannes' patch.
> >> > >> >>
> >> > >> >
> >> > >> > Because it's still possible for processes to allocate pages at the same
> >> > >> > rate kswapd is freeing them leading to a situation where kswapd does not
> >> > >> > consider the zone balanced for prolonged periods of time.
> >> > >>
> >> > >> We have cond_resched in shrink_page_list, shrink_slab and balance_pgdat.
> >> > >> So I think kswapd can be scheduled out although it's scheduled in
> >> > >> after a short time as task scheduled also need page reclaim. Although
> >> > >> all task in system need reclaim, kswapd cpu 99% consumption is a
> >> > >> natural result, I think.
> >> > >> Do I miss something?
> >> > >>
> >> > >
> >> > > Lets see;
> >> > >
> >> > > shrink_page_list() only applies if inactive pages were isolated
> >> > > which in turn may not happen if all_unreclaimable is set in
> >> > > shrink_zones(). If for whatver reason, all_unreclaimable is
> >> > > set on all zones, we can miss calling cond_resched().
> >> > >
> >> > > shrink_slab only applies if we are reclaiming slab pages. If the first
> >> > > shrinker returns -1, we do not call cond_resched(). If that
> >> > > first shrinker is dcache and __GFP_FS is not set, direct
> >> > > reclaimers will not shrink at all. However, if there are
> >> > > enough of them running or if one of the other shrinkers
> >> > > is running for a very long time, kswapd could be starved
> >> > > acquiring the shrinker_rwsem and never reaching the
> >> > > cond_resched().
> >> >
> >> > Don't we have to move cond_resched?
> >> >
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index 292582c..633e761 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -231,8 +231,10 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >> > if (scanned == 0)
> >> > scanned = SWAP_CLUSTER_MAX;
> >> >
> >> > - if (!down_read_trylock(&shrinker_rwsem))
> >> > - return 1; /* Assume we'll be able to shrink next time */
> >> > + if (!down_read_trylock(&shrinker_rwsem)) {
> >> > + ret = 1;
> >> > + goto out; /* Assume we'll be able to shrink next time */
> >> > + }
> >> >
> >> > list_for_each_entry(shrinker, &shrinker_list, list) {
> >> > unsigned long long delta;
> >> > @@ -280,12 +282,14 @@ unsigned long shrink_slab(struct shrink_control *shrink,
> >> > count_vm_events(SLABS_SCANNED, this_scan);
> >> > total_scan -= this_scan;
> >> >
> >> > - cond_resched();
> >> > }
> >> >
> >> > shrinker->nr += total_scan;
> >> > + cond_resched();
> >> > }
> >> > up_read(&shrinker_rwsem);
> >> > +out:
> >> > + cond_resched();
> >> > return ret;
> >> > }
> >> >
> >>
> >> This makes some sense for the exit path but if one or more of the
> >> shrinkers takes a very long time without sleeping (extremely long
> >> list searches for example) then kswapd will not call cond_resched()
> >> between shrinkers and still consume a lot of CPU.
> >>
> >> > >
> >> > > balance_pgdat() only calls cond_resched if the zones are not
> >> > > balanced. For a high-order allocation that is balanced, it
> >> > > checks order-0 again. During that window, order-0 might have
> >> > > become unbalanced so it loops again for order-0 and returns
> >> > > that was reclaiming for order-0 to kswapd(). It can then find
> >> > > that a caller has rewoken kswapd for a high-order and re-enters
> >> > > balance_pgdat() without ever have called cond_resched().
> >> >
> >> > If kswapd reclaims order-o followed by high order, it would have a
> >> > chance to call cond_resched in shrink_page_list. But if all zones are
> >> > all_unreclaimable is set, balance_pgdat could return any work. Okay.
> >> > It does make sense.
> >> > By your scenario, someone wakes up kswapd with higher order, again.
> >> > So re-enters balance_pgdat without ever have called cond_resched.
> >> > But if someone wakes up higher order again, we can't have a chance to
> >> > call kswapd_try_to_sleep. So your patch effect would be nop, too.
> >> >
> >> > It would be better to put cond_resched after balance_pgdat?
> >> >
> >>
> >> Which will leave kswapd runnable instead of going to sleep but
> >> guarantees a scheduling point. Lets see if the problem is that
> >> cond_resched is being missed although if this was the case then patch
> >> 4 would truly be a no-op but Colin has already reported that patch 1 on
> >> its own didn't fix his problem. If the problem is sandybridge-specific
> >> where kswapd remains runnable and consuming large amounts of CPU in
> >> turbo mode then we know that there are other cond_resched() decisions
> >> that will need to be revisited.
> >>
> >> Colin or James, would you be willing to test with patch 1 from this
> >> series and Minchan's patch below? Thanks.
> >
> > This works OK fine. Ran 250 test cycles for about 2 hours.
>
> Thanks for the testing!.
> I would like to know exact patch for you to apply.
> My modification of inserting cond_resched is two.
>
> 1) shrink_slab function
> 2) kswapd right after balance_pgdat.
>
> 1) or 2) ?
> Or
> Both?
>
I just followed Mel's request, so, patch 1 from the series and *just*
the following:
>Colin or James, would you be willing to test with patch 1 from this
>series and Minchan's patch below? Thanks.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 292582c..61c45d0 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> if (!ret) {
> trace_mm_vmscan_kswapd_wake(pgdat->node_id,
order);
> order = balance_pgdat(pgdat, order,
&classzone_idx);
> + cond_resched();
> }
> }
> return 0;
Colin
> Thanks
On Wed, May 18, 2011 at 09:26:09AM +0900, KOSAKI Motohiro wrote:
> >Lets see;
> >
> >shrink_page_list() only applies if inactive pages were isolated
> > which in turn may not happen if all_unreclaimable is set in
> > shrink_zones(). If for whatver reason, all_unreclaimable is
> > set on all zones, we can miss calling cond_resched().
> >
> >shrink_slab only applies if we are reclaiming slab pages. If the first
> > shrinker returns -1, we do not call cond_resched(). If that
> > first shrinker is dcache and __GFP_FS is not set, direct
> > reclaimers will not shrink at all. However, if there are
> > enough of them running or if one of the other shrinkers
> > is running for a very long time, kswapd could be starved
> > acquiring the shrinker_rwsem and never reaching the
> > cond_resched().
>
> OK.
>
>
> >
> >balance_pgdat() only calls cond_resched if the zones are not
> > balanced. For a high-order allocation that is balanced, it
> > checks order-0 again. During that window, order-0 might have
> > become unbalanced so it loops again for order-0 and returns
> > that was reclaiming for order-0 to kswapd(). It can then find
> > that a caller has rewoken kswapd for a high-order and re-enters
> > balance_pgdat() without ever have called cond_resched().
>
> Then, Shouldn't balance_pgdat() call cond_resched() unconditionally?
> The problem is NOT 100% cpu consumption. if kswapd will sleep, other
> processes need to reclaim old pages. The problem is, kswapd doesn't
> invoke context switch and other tasks hang-up.
>
Which the shrink_slab patch does (either version). What's the gain from
sprinkling more cond_resched() around? If you think there is, submit
another pair of patches (include patch 1 from this series) but I'm not
seeing the advantage myself.
>
> >While it appears unlikely, there are bad conditions which can result
> >in cond_resched() being avoided.
> >
>
>
--
Mel Gorman
SUSE Labs
On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote:
> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
> <[email protected]> wrote:
> >> It would be better to put cond_resched after balance_pgdat?
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 292582c..61c45d0 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
> >> ? ? ? ? ? ? ? ? if (!ret) {
> >> ? ? ? ? ? ? ? ? ? ? ? ? trace_mm_vmscan_kswapd_wake(pgdat->node_id,
> >> order);
> >> ? ? ? ? ? ? ? ? ? ? ? ? order = balance_pgdat(pgdat,
> >> order,&classzone_idx);
> >> + ? ? ? ? ? ? ? ? ? ? ? cond_resched();
> >> ? ? ? ? ? ? ? ? }
> >> ? ? ? ? }
> >> ? ? ? ? return 0;
> >>
> >>>>> While it appears unlikely, there are bad conditions which can result
> >>>
> >>> in cond_resched() being avoided.
> >
> > Every reclaim priority decreasing or every shrink_zone() calling makes more
> > fine grained preemption. I think.
>
> It could be.
> But in direct reclaim case, I have a concern about losing pages
> reclaimed to other tasks by preemption.
>
> Hmm,, anyway, we also needs test.
> Hmm,, how long should we bother them(Colins and James)?
> First of all, Let's fix one just between us and ask test to them and
> send the last patch to akpm.
>
> 1. shrink_slab
> 2. right after balance_pgdat
> 3. shrink_zone
> 4. reclaim priority decreasing routine.
>
> Now, I vote 1) and 2).
>
I've already submitted a pair of patches for option 1. I don't think
option 2 gains us anything. I think it's more likely we should worry
about all_unreclaimable being set when shrink_slab is returning 0 and we
are encountering so many dirty pages that pages_scanned is high enough.
--
Mel Gorman
SUSE Labs
On Wed, 18 May 2011, Pekka Enberg wrote:
> On 5/17/11 12:10 AM, David Rientjes wrote:
> > On Fri, 13 May 2011, Mel Gorman wrote:
> >
> > > To avoid locking and per-cpu overhead, SLUB optimisically uses
> > > high-order allocations and falls back to lower allocations if they
> > > fail. However, by simply trying to allocate, kswapd is woken up to
> > > start reclaiming at that order. On a desktop system, two users report
> > > that the system is getting locked up with kswapd using large amounts
> > > of CPU. Using SLAB instead of SLUB made this problem go away.
> > >
> > > This patch prevents kswapd being woken up for high-order allocations.
> > > Testing indicated that with this patch applied, the system was much
> > > harder to hang and even when it did, it eventually recovered.
> > >
> > > Signed-off-by: Mel Gorman<[email protected]>
> > Acked-by: David Rientjes<[email protected]>
>
> Christoph? I think this patch is sane although the original rationale was to
> workaround kswapd problems.
I am mostly fine with it. The concerns that I have is if there is a
large series of high order allocs then at some point you would want
kswapd to be triggered instead of high order allocs constantly failing.
Can we have a "trigger once in a while" functionality?
On Wed, May 18, 2011 at 6:58 PM, Mel Gorman <[email protected]> wrote:
> On Wed, May 18, 2011 at 02:44:48PM +0900, Minchan Kim wrote:
>> On Wed, May 18, 2011 at 10:05 AM, KOSAKI Motohiro
>> <[email protected]> wrote:
>> >> It would be better to put cond_resched after balance_pgdat?
>> >>
>> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> >> index 292582c..61c45d0 100644
>> >> --- a/mm/vmscan.c
>> >> +++ b/mm/vmscan.c
>> >> @@ -2753,6 +2753,7 @@ static int kswapd(void *p)
>> >> if (!ret) {
>> >> trace_mm_vmscan_kswapd_wake(pgdat->node_id,
>> >> order);
>> >> order = balance_pgdat(pgdat,
>> >> order,&classzone_idx);
>> >> + cond_resched();
>> >> }
>> >> }
>> >> return 0;
>> >>
>> >>>>> While it appears unlikely, there are bad conditions which can result
>> >>>
>> >>> in cond_resched() being avoided.
>> >
>> > Every reclaim priority decreasing or every shrink_zone() calling makes more
>> > fine grained preemption. I think.
>>
>> It could be.
>> But in direct reclaim case, I have a concern about losing pages
>> reclaimed to other tasks by preemption.
>>
>> Hmm,, anyway, we also needs test.
>> Hmm,, how long should we bother them(Colins and James)?
>> First of all, Let's fix one just between us and ask test to them and
>> send the last patch to akpm.
>>
>> 1. shrink_slab
>> 2. right after balance_pgdat
>> 3. shrink_zone
>> 4. reclaim priority decreasing routine.
>>
>> Now, I vote 1) and 2).
>>
>
> I've already submitted a pair of patches for option 1. I don't think
> option 2 gains us anything. I think it's more likely we should worry
> about all_unreclaimable being set when shrink_slab is returning 0 and we
> are encountering so many dirty pages that pages_scanned is high enough.
Okay.
Colin reported he had no problem with patch 1 in this series and
mine(ie, just cond_resched right after balance_pgdat call without no
patch of shrink_slab).
If Colin's test is successful, I don't insist on mine.
(I don't want to drag on for days :( )
If KOSAKI agree, let's ask the test to Colin and confirm our last test.
KOSAKI. Could you post a your opinion?
--
Kind regards,
Minchan Kim
>> I've already submitted a pair of patches for option 1. I don't think
>> option 2 gains us anything. I think it's more likely we should worry
>> about all_unreclaimable being set when shrink_slab is returning 0 and we
>> are encountering so many dirty pages that pages_scanned is high enough.
>
> Okay.
>
> Colin reported he had no problem with patch 1 in this series and
> mine(ie, just cond_resched right after balance_pgdat call without no
> patch of shrink_slab).
>
> If Colin's test is successful, I don't insist on mine.
> (I don't want to drag on for days :( )
> If KOSAKI agree, let's ask the test to Colin and confirm our last test.
>
> KOSAKI. Could you post a your opinion?
Yeah.
I also don't have any motivation to ignore Colin's test result.
On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <[email protected]> wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
> which in turn may not happen if all_unreclaimable is set in
> shrink_zones(). If for whatver reason, all_unreclaimable is
> set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
> balanced. For a high-order allocation that is balanced, it
> checks order-0 again. During that window, order-0 might have
> become unbalanced so it loops again for order-0 and returns
> that it was reclaiming for order-0 to kswapd(). It can then
> find that a caller has rewoken kswapd for a high-order and
> re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
> pages. If there are a large number of direct reclaimers, the
> shrinker_rwsem can be contended and prevent kswapd calling
> cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [[email protected]: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
--
Kind regards,
Minchan Kim
Hi Colin.
Sorry for bothering you. :(
I hope this test is last.
We(Mel, KOSAKI and me) finalized opinion.
Could you test below patch with patch[1/4] of Mel's series(ie,
!pgdat_balanced of sleeping_prematurely)?
If it is successful, we will try to merge this version instead of
various cond_resched sprinkling version.
On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <[email protected]> wrote:
> It has been reported on some laptops that kswapd is consuming large
> amounts of CPU and not being scheduled when SLUB is enabled during
> large amounts of file copying. It is expected that this is due to
> kswapd missing every cond_resched() point because;
>
> shrink_page_list() calls cond_resched() if inactive pages were isolated
> which in turn may not happen if all_unreclaimable is set in
> shrink_zones(). If for whatver reason, all_unreclaimable is
> set on all zones, we can miss calling cond_resched().
>
> balance_pgdat() only calls cond_resched if the zones are not
> balanced. For a high-order allocation that is balanced, it
> checks order-0 again. During that window, order-0 might have
> become unbalanced so it loops again for order-0 and returns
> that it was reclaiming for order-0 to kswapd(). It can then
> find that a caller has rewoken kswapd for a high-order and
> re-enters balance_pgdat() without ever calling cond_resched().
>
> shrink_slab only calls cond_resched() if we are reclaiming slab
> pages. If there are a large number of direct reclaimers, the
> shrinker_rwsem can be contended and prevent kswapd calling
> cond_resched().
>
> This patch modifies the shrink_slab() case. If the semaphore is
> contended, the caller will still check cond_resched(). After each
> successful call into a shrinker, the check for cond_resched() is
> still necessary in case one shrinker call is particularly slow.
>
> This patch replaces
> mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> in -mm.
>
> [[email protected]: Preserve call to cond_resched after each call into shrinker]
> From: Minchan Kim <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 9 +++++++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..0bed248 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> if (scanned == 0)
> scanned = SWAP_CLUSTER_MAX;
>
> - if (!down_read_trylock(&shrinker_rwsem))
> - return 1; /* Assume we'll be able to shrink next time */
> + if (!down_read_trylock(&shrinker_rwsem)) {
> + /* Assume we'll be able to shrink next time */
> + ret = 1;
> + goto out;
> + }
>
> list_for_each_entry(shrinker, &shrinker_list, list) {
> unsigned long long delta;
> @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> shrinker->nr += total_scan;
> }
> up_read(&shrinker_rwsem);
> +out:
> + cond_resched();
> return ret;
> }
>
>
--
Kind regards,
Minchan Kim
On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote:
> Hi Colin.
>
> Sorry for bothering you. :(
No problem at all, I've very happy to re-test.
> I hope this test is last.
>
> We(Mel, KOSAKI and me) finalized opinion.
>
> Could you test below patch with patch[1/4] of Mel's series(ie,
> !pgdat_balanced of sleeping_prematurely)?
> If it is successful, we will try to merge this version instead of
> various cond_resched sprinkling version.
tested with the patch below + patch[1/4] of Mel's series. 300 cycles,
2.5 hrs of soak testing: works OK.
Colin
>
>
> On Wed, May 18, 2011 at 1:15 AM, Mel Gorman <[email protected]> wrote:
> > It has been reported on some laptops that kswapd is consuming large
> > amounts of CPU and not being scheduled when SLUB is enabled during
> > large amounts of file copying. It is expected that this is due to
> > kswapd missing every cond_resched() point because;
> >
> > shrink_page_list() calls cond_resched() if inactive pages were isolated
> > which in turn may not happen if all_unreclaimable is set in
> > shrink_zones(). If for whatver reason, all_unreclaimable is
> > set on all zones, we can miss calling cond_resched().
> >
> > balance_pgdat() only calls cond_resched if the zones are not
> > balanced. For a high-order allocation that is balanced, it
> > checks order-0 again. During that window, order-0 might have
> > become unbalanced so it loops again for order-0 and returns
> > that it was reclaiming for order-0 to kswapd(). It can then
> > find that a caller has rewoken kswapd for a high-order and
> > re-enters balance_pgdat() without ever calling cond_resched().
> >
> > shrink_slab only calls cond_resched() if we are reclaiming slab
> > pages. If there are a large number of direct reclaimers, the
> > shrinker_rwsem can be contended and prevent kswapd calling
> > cond_resched().
> >
> > This patch modifies the shrink_slab() case. If the semaphore is
> > contended, the caller will still check cond_resched(). After each
> > successful call into a shrinker, the check for cond_resched() is
> > still necessary in case one shrinker call is particularly slow.
> >
> > This patch replaces
> > mm-vmscan-if-kswapd-has-been-running-too-long-allow-it-to-sleep.patch
> > in -mm.
> >
> > [[email protected]: Preserve call to cond_resched after each call into shrinker]
> > From: Minchan Kim <[email protected]>
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 9 +++++++--
> > 1 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index af24d1e..0bed248 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -230,8 +230,11 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> > if (scanned == 0)
> > scanned = SWAP_CLUSTER_MAX;
> >
> > - if (!down_read_trylock(&shrinker_rwsem))
> > - return 1; /* Assume we'll be able to shrink next time */
> > + if (!down_read_trylock(&shrinker_rwsem)) {
> > + /* Assume we'll be able to shrink next time */
> > + ret = 1;
> > + goto out;
> > + }
> >
> > list_for_each_entry(shrinker, &shrinker_list, list) {
> > unsigned long long delta;
> > @@ -282,6 +285,8 @@ unsigned long shrink_slab(unsigned long scanned, gfp_t gfp_mask,
> > shrinker->nr += total_scan;
> > }
> > up_read(&shrinker_rwsem);
> > +out:
> > + cond_resched();
> > return ret;
> > }
> >
> >
>
>
>
On Thu, May 19, 2011 at 8:36 PM, Colin Ian King
<[email protected]> wrote:
> On Thu, 2011-05-19 at 09:09 +0900, Minchan Kim wrote:
>> Hi Colin.
>>
>> Sorry for bothering you. :(
>
> No problem at all, I've very happy to re-test.
>
>> I hope this test is last.
>>
>> We(Mel, KOSAKI and me) finalized opinion.
>>
>> Could you test below patch with patch[1/4] of Mel's series(ie,
>> !pgdat_balanced of sleeping_prematurely)?
>> If it is successful, we will try to merge this version instead of
>> various cond_resched sprinkling version.
>
> tested with the patch below + patch[1/4] of Mel's series. 300 cycles,
> 2.5 hrs of soak testing: works OK.
>
> Colin
Thanks, Colin.
We are approaching the conclusion for your help. :)
Mel, KOSAKI.
I will ask test to Andrew Lutomirski.
If he doesn't have a problem, let's go, then.
--
Kind regards,
Minchan Kim