2009-11-12 19:30:38

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

Sorry for the long delay in posting another version. Testing is extremely
time-consuming and I wasn't getting to work on this as much as I'd have liked.

Changelog since V2
o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
testing, it made latencies even worse as kswapd slept more on high-order
congestion causing order-0 direct reclaims.
o Added changes to how congestion_wait() works
o Added a number of new patches altering the behaviour of reclaim

Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;

[Bug #14141] order 2 page allocation failures in iwlagn
[Bug #14141] order 2 page allocation failures (generic)
[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
[No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
[No BZ ID] page allocation failure message kernel 2.6.31.4 (tty-related)

The following are a series of patches that bring the behaviour of reclaim
and the page allocator more in line with 2.6.30.

Patches 1-3 should be tested first. The testing I've done shows that the
page allocator and behaviour of congestion_wait() is more in line with
2.6.30 than the vanilla kernels.

It'd be nice to have 2 more tests, applying each patch on top noting any
behaviour change. i.e. ideally there would be results for

o patches 1+2+3
o patches 1+2+3+4
o patches 1+2+3+4+5

Of course, any tests results are welcome. The rest of the mail is the
results of my own tests.

I've tested against 2.6.31 and 2.6.32-rc6. I've somewhat replicated the
problem in Bug #14141 and believe the other bugs are variations of the same
style of problem. The basic reproduction case was;

1. X86-64 AMD Phenom and X86 P4 booted with mem=512MB. Expectation is
any machine will do as long as it's 512MB for the size of workload
involved.

2. A crypted work partition and swap partition was created. On my
own setup, I gave no passphrase so it'd be easier to activate without
interaction but there are multiple options. I should have taken better
notes but the setup goes something like this;

cryptsetup create -y crypt-partition /dev/sda5
pvcreate /dev/mapper/crypt-partition
vgcreate crypt-volume /dev/mapper/crypt-partition
lvcreate -L 5G -n crypt-logical crypt-volume
lvcreate -L 2G -n crypt-swap crypt-volume
mkfs -t ext3 /dev/crypt-volume/crypt-logical
mkswap /dev/crypt-volume/crypt-swap

3. With the partition mounted on /scratch, I
cd /scratch
mkdir music
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git linux-2.6

4. On a normal partition, I expand a tarball containing test scripts available at
http://www.csn.ul.ie/~mel/postings/latency-20091112/latency-tests-with-results.tar.gz

There are two helper programs that run as part of the test - a fake
music player and a fake gitk.

The fake music player uses rsync with bandwidth limits to start
downloading a music folder from another machine. It's bandwidth
limited to simulate playing music over NFS. I believe it generates
similar if not exact traffic to a music player. It occured to be
afterwards that if one patched ogg123 to print a line when 1/10th
of a seconds worth of music was played, it could be used as an
indirect measure of desktop interactivity and help pin down pesky
"audio skips" bug reports.

The fake gitk is based on observing roughly what gitk does using
strace. It loads all the logs into a large buffer and then builds a
very basic hash map of parent to child commits. The data is stored
because it was insufficient just to read the logs. It had to be
kept in an in-memory buffer to generate swap. It then discards the
data and does it over again in a loop for a small number of times
so the test is finite. When it processes a large number of commits,
it outputs a line to stdout so that stalls can be observed. Ideal
behaviour is that commits are read at a constant rate and latencies
look flat.

Output from the two programs is piped through another script -
latency-output. It records how far into the test it was when the
line was outputted and what the latency was since the last line
appeared. The latency should always be very smooth. Because pipes
buffer IO, they are all run by expect_unbuffered which is available
from expect-dev on Debian at least.

All the tests are driven via run-test.sh. While the tests run,
it records the kern.log to track page allocation failures, records
nr_writeback at regular intervals and tracks Page IO and Swap IO.

5. For running an actual test, a kernel is built, booted, the
crypted partition activated, lvm restarted,
/dev/crypt-volume/crypt-logical mounted on /scratch, all
swap partitions turned off and then the swap partition on
/dev/crypt-volume/crypt-swap activated. I then run run-test.sh from
the tarball

6. Run the test script

To evaluate the patches, I considered three basic metrics.

o The length of time it takes fake-gitk to complete on average
o How often and how long fake-gitk stalled for
o How long was spent in congestion_wait

All generated data is in the tarball.

On X86, the results I got were

2.6.30-0000000-force-highorder Elapsed:10:59.095 Failures:0

2.6.31-0000000-force-highorder Elapsed:11:53.505 Failures:0
2.6.31-revert-8aa7e847 Elapsed:14:01.595 Failures:0
2.6.31-0000012-pgalloc-2.6.30 Elapsed:13:32.237 Failures:0
2.6.31-0000123-congestion-both Elapsed:12:44.170 Failures:0
2.6.31-0001234-kswapd-quick-recheck Elapsed:10:35.327 Failures:0
2.6.31-0012345-adjust-priority Elapsed:11:02.995 Failures:0

2.6.32-rc6-0000000-force-highorder Elapsed:18:18.562 Failures:0
2.6.32-rc6-revert-8aa7e847 Elapsed:10:29.278 Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:13:32.393 Failures:0
2.6.32-rc6-0000123-congestion-both Elapsed:14:55.265 Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:13:35.628 Failures:0
2.6.32-rc6-0012345-adjust-priority Elapsed:12:41.278 Failures:0

The 0000000-force-highorder is a vanilla kernel patched so that network
receive always results in an order-2 allocation. This machine wasn't
suffering page allocation failures even under this circumstance. However,
note how slow 2.6.32-rc6 is and how much the revert helps. With the patches
applied, there is comparable performance.

Latencies were generally reduced with the patches applied. 2.6.32-rc6 was
particularly crazy with long stalls measured over the duration of the test
but has comparable latencies with 2.6.30 with the patches applied.

congestion_wait behaviour is more in line with 2.6.30 after the
patches with similar amounts of time being spent. In general,
2.6.32-rc6-0012345-adjust-priority waits for longer than 2.6.30 or the
reverted kernels did. It also waits in more instances such as inside
shrink_inactive_list() where it didn't before. Forcing behaviour like 2.6.30
resulted in good figures but I couldn't justify the patches with anything
more solid than "in tests, it behaves well even though it doesn't make a
lot of sense"

On X86-64, the results I got were

2.6.30-0000000-force-highorder Elapsed:09:48.545 Failures:0

2.6.31-0000000-force-highorder Elapsed:09:13.020 Failures:0
2.6.31-revert-8aa7e847 Elapsed:09:02.120 Failures:0
2.6.31-0000012-pgalloc-2.6.30 Elapsed:08:52.742 Failures:0
2.6.31-0000123-congestion-both Elapsed:08:59.375 Failures:0
2.6.31-0001234-kswapd-quick-recheck Elapsed:09:19.208 Failures:0
2.6.31-0012345-adjust-priority Elapsed:09:39.225 Failures:0

2.6.32-rc6-0000000-force-highorder Elapsed:19:38.585 Failures:5
2.6.32-rc6-revert-8aa7e847 Elapsed:17:21.257 Failures:0
2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:18:56.682 Failures:1
2.6.32-rc6-0000123-congestion-both Elapsed:16:08.340 Failures:0
2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:18:11.200 Failures:7
2.6.32-rc6-0012345-adjust-priority Elapsed:21:33.158 Failures:0

Failures were down and my impression was that it was much harder to cause
failures. Performance on mainline is still not as good as 2.6.30. On
this particular machine, I was able to force performance to be in line
but not with any patch I could justify in the general case.

Latencies were slightly reduced by applying the patches against 2.6.31.
against 2.6.32-rc6, applying the patches significantly reduced the latencies
but they are still significant. I'll continue to investigate what can be
done to improve this further.

Again, congestion_wait() is more in line with 2.6.30 when the patches
are applied. Similarly to X86, almost identical behaviour can be forced
by waiting on BLK_ASYNC_BOTH for each caller to congestion_wait() in the
reclaim and allocator paths.

Bottom line, the patches made triggering allocation failures much harder
and in a number of instances and latencies are reduced when the system
is under load. I will keep looking around this area - particularly the
performance under load on 2.6.32-rc6 but with 2.6.32 almost out the door,
I am releasing what I have now.


2009-11-12 19:30:42

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed

If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
application afterwards. In times past, this would always wake kswapd as well
but currently, kswapd is not woken up after direct reclaim fails. For order-0
allocations, this makes little difference but if there is a heavy mix of
higher-order allocations that direct reclaim is failing for, it might mean
that kswapd is not rewoken for higher orders as much as it did previously.

This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdcedf6..250d055 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;

+restart:
wake_all_kswapd(order, zonelist, high_zoneidx);

-restart:
/*
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
--
1.6.5

2009-11-12 19:30:44

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER

Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.

[[email protected]: Spotted the problem]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
---
mm/page_alloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 250d055..2bc2ac6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
alloc_flags &= ~ALLOC_CPUSET;
- } else if (unlikely(rt_task(p)))
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;

if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
--
1.6.5

2009-11-12 19:31:44

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
that the commits 373c0a7e 8aa7e847 dramatically increased the number of
GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
this patch seemed to help a lot even though it was pointed out that the
congestion changes were very far away from high-order atomic allocations.

The key to why the revert makes such a big difference is down to timing and
how long direct reclaimers wait versus kswapd. With the patch reverted,
the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
significant part of the workload involved reads, it makes sense that the
SYNC list is what was truely congested and with the revert processes were
waiting on congestion as expected. Hence, direct reclaimers stalled
properly and kswapd was able to do its job with fewer stalls.

This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
for direct reclaimers. Instead of making the congestion_wait() on the SYNC
queue which would only fix a particular type of workload, this patch adds a
third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
and then the SYNC queue if the timeout has not been reached. In tests, this
counter-intuitively results in kswapd stalling less and freeing up pages
resulting in fewer allocation failures and fewer direct-reclaim-orientated
stalls.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/backing-dev.h | 1 +
mm/backing-dev.c | 25 ++++++++++++++++++++++---
mm/page_alloc.c | 4 ++--
mm/vmscan.c | 2 +-
4 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index b449e73..b35344c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -276,6 +276,7 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
enum {
BLK_RW_ASYNC = 0,
BLK_RW_SYNC = 1,
+ BLK_RW_BOTH = 2,
};

void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 1065b71..ea9ffc3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -736,22 +736,41 @@ EXPORT_SYMBOL(set_bdi_congested);

/**
* congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
+ * @sync: SYNC, ASYNC or BOTH IO
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
* write congestion. If no backing_devs are congested then just wait for the
* next write to be completed.
*/
-long congestion_wait(int sync, long timeout)
+long congestion_wait(int sync_request, long timeout)
{
long ret;
DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = &congestion_wqh[sync];
+ int sync;
+ wait_queue_head_t *wqh;
+
+ /* If requested to sync both, wait on ASYNC first, then SYNC */
+ if (sync_request == BLK_RW_BOTH)
+ sync = BLK_RW_ASYNC;
+ else
+ sync = sync_request;
+
+again:
+ wqh = &congestion_wqh[sync];

prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
+
+ if (sync_request == BLK_RW_BOTH) {
+ sync_request = 0;
+ sync = BLK_RW_SYNC;
+ timeout = ret;
+ if (timeout)
+ goto again;
+ }
+
return ret;
}
EXPORT_SYMBOL(congestion_wait);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2bc2ac6..f6ed41c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1727,7 +1727,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);

if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(BLK_RW_BOTH, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));

return page;
@@ -1898,7 +1898,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(BLK_RW_BOTH, HZ/50);
goto rebalance;
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 777af57..190bae1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1793,7 +1793,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,

/* Take a nap, wait for some writeback to complete */
if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(BLK_RW_BOTH, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
--
1.6.5

2009-11-12 19:31:41

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

After kswapd balances all zones in a pgdat, it goes to sleep. In the event
of no IO congestion, kswapd can go to sleep very shortly after the high
watermark was reached. If there are a constant stream of allocations from
parallel processes, it can mean that kswapd went to sleep too quickly and
the high watermark is not being maintained for sufficient length time.

This patch makes kswapd go to sleep as a two-stage process. It first
tries to sleep for HZ/10. If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work. Otherwise it goes fully to sleep.

This adds more counters to distinguish between fast and slow breaches of
watermarks. A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep. A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vmstat.h | 1 +
mm/vmscan.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
mm/vmstat.c | 2 ++
3 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2d0f222..9716003 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -40,6 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGSCAN_ZONE_RECLAIM_FAILED,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
+ KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 190bae1..ffa1766 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1904,6 +1904,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
}
#endif

+/* is kswapd sleeping prematurely? */
+static int sleeping_prematurely(int order, long remaining)
+{
+ struct zone *zone;
+
+ /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
+ if (remaining)
+ return 1;
+
+ /* If after HZ/10, a zone is below the high mark, it's premature */
+ for_each_populated_zone(zone)
+ if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
+ 0, 0))
+ return 1;
+
+ return 0;
+}
+
/*
* For kswapd, balance_pgdat() will work across all this node's zones until
* they are all at high_wmark_pages(zone).
@@ -2184,8 +2202,30 @@ static int kswapd(void *p)
*/
order = new_order;
} else {
- if (!freezing(current))
- schedule();
+ if (!freezing(current)) {
+ long remaining = 0;
+
+ /* Try to sleep for a short interval */
+ if (!sleeping_prematurely(order, remaining)) {
+ remaining = schedule_timeout(HZ/10);
+ finish_wait(&pgdat->kswapd_wait, &wait);
+ prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
+ }
+
+ /*
+ * After a short sleep, check if it was a
+ * premature sleep. If not, then go fully
+ * to sleep until explicitly woken up
+ */
+ if (!sleeping_prematurely(order, remaining))
+ schedule();
+ else {
+ if (remaining)
+ count_vm_event(KSWAPD_PREMATURE_FAST);
+ else
+ count_vm_event(KSWAPD_PREMATURE_SLOW);
+ }
+ }

order = pgdat->kswapd_max_order;
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c81321f..90b11e4 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -683,6 +683,8 @@ static const char * const vmstat_text[] = {
"slabs_scanned",
"kswapd_steal",
"kswapd_inodesteal",
+ "kswapd_slept_prematurely_fast",
+ "kswapd_slept_prematurely_slow",
"pageoutrun",
"allocstall",

--
1.6.5

2009-11-12 19:30:52

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.
However, on systems with large numbers of high-order atomics due to
crappy network cards, it's important that kswapd keep working in
parallel to save their sorry ass.

This patch takes into account the order kswapd is reclaiming at before
waiting on congestion. The higher the order, the longer it is before
kswapd considers itself to be in trouble. The impact is that kswapd
works harder in parallel rather than depending on direct reclaimers or
atomic allocations to fail.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 14 ++++++++++++--
1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..5e200f1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
{
int all_zones_ok;
- int priority;
+ int priority, congestion_priority;
int i;
unsigned long total_scanned;
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
*/
int temp_priority[MAX_NR_ZONES];

+ /*
+ * When priority reaches congestion_priority, kswapd will sleep
+ * for a short time while congestion clears. The higher the
+ * order being reclaimed, the less likely kswapd will go to
+ * sleep as high-order allocations are harder to reclaim and
+ * stall direct reclaimers longer
+ */
+ congestion_priority = DEF_PRIORITY - 2;
+ congestion_priority -= min(congestion_priority, sc.order);
+
loop_again:
total_scanned = 0;
sc.nr_reclaimed = 0;
@@ -2092,7 +2102,7 @@ loop_again:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && priority < DEF_PRIORITY - 2)
+ if (total_scanned && priority < congestion_priority)
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
--
1.6.5

2009-11-13 05:23:26

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed

> If a direct reclaim makes no forward progress, it considers whether it
> should go OOM or not. Whether OOM is triggered or not, it may retry the
> application afterwards. In times past, this would always wake kswapd as well
> but currently, kswapd is not woken up after direct reclaim fails. For order-0
> allocations, this makes little difference but if there is a heavy mix of
> higher-order allocations that direct reclaim is failing for, it might mean
> that kswapd is not rewoken for higher orders as much as it did previously.
>
> This patch wakes up kswapd when an allocation is being retried after a direct
> reclaim failure. It would be expected that kswapd is already awake, but
> this has the effect of telling kswapd to reclaim at the higher order as well.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Christoph Lameter <[email protected]>
> Reviewed-by: Pekka Enberg <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>

Umm,
My mail box have the carbon copy of akpm sent this patch to linus.
(bellow subject and data)

Does this have any update?

-----------------------------------------------------------------
Subject: [patch 07/35] page allocator: always wake kswapd when restarting an allocation attempt after direct reclaim failed
To: [email protected]
Cc: [email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
From: [email protected]
Date: Wed, 11 Nov 2009 14:26:14 -0800


2009-11-13 05:24:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER

> Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> slightly by allowing rt_tasks that are handling an interrupt to set
> ALLOC_HARDER. This patch brings the watermark logic more in line with
> 2.6.30.

ditto.
afaik, this patch have been sent to linus.


>
> [[email protected]: Spotted the problem]
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Pekka Enberg <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
> Reviewed-by: KOSAKI Motohiro <[email protected]>
> ---
> mm/page_alloc.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 250d055..2bc2ac6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> */
> alloc_flags &= ~ALLOC_CPUSET;
> - } else if (unlikely(rt_task(p)))
> + } else if (unlikely(rt_task(p)) && !in_interrupt())
> alloc_flags |= ALLOC_HARDER;
>
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> --
> 1.6.5
>


2009-11-13 09:04:27

by Frans Pop

[permalink] [raw]
Subject: Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Thursday 12 November 2009, Mel Gorman wrote:
> Changelog since V2
> o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> testing, it made latencies even worse as kswapd slept more on
> high-order congestion causing order-0 direct reclaims.
> o Added changes to how congestion_wait() works
> o Added a number of new patches altering the behaviour of reclaim

I have tested this series on top of .32-rc7. First impression is that it
does seem to improve my test case, but does not yet completely solve it.

My last gitk instance now loads more smoothly for most of the time it takes
to complete, but I still see a choke point where things freeze for a while
and where I get SKB allocation errors from my wireless.
However, that choke point does seem to happen later and to be shorter than
without the patches.

I'll try to do additional tests (with .31). If you'd like me to run this
set with your instrumentation patch for congestion_wait, then please let
me know.

Chris Mason's analysis regarding dm-crypt workqueues in reply to your other
mail looks very interesting.

Cheers,
FJP

2009-11-13 09:54:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble

> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.
> However, on systems with large numbers of high-order atomics due to
> crappy network cards, it's important that kswapd keep working in
> parallel to save their sorry ass.
>
> This patch takes into account the order kswapd is reclaiming at before
> waiting on congestion. The higher the order, the longer it is before
> kswapd considers itself to be in trouble. The impact is that kswapd
> works harder in parallel rather than depending on direct reclaimers or
> atomic allocations to fail.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 14 ++++++++++++--
> 1 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ffa1766..5e200f1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
> static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> {
> int all_zones_ok;
> - int priority;
> + int priority, congestion_priority;
> int i;
> unsigned long total_scanned;
> struct reclaim_state *reclaim_state = current->reclaim_state;
> @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> */
> int temp_priority[MAX_NR_ZONES];
>
> + /*
> + * When priority reaches congestion_priority, kswapd will sleep
> + * for a short time while congestion clears. The higher the
> + * order being reclaimed, the less likely kswapd will go to
> + * sleep as high-order allocations are harder to reclaim and
> + * stall direct reclaimers longer
> + */
> + congestion_priority = DEF_PRIORITY - 2;
> + congestion_priority -= min(congestion_priority, sc.order);

This calculation mean

sc.order congestion_priority scan-pages
---------------------------------------------------------
0 10 1/1024 * zone-mem
1 9 1/512 * zone-mem
2 8 1/256 * zone-mem
3 7 1/128 * zone-mem
4 6 1/64 * zone-mem
5 5 1/32 * zone-mem
6 4 1/16 * zone-mem
7 3 1/8 * zone-mem
8 2 1/4 * zone-mem
9 1 1/2 * zone-mem
10 0 1 * zone-mem
11+ 0 1 * zone-mem

I feel this is too agressive. The intention of this congestion_wait()
is to prevent kswapd use 100% cpu time. but the above promotion seems
break it.

example,
ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.

example2,
order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
reclaim, because it doesn't use lumpy recliam.
I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
waste cpu time 100%.


> +
> loop_again:
> total_scanned = 0;
> sc.nr_reclaimed = 0;
> @@ -2092,7 +2102,7 @@ loop_again:
> * OK, kswapd is getting into trouble. Take a nap, then take
> * another pass across the zones.
> */
> - if (total_scanned && priority < DEF_PRIORITY - 2)
> + if (total_scanned && priority < congestion_priority)
> congestion_wait(BLK_RW_ASYNC, HZ/10);

Instead, How about this?



---
mm/vmscan.c | 13 ++++++++++++-
1 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64e4388..937e90d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1938,6 +1938,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
* free_pages == high_wmark_pages(zone).
*/
int temp_priority[MAX_NR_ZONES];
+ int has_under_min_watermark_zone = 0;

loop_again:
total_scanned = 0;
@@ -2057,6 +2058,15 @@ loop_again:
if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
sc.may_writepage = 1;
+
+ /*
+ * We are still under min water mark. it mean we have
+ * GFP_ATOMIC allocation failure risk. Hurry up!
+ */
+ if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ end_zone, 0))
+ has_under_min_watermark_zone = 1;
+
}
if (all_zones_ok)
break; /* kswapd: all done */
@@ -2064,7 +2074,8 @@ loop_again:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && priority < DEF_PRIORITY - 2)
+ if (total_scanned && (priority < DEF_PRIORITY - 2) &&
+ !has_under_min_watermark_zone)
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
--
1.6.2.5


2009-11-13 10:43:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

> After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> of no IO congestion, kswapd can go to sleep very shortly after the high
> watermark was reached. If there are a constant stream of allocations from
> parallel processes, it can mean that kswapd went to sleep too quickly and
> the high watermark is not being maintained for sufficient length time.
>
> This patch makes kswapd go to sleep as a two-stage process. It first
> tries to sleep for HZ/10. If it is woken up by another process or the
> high watermark is no longer met, it's considered a premature sleep and
> kswapd continues work. Otherwise it goes fully to sleep.
>
> This adds more counters to distinguish between fast and slow breaches of
> watermarks. A "fast" premature sleep is one where the low watermark was
> hit in a very short time after kswapd going to sleep. A "slow" premature
> sleep indicates that the high watermark was breached after a very short
> interval.
>
> Signed-off-by: Mel Gorman <[email protected]>

Why do you submit this patch to mainline? this is debugging patch
no more and no less.


> ---
> include/linux/vmstat.h | 1 +
> mm/vmscan.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> mm/vmstat.c | 2 ++
> 3 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 2d0f222..9716003 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -40,6 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> PGSCAN_ZONE_RECLAIM_FAILED,
> #endif
> PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> + KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,

Please don't use the word of "premature" and "fast". it is too hard to understand the meanings.
Plus, please use per-zone stastics (like NUMA_HIT).

>
> #ifdef CONFIG_HUGETLB_PAGE
> HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 190bae1..ffa1766 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1904,6 +1904,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> }
> #endif
>
> +/* is kswapd sleeping prematurely? */
> +static int sleeping_prematurely(int order, long remaining)
> +{
> + struct zone *zone;
> +
> + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> + if (remaining)
> + return 1;
> +
> + /* If after HZ/10, a zone is below the high mark, it's premature */
> + for_each_populated_zone(zone)
> + if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> + 0, 0))
> + return 1;

for_each_populated_zone() iterate all populated zone. but kswapd shuld't see another node.

> +
> + return 0;
> +}
> +
> /*
> * For kswapd, balance_pgdat() will work across all this node's zones until
> * they are all at high_wmark_pages(zone).
> @@ -2184,8 +2202,30 @@ static int kswapd(void *p)
> */
> order = new_order;
> } else {
> - if (!freezing(current))
> - schedule();
> + if (!freezing(current)) {
> + long remaining = 0;
> +
> + /* Try to sleep for a short interval */
> + if (!sleeping_prematurely(order, remaining)) {
> + remaining = schedule_timeout(HZ/10);
> + finish_wait(&pgdat->kswapd_wait, &wait);
> + prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> + }
> +
> + /*
> + * After a short sleep, check if it was a
> + * premature sleep. If not, then go fully
> + * to sleep until explicitly woken up
> + */
> + if (!sleeping_prematurely(order, remaining))
> + schedule();
> + else {
> + if (remaining)
> + count_vm_event(KSWAPD_PREMATURE_FAST);
> + else
> + count_vm_event(KSWAPD_PREMATURE_SLOW);
> + }
> + }
>
> order = pgdat->kswapd_max_order;
> }
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index c81321f..90b11e4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -683,6 +683,8 @@ static const char * const vmstat_text[] = {
> "slabs_scanned",
> "kswapd_steal",
> "kswapd_inodesteal",
> + "kswapd_slept_prematurely_fast",
> + "kswapd_slept_prematurely_slow",
> "pageoutrun",
> "allocstall",
>
> --
> 1.6.5
>


2009-11-13 11:20:39

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

(cc to Jens)

> Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> this patch seemed to help a lot even though it was pointed out that the
> congestion changes were very far away from high-order atomic allocations.
>
> The key to why the revert makes such a big difference is down to timing and
> how long direct reclaimers wait versus kswapd. With the patch reverted,
> the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> significant part of the workload involved reads, it makes sense that the
> SYNC list is what was truely congested and with the revert processes were
> waiting on congestion as expected. Hence, direct reclaimers stalled
> properly and kswapd was able to do its job with fewer stalls.
>
> This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> queue which would only fix a particular type of workload, this patch adds a
> third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> and then the SYNC queue if the timeout has not been reached. In tests, this
> counter-intuitively results in kswapd stalling less and freeing up pages
> resulting in fewer allocation failures and fewer direct-reclaim-orientated
> stalls.

Honestly, I don't like this patch. page allocator is not related to
sync block queue. vmscan doesn't make read operation.
This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.

Please don't make mysterious heuristic code.


Sidenode: I doubt this regression was caused from page allocator.
Probably we need to confirm caller change....



>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/backing-dev.h | 1 +
> mm/backing-dev.c | 25 ++++++++++++++++++++++---
> mm/page_alloc.c | 4 ++--
> mm/vmscan.c | 2 +-
> 4 files changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index b449e73..b35344c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -276,6 +276,7 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
> enum {
> BLK_RW_ASYNC = 0,
> BLK_RW_SYNC = 1,
> + BLK_RW_BOTH = 2,
> };
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 1065b71..ea9ffc3 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -736,22 +736,41 @@ EXPORT_SYMBOL(set_bdi_congested);
>
> /**
> * congestion_wait - wait for a backing_dev to become uncongested
> - * @sync: SYNC or ASYNC IO
> + * @sync: SYNC, ASYNC or BOTH IO
> * @timeout: timeout in jiffies
> *
> * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> * write congestion. If no backing_devs are congested then just wait for the
> * next write to be completed.
> */
> -long congestion_wait(int sync, long timeout)
> +long congestion_wait(int sync_request, long timeout)
> {
> long ret;
> DEFINE_WAIT(wait);
> - wait_queue_head_t *wqh = &congestion_wqh[sync];
> + int sync;
> + wait_queue_head_t *wqh;
> +
> + /* If requested to sync both, wait on ASYNC first, then SYNC */
> + if (sync_request == BLK_RW_BOTH)
> + sync = BLK_RW_ASYNC;
> + else
> + sync = sync_request;
> +
> +again:
> + wqh = &congestion_wqh[sync];
>
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = io_schedule_timeout(timeout);
> finish_wait(wqh, &wait);
> +
> + if (sync_request == BLK_RW_BOTH) {
> + sync_request = 0;
> + sync = BLK_RW_SYNC;
> + timeout = ret;
> + if (timeout)
> + goto again;
> + }
> +
> return ret;
> }
> EXPORT_SYMBOL(congestion_wait);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 2bc2ac6..f6ed41c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1727,7 +1727,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> preferred_zone, migratetype);
>
> if (!page && gfp_mask & __GFP_NOFAIL)
> - congestion_wait(BLK_RW_ASYNC, HZ/50);
> + congestion_wait(BLK_RW_BOTH, HZ/50);
> } while (!page && (gfp_mask & __GFP_NOFAIL));
>
> return page;
> @@ -1898,7 +1898,7 @@ rebalance:
> pages_reclaimed += did_some_progress;
> if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> - congestion_wait(BLK_RW_ASYNC, HZ/50);
> + congestion_wait(BLK_RW_BOTH, HZ/50);
> goto rebalance;
> }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 777af57..190bae1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1793,7 +1793,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>
> /* Take a nap, wait for some writeback to complete */
> if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + congestion_wait(BLK_RW_BOTH, HZ/10);
> }
> /* top priority shrink_zones still had more to do? don't OOM, then */
> if (!sc->all_unreclaimable && scanning_global_lru(sc))
> --
> 1.6.5
>


2009-11-13 11:55:58

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> (cc to Jens)
>
> > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > this patch seemed to help a lot even though it was pointed out that the
> > congestion changes were very far away from high-order atomic allocations.
> >
> > The key to why the revert makes such a big difference is down to timing and
> > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > significant part of the workload involved reads, it makes sense that the
> > SYNC list is what was truely congested and with the revert processes were
> > waiting on congestion as expected. Hence, direct reclaimers stalled
> > properly and kswapd was able to do its job with fewer stalls.
> >
> > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > queue which would only fix a particular type of workload, this patch adds a
> > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > and then the SYNC queue if the timeout has not been reached. In tests, this
> > counter-intuitively results in kswapd stalling less and freeing up pages
> > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > stalls.
>
> Honestly, I don't like this patch. page allocator is not related to
> sync block queue. vmscan doesn't make read operation.
> This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
>
> Please don't make mysterious heuristic code.
>
>
> Sidenode: I doubt this regression was caused from page allocator.
> Probably we need to confirm caller change....

See the email from Chris from yesterday, he nicely explains why this
change made a difference with dm-crypt. dm-crypt needs fixing, not a
hack like this added.

The vm needs to drop congestion hints and usage, not increase it. The
above changelog is mostly hand-wavy nonsense, imho.

--
Jens Axboe

2009-11-13 12:28:25

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

On Fri, Nov 13, 2009 at 12:55:58PM +0100, Jens Axboe wrote:
> On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> > (cc to Jens)
> >
> > > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > > this patch seemed to help a lot even though it was pointed out that the
> > > congestion changes were very far away from high-order atomic allocations.
> > >
> > > The key to why the revert makes such a big difference is down to timing and
> > > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > > significant part of the workload involved reads, it makes sense that the
> > > SYNC list is what was truely congested and with the revert processes were
> > > waiting on congestion as expected. Hence, direct reclaimers stalled
> > > properly and kswapd was able to do its job with fewer stalls.
> > >
> > > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > > queue which would only fix a particular type of workload, this patch adds a
> > > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > > and then the SYNC queue if the timeout has not been reached. In tests, this
> > > counter-intuitively results in kswapd stalling less and freeing up pages
> > > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > > stalls.
> >
> > Honestly, I don't like this patch. page allocator is not related to
> > sync block queue. vmscan doesn't make read operation.
> > This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> >
> > Please don't make mysterious heuristic code.
> >
> >
> > Sidenode: I doubt this regression was caused from page allocator.

Probably not. As noted, the major change is really in how long callers
are waiting on congestion_wait. The tarball includes graphs from an
instrumented kernel that shows how long callers are waiting due to
congestion_wait(). This has changed significantly.

I'll queue up tests over the weekend that test without dm-crypt being involved.

> > Probably we need to confirm caller change....
>
> See the email from Chris from yesterday, he nicely explains why this
> change made a difference with dm-crypt.

Indeed.

But bear in mind that it also possible that direct reclaimers are also
congesting the queue due to swap-in.

> dm-crypt needs fixing, not a hack like this added.
>

As noted by Chris in the same mail, dm-crypt has not changed. What has
changed is how long callers wait in congestion_wait.


> The vm needs to drop congestion hints and usage, not increase it. The
> above changelog is mostly hand-wavy nonsense, imho.
>

Suggest an alternative that brings congestion_wait() more in line with
2.6.30 behaviour then.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-13 12:41:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble

On Fri, Nov 13, 2009 at 4:30 AM, Mel Gorman <[email protected]> wrote:
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.
> However, on systems with large numbers of high-order atomics due to
> crappy network cards, it's important that kswapd keep working in
> parallel to save their sorry ass.
>
> This patch takes into account the order kswapd is reclaiming at before
> waiting on congestion. The higher the order, the longer it is before
> kswapd considers itself to be in trouble. The impact is that kswapd
> works harder in parallel rather than depending on direct reclaimers or
> atomic allocations to fail.
>
> Signed-off-by: Mel Gorman <[email protected]>

It's make sense to me.
It can help high order atomic allocation which is a big problem of allocator. :)

Thanks Mel.

Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-11-13 12:47:37

by Tobias Oetiker

[permalink] [raw]
Subject: Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

Hi Mel,

Yesterday Mel Gorman wrote:

> Sorry for the long delay in posting another version. Testing is extremely
> time-consuming and I wasn't getting to work on this as much as I'd have liked.
>
> Changelog since V2
> o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> testing, it made latencies even worse as kswapd slept more on high-order
> congestion causing order-0 direct reclaims.
> o Added changes to how congestion_wait() works
> o Added a number of new patches altering the behaviour of reclaim

so is there anything promissing for the order 5 allocation problems
in this set?

cheers
tobi


--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900

2009-11-13 13:32:11

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

On Fri, Nov 13 2009, Mel Gorman wrote:
> On Fri, Nov 13, 2009 at 12:55:58PM +0100, Jens Axboe wrote:
> > On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> > > (cc to Jens)
> > >
> > > > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > > > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > > > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > > > this patch seemed to help a lot even though it was pointed out that the
> > > > congestion changes were very far away from high-order atomic allocations.
> > > >
> > > > The key to why the revert makes such a big difference is down to timing and
> > > > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > > > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > > > significant part of the workload involved reads, it makes sense that the
> > > > SYNC list is what was truely congested and with the revert processes were
> > > > waiting on congestion as expected. Hence, direct reclaimers stalled
> > > > properly and kswapd was able to do its job with fewer stalls.
> > > >
> > > > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > > > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > > > queue which would only fix a particular type of workload, this patch adds a
> > > > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > > > and then the SYNC queue if the timeout has not been reached. In tests, this
> > > > counter-intuitively results in kswapd stalling less and freeing up pages
> > > > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > > > stalls.
> > >
> > > Honestly, I don't like this patch. page allocator is not related to
> > > sync block queue. vmscan doesn't make read operation.
> > > This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> > >
> > > Please don't make mysterious heuristic code.
> > >
> > >
> > > Sidenode: I doubt this regression was caused from page allocator.
>
> Probably not. As noted, the major change is really in how long callers
> are waiting on congestion_wait. The tarball includes graphs from an
> instrumented kernel that shows how long callers are waiting due to
> congestion_wait(). This has changed significantly.
>
> I'll queue up tests over the weekend that test without dm-crypt being involved.
>
> > > Probably we need to confirm caller change....
> >
> > See the email from Chris from yesterday, he nicely explains why this
> > change made a difference with dm-crypt.
>
> Indeed.
>
> But bear in mind that it also possible that direct reclaimers are also
> congesting the queue due to swap-in.

Are you speculating, or has this been observed? While I don't contest
that that could happen, it's also not a new thing. And it should be an
unlikely event.

> > dm-crypt needs fixing, not a hack like this added.
> >
>
> As noted by Chris in the same mail, dm-crypt has not changed. What has
> changed is how long callers wait in congestion_wait.

Right dm-crypt didn't change, it WAS ALREADY BUGGY.

> > The vm needs to drop congestion hints and usage, not increase it. The
> > above changelog is mostly hand-wavy nonsense, imho.
> >
>
> Suggest an alternative that brings congestion_wait() more in line with
> 2.6.30 behaviour then.

I don't have a good explanation as to why the delays have changed,
unfortunately. Are we sure that they have between .30 and .31? The
dm-crypt case is overly complex and lots of changes could have broken
that house of cards.

--
Jens Axboe

2009-11-13 13:37:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Fri, Nov 13, 2009 at 01:47:35PM +0100, Tobias Oetiker wrote:
> Hi Mel,
>
> Yesterday Mel Gorman wrote:
>
> > Sorry for the long delay in posting another version. Testing is extremely
> > time-consuming and I wasn't getting to work on this as much as I'd have liked.
> >
> > Changelog since V2
> > o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> > testing, it made latencies even worse as kswapd slept more on high-order
> > congestion causing order-0 direct reclaims.
> > o Added changes to how congestion_wait() works
> > o Added a number of new patches altering the behaviour of reclaim
>
> so is there anything promissing for the order 5 allocation problems
> in this set?
>

Yes. While the change in timing of direct reclaimers might be less
important when dm-crypt is not involved, kswapd is more pro-active about
maintaining the watermarks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-13 13:41:48

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

Hi Jens,

On Fri, Nov 13, 2009 at 3:32 PM, Jens Axboe <[email protected]> wrote:
>> Suggest an alternative that brings congestion_wait() more in line with
>> 2.6.30 behaviour then.
>
> I don't have a good explanation as to why the delays have changed,
> unfortunately. Are we sure that they have between .30 and .31? The
> dm-crypt case is overly complex and lots of changes could have broken
> that house of cards.

Hand-waving or not, we have end user reports stating that reverting
commit 8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0 ("Fix
congestion_wait() sync/async vs read/write confusion") fixes their
(rather serious) OOM regression. The commit in question _does_
introduce a functional change and if this was your average regression,
people would be kicking and screaming to get it reverted.

So is there a reason we shouldn't send a partial revert of the commit
(switching to BLK_RW_SYNC) to Linus until the "real" issue gets
resolved? Yes, I realize it's ugly voodoo magic but dammit, it used to
work!

Pekka

2009-11-13 13:54:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble

On Fri, Nov 13, 2009 at 06:54:29PM +0900, KOSAKI Motohiro wrote:
> > If reclaim fails to make sufficient progress, the priority is raised.
> > Once the priority is higher, kswapd starts waiting on congestion.
> > However, on systems with large numbers of high-order atomics due to
> > crappy network cards, it's important that kswapd keep working in
> > parallel to save their sorry ass.
> >
> > This patch takes into account the order kswapd is reclaiming at before
> > waiting on congestion. The higher the order, the longer it is before
> > kswapd considers itself to be in trouble. The impact is that kswapd
> > works harder in parallel rather than depending on direct reclaimers or
> > atomic allocations to fail.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 14 ++++++++++++--
> > 1 files changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ffa1766..5e200f1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
> > static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> > {
> > int all_zones_ok;
> > - int priority;
> > + int priority, congestion_priority;
> > int i;
> > unsigned long total_scanned;
> > struct reclaim_state *reclaim_state = current->reclaim_state;
> > @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> > */
> > int temp_priority[MAX_NR_ZONES];
> >
> > + /*
> > + * When priority reaches congestion_priority, kswapd will sleep
> > + * for a short time while congestion clears. The higher the
> > + * order being reclaimed, the less likely kswapd will go to
> > + * sleep as high-order allocations are harder to reclaim and
> > + * stall direct reclaimers longer
> > + */
> > + congestion_priority = DEF_PRIORITY - 2;
> > + congestion_priority -= min(congestion_priority, sc.order);
>
> This calculation mean
>
> sc.order congestion_priority scan-pages
> ---------------------------------------------------------
> 0 10 1/1024 * zone-mem
> 1 9 1/512 * zone-mem
> 2 8 1/256 * zone-mem
> 3 7 1/128 * zone-mem
> 4 6 1/64 * zone-mem
> 5 5 1/32 * zone-mem
> 6 4 1/16 * zone-mem
> 7 3 1/8 * zone-mem
> 8 2 1/4 * zone-mem
> 9 1 1/2 * zone-mem
> 10 0 1 * zone-mem
> 11+ 0 1 * zone-mem
>
> I feel this is too agressive. The intention of this congestion_wait()
> is to prevent kswapd use 100% cpu time.

Ok, I thought the intention might be to avoid dumping too many pages on
the queue but it was already waiting on congestion elsewhere.

> but the above promotion seems
> break it.
>
> example,
> ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.
>
> example2,
> order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
> reclaim, because it doesn't use lumpy recliam.
> I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
> waste cpu time 100%.
>
>
> > +
> > loop_again:
> > total_scanned = 0;
> > sc.nr_reclaimed = 0;
> > @@ -2092,7 +2102,7 @@ loop_again:
> > * OK, kswapd is getting into trouble. Take a nap, then take
> > * another pass across the zones.
> > */
> > - if (total_scanned && priority < DEF_PRIORITY - 2)
> > + if (total_scanned && priority < congestion_priority)
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> Instead, How about this?
>

This makes a lot of sense. Tests look good and I added stats to make sure
the logic was triggering. On X86, kswapd avoided a congestion_wait 11723
times and X86-64 avoided it 5084 times. I think we should hold onto the
stats temporarily until all these bugs are ironed out.

Would you like to sign off the following?

If you are ok to sign off, this patch should replace my patch 5 in
the series.

==== CUT HERE ====

vmscan: Stop kswapd waiting on congestion when the min watermark is not being met

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion. However,
if the zone is below the min watermark then kswapd needs to continue working
without delay as there is a danger of an increased rate of GFP_ATOMIC
allocation failure.

This patch changes the conditions under which kswapd waits on
congestion by only going to sleep if the min watermarks are being met.

[[email protected]: Add stats to track how relevant the logic is]
Needs-signed-off-by-original-author

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9716003..7d66695 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+ KSWAPD_NO_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..70967e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1966,6 +1966,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
* free_pages == high_wmark_pages(zone).
*/
int temp_priority[MAX_NR_ZONES];
+ int has_under_min_watermark_zone = 0;

loop_again:
total_scanned = 0;
@@ -2085,6 +2086,15 @@ loop_again:
if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
sc.may_writepage = 1;
+
+ /*
+ * We are still under min water mark. it mean we have
+ * GFP_ATOMIC allocation failure risk. Hurry up!
+ */
+ if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ end_zone, 0))
+ has_under_min_watermark_zone = 1;
+
}
if (all_zones_ok)
break; /* kswapd: all done */
@@ -2092,8 +2102,13 @@ loop_again:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+
+ if (!has_under_min_watermark_zone)
+ count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
+ else
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ }

/*
* We do this so kswapd doesn't build up large priorities for
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90b11e4..bc09547 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
"kswapd_inodesteal",
"kswapd_slept_prematurely_fast",
"kswapd_slept_prematurely_slow",
+ "kswapd_no_congestion_wait",
"pageoutrun",
"allocstall",

2009-11-13 13:55:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed

On Fri, Nov 13, 2009 at 02:23:21PM +0900, KOSAKI Motohiro wrote:
> > If a direct reclaim makes no forward progress, it considers whether it
> > should go OOM or not. Whether OOM is triggered or not, it may retry the
> > application afterwards. In times past, this would always wake kswapd as well
> > but currently, kswapd is not woken up after direct reclaim fails. For order-0
> > allocations, this makes little difference but if there is a heavy mix of
> > higher-order allocations that direct reclaim is failing for, it might mean
> > that kswapd is not rewoken for higher orders as much as it did previously.
> >
> > This patch wakes up kswapd when an allocation is being retried after a direct
> > reclaim failure. It would be expected that kswapd is already awake, but
> > this has the effect of telling kswapd to reclaim at the higher order as well.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > Reviewed-by: Christoph Lameter <[email protected]>
> > Reviewed-by: Pekka Enberg <[email protected]>
> > Reviewed-by: KOSAKI Motohiro <[email protected]>
>
> Umm,
> My mail box have the carbon copy of akpm sent this patch to linus.
> (bellow subject and data)
>
> Does this have any update?
>

No, it doesn't. I included it in the series for the benefit of testers.

> -----------------------------------------------------------------
> Subject: [patch 07/35] page allocator: always wake kswapd when restarting an allocation attempt after direct reclaim failed
> To: [email protected]
> Cc: [email protected],
> [email protected],
> [email protected],
> [email protected],
> [email protected],
> [email protected]
> From: [email protected]
> Date: Wed, 11 Nov 2009 14:26:14 -0800
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-13 13:56:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER

On Fri, Nov 13, 2009 at 02:24:13PM +0900, KOSAKI Motohiro wrote:
> > Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> > slightly by allowing rt_tasks that are handling an interrupt to set
> > ALLOC_HARDER. This patch brings the watermark logic more in line with
> > 2.6.30.
>
> ditto.
> afaik, this patch have been sent to linus.
>

And ditto, it's there for testers, particularly when patching against
2.6.31.

>
> >
> > [[email protected]: Spotted the problem]
> > Signed-off-by: Mel Gorman <[email protected]>
> > Reviewed-by: Pekka Enberg <[email protected]>
> > Reviewed-by: Rik van Riel <[email protected]>
> > Reviewed-by: KOSAKI Motohiro <[email protected]>
> > ---
> > mm/page_alloc.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 250d055..2bc2ac6 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> > */
> > alloc_flags &= ~ALLOC_CPUSET;
> > - } else if (unlikely(rt_task(p)))
> > + } else if (unlikely(rt_task(p)) && !in_interrupt())
> > alloc_flags |= ALLOC_HARDER;
> >
> > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > --
> > 1.6.5
> >
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-13 14:13:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > of no IO congestion, kswapd can go to sleep very shortly after the high
> > watermark was reached. If there are a constant stream of allocations from
> > parallel processes, it can mean that kswapd went to sleep too quickly and
> > the high watermark is not being maintained for sufficient length time.
> >
> > This patch makes kswapd go to sleep as a two-stage process. It first
> > tries to sleep for HZ/10. If it is woken up by another process or the
> > high watermark is no longer met, it's considered a premature sleep and
> > kswapd continues work. Otherwise it goes fully to sleep.
> >
> > This adds more counters to distinguish between fast and slow breaches of
> > watermarks. A "fast" premature sleep is one where the low watermark was
> > hit in a very short time after kswapd going to sleep. A "slow" premature
> > sleep indicates that the high watermark was breached after a very short
> > interval.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Why do you submit this patch to mainline? this is debugging patch
> no more and no less.
>

Do you mean the stats part? The stats are included until such time as the page
allocator failure reports stop or are significantly reduced. In the event a
report is received, the value of the counters help determine if kswapd was
struggling or not. They should be removed once this mess is ironed out.

If there is a preference, I can split out the stats part and send it to
people with page allocator failure reports for retesting.

>
> > ---
> > include/linux/vmstat.h | 1 +
> > mm/vmscan.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
> > mm/vmstat.c | 2 ++
> > 3 files changed, 45 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 2d0f222..9716003 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -40,6 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > PGSCAN_ZONE_RECLAIM_FAILED,
> > #endif
> > PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> > + KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> > PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>
> Please don't use the word of "premature" and "fast". it is too hard to understand the meanings.

How about KSWAPD_LOW_WMARK_HIT_QUICKLY and
KSWAPD_HIGH_WMARK_HIT_QUICKLY?


> Plus, please use per-zone stastics (like NUMA_HIT).
>

Is that not overkill? The intention is just to track in general terms if
kswapd is struggling or not. I hadn't been considering the problem on a
per-zone basis to date.

> >
> > #ifdef CONFIG_HUGETLB_PAGE
> > HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 190bae1..ffa1766 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1904,6 +1904,24 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > }
> > #endif
> >
> > +/* is kswapd sleeping prematurely? */
> > +static int sleeping_prematurely(int order, long remaining)
> > +{
> > + struct zone *zone;
> > +
> > + /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> > + if (remaining)
> > + return 1;
> > +
> > + /* If after HZ/10, a zone is below the high mark, it's premature */
> > + for_each_populated_zone(zone)
> > + if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> > + 0, 0))
> > + return 1;
>
> for_each_populated_zone() iterate all populated zone. but kswapd shuld't see another node.
>

Good spot.

How about the following?

==== CUT HERE ====
vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1

This patch is a fix and a claritifacation to the patch "vmscan: Have
kswapd sleep for a short interval and double check it should be asleep".
The fix is for kswapd to only check zones in the node it is responsible
for. The clarification is to rename two counters to better explain what is
being counted.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vmstat.h | 2 +-
mm/vmscan.c | 20 +++++++++++++-------
mm/vmstat.c | 4 ++--
3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 7d66695..0591a48 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGSCAN_ZONE_RECLAIM_FAILED,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
- KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+ KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_NO_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 70967e1..5557555 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
#endif

/* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(int order, long remaining)
+static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
{
- struct zone *zone;
+ int i;

/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
return 1;

/* If after HZ/10, a zone is below the high mark, it's premature */
- for_each_populated_zone(zone)
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ if (!populated_zone(zone))
+ continue;
+
if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
0, 0))
return 1;
+ }

return 0;
}
@@ -2221,7 +2227,7 @@ static int kswapd(void *p)
long remaining = 0;

/* Try to sleep for a short interval */
- if (!sleeping_prematurely(order, remaining)) {
+ if (!sleeping_prematurely(pgdat, order, remaining)) {
remaining = schedule_timeout(HZ/10);
finish_wait(&pgdat->kswapd_wait, &wait);
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -2232,13 +2238,13 @@ static int kswapd(void *p)
* premature sleep. If not, then go fully
* to sleep until explicitly woken up
*/
- if (!sleeping_prematurely(order, remaining))
+ if (!sleeping_prematurely(pgdat, order, remaining))
schedule();
else {
if (remaining)
- count_vm_event(KSWAPD_PREMATURE_FAST);
+ count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
else
- count_vm_event(KSWAPD_PREMATURE_SLOW);
+ count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
}
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index bc09547..6cc8dc6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
"slabs_scanned",
"kswapd_steal",
"kswapd_inodesteal",
- "kswapd_slept_prematurely_fast",
- "kswapd_slept_prematurely_slow",
+ "kswapd_low_wmark_hit_quickly",
+ "kswapd_high_wmark_hit_quickly",
"kswapd_no_congestion_wait",
"pageoutrun",
"allocstall",

2009-11-13 14:16:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

On Fri, Nov 13, 2009 at 02:32:12PM +0100, Jens Axboe wrote:
> On Fri, Nov 13 2009, Mel Gorman wrote:
> > On Fri, Nov 13, 2009 at 12:55:58PM +0100, Jens Axboe wrote:
> > > On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> > > > (cc to Jens)
> > > >
> > > > > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > > > > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > > > > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > > > > this patch seemed to help a lot even though it was pointed out that the
> > > > > congestion changes were very far away from high-order atomic allocations.
> > > > >
> > > > > The key to why the revert makes such a big difference is down to timing and
> > > > > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > > > > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > > > > significant part of the workload involved reads, it makes sense that the
> > > > > SYNC list is what was truely congested and with the revert processes were
> > > > > waiting on congestion as expected. Hence, direct reclaimers stalled
> > > > > properly and kswapd was able to do its job with fewer stalls.
> > > > >
> > > > > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > > > > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > > > > queue which would only fix a particular type of workload, this patch adds a
> > > > > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > > > > and then the SYNC queue if the timeout has not been reached. In tests, this
> > > > > counter-intuitively results in kswapd stalling less and freeing up pages
> > > > > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > > > > stalls.
> > > >
> > > > Honestly, I don't like this patch. page allocator is not related to
> > > > sync block queue. vmscan doesn't make read operation.
> > > > This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> > > >
> > > > Please don't make mysterious heuristic code.
> > > >
> > > >
> > > > Sidenode: I doubt this regression was caused from page allocator.
> >
> > Probably not. As noted, the major change is really in how long callers
> > are waiting on congestion_wait. The tarball includes graphs from an
> > instrumented kernel that shows how long callers are waiting due to
> > congestion_wait(). This has changed significantly.
> >
> > I'll queue up tests over the weekend that test without dm-crypt being involved.
> >
> > > > Probably we need to confirm caller change....
> > >
> > > See the email from Chris from yesterday, he nicely explains why this
> > > change made a difference with dm-crypt.
> >
> > Indeed.
> >
> > But bear in mind that it also possible that direct reclaimers are also
> > congesting the queue due to swap-in.
>
> Are you speculating, or has this been observed?

Speculating. I'll need to adjust the test and instrumentation to know
for sure.

> While I don't contest
> that that could happen, it's also not a new thing. And it should be an
> unlikely event.
>
> > > dm-crypt needs fixing, not a hack like this added.
> > >
> >
> > As noted by Chris in the same mail, dm-crypt has not changed. What has
> > changed is how long callers wait in congestion_wait.
>
> Right dm-crypt didn't change, it WAS ALREADY BUGGY.
>

Fair point.

> > > The vm needs to drop congestion hints and usage, not increase it. The
> > > above changelog is mostly hand-wavy nonsense, imho.
> > >
> >
> > Suggest an alternative that brings congestion_wait() more in line with
> > 2.6.30 behaviour then.
>
> I don't have a good explanation as to why the delays have changed,
> unfortunately. Are we sure that they have between .30 and .31?

Fairly sure. The original reporter first reported that the problem was
in this range and reverting one of the congestion_wait() patches
appeared to help. What the revert was really doing was making
congestion_wait() on SYNC instead of ASYNC in a number of cases.

> The
> dm-crypt case is overly complex and lots of changes could have broken
> that house of cards.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-13 14:38:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble

Hi, Koskai.

I missed this patch.
I noticed this after Mel reply your patch.

On Fri, Nov 13, 2009 at 6:54 PM, KOSAKI Motohiro
<[email protected]> wrote:
>> If reclaim fails to make sufficient progress, the priority is raised.
>> Once the priority is higher, kswapd starts waiting on congestion.
>> However, on systems with large numbers of high-order atomics due to
>> crappy network cards, it's important that kswapd keep working in
>> parallel to save their sorry ass.
>>
>> This patch takes into account the order kswapd is reclaiming at before
>> waiting on congestion. The higher the order, the longer it is before
>> kswapd considers itself to be in trouble. The impact is that kswapd
>> works harder in parallel rather than depending on direct reclaimers or
>> atomic allocations to fail.
>>
>> Signed-off-by: Mel Gorman <[email protected]>
>> ---
>> ?mm/vmscan.c | ? 14 ++++++++++++--
>> ?1 files changed, 12 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index ffa1766..5e200f1 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
>> ?static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>> ?{
>> ? ? ? int all_zones_ok;
>> - ? ? int priority;
>> + ? ? int priority, congestion_priority;
>> ? ? ? int i;
>> ? ? ? unsigned long total_scanned;
>> ? ? ? struct reclaim_state *reclaim_state = current->reclaim_state;
>> @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>> ? ? ? ?*/
>> ? ? ? int temp_priority[MAX_NR_ZONES];
>>
>> + ? ? /*
>> + ? ? ?* When priority reaches congestion_priority, kswapd will sleep
>> + ? ? ?* for a short time while congestion clears. The higher the
>> + ? ? ?* order being reclaimed, the less likely kswapd will go to
>> + ? ? ?* sleep as high-order allocations are harder to reclaim and
>> + ? ? ?* stall direct reclaimers longer
>> + ? ? ?*/
>> + ? ? congestion_priority = DEF_PRIORITY - 2;
>> + ? ? congestion_priority -= min(congestion_priority, sc.order);
>
> This calculation mean
>
> ? ? ? ?sc.order ? ? ? ?congestion_priority ? ? scan-pages
> ? ? ? ?---------------------------------------------------------
> ? ? ? ?0 ? ? ? ? ? ? ? 10 ? ? ? ? ? ? ? ? ? ? ?1/1024 * zone-mem
> ? ? ? ?1 ? ? ? ? ? ? ? 9 ? ? ? ? ? ? ? ? ? ? ? 1/512 ?* zone-mem
> ? ? ? ?2 ? ? ? ? ? ? ? 8 ? ? ? ? ? ? ? ? ? ? ? 1/256 ?* zone-mem
> ? ? ? ?3 ? ? ? ? ? ? ? 7 ? ? ? ? ? ? ? ? ? ? ? 1/128 ?* zone-mem
> ? ? ? ?4 ? ? ? ? ? ? ? 6 ? ? ? ? ? ? ? ? ? ? ? 1/64 ? * zone-mem
> ? ? ? ?5 ? ? ? ? ? ? ? 5 ? ? ? ? ? ? ? ? ? ? ? 1/32 ? * zone-mem
> ? ? ? ?6 ? ? ? ? ? ? ? 4 ? ? ? ? ? ? ? ? ? ? ? 1/16 ? * zone-mem
> ? ? ? ?7 ? ? ? ? ? ? ? 3 ? ? ? ? ? ? ? ? ? ? ? 1/8 ? ?* zone-mem
> ? ? ? ?8 ? ? ? ? ? ? ? 2 ? ? ? ? ? ? ? ? ? ? ? 1/4 ? ?* zone-mem
> ? ? ? ?9 ? ? ? ? ? ? ? 1 ? ? ? ? ? ? ? ? ? ? ? 1/2 ? ?* zone-mem
> ? ? ? ?10 ? ? ? ? ? ? ?0 ? ? ? ? ? ? ? ? ? ? ? 1 ? ? ?* zone-mem
> ? ? ? ?11+ ? ? ? ? ? ? 0 ? ? ? ? ? ? ? ? ? ? ? 1 ? ? ?* zone-mem
>
> I feel this is too agressive. The intention of this congestion_wait()
> is to prevent kswapd use 100% cpu time. but the above promotion seems
> break it.

I can't understand your point.
Mel didn't change the number of scan pages.
It denpends on priority.
He just added another one to prevent frequent contestion_wait.
Still, shrink_zone is called with priority, not congestion_priority.

> example,
> ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.

Indeed. Good catch.

> example2,
> order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
> reclaim, because it doesn't use lumpy recliam.
> I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
> waste cpu time 100%.

Above I said, It depends on priority, not congestion_priority.

>
>> +
>> ?loop_again:
>> ? ? ? total_scanned = 0;
>> ? ? ? sc.nr_reclaimed = 0;
>> @@ -2092,7 +2102,7 @@ loop_again:
>> ? ? ? ? ? ? ? ?* OK, kswapd is getting into trouble. ?Take a nap, then take
>> ? ? ? ? ? ? ? ?* another pass across the zones.
>> ? ? ? ? ? ? ? ?*/
>> - ? ? ? ? ? ? if (total_scanned && priority < DEF_PRIORITY - 2)
>> + ? ? ? ? ? ? if (total_scanned && priority < congestion_priority)
>> ? ? ? ? ? ? ? ? ? ? ? congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> Instead, How about this?
>
>
>
> ---
> ?mm/vmscan.c | ? 13 ++++++++++++-
> ?1 files changed, 12 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64e4388..937e90d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1938,6 +1938,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> ? ? ? ? * free_pages == high_wmark_pages(zone).
> ? ? ? ? */
> ? ? ? ?int temp_priority[MAX_NR_ZONES];
> + ? ? ? int has_under_min_watermark_zone = 0;

Let's make the shorter.
How about "under_min_watermark"?

>
> ?loop_again:
> ? ? ? ?total_scanned = 0;
> @@ -2057,6 +2058,15 @@ loop_again:
> ? ? ? ? ? ? ? ? ? ? ? ?if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> ? ? ? ? ? ? ? ? ? ? ? ? ? ?total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sc.may_writepage = 1;
> +
> + ? ? ? ? ? ? ? ? ? ? ? /*
> + ? ? ? ? ? ? ? ? ? ? ? ?* We are still under min water mark. it mean we have
> + ? ? ? ? ? ? ? ? ? ? ? ?* GFP_ATOMIC allocation failure risk. Hurry up!
> + ? ? ? ? ? ? ? ? ? ? ? ?*/
> + ? ? ? ? ? ? ? ? ? ? ? if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? end_zone, 0))
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? has_under_min_watermark_zone = 1;
> +
> ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ?if (all_zones_ok)
> ? ? ? ? ? ? ? ? ? ? ? ?break; ? ? ? ? ?/* kswapd: all done */
> @@ -2064,7 +2074,8 @@ loop_again:
> ? ? ? ? ? ? ? ? * OK, kswapd is getting into trouble. ?Take a nap, then take
> ? ? ? ? ? ? ? ? * another pass across the zones.
> ? ? ? ? ? ? ? ? */
> - ? ? ? ? ? ? ? if (total_scanned && priority < DEF_PRIORITY - 2)
> + ? ? ? ? ? ? ? if (total_scanned && (priority < DEF_PRIORITY - 2) &&
> + ? ? ? ? ? ? ? ? ? !has_under_min_watermark_zone)
> ? ? ? ? ? ? ? ? ? ? ? ?congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> ? ? ? ? ? ? ? ?/*
> --
> 1.6.2.5

Anyway, Looks good to me.
It's more straightforward than Mel's one, I think.

Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-11-13 14:48:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble

On Fri, Nov 13, 2009 at 10:54 PM, Mel Gorman <[email protected]> wrote:
> On Fri, Nov 13, 2009 at 06:54:29PM +0900, KOSAKI Motohiro wrote:
>> > If reclaim fails to make sufficient progress, the priority is raised.
>> > Once the priority is higher, kswapd starts waiting on congestion.
>> > However, on systems with large numbers of high-order atomics due to
>> > crappy network cards, it's important that kswapd keep working in
>> > parallel to save their sorry ass.
>> >
>> > This patch takes into account the order kswapd is reclaiming at before
>> > waiting on congestion. The higher the order, the longer it is before
>> > kswapd considers itself to be in trouble. The impact is that kswapd
>> > works harder in parallel rather than depending on direct reclaimers or
>> > atomic allocations to fail.
>> >
>> > Signed-off-by: Mel Gorman <[email protected]>
>> > ---
>> > ?mm/vmscan.c | ? 14 ++++++++++++--
>> > ?1 files changed, 12 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index ffa1766..5e200f1 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
>> > ?static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>> > ?{
>> > ? ? int all_zones_ok;
>> > - ? int priority;
>> > + ? int priority, congestion_priority;
>> > ? ? int i;
>> > ? ? unsigned long total_scanned;
>> > ? ? struct reclaim_state *reclaim_state = current->reclaim_state;
>> > @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
>> > ? ? ?*/
>> > ? ? int temp_priority[MAX_NR_ZONES];
>> >
>> > + ? /*
>> > + ? ?* When priority reaches congestion_priority, kswapd will sleep
>> > + ? ?* for a short time while congestion clears. The higher the
>> > + ? ?* order being reclaimed, the less likely kswapd will go to
>> > + ? ?* sleep as high-order allocations are harder to reclaim and
>> > + ? ?* stall direct reclaimers longer
>> > + ? ?*/
>> > + ? congestion_priority = DEF_PRIORITY - 2;
>> > + ? congestion_priority -= min(congestion_priority, sc.order);
>>
>> This calculation mean
>>
>> ? ? ? sc.order ? ? ? ?congestion_priority ? ? scan-pages
>> ? ? ? ---------------------------------------------------------
>> ? ? ? 0 ? ? ? ? ? ? ? 10 ? ? ? ? ? ? ? ? ? ? ?1/1024 * zone-mem
>> ? ? ? 1 ? ? ? ? ? ? ? 9 ? ? ? ? ? ? ? ? ? ? ? 1/512 ?* zone-mem
>> ? ? ? 2 ? ? ? ? ? ? ? 8 ? ? ? ? ? ? ? ? ? ? ? 1/256 ?* zone-mem
>> ? ? ? 3 ? ? ? ? ? ? ? 7 ? ? ? ? ? ? ? ? ? ? ? 1/128 ?* zone-mem
>> ? ? ? 4 ? ? ? ? ? ? ? 6 ? ? ? ? ? ? ? ? ? ? ? 1/64 ? * zone-mem
>> ? ? ? 5 ? ? ? ? ? ? ? 5 ? ? ? ? ? ? ? ? ? ? ? 1/32 ? * zone-mem
>> ? ? ? 6 ? ? ? ? ? ? ? 4 ? ? ? ? ? ? ? ? ? ? ? 1/16 ? * zone-mem
>> ? ? ? 7 ? ? ? ? ? ? ? 3 ? ? ? ? ? ? ? ? ? ? ? 1/8 ? ?* zone-mem
>> ? ? ? 8 ? ? ? ? ? ? ? 2 ? ? ? ? ? ? ? ? ? ? ? 1/4 ? ?* zone-mem
>> ? ? ? 9 ? ? ? ? ? ? ? 1 ? ? ? ? ? ? ? ? ? ? ? 1/2 ? ?* zone-mem
>> ? ? ? 10 ? ? ? ? ? ? ?0 ? ? ? ? ? ? ? ? ? ? ? 1 ? ? ?* zone-mem
>> ? ? ? 11+ ? ? ? ? ? ? 0 ? ? ? ? ? ? ? ? ? ? ? 1 ? ? ?* zone-mem
>>
>> I feel this is too agressive. The intention of this congestion_wait()
>> is to prevent kswapd use 100% cpu time.

As I said in reply of kosaki's patch, I can't understand point.

> Ok, I thought the intention might be to avoid dumping too many pages on
> the queue but it was already waiting on congestion elsewhere.
>
>> but the above promotion seems
>> break it.
>>
>> example,
>> ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.
>> example2,

But, This is a true problem missed in my review.
Thanks, Kosaki.

>> order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
>> reclaim, because it doesn't use lumpy recliam.
>> I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
>> waste cpu time 100%.
>>
>>
>> > +
>> > ?loop_again:
>> > ? ? total_scanned = 0;
>> > ? ? sc.nr_reclaimed = 0;
>> > @@ -2092,7 +2102,7 @@ loop_again:
>> > ? ? ? ? ? ? ?* OK, kswapd is getting into trouble. ?Take a nap, then take
>> > ? ? ? ? ? ? ?* another pass across the zones.
>> > ? ? ? ? ? ? ?*/
>> > - ? ? ? ? ? if (total_scanned && priority < DEF_PRIORITY - 2)
>> > + ? ? ? ? ? if (total_scanned && priority < congestion_priority)
>> > ? ? ? ? ? ? ? ? ? ? congestion_wait(BLK_RW_ASYNC, HZ/10);
>>
>> Instead, How about this?
>>
>
> This makes a lot of sense. Tests look good and I added stats to make sure
> the logic was triggering. On X86, kswapd avoided a congestion_wait 11723
> times and X86-64 avoided it 5084 times. I think we should hold onto the
> stats temporarily until all these bugs are ironed out.
>
> Would you like to sign off the following?
>
> If you are ok to sign off, this patch should replace my patch 5 in
> the series.

I agree Kosaki's patch is more strightforward.

You can add my review sign, too.
Thanks for good patch, Kosaki. :)

--
Kind regards,
Minchan Kim

2009-11-13 15:25:27

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

On Fri, Nov 13, 2009 at 03:41:46PM +0200, Pekka Enberg wrote:
> Hi Jens,
>
> On Fri, Nov 13, 2009 at 3:32 PM, Jens Axboe <[email protected]> wrote:
> >> Suggest an alternative that brings congestion_wait() more in line with
> >> 2.6.30 behaviour then.
> >
> > I don't have a good explanation as to why the delays have changed,
> > unfortunately. Are we sure that they have between .30 and .31? The
> > dm-crypt case is overly complex and lots of changes could have broken
> > that house of cards.
>
> Hand-waving or not, we have end user reports stating that reverting
> commit 8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0 ("Fix
> congestion_wait() sync/async vs read/write confusion") fixes their
> (rather serious) OOM regression. The commit in question _does_
> introduce a functional change and if this was your average regression,
> people would be kicking and screaming to get it reverted.
>
> So is there a reason we shouldn't send a partial revert of the commit
> (switching to BLK_RW_SYNC) to Linus until the "real" issue gets
> resolved? Yes, I realize it's ugly voodoo magic but dammit, it used to
> work!

If the workload didn't involve dm-crypt everything would make more sense.
With dm-crypt, things look like this:

VM:
Someone submits a write (sync or not) to clean a page
out of pages, I'll wait for the disk

dm-crypt:
Async threads wakeup and start encrypting submitted writes

Async threads finish and hand off to other async threads to submit new writes
or
dm encryption thread submits the write directly (without my patch)

The problem with all of this is that waiting for congestion doesn't
actually have very much to do with waiting for dm-crypt IO. We might
be checking congestion long before dm-cryptd actually gets around to
sending the IO to the disk (so we'll just wait for the full timeout),
or the congestion check could be waiting for entirely unrelated IO.

Because dm-crypt uses the same queue for encrypting and decrypting reads
and writes, if writepage needs to read a block (say a metadata mapping
block), waiting for that block to become unencrypted will have to wait
for the entire queue of both encrypted and decrypted blocks to finish.

[ Unless I completely misread dm-crypt, which is always possible ]

So, cleaning pages can take an arbitrary amount of time, based on how
much CPU time is available, how many other pages are in the worker
threads and how fast the disk is once we finally hand the pages over to
it.

Tuning the VM timeouts based on this seems like a bad idea to me. If
the VM is waiting for pages to leave writeback, we should just add a
sequence counter and wait queue that gets bumped as pages leave
writeback (or something that wouldn't do horrible things to cachelines).

Otherwise we're going to end up waiting way too long for the sync
congestion on configs that don't intermix sync and async in such complex
ways.

-chris

2009-11-13 18:02:14

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble


> This makes a lot of sense. Tests look good and I added stats to make sure
> the logic was triggering. On X86, kswapd avoided a congestion_wait 11723
> times and X86-64 avoided it 5084 times. I think we should hold onto the
> stats temporarily until all these bugs are ironed out.
>
> Would you like to sign off the following?
>
> If you are ok to sign off, this patch should replace my patch 5 in
> the series.

I'm sorry, I found my bug.
Please see below.

>
> ==== CUT HERE ====
>
> vmscan: Stop kswapd waiting on congestion when the min watermark is not being met
>
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion. However,
> if the zone is below the min watermark then kswapd needs to continue working
> without delay as there is a danger of an increased rate of GFP_ATOMIC
> allocation failure.
>
> This patch changes the conditions under which kswapd waits on
> congestion by only going to sleep if the min watermarks are being met.
>
> [[email protected]: Add stats to track how relevant the logic is]
> Needs-signed-off-by-original-author
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 9716003..7d66695 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> #endif
> PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> + KSWAPD_NO_CONGESTION_WAIT,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> #ifdef CONFIG_HUGETLB_PAGE
> HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ffa1766..70967e1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1966,6 +1966,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> * free_pages == high_wmark_pages(zone).
> */
> int temp_priority[MAX_NR_ZONES];
> + int has_under_min_watermark_zone = 0;

This is wrong declaration place. It must change to

for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
+ int has_under_min_watermark_zone = 0;


because, has_under_min_watermark_zone should be initialized every priority.


> loop_again:
> total_scanned = 0;
> @@ -2085,6 +2086,15 @@ loop_again:
> if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
> sc.may_writepage = 1;
> +
> + /*
> + * We are still under min water mark. it mean we have
> + * GFP_ATOMIC allocation failure risk. Hurry up!
> + */
> + if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
> + end_zone, 0))
> + has_under_min_watermark_zone = 1;
> +
> }
> if (all_zones_ok)
> break; /* kswapd: all done */
> @@ -2092,8 +2102,13 @@ loop_again:
> * OK, kswapd is getting into trouble. Take a nap, then take
> * another pass across the zones.
> */
> - if (total_scanned && priority < DEF_PRIORITY - 2)
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> + if (total_scanned && (priority < DEF_PRIORITY - 2)) {
> +

This blank line is unnecesary.

> + if (!has_under_min_watermark_zone)

Probably "if (has_under_min_watermark_zone)" is correct.


> + count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
> + else
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> + }

Otherthing looks pretty good to me. please feel free to add my s-o-b or reviewed-by.

Thanks.



2009-11-13 18:02:19

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

> On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > watermark was reached. If there are a constant stream of allocations from
> > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > the high watermark is not being maintained for sufficient length time.
> > >
> > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > high watermark is no longer met, it's considered a premature sleep and
> > > kswapd continues work. Otherwise it goes fully to sleep.
> > >
> > > This adds more counters to distinguish between fast and slow breaches of
> > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > sleep indicates that the high watermark was breached after a very short
> > > interval.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> >
> > Why do you submit this patch to mainline? this is debugging patch
> > no more and no less.
> >
>
> Do you mean the stats part? The stats are included until such time as the page
> allocator failure reports stop or are significantly reduced. In the event a
> report is received, the value of the counters help determine if kswapd was
> struggling or not. They should be removed once this mess is ironed out.
>
> If there is a preference, I can split out the stats part and send it to
> people with page allocator failure reports for retesting.

I'm sorry my last mail didn't have enough explanation.
This stats help to solve this issue. I agreed. but after solving this issue,
I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
Or, Can LKML folk make any advise to admin?

if kernel doesn't have any bug, kswapd wakeup rate is not so worth information imho.
following your additional code itself looks good to me. but...


> ==== CUT HERE ====
> vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
>
> This patch is a fix and a claritifacation to the patch "vmscan: Have
> kswapd sleep for a short interval and double check it should be asleep".
> The fix is for kswapd to only check zones in the node it is responsible
> for. The clarification is to rename two counters to better explain what is
> being counted.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/vmstat.h | 2 +-
> mm/vmscan.c | 20 +++++++++++++-------
> mm/vmstat.c | 4 ++--
> 3 files changed, 16 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 7d66695..0591a48 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> PGSCAN_ZONE_RECLAIM_FAILED,
> #endif
> PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> - KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> + KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> KSWAPD_NO_CONGESTION_WAIT,
> PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> #ifdef CONFIG_HUGETLB_PAGE
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 70967e1..5557555 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> #endif
>
> /* is kswapd sleeping prematurely? */
> -static int sleeping_prematurely(int order, long remaining)
> +static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> {
> - struct zone *zone;
> + int i;
>
> /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> if (remaining)
> return 1;
>
> /* If after HZ/10, a zone is below the high mark, it's premature */
> - for_each_populated_zone(zone)
> + for (i = 0; i < pgdat->nr_zones; i++) {
> + struct zone *zone = pgdat->node_zones + i;
> +
> + if (!populated_zone(zone))
> + continue;
> +
> if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> 0, 0))
> return 1;
> + }
>
> return 0;
> }
> @@ -2221,7 +2227,7 @@ static int kswapd(void *p)
> long remaining = 0;
>
> /* Try to sleep for a short interval */
> - if (!sleeping_prematurely(order, remaining)) {
> + if (!sleeping_prematurely(pgdat, order, remaining)) {
> remaining = schedule_timeout(HZ/10);
> finish_wait(&pgdat->kswapd_wait, &wait);
> prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> @@ -2232,13 +2238,13 @@ static int kswapd(void *p)
> * premature sleep. If not, then go fully
> * to sleep until explicitly woken up
> */
> - if (!sleeping_prematurely(order, remaining))
> + if (!sleeping_prematurely(pgdat, order, remaining))
> schedule();
> else {
> if (remaining)
> - count_vm_event(KSWAPD_PREMATURE_FAST);
> + count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> else
> - count_vm_event(KSWAPD_PREMATURE_SLOW);
> + count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> }
> }
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index bc09547..6cc8dc6 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
> "slabs_scanned",
> "kswapd_steal",
> "kswapd_inodesteal",
> - "kswapd_slept_prematurely_fast",
> - "kswapd_slept_prematurely_slow",
> + "kswapd_low_wmark_hit_quickly",
> + "kswapd_high_wmark_hit_quickly",
> "kswapd_no_congestion_wait",
> "pageoutrun",
> "allocstall",


2009-11-13 18:16:00

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion. However,
if the zone is below the min watermark then kswapd needs to continue working
without delay as there is a danger of an increased rate of GFP_ATOMIC
allocation failure.

This patch changes the conditions under which kswapd waits on
congestion by only going to sleep if the min watermarks are being met.

This patch replaces
vmscan-take-order-into-consideration-when-deciding-if-kswapd-is-in-trouble.patch .

[[email protected]: Add stats to track how relevant the logic is]
From: KOSAKI Motohiro <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vmstat.h | 1 +
mm/vmscan.c | 18 ++++++++++++++++--
mm/vmstat.c | 1 +
3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9716003..7d66695 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+ KSWAPD_NO_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..70a2322 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1979,6 +1979,7 @@ loop_again:
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
+ int has_under_min_watermark_zone = 0;

/* The swap token gets in the way of swapout... */
if (!priority)
@@ -2085,6 +2086,15 @@ loop_again:
if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
sc.may_writepage = 1;
+
+ /*
+ * We are still under min water mark. it mean we have
+ * GFP_ATOMIC allocation failure risk. Hurry up!
+ */
+ if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ end_zone, 0))
+ has_under_min_watermark_zone = 1;
+
}
if (all_zones_ok)
break; /* kswapd: all done */
@@ -2092,8 +2102,12 @@ loop_again:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+ if (!has_under_min_watermark_zone)
+ count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
+ else
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ }

/*
* We do this so kswapd doesn't build up large priorities for
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90b11e4..bc09547 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
"kswapd_inodesteal",
"kswapd_slept_prematurely_fast",
"kswapd_slept_prematurely_slow",
+ "kswapd_no_congestion_wait",
"pageoutrun",
"allocstall",

2009-11-13 18:17:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > > watermark was reached. If there are a constant stream of allocations from
> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > > the high watermark is not being maintained for sufficient length time.
> > > >
> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > > high watermark is no longer met, it's considered a premature sleep and
> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > >
> > > > This adds more counters to distinguish between fast and slow breaches of
> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > > sleep indicates that the high watermark was breached after a very short
> > > > interval.
> > > >
> > > > Signed-off-by: Mel Gorman <[email protected]>
> > >
> > > Why do you submit this patch to mainline? this is debugging patch
> > > no more and no less.
> > >
> >
> > Do you mean the stats part? The stats are included until such time as the page
> > allocator failure reports stop or are significantly reduced. In the event a
> > report is received, the value of the counters help determine if kswapd was
> > struggling or not. They should be removed once this mess is ironed out.
> >
> > If there is a preference, I can split out the stats part and send it to
> > people with page allocator failure reports for retesting.
>
> I'm sorry my last mail didn't have enough explanation.
> This stats help to solve this issue. I agreed. but after solving this issue,
> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?

One possible workaround would be to raise min_free_kbytes while a fix is
being worked on.

> Or, Can LKML folk make any advise to admin?
>

Work with them to fix the bug :/

> if kernel doesn't have any bug, kswapd wakeup rate is not so worth information imho.
> following your additional code itself looks good to me. but...
>
>
> > ==== CUT HERE ====
> > vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> >
> > This patch is a fix and a claritifacation to the patch "vmscan: Have
> > kswapd sleep for a short interval and double check it should be asleep".
> > The fix is for kswapd to only check zones in the node it is responsible
> > for. The clarification is to rename two counters to better explain what is
> > being counted.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/linux/vmstat.h | 2 +-
> > mm/vmscan.c | 20 +++++++++++++-------
> > mm/vmstat.c | 4 ++--
> > 3 files changed, 16 insertions(+), 10 deletions(-)
> >
> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> > index 7d66695..0591a48 100644
> > --- a/include/linux/vmstat.h
> > +++ b/include/linux/vmstat.h
> > @@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> > PGSCAN_ZONE_RECLAIM_FAILED,
> > #endif
> > PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> > - KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> > + KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> > KSWAPD_NO_CONGESTION_WAIT,
> > PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> > #ifdef CONFIG_HUGETLB_PAGE
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 70967e1..5557555 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> > #endif
> >
> > /* is kswapd sleeping prematurely? */
> > -static int sleeping_prematurely(int order, long remaining)
> > +static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> > {
> > - struct zone *zone;
> > + int i;
> >
> > /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> > if (remaining)
> > return 1;
> >
> > /* If after HZ/10, a zone is below the high mark, it's premature */
> > - for_each_populated_zone(zone)
> > + for (i = 0; i < pgdat->nr_zones; i++) {
> > + struct zone *zone = pgdat->node_zones + i;
> > +
> > + if (!populated_zone(zone))
> > + continue;
> > +
> > if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> > 0, 0))
> > return 1;
> > + }
> >
> > return 0;
> > }
> > @@ -2221,7 +2227,7 @@ static int kswapd(void *p)
> > long remaining = 0;
> >
> > /* Try to sleep for a short interval */
> > - if (!sleeping_prematurely(order, remaining)) {
> > + if (!sleeping_prematurely(pgdat, order, remaining)) {
> > remaining = schedule_timeout(HZ/10);
> > finish_wait(&pgdat->kswapd_wait, &wait);
> > prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> > @@ -2232,13 +2238,13 @@ static int kswapd(void *p)
> > * premature sleep. If not, then go fully
> > * to sleep until explicitly woken up
> > */
> > - if (!sleeping_prematurely(order, remaining))
> > + if (!sleeping_prematurely(pgdat, order, remaining))
> > schedule();
> > else {
> > if (remaining)
> > - count_vm_event(KSWAPD_PREMATURE_FAST);
> > + count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> > else
> > - count_vm_event(KSWAPD_PREMATURE_SLOW);
> > + count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> > }
> > }
> >
> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> > index bc09547..6cc8dc6 100644
> > --- a/mm/vmstat.c
> > +++ b/mm/vmstat.c
> > @@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
> > "slabs_scanned",
> > "kswapd_steal",
> > "kswapd_inodesteal",
> > - "kswapd_slept_prematurely_fast",
> > - "kswapd_slept_prematurely_slow",
> > + "kswapd_low_wmark_hit_quickly",
> > + "kswapd_high_wmark_hit_quickly",
> > "kswapd_no_congestion_wait",
> > "pageoutrun",
> > "allocstall",
>
>
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-13 18:26:36

by Frans Pop

[permalink] [raw]
Subject: Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met

On Friday 13 November 2009, Mel Gorman wrote:
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion.
> ?However, if the zone is below the min watermark then kswapd needs to
> continue working without delay as there is a danger of an increased rate
> of GFP_ATOMIC allocation failure.
>
> This patch changes the conditions under which kswapd waits on
> congestion by only going to sleep if the min watermarks are being met.
>
> This patch replaces
> vmscan-take-order-into-consideration-when-deciding-if-kswapd-is-in-troub
>le.patch .
>
> [[email protected]: Add stats to track how relevant the logic is]
> From: KOSAKI Motohiro <[email protected]>

For this to work with git-am, the From: line has to be _above_ the patch
description (must be the first line of the mail even). AFAIK at least.

> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Cheers,
FJP

2009-11-13 18:33:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met

> @@ -2092,8 +2102,12 @@ loop_again:
> ? ? ? ? ? ? ? ? * OK, kswapd is getting into trouble. ?Take a nap, then take
> ? ? ? ? ? ? ? ? * another pass across the zones.
> ? ? ? ? ? ? ? ? */
> - ? ? ? ? ? ? ? if (total_scanned && priority < DEF_PRIORITY - 2)
> - ? ? ? ? ? ? ? ? ? ? ? congestion_wait(BLK_RW_ASYNC, HZ/10);
> + ? ? ? ? ? ? ? if (total_scanned && (priority < DEF_PRIORITY - 2)) {
> + ? ? ? ? ? ? ? ? ? ? ? if (!has_under_min_watermark_zone)

if I am correct, we must to remove "!".


> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
> + ? ? ? ? ? ? ? ? ? ? ? else
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? congestion_wait(BLK_RW_ASYNC, HZ/10);
> + ? ? ? ? ? ? ? }

2009-11-13 18:38:29

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met

On 11/13/2009 01:15 PM, Mel Gorman wrote:
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion. However,
> if the zone is below the min watermark then kswapd needs to continue working
> without delay as there is a danger of an increased rate of GFP_ATOMIC
> allocation failure.
>
> This patch changes the conditions under which kswapd waits on
> congestion by only going to sleep if the min watermarks are being met.
>
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 90b11e4..bc09547 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
> "kswapd_inodesteal",
> "kswapd_slept_prematurely_fast",
> "kswapd_slept_prematurely_slow",
> + "kswapd_no_congestion_wait",
>
>
Perhaps better named "kswapd_skip_congestion_wait" ?

Other than that, the patch looks good to me.

Reviewed-by: Rik van Riel <[email protected]>

2009-11-13 20:04:00

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2

The last version has a stupid bug in it. Sorry.

Changelog since V1
o Fix incorrect negation
o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
suggested by Rik

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion. However,
if the zone is below the min watermark then kswapd needs to continue working
without delay as there is a danger of an increased rate of GFP_ATOMIC
allocation failure.

This patch changes the conditions under which kswapd waits on
congestion by only going to sleep if the min watermarks are being met.

[[email protected]: Add stats to track how relevant the logic is]
From: KOSAKI Motohiro <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
---
include/linux/vmstat.h | 1 +
mm/vmscan.c | 18 ++++++++++++++++--
mm/vmstat.c | 1 +
3 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9716003..7d66695 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+ KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..70a2322 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1979,6 +1979,7 @@ loop_again:
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
+ int has_under_min_watermark_zone = 0;

/* The swap token gets in the way of swapout... */
if (!priority)
@ -2085,6 +2086,15 @@ loop_again:
if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
sc.may_writepage = 1;
+
+ /*
+ * We are still under min water mark. it mean we have
+ * GFP_ATOMIC allocation failure risk. Hurry up!
+ */
+ if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ end_zone, 0))
+ has_under_min_watermark_zone = 1;
+
}
if (all_zones_ok)
break; /* kswapd: all done */
@@ -2092,8 +2102,12 @@ loop_again:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+ if (has_under_min_watermark_zone)
+ count_vm_event(KSWAPD_SKIP_CONGESTION_WAIT);
+ else
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ }

/*
* We do this so kswapd doesn't build up large priorities for
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90b11e4..bc09547 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
"kswapd_inodesteal",
"kswapd_slept_prematurely_fast",
"kswapd_slept_prematurely_slow",
+ "kswapd_skip_congestion_wait",
"pageoutrun",
"allocstall",

2009-11-14 09:34:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

2009/11/14 Mel Gorman <[email protected]>:
> On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
>> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
>> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
>> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
>> > > > watermark was reached. If there are a constant stream of allocations from
>> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
>> > > > the high watermark is not being maintained for sufficient length time.
>> > > >
>> > > > This patch makes kswapd go to sleep as a two-stage process. It first
>> > > > tries to sleep for HZ/10. If it is woken up by another process or the
>> > > > high watermark is no longer met, it's considered a premature sleep and
>> > > > kswapd continues work. Otherwise it goes fully to sleep.
>> > > >
>> > > > This adds more counters to distinguish between fast and slow breaches of
>> > > > watermarks. A "fast" premature sleep is one where the low watermark was
>> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
>> > > > sleep indicates that the high watermark was breached after a very short
>> > > > interval.
>> > > >
>> > > > Signed-off-by: Mel Gorman <[email protected]>
>> > >
>> > > Why do you submit this patch to mainline? this is debugging patch
>> > > no more and no less.
>> > >
>> >
>> > Do you mean the stats part? The stats are included until such time as the page
>> > allocator failure reports stop or are significantly reduced. In the event a
>> > report is received, the value of the counters help determine if kswapd was
>> > struggling or not. They should be removed once this mess is ironed out.
>> >
>> > If there is a preference, I can split out the stats part and send it to
>> > people with page allocator failure reports for retesting.
>>
>> I'm sorry my last mail didn't have enough explanation.
>> This stats help to solve this issue. I agreed. but after solving this issue,
>> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
>> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
>
> One possible workaround would be to raise min_free_kbytes while a fix is
> being worked on.

Please correct me, if I said wrong thing.

if I was admin, I don't watch this stats because kswapd frequently
wakeup doesn't mean any trouble. instead I watch number of allocation
failure.

[see include/linux/vmstat.h]

umm...
Why don't we have nr-allocation-failure vmstat? I don't remember it.
GFP_NOWARN failure doesn't make syslog error logging. but frequently
GFP_NOWARN failure imply the system is under memory pressure or under
heavy fragmentation. It is good opportunity to change min_free_kbytes.



>> Or, Can LKML folk make any advise to admin?
>>
>
> Work with them to fix the bug :/
>
>> if kernel doesn't have any bug, kswapd wakeup rate is not so worth information imho.
>> following your additional code itself looks good to me. but...
>>
>>
>> > ==== CUT HERE ====
>> > vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
>> >
>> > This patch is a fix and a claritifacation to the patch "vmscan: Have
>> > kswapd sleep for a short interval and double check it should be asleep".
>> > The fix is for kswapd to only check zones in the node it is responsible
>> > for. The clarification is to rename two counters to better explain what is
>> > being counted.
>> >
>> > Signed-off-by: Mel Gorman <[email protected]>
>> > ---
>> > ?include/linux/vmstat.h | ? ?2 +-
>> > ?mm/vmscan.c ? ? ? ? ? ?| ? 20 +++++++++++++-------
>> > ?mm/vmstat.c ? ? ? ? ? ?| ? ?4 ++--
>> > ?3 files changed, 16 insertions(+), 10 deletions(-)
>> >
>> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
>> > index 7d66695..0591a48 100644
>> > --- a/include/linux/vmstat.h
>> > +++ b/include/linux/vmstat.h
>> > @@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
>> > ? ? ? ? ? ? PGSCAN_ZONE_RECLAIM_FAILED,
>> > ?#endif
>> > ? ? ? ? ? ? PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
>> > - ? ? ? ? ? KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
>> > + ? ? ? ? ? KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
>> > ? ? ? ? ? ? KSWAPD_NO_CONGESTION_WAIT,
>> > ? ? ? ? ? ? PAGEOUTRUN, ALLOCSTALL, PGROTATED,
>> > ?#ifdef CONFIG_HUGETLB_PAGE
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 70967e1..5557555 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
>> > ?#endif
>> >
>> > ?/* is kswapd sleeping prematurely? */
>> > -static int sleeping_prematurely(int order, long remaining)
>> > +static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
>> > ?{
>> > - ? struct zone *zone;
>> > + ? int i;
>> >
>> > ? ? /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
>> > ? ? if (remaining)
>> > ? ? ? ? ? ? return 1;
>> >
>> > ? ? /* If after HZ/10, a zone is below the high mark, it's premature */
>> > - ? for_each_populated_zone(zone)
>> > + ? for (i = 0; i < pgdat->nr_zones; i++) {
>> > + ? ? ? ? ? struct zone *zone = pgdat->node_zones + i;
>> > +
>> > + ? ? ? ? ? if (!populated_zone(zone))
>> > + ? ? ? ? ? ? ? ? ? continue;
>> > +
>> > ? ? ? ? ? ? if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0, 0))
>> > ? ? ? ? ? ? ? ? ? ? return 1;
>> > + ? }
>> >
>> > ? ? return 0;
>> > ?}
>> > @@ -2221,7 +2227,7 @@ static int kswapd(void *p)
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? long remaining = 0;
>> >
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* Try to sleep for a short interval */
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? if (!sleeping_prematurely(order, remaining)) {
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? if (!sleeping_prematurely(pgdat, order, remaining)) {
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? remaining = schedule_timeout(HZ/10);
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? finish_wait(&pgdat->kswapd_wait, &wait);
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
>> > @@ -2232,13 +2238,13 @@ static int kswapd(void *p)
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* premature sleep. If not, then go fully
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* to sleep until explicitly woken up
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*/
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? if (!sleeping_prematurely(order, remaining))
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? if (!sleeping_prematurely(pgdat, order, remaining))
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? schedule();
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? else {
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (remaining)
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_PREMATURE_FAST);
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? else
>> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_PREMATURE_SLOW);
>> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? }
>> > ? ? ? ? ? ? ? ? ? ? }
>> >
>> > diff --git a/mm/vmstat.c b/mm/vmstat.c
>> > index bc09547..6cc8dc6 100644
>> > --- a/mm/vmstat.c
>> > +++ b/mm/vmstat.c
>> > @@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
>> > ? ? "slabs_scanned",
>> > ? ? "kswapd_steal",
>> > ? ? "kswapd_inodesteal",
>> > - ? "kswapd_slept_prematurely_fast",
>> > - ? "kswapd_slept_prematurely_slow",
>> > + ? "kswapd_low_wmark_hit_quickly",
>> > + ? "kswapd_high_wmark_hit_quickly",
>> > ? ? "kswapd_no_congestion_wait",
>> > ? ? "pageoutrun",
>> > ? ? "allocstall",
>>
>>
>>
>
> --
> Mel Gorman
> Part-time Phd Student ? ? ? ? ? ? ? ? ? ? ? ? ?Linux Technology Center
> University of Limerick ? ? ? ? ? ? ? ? ? ? ? ? IBM Dublin Software Lab
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2009-11-14 15:46:36

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> 2009/11/14 Mel Gorman <[email protected]>:
> > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> >> > > > watermark was reached. If there are a constant stream of allocations from
> >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> >> > > > the high watermark is not being maintained for sufficient length time.
> >> > > >
> >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> >> > > > high watermark is no longer met, it's considered a premature sleep and
> >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> >> > > >
> >> > > > This adds more counters to distinguish between fast and slow breaches of
> >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> >> > > > sleep indicates that the high watermark was breached after a very short
> >> > > > interval.
> >> > > >
> >> > > > Signed-off-by: Mel Gorman <[email protected]>
> >> > >
> >> > > Why do you submit this patch to mainline? this is debugging patch
> >> > > no more and no less.
> >> > >
> >> >
> >> > Do you mean the stats part? The stats are included until such time as the page
> >> > allocator failure reports stop or are significantly reduced. In the event a
> >> > report is received, the value of the counters help determine if kswapd was
> >> > struggling or not. They should be removed once this mess is ironed out.
> >> >
> >> > If there is a preference, I can split out the stats part and send it to
> >> > people with page allocator failure reports for retesting.
> >>
> >> I'm sorry my last mail didn't have enough explanation.
> >> This stats help to solve this issue. I agreed. but after solving this issue,
> >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> >
> > One possible workaround would be to raise min_free_kbytes while a fix is
> > being worked on.
>
> Please correct me, if I said wrong thing.
>

You didn't.

> if I was admin, I don't watch this stats because kswapd frequently
> wakeup doesn't mean any trouble. instead I watch number of allocation
> failure.
>

The stats are not tracking when kswapd wakes up. It helps track how
quickly the high or low watermarks are going under once kswapd tries to
go back to sleep.

> [see include/linux/vmstat.h]
>
> umm...
> Why don't we have nr-allocation-failure vmstat?

There isn't a stat for allocation failures, but there are tracepoints that
can be used to find out that information if necessary. For the cases I'm
looking at the moment though, GFP_NOWARN is not a factor so just looking
through the kernel log has been sufficient.

> I don't remember it.
> GFP_NOWARN failure doesn't make syslog error logging. but frequently
> GFP_NOWARN failure imply the system is under memory pressure or under
> heavy fragmentation. It is good opportunity to change min_free_kbytes.
>
>
>
> >> Or, Can LKML folk make any advise to admin?
> >>
> >
> > Work with them to fix the bug :/
> >
> >> if kernel doesn't have any bug, kswapd wakeup rate is not so worth information imho.
> >> following your additional code itself looks good to me. but...
> >>
> >>
> >> > ==== CUT HERE ====
> >> > vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> >> >
> >> > This patch is a fix and a claritifacation to the patch "vmscan: Have
> >> > kswapd sleep for a short interval and double check it should be asleep".
> >> > The fix is for kswapd to only check zones in the node it is responsible
> >> > for. The clarification is to rename two counters to better explain what is
> >> > being counted.
> >> >
> >> > Signed-off-by: Mel Gorman <[email protected]>
> >> > ---
> >> > ?include/linux/vmstat.h | ? ?2 +-
> >> > ?mm/vmscan.c ? ? ? ? ? ?| ? 20 +++++++++++++-------
> >> > ?mm/vmstat.c ? ? ? ? ? ?| ? ?4 ++--
> >> > ?3 files changed, 16 insertions(+), 10 deletions(-)
> >> >
> >> > diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> >> > index 7d66695..0591a48 100644
> >> > --- a/include/linux/vmstat.h
> >> > +++ b/include/linux/vmstat.h
> >> > @@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> >> > ? ? ? ? ? ? PGSCAN_ZONE_RECLAIM_FAILED,
> >> > ?#endif
> >> > ? ? ? ? ? ? PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
> >> > - ? ? ? ? ? KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
> >> > + ? ? ? ? ? KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
> >> > ? ? ? ? ? ? KSWAPD_NO_CONGESTION_WAIT,
> >> > ? ? ? ? ? ? PAGEOUTRUN, ALLOCSTALL, PGROTATED,
> >> > ?#ifdef CONFIG_HUGETLB_PAGE
> >> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> > index 70967e1..5557555 100644
> >> > --- a/mm/vmscan.c
> >> > +++ b/mm/vmscan.c
> >> > @@ -1905,19 +1905,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
> >> > ?#endif
> >> >
> >> > ?/* is kswapd sleeping prematurely? */
> >> > -static int sleeping_prematurely(int order, long remaining)
> >> > +static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
> >> > ?{
> >> > - ? struct zone *zone;
> >> > + ? int i;
> >> >
> >> > ? ? /* If a direct reclaimer woke kswapd within HZ/10, it's premature */
> >> > ? ? if (remaining)
> >> > ? ? ? ? ? ? return 1;
> >> >
> >> > ? ? /* If after HZ/10, a zone is below the high mark, it's premature */
> >> > - ? for_each_populated_zone(zone)
> >> > + ? for (i = 0; i < pgdat->nr_zones; i++) {
> >> > + ? ? ? ? ? struct zone *zone = pgdat->node_zones + i;
> >> > +
> >> > + ? ? ? ? ? if (!populated_zone(zone))
> >> > + ? ? ? ? ? ? ? ? ? continue;
> >> > +
> >> > ? ? ? ? ? ? if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 0, 0))
> >> > ? ? ? ? ? ? ? ? ? ? return 1;
> >> > + ? }
> >> >
> >> > ? ? return 0;
> >> > ?}
> >> > @@ -2221,7 +2227,7 @@ static int kswapd(void *p)
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? long remaining = 0;
> >> >
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? /* Try to sleep for a short interval */
> >> > - ? ? ? ? ? ? ? ? ? ? ? ? ? if (!sleeping_prematurely(order, remaining)) {
> >> > + ? ? ? ? ? ? ? ? ? ? ? ? ? if (!sleeping_prematurely(pgdat, order, remaining)) {
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? remaining = schedule_timeout(HZ/10);
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? finish_wait(&pgdat->kswapd_wait, &wait);
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
> >> > @@ -2232,13 +2238,13 @@ static int kswapd(void *p)
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* premature sleep. If not, then go fully
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* to sleep until explicitly woken up
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?*/
> >> > - ? ? ? ? ? ? ? ? ? ? ? ? ? if (!sleeping_prematurely(order, remaining))
> >> > + ? ? ? ? ? ? ? ? ? ? ? ? ? if (!sleeping_prematurely(pgdat, order, remaining))
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? schedule();
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? else {
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (remaining)
> >> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_PREMATURE_FAST);
> >> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? else
> >> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_PREMATURE_SLOW);
> >> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
> >> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? }
> >> > ? ? ? ? ? ? ? ? ? ? }
> >> >
> >> > diff --git a/mm/vmstat.c b/mm/vmstat.c
> >> > index bc09547..6cc8dc6 100644
> >> > --- a/mm/vmstat.c
> >> > +++ b/mm/vmstat.c
> >> > @@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
> >> > ? ? "slabs_scanned",
> >> > ? ? "kswapd_steal",
> >> > ? ? "kswapd_inodesteal",
> >> > - ? "kswapd_slept_prematurely_fast",
> >> > - ? "kswapd_slept_prematurely_slow",
> >> > + ? "kswapd_low_wmark_hit_quickly",
> >> > + ? "kswapd_high_wmark_hit_quickly",
> >> > ? ? "kswapd_no_congestion_wait",
> >> > ? ? "pageoutrun",
> >> > ? ? "allocstall",
> >>
> >>
> >>
> >

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-15 12:07:24

by Karol Lewandowski

[permalink] [raw]
Subject: Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Thu, Nov 12, 2009 at 07:30:30PM +0000, Mel Gorman wrote:

> [Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100

> Patches 1-3 should be tested first. The testing I've done shows that the
> page allocator and behaviour of congestion_wait() is more in line with
> 2.6.30 than the vanilla kernels.
>
> It'd be nice to have 2 more tests, applying each patch on top noting any
> behaviour change. i.e. ideally there would be results for
>
> o patches 1+2+3
> o patches 1+2+3+4
> o patches 1+2+3+4+5
>
> Of course, any tests results are welcome. The rest of the mail is the
> results of my own tests.

I've tried testing 3+4+5 against 2.6.32-rc7 (1+2 seem to be in
mainline) and got failure. I've noticed something strange (I think).
I was unable to trigger failures when system was under heavy memory
pressure (i.e. my testing - gitk, firefoxes, etc.). When I killed
almost all memory hogs, put system into sleep and resumed -- it
failed. free(1) showed:

total used free shared buffers cached
Mem: 255240 194052 61188 0 4040 49364
-/+ buffers/cache: 140648 114592
Swap: 514040 72712 441328


Is that ok? Wild guess -- maybe kswapd doesn't take fragmentation (or
other factors) into account as hard as it used to in 2.6.30?

Thanks.

2009-11-16 09:52:57

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Sun, Nov 15, 2009 at 01:07:21PM +0100, Karol Lewandowski wrote:
> On Thu, Nov 12, 2009 at 07:30:30PM +0000, Mel Gorman wrote:
>
> > [Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
>
> > Patches 1-3 should be tested first. The testing I've done shows that the
> > page allocator and behaviour of congestion_wait() is more in line with
> > 2.6.30 than the vanilla kernels.
> >
> > It'd be nice to have 2 more tests, applying each patch on top noting any
> > behaviour change. i.e. ideally there would be results for
> >
> > o patches 1+2+3
> > o patches 1+2+3+4
> > o patches 1+2+3+4+5
> >
> > Of course, any tests results are welcome. The rest of the mail is the
> > results of my own tests.
>
> I've tried testing 3+4+5 against 2.6.32-rc7 (1+2 seem to be in
> mainline) and got failure. I've noticed something strange (I think).
> I was unable to trigger failures when system was under heavy memory
> pressure (i.e. my testing - gitk, firefoxes, etc.). When I killed
> almost all memory hogs, put system into sleep and resumed -- it
> failed. free(1) showed:
>
> total used free shared buffers cached
> Mem: 255240 194052 61188 0 4040 49364
> -/+ buffers/cache: 140648 114592
> Swap: 514040 72712 441328
>
>
> Is that ok? Wild guess -- maybe kswapd doesn't take fragmentation (or
> other factors) into account as hard as it used to in 2.6.30?
>

That's a lot of memory free. I take it the order-5 GFP_ATOMIC allocation
failed. What was the dmesg for it please?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-16 12:08:47

by Karol Lewandowski

[permalink] [raw]
Subject: Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Mon, Nov 16, 2009 at 09:52:58AM +0000, Mel Gorman wrote:
> On Sun, Nov 15, 2009 at 01:07:21PM +0100, Karol Lewandowski wrote:
> > total used free shared buffers cached
> > Mem: 255240 194052 61188 0 4040 49364
> > -/+ buffers/cache: 140648 114592
> > Swap: 514040 72712 441328
> >
> >
> > Is that ok? Wild guess -- maybe kswapd doesn't take fragmentation (or
> > other factors) into account as hard as it used to in 2.6.30?

> That's a lot of memory free. I take it the order-5 GFP_ATOMIC allocation
> failed. What was the dmesg for it please?

Sure, it's attached below.

Hmm, "Normal free"/"DMA free" are much lower than 61MB as shown above.
free(1) output was collected right after resume successfully finished.

Thanks.


e100: Intel(R) PRO/100 Network Driver, 3.5.24-k2-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation
e100 0000:00:03.0: PCI INT A -> Link[LNKC] -> GSI 9 (level, low) -> IRQ 9
e100 0000:00:03.0: PME# disabled
e100: eth0: e100_probe: addr 0xe8120000, irq 9, MAC addr 00:10:a4:89:e8:84
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 7339, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
[<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
[<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
[<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
[<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
[<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
[<d0936e14>] ? e100_open+0x17/0x41 [e100]
[<c03245cc>] ? dev_open+0x8f/0xc5
[<c0323d54>] ? dev_change_flags+0xa2/0x155
[<c03576b0>] ? devinet_ioctl+0x22a/0x51e
[<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
[<c0316d06>] ? sock_ioctl+0x0/0x1da
[<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
[<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
[<c0373362>] ? do_page_fault+0x33b/0x351
[<c01945cd>] ? sys_ioctl+0x2c/0x42
[<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 89
active_anon:15632 inactive_anon:17534 isolated_anon:0
active_file:12373 inactive_file:12884 isolated_file:0
unevictable:0 dirty:15 writeback:62 unstable:0
free:2072 slab_reclaimable:849 slab_unreclaimable:1080
mapped:7343 shmem:233 pagetables:505 bounce:0
DMA free:1440kB min:120kB low:148kB high:180kB active_anon:224kB inactive_anon:1516kB active_file:3440kB inactive_file:5140kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:404kB shmem:44kB slab_reclaimable:16kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:6848kB min:1912kB low:2388kB high:2868kB active_anon:62304kB inactive_anon:68620kB active_file:46052kB inactive_file:46396kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:60kB writeback:248kB mapped:28968kB shmem:888kB slab_reclaimable:3380kB slab_unreclaimable:4304kB kernel_stack:464kB pagetables:2020kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:128 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 4*4kB 8*8kB 33*16kB 22*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1440kB
Normal: 486*4kB 271*8kB 139*16kB 12*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 6848kB
29040 total pagecache pages
3548 pages in swap cache
Swap cache stats: add 72385, delete 68837, find 70772/74308
Free swap = 483200kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
20970 pages shared
50976 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 7339, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
[<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
[<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
[<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
[<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
[<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
[<d0936e14>] ? e100_open+0x17/0x41 [e100]
[<c03245cc>] ? dev_open+0x8f/0xc5
[<c0323d54>] ? dev_change_flags+0xa2/0x155
[<c03576b0>] ? devinet_ioctl+0x22a/0x51e
[<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
[<c0316d06>] ? sock_ioctl+0x0/0x1da
[<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
[<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
[<c0189370>] ? vfs_write+0xf4/0x105
[<c01945cd>] ? sys_ioctl+0x2c/0x42
[<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 82
active_anon:15563 inactive_anon:17593 isolated_anon:0
active_file:12294 inactive_file:12787 isolated_file:0
unevictable:0 dirty:18 writeback:155 unstable:0
free:2267 slab_reclaimable:847 slab_unreclaimable:1080
mapped:7271 shmem:229 pagetables:505 bounce:0
DMA free:1440kB min:120kB low:148kB high:180kB active_anon:224kB inactive_anon:1516kB active_file:3440kB inactive_file:5140kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:404kB shmem:44kB slab_reclaimable:16kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:7628kB min:1912kB low:2388kB high:2868kB active_anon:62028kB inactive_anon:68856kB active_file:45736kB inactive_file:46008kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:72kB writeback:620kB mapped:28680kB shmem:872kB slab_reclaimable:3372kB slab_unreclaimable:4304kB kernel_stack:464kB pagetables:2020kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 4*4kB 8*8kB 33*16kB 22*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1440kB
Normal: 559*4kB 290*8kB 152*16kB 16*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7628kB
29016 total pagecache pages
3702 pages in swap cache
Swap cache stats: add 72550, delete 68848, find 70774/74310
Free swap = 482540kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
20826 pages shared
50895 pages non-shared
e100 0000:00:03.0: firmware: requesting e100/d101s_ucode.bin
e100: eth0 NIC Link is Up 100 Mbps Full Duplex
crashreporter[9491]: segfault at b67701a0 ip b67701a0 sp bfa0fc2c error 4 in libnss_files-2.7.so[b67b6000+a000]

2009-11-16 14:32:45

by Karol Lewandowski

[permalink] [raw]
Subject: Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Mon, Nov 16, 2009 at 01:08:45PM +0100, Karol Lewandowski wrote:
> On Mon, Nov 16, 2009 at 09:52:58AM +0000, Mel Gorman wrote:
> > On Sun, Nov 15, 2009 at 01:07:21PM +0100, Karol Lewandowski wrote:
> > > total used free shared buffers cached
> > > Mem: 255240 194052 61188 0 4040 49364
> > > -/+ buffers/cache: 140648 114592
> > > Swap: 514040 72712 441328
> > >
> > >
> > > Is that ok? Wild guess -- maybe kswapd doesn't take fragmentation (or
> > > other factors) into account as hard as it used to in 2.6.30?
>
> > That's a lot of memory free. I take it the order-5 GFP_ATOMIC allocation
> > failed. What was the dmesg for it please?
>
> Sure, it's attached below.
>
> Hmm, "Normal free"/"DMA free" are much lower than 61MB as shown above.
> free(1) output was collected right after resume successfully finished.

Replying to myself, but well, it happened again:

$ free
total used free shared buffers cached
Mem: 255240 189400 65840 0 3968 58916
-/+ buffers/cache: 126516 128724
Swap: 514040 98460 415580


$ dmesg

e100: Intel(R) PRO/100 Network Driver, 3.5.24-k2-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation
e100 0000:00:03.0: PCI INT A -> Link[LNKC] -> GSI 9 (level, low) -> IRQ 9
e100 0000:00:03.0: PME# disabled
e100: eth0: e100_probe: addr 0xe8120000, irq 9, MAC addr 00:10:a4:89:e8:84
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 10402, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
[<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
[<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
[<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
[<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
[<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
[<d0936e14>] ? e100_open+0x17/0x41 [e100]
[<c03245cc>] ? dev_open+0x8f/0xc5
[<c0323d54>] ? dev_change_flags+0xa2/0x155
[<c03576b0>] ? devinet_ioctl+0x22a/0x51e
[<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
[<c0316d06>] ? sock_ioctl+0x0/0x1da
[<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
[<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
[<c0373362>] ? do_page_fault+0x33b/0x351
[<c01945cd>] ? sys_ioctl+0x2c/0x42
[<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 81
active_anon:10734 inactive_anon:19229 isolated_anon:0
active_file:14360 inactive_file:14515 isolated_file:0
unevictable:0 dirty:7 writeback:53 unstable:0
free:1101 slab_reclaimable:1334 slab_unreclaimable:1143
mapped:2686 shmem:44 pagetables:522 bounce:0
DMA free:1496kB min:120kB low:148kB high:180kB active_anon:724kB inactive_anon:1520kB active_file:3752kB inactive_file:3964kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:324kB shmem:0kB slab_reclaimable:216kB slab_unreclaimable:36kB kernel_stack:4kB pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:2908kB min:1912kB low:2388kB high:2868kB active_anon:42212kB inactive_anon:75396kB active_file:53688kB inactive_file:54096kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:28kB writeback:212kB mapped:10420kB shmem:176kB slab_reclaimable:5120kB slab_unreclaimable:4536kB kernel_stack:444kB pagetables:2072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 26*4kB 48*8kB 53*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1496kB
Normal: 275*4kB 58*8kB 50*16kB 13*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2908kB
35091 total pagecache pages
6168 pages in swap cache
Swap cache stats: add 367741, delete 361573, find 144185/185303
Free swap = 438964kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
13452 pages shared
53168 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 10402, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
[<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
[<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
[<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
[<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
[<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
[<d0936e14>] ? e100_open+0x17/0x41 [e100]
[<c03245cc>] ? dev_open+0x8f/0xc5
[<c0323d54>] ? dev_change_flags+0xa2/0x155
[<c03576b0>] ? devinet_ioctl+0x22a/0x51e
[<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
[<c0316d06>] ? sock_ioctl+0x0/0x1da
[<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
[<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
[<c0189370>] ? vfs_write+0xf4/0x105
[<c01945cd>] ? sys_ioctl+0x2c/0x42
[<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 93
active_anon:10715 inactive_anon:19236 isolated_anon:0
active_file:14312 inactive_file:14473 isolated_file:0
unevictable:0 dirty:7 writeback:52 unstable:0
free:1191 slab_reclaimable:1334 slab_unreclaimable:1144
mapped:2686 shmem:44 pagetables:522 bounce:0
DMA free:1496kB min:120kB low:148kB high:180kB active_anon:724kB inactive_anon:1520kB active_file:3752kB inactive_file:3964kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:324kB shmem:0kB slab_reclaimable:216kB slab_unreclaimable:36kB kernel_stack:4kB pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:3268kB min:1912kB low:2388kB high:2868kB active_anon:42136kB inactive_anon:75424kB active_file:53496kB inactive_file:53928kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:28kB writeback:208kB mapped:10420kB shmem:176kB slab_reclaimable:5120kB slab_unreclaimable:4540kB kernel_stack:444kB pagetables:2072kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 26*4kB 48*8kB 53*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1496kB
Normal: 291*4kB 73*8kB 63*16kB 12*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3268kB
35039 total pagecache pages
6209 pages in swap cache
Swap cache stats: add 367794, delete 361585, find 144185/185304
Free swap = 438756kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
13445 pages shared
53086 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 10440, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
[<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
[<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
[<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
[<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
[<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
[<d0936e14>] ? e100_open+0x17/0x41 [e100]
[<c03245cc>] ? dev_open+0x8f/0xc5
[<c0323d54>] ? dev_change_flags+0xa2/0x155
[<c03576b0>] ? devinet_ioctl+0x22a/0x51e
[<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
[<c0316d06>] ? sock_ioctl+0x0/0x1da
[<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
[<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
[<c0373362>] ? do_page_fault+0x33b/0x351
[<c01945cd>] ? sys_ioctl+0x2c/0x42
[<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 27
active_anon:9587 inactive_anon:19792 isolated_anon:0
active_file:12298 inactive_file:12256 isolated_file:0
unevictable:0 dirty:13 writeback:442 unstable:0
free:6098 slab_reclaimable:1287 slab_unreclaimable:1142
mapped:2436 shmem:37 pagetables:519 bounce:0
DMA free:1948kB min:120kB low:148kB high:180kB active_anon:724kB inactive_anon:1520kB active_file:3752kB inactive_file:3524kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:324kB shmem:0kB slab_reclaimable:208kB slab_unreclaimable:32kB kernel_stack:4kB pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:22444kB min:1912kB low:2388kB high:2868kB active_anon:37624kB inactive_anon:77648kB active_file:45440kB inactive_file:45500kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:52kB writeback:1768kB mapped:9420kB shmem:148kB slab_reclaimable:4940kB slab_unreclaimable:4536kB kernel_stack:444kB pagetables:2060kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 115*4kB 54*8kB 56*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1948kB
Normal: 2005*4kB 829*8kB 379*16kB 50*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22444kB
32732 total pagecache pages
8128 pages in swap cache
Swap cache stats: add 370349, delete 362221, find 144309/185439
Free swap = 428812kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
10771 pages shared
50488 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 10440, comm: ifconfig Not tainted 2.6.32-rc7-dirty #3
Call Trace:
[<c0168704>] ? __alloc_pages_nodemask+0x45d/0x4eb
[<c01050b3>] ? dma_generic_alloc_coherent+0x4a/0xa7
[<c0105069>] ? dma_generic_alloc_coherent+0x0/0xa7
[<d0935ba0>] ? e100_alloc_cbs+0xc2/0x16f [e100]
[<d0936d23>] ? e100_up+0x1b/0xf5 [e100]
[<d0936e14>] ? e100_open+0x17/0x41 [e100]
[<c03245cc>] ? dev_open+0x8f/0xc5
[<c0323d54>] ? dev_change_flags+0xa2/0x155
[<c03576b0>] ? devinet_ioctl+0x22a/0x51e
[<c0316ebc>] ? sock_ioctl+0x1b6/0x1da
[<c0316d06>] ? sock_ioctl+0x0/0x1da
[<c0193f9b>] ? vfs_ioctl+0x1c/0x5f
[<c0194565>] ? do_vfs_ioctl+0x4bf/0x4fb
[<c0101683>] ? __switch_to+0xf/0x14c
[<c011d24b>] ? finish_task_switch+0x2e/0x7d
[<c0370423>] ? schedule+0x4cc/0x4f3
[<c01945cd>] ? sys_ioctl+0x2c/0x42
[<c0102834>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 30
active_anon:9587 inactive_anon:19792 isolated_anon:0
active_file:12298 inactive_file:12270 isolated_file:0
unevictable:0 dirty:13 writeback:434 unstable:0
free:6083 slab_reclaimable:1287 slab_unreclaimable:1143
mapped:2449 shmem:37 pagetables:519 bounce:0
DMA free:1948kB min:120kB low:148kB high:180kB active_anon:724kB inactive_anon:1520kB active_file:3752kB inactive_file:3524kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15808kB mlocked:0kB dirty:0kB writeback:0kB mapped:324kB shmem:0kB slab_reclaimable:208kB slab_unreclaimable:32kB kernel_stack:4kB pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:22384kB min:1912kB low:2388kB high:2868kB active_anon:37624kB inactive_anon:77648kB active_file:45440kB inactive_file:45556kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:243776kB mlocked:0kB dirty:52kB writeback:1724kB mapped:9472kB shmem:148kB slab_reclaimable:4940kB slab_unreclaimable:4540kB kernel_stack:444kB pagetables:2060kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 115*4kB 54*8kB 56*16kB 1*32kB 2*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1948kB
Normal: 1990*4kB 829*8kB 379*16kB 50*32kB 0*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 22384kB
32740 total pagecache pages
8128 pages in swap cache
Swap cache stats: add 370349, delete 362221, find 144309/185439
Free swap = 428812kB
Total swap = 514040kB
65520 pages RAM
1726 pages reserved
10844 pages shared
50429 pages non-shared

2009-11-16 17:57:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3

On Fri, Nov 13, 2009 at 10:04:21AM +0100, Frans Pop wrote:
> On Thursday 12 November 2009, Mel Gorman wrote:
> > Changelog since V2
> > o Dropped the kswapd-quickly-notice-high-order patch. In more detailed
> > testing, it made latencies even worse as kswapd slept more on
> > high-order congestion causing order-0 direct reclaims.
> > o Added changes to how congestion_wait() works
> > o Added a number of new patches altering the behaviour of reclaim
>
> I have tested this series on top of .32-rc7. First impression is that it
> does seem to improve my test case, but does not yet completely solve it.
>
> My last gitk instance now loads more smoothly for most of the time it takes
> to complete, but I still see a choke point where things freeze for a while
> and where I get SKB allocation errors from my wireless.
> However, that choke point does seem to happen later and to be shorter than
> without the patches.
>

I haven't fully figured out why this makes a difference yet, but with
.32-rc7 and these patches, could you retry the test except beforehand do

cd /sys
for SYS in `find -name low_latency`; do
echo 0 > $SYS
done

I believe the low_latency logic might be interfering with the number of
clean pages available for kswapd to reclaim.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-17 10:34:21

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1

When checking if kswapd is sleeping prematurely, all populated zones are
checked instead of the zones the instance of kswapd is responsible for.
The counters for kswapd going to sleep prematurely are also named poorly.
This patch makes kswapd only check its own zones and renames the relevant
counters.

This is a fix to the patch
vmscan-have-kswapd-sleep-for-a-short-interval-and-double-check-it-should-be-asleep.patch
and is based on top of mmotm-2009-11-13-19-59. It would be preferable if
Kosaki Motohiro signed off on it as he had comments on the patch but the
discussion petered out without a solid conclusion.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vmstat.h | 2 +-
mm/vmscan.c | 20 +++++++++++++-------
mm/vmstat.c | 4 ++--
3 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 70c8093..117f0dd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -40,7 +40,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGSCAN_ZONE_RECLAIM_FAILED,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
- KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+ KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 81ef29b..da6cf42 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1907,19 +1907,25 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
#endif

/* is kswapd sleeping prematurely? */
-static int sleeping_prematurely(int order, long remaining)
+static int sleeping_prematurely(pg_data_t *pgdat, int order, long remaining)
{
- struct zone *zone;
+ int i;

/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
if (remaining)
return 1;

/* If after HZ/10, a zone is below the high mark, it's premature */
- for_each_populated_zone(zone)
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ struct zone *zone = pgdat->node_zones + i;
+
+ if (!populated_zone(zone))
+ continue;
+
if (!zone_watermark_ok(zone, order, high_wmark_pages(zone),
0, 0))
return 1;
+ }

return 0;
}
@@ -2227,7 +2233,7 @@ static int kswapd(void *p)
long remaining = 0;

/* Try to sleep for a short interval */
- if (!sleeping_prematurely(order, remaining)) {
+ if (!sleeping_prematurely(pgdat, order, remaining)) {
remaining = schedule_timeout(HZ/10);
finish_wait(&pgdat->kswapd_wait, &wait);
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -2238,13 +2244,13 @@ static int kswapd(void *p)
* premature sleep. If not, then go fully
* to sleep until explicitly woken up
*/
- if (!sleeping_prematurely(order, remaining))
+ if (!sleeping_prematurely(pgdat, order, remaining))
schedule();
else {
if (remaining)
- count_vm_event(KSWAPD_PREMATURE_FAST);
+ count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
else
- count_vm_event(KSWAPD_PREMATURE_SLOW);
+ count_vm_event(KSWAPD_HIGH_WMARK_HIT_QUICKLY);
}
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 889254f..6051fba 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -683,8 +683,8 @@ static const char * const vmstat_text[] = {
"slabs_scanned",
"kswapd_steal",
"kswapd_inodesteal",
- "kswapd_slept_prematurely_fast",
- "kswapd_slept_prematurely_slow",
+ "kswapd_low_wmark_hit_quickly",
+ "kswapd_high_wmark_hit_quickly",
"kswapd_skip_congestion_wait",
"pageoutrun",
"allocstall",

2009-11-17 11:03:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

I'm sorry for the long delay.

> On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > 2009/11/14 Mel Gorman <[email protected]>:
> > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > >> > > > watermark was reached. If there are a constant stream of allocations from
> > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > >> > > > the high watermark is not being maintained for sufficient length time.
> > >> > > >
> > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > >> > > >
> > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > >> > > > sleep indicates that the high watermark was breached after a very short
> > >> > > > interval.
> > >> > > >
> > >> > > > Signed-off-by: Mel Gorman <[email protected]>
> > >> > >
> > >> > > Why do you submit this patch to mainline? this is debugging patch
> > >> > > no more and no less.
> > >> > >
> > >> >
> > >> > Do you mean the stats part? The stats are included until such time as the page
> > >> > allocator failure reports stop or are significantly reduced. In the event a
> > >> > report is received, the value of the counters help determine if kswapd was
> > >> > struggling or not. They should be removed once this mess is ironed out.
> > >> >
> > >> > If there is a preference, I can split out the stats part and send it to
> > >> > people with page allocator failure reports for retesting.
> > >>
> > >> I'm sorry my last mail didn't have enough explanation.
> > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > >
> > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > being worked on.
> >
> > Please correct me, if I said wrong thing.
>
> You didn't.
>
> > if I was admin, I don't watch this stats because kswapd frequently
> > wakeup doesn't mean any trouble. instead I watch number of allocation
> > failure.
>
> The stats are not tracking when kswapd wakes up. It helps track how
> quickly the high or low watermarks are going under once kswapd tries to
> go back to sleep.

Umm, honestly I'm still puzlled. probably we need go back one step at once.
kswapd wake up when memory amount less than low watermark and sleep
when memory amount much than high watermask. We need to know
GFP_ATOMIC failure sign.

My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
under heavy pressure and memory amount go up and down between low watermark
and high watermark, this stats don't increase at all. IOW, this stats is strong related to
high watermark.

Probaby, min watermark or low watermark are more useful for us.

# of called wake_all_kswapd() is related to low watermark. and It's conteniously
increase although the system have strong memroy pressure. I'm ok.
KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
# of page allocation failure is related to min watermark too. I'm ok too.

IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
Can you please tell me why you think kswapd slept time is so important?


2009-11-17 11:44:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

On Tue, Nov 17, 2009 at 08:03:21PM +0900, KOSAKI Motohiro wrote:
> I'm sorry for the long delay.
>
> > On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > > 2009/11/14 Mel Gorman <[email protected]>:
> > > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > >> > > > watermark was reached. If there are a constant stream of allocations from
> > > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > >> > > > the high watermark is not being maintained for sufficient length time.
> > > >> > > >
> > > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > >> > > >
> > > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > >> > > > sleep indicates that the high watermark was breached after a very short
> > > >> > > > interval.
> > > >> > > >
> > > >> > > > Signed-off-by: Mel Gorman <[email protected]>
> > > >> > >
> > > >> > > Why do you submit this patch to mainline? this is debugging patch
> > > >> > > no more and no less.
> > > >> > >
> > > >> >
> > > >> > Do you mean the stats part? The stats are included until such time as the page
> > > >> > allocator failure reports stop or are significantly reduced. In the event a
> > > >> > report is received, the value of the counters help determine if kswapd was
> > > >> > struggling or not. They should be removed once this mess is ironed out.
> > > >> >
> > > >> > If there is a preference, I can split out the stats part and send it to
> > > >> > people with page allocator failure reports for retesting.
> > > >>
> > > >> I'm sorry my last mail didn't have enough explanation.
> > > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > > >
> > > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > > being worked on.
> > >
> > > Please correct me, if I said wrong thing.
> >
> > You didn't.
> >
> > > if I was admin, I don't watch this stats because kswapd frequently
> > > wakeup doesn't mean any trouble. instead I watch number of allocation
> > > failure.
> >
> > The stats are not tracking when kswapd wakes up. It helps track how
> > quickly the high or low watermarks are going under once kswapd tries to
> > go back to sleep.
>
> Umm, honestly I'm still puzlled. probably we need go back one step at once.
> kswapd wake up when memory amount less than low watermark and sleep
> when memory amount much than high watermask. We need to know
> GFP_ATOMIC failure sign.
>
> My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
> under heavy pressure and memory amount go up and down between low watermark
> and high watermark, this stats don't increase at all. IOW, this stats is strong related to
> high watermark.
>

Yes, this is true but as long as kswapd is awake and doing its job, it
will continue taking direction on what order it should be reclaiming from
processes that failed the low_watermark test. The GFP_ATOMIC allocations
will be allowed to go under this low watermark but will have informed kswapd
what order it should be reclaiming at so it stays working.

A stat that increases between the low and high watermark would indicate
that memory pressure is there or that the reclaim algorithm is not
working as expected but that's checking for a different problem.

What I was looking at was kswapd going to sleep and the low or min watermarks
being hit very quickly after that so that kswapd pre-emptively kicks in
before allocations start failing again.

> Probaby, min watermark or low watermark are more useful for us.
>

Why? kswapd is awake between those points.

> # of called wake_all_kswapd() is related to low watermark. and It's conteniously
> increase although the system have strong memroy pressure. I'm ok.
> KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
> # of page allocation failure is related to min watermark too. I'm ok too.
>
> IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
> Can you please tell me why you think kswapd slept time is so important?
>

I don't think the amount of time it has slept is important. I think it's
important to know if the system is getting back into watermark trouble very
shortly after kswapd reached the high watermark.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-17 12:18:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

> On Tue, Nov 17, 2009 at 08:03:21PM +0900, KOSAKI Motohiro wrote:
> > I'm sorry for the long delay.
> >
> > > On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > > > 2009/11/14 Mel Gorman <[email protected]>:
> > > > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > > > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > > >> > > > watermark was reached. If there are a constant stream of allocations from
> > > > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > > >> > > > the high watermark is not being maintained for sufficient length time.
> > > > >> > > >
> > > > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > > > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > > >> > > >
> > > > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > > > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > > >> > > > sleep indicates that the high watermark was breached after a very short
> > > > >> > > > interval.
> > > > >> > > >
> > > > >> > > > Signed-off-by: Mel Gorman <[email protected]>
> > > > >> > >
> > > > >> > > Why do you submit this patch to mainline? this is debugging patch
> > > > >> > > no more and no less.
> > > > >> > >
> > > > >> >
> > > > >> > Do you mean the stats part? The stats are included until such time as the page
> > > > >> > allocator failure reports stop or are significantly reduced. In the event a
> > > > >> > report is received, the value of the counters help determine if kswapd was
> > > > >> > struggling or not. They should be removed once this mess is ironed out.
> > > > >> >
> > > > >> > If there is a preference, I can split out the stats part and send it to
> > > > >> > people with page allocator failure reports for retesting.
> > > > >>
> > > > >> I'm sorry my last mail didn't have enough explanation.
> > > > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > > > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > > > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > > > >
> > > > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > > > being worked on.
> > > >
> > > > Please correct me, if I said wrong thing.
> > >
> > > You didn't.
> > >
> > > > if I was admin, I don't watch this stats because kswapd frequently
> > > > wakeup doesn't mean any trouble. instead I watch number of allocation
> > > > failure.
> > >
> > > The stats are not tracking when kswapd wakes up. It helps track how
> > > quickly the high or low watermarks are going under once kswapd tries to
> > > go back to sleep.
> >
> > Umm, honestly I'm still puzlled. probably we need go back one step at once.
> > kswapd wake up when memory amount less than low watermark and sleep
> > when memory amount much than high watermask. We need to know
> > GFP_ATOMIC failure sign.
> >
> > My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
> > under heavy pressure and memory amount go up and down between low watermark
> > and high watermark, this stats don't increase at all. IOW, this stats is strong related to
> > high watermark.
> >
>
> Yes, this is true but as long as kswapd is awake and doing its job, it
> will continue taking direction on what order it should be reclaiming from
> processes that failed the low_watermark test. The GFP_ATOMIC allocations
> will be allowed to go under this low watermark but will have informed kswapd
> what order it should be reclaiming at so it stays working.
>
> A stat that increases between the low and high watermark would indicate
> that memory pressure is there or that the reclaim algorithm is not
> working as expected but that's checking for a different problem.
>
> What I was looking at was kswapd going to sleep and the low or min watermarks
> being hit very quickly after that so that kswapd pre-emptively kicks in
> before allocations start failing again.
>
> > Probaby, min watermark or low watermark are more useful for us.
>
> Why? kswapd is awake between those points.

What's difference below (1) and (2)?

1. kswapd() run 100ms and sleep 10ms and run 100ms.
2. kswapd() run 200ms

(1) represent amount memory go beyond high-watermark very shortly and go below
low-watermark right after . (2) represent amount memory neared high-watermark closely, but don't touched,
and probably go blow low-watermark right after.
It's almost same memory pressure. but (1) increase KSWAPD_HIGH_WMARK_HIT_QUICKLY and
(2) don't increase any stat.

Thus, We can't think KSWAPD_HIGH_WMARK_HIT_QUICKLY indicate memory pressure.


> > # of called wake_all_kswapd() is related to low watermark. and It's conteniously
> > increase although the system have strong memroy pressure. I'm ok.
> > KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
> > # of page allocation failure is related to min watermark too. I'm ok too.
> >
> > IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
> > Can you please tell me why you think kswapd slept time is so important?
>
> I don't think the amount of time it has slept is important. I think it's
> important to know if the system is getting back into watermark trouble very
> shortly after kswapd reached the high watermark.

Probably, My last mail doesn't take kindly explanation.
My point is, beyond high-watermark or not doesn't indicate any meaningful
phenomenon.

Then, I'd prefer low or min-watermark related stats.



2009-11-17 12:25:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

On Tue, Nov 17, 2009 at 09:18:11PM +0900, KOSAKI Motohiro wrote:
> > On Tue, Nov 17, 2009 at 08:03:21PM +0900, KOSAKI Motohiro wrote:
> > > I'm sorry for the long delay.
> > >
> > > > On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > > > > 2009/11/14 Mel Gorman <[email protected]>:
> > > > > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > > > > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > > > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > > > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > > > >> > > > watermark was reached. If there are a constant stream of allocations from
> > > > > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > > > >> > > > the high watermark is not being maintained for sufficient length time.
> > > > > >> > > >
> > > > > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > > > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > > > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > > > > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > > > >> > > >
> > > > > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > > > > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > > > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > > > >> > > > sleep indicates that the high watermark was breached after a very short
> > > > > >> > > > interval.
> > > > > >> > > >
> > > > > >> > > > Signed-off-by: Mel Gorman <[email protected]>
> > > > > >> > >
> > > > > >> > > Why do you submit this patch to mainline? this is debugging patch
> > > > > >> > > no more and no less.
> > > > > >> > >
> > > > > >> >
> > > > > >> > Do you mean the stats part? The stats are included until such time as the page
> > > > > >> > allocator failure reports stop or are significantly reduced. In the event a
> > > > > >> > report is received, the value of the counters help determine if kswapd was
> > > > > >> > struggling or not. They should be removed once this mess is ironed out.
> > > > > >> >
> > > > > >> > If there is a preference, I can split out the stats part and send it to
> > > > > >> > people with page allocator failure reports for retesting.
> > > > > >>
> > > > > >> I'm sorry my last mail didn't have enough explanation.
> > > > > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > > > > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > > > > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > > > > >
> > > > > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > > > > being worked on.
> > > > >
> > > > > Please correct me, if I said wrong thing.
> > > >
> > > > You didn't.
> > > >
> > > > > if I was admin, I don't watch this stats because kswapd frequently
> > > > > wakeup doesn't mean any trouble. instead I watch number of allocation
> > > > > failure.
> > > >
> > > > The stats are not tracking when kswapd wakes up. It helps track how
> > > > quickly the high or low watermarks are going under once kswapd tries to
> > > > go back to sleep.
> > >
> > > Umm, honestly I'm still puzlled. probably we need go back one step at once.
> > > kswapd wake up when memory amount less than low watermark and sleep
> > > when memory amount much than high watermask. We need to know
> > > GFP_ATOMIC failure sign.
> > >
> > > My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
> > > under heavy pressure and memory amount go up and down between low watermark
> > > and high watermark, this stats don't increase at all. IOW, this stats is strong related to
> > > high watermark.
> > >
> >
> > Yes, this is true but as long as kswapd is awake and doing its job, it
> > will continue taking direction on what order it should be reclaiming from
> > processes that failed the low_watermark test. The GFP_ATOMIC allocations
> > will be allowed to go under this low watermark but will have informed kswapd
> > what order it should be reclaiming at so it stays working.
> >
> > A stat that increases between the low and high watermark would indicate
> > that memory pressure is there or that the reclaim algorithm is not
> > working as expected but that's checking for a different problem.
> >
> > What I was looking at was kswapd going to sleep and the low or min watermarks
> > being hit very quickly after that so that kswapd pre-emptively kicks in
> > before allocations start failing again.
> >
> > > Probaby, min watermark or low watermark are more useful for us.
> >
> > Why? kswapd is awake between those points.
>
> What's difference below (1) and (2)?
>
> 1. kswapd() run 100ms and sleep 10ms and run 100ms.
> 2. kswapd() run 200ms
>

Because prior to the patch, once kswapd went to sleep, it would
not wake again until the low watermark was reached. There appeared
to be a timing issue where congestion_wait() would block kswapd
just long enough before checking the high watermark to mean that
it stayed awake. This wasn't happening hence the approach of
briefly-sleep-after-high-watermark-is-reached-and-double-check-watermarks-are-ok.

> (1) represent amount memory go beyond high-watermark very shortly and go below
> low-watermark right after . (2) represent amount memory neared high-watermark closely, but don't touched,
> and probably go blow low-watermark right after.
> It's almost same memory pressure. but (1) increase KSWAPD_HIGH_WMARK_HIT_QUICKLY and
> (2) don't increase any stat.
>
> Thus, We can't think KSWAPD_HIGH_WMARK_HIT_QUICKLY indicate memory pressure.
>

It indicates mild memory pressure because the high watermark was only
reached for a short period of time.

> > > # of called wake_all_kswapd() is related to low watermark. and It's conteniously
> > > increase although the system have strong memroy pressure. I'm ok.
> > > KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
> > > # of page allocation failure is related to min watermark too. I'm ok too.
> > >
> > > IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
> > > Can you please tell me why you think kswapd slept time is so important?
> >
> > I don't think the amount of time it has slept is important. I think it's
> > important to know if the system is getting back into watermark trouble very
> > shortly after kswapd reached the high watermark.
>
> Probably, My last mail doesn't take kindly explanation.
> My point is, beyond high-watermark or not doesn't indicate any meaningful
> phenomenon.
>
> Then, I'd prefer low or min-watermark related stats.
>

What would that have to do with the kswapd-briefly-sleep logic? i.e. the
stats you are suggesting, while meaningful, are for looking at a
different type of problem. If you want those stats, then I should revert
the stats part of this patch altogether. Does that then mean you also
want the patch that makes kswapd double check watermarks to be dropped
or just have no associated stats to see what it's doing in the event of
allocation failure?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-18 05:20:51

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

> On Tue, Nov 17, 2009 at 09:18:11PM +0900, KOSAKI Motohiro wrote:
> > > On Tue, Nov 17, 2009 at 08:03:21PM +0900, KOSAKI Motohiro wrote:
> > > > I'm sorry for the long delay.
> > > >
> > > > > On Sat, Nov 14, 2009 at 06:34:23PM +0900, KOSAKI Motohiro wrote:
> > > > > > 2009/11/14 Mel Gorman <[email protected]>:
> > > > > > > On Sat, Nov 14, 2009 at 03:00:57AM +0900, KOSAKI Motohiro wrote:
> > > > > > >> > On Fri, Nov 13, 2009 at 07:43:09PM +0900, KOSAKI Motohiro wrote:
> > > > > > >> > > > After kswapd balances all zones in a pgdat, it goes to sleep. In the event
> > > > > > >> > > > of no IO congestion, kswapd can go to sleep very shortly after the high
> > > > > > >> > > > watermark was reached. If there are a constant stream of allocations from
> > > > > > >> > > > parallel processes, it can mean that kswapd went to sleep too quickly and
> > > > > > >> > > > the high watermark is not being maintained for sufficient length time.
> > > > > > >> > > >
> > > > > > >> > > > This patch makes kswapd go to sleep as a two-stage process. It first
> > > > > > >> > > > tries to sleep for HZ/10. If it is woken up by another process or the
> > > > > > >> > > > high watermark is no longer met, it's considered a premature sleep and
> > > > > > >> > > > kswapd continues work. Otherwise it goes fully to sleep.
> > > > > > >> > > >
> > > > > > >> > > > This adds more counters to distinguish between fast and slow breaches of
> > > > > > >> > > > watermarks. A "fast" premature sleep is one where the low watermark was
> > > > > > >> > > > hit in a very short time after kswapd going to sleep. A "slow" premature
> > > > > > >> > > > sleep indicates that the high watermark was breached after a very short
> > > > > > >> > > > interval.
> > > > > > >> > > >
> > > > > > >> > > > Signed-off-by: Mel Gorman <[email protected]>
> > > > > > >> > >
> > > > > > >> > > Why do you submit this patch to mainline? this is debugging patch
> > > > > > >> > > no more and no less.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > Do you mean the stats part? The stats are included until such time as the page
> > > > > > >> > allocator failure reports stop or are significantly reduced. In the event a
> > > > > > >> > report is received, the value of the counters help determine if kswapd was
> > > > > > >> > struggling or not. They should be removed once this mess is ironed out.
> > > > > > >> >
> > > > > > >> > If there is a preference, I can split out the stats part and send it to
> > > > > > >> > people with page allocator failure reports for retesting.
> > > > > > >>
> > > > > > >> I'm sorry my last mail didn't have enough explanation.
> > > > > > >> This stats help to solve this issue. I agreed. but after solving this issue,
> > > > > > >> I don't imagine administrator how to use this stats. if KSWAPD_PREMATURE_FAST or
> > > > > > >> KSWAPD_PREMATURE_SLOW significantly increased, what should admin do?
> > > > > > >
> > > > > > > One possible workaround would be to raise min_free_kbytes while a fix is
> > > > > > > being worked on.
> > > > > >
> > > > > > Please correct me, if I said wrong thing.
> > > > >
> > > > > You didn't.
> > > > >
> > > > > > if I was admin, I don't watch this stats because kswapd frequently
> > > > > > wakeup doesn't mean any trouble. instead I watch number of allocation
> > > > > > failure.
> > > > >
> > > > > The stats are not tracking when kswapd wakes up. It helps track how
> > > > > quickly the high or low watermarks are going under once kswapd tries to
> > > > > go back to sleep.
> > > >
> > > > Umm, honestly I'm still puzlled. probably we need go back one step at once.
> > > > kswapd wake up when memory amount less than low watermark and sleep
> > > > when memory amount much than high watermask. We need to know
> > > > GFP_ATOMIC failure sign.
> > > >
> > > > My point is, kswapd wakeup only happen after kswapd sleeping. but if the system is
> > > > under heavy pressure and memory amount go up and down between low watermark
> > > > and high watermark, this stats don't increase at all. IOW, this stats is strong related to
> > > > high watermark.
> > > >
> > >
> > > Yes, this is true but as long as kswapd is awake and doing its job, it
> > > will continue taking direction on what order it should be reclaiming from
> > > processes that failed the low_watermark test. The GFP_ATOMIC allocations
> > > will be allowed to go under this low watermark but will have informed kswapd
> > > what order it should be reclaiming at so it stays working.
> > >
> > > A stat that increases between the low and high watermark would indicate
> > > that memory pressure is there or that the reclaim algorithm is not
> > > working as expected but that's checking for a different problem.
> > >
> > > What I was looking at was kswapd going to sleep and the low or min watermarks
> > > being hit very quickly after that so that kswapd pre-emptively kicks in
> > > before allocations start failing again.
> > >
> > > > Probaby, min watermark or low watermark are more useful for us.
> > >




> > > Why? kswapd is awake between those points.
> >
> > What's difference below (1) and (2)?
> >
> > 1. kswapd() run 100ms and sleep 10ms and run 100ms.
> > 2. kswapd() run 200ms
>
> Because prior to the patch, once kswapd went to sleep, it would
> not wake again until the low watermark was reached. There appeared
> to be a timing issue where congestion_wait() would block kswapd
> just long enough before checking the high watermark to mean that
> it stayed awake. This wasn't happening hence the approach of
> briefly-sleep-after-high-watermark-is-reached-and-double-check-watermarks-are-ok.

You are right.

> > (1) represent amount memory go beyond high-watermark very shortly and go below
> > low-watermark right after . (2) represent amount memory neared high-watermark closely, but don't touched,
> > and probably go blow low-watermark right after.
> > It's almost same memory pressure. but (1) increase KSWAPD_HIGH_WMARK_HIT_QUICKLY and
> > (2) don't increase any stat.
> >
> > Thus, We can't think KSWAPD_HIGH_WMARK_HIT_QUICKLY indicate memory pressure.
> >
>
> It indicates mild memory pressure because the high watermark was only
> reached for a short period of time.
>
> > > > # of called wake_all_kswapd() is related to low watermark. and It's conteniously
> > > > increase although the system have strong memroy pressure. I'm ok.
> > > > KSWAPD_NO_CONGESTION_WAIT is related to min watermark. I'm ok too..
> > > > # of page allocation failure is related to min watermark too. I'm ok too.
> > > >
> > > > IOW, I only dislike this stat stop increase strong memory pressure (above explanation).
> > > > Can you please tell me why you think kswapd slept time is so important?
> > >
> > > I don't think the amount of time it has slept is important. I think it's
> > > important to know if the system is getting back into watermark trouble very
> > > shortly after kswapd reached the high watermark.
> >
> > Probably, My last mail doesn't take kindly explanation.
> > My point is, beyond high-watermark or not doesn't indicate any meaningful
> > phenomenon.
> >
> > Then, I'd prefer low or min-watermark related stats.
>
> What would that have to do with the kswapd-briefly-sleep logic? i.e. the
> stats you are suggesting, while meaningful, are for looking at a
> different type of problem. If you want those stats, then I should revert
> the stats part of this patch altogether. Does that then mean you also
> want the patch that makes kswapd double check watermarks to be dropped
> or just have no associated stats to see what it's doing in the event of
> allocation failure?

Yes, you talked right thing.

After awhile thinking (and one night good sleeping), I conclude I should not enforce
my personal preference any more. I'm sorry.

Honestly, I haven't understand this stats usage. but the stats is not core
piece in this patch concept. we sholdn't get stuck small issue.

then, I'll review your latest patch soon.

Thanks.



2009-11-18 05:27:16

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1

> When checking if kswapd is sleeping prematurely, all populated zones are
> checked instead of the zones the instance of kswapd is responsible for.
> The counters for kswapd going to sleep prematurely are also named poorly.
> This patch makes kswapd only check its own zones and renames the relevant
> counters.
>
> This is a fix to the patch
> vmscan-have-kswapd-sleep-for-a-short-interval-and-double-check-it-should-be-asleep.patch
> and is based on top of mmotm-2009-11-13-19-59. It would be preferable if
> Kosaki Motohiro signed off on it as he had comments on the patch but the
> discussion petered out without a solid conclusion.
>
> Signed-off-by: Mel Gorman <[email protected]>

Looks good to me.
I apologize to bother you by nit for long time.

Reviewed-by: KOSAKI Motohiro <[email protected]>


2009-11-20 14:56:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

On Fri, Nov 13, 2009 at 02:32:12PM +0100, Jens Axboe wrote:
> On Fri, Nov 13 2009, Mel Gorman wrote:
> > On Fri, Nov 13, 2009 at 12:55:58PM +0100, Jens Axboe wrote:
> > > On Fri, Nov 13 2009, KOSAKI Motohiro wrote:
> > > > (cc to Jens)
> > > >
> > > > > Testing by Frans Pop indicated that in the 2.6.30..2.6.31 window at least
> > > > > that the commits 373c0a7e 8aa7e847 dramatically increased the number of
> > > > > GFP_ATOMIC failures that were occuring within a wireless driver. Reverting
> > > > > this patch seemed to help a lot even though it was pointed out that the
> > > > > congestion changes were very far away from high-order atomic allocations.
> > > > >
> > > > > The key to why the revert makes such a big difference is down to timing and
> > > > > how long direct reclaimers wait versus kswapd. With the patch reverted,
> > > > > the congestion_wait() is on the SYNC queue instead of the ASYNC. As a
> > > > > significant part of the workload involved reads, it makes sense that the
> > > > > SYNC list is what was truely congested and with the revert processes were
> > > > > waiting on congestion as expected. Hence, direct reclaimers stalled
> > > > > properly and kswapd was able to do its job with fewer stalls.
> > > > >
> > > > > This patch aims to fix the congestion_wait() behaviour for SYNC and ASYNC
> > > > > for direct reclaimers. Instead of making the congestion_wait() on the SYNC
> > > > > queue which would only fix a particular type of workload, this patch adds a
> > > > > third type of congestion_wait - BLK_RW_BOTH which first waits on the ASYNC
> > > > > and then the SYNC queue if the timeout has not been reached. In tests, this
> > > > > counter-intuitively results in kswapd stalling less and freeing up pages
> > > > > resulting in fewer allocation failures and fewer direct-reclaim-orientated
> > > > > stalls.
> > > >
> > > > Honestly, I don't like this patch. page allocator is not related to
> > > > sync block queue. vmscan doesn't make read operation.
> > > > This patch makes nearly same effect of s/congestion_wait/io_schedule_timeout/.
> > > >
> > > > Please don't make mysterious heuristic code.
> > > >
> > > >
> > > > Sidenode: I doubt this regression was caused from page allocator.
> >
> > Probably not. As noted, the major change is really in how long callers
> > are waiting on congestion_wait. The tarball includes graphs from an
> > instrumented kernel that shows how long callers are waiting due to
> > congestion_wait(). This has changed significantly.
> >
> > I'll queue up tests over the weekend that test without dm-crypt being involved.
> >
> > > > Probably we need to confirm caller change....
> > >
> > > See the email from Chris from yesterday, he nicely explains why this
> > > change made a difference with dm-crypt.
> >
> > Indeed.
> >
> > But bear in mind that it also possible that direct reclaimers are also
> > congesting the queue due to swap-in.
>
> Are you speculating, or has this been observed? While I don't contest
> that that could happen, it's also not a new thing. And it should be an
> unlikely event.
>
> > > dm-crypt needs fixing, not a hack like this added.
> > >
> >
> > As noted by Chris in the same mail, dm-crypt has not changed. What has
> > changed is how long callers wait in congestion_wait.
>
> Right dm-crypt didn't change, it WAS ALREADY BUGGY.
>

On a different note, I tried without dm-crypt and found by far the
greatest different to page allocator success or failure was the value of
the low_latency tunable.

low_latency == 0
2.6.32-rc6-0000000-force-highorder Elapsed:17:50.935(stddev:002.535) Failures:0
2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:18:44.610(stddev:002.236) Failures:1
2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:17:18.330(stddev:002.258) Failures:2
2.6.32-rc6-0000123-congestion-both Elapsed:16:17.370(stddev:002.167) Failures:1
2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:15:54.880(stddev:002.234) Failures:0
2.6.32-rc6-0012345-kswapd-stay-awake-when-min Elapsed:19:26.417(stddev:002.237) Failures:1
2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:20:19.135(stddev:001.516) Failures:0

low_latency == 1
2.6.32-rc6-0000000-force-highorder Elapsed:20:04.755(stddev:078.551) Failures:22
2.6.32-rc6-0000006-dm-crypt-unplug Elapsed:25:05.608(stddev:053.224) Failures:12
2.6.32-rc6-0000012-pgalloc-2.6.30 Elapsed:19:01.530(stddev:002.146) Failures:14
2.6.32-rc6-0000123-congestion-both Elapsed:18:01.938(stddev:002.171) Failures:2
2.6.32-rc6-0001234-kswapd-quick-recheck Elapsed:19:52.833(stddev:064.168) Failures:7
2.6.32-rc6-0012345-kswapd-stay-awake-when-min Elapsed:26:25.767(stddev:050.827) Failures:1
2.6.32-rc6-0123456-dm-crypt-unplug Elapsed:22:56.850(stddev:053.914) Failures:1

Setting low_latency both regresses performance of the test and causes a
boatload of allocation failures. Note also the deviations. With
low_latency == 0, gitk performs predictably +/- around 2 seconds. With
low_latency == 1, there are huge varianes +/- about a minute.

Sampling writeback, I found that the number of pages in writeback with
low_latency == 1 was higher for longer.

Any theories as to why enabling low_latency has such a negative impact?
Should it be disabled by default?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-11-26 14:45:11

by Tobias Oetiker

[permalink] [raw]
Subject: Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2

Hi Mel,

Nov 13 Mel Gorman wrote:

> The last version has a stupid bug in it. Sorry.
>
> Changelog since V1
> o Fix incorrect negation
> o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
> suggested by Rik
>
> If reclaim fails to make sufficient progress, the priority is raised.
> Once the priority is higher, kswapd starts waiting on congestion. However,
> if the zone is below the min watermark then kswapd needs to continue working
> without delay as there is a danger of an increased rate of GFP_ATOMIC
> allocation failure.
>
> This patch changes the conditions under which kswapd waits on
> congestion by only going to sleep if the min watermarks are being met.

I finally got around to test this together with the whole series on
2.6.31.6. after running it for a day I have not yet seen a single
order:5 allocation problem ... (while I had several an hour before)

for the record, my kernel is now running with the following
patches:

patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed

patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER

patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim

patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep

patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2

patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1

cheers
tobi

--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900

2009-11-29 07:42:08

by Tobias Oetiker

[permalink] [raw]
Subject: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)

Hi Mel,

Thursday Tobias Oetiker wrote:
> Hi Mel,
>
> Nov 13 Mel Gorman wrote:
>
> > The last version has a stupid bug in it. Sorry.
> >
> > Changelog since V1
> > o Fix incorrect negation
> > o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
> > suggested by Rik
> >
> > If reclaim fails to make sufficient progress, the priority is raised.
> > Once the priority is higher, kswapd starts waiting on congestion. However,
> > if the zone is below the min watermark then kswapd needs to continue working
> > without delay as there is a danger of an increased rate of GFP_ATOMIC
> > allocation failure.
> >
> > This patch changes the conditions under which kswapd waits on
> > congestion by only going to sleep if the min watermarks are being met.
>
> I finally got around to test this together with the whole series on
> 2.6.31.6. after running it for a day I have not yet seen a single
> order:5 allocation problem ... (while I had several an hour before)

> for the record, my kernel is now running with the following
> patches:
>
> patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
> patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
>
> patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
> patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
>
> patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
> patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
>
> patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
> patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
>
> patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
> patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
>
> patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
> patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
>
I have now been running the new kernel for a few days and I am
sorry to report that about a day after booting the allocation
failures started showing again. More order:4 instead of order:5 ...

Nov 29 07:16:17 johan kernel: [261565.598627] nfsd: page allocation failure. order:4, mode:0x4020 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598638] Pid: 6956, comm: nfsd Tainted: G D 2.6.31.6-oep #1 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598641] Call Trace: [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598646] <IRQ> [<ffffffff810cb730>] __alloc_pages_nodemask+0x570/0x690 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598672] [<ffffffff8142aba5>] ? tcp_tso_segment+0x265/0x2e0 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598680] [<ffffffff810faf08>] kmalloc_large_node+0x68/0xc0 [kern.warning]
Nov 29 07:16:17 johan kernel: [261565.598692] [<ffffffff810febfa>] __kmalloc_node_track_caller+0x11a/0x180 [kern.warning]

the fact that there were no problems right after booting might give
some indication as to where the problem is. This is on a machine
with 24 GB of memory, as you can see from the 'top' headers, it has
plenty of room to spare and should never run into allocation problems
anyway. The only thing that would have changed after running it for
about a day is that more memory is used for caching ...

# cat /proc/meminfo
MemTotal: 24746268 kB
MemFree: 128012 kB
Buffers: 6255988 kB
Cached: 10600780 kB
SwapCached: 23668 kB
Active: 13078944 kB
Inactive: 9229172 kB
Active(anon): 5455164 kB
Inactive(anon): 508616 kB
Active(file): 7623780 kB
Inactive(file): 8720556 kB
Unevictable: 291768 kB
Mlocked: 291768 kB
SwapTotal: 83886056 kB
SwapFree: 83846520 kB
Dirty: 2028 kB
Writeback: 0 kB
AnonPages: 5723748 kB
Mapped: 152240 kB
Slab: 1339176 kB
SReclaimable: 1270156 kB
SUnreclaim: 69020 kB
PageTables: 31208 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 101208440 kB
Committed_AS: 8146696 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 115084 kB
VmallocChunk: 34359518011 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 10176 kB
DirectMap2M: 25155584 kB

hth
tobi

--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900

2009-12-02 11:32:41

by Mel Gorman

[permalink] [raw]
Subject: Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)

On Sun, Nov 29, 2009 at 08:42:09AM +0100, Tobi Oetiker wrote:
> Hi Mel,
>
> Thursday Tobias Oetiker wrote:
> > Hi Mel,
> >
> > Nov 13 Mel Gorman wrote:
> >
> > > The last version has a stupid bug in it. Sorry.
> > >
> > > Changelog since V1
> > > o Fix incorrect negation
> > > o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
> > > suggested by Rik
> > >
> > > If reclaim fails to make sufficient progress, the priority is raised.
> > > Once the priority is higher, kswapd starts waiting on congestion. However,
> > > if the zone is below the min watermark then kswapd needs to continue working
> > > without delay as there is a danger of an increased rate of GFP_ATOMIC
> > > allocation failure.
> > >
> > > This patch changes the conditions under which kswapd waits on
> > > congestion by only going to sleep if the min watermarks are being met.
> >
> > I finally got around to test this together with the whole series on
> > 2.6.31.6. after running it for a day I have not yet seen a single
> > order:5 allocation problem ... (while I had several an hour before)
>
> > for the record, my kernel is now running with the following
> > patches:
> >
> > patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
> > patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> >
> > patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
> > patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
> >
> > patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
> > patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
> >
> > patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
> > patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
> >
> > patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
> > patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
> >
> > patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
> > patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> >
> I have now been running the new kernel for a few days and I am
> sorry to report that about a day after booting the allocation
> failures started showing again. More order:4 instead of order:5 ...
>

Why has the order changed?

Also, what allocator were you using in 2.6.30 and 2.6.31.6, SLAB or
SLUB? Did you happen to change them when upgrading the kernel?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-12-02 21:30:26

by Tobias Oetiker

[permalink] [raw]
Subject: Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)

Hi Mel,

Today Mel Gorman wrote:

> On Sun, Nov 29, 2009 at 08:42:09AM +0100, Tobi Oetiker wrote:
> > Hi Mel,
> >
> > Thursday Tobias Oetiker wrote:
> > > Hi Mel,
> > >
> > > Nov 13 Mel Gorman wrote:
> > >
> > > > The last version has a stupid bug in it. Sorry.
> > > >
> > > > Changelog since V1
> > > > o Fix incorrect negation
> > > > o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
> > > > suggested by Rik
> > > >
> > > > If reclaim fails to make sufficient progress, the priority is raised.
> > > > Once the priority is higher, kswapd starts waiting on congestion. However,
> > > > if the zone is below the min watermark then kswapd needs to continue working
> > > > without delay as there is a danger of an increased rate of GFP_ATOMIC
> > > > allocation failure.
> > > >
> > > > This patch changes the conditions under which kswapd waits on
> > > > congestion by only going to sleep if the min watermarks are being met.
> > >
> > > I finally got around to test this together with the whole series on
> > > 2.6.31.6. after running it for a day I have not yet seen a single
> > > order:5 allocation problem ... (while I had several an hour before)
> >
> > > for the record, my kernel is now running with the following
> > > patches:
> > >
> > > patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
> > > patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> > >
> > > patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
> > > patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
> > >
> > > patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
> > > patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
> > >
> > > patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
> > > patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
> > >
> > > patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
> > > patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
> > >
> > > patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
> > > patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
> > >
> > I have now been running the new kernel for a few days and I am
> > sorry to report that about a day after booting the allocation
> > failures started showing again. More order:4 instead of order:5 ...
> >
>
> Why has the order changed?

? no idea ... the order has changed after applying the patches
cited above.

> Also, what allocator were you using in 2.6.30 and 2.6.31.6, SLAB or
> SLUB? Did you happen to change them when upgrading the kernel?

I have been and still am using SLUB ...

cheers
tobi


--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900

2009-12-03 20:25:56

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)

Hi Tobias,
does the patch in http://lkml.org/lkml/2009/11/30/301 help with your
high order allocation problems?
It seems that you have lot of memory, but high order pages do not show up.
The patch should make them more likely to appear.
On my machine (that has much less ram than yours), with the patch, I
always have order-10 pages available.

Corrado

On Wed, Dec 2, 2009 at 10:30 PM, Tobias Oetiker <[email protected]> wrote:
> Hi Mel,
>
> Today Mel Gorman wrote:
>
>> On Sun, Nov 29, 2009 at 08:42:09AM +0100, Tobi Oetiker wrote:
>> > Hi Mel,
>> >
>> > Thursday Tobias Oetiker wrote:
>> > > Hi Mel,
>> > >
>> > > Nov 13 Mel Gorman wrote:
>> > >
>> > > > The last version has a stupid bug in it. Sorry.
>> > > >
>> > > > Changelog since V1
>> > > >   o Fix incorrect negation
>> > > >   o Rename kswapd_no_congestion_wait to kswapd_skip_congestion_wait as
>> > > >     suggested by Rik
>> > > >
>> > > > If reclaim fails to make sufficient progress, the priority is raised.
>> > > > Once the priority is higher, kswapd starts waiting on congestion.  However,
>> > > > if the zone is below the min watermark then kswapd needs to continue working
>> > > > without delay as there is a danger of an increased rate of GFP_ATOMIC
>> > > > allocation failure.
>> > > >
>> > > > This patch changes the conditions under which kswapd waits on
>> > > > congestion by only going to sleep if the min watermarks are being met.
>> > >
>> > > I finally got around to test this together with the whole series on
>> > > 2.6.31.6. after running it for a day I have not yet seen a single
>> > > order:5 allocation problem ... (while I had several an hour before)
>> >
>> > > for the record, my kernel is now running with the following
>> > > patches:
>> > >
>> > > patch1:Date: Thu, 12 Nov 2009 19:30:31 +0000
>> > > patch1:Subject: [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
>> > >
>> > > patch2:Date: Thu, 12 Nov 2009 19:30:32 +0000
>> > > patch2:Subject: [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER
>> > >
>> > > patch3:Date: Thu, 12 Nov 2009 19:30:33 +0000
>> > > patch3:Subject: [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim
>> > >
>> > > patch4:Date: Thu, 12 Nov 2009 19:30:34 +0000
>> > > patch4:Subject: [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep
>> > >
>> > > patch5:Date: Fri, 13 Nov 2009 20:03:57 +0000
>> > > patch5:Subject: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2
>> > >
>> > > patch6:Date: Tue, 17 Nov 2009 10:34:21 +0000
>> > > patch6:Subject: [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1
>> > >
>> > I have now been running the new kernel for a few days and I am
>> > sorry to report that about a day after booting the allocation
>> > failures started showing again. More order:4 instead of order:5 ...
>> >
>>
>> Why has the order changed?
>
> ? no idea ... the order has changed after applying the patches
> cited above.
>
>> Also, what allocator were you using in 2.6.30 and 2.6.31.6, SLAB or
>> SLUB? Did you happen to change them when upgrading the kernel?
>
> I have been and still am using SLUB  ...
>
> cheers
> tobi
>
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda

2009-12-14 05:59:13

by Tobias Oetiker

[permalink] [raw]
Subject: Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)

Hi Corrado,

Dec 3 Corrado Zoccolo wrote:

> Hi Tobias,
> does the patch in http://lkml.org/lkml/2009/11/30/301 help with your
> high order allocation problems?
> It seems that you have lot of memory, but high order pages do not show up.
> The patch should make them more likely to appear.
> On my machine (that has much less ram than yours), with the patch, I
> always have order-10 pages available.

I have tried it and ... it does not work, the page allocation
failure still shows. BUT while testing it on two machines I found that it
only shows on on machine. The workload on the two machines is
similar (they both run virtualbox) and also the available memory.

Could it be caused by a hardware driver ?

cheers
tobi

--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900

2009-12-14 08:49:35

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2)

Hi Tobi,
On Mon, Dec 14, 2009 at 6:59 AM, Tobias Oetiker <[email protected]> wrote:
> Hi Corrado,
>
> Dec 3 Corrado Zoccolo wrote:
>
>> Hi Tobias,
>> does the patch in http://lkml.org/lkml/2009/11/30/301 help with your
>> high order allocation problems?
>> It seems that you have lot of memory, but high order pages do not show up.
>> The patch should make them more likely to appear.
>> On my machine (that has much less ram than yours), with the patch, I
>> always have order-10 pages available.
>
> I have tried it and ... it does not work, the  page allocation
> failure still shows. BUT while testing it on two machines I found that it
> only shows on on machine. The workload on the two machines is
> similar (they both run virtualbox) and also the available memory.

Where those both failing before the patch?
Did the order of failure change?

> Could it be caused by a hardware driver ?
It should be something that is taking more time to release pages, but
I don't know what can it be. What happens if you drop the caches when
you are getting failures? Does the failure rate drops as if you had
just rebooted?
Can you log at regular intervals the content of /proc/buddyinfo, and
try correlating when the number of pages of the requested order are
becoming scarce with some other event?

Thanks,
Corrado
>
> cheers
> tobi
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
Tales of Power - C. Castaneda