Sorry for the large cc list. Variations of this bug have cropped up in a
number of different places and so there are a fair few people that should
be vaguely aware of what's going on.
Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
failures. A significant number of these have been high-order GFP_ATOMIC
failures and while they are generally brushed away, there has been a large
increase in them recently and there are a number of possible areas the
problem could be in - core vm, page writeback and a specific driver. The
bugs affected by this that I am aware of are;
[Bug #14141] order 2 page allocation failures in iwlagn
Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
allocations within the wireless driver. This has caused large numbers
of failure reports to occur as reported by Frans Pop. Fixing this
requires changes to the driver if it wants to use GFP_ATOMIC which
is in the hands of Mohamed Abbas and Reinette Chatre. However,
it is very likely that it has being compounded by core mm changes
that this series is aimed at.
[Bug #14141] order 2 page allocation failures (generic)
This problem is being tracked under bug #14141 but chances are it's
unrelated to the wireless change. Tobi Oetiker has reported that a
virtualised machine using a bridged interface is reporting a small
number of order-5 GFP_ATOMIC failures. He has reported that the
errors can be suppressed with kswapd patches in this series. However,
I would like to confirm they are necessary.
[Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
Karol Lewandows reported that e100 fails to allocate order-5
GFP_ATOMIC when loading firmware during resume. This has started
happening relatively recent.
[No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
This apparently is easily reproducible, particular in comparison to
the other reports. The point of greatest interest is that this is
order-0 GFP_ATOMIC failures. Sven, I'm hoping that you in particular
will be able to follow the tests below as you are the most likely
person to have an easily reproducible situation.
[No BZ ID] page allocation failure message kernel 2.6.31.4 (tty-related)
reported at: http://lkml.org/lkml/2009/10/20/139. Looks the same
as the order-2 failures.
There are 5 patches in this series. For people affected by this bug,
I'm afraid there is a lot of legwork involved to help pin down which of
these patches are relevant. These patches are all against 2.6.32-rc5 and
have been tested on X86 and X86-64 by running the sysbench benchmark to
completion. I'll post against 2.6.31.4 where necessary.
Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
Test 2: Apply the following two patches and test again
1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER
These patches correct problems introduced by me during the 2.6.31-rc1
merge window. The patches were not meant to introduce any functional
changes but two were missed.
If your problem goes away with just these two patches applied,
please tell me.
Test 3: If you are getting allocation failures, try with the following patch
3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
This is a functional change that causes kswapd to notice sooner
when high-order watermarks have been hit. There have been a number
of changes in page reclaim since 2.6.30 that might have delayed
when kswapd kicks in for higher orders
If your problem goes away with these three patches applied, please
tell me
Test 4: If you are still getting failures, apply the following
4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks are hit
This patch is very heavy handed and pre-emptively kicks kswapd when
watermarks are hit. It should only be necessary if there has been
significant changes in the timing and density of page allocations
from an unknown source. Tobias, this patch is largely aimed at you.
You reported that with patches 3+4 applied that your problems went
away. I need to know if patch 3 on its own is enough or if both
are required
If your problem goes away with these four patches applied, please
tell me
Test 5: If things are still screwed, apply the following
5/5 Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
Frans Pop reports that the bulk of his problems go away when this
patch is reverted on 2.6.31. There has been some confusion on why
exactly this patch was wrong but apparently the conversion was not
complete and further work was required. It's unknown if all the
necessary work exists in 2.6.31-rc5 or not. If there are still
allocation failures and applying this patch fixes the problem,
there are still snags that need to be ironed out.
Test 6: If only testing 2.6.31.4, test with patches 1, 2 and 5 as posted for that kernel
Even if patches 3, 4 or both are necessary against mainline, I'm
hoping they are unnecessary against -stable.
Thanks to all that reported problems and are testing this. The major bulk of
the work was done by Frans Pop so a big thanks to him in particular. I/we owe
him beers.
arch/x86/lib/usercopy_32.c | 2 +-
drivers/block/pktcdvd.c | 10 ++++------
drivers/md/dm-crypt.c | 2 +-
fs/fat/file.c | 2 +-
fs/fuse/dev.c | 8 ++++----
fs/nfs/write.c | 8 +++-----
fs/reiserfs/journal.c | 2 +-
fs/xfs/linux-2.6/kmem.c | 4 ++--
fs/xfs/linux-2.6/xfs_buf.c | 2 +-
include/linux/backing-dev.h | 11 +++--------
include/linux/blkdev.h | 13 +++++++++----
mm/backing-dev.c | 7 ++++---
mm/memcontrol.c | 2 +-
mm/page-writeback.c | 2 +-
mm/page_alloc.c | 41 ++++++++++++++++++++++++++---------------
mm/vmscan.c | 17 +++++++++++++----
16 files changed, 75 insertions(+), 58 deletions(-)
If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
application afterwards. In times past, this would always wake kswapd as well
but currently, kswapd is not woken up after direct reclaim fails. For order-0
allocations, this makes little difference but if there is a heavy mix of
higher-order allocations that direct reclaim is failing for, it might mean
that kswapd is not rewoken for higher orders as much as it did previously.
This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf72055..dfa4362 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;
+restart:
wake_all_kswapd(order, zonelist, high_zoneidx);
-restart:
/*
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
--
1.6.3.3
Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
slightly by allowing rt_tasks that are handling an interrupt to set
ALLOC_HARDER. This patch brings the watermark logic more in line with
2.6.30.
[[email protected]: Spotted the problem]
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Pekka Enberg <[email protected]>
---
mm/page_alloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dfa4362..7f2aa3e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
* See also cpuset_zone_allowed() comment in kernel/cpuset.c.
*/
alloc_flags &= ~ALLOC_CPUSET;
- } else if (unlikely(rt_task(p)))
+ } else if (unlikely(rt_task(p)) && !in_interrupt())
alloc_flags |= ALLOC_HARDER;
if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
--
1.6.3.3
When a high-order allocation fails, kswapd is kicked so that it reclaims
at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
allocations. Something has changed in recent kernels that affect the timing
where high-order GFP_ATOMIC allocations are now failing with more frequency,
particularly under pressure. This patch forces kswapd to notice sooner that
high-order allocations are occuring.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64e4388..cd68109 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2016,6 +2016,15 @@ loop_again:
priority != DEF_PRIORITY)
continue;
+ /*
+ * Exit quickly to restart if it has been indicated
+ * that higher orders are required
+ */
+ if (pgdat->kswapd_max_order > order) {
+ all_zones_ok = 1;
+ goto out;
+ }
+
if (!zone_watermark_ok(zone, order,
high_wmark_pages(zone), end_zone, 0))
all_zones_ok = 0;
--
1.6.3.3
When a high-order allocation fails, kswapd is kicked so that it reclaims
at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
allocations. Something has changed in recent kernels that affect the timing
where high-order GFP_ATOMIC allocations are now failing with more frequency,
particularly under pressure.
This patch pre-emptively checks if watermarks have been hit after a
high-order allocation completes successfully. If the watermarks have been
reached, kswapd is woken in the hope it fixes the watermarks before the
next GFP_ATOMIC allocation fails.
Warning, this patch is somewhat of a band-aid. If this makes a difference,
it still implies that something has changed that is either causing more
GFP_ATOMIC allocations to occur (such as the case with iwlagn wireless
driver) or make them more likely to fail.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 33 ++++++++++++++++++++++-----------
1 files changed, 22 insertions(+), 11 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7f2aa3e..851df40 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1596,6 +1596,17 @@ try_next_zone:
return page;
}
+static inline
+void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
+ enum zone_type high_zoneidx)
+{
+ struct zoneref *z;
+ struct zone *zone;
+
+ for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
+ wakeup_kswapd(zone, order);
+}
+
static inline int
should_alloc_retry(gfp_t gfp_mask, unsigned int order,
unsigned long pages_reclaimed)
@@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
- return page;
-}
-
-static inline
-void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
- enum zone_type high_zoneidx)
-{
- struct zoneref *z;
- struct zone *zone;
+ /*
+ * If after a high-order allocation we are now below watermarks,
+ * pre-emptively kick kswapd rather than having the next allocation
+ * fail and have to wake up kswapd, potentially failing GFP_ATOMIC
+ * allocations or entering direct reclaim
+ */
+ if (unlikely(order) && page && !zone_watermark_ok(preferred_zone, order,
+ preferred_zone->watermark[ALLOC_WMARK_LOW],
+ zone_idx(preferred_zone), ALLOC_WMARK_LOW))
+ wake_all_kswapd(order, zonelist, high_zoneidx);
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- wakeup_kswapd(zone, order);
+ return page;
}
static inline int
--
1.6.3.3
Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at
least that the commits 373c0a7e 8aa7e847 dramatically increased the
number of GFP_ATOMIC failures that were occuring within a wireless
driver. It was never isolated which of the changes was the exact problem
and it's possible it has been fixed since. If problems are still
occuring with GFP_ATOMIC in 2.6.31-rc5, then this patch should be
applied to determine if the congestion_wait() callers are still broken.
Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/lib/usercopy_32.c | 2 +-
drivers/block/pktcdvd.c | 10 ++++------
drivers/md/dm-crypt.c | 2 +-
fs/fat/file.c | 2 +-
fs/fuse/dev.c | 8 ++++----
fs/nfs/write.c | 8 +++-----
fs/reiserfs/journal.c | 2 +-
fs/xfs/linux-2.6/kmem.c | 4 ++--
fs/xfs/linux-2.6/xfs_buf.c | 2 +-
include/linux/backing-dev.h | 11 +++--------
include/linux/blkdev.h | 13 +++++++++----
mm/backing-dev.c | 7 ++++---
mm/memcontrol.c | 2 +-
mm/page-writeback.c | 2 +-
mm/page_alloc.c | 4 ++--
mm/vmscan.c | 8 ++++----
16 files changed, 42 insertions(+), 45 deletions(-)
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
if (retval == -ENOMEM && is_global_init(current)) {
up_read(¤t->mm->mmap_sem);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto survive;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 2ddf03a..d69bf9c 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
wakeup = (pd->write_congestion_on > 0
&& pd->bio_queue_size <= pd->write_congestion_off);
spin_unlock(&pd->lock);
- if (wakeup) {
- clear_bdi_congested(&pd->disk->queue->backing_dev_info,
- BLK_RW_ASYNC);
- }
+ if (wakeup)
+ clear_bdi_congested(&pd->disk->queue->backing_dev_info, WRITE);
pkt->sleep_time = max(PACKET_WAIT_TIME, 1);
pkt_set_state(pkt, PACKET_WAITING_STATE);
@@ -2594,10 +2592,10 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
spin_lock(&pd->lock);
if (pd->write_congestion_on > 0
&& pd->bio_queue_size >= pd->write_congestion_on) {
- set_bdi_congested(&q->backing_dev_info, BLK_RW_ASYNC);
+ set_bdi_congested(&q->backing_dev_info, WRITE);
do {
spin_unlock(&pd->lock);
- congestion_wait(BLK_RW_ASYNC, HZ);
+ congestion_wait(WRITE, HZ);
spin_lock(&pd->lock);
} while(pd->bio_queue_size > pd->write_congestion_off);
}
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ed10381..c72a8dd 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
* But don't wait if split was due to the io size restriction
*/
if (unlikely(out_of_pages))
- congestion_wait(BLK_RW_ASYNC, HZ/100);
+ congestion_wait(WRITE, HZ/100);
/*
* With async crypto it is unsafe to share the crypto context
diff --git a/fs/fat/file.c b/fs/fat/file.c
index e8c159d..ef60a65 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
if ((filp->f_mode & FMODE_WRITE) &&
MSDOS_SB(inode->i_sb)->options.flush) {
fat_flush_inodes(inode->i_sb, inode, NULL);
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
return 0;
}
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 51d9e33..b152761 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -286,8 +286,8 @@ __releases(&fc->lock)
}
if (fc->num_background == fc->congestion_threshold &&
fc->connected && fc->bdi_initialized) {
- clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
- clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+ clear_bdi_congested(&fc->bdi, READ);
+ clear_bdi_congested(&fc->bdi, WRITE);
}
fc->num_background--;
fc->active_background--;
@@ -414,8 +414,8 @@ static void fuse_request_send_nowait_locked(struct fuse_conn *fc,
fc->blocked = 1;
if (fc->num_background == fc->congestion_threshold &&
fc->bdi_initialized) {
- set_bdi_congested(&fc->bdi, BLK_RW_SYNC);
- set_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+ set_bdi_congested(&fc->bdi, READ);
+ set_bdi_congested(&fc->bdi, WRITE);
}
list_add_tail(&req->list, &fc->bg_queue);
flush_bg_queue(fc);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53eb26c..bb9cc66 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -202,10 +202,8 @@ static int nfs_set_page_writeback(struct page *page)
struct nfs_server *nfss = NFS_SERVER(inode);
if (atomic_long_inc_return(&nfss->writeback) >
- NFS_CONGESTION_ON_THRESH) {
- set_bdi_congested(&nfss->backing_dev_info,
- BLK_RW_ASYNC);
- }
+ NFS_CONGESTION_ON_THRESH)
+ set_bdi_congested(&nfss->backing_dev_info, WRITE);
}
return ret;
}
@@ -217,7 +215,7 @@ static void nfs_end_page_writeback(struct page *page)
end_page_writeback(page);
if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
- clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+ clear_bdi_congested(&nfss->backing_dev_info, WRITE);
}
static struct nfs_page *nfs_find_and_lock_request(struct page *page)
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 9062220..77f5bb7 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
DEFINE_WAIT(wait);
struct reiserfs_journal *j = SB_JOURNAL(s);
if (atomic_read(&j->j_async_throttle))
- congestion_wait(BLK_RW_ASYNC, HZ / 10);
+ congestion_wait(WRITE, HZ / 10);
return 0;
}
diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
index 2d3f90a..1cd3b55 100644
--- a/fs/xfs/linux-2.6/kmem.c
+++ b/fs/xfs/linux-2.6/kmem.c
@@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (1);
}
@@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (1);
}
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..178c20c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(
XFS_STATS_INC(xb_page_retries);
xfsbufd_wakeup(0, gfp_mask);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto retry;
}
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index b449e73..58f5d0c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -273,14 +273,9 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
(1 << BDI_async_congested));
}
-enum {
- BLK_RW_ASYNC = 0,
- BLK_RW_SYNC = 1,
-};
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
-long congestion_wait(int sync, long timeout);
+void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
+void set_bdi_congested(struct backing_dev_info *bdi, int rw);
+long congestion_wait(int rw, long timeout);
static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 221cecd..51a6320 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -70,6 +70,11 @@ enum rq_cmd_type_bits {
REQ_TYPE_ATA_PC,
};
+enum {
+ BLK_RW_ASYNC = 0,
+ BLK_RW_SYNC = 1,
+};
+
/*
* For request of type REQ_TYPE_LINUX_BLOCK, rq->cmd[0] is the opcode being
* sent down (similar to how REQ_TYPE_BLOCK_PC means that ->cmd[] holds a
@@ -784,18 +789,18 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
* congested queues, and wake up anyone who was waiting for requests to be
* put back.
*/
-static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
+static inline void blk_clear_queue_congested(struct request_queue *q, int rw)
{
- clear_bdi_congested(&q->backing_dev_info, sync);
+ clear_bdi_congested(&q->backing_dev_info, rw);
}
/*
* A queue has just entered congestion. Flag that in the queue's VM-visible
* state flags and increment the global gounter of congested queues.
*/
-static inline void blk_set_queue_congested(struct request_queue *q, int sync)
+static inline void blk_set_queue_congested(struct request_queue *q, int rw)
{
- set_bdi_congested(&q->backing_dev_info, sync);
+ set_bdi_congested(&q->backing_dev_info, rw);
}
extern void blk_start_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5a37e20..d68d6e4 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -696,6 +696,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
@@ -720,18 +721,18 @@ EXPORT_SYMBOL(set_bdi_congested);
/**
* congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
+ * @rw: READ or WRITE
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
* write congestion. If no backing_devs are congested then just wait for the
* next write to be completed.
*/
-long congestion_wait(int sync, long timeout)
+long congestion_wait(int rw, long timeout)
{
long ret;
DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = &congestion_wqh[sync];
+ wait_queue_head_t *wqh = &congestion_wqh[rw];
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f99f599..f92ee06 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2430,7 +2430,7 @@ try_to_free:
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2c5d792..b300954 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -671,7 +671,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* The caller might hold locks which can prevent IO completion
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 851df40..2cd0fbb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1738,7 +1738,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);
if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
/*
@@ -1909,7 +1909,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto rebalance;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd68109..3805e59 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1174,7 +1174,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
*/
if (nr_freed < nr_taken && !current_is_kswapd() &&
lumpy_reclaim) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* The attempt at page out may have made some
@@ -1783,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
/* Take a nap, wait for some writeback to complete */
if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
@@ -2074,7 +2074,7 @@ loop_again:
* another pass across the zones.
*/
if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* We do this so kswapd doesn't build up large priorities for
@@ -2378,7 +2378,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
goto out;
if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ / 10);
+ congestion_wait(WRITE, HZ / 10);
}
}
--
1.6.3.3
This is a version that applies against 2.6.31.4
==== CUT HERE ====
page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
If a direct reclaim makes no forward progress, it considers whether it
should go OOM or not. Whether OOM is triggered or not, it may retry the
application afterwards. In times past, this would always wake kswapd as well
but currently, kswapd is not woken up after direct reclaim fails. For order-0
allocations, this makes little difference but if there is a heavy mix of
higher-order allocations that direct reclaim is failing for, it might mean
that kswapd is not rewoken for higher orders as much as it did previously.
This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b3c6cb..239677a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1763,6 +1763,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;
+restart:
wake_all_kswapd(order, zonelist, high_zoneidx);
/*
@@ -1772,7 +1773,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
*/
alloc_flags = gfp_to_alloc_flags(gfp_mask);
-restart:
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
--
1.6.3.3
This is a clean revert against 2.6.31.4
==== CUT HERE ====
Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at least
that the commits 373c0a7e 8aa7e847 dramatically increased the number of
GFP_ATOMIC failures that were occuring within a wireless driver. It was
never isolated which of the changes was the exact problem and it's possible
it has been fixed since.
However the fixes, if they exist in mainline, have not been back-ported to
-stable so for the -stable series, it might be best just to revert.
Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/lib/usercopy_32.c | 2 +-
drivers/block/pktcdvd.c | 10 ++++------
drivers/md/dm-crypt.c | 2 +-
fs/fat/file.c | 2 +-
fs/fuse/dev.c | 8 ++++----
fs/nfs/write.c | 8 +++-----
fs/reiserfs/journal.c | 2 +-
fs/xfs/linux-2.6/kmem.c | 4 ++--
fs/xfs/linux-2.6/xfs_buf.c | 2 +-
include/linux/backing-dev.h | 11 +++--------
include/linux/blkdev.h | 13 +++++++++----
mm/backing-dev.c | 7 ++++---
mm/memcontrol.c | 2 +-
mm/page-writeback.c | 8 ++++----
mm/page_alloc.c | 4 ++--
mm/vmscan.c | 8 ++++----
16 files changed, 45 insertions(+), 48 deletions(-)
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
if (retval == -ENOMEM && is_global_init(current)) {
up_read(¤t->mm->mmap_sem);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto survive;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 99a506f..83650e0 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
wakeup = (pd->write_congestion_on > 0
&& pd->bio_queue_size <= pd->write_congestion_off);
spin_unlock(&pd->lock);
- if (wakeup) {
- clear_bdi_congested(&pd->disk->queue->backing_dev_info,
- BLK_RW_ASYNC);
- }
+ if (wakeup)
+ clear_bdi_congested(&pd->disk->queue->backing_dev_info, WRITE);
pkt->sleep_time = max(PACKET_WAIT_TIME, 1);
pkt_set_state(pkt, PACKET_WAITING_STATE);
@@ -2594,10 +2592,10 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
spin_lock(&pd->lock);
if (pd->write_congestion_on > 0
&& pd->bio_queue_size >= pd->write_congestion_on) {
- set_bdi_congested(&q->backing_dev_info, BLK_RW_ASYNC);
+ set_bdi_congested(&q->backing_dev_info, WRITE);
do {
spin_unlock(&pd->lock);
- congestion_wait(BLK_RW_ASYNC, HZ);
+ congestion_wait(WRITE, HZ);
spin_lock(&pd->lock);
} while(pd->bio_queue_size > pd->write_congestion_off);
}
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ed10381..c72a8dd 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
* But don't wait if split was due to the io size restriction
*/
if (unlikely(out_of_pages))
- congestion_wait(BLK_RW_ASYNC, HZ/100);
+ congestion_wait(WRITE, HZ/100);
/*
* With async crypto it is unsafe to share the crypto context
diff --git a/fs/fat/file.c b/fs/fat/file.c
index f042b96..b28ea64 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
if ((filp->f_mode & FMODE_WRITE) &&
MSDOS_SB(inode->i_sb)->options.flush) {
fat_flush_inodes(inode->i_sb, inode, NULL);
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
return 0;
}
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6484eb7..f58ecbc 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -286,8 +286,8 @@ __releases(&fc->lock)
}
if (fc->num_background == FUSE_CONGESTION_THRESHOLD &&
fc->connected && fc->bdi_initialized) {
- clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
- clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+ clear_bdi_congested(&fc->bdi, READ);
+ clear_bdi_congested(&fc->bdi, WRITE);
}
fc->num_background--;
fc->active_background--;
@@ -414,8 +414,8 @@ static void fuse_request_send_nowait_locked(struct fuse_conn *fc,
fc->blocked = 1;
if (fc->num_background == FUSE_CONGESTION_THRESHOLD &&
fc->bdi_initialized) {
- set_bdi_congested(&fc->bdi, BLK_RW_SYNC);
- set_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+ set_bdi_congested(&fc->bdi, READ);
+ set_bdi_congested(&fc->bdi, WRITE);
}
list_add_tail(&req->list, &fc->bg_queue);
flush_bg_queue(fc);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a34fae2..5693fcd 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -200,10 +200,8 @@ static int nfs_set_page_writeback(struct page *page)
struct nfs_server *nfss = NFS_SERVER(inode);
if (atomic_long_inc_return(&nfss->writeback) >
- NFS_CONGESTION_ON_THRESH) {
- set_bdi_congested(&nfss->backing_dev_info,
- BLK_RW_ASYNC);
- }
+ NFS_CONGESTION_ON_THRESH)
+ set_bdi_congested(&nfss->backing_dev_info, WRITE);
}
return ret;
}
@@ -215,7 +213,7 @@ static void nfs_end_page_writeback(struct page *page)
end_page_writeback(page);
if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
- clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+ clear_bdi_congested(&nfss->backing_dev_info, WRITE);
}
/*
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 9062220..77f5bb7 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
DEFINE_WAIT(wait);
struct reiserfs_journal *j = SB_JOURNAL(s);
if (atomic_read(&j->j_async_throttle))
- congestion_wait(BLK_RW_ASYNC, HZ / 10);
+ congestion_wait(WRITE, HZ / 10);
return 0;
}
diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
index 2d3f90a..1cd3b55 100644
--- a/fs/xfs/linux-2.6/kmem.c
+++ b/fs/xfs/linux-2.6/kmem.c
@@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (1);
}
@@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (1);
}
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..178c20c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(
XFS_STATS_INC(xb_page_retries);
xfsbufd_wakeup(0, gfp_mask);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto retry;
}
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 1d52425..0ec2c59 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -229,14 +229,9 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
(1 << BDI_async_congested));
}
-enum {
- BLK_RW_ASYNC = 0,
- BLK_RW_SYNC = 1,
-};
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
-long congestion_wait(int sync, long timeout);
+void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
+void set_bdi_congested(struct backing_dev_info *bdi, int rw);
+long congestion_wait(int rw, long timeout);
static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69103e0..998c8e0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -70,6 +70,11 @@ enum rq_cmd_type_bits {
REQ_TYPE_ATA_PC,
};
+enum {
+ BLK_RW_ASYNC = 0,
+ BLK_RW_SYNC = 1,
+};
+
/*
* For request of type REQ_TYPE_LINUX_BLOCK, rq->cmd[0] is the opcode being
* sent down (similar to how REQ_TYPE_BLOCK_PC means that ->cmd[] holds a
@@ -775,18 +780,18 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
* congested queues, and wake up anyone who was waiting for requests to be
* put back.
*/
-static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
+static inline void blk_clear_queue_congested(struct request_queue *q, int rw)
{
- clear_bdi_congested(&q->backing_dev_info, sync);
+ clear_bdi_congested(&q->backing_dev_info, rw);
}
/*
* A queue has just entered congestion. Flag that in the queue's VM-visible
* state flags and increment the global gounter of congested queues.
*/
-static inline void blk_set_queue_congested(struct request_queue *q, int sync)
+static inline void blk_set_queue_congested(struct request_queue *q, int rw)
{
- set_bdi_congested(&q->backing_dev_info, sync);
+ set_bdi_congested(&q->backing_dev_info, rw);
}
extern void blk_start_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c86edd2..493b468 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -283,6 +283,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
@@ -307,18 +308,18 @@ EXPORT_SYMBOL(set_bdi_congested);
/**
* congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
+ * @rw: READ or WRITE
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
* write congestion. If no backing_devs are congested then just wait for the
* next write to be completed.
*/
-long congestion_wait(int sync, long timeout)
+long congestion_wait(int rw, long timeout)
{
long ret;
DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = &congestion_wqh[sync];
+ wait_queue_head_t *wqh = &congestion_wqh[rw];
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fd4529d..834509f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1990,7 +1990,7 @@ try_to_free:
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..7687879 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -575,7 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
if (pages_written >= write_chunk)
break; /* We've done our duty */
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -669,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* The caller might hold locks which can prevent IO completion
@@ -715,7 +715,7 @@ static void background_writeout(unsigned long _min_pages)
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
else
break;
}
@@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f7e52af..68f75da 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1684,7 +1684,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);
if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
/*
@@ -1855,7 +1855,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto rebalance;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a7eafdb..0e66a6b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1109,7 +1109,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
*/
if (nr_freed < nr_taken && !current_is_kswapd() &&
lumpy_reclaim) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* The attempt at page out may have made some
@@ -1726,7 +1726,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
/* Take a nap, wait for some writeback to complete */
if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
@@ -1974,7 +1974,7 @@ loop_again:
* another pass across the zones.
*/
if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* We do this so kswapd doesn't build up large priorities for
@@ -2247,7 +2247,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
goto out;
if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ / 10);
+ congestion_wait(WRITE, HZ / 10);
}
}
--
1.6.3.3
On Thu, Oct 22, 2009 at 5:22 PM, Mel Gorman <[email protected]> wrote:
> If a direct reclaim makes no forward progress, it considers whether it
> should go OOM or not. Whether OOM is triggered or not, it may retry the
> application afterwards. In times past, this would always wake kswapd as well
> but currently, kswapd is not woken up after direct reclaim fails. For order-0
> allocations, this makes little difference but if there is a heavy mix of
> higher-order allocations that direct reclaim is failing for, it might mean
> that kswapd is not rewoken for higher orders as much as it did previously.
>
> This patch wakes up kswapd when an allocation is being retried after a direct
> reclaim failure. It would be expected that kswapd is already awake, but
> this has the effect of telling kswapd to reclaim at the higher order as well.
>
> Signed-off-by: Mel Gorman <[email protected]>
You seem to have dropped the Reviewed-by tags from me and Christoph
for this patch.
> ?mm/page_alloc.c | ? ?2 +-
> ?1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf72055..dfa4362 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> ? ? ? ?if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> ? ? ? ? ? ? ? ?goto nopage;
>
> +restart:
> ? ? ? ?wake_all_kswapd(order, zonelist, high_zoneidx);
>
> -restart:
> ? ? ? ?/*
> ? ? ? ? * OK, we're below the kswapd watermark and have kicked background
> ? ? ? ? * reclaim. Now things get more complex, so set up alloc_flags according
> --
> 1.6.3.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Thu, Oct 22, 2009 at 5:22 PM, Mel Gorman <[email protected]> wrote:
> Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
>
> Test 2: Apply the following two patches and test again
>
> ?1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> ?2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER
These are pretty obvious bug fixes and should go to linux-next ASAP IMHO.
> Test 5: If things are still screwed, apply the following
> ?5/5 Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
>
> ? ? ? ?Frans Pop reports that the bulk of his problems go away when this
> ? ? ? ?patch is reverted on 2.6.31. There has been some confusion on why
> ? ? ? ?exactly this patch was wrong but apparently the conversion was not
> ? ? ? ?complete and further work was required. It's unknown if all the
> ? ? ? ?necessary work exists in 2.6.31-rc5 or not. If there are still
> ? ? ? ?allocation failures and applying this patch fixes the problem,
> ? ? ? ?there are still snags that need to be ironed out.
As explained by Jens Axboe, this changes timing but is not the source
of the OOMs so the revert is bogus even if it "helps" on some
workloads. IIRC the person who reported the revert to help things did
report that the OOMs did not go away, they were simply harder to
trigger with the revert.
On Thu, 2009-10-22 at 07:22 -0700, Mel Gorman wrote:
> [Bug #14141] order 2 page allocation failures in iwlagn
> Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
> allocations within the wireless driver. This has caused large numbers
> of failure reports to occur as reported by Frans Pop. Fixing this
> requires changes to the driver if it wants to use GFP_ATOMIC which
> is in the hands of Mohamed Abbas and Reinette Chatre. However,
> it is very likely that it has being compounded by core mm changes
> that this series is aimed at.
Driver has been changed to allocate paged skb for its receive buffers.
This reduces amount of memory needed from order-2 to order-1. This work
is significant and will thus be in 2.6.33.
Reinette
On Thu, Oct 22, 2009 at 05:41:53PM +0300, Pekka Enberg wrote:
> On Thu, Oct 22, 2009 at 5:22 PM, Mel Gorman <[email protected]> wrote:
> > If a direct reclaim makes no forward progress, it considers whether it
> > should go OOM or not. Whether OOM is triggered or not, it may retry the
> > application afterwards. In times past, this would always wake kswapd as well
> > but currently, kswapd is not woken up after direct reclaim fails. For order-0
> > allocations, this makes little difference but if there is a heavy mix of
> > higher-order allocations that direct reclaim is failing for, it might mean
> > that kswapd is not rewoken for higher orders as much as it did previously.
> >
> > This patch wakes up kswapd when an allocation is being retried after a direct
> > reclaim failure. It would be expected that kswapd is already awake, but
> > this has the effect of telling kswapd to reclaim at the higher order as well.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> You seem to have dropped the Reviewed-by tags from me and Christoph
> for this patch.
>
My apologies. I missed then when going through the old mails.
> > ?mm/page_alloc.c | ? ?2 +-
> > ?1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bf72055..dfa4362 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> > ? ? ? ?if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> > ? ? ? ? ? ? ? ?goto nopage;
> >
> > +restart:
> > ? ? ? ?wake_all_kswapd(order, zonelist, high_zoneidx);
> >
> > -restart:
> > ? ? ? ?/*
> > ? ? ? ? * OK, we're below the kswapd watermark and have kicked background
> > ? ? ? ? * reclaim. Now things get more complex, so set up alloc_flags according
> > --
> > 1.6.3.3
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. ?For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Oct 22, 2009 at 05:47:10PM +0300, Pekka Enberg wrote:
> On Thu, Oct 22, 2009 at 5:22 PM, Mel Gorman <[email protected]> wrote:
> > Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
> >
> > Test 2: Apply the following two patches and test again
> >
> > ?1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> > ?2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER
>
> These are pretty obvious bug fixes and should go to linux-next ASAP IMHO.
>
Agreed, but I wanted to pin down where exactly we stand with this
problem before sending patches any direction for merging.
> > Test 5: If things are still screwed, apply the following
> > ?5/5 Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs read/write confusion
> >
> > ? ? ? ?Frans Pop reports that the bulk of his problems go away when this
> > ? ? ? ?patch is reverted on 2.6.31. There has been some confusion on why
> > ? ? ? ?exactly this patch was wrong but apparently the conversion was not
> > ? ? ? ?complete and further work was required. It's unknown if all the
> > ? ? ? ?necessary work exists in 2.6.31-rc5 or not. If there are still
> > ? ? ? ?allocation failures and applying this patch fixes the problem,
> > ? ? ? ?there are still snags that need to be ironed out.
>
> As explained by Jens Axboe, this changes timing but is not the source
> of the OOMs so the revert is bogus even if it "helps" on some
> workloads. IIRC the person who reported the revert to help things did
> report that the OOMs did not go away, they were simply harder to
> trigger with the revert.
>
IIRC, there were mixed reports as to how much the revert helped. I'm hoping
that patches 1+2 cover the bases hence why I asked them to be tested on
their own. Patch 2 in particular might be responsible for watermarks being
impacted enough to cause timing problems. I left reverting with patch 5 as
a standalone test to see how much of a factor the timing changes introduced
are if there are still allocation problems.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, 22 Oct 2009 15:22:33 +0100
Mel Gorman <[email protected]> wrote:
> Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> slightly by allowing rt_tasks that are handling an interrupt to set
> ALLOC_HARDER. This patch brings the watermark logic more in line with
> 2.6.30.
>
> [[email protected]: Spotted the problem]
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Pekka Enberg <[email protected]>
> ---
> mm/page_alloc.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dfa4362..7f2aa3e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> */
> alloc_flags &= ~ALLOC_CPUSET;
> - } else if (unlikely(rt_task(p)))
> + } else if (unlikely(rt_task(p)) && !in_interrupt())
> alloc_flags |= ALLOC_HARDER;
>
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> --
> 1.6.3.3
>
Is it correct that this one applies offset -54 lines in 2.6.31.4 ?
--
Regards,
Stephan
On Thu, Oct 22, 2009 at 06:33:03PM +0200, Stephan von Krawczynski wrote:
> On Thu, 22 Oct 2009 15:22:33 +0100
> Mel Gorman <[email protected]> wrote:
>
> > Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> > slightly by allowing rt_tasks that are handling an interrupt to set
> > ALLOC_HARDER. This patch brings the watermark logic more in line with
> > 2.6.30.
> >
> > [[email protected]: Spotted the problem]
> > Signed-off-by: Mel Gorman <[email protected]>
> > Reviewed-by: Pekka Enberg <[email protected]>
> > ---
> > mm/page_alloc.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index dfa4362..7f2aa3e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> > * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> > */
> > alloc_flags &= ~ALLOC_CPUSET;
> > - } else if (unlikely(rt_task(p)))
> > + } else if (unlikely(rt_task(p)) && !in_interrupt())
> > alloc_flags |= ALLOC_HARDER;
> >
> > if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> > --
> > 1.6.3.3
> >
>
> Is it correct that this one applies offset -54 lines in 2.6.31.4 ?
>
In this case, it's ok. It's just a harmless heads-up that the kernel
looks slightly different than expected. I posted a 2.6.31.4 version of
the two patches that cause real problems.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, 22 Oct 2009, Mel Gorman wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7f2aa3e..851df40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1596,6 +1596,17 @@ try_next_zone:
> return page;
> }
>
> +static inline
> +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> + enum zone_type high_zoneidx)
> +{
> + struct zoneref *z;
> + struct zone *zone;
> +
> + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> + wakeup_kswapd(zone, order);
> +}
> +
> static inline int
> should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> @@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> } while (!page && (gfp_mask & __GFP_NOFAIL));
>
> - return page;
> -}
> -
> -static inline
> -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> - enum zone_type high_zoneidx)
> -{
> - struct zoneref *z;
> - struct zone *zone;
> + /*
> + * If after a high-order allocation we are now below watermarks,
> + * pre-emptively kick kswapd rather than having the next allocation
> + * fail and have to wake up kswapd, potentially failing GFP_ATOMIC
> + * allocations or entering direct reclaim
> + */
> + if (unlikely(order) && page && !zone_watermark_ok(preferred_zone, order,
> + preferred_zone->watermark[ALLOC_WMARK_LOW],
> + zone_idx(preferred_zone), ALLOC_WMARK_LOW))
> + wake_all_kswapd(order, zonelist, high_zoneidx);
>
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - wakeup_kswapd(zone, order);
> + return page;
> }
>
> static inline int
Hmm, is this really supposed to be added to __alloc_pages_high_priority()?
By the patch description I was expecting kswapd to be woken up
preemptively whenever the preferred zone is below ALLOC_WMARK_LOW and
we're known to have just allocated at a higher order, not just when
current was oom killed (when we should already be freeing a _lot_ of
memory soon) or is doing a higher order allocation during direct reclaim.
For the best coverage, it would have to be add the branch to the fastpath.
That seems fine for a debugging aid and to see if progress is being made
on the GFP_ATOMIC allocation issues, but doesn't seem like it should make
its way to mainline, the subsequent GFP_ATOMIC allocation could already be
happening and in the page allocator's slowpath at this point that this
wakeup becomes unnecessary.
If this is moved to the fastpath, why is this wake_all_kswapd() and not
wakeup_kswapd(preferred_zone, order)? Do we need to kick kswapd in all
zones even though they may be free just because preferred_zone is now
below the watermark?
Wouldn't it be better to do this on page_zone(page) instead of
preferred_zone anyway?
On Thu, Oct 22 2009, Mel Gorman wrote:
> Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at
> least that the commits 373c0a7e 8aa7e847 dramatically increased the
> number of GFP_ATOMIC failures that were occuring within a wireless
> driver. It was never isolated which of the changes was the exact problem
> and it's possible it has been fixed since. If problems are still
> occuring with GFP_ATOMIC in 2.6.31-rc5, then this patch should be
> applied to determine if the congestion_wait() callers are still broken.
I still think this is a complete red herring.
--
Jens Axboe
Mel Gorman schrieb am Donnerstag, den 22. Oktober um 16:22 Uhr:
> [No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
> This apparently is easily reproducible, particular in comparison to
> the other reports. The point of greatest interest is that this is
> order-0 GFP_ATOMIC failures. Sven, I'm hoping that you in particular
> will be able to follow the tests below as you are the most likely
> person to have an easily reproducible situation.
I will see what I can do on the weekend. Unfortunately the crash happens on
a somewhat important machine and afterwards the Software-RAID needs a resync
which takes a few hours.
Sven
--
"Those who do not understand Unix are condemned to reinvent it, poorly"
(Henry Spencer)
/me is giggls@ircnet, http://sven.gegg.us/ on the Web
On Thu, Oct 22, 2009 at 12:41:42PM -0700, David Rientjes wrote:
> On Thu, 22 Oct 2009, Mel Gorman wrote:
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7f2aa3e..851df40 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1596,6 +1596,17 @@ try_next_zone:
> > return page;
> > }
> >
> > +static inline
> > +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> > + enum zone_type high_zoneidx)
> > +{
> > + struct zoneref *z;
> > + struct zone *zone;
> > +
> > + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > + wakeup_kswapd(zone, order);
> > +}
> > +
> > static inline int
> > should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > unsigned long pages_reclaimed)
> > @@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> > congestion_wait(BLK_RW_ASYNC, HZ/50);
> > } while (!page && (gfp_mask & __GFP_NOFAIL));
> >
> > - return page;
> > -}
> > -
> > -static inline
> > -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> > - enum zone_type high_zoneidx)
> > -{
> > - struct zoneref *z;
> > - struct zone *zone;
> > + /*
> > + * If after a high-order allocation we are now below watermarks,
> > + * pre-emptively kick kswapd rather than having the next allocation
> > + * fail and have to wake up kswapd, potentially failing GFP_ATOMIC
> > + * allocations or entering direct reclaim
> > + */
> > + if (unlikely(order) && page && !zone_watermark_ok(preferred_zone, order,
> > + preferred_zone->watermark[ALLOC_WMARK_LOW],
> > + zone_idx(preferred_zone), ALLOC_WMARK_LOW))
> > + wake_all_kswapd(order, zonelist, high_zoneidx);
> >
> > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > - wakeup_kswapd(zone, order);
> > + return page;
> > }
> >
> > static inline int
>
> Hmm, is this really supposed to be added to __alloc_pages_high_priority()?
> By the patch description I was expecting kswapd to be woken up
> preemptively whenever the preferred zone is below ALLOC_WMARK_LOW and
> we're known to have just allocated at a higher order, not just when
> current was oom killed (when we should already be freeing a _lot_ of
> memory soon) or is doing a higher order allocation during direct reclaim.
>
It was a somewhat arbitrary choice to have it trigger in the event high
priority allocations were happening frequently.
> For the best coverage, it would have to be add the branch to the fastpath.
Agreed - specifically at the end of __alloc_pages_nodemask()
> That seems fine for a debugging aid and to see if progress is being made
> on the GFP_ATOMIC allocation issues, but doesn't seem like it should make
> its way to mainline, the subsequent GFP_ATOMIC allocation could already be
> happening and in the page allocator's slowpath at this point that this
> wakeup becomes unnecessary.
>
> If this is moved to the fastpath, why is this wake_all_kswapd() and not
> wakeup_kswapd(preferred_zone, order)? Do we need to kick kswapd in all
> zones even though they may be free just because preferred_zone is now
> below the watermark?
>
It probably makes no difference as zones are checked for their watermarks
before any real work happens. However, even if this patch makes a difference,
I don't want to see it merged. At best, it is an extremely heavy-handed
hack which is why I asked for it to be tested in isolation. It shouldn't
be necessary at all because sort of pre-emptive waking of kswapd was never
necessary before.
> Wouldn't it be better to do this on page_zone(page) instead of
> preferred_zone anyway?
>
No. The preferred_zone is the zone we should be allocating from. If we
failed to allocate from it, it implies the watermarks are not being met
so we want to wake it.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Fri, 23 Oct 2009, Mel Gorman wrote:
> > Hmm, is this really supposed to be added to __alloc_pages_high_priority()?
> > By the patch description I was expecting kswapd to be woken up
> > preemptively whenever the preferred zone is below ALLOC_WMARK_LOW and
> > we're known to have just allocated at a higher order, not just when
> > current was oom killed (when we should already be freeing a _lot_ of
> > memory soon) or is doing a higher order allocation during direct reclaim.
> >
>
> It was a somewhat arbitrary choice to have it trigger in the event high
> priority allocations were happening frequently.
>
I don't quite understand, users of PF_MEMALLOC shouldn't be doing these
higher order allocations and if ALLOC_NO_WATERMARKS is by way of the oom
killer, we should be freeing a substantial amount of memory imminently
when it exits that waking up kswapd would be irrelevant.
> > If this is moved to the fastpath, why is this wake_all_kswapd() and not
> > wakeup_kswapd(preferred_zone, order)? Do we need to kick kswapd in all
> > zones even though they may be free just because preferred_zone is now
> > below the watermark?
> >
>
> It probably makes no difference as zones are checked for their watermarks
> before any real work happens. However, even if this patch makes a difference,
> I don't want to see it merged. At best, it is an extremely heavy-handed
> hack which is why I asked for it to be tested in isolation. It shouldn't
> be necessary at all because sort of pre-emptive waking of kswapd was never
> necessary before.
>
Ahh, that makes a ton more sense: this particular patch is a debugging
effort while the first two are candidates for 2.6.32 and -stable. Gotcha.
> > Wouldn't it be better to do this on page_zone(page) instead of
> > preferred_zone anyway?
> >
>
> No. The preferred_zone is the zone we should be allocating from. If we
> failed to allocate from it, it implies the watermarks are not being met
> so we want to wake it.
>
Oops, I'm even more confused now :) I thought the existing
wake_all_kswapd() in the slowpath was doing that and that this patch was
waking them prematurely because it speculates that a subsequent high
order allocation will fail unless memory is reclaimed. I thought we'd
want to reclaim from the zone we just did a high order allocation from so
that the fastpath could find the memory next time with ALLOC_WMARK_LOW.
On Thu, 22 Oct 2009 17:37:52 +0100
Mel Gorman <[email protected]> wrote:
> In this case, it's ok. It's just a harmless heads-up that the kernel
> looks slightly different than expected. I posted a 2.6.31.4 version of
> the two patches that cause real problems.
After around 12 hours of runtime with patch 1/5 and 2/5 on 2.6.31.4 I can see
no page allocation failure messages so far. I'll keep you informed.
--
Regards,
Stephan
On Fri, Oct 23, 2009 at 02:36:53AM -0700, David Rientjes wrote:
> On Fri, 23 Oct 2009, Mel Gorman wrote:
>
> > > Hmm, is this really supposed to be added to __alloc_pages_high_priority()?
> > > By the patch description I was expecting kswapd to be woken up
> > > preemptively whenever the preferred zone is below ALLOC_WMARK_LOW and
> > > we're known to have just allocated at a higher order, not just when
> > > current was oom killed (when we should already be freeing a _lot_ of
> > > memory soon) or is doing a higher order allocation during direct reclaim.
> > >
> >
> > It was a somewhat arbitrary choice to have it trigger in the event high
> > priority allocations were happening frequently.
> >
>
> I don't quite understand, users of PF_MEMALLOC shouldn't be doing these
> higher order allocations and if ALLOC_NO_WATERMARKS is by way of the oom
> killer, we should be freeing a substantial amount of memory imminently
> when it exits that waking up kswapd would be irrelevant.
>
I agree. I think it's highly unlikely this patch will make any
difference but I wanted to eliminate it as a possibility. Patch 3 and 4
were previously one patch that were tested together.
> > > If this is moved to the fastpath, why is this wake_all_kswapd() and not
> > > wakeup_kswapd(preferred_zone, order)? Do we need to kick kswapd in all
> > > zones even though they may be free just because preferred_zone is now
> > > below the watermark?
> > >
> >
> > It probably makes no difference as zones are checked for their watermarks
> > before any real work happens. However, even if this patch makes a difference,
> > I don't want to see it merged. At best, it is an extremely heavy-handed
> > hack which is why I asked for it to be tested in isolation. It shouldn't
> > be necessary at all because sort of pre-emptive waking of kswapd was never
> > necessary before.
> >
>
> Ahh, that makes a ton more sense: this particular patch is a debugging
> effort while the first two are candidates for 2.6.32 and -stable. Gotcha.
>
Yep.
> > > Wouldn't it be better to do this on page_zone(page) instead of
> > > preferred_zone anyway?
> > >
> >
> > No. The preferred_zone is the zone we should be allocating from. If we
> > failed to allocate from it, it implies the watermarks are not being met
> > so we want to wake it.
> >
>
> Oops, I'm even more confused now :) I thought the existing
> wake_all_kswapd() in the slowpath was doing that and that this patch was
> waking them prematurely because it speculates that a subsequent high
> order allocation will fail unless memory is reclaimed.
It should be doing that. This patch should be junk but because it was tested
previously, I needed to be sure it was actually junk.
> I thought we'd
> want to reclaim from the zone we just did a high order allocation from so
> that the fastpath could find the memory next time with ALLOC_WMARK_LOW.
The fastpath should be getting the pages it needs from the
preferred_zone. If it's not, we still want to get pages back in that
zone and the zone we actually ended up getting pages from.
It's probably best to ignore this patch except in the unlikely event Tobias
says it makes a difference to his testing. I'm hoping he's covered by patches
1+2 and maybe 3 and that patches 4 and 5 of this set get consigned to
the bit bucket.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel,
Today Mel Gorman wrote:
> On Fri, Oct 23, 2009 at 02:36:53AM -0700, David Rientjes wrote:
> > On Fri, 23 Oct 2009, Mel Gorman wrote:
> >
> > > > Hmm, is this really supposed to be added to __alloc_pages_high_priority()?
> > > > By the patch description I was expecting kswapd to be woken up
> > > > preemptively whenever the preferred zone is below ALLOC_WMARK_LOW and
> > > > we're known to have just allocated at a higher order, not just when
> > > > current was oom killed (when we should already be freeing a _lot_ of
> > > > memory soon) or is doing a higher order allocation during direct reclaim.
> > > >
> > >
> > > It was a somewhat arbitrary choice to have it trigger in the event high
> > > priority allocations were happening frequently.
> > >
> >
> > I don't quite understand, users of PF_MEMALLOC shouldn't be doing these
> > higher order allocations and if ALLOC_NO_WATERMARKS is by way of the oom
> > killer, we should be freeing a substantial amount of memory imminently
> > when it exits that waking up kswapd would be irrelevant.
> >
>
> I agree. I think it's highly unlikely this patch will make any
> difference but I wanted to eliminate it as a possibility. Patch 3 and 4
> were previously one patch that were tested together.
hi hi ... I have tested '3 only' this morning, and the allocation
problems started again ... so for me 3 alone does not work while
3+4 does.
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Fri, Oct 23, 2009 at 01:31:10PM +0200, Tobias Oetiker wrote:
> Mel,
>
> Today Mel Gorman wrote:
>
> > On Fri, Oct 23, 2009 at 02:36:53AM -0700, David Rientjes wrote:
> > > On Fri, 23 Oct 2009, Mel Gorman wrote:
> > >
> > > > > Hmm, is this really supposed to be added to __alloc_pages_high_priority()?
> > > > > By the patch description I was expecting kswapd to be woken up
> > > > > preemptively whenever the preferred zone is below ALLOC_WMARK_LOW and
> > > > > we're known to have just allocated at a higher order, not just when
> > > > > current was oom killed (when we should already be freeing a _lot_ of
> > > > > memory soon) or is doing a higher order allocation during direct reclaim.
> > > > >
> > > >
> > > > It was a somewhat arbitrary choice to have it trigger in the event high
> > > > priority allocations were happening frequently.
> > > >
> > >
> > > I don't quite understand, users of PF_MEMALLOC shouldn't be doing these
> > > higher order allocations and if ALLOC_NO_WATERMARKS is by way of the oom
> > > killer, we should be freeing a substantial amount of memory imminently
> > > when it exits that waking up kswapd would be irrelevant.
> > >
> >
> > I agree. I think it's highly unlikely this patch will make any
> > difference but I wanted to eliminate it as a possibility. Patch 3 and 4
> > were previously one patch that were tested together.
>
> hi hi ... I have tested '3 only' this morning, and the allocation
> problems started again ... so for me 3 alone does not work while
> 3+4 does.
>
Hi,
What was the outcome of 1+2?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, Oct 22, 2009 at 03:22:31PM +0100, Mel Gorman wrote:
[Cut everything but my bug]
> [Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
> Karol Lewandows reported that e100 fails to allocate order-5
> GFP_ATOMIC when loading firmware during resume. This has started
> happening relatively recent.
> Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
Yes, bug is still there.
> Test 2: Apply the following two patches and test again
>
> 1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> 2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER
>
>
> These patches correct problems introduced by me during the 2.6.31-rc1
> merge window. The patches were not meant to introduce any functional
> changes but two were missed.
>
> If your problem goes away with just these two patches applied,
> please tell me.
Likewise.
> Test 3: If you are getting allocation failures, try with the following patch
>
> 3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
>
> This is a functional change that causes kswapd to notice sooner
> when high-order watermarks have been hit. There have been a number
> of changes in page reclaim since 2.6.30 that might have delayed
> when kswapd kicks in for higher orders
>
> If your problem goes away with these three patches applied, please
> tell me
No, problem doesn't go away with these patches (1+2+3). However, from
my testing this particular patch makes it way, way harder to trigger
allocation failures (but these are still present).
This bothers me - should I test following patches with or without
above patch? This patch makes bug harder to find, IMVHO it doesn't
fix the real problem.
(Rest not tested yet.)
Thanks.
On Fri, Oct 23, 2009 at 06:58:10PM +0200, Karol Lewandowski wrote:
> On Thu, Oct 22, 2009 at 03:22:31PM +0100, Mel Gorman wrote:
> > Test 3: If you are getting allocation failures, try with the following patch
> >
> > 3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
> No, problem doesn't go away with these patches (1+2+3). However, from
> my testing this particular patch makes it way, way harder to trigger
> allocation failures (but these are still present).
>
> This bothers me - should I test following patches with or without
> above patch? This patch makes bug harder to find, IMVHO it doesn't
> fix the real problem.
..
> Test 4: If you are still getting failures, apply the following
> 4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks are hit
Ok, I've tested patches 1+2+4 and bug, while very hard to trigger, is
still present. I'll test complete 1-4 patchset as time permits.
Thanks.
On Thu, 22 Oct 2009, Pekka Enberg wrote:
> These are pretty obvious bug fixes and should go to linux-next ASAP IMHO.
Bug fixes go into main not linux-next. Lets make sure these fixes really
work and then merge.
There are now rt dependencies in the page allocator that screw things up?
And an rt flag causes the page allocator to try harder meaning it adds
latency.
?
On Thu, 22 Oct 2009, Pekka Enberg wrote:
>> These are pretty obvious bug fixes and should go to linux-next ASAP IMHO.
Christoph Lameter wrote:
> Bug fixes go into main not linux-next. Lets make sure these fixes really
> work and then merge.
Regardless, patches 1-2 and should _really_ go to Linus' tree (and
eventually -stable) while we figure out the rest of the problems. They
fix obvious regressions in the code paths and we have reports from
people that they help. Yes, they don't fix everything for everyone but
we there's no upside in holding back fixes that are simple one line
fixes to regressions.
Pekka
Hi,
This is the same Mel as [email protected]. The mail server the address is
on has no power until Tuesday so I'm not going to be very unresponsive
until then. Monday is also a public holiday here and apparently they
are upgrading the power transformers near the building.
On 10/23/09, Karol Lewandowski <[email protected]> wrote:
> On Fri, Oct 23, 2009 at 06:58:10PM +0200, Karol Lewandowski wrote:
>> On Thu, Oct 22, 2009 at 03:22:31PM +0100, Mel Gorman wrote:
>> > Test 3: If you are getting allocation failures, try with the following
>> > patch
>> >
>> > 3/5 vmscan: Force kswapd to take notice faster when high-order
>> > watermarks are being hit
>
>> No, problem doesn't go away with these patches (1+2+3). However, from
>> my testing this particular patch makes it way, way harder to trigger
>> allocation failures (but these are still present).
>>
>> This bothers me - should I test following patches with or without
>> above patch? This patch makes bug harder to find, IMVHO it doesn't
>> fix the real problem.
> ..
>
>> Test 4: If you are still getting failures, apply the following
>> 4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks
>> are hit
>
> Ok, I've tested patches 1+2+4 and bug, while very hard to trigger, is
> still present. I'll test complete 1-4 patchset as time permits.
>
And also patch 5 please which is the revert. Patch 5 as pointed out is
probably a red herring. Hwoever, it has changed the timing and made a
difference for some testing so I'd like to know if it helps yours as
well.
As things stand, it looks like patches 1+2 should certainly go ahead.
I need to give more thought on patches 3 and 4 as to why they help
Tobias but not anyone elses testing.
Thanks
On Thursday 22 October 2009, Mel Gorman wrote:
> Since 2.6.31-rc1, there have been an increasing number of GFP_ATOMIC
> failures. A significant number of these have been high-order GFP_ATOMIC
> failures and while they are generally brushed away, there has been a
> large increase in them recently and there are a number of possible areas
> the problem could be in - core vm, page writeback and a specific driver.
I needed a break and have thus been off-line for a few days. Good to see
there's been progress. I'll try to do some testing tomorrow.
Cheers,
FJP
Mel Gorman schrieb am Donnerstag, den 22. Oktober um 16:22 Uhr:
> Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
Problem persists. RAID resync in progress :(
Sven
--
"linux is evolution, not intelligent design"
(Linus Torvalds)
/me is giggls@ircnet, http://sven.gegg.us/ on the Web
On Thu, 22 Oct 2009 17:37:52 +0100
Mel Gorman <[email protected]> wrote:
> In this case, it's ok. It's just a harmless heads-up that the kernel
> looks slightly different than expected. I posted a 2.6.31.4 version of
> the two patches that cause real problems.
After around 12 hours of runtime with patch 1/5 and 2/5 on 2.6.31.4 I can see
no page allocation failure messages so far. I'll keep you informed.
[Update]
After 2 days I still see no failure messages.
[/update]
--
Regards,
Stephan
> If a direct reclaim makes no forward progress, it considers whether it
> should go OOM or not. Whether OOM is triggered or not, it may retry the
> application afterwards. In times past, this would always wake kswapd as well
> but currently, kswapd is not woken up after direct reclaim fails. For order-0
> allocations, this makes little difference but if there is a heavy mix of
> higher-order allocations that direct reclaim is failing for, it might mean
> that kswapd is not rewoken for higher orders as much as it did previously.
>
> This patch wakes up kswapd when an allocation is being retried after a direct
> reclaim failure. It would be expected that kswapd is already awake, but
> this has the effect of telling kswapd to reclaim at the higher order as well.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/page_alloc.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf72055..dfa4362 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
> goto nopage;
>
> +restart:
> wake_all_kswapd(order, zonelist, high_zoneidx);
>
> -restart:
> /*
> * OK, we're below the kswapd watermark and have kicked background
> * reclaim. Now things get more complex, so set up alloc_flags according
I think this patch is correct. personally, I like to add some commnent
at restart label. but it isn't big matter.
Reviewed-by: KOSAKI Motohiro <[email protected]>
However, I have a question. __alloc_pages_slowpath() retry logic is,
1. try_to_free_pages() reclaimed some pages:
-> wait awhile and goto rebalance
2. try_to_free_pages() didn't reclaimed any page:
-> call out_of_memory() and goto restart
Then, case-1 should be fixed too?
I mean,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf72055..5a27896 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1899,6 +1899,12 @@ rebalance:
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
congestion_wait(BLK_RW_ASYNC, HZ/50);
+
+ /*
+ * While we wait congestion wait, Amount of free memory can
+ * be changed dramatically. Thus, we kick kswapd again.
+ */
+ wake_all_kswapd(order, zonelist, high_zoneidx);
goto rebalance;
}
-------------------------------------------
?
> Commit 341ce06f69abfafa31b9468410a13dbd60e2b237 altered watermark logic
> slightly by allowing rt_tasks that are handling an interrupt to set
> ALLOC_HARDER. This patch brings the watermark logic more in line with
> 2.6.30.
>
> [[email protected]: Spotted the problem]
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Pekka Enberg <[email protected]>
> ---
> mm/page_alloc.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dfa4362..7f2aa3e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1769,7 +1769,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
> */
> alloc_flags &= ~ALLOC_CPUSET;
> - } else if (unlikely(rt_task(p)))
> + } else if (unlikely(rt_task(p)) && !in_interrupt())
> alloc_flags |= ALLOC_HARDER;
>
> if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> --
> 1.6.3.3
good catch.
Reviewed-by: KOSAKI Motohiro <[email protected]>
On Mon, 26 Oct 2009, KOSAKI Motohiro wrote:
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf72055..5a27896 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1899,6 +1899,12 @@ rebalance:
> if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> +
> + /*
> + * While we wait congestion wait, Amount of free memory can
> + * be changed dramatically. Thus, we kick kswapd again.
> + */
> + wake_all_kswapd(order, zonelist, high_zoneidx);
> goto rebalance;
> }
>
We're blocking to finish writeback of the directly reclaimed memory, why
do we need to wake kswapd afterwards?
Hi Mel,
I have no done additional tests ... and can report the following
Thursday Mel Gorman wrote:
> 1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> 2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER
>
>
> These patches correct problems introduced by me during the 2.6.31-rc1
> merge window. The patches were not meant to introduce any functional
> changes but two were missed.
>
> If your problem goes away with just these two patches applied,
> please tell me.
1+2 do not help
> Test 3: If you are getting allocation failures, try with the following patch
>
> 3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
>
> This is a functional change that causes kswapd to notice sooner
> when high-order watermarks have been hit. There have been a number
> of changes in page reclaim since 2.6.30 that might have delayed
> when kswapd kicks in for higher orders
>
> If your problem goes away with these three patches applied, please
> tell me
1+2+3 do not help either
> Test 4: If you are still getting failures, apply the following
> 4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks are hit
>
> This patch is very heavy handed and pre-emptively kicks kswapd when
> watermarks are hit. It should only be necessary if there has been
> significant changes in the timing and density of page allocations
> from an unknown source. Tobias, this patch is largely aimed at you.
> You reported that with patches 3+4 applied that your problems went
> away. I need to know if patch 3 on its own is enough or if both
> are required
>
> If your problem goes away with these four patches applied, please
> tell me
3 allone does not help
3+4 does ...
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Thursday 22 October 2009, Mel Gorman wrote:
> Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
I've tested against 2.6.31.1 as it's easier for me to compare behaviors
with that than with .32. All patches applied without problems against .31.
I've also tested 2.6.31.1 with SLAB instead of SLUB, but that does not seem
to make a significant difference for my test.
> Test 2: Apply the following two patches and test again
> 1/5 page allocator: Always wake kswapd when restarting an allocation
> attempt after direct reclaim failed
> 2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER
Does not look to make any difference. Possibly causes more variation in the
duration of the test (increases timing effects)?
> Test 3: If you are getting allocation failures, try with the following
> patch
> 3/5 vmscan: Force kswapd to take notice faster when high-order
> watermarks are being hit
Applied on top of patches 1-2. Does not look to make any difference.
> Test 4: If you are still getting failures, apply the following
> 4/5 page allocator: Pre-emptively wake kswapd when high-order
> watermarks are hit
Applied on top of patches 1-3. Does not look to make any difference.
> Test 5: If things are still screwed, apply the following
> 5/5 Revert 373c0a7e, 8aa7e847: Fix congestion_wait() sync/async vs
> read/write confusion
Applied on top of patches 1-4. Despite Jens' scepticism is this still the
patch that makes the most significant difference in my test.
The reading of commits in gitk is much more fluent and music skips are a
lot less severe. But most important is that there is no long total freeze
of the system halfway during the reading of commits and gitk loads
fastest. It also gives by far the most consistent results.
The likelyhood of SKB allocation errors during the test is a lot smaller.
See also http://lkml.org/lkml/2009/10/26/455.
Detailed test results follow. I've done 2 test runs with each kernel (3 for
the last).
The columns below give the following info:
- time at which all commits have been read by gitk
- time at which gitk fills in "branch", "follows" and "precedes" data for
the current commit
- time at which there's no longer any disk activity, i.e. when gitk is
fully loaded and all swapping is done
- total number of SKB allocation errors during the test
A "freeze" during the reading of commits is indicated by an "f" (short
freeze) or "F" (long "hard" freeze). An "S" shows when there were SKB
allocation errors.
end commits show branch done SKB errs
1) vanilla .31.1
run 1: 1:20 fFS 2:10 S 2:30 44 a)
run 2: 1:35 FS 1:45 2:10 13
2) .31.1 + patches 1-2
run1: 2:30 fFS 2:45 3:00 58
run2: 1:15 fS 2:00 2:20 2 a)
3) .31.1 + patches 1-3
run1: 1:00 fS 1:15 1:45 1 *)
run2: 3:00 fFS 3:15 3:30 33
*) unexpected; fortunate timing?
4) .31.1 + patches 1-4
run1: 1:10 ffS 1:55 S 2:20 35 a)
run2: 3:05 fFS 3:15 3:25 36
5) .31.1 + patches 1-5
run1: 1:00 1:15 1:35 0
run2: 0:50 1:15 S 1:45 45 *)
run3: 1:00 1:15 1:45 0
*) unexpected; unfortunate timing?
a) fast in 1st phase; slow in 2nd and 3rd
Note that without the congestion_wait() reverts occurrence of SKB errors,
the long freezes and time it takes for gitk to load seem roughly related;
with the reverts total time is not affected even with many SKB errors.
Cheers,
FJP
On Monday 26 October 2009, Frans Pop wrote:
> Detailed test results follow. I've done 2 test runs with each kernel (3
> for the last).
Forgot to mention that each run was after a reboot, so they are not
interdependant.
> When a high-order allocation fails, kswapd is kicked so that it reclaims
> at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
> allocations. Something has changed in recent kernels that affect the timing
> where high-order GFP_ATOMIC allocations are now failing with more frequency,
> particularly under pressure. This patch forces kswapd to notice sooner that
> high-order allocations are occuring.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 9 +++++++++
> 1 files changed, 9 insertions(+), 0 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64e4388..cd68109 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2016,6 +2016,15 @@ loop_again:
> priority != DEF_PRIORITY)
> continue;
>
> + /*
> + * Exit quickly to restart if it has been indicated
> + * that higher orders are required
> + */
> + if (pgdat->kswapd_max_order > order) {
> + all_zones_ok = 1;
> + goto out;
> + }
> +
> if (!zone_watermark_ok(zone, order,
> high_wmark_pages(zone), end_zone, 0))
> all_zones_ok = 0;
this is simplest patch and seems reasonable.
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro>
btw, now balance_pgdat() have too complex flow. at least Vincent was
confused it.
Then, I think kswap_max_order handling should move into balance_pgdat()
at later release.
the following patch addressed my proposed concept.
>From 2c5be772f6db25a5ef82975960d0b5788736ec2b Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <[email protected]>
Date: Mon, 26 Oct 2009 23:25:29 +0900
Subject: [PATCH] kswapd_max_order handling move into balance_pgdat()
Signed-off-by: KOSAKI Motohiro <[email protected]>
---
mm/vmscan.c | 45 +++++++++++++++++++++------------------------
1 files changed, 21 insertions(+), 24 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64e4388..49001d3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1915,7 +1915,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
* interoperates with the page allocator fallback scheme to ensure that aging
* of pages is balanced across the zones.
*/
-static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
+static unsigned long balance_pgdat(pg_data_t *pgdat)
{
int all_zones_ok;
int priority;
@@ -1928,7 +1928,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.may_swap = 1,
.swap_cluster_max = SWAP_CLUSTER_MAX,
.swappiness = vm_swappiness,
- .order = order,
+ .order = 0,
.mem_cgroup = NULL,
.isolate_pages = isolate_pages_global,
};
@@ -1938,6 +1938,8 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
* free_pages == high_wmark_pages(zone).
*/
int temp_priority[MAX_NR_ZONES];
+ int order = 0;
+ int new_order;
loop_again:
total_scanned = 0;
@@ -1945,6 +1947,11 @@ loop_again:
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);
+ new_order = pgdat->kswapd_max_order;
+ pgdat->kswapd_max_order = 0;
+ if (order < new_order)
+ order = sc.order = new_order;
+
for (i = 0; i < pgdat->nr_zones; i++)
temp_priority[i] = DEF_PRIORITY;
@@ -2087,11 +2094,17 @@ out:
zone->prev_priority = temp_priority[i];
}
- if (!all_zones_ok) {
- cond_resched();
- try_to_freeze();
+ cond_resched();
+ try_to_freeze();
+ /*
+ * restart if someone wants a larger 'order' allocation
+ */
+ if (order < pgdat->kswapd_max_order)
+ goto loop_again;
+
+ if (!all_zones_ok) {
/*
* Fragmentation may mean that the system cannot be
* rebalanced for high-order allocations in all zones.
@@ -2130,7 +2143,6 @@ out:
*/
static int kswapd(void *p)
{
- unsigned long order;
pg_data_t *pgdat = (pg_data_t*)p;
struct task_struct *tsk = current;
DEFINE_WAIT(wait);
@@ -2160,32 +2172,17 @@ static int kswapd(void *p)
tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
set_freezable();
- order = 0;
for ( ; ; ) {
- unsigned long new_order;
-
prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
- new_order = pgdat->kswapd_max_order;
- pgdat->kswapd_max_order = 0;
- if (order < new_order) {
- /*
- * Don't sleep if someone wants a larger 'order'
- * allocation
- */
- order = new_order;
- } else {
- if (!freezing(current))
- schedule();
-
- order = pgdat->kswapd_max_order;
- }
+ if (!freezing(current))
+ schedule();
finish_wait(&pgdat->kswapd_wait, &wait);
if (!try_to_freeze()) {
/* We can speed up thawing tasks if we don't call
* balance_pgdat after returning from the refrigerator
*/
- balance_pgdat(pgdat, order);
+ balance_pgdat(pgdat);
}
}
return 0;
--
1.6.2.5
> On Mon, 26 Oct 2009, KOSAKI Motohiro wrote:
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bf72055..5a27896 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1899,6 +1899,12 @@ rebalance:
> > if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> > /* Wait for some write requests to complete then retry */
> > congestion_wait(BLK_RW_ASYNC, HZ/50);
> > +
> > + /*
> > + * While we wait congestion wait, Amount of free memory can
> > + * be changed dramatically. Thus, we kick kswapd again.
> > + */
> > + wake_all_kswapd(order, zonelist, high_zoneidx);
> > goto rebalance;
> > }
> >
>
> We're blocking to finish writeback of the directly reclaimed memory, why
> do we need to wake kswapd afterwards?
the same reason of "goto restart" case. that's my intention.
if following scenario occur, it is equivalent that we didn't call wake_all_kswapd().
1. call congestion_wait()
2. kswapd reclaimed lots memory and sleep
3. another task consume lots memory
4. wakeup from congestion_wait()
IOW, if we falled into __alloc_pages_slowpath(), we naturally expect
next page_alloc() don't fall into slowpath. however if kswapd end to
its work too early, this assumption isn't true.
Is this too pessimistic assumption?
> When a high-order allocation fails, kswapd is kicked so that it reclaims
> at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
> allocations. Something has changed in recent kernels that affect the timing
> where high-order GFP_ATOMIC allocations are now failing with more frequency,
> particularly under pressure.
>
> This patch pre-emptively checks if watermarks have been hit after a
> high-order allocation completes successfully. If the watermarks have been
> reached, kswapd is woken in the hope it fixes the watermarks before the
> next GFP_ATOMIC allocation fails.
>
> Warning, this patch is somewhat of a band-aid. If this makes a difference,
> it still implies that something has changed that is either causing more
> GFP_ATOMIC allocations to occur (such as the case with iwlagn wireless
> driver) or make them more likely to fail.
hmm, I'm confused. this description addressed generic high order allocation.
but,
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/page_alloc.c | 33 ++++++++++++++++++++++-----------
> 1 files changed, 22 insertions(+), 11 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7f2aa3e..851df40 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1596,6 +1596,17 @@ try_next_zone:
> return page;
> }
>
> +static inline
> +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> + enum zone_type high_zoneidx)
> +{
> + struct zoneref *z;
> + struct zone *zone;
> +
> + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> + wakeup_kswapd(zone, order);
> +}
> +
> static inline int
> should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> unsigned long pages_reclaimed)
> @@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
__alloc_pages_high_priority() is only called if ALLOC_NO_WATERMARKS.
ALLOC_NO_WATERMARKS mean PF_MEMALLOC or TIF_MEMDIE and GFP_ATOMIC don't make
nested alloc_pages() (= don't make PF_MEMALLOC case).
Then, I haven't understand why this patch improve iwlagn GFP_ATOMIC case.
hmm, maybe I missed something. I see the code again tommorow.
> congestion_wait(BLK_RW_ASYNC, HZ/50);
> } while (!page && (gfp_mask & __GFP_NOFAIL));
>
> - return page;
> -}
> -
> -static inline
> -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> - enum zone_type high_zoneidx)
> -{
> - struct zoneref *z;
> - struct zone *zone;
> + /*
> + * If after a high-order allocation we are now below watermarks,
> + * pre-emptively kick kswapd rather than having the next allocation
> + * fail and have to wake up kswapd, potentially failing GFP_ATOMIC
> + * allocations or entering direct reclaim
> + */
> + if (unlikely(order) && page && !zone_watermark_ok(preferred_zone, order,
> + preferred_zone->watermark[ALLOC_WMARK_LOW],
> + zone_idx(preferred_zone), ALLOC_WMARK_LOW))
> + wake_all_kswapd(order, zonelist, high_zoneidx);
>
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - wakeup_kswapd(zone, order);
> + return page;
> }
>
> static inline int
> --
> 1.6.3.3
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> Testing by Frans Pop indicates that in the 2.6.30..2.6.31 window at
> least that the commits 373c0a7e 8aa7e847 dramatically increased the
> number of GFP_ATOMIC failures that were occuring within a wireless
> driver. It was never isolated which of the changes was the exact problem
> and it's possible it has been fixed since. If problems are still
> occuring with GFP_ATOMIC in 2.6.31-rc5, then this patch should be
> applied to determine if the congestion_wait() callers are still broken.
Oops. no, please no.
8aa7e847 is regression fixing commit. this revert indicate the regression
occur again.
if we really need to revert it, we need to revert 1faa16d2287 too.
however, I doubt this commit really cause regression to iwlan. IOW,
I agree Jens.
I hope to try reproduce this problem on my test environment. Can anyone
please explain reproduce way?
Is special hardware necessary?
----------------------------------------------------
commit 8aa7e847d834ed937a9ad37a0f2ad5b8584c1ab0
Author: Jens Axboe <[email protected]>
Date: Thu Jul 9 14:52:32 2009 +0200
Fix congestion_wait() sync/async vs read/write confusion
Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.
Signed-off-by: Jens Axboe <[email protected]>
On Tuesday 27 October 2009, KOSAKI Motohiro wrote:
> Oops. no, please no.
> 8aa7e847 is regression fixing commit. this revert indicate the
> regression occur again.
> if we really need to revert it, we need to revert 1faa16d2287 too.
> however, I doubt this commit really cause regression to iwlan. IOW,
> I agree Jens.
This is not intended as a patch for mainline, but just as a test to see if
it improves things. It may be a regression fix, but it also creates a
significant change in behavior during swapping in my test case.
If a fix is needed, it will probably by different from this revert.
Please read: http://lkml.org/lkml/2009/10/26/510.
This mail has some data: http://lkml.org/lkml/2009/10/26/455.
> I hope to try reproduce this problem on my test environment. Can anyone
> please explain reproduce way?
Please see my mails in this thread for bug #14141:
http://thread.gmane.org/gmane.linux.kernel/896714
You will probably need to read some of them to understand the context of
the two mails linked above.
The most relevant ones are (all from the same thread; not sure why gmane
gives such weird links):
http://article.gmane.org/gmane.linux.kernel.mm/39909
http://article.gmane.org/gmane.linux.kernel.kernel-testers/7228
http://article.gmane.org/gmane.linux.kernel.kernel-testers/7165
> Is special hardware necessary?
Not special hardware, but you may need an encrypted partition and NFS; the
test may need to be modified according to the amount of memory you have.
I think it should be possible to reproduce the freezes I see while ignoring
the SKB allocation errors as IMO those are just a symptom, not the cause.
So you should not need wireless.
The severity of the freezes during my test often increases if the test is
repeated (without rebooting).
Cheers,
FJP
On Thu, Oct 22, 2009 at 08:43:38AM -0700, reinette chatre wrote:
> On Thu, 2009-10-22 at 07:22 -0700, Mel Gorman wrote:
> > [Bug #14141] order 2 page allocation failures in iwlagn
> > Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
> > allocations within the wireless driver. This has caused large numbers
> > of failure reports to occur as reported by Frans Pop. Fixing this
> > requires changes to the driver if it wants to use GFP_ATOMIC which
> > is in the hands of Mohamed Abbas and Reinette Chatre. However,
> > it is very likely that it has being compounded by core mm changes
> > that this series is aimed at.
>
> Driver has been changed to allocate paged skb for its receive buffers.
> This reduces amount of memory needed from order-2 to order-1. This work
> is significant and will thus be in 2.6.33.
>
What do you want to do for -stable in 2.6.31?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, Oct 27, 2009 at 11:42:55AM +0900, KOSAKI Motohiro wrote:
> > On Mon, 26 Oct 2009, KOSAKI Motohiro wrote:
> >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index bf72055..5a27896 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1899,6 +1899,12 @@ rebalance:
> > > if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> > > /* Wait for some write requests to complete then retry */
> > > congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > +
> > > + /*
> > > + * While we wait congestion wait, Amount of free memory can
> > > + * be changed dramatically. Thus, we kick kswapd again.
> > > + */
> > > + wake_all_kswapd(order, zonelist, high_zoneidx);
> > > goto rebalance;
> > > }
> > >
> >
> > We're blocking to finish writeback of the directly reclaimed memory, why
> > do we need to wake kswapd afterwards?
>
> the same reason of "goto restart" case. that's my intention.
> if following scenario occur, it is equivalent that we didn't call wake_all_kswapd().
>
> 1. call congestion_wait()
> 2. kswapd reclaimed lots memory and sleep
> 3. another task consume lots memory
> 4. wakeup from congestion_wait()
>
> IOW, if we falled into __alloc_pages_slowpath(), we naturally expect
> next page_alloc() don't fall into slowpath. however if kswapd end to
> its work too early, this assumption isn't true.
>
> Is this too pessimistic assumption?
>
hmm.
The reason it's not woken in both cases a second time was to match the
behaviour of 2.6.30. If the direct reclaimer goes asleep and another task
consumes the memory the direct reclaimer freed then the greedy process should
kick kswapd back awake again as free memory goes below the low watermark.
However, if the greedy process was allocating order-0, it's possible that
the watermarks for order-0 are being met leaving kswapd alone where as the
high-order ones are not leaving kswapd to go back asleep or to reclaim at
the wrong order.
It's a functional change but I can add it to the list of things to
consider. Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Sat, Oct 24, 2009 at 04:02:09PM +0200, Sven Geggus wrote:
> Mel Gorman schrieb am Donnerstag, den 22. Oktober um 16:22 Uhr:
>
> > Test 1: Verify your problem occurs on 2.6.32-rc5 if you can
>
> Problem persists. RAID resync in progress :(
>
What about the rest of the patches, any luck?
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Fri, Oct 23, 2009 at 10:03:05PM -0400, Christoph Lameter wrote:
> There are now rt dependencies in the page allocator that screw things up?
>
The rt logic was present before.
> And an rt flag causes the page allocator to try harder meaning it adds
> latency.
>
The harder term in the flag is a bit misleading. The effect of ALLOC_HARDER
is that the watermark is lower for the task. i.e. a rt task is less likely
to enter direct reclaim than normal tasks so it should have less latency.
> ?
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, Oct 27, 2009 at 11:42:58AM +0900, KOSAKI Motohiro wrote:
> > When a high-order allocation fails, kswapd is kicked so that it reclaims
> > at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
> > allocations. Something has changed in recent kernels that affect the timing
> > where high-order GFP_ATOMIC allocations are now failing with more frequency,
> > particularly under pressure.
> >
> > This patch pre-emptively checks if watermarks have been hit after a
> > high-order allocation completes successfully. If the watermarks have been
> > reached, kswapd is woken in the hope it fixes the watermarks before the
> > next GFP_ATOMIC allocation fails.
> >
> > Warning, this patch is somewhat of a band-aid. If this makes a difference,
> > it still implies that something has changed that is either causing more
> > GFP_ATOMIC allocations to occur (such as the case with iwlagn wireless
> > driver) or make them more likely to fail.
>
> hmm, I'm confused. this description addressed generic high order allocation.
> but,
>
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/page_alloc.c | 33 ++++++++++++++++++++++-----------
> > 1 files changed, 22 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 7f2aa3e..851df40 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1596,6 +1596,17 @@ try_next_zone:
> > return page;
> > }
> >
> > +static inline
> > +void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> > + enum zone_type high_zoneidx)
> > +{
> > + struct zoneref *z;
> > + struct zone *zone;
> > +
> > + for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > + wakeup_kswapd(zone, order);
> > +}
> > +
> > static inline int
> > should_alloc_retry(gfp_t gfp_mask, unsigned int order,
> > unsigned long pages_reclaimed)
> > @@ -1730,18 +1741,18 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
>
> __alloc_pages_high_priority() is only called if ALLOC_NO_WATERMARKS.
> ALLOC_NO_WATERMARKS mean PF_MEMALLOC or TIF_MEMDIE and GFP_ATOMIC don't make
> nested alloc_pages() (= don't make PF_MEMALLOC case).
> Then, I haven't understand why this patch improve iwlagn GFP_ATOMIC case.
>
> hmm, maybe I missed something. I see the code again tommorow.
>
The description is misleading but in the patches current form, it makes
a different to Tobias's testing. I still haven't figured out why.
>
> > congestion_wait(BLK_RW_ASYNC, HZ/50);
> > } while (!page && (gfp_mask & __GFP_NOFAIL));
> >
> > - return page;
> > -}
> > -
> > -static inline
> > -void wake_all_kswapd(unsigned int order, struct zonelist *zonelist,
> > - enum zone_type high_zoneidx)
> > -{
> > - struct zoneref *z;
> > - struct zone *zone;
> > + /*
> > + * If after a high-order allocation we are now below watermarks,
> > + * pre-emptively kick kswapd rather than having the next allocation
> > + * fail and have to wake up kswapd, potentially failing GFP_ATOMIC
> > + * allocations or entering direct reclaim
> > + */
> > + if (unlikely(order) && page && !zone_watermark_ok(preferred_zone, order,
> > + preferred_zone->watermark[ALLOC_WMARK_LOW],
> > + zone_idx(preferred_zone), ALLOC_WMARK_LOW))
> > + wake_all_kswapd(order, zonelist, high_zoneidx);
> >
> > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > - wakeup_kswapd(zone, order);
> > + return page;
> > }
> >
> > static inline int
> > --
> > 1.6.3.3
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Mon, Oct 26, 2009 at 06:37:36PM +0100, Tobias Oetiker wrote:
> Hi Mel,
>
> I have no done additional tests ... and can report the following
>
> Thursday Mel Gorman wrote:
>
> > 1/5 page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
> > 2/5 page allocator: Do not allow interrupts to use ALLOC_HARDER
> >
> >
> > These patches correct problems introduced by me during the 2.6.31-rc1
> > merge window. The patches were not meant to introduce any functional
> > changes but two were missed.
> >
> > If your problem goes away with just these two patches applied,
> > please tell me.
>
> 1+2 do not help
>
> > Test 3: If you are getting allocation failures, try with the following patch
> >
> > 3/5 vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
> >
> > This is a functional change that causes kswapd to notice sooner
> > when high-order watermarks have been hit. There have been a number
> > of changes in page reclaim since 2.6.30 that might have delayed
> > when kswapd kicks in for higher orders
> >
> > If your problem goes away with these three patches applied, please
> > tell me
>
> 1+2+3 do not help either
>
> > Test 4: If you are still getting failures, apply the following
> > 4/5 page allocator: Pre-emptively wake kswapd when high-order watermarks are hit
> >
> > This patch is very heavy handed and pre-emptively kicks kswapd when
> > watermarks are hit. It should only be necessary if there has been
> > significant changes in the timing and density of page allocations
> > from an unknown source. Tobias, this patch is largely aimed at you.
> > You reported that with patches 3+4 applied that your problems went
> > away. I need to know if patch 3 on its own is enough or if both
> > are required
> >
> > If your problem goes away with these four patches applied, please
> > tell me
>
> 3 allone does not help
> 3+4 does ...
>
This is a bit surprising.....
Tell me, do you have an Intel IO-MMU on your system by any chance? It should
be mentioned in either dmesg or lspci -v (please send the full output of
both). If you do have one of these things, I notice they abuse PF_MEMALLOC
which would explain why this patch makes a difference to your testing.
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, 2009-10-27 at 03:40 -0700, Mel Gorman wrote:
> On Thu, Oct 22, 2009 at 08:43:38AM -0700, reinette chatre wrote:
> > On Thu, 2009-10-22 at 07:22 -0700, Mel Gorman wrote:
> > > [Bug #14141] order 2 page allocation failures in iwlagn
> > > Commit 4752c93c30441f98f7ed723001b1a5e3e5619829 introduced GFP_ATOMIC
> > > allocations within the wireless driver. This has caused large numbers
> > > of failure reports to occur as reported by Frans Pop. Fixing this
> > > requires changes to the driver if it wants to use GFP_ATOMIC which
> > > is in the hands of Mohamed Abbas and Reinette Chatre. However,
> > > it is very likely that it has being compounded by core mm changes
> > > that this series is aimed at.
> >
> > Driver has been changed to allocate paged skb for its receive buffers.
> > This reduces amount of memory needed from order-2 to order-1. This work
> > is significant and will thus be in 2.6.33.
> >
>
> What do you want to do for -stable in 2.6.31?
>
I have just posted two patches to stable. The first is to address a bug
in which buffer loss occurs when there is an allocation failure and the
second is what Frans has been testing that reduces the noise when these
allocations fail. They are:
iwlwifi: fix potential rx buffer loss
de0bd50845eb5935ce3d503c5d2f565d6cb9ece1 in linux-2.6
iwlwifi: reduce noise when skb allocation fails
f82a924cc88a5541df1d4b9d38a0968cd077a051 in linux-2.6
Reinette
On Sat, Oct 24, 2009 at 02:46:56PM +0100, Mel LKML wrote:
> Hi,
Hi,
> On 10/23/09, Karol Lewandowski <[email protected]> wrote:
> > On Fri, Oct 23, 2009 at 06:58:10PM +0200, Karol Lewandowski wrote:
> > Ok, I've tested patches 1+2+4 and bug, while very hard to trigger, is
> > still present. I'll test complete 1-4 patchset as time permits.
Sorry for silence, I've been quite busy lately.
> And also patch 5 please which is the revert. Patch 5 as pointed out is
> probably a red herring. Hwoever, it has changed the timing and made a
> difference for some testing so I'd like to know if it helps yours as
> well.
I've tested patches 1+2+3+4 in my normal usage scenario (do some work,
suspend, do work, suspend, ...) and it failed today after 4 days (== 4
suspend-resume cycles).
I'll test 1-5 now.
Thanks.
On Wed, Oct 28, 2009 at 12:42:08PM +0100, Karol Lewandowski wrote:
> On Sat, Oct 24, 2009 at 02:46:56PM +0100, Mel LKML wrote:
> > Hi,
>
> Hi,
>
> > On 10/23/09, Karol Lewandowski <[email protected]> wrote:
> > > On Fri, Oct 23, 2009 at 06:58:10PM +0200, Karol Lewandowski wrote:
>
> > > Ok, I've tested patches 1+2+4 and bug, while very hard to trigger, is
> > > still present. I'll test complete 1-4 patchset as time permits.
>
> Sorry for silence, I've been quite busy lately.
>
>
> > And also patch 5 please which is the revert. Patch 5 as pointed out is
> > probably a red herring. Hwoever, it has changed the timing and made a
> > difference for some testing so I'd like to know if it helps yours as
> > well.
>
> I've tested patches 1+2+3+4 in my normal usage scenario (do some work,
> suspend, do work, suspend, ...) and it failed today after 4 days (== 4
> suspend-resume cycles).
>
> I'll test 1-5 now.
>
I was digging through commits for suspend-related changes. Rafael, is
there any chance that some change to suspend is responsible for this
regression? This commit for example is a vague possibility;
c6f37f12197ac3bd2e5a35f2f0e195ae63d437de: PM/Suspend: Do not shrink memory before suspend
I say vague because FREE_PAGE_NUMBER is so small.
Also, what was the behaviour of the e100 driver when suspending before
this commit?
6905b1f1a03a48dcf115a2927f7b87dba8d5e566: Net / e100: Fix suspend of devices that cannot be power managed
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Today Karol Lewandowski wrote:
> On Sat, Oct 24, 2009 at 02:46:56PM +0100, Mel LKML wrote:
> > Hi,
>
> Hi,
>
> > On 10/23/09, Karol Lewandowski <[email protected]> wrote:
> > > On Fri, Oct 23, 2009 at 06:58:10PM +0200, Karol Lewandowski wrote:
>
> > > Ok, I've tested patches 1+2+4 and bug, while very hard to trigger, is
> > > still present. I'll test complete 1-4 patchset as time permits.
>
> Sorry for silence, I've been quite busy lately.
>
>
> > And also patch 5 please which is the revert. Patch 5 as pointed out is
> > probably a red herring. Hwoever, it has changed the timing and made a
> > difference for some testing so I'd like to know if it helps yours as
> > well.
>
> I've tested patches 1+2+3+4 in my normal usage scenario (do some work,
> suspend, do work, suspend, ...) and it failed today after 4 days (== 4
> suspend-resume cycles).
I have been testing 1+2,1+2+3 as well as 3+4 and have been of the
assumption that 3+4 does help ... I have now been runing a modified
version of 4 which prints a warning instead of doing anything ... I
have now seen the allocation issue again without the warning being
printed. So in other words
1+2+3 make the problem less severe, but do not solve it
4 seems to be a red hering.
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Wed, Oct 28, 2009 at 11:59:26AM +0000, Mel Gorman wrote:
> On Wed, Oct 28, 2009 at 12:42:08PM +0100, Karol Lewandowski wrote:
> > On Sat, Oct 24, 2009 at 02:46:56PM +0100, Mel LKML wrote:
> > I've tested patches 1+2+3+4 in my normal usage scenario (do some work,
> > suspend, do work, suspend, ...) and it failed today after 4 days (== 4
> > suspend-resume cycles).
> >
> > I'll test 1-5 now.
2.6.32-rc5 with patches 1-5 fails too.
> Also, what was the behaviour of the e100 driver when suspending before
> this commit?
>
> 6905b1f1a03a48dcf115a2927f7b87dba8d5e566: Net / e100: Fix suspend of devices that cannot be power managed
This was discussed before with e100 maintainers and Rafael. Reverting
this patch didn't change anything.
Thanks.
On Fri, Oct 30, 2009 at 03:23:50PM +0100, Karol Lewandowski wrote:
> On Wed, Oct 28, 2009 at 11:59:26AM +0000, Mel Gorman wrote:
> > On Wed, Oct 28, 2009 at 12:42:08PM +0100, Karol Lewandowski wrote:
> > > On Sat, Oct 24, 2009 at 02:46:56PM +0100, Mel LKML wrote:
> > > I've tested patches 1+2+3+4 in my normal usage scenario (do some work,
> > > suspend, do work, suspend, ...) and it failed today after 4 days (== 4
> > > suspend-resume cycles).
> > >
> > > I'll test 1-5 now.
>
> 2.6.32-rc5 with patches 1-5 fails too.
>
>
> > Also, what was the behaviour of the e100 driver when suspending before
> > this commit?
> >
> > 6905b1f1a03a48dcf115a2927f7b87dba8d5e566: Net / e100: Fix suspend of devices that cannot be power managed
>
> This was discussed before with e100 maintainers and Rafael. Reverting
> this patch didn't change anything.
>
Does applying the following on top make any difference?
==== CUT HERE ====
PM: Shrink memory before suspend
This is a partial revert of c6f37f12197ac3bd2e5a35f2f0e195ae63d437de. It
is an outside possibility for fixing the e100 bug where an order-5
allocation is failing during resume. The commit notes that the shrinking
of memory should be unnecessary but maybe it is in error.
Signed-off-by: Mel Gorman <[email protected]>
diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
index 6f10dfc..4f6ae64 100644
--- a/kernel/power/suspend.c
+++ b/kernel/power/suspend.c
@@ -23,6 +23,9 @@ const char *const pm_states[PM_SUSPEND_MAX] = {
[PM_SUSPEND_MEM] = "mem",
};
+/* This is just an arbitrary number */
+#define FREE_PAGE_NUMBER (100)
+
static struct platform_suspend_ops *suspend_ops;
/**
@@ -78,6 +81,7 @@ static int suspend_test(int level)
static int suspend_prepare(void)
{
int error;
+ unsigned int free_pages;
if (!suspend_ops || !suspend_ops->enter)
return -EPERM;
@@ -92,10 +96,24 @@ static int suspend_prepare(void)
if (error)
goto Finish;
- error = suspend_freeze_processes();
+ if (suspend_freeze_processes()) {
+ error = -EAGAIN;
+ goto Thaw;
+ }
+
+ free_pages = global_page_state(NR_FREE_PAGES);
+ if (free_pages < FREE_PAGE_NUMBER) {
+ pr_debug("PM: free some memory\n");
+ shrink_all_memory(FREE_PAGE_NUMBER - free_pages);
+ if (nr_free_pages() < FREE_PAGE_NUMBER) {
+ error = -ENOMEM;
+ printk(KERN_ERR "PM: No enough memory\n");
+ }
+ }
if (!error)
return 0;
+ Thaw:
suspend_thaw_processes();
usermodehelper_enable();
Finish:
On Mon, Nov 02, 2009 at 08:30:34PM +0000, Mel Gorman wrote:
> Does applying the following on top make any difference?
>
> ==== CUT HERE ====
> PM: Shrink memory before suspend
No, this patch didn't change anything either.
IIRC I get failures while free(1) shows as much as 20MB free RAM
(ie. without buffers/caches). Additionaly nr_free_pages (from
/proc/vmstat) stays at about 800-1000 under heavy memory pressure
(gitk on full linux repository).
--- babbling follows ---
Hmm, I wonder if it's really timing issue then wouldn't be the case
that lowering swappiness sysctl would make problem more visible?
I've vm.swappiness=15, would testing with higher value make any sense?
Thanks.
Mel Gorman wrote:
> [No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
> This apparently is easily reproducible, particular in comparison to
> the other reports. The point of greatest interest is that this is
> order-0 GFP_ATOMIC failures. Sven, I'm hoping that you in particular
> will be able to follow the tests below as you are the most likely
> person to have an easily reproducible situation.
I've also seen order-0 failures on 2.6.31.5:
Note that this is with a one process hogging and mlocking memory and
min_free_kbytes reduced to 100 to reproduce the problem more easily.
I tried bisecting the issue, but in the end without memory pressure
I can't reproduce it reliably and with the above mentioned pressure
I get allocation failures even on 2.6.30.o
Initially the issue was that the machine hangs after the allocation
failure, but that seems to be a netconsole related issue, since I
didn't get a hang on 2.6.31 compiled without netconsole.
http://lkml.org/lkml/2009/11/1/66
http://lkml.org/lkml/2009/11/5/100
[ 375.398423] swapper: page allocation failure. order:0, mode:0x20
[ 375.398483] Pid: 0, comm: swapper Not tainted 2.6.31.5-nokmem-tomodachi #3
[ 375.398519] Call Trace:
[ 375.398566] [<c10395a8>] ? __alloc_pages_nodemask+0x40f/0x453
[ 375.398613] [<c104e988>] ? cache_alloc_refill+0x1f3/0x382
[ 375.398648] [<c104eb76>] ? __kmalloc+0x5f/0x97
[ 375.398690] [<c1228003>] ? __alloc_skb+0x44/0x101
[ 375.398723] [<c12289b5>] ? dev_alloc_skb+0x11/0x25
[ 375.398760] [<c11a53a1>] ? tulip_refill_rx+0x3c/0x115
[ 375.398793] [<c11a57f7>] ? tulip_poll+0x37d/0x416
[ 375.398832] [<c122cf56>] ? net_rx_action+0x3a/0xdb
[ 375.398874] [<c101d8b0>] ? __do_softirq+0x5b/0xcb
[ 375.398908] [<c101d855>] ? __do_softirq+0x0/0xcb
[ 375.398937] <IRQ> [<c1003e0b>] ? do_IRQ+0x66/0x76
[ 375.398989] [<c1002d70>] ? common_interrupt+0x30/0x38
[ 375.399026] [<c100725c>] ? default_idle+0x25/0x38
[ 375.399058] [<c1001a1e>] ? cpu_idle+0x64/0x7a
[ 375.399102] [<c14105ff>] ? start_kernel+0x251/0x258
[ 375.399133] Mem-Info:
[ 375.399159] DMA per-cpu:
[ 375.399184] CPU 0: hi: 0, btch: 1 usd: 0
[ 375.399214] Normal per-cpu:
[ 375.399241] CPU 0: hi: 90, btch: 15 usd: 28
[ 375.399276] Active_anon:6709 active_file:1851 inactive_anon:6729
[ 375.399278] inactive_file:2051 unevictable:40962 dirty:998 writeback:914 unstable:0
[ 375.399281] free:232 slab:1843 mapped:1404 pagetables:613 bounce:0
[ 375.399391] DMA free:924kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:15028kB present:15872kB pages_scanned:0 all_unreclaimable? yes
[ 375.399465] lowmem_reserve[]: 0 230 230
[ 375.399537] Normal free:4kB min:92kB low:112kB high:136kB active_anon:26836kB inactive_anon:26916kB active_file:7404kB inactive_file:8204kB unevictable:148820kB present:235648kB pages_scanned:32 all_unreclaimable? no
[ 375.399615] lowmem_reserve[]: 0 0 0
[ 375.399681] DMA: 1*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 924kB
[ 375.399850] Normal: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4kB
[ 375.400016] 4412 total pagecache pages
[ 375.400041] 494 pages in swap cache
[ 375.400041] Swap cache stats: add 8110, delete 7616, find 139/205
[ 375.400041] Free swap = 494212kB
[ 375.400041] Total swap = 524280kB
[ 375.400041] 63472 pages RAM
[ 375.400041] 1742 pages reserved
[ 375.400041] 14429 pages shared
[ 375.400041] 56917 pages non-shared
[ 378.306631] swapper: page allocation failure. order:0, mode:0x20
[ 378.306686] Pid: 0, comm: swapper Not tainted 2.6.31.5-nokmem-tomodachi #3
[ 378.306722] Call Trace:
[ 378.306769] [<c10395a8>] ? __alloc_pages_nodemask+0x40f/0x453
[ 378.306817] [<c104e988>] ? cache_alloc_refill+0x1f3/0x382
[ 378.306853] [<c104eb76>] ? __kmalloc+0x5f/0x97
[ 378.306894] [<c1228003>] ? __alloc_skb+0x44/0x101
[ 378.306927] [<c12289b5>] ? dev_alloc_skb+0x11/0x25
[ 378.306964] [<c11a53a1>] ? tulip_refill_rx+0x3c/0x115
[ 378.306996] [<c11a57f7>] ? tulip_poll+0x37d/0x416
[ 378.307035] [<c122cf56>] ? net_rx_action+0x3a/0xdb
[ 378.307079] [<c101d8b0>] ? __do_softirq+0x5b/0xcb
[ 378.307112] [<c101d855>] ? __do_softirq+0x0/0xcb
[ 378.307142] <IRQ> [<c1003e0b>] ? do_IRQ+0x66/0x76
[ 378.307193] [<c1002d70>] ? common_interrupt+0x30/0x38
[ 378.307232] [<c100725c>] ? default_idle+0x25/0x38
[ 378.307263] [<c1001a1e>] ? cpu_idle+0x64/0x7a
[ 378.307306] [<c14105ff>] ? start_kernel+0x251/0x258
[ 378.307338] Mem-Info:
[ 378.307364] DMA per-cpu:
[ 378.307389] CPU 0: hi: 0, btch: 1 usd: 0
[ 378.307420] Normal per-cpu:
[ 378.307446] CPU 0: hi: 90, btch: 15 usd: 70
[ 378.307480] Active_anon:6231 active_file:2340 inactive_anon:6283
[ 378.307483] inactive_file:2445 unevictable:40962 dirty:989 writeback:1024 unstable:0
[ 378.307485] free:232 slab:1872 mapped:1408 pagetables:613 bounce:0
[ 378.307595] DMA free:924kB min:4kB low:4kB high:4kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:15028kB present:15872kB pages_scanned:0 all_unreclaimable? yes
[ 378.307668] lowmem_reserve[]: 0 230 230
[ 378.307739] Normal free:4kB min:92kB low:112kB high:136kB active_anon:24924kB inactive_anon:25132kB active_file:9360kB inactive_file:9780kB unevictable:148820kB present:235648kB pages_scanned:32 all_unreclaimable? no
[ 378.307816] lowmem_reserve[]: 0 0 0
[ 378.307882] DMA: 1*4kB 1*8kB 1*16kB 0*32kB 0*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 924kB
[ 378.308052] Normal: 1*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4kB
[ 378.308219] 5866 total pagecache pages
[ 378.308247] 1065 pages in swap cache
[ 378.308277] Swap cache stats: add 9651, delete 8586, find 152/220
[ 378.308308] Free swap = 488152kB
[ 378.308334] Total swap = 524280kB
[ 378.310030] 63472 pages RAM
[ 378.310030] 1742 pages reserved
[ 378.310030] 15289 pages shared
[ 378.310030] 56019 pages non-shared
--
Tobias PGP: http://8ef7ddba.uguu.de
On Fri, Nov 06, 2009 at 07:03:23AM +0100, Tobias Diedrich wrote:
> Mel Gorman wrote:
> > [No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
> > This apparently is easily reproducible, particular in comparison to
> > the other reports. The point of greatest interest is that this is
> > order-0 GFP_ATOMIC failures. Sven, I'm hoping that you in particular
> > will be able to follow the tests below as you are the most likely
> > person to have an easily reproducible situation.
>
> I've also seen order-0 failures on 2.6.31.5:
> Note that this is with a one process hogging and mlocking memory and
> min_free_kbytes reduced to 100 to reproduce the problem more easily.
>
Is that a vanilla, with patches 1-3 applied or both?
> I tried bisecting the issue, but in the end without memory pressure
> I can't reproduce it reliably and with the above mentioned pressure
> I get allocation failures even on 2.6.30.o
>
To be honest, it's not entirely unexpected with min_free_kbytes set that
low. The system should cope with a certain amount of pressure but with
pressure and a low min_free_kbytes, the system will simply be reacting
too late to free memory in the non-atomic paths.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Mel Gorman wrote:
> On Fri, Nov 06, 2009 at 07:03:23AM +0100, Tobias Diedrich wrote:
> > Mel Gorman wrote:
> > > [No BZ ID] Kernel crash on 2.6.31.x (kcryptd: page allocation failure..)
> > > This apparently is easily reproducible, particular in comparison to
> > > the other reports. The point of greatest interest is that this is
> > > order-0 GFP_ATOMIC failures. Sven, I'm hoping that you in particular
> > > will be able to follow the tests below as you are the most likely
> > > person to have an easily reproducible situation.
> >
> > I've also seen order-0 failures on 2.6.31.5:
> > Note that this is with a one process hogging and mlocking memory and
> > min_free_kbytes reduced to 100 to reproduce the problem more easily.
> >
>
> Is that a vanilla, with patches 1-3 applied or both?
That was on vanilla 2.6.31.5.
I tried 2.6.31.5 before with patches 1+2 and netconsole enabled and
still got the order-1 failures (apparently I get order-1 failures
with netconsole and order-0 failures without).
> > I tried bisecting the issue, but in the end without memory pressure
> > I can't reproduce it reliably and with the above mentioned pressure
> > I get allocation failures even on 2.6.30.o
>
> To be honest, it's not entirely unexpected with min_free_kbytes set that
> low. The system should cope with a certain amount of pressure but with
> pressure and a low min_free_kbytes, the system will simply be reacting
> too late to free memory in the non-atomic paths.
Maybe I should try again on 2.6.30 without netconsole und try
increasing min_free_kbytes until the allocation failures
disappear and try to bisect again with that setting...
--
Tobias PGP: http://8ef7ddba.uguu.de