2009-07-08 18:47:17

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

Hi,

This one isn't great, we currently have broken congestion wait logic in
the kernel. 2.6.30 is impacted as well, so this patch should go to
stable too once it's in -git. I'll let this one simmer until tomorrow,
then ask Linus to pull it. The offending commit breaking this is
1faa16d22877f4839bd433547d770c676d1d964c.

Meanwhile, it could potentially cause buffered writeout slowdowns in the
kernel. Perhaps the 2.6.30 regression in that area is caused by this?
Would be interesting if the submitter could test. I can't find the list,
CC'ing Rafael.

diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 7c8ca91..1f118d4 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:

if (retval == -ENOMEM && is_global_init(current)) {
up_read(&current->mm->mmap_sem);
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
goto survive;
}

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 83650e0..f7ebe74 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2595,7 +2595,7 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
set_bdi_congested(&q->backing_dev_info, WRITE);
do {
spin_unlock(&pd->lock);
- congestion_wait(WRITE, HZ);
+ congestion_wait(BLK_RW_ASYNC, HZ);
spin_lock(&pd->lock);
} while(pd->bio_queue_size > pd->write_congestion_off);
}
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 9933eb8..529e2ba 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
* But don't wait if split was due to the io size restriction
*/
if (unlikely(out_of_pages))
- congestion_wait(WRITE, HZ/100);
+ congestion_wait(BLK_RW_ASYNC, HZ/100);

/*
* With async crypto it is unsafe to share the crypto context
diff --git a/fs/fat/file.c b/fs/fat/file.c
index b28ea64..f042b96 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
if ((filp->f_mode & FMODE_WRITE) &&
MSDOS_SB(inode->i_sb)->options.flush) {
fat_flush_inodes(inode->i_sb, inode, NULL);
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
}
return 0;
}
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 77f5bb7..9062220 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
DEFINE_WAIT(wait);
struct reiserfs_journal *j = SB_JOURNAL(s);
if (atomic_read(&j->j_async_throttle))
- congestion_wait(WRITE, HZ / 10);
+ congestion_wait(BLK_RW_ASYNC, HZ / 10);
return 0;
}

diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
index 1cd3b55..2d3f90a 100644
--- a/fs/xfs/linux-2.6/kmem.c
+++ b/fs/xfs/linux-2.6/kmem.c
@@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (1);
}

@@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (1);
}

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 1418b91..0c93c7e 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(

XFS_STATS_INC(xb_page_retries);
xfsbufd_wakeup(0, gfp_mask);
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
goto retry;
}

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 493b468..c86edd2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -283,7 +283,6 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};

-
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
@@ -308,18 +307,18 @@ EXPORT_SYMBOL(set_bdi_congested);

/**
* congestion_wait - wait for a backing_dev to become uncongested
- * @rw: READ or WRITE
+ * @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
* write congestion. If no backing_devs are congested then just wait for the
* next write to be completed.
*/
-long congestion_wait(int rw, long timeout)
+long congestion_wait(int sync, long timeout)
{
long ret;
DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = &congestion_wqh[rw];
+ wait_queue_head_t *wqh = &congestion_wqh[sync];

prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2fa20d..e717964 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1973,7 +1973,7 @@ try_to_free:
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
}

}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7687879..81627eb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -575,7 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
if (pages_written >= write_chunk)
break; /* We've done our duty */

- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
}

if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -669,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
* The caller might hold locks which can prevent IO completion
@@ -715,7 +715,7 @@ static void background_writeout(unsigned long _min_pages)
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
else
break;
}
@@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
else
break; /* All the old data is written */
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0f2cdf..2862bcf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1666,7 +1666,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);

if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));

return page;
@@ -1831,7 +1831,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
goto rebalance;
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5415526..dea7abd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1104,7 +1104,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
*/
if (nr_freed < nr_taken && !current_is_kswapd() &&
lumpy_reclaim) {
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
* The attempt at page out may have made some
@@ -1721,7 +1721,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,

/* Take a nap, wait for some writeback to complete */
if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
@@ -1960,7 +1960,7 @@ loop_again:
* another pass across the zones.
*/
if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
* We do this so kswapd doesn't build up large priorities for
@@ -2233,7 +2233,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
goto out;

if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ / 10);
+ congestion_wait(BLK_RW_ASYNC, HZ / 10);
}
}


--
Jens Axboe


2009-07-08 19:13:15

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

On Wed, Jul 08, 2009 at 08:47:03PM +0200, Jens Axboe wrote:
> Hi,
>
> This one isn't great, we currently have broken congestion wait logic in
> the kernel. 2.6.30 is impacted as well, so this patch should go to
> stable too once it's in -git. I'll let this one simmer until tomorrow,
> then ask Linus to pull it. The offending commit breaking this is
> 1faa16d22877f4839bd433547d770c676d1d964c.
>
> Meanwhile, it could potentially cause buffered writeout slowdowns in the
> kernel. Perhaps the 2.6.30 regression in that area is caused by this?
> Would be interesting if the submitter could test. I can't find the list,
> CC'ing Rafael.

Even if this does slow down some workloads, the bug is not in using the
correct flag ;) So, I'd ack this one.

Jan Kara was able to reproduce the tiobench 2.6.30 regression, so I've
cc'd him and kept the patch below.

-chris

diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 7c8ca91..1f118d4 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:

if (retval == -ENOMEM && is_global_init(current)) {
up_read(&current->mm->mmap_sem);
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
goto survive;
}

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 83650e0..f7ebe74 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2595,7 +2595,7 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
set_bdi_congested(&q->backing_dev_info, WRITE);
do {
spin_unlock(&pd->lock);
- congestion_wait(WRITE, HZ);
+ congestion_wait(BLK_RW_ASYNC, HZ);
spin_lock(&pd->lock);
} while(pd->bio_queue_size > pd->write_congestion_off);
}
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 9933eb8..529e2ba 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
* But don't wait if split was due to the io size restriction
*/
if (unlikely(out_of_pages))
- congestion_wait(WRITE, HZ/100);
+ congestion_wait(BLK_RW_ASYNC, HZ/100);

/*
* With async crypto it is unsafe to share the crypto context
diff --git a/fs/fat/file.c b/fs/fat/file.c
index b28ea64..f042b96 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
if ((filp->f_mode & FMODE_WRITE) &&
MSDOS_SB(inode->i_sb)->options.flush) {
fat_flush_inodes(inode->i_sb, inode, NULL);
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
}
return 0;
}
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 77f5bb7..9062220 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
DEFINE_WAIT(wait);
struct reiserfs_journal *j = SB_JOURNAL(s);
if (atomic_read(&j->j_async_throttle))
- congestion_wait(WRITE, HZ / 10);
+ congestion_wait(BLK_RW_ASYNC, HZ / 10);
return 0;
}

diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
index 1cd3b55..2d3f90a 100644
--- a/fs/xfs/linux-2.6/kmem.c
+++ b/fs/xfs/linux-2.6/kmem.c
@@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (1);
}

@@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (1);
}

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 1418b91..0c93c7e 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(

XFS_STATS_INC(xb_page_retries);
xfsbufd_wakeup(0, gfp_mask);
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
goto retry;
}

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 493b468..c86edd2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -283,7 +283,6 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};

-
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
@@ -308,18 +307,18 @@ EXPORT_SYMBOL(set_bdi_congested);

/**
* congestion_wait - wait for a backing_dev to become uncongested
- * @rw: READ or WRITE
+ * @sync: SYNC or ASYNC IO
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
* write congestion. If no backing_devs are congested then just wait for the
* next write to be completed.
*/
-long congestion_wait(int rw, long timeout)
+long congestion_wait(int sync, long timeout)
{
long ret;
DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = &congestion_wqh[rw];
+ wait_queue_head_t *wqh = &congestion_wqh[sync];

prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e2fa20d..e717964 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1973,7 +1973,7 @@ try_to_free:
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
}

}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7687879..81627eb 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -575,7 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
if (pages_written >= write_chunk)
break; /* We've done our duty */

- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
}

if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -669,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
* The caller might hold locks which can prevent IO completion
@@ -715,7 +715,7 @@ static void background_writeout(unsigned long _min_pages)
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
else
break;
}
@@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
else
break; /* All the old data is written */
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e0f2cdf..2862bcf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1666,7 +1666,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);

if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));

return page;
@@ -1831,7 +1831,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(WRITE, HZ/50);
+ congestion_wait(BLK_RW_ASYNC, HZ/50);
goto rebalance;
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5415526..dea7abd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1104,7 +1104,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
*/
if (nr_freed < nr_taken && !current_is_kswapd() &&
lumpy_reclaim) {
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
* The attempt at page out may have made some
@@ -1721,7 +1721,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,

/* Take a nap, wait for some writeback to complete */
if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
@@ -1960,7 +1960,7 @@ loop_again:
* another pass across the zones.
*/
if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ/10);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
* We do this so kswapd doesn't build up large priorities for
@@ -2233,7 +2233,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
goto out;

if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(WRITE, HZ / 10);
+ congestion_wait(BLK_RW_ASYNC, HZ / 10);
}
}


--
Jens Axboe

2009-07-08 19:15:11

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

On Wednesday 08 July 2009, Jens Axboe wrote:
> Hi,
>
> This one isn't great, we currently have broken congestion wait logic in
> the kernel. 2.6.30 is impacted as well, so this patch should go to
> stable too once it's in -git. I'll let this one simmer until tomorrow,
> then ask Linus to pull it. The offending commit breaking this is
> 1faa16d22877f4839bd433547d770c676d1d964c.
>
> Meanwhile, it could potentially cause buffered writeout slowdowns in the
> kernel. Perhaps the 2.6.30 regression in that area is caused by this?
> Would be interesting if the submitter could test. I can't find the list,
> CC'ing Rafael.

Thanks, but I'm not sure which one do you mean in particular.
http://bugzilla.kernel.org/show_bug.cgi?id=13408 looks like it might be
somewhat related.

The complete list is at http://lkml.org/lkml/2009/7/6/383 .

Best,
Rafael

2009-07-08 22:35:28

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

On Wed, 2009-07-08 at 20:47 +0200, Jens Axboe wrote:
> Hi,
>
> This one isn't great, we currently have broken congestion wait logic in
> the kernel. 2.6.30 is impacted as well, so this patch should go to
> stable too once it's in -git. I'll let this one simmer until tomorrow,
> then ask Linus to pull it. The offending commit breaking this is
> 1faa16d22877f4839bd433547d770c676d1d964c.
>
> Meanwhile, it could potentially cause buffered writeout slowdowns in the
> kernel. Perhaps the 2.6.30 regression in that area is caused by this?
> Would be interesting if the submitter could test. I can't find the list,
> CC'ing Rafael.
>
> diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
> index 7c8ca91..1f118d4 100644
> --- a/arch/x86/lib/usercopy_32.c
> +++ b/arch/x86/lib/usercopy_32.c
> @@ -751,7 +751,7 @@ survive:
>
> if (retval == -ENOMEM && is_global_init(current)) {
> up_read(&current->mm->mmap_sem);
> - congestion_wait(WRITE, HZ/50);
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto survive;
> }
>
> diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> index 83650e0..f7ebe74 100644
> --- a/drivers/block/pktcdvd.c
> +++ b/drivers/block/pktcdvd.c
> @@ -2595,7 +2595,7 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
> set_bdi_congested(&q->backing_dev_info, WRITE);
> do {
> spin_unlock(&pd->lock);
> - congestion_wait(WRITE, HZ);
> + congestion_wait(BLK_RW_ASYNC, HZ);
> spin_lock(&pd->lock);
> } while(pd->bio_queue_size > pd->write_congestion_off);
> }
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 9933eb8..529e2ba 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
> * But don't wait if split was due to the io size restriction
> */
> if (unlikely(out_of_pages))
> - congestion_wait(WRITE, HZ/100);
> + congestion_wait(BLK_RW_ASYNC, HZ/100);
>
> /*
> * With async crypto it is unsafe to share the crypto context
> diff --git a/fs/fat/file.c b/fs/fat/file.c
> index b28ea64..f042b96 100644
> --- a/fs/fat/file.c
> +++ b/fs/fat/file.c
> @@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
> if ((filp->f_mode & FMODE_WRITE) &&
> MSDOS_SB(inode->i_sb)->options.flush) {
> fat_flush_inodes(inode->i_sb, inode, NULL);
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> }
> return 0;
> }
> diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> index 77f5bb7..9062220 100644
> --- a/fs/reiserfs/journal.c
> +++ b/fs/reiserfs/journal.c
> @@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
> DEFINE_WAIT(wait);
> struct reiserfs_journal *j = SB_JOURNAL(s);
> if (atomic_read(&j->j_async_throttle))
> - congestion_wait(WRITE, HZ / 10);
> + congestion_wait(BLK_RW_ASYNC, HZ / 10);
> return 0;
> }
>
> diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
> index 1cd3b55..2d3f90a 100644
> --- a/fs/xfs/linux-2.6/kmem.c
> +++ b/fs/xfs/linux-2.6/kmem.c
> @@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
> printk(KERN_ERR "XFS: possible memory allocation "
> "deadlock in %s (mode:0x%x)\n",
> __func__, lflags);
> - congestion_wait(WRITE, HZ/50);
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> } while (1);
> }
>
> @@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
> printk(KERN_ERR "XFS: possible memory allocation "
> "deadlock in %s (mode:0x%x)\n",
> __func__, lflags);
> - congestion_wait(WRITE, HZ/50);
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> } while (1);
> }
>
> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index 1418b91..0c93c7e 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c
> @@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(
>
> XFS_STATS_INC(xb_page_retries);
> xfsbufd_wakeup(0, gfp_mask);
> - congestion_wait(WRITE, HZ/50);
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto retry;
> }
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 493b468..c86edd2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -283,7 +283,6 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
>
> -
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> enum bdi_state bit;
> @@ -308,18 +307,18 @@ EXPORT_SYMBOL(set_bdi_congested);
>
> /**
> * congestion_wait - wait for a backing_dev to become uncongested
> - * @rw: READ or WRITE
> + * @sync: SYNC or ASYNC IO
> * @timeout: timeout in jiffies
> *
> * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> * write congestion. If no backing_devs are congested then just wait for the
> * next write to be completed.
> */
> -long congestion_wait(int rw, long timeout)
> +long congestion_wait(int sync, long timeout)
> {
> long ret;
> DEFINE_WAIT(wait);
> - wait_queue_head_t *wqh = &congestion_wqh[rw];
> + wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = io_schedule_timeout(timeout);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2fa20d..e717964 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1973,7 +1973,7 @@ try_to_free:
> if (!progress) {
> nr_retries--;
> /* maybe some writeback is necessary */
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> }
>
> }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 7687879..81627eb 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -575,7 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> if (pages_written >= write_chunk)
> break; /* We've done our duty */
>
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> }
>
> if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> @@ -669,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> if (global_page_state(NR_UNSTABLE_NFS) +
> global_page_state(NR_WRITEBACK) <= dirty_thresh)
> break;
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
> * The caller might hold locks which can prevent IO completion
> @@ -715,7 +715,7 @@ static void background_writeout(unsigned long _min_pages)
> if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> /* Wrote less than expected */
> if (wbc.encountered_congestion || wbc.more_io)
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> else
> break;
> }
> @@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
> writeback_inodes(&wbc);
> if (wbc.nr_to_write > 0) {
> if (wbc.encountered_congestion || wbc.more_io)
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> else
> break; /* All the old data is written */
> }
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e0f2cdf..2862bcf 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1666,7 +1666,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> preferred_zone, migratetype);
>
> if (!page && gfp_mask & __GFP_NOFAIL)
> - congestion_wait(WRITE, HZ/50);
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> } while (!page && (gfp_mask & __GFP_NOFAIL));
>
> return page;
> @@ -1831,7 +1831,7 @@ rebalance:
> pages_reclaimed += did_some_progress;
> if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> - congestion_wait(WRITE, HZ/50);
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
> }
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 5415526..dea7abd 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1104,7 +1104,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
> */
> if (nr_freed < nr_taken && !current_is_kswapd() &&
> lumpy_reclaim) {
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
> * The attempt at page out may have made some
> @@ -1721,7 +1721,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>
> /* Take a nap, wait for some writeback to complete */
> if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);
> }
> /* top priority shrink_zones still had more to do? don't OOM, then */
> if (!sc->all_unreclaimable && scanning_global_lru(sc))
> @@ -1960,7 +1960,7 @@ loop_again:
> * another pass across the zones.
> */
> if (total_scanned && priority < DEF_PRIORITY - 2)
> - congestion_wait(WRITE, HZ/10);
> + congestion_wait(BLK_RW_ASYNC, HZ/10);

Oh, great...

This particular change will affect _all_ users of
set_bdi_congested(WRITE)/clear_bdi_congested(WRITE). If you're going to
do this, then you had better be prepared to change them all. There's one
in fs/nfs/write.c...

Trond

> /*
> * We do this so kswapd doesn't build up large priorities for
> @@ -2233,7 +2233,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
> goto out;
>
> if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
> - congestion_wait(WRITE, HZ / 10);
> + congestion_wait(BLK_RW_ASYNC, HZ / 10);
> }
> }
>
>

2009-07-08 22:45:06

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

On Wed, 2009-07-08 at 18:35 -0400, Trond Myklebust wrote:
> On Wed, 2009-07-08 at 20:47 +0200, Jens Axboe wrote:
> > Hi,
> >
> > This one isn't great, we currently have broken congestion wait logic in
> > the kernel. 2.6.30 is impacted as well, so this patch should go to
> > stable too once it's in -git. I'll let this one simmer until tomorrow,
> > then ask Linus to pull it. The offending commit breaking this is
> > 1faa16d22877f4839bd433547d770c676d1d964c.
> >
> > Meanwhile, it could potentially cause buffered writeout slowdowns in the
> > kernel. Perhaps the 2.6.30 regression in that area is caused by this?
> > Would be interesting if the submitter could test. I can't find the list,
> > CC'ing Rafael.
> >
> > diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
> > index 7c8ca91..1f118d4 100644
> > --- a/arch/x86/lib/usercopy_32.c
> > +++ b/arch/x86/lib/usercopy_32.c
> > @@ -751,7 +751,7 @@ survive:
> >
> > if (retval == -ENOMEM && is_global_init(current)) {
> > up_read(&current->mm->mmap_sem);
> > - congestion_wait(WRITE, HZ/50);
> > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > goto survive;
> > }
> >
> > diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> > index 83650e0..f7ebe74 100644
> > --- a/drivers/block/pktcdvd.c
> > +++ b/drivers/block/pktcdvd.c
> > @@ -2595,7 +2595,7 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
> > set_bdi_congested(&q->backing_dev_info, WRITE);
> > do {
> > spin_unlock(&pd->lock);
> > - congestion_wait(WRITE, HZ);
> > + congestion_wait(BLK_RW_ASYNC, HZ);
> > spin_lock(&pd->lock);
> > } while(pd->bio_queue_size > pd->write_congestion_off);
> > }
> > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > index 9933eb8..529e2ba 100644
> > --- a/drivers/md/dm-crypt.c
> > +++ b/drivers/md/dm-crypt.c
> > @@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
> > * But don't wait if split was due to the io size restriction
> > */
> > if (unlikely(out_of_pages))
> > - congestion_wait(WRITE, HZ/100);
> > + congestion_wait(BLK_RW_ASYNC, HZ/100);
> >
> > /*
> > * With async crypto it is unsafe to share the crypto context
> > diff --git a/fs/fat/file.c b/fs/fat/file.c
> > index b28ea64..f042b96 100644
> > --- a/fs/fat/file.c
> > +++ b/fs/fat/file.c
> > @@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
> > if ((filp->f_mode & FMODE_WRITE) &&
> > MSDOS_SB(inode->i_sb)->options.flush) {
> > fat_flush_inodes(inode->i_sb, inode, NULL);
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > }
> > return 0;
> > }
> > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > index 77f5bb7..9062220 100644
> > --- a/fs/reiserfs/journal.c
> > +++ b/fs/reiserfs/journal.c
> > @@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
> > DEFINE_WAIT(wait);
> > struct reiserfs_journal *j = SB_JOURNAL(s);
> > if (atomic_read(&j->j_async_throttle))
> > - congestion_wait(WRITE, HZ / 10);
> > + congestion_wait(BLK_RW_ASYNC, HZ / 10);
> > return 0;
> > }
> >
> > diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
> > index 1cd3b55..2d3f90a 100644
> > --- a/fs/xfs/linux-2.6/kmem.c
> > +++ b/fs/xfs/linux-2.6/kmem.c
> > @@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
> > printk(KERN_ERR "XFS: possible memory allocation "
> > "deadlock in %s (mode:0x%x)\n",
> > __func__, lflags);
> > - congestion_wait(WRITE, HZ/50);
> > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > } while (1);
> > }
> >
> > @@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
> > printk(KERN_ERR "XFS: possible memory allocation "
> > "deadlock in %s (mode:0x%x)\n",
> > __func__, lflags);
> > - congestion_wait(WRITE, HZ/50);
> > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > } while (1);
> > }
> >
> > diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> > index 1418b91..0c93c7e 100644
> > --- a/fs/xfs/linux-2.6/xfs_buf.c
> > +++ b/fs/xfs/linux-2.6/xfs_buf.c
> > @@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(
> >
> > XFS_STATS_INC(xb_page_retries);
> > xfsbufd_wakeup(0, gfp_mask);
> > - congestion_wait(WRITE, HZ/50);
> > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > goto retry;
> > }
> >
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 493b468..c86edd2 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -283,7 +283,6 @@ static wait_queue_head_t congestion_wqh[2] = {
> > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > };
> >
> > -
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > enum bdi_state bit;
> > @@ -308,18 +307,18 @@ EXPORT_SYMBOL(set_bdi_congested);
> >
> > /**
> > * congestion_wait - wait for a backing_dev to become uncongested
> > - * @rw: READ or WRITE
> > + * @sync: SYNC or ASYNC IO
> > * @timeout: timeout in jiffies
> > *
> > * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > * write congestion. If no backing_devs are congested then just wait for the
> > * next write to be completed.
> > */
> > -long congestion_wait(int rw, long timeout)
> > +long congestion_wait(int sync, long timeout)
> > {
> > long ret;
> > DEFINE_WAIT(wait);
> > - wait_queue_head_t *wqh = &congestion_wqh[rw];
> > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > ret = io_schedule_timeout(timeout);
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e2fa20d..e717964 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1973,7 +1973,7 @@ try_to_free:
> > if (!progress) {
> > nr_retries--;
> > /* maybe some writeback is necessary */
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > }
> >
> > }
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 7687879..81627eb 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -575,7 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > if (pages_written >= write_chunk)
> > break; /* We've done our duty */
> >
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > }
> >
> > if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > @@ -669,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > if (global_page_state(NR_UNSTABLE_NFS) +
> > global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > break;
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> >
> > /*
> > * The caller might hold locks which can prevent IO completion
> > @@ -715,7 +715,7 @@ static void background_writeout(unsigned long _min_pages)
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > /* Wrote less than expected */
> > if (wbc.encountered_congestion || wbc.more_io)
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > else
> > break;
> > }
> > @@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
> > writeback_inodes(&wbc);
> > if (wbc.nr_to_write > 0) {
> > if (wbc.encountered_congestion || wbc.more_io)
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > else
> > break; /* All the old data is written */
> > }
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e0f2cdf..2862bcf 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1666,7 +1666,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> > preferred_zone, migratetype);
> >
> > if (!page && gfp_mask & __GFP_NOFAIL)
> > - congestion_wait(WRITE, HZ/50);
> > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > } while (!page && (gfp_mask & __GFP_NOFAIL));
> >
> > return page;
> > @@ -1831,7 +1831,7 @@ rebalance:
> > pages_reclaimed += did_some_progress;
> > if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> > /* Wait for some write requests to complete then retry */
> > - congestion_wait(WRITE, HZ/50);
> > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > goto rebalance;
> > }
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 5415526..dea7abd 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1104,7 +1104,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
> > */
> > if (nr_freed < nr_taken && !current_is_kswapd() &&
> > lumpy_reclaim) {
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> >
> > /*
> > * The attempt at page out may have made some
> > @@ -1721,7 +1721,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >
> > /* Take a nap, wait for some writeback to complete */
> > if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > }
> > /* top priority shrink_zones still had more to do? don't OOM, then */
> > if (!sc->all_unreclaimable && scanning_global_lru(sc))
> > @@ -1960,7 +1960,7 @@ loop_again:
> > * another pass across the zones.
> > */
> > if (total_scanned && priority < DEF_PRIORITY - 2)
> > - congestion_wait(WRITE, HZ/10);
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> Oh, great...
>
> This particular change will affect _all_ users of
> set_bdi_congested(WRITE)/clear_bdi_congested(WRITE). If you're going to
> do this, then you had better be prepared to change them all. There's one
> in fs/nfs/write.c...

More specifically, you need to audit and fix:

git grep '\(set\|clear\)_bdi_congested *(.*, *\(READ\|WRITE\) *)'
drivers/block/pktcdvd.c: clear_bdi_congested(&pd->disk->queue->ba
drivers/block/pktcdvd.c: set_bdi_congested(&q->backing_dev_info,
fs/fuse/dev.c: clear_bdi_congested(&fc->bdi, READ);
fs/fuse/dev.c: clear_bdi_congested(&fc->bdi, WRITE);
fs/fuse/dev.c: set_bdi_congested(&fc->bdi, READ);
fs/fuse/dev.c: set_bdi_congested(&fc->bdi, WRITE);
fs/nfs/write.c: set_bdi_congested(&nfss->backing_dev_info, WRITE
fs/nfs/write.c: clear_bdi_congested(&nfss->backing_dev_info, WRITE);


2009-07-09 12:48:16

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

On Wed, Jul 08 2009, Trond Myklebust wrote:
> On Wed, 2009-07-08 at 18:35 -0400, Trond Myklebust wrote:
> > On Wed, 2009-07-08 at 20:47 +0200, Jens Axboe wrote:
> > > Hi,
> > >
> > > This one isn't great, we currently have broken congestion wait logic in
> > > the kernel. 2.6.30 is impacted as well, so this patch should go to
> > > stable too once it's in -git. I'll let this one simmer until tomorrow,
> > > then ask Linus to pull it. The offending commit breaking this is
> > > 1faa16d22877f4839bd433547d770c676d1d964c.
> > >
> > > Meanwhile, it could potentially cause buffered writeout slowdowns in the
> > > kernel. Perhaps the 2.6.30 regression in that area is caused by this?
> > > Would be interesting if the submitter could test. I can't find the list,
> > > CC'ing Rafael.
> > >
> > > diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
> > > index 7c8ca91..1f118d4 100644
> > > --- a/arch/x86/lib/usercopy_32.c
> > > +++ b/arch/x86/lib/usercopy_32.c
> > > @@ -751,7 +751,7 @@ survive:
> > >
> > > if (retval == -ENOMEM && is_global_init(current)) {
> > > up_read(&current->mm->mmap_sem);
> > > - congestion_wait(WRITE, HZ/50);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > goto survive;
> > > }
> > >
> > > diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
> > > index 83650e0..f7ebe74 100644
> > > --- a/drivers/block/pktcdvd.c
> > > +++ b/drivers/block/pktcdvd.c
> > > @@ -2595,7 +2595,7 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
> > > set_bdi_congested(&q->backing_dev_info, WRITE);
> > > do {
> > > spin_unlock(&pd->lock);
> > > - congestion_wait(WRITE, HZ);
> > > + congestion_wait(BLK_RW_ASYNC, HZ);
> > > spin_lock(&pd->lock);
> > > } while(pd->bio_queue_size > pd->write_congestion_off);
> > > }
> > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > > index 9933eb8..529e2ba 100644
> > > --- a/drivers/md/dm-crypt.c
> > > +++ b/drivers/md/dm-crypt.c
> > > @@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
> > > * But don't wait if split was due to the io size restriction
> > > */
> > > if (unlikely(out_of_pages))
> > > - congestion_wait(WRITE, HZ/100);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/100);
> > >
> > > /*
> > > * With async crypto it is unsafe to share the crypto context
> > > diff --git a/fs/fat/file.c b/fs/fat/file.c
> > > index b28ea64..f042b96 100644
> > > --- a/fs/fat/file.c
> > > +++ b/fs/fat/file.c
> > > @@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
> > > if ((filp->f_mode & FMODE_WRITE) &&
> > > MSDOS_SB(inode->i_sb)->options.flush) {
> > > fat_flush_inodes(inode->i_sb, inode, NULL);
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > }
> > > return 0;
> > > }
> > > diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> > > index 77f5bb7..9062220 100644
> > > --- a/fs/reiserfs/journal.c
> > > +++ b/fs/reiserfs/journal.c
> > > @@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
> > > DEFINE_WAIT(wait);
> > > struct reiserfs_journal *j = SB_JOURNAL(s);
> > > if (atomic_read(&j->j_async_throttle))
> > > - congestion_wait(WRITE, HZ / 10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ / 10);
> > > return 0;
> > > }
> > >
> > > diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
> > > index 1cd3b55..2d3f90a 100644
> > > --- a/fs/xfs/linux-2.6/kmem.c
> > > +++ b/fs/xfs/linux-2.6/kmem.c
> > > @@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
> > > printk(KERN_ERR "XFS: possible memory allocation "
> > > "deadlock in %s (mode:0x%x)\n",
> > > __func__, lflags);
> > > - congestion_wait(WRITE, HZ/50);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > } while (1);
> > > }
> > >
> > > @@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
> > > printk(KERN_ERR "XFS: possible memory allocation "
> > > "deadlock in %s (mode:0x%x)\n",
> > > __func__, lflags);
> > > - congestion_wait(WRITE, HZ/50);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > } while (1);
> > > }
> > >
> > > diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> > > index 1418b91..0c93c7e 100644
> > > --- a/fs/xfs/linux-2.6/xfs_buf.c
> > > +++ b/fs/xfs/linux-2.6/xfs_buf.c
> > > @@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(
> > >
> > > XFS_STATS_INC(xb_page_retries);
> > > xfsbufd_wakeup(0, gfp_mask);
> > > - congestion_wait(WRITE, HZ/50);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > goto retry;
> > > }
> > >
> > > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > > index 493b468..c86edd2 100644
> > > --- a/mm/backing-dev.c
> > > +++ b/mm/backing-dev.c
> > > @@ -283,7 +283,6 @@ static wait_queue_head_t congestion_wqh[2] = {
> > > __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> > > };
> > >
> > > -
> > > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > > {
> > > enum bdi_state bit;
> > > @@ -308,18 +307,18 @@ EXPORT_SYMBOL(set_bdi_congested);
> > >
> > > /**
> > > * congestion_wait - wait for a backing_dev to become uncongested
> > > - * @rw: READ or WRITE
> > > + * @sync: SYNC or ASYNC IO
> > > * @timeout: timeout in jiffies
> > > *
> > > * Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
> > > * write congestion. If no backing_devs are congested then just wait for the
> > > * next write to be completed.
> > > */
> > > -long congestion_wait(int rw, long timeout)
> > > +long congestion_wait(int sync, long timeout)
> > > {
> > > long ret;
> > > DEFINE_WAIT(wait);
> > > - wait_queue_head_t *wqh = &congestion_wqh[rw];
> > > + wait_queue_head_t *wqh = &congestion_wqh[sync];
> > >
> > > prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> > > ret = io_schedule_timeout(timeout);
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index e2fa20d..e717964 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1973,7 +1973,7 @@ try_to_free:
> > > if (!progress) {
> > > nr_retries--;
> > > /* maybe some writeback is necessary */
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > }
> > >
> > > }
> > > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > > index 7687879..81627eb 100644
> > > --- a/mm/page-writeback.c
> > > +++ b/mm/page-writeback.c
> > > @@ -575,7 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
> > > if (pages_written >= write_chunk)
> > > break; /* We've done our duty */
> > >
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > }
> > >
> > > if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
> > > @@ -669,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> > > if (global_page_state(NR_UNSTABLE_NFS) +
> > > global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > break;
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >
> > > /*
> > > * The caller might hold locks which can prevent IO completion
> > > @@ -715,7 +715,7 @@ static void background_writeout(unsigned long _min_pages)
> > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > > /* Wrote less than expected */
> > > if (wbc.encountered_congestion || wbc.more_io)
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > else
> > > break;
> > > }
> > > @@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
> > > writeback_inodes(&wbc);
> > > if (wbc.nr_to_write > 0) {
> > > if (wbc.encountered_congestion || wbc.more_io)
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > else
> > > break; /* All the old data is written */
> > > }
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index e0f2cdf..2862bcf 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -1666,7 +1666,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
> > > preferred_zone, migratetype);
> > >
> > > if (!page && gfp_mask & __GFP_NOFAIL)
> > > - congestion_wait(WRITE, HZ/50);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > } while (!page && (gfp_mask & __GFP_NOFAIL));
> > >
> > > return page;
> > > @@ -1831,7 +1831,7 @@ rebalance:
> > > pages_reclaimed += did_some_progress;
> > > if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> > > /* Wait for some write requests to complete then retry */
> > > - congestion_wait(WRITE, HZ/50);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/50);
> > > goto rebalance;
> > > }
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 5415526..dea7abd 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1104,7 +1104,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
> > > */
> > > if (nr_freed < nr_taken && !current_is_kswapd() &&
> > > lumpy_reclaim) {
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > >
> > > /*
> > > * The attempt at page out may have made some
> > > @@ -1721,7 +1721,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> > >
> > > /* Take a nap, wait for some writeback to complete */
> > > if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > > }
> > > /* top priority shrink_zones still had more to do? don't OOM, then */
> > > if (!sc->all_unreclaimable && scanning_global_lru(sc))
> > > @@ -1960,7 +1960,7 @@ loop_again:
> > > * another pass across the zones.
> > > */
> > > if (total_scanned && priority < DEF_PRIORITY - 2)
> > > - congestion_wait(WRITE, HZ/10);
> > > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> >
> > Oh, great...
> >
> > This particular change will affect _all_ users of
> > set_bdi_congested(WRITE)/clear_bdi_congested(WRITE). If you're going to
> > do this, then you had better be prepared to change them all. There's one
> > in fs/nfs/write.c...
>
> More specifically, you need to audit and fix:
>
> git grep '\(set\|clear\)_bdi_congested *(.*, *\(READ\|WRITE\) *)'
> drivers/block/pktcdvd.c: clear_bdi_congested(&pd->disk->queue->ba
> drivers/block/pktcdvd.c: set_bdi_congested(&q->backing_dev_info,
> fs/fuse/dev.c: clear_bdi_congested(&fc->bdi, READ);
> fs/fuse/dev.c: clear_bdi_congested(&fc->bdi, WRITE);
> fs/fuse/dev.c: set_bdi_congested(&fc->bdi, READ);
> fs/fuse/dev.c: set_bdi_congested(&fc->bdi, WRITE);
> fs/nfs/write.c: set_bdi_congested(&nfss->backing_dev_info, WRITE
> fs/nfs/write.c: clear_bdi_congested(&nfss->backing_dev_info, WRITE);

Thanks, not sure why those were missed. I'll go over everything again,
thanks for checking!

--
Jens Axboe

2009-07-09 14:02:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

> > Oh, great...
> >
> > This particular change will affect _all_ users of
> > set_bdi_congested(WRITE)/clear_bdi_congested(WRITE). If you're going to
> > do this, then you had better be prepared to change them all. There's one
> > in fs/nfs/write.c...
>
> More specifically, you need to audit and fix:

Please stop quoting 250+ lines of patch for a couple of lines reply.
Seriously that's just extremly rude and annoying.

2009-07-14 11:44:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

On Wed 08-07-09 15:12:38, Chris Mason wrote:
> On Wed, Jul 08, 2009 at 08:47:03PM +0200, Jens Axboe wrote:
> > Hi,
> >
> > This one isn't great, we currently have broken congestion wait logic in
> > the kernel. 2.6.30 is impacted as well, so this patch should go to
> > stable too once it's in -git. I'll let this one simmer until tomorrow,
> > then ask Linus to pull it. The offending commit breaking this is
> > 1faa16d22877f4839bd433547d770c676d1d964c.
> >
> > Meanwhile, it could potentially cause buffered writeout slowdowns in the
> > kernel. Perhaps the 2.6.30 regression in that area is caused by this?
> > Would be interesting if the submitter could test. I can't find the list,
> > CC'ing Rafael.
>
> Even if this does slow down some workloads, the bug is not in using the
> correct flag ;) So, I'd ack this one.
>
> Jan Kara was able to reproduce the tiobench 2.6.30 regression, so I've
> cc'd him and kept the patch below.
Thanks for the patch Chris. I've remeasured tiobench with the 2.6.30 +
the fix but it didn't help (which is not too surprising as what I observe
is most likely CFQ related as there's no regression with NOOP scheduler).
Just to recall:
2.6.29 (CFQ) Avg StdDev
8 38.01 40.26 39.69 -> 39.32 0.955092
16 40.09 38.18 40.05 -> 39.44 0.891104

2.6.30-rc8 (CFQ)
8 36.67 36.81 38.20 -> 37.23 0.69062
16 37.45 36.47 37.46 -> 37.13 0.464351

2.6.30-rc8+fix (CFQ)
8 37.56 37.38 37.98 -> 37.64 0.251396
16 38.11 36.71 37.18 -> 37.33 0.581741

So with the fix there's no statistically significant difference and we
are still below 2.6.29 results. I'm now going to retest with the WRITE_SYNC
changes reverted.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-07-14 13:12:47

by Chris Mason

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

On Tue, Jul 14, 2009 at 01:44:19PM +0200, Jan Kara wrote:
> On Wed 08-07-09 15:12:38, Chris Mason wrote:
> > On Wed, Jul 08, 2009 at 08:47:03PM +0200, Jens Axboe wrote:
> > > Hi,
> > >
> > > This one isn't great, we currently have broken congestion wait logic in
> > > the kernel. 2.6.30 is impacted as well, so this patch should go to
> > > stable too once it's in -git. I'll let this one simmer until tomorrow,
> > > then ask Linus to pull it. The offending commit breaking this is
> > > 1faa16d22877f4839bd433547d770c676d1d964c.
> > >
> > > Meanwhile, it could potentially cause buffered writeout slowdowns in the
> > > kernel. Perhaps the 2.6.30 regression in that area is caused by this?
> > > Would be interesting if the submitter could test. I can't find the list,
> > > CC'ing Rafael.
> >
> > Even if this does slow down some workloads, the bug is not in using the
> > correct flag ;) So, I'd ack this one.
> >
> > Jan Kara was able to reproduce the tiobench 2.6.30 regression, so I've
> > cc'd him and kept the patch below.
> Thanks for the patch Chris. I've remeasured tiobench with the 2.6.30 +
> the fix but it didn't help (which is not too surprising as what I observe
> is most likely CFQ related as there's no regression with NOOP scheduler).
> Just to recall:
> 2.6.29 (CFQ) Avg StdDev
> 8 38.01 40.26 39.69 -> 39.32 0.955092
> 16 40.09 38.18 40.05 -> 39.44 0.891104
>
> 2.6.30-rc8 (CFQ)
> 8 36.67 36.81 38.20 -> 37.23 0.69062
> 16 37.45 36.47 37.46 -> 37.13 0.464351
>
> 2.6.30-rc8+fix (CFQ)
> 8 37.56 37.38 37.98 -> 37.64 0.251396
> 16 38.11 36.71 37.18 -> 37.33 0.581741
>
> So with the fix there's no statistically significant difference and we
> are still below 2.6.29 results. I'm now going to retest with the WRITE_SYNC
> changes reverted.

Well, its good the patch didn't make things worse ;) I didn't have the
highest hopes that it would resolve the regression, but thanks for
testing!

-chris

2009-07-14 14:40:11

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] Fix congestion_wait() sync/async vs read/write confusion

On Tue 14-07-09 09:12:15, Chris Mason wrote:
> On Tue, Jul 14, 2009 at 01:44:19PM +0200, Jan Kara wrote:
> > On Wed 08-07-09 15:12:38, Chris Mason wrote:
> > > On Wed, Jul 08, 2009 at 08:47:03PM +0200, Jens Axboe wrote:
> > > > Hi,
> > > >
> > > > This one isn't great, we currently have broken congestion wait logic in
> > > > the kernel. 2.6.30 is impacted as well, so this patch should go to
> > > > stable too once it's in -git. I'll let this one simmer until tomorrow,
> > > > then ask Linus to pull it. The offending commit breaking this is
> > > > 1faa16d22877f4839bd433547d770c676d1d964c.
> > > >
> > > > Meanwhile, it could potentially cause buffered writeout slowdowns in the
> > > > kernel. Perhaps the 2.6.30 regression in that area is caused by this?
> > > > Would be interesting if the submitter could test. I can't find the list,
> > > > CC'ing Rafael.
> > >
> > > Even if this does slow down some workloads, the bug is not in using the
> > > correct flag ;) So, I'd ack this one.
> > >
> > > Jan Kara was able to reproduce the tiobench 2.6.30 regression, so I've
> > > cc'd him and kept the patch below.
> > Thanks for the patch Chris. I've remeasured tiobench with the 2.6.30 +
> > the fix but it didn't help (which is not too surprising as what I observe
> > is most likely CFQ related as there's no regression with NOOP scheduler).
> > Just to recall:
> > 2.6.29 (CFQ) Avg StdDev
> > 8 38.01 40.26 39.69 -> 39.32 0.955092
> > 16 40.09 38.18 40.05 -> 39.44 0.891104
> >
> > 2.6.30-rc8 (CFQ)
> > 8 36.67 36.81 38.20 -> 37.23 0.69062
> > 16 37.45 36.47 37.46 -> 37.13 0.464351
> >
> > 2.6.30-rc8+fix (CFQ)
> > 8 37.56 37.38 37.98 -> 37.64 0.251396
> > 16 38.11 36.71 37.18 -> 37.33 0.581741
> >
> > So with the fix there's no statistically significant difference and we
> > are still below 2.6.29 results. I'm now going to retest with the WRITE_SYNC
> > changes reverted.
>
> Well, its good the patch didn't make things worse ;) I didn't have the
> highest hopes that it would resolve the regression, but thanks for
> testing!
I've now tried to revert everything which looked WRITE_SYNC related but
it didn't help either. Now, I'm trying to basically bisect CFQ changes and
I'll see whether it goes somewhere...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR