Hello,
When writeback is executed asynchronously (e.g. for compression), bios
are bounced to and issued by worker pool shared by all cgroups. This
leads to significant priority inversions when cgroup IO control is in
use - IOs for a low priority cgroup can tie down the workers forcing
higher priority IOs to wait behind them.
This patchset adds an bio punt mechanism to blkcg. A bio tagged with
REQ_CGROUP_PUNT flag is bounced to the asynchronous issue context of
the associated blkcg on bio_submit(). As the bios are issued from
per-blkcg work items, there's no concern for priority inversions and
it doesn't require invasive changes to the filesystems. The mechanism
should be generally useful for IO control support across different
filesystems.
btrfs async path is simplified and is updated to use REQ_CGROUP_PUNT
for submitting asynchronous bios. It's also updated to always and
only account the original writeback bios against wbc so that cgroup
inode writeback ownership arbitration isn't confused by downstream
bios.
This patchset contains the following 8 patches. The first three are
my blkcg patches to implement the needed mechanisms. The latter five
are Chris Mason's btrfs cleanup and update patches. Please let me
know how the patches should be routed. Given that there currently
aren't other users, it's likely the easiest to route all through btrfs
tree.
0001-blkcg-writeback-Add-wbc-no_wbc_acct.patch
0002-blkcg-writeback-Implement-wbc_blkcg_css.patch
0003-blkcg-implement-REQ_CGROUP_PUNT.patch
0004-Btrfs-stop-using-btrfs_schedule_bio.patch
0005-Btrfs-delete-the-entire-async-bio-submission-framewo.patch
0006-Btrfs-only-associate-the-locked-page-with-one-async_.patch
0007-Btrfs-use-REQ_CGROUP_PUNT-for-worker-thread-submitte.patch
0008-Btrfs-extent_write_locked_range-should-attach-inode-.patch
0001-0003 implement wbc->no_wbc_acct, wbc_blkcg_css() and
REQ_CGROUP_PUNT.
0004-0006 are prep patches to simplify / improve async bio submission.
0007 makes btrfs use REQ_CGROUP_PUNT for async bios.
0008 fixes wbc writeback accounting for IOs issued through
extent_write_locked_range().
This patchset is also available in the following git branch.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-btrfs-cgroup-updates
Thanks, diffstat follows.
block/blk-cgroup.c | 54 +++++++++
block/blk-core.c | 3
fs/btrfs/compression.c | 16 +-
fs/btrfs/compression.h | 3
fs/btrfs/ctree.h | 1
fs/btrfs/disk-io.c | 25 +---
fs/btrfs/extent_io.c | 15 +-
fs/btrfs/inode.c | 61 ++++++++--
fs/btrfs/super.c | 1
fs/btrfs/volumes.c | 264 --------------------------------------------
fs/btrfs/volumes.h | 10 -
fs/fs-writeback.c | 2
include/linux/backing-dev.h | 1
include/linux/blk-cgroup.h | 16 ++
include/linux/blk_types.h | 10 +
include/linux/writeback.h | 24 +++-
16 files changed, 194 insertions(+), 312 deletions(-)
--
tejun
When writeback IOs are bounced through async layers, the IOs should
only be accounted against the wbc from the original bdi writeback to
avoid confusing cgroup inode ownership arbitration. Add
wbc->no_wbc_acct to allow disabling wbc accounting. This will be used
make btfs compression work well with cgroup IO control.
Signed-off-by: Tejun Heo <[email protected]>
---
fs/fs-writeback.c | 2 +-
include/linux/writeback.h | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36855c1f8daf..6c8061a49ca0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -721,7 +721,7 @@ void wbc_account_io(struct writeback_control *wbc, struct page *page,
* behind a slow cgroup. Ultimately, we want pageout() to kick off
* regular writeback instead of writing things out itself.
*/
- if (!wbc->wb)
+ if (!wbc->wb || wbc->no_wbc_acct)
return;
id = mem_cgroup_css_from_page(page)->id;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 738a0c24874f..b8f5f000cde4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -68,6 +68,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
+ unsigned no_wbc_acct:1; /* skip wbc IO accounting */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *wb; /* wb this writeback is issued under */
struct inode *inode; /* inode being written out */
--
2.17.1
From: Chris Mason <[email protected]>
btrfs_schedule_bio() hands IO off to a helper thread to do the actual
submit_bio() call. This has been used to make sure async crc and
compression helpers don't get stuck on IO submission. To maintain good
performance, over time the IO submission threads duplicated some IO
scheduler characteristics such as high and low priority IOs and they
also made some ugly assumptions about request allocation batch sizes.
All of this cost at least one extra context switch during IO submission,
and doesn't fit well with the modern blkmq IO stack. So, this commit stops
using btrfs_schedule_bio(). We may need to adjust the number of async
helper threads for crcs and compression, but long term it's a better
path.
Signed-off-by: Chris Mason <[email protected]>
---
fs/btrfs/compression.c | 8 +++---
fs/btrfs/disk-io.c | 6 ++---
fs/btrfs/inode.c | 6 ++---
fs/btrfs/volumes.c | 55 +++---------------------------------------
fs/btrfs/volumes.h | 2 +-
5 files changed, 15 insertions(+), 62 deletions(-)
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4ec1df369e47..873261b932b8 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -355,7 +355,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
BUG_ON(ret); /* -ENOMEM */
}
- ret = btrfs_map_bio(fs_info, bio, 0, 1);
+ ret = btrfs_map_bio(fs_info, bio, 0);
if (ret) {
bio->bi_status = ret;
bio_endio(bio);
@@ -385,7 +385,7 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
BUG_ON(ret); /* -ENOMEM */
}
- ret = btrfs_map_bio(fs_info, bio, 0, 1);
+ ret = btrfs_map_bio(fs_info, bio, 0);
if (ret) {
bio->bi_status = ret;
bio_endio(bio);
@@ -638,7 +638,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
sums += DIV_ROUND_UP(comp_bio->bi_iter.bi_size,
fs_info->sectorsize);
- ret = btrfs_map_bio(fs_info, comp_bio, mirror_num, 0);
+ ret = btrfs_map_bio(fs_info, comp_bio, mirror_num);
if (ret) {
comp_bio->bi_status = ret;
bio_endio(comp_bio);
@@ -662,7 +662,7 @@ blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
BUG_ON(ret); /* -ENOMEM */
}
- ret = btrfs_map_bio(fs_info, comp_bio, mirror_num, 0);
+ ret = btrfs_map_bio(fs_info, comp_bio, mirror_num);
if (ret) {
comp_bio->bi_status = ret;
bio_endio(comp_bio);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 663efce22d98..b34240406f36 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -800,7 +800,7 @@ static void run_one_async_done(struct btrfs_work *work)
}
ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
- async->mirror_num, 1);
+ async->mirror_num);
if (ret) {
async->bio->bi_status = ret;
bio_endio(async->bio);
@@ -901,12 +901,12 @@ static blk_status_t btree_submit_bio_hook(struct inode *inode, struct bio *bio,
BTRFS_WQ_ENDIO_METADATA);
if (ret)
goto out_w_error;
- ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+ ret = btrfs_map_bio(fs_info, bio, mirror_num);
} else if (!async) {
ret = btree_csum_one_bio(bio);
if (ret)
goto out_w_error;
- ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+ ret = btrfs_map_bio(fs_info, bio, mirror_num);
} else {
/*
* kthread helpers are used to submit writes so that
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index d519c3520e87..91b161fb1521 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2032,7 +2032,7 @@ static blk_status_t btrfs_submit_bio_hook(struct inode *inode, struct bio *bio,
}
mapit:
- ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+ ret = btrfs_map_bio(fs_info, bio, mirror_num);
out:
if (ret) {
@@ -7764,7 +7764,7 @@ static inline blk_status_t submit_dio_repair_bio(struct inode *inode,
if (ret)
return ret;
- ret = btrfs_map_bio(fs_info, bio, mirror_num, 0);
+ ret = btrfs_map_bio(fs_info, bio, mirror_num);
return ret;
}
@@ -8295,7 +8295,7 @@ static inline blk_status_t btrfs_submit_dio_bio(struct bio *bio,
goto err;
}
map:
- ret = btrfs_map_bio(fs_info, bio, 0, 0);
+ ret = btrfs_map_bio(fs_info, bio, 0);
err:
return ret;
}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 1c2a6e4b39da..72326cc23985 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6386,52 +6386,8 @@ static void btrfs_end_bio(struct bio *bio)
}
}
-/*
- * see run_scheduled_bios for a description of why bios are collected for
- * async submit.
- *
- * This will add one bio to the pending list for a device and make sure
- * the work struct is scheduled.
- */
-static noinline void btrfs_schedule_bio(struct btrfs_device *device,
- struct bio *bio)
-{
- struct btrfs_fs_info *fs_info = device->fs_info;
- int should_queue = 1;
- struct btrfs_pending_bios *pending_bios;
-
- /* don't bother with additional async steps for reads, right now */
- if (bio_op(bio) == REQ_OP_READ) {
- btrfsic_submit_bio(bio);
- return;
- }
-
- WARN_ON(bio->bi_next);
- bio->bi_next = NULL;
-
- spin_lock(&device->io_lock);
- if (op_is_sync(bio->bi_opf))
- pending_bios = &device->pending_sync_bios;
- else
- pending_bios = &device->pending_bios;
-
- if (pending_bios->tail)
- pending_bios->tail->bi_next = bio;
-
- pending_bios->tail = bio;
- if (!pending_bios->head)
- pending_bios->head = bio;
- if (device->running_pending)
- should_queue = 0;
-
- spin_unlock(&device->io_lock);
-
- if (should_queue)
- btrfs_queue_work(fs_info->submit_workers, &device->work);
-}
-
static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
- u64 physical, int dev_nr, int async)
+ u64 physical, int dev_nr)
{
struct btrfs_device *dev = bbio->stripes[dev_nr].dev;
struct btrfs_fs_info *fs_info = bbio->fs_info;
@@ -6449,10 +6405,7 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
btrfs_bio_counter_inc_noblocked(fs_info);
- if (async)
- btrfs_schedule_bio(dev, bio);
- else
- btrfsic_submit_bio(bio);
+ btrfsic_submit_bio(bio);
}
static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
@@ -6473,7 +6426,7 @@ static void bbio_error(struct btrfs_bio *bbio, struct bio *bio, u64 logical)
}
blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
- int mirror_num, int async_submit)
+ int mirror_num)
{
struct btrfs_device *dev;
struct bio *first_bio = bio;
@@ -6542,7 +6495,7 @@ blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
bio = first_bio;
submit_stripe_bio(bbio, bio, bbio->stripes[dev_nr].physical,
- dev_nr, async_submit);
+ dev_nr);
}
btrfs_bio_counter_dec(fs_info);
return BLK_STS_OK;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index b8a0e8d0672d..8c7bd79b234a 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -415,7 +415,7 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans, u64 type);
void btrfs_mapping_init(struct btrfs_mapping_tree *tree);
void btrfs_mapping_tree_free(struct btrfs_mapping_tree *tree);
blk_status_t btrfs_map_bio(struct btrfs_fs_info *fs_info, struct bio *bio,
- int mirror_num, int async_submit);
+ int mirror_num);
int btrfs_open_devices(struct btrfs_fs_devices *fs_devices,
fmode_t flags, void *holder);
struct btrfs_device *btrfs_scan_one_device(const char *path,
--
2.17.1
From: Chris Mason <[email protected]>
extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression. It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.
Signed-off-by: Chris Mason <[email protected]>
---
fs/btrfs/extent_io.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index d7b57341ff1a..afb916a69c30 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4180,6 +4180,7 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
.no_wbc_acct = 1,
};
+ wbc_attach_fdatawrite_inode(&wbc_writepages, inode);
while (start <= end) {
page = find_get_page(mapping, start >> PAGE_SHIFT);
if (clear_page_dirty_for_io(page))
@@ -4194,11 +4195,12 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
}
ASSERT(ret <= 0);
- if (ret < 0) {
+ if (ret == 0)
+ ret = flush_write_bio(&epd);
+ else
end_write_bio(&epd, ret);
- return ret;
- }
- ret = flush_write_bio(&epd);
+
+ wbc_detach_inode(&wbc_writepages);
return ret;
}
--
2.17.1
From: Chris Mason <[email protected]>
Async CRCs and compression submit IO through helper threads, which
means they have IO priority inversions when cgroup IO controllers are
in use.
This flags all of the writes submitted by btrfs helper threads as
REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg
work items to avoid the priority inversion.
For the compression code, we take a reference on the wbc's blkg css and
pass it down to the async workers.
For the async crcs, the bio already has the correct css, we just need to
tell the block layer to use REQ_CGROUP_PUNT.
Signed-off-by: Chris Mason <[email protected]>
Modified-and-reviewed-by: Tejun Heo <[email protected]>
---
fs/btrfs/compression.c | 8 +++++++-
fs/btrfs/compression.h | 3 ++-
fs/btrfs/disk-io.c | 6 ++++++
fs/btrfs/extent_io.c | 3 +++
fs/btrfs/inode.c | 30 +++++++++++++++++++++++++++---
5 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 873261b932b8..138479a9576c 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -289,7 +289,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
unsigned long compressed_len,
struct page **compressed_pages,
unsigned long nr_pages,
- unsigned int write_flags)
+ unsigned int write_flags,
+ struct cgroup_subsys_state *blkcg_css)
{
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct bio *bio = NULL;
@@ -323,6 +324,11 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
bio->bi_opf = REQ_OP_WRITE | write_flags;
bio->bi_private = cb;
bio->bi_end_io = end_compressed_bio_write;
+
+ if (blkcg_css) {
+ bio->bi_opf |= REQ_CGROUP_PUNT;
+ bio_associate_blkg_from_css(bio, blkcg_css);
+ }
refcount_set(&cb->pending_bios, 1);
/* create and submit bios for the compressed pages */
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 9976fe0f7526..7cbefab96ecf 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -93,7 +93,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode *inode, u64 start,
unsigned long compressed_len,
struct page **compressed_pages,
unsigned long nr_pages,
- unsigned int write_flags);
+ unsigned int write_flags,
+ struct cgroup_subsys_state *blkcg_css);
blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
int mirror_num, unsigned long bio_flags);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9dbe4ba3995d..a5ebbf3d0833 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -799,6 +799,12 @@ static void run_one_async_done(struct btrfs_work *work)
return;
}
+ /*
+ * All of the bios that pass through here are from async helpers.
+ * Use REQ_CGROUP_PUNT to issue them from the owning cgroup's
+ * context. This changes nothing when cgroups aren't in use.
+ */
+ async->bio->bi_opf |= REQ_CGROUP_PUNT;
ret = btrfs_map_bio(btrfs_sb(inode->i_sb), async->bio,
async->mirror_num);
if (ret) {
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 9f223d7d78c0..d7b57341ff1a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4175,6 +4175,9 @@ int extent_write_locked_range(struct inode *inode, u64 start, u64 end,
.nr_to_write = nr_pages * 2,
.range_start = start,
.range_end = end + 1,
+ /* we're called from an async helper function */
+ .punt_to_cgroup = 1,
+ .no_wbc_acct = 1,
};
while (start <= end) {
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index df5527cc07b9..3f9b35bc0455 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -357,6 +357,7 @@ struct async_extent {
};
struct async_chunk {
+ struct cgroup_subsys_state *blkcg_css;
struct inode *inode;
struct page *locked_page;
u64 start;
@@ -846,7 +847,8 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
ins.objectid,
ins.offset, async_extent->pages,
async_extent->nr_pages,
- async_chunk->write_flags)) {
+ async_chunk->write_flags,
+ async_chunk->blkcg_css)) {
struct page *p = async_extent->pages[0];
const u64 start = async_extent->start;
const u64 end = start + async_extent->ram_size - 1;
@@ -1170,6 +1172,8 @@ static noinline void async_cow_free(struct btrfs_work *work)
async_chunk = container_of(work, struct async_chunk, work);
if (async_chunk->inode)
btrfs_add_delayed_iput(async_chunk->inode);
+ if (async_chunk->blkcg_css)
+ css_put(async_chunk->blkcg_css);
/*
* Since the pointer to 'pending' is at the beginning of the array of
* async_chunk's, freeing it ensures the whole array has been freed.
@@ -1178,12 +1182,15 @@ static noinline void async_cow_free(struct btrfs_work *work)
kvfree(async_chunk->pending);
}
-static int cow_file_range_async(struct inode *inode, struct page *locked_page,
+static int cow_file_range_async(struct inode *inode,
+ struct writeback_control *wbc,
+ struct page *locked_page,
u64 start, u64 end, int *page_started,
unsigned long *nr_written,
unsigned int write_flags)
{
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
+ struct cgroup_subsys_state *blkcg_css = wbc_blkcg_css(wbc);
struct async_cow *ctx;
struct async_chunk *async_chunk;
unsigned long nr_pages;
@@ -1251,12 +1258,29 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
* to unlock it.
*/
if (locked_page) {
+ /*
+ * Depending on the compressibility, the pages
+ * might or might not go through async. We want
+ * all of them to be accounted against @wbc once.
+ * Let's do it here before the paths diverge. wbc
+ * accounting is used only for foreign writeback
+ * detection and doesn't need full accuracy. Just
+ * account the whole thing against the first page.
+ */
+ wbc_account_io(wbc, locked_page, cur_end - start);
async_chunk[i].locked_page = locked_page;
locked_page = NULL;
} else {
async_chunk[i].locked_page = NULL;
}
+ if (blkcg_css != blkcg_root_css) {
+ css_get(blkcg_css);
+ async_chunk[i].blkcg_css = blkcg_css;
+ } else {
+ async_chunk[i].blkcg_css = NULL;
+ }
+
btrfs_init_work(&async_chunk[i].work,
btrfs_delalloc_helper,
async_cow_start, async_cow_submit,
@@ -1653,7 +1677,7 @@ int btrfs_run_delalloc_range(struct inode *inode, struct page *locked_page,
} else {
set_bit(BTRFS_INODE_HAS_ASYNC_EXTENT,
&BTRFS_I(inode)->runtime_flags);
- ret = cow_file_range_async(inode, locked_page, start, end,
+ ret = cow_file_range_async(inode, wbc, locked_page, start, end,
page_started, nr_written,
write_flags);
}
--
2.17.1
From: Chris Mason <[email protected]>
Now that we're not using btrfs_schedule_bio() anymore, delete all the
code that supported it.
Signed-off-by: Chris Mason <[email protected]>
---
fs/btrfs/ctree.h | 1 -
fs/btrfs/disk-io.c | 13 +--
fs/btrfs/super.c | 1 -
fs/btrfs/volumes.c | 209 ---------------------------------------------
fs/btrfs/volumes.h | 8 --
5 files changed, 1 insertion(+), 231 deletions(-)
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b81c331b28fa..2a5ba0f85ed3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -989,7 +989,6 @@ struct btrfs_fs_info {
struct btrfs_workqueue *endio_meta_write_workers;
struct btrfs_workqueue *endio_write_workers;
struct btrfs_workqueue *endio_freespace_worker;
- struct btrfs_workqueue *submit_workers;
struct btrfs_workqueue *caching_workers;
struct btrfs_workqueue *readahead_workers;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b34240406f36..9dbe4ba3995d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2028,7 +2028,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
btrfs_destroy_workqueue(fs_info->rmw_workers);
btrfs_destroy_workqueue(fs_info->endio_write_workers);
btrfs_destroy_workqueue(fs_info->endio_freespace_worker);
- btrfs_destroy_workqueue(fs_info->submit_workers);
btrfs_destroy_workqueue(fs_info->delayed_workers);
btrfs_destroy_workqueue(fs_info->caching_workers);
btrfs_destroy_workqueue(fs_info->readahead_workers);
@@ -2194,16 +2193,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
fs_info->caching_workers =
btrfs_alloc_workqueue(fs_info, "cache", flags, max_active, 0);
- /*
- * a higher idle thresh on the submit workers makes it much more
- * likely that bios will be send down in a sane order to the
- * devices
- */
- fs_info->submit_workers =
- btrfs_alloc_workqueue(fs_info, "submit", flags,
- min_t(u64, fs_devices->num_devices,
- max_active), 64);
-
fs_info->fixup_workers =
btrfs_alloc_workqueue(fs_info, "fixup", flags, 1, 0);
@@ -2246,7 +2235,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info,
max_active), 8);
if (!(fs_info->workers && fs_info->delalloc_workers &&
- fs_info->submit_workers && fs_info->flush_workers &&
+ fs_info->flush_workers &&
fs_info->endio_workers && fs_info->endio_meta_workers &&
fs_info->endio_meta_write_workers &&
fs_info->endio_repair_workers &&
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 2c66d9ea6a3b..3fb86a7bfdf7 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1668,7 +1668,6 @@ static void btrfs_resize_thread_pool(struct btrfs_fs_info *fs_info,
btrfs_workqueue_set_max(fs_info->workers, new_pool_size);
btrfs_workqueue_set_max(fs_info->delalloc_workers, new_pool_size);
- btrfs_workqueue_set_max(fs_info->submit_workers, new_pool_size);
btrfs_workqueue_set_max(fs_info->caching_workers, new_pool_size);
btrfs_workqueue_set_max(fs_info->endio_workers, new_pool_size);
btrfs_workqueue_set_max(fs_info->endio_meta_workers, new_pool_size);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 72326cc23985..fc3a16d87869 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -509,212 +509,6 @@ btrfs_get_bdev_and_sb(const char *device_path, fmode_t flags, void *holder,
return ret;
}
-static void requeue_list(struct btrfs_pending_bios *pending_bios,
- struct bio *head, struct bio *tail)
-{
-
- struct bio *old_head;
-
- old_head = pending_bios->head;
- pending_bios->head = head;
- if (pending_bios->tail)
- tail->bi_next = old_head;
- else
- pending_bios->tail = tail;
-}
-
-/*
- * we try to collect pending bios for a device so we don't get a large
- * number of procs sending bios down to the same device. This greatly
- * improves the schedulers ability to collect and merge the bios.
- *
- * But, it also turns into a long list of bios to process and that is sure
- * to eventually make the worker thread block. The solution here is to
- * make some progress and then put this work struct back at the end of
- * the list if the block device is congested. This way, multiple devices
- * can make progress from a single worker thread.
- */
-static noinline void run_scheduled_bios(struct btrfs_device *device)
-{
- struct btrfs_fs_info *fs_info = device->fs_info;
- struct bio *pending;
- struct backing_dev_info *bdi;
- struct btrfs_pending_bios *pending_bios;
- struct bio *tail;
- struct bio *cur;
- int again = 0;
- unsigned long num_run;
- unsigned long batch_run = 0;
- unsigned long last_waited = 0;
- int force_reg = 0;
- int sync_pending = 0;
- struct blk_plug plug;
-
- /*
- * this function runs all the bios we've collected for
- * a particular device. We don't want to wander off to
- * another device without first sending all of these down.
- * So, setup a plug here and finish it off before we return
- */
- blk_start_plug(&plug);
-
- bdi = device->bdev->bd_bdi;
-
-loop:
- spin_lock(&device->io_lock);
-
-loop_lock:
- num_run = 0;
-
- /* take all the bios off the list at once and process them
- * later on (without the lock held). But, remember the
- * tail and other pointers so the bios can be properly reinserted
- * into the list if we hit congestion
- */
- if (!force_reg && device->pending_sync_bios.head) {
- pending_bios = &device->pending_sync_bios;
- force_reg = 1;
- } else {
- pending_bios = &device->pending_bios;
- force_reg = 0;
- }
-
- pending = pending_bios->head;
- tail = pending_bios->tail;
- WARN_ON(pending && !tail);
-
- /*
- * if pending was null this time around, no bios need processing
- * at all and we can stop. Otherwise it'll loop back up again
- * and do an additional check so no bios are missed.
- *
- * device->running_pending is used to synchronize with the
- * schedule_bio code.
- */
- if (device->pending_sync_bios.head == NULL &&
- device->pending_bios.head == NULL) {
- again = 0;
- device->running_pending = 0;
- } else {
- again = 1;
- device->running_pending = 1;
- }
-
- pending_bios->head = NULL;
- pending_bios->tail = NULL;
-
- spin_unlock(&device->io_lock);
-
- while (pending) {
-
- rmb();
- /* we want to work on both lists, but do more bios on the
- * sync list than the regular list
- */
- if ((num_run > 32 &&
- pending_bios != &device->pending_sync_bios &&
- device->pending_sync_bios.head) ||
- (num_run > 64 && pending_bios == &device->pending_sync_bios &&
- device->pending_bios.head)) {
- spin_lock(&device->io_lock);
- requeue_list(pending_bios, pending, tail);
- goto loop_lock;
- }
-
- cur = pending;
- pending = pending->bi_next;
- cur->bi_next = NULL;
-
- BUG_ON(atomic_read(&cur->__bi_cnt) == 0);
-
- /*
- * if we're doing the sync list, record that our
- * plug has some sync requests on it
- *
- * If we're doing the regular list and there are
- * sync requests sitting around, unplug before
- * we add more
- */
- if (pending_bios == &device->pending_sync_bios) {
- sync_pending = 1;
- } else if (sync_pending) {
- blk_finish_plug(&plug);
- blk_start_plug(&plug);
- sync_pending = 0;
- }
-
- btrfsic_submit_bio(cur);
- num_run++;
- batch_run++;
-
- cond_resched();
-
- /*
- * we made progress, there is more work to do and the bdi
- * is now congested. Back off and let other work structs
- * run instead
- */
- if (pending && bdi_write_congested(bdi) && batch_run > 8 &&
- fs_info->fs_devices->open_devices > 1) {
- struct io_context *ioc;
-
- ioc = current->io_context;
-
- /*
- * the main goal here is that we don't want to
- * block if we're going to be able to submit
- * more requests without blocking.
- *
- * This code does two great things, it pokes into
- * the elevator code from a filesystem _and_
- * it makes assumptions about how batching works.
- */
- if (ioc && ioc->nr_batch_requests > 0 &&
- time_before(jiffies, ioc->last_waited + HZ/50UL) &&
- (last_waited == 0 ||
- ioc->last_waited == last_waited)) {
- /*
- * we want to go through our batch of
- * requests and stop. So, we copy out
- * the ioc->last_waited time and test
- * against it before looping
- */
- last_waited = ioc->last_waited;
- cond_resched();
- continue;
- }
- spin_lock(&device->io_lock);
- requeue_list(pending_bios, pending, tail);
- device->running_pending = 1;
-
- spin_unlock(&device->io_lock);
- btrfs_queue_work(fs_info->submit_workers,
- &device->work);
- goto done;
- }
- }
-
- cond_resched();
- if (again)
- goto loop;
-
- spin_lock(&device->io_lock);
- if (device->pending_bios.head || device->pending_sync_bios.head)
- goto loop_lock;
- spin_unlock(&device->io_lock);
-
-done:
- blk_finish_plug(&plug);
-}
-
-static void pending_bios_fn(struct btrfs_work *work)
-{
- struct btrfs_device *device;
-
- device = container_of(work, struct btrfs_device, work);
- run_scheduled_bios(device);
-}
-
static bool device_path_matched(const char *path, struct btrfs_device *device)
{
int found;
@@ -6599,9 +6393,6 @@ struct btrfs_device *btrfs_alloc_device(struct btrfs_fs_info *fs_info,
else
generate_random_uuid(dev->uuid);
- btrfs_init_work(&dev->work, btrfs_submit_helper,
- pending_bios_fn, NULL, NULL);
-
return dev;
}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 8c7bd79b234a..231f50dd107d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -18,10 +18,6 @@ extern struct mutex uuid_mutex;
#define BTRFS_STRIPE_LEN SZ_64K
struct buffer_head;
-struct btrfs_pending_bios {
- struct bio *head;
- struct bio *tail;
-};
/*
* Use sequence counter to get consistent device stat data on
@@ -55,10 +51,6 @@ struct btrfs_device {
spinlock_t io_lock ____cacheline_aligned;
int running_pending;
- /* regular prio bios */
- struct btrfs_pending_bios pending_bios;
- /* sync bios */
- struct btrfs_pending_bios pending_sync_bios;
struct block_device *bdev;
--
2.17.1
When a shared kthread needs to issue a bio for a cgroup, doing so
synchronously can lead to priority inversions as the kthread can be
trapped waiting for that cgroup. This patch implements
REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
to a dedicated per-blkcg work item to avoid such priority inversions.
This will be used to fix priority inversions in btrfs compression and
should be generally useful as we grow filesystem support for
comprehensive IO control.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Chris Mason <[email protected]>
---
block/blk-cgroup.c | 56 +++++++++++++++++++++++++++++++++++--
block/blk-core.c | 3 ++
include/linux/backing-dev.h | 1 +
include/linux/blk-cgroup.h | 16 ++++++++++-
include/linux/blk_types.h | 10 +++++++
include/linux/writeback.h | 13 +++++++--
6 files changed, 92 insertions(+), 7 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 617a2b3f7582..64d80a661205 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -46,12 +46,10 @@ struct blkcg blkcg_root;
EXPORT_SYMBOL_GPL(blkcg_root);
struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css;
-
static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];
-
static LIST_HEAD(all_blkcgs); /* protected by blkcg_pol_mutex */
-
static bool blkcg_debug_stats = false;
+static struct workqueue_struct *blkcg_punt_bio_wq;
static bool blkcg_policy_enabled(struct request_queue *q,
const struct blkcg_policy *pol)
@@ -87,6 +85,8 @@ static void __blkg_release(struct rcu_head *rcu)
percpu_ref_exit(&blkg->refcnt);
+ WARN_ON(!bio_list_empty(&blkg->async_bios));
+
/* release the blkcg and parent blkg refs this blkg has been holding */
css_put(&blkg->blkcg->css);
if (blkg->parent)
@@ -112,6 +112,23 @@ static void blkg_release(struct percpu_ref *ref)
call_rcu(&blkg->rcu_head, __blkg_release);
}
+static void blkg_async_bio_workfn(struct work_struct *work)
+{
+ struct blkcg_gq *blkg = container_of(work, struct blkcg_gq,
+ async_bio_work);
+ struct bio_list bios = BIO_EMPTY_LIST;
+ struct bio *bio;
+
+ /* as long as there are pending bios, @blkg can't go away */
+ spin_lock_bh(&blkg->async_bio_lock);
+ bio_list_merge(&bios, &blkg->async_bios);
+ bio_list_init(&blkg->async_bios);
+ spin_unlock_bh(&blkg->async_bio_lock);
+
+ while ((bio = bio_list_pop(&bios)))
+ submit_bio(bio);
+}
+
/**
* blkg_alloc - allocate a blkg
* @blkcg: block cgroup the new blkg is associated with
@@ -137,6 +154,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
blkg->q = q;
INIT_LIST_HEAD(&blkg->q_node);
+ spin_lock_init(&blkg->async_bio_lock);
+ bio_list_init(&blkg->async_bios);
+ INIT_WORK(&blkg->async_bio_work, blkg_async_bio_workfn);
blkg->blkcg = blkcg;
for (i = 0; i < BLKCG_MAX_POLS; i++) {
@@ -1582,6 +1602,25 @@ void blkcg_policy_unregister(struct blkcg_policy *pol)
}
EXPORT_SYMBOL_GPL(blkcg_policy_unregister);
+bool __blkcg_punt_bio_submit(struct bio *bio)
+{
+ struct blkcg_gq *blkg = bio->bi_blkg;
+
+ /* consume the flag first */
+ bio->bi_opf &= ~REQ_CGROUP_PUNT;
+
+ /* never bounce for the root cgroup */
+ if (!blkg->parent)
+ return false;
+
+ spin_lock_bh(&blkg->async_bio_lock);
+ bio_list_add(&blkg->async_bios, bio);
+ spin_unlock_bh(&blkg->async_bio_lock);
+
+ queue_work(blkcg_punt_bio_wq, &blkg->async_bio_work);
+ return true;
+}
+
/*
* Scale the accumulated delay based on how long it has been since we updated
* the delay. We only call this when we are adding delay, in case it's been a
@@ -1782,5 +1821,16 @@ void blkcg_add_delay(struct blkcg_gq *blkg, u64 now, u64 delta)
atomic64_add(delta, &blkg->delay_nsec);
}
+static int __init blkcg_init(void)
+{
+ blkcg_punt_bio_wq = alloc_workqueue("blkcg_punt_bio",
+ WQ_MEM_RECLAIM | WQ_FREEZABLE |
+ WQ_UNBOUND | WQ_SYSFS, 0);
+ if (!blkcg_punt_bio_wq)
+ return -ENOMEM;
+ return 0;
+}
+subsys_initcall(blkcg_init);
+
module_param(blkcg_debug_stats, bool, 0644);
MODULE_PARM_DESC(blkcg_debug_stats, "True if you want debug stats, false if not");
diff --git a/block/blk-core.c b/block/blk-core.c
index a55389ba8779..5879c1ec044d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1165,6 +1165,9 @@ EXPORT_SYMBOL_GPL(direct_make_request);
*/
blk_qc_t submit_bio(struct bio *bio)
{
+ if (blkcg_punt_bio_submit(bio))
+ return BLK_QC_T_NONE;
+
/*
* If it's a regular read/write or a barrier with data attached,
* go through the normal accounting stuff before submission.
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f9b029180241..35b31d176f74 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -48,6 +48,7 @@ extern spinlock_t bdi_lock;
extern struct list_head bdi_list;
extern struct workqueue_struct *bdi_wq;
+extern struct workqueue_struct *bdi_async_bio_wq;
static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
{
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 76c61318fda5..ffb2f88e87c6 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -134,13 +134,17 @@ struct blkcg_gq {
struct blkg_policy_data *pd[BLKCG_MAX_POLS];
- struct rcu_head rcu_head;
+ spinlock_t async_bio_lock;
+ struct bio_list async_bios;
+ struct work_struct async_bio_work;
atomic_t use_delay;
atomic64_t delay_nsec;
atomic64_t delay_start;
u64 last_delay;
int last_use;
+
+ struct rcu_head rcu_head;
};
typedef struct blkcg_policy_data *(blkcg_pol_alloc_cpd_fn)(gfp_t gfp);
@@ -763,6 +767,15 @@ static inline bool blk_throtl_bio(struct request_queue *q, struct blkcg_gq *blkg
struct bio *bio) { return false; }
#endif
+bool __blkcg_punt_bio_submit(struct bio *bio);
+
+static inline bool blkcg_punt_bio_submit(struct bio *bio)
+{
+ if (bio->bi_opf & REQ_CGROUP_PUNT)
+ return __blkcg_punt_bio_submit(bio);
+ else
+ return false;
+}
static inline void blkcg_bio_issue_init(struct bio *bio)
{
@@ -910,6 +923,7 @@ static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
static inline void blkg_get(struct blkcg_gq *blkg) { }
static inline void blkg_put(struct blkcg_gq *blkg) { }
+static inline bool blkcg_punt_bio_submit(struct bio *bio) { return false; }
static inline void blkcg_bio_issue_init(struct bio *bio) { }
static inline bool blkcg_bio_issue_check(struct request_queue *q,
struct bio *bio) { return true; }
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 791fee35df88..e8b42a786315 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -321,6 +321,14 @@ enum req_flag_bits {
__REQ_RAHEAD, /* read ahead, can fail anytime */
__REQ_BACKGROUND, /* background IO */
__REQ_NOWAIT, /* Don't wait if request will block */
+ /*
+ * When a shared kthread needs to issue a bio for a cgroup, doing
+ * so synchronously can lead to priority inversions as the kthread
+ * can be trapped waiting for that cgroup. CGROUP_PUNT flag makes
+ * submit_bio() punt the actual issuing to a dedicated per-blkcg
+ * work item to avoid such priority inversions.
+ */
+ __REQ_CGROUP_PUNT,
/* command specific flags for REQ_OP_WRITE_ZEROES: */
__REQ_NOUNMAP, /* do not free blocks when zeroing */
@@ -347,6 +355,8 @@ enum req_flag_bits {
#define REQ_RAHEAD (1ULL << __REQ_RAHEAD)
#define REQ_BACKGROUND (1ULL << __REQ_BACKGROUND)
#define REQ_NOWAIT (1ULL << __REQ_NOWAIT)
+#define REQ_CGROUP_PUNT (1ULL << __REQ_CGROUP_PUNT)
+
#define REQ_NOUNMAP (1ULL << __REQ_NOUNMAP)
#define REQ_HIPRI (1ULL << __REQ_HIPRI)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 1c85563f035d..be602c42aab8 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -11,6 +11,7 @@
#include <linux/flex_proportions.h>
#include <linux/backing-dev-defs.h>
#include <linux/blk_types.h>
+#include <linux/blk-cgroup.h>
struct bio;
@@ -69,6 +70,7 @@ struct writeback_control {
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned no_wbc_acct:1; /* skip wbc IO accounting */
+ unsigned punt_to_cgroup:1; /* cgrp punting, see __REQ_CGROUP_PUNT */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *wb; /* wb this writeback is issued under */
struct inode *inode; /* inode being written out */
@@ -85,12 +87,17 @@ struct writeback_control {
static inline int wbc_to_write_flags(struct writeback_control *wbc)
{
+ int flags = 0;
+
+ if (wbc->punt_to_cgroup)
+ flags = REQ_CGROUP_PUNT;
+
if (wbc->sync_mode == WB_SYNC_ALL)
- return REQ_SYNC;
+ flags |= REQ_SYNC;
else if (wbc->for_kupdate || wbc->for_background)
- return REQ_BACKGROUND;
+ flags |= REQ_BACKGROUND;
- return 0;
+ return flags;
}
static inline struct cgroup_subsys_state *
--
2.17.1
Add a helper to determine the target blkcg from wbc.
Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/writeback.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b8f5f000cde4..1c85563f035d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -93,6 +93,16 @@ static inline int wbc_to_write_flags(struct writeback_control *wbc)
return 0;
}
+static inline struct cgroup_subsys_state *
+wbc_blkcg_css(struct writeback_control *wbc)
+{
+#ifdef CONFIG_CGROUP_WRITEBACK
+ if (wbc->wb)
+ return wbc->wb->blkcg_css;
+#endif
+ return blkcg_root_css;
+}
+
/*
* A wb_domain represents a domain that wb's (bdi_writeback's) belong to
* and are measured against each other in. There always is one global
--
2.17.1
From: Chris Mason <[email protected]>
The btrfs writepages function collects a large range of pages flagged
for delayed allocation, and then sends them down through the COW code
for processing. When compression is on, we allocate one async_cow
structure for every 512K, and then run those pages through the
compression code for IO submission.
writepages starts all of this off with a single page, locked by
the original call to extent_write_cache_pages(), and it's important to
keep track of this page because it has already been through
clear_page_dirty_for_io().
The btrfs async_cow struct has a pointer to the locked_page, and when
we're redirtying the page because compression had to fallback to
uncompressed IO, we use page->index to decide if a given async_cow
struct really owns that page.
But, this is racey. If a given delalloc range is broken up into two
async_cows (cow_A and cow_B), we can end up with something like this:
compress_file_range(cowA)
submit_compress_extents(cowA)
submit compressed bios(cowA)
put_page(locked_page)
compress_file_range(cowB)
...
The end result is that cowA is completed and cleaned up before cowB even
starts processing. This means we can free locked_page() and reuse it
elsewhere. If we get really lucky, it'll have the same page->index in
its new home as it did before.
While we're processing cowB, we might decide we need to fall back to
uncompressed IO, and so compress_file_range() will call
__set_page_dirty_nobufers() on cowB->locked_page.
Without cgroups in use, this creates as a phantom dirty page, which
isn't great but isn't the end of the world. With cgroups in use, we
might crash in the accounting code because page->mapping->i_wb isn't
set.
[ 8308.523110] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
[ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
[ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
[ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
[ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
[ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
[ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
[ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
[ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
[ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
[ 8308.582520] FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
[ 8308.585440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
[ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 8308.594469] Call Trace:
[ 8308.595149] account_page_cleaned+0x15b/0x1f0
[ 8308.596340] __cancel_dirty_page+0x146/0x200
[ 8308.599395] truncate_cleanup_page+0x92/0xb0
[ 8308.600480] truncate_inode_pages_range+0x202/0x7d0
[ 8308.617392] btrfs_evict_inode+0x92/0x5a0
[ 8308.619108] evict+0xc1/0x190
[ 8308.620023] do_unlinkat+0x176/0x280
[ 8308.621202] do_syscall_64+0x63/0x1a0
[ 8308.623451] entry_SYSCALL_64_after_hwframe+0x42/0xb7
The fix here is to make asyc_cow->locked_page NULL everywhere but the
one async_cow struct that's allowed to do things to the locked page.
Signed-off-by: Chris Mason <[email protected]>
Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
---
fs/btrfs/extent_io.c | 2 +-
fs/btrfs/inode.c | 25 +++++++++++++++++++++----
2 files changed, 22 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 13fca7bfc1f2..9f223d7d78c0 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping,
if (page_ops & PAGE_SET_PRIVATE2)
SetPagePrivate2(pages[i]);
- if (pages[i] == locked_page) {
+ if (locked_page && pages[i] == locked_page) {
put_page(pages[i]);
pages_locked++;
continue;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 91b161fb1521..df5527cc07b9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -666,10 +666,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk,
* to our extent and set things up for the async work queue to run
* cow_file_range to do the normal delalloc dance.
*/
- if (page_offset(async_chunk->locked_page) >= start &&
- page_offset(async_chunk->locked_page) <= end)
+ if (async_chunk->locked_page &&
+ (page_offset(async_chunk->locked_page) >= start &&
+ page_offset(async_chunk->locked_page)) <= end) {
__set_page_dirty_nobuffers(async_chunk->locked_page);
/* unlocked later on in the async handlers */
+ }
if (redirty)
extent_range_redirty_for_io(inode, start, end);
@@ -759,7 +761,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
async_extent->start +
async_extent->ram_size - 1,
WB_SYNC_ALL);
- else if (ret)
+ else if (ret && async_chunk->locked_page)
unlock_page(async_chunk->locked_page);
kfree(async_extent);
cond_resched();
@@ -1236,10 +1238,25 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
async_chunk[i].inode = inode;
async_chunk[i].start = start;
async_chunk[i].end = cur_end;
- async_chunk[i].locked_page = locked_page;
async_chunk[i].write_flags = write_flags;
INIT_LIST_HEAD(&async_chunk[i].extents);
+ /*
+ * The locked_page comes all the way from writepage and its
+ * the original page we were actually given. As we spread
+ * this large delalloc region across multiple async_cow
+ * structs, only the first struct needs a pointer to locked_page
+ *
+ * This way we don't need racey decisions about who is supposed
+ * to unlock it.
+ */
+ if (locked_page) {
+ async_chunk[i].locked_page = locked_page;
+ locked_page = NULL;
+ } else {
+ async_chunk[i].locked_page = NULL;
+ }
+
btrfs_init_work(&async_chunk[i].work,
btrfs_delalloc_helper,
async_cow_start, async_cow_submit,
--
2.17.1
On Thu, Jun 13, 2019 at 05:33:44PM -0700, Tejun Heo wrote:
> Add a helper to determine the target blkcg from wbc.
>
> Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Jun 13, 2019 at 05:33:43PM -0700, Tejun Heo wrote:
> When writeback IOs are bounced through async layers, the IOs should
> only be accounted against the wbc from the original bdi writeback to
> avoid confusing cgroup inode ownership arbitration. Add
> wbc->no_wbc_acct to allow disabling wbc accounting. This will be used
> make btfs compression work well with cgroup IO control.
>
> Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Jun 13, 2019 at 05:33:45PM -0700, Tejun Heo wrote:
> When a shared kthread needs to issue a bio for a cgroup, doing so
> synchronously can lead to priority inversions as the kthread can be
> trapped waiting for that cgroup. This patch implements
> REQ_CGROUP_PUNT flag which makes submit_bio() punt the actual issuing
> to a dedicated per-blkcg work item to avoid such priority inversions.
>
> This will be used to fix priority inversions in btrfs compression and
> should be generally useful as we grow filesystem support for
> comprehensive IO control.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Chris Mason <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Jun 13, 2019 at 05:33:46PM -0700, Tejun Heo wrote:
> From: Chris Mason <[email protected]>
>
> btrfs_schedule_bio() hands IO off to a helper thread to do the actual
> submit_bio() call. This has been used to make sure async crc and
> compression helpers don't get stuck on IO submission. To maintain good
> performance, over time the IO submission threads duplicated some IO
> scheduler characteristics such as high and low priority IOs and they
> also made some ugly assumptions about request allocation batch sizes.
>
> All of this cost at least one extra context switch during IO submission,
> and doesn't fit well with the modern blkmq IO stack. So, this commit stops
> using btrfs_schedule_bio(). We may need to adjust the number of async
> helper threads for crcs and compression, but long term it's a better
> path.
>
> Signed-off-by: Chris Mason <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Jun 13, 2019 at 05:33:47PM -0700, Tejun Heo wrote:
> From: Chris Mason <[email protected]>
>
> Now that we're not using btrfs_schedule_bio() anymore, delete all the
> code that supported it.
>
> Signed-off-by: Chris Mason <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Jun 13, 2019 at 05:33:48PM -0700, Tejun Heo wrote:
> From: Chris Mason <[email protected]>
>
> The btrfs writepages function collects a large range of pages flagged
> for delayed allocation, and then sends them down through the COW code
> for processing. When compression is on, we allocate one async_cow
> structure for every 512K, and then run those pages through the
> compression code for IO submission.
>
> writepages starts all of this off with a single page, locked by
> the original call to extent_write_cache_pages(), and it's important to
> keep track of this page because it has already been through
> clear_page_dirty_for_io().
>
> The btrfs async_cow struct has a pointer to the locked_page, and when
> we're redirtying the page because compression had to fallback to
> uncompressed IO, we use page->index to decide if a given async_cow
> struct really owns that page.
>
> But, this is racey. If a given delalloc range is broken up into two
> async_cows (cow_A and cow_B), we can end up with something like this:
>
> compress_file_range(cowA)
> submit_compress_extents(cowA)
> submit compressed bios(cowA)
> put_page(locked_page)
>
> compress_file_range(cowB)
> ...
>
> The end result is that cowA is completed and cleaned up before cowB even
> starts processing. This means we can free locked_page() and reuse it
> elsewhere. If we get really lucky, it'll have the same page->index in
> its new home as it did before.
>
> While we're processing cowB, we might decide we need to fall back to
> uncompressed IO, and so compress_file_range() will call
> __set_page_dirty_nobufers() on cowB->locked_page.
>
> Without cgroups in use, this creates as a phantom dirty page, which
> isn't great but isn't the end of the world. With cgroups in use, we
> might crash in the accounting code because page->mapping->i_wb isn't
> set.
>
> [ 8308.523110] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
> [ 8308.531084] IP: percpu_counter_add_batch+0x11/0x70
> [ 8308.538371] PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
> [ 8308.541750] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 8308.551948] CPU: 16 PID: 2172 Comm: rm Not tainted
> [ 8308.566883] RIP: 0010:percpu_counter_add_batch+0x11/0x70
> [ 8308.567891] RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
> [ 8308.568986] RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
> [ 8308.570734] RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
> [ 8308.572543] RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
> [ 8308.573856] R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
> [ 8308.580099] R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
> [ 8308.582520] FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
> [ 8308.585440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 8308.587951] CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
> [ 8308.590707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 8308.592865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 8308.594469] Call Trace:
> [ 8308.595149] account_page_cleaned+0x15b/0x1f0
> [ 8308.596340] __cancel_dirty_page+0x146/0x200
> [ 8308.599395] truncate_cleanup_page+0x92/0xb0
> [ 8308.600480] truncate_inode_pages_range+0x202/0x7d0
> [ 8308.617392] btrfs_evict_inode+0x92/0x5a0
> [ 8308.619108] evict+0xc1/0x190
> [ 8308.620023] do_unlinkat+0x176/0x280
> [ 8308.621202] do_syscall_64+0x63/0x1a0
> [ 8308.623451] entry_SYSCALL_64_after_hwframe+0x42/0xb7
>
> The fix here is to make asyc_cow->locked_page NULL everywhere but the
> one async_cow struct that's allowed to do things to the locked page.
>
> Signed-off-by: Chris Mason <[email protected]>
> Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
> ---
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Jun 13, 2019 at 05:33:49PM -0700, Tejun Heo wrote:
> From: Chris Mason <[email protected]>
>
> Async CRCs and compression submit IO through helper threads, which
> means they have IO priority inversions when cgroup IO controllers are
> in use.
>
> This flags all of the writes submitted by btrfs helper threads as
> REQ_CGROUP_PUNT. submit_bio() will punt these to dedicated per-blkcg
> work items to avoid the priority inversion.
>
> For the compression code, we take a reference on the wbc's blkg css and
> pass it down to the async workers.
>
> For the async crcs, the bio already has the correct css, we just need to
> tell the block layer to use REQ_CGROUP_PUNT.
>
> Signed-off-by: Chris Mason <[email protected]>
> Modified-and-reviewed-by: Tejun Heo <[email protected]>
> ---
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Jun 13, 2019 at 05:33:50PM -0700, Tejun Heo wrote:
> From: Chris Mason <[email protected]>
>
> extent_write_locked_range() is used when we're falling back to buffered
> IO from inside of compression. It allocates its own wbc and should
> associate it with the inode's i_wb to make sure the IO goes down from
> the correct cgroup.
>
> Signed-off-by: Chris Mason <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Jun 13, 2019 at 05:33:49PM -0700, Tejun Heo wrote:
> @@ -1251,12 +1258,29 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
> * to unlock it.
> */
> if (locked_page) {
> + /*
> + * Depending on the compressibility, the pages
> + * might or might not go through async. We want
> + * all of them to be accounted against @wbc once.
> + * Let's do it here before the paths diverge. wbc
> + * accounting is used only for foreign writeback
> + * detection and doesn't need full accuracy. Just
> + * account the whole thing against the first page.
> + */
> + wbc_account_io(wbc, locked_page, cur_end - start);
> async_chunk[i].locked_page = locked_page;
> locked_page = NULL;
> } else {
> async_chunk[i].locked_page = NULL;
> }
>
> + if (blkcg_css != blkcg_root_css) {
> + css_get(blkcg_css);
fs/btrfs/inode.c: In function ‘cow_file_range_async’:
fs/btrfs/inode.c:1278:4: error: implicit declaration of function ‘css_get’; did you mean ‘css_put’? [-Werror=implicit-function-declaration]
1278 | css_get(blkcg_css);
| ^~~~~~~
| css_put
I don't have CONFIG_CGROUPS enabled in the testing kernel so this probably
needs a wrapper so the ifdef is not in the middle of the function.
extent_write_locked_range() is used when we're falling back to buffered
IO from inside of compression. It allocates its own wbc and should
associate it with the inode's i_wb to make sure the IO goes down from
the correct cgroup.
Also, export the needed symbols from fs-writeback.c.
Signed-off-by: Chris Mason <[email protected]>
Build-Breakage-Reported-by: kbuild test robot <[email protected]>
---
v2: Module build fix.
fs/btrfs/extent_io.c | 10 ++++++----
fs/fs-writeback.c | 3 +++
2 files changed, 9 insertions(+), 4 deletions(-)
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4178,6 +4178,7 @@ int extent_write_locked_range(struct ino
.no_wbc_acct = 1,
};
+ wbc_attach_fdatawrite_inode(&wbc_writepages, inode);
while (start <= end) {
page = find_get_page(mapping, start >> PAGE_SHIFT);
if (clear_page_dirty_for_io(page))
@@ -4192,11 +4193,12 @@ int extent_write_locked_range(struct ino
}
ASSERT(ret <= 0);
- if (ret < 0) {
+ if (ret == 0)
+ ret = flush_write_bio(&epd);
+ else
end_write_bio(&epd, ret);
- return ret;
- }
- ret = flush_write_bio(&epd);
+
+ wbc_detach_inode(&wbc_writepages);
return ret;
}
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -270,6 +270,7 @@ void __inode_attach_wb(struct inode *ino
if (unlikely(cmpxchg(&inode->i_wb, NULL, wb)))
wb_put(wb);
}
+EXPORT_SYMBOL_GPL(__inode_attach_wb);
/**
* locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
@@ -582,6 +583,7 @@ void wbc_attach_and_unlock_inode(struct
if (unlikely(wb_dying(wbc->wb)))
inode_switch_wbs(inode, wbc->wb_id);
}
+EXPORT_SYMBOL_GPL(wbc_attach_and_unlock_inode);
/**
* wbc_detach_inode - disassociate wbc from inode and perform foreign detection
@@ -701,6 +703,7 @@ void wbc_detach_inode(struct writeback_c
wb_put(wbc->wb);
wbc->wb = NULL;
}
+EXPORT_SYMBOL_GPL(wbc_detach_inode);
/**
* wbc_account_io - account IO issued during writeback
Hello, David.
On Fri, Jun 14, 2019 at 05:15:24PM +0200, David Sterba wrote:
> fs/btrfs/inode.c: In function ‘cow_file_range_async’:
> fs/btrfs/inode.c:1278:4: error: implicit declaration of function ‘css_get’; did you mean ‘css_put’? [-Werror=implicit-function-declaration]
> 1278 | css_get(blkcg_css);
> | ^~~~~~~
> | css_put
>
> I don't have CONFIG_CGROUPS enabled in the testing kernel so this probably
> needs a wrapper so the ifdef is not in the middle of the function.
Yeah, got a bunch of kbuild warnings last night. I'll fix them up and
send v2 patchset.
Thanks.
--
tejun
Hello,
Added a separate patch to add dummy css_get() and EXPORT's for the
build failures. It's in the following branch. I'll wait for a while
and send out if nothing else is broken.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-btrfs-cgroup-updates-v2
Thanks.
--
tejun