= Problem =
When sync() is executed from a high-priority cgroup, the process is forced to
wait the completion of the entire outstanding writeback I/O, even the I/O that
was originally generated by low-priority cgroups potentially.
This may cause massive latencies to random processes (even those running in the
root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
priority inversion problem.
This topic has been previously discussed here:
https://patchwork.kernel.org/patch/10804489/
[ Thanks to Josef for the suggestions ]
= Solution =
Here's a slightly more detailed description of the solution, as suggested by
Josef and Tejun (let me know if I misunderstood or missed anything):
- track the submitter of wb work (when issuing sync()) and the cgroup that
originally dirtied any inode, then use this information to determine the
proper "sync() domain" and decide if the I/O speed needs to be boosted or
not in order to prevent priority-inversion problems
- by default when sync() is issued, all the outstanding writeback I/O is
boosted to maximum speed to prevent priority inversion problems
- if sync() is issued by the same throttled cgroup that generated the dirty
pages, the corresponding writeback I/O is still throttled normally
- add a new flag to cgroups (io.sync_isolation) that would make sync()'ers in
that cgroup only be allowed to write out dirty pages that belong to its
cgroup
= Test =
Here's a trivial example to trigger the problem:
- create 2 cgroups: cg1 and cg2
# mkdir /sys/fs/cgroup/unified/cg1
# mkdir /sys/fs/cgroup/unified/cg2
- set an I/O limit of 1MB/s on cg1/io.ma:
# echo "8:0 rbps=1048576 wbps=1048576" > /sys/fs/cgroup/unified/cg1/io.max
- run a write-intensive workload in cg1
# cat /proc/self/cgroup
0::/cg1
# fio --rw=write --bs=1M --size=32M --numjobs=16 --name=writer --time_based --runtime=30
- run sync in cg2 and measure time
== Vanilla kernel ==
# cat /proc/self/cgroup
0::/cg2
# time sync
real 9m32,618s
user 0m0,000s
sys 0m0,018s
Ideally "sync" should complete almost immediately, because cg2 is unlimited and
it's not doing any I/O at all. Instead, the entire system is totally sluggish,
waiting for the throttled writeback I/O to complete, and it also triggers many
hung task timeout warnings.
== With this patch set applied and io.sync_isolation=0 (default) ==
# cat /proc/self/cgroup
0::/cg2
# time sync
real 0m2,044s
user 0m0,009s
sys 0m0,000s
[ Time range goes from 2s to 4s ]
== With this patch set applied and io.sync_isolation=1 ==
# cat /proc/self/cgroup
0::/cg2
# time sync
real 0m0,768s
user 0m0,001s
sys 0m0,008s
[ Time range goes from 0.7s to 1.6s ]
Changes in v2:
- fix: properly keep track of sync waiters when a blkcg is writing to
many block devices at the same time
Andrea Righi (3):
blkcg: prevent priority inversion problem during sync()
blkcg: introduce io.sync_isolation
blkcg: implement sync() isolation
Documentation/admin-guide/cgroup-v2.rst | 9 +++
block/blk-cgroup.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
block/blk-throttle.c | 48 ++++++++++++++-
fs/fs-writeback.c | 57 +++++++++++++++++-
fs/inode.c | 1 +
fs/sync.c | 8 ++-
include/linux/backing-dev-defs.h | 2 +
include/linux/blk-cgroup.h | 52 +++++++++++++++++
include/linux/fs.h | 4 ++
mm/backing-dev.c | 2 +
mm/page-writeback.c | 1 +
11 files changed, 355 insertions(+), 7 deletions(-)
Prevent priority inversion problem when a high-priority blkcg issues a
sync() and it is forced to wait the completion of all the writeback I/O
generated by any other low-priority blkcg, causing massive latencies to
processes that shouldn't be I/O-throttled at all.
The idea is to save a list of blkcg's that are waiting for writeback:
every time a sync() is executed the current blkcg is added to the list.
Then, when I/O is throttled, if there's a blkcg waiting for writeback
different than the current blkcg, no throttling is applied (we can
probably refine this logic later, i.e., a better policy could be to
adjust the throttling I/O rate using the blkcg with the highest speed
from the list of waiters - priority inheritance, kinda).
Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-cgroup.c | 131 +++++++++++++++++++++++++++++++
block/blk-throttle.c | 11 ++-
fs/fs-writeback.c | 5 ++
fs/sync.c | 8 +-
include/linux/backing-dev-defs.h | 2 +
include/linux/blk-cgroup.h | 23 ++++++
mm/backing-dev.c | 2 +
7 files changed, 178 insertions(+), 4 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bed5725aa03..4305e78d1bb2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1351,6 +1351,137 @@ struct cgroup_subsys io_cgrp_subsys = {
};
EXPORT_SYMBOL_GPL(io_cgrp_subsys);
+#ifdef CONFIG_CGROUP_WRITEBACK
+struct blkcg_wb_sleeper {
+ struct backing_dev_info *bdi;
+ struct blkcg *blkcg;
+ refcount_t refcnt;
+ struct list_head node;
+};
+
+static DEFINE_SPINLOCK(blkcg_wb_sleeper_lock);
+static LIST_HEAD(blkcg_wb_sleeper_list);
+
+static struct blkcg_wb_sleeper *
+blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+ struct blkcg_wb_sleeper *bws;
+
+ list_for_each_entry(bws, &blkcg_wb_sleeper_list, node)
+ if (bws->blkcg == blkcg && bws->bdi == bdi)
+ return bws;
+ return NULL;
+}
+
+static void blkcg_wb_sleeper_add(struct blkcg_wb_sleeper *bws)
+{
+ list_add(&bws->node, &blkcg_wb_sleeper_list);
+}
+
+static void blkcg_wb_sleeper_del(struct blkcg_wb_sleeper *bws)
+{
+ list_del_init(&bws->node);
+}
+
+/**
+ * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
+ * @blkcg: current blkcg cgroup
+ * @bdi: block device to check
+ *
+ * Return true if any other blkcg different than the current one is waiting for
+ * writeback on the target block device, false otherwise.
+ */
+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+ struct blkcg_wb_sleeper *bws;
+ bool ret = false;
+
+ spin_lock(&blkcg_wb_sleeper_lock);
+ list_for_each_entry(bws, &blkcg_wb_sleeper_list, node)
+ if (bws->bdi == bdi && bws->blkcg != blkcg) {
+ ret = true;
+ break;
+ }
+ spin_unlock(&blkcg_wb_sleeper_lock);
+
+ return ret;
+}
+
+/**
+ * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list
+ * @bdi: target block device
+ *
+ * Add current blkcg to the list of writeback waiters on target block device.
+ */
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+ struct blkcg_wb_sleeper *new_bws, *bws;
+ struct blkcg *blkcg;
+
+ new_bws = kzalloc(sizeof(*new_bws), GFP_KERNEL);
+ if (unlikely(!new_bws))
+ return;
+
+ rcu_read_lock();
+ blkcg = blkcg_from_current();
+ if (likely(blkcg)) {
+ /* Check if blkcg is already sleeping on bdi */
+ spin_lock(&blkcg_wb_sleeper_lock);
+ bws = blkcg_wb_sleeper_find(blkcg, bdi);
+ if (bws) {
+ refcount_inc(&bws->refcnt);
+ } else {
+ /* Add current blkcg as a new wb sleeper on bdi */
+ css_get(&blkcg->css);
+ new_bws->blkcg = blkcg;
+ new_bws->bdi = bdi;
+ refcount_set(&new_bws->refcnt, 1);
+ blkcg_wb_sleeper_add(new_bws);
+ new_bws = NULL;
+ }
+ spin_unlock(&blkcg_wb_sleeper_lock);
+ }
+ rcu_read_unlock();
+
+ kfree(new_bws);
+}
+
+/**
+ * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list
+ * @bdi: target block device
+ *
+ * Remove current blkcg from the list of writeback waiters on target block
+ * device.
+ */
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+ struct blkcg_wb_sleeper *bws = NULL;
+ struct blkcg *blkcg;
+
+ rcu_read_lock();
+ blkcg = blkcg_from_current();
+ if (!blkcg) {
+ rcu_read_unlock();
+ return;
+ }
+ spin_lock(&blkcg_wb_sleeper_lock);
+ bws = blkcg_wb_sleeper_find(blkcg, bdi);
+ if (unlikely(!bws)) {
+ /* blkcg_start/stop_wb_wait_on_bdi() mismatch */
+ WARN_ON(1);
+ goto out_unlock;
+ }
+ if (refcount_dec_and_test(&bws->refcnt)) {
+ blkcg_wb_sleeper_del(bws);
+ css_put(&blkcg->css);
+ kfree(bws);
+ }
+out_unlock:
+ spin_unlock(&blkcg_wb_sleeper_lock);
+ rcu_read_unlock();
+}
+#endif
+
/**
* blkcg_activate_policy - activate a blkcg policy on a request_queue
* @q: request_queue of interest
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1b97a73d2fb1..da817896cded 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -970,9 +970,13 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
{
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
+ struct throtl_data *td = tg->td;
+ struct request_queue *q = td->queue;
+ struct backing_dev_info *bdi = q->backing_dev_info;
+ struct blkcg_gq *blkg = tg_to_blkg(tg);
/*
- * Currently whole state machine of group depends on first bio
+ * Currently whole state machine of group depends on first bio
* queued in the group bio list. So one should not be calling
* this function with a different bio if there are other bios
* queued.
@@ -981,8 +985,9 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
/* If tg->bps = -1, then BW is unlimited */
- if (tg_bps_limit(tg, rw) == U64_MAX &&
- tg_iops_limit(tg, rw) == UINT_MAX) {
+ if (blkcg_wb_waiters_on_bdi(blkg->blkcg, bdi) ||
+ (tg_bps_limit(tg, rw) == U64_MAX &&
+ tg_iops_limit(tg, rw) == UINT_MAX)) {
if (wait)
*wait = 0;
return true;
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36855c1f8daf..77c039a0ec25 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -28,6 +28,7 @@
#include <linux/tracepoint.h>
#include <linux/device.h>
#include <linux/memcontrol.h>
+#include <linux/blk-cgroup.h>
#include "internal.h"
/*
@@ -2446,6 +2447,8 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
+ blkcg_start_wb_wait_on_bdi(bdi);
+
/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
bdi_down_write_wb_switch_rwsem(bdi);
bdi_split_work_to_wbs(bdi, &work, false);
@@ -2453,6 +2456,8 @@ void sync_inodes_sb(struct super_block *sb)
bdi_up_write_wb_switch_rwsem(bdi);
wait_sb_inodes(sb);
+
+ blkcg_stop_wb_wait_on_bdi(bdi);
}
EXPORT_SYMBOL(sync_inodes_sb);
diff --git a/fs/sync.c b/fs/sync.c
index b54e0541ad89..3958b8f98b85 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -16,6 +16,7 @@
#include <linux/pagemap.h>
#include <linux/quotaops.h>
#include <linux/backing-dev.h>
+#include <linux/blk-cgroup.h>
#include "internal.h"
#define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \
@@ -76,8 +77,13 @@ static void sync_inodes_one_sb(struct super_block *sb, void *arg)
static void sync_fs_one_sb(struct super_block *sb, void *arg)
{
- if (!sb_rdonly(sb) && sb->s_op->sync_fs)
+ struct backing_dev_info *bdi = sb->s_bdi;
+
+ if (!sb_rdonly(sb) && sb->s_op->sync_fs) {
+ blkcg_start_wb_wait_on_bdi(bdi);
sb->s_op->sync_fs(sb, *(int *)arg);
+ blkcg_stop_wb_wait_on_bdi(bdi);
+ }
}
static void fdatawrite_one_bdev(struct block_device *bdev, void *arg)
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 07e02d6df5ad..095e4dd0427b 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -191,6 +191,8 @@ struct backing_dev_info {
struct rb_root cgwb_congested_tree; /* their congested states */
struct mutex cgwb_release_mutex; /* protect shutdown of wb structs */
struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
+ struct list_head cgwb_waiters; /* list of all waiters for writeback */
+ spinlock_t cgwb_waiters_lock; /* protect cgwb_waiters list */
#else
struct bdi_writeback_congested *wb_congested;
#endif
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 76c61318fda5..0f7dcb70e922 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -56,6 +56,7 @@ struct blkcg {
struct list_head all_blkcgs_node;
#ifdef CONFIG_CGROUP_WRITEBACK
+ struct list_head cgwb_wait_node;
struct list_head cgwb_list;
refcount_t cgwb_refcnt;
#endif
@@ -252,6 +253,12 @@ static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
return css ? container_of(css, struct blkcg, css) : NULL;
}
+static inline struct blkcg *blkcg_from_current(void)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ return css_to_blkcg(blkcg_css());
+}
+
/**
* __bio_blkcg - internal, inconsistent version to get blkcg
*
@@ -454,6 +461,10 @@ static inline void blkcg_cgwb_put(struct blkcg *blkcg)
blkcg_destroy_blkgs(blkcg);
}
+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi);
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi);
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi);
+
#else
static inline void blkcg_cgwb_get(struct blkcg *blkcg) { }
@@ -464,6 +475,14 @@ static inline void blkcg_cgwb_put(struct blkcg *blkcg)
blkcg_destroy_blkgs(blkcg);
}
+static inline bool
+blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+ return false;
+}
+static inline void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi) { }
+static inline void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi) { }
+
#endif
/**
@@ -772,6 +791,7 @@ static inline void blkcg_bio_issue_init(struct bio *bio)
static inline bool blkcg_bio_issue_check(struct request_queue *q,
struct bio *bio)
{
+ struct backing_dev_info *bdi = q->backing_dev_info;
struct blkcg_gq *blkg;
bool throtl = false;
@@ -788,6 +808,9 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
blkg = bio->bi_blkg;
+ if (blkcg_wb_waiters_on_bdi(blkg->blkcg, bdi))
+ bio_set_flag(bio, BIO_THROTTLED);
+
throtl = blk_throtl_bio(q, blkg, bio);
if (!throtl) {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 72e6d0c55cfa..8848d26e8bf6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -686,10 +686,12 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
{
int ret;
+ INIT_LIST_HEAD(&bdi->cgwb_waiters);
INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
mutex_init(&bdi->cgwb_release_mutex);
init_rwsem(&bdi->wb_switch_rwsem);
+ spin_lock_init(&bdi->cgwb_waiters_lock);
ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
if (!ret) {
--
2.19.1
Keep track of the inodes that have been dirtied by each blkcg cgroup and
make sure that a blkcg issuing a sync() can trigger the writeback + wait
of only those pages that belong to the cgroup itself.
This behavior is applied only when io.sync_isolation is enabled in the
cgroup, otherwise the old behavior is applied: sync() triggers the
writeback of any dirty page.
Signed-off-by: Andrea Righi <[email protected]>
---
block/blk-cgroup.c | 47 ++++++++++++++++++++++++++++++++++
fs/fs-writeback.c | 52 +++++++++++++++++++++++++++++++++++---
fs/inode.c | 1 +
include/linux/blk-cgroup.h | 22 ++++++++++++++++
include/linux/fs.h | 4 +++
mm/page-writeback.c | 1 +
6 files changed, 124 insertions(+), 3 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4305e78d1bb2..7d3b26ba4575 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1480,6 +1480,53 @@ void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
spin_unlock(&blkcg_wb_sleeper_lock);
rcu_read_unlock();
}
+
+/**
+ * blkcg_set_mapping_dirty - set owner of a dirty mapping
+ * @mapping: target address space
+ *
+ * Set the current blkcg as the owner of the address space @mapping (the first
+ * blkcg that dirties @mapping becomes the owner).
+ */
+void blkcg_set_mapping_dirty(struct address_space *mapping)
+{
+ struct blkcg *curr_blkcg, *blkcg;
+
+ if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) ||
+ mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+ return;
+
+ rcu_read_lock();
+ curr_blkcg = blkcg_from_current();
+ blkcg = blkcg_from_mapping(mapping);
+ if (curr_blkcg != blkcg) {
+ if (blkcg)
+ css_put(&blkcg->css);
+ css_get(&curr_blkcg->css);
+ rcu_assign_pointer(mapping->i_blkcg, curr_blkcg);
+ }
+ rcu_read_unlock();
+}
+
+/**
+ * blkcg_set_mapping_clean - clear the owner of a dirty mapping
+ * @mapping: target address space
+ *
+ * Unset the owner of @mapping when it becomes clean.
+ */
+
+void blkcg_set_mapping_clean(struct address_space *mapping)
+{
+ struct blkcg *blkcg;
+
+ rcu_read_lock();
+ blkcg = rcu_dereference(mapping->i_blkcg);
+ if (blkcg) {
+ css_put(&blkcg->css);
+ RCU_INIT_POINTER(mapping->i_blkcg, NULL);
+ }
+ rcu_read_unlock();
+}
#endif
/**
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 77c039a0ec25..d003d0593f41 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -58,6 +58,9 @@ struct wb_writeback_work {
struct list_head list; /* pending work list */
struct wb_completion *done; /* set if the caller waits */
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct blkcg *blkcg;
+#endif
};
/*
@@ -916,6 +919,29 @@ static int __init cgroup_writeback_init(void)
}
fs_initcall(cgroup_writeback_init);
+static void blkcg_set_sync_domain(struct wb_writeback_work *work)
+{
+ rcu_read_lock();
+ work->blkcg = blkcg_from_current();
+ rcu_read_unlock();
+}
+
+static bool blkcg_same_sync_domain(struct wb_writeback_work *work,
+ struct address_space *mapping)
+{
+ struct blkcg *blkcg;
+
+ if (!work->blkcg || work->blkcg == &blkcg_root)
+ return true;
+ if (!test_bit(BLKCG_SYNC_ISOLATION, &work->blkcg->flags))
+ return true;
+ rcu_read_lock();
+ blkcg = blkcg_from_mapping(mapping);
+ rcu_read_unlock();
+
+ return blkcg == work->blkcg;
+}
+
#else /* CONFIG_CGROUP_WRITEBACK */
static void bdi_down_write_wb_switch_rwsem(struct backing_dev_info *bdi) { }
@@ -959,6 +985,15 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
}
}
+static void blkcg_set_sync_domain(struct wb_writeback_work *work)
+{
+}
+
+static bool blkcg_same_sync_domain(struct wb_writeback_work *work,
+ struct address_space *mapping)
+{
+ return true;
+}
#endif /* CONFIG_CGROUP_WRITEBACK */
/*
@@ -1131,7 +1166,7 @@ static int move_expired_inodes(struct list_head *delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
- struct inode *inode;
+ struct inode *inode, *next;
int do_sb_sort = 0;
int moved = 0;
@@ -1141,11 +1176,12 @@ static int move_expired_inodes(struct list_head *delaying_queue,
expire_time = jiffies - (dirtytime_expire_interval * HZ);
older_than_this = &expire_time;
}
- while (!list_empty(delaying_queue)) {
- inode = wb_inode(delaying_queue->prev);
+ list_for_each_entry_safe(inode, next, delaying_queue, i_io_list) {
if (older_than_this &&
inode_dirtied_after(inode, *older_than_this))
break;
+ if (!blkcg_same_sync_domain(work, inode->i_mapping))
+ continue;
list_move(&inode->i_io_list, &tmp);
moved++;
if (flags & EXPIRE_DIRTY_ATIME)
@@ -1560,6 +1596,15 @@ static long writeback_sb_inodes(struct super_block *sb,
break;
}
+ /*
+ * Only write out inodes that belong to the blkcg that issued
+ * the sync().
+ */
+ if (!blkcg_same_sync_domain(work, inode->i_mapping)) {
+ redirty_tail(inode, wb);
+ continue;
+ }
+
/*
* Don't bother with new inodes or inodes being freed, first
* kind does not need periodic writeout yet, and for the latter
@@ -2447,6 +2492,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
+ blkcg_set_sync_domain(&work);
blkcg_start_wb_wait_on_bdi(bdi);
/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
diff --git a/fs/inode.c b/fs/inode.c
index e9d97add2b36..b9659aaa8546 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -564,6 +564,7 @@ static void evict(struct inode *inode)
bd_forget(inode);
if (S_ISCHR(inode->i_mode) && inode->i_cdev)
cd_forget(inode);
+ blkcg_set_mapping_clean(&inode->i_data);
remove_inode_hash(inode);
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 6ac5aa049334..a2bcc83c8c3e 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -441,6 +441,15 @@ extern void blkcg_destroy_blkgs(struct blkcg *blkcg);
#ifdef CONFIG_CGROUP_WRITEBACK
+static inline struct blkcg *blkcg_from_mapping(struct address_space *mapping)
+{
+ WARN_ON_ONCE(!rcu_read_lock_held());
+ return rcu_dereference(mapping->i_blkcg);
+}
+
+void blkcg_set_mapping_dirty(struct address_space *mapping);
+void blkcg_set_mapping_clean(struct address_space *mapping);
+
/**
* blkcg_cgwb_get - get a reference for blkcg->cgwb_list
* @blkcg: blkcg of interest
@@ -474,6 +483,19 @@ void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi);
#else
+static inline struct blkcg *blkcg_from_mapping(struct address_space *mapping)
+{
+ return NULL;
+}
+
+static inline void blkcg_set_mapping_dirty(struct address_space *mapping)
+{
+}
+
+static inline void blkcg_set_mapping_clean(struct address_space *mapping)
+{
+}
+
static inline void blkcg_cgwb_get(struct blkcg *blkcg) { }
static inline void blkcg_cgwb_put(struct blkcg *blkcg)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 08f26046233e..19e99b4a9fa2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -420,6 +420,7 @@ int pagecache_write_end(struct file *, struct address_space *mapping,
* @nrpages: Number of page entries, protected by the i_pages lock.
* @nrexceptional: Shadow or DAX entries, protected by the i_pages lock.
* @writeback_index: Writeback starts here.
+ * @i_blkcg: blkcg owner (that dirtied the address_space)
* @a_ops: Methods.
* @flags: Error bits and flags (AS_*).
* @wb_err: The most recent error which has occurred.
@@ -438,6 +439,9 @@ struct address_space {
unsigned long nrexceptional;
pgoff_t writeback_index;
const struct address_space_operations *a_ops;
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct blkcg __rcu *i_blkcg;
+#endif
unsigned long flags;
errseq_t wb_err;
spinlock_t private_lock;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9f61dfec6a1f..e16574f946a7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2418,6 +2418,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
inode_attach_wb(inode, page);
wb = inode_to_wb(inode);
+ blkcg_set_mapping_dirty(mapping);
__inc_lruvec_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
__inc_node_page_state(page, NR_DIRTIED);
--
2.19.1
Add a flag to the blkcg cgroups to make sync()'ers in a cgroup only be
allowed to write out pages that have been dirtied by the cgroup itself.
This flag is disabled by default (meaning that we are not changing the
previous behavior by default).
When this flag is enabled any cgroup can write out only dirty pages that
belong to the cgroup itself (except for the root cgroup that would still
be able to write out all pages globally).
Signed-off-by: Andrea Righi <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 9 ++++++
block/blk-throttle.c | 37 +++++++++++++++++++++++++
include/linux/blk-cgroup.h | 7 +++++
3 files changed, 53 insertions(+)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 53d3288c328b..17fff0ee97b8 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1448,6 +1448,15 @@ IO Interface Files
Shows pressure stall information for IO. See
Documentation/accounting/psi.txt for details.
+ io.sync_isolation
+ A flag (0|1) that determines whether a cgroup is allowed to write out
+ only pages that have been dirtied by the cgroup itself. This option is
+ set to false (0) by default, meaning that any cgroup would try to write
+ out dirty pages globally, even those that have been dirtied by other
+ cgroups.
+
+ Setting this option to true (1) provides a better isolation across
+ cgroups that are doing an intense write I/O activity.
Writeback
~~~~~~~~~
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index da817896cded..4bc3b40a4d93 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1704,6 +1704,35 @@ static ssize_t tg_set_limit(struct kernfs_open_file *of,
return ret ?: nbytes;
}
+#ifdef CONFIG_CGROUP_WRITEBACK
+static int sync_isolation_show(struct seq_file *sf, void *v)
+{
+ struct blkcg *blkcg = css_to_blkcg(seq_css(sf));
+
+ seq_printf(sf, "%d\n", test_bit(BLKCG_SYNC_ISOLATION, &blkcg->flags));
+ return 0;
+}
+
+static ssize_t sync_isolation_write(struct kernfs_open_file *of,
+ char *buf, size_t nbytes, loff_t off)
+{
+ struct blkcg *blkcg = css_to_blkcg(of_css(of));
+ unsigned long val;
+ int err;
+
+ buf = strstrip(buf);
+ err = kstrtoul(buf, 0, &val);
+ if (err)
+ return err;
+ if (val)
+ set_bit(BLKCG_SYNC_ISOLATION, &blkcg->flags);
+ else
+ clear_bit(BLKCG_SYNC_ISOLATION, &blkcg->flags);
+
+ return nbytes;
+}
+#endif
+
static struct cftype throtl_files[] = {
#ifdef CONFIG_BLK_DEV_THROTTLING_LOW
{
@@ -1721,6 +1750,14 @@ static struct cftype throtl_files[] = {
.write = tg_set_limit,
.private = LIMIT_MAX,
},
+#ifdef CONFIG_CGROUP_WRITEBACK
+ {
+ .name = "sync_isolation",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = sync_isolation_show,
+ .write = sync_isolation_write,
+ },
+#endif
{ } /* terminate */
};
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 0f7dcb70e922..6ac5aa049334 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -44,6 +44,12 @@ enum blkg_rwstat_type {
struct blkcg_gq;
+/* blkcg->flags */
+enum {
+ /* sync()'ers allowed to write out pages dirtied by the blkcg */
+ BLKCG_SYNC_ISOLATION,
+};
+
struct blkcg {
struct cgroup_subsys_state css;
spinlock_t lock;
@@ -55,6 +61,7 @@ struct blkcg {
struct blkcg_policy_data *cpd[BLKCG_MAX_POLS];
struct list_head all_blkcgs_node;
+ unsigned long flags;
#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head cgwb_wait_node;
struct list_head cgwb_list;
--
2.19.1
On Thu, Mar 07, 2019 at 07:08:34PM +0100, Andrea Righi wrote:
> Keep track of the inodes that have been dirtied by each blkcg cgroup and
> make sure that a blkcg issuing a sync() can trigger the writeback + wait
> of only those pages that belong to the cgroup itself.
>
> This behavior is applied only when io.sync_isolation is enabled in the
> cgroup, otherwise the old behavior is applied: sync() triggers the
> writeback of any dirty page.
>
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> block/blk-cgroup.c | 47 ++++++++++++++++++++++++++++++++++
> fs/fs-writeback.c | 52 +++++++++++++++++++++++++++++++++++---
> fs/inode.c | 1 +
> include/linux/blk-cgroup.h | 22 ++++++++++++++++
> include/linux/fs.h | 4 +++
> mm/page-writeback.c | 1 +
> 6 files changed, 124 insertions(+), 3 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 4305e78d1bb2..7d3b26ba4575 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1480,6 +1480,53 @@ void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
> spin_unlock(&blkcg_wb_sleeper_lock);
> rcu_read_unlock();
> }
> +
> +/**
> + * blkcg_set_mapping_dirty - set owner of a dirty mapping
> + * @mapping: target address space
> + *
> + * Set the current blkcg as the owner of the address space @mapping (the first
> + * blkcg that dirties @mapping becomes the owner).
> + */
> +void blkcg_set_mapping_dirty(struct address_space *mapping)
> +{
> + struct blkcg *curr_blkcg, *blkcg;
> +
> + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) ||
> + mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> + return;
> +
> + rcu_read_lock();
> + curr_blkcg = blkcg_from_current();
> + blkcg = blkcg_from_mapping(mapping);
> + if (curr_blkcg != blkcg) {
> + if (blkcg)
> + css_put(&blkcg->css);
> + css_get(&curr_blkcg->css);
> + rcu_assign_pointer(mapping->i_blkcg, curr_blkcg);
> + }
> + rcu_read_unlock();
> +}
> +
> +/**
> + * blkcg_set_mapping_clean - clear the owner of a dirty mapping
> + * @mapping: target address space
> + *
> + * Unset the owner of @mapping when it becomes clean.
> + */
> +
> +void blkcg_set_mapping_clean(struct address_space *mapping)
> +{
> + struct blkcg *blkcg;
> +
> + rcu_read_lock();
> + blkcg = rcu_dereference(mapping->i_blkcg);
> + if (blkcg) {
> + css_put(&blkcg->css);
> + RCU_INIT_POINTER(mapping->i_blkcg, NULL);
> + }
> + rcu_read_unlock();
> +}
> #endif
>
Why do we need this? We already have the inode_attach_wb(), which has the
blkcg_css embedded in it for whoever dirtied the inode first. Can we not just
use that? Thanks,
Josef
On Thu, Mar 07, 2019 at 07:08:32PM +0100, Andrea Righi wrote:
> Prevent priority inversion problem when a high-priority blkcg issues a
> sync() and it is forced to wait the completion of all the writeback I/O
> generated by any other low-priority blkcg, causing massive latencies to
> processes that shouldn't be I/O-throttled at all.
>
> The idea is to save a list of blkcg's that are waiting for writeback:
> every time a sync() is executed the current blkcg is added to the list.
>
> Then, when I/O is throttled, if there's a blkcg waiting for writeback
> different than the current blkcg, no throttling is applied (we can
> probably refine this logic later, i.e., a better policy could be to
> adjust the throttling I/O rate using the blkcg with the highest speed
> from the list of waiters - priority inheritance, kinda).
>
> Signed-off-by: Andrea Righi <[email protected]>
> ---
> block/blk-cgroup.c | 131 +++++++++++++++++++++++++++++++
> block/blk-throttle.c | 11 ++-
> fs/fs-writeback.c | 5 ++
> fs/sync.c | 8 +-
> include/linux/backing-dev-defs.h | 2 +
> include/linux/blk-cgroup.h | 23 ++++++
> mm/backing-dev.c | 2 +
> 7 files changed, 178 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 2bed5725aa03..4305e78d1bb2 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1351,6 +1351,137 @@ struct cgroup_subsys io_cgrp_subsys = {
> };
> EXPORT_SYMBOL_GPL(io_cgrp_subsys);
>
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +struct blkcg_wb_sleeper {
> + struct backing_dev_info *bdi;
> + struct blkcg *blkcg;
> + refcount_t refcnt;
> + struct list_head node;
> +};
> +
> +static DEFINE_SPINLOCK(blkcg_wb_sleeper_lock);
> +static LIST_HEAD(blkcg_wb_sleeper_list);
> +
> +static struct blkcg_wb_sleeper *
> +blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi)
> +{
> + struct blkcg_wb_sleeper *bws;
> +
> + list_for_each_entry(bws, &blkcg_wb_sleeper_list, node)
> + if (bws->blkcg == blkcg && bws->bdi == bdi)
> + return bws;
> + return NULL;
> +}
> +
> +static void blkcg_wb_sleeper_add(struct blkcg_wb_sleeper *bws)
> +{
> + list_add(&bws->node, &blkcg_wb_sleeper_list);
> +}
> +
> +static void blkcg_wb_sleeper_del(struct blkcg_wb_sleeper *bws)
> +{
> + list_del_init(&bws->node);
> +}
> +
> +/**
> + * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
> + * @blkcg: current blkcg cgroup
> + * @bdi: block device to check
> + *
> + * Return true if any other blkcg different than the current one is waiting for
> + * writeback on the target block device, false otherwise.
> + */
> +bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
> +{
> + struct blkcg_wb_sleeper *bws;
> + bool ret = false;
> +
> + spin_lock(&blkcg_wb_sleeper_lock);
> + list_for_each_entry(bws, &blkcg_wb_sleeper_list, node)
> + if (bws->bdi == bdi && bws->blkcg != blkcg) {
> + ret = true;
> + break;
> + }
> + spin_unlock(&blkcg_wb_sleeper_lock);
> +
> + return ret;
> +}
No global lock please, add something to the bdi I think? Also have a fast path
of
if (list_empty(blkcg_wb_sleeper_list))
return false;
we don't need to be super accurate here. Thanks,
Josef
On Thu, Mar 07, 2019 at 07:08:33PM +0100, Andrea Righi wrote:
> Add a flag to the blkcg cgroups to make sync()'ers in a cgroup only be
> allowed to write out pages that have been dirtied by the cgroup itself.
>
> This flag is disabled by default (meaning that we are not changing the
> previous behavior by default).
>
> When this flag is enabled any cgroup can write out only dirty pages that
> belong to the cgroup itself (except for the root cgroup that would still
> be able to write out all pages globally).
>
> Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Thanks,
Josef
On Thu, Mar 07, 2019 at 05:10:53PM -0500, Josef Bacik wrote:
> On Thu, Mar 07, 2019 at 07:08:32PM +0100, Andrea Righi wrote:
> > Prevent priority inversion problem when a high-priority blkcg issues a
> > sync() and it is forced to wait the completion of all the writeback I/O
> > generated by any other low-priority blkcg, causing massive latencies to
> > processes that shouldn't be I/O-throttled at all.
> >
> > The idea is to save a list of blkcg's that are waiting for writeback:
> > every time a sync() is executed the current blkcg is added to the list.
> >
> > Then, when I/O is throttled, if there's a blkcg waiting for writeback
> > different than the current blkcg, no throttling is applied (we can
> > probably refine this logic later, i.e., a better policy could be to
> > adjust the throttling I/O rate using the blkcg with the highest speed
> > from the list of waiters - priority inheritance, kinda).
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > ---
> > block/blk-cgroup.c | 131 +++++++++++++++++++++++++++++++
> > block/blk-throttle.c | 11 ++-
> > fs/fs-writeback.c | 5 ++
> > fs/sync.c | 8 +-
> > include/linux/backing-dev-defs.h | 2 +
> > include/linux/blk-cgroup.h | 23 ++++++
> > mm/backing-dev.c | 2 +
> > 7 files changed, 178 insertions(+), 4 deletions(-)
> >
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 2bed5725aa03..4305e78d1bb2 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -1351,6 +1351,137 @@ struct cgroup_subsys io_cgrp_subsys = {
> > };
> > EXPORT_SYMBOL_GPL(io_cgrp_subsys);
> >
> > +#ifdef CONFIG_CGROUP_WRITEBACK
> > +struct blkcg_wb_sleeper {
> > + struct backing_dev_info *bdi;
> > + struct blkcg *blkcg;
> > + refcount_t refcnt;
> > + struct list_head node;
> > +};
> > +
> > +static DEFINE_SPINLOCK(blkcg_wb_sleeper_lock);
> > +static LIST_HEAD(blkcg_wb_sleeper_list);
> > +
> > +static struct blkcg_wb_sleeper *
> > +blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi)
> > +{
> > + struct blkcg_wb_sleeper *bws;
> > +
> > + list_for_each_entry(bws, &blkcg_wb_sleeper_list, node)
> > + if (bws->blkcg == blkcg && bws->bdi == bdi)
> > + return bws;
> > + return NULL;
> > +}
> > +
> > +static void blkcg_wb_sleeper_add(struct blkcg_wb_sleeper *bws)
> > +{
> > + list_add(&bws->node, &blkcg_wb_sleeper_list);
> > +}
> > +
> > +static void blkcg_wb_sleeper_del(struct blkcg_wb_sleeper *bws)
> > +{
> > + list_del_init(&bws->node);
> > +}
> > +
> > +/**
> > + * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
> > + * @blkcg: current blkcg cgroup
> > + * @bdi: block device to check
> > + *
> > + * Return true if any other blkcg different than the current one is waiting for
> > + * writeback on the target block device, false otherwise.
> > + */
> > +bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
> > +{
> > + struct blkcg_wb_sleeper *bws;
> > + bool ret = false;
> > +
> > + spin_lock(&blkcg_wb_sleeper_lock);
> > + list_for_each_entry(bws, &blkcg_wb_sleeper_list, node)
> > + if (bws->bdi == bdi && bws->blkcg != blkcg) {
> > + ret = true;
> > + break;
> > + }
> > + spin_unlock(&blkcg_wb_sleeper_lock);
> > +
> > + return ret;
> > +}
>
> No global lock please, add something to the bdi I think? Also have a fast path
> of
OK, I'll add a list per-bdi and a lock as well.
>
> if (list_empty(blkcg_wb_sleeper_list))
> return false;
OK.
>
> we don't need to be super accurate here. Thanks,
>
> Josef
Thanks,
-Andrea
On Thu, Mar 07, 2019 at 05:07:01PM -0500, Josef Bacik wrote:
> On Thu, Mar 07, 2019 at 07:08:34PM +0100, Andrea Righi wrote:
> > Keep track of the inodes that have been dirtied by each blkcg cgroup and
> > make sure that a blkcg issuing a sync() can trigger the writeback + wait
> > of only those pages that belong to the cgroup itself.
> >
> > This behavior is applied only when io.sync_isolation is enabled in the
> > cgroup, otherwise the old behavior is applied: sync() triggers the
> > writeback of any dirty page.
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > ---
> > block/blk-cgroup.c | 47 ++++++++++++++++++++++++++++++++++
> > fs/fs-writeback.c | 52 +++++++++++++++++++++++++++++++++++---
> > fs/inode.c | 1 +
> > include/linux/blk-cgroup.h | 22 ++++++++++++++++
> > include/linux/fs.h | 4 +++
> > mm/page-writeback.c | 1 +
> > 6 files changed, 124 insertions(+), 3 deletions(-)
> >
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 4305e78d1bb2..7d3b26ba4575 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -1480,6 +1480,53 @@ void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
> > spin_unlock(&blkcg_wb_sleeper_lock);
> > rcu_read_unlock();
> > }
> > +
> > +/**
> > + * blkcg_set_mapping_dirty - set owner of a dirty mapping
> > + * @mapping: target address space
> > + *
> > + * Set the current blkcg as the owner of the address space @mapping (the first
> > + * blkcg that dirties @mapping becomes the owner).
> > + */
> > +void blkcg_set_mapping_dirty(struct address_space *mapping)
> > +{
> > + struct blkcg *curr_blkcg, *blkcg;
> > +
> > + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK) ||
> > + mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> > + return;
> > +
> > + rcu_read_lock();
> > + curr_blkcg = blkcg_from_current();
> > + blkcg = blkcg_from_mapping(mapping);
> > + if (curr_blkcg != blkcg) {
> > + if (blkcg)
> > + css_put(&blkcg->css);
> > + css_get(&curr_blkcg->css);
> > + rcu_assign_pointer(mapping->i_blkcg, curr_blkcg);
> > + }
> > + rcu_read_unlock();
> > +}
> > +
> > +/**
> > + * blkcg_set_mapping_clean - clear the owner of a dirty mapping
> > + * @mapping: target address space
> > + *
> > + * Unset the owner of @mapping when it becomes clean.
> > + */
> > +
> > +void blkcg_set_mapping_clean(struct address_space *mapping)
> > +{
> > + struct blkcg *blkcg;
> > +
> > + rcu_read_lock();
> > + blkcg = rcu_dereference(mapping->i_blkcg);
> > + if (blkcg) {
> > + css_put(&blkcg->css);
> > + RCU_INIT_POINTER(mapping->i_blkcg, NULL);
> > + }
> > + rcu_read_unlock();
> > +}
> > #endif
> >
>
> Why do we need this? We already have the inode_attach_wb(), which has the
> blkcg_css embedded in it for whoever dirtied the inode first. Can we not just
> use that? Thanks,
>
> Josef
I'm realizing only now that inode_attach_wb() also has blkcg embedded
in addition to the memcg. I think I can use that and drop these
blkcg_set_mapping_dirty/clean()..
Thanks,
-Andrea
On Thu, Mar 07, 2019 at 07:08:31PM +0100, Andrea Righi wrote:
> = Problem =
>
> When sync() is executed from a high-priority cgroup, the process is forced to
> wait the completion of the entire outstanding writeback I/O, even the I/O that
> was originally generated by low-priority cgroups potentially.
>
> This may cause massive latencies to random processes (even those running in the
> root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
> priority inversion problem.
>
> This topic has been previously discussed here:
> https://patchwork.kernel.org/patch/10804489/
>
Sorry to move the goal posts on you again Andrea, but Tejun and I talked about
this some more offline.
We don't want cgroup to become the arbiter of correctness/behavior here. We
just want it to be isolating things.
For you that means you can drop the per-cgroup flag stuff, and only do the
priority boosting for multiple sync(2) waiters. That is a real priority
inversion that needs to be fixed. io.latency and io.max are capable of noticing
that a low priority group is going above their configured limits and putting
pressure elsewhere accordingly.
Tejun said he'd rather see the sync(2) isolation be done at the namespace level.
That way if you have fs namespacing you are already isolated to your namespace.
If you feel like tackling that then hooray, but that's a separate dragon to slay
so don't feel like you have to right now.
This way we keep cgroup doing its job, controlling resources. Then we allow
namespacing to do its thing, isolating resources. Thanks,
Josef
On Fri, Mar 08, 2019 at 12:22:20PM -0500, Josef Bacik wrote:
> On Thu, Mar 07, 2019 at 07:08:31PM +0100, Andrea Righi wrote:
> > = Problem =
> >
> > When sync() is executed from a high-priority cgroup, the process is forced to
> > wait the completion of the entire outstanding writeback I/O, even the I/O that
> > was originally generated by low-priority cgroups potentially.
> >
> > This may cause massive latencies to random processes (even those running in the
> > root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic
> > priority inversion problem.
> >
> > This topic has been previously discussed here:
> > https://patchwork.kernel.org/patch/10804489/
> >
>
> Sorry to move the goal posts on you again Andrea, but Tejun and I talked about
> this some more offline.
>
> We don't want cgroup to become the arbiter of correctness/behavior here. We
> just want it to be isolating things.
>
> For you that means you can drop the per-cgroup flag stuff, and only do the
> priority boosting for multiple sync(2) waiters. That is a real priority
> inversion that needs to be fixed. io.latency and io.max are capable of noticing
> that a low priority group is going above their configured limits and putting
> pressure elsewhere accordingly.
Alright, so IIUC that means we just need patch 1/3 for now (with the
per-bdi lock instead of the global lock). If that's the case I'll focus
at that patch then.
>
> Tejun said he'd rather see the sync(2) isolation be done at the namespace level.
> That way if you have fs namespacing you are already isolated to your namespace.
> If you feel like tackling that then hooray, but that's a separate dragon to slay
> so don't feel like you have to right now.
Makes sense. I can take a look and see what I can do after posting the
new patch with the priority inversion fix only.
>
> This way we keep cgroup doing its job, controlling resources. Then we allow
> namespacing to do its thing, isolating resources. Thanks,
>
> Josef
Looks like a good plan to me. Thanks for the update.
-Andrea