2019-02-09 14:08:30

by Andrea Righi

[permalink] [raw]
Subject: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()

This is an attempt to mitigate the priority inversion problem of a
high-priority blkcg issuing a sync() and being forced to wait the
completion of all the writeback I/O generated by any other low-priority
blkcg, causing massive latencies to processes that shouldn't be
I/O-throttled at all.

The idea is to save a list of blkcg's that are waiting for writeback:
every time a sync() is executed the current blkcg is added to the list.

Then, when I/O is throttled, if there's a blkcg waiting for writeback
different than the current blkcg, no throttling is applied (we can
probably refine this logic later, i.e., a better policy could be to
adjust the throttling I/O rate using the blkcg with the highest speed
from the list of waiters - priority inheritance, kinda).

This topic has been discussed here:
https://lwn.net/ml/cgroups/[email protected]/

But we didn't come up with any definitive solution.

This patch is not a definitive solution either, but it's an attempt to
continue addressing this issue and handling the priority inversion
problem with sync() in a better way.

Signed-off-by: Andrea Righi <[email protected]>
---
Changes in v2:
- fix: use the proper current blkcg in blkcg_wb_waiters_on_bdi()

block/blk-cgroup.c | 69 ++++++++++++++++++++++++++++++++
block/blk-throttle.c | 11 +++--
fs/fs-writeback.c | 4 ++
include/linux/backing-dev-defs.h | 2 +
include/linux/blk-cgroup.h | 18 ++++++++-
mm/backing-dev.c | 2 +
6 files changed, 102 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2bed5725aa03..21f14148a9c6 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1635,6 +1635,75 @@ static void blkcg_scale_delay(struct blkcg_gq *blkg, u64 now)
}
}

+/**
+ * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device
+ * @blkcg: current blkcg cgroup
+ * @bdi: block device to check
+ *
+ * Return true if any other blkcg is waiting for writeback on the target block
+ * device, false otherwise.
+ */
+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+ struct blkcg *wait_blkcg;
+ bool ret = false;
+
+ if (unlikely(!bdi))
+ return false;
+
+ rcu_read_lock();
+ list_for_each_entry_rcu(wait_blkcg, &bdi->cgwb_waiters, cgwb_wait_node)
+ if (wait_blkcg != blkcg) {
+ ret = true;
+ break;
+ }
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/**
+ * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list
+ * @bdi: target block device
+ *
+ * Add current blkcg to the list of writeback waiters on target block device.
+ */
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+ struct blkcg *blkcg;
+
+ rcu_read_lock();
+ blkcg = css_to_blkcg(task_css(current, io_cgrp_id));
+ if (blkcg) {
+ spin_lock(&bdi->cgwb_waiters_lock);
+ list_add_rcu(&blkcg->cgwb_wait_node, &bdi->cgwb_waiters);
+ spin_unlock(&bdi->cgwb_waiters_lock);
+ }
+ rcu_read_unlock();
+}
+
+/**
+ * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list
+ * @bdi: target block device
+ *
+ * Remove current blkcg from the list of writeback waiters on target block
+ * device.
+ */
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi)
+{
+ struct blkcg *blkcg;
+
+ rcu_read_lock();
+ blkcg = css_to_blkcg(task_css(current, io_cgrp_id));
+ if (blkcg) {
+ spin_lock(&bdi->cgwb_waiters_lock);
+ list_del_rcu(&blkcg->cgwb_wait_node);
+ spin_unlock(&bdi->cgwb_waiters_lock);
+ }
+ rcu_read_unlock();
+ synchronize_rcu();
+}
+
/*
* This is called when we want to actually walk up the hierarchy and check to
* see if we need to throttle, and then actually throttle if there is some
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 1b97a73d2fb1..da817896cded 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -970,9 +970,13 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
{
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
+ struct throtl_data *td = tg->td;
+ struct request_queue *q = td->queue;
+ struct backing_dev_info *bdi = q->backing_dev_info;
+ struct blkcg_gq *blkg = tg_to_blkg(tg);

/*
- * Currently whole state machine of group depends on first bio
+ * Currently whole state machine of group depends on first bio
* queued in the group bio list. So one should not be calling
* this function with a different bio if there are other bios
* queued.
@@ -981,8 +985,9 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
bio != throtl_peek_queued(&tg->service_queue.queued[rw]));

/* If tg->bps = -1, then BW is unlimited */
- if (tg_bps_limit(tg, rw) == U64_MAX &&
- tg_iops_limit(tg, rw) == UINT_MAX) {
+ if (blkcg_wb_waiters_on_bdi(blkg->blkcg, bdi) ||
+ (tg_bps_limit(tg, rw) == U64_MAX &&
+ tg_iops_limit(tg, rw) == UINT_MAX)) {
if (wait)
*wait = 0;
return true;
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 36855c1f8daf..13880774af3c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2446,6 +2446,8 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

+ blkcg_start_wb_wait_on_bdi(bdi);
+
/* protect against inode wb switch, see inode_switch_wbs_work_fn() */
bdi_down_write_wb_switch_rwsem(bdi);
bdi_split_work_to_wbs(bdi, &work, false);
@@ -2453,6 +2455,8 @@ void sync_inodes_sb(struct super_block *sb)
bdi_up_write_wb_switch_rwsem(bdi);

wait_sb_inodes(sb);
+
+ blkcg_stop_wb_wait_on_bdi(bdi);
}
EXPORT_SYMBOL(sync_inodes_sb);

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 07e02d6df5ad..095e4dd0427b 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -191,6 +191,8 @@ struct backing_dev_info {
struct rb_root cgwb_congested_tree; /* their congested states */
struct mutex cgwb_release_mutex; /* protect shutdown of wb structs */
struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */
+ struct list_head cgwb_waiters; /* list of all waiters for writeback */
+ spinlock_t cgwb_waiters_lock; /* protect cgwb_waiters list */
#else
struct bdi_writeback_congested *wb_congested;
#endif
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 76c61318fda5..5ea6b31c1df4 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -56,6 +56,7 @@ struct blkcg {

struct list_head all_blkcgs_node;
#ifdef CONFIG_CGROUP_WRITEBACK
+ struct list_head cgwb_wait_node;
struct list_head cgwb_list;
refcount_t cgwb_refcnt;
#endif
@@ -454,6 +455,10 @@ static inline void blkcg_cgwb_put(struct blkcg *blkcg)
blkcg_destroy_blkgs(blkcg);
}

+bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi);
+void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi);
+void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi);
+
#else

static inline void blkcg_cgwb_get(struct blkcg *blkcg) { }
@@ -464,6 +469,14 @@ static inline void blkcg_cgwb_put(struct blkcg *blkcg)
blkcg_destroy_blkgs(blkcg);
}

+static inline bool
+blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi)
+{
+ return false;
+}
+static inline void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi) { }
+static inline void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi) { }
+
#endif

/**
@@ -772,6 +785,7 @@ static inline void blkcg_bio_issue_init(struct bio *bio)
static inline bool blkcg_bio_issue_check(struct request_queue *q,
struct bio *bio)
{
+ struct backing_dev_info *bdi = q->backing_dev_info;
struct blkcg_gq *blkg;
bool throtl = false;

@@ -785,9 +799,11 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q,
bio_devname(bio, b));
bio_associate_blkg(bio);
}
-
blkg = bio->bi_blkg;

+ if (blkcg_wb_waiters_on_bdi(blkg->blkcg, bdi))
+ bio_set_flag(bio, BIO_THROTTLED);
+
throtl = blk_throtl_bio(q, blkg, bio);

if (!throtl) {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 72e6d0c55cfa..8848d26e8bf6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -686,10 +686,12 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi)
{
int ret;

+ INIT_LIST_HEAD(&bdi->cgwb_waiters);
INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
bdi->cgwb_congested_tree = RB_ROOT;
mutex_init(&bdi->cgwb_release_mutex);
init_rwsem(&bdi->wb_switch_rwsem);
+ spin_lock_init(&bdi->cgwb_waiters_lock);

ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL);
if (!ret) {
--
2.17.1



2019-02-11 15:40:26

by Josef Bacik

[permalink] [raw]
Subject: Re: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()

On Sat, Feb 09, 2019 at 03:07:49PM +0100, Andrea Righi wrote:
> This is an attempt to mitigate the priority inversion problem of a
> high-priority blkcg issuing a sync() and being forced to wait the
> completion of all the writeback I/O generated by any other low-priority
> blkcg, causing massive latencies to processes that shouldn't be
> I/O-throttled at all.
>
> The idea is to save a list of blkcg's that are waiting for writeback:
> every time a sync() is executed the current blkcg is added to the list.
>
> Then, when I/O is throttled, if there's a blkcg waiting for writeback
> different than the current blkcg, no throttling is applied (we can
> probably refine this logic later, i.e., a better policy could be to
> adjust the throttling I/O rate using the blkcg with the highest speed
> from the list of waiters - priority inheritance, kinda).
>
> This topic has been discussed here:
> https://lwn.net/ml/cgroups/[email protected]/
>
> But we didn't come up with any definitive solution.
>
> This patch is not a definitive solution either, but it's an attempt to
> continue addressing this issue and handling the priority inversion
> problem with sync() in a better way.
>
> Signed-off-by: Andrea Righi <[email protected]>

Talked with Tejun about this some and we agreed the following is probably the
best way forward

1) Track the submitter of the wb work to the writeback code.
2) Sync() defaults to the root cg, and and it writes all the things as the root
cg.
3) Add a flag to the cgroups that would make sync()'ers in that group only be
allowed to write out things that belong to its group.

This way we avoid the priority inversion of having things like systemd or random
logged in user doing sync() and having to wait, and we keep low prio cgroups
from causing big IO storms by syncing out stuff and getting upgraded to root
priority just to avoid the inversion.

Obviously by default we want this flag to be off since its such a big change,
but people/setups really worried about this behavior (Facebook for instance
would likely use this flag) can go ahead and set it and be sure we're getting
good isolation and still avoiding the priority inversion associated with running
sync from a high priority context. Thanks,

Josef

2019-02-11 20:47:19

by Andrea Righi

[permalink] [raw]
Subject: Re: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()

On Mon, Feb 11, 2019 at 10:39:34AM -0500, Josef Bacik wrote:
> On Sat, Feb 09, 2019 at 03:07:49PM +0100, Andrea Righi wrote:
> > This is an attempt to mitigate the priority inversion problem of a
> > high-priority blkcg issuing a sync() and being forced to wait the
> > completion of all the writeback I/O generated by any other low-priority
> > blkcg, causing massive latencies to processes that shouldn't be
> > I/O-throttled at all.
> >
> > The idea is to save a list of blkcg's that are waiting for writeback:
> > every time a sync() is executed the current blkcg is added to the list.
> >
> > Then, when I/O is throttled, if there's a blkcg waiting for writeback
> > different than the current blkcg, no throttling is applied (we can
> > probably refine this logic later, i.e., a better policy could be to
> > adjust the throttling I/O rate using the blkcg with the highest speed
> > from the list of waiters - priority inheritance, kinda).
> >
> > This topic has been discussed here:
> > https://lwn.net/ml/cgroups/[email protected]/
> >
> > But we didn't come up with any definitive solution.
> >
> > This patch is not a definitive solution either, but it's an attempt to
> > continue addressing this issue and handling the priority inversion
> > problem with sync() in a better way.
> >
> > Signed-off-by: Andrea Righi <[email protected]>
>
> Talked with Tejun about this some and we agreed the following is probably the
> best way forward

First of all thanks for the update!

>
> 1) Track the submitter of the wb work to the writeback code.

Are we going to track the cgroup that originated the dirty pages (or
maybe dirty inodes) or do you have any idea in particular?

> 2) Sync() defaults to the root cg, and and it writes all the things as the root
> cg.

OK.

> 3) Add a flag to the cgroups that would make sync()'ers in that group only be
> allowed to write out things that belong to its group.

So, IIUC, when this flag is enabled a cgroup that is doing sync() would
trigger the writeback of the pages that belong to that cgroup only and
it waits only for these pages to be sync-ed, right? In this case
writeback can still go at cgroup's speed.

Instead when the flag is disabled, sync() would trigger writeback I/O
globally, as usual, and it goes at full speed (root cgroup's speed).

Am I understanding correctly?

>
> This way we avoid the priority inversion of having things like systemd or random
> logged in user doing sync() and having to wait, and we keep low prio cgroups
> from causing big IO storms by syncing out stuff and getting upgraded to root
> priority just to avoid the inversion.
>
> Obviously by default we want this flag to be off since its such a big change,
> but people/setups really worried about this behavior (Facebook for instance
> would likely use this flag) can go ahead and set it and be sure we're getting
> good isolation and still avoiding the priority inversion associated with running
> sync from a high priority context. Thanks,
>
> Josef

Thanks,
-Andrea

2019-02-11 21:35:11

by Josef Bacik

[permalink] [raw]
Subject: Re: [RFC PATCH v2] blkcg: prevent priority inversion problem during sync()

On Mon, Feb 11, 2019 at 09:40:29PM +0100, Andrea Righi wrote:
> On Mon, Feb 11, 2019 at 10:39:34AM -0500, Josef Bacik wrote:
> > On Sat, Feb 09, 2019 at 03:07:49PM +0100, Andrea Righi wrote:
> > > This is an attempt to mitigate the priority inversion problem of a
> > > high-priority blkcg issuing a sync() and being forced to wait the
> > > completion of all the writeback I/O generated by any other low-priority
> > > blkcg, causing massive latencies to processes that shouldn't be
> > > I/O-throttled at all.
> > >
> > > The idea is to save a list of blkcg's that are waiting for writeback:
> > > every time a sync() is executed the current blkcg is added to the list.
> > >
> > > Then, when I/O is throttled, if there's a blkcg waiting for writeback
> > > different than the current blkcg, no throttling is applied (we can
> > > probably refine this logic later, i.e., a better policy could be to
> > > adjust the throttling I/O rate using the blkcg with the highest speed
> > > from the list of waiters - priority inheritance, kinda).
> > >
> > > This topic has been discussed here:
> > > https://lwn.net/ml/cgroups/[email protected]/
> > >
> > > But we didn't come up with any definitive solution.
> > >
> > > This patch is not a definitive solution either, but it's an attempt to
> > > continue addressing this issue and handling the priority inversion
> > > problem with sync() in a better way.
> > >
> > > Signed-off-by: Andrea Righi <[email protected]>
> >
> > Talked with Tejun about this some and we agreed the following is probably the
> > best way forward
>
> First of all thanks for the update!
>
> >
> > 1) Track the submitter of the wb work to the writeback code.
>
> Are we going to track the cgroup that originated the dirty pages (or
> maybe dirty inodes) or do you have any idea in particular?
>

The guy doing the sync(), so that way we can accomplish #3. But really this is
an implementation detail, however you want to accomplish it is fine by me.

> > 2) Sync() defaults to the root cg, and and it writes all the things as the root
> > cg.
>
> OK.
>
> > 3) Add a flag to the cgroups that would make sync()'ers in that group only be
> > allowed to write out things that belong to its group.
>
> So, IIUC, when this flag is enabled a cgroup that is doing sync() would
> trigger the writeback of the pages that belong to that cgroup only and
> it waits only for these pages to be sync-ed, right? In this case
> writeback can still go at cgroup's speed.
>
> Instead when the flag is disabled, sync() would trigger writeback I/O
> globally, as usual, and it goes at full speed (root cgroup's speed).
>
> Am I understanding correctly?
>

Yup that's exactly it. Thanks,

Josef