Received: by 2002:ac0:aed5:0:0:0:0:0 with SMTP id t21csp6274321imb; Fri, 8 Mar 2019 13:30:28 -0800 (PST) X-Google-Smtp-Source: APXvYqyx1Mh/Ik2fks9on7gBTNV6MPEnrgse7mBl44ZITHjdRubYOetaeL5bsyty5cFlDp0nIThI X-Received: by 2002:a17:902:59c3:: with SMTP id d3mr21138951plj.214.1552080628292; Fri, 08 Mar 2019 13:30:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1552080628; cv=none; d=google.com; s=arc-20160816; b=HnlZQi6/Aao9fRGfRyzlKyIP7rrsc2He+gkJSBfrv2dBtTj7n/hmhpx9QAoPiIyfob xdsuvGnWzqqRjIay98z2P6UaQsyRhy8rcoHk3B5zwCNZPxUwDuh7RERwK/a3vceoImT2 k04pq6s0clXK1uYJOMA9qXgts86JBSVIIBy8bz4O9y884YTSdaoWZtrLUpmwcLMn8fGA Q8uJoruhW7T9Qo14iuOR9EpOzucmOtGwj5NSEW1oSmYpuPPrrvRfjlkD0lXeIXtzQMKr L2E1gTNZKPy+9zOQVjHvkbhzTz/ZgDdEWV/dMiKN0IT+UUh2Xi3he2UKsgXDC5ntGHLP TSjA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:content-disposition :mime-version:message-id:subject:cc:to:from:date; bh=JaqPMjNUg8YGoZrvXWIJxyaScX79kLBvByC21D5oQZA=; b=evAbIYGtQmOh07MKxzf/PRh8cBXGQiDyKMtpxHqlnHS4hNW7+EekN2Wzmaa+e8TSij P0gAeg29HNGafhakikfcgeifD8OKGWZruj2lEvgLQdvsA3vrd4ULsZD4h9spPx6YMyYP IiG8rpMx+QlilLYtRY1W/ebWopnTStqW2b07HwVzj26xz9PYAy1ap+sDnTIgRxW6yZz9 UDTT/3aEUG/tC58pDN5eMCsLWTWKoMuhp5VHxGRFFPcAOEVZbOVTlwHOd1YffbzqkLVK R3EhpUTWSwcYKLfo7H7aLU51dDQBnedR/rv8Lh50GSfxphIDe1GKj96s23wj9WNyofoE KduQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 39si3318186plc.66.2019.03.08.13.30.12; Fri, 08 Mar 2019 13:30:28 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726607AbfCHV2O (ORCPT + 99 others); Fri, 8 Mar 2019 16:28:14 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:51945 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726414AbfCHV2O (ORCPT ); Fri, 8 Mar 2019 16:28:14 -0500 Received: from mail-wr1-f69.google.com ([209.85.221.69]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1h2N2H-0001SU-5F for linux-kernel@vger.kernel.org; Fri, 08 Mar 2019 21:28:09 +0000 Received: by mail-wr1-f69.google.com with SMTP id a5so10763903wrq.3 for ; Fri, 08 Mar 2019 13:28:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:mime-version :content-disposition:user-agent; bh=JaqPMjNUg8YGoZrvXWIJxyaScX79kLBvByC21D5oQZA=; b=Qpro2E+iK29yXRoXr6guGDEnoaRsPy6nEJZf82f+dj/E3SEeK5NJLBWJ0TVUpDJ7qf MagBhr5WdzjMGf3Ki+65/D4A0bttp118vHUdUbd0EWanNIATzHLq2Pt8sPSenGzNKoEH i/uWCqVkffDEG8/btmYoOj7bn6SL4iliDxJ4+6GODZ/GpGsOo30G6yJEBdG3roUyZgLy MA4AEZHnoBEYsXgFzPrs7yS1zh2vO5gPvS2Nz58ZIZcVGxmRMMvPsXcozYx80pd3JQm/ gTT60fb9Ofig20bqsKLENsJosdPwEBKiiAdvG2eMDtUSlEjxMJce44tVl/OqFb7BXygS ZoQA== X-Gm-Message-State: APjAAAXCIuJ/Aw3tF4UhyRBajgjZ0bFwUWNOBNnc1IbXdCkjOxH2iQox ceNt665LyxxebNV6jKg9beS6ifNPCf0DqSc3s+Yhck1RQ70J+MF5M7BtxAoo2+gjlMkzhsfwtGW ATM5Jv8ExSwwjv1pKNKVDowBR93KnbOtL12F+bH3XTg== X-Received: by 2002:adf:f744:: with SMTP id z4mr4414411wrp.66.1552080488695; Fri, 08 Mar 2019 13:28:08 -0800 (PST) X-Received: by 2002:adf:f744:: with SMTP id z4mr4414388wrp.66.1552080488308; Fri, 08 Mar 2019 13:28:08 -0800 (PST) Received: from localhost (host157-124-dynamic.27-79-r.retail.telecomitalia.it. [79.27.124.157]) by smtp.gmail.com with ESMTPSA id g3sm5072472wmk.32.2019.03.08.13.28.06 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Fri, 08 Mar 2019 13:28:07 -0800 (PST) Date: Fri, 8 Mar 2019 22:28:06 +0100 From: Andrea Righi To: Josef Bacik , Tejun Heo Cc: Li Zefan , Paolo Valente , Johannes Weiner , Jens Axboe , Vivek Goyal , Dennis Zhou , cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v3] blkcg: prevent priority inversion problem during sync() Message-ID: <20190308212806.GA1172@xps-13> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When sync(2) is executed from a high-priority cgroup, the process is forced to wait the completion of the entire outstanding writeback I/O, even the I/O that was originally generated by low-priority cgroups potentially. This may cause massive latencies to random processes (even those running in the root cgroup) that shouldn't be I/O-throttled at all, similarly to a classic priority inversion problem. Prevent this problem by saving a list of blkcg's that are waiting for writeback: every time a sync(2) is executed the current blkcg is added to the list. Then, when I/O is throttled, if there's a blkcg waiting for writeback different than the current blkcg, no throttling is applied (we can probably refine this logic later, i.e., a better policy could be to adjust the I/O rate using the blkcg with the highest speed from the list of waiters). See also: https://lkml.org/lkml/2019/3/7/640 Signed-off-by: Andrea Righi --- Changes in v3: - drop sync(2) isolation patches (this will be addressed by another patch, potentially operating at the fs namespace level) - use a per-bdi lock and a per-bdi list instead of a global lock and a global list to save the list of sync(2) waiters block/blk-cgroup.c | 130 +++++++++++++++++++++++++++++++ block/blk-throttle.c | 11 ++- fs/fs-writeback.c | 5 ++ fs/sync.c | 8 +- include/linux/backing-dev-defs.h | 2 + include/linux/blk-cgroup.h | 25 ++++++ mm/backing-dev.c | 2 + 7 files changed, 179 insertions(+), 4 deletions(-) diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 2bed5725aa03..b380d678cfc2 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1351,6 +1351,136 @@ struct cgroup_subsys io_cgrp_subsys = { }; EXPORT_SYMBOL_GPL(io_cgrp_subsys); +#ifdef CONFIG_CGROUP_WRITEBACK +struct blkcg_wb_sleeper { + struct blkcg *blkcg; + refcount_t refcnt; + struct list_head node; +}; + +static struct blkcg_wb_sleeper * +blkcg_wb_sleeper_find(struct blkcg *blkcg, struct backing_dev_info *bdi) +{ + struct blkcg_wb_sleeper *bws; + + list_for_each_entry(bws, &bdi->cgwb_waiters, node) + if (bws->blkcg == blkcg) + return bws; + return NULL; +} + +static void +blkcg_wb_sleeper_add(struct backing_dev_info *bdi, struct blkcg_wb_sleeper *bws) +{ + list_add(&bws->node, &bdi->cgwb_waiters); +} + +static void +blkcg_wb_sleeper_del(struct backing_dev_info *bdi, struct blkcg_wb_sleeper *bws) +{ + list_del_init(&bws->node); +} + +/** + * blkcg_wb_waiters_on_bdi - check for writeback waiters on a block device + * @blkcg: current blkcg cgroup + * @bdi: block device to check + * + * Return true if any other blkcg different than the current one is waiting for + * writeback on the target block device, false otherwise. + */ +bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi) +{ + struct blkcg_wb_sleeper *bws; + bool ret = false; + + if (likely(list_empty(&bdi->cgwb_waiters))) + return false; + spin_lock(&bdi->cgwb_waiters_lock); + list_for_each_entry(bws, &bdi->cgwb_waiters, node) + if (bws->blkcg != blkcg) { + ret = true; + break; + } + spin_unlock(&bdi->cgwb_waiters_lock); + + return ret; +} + +/** + * blkcg_start_wb_wait_on_bdi - add current blkcg to writeback waiters list + * @bdi: target block device + * + * Add current blkcg to the list of writeback waiters on target block device. + */ +void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi) +{ + struct blkcg_wb_sleeper *new_bws, *bws; + struct blkcg *blkcg; + + new_bws = kzalloc(sizeof(*new_bws), GFP_KERNEL); + if (unlikely(!new_bws)) + return; + + rcu_read_lock(); + blkcg = blkcg_from_current(); + if (likely(blkcg)) { + /* Check if blkcg is already sleeping on bdi */ + spin_lock_bh(&bdi->cgwb_waiters_lock); + bws = blkcg_wb_sleeper_find(blkcg, bdi); + if (bws) { + refcount_inc(&bws->refcnt); + } else { + /* Add current blkcg as a new wb sleeper on bdi */ + css_get(&blkcg->css); + new_bws->blkcg = blkcg; + refcount_set(&new_bws->refcnt, 1); + blkcg_wb_sleeper_add(bdi, new_bws); + new_bws = NULL; + } + spin_unlock_bh(&bdi->cgwb_waiters_lock); + } + rcu_read_unlock(); + + kfree(new_bws); +} + +/** + * blkcg_stop_wb_wait_on_bdi - remove current blkcg from writeback waiters list + * @bdi: target block device + * + * Remove current blkcg from the list of writeback waiters on target block + * device. + */ +void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi) +{ + struct blkcg_wb_sleeper *bws = NULL; + struct blkcg *blkcg; + + rcu_read_lock(); + blkcg = blkcg_from_current(); + if (!blkcg) { + rcu_read_unlock(); + return; + } + spin_lock_bh(&bdi->cgwb_waiters_lock); + bws = blkcg_wb_sleeper_find(blkcg, bdi); + if (unlikely(!bws)) { + /* blkcg_start/stop_wb_wait_on_bdi() mismatch */ + WARN_ON(1); + goto out_unlock; + } + if (refcount_dec_and_test(&bws->refcnt)) { + blkcg_wb_sleeper_del(bdi, bws); + css_put(&blkcg->css); + kfree(bws); + } +out_unlock: + spin_unlock_bh(&bdi->cgwb_waiters_lock); + rcu_read_unlock(); +} +#endif + /** * blkcg_activate_policy - activate a blkcg policy on a request_queue * @q: request_queue of interest diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 1b97a73d2fb1..da817896cded 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -970,9 +970,13 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, { bool rw = bio_data_dir(bio); unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0; + struct throtl_data *td = tg->td; + struct request_queue *q = td->queue; + struct backing_dev_info *bdi = q->backing_dev_info; + struct blkcg_gq *blkg = tg_to_blkg(tg); /* - * Currently whole state machine of group depends on first bio + * Currently whole state machine of group depends on first bio * queued in the group bio list. So one should not be calling * this function with a different bio if there are other bios * queued. @@ -981,8 +985,9 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio, bio != throtl_peek_queued(&tg->service_queue.queued[rw])); /* If tg->bps = -1, then BW is unlimited */ - if (tg_bps_limit(tg, rw) == U64_MAX && - tg_iops_limit(tg, rw) == UINT_MAX) { + if (blkcg_wb_waiters_on_bdi(blkg->blkcg, bdi) || + (tg_bps_limit(tg, rw) == U64_MAX && + tg_iops_limit(tg, rw) == UINT_MAX)) { if (wait) *wait = 0; return true; diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 36855c1f8daf..77c039a0ec25 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -28,6 +28,7 @@ #include #include #include +#include #include "internal.h" /* @@ -2446,6 +2447,8 @@ void sync_inodes_sb(struct super_block *sb) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); + blkcg_start_wb_wait_on_bdi(bdi); + /* protect against inode wb switch, see inode_switch_wbs_work_fn() */ bdi_down_write_wb_switch_rwsem(bdi); bdi_split_work_to_wbs(bdi, &work, false); @@ -2453,6 +2456,8 @@ void sync_inodes_sb(struct super_block *sb) bdi_up_write_wb_switch_rwsem(bdi); wait_sb_inodes(sb); + + blkcg_stop_wb_wait_on_bdi(bdi); } EXPORT_SYMBOL(sync_inodes_sb); diff --git a/fs/sync.c b/fs/sync.c index b54e0541ad89..3958b8f98b85 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -16,6 +16,7 @@ #include #include #include +#include #include "internal.h" #define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \ @@ -76,8 +77,13 @@ static void sync_inodes_one_sb(struct super_block *sb, void *arg) static void sync_fs_one_sb(struct super_block *sb, void *arg) { - if (!sb_rdonly(sb) && sb->s_op->sync_fs) + struct backing_dev_info *bdi = sb->s_bdi; + + if (!sb_rdonly(sb) && sb->s_op->sync_fs) { + blkcg_start_wb_wait_on_bdi(bdi); sb->s_op->sync_fs(sb, *(int *)arg); + blkcg_stop_wb_wait_on_bdi(bdi); + } } static void fdatawrite_one_bdev(struct block_device *bdev, void *arg) diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h index 07e02d6df5ad..095e4dd0427b 100644 --- a/include/linux/backing-dev-defs.h +++ b/include/linux/backing-dev-defs.h @@ -191,6 +191,8 @@ struct backing_dev_info { struct rb_root cgwb_congested_tree; /* their congested states */ struct mutex cgwb_release_mutex; /* protect shutdown of wb structs */ struct rw_semaphore wb_switch_rwsem; /* no cgwb switch while syncing */ + struct list_head cgwb_waiters; /* list of all waiters for writeback */ + spinlock_t cgwb_waiters_lock; /* protect cgwb_waiters list */ #else struct bdi_writeback_congested *wb_congested; #endif diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h index 76c61318fda5..66d7b6901c77 100644 --- a/include/linux/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -56,6 +56,7 @@ struct blkcg { struct list_head all_blkcgs_node; #ifdef CONFIG_CGROUP_WRITEBACK + struct list_head cgwb_wait_node; struct list_head cgwb_list; refcount_t cgwb_refcnt; #endif @@ -252,6 +253,12 @@ static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css) return css ? container_of(css, struct blkcg, css) : NULL; } +static inline struct blkcg *blkcg_from_current(void) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + return css_to_blkcg(blkcg_css()); +} + /** * __bio_blkcg - internal, inconsistent version to get blkcg * @@ -454,6 +461,10 @@ static inline void blkcg_cgwb_put(struct blkcg *blkcg) blkcg_destroy_blkgs(blkcg); } +bool blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi); +void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi); +void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi); + #else static inline void blkcg_cgwb_get(struct blkcg *blkcg) { } @@ -464,6 +475,14 @@ static inline void blkcg_cgwb_put(struct blkcg *blkcg) blkcg_destroy_blkgs(blkcg); } +static inline bool +blkcg_wb_waiters_on_bdi(struct blkcg *blkcg, struct backing_dev_info *bdi) +{ + return false; +} +static inline void blkcg_start_wb_wait_on_bdi(struct backing_dev_info *bdi) { } +static inline void blkcg_stop_wb_wait_on_bdi(struct backing_dev_info *bdi) { } + #endif /** @@ -772,6 +791,7 @@ static inline void blkcg_bio_issue_init(struct bio *bio) static inline bool blkcg_bio_issue_check(struct request_queue *q, struct bio *bio) { + struct backing_dev_info *bdi = q->backing_dev_info; struct blkcg_gq *blkg; bool throtl = false; @@ -788,6 +808,11 @@ static inline bool blkcg_bio_issue_check(struct request_queue *q, blkg = bio->bi_blkg; + local_bh_disable(); + if (blkcg_wb_waiters_on_bdi(blkg->blkcg, bdi)) + bio_set_flag(bio, BIO_THROTTLED); + local_bh_enable(); + throtl = blk_throtl_bio(q, blkg, bio); if (!throtl) { diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 72e6d0c55cfa..8848d26e8bf6 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -686,10 +686,12 @@ static int cgwb_bdi_init(struct backing_dev_info *bdi) { int ret; + INIT_LIST_HEAD(&bdi->cgwb_waiters); INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC); bdi->cgwb_congested_tree = RB_ROOT; mutex_init(&bdi->cgwb_release_mutex); init_rwsem(&bdi->wb_switch_rwsem); + spin_lock_init(&bdi->cgwb_waiters_lock); ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL); if (!ret) { -- 2.20.1