Received: by 2002:a05:6358:16cc:b0:ea:6187:17c9 with SMTP id r12csp1827037rwl; Mon, 26 Dec 2022 05:24:57 -0800 (PST) X-Google-Smtp-Source: AMrXdXuqRWt+J7salnt/vw9GysjyAS4NEdaSPyCbr76nzdTPiHzpzumZ4U2Vs7O2VJ9SJzgdWqc0 X-Received: by 2002:a17:907:970a:b0:7c0:fd1a:79ee with SMTP id jg10-20020a170907970a00b007c0fd1a79eemr23457716ejc.63.1672061097309; Mon, 26 Dec 2022 05:24:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1672061097; cv=none; d=google.com; s=arc-20160816; b=AMi/O0Dfbh/3wvXhAzAXcuHJh3lUPbxuDuHgx4RcOAVqKO+lro/2iujTNWlvBin6tB v0btnaGmhQ18/PCquvcEL5Pwd1jMJd0bKCsjIn0qD4eyqdZ3KfXZ14b5EJB6csCGzwS6 mT6G9yV9TGYqOoWgJ576ae9ZVoPt+vyingDOwUmKmcr1h4v1s1XWGotq8IDGdoBtwAud 6vckSVNkyXXgiHC2Mi5+Xp4EsPt5xIBviw4t5B3Iyx0oz9GfrV1GBpyoWWlCnvlJQo7l vsFroOHuDYuamTALt1uzTQbMBskZs2fhlGfJDwjOgVfricHDtDJojaKv23M2PNrYvOWf ZOAg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=nE7yJwFAP/zkGtRmnIRtac0TUO9NfSKJNGcyhbdNgL8=; b=g1g/Ns3RWbWXTUIVIn90WqL93aeJbmEdbVplVdudQhe50F9tf//NFShnOGHTDR+q/N ixTwnh/GOrRFK9QsbaRz0ctsAmpsX3fKzM8sZM/wIfKojUyTNP9MSRoCxhWQ6MSMqnba pOPhRXxJGPQP1Bx47tupNtISK9bMTcNSEl4wtGX/6yFoCrr2zZESxpg9bx3625Z96I0R aOUVwS9gbCjvDjryWCAqqdDNojZ7F7eQskn3Cx8Ya4R0sOJiH/0t3OclO89bp5ZQE07b rlUKrZ4RATB/mrGCbcmCNCjZn0fs6J3wKFrYFx/sNZrRFSYpO8ID9VzF39BNnVBRU4DF gKog== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=EIWRvUG6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h11-20020a0564020e0b00b0046acab646ecsi8199501edh.81.2022.12.26.05.24.41; Mon, 26 Dec 2022 05:24:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=EIWRvUG6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232042AbiLZNFW (ORCPT + 66 others); Mon, 26 Dec 2022 08:05:22 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55006 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229502AbiLZNFU (ORCPT ); Mon, 26 Dec 2022 08:05:20 -0500 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9834810B0 for ; Mon, 26 Dec 2022 05:05:18 -0800 (PST) Received: by mail-pl1-x630.google.com with SMTP id jn22so10648443plb.13 for ; Mon, 26 Dec 2022 05:05:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=nE7yJwFAP/zkGtRmnIRtac0TUO9NfSKJNGcyhbdNgL8=; b=EIWRvUG6w6BpGVljPLu29Cqw6SgK7myTN/ND1p0+c2x0VTOiDgEe1nnuW8IEHGfs5o W0BQM/QzfWsNhJhfVoh5AJQOEsvEfgLfn17K+ffzcKKYP7gzpffhtP6upvfKzm0pOsVG +vYFIjSHnl0LNwf/vKmuzPOMjpU5X0KwYedp7hbO/aAXUCuufM46gia2OGwTZ4ozuPRQ AjSsGg9+/TNkFCPdVRNrsqygIblFi2UNP5fLXcqYQZshNfHXhJWDlyBUzez1GTZY64HQ YT3gIYPTZTqtIG91CMcLLMVzm0erxIVx+QaDWaPnlN5snLcLmxivoiArUd97keTo8NtD 1x1A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=nE7yJwFAP/zkGtRmnIRtac0TUO9NfSKJNGcyhbdNgL8=; b=p+EsNjCtD+WbZwdNQWZaulUVL0wKYgyqyAQmdyxPUWfnRHm+Mc62utt9z1yAh9TJhi 9OCpoOBpaLxDKKdxKaLf+c+Izcs4h7Cx9uCg5LXCk1H9+Qjn8yr2lJo2HmxyWN4mRPTB U+4ugMtN9PyGk9ReDxwdaMEoRk5kHHdrOUsQPJBuzzaN/LYH2eF4RGn8LNYvmsXNcEUc 3sLR87MSILIPJkWuk8DnYS3kXl4qXLX/MDdA03pMT/EUQj7YPnxFuVWtBRT7svU4FEUq WRrmLregmfS5gfBeJidcxUGr81i32LfBvF06IUF9jnUhfSGFjdRnafAIQIpGwoaYCaTx FaMQ== X-Gm-Message-State: AFqh2kqwwd9UGeBS5aojMb9urCpOSOV/ZLwS3DAOzbU+ZeISK47GopeU NDq2qSsjmFFKPTCQ5FXq81zGBQ== X-Received: by 2002:a17:902:ba8e:b0:189:db2b:ac1a with SMTP id k14-20020a170902ba8e00b00189db2bac1amr31991745pls.36.1672059917944; Mon, 26 Dec 2022 05:05:17 -0800 (PST) Received: from localhost.localdomain ([139.177.225.243]) by smtp.gmail.com with ESMTPSA id y12-20020a17090322cc00b00182d25a1e4bsm7037697plg.259.2022.12.26.05.05.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 26 Dec 2022 05:05:17 -0800 (PST) From: Jinke Han X-Google-Original-From: Jinke Han To: tj@kernel.org, josef@toxicpanda.com, axboe@kernel.dk Cc: cgroups@vger.kernel.org, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, yinxin.x@bytedance.com, Jinke Han Subject: [PATCH v3] blk-throtl: Introduce sync and async queues for blk-throtl Date: Mon, 26 Dec 2022 21:05:05 +0800 Message-Id: <20221226130505.7186-1-hanjinke.666@bytedance.com> X-Mailer: git-send-email 2.32.0 (Apple Git-132) MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Jinke Han Now we don't distinguish sync write ios from normal buffer write ios in blk-throtl. A bio with REQ_SYNC tagged always mean it will be wait until write completion soon after it submit. So it's reasonable for sync io to complete as soon as possible. In our test, fio writes a 100g file in sequential 4k blocksize in a container with low bps limit configured (wbps=10M). More than 1200 ios were throttled in blk-throtl queue and the avarage throtle time of each io is 140s. At the same time, the operation of saving a small file by vim will be blocked amolst 140s. As a fsync will be send by vim, the sync ios of fsync will be blocked by a huge amount of buffer write ios ahead. This is also a priority inversion problem within one cgroup. In the database scene, things got really bad with blk-throtle enabled as fsync is called very often. This patch splits bio queue into sync and async queues for blk-throtl and gives a huge priority to sync write ios. Sync queue only make sense for write ios as we treat all read io as sync io. I think it's a nice respond to the semantics of REQ_SYNC. Bios with REQ_META and REQ_PRIO gains the same priority as they are important to fs. This may avoid some potential priority inversion problems. With this patch, do the same test above, the duration of the fsync sent by vim drops to several hundreds of milliseconds. Signed-off-by: Jinke Han Signed-off-by: Xin Yin --- Changes in v2 - Make code more simple. Changes in v3 - Fix mismatch of waiting bio and the next dispatched bio. - Rename dispatch_sync_cnt to disp_sync_cnt. - Add more notes. block/blk-throttle.c | 135 ++++++++++++++++++++++++++++++++++++++++--- block/blk-throttle.h | 13 ++++- 2 files changed, 139 insertions(+), 9 deletions(-) diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 847721dc2b2b..c4db7873dfa8 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -21,6 +21,13 @@ /* Total max dispatch from all groups in one round */ #define THROTL_QUANTUM 32 +/* For write ios, dispatch 4 sync ios and 1 normal io in one loop */ +#define THROTL_SYNC_FACTOR 4 + +/* Only make sense for write ios, all read ios are treated as SYNC */ +#define SYNC 0 +#define ASYNC 1 + /* Throttling is performed over a slice and after that slice is renewed */ #define DFL_THROTL_SLICE_HD (HZ / 10) #define DFL_THROTL_SLICE_SSD (HZ / 50) @@ -88,6 +95,7 @@ struct throtl_data }; static void throtl_pending_timer_fn(struct timer_list *t); +static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn); static inline struct blkcg_gq *tg_to_blkg(struct throtl_grp *tg) { @@ -241,11 +249,28 @@ static inline unsigned int throtl_bio_data_size(struct bio *bio) return bio->bi_iter.bi_size; } -static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg) +static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg, + bool rw) { INIT_LIST_HEAD(&qn->node); - bio_list_init(&qn->bios); + bio_list_init(&qn->bios[SYNC]); + bio_list_init(&qn->bios[ASYNC]); qn->tg = tg; + qn->disp_sync_cnt = (rw == READ) ? UINT_MAX : 0; + qn->next_to_disp = NULL; +} + +#define BLK_THROTL_SYNC(bio) (bio->bi_opf & (REQ_SYNC | REQ_META | REQ_PRIO)) + +static inline void throtl_qnode_add_bio_list(struct throtl_qnode *qn, + struct bio *bio) +{ + bool rw = bio_data_dir(bio); + + if ((rw == READ) || BLK_THROTL_SYNC(bio)) + bio_list_add(&qn->bios[SYNC], bio); + else + bio_list_add(&qn->bios[ASYNC], bio); } /** @@ -261,13 +286,45 @@ static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg) static void throtl_qnode_add_bio(struct bio *bio, struct throtl_qnode *qn, struct list_head *queued) { - bio_list_add(&qn->bios, bio); + throtl_qnode_add_bio_list(qn, bio); if (list_empty(&qn->node)) { list_add_tail(&qn->node, queued); blkg_get(tg_to_blkg(qn->tg)); } } +/** + * throtl_qnode_bio_peek - peek a bio for a qn + * @qn: the qnode to peek from + * + * For read qn, just peek bio from the SYNC queue and return. + * For write qn, we first ask the next_to_disp for bio and will pop a bio + * to fill it if it's NULL. The next_to_disp is used to pin the bio for + * next to dispatch. It is necessary. In the dispatching process, a peeked + * bio may can't be dispatched due to lack of budget and has to wait, the + * dispatching process may give up and the spin lock of the request queue + * will be released. New bio may be queued in as the spin lock were released. + * When it's time to dispatch the waiting bio, another bio may be selected to + * check the limit and may be dispatched. If the dispatched bio is smaller + * than the waiting bio, the bandwidth may be hard to satisfied as we may + * trim the slice after each dispatch. + * So pinning the next_to_disp to make sure that the waiting bio and the + * dispatched one later always the same one in case that the spin lock of + * queue was released and re-holded. + */ +static inline struct bio *throtl_qnode_bio_peek(struct throtl_qnode *qn) +{ + /* qn for read ios */ + if (qn->disp_sync_cnt == UINT_MAX) + return bio_list_peek(&qn->bios[SYNC]); + + /* qn for write ios */ + if (!qn->next_to_disp) + qn->next_to_disp = throtl_qnode_bio_list_pop(qn); + + return qn->next_to_disp; +} + /** * throtl_peek_queued - peek the first bio on a qnode list * @queued: the qnode list to peek @@ -281,11 +338,73 @@ static struct bio *throtl_peek_queued(struct list_head *queued) return NULL; qn = list_first_entry(queued, struct throtl_qnode, node); - bio = bio_list_peek(&qn->bios); + bio = throtl_qnode_bio_peek(qn); WARN_ON_ONCE(!bio); return bio; } +/** + * throtl_qnode_bio_pop: pop a bio from sync/async queue + * @qn: the qnode to pop a bio from + * + * For write io qn, the target queue to pop was determined by the disp_sync_cnt. + * Try to pop bio from target queue, fetch the bio and return it when it is not + * empty. If the target queue empty, pop bio from another queue instead. + */ +static inline struct bio *throtl_qnode_bio_list_pop(struct throtl_qnode *qn) +{ + struct bio *bio; + int from = SYNC; + + if (qn->disp_sync_cnt == THROTL_SYNC_FACTOR) + from = ASYNC; + + bio = bio_list_pop(&qn->bios[from]); + if (!bio) { + from = 1 - from; + bio = bio_list_pop(&qn->bios[from]); + } + + if ((qn->disp_sync_cnt < THROTL_SYNC_FACTOR) && + (from == SYNC)) + qn->disp_sync_cnt++; + else + qn->disp_sync_cnt = 0; + + return bio; +} + +/** + * throtl_qnode_bio_pop: pop a bio from a qnode + * @qn: the qnode to pop a bio from + */ +static inline struct bio *throtl_qnode_bio_pop(struct throtl_qnode *qn) +{ + struct bio *bio; + + /* qn for read ios */ + if (qn->disp_sync_cnt == UINT_MAX) + return bio_list_pop(&qn->bios[SYNC]); + + /* qn for write ios */ + if (qn->next_to_disp) { + bio = qn->next_to_disp; + qn->next_to_disp = NULL; + return bio; + } + + return throtl_qnode_bio_list_pop(qn); +} + +static inline bool throtl_qnode_empty(struct throtl_qnode *qn) +{ + if (!qn->next_to_disp && + bio_list_empty(&qn->bios[SYNC]) && + bio_list_empty(&qn->bios[ASYNC])) + return true; + return false; +} + /** * throtl_pop_queued - pop the first bio form a qnode list * @queued: the qnode list to pop a bio from @@ -310,10 +429,10 @@ static struct bio *throtl_pop_queued(struct list_head *queued, return NULL; qn = list_first_entry(queued, struct throtl_qnode, node); - bio = bio_list_pop(&qn->bios); + bio = throtl_qnode_bio_pop(qn); WARN_ON_ONCE(!bio); - if (bio_list_empty(&qn->bios)) { + if (throtl_qnode_empty(qn)) { list_del_init(&qn->node); if (tg_to_put) *tg_to_put = qn->tg; @@ -355,8 +474,8 @@ static struct blkg_policy_data *throtl_pd_alloc(gfp_t gfp, throtl_service_queue_init(&tg->service_queue); for (rw = READ; rw <= WRITE; rw++) { - throtl_qnode_init(&tg->qnode_on_self[rw], tg); - throtl_qnode_init(&tg->qnode_on_parent[rw], tg); + throtl_qnode_init(&tg->qnode_on_self[rw], tg, rw); + throtl_qnode_init(&tg->qnode_on_parent[rw], tg, rw); } RB_CLEAR_NODE(&tg->rb_node); diff --git a/block/blk-throttle.h b/block/blk-throttle.h index ef4b7a4de987..bd4fff65b6d9 100644 --- a/block/blk-throttle.h +++ b/block/blk-throttle.h @@ -28,8 +28,19 @@ */ struct throtl_qnode { struct list_head node; /* service_queue->queued[] */ - struct bio_list bios; /* queued bios */ + struct bio_list bios[2]; /* queued bios */ struct throtl_grp *tg; /* tg this qnode belongs to */ + + struct bio *next_to_disp; /* pinned for next to dispatch */ + /* + * 1) for write throtl_qnode: + * [0, THROTL_SYNC_FACTOR-1]: dispatch sync io + * [THROTL_SYNC_FACTOR]: dispatch async io + * + * 2) for read throtl_qnode: + * UINT_MAX + */ + unsigned int disp_sync_cnt; /* sync io dispatch counter */ }; struct throtl_service_queue { -- 2.20.1