Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp2461285imm; Thu, 23 Aug 2018 22:58:41 -0700 (PDT) X-Google-Smtp-Source: ANB0VdbXUho7fkusbXzIk9oTqY1fPrbqjoqa5H+wUPUOAbo5zC2Pbes1sgADrKVsOebsmoDRYxZ4 X-Received: by 2002:aa7:82c3:: with SMTP id f3-v6mr292746pfn.136.1535090321490; Thu, 23 Aug 2018 22:58:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535090321; cv=none; d=google.com; s=arc-20160816; b=xwf1kWZuo+RSmzdVYFaOxi9KlOAYFZ1iUfsb6z8R+FBbaFfDTBot3uBE2MjvtllTof 7d1uguFwzMlywHoEl+Xox9/CKXdsQljdfbVZMYw7y9LqLA+qfBwolzFEMfrCCicLbfAp P2jnRZ+1y3aNAsxyaPnvHfozdSB/jqeOAtDB13r80nr39ZoFDms6mZXwzV/KvJoTcOxD lNr11HFf6DVM+pwiw0WzUszD6T40ggX200T+BSZmfQLixiY0jN3ktBBEPHUmMamPEKez agtBdGJSYdiN9c2lKXmHQwhAz4YuAHp/gNtZpmJ/oagUb9y1885IBqBmdpagR0TSOAi4 wiJg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language:in-reply-to:mime-version :user-agent:date:message-id:from:references:cc:to:subject :dkim-signature:arc-authentication-results; bh=cR0kC9ddz4Z2WdC+3NRW7xlVgOPW0nR0HbSeUOJLezs=; b=xZrH5LauJVuDAqhaxfH9bDEbUIp+eXGLdchY+WtlviCQDNt3uKp6o0Ip8hI3+ptLIo ywxtX+3vT2l6ri0BpJyenmR/jGSXuhXED9EGT7LkbPWQCKJVcU/YRRoYcudB2nAlNCQw RsUSOEn07Nu6GC+d2k2GYgZWkE2A2GY91erpIsq9bPgd/RCYpXRCIiNpsA7bOJ0heIG3 SbGZnG6A1FHN/o6Rnr60sc0F8EBFyzMDXBBMlsnwr0+MijQJKFvfBKqhqRUkCxZyYKI6 qYU1e242wHDp4bwjpvcWI/DZl6vyIImNpnkmAn3BGVSR/TREp302M8oDiy1xlBV0sySP HW9A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=SWC5acU8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e12-v6si7046586pfd.38.2018.08.23.22.58.11; Thu, 23 Aug 2018 22:58:41 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=SWC5acU8; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727401AbeHXJ2j (ORCPT + 99 others); Fri, 24 Aug 2018 05:28:39 -0400 Received: from aserp2120.oracle.com ([141.146.126.78]:45306 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726572AbeHXJ2j (ORCPT ); Fri, 24 Aug 2018 05:28:39 -0400 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w7O5riHg146595; Fri, 24 Aug 2018 05:55:24 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type; s=corp-2018-07-02; bh=cR0kC9ddz4Z2WdC+3NRW7xlVgOPW0nR0HbSeUOJLezs=; b=SWC5acU8EAe48ez1v3RZ5ae2klNUGgQH84QhW7ZMQ4RnJZ5QC6+il79MtEJ4VvnSb+6M fwJQWkMKnRcUUkHE/mtti2cyjYqMkFbY9lkD+8DJ7aKLAIZpdzTt8hgFNVkHbJVnyXT0 7z6uoHhoc3yZEQObIeyE2uAiitQE1Ma7UCQWj3SBlJazDiFub2njAY4cv/KvhVB2QNSa 4fny+5VrVa/rw5woTXiSl8x5+5dJQaXrWo5DvMsUwkQX4pQ+34x4WU7LK1wsBotYA2hf dux6+3nHX6n4UfzCkvkzR4m4h6FXknhJWCMo9WvcO0VBJuk5VcIlY7jPn31GUtxvfkq+ LQ== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by aserp2120.oracle.com with ESMTP id 2kxbdqd564-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 24 Aug 2018 05:55:23 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id w7O5tMEk003159 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 24 Aug 2018 05:55:22 GMT Received: from abhmp0018.oracle.com (abhmp0018.oracle.com [141.146.116.24]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w7O5tLlK031676; Fri, 24 Aug 2018 05:55:21 GMT Received: from [10.182.69.179] (/10.182.69.179) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Thu, 23 Aug 2018 22:55:21 -0700 Subject: Re: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done To: Jens Axboe , Anchal Agarwal , "van der Linden, Frank" Cc: "mlinux-block@vger.kernel.org" , "linux-kernel@vger.kernel.org" References: <1535029718-17259-1-git-send-email-jianchao.w.wang@oracle.com> <20180823210144.GB5624@kaos-source-ops-60001.pdx1.amazon.com> <3eaa20ce-0599-c405-d979-87d91ea331d2@kernel.dk> <969389e7-b1bc-0559-6cc9-9461b034a24f@kernel.dk> From: "jianchao.wang" Message-ID: <8af76974-08b2-f4ef-91b9-7bd42291b8d9@oracle.com> Date: Fri, 24 Aug 2018 13:55:58 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <969389e7-b1bc-0559-6cc9-9461b034a24f@kernel.dk> Content-Type: multipart/mixed; boundary="------------60947739C932B1E68CF44B25" Content-Language: en-US X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8994 signatures=668707 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1808240063 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is a multi-part message in MIME format. --------------60947739C932B1E68CF44B25 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit On 08/24/2018 07:14 AM, Jens Axboe wrote: > On 8/23/18 5:03 PM, Jens Axboe wrote: >>> Hi Jens, This patch looks much cleaner for sure as Frank pointed out >>> too. Basically this looks similar to wake_up_nr only making sure that >>> those woken up requests won't get reordered. This does solves the >>> thundering herd issue. However, I tested the patch against my >>> application and lock contention numbers rose to around 10 times from >>> what I had from your last 3 patches. Again this did add to drop in >>> of total files read by 0.12% and rate at which they were read by >>> 0.02% but this is not a very significant drop. Is lock contention >>> worth the tradeoff? I also added missing >>> __set_current_state(TASK_RUNNING) to the patch for testing. >> >> Can you try this variant? I don't think we need a >> __set_current_state() after io_schedule(), should be fine as-is. >> >> I'm not surprised this will raise contention a bit, since we're now >> waking N tasks potentially, if N can queue. With the initial change, >> we'd always just wake one. That is arguably incorrect. You say it's >> 10 times higher contention, how does that compare to before your >> patch? >> >> Is it possible to run something that looks like your workload? > > Additionally, is the contention you are seeing the wait queue, or the > atomic counter? When you say lock contention, I'm inclined to think it's > the rqw->wait.lock. > I guess the increased lock contend is due to: when the wake up is ongoing with wait head lock is held, there is still waiter on wait queue, and __wbt_wait will go to wait and try to require the wait head lock. This is necessary to keep the order on the rqw->wait queue. The attachment does following thing to try to avoid the scenario above. " Introduce wait queue rqw->delayed. Try to lock rqw->wait.lock firstly, if fails, add the waiter on rqw->delayed. __wbt_done will pick the waiters on rqw->delayed up and queue them on the tail of rqw->wait before it do wake up operation. " Thanks Jianchao --------------60947739C932B1E68CF44B25 Content-Type: text/x-patch; name="0001-blk-wbt-get-back-the-missed-wakeup-from-__wbt_done.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename*0="0001-blk-wbt-get-back-the-missed-wakeup-from-__wbt_done.patc"; filename*1="h" From c410983b85d5624a9b4ff579240818dfaab2f5db Mon Sep 17 00:00:00 2001 From: Jianchao Wang Date: Fri, 24 Aug 2018 02:10:21 +0800 Subject: [PATCH] blk-wbt: get back the missed wakeup from __wbt_done 2887e41 (blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait) introduces two cases that could miss wakeup: - __wbt_done only wakes up one waiter one time. There could be multiple waiters and (limit - inflight) > 1 at the moment. - When the waiter is waked up, it is still on wait queue and set to TASK_UNINTERRUPTIBLE immediately, so this waiter could be waked up one more time. If a __wbt_done comes and wakes up again, the prevous waiter may waste a wakeup. To fix them, we introduce our own wake up func wbt_wake_function in __wbt_wait and use wake_up_all in __wbt_done. wbt_wake_function will try to get wbt budget firstly, if sucesses, wake up the process, otherwise, return -1 to interrupt the wake up loop. To prevent disorder on rqw->wait, __wbt_wait will firstly check whether there is sleeper on it before try to get wbt budget, if has sleeper, it will go to wait directly. Because the wake up operations which holds rqw->wait.lock could take a relatively long time, extra lock contention could be introduced when __wbt_wait try to add itself on rqw->wait. To avoid this, introduce wait queue rqw->delayed. We will try to lock rqw->wait.lock firstly, if fails, add the waiter on rqw->delayed. __wbt_done will pick the waiters on rqw->delayed up and queue them on the tail of rqw->wait before it do wake up operation. Signed-off-by: Jianchao Wang Fixes: 2887e41 (blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait) Cc: Anchal Agarwal Cc: Frank van der Linden --- block/blk-rq-qos.h | 2 + block/blk-wbt.c | 130 ++++++++++++++++++++++++++++++++++++++++++----------- 2 files changed, 105 insertions(+), 27 deletions(-) diff --git a/block/blk-rq-qos.h b/block/blk-rq-qos.h index 32b02ef..a6111eb 100644 --- a/block/blk-rq-qos.h +++ b/block/blk-rq-qos.h @@ -14,6 +14,7 @@ enum rq_qos_id { struct rq_wait { wait_queue_head_t wait; + wait_queue_head_t delayed; atomic_t inflight; }; @@ -70,6 +71,7 @@ static inline void rq_wait_init(struct rq_wait *rq_wait) { atomic_set(&rq_wait->inflight, 0); init_waitqueue_head(&rq_wait->wait); + init_waitqueue_head(&rq_wait->delayed); } static inline void rq_qos_add(struct request_queue *q, struct rq_qos *rqos) diff --git a/block/blk-wbt.c b/block/blk-wbt.c index c9358f1..7fa61fe 100644 --- a/block/blk-wbt.c +++ b/block/blk-wbt.c @@ -111,6 +111,23 @@ static inline struct rq_wait *get_rq_wait(struct rq_wb *rwb, return &rwb->rq_wait[WBT_RWQ_BG]; } +static void wbt_check_delayed(struct rq_wait *rqw) +{ + unsigned long flags; + + if (!wq_has_sleeper(&rqw->delayed)) + return; + /* + * Avoid to require wait->lock holding delayed->lock, that will + * introduce undesire contending on delayed->lock. + */ + spin_lock_irqsave(&rqw->wait.lock, flags); + spin_lock(&rqw->delayed.lock); + list_splice_tail_init(&rqw->delayed.head, &rqw->wait.head); + spin_unlock(&rqw->delayed.lock); + spin_unlock_irqrestore(&rqw->wait.lock, flags); +} + static void rwb_wake_all(struct rq_wb *rwb) { int i; @@ -118,8 +135,11 @@ static void rwb_wake_all(struct rq_wb *rwb) for (i = 0; i < WBT_NUM_RWQ; i++) { struct rq_wait *rqw = &rwb->rq_wait[i]; - if (wq_has_sleeper(&rqw->wait)) + if (wq_has_sleeper(&rqw->wait) || + waitqueue_active(&rqw->delayed)) { + wbt_check_delayed(rqw); wake_up_all(&rqw->wait); + } } } @@ -162,11 +182,14 @@ static void __wbt_done(struct rq_qos *rqos, enum wbt_flags wb_acct) if (inflight && inflight >= limit) return; - if (wq_has_sleeper(&rqw->wait)) { + if (wq_has_sleeper(&rqw->wait) || + waitqueue_active(&rqw->delayed)) { int diff = limit - inflight; - if (!inflight || diff >= rwb->wb_background / 2) - wake_up(&rqw->wait); + if (!inflight || diff >= rwb->wb_background / 2) { + wbt_check_delayed(rqw); + wake_up_all(&rqw->wait); + } } } @@ -481,6 +504,64 @@ static inline unsigned int get_limit(struct rq_wb *rwb, unsigned long rw) return limit; } +struct wbt_wait_data { + struct task_struct *curr; + struct rq_wb *rwb; + struct rq_wait *rqw; + unsigned long rw; +}; + +static int wbt_wake_function(wait_queue_entry_t *curr, unsigned int mode, + int wake_flags, void *key) +{ + struct wbt_wait_data *data = curr->private; + + /* + * If fail to get budget, return -1 to interrupt the wake up + * loop in __wake_up_common. + */ + if (!rq_wait_inc_below(data->rqw, get_limit(data->rwb, data->rw))) + return -1; + + wake_up_process(data->curr); + + list_del_init(&curr->entry); + return 1; +} + +static inline void wbt_init_wait(struct wait_queue_entry *wait, + struct wbt_wait_data *data) +{ + INIT_LIST_HEAD(&wait->entry); + wait->flags = 0; + wait->func = wbt_wake_function; + wait->private = data; +} + +static void wbt_add_wait(struct rq_wait *rqw, struct wait_queue_entry *wq_entry) +{ + unsigned long flags; + + wq_entry->flags |= WQ_FLAG_EXCLUSIVE; + + if (spin_trylock_irqsave(&rqw->wait.lock, flags)) { + if (list_empty(&wq_entry->entry)) + __add_wait_queue_entry_tail(&rqw->wait, wq_entry); + set_current_state(TASK_UNINTERRUPTIBLE); + spin_unlock_irqrestore(&rqw->wait.lock, flags); + } else { + /* + * __wbt_done will pick the delayed waiters up and add them on + * the tail of rqw->wait queue. + */ + spin_lock_irqsave(&rqw->delayed.lock, flags); + if (list_empty(&wq_entry->entry)) + __add_wait_queue_entry_tail(&rqw->delayed, wq_entry); + set_current_state(TASK_UNINTERRUPTIBLE); + spin_unlock_irqrestore(&rqw->delayed.lock, flags); + } +} + /* * Block if we will exceed our limit, or if we are currently waiting for * the timer to kick off queuing again. @@ -491,31 +572,26 @@ static void __wbt_wait(struct rq_wb *rwb, enum wbt_flags wb_acct, __acquires(lock) { struct rq_wait *rqw = get_rq_wait(rwb, wb_acct); - DECLARE_WAITQUEUE(wait, current); - bool has_sleeper; - - has_sleeper = wq_has_sleeper(&rqw->wait); - if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) + struct wait_queue_entry wait; + struct wbt_wait_data data = { + .curr = current, + .rwb = rwb, + .rqw = rqw, + .rw = rw, + }; + + if (!wq_has_sleeper(&rqw->wait) && + rq_wait_inc_below(rqw, get_limit(rwb, rw))) return; - add_wait_queue_exclusive(&rqw->wait, &wait); - do { - set_current_state(TASK_UNINTERRUPTIBLE); - - if (!has_sleeper && rq_wait_inc_below(rqw, get_limit(rwb, rw))) - break; - - if (lock) { - spin_unlock_irq(lock); - io_schedule(); - spin_lock_irq(lock); - } else - io_schedule(); - has_sleeper = false; - } while (1); - - __set_current_state(TASK_RUNNING); - remove_wait_queue(&rqw->wait, &wait); + wbt_init_wait(&wait, &data); + wbt_add_wait(rqw, &wait); + if (lock) { + spin_unlock_irq(lock); + io_schedule(); + spin_lock_irq(lock); + } else + io_schedule(); } static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio) -- 2.7.4 --------------60947739C932B1E68CF44B25--