Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Subject: Re: [PATCH] sbitmap: Use single per-bitmap counting to wake up queued
 tags
To:     Gabriel Krisman Bertazi <krisman@suse.de>, axboe@kernel.dk
Cc:     linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
        Hugh Dickins <hughd@google.com>,
        Keith Busch <kbusch@kernel.org>,
        Liu Song <liusong@linux.alibaba.com>, Jan Kara <jack@suse.cz>,
        "yukuai (C)" <yukuai3@huawei.com>
References: <20221105231055.25953-1-krisman@suse.de>
From:   Yu Kuai <yukuai1@huaweicloud.com>
Message-ID: <2a445c5c-fd15-c0bf-8655-2fb5bde3fe67@huaweicloud.com>
Date:   Thu, 10 Nov 2022 17:42:49 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101
 Thunderbird/60.8.0
MIME-Version: 1.0
In-Reply-To: <20221105231055.25953-1-krisman@suse.de>
Content-Type: text/plain; charset=gbk; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk

Hi,

?? 2022/11/06 7:10, Gabriel Krisman Bertazi д??:
> sbitmap suffers from code complexity, as demonstrated by recent fixes,
> and eventual lost wake ups on nested I/O completion.  The later happens,
> from what I understand, due to the non-atomic nature of the updates to
> wait_cnt, which needs to be subtracted and eventually reset when equal
> to zero.  This two step process can eventually miss an update when a
> nested completion happens to interrupt the CPU in between the wait_cnt
> updates.  This is very hard to fix, as shown by the recent changes to
> this code.
> 
> The code complexity arises mostly from the corner cases to avoid missed
> wakes in this scenario.  In addition, the handling of wake_batch
> recalculation plus the synchronization with sbq_queue_wake_up is
> non-trivial.
> 
> This patchset implements the idea originally proposed by Jan [1], which
> removes the need for the two-step updates of wait_cnt.  This is done by
> tracking the number of completions and wakeups in always increasing,
> per-bitmap counters.  Instead of having to reset the wait_cnt when it
> reaches zero, we simply keep counting, and attempt to wake up N threads
> in a single wait queue whenever there is enough space for a batch.
> Waking up less than batch_wake shouldn't be a problem, because we
> haven't changed the conditions for wake up, and the existing batch
> calculation guarantees at least enough remaining completions to wake up
> a batch for each queue at any time.
> 
> Performance-wise, one should expect very similar performance to the
> original algorithm for the case where there is no queueing.  In both the
> old algorithm and this implementation, the first thing is to check
> ws_active, which bails out if there is no queueing to be managed. In the
> new code, we took care to avoid accounting completions and wakeups when
> there is no queueing, to not pay the cost of atomic operations
> unnecessarily, since it doesn't skew the numbers.
> 
> For more interesting cases, where there is queueing, we need to take
> into account the cross-communication of the atomic operations.  I've
> been benchmarking by running parallel fio jobs against a single hctx
> nullb in different hardware queue depth scenarios, and verifying both
> IOPS and queueing.
> 
> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
> varying only the hardware queue length per test.
> 
> queue size 2                 4                 8                 16                 32                 64
> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
> 
> The following is a similar experiment, ran against a nullb with a single
> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
> parallel fio jobs operating on the same device
> 
> queue size 2 	             4                 8              	16             	    32		       64
> 6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
> patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
> 
> It has also survived blktests and a 12h-stress run against nullb. I also
> ran the code against nvme and a scsi SSD, and I didn't observe
> performance regression in those. If there are other tests you think I
> should run, please let me know and I will follow up with results.
> 
> [1] https://lore.kernel.org/all/aef9de29-e9f5-259a-f8be-12d1b734e72@google.com/
> 
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Keith Busch <kbusch@kernel.org>
> Cc: Liu Song <liusong@linux.alibaba.com>
> Suggested-by: Jan Kara <jack@suse.cz>
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> ---
>   include/linux/sbitmap.h |  16 ++++--
>   lib/sbitmap.c           | 122 +++++++++-------------------------------
>   2 files changed, 37 insertions(+), 101 deletions(-)
> 
> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
> index 4d2d5205ab58..d662cf136021 100644
> --- a/include/linux/sbitmap.h
> +++ b/include/linux/sbitmap.h
> @@ -86,11 +86,6 @@ struct sbitmap {
>    * struct sbq_wait_state - Wait queue in a &struct sbitmap_queue.
>    */
>   struct sbq_wait_state {
> -	/**
> -	 * @wait_cnt: Number of frees remaining before we wake up.
> -	 */
> -	atomic_t wait_cnt;
> -
>   	/**
>   	 * @wait: Wait queue.
>   	 */
> @@ -138,6 +133,17 @@ struct sbitmap_queue {
>   	 * sbitmap_queue_get_shallow()
>   	 */
>   	unsigned int min_shallow_depth;
> +
> +	/**
> +	 * @completion_cnt: Number of bits cleared passed to the
> +	 * wakeup function.
> +	 */
> +	atomic_t completion_cnt;
> +
> +	/**
> +	 * @wakeup_cnt: Number of thread wake ups issued.
> +	 */
> +	atomic_t wakeup_cnt;
>   };
>   
>   /**
> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index 7280ae8ca88c..eca462cba398 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -434,6 +434,8 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
>   	sbq->wake_batch = sbq_calc_wake_batch(sbq, depth);
>   	atomic_set(&sbq->wake_index, 0);
>   	atomic_set(&sbq->ws_active, 0);
> +	atomic_set(&sbq->completion_cnt, 0);
> +	atomic_set(&sbq->wakeup_cnt, 0);
>   
>   	sbq->ws = kzalloc_node(SBQ_WAIT_QUEUES * sizeof(*sbq->ws), flags, node);
>   	if (!sbq->ws) {
> @@ -441,40 +443,21 @@ int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
>   		return -ENOMEM;
>   	}
>   
> -	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
> +	for (i = 0; i < SBQ_WAIT_QUEUES; i++)
>   		init_waitqueue_head(&sbq->ws[i].wait);
> -		atomic_set(&sbq->ws[i].wait_cnt, sbq->wake_batch);
> -	}
>   
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(sbitmap_queue_init_node);
>   
> -static inline void __sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
> -					    unsigned int wake_batch)
> -{
> -	int i;
> -
> -	if (sbq->wake_batch != wake_batch) {
> -		WRITE_ONCE(sbq->wake_batch, wake_batch);
> -		/*
> -		 * Pairs with the memory barrier in sbitmap_queue_wake_up()
> -		 * to ensure that the batch size is updated before the wait
> -		 * counts.
> -		 */
> -		smp_mb();
> -		for (i = 0; i < SBQ_WAIT_QUEUES; i++)
> -			atomic_set(&sbq->ws[i].wait_cnt, 1);
> -	}
> -}
> -
>   static void sbitmap_queue_update_wake_batch(struct sbitmap_queue *sbq,
>   					    unsigned int depth)
>   {
>   	unsigned int wake_batch;
>   
>   	wake_batch = sbq_calc_wake_batch(sbq, depth);
> -	__sbitmap_queue_update_wake_batch(sbq, wake_batch);
> +	if (sbq->wake_batch != wake_batch)
> +		WRITE_ONCE(sbq->wake_batch, wake_batch);
>   }
>   
>   void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
> @@ -488,7 +471,8 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
>   
>   	wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES,
>   			min_batch, SBQ_WAKE_BATCH);
> -	__sbitmap_queue_update_wake_batch(sbq, wake_batch);
> +
> +	WRITE_ONCE(sbq->wake_batch, wake_batch);
>   }
>   EXPORT_SYMBOL_GPL(sbitmap_queue_recalculate_wake_batch);
>   
> @@ -587,7 +571,7 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
>   	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
>   		struct sbq_wait_state *ws = &sbq->ws[wake_index];
>   
> -		if (waitqueue_active(&ws->wait) && atomic_read(&ws->wait_cnt)) {
> +		if (waitqueue_active(&ws->wait)) {
>   			if (wake_index != atomic_read(&sbq->wake_index))
>   				atomic_set(&sbq->wake_index, wake_index);
>   			return ws;
> @@ -599,83 +583,31 @@ static struct sbq_wait_state *sbq_wake_ptr(struct sbitmap_queue *sbq)
>   	return NULL;
>   }
>   
> -static bool __sbq_wake_up(struct sbitmap_queue *sbq, int *nr)
> +void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
>   {
> -	struct sbq_wait_state *ws;
> -	unsigned int wake_batch;
> -	int wait_cnt, cur, sub;
> -	bool ret;
> +	unsigned int wake_batch = READ_ONCE(sbq->wake_batch);
> +	struct sbq_wait_state *ws = NULL;
> +	unsigned int wakeups;
>   
> -	if (*nr <= 0)
> -		return false;
> +	if (!atomic_read(&sbq->ws_active))
> +		return;
>   
> -	ws = sbq_wake_ptr(sbq);
> -	if (!ws)
> -		return false;
> +	atomic_add(nr, &sbq->completion_cnt);
> +	wakeups = atomic_read(&sbq->wakeup_cnt);
>   
> -	cur = atomic_read(&ws->wait_cnt);
>   	do {
> -		/*
> -		 * For concurrent callers of this, callers should call this
> -		 * function again to wakeup a new batch on a different 'ws'.
> -		 */
> -		if (cur == 0)
> -			return true;
> -		sub = min(*nr, cur);
> -		wait_cnt = cur - sub;
> -	} while (!atomic_try_cmpxchg(&ws->wait_cnt, &cur, wait_cnt));
> -
> -	/*
> -	 * If we decremented queue without waiters, retry to avoid lost
> -	 * wakeups.
> -	 */
> -	if (wait_cnt > 0)
> -		return !waitqueue_active(&ws->wait);
> +		if (atomic_read(&sbq->completion_cnt) - wakeups < wake_batch)
> +			return;

Should it be considered that completion_cnt overflow and becomes
negtive?

Thanks,
Kuai
>   
> -	*nr -= sub;
> -
> -	/*
> -	 * When wait_cnt == 0, we have to be particularly careful as we are
> -	 * responsible to reset wait_cnt regardless whether we've actually
> -	 * woken up anybody. But in case we didn't wakeup anybody, we still
> -	 * need to retry.
> -	 */
> -	ret = !waitqueue_active(&ws->wait);
> -	wake_batch = READ_ONCE(sbq->wake_batch);
> +		if (!ws) {
> +			ws = sbq_wake_ptr(sbq);
> +			if (!ws)
> +				return;
> +		}
> +	} while (!atomic_try_cmpxchg(&sbq->wakeup_cnt,
> +				     &wakeups, wakeups + wake_batch));
>   
> -	/*
> -	 * Wake up first in case that concurrent callers decrease wait_cnt
> -	 * while waitqueue is empty.
> -	 */
>   	wake_up_nr(&ws->wait, wake_batch);
> -
> -	/*
> -	 * Pairs with the memory barrier in sbitmap_queue_resize() to
> -	 * ensure that we see the batch size update before the wait
> -	 * count is reset.
> -	 *
> -	 * Also pairs with the implicit barrier between decrementing wait_cnt
> -	 * and checking for waitqueue_active() to make sure waitqueue_active()
> -	 * sees result of the wakeup if atomic_dec_return() has seen the result
> -	 * of atomic_set().
> -	 */
> -	smp_mb__before_atomic();
> -
> -	/*
> -	 * Increase wake_index before updating wait_cnt, otherwise concurrent
> -	 * callers can see valid wait_cnt in old waitqueue, which can cause
> -	 * invalid wakeup on the old waitqueue.
> -	 */
> -	sbq_index_atomic_inc(&sbq->wake_index);
> -	atomic_set(&ws->wait_cnt, wake_batch);
> -
> -	return ret || *nr;
> -}
> -
> -void sbitmap_queue_wake_up(struct sbitmap_queue *sbq, int nr)
> -{
> -	while (__sbq_wake_up(sbq, &nr))
> -		;
>   }
>   EXPORT_SYMBOL_GPL(sbitmap_queue_wake_up);
>   
> @@ -792,9 +724,7 @@ void sbitmap_queue_show(struct sbitmap_queue *sbq, struct seq_file *m)
>   	seq_puts(m, "ws={\n");
>   	for (i = 0; i < SBQ_WAIT_QUEUES; i++) {
>   		struct sbq_wait_state *ws = &sbq->ws[i];
> -
> -		seq_printf(m, "\t{.wait_cnt=%d, .wait=%s},\n",
> -			   atomic_read(&ws->wait_cnt),
> +		seq_printf(m, "\t{.wait=%s},\n",
>   			   waitqueue_active(&ws->wait) ? "active" : "inactive");
>   	}
>   	seq_puts(m, "}\n");
>