Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <97a7eafc-c77f-6282-c914-7e66712ee47b@kernel.dk>
Date:   Wed, 9 Nov 2022 20:25:53 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:102.0) Gecko/20100101
 Thunderbird/102.4.1
Subject: Re: [PATCH] sbitmap: Use single per-bitmap counting to wake up queued
 tags
Content-Language: en-US
To:     Gabriel Krisman Bertazi <krisman@suse.de>
Cc:     linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
        Hugh Dickins <hughd@google.com>,
        Keith Busch <kbusch@kernel.org>,
        Liu Song <liusong@linux.alibaba.com>, Jan Kara <jack@suse.cz>
References: <20221105231055.25953-1-krisman@suse.de>
 <cd88f306-1da4-a243-ec23-fea033142fbb@kernel.dk> <87wn83eod3.fsf@suse.de>
From:   Jens Axboe <axboe@kernel.dk>
In-Reply-To: <87wn83eod3.fsf@suse.de>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 11/9/22 3:48 PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote:
>>> Performance-wise, one should expect very similar performance to the
>>> original algorithm for the case where there is no queueing.  In both the
>>> old algorithm and this implementation, the first thing is to check
>>> ws_active, which bails out if there is no queueing to be managed. In the
>>> new code, we took care to avoid accounting completions and wakeups when
>>> there is no queueing, to not pay the cost of atomic operations
>>> unnecessarily, since it doesn't skew the numbers.
>>>
>>> For more interesting cases, where there is queueing, we need to take
>>> into account the cross-communication of the atomic operations.  I've
>>> been benchmarking by running parallel fio jobs against a single hctx
>>> nullb in different hardware queue depth scenarios, and verifying both
>>> IOPS and queueing.
>>>
>>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
>>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
>>> varying only the hardware queue length per test.
>>>
>>> queue size 2                 4                 8                 16                 32                 64
>>> 6.1-rc2    1681.1K (1.6K)    2633.0K (12.7K)   6940.8K (16.3K)   8172.3K (617.5K)   8391.7K (367.1K)   8606.1K (351.2K)
>>> patched    1721.8K (15.1K)   3016.7K (3.8K)    7543.0K (89.4K)   8132.5K (303.4K)   8324.2K (230.6K)   8401.8K (284.7K)
>>>
>>> The following is a similar experiment, ran against a nullb with a single
>>> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
>>> parallel fio jobs operating on the same device
>>>
>>> queue size 2 	             4                 8              	16             	    32		       64
>>> 6.1-rc2	   1081.0K (2.3K)    957.2K (1.5K)     1699.1K (5.7K) 	6178.2K (124.6K)    12227.9K (37.7K)   13286.6K (92.9K)
>>> patched	   1081.8K (2.8K)    1316.5K (5.4K)    2364.4K (1.8K) 	6151.4K  (20.0K)    11893.6K (17.5K)   12385.6K (18.4K)
>>
>> What's the queue depth of these devices? That's the interesting question
>> here, as it'll tell us if any of these are actually hitting the slower
>> path where you made changes. 
>>
> 
> Hi Jens,
> 
> The hardware queue depth is a parameter being varied in this experiment.
> Each column of the tables has a different queue depth.  Its value is the
> first line (queue size) of both tables.  For instance, looking at the
> first table, for a device with hardware queue depth=2, 6.1-rc2 gave
> 1681K IOPS and the patched version gave 1721.8K IOPS.
> 
> As mentioned, I monitored the size of the sbitmap wqs during the
> benchmark execution to confirm it was indeed hitting the slow path and
> queueing.  Indeed, I observed less queueing on higher QDs (16,32) and
> even less for QD=64.  For QD<=8, there was extensive queueing present
> throughout the execution.

Gotcha, this makes a lot of sense. I just misunderstood the queue size
numbers to be queue depth on the IO generation side.

Any idea what is reducing performance at the higher end of queue size
spectrum? Not that I think it's _really_ that interesting, just curious.

> I should provide the queue size over time alongside the latency numbers.
> I have to rerun the benchmarks already to collect the information
> Chaitanya requested.

Thanks.

>> I suspect you are for the second set of numbers, but not for the first
>> one?
> 
> No. both tables show some level of queueing. The shared bitmap in
> table 2 surely has way more intensive queueing, though.

Yep, got it now.

>> Anything that isn't hitting the wait path for tags isn't a very useful
>> test, as I would not expect any changes there.
> 
> Even when there is less to no queueing (QD=64 in this data), we still
> enter sbitmap_queue_wake_up and bail out on the first line
> !wait_active. This is why I think it is important to include QD=64
> here. it is less interesting data, as I mentioned, but it shows no
> regressions of the faspath.

Yes, the checking for whether we need to hit the slow path should be
fast. I will try and run some numbers here too for the pure fast path,
no sleeping at all. For most fast storage this is where we'll be for the
majority of the time, queue depth on NVMe is generally pretty deep.
FWIW, I did run my usual quick peak testing and didn't see any
regressions there. I'll do a quick per-core benchmark tomorrow, that's
more sensitive to cycle issues.

But I like the change in general, and I think your numbers look sane.
Would be nice to turn sbitmap queueing into something that's actually
sane.

-- 
Jens Axboe