Message-ID: <c81c971e-3e00-0767-3158-d712208f15e9@gmail.com>
Date:   Sat, 11 Mar 2023 20:53:33 +0000
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.7.2
Subject: Re: [RFC 0/2] optimise local-tw task resheduling
Content-Language: en-US
From:   Pavel Begunkov <asml.silence@gmail.com>
To:     Jens Axboe <axboe@kernel.dk>, io-uring@vger.kernel.org
Cc:     linux-kernel@vger.kernel.org
References: <cover.1678474375.git.asml.silence@gmail.com>
 <9250606d-4998-96f6-aeaf-a5904d7027e3@kernel.dk>
 <ee962f58-1074-0480-333b-67b360ea8b87@gmail.com>
In-Reply-To: <ee962f58-1074-0480-333b-67b360ea8b87@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk

On 3/11/23 20:45, Pavel Begunkov wrote:
> On 3/11/23 17:24, Jens Axboe wrote:
>> On 3/10/23 12:04?PM, Pavel Begunkov wrote:
>>> io_uring extensively uses task_work, but when a task is waiting
>>> for multiple CQEs it causes lots of rescheduling. This series
>>> is an attempt to optimise it and be a base for future improvements.
>>>
>>> For some zc network tests eventually waiting for a portion of
>>> buffers I've got 10x descrease in the number of context switches,
>>> which reduced the CPU consumption more than twice (17% -> 8%).
>>> It also helps storage cases, while running fio/t/io_uring against
>>> a low performant drive it got 2x descrease of the number of context
>>> switches for QD8 and ~4 times for QD32.
>>>
>>> Not for inclusion yet, I want to add an optimisation for when
>>> waiting for 1 CQE.
>>
>> Ran this on the usual peak benchmark, using IRQ. IOPS is around ~70M for
>> that, and I see context rates of around 8.1-8.3M/sec with the current
>> kernel.
>>
>> Applied the two patches, but didn't see much of a change? Performance is
>> about the same, and cx rate ditto. Confused... As you probably know,
>> this test waits for 32 ios at the time.
> 
> If I'd to guess it already has perfect batching, for which case
> the patch does nothing. Maybe it's due to SSD coalescing +
> small ro I/O + consistency and small latencies of Optanes,
> or might be on the scheduling and the kernel side to be slow
> to react.

And if that's that, I have to note that it's quite a sterile
case, the last time I asked the usual batching we're currently
getting for networking cases is 1-2.

-- 
Pavel Begunkov