Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
To:     Olivier Langlois <olivier@trillion01.com>,
        Jens Axboe <axboe@kernel.dk>, io-uring@vger.kernel.org,
        linux-kernel@vger.kernel.org
References: <60c13bec.1c69fb81.73967.f06dSMTPIN_ADDED_MISSING@mx.google.com>
 <84e42313-d738-fb19-c398-08a4ed0e0d9c@gmail.com>
 <4b5644bff43e072a98a19d7a5ca36bb5e11497ec.camel@trillion01.com>
 <a7d6f2fd-b59e-e6fa-475a-23962d45b6fa@gmail.com>
 <9938f22a0bb09f344fa5c9c5c1b91f0d12e7566f.camel@trillion01.com>
 <a12e218a-518d-1dac-5e8c-d9784c9850b0@gmail.com>
 <b0a8c92cffb3dc1b48b081e5e19b016fee4c6511.camel@trillion01.com>
 <7d9a481b-ae8c-873e-5c61-ab0a57243905@gmail.com>
 <f511d34b1a1ae5f76c9c4ba1ab87bbf15046a588.camel@trillion01.com>
From:   Pavel Begunkov <asml.silence@gmail.com>
Subject: Re: [PATCH] io_uring: reduce latency by reissueing the operation
Message-ID: <bc6d5e7b-fc63-827f-078b-b3423da0e5f7@gmail.com>
Date:   Sun, 20 Jun 2021 21:55:46 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.10.1
MIME-Version: 1.0
In-Reply-To: <f511d34b1a1ae5f76c9c4ba1ab87bbf15046a588.camel@trillion01.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Precedence: bulk

On 6/18/21 11:45 PM, Olivier Langlois wrote:
> On Thu, 2021-06-17 at 19:10 +0100, Pavel Begunkov wrote:
>>>
>>> For the patch performance testing I did use the simplest config:
>>> Single thread, 1 TCP connection, no sqpoll.
>>
>> Queue depth (QD) 1, right?
> 
> Since my io_uring usage is wrapped into a library and some parameters
> are fixed and adjusted to be able to handle the largest use-case
> scenario, QD was set to 256 for that test.
> 
> There is also few accessory fds such as 2 eventfd that are polled to
> interrupt the framework event loop but in practice they were silent
> during the whole testing period.
> 
> but no big batch submission for sure. At most maybe 2 sqes per
> submission.
> 
> 1 to provide back the buffer and the other to reinsert the read
> operation.

I see

[...]
>> 2) Do they return equivalent number of bytes? And what the
>> read/recv size (e.g. buffer size)?
> 
> Nothing escape your eagle vision attention Pavel...

It all may sound over scrutinising, but I just used to analyse
performance magic and see how edges may be polished. Not
a requirement for the patch 

> I set my read size to an arbitrarilly big size (20KB) just to be sure
> that I should, most of the time, never end up with partial reads and
> perform more syscalls that I could get away with big enough buffer
> size.
> 
> TBH, I didn't pay that much attention to this detail. out of my head, I
> would say that the average size is all over the place. It can go from
> 150 Bytes up to 15KB but I would think that the average must be between
> 1-2 MTU (around 2500 bytes).
> 
> That being said, the average read size must spread equally to the
> packets going to the regular path vs those of take the new shortcut, so
> I believe that the conclusion should still hold despite not having
> considered this aspect in the test.
>>
>> Because in theory can be that during a somewhat small delay for
>> punting to io-wq, more data had arrived and so async completion
>> pulls more data that takes more time. In that case the time
>> difference should also account the difference in amount of
>> data that it reads.
> 
> Good point. This did not even occur to me to consider this aspect but
> how many more packets would the network stack had the time to receive
> in an extra 16uSec period? (I am not on one of those crazy Fiber optic
> 200Gbps Mellanox card....) 1,2,3,4? We aren't talking multiple extra
> MBs to copy here...

Say 1-2. Need to check, but I think while processing them and
copying to the userspace there might arrive another one, and so
you have full 20KB instead of 4KB that would have been copied
inline. Plus io-wq overhead, 16us wouldn't be unreasonable then.

But that's "what if" thinking.

>>
>> 3) Curious, why read but not recv as you're working with sockets
> 
> I have learn network programming with the classic Stevens book. As far
> as I remember from what I have learned in the book, it is that the only
> benefit of recv() over read() is if you need to specify one of the
> funky flags that recv() allow you to provide to it, read() doesn't give
> access to that functionality.
> 
> If there is a performance benefit to use recv() over read() for tcp
> fds, that is something I am not aware of and if you confirm me that it
> is the case, that would be very easy for me to change my read calls for
> recv() ones...
> 
> Now that you ask the question, maybe read() is implemented with recv()

All sinks into the common code rather early

> but AFAIK, the native network functions are sendmsg and recvmsg so
> neither read() or recv() would have an edge over the other in that
> department, AFAIK...

For io_uring part, e.g. recv is slimmer than recvmsg, doesn't
need to copy extra.

Read can be more expensive on the io_uring side because it
may copy/alloc extra stuff. Plus additional logic on the
io_read() part for generality.

But don't expect it to be much of a difference, but never
tested.

> while we are talking about read() vs recv(), I am curious too about
> something, while working on my other patch (store back buffer in case
> of failure), I did notice that buffer address and bid weren't stored in
> the same fields.
> 
> io_put_recv_kbuf() vs io_put_rw_kbuf()
> 
> I didn't figure out why those values weren't stored in the same
> io_kiocb fields for recv operations...
> 
> Why is that?

Just because how it was done. May use cleaning up. e.g. I don't
like rewriting req->rw.addr with a selected buffer.

In general, the first 64B (cacheline) of io_kiocb (i.e. request)
is taken by per-opcode data, and we try to fit everything
related to a particular opcode there and not spill into
generic parts of the struct.

Another concern, in general, is not keeping everything tight
enough and shuffled right, so it doesn't read extra cachelines
in hot path.

>>
>> 4) Did you do any userspace measurements. And a question to
>> everyone in general, do we have any good net benchmarking tool
>> that works with io_uring? Like netperf? Hopefully spitting
>> out latency distribution.
> 
> No, I haven't.

With what was said, I'd expect ~same mean and elevated ~99%
reduced by the patch, which is also great. Latency is always
the hardest part.

>> Also, not particularly about reissue stuff, but a note to myself:
>> 59us is much, so I wonder where the overhead comes from.
>> Definitely not the iowq queueing (i.e. putting into a list).
>> - waking a worker?
>> - creating a new worker? Do we manage workers sanely? e.g.
>>   don't keep them constantly recreated and dying back.
>> - scheduling a worker?
> 
> creating a new worker is for sure not free but I would remove that
> cause from the suspect list as in my scenario, it was a one-shot event.

Not sure what you mean, but speculating, io-wq may have not
optimal policy for recycling worker threads leading to
recreating/removing more than needed. Depends on bugs, use
cases and so on.

> First measurement was even not significantly higher than all the other
> measurements.

You get a huge max for io-wq case. Obviously nothing can be
said just because of max. We'd need latency distribution
and probably longer runs, but I'm still curious where it's
coming from. Just keeping an eye in general

>>
>> Olivier, for how long did you run the test? >1 min?
> 
> much more than 1 minute. I would say something between 20-25 minutes.
> 
> I wanted a big enough sample size for those 2.5% special path events so
> that the conclusion could be statistically significant.

Great, if io-wq worker creation doesn't work right, then it's
because of policies and so on.

-- 
Pavel Begunkov