2019-11-22 11:22:35

by David Laight

[permalink] [raw]
Subject: epoll_wait() performance

I'm trying to optimise some code that reads UDP messages (RTP and RTCP) from a lot of sockets.
The 'normal' data pattern is that there is no data on half the sockets (RTCP) and
one message every 20ms on the others (RTP).
However there can be more than one message on each socket, and they all need to be read.
Since the code processing the data runs every 10ms, the message receiving code
also runs every 10ms (a massive gain when using poll()).

While using recvmmsg() to read multiple messages might seem a good idea, it is much
slower than recv() when there is only one message (even recvmsg() is a lot slower).
(I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
and faffing with the user iov[].)

So using poll() we repoll the fd after calling recv() to find is there is a second message.
However the second poll has a significant performance cost (but less than using recvmmsg()).

If we use epoll() in level triggered mode a second epoll_wait() call (after the recv()) will
indicate that there is more data.

For poll() it doesn't make much difference how many fd are supplied to each system call.
The overall performance is much the same for 32, 64 or 500 (all the sockets).

For epoll_wait() that isn't true.
Supplying a buffer that is shorter than the list of 'ready' fds gives a massive penalty.
With a buffer long enough for all the events epoll() is somewhat faster than poll().
But with a 64 entry buffer it is much slower.
I've looked at the code and can't see why splicing the unread events back is expensive.

I'd like to be able to change the code so that multiple threads are reading from the epoll fd.
This would mean I'd have to run it in edge mode and each thread reading a smallish
block of events.
Any suggestions on how to efficiently read the 'unusual' additional messages from
the sockets?

FWIW the fastest way to read 1 RTP message every 20ms is to do non-blocking recv() every 10ms.
The failing recv() is actually faster than either epoll() or two poll() actions.
(Although something is needed to pick up the occasional second message.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2019-11-27 09:52:26

by Marek Majkowski

[permalink] [raw]
Subject: Re: epoll_wait() performance

On Fri, Nov 22, 2019 at 12:18 PM David Laight <[email protected]> wrote:
> I'm trying to optimise some code that reads UDP messages (RTP and RTCP) from a lot of sockets.
> The 'normal' data pattern is that there is no data on half the sockets (RTCP) and
> one message every 20ms on the others (RTP).
> However there can be more than one message on each socket, and they all need to be read.
> Since the code processing the data runs every 10ms, the message receiving code
> also runs every 10ms (a massive gain when using poll()).

How many sockets we are talking about? More like 500 or 500k? We had very
bad experience with UDP connected sockets, so if you are using UDP connected
sockets, the RX path is super slow, mostly consumed by udp_lib_lookup()
https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/udp.c#L445

Then we might argue that doing thousands of udp unconnected sockets - like
192.0.2.1:1234, 192.0.2.2:1234, etc - creates little value. I guess the only
reasonable case for large number of UDP sockets is when you need
large number of source ports.

In such case we experimented with abusing TPROXY:
https://web.archive.org/web/20191115081000/https://blog.cloudflare.com/how-we-built-spectrum/

> While using recvmmsg() to read multiple messages might seem a good idea, it is much
> slower than recv() when there is only one message (even recvmsg() is a lot slower).
> (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> and faffing with the user iov[].)
>
> So using poll() we repoll the fd after calling recv() to find is there is a second message.
> However the second poll has a significant performance cost (but less than using recvmmsg()).

That sounds wrong. Single recvmmsg(), even when receiving only a
single message, should be faster than two syscalls - recv() and
poll().

> If we use epoll() in level triggered mode a second epoll_wait() call (after the recv()) will
> indicate that there is more data.
>
> For poll() it doesn't make much difference how many fd are supplied to each system call.
> The overall performance is much the same for 32, 64 or 500 (all the sockets).
>
> For epoll_wait() that isn't true.
> Supplying a buffer that is shorter than the list of 'ready' fds gives a massive penalty.
> With a buffer long enough for all the events epoll() is somewhat faster than poll().
> But with a 64 entry buffer it is much slower.
> I've looked at the code and can't see why splicing the unread events back is expensive.

Again, this is surprising.

> I'd like to be able to change the code so that multiple threads are reading from the epoll fd.
> This would mean I'd have to run it in edge mode and each thread reading a smallish
> block of events.
> Any suggestions on how to efficiently read the 'unusual' additional messages from
> the sockets?

Random ideas:
1. Perhaps reducing the number of sockets could help - with iptables or TPROXY.
TPROXY has some performance impact though, so be careful.

2. I played with io_submit for syscall batching, but in my experiments I wasn't
able to show performance boost:
https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/
Perhaps the newer io_uring with networking support could help:
https://twitter.com/axboe/status/1195047335182524416

3. SO_BUSYPOLL drastically reduces latency, but I've only used it with
a single socket..

4. If you want to get number of outstanding packets, there is SIOCINQ
and SO_MEMINFO.

My older writeups:
https://blog.cloudflare.com/how-to-receive-a-million-packets/
https://blog.cloudflare.com/how-to-achieve-low-latency/

Cheers,
Marek

> FWIW the fastest way to read 1 RTP message every 20ms is to do non-blocking recv() every 10ms.
> The failing recv() is actually faster than either epoll() or two poll() actions.
> (Although something is needed to pick up the occasional second message.)
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>

2019-11-27 10:41:45

by David Laight

[permalink] [raw]
Subject: RE: epoll_wait() performance

From: Marek Majkowski
> Sent: 27 November 2019 09:51
> On Fri, Nov 22, 2019 at 12:18 PM David Laight <[email protected]> wrote:
> > I'm trying to optimise some code that reads UDP messages (RTP and RTCP) from a lot of sockets.
> > The 'normal' data pattern is that there is no data on half the sockets (RTCP) and
> > one message every 20ms on the others (RTP).
> > However there can be more than one message on each socket, and they all need to be read.
> > Since the code processing the data runs every 10ms, the message receiving code
> > also runs every 10ms (a massive gain when using poll()).
>
> How many sockets we are talking about? More like 500 or 500k? We had very
> bad experience with UDP connected sockets, so if you are using UDP connected
> sockets, the RX path is super slow, mostly consumed by udp_lib_lookup()
> https://elixir.bootlin.com/linux/v5.4/source/net/ipv4/udp.c#L445

For my tests I have 900, but that is nothing like the limit for the application.
The test system is over 50% idle and running at its minimal clock speed.
The sockets are all unconnected, I believe the remote application is allowed
to change the source IP mid-flow!

...
> > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > and faffing with the user iov[].)
> >
> > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > However the second poll has a significant performance cost (but less than using recvmmsg()).
>
> That sounds wrong. Single recvmmsg(), even when receiving only a
> single message, should be faster than two syscalls - recv() and
> poll().

My suspicion is the extra two copy_from_user() needed for each recvmsg are a
significant overhead, most likely due to the crappy code that tries to stop
the kernel buffer being overrun.
I need to run the tests on a system with a 'home built' kernel to see how much
difference this make (by seeing how much slower duplicating the copy makes it).

The system call cost of poll() gets factored over a reasonable number of sockets.
So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
even allowing for looking up the fd.

This could be fixed by an extra flag to recvmmsg() to indicate that you only really
expect one message and to call the poll() function before each subsequent receive.

There is also the 'reschedule' that Eric added to the loop in recvmmsg.
I don't know how much that actually costs.
In this case the process is likely to be running at a RT priority and pinned to a cpu.
In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.

We really do want to receive all these UDP packets in a timely manner.
Although very low latency isn't itself an issue.
The data is telephony audio with (typically) one packet every 20ms.
The code only looks for packets every 10ms - that helps no end since, in principle,
only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.

> > If we use epoll() in level triggered mode a second epoll_wait() call (after the recv()) will
> > indicate that there is more data.
> >
> > For poll() it doesn't make much difference how many fd are supplied to each system call.
> > The overall performance is much the same for 32, 64 or 500 (all the sockets).
> >
> > For epoll_wait() that isn't true.
> > Supplying a buffer that is shorter than the list of 'ready' fds gives a massive penalty.
> > With a buffer long enough for all the events epoll() is somewhat faster than poll().
> > But with a 64 entry buffer it is much slower.
> > I've looked at the code and can't see why splicing the unread events back is expensive.
>
> Again, this is surprising.

Yep, but repeatedly measurable.
If no one else has seen this I'll have to try to instrument it in the kernel somehow.
I'm pretty sure it isn't a userspace issue.

> > I'd like to be able to change the code so that multiple threads are reading from the epoll fd.
> > This would mean I'd have to run it in edge mode and each thread reading a smallish
> > block of events.
> > Any suggestions on how to efficiently read the 'unusual' additional messages from
> > the sockets?
>
> Random ideas:
> 1. Perhaps reducing the number of sockets could help - with iptables or TPROXY.
> TPROXY has some performance impact though, so be careful.

We'd then have to use recvmsg() - which is measurably slower than recv().

> 2. I played with io_submit for syscall batching, but in my experiments I wasn't
> able to show performance boost:
> https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/
> Perhaps the newer io_uring with networking support could help:
> https://twitter.com/axboe/status/1195047335182524416

You need an OS that actually does async IO - Like RSX11/M or windows.
Just deferring the request to a kernel thread can mean you get stuck
behind other processes blocking reads.

> 3. SO_BUSYPOLL drastically reduces latency, but I've only used it with
> a single socket..

We need to minimise the cpu cost more than the absolute latency.

> 4. If you want to get number of outstanding packets, there is SIOCINQ
> and SO_MEMINFO.

That's another system call.
poll() can tell use whether there is any data on a lot of sockets quicker.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-11-27 15:50:15

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: epoll_wait() performance


On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <[email protected]> wrote:

> ...
> > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > and faffing with the user iov[].)
> > >
> > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > However the second poll has a significant performance cost (but less than using recvmmsg()).
> >
> > That sounds wrong. Single recvmmsg(), even when receiving only a
> > single message, should be faster than two syscalls - recv() and
> > poll().
>
> My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> significant overhead, most likely due to the crappy code that tries to stop
> the kernel buffer being overrun.
>
> I need to run the tests on a system with a 'home built' kernel to see how much
> difference this make (by seeing how much slower duplicating the copy makes it).
>
> The system call cost of poll() gets factored over a reasonable number of sockets.
> So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> even allowing for looking up the fd.
>
> This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> expect one message and to call the poll() function before each subsequent receive.
>
> There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> I don't know how much that actually costs.
> In this case the process is likely to be running at a RT priority and pinned to a cpu.
> In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
>
> We really do want to receive all these UDP packets in a timely manner.
> Although very low latency isn't itself an issue.
> The data is telephony audio with (typically) one packet every 20ms.
> The code only looks for packets every 10ms - that helps no end since, in principle,
> only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.

I have a simple udp_sink tool[1] that cycle through the different
receive socket system calls. I gave it a quick spin on a F31 kernel
5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
to see a significant regression/slowdown for recvMmsg.

$ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
run count ns/pkt pps cycles payload
recvMmsg/32 run: 0 10000000 1461.41 684270.96 5261 18 demux:1
recvmsg run: 0 10000000 889.82 1123824.84 3203 18 demux:1
read run: 0 10000000 974.81 1025841.68 3509 18 demux:1
recvfrom run: 0 10000000 1056.51 946513.44 3803 18 demux:1

Normal recvmsg almost have double performance that recvmmsg.
recvMmsg/32 = 684,270 pps
recvmsg = 1,123,824 pps

[1] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

For connected UDP socket:

$ sudo ./udp_sink --port 9 --repeat 1 --connect
run count ns/pkt pps cycles payload
recvMmsg/32 run: 0 1000000 1240.06 806411.73 4464 18 demux:1 c:1
recvmsg run: 0 1000000 768.80 1300724.75 2767 18 demux:1 c:1
read run: 0 1000000 823.40 1214478.40 2964 18 demux:1 c:1
recvfrom run: 0 1000000 889.19 1124616.11 3201 18 demux:1 c:1


Found some old results (approx v4.10-rc1):

[brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
recvMmsg/32 run: 0 10000000 537.89 1859106.74 2155 21559353816
recvmsg run: 0 10000000 552.69 1809344.44 2215 22152468673
read run: 0 10000000 476.65 2097970.76 1910 19104864199
recvfrom run: 0 10000000 450.76 2218492.60 1806 18066972794


2019-11-27 16:07:05

by David Laight

[permalink] [raw]
Subject: RE: epoll_wait() performance

From: Jesper Dangaard Brouer
> Sent: 27 November 2019 15:48
> On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <[email protected]> wrote:
>
> > ...
> > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > and faffing with the user iov[].)
> > > >
> > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > However the second poll has a significant performance cost (but less than using recvmmsg()).
> > >
> > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > single message, should be faster than two syscalls - recv() and
> > > poll().
> >
> > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > significant overhead, most likely due to the crappy code that tries to stop
> > the kernel buffer being overrun.
> >
> > I need to run the tests on a system with a 'home built' kernel to see how much
> > difference this make (by seeing how much slower duplicating the copy makes it).
> >
> > The system call cost of poll() gets factored over a reasonable number of sockets.
> > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > even allowing for looking up the fd.
> >
> > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > expect one message and to call the poll() function before each subsequent receive.
> >
> > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > I don't know how much that actually costs.
> > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> >
> > We really do want to receive all these UDP packets in a timely manner.
> > Although very low latency isn't itself an issue.
> > The data is telephony audio with (typically) one packet every 20ms.
> > The code only looks for packets every 10ms - that helps no end since, in principle,
> > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
>
> I have a simple udp_sink tool[1] that cycle through the different
> receive socket system calls. I gave it a quick spin on a F31 kernel
> 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> to see a significant regression/slowdown for recvMmsg.
>
> $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
> run count ns/pkt pps cycles payload
> recvMmsg/32 run: 0 10000000 1461.41 684270.96 5261 18 demux:1
> recvmsg run: 0 10000000 889.82 1123824.84 3203 18 demux:1
> read run: 0 10000000 974.81 1025841.68 3509 18 demux:1
> recvfrom run: 0 10000000 1056.51 946513.44 3803 18 demux:1
>
> Normal recvmsg almost have double performance that recvmmsg.
> recvMmsg/32 = 684,270 pps
> recvmsg = 1,123,824 pps

Can you test recv() as well?
I think it might be faster than read().

...
> Found some old results (approx v4.10-rc1):
>
> [brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
> recvMmsg/32 run: 0 10000000 537.89 1859106.74 2155 21559353816
> recvmsg run: 0 10000000 552.69 1809344.44 2215 22152468673
> read run: 0 10000000 476.65 2097970.76 1910 19104864199
> recvfrom run: 0 10000000 450.76 2218492.60 1806 18066972794

That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel.
recvmmsg() and recvmsg() are similar - but both a lot slower then recv().

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-11-27 16:29:58

by Paolo Abeni

[permalink] [raw]
Subject: Re: epoll_wait() performance

On Wed, 2019-11-27 at 16:48 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <[email protected]> wrote:
>
> > ...
> > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > and faffing with the user iov[].)
> > > >
> > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > However the second poll has a significant performance cost (but less than using recvmmsg()).
> > >
> > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > single message, should be faster than two syscalls - recv() and
> > > poll().
> >
> > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > significant overhead, most likely due to the crappy code that tries to stop
> > the kernel buffer being overrun.
> >
> > I need to run the tests on a system with a 'home built' kernel to see how much
> > difference this make (by seeing how much slower duplicating the copy makes it).
> >
> > The system call cost of poll() gets factored over a reasonable number of sockets.
> > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > even allowing for looking up the fd.
> >
> > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > expect one message and to call the poll() function before each subsequent receive.
> >
> > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > I don't know how much that actually costs.
> > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> >
> > We really do want to receive all these UDP packets in a timely manner.
> > Although very low latency isn't itself an issue.
> > The data is telephony audio with (typically) one packet every 20ms.
> > The code only looks for packets every 10ms - that helps no end since, in principle,
> > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
>
> I have a simple udp_sink tool[1] that cycle through the different
> receive socket system calls. I gave it a quick spin on a F31 kernel
> 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> to see a significant regression/slowdown for recvMmsg.
>
> $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
> run count ns/pkt pps cycles payload
> recvMmsg/32 run: 0 10000000 1461.41 684270.96 5261 18 demux:1
> recvmsg run: 0 10000000 889.82 1123824.84 3203 18 demux:1
> read run: 0 10000000 974.81 1025841.68 3509 18 demux:1
> recvfrom run: 0 10000000 1056.51 946513.44 3803 18 demux:1
>
> Normal recvmsg almost have double performance that recvmmsg.

For stream tests, the above is true, if the BH is able to push the
packets to the socket fast enough. Otherwise the recvmmsg() will make
the user space even faster, the BH will find the user space process
sleeping more often and the BH will have to spend more time waking-up
the process.

If a single receive queue is in use this condition is not easy to meet.

Before spectre/meltdown and others mitigations using connected sockets
and removing ct/nf was usually sufficient - at least in my scenarios -
to make BH fast enough.

But it's no more the case, and I have to use 2 or more different
receive queues.

@David: If I read your message correctly, the pkt rate you are dealing
with is quite low... are we talking about tput or latency? I guess
latency could be measurably higher with recvmmsg() in respect to other
syscall. How do you measure the releative performances of recvmmsg()
and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
usually getting a single packet per recvmmsg() call?

Thanks,

PAolo

2019-11-27 17:34:55

by David Laight

[permalink] [raw]
Subject: RE: epoll_wait() performance

From: Paolo Abeni
> Sent: 27 November 2019 16:27
...
> @David: If I read your message correctly, the pkt rate you are dealing
> with is quite low... are we talking about tput or latency? I guess
> latency could be measurably higher with recvmmsg() in respect to other
> syscall. How do you measure the releative performances of recvmmsg()
> and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
> usually getting a single packet per recvmmsg() call?

The packet rate per socket is low, typically one packet every 20ms.
This is RTP, so telephony audio.
However we have a lot of audio channels and hence a lot of sockets.
So there are can be 1000s of sockets we need to receive the data from.
The test system I'm using has 16 E1 TDM links each of which can handle
31 audio channels.
Forwarding all these to/from RTP (one of the things it might do) is 496
audio channels - so 496 RTP sockets and 496 RTCP ones.
Although the test I'm doing is pure RTP and doesn't use TDM.

What I'm measuring is the total time taken to receive all the packets
(on all the sockets) that are available to be read every 10ms.
So poll + recv + add_to_queue.
(The data processing is done by other threads.)
I use the time difference (actually CLOCK_MONOTONIC - from rdtsc)
to generate a 64 entry (self scaling) histogram of the elapsed times.
Then look for the histograms peak value.
(I need to work on the max value, but that is a different (more important!) problem.)
Depending on the poll/recv method used this takes 1.5 to 2ms
in each 10ms period.
(It is faster if I run the cpu at full speed, but it usually idles along
at 800MHz.)

If I use recvmmsg() I only expect to see one packet because there
is (almost always) only one packet on each socket every 20ms.
However there might be more than one, and if there is they
all need to be read (well at least 2 of them) in that block of receives.

The outbound traffic goes out through a small number of raw sockets.
Annoyingly we have to work out the local IPv4 address that will be used
for each destination in order to calculate the UDP checksum.
(I've a pending patch to speed up the x86 checksum code on a lot of
cpus.)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-11-27 17:50:33

by Eric Dumazet

[permalink] [raw]
Subject: Re: epoll_wait() performance



On 11/27/19 9:30 AM, David Laight wrote:
> From: Paolo Abeni
>> Sent: 27 November 2019 16:27
> ...
>> @David: If I read your message correctly, the pkt rate you are dealing
>> with is quite low... are we talking about tput or latency? I guess
>> latency could be measurably higher with recvmmsg() in respect to other
>> syscall. How do you measure the releative performances of recvmmsg()
>> and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
>> usually getting a single packet per recvmmsg() call?
>
> The packet rate per socket is low, typically one packet every 20ms.
> This is RTP, so telephony audio.
> However we have a lot of audio channels and hence a lot of sockets.
> So there are can be 1000s of sockets we need to receive the data from.
> The test system I'm using has 16 E1 TDM links each of which can handle
> 31 audio channels.
> Forwarding all these to/from RTP (one of the things it might do) is 496
> audio channels - so 496 RTP sockets and 496 RTCP ones.
> Although the test I'm doing is pure RTP and doesn't use TDM.
>
> What I'm measuring is the total time taken to receive all the packets
> (on all the sockets) that are available to be read every 10ms.
> So poll + recv + add_to_queue.
> (The data processing is done by other threads.)
> I use the time difference (actually CLOCK_MONOTONIC - from rdtsc)
> to generate a 64 entry (self scaling) histogram of the elapsed times.
> Then look for the histograms peak value.
> (I need to work on the max value, but that is a different (more important!) problem.)
> Depending on the poll/recv method used this takes 1.5 to 2ms
> in each 10ms period.
> (It is faster if I run the cpu at full speed, but it usually idles along
> at 800MHz.)
>
> If I use recvmmsg() I only expect to see one packet because there
> is (almost always) only one packet on each socket every 20ms.
> However there might be more than one, and if there is they
> all need to be read (well at least 2 of them) in that block of receives.
>
> The outbound traffic goes out through a small number of raw sockets.
> Annoyingly we have to work out the local IPv4 address that will be used
> for each destination in order to calculate the UDP checksum.
> (I've a pending patch to speed up the x86 checksum code on a lot of
> cpus.)
>
> David

A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
per cpu.

This is really the only way to scale, and does not need kernel changes to efficiently
organize millions of UDP sockets (huge memory footprint even if we get right how
we manage them)

Given that UDP has no state, there is really no point trying to have one UDP
socket per flow, and having to deal with epoll()/poll() overhead.




2019-11-27 17:51:54

by Paolo Abeni

[permalink] [raw]
Subject: Re: epoll_wait() performance

Hi,

Thanks for the additional details.

On Wed, 2019-11-27 at 17:30 +0000, David Laight wrote:
> From: Paolo Abeni
> > Sent: 27 November 2019 16:27
> ...
> > @David: If I read your message correctly, the pkt rate you are dealing
> > with is quite low... are we talking about tput or latency? I guess
> > latency could be measurably higher with recvmmsg() in respect to other
> > syscall. How do you measure the releative performances of recvmmsg()
> > and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
> > usually getting a single packet per recvmmsg() call?
>
> The packet rate per socket is low, typically one packet every 20ms.
> This is RTP, so telephony audio.
> However we have a lot of audio channels and hence a lot of sockets.
> So there are can be 1000s of sockets we need to receive the data from.
> The test system I'm using has 16 E1 TDM links each of which can handle
> 31 audio channels.
> Forwarding all these to/from RTP (one of the things it might do) is 496
> audio channels - so 496 RTP sockets and 496 RTCP ones.
> Although the test I'm doing is pure RTP and doesn't use TDM.

Oks, I think this is not exactly the preferred recvmmsg() use case ;)

> What I'm measuring is the total time taken to receive all the packets
> (on all the sockets) that are available to be read every 10ms.
> So poll + recv + add_to_queue.
> (The data processing is done by other threads.)
> I use the time difference (actually CLOCK_MONOTONIC - from rdtsc)
> to generate a 64 entry (self scaling) histogram of the elapsed times.
> Then look for the histograms peak value.
> (I need to work on the max value, but that is a different (more important!) problem.)
> Depending on the poll/recv method used this takes 1.5 to 2ms
> in each 10ms period.
> (It is faster if I run the cpu at full speed, but it usually idles along
> at 800MHz.)
>
> If I use recvmmsg() I only expect to see one packet because there
> is (almost always) only one packet on each socket every 20ms.
> However there might be more than one, and if there is they
> all need to be read (well at least 2 of them) in that block of receives.

I would wild guess that recvmmsg() would be faster than 2 recv() when
there are exactly 2 pkts to read and the user-space provides exactly 2
msg entries, but likely non very relevant for the overall scenario.

Sorry, I don't have any good suggestion here.

Cheers,

Paolo

2019-11-27 20:16:23

by Willem de Bruijn

[permalink] [raw]
Subject: Re: epoll_wait() performance

On Wed, Nov 27, 2019 at 11:04 AM David Laight <[email protected]> wrote:
>
> From: Jesper Dangaard Brouer
> > Sent: 27 November 2019 15:48
> > On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <[email protected]> wrote:
> >
> > > ...
> > > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > > and faffing with the user iov[].)
> > > > >
> > > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > > However the second poll has a significant performance cost (but less than using recvmmsg()).
> > > >
> > > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > > single message, should be faster than two syscalls - recv() and
> > > > poll().
> > >
> > > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > > significant overhead, most likely due to the crappy code that tries to stop
> > > the kernel buffer being overrun.
> > >
> > > I need to run the tests on a system with a 'home built' kernel to see how much
> > > difference this make (by seeing how much slower duplicating the copy makes it).
> > >
> > > The system call cost of poll() gets factored over a reasonable number of sockets.
> > > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > > even allowing for looking up the fd.
> > >
> > > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > > expect one message and to call the poll() function before each subsequent receive.
> > >
> > > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > > I don't know how much that actually costs.
> > > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> > >
> > > We really do want to receive all these UDP packets in a timely manner.
> > > Although very low latency isn't itself an issue.
> > > The data is telephony audio with (typically) one packet every 20ms.
> > > The code only looks for packets every 10ms - that helps no end since, in principle,
> > > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
> >
> > I have a simple udp_sink tool[1] that cycle through the different
> > receive socket system calls. I gave it a quick spin on a F31 kernel
> > 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> > to see a significant regression/slowdown for recvMmsg.
> >
> > $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
> > run count ns/pkt pps cycles payload
> > recvMmsg/32 run: 0 10000000 1461.41 684270.96 5261 18 demux:1
> > recvmsg run: 0 10000000 889.82 1123824.84 3203 18 demux:1
> > read run: 0 10000000 974.81 1025841.68 3509 18 demux:1
> > recvfrom run: 0 10000000 1056.51 946513.44 3803 18 demux:1
> >
> > Normal recvmsg almost have double performance that recvmmsg.
> > recvMmsg/32 = 684,270 pps
> > recvmsg = 1,123,824 pps
>
> Can you test recv() as well?
> I think it might be faster than read().
>
> ...
> > Found some old results (approx v4.10-rc1):
> >
> > [brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
> > recvMmsg/32 run: 0 10000000 537.89 1859106.74 2155 21559353816
> > recvmsg run: 0 10000000 552.69 1809344.44 2215 22152468673
> > read run: 0 10000000 476.65 2097970.76 1910 19104864199
> > recvfrom run: 0 10000000 450.76 2218492.60 1806 18066972794
>
> That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel.
> recvmmsg() and recvmsg() are similar - but both a lot slower then recv().

Indeed, surprising that recv(from) would be less efficient than recvmsg.

Are the latest numbers with CONFIG_HARDENED_USERCOPY?

I assume that the poll() after recv() is non-blocking. If using
recvmsg, that extra syscall could be avoided by implementing a cmsg
inq hint for udp sockets analogous to TCP_CM_INQ/tcp_inq_hint.

More outlandish would be to abuse the mmsghdr->msg_len field to pass
file descriptors and amortize the kernel page-table isolation cost
across sockets. Blocking semantics would be weird, for starters.

2019-11-28 10:19:27

by David Laight

[permalink] [raw]
Subject: RE: epoll_wait() performance

From: Eric Dumazet
> Sent: 27 November 2019 17:47
...
> A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
> per cpu.
>
> This is really the only way to scale, and does not need kernel changes to efficiently
> organize millions of UDP sockets (huge memory footprint even if we get right how
> we manage them)
>
> Given that UDP has no state, there is really no point trying to have one UDP
> socket per flow, and having to deal with epoll()/poll() overhead.

How can you do that when all the UDP flows have different destination port numbers?
These are message flows not idempotent requests.
I don't really want to collect the packets before they've been processed by IP.

I could write a driver that uses kernel udp sockets to generate a single message queue
than can be efficiently processed from userspace - but it is a faff compiling it for
the systems kernel version.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-11-28 11:16:56

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: epoll_wait() performance

On Wed, 27 Nov 2019 16:04:12 +0000
David Laight <[email protected]> wrote:

> From: Jesper Dangaard Brouer
> > Sent: 27 November 2019 15:48
> > On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <[email protected]> wrote:
> >
> > > ...
> > > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > > and faffing with the user iov[].)
> > > > >
> > > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > > However the second poll has a significant performance cost (but less than using recvmmsg()).
> > > >
> > > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > > single message, should be faster than two syscalls - recv() and
> > > > poll().
> > >
> > > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > > significant overhead, most likely due to the crappy code that tries to stop
> > > the kernel buffer being overrun.
> > >
> > > I need to run the tests on a system with a 'home built' kernel to see how much
> > > difference this make (by seeing how much slower duplicating the copy makes it).
> > >
> > > The system call cost of poll() gets factored over a reasonable number of sockets.
> > > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > > even allowing for looking up the fd.
> > >
> > > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > > expect one message and to call the poll() function before each subsequent receive.
> > >
> > > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > > I don't know how much that actually costs.
> > > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> > >
> > > We really do want to receive all these UDP packets in a timely manner.
> > > Although very low latency isn't itself an issue.
> > > The data is telephony audio with (typically) one packet every 20ms.
> > > The code only looks for packets every 10ms - that helps no end since, in principle,
> > > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
> >
> > I have a simple udp_sink tool[1] that cycle through the different
> > receive socket system calls. I gave it a quick spin on a F31 kernel
> > 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> > to see a significant regression/slowdown for recvMmsg.
> >
> > $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
> > run count ns/pkt pps cycles payload
> > recvMmsg/32 run: 0 10000000 1461.41 684270.96 5261 18 demux:1
> > recvmsg run: 0 10000000 889.82 1123824.84 3203 18 demux:1
> > read run: 0 10000000 974.81 1025841.68 3509 18 demux:1
> > recvfrom run: 0 10000000 1056.51 946513.44 3803 18 demux:1
> >
> > Normal recvmsg almost have double performance that recvmmsg.
> > recvMmsg/32 = 684,270 pps
> > recvmsg = 1,123,824 pps
>
> Can you test recv() as well?

Sure: https://github.com/netoptimizer/network-testing/commit/9e3c8b86a2d662

$ sudo taskset -c 1 ./udp_sink --port 9 --count $((10**6*2))
run count ns/pkt pps cycles payload
recvMmsg/32 run: 0 2000000 653.29 1530704.29 2351 18 demux:1
recvmsg run: 0 2000000 631.01 1584760.06 2271 18 demux:1
read run: 0 2000000 582.24 1717518.16 2096 18 demux:1
recvfrom run: 0 2000000 547.26 1827269.12 1970 18 demux:1
recv run: 0 2000000 547.37 1826930.39 1970 18 demux:1

> I think it might be faster than read().

Slightly, but same speed as recvfrom.

Strangely recvMmsg is not that bad in this testrun, and it is on the
same kernel 5.3.12-300.fc31.x86_64 and hardware. I have CPU pinned
udp_sink, as it if jumps to the CPU doing RX-NAPI it will be fighting
for CPU time with softirq (which have Eric mitigated a bit), and
results are bad and look like this:

[broadwell src]$ sudo taskset -c 5 ./udp_sink --port 9 --count $((10**6*2))
run count ns/pkt pps cycles payload
recvMmsg/32 run: 0 2000000 1252.44 798439.60 4508 18 demux:1
recvmsg run: 0 2000000 1917.65 521470.72 6903 18 demux:1
read run: 0 2000000 1817.31 550263.37 6542 18 demux:1
recvfrom run: 0 2000000 1742.44 573909.46 6272 18 demux:1
recv run: 0 2000000 1741.51 574213.08 6269 18 demux:1


> [...]
> > Found some old results (approx v4.10-rc1):
> >
> > [brouer@skylake src]$ sudo taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --connect
> > recvMmsg/32 run: 0 10000000 537.89 1859106.74 2155 21559353816
> > recvmsg run: 0 10000000 552.69 1809344.44 2215 22152468673
> > read run: 0 10000000 476.65 2097970.76 1910 19104864199
> > recvfrom run: 0 10000000 450.76 2218492.60 1806 18066972794
>
> That is probably nearer what I am seeing on a 4.15 Ubuntu 18.04 kernel.
> recvmmsg() and recvmsg() are similar - but both a lot slower then recv().

Notice tool can also test connect UDP sockets, which is done in above.
I did a quick run with --connect:

$ sudo taskset -c 1 ./udp_sink --port 9 --count $((10**6*2)) --connect
run count ns/pkt pps cycles payload
recvMmsg/32 run: 0 2000000 500.72 1997107.02 1802 18 demux:1 c:1
recvmsg run: 0 2000000 662.52 1509380.46 2385 18 demux:1 c:1
read run: 0 2000000 613.46 1630103.14 2208 18 demux:1 c:1
recvfrom run: 0 2000000 577.71 1730974.34 2079 18 demux:1 c:1
recv run: 0 2000000 578.27 1729305.35 2081 18 demux:1 c:1

And now, recvMmsg is actually the fastest...?!


p.s.
DISPLAIMER: Do notice that this udp_sink tool is a network-overload
micro-benchmark, that does not represent the use-case you are
describing.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2019-11-28 16:28:07

by David Laight

[permalink] [raw]
Subject: RE: epoll_wait() performance

From: Willem de Bruijn
> Sent: 27 November 2019 19:48
...
> Are the latest numbers with CONFIG_HARDENED_USERCOPY?

According to /boot/config-`uname -r` it is enabled on my system.
I suspect it has a measurable effect on these tests.

> I assume that the poll() after recv() is non-blocking. If using
> recvmsg, that extra syscall could be avoided by implementing a cmsg
> inq hint for udp sockets analogous to TCP_CM_INQ/tcp_inq_hint.

All the poll() calls are non-blocking.
The first poll() has all the sockets in it.
The second poll() only those that returned data the first time around.
The code then sleeps elsewhere for the rest of the 10ms interval.
(Actually the polls are done in blocks of 64, filling up the pfd[] each time.)

This avoids the problem of repeatedly setting up and tearing down the
per-fd data for poll().

> More outlandish would be to abuse the mmsghdr->msg_len field to pass
> file descriptors and amortize the kernel page-table isolation cost
> across sockets. Blocking semantics would be weird, for starters.

It would be better to allow a single UDP socket be bound to multiple ports.
And then use the cmsg data to sort out the actual destination port.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-11-28 16:39:22

by David Laight

[permalink] [raw]
Subject: RE: epoll_wait() performance

From: Jesper Dangaard Brouer
> Sent: 28 November 2019 11:12
...
> > Can you test recv() as well?
>
> Sure: https://github.com/netoptimizer/network-testing/commit/9e3c8b86a2d662
>
> $ sudo taskset -c 1 ./udp_sink --port 9 --count $((10**6*2))
> run count ns/pkt pps cycles payload
> recvMmsg/32 run: 0 2000000 653.29 1530704.29 2351 18 demux:1
> recvmsg run: 0 2000000 631.01 1584760.06 2271 18 demux:1
> read run: 0 2000000 582.24 1717518.16 2096 18 demux:1
> recvfrom run: 0 2000000 547.26 1827269.12 1970 18 demux:1
> recv run: 0 2000000 547.37 1826930.39 1970 18 demux:1
>
> > I think it might be faster than read().
>
> Slightly, but same speed as recvfrom.

I notice that you recvfrom() code doesn't request the source address.
So is probably identical to recv().

My test system tends to increase its clock rate when busy.
(The fans speed up immediately, the cpu has a passive heatsink and all the
case fans are connected (via buffers) to the motherboard 'cpu fan' header.)
I could probably work out how to lock the frequency, but for some tests I run:
$ while :; do :; done
Putting 1 cpu into a userspace infinite loop make them all run flat out
(until thermally throttled).

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-11-28 16:56:09

by Willy Tarreau

[permalink] [raw]
Subject: Re: epoll_wait() performance

On Thu, Nov 28, 2019 at 04:37:01PM +0000, David Laight wrote:
> My test system tends to increase its clock rate when busy.
> (The fans speed up immediately, the cpu has a passive heatsink and all the
> case fans are connected (via buffers) to the motherboard 'cpu fan' header.)
> I could probably work out how to lock the frequency, but for some tests I run:
> $ while :; do :; done
> Putting 1 cpu into a userspace infinite loop make them all run flat out
> (until thermally throttled).

It would be way more efficient to only make the CPUs spin in the idle
loop. I wrote a small module a few years ago for this, which allows me
to do the equivalent of "idle=poll" at runtime. It's very convenient
in VMs as it significantly reduces your latency and jitter by preventing
them from sleeping. It's quite efficient as well to stabilize CPUs having
an important difference between their highest and lowest frequencies.

I'm attaching the patch here, it's straightforward, it was made on
3.14 and still worked unmodified on 4.19, I'm sure it still does with
more recent kernels.

Hoping this helps,
Willy

---

From 22d67389c2b28d924260b8ced78111111006ed94 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Wed, 27 Jan 2016 17:24:54 +0100
Subject: staging: add a new "idle_poll" module to disable idle loop

Sometimes it's convenient to be able to switch to polled mode for the
idle loop. This module does just that, and reverts back to the original
mode once unloaded.
---
drivers/staging/Kconfig | 2 ++
drivers/staging/Makefile | 1 +
drivers/staging/idle_poll/Kconfig | 8 ++++++++
drivers/staging/idle_poll/Makefile | 7 +++++++
drivers/staging/idle_poll/idle_poll.c | 22 ++++++++++++++++++++++
kernel/cpu/idle.c | 1 +
6 files changed, 41 insertions(+)
create mode 100644 drivers/staging/idle_poll/Kconfig
create mode 100644 drivers/staging/idle_poll/Makefile
create mode 100644 drivers/staging/idle_poll/idle_poll.c

diff --git a/drivers/staging/Kconfig b/drivers/staging/Kconfig
index 9594f204d4fc..936a2721b0f7 100644
--- a/drivers/staging/Kconfig
+++ b/drivers/staging/Kconfig
@@ -148,4 +148,6 @@ source "drivers/staging/dgnc/Kconfig"

source "drivers/staging/dgap/Kconfig"

+source "drivers/staging/idle_poll/Kconfig"
+
endif # STAGING
diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 6ca1cf3dbcd4..d3d45aff73d2 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -66,3 +66,4 @@ obj-$(CONFIG_XILLYBUS) += xillybus/
obj-$(CONFIG_DGNC) += dgnc/
obj-$(CONFIG_DGAP) += dgap/
obj-$(CONFIG_MTD_SPINAND_MT29F) += mt29f_spinand/
+obj-$(CONFIG_IDLE_POLL) += idle_poll/
diff --git a/drivers/staging/idle_poll/Kconfig b/drivers/staging/idle_poll/Kconfig
new file mode 100644
index 000000000000..4c96a21f66aa
--- /dev/null
+++ b/drivers/staging/idle_poll/Kconfig
@@ -0,0 +1,8 @@
+config IDLE_POLL
+ tristate "IDLE_POLL enabler"
+ help
+ This module automatically enables polling-based idle loop.
+ It is convenient in certain situations to simply load the
+ module to disable the idle loop, or unload it to re-enable
+ it.
+
diff --git a/drivers/staging/idle_poll/Makefile b/drivers/staging/idle_poll/Makefile
new file mode 100644
index 000000000000..60ad176f11f6
--- /dev/null
+++ b/drivers/staging/idle_poll/Makefile
@@ -0,0 +1,7 @@
+# This rule extracts the directory part from the location where this Makefile
+# is found, strips last slash and retrieves the last component which is used
+# to make a file name. It is a generic way of building modules which always
+# have the name of the directory they're located in. $(lastword) could have
+# been used instead of $(word $(words)) but it's bogus on some versions.
+
+obj-m += $(notdir $(patsubst %/,%,$(dir $(word $(words $(MAKEFILE_LIST)),$(MAKEFILE_LIST))))).o
diff --git a/drivers/staging/idle_poll/idle_poll.c b/drivers/staging/idle_poll/idle_poll.c
new file mode 100644
index 000000000000..6f39f85cc61d
--- /dev/null
+++ b/drivers/staging/idle_poll/idle_poll.c
@@ -0,0 +1,22 @@
+#include <linux/module.h>
+#include <linux/cpu.h>
+
+static int __init modinit(void)
+{
+ cpu_idle_poll_ctrl(true);
+ return 0;
+}
+
+static void __exit modexit(void)
+{
+ cpu_idle_poll_ctrl(false);
+ return;
+}
+
+module_init(modinit);
+module_exit(modexit);
+
+MODULE_DESCRIPTION("idle_poll enabler");
+MODULE_AUTHOR("Willy Tarreau");
+MODULE_VERSION("0.0.1");
+MODULE_LICENSE("GPL");
diff --git a/kernel/cpu/idle.c b/kernel/cpu/idle.c
index 277f494c2a9a..fbf648bc52b2 100644
--- a/kernel/cpu/idle.c
+++ b/kernel/cpu/idle.c
@@ -22,6 +22,7 @@ void cpu_idle_poll_ctrl(bool enable)
WARN_ON_ONCE(cpu_idle_force_poll < 0);
}
}
+EXPORT_SYMBOL(cpu_idle_poll_ctrl);

#ifdef CONFIG_GENERIC_IDLE_POLL_SETUP
static int __init cpu_idle_poll_setup(char *__unused)
--
2.20.1

2019-11-30 01:15:48

by Eric Dumazet

[permalink] [raw]
Subject: Re: epoll_wait() performance



On 11/28/19 2:17 AM, David Laight wrote:
> From: Eric Dumazet
>> Sent: 27 November 2019 17:47
> ...
>> A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
>> per cpu.
>>
>> This is really the only way to scale, and does not need kernel changes to efficiently
>> organize millions of UDP sockets (huge memory footprint even if we get right how
>> we manage them)
>>
>> Given that UDP has no state, there is really no point trying to have one UDP
>> socket per flow, and having to deal with epoll()/poll() overhead.
>
> How can you do that when all the UDP flows have different destination port numbers?
> These are message flows not idempotent requests.
> I don't really want to collect the packets before they've been processed by IP.
>
> I could write a driver that uses kernel udp sockets to generate a single message queue
> than can be efficiently processed from userspace - but it is a faff compiling it for
> the systems kernel version.

Well if destinations ports are not under your control,
you also could use AF_PACKET sockets, no need for 'UDP sockets' to receive UDP traffic,
especially it the rate is small.

2019-11-30 13:31:44

by Jakub Sitnicki

[permalink] [raw]
Subject: Re: epoll_wait() performance

On Sat, Nov 30, 2019 at 02:07 AM CET, Eric Dumazet wrote:
> On 11/28/19 2:17 AM, David Laight wrote:
>> From: Eric Dumazet
>>> Sent: 27 November 2019 17:47
>> ...
>>> A QUIC server handles hundred of thousands of ' UDP flows' all using only one UDP socket
>>> per cpu.
>>>
>>> This is really the only way to scale, and does not need kernel changes to efficiently
>>> organize millions of UDP sockets (huge memory footprint even if we get right how
>>> we manage them)
>>>
>>> Given that UDP has no state, there is really no point trying to have one UDP
>>> socket per flow, and having to deal with epoll()/poll() overhead.
>>
>> How can you do that when all the UDP flows have different destination port numbers?
>> These are message flows not idempotent requests.
>> I don't really want to collect the packets before they've been processed by IP.
>>
>> I could write a driver that uses kernel udp sockets to generate a single message queue
>> than can be efficiently processed from userspace - but it is a faff compiling it for
>> the systems kernel version.
>
> Well if destinations ports are not under your control,
> you also could use AF_PACKET sockets, no need for 'UDP sockets' to receive UDP traffic,
> especially it the rate is small.

Alternatively, you could steer UDP flows coming to a certain port range
to one UDP socket using TPROXY [0, 1].

TPROXY has the same downside as AF_PACKET, meaning that it requires at
least CAP_NET_RAW to create/set up the socket.

OTOH, with TPROXY you can gracefully co-reside with other services,
filering on just the destination addresses you want in iptables/nftables.

Fan-out / load-balancing with reuseport to have one socket per CPU is
not possible, though. You would need to do that with Netfilter.

-Jakub

[0] https://www.kernel.org/doc/Documentation/networking/tproxy.txt
[1] https://blog.cloudflare.com/how-we-built-spectrum/

2019-12-02 12:27:57

by David Laight

[permalink] [raw]
Subject: RE: epoll_wait() performance

From: Jakub Sitnicki <[email protected]>
> Sent: 30 November 2019 13:30
> On Sat, Nov 30, 2019 at 02:07 AM CET, Eric Dumazet wrote:
> > On 11/28/19 2:17 AM, David Laight wrote:
...
> >> How can you do that when all the UDP flows have different destination port numbers?
> >> These are message flows not idempotent requests.
> >> I don't really want to collect the packets before they've been processed by IP.
> >>
> >> I could write a driver that uses kernel udp sockets to generate a single message queue
> >> than can be efficiently processed from userspace - but it is a faff compiling it for
> >> the systems kernel version.
> >
> > Well if destinations ports are not under your control,
> > you also could use AF_PACKET sockets, no need for 'UDP sockets' to receive UDP traffic,
> > especially it the rate is small.
>
> Alternatively, you could steer UDP flows coming to a certain port range
> to one UDP socket using TPROXY [0, 1].

I don't think that can work, we don't really know the list of valid UDP port
numbers ahead of time.

> TPROXY has the same downside as AF_PACKET, meaning that it requires at
> least CAP_NET_RAW to create/set up the socket.

CAP_NET_RAW wouldn't be a problem - we already send from a 'raw' socket.
(Which is a PITA for IPv4 because we have to determine the local IP address
in order to calculate the UDP checksum - so we have to have a local copy
of what (hopefully) matches the kernel routing tables.

> OTOH, with TPROXY you can gracefully co-reside with other services,
> filtering on just the destination addresses you want in iptables/nftables.

Yes, the packets need to be extracted from normal processing - otherwise
the UDP code would send ICMP port unreachable errors.

>
> Fan-out / load-balancing with reuseport to have one socket per CPU is
> not possible, though. You would need to do that with Netfilter.

Or put different port ranges onto different sockets.

> [0] https://www.kernel.org/doc/Documentation/networking/tproxy.txt
> [1] https://blog.cloudflare.com/how-we-built-spectrum/

The latter explains it a bit better than the former....

I think I'll try to understand the 'ftrace' documentation enough to add some
extra trace points to those displayed by shedviz.
So far I've only found some trivial example uses, not any actual docs.
Hopefully there is a faster way than reading the kernel sources :-(

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2019-12-02 16:49:12

by Willem de Bruijn

[permalink] [raw]
Subject: Re: epoll_wait() performance

On Mon, Dec 2, 2019 at 7:24 AM David Laight <[email protected]> wrote:
>
> From: Jakub Sitnicki <[email protected]>
> > Sent: 30 November 2019 13:30
> > On Sat, Nov 30, 2019 at 02:07 AM CET, Eric Dumazet wrote:
> > > On 11/28/19 2:17 AM, David Laight wrote:
> ...
> > >> How can you do that when all the UDP flows have different destination port numbers?
> > >> These are message flows not idempotent requests.
> > >> I don't really want to collect the packets before they've been processed by IP.
> > >>
> > >> I could write a driver that uses kernel udp sockets to generate a single message queue
> > >> than can be efficiently processed from userspace - but it is a faff compiling it for
> > >> the systems kernel version.
> > >
> > > Well if destinations ports are not under your control,
> > > you also could use AF_PACKET sockets, no need for 'UDP sockets' to receive UDP traffic,
> > > especially it the rate is small.
> >
> > Alternatively, you could steer UDP flows coming to a certain port range
> > to one UDP socket using TPROXY [0, 1].
>
> I don't think that can work, we don't really know the list of valid UDP port
> numbers ahead of time.

How about -j REDIRECT. That does not require all ports to be known
ahead of time.

> > TPROXY has the same downside as AF_PACKET, meaning that it requires at
> > least CAP_NET_RAW to create/set up the socket.
>
> CAP_NET_RAW wouldn't be a problem - we already send from a 'raw' socket.

One other issue when comparing udp and packet sockets is ip
defragmentation. That is critical code that is not at all trivial to
duplicate in userspace.

Even when choosing packet sockets, which normally would not
defragment, there is a trick. A packet socket with fanout and flag
PACKET_FANOUT_FLAG_DEFRAG will defragment before fanout.

2019-12-19 07:58:43

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: epoll_wait() performance

On Thu, 28 Nov 2019 16:37:01 +0000
David Laight <[email protected]> wrote:

> From: Jesper Dangaard Brouer
> > Sent: 28 November 2019 11:12
> ...
> > > Can you test recv() as well?
> >
> > Sure: https://github.com/netoptimizer/network-testing/commit/9e3c8b86a2d662
> >
> > $ sudo taskset -c 1 ./udp_sink --port 9 --count $((10**6*2))
> > run count ns/pkt pps cycles payload
> > recvMmsg/32 run: 0 2000000 653.29 1530704.29 2351 18 demux:1
> > recvmsg run: 0 2000000 631.01 1584760.06 2271 18 demux:1
> > read run: 0 2000000 582.24 1717518.16 2096 18 demux:1
> > recvfrom run: 0 2000000 547.26 1827269.12 1970 18 demux:1
> > recv run: 0 2000000 547.37 1826930.39 1970 18 demux:1
> >
> > > I think it might be faster than read().
> >
> > Slightly, but same speed as recvfrom.
>
> I notice that you recvfrom() code doesn't request the source address.
> So is probably identical to recv().

Created a GitHub issue/bug on this:
https://github.com/netoptimizer/network-testing/issues/5

Feel free to fix this and send a patch/PR.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer