2024-04-02 21:04:42

by Michael Clark

[permalink] [raw]
Subject: user-space concurrent pipe buffer scheduler interactions

Folks,

I am working on a low latency cross-platform concurrent pipe buffer
using C11 threads and atomics. It is portable code using a <stdatomic.h>
polyfill on Windows that wraps the intrinsics that Microsoft provides.
There is a detailed write up with implementation details, source code,
tests and benchmark results in the URL here:

- https://github.com/michaeljclark/cpipe/

I have been eagerly following the work of Jens on io_uring which is why
I am including him as he may be interested in these scheduler findings,
because I am currently using busy memory polling for synchronization.

The reason why I am writing here, is that I think I now have a pretty
decent test case to test the Windows and Linux schedulers side-by-side.
Let's just say it has been an eye opening process and I think folks here
might be interested in what I am seeing and what we could predict should
happen based on Amdahl's Law and low-level cache ping-pong on atomics.

Let me cut to the chase. What I am observing is a situation where when I
add threads on Windows, performance increases, but when I add threads on
Linux, performance decreases. I don't know exactly why. I am wondering
if Windows is doing some topologically affine scheduling? or if it is
using performance counters to intuit scheduling decisions? I have
checked the codegen and it is basically two LOCK CMPXCHG instructions.

I ran bare metal tests on Kaby Lake and Skylake processors on both OSes:

- `Windows 11 Version 23H2 Build 22631.3296`
- `Linux 6.5.0-25-generic #25~22.04.1-Ubuntu`

In any case, here are numbers. I will let them speak for themselves:

# Minimum Latency (nanoseconds)

| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | ~219ns | ~362ns | ~7692ns |
| Skylake (i9-7980XE) | ~404ns | ~425ns | ~9183ns |

# Message Rate (messages per second)

| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | 4.55M | 2.71M | 129.62K |
| Skylake (i9-7980XE) | 2.47M | 2.35M | 108.89K |

# Bandwidth 32KB buffer (1-thread)

| | cpipe win11 | cpipe linux | linux pipes |
|:---------------------|------------:|------------:|------------:|
| Kaby Lake (i7-8550U) | 2.91GB/sec | 1.36GB/sec | 1.72GB/sec |
| Skylake (i9-7980XE) | 2.98GB/sec | 1.44GB/sec | 1.67GB/sec |

# Bandwidth 32KB buffer (4-threads)

| | cpipe win11 | cpipe linux |
|:---------------------|------------:|------------:|
| Kaby Lake (i7-8550U) | 5.56GB/sec | 0.79GB/sec |
| Skylake (i9-7980XE) | 7.11GB/sec | 0.89GB/sec |

I think we have a very useful test case here for the Linux scheduler. I
have been working on a generalization of memory polled user-space queue
and this is about the 5th iteration where I have been very careful about
modulo arithmetic and overflow as the normal case.

I know it is a little unfair to compare latency with Linux pipes and
also we waste a lot of time spinning on queue full. This is where we
would really like to use something like SENDUIPI, UMONITOR and UMWAIT
but I don't have access to silicon that supports those yet.

Regards,
Michael Clark


2024-04-03 16:56:58

by Linus Torvalds

[permalink] [raw]
Subject: Re: user-space concurrent pipe buffer scheduler interactions

On Tue, 2 Apr 2024 at 13:54, Michael Clark <[email protected]> wrote:
>
> I am working on a low latency cross-platform concurrent pipe buffer
> using C11 threads and atomics.

You will never get good performance doing spinlocks in user space
unless you actually tell the scheduler about the spinlocks, and have
some way to actually sleep on contention.

Which I don't see you as having.

Linus

2024-04-03 20:53:05

by Michael Clark

[permalink] [raw]
Subject: Re: user-space concurrent pipe buffer scheduler interactions

On 4/4/24 05:56, Linus Torvalds wrote:
> On Tue, 2 Apr 2024 at 13:54, Michael Clark <[email protected]> wrote:
>>
>> I am working on a low latency cross-platform concurrent pipe buffer
>> using C11 threads and atomics.
>
> You will never get good performance doing spinlocks in user space
> unless you actually tell the scheduler about the spinlocks, and have
> some way to actually sleep on contention.
>
> Which I don't see you as having.

We can work on this.

So maybe it is possible to look at how many LOCK instructions were
retired in the last scheduler quantum ideally with retired-success,
retired-failed for interlocked-compare-and-swap. Maybe it is just a
performance counter and doesn't require perf tracing switched on?

Then you can probably make a queue of processes in lock contention but
the hard part is deducing who had contention with who. I will need to
think about this for a while. We know the latency when things are not
contended because these critical sections are usually small. It's about
~200-400ns and you can get these numbers in a loop at boot up.

But I don't know how we can see spinning on acquires. It makes me think
that the early relaxed/acquire comparison before the LOCK op is bad. I
got a very minor performance boost but it would break the strategy I
just mentioned because we wouldn't have a LOCK CMPXCHG in our spin loop.
We would know for certain "that" process had a failed LOCK CMPXCHG.

So I would need to delete this line and other lines like this:

https://github.com/michaeljclark/cpipe/blob/13c0ad1a865b9cc0174fc8f61d76f37bdbf11d4d/include/buffer.h#L317

I also want a user-space wrapper for futexes for a waitlist_t that
rechecks conditions and uses cond_timeout on old broken POSIX systems so
that we won't deadlock due to a missed wake-up. FreeBSD, macOS and
Windows are starting to look like they might have something we can use.

WaitOnAddress in Windows has a compare not-equals, and supports 8, 16,
32, and 64 bit words, but when used to construct equals, which is what I
need, or less-than or greater-than it could suffer from thundering herd
if used in a decentralized way in user-space. Maybe we would need an
address waiter list for an address stashed in a CPU struct for the lead
waiter, then centrally recheck the condition and when appropriate
reschedule those sleeping in the queue for events on that address?

Sorry I don't know how the Linux scheduler and futexes work internally.
I just want to use this stuff in user-space. I want a POSIX waitlist_t.

I am working on a tiny emulator for the Windows Hypervisor which is how
I justify the present "embedded" version which spins. This pipe buffer
is for a tiny test kernel to get rid of a janky lock around printf.

- https://github.com/michaeljclark/emu

Thanks,
Michael.

2024-04-03 21:05:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: user-space concurrent pipe buffer scheduler interactions

On Wed, 3 Apr 2024 at 13:52, Michael Clark <[email protected]> wrote:
>
> On 4/4/24 05:56, Linus Torvalds wrote:
> > On Tue, 2 Apr 2024 at 13:54, Michael Clark <[email protected]> wrote:
> >>
> >> I am working on a low latency cross-platform concurrent pipe buffer
> >> using C11 threads and atomics.
> >
> > You will never get good performance doing spinlocks in user space
> > unless you actually tell the scheduler about the spinlocks, and have
> > some way to actually sleep on contention.
> >
> > Which I don't see you as having.
>
> We can work on this.

It's been tried.

Nobody ever found a use-case that is sufficiently convincing, but see
the write-up at

https://lwn.net/Articles/944895/

for a pointer to at least attempts.

Linus

2024-04-03 21:22:55

by Michael Clark

[permalink] [raw]
Subject: Re: user-space concurrent pipe buffer scheduler interactions

On 4/4/24 09:39, Michael Clark wrote:
> So maybe it is possible to look at how many LOCK instructions were
> retired in the last scheduler quantum ideally with retired-success,
> retired-failed for interlocked-compare-and-swap. Maybe it is just a
> performance counter and doesn't require perf tracing switched on?

just occurred to me that you could stash the address and width of the
last failed interlocked-compare-and-swap to deduce wait-on-address or
even ask the processor in some circumstances to generate a precise
interrupt so that you could reschedule it. an idle thought. like if it
is going to fail we might as well jump straight to the scheduler.

2024-04-04 03:00:36

by Michael Clark

[permalink] [raw]
Subject: Re: user-space concurrent pipe buffer scheduler interactions

On 4/4/24 09:57, Linus Torvalds wrote:
> On Wed, 3 Apr 2024 at 13:52, Michael Clark <[email protected]> wrote:
>>
>> On 4/4/24 05:56, Linus Torvalds wrote:
>>> On Tue, 2 Apr 2024 at 13:54, Michael Clark <[email protected]> wrote:
>>>>
>>>> I am working on a low latency cross-platform concurrent pipe buffer
>>>> using C11 threads and atomics.
>>>
>>> You will never get good performance doing spinlocks in user space
>>> unless you actually tell the scheduler about the spinlocks, and have
>>> some way to actually sleep on contention.
>>>
>>> Which I don't see you as having.
>>
>> We can work on this.
>
> It's been tried.
>
> Nobody ever found a use-case that is sufficiently convincing, but see
> the write-up at
>
> https://lwn.net/Articles/944895/

It's certainly interesting!

I wouldn't throw in the towel so soon until folks tried a bit harder.
Even if we can't make it faster it would be neat to have an Austin Group
meeting to give us user-space devels a waitlist_t so we don't hit that
brain damaged deadlocky cond_wait foot-gun and realize its totally
broken and that we must use cond_timedwait with some stupid delay until
we can recheck the condition again in user-space in our "portable code".

Michael.