LinuxLists.cc - [PATCH] rfc: threaded epoll

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On Fri, 4 May 2007, Davi Arnaut wrote:

> Hi,
>
> If multiple threads are parked on epoll_wait (on a single epoll fd) and
> events become available, epoll performs a wake up of all threads of the
> poll wait list, causing a thundering herd of processes trying to grab
> the eventpoll lock.
>
> This patch addresses this by using exclusive waiters (wake one). Once
> the exclusive thread finishes transferring it's events, a new thread
> is woken if there are more events available.
>
> Makes sense?

Theorically, make sense. I said theorically because all the use
epoll_wait MT use cases I've heard of, use a single thread that does the
epoll_wait, and then dispatch to worker threads. So thundering herd is not
in the picture. OTOH, it does not hurt either.
But, that code is completely changed with the new single-pass epoll delivery
code that is in -mm. So, I'd either wait for that code to go in, or I
(or you, if you like) can make a patch against -mm.

- Davide

2007-05-05 21:42:21

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

Davide Libenzi wrote:
> On Fri, 4 May 2007, Davi Arnaut wrote:
>
>> Hi,
>>
>> If multiple threads are parked on epoll_wait (on a single epoll fd) and
>> events become available, epoll performs a wake up of all threads of the
>> poll wait list, causing a thundering herd of processes trying to grab
>> the eventpoll lock.
>>
>> This patch addresses this by using exclusive waiters (wake one). Once
>> the exclusive thread finishes transferring it's events, a new thread
>> is woken if there are more events available.
>>
>> Makes sense?
>
> Theorically, make sense. I said theorically because all the use
> epoll_wait MT use cases I've heard of, use a single thread that does the
> epoll_wait, and then dispatch to worker threads. So thundering herd is not
> in the picture. OTOH, it does not hurt either.

A google search turns up a few users. It also addresses some complaints
from Drepper.

> But, that code is completely changed with the new single-pass epoll delivery
> code that is in -mm. So, I'd either wait for that code to go in, or I
> (or you, if you like) can make a patch against -mm.

Fell free to update the patch for -mm and to include my signed-of-by.
Thanks.

--
Davi Arnaut

2007-05-07 15:47:05

by Chase Venters

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On Sat, 5 May 2007, Davide Libenzi wrote:

> On Fri, 4 May 2007, Davi Arnaut wrote:
>
>> Hi,
>>
>> If multiple threads are parked on epoll_wait (on a single epoll fd) and
>> events become available, epoll performs a wake up of all threads of the
>> poll wait list, causing a thundering herd of processes trying to grab
>> the eventpoll lock.
>>
>> This patch addresses this by using exclusive waiters (wake one). Once
>> the exclusive thread finishes transferring it's events, a new thread
>> is woken if there are more events available.
>>
>> Makes sense?
>
> Theorically, make sense. I said theorically because all the use
> epoll_wait MT use cases I've heard of, use a single thread that does the
> epoll_wait, and then dispatch to worker threads. So thundering herd is not
> in the picture. OTOH, it does not hurt either.
> But, that code is completely changed with the new single-pass epoll delivery
> code that is in -mm. So, I'd either wait for that code to go in, or I
> (or you, if you like) can make a patch against -mm.
>

*raises hand*

I'm working on event handling code for multiple projects right now, and my
method of calling epoll_wait() is to do so from several threads. I've
glanced at the epoll code but obviously haven't noticed the wake-all
behavior... good to know. I suppose I'm going to have to hack around this
problem by wrapping epoll_wait() calls in a mutex. That sucks - it means
other threads won't be able to 'get ahead' by preparing their wait before
it is their turn to dequeue events.

In any case, I think having multiple threads blocking on epoll_wait() is a
much saner idea than one thread which then passes out events, so I must
voice my support for fixing this case. Why this is the exception instead
of the norm is a little baffling, but I've seen so many perverse things in
multi-threaded code...

>
> - Davide
>

Thanks,
Chase

2007-05-07 17:18:46

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On Mon, 7 May 2007, Chase Venters wrote:

> I'm working on event handling code for multiple projects right now, and my
> method of calling epoll_wait() is to do so from several threads. I've glanced
> at the epoll code but obviously haven't noticed the wake-all behavior... good
> to know. I suppose I'm going to have to hack around this problem by wrapping
> epoll_wait() calls in a mutex. That sucks - it means other threads won't be
> able to 'get ahead' by preparing their wait before it is their turn to dequeue
> events.
>
> In any case, I think having multiple threads blocking on epoll_wait() is a
> much saner idea than one thread which then passes out events, so I must voice
> my support for fixing this case. Why this is the exception instead of the norm
> is a little baffling, but I've seen so many perverse things in multi-threaded
> code...

The problem that you can have with multiple threads calling epoll_wait()
on an SMP system, is that if you sweep 100 events in one thread, and this
thread goes alone in processing those, you may have other CPUs idle while
the other thread is handling those. Either you call epoll_wait() from
multiple thread by keeping the event buffer passed to epoll_wait() farily
limited, on you use a single epoll_wait() fetcher with a queue(s) from
which worker threads pull from.
Davi's patch will be re-factored against 22-rc1 and submitted in any case
though.

- Davide

2007-05-07 18:18:27

by Chase Venters

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On Mon, 7 May 2007, Davide Libenzi wrote:

> On Mon, 7 May 2007, Chase Venters wrote:
>
>> I'm working on event handling code for multiple projects right now, and my
>> method of calling epoll_wait() is to do so from several threads. I've glanced
>> at the epoll code but obviously haven't noticed the wake-all behavior... good
>> to know. I suppose I'm going to have to hack around this problem by wrapping
>> epoll_wait() calls in a mutex. That sucks - it means other threads won't be
>> able to 'get ahead' by preparing their wait before it is their turn to dequeue
>> events.
>>
>> In any case, I think having multiple threads blocking on epoll_wait() is a
>> much saner idea than one thread which then passes out events, so I must voice
>> my support for fixing this case. Why this is the exception instead of the norm
>> is a little baffling, but I've seen so many perverse things in multi-threaded
>> code...
>
> The problem that you can have with multiple threads calling epoll_wait()
> on an SMP system, is that if you sweep 100 events in one thread, and this
> thread goes alone in processing those, you may have other CPUs idle while
> the other thread is handling those. Either you call epoll_wait() from
> multiple thread by keeping the event buffer passed to epoll_wait() farily
> limited, on you use a single epoll_wait() fetcher with a queue(s) from
> which worker threads pull from.

Working with smaller quantums is indeed the right thing to do.

In any case, let's consider why you're getting 100 events from one
epoll_wait():

1. You have a single thread doing the dequeue, and it is taking a long
time (perhaps due to the time it is taking to requeue the work in other
threads).

2. Your load is so high that you are taking lots and lots of events, in
which case the other epoll_wait() threads are going to be woken up very
soon with work anyway. In this scenario you will be "scheduling" work at
"odd" times based on its arrival, but that's just another argument to use
smaller quantums.

I'm referring specifically to edge-triggered behavior, btw. I find
edge-triggered development far easier and saner in a multi-threaded
environment, and doing level-triggered and multi-threaded at the same time
certainly seems like the wrong thing to do.

In any case, I see little point in a thread whose job is simply to move
something from queue A (epoll ready list) to queue B (thread work list).
My latest code basically uses epoll_wait() as a load balancing mechanism
to pass out work. The quantums are fairly small. There may be situations
where you get a burst of traffic that one thread handles while others are
momentarily idle, but handling that traffic is a very quick operation (and
everything is non-blocking). You really only need the other threads to
participate when the load starts to get to the point where the
epoll_wait() calls will be constantly returning anyway.

> Davi's patch will be re-factored against 22-rc1 and submitted in any case
> though.

Great. I'm just glad I saw this mail -- I probably would have burned quite
some time in the coming weeks trying to figure out why my epoll code
wasn't running quite smoothly.

>
> - Davide
>

Thanks,
Chase

2007-05-07 21:01:01

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On 5/5/07, Davi Arnaut <[email protected]> wrote:
> A google search turns up a few users. It also addresses some complaints
> from Drepper.

There is a huge problem with this approach and we're back at the
inadequate interface.

select/poll/epoll are thread cancellation points. I.e., the thread
can be canceled before returning to the user. If this cancellation
happens between the kernel deciding to give this thread the event (and
no other thread) and the thread testing for cancellation in the libc
wrapper around the syscall, then the event is lost and the process(es)
might hang.

With kevent we in the end fixed the problem by requiring that part of
the cancellation handling the thread tries to wake up another thread
waiting for the event queue. This is easily possible since the event
data is in the shared memory segment and it's just purely the thread
wakeup that is needed.

To make something like this work for poll you'd have to push back the
revents fields of the result back to the kernel which might then cause
another thread to be woken up. I find this too ugly to consider. You
guys will not believe this but I really thought all these things
through before writing the OLS paper. poll cannot be salvaged.

There is another thing about this selective wakeup: do I assume it
correctly that if more than one file descriptor is reported ready more
than one thread is woken? I think nothing else can be justified.
Will in this case both threads get the same set of descriptors
reported or will they see disjunct sets?

2007-05-07 21:34:40

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

Ulrich Drepper wrote:
> On 5/5/07, Davi Arnaut <[email protected]> wrote:
>> A google search turns up a few users. It also addresses some complaints
>> from Drepper.
>
> There is a huge problem with this approach and we're back at the
> inadequate interface.
>
> select/poll/epoll are thread cancellation points. I.e., the thread
> can be canceled before returning to the user. If this cancellation
> happens between the kernel deciding to give this thread the event (and
> no other thread) and the thread testing for cancellation in the libc
> wrapper around the syscall, then the event is lost and the process(es)
> might hang.
>
> With kevent we in the end fixed the problem by requiring that part of
> the cancellation handling the thread tries to wake up another thread
> waiting for the event queue. This is easily possible since the event
> data is in the shared memory segment and it's just purely the thread
> wakeup that is needed.
>
> To make something like this work for poll you'd have to push back the
> revents fields of the result back to the kernel which might then cause
> another thread to be woken up. I find this too ugly to consider. You
> guys will not believe this but I really thought all these things
> through before writing the OLS paper. poll cannot be salvaged.

See Linus's message on this same thread.

> There is another thing about this selective wakeup: do I assume it
> correctly that if more than one file descriptor is reported ready more
> than one thread is woken? I think nothing else can be justified.

Correct.

> Will in this case both threads get the same set of descriptors
> reported or will they see disjunct sets?

Disjunct. In reality, only if the event is edge triggered.

--
Davi Arnaut

2007-05-07 22:19:27

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On 5/7/07, Davi Arnaut <[email protected]> wrote:
> See Linus's message on this same thread.

No. I'm talking about the userlevel side, not kernel side.

If a thread is canceled *after* it returns from the syscall but before
it reports the event to the call (i.e., while still in the syscall
wrapper, thread cancellation rules require a check there) the event is
lost.

Linus was only talking about the kernel side and how in the exit path
such events are not lost.

2007-05-07 22:35:46

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On Mon, 7 May 2007, Ulrich Drepper wrote:

> On 5/7/07, Davi Arnaut <[email protected]> wrote:
> > See Linus's message on this same thread.
>
> No. I'm talking about the userlevel side, not kernel side.
>
> If a thread is canceled *after* it returns from the syscall but before
> it reports the event to the call (i.e., while still in the syscall
> wrapper, thread cancellation rules require a check there) the event is
> lost.

read(2) is a cancellation point too. So if the fine userspace code issues
a random pthread_cancel() to a thread handling that, data is lost together
with the session that thread was handling. Hmm, I wonder how the world
could have functioned so far.
Bottom line is, if you really want to throw random cancels to your worker
threads, you better wrap them into pthread_cleanup_push(). Because
otherwise, no matter where your cancel hits, you end up with a broken
system.

- Davide

2007-05-07 22:47:28

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On Mon, 7 May 2007, Ulrich Drepper wrote:

> On 5/5/07, Davi Arnaut <[email protected]> wrote:
> > A google search turns up a few users. It also addresses some complaints
> > from Drepper.
>
> There is a huge problem with this approach and we're back at the
> inadequate interface.
>
> select/poll/epoll are thread cancellation points. I.e., the thread
> can be canceled before returning to the user. If this cancellation
> happens between the kernel deciding to give this thread the event (and
> no other thread) and the thread testing for cancellation in the libc
> wrapper around the syscall, then the event is lost and the process(es)
> might hang.
>
> With kevent we in the end fixed the problem by requiring that part of
> the cancellation handling the thread tries to wake up another thread
> waiting for the event queue. This is easily possible since the event
> data is in the shared memory segment and it's just purely the thread
> wakeup that is needed.

So, by the same logic, every API that 1) returns something to userspace
by canceling its internal kernel state 2) is not based on shared
kernel/userspace memory, will break under your assumptions.
Scary, because there's a pretty long list.

- Davide

2007-05-07 23:15:59

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

Ulrich Drepper wrote:
> On 5/7/07, Davi Arnaut <[email protected]> wrote:
>> See Linus's message on this same thread.
>
> No. I'm talking about the userlevel side, not kernel side.

So you probably knew the answer before asking the question.

> If a thread is canceled *after* it returns from the syscall but before
> it reports the event to the call (i.e., while still in the syscall
> wrapper, thread cancellation rules require a check there) the event is
> lost.

Exactly. The same happens with sigwaitinfo(), and various other.

> Linus was only talking about the kernel side and how in the exit path
> such events are not lost.

Anyway, we could extend epoll to be mmapable...

--
Davi Arnaut

2007-05-08 02:32:52

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On 5/7/07, Davi Arnaut <[email protected]> wrote:
> Anyway, we could extend epoll to be mmapable...

Welcome to kevent, well, except with a lot more ballast and awkward interfaces.

2007-05-08 02:49:12

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On 5/7/07, Davide Libenzi <[email protected]> wrote:
> read(2) is a cancellation point too. So if the fine userspace code issues
> a random pthread_cancel() to a thread handling that, data is lost together
> with the session that thread was handling.

This is absolutely not comparable. When read/write is canceled no
data is lost. Some other thread might have to pick up the slack but
that's it.

With your poll extensions where one thread gets notified and then the
kernel forgets about the event information. Now this information is
only available in the thread itself. How can a cancellation handler
communication this to all the other threads waiting in poll()? Every
event which is reported and forgotten must be communication. To make
things more fun, the various waiters can be in different processes.

We went over this for kevent discussions. I really am not willing to
do it all again especially since there is no hope to achieve a
satisfying result with poll. So, don't count my silence as agreement,
it isn't.

2007-05-08 03:24:20

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

Ulrich Drepper wrote:
> On 5/7/07, Davi Arnaut <[email protected]> wrote:
>> Anyway, we could extend epoll to be mmapable...
>
> Welcome to kevent, well, except with a lot more ballast and awkward interfaces.

So an mmapable epoll is equivalent to kevent.. great! Well, expect
without a whole new giant bloated monolithic interface.

And this thread was about something else... :-)

--
Davi Arnaut

2007-05-08 03:57:21

by Kyle Moffett

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On May 07, 2007, at 22:49:04, Ulrich Drepper wrote:
> On 5/7/07, Davide Libenzi <[email protected]> wrote:
>> read(2) is a cancellation point too. So if the fine userspace code
>> issues a random pthread_cancel() to a thread handling that, data
>> is lost together with the session that thread was handling.
>
> This is absolutely not comparable. When read/write is canceled no
> data is lost. Some other thread might have to pick up the slack
> but that's it.

Uhh, how can you claim "no data is lost" in this scenario:

A bunch of threads shares a common fd and fpos, reading fixed-size
records:

thread1: read(myfd, per_thread_buf, 512);
thread2: pthread_cancel(thread1);
thread3: read(myfd, per_thread_buf, 512);

This is *exactly* analogous to the epoll() scenario: A bunch of
threads are attempting to get data (either events or records) from a
given object. If thread1 gets cancelled after the completion of its
read() syscall then the file position will have been advanced but the
data in the per-thread buffer will be discarded. The third thread
won't (reliably) get the same data as thread1. It might get it
randomly on occasion if file position updates race, but you can't
rely on that behavior.

Here's how you make both of those models safe against thread
cancellation (assuming, of course, that process_one_buf has no
cancellation points until after it saves the buffer state somewhere):

char per_thread_buf[512];
pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL);
while (read(fd, per_thread_buf, 512) == 512) {
pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
process_one_buffer(per_thread_buf);
pthread_testcancel();
pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, NULL);
}
pthread_exit(NULL);

Look how trivial that is! Isn't it fun? Of course, once you look at
it this way you could do the same verbose crap that pthread_cancel/
pthread_setcancelstate/pthread_testcancel are doing with a simple int
protected by a pthread_mutex.

Cheers,
Kyle Moffett

2007-05-08 04:36:10

by Linus Torvalds

[permalink] [raw]

Subject: Re: [PATCH] rfc: threaded epoll_wait thundering herd

On Mon, 7 May 2007, Ulrich Drepper wrote:
>
> This is absolutely not comparable. When read/write is canceled no
> data is lost. Some other thread might have to pick up the slack but
> that's it.

That's bullsh*t, Uli, and you know it.

Whatever the thread read() into it's buffer is effectively gone. You don't
know how much of the buffer was updated, so other threads cannot use the
data.

In fact, the exact *reverse* of what you claim is true. With "poll()" or
"select()", other threads *can* actually look at the result buffer, since
it has a known size and format that is independent of the return value,
unlike read().

But the real issue is that if you start cancellign threads without any
other synchronization, you are going to lose data anyway. Claiming
anything else is just silly. The whole scenario you talk about is
nonsensical, never mind that read() actually handles it *less* well rather
than better as you claim.

Linus

2007-05-08 06:30:31