Subject: Re: epoll and multiple processes - eliminate unneeded process
 wake-ups
To: Madars Vitolins <m@silodev.com>
References: <bdb4aeac80a1d819ea808626f9453a3c@silodev.com>
Cc: Eric Wong <normalperson@yhbt.net>, linux-kernel@vger.kernel.org
From: Jason Baron <jbaron@akamai.com>
Message-ID: <565CA764.1090805@akamai.com>
Date: Mon, 30 Nov 2015 14:45:40 -0500
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <bdb4aeac80a1d819ea808626f9453a3c@silodev.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 9137
Lines: 229

Hi Madars,

On 11/28/2015 05:54 PM, Madars Vitolins wrote:
> Hi Jason,
> 
> I did recently tests with multiprocessing and epoll() on Posix Queues.
> You were right about "EP_MAX_NESTS", it is not related with how many
> processes are waken up when multiple process epoll_waits are waiting on
> one event source.
> 
> At doing epoll every process is added to wait queue for every monitored
> event source. Thus when message is sent to some queue (for example), all
> processes polling on it are activated during mq_timedsend() ->
> __do_notify () -> wake_up(&info->wait_q) kernel processing.
> 
> So to get one message to be processed only by one process of
> epoll_wait(), it requires that process in event source's  wait queue is
> added with exclusive flag set.
> 
> I could create a kernel patch, by adding new EPOLLEXCL flag which could
> result in following functionality:
> 
> - fs/eventpoll.c
> ================================================================================
> 
> /*
>  * This is the callback that is used to add our wait queue to the
>  * target file wakeup lists.
>  */
> static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t
> *whead,
>                                  poll_table *pt)
> {
>         struct epitem *epi = ep_item_from_epqueue(pt);
>         struct eppoll_entry *pwq;
> 
>         if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache,
> GFP_KERNEL))) {
>                 init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
>                 pwq->whead = whead;
>                 pwq->base = epi;
> 
>                 if (epi->event.events & EPOLLEXCL) { <<<< New
> functionality here!!!
>                         add_wait_queue_exclusive(whead, &pwq->wait);
>                 } else {
>                         add_wait_queue(whead, &pwq->wait);
>                 }
>                 list_add_tail(&pwq->llink, &epi->pwqlist);
>                 epi->nwait++;
>         } else {
>                 /* We have to signal that an error occurred */
>                 epi->nwait = -1;
>         }
> }
> ================================================================================
> 
> 
> After doing test with EPOLLEXCL set in my multiprocessing application
> framework (now it is open source: http://www.endurox.org/ :) ), results
> were good, there were no extra wakeups. Thus more efficient processing.
> 

Cool. If you have any performance numbers to share that would be more
supportive.

> Jason, how do you think would mainline accept such patch with new flag?
> Or are there any concerns about this? Also this will mean that new flag
> will be need to add to GNU C Library (/usr/include/sys/epoll.h).
>

This has come up several times - so imo I think it would be a reasonable
addition - but I'm only speaking for myself.

In terms of implementation it might make sense to return 0 from
ep_poll_callback() in case ep->wq is empty. That way we continue to
search for an active waiter. That way we service wakeups in a more
timely manner if some threads are busy. We probably also don't want to
allow the flag for nested ep descriptors.

Thanks,

-Jason

> Or maybe somebody else who is familiar with kernel epoll functionality
> can comment this?
> 
> Regarding the flag's bitmask, seems like (1<<28) needs to be taken for
> EPOLLEXCL as flags type for epoll_event.events is int32 and last bit
> 1<<31 is used by EPOLLET (in include/uapi/linux/eventpoll.h).
> 
> Thanks a lot in advance,
> Madars
> 
> 
> Jason Baron @ 2015-08-05 15:32 rakstīja:
>> On 08/05/2015 07:06 AM, Madars Vitolins wrote:
>>> Jason Baron @ 2015-08-04 18:02 rakstīja:
>>>> On 08/03/2015 07:48 PM, Eric Wong wrote:
>>>>> Madars Vitolins <m@silodev.com> wrote:
>>>>>> Hi Folks,
>>>>>>
>>>>>> I am developing kind of open systems application, which uses
>>>>>> multiple processes/executables where each of them monitors some set
>>>>>> of resources (in this case POSIX Queues) via epoll interface. For
>>>>>> example when 10 processes on same queue are in state of epoll_wait()
>>>>>> and one message arrives, all 10 processes gets woken up and all of
>>>>>> them tries to read the message from Q. One succeeds, the others gets
>>>>>> EAGAIN error. The problem is with those others, which generates
>>>>>> extra context switches - useless CPU usage. With more processes
>>>>>> inefficiency gets higher.
>>>>>>
>>>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for
>>>>>> multi-threaded application and not for multi-process application.
>>>>>
>>>>> Correct.  Most FDs are not shared across processes.
>>>>>
>>>>>> Ideal mechanism for this would be:
>>>>>> 1. If multiple epoll sets in kernel matches same event and one or
>>>>>> more processes are in state of epoll_wait() - then send event only
>>>>>> to one waiter.
>>>>>> 2. If none of processes are in wait state, then send the event to
>>>>>> all epoll sets (as it is currently). Then the first free process
>>>>>> will grab the event.
>>>>>
>>>>> Jason Baron was working on this (search LKML archives for
>>>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE)
>>>>>
>>>>> However, I was unconvinced about modifying epoll.
>>>>>
>>>>> Perhaps I may be more easily convinced about your mqueue case than his
>>>>> case for listen sockets, though[*]
>>>>>
>>>>
>>>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have
>>>> multiple epoll fds (or epoll sets) attached to the same wakeup source,
>>>> and have the wakeups 'rotate' among the epoll sets. The wakeup
>>>> essentially walks the list of waiters, wakes up the first thread
>>>> that is actively in epoll_wait(), stops and moves the woken up
>>>> epoll set to the end of the list. So it attempts to balance
>>>> the wakeups among the epoll sets, I think in the way that you
>>>> were describing.
>>>>
>>>> Here is the patchset:
>>>>
>>>> https://lkml.org/lkml/2015/2/24/667
>>>>
>>>> The test program shows how to use the API. Essentially, you
>>>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag,
>>>> which you then attach to you're shared wakeup source and
>>>> then to your epoll sets. Please let me know if its unclear.
>>>>
>>>> Thanks,
>>>>
>>>> -Jason
>>>
>>> In my particular case I need to work with multiple
>>> processes/executables running (not threads) and listening on same
>>> queues (this concept allows to sysadmin easily manage those processes
>>> (start new ones for balancing or stop them with out service
>>> interruption), and if any process dies for some reason (signal, core,
>>> etc..), the whole application does not get killed, but only one
>>> transaction is lost).
>>>
>>> Recently I did tests, and found out that kernel's epoll currently
>>> sends notifications to 4 processes (I think it is EP_MAX_NESTS
>>> constant) waiting on same resource (those other 6 from my example
>>> will stay in sleep state). So it is not as bad as I thought before.
>>> It could be nice if EP_MAX_NESTS could be configurable, but I guess 4
>>> is fine too.
>>>
>>
>> hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE
>> if you can do ep1->ep2->ep3->ep4-> <wakeup src fd>. But you
>> can't add in 'ep5'. Where the 'epN' above represent epoll file
>> descriptors that are attached together via: EPOLL_CTL_ADD.
>>
>> The nesting does not affect how wakeups are down. All epoll fds
>> that are attached to the even source fd are going to get wakeups.
>>
>>
>>> Jason, does your patch work for multi-process application? How hard
>>> it would be to implement this for such scenario?
>>
>> I don't think it would be too hard, but it requires:
>>
>> 1) adding the patches
>> 2) re-compiling, running new kernel
>> 3) modifying your app to the new API.
>>
>> Thanks,
>>
>> -Jason
>>
>>
>>>
>>> Madars
>>>
>>>>
>>>>> Typical applications have few (probably only one) listen sockets or
>>>>> POSIX mqueues; so I would rather use dedicated threads to issue
>>>>> blocking syscalls (accept4 or mq_timedreceive).
>>>>>
>>>>> Making blocking syscalls allows exclusive wakeups to avoid thundering
>>>>> herds.
>>>>>
>>>>>> How do you think, would it be real to implement this? How about
>>>>>> concurrency?
>>>>>> Can you please give me some hints from which points in code to start
>>>>>> to implement these changes?
>>>>>
>>>>> For now, I suggest dedicating a thread in each process to do
>>>>> mq_timedreceive/mq_receive, assuming you only have a small amount
>>>>> of queues in your system.
>>>>>
>>>>>
>>>>> [*] mq_timedreceive may copy a largish buffer which benefits from
>>>>>     staying on the same CPU as much as possible.
>>>>>     Contrary, accept4 only creates a client socket.  With a C10K+
>>>>>     socket server (e.g. http/memcached/DB), a typical new client
>>>>>     socket spends a fair amount of time idle.  Thus I don't believe
>>>>>     memory locality inside the kernel is much concern when there's
>>>>>     thousands of accepted client sockets.
>>>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/