Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754600AbbK3Tpn (ORCPT ); Mon, 30 Nov 2015 14:45:43 -0500 Received: from prod-mail-xrelay07.akamai.com ([23.79.238.175]:59474 "EHLO prod-mail-xrelay07.akamai.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751691AbbK3Tpl (ORCPT ); Mon, 30 Nov 2015 14:45:41 -0500 Subject: Re: epoll and multiple processes - eliminate unneeded process wake-ups To: Madars Vitolins References: Cc: Eric Wong , linux-kernel@vger.kernel.org From: Jason Baron X-Enigmail-Draft-Status: N1110 Message-ID: <565CA764.1090805@akamai.com> Date: Mon, 30 Nov 2015 14:45:40 -0500 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9137 Lines: 229 Hi Madars, On 11/28/2015 05:54 PM, Madars Vitolins wrote: > Hi Jason, > > I did recently tests with multiprocessing and epoll() on Posix Queues. > You were right about "EP_MAX_NESTS", it is not related with how many > processes are waken up when multiple process epoll_waits are waiting on > one event source. > > At doing epoll every process is added to wait queue for every monitored > event source. Thus when message is sent to some queue (for example), all > processes polling on it are activated during mq_timedsend() -> > __do_notify () -> wake_up(&info->wait_q) kernel processing. > > So to get one message to be processed only by one process of > epoll_wait(), it requires that process in event source's wait queue is > added with exclusive flag set. > > I could create a kernel patch, by adding new EPOLLEXCL flag which could > result in following functionality: > > - fs/eventpoll.c > ================================================================================ > > /* > * This is the callback that is used to add our wait queue to the > * target file wakeup lists. > */ > static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t > *whead, > poll_table *pt) > { > struct epitem *epi = ep_item_from_epqueue(pt); > struct eppoll_entry *pwq; > > if (epi->nwait >= 0 && (pwq = kmem_cache_alloc(pwq_cache, > GFP_KERNEL))) { > init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); > pwq->whead = whead; > pwq->base = epi; > > if (epi->event.events & EPOLLEXCL) { <<<< New > functionality here!!! > add_wait_queue_exclusive(whead, &pwq->wait); > } else { > add_wait_queue(whead, &pwq->wait); > } > list_add_tail(&pwq->llink, &epi->pwqlist); > epi->nwait++; > } else { > /* We have to signal that an error occurred */ > epi->nwait = -1; > } > } > ================================================================================ > > > After doing test with EPOLLEXCL set in my multiprocessing application > framework (now it is open source: http://www.endurox.org/ :) ), results > were good, there were no extra wakeups. Thus more efficient processing. > Cool. If you have any performance numbers to share that would be more supportive. > Jason, how do you think would mainline accept such patch with new flag? > Or are there any concerns about this? Also this will mean that new flag > will be need to add to GNU C Library (/usr/include/sys/epoll.h). > This has come up several times - so imo I think it would be a reasonable addition - but I'm only speaking for myself. In terms of implementation it might make sense to return 0 from ep_poll_callback() in case ep->wq is empty. That way we continue to search for an active waiter. That way we service wakeups in a more timely manner if some threads are busy. We probably also don't want to allow the flag for nested ep descriptors. Thanks, -Jason > Or maybe somebody else who is familiar with kernel epoll functionality > can comment this? > > Regarding the flag's bitmask, seems like (1<<28) needs to be taken for > EPOLLEXCL as flags type for epoll_event.events is int32 and last bit > 1<<31 is used by EPOLLET (in include/uapi/linux/eventpoll.h). > > Thanks a lot in advance, > Madars > > > Jason Baron @ 2015-08-05 15:32 rakstīja: >> On 08/05/2015 07:06 AM, Madars Vitolins wrote: >>> Jason Baron @ 2015-08-04 18:02 rakstīja: >>>> On 08/03/2015 07:48 PM, Eric Wong wrote: >>>>> Madars Vitolins wrote: >>>>>> Hi Folks, >>>>>> >>>>>> I am developing kind of open systems application, which uses >>>>>> multiple processes/executables where each of them monitors some set >>>>>> of resources (in this case POSIX Queues) via epoll interface. For >>>>>> example when 10 processes on same queue are in state of epoll_wait() >>>>>> and one message arrives, all 10 processes gets woken up and all of >>>>>> them tries to read the message from Q. One succeeds, the others gets >>>>>> EAGAIN error. The problem is with those others, which generates >>>>>> extra context switches - useless CPU usage. With more processes >>>>>> inefficiency gets higher. >>>>>> >>>>>> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for >>>>>> multi-threaded application and not for multi-process application. >>>>> >>>>> Correct. Most FDs are not shared across processes. >>>>> >>>>>> Ideal mechanism for this would be: >>>>>> 1. If multiple epoll sets in kernel matches same event and one or >>>>>> more processes are in state of epoll_wait() - then send event only >>>>>> to one waiter. >>>>>> 2. If none of processes are in wait state, then send the event to >>>>>> all epoll sets (as it is currently). Then the first free process >>>>>> will grab the event. >>>>> >>>>> Jason Baron was working on this (search LKML archives for >>>>> EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE) >>>>> >>>>> However, I was unconvinced about modifying epoll. >>>>> >>>>> Perhaps I may be more easily convinced about your mqueue case than his >>>>> case for listen sockets, though[*] >>>>> >>>> >>>> Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have >>>> multiple epoll fds (or epoll sets) attached to the same wakeup source, >>>> and have the wakeups 'rotate' among the epoll sets. The wakeup >>>> essentially walks the list of waiters, wakes up the first thread >>>> that is actively in epoll_wait(), stops and moves the woken up >>>> epoll set to the end of the list. So it attempts to balance >>>> the wakeups among the epoll sets, I think in the way that you >>>> were describing. >>>> >>>> Here is the patchset: >>>> >>>> https://lkml.org/lkml/2015/2/24/667 >>>> >>>> The test program shows how to use the API. Essentially, you >>>> have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag, >>>> which you then attach to you're shared wakeup source and >>>> then to your epoll sets. Please let me know if its unclear. >>>> >>>> Thanks, >>>> >>>> -Jason >>> >>> In my particular case I need to work with multiple >>> processes/executables running (not threads) and listening on same >>> queues (this concept allows to sysadmin easily manage those processes >>> (start new ones for balancing or stop them with out service >>> interruption), and if any process dies for some reason (signal, core, >>> etc..), the whole application does not get killed, but only one >>> transaction is lost). >>> >>> Recently I did tests, and found out that kernel's epoll currently >>> sends notifications to 4 processes (I think it is EP_MAX_NESTS >>> constant) waiting on same resource (those other 6 from my example >>> will stay in sleep state). So it is not as bad as I thought before. >>> It could be nice if EP_MAX_NESTS could be configurable, but I guess 4 >>> is fine too. >>> >> >> hmmm...EP_MAX_NESTS is about the level 'nesting' epoll sets, IE >> if you can do ep1->ep2->ep3->ep4-> . But you >> can't add in 'ep5'. Where the 'epN' above represent epoll file >> descriptors that are attached together via: EPOLL_CTL_ADD. >> >> The nesting does not affect how wakeups are down. All epoll fds >> that are attached to the even source fd are going to get wakeups. >> >> >>> Jason, does your patch work for multi-process application? How hard >>> it would be to implement this for such scenario? >> >> I don't think it would be too hard, but it requires: >> >> 1) adding the patches >> 2) re-compiling, running new kernel >> 3) modifying your app to the new API. >> >> Thanks, >> >> -Jason >> >> >>> >>> Madars >>> >>>> >>>>> Typical applications have few (probably only one) listen sockets or >>>>> POSIX mqueues; so I would rather use dedicated threads to issue >>>>> blocking syscalls (accept4 or mq_timedreceive). >>>>> >>>>> Making blocking syscalls allows exclusive wakeups to avoid thundering >>>>> herds. >>>>> >>>>>> How do you think, would it be real to implement this? How about >>>>>> concurrency? >>>>>> Can you please give me some hints from which points in code to start >>>>>> to implement these changes? >>>>> >>>>> For now, I suggest dedicating a thread in each process to do >>>>> mq_timedreceive/mq_receive, assuming you only have a small amount >>>>> of queues in your system. >>>>> >>>>> >>>>> [*] mq_timedreceive may copy a largish buffer which benefits from >>>>> staying on the same CPU as much as possible. >>>>> Contrary, accept4 only creates a client socket. With a C10K+ >>>>> socket server (e.g. http/memcached/DB), a typical new client >>>>> socket spends a fair amount of time idle. Thus I don't believe >>>>> memory locality inside the kernel is much concern when there's >>>>> thousands of accepted client sockets. >>>>> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/