Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934737AbbHDPDJ (ORCPT ); Tue, 4 Aug 2015 11:03:09 -0400 Received: from a23-79-238-175.deploy.static.akamaitechnologies.com ([23.79.238.175]:51296 "EHLO prod-mail-xrelay07.akamai.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1756907AbbHDPC0 (ORCPT ); Tue, 4 Aug 2015 11:02:26 -0400 Message-ID: <55C0D400.3070904@akamai.com> Date: Tue, 04 Aug 2015 11:02:24 -0400 From: Jason Baron User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Eric Wong , Madars Vitolins CC: linux-kernel@vger.kernel.org Subject: Re: epoll and multiple processes - eliminate unneeded process wake-ups References: <20150803234842.GA21995@dcvr.yhbt.net> In-Reply-To: <20150803234842.GA21995@dcvr.yhbt.net> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3626 Lines: 90 On 08/03/2015 07:48 PM, Eric Wong wrote: > Madars Vitolins wrote: >> Hi Folks, >> >> I am developing kind of open systems application, which uses >> multiple processes/executables where each of them monitors some set >> of resources (in this case POSIX Queues) via epoll interface. For >> example when 10 processes on same queue are in state of epoll_wait() >> and one message arrives, all 10 processes gets woken up and all of >> them tries to read the message from Q. One succeeds, the others gets >> EAGAIN error. The problem is with those others, which generates >> extra context switches - useless CPU usage. With more processes >> inefficiency gets higher. >> >> I tried to use EPOLLONESHOT, but no help. Seems this is suitable for >> multi-threaded application and not for multi-process application. > > Correct. Most FDs are not shared across processes. > >> Ideal mechanism for this would be: >> 1. If multiple epoll sets in kernel matches same event and one or >> more processes are in state of epoll_wait() - then send event only >> to one waiter. >> 2. If none of processes are in wait state, then send the event to >> all epoll sets (as it is currently). Then the first free process >> will grab the event. > > Jason Baron was working on this (search LKML archives for > EPOLLEXCLUSIVE, EPOLLROUNDROBIN, EPOLL_ROTATE) > > However, I was unconvinced about modifying epoll. > > Perhaps I may be more easily convinced about your mqueue case than his > case for listen sockets, though[*] > Yeah, so I implemented an 'EPOLL_ROTATE' mode, where you could have multiple epoll fds (or epoll sets) attached to the same wakeup source, and have the wakeups 'rotate' among the epoll sets. The wakeup essentially walks the list of waiters, wakes up the first thread that is actively in epoll_wait(), stops and moves the woken up epoll set to the end of the list. So it attempts to balance the wakeups among the epoll sets, I think in the way that you were describing. Here is the patchset: https://lkml.org/lkml/2015/2/24/667 The test program shows how to use the API. Essentially, you have to create a 'dummy' epoll fd with the 'EPOLL_ROTATE' flag, which you then attach to you're shared wakeup source and then to your epoll sets. Please let me know if its unclear. Thanks, -Jason > Typical applications have few (probably only one) listen sockets or > POSIX mqueues; so I would rather use dedicated threads to issue > blocking syscalls (accept4 or mq_timedreceive). > > Making blocking syscalls allows exclusive wakeups to avoid thundering > herds. > >> How do you think, would it be real to implement this? How about >> concurrency? >> Can you please give me some hints from which points in code to start >> to implement these changes? > > For now, I suggest dedicating a thread in each process to do > mq_timedreceive/mq_receive, assuming you only have a small amount > of queues in your system. > > > [*] mq_timedreceive may copy a largish buffer which benefits from > staying on the same CPU as much as possible. > Contrary, accept4 only creates a client socket. With a C10K+ > socket server (e.g. http/memcached/DB), a typical new client > socket spends a fair amount of time idle. Thus I don't believe > memory locality inside the kernel is much concern when there's > thousands of accepted client sockets. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/