Received: by 2002:a5b:505:0:0:0:0:0 with SMTP id o5csp4645558ybp; Mon, 7 Oct 2019 11:32:28 -0700 (PDT) X-Google-Smtp-Source: APXvYqwv3HVsbl0Ex2JEunJh7OQIFXmSKGG0y8hhQgvtU/DhGGc8vibg/vdrAxL0VmO2oQDKyagu X-Received: by 2002:a17:906:cc9b:: with SMTP id oq27mr25162812ejb.125.1570473148318; Mon, 07 Oct 2019 11:32:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570473148; cv=none; d=google.com; s=arc-20160816; b=KF2ne9Osb1ZWkzktMFUb56LFVysyUelkCtCg9NyiplXlCPpjY0UGwPIBLSBm+vi2W7 X4IRpHvxmZxQm+UOHx04AcyMRQEh/D9y6x3QEBft/19/m9l0yL151qsOyDG10XY2tvDC fDg8EDA1AhywYzq+1gk6greIbjOatoqVNYbxsv2dE/m/xHK3Tm6kWSoiO+0XFHx3yZmJ iuLXHh1cFvYcjUyCmyI/kUjQW1EIsfA0/Pv3NPmCiVtldIwGedFBoFseiM3kYqTe9i1u EBKYTkXJdvDhIm8BoxoDoP+TKSFBMD1k6mkj1RXT+PQL8o7KmJ4MSHP92TpXbtoh6Ez7 pSsA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version; bh=F6yymxcDwcb4FeNLOvjQCJahjswC4Mea/0qfsORgH+8=; b=jNPjVmd3KfJOz+sCUdu/orBaI0CxDgxhPrOhl+s585W1TG4vqWeR7RIlURi/8fk6qk SM92mD+tw++sNfQnMM5v9cG9CM4DzhQRAuSjSMFn9WcwgYNH+BKfJOHltUNPeCUEariB 0t46m9mhVHXWyuWtq9SNjQ33Fv+5Xj8d0N5JSWggnIMwWz68USG1+QwwV4GHiuRHWLIM oWkUplI//vT7gBvqVuF/eLYoAzCEfRWqzb+CJ7OsW2Liq+XMl3fUD2lHkuotW5zf5/6a 6U+6a5KRgZbjGtGwrTMIrDbgSOS03x+oM4vM1kLCNsjTAk45R8hPUUJPmwNTl2ybKyEG L9hA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c14si7151816eda.8.2019.10.07.11.32.04; Mon, 07 Oct 2019 11:32:28 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729239AbfJGSbG (ORCPT + 99 others); Mon, 7 Oct 2019 14:31:06 -0400 Received: from mx2.suse.de ([195.135.220.15]:55048 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728031AbfJGSbG (ORCPT ); Mon, 7 Oct 2019 14:31:06 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 929B4B035; Mon, 7 Oct 2019 18:31:00 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Date: Mon, 07 Oct 2019 20:30:59 +0200 From: Roman Penyaev To: Jason Baron Cc: Andrew Morton , hev , linux-fsdevel@vger.kernel.org, Al Viro , Davide Libenzi , Davidlohr Bueso , Dominik Brodowski , Eric Wong , Linus Torvalds , Sridhar Samudrala , linux-kernel@vger.kernel.org Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode In-Reply-To: <9ceee722-d2a8-b182-c95a-e7a873b08ca1@akamai.com> References: <20190925015603.10939-1-r@hev.cc> <20190927192915.6ec24ad706258de99470a96e@linux-foundation.org> <9ca02c9b-85b7-dced-9c82-1fc453c49b8a@akamai.com> <9a82925ff7dfc314d36b3d36e54316a8@suse.de> <9ceee722-d2a8-b182-c95a-e7a873b08ca1@akamai.com> Message-ID: X-Sender: rpenyaev@suse.de User-Agent: Roundcube Webmail Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019-10-07 18:42, Jason Baron wrote: > On 10/7/19 6:54 AM, Roman Penyaev wrote: >> On 2019-10-03 18:13, Jason Baron wrote: >>> On 9/30/19 7:55 AM, Roman Penyaev wrote: >>>> On 2019-09-28 04:29, Andrew Morton wrote: >>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev wrote: >>>>> >>>>>> From: Heiher >>>>>> >>>>>> Take the case where we have: >>>>>> >>>>>>         t0 >>>>>>          | (ew) >>>>>>         e0 >>>>>>          | (et) >>>>>>         e1 >>>>>>          | (lt) >>>>>>         s0 >>>>>> >>>>>> t0: thread 0 >>>>>> e0: epoll fd 0 >>>>>> e1: epoll fd 1 >>>>>> s0: socket fd 0 >>>>>> ew: epoll_wait >>>>>> et: edge-trigger >>>>>> lt: level-trigger >>>>>> >>>>>> We only need to wakeup nested epoll fds if something has been >>>>>> queued >>>>>> to the >>>>>> overflow list, since the ep_poll() traverses the rdllist during >>>>>> recursive poll >>>>>> and thus events on the overflow list may not be visible yet. >>>>>> >>>>>> Test code: >>>>> >>>>> Look sane to me.  Do you have any performance testing results which >>>>> show a benefit? >>>>> >>>>> epoll maintainership isn't exactly a hive of activity nowadays :( >>>>> Roman, would you please have time to review this? >>>> >>>> So here is my observation: current patch does not fix the described >>>> problem (double wakeup) for the case, when new event comes exactly >>>> to the ->ovflist.  According to the patch this is the desired >>>> intention: >>>> >>>>    /* >>>>     * We only need to wakeup nested epoll fds if something has been >>>> queued >>>>     * to the overflow list, since the ep_poll() traverses the >>>> rdllist >>>>     * during recursive poll and thus events on the overflow list may >>>> not be >>>>     * visible yet. >>>>     */ >>>>     if (nepi != NULL) >>>>        pwake++; >>>> >>>>     .... >>>> >>>>     if (pwake == 2) >>>>        ep_poll_safewake(&ep->poll_wait); >>>> >>>> >>>> but this actually means that we repeat the same behavior (double >>>> wakeup) >>>> but only for the case, when event comes to the ->ovflist. >>>> >>>> How to reproduce? Can be easily done (ok, not so easy but it is >>>> possible >>>> to try): to the given userspace test we need to add one more socket >>>> and >>>> immediately fire the event: >>>> >>>>     e.events = EPOLLIN; >>>>     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >>>>        goto out; >>>> >>>>     /* >>>>      * Signal any fd to let epoll_wait() to call >>>> ep_scan_ready_list() >>>>      * in order to "catch" it there and add new event to ->ovflist. >>>>      */ >>>>      if (write(s2fd[1], "w", 1) != 1) >>>>         goto out; >>>> >>>> That is done in order to let the following epoll_wait() call to >>>> invoke >>>> ep_scan_ready_list(), where we can "catch" and insert new event >>>> exactly >>>> to the ->ovflist. In order to insert event exactly in the correct >>>> list >>>> I introduce artificial delay. >>>> >>>> Modified test and kernel patch is below.  Here is the output of the >>>> testing tool with some debug lines from kernel: >>>> >>>>   # ~/devel/test/edge-bug >>>>   [   59.263178] ### sleep 2 >>>>   >> write to sock >>>>   [   61.318243] ### done sleep >>>>   [   61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >>>> events_in_rdllist=1, events_in_ovflist=1 >>>>   [   61.321204] ### sleep 2 >>>>   [   63.398325] ### done sleep >>>>   error: What?! Again?! >>>> >>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >>>> wanted to achieve, so eventually ep_poll_safewake() is called again >>>> which leads to double wakeup. >>>> >>>> In my opinion current patch as it is should be dropped, it does not >>>> fix the described problem but just hides it. >>>> >>>> -- >> >> Hi Jason, >> >>> >>> Yes, there are 2 wakeups in the test case you describe, but if the >>> second event (write to s1fd) gets queued after the first call to >>> epoll_wait(), we are going to get 2 wakeups anyways. >> >> Yes, exactly, for this reason I print out the number of events >> observed >> on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise >> this is another case, when second event comes exactly after first >> wait, which is legitimate wakeup. >> >>> So yes, there may >>> be a slightly bigger window with this patch for 2 wakeups, but its >>> small >>> and I tried to be conservative with the patch - I'd rather get an >>> occasional 2nd wakeup then miss one. Trying to debug missing wakeups >>> isn't always fun... >>> >>> That said, the reason for propagating events that end up on the >>> overflow >>> list was to prevent the race of the wakee not seeing events because >>> they >>> were still on the overflow list. In the testcase, imagine if there >>> was a >>> thread doing epoll_wait() on efd[0], and then a write happends on >>> s1fd. >>> I thought it was possible then that a 2nd thread doing epoll_wait() >>> on >>> efd[1], wakes up, checks efd[0] and sees no events because they are >>> still potentially on the overflow list. However, I think that case is >>> not possible because the thread doing epoll_wait() on efd[0] is going >>> to >>> have the ep->mtx, and thus when the thread wakes up on efd[1], its >>> going >>> to have to be ordered because its also grabbing the ep->mtx >>> associated >>> with efd[0]. >>> >>> So I think its safe to do the following if we want to go further than >>> the proposed patch, which is what you suggested earlier in the thread >>> (minus keeping the wakeup on ep->wq). >> >> Then I do not understand why we need to keep ep->wq wakeup? >> @wq and @poll_wait are almost the same with only one difference: >> @wq is used when you sleep on it inside epoll_wait() and the other >> is used when you nest epoll fd inside epoll fd.  Either you wake >> both up either you don't this at all. >> >> ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify() >> do wakeup explicitly, so what are the cases when we need to do wakeups >> from ep_scan_ready_list()? > > Hi Roman, > > So the reason I was saying not to drop the ep->wq wakeup was that I was > thinking about a usecase where you have multi-threads say thread A and > thread B which are doing epoll_wait() on the same epfd. Now, the > threads > both call epoll_wait() and are added as exclusive to ep->wq. Now a > bunch > of events happen and thread A is woken up. However, thread A may only > process a subset of the events due to its 'maxevents' parameter. In > that > case, I was thinking that the wakeup on ep->wq might be helpful, > because > in the absence of subsequent events, thread B can now start processing > the rest, instead of waiting for the next event to be queued. > > However, I was thinking about the state of things before: > 86c0517 fs/epoll: deal with wait_queue only once > > Before that patch, thread A would have been removed from eq->wq before > the wakeup call, thus waking up thread B. However, now that thread A > stays on the queue during the call to ep_send_events(), I believe the > wakeup would only target thread A, which doesn't help since its already > checking for events. So given the state of things I think you are right > in that its not needed. However, I wonder if not removing from the > ep->wq affects the multi-threaded case I described. Its been around > since 5.0, so probably not, but it would be a more subtle performance > difference. Now I understand what you mean. You want to prevent "idling" of events, while thread A is busy with the small portion of events (portion is equal to maxevents). On next iteration thread A will pick up the rest, no doubts, but would be nice to give a chance to thread B immediately to deal with the rest. Ok, makes sense. But what if to make this wakeup explicit if we have more events to process? (nothing is tested, just a guess) @@ -255,6 +255,7 @@ struct ep_pqueue { struct ep_send_events_data { int maxevents; struct epoll_event __user *events; + bool have_more; int res; }; @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head } static int ep_send_events(struct eventpoll *ep, - struct epoll_event __user *events, int maxevents) + struct epoll_event __user *events, int maxevents, + bool *have_more) { - struct ep_send_events_data esed; - - esed.maxevents = maxevents; - esed.events = events; + struct ep_send_events_data esed = { + .maxevents = maxevents, + .events = events, + }; ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); + *have_more = esed.have_more; + return esed.res; } @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, { int res = 0, eavail, timed_out = 0; u64 slack = 0; - bool waiter = false; + bool waiter = false, have_more; wait_queue_entry_t wait; ktime_t expires, *to = NULL; @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, * more luck. */ if (!res && eavail && - !(res = ep_send_events(ep, events, maxevents)) && !timed_out) + !(res = ep_send_events(ep, events, maxevents, &have_more)) && + !timed_out) goto fetch_events; if (waiter) { @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events, __remove_wait_queue(&ep->wq, &wait); spin_unlock_irq(&ep->wq.lock); } + /* + * We were not able to process all the events, so immediately + * wakeup other waiter. + */ + if (res > 0 && have_more && waitqueue_active(&ep->wq)) + wake_up(&ep->wq); return res; } PS. So what we decide with the original patch? Remove the whole branch? -- Roman