Received: by 2002:a5b:505:0:0:0:0:0 with SMTP id o5csp511144ybp; Tue, 8 Oct 2019 23:04:56 -0700 (PDT) X-Google-Smtp-Source: APXvYqzsHznksD6u57NxB71n+b8N40cAeKm6dcAttIJjzEae/bSyFL4Kj9xr2IuMVDj31pADIO1F X-Received: by 2002:a50:d4d7:: with SMTP id e23mr1420377edj.135.1570601096319; Tue, 08 Oct 2019 23:04:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570601096; cv=none; d=google.com; s=arc-20160816; b=nfu+zAlnFcIXVvQH8wVTZ4b82th8amZAL7xVhD9BDC7LKKo8qcP7VbwMRF0Mz0cEG2 Jv47St/O0PixL0dton08SkKFvsDeTduMwMrgB1lvisG15/UvkGEFHzStDlvO7db0iSLT v0NRorq87jDmYQt6c20PDzWz+bVtFsFYPclZHgnGAhF4iqZ5WYUm7tGhbjNnwtgFU+sz ASdsjDNGqumbEE3yY07uAqxVnUmI3AeDWqf5e2tSz8ZIT8INkR4uX3ApVfiHCXRRLVe/ Zsp2DyF/cozL56SHBjBKijdFGq80U2LcZr+xagv7qtES0sfrEGmBbHjjBQCywKIAhWlu LNTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=l6hBrR6qXV4qdWZNbpeQBt34dXxtvdG8HN8+1zByqys=; b=oZw14QUupE2jBVPEPfDElCHcd8o0H7n2KoaNGNvKenAedkky8bABWBccns16UVfSVG lTDIDqVVaRgSOO6THEVUaIEYcdnDo/nwkc5JvG0q3XZfUkurOeASzj9PB9+ucA4ALS8m BT1ndMVQF9fEuBJOhiHtCZyKFE6p31HvjuuS5mVlGtRnGkyFSLA2Ehw0z8q0b9uQ8JD7 sYUEaPTYqJHsURFfwlPl3xMpaKC9uVv1hzgqzSm7CzCmoE46aY2hrstgS1Q21UWe6KjC 8UjNVGVc304gcnQr/EnR9jrdazUF3yCpVdnVLDr2omVMptqdLBtIZRYMizjUUfHCG2dw 8e0Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@hev-cc.20150623.gappssmtp.com header.s=20150623 header.b=lZzOag8U; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d1si662493edr.446.2019.10.08.23.04.32; Tue, 08 Oct 2019 23:04:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@hev-cc.20150623.gappssmtp.com header.s=20150623 header.b=lZzOag8U; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727649AbfJIGER (ORCPT + 99 others); Wed, 9 Oct 2019 02:04:17 -0400 Received: from mail-lf1-f66.google.com ([209.85.167.66]:39637 "EHLO mail-lf1-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726698AbfJIGEQ (ORCPT ); Wed, 9 Oct 2019 02:04:16 -0400 Received: by mail-lf1-f66.google.com with SMTP id 72so679174lfh.6 for ; Tue, 08 Oct 2019 23:04:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=hev-cc.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=l6hBrR6qXV4qdWZNbpeQBt34dXxtvdG8HN8+1zByqys=; b=lZzOag8UrkgFNxYmDYrS3VvLUkKqa1MrRv33wcKuNwbdmW8+mg8UTJGhlmSgv6RDnP YfSidv2+2EwVJflrIBqZPboeS4rMnfDAW2N4unPCTq+DrF/f3ptBxOwfxCh7lwdm0uys fbQuc71R5k5EAov+40NV+Dg41vr7o3a3RPtp4a7X63qCBkGo8xnuGXNtYPGhorX61Ww0 vFbgG4onHR2hqFYJVF5uLj7JEaSbSzhlLWPJcmXR9IE91FqtH4XDyD03S5gh05uYoKvj aS0MCW3OzPeg0LqRTttrTRNtM+jPupprevb3i5GgUrYfWn4UAcuBKZsGKhFpupC8bV6g pSPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=l6hBrR6qXV4qdWZNbpeQBt34dXxtvdG8HN8+1zByqys=; b=ZcurQhk11D3MptL/i3C5XYK9mSPqEpzNhkSSmW37DKYX59WqXsrzZWd3sf3WAX6Jtd raMUhGz7fK7SdOKmAbWbx48ScpaHWpd0Y9vQQXH5LetnVkxlg8FR+rj48JR/qdNo12pb pxX3a3NBQc7TNUl+RR39HACJbYc7QWbIGl3zmhmLFOGAmfqvpwTlwxLlDp47tEYGc/Xu 7QLs/ToDF9xac3JwW8nQJmHRuGF6SXeln/WxE0SFJ0Bmh+PIMaw8IjdwGLtcJyvtN11M wUNVBUqU8/riaOpOqVHe9nVD/wyx98KpMcuNNpE/K7Xt0gyuy0LuQPxsffvTGG2URQWU pT4A== X-Gm-Message-State: APjAAAVtIZZ4ZRO3+qK3J0iDDT5/eIegmEO7JlFw5AaupU2ETDC9o7rs SixxyToTi3exZyPZK9nx1jwQxdWxO/AoQhQxB9/Cew== X-Received: by 2002:a19:6f0e:: with SMTP id k14mr954203lfc.34.1570601053475; Tue, 08 Oct 2019 23:04:13 -0700 (PDT) MIME-Version: 1.0 References: <20190925015603.10939-1-r@hev.cc> <20190927192915.6ec24ad706258de99470a96e@linux-foundation.org> <9ca02c9b-85b7-dced-9c82-1fc453c49b8a@akamai.com> <9a82925ff7dfc314d36b3d36e54316a8@suse.de> <9ceee722-d2a8-b182-c95a-e7a873b08ca1@akamai.com> <56b7c2c2-debc-4e62-904e-f2f1c2e65293@akamai.com> In-Reply-To: From: Heiher Date: Wed, 9 Oct 2019 14:03:43 +0800 Message-ID: Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode To: Roman Penyaev Cc: Jason Baron , Andrew Morton , linux-fsdevel@vger.kernel.org, Al Viro , Davide Libenzi , Davidlohr Bueso , Dominik Brodowski , Eric Wong , Linus Torvalds , Sridhar Samudrala , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tue, Oct 8, 2019 at 3:10 AM Roman Penyaev wrote: > > On 2019-10-07 20:43, Jason Baron wrote: > > On 10/7/19 2:30 PM, Roman Penyaev wrote: > >> On 2019-10-07 18:42, Jason Baron wrote: > >>> On 10/7/19 6:54 AM, Roman Penyaev wrote: > >>>> On 2019-10-03 18:13, Jason Baron wrote: > >>>>> On 9/30/19 7:55 AM, Roman Penyaev wrote: > >>>>>> On 2019-09-28 04:29, Andrew Morton wrote: > >>>>>>> On Wed, 25 Sep 2019 09:56:03 +0800 hev wrote: > >>>>>>> > >>>>>>>> From: Heiher > >>>>>>>> > >>>>>>>> Take the case where we have: > >>>>>>>> > >>>>>>>> t0 > >>>>>>>> | (ew) > >>>>>>>> e0 > >>>>>>>> | (et) > >>>>>>>> e1 > >>>>>>>> | (lt) > >>>>>>>> s0 > >>>>>>>> > >>>>>>>> t0: thread 0 > >>>>>>>> e0: epoll fd 0 > >>>>>>>> e1: epoll fd 1 > >>>>>>>> s0: socket fd 0 > >>>>>>>> ew: epoll_wait > >>>>>>>> et: edge-trigger > >>>>>>>> lt: level-trigger > >>>>>>>> > >>>>>>>> We only need to wakeup nested epoll fds if something has been > >>>>>>>> queued > >>>>>>>> to the > >>>>>>>> overflow list, since the ep_poll() traverses the rdllist during > >>>>>>>> recursive poll > >>>>>>>> and thus events on the overflow list may not be visible yet. > >>>>>>>> > >>>>>>>> Test code: > >>>>>>> > >>>>>>> Look sane to me. Do you have any performance testing results > >>>>>>> which > >>>>>>> show a benefit? > >>>>>>> > >>>>>>> epoll maintainership isn't exactly a hive of activity nowadays :( > >>>>>>> Roman, would you please have time to review this? > >>>>>> > >>>>>> So here is my observation: current patch does not fix the > >>>>>> described > >>>>>> problem (double wakeup) for the case, when new event comes exactly > >>>>>> to the ->ovflist. According to the patch this is the desired > >>>>>> intention: > >>>>>> > >>>>>> /* > >>>>>> * We only need to wakeup nested epoll fds if something has > >>>>>> been > >>>>>> queued > >>>>>> * to the overflow list, since the ep_poll() traverses the > >>>>>> rdllist > >>>>>> * during recursive poll and thus events on the overflow list > >>>>>> may > >>>>>> not be > >>>>>> * visible yet. > >>>>>> */ > >>>>>> if (nepi != NULL) > >>>>>> pwake++; > >>>>>> > >>>>>> .... > >>>>>> > >>>>>> if (pwake == 2) > >>>>>> ep_poll_safewake(&ep->poll_wait); > >>>>>> > >>>>>> > >>>>>> but this actually means that we repeat the same behavior (double > >>>>>> wakeup) > >>>>>> but only for the case, when event comes to the ->ovflist. > >>>>>> > >>>>>> How to reproduce? Can be easily done (ok, not so easy but it is > >>>>>> possible > >>>>>> to try): to the given userspace test we need to add one more > >>>>>> socket > >>>>>> and > >>>>>> immediately fire the event: > >>>>>> > >>>>>> e.events = EPOLLIN; > >>>>>> if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) > >>>>>> goto out; > >>>>>> > >>>>>> /* > >>>>>> * Signal any fd to let epoll_wait() to call > >>>>>> ep_scan_ready_list() > >>>>>> * in order to "catch" it there and add new event to > >>>>>> ->ovflist. > >>>>>> */ > >>>>>> if (write(s2fd[1], "w", 1) != 1) > >>>>>> goto out; > >>>>>> > >>>>>> That is done in order to let the following epoll_wait() call to > >>>>>> invoke > >>>>>> ep_scan_ready_list(), where we can "catch" and insert new event > >>>>>> exactly > >>>>>> to the ->ovflist. In order to insert event exactly in the correct > >>>>>> list > >>>>>> I introduce artificial delay. > >>>>>> > >>>>>> Modified test and kernel patch is below. Here is the output of > >>>>>> the > >>>>>> testing tool with some debug lines from kernel: > >>>>>> > >>>>>> # ~/devel/test/edge-bug > >>>>>> [ 59.263178] ### sleep 2 > >>>>>> >> write to sock > >>>>>> [ 61.318243] ### done sleep > >>>>>> [ 61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); > >>>>>> events_in_rdllist=1, events_in_ovflist=1 > >>>>>> [ 61.321204] ### sleep 2 > >>>>>> [ 63.398325] ### done sleep > >>>>>> error: What?! Again?! > >>>>>> > >>>>>> First epoll_wait() call (ep_scan_ready_list()) observes 2 events > >>>>>> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we > >>>>>> wanted to achieve, so eventually ep_poll_safewake() is called > >>>>>> again > >>>>>> which leads to double wakeup. > >>>>>> > >>>>>> In my opinion current patch as it is should be dropped, it does > >>>>>> not > >>>>>> fix the described problem but just hides it. > >>>>>> > >>>>>> -- > >>>> > >>>> Hi Jason, > >>>> > >>>>> > >>>>> Yes, there are 2 wakeups in the test case you describe, but if the > >>>>> second event (write to s1fd) gets queued after the first call to > >>>>> epoll_wait(), we are going to get 2 wakeups anyways. > >>>> > >>>> Yes, exactly, for this reason I print out the number of events > >>>> observed > >>>> on first wait, there should be 1 (rdllist) and 1 (ovflist), > >>>> otherwise > >>>> this is another case, when second event comes exactly after first > >>>> wait, which is legitimate wakeup. > >>>> > >>>>> So yes, there may > >>>>> be a slightly bigger window with this patch for 2 wakeups, but its > >>>>> small > >>>>> and I tried to be conservative with the patch - I'd rather get an > >>>>> occasional 2nd wakeup then miss one. Trying to debug missing > >>>>> wakeups > >>>>> isn't always fun... > >>>>> > >>>>> That said, the reason for propagating events that end up on the > >>>>> overflow > >>>>> list was to prevent the race of the wakee not seeing events because > >>>>> they > >>>>> were still on the overflow list. In the testcase, imagine if there > >>>>> was a > >>>>> thread doing epoll_wait() on efd[0], and then a write happends on > >>>>> s1fd. > >>>>> I thought it was possible then that a 2nd thread doing epoll_wait() > >>>>> on > >>>>> efd[1], wakes up, checks efd[0] and sees no events because they are > >>>>> still potentially on the overflow list. However, I think that case > >>>>> is > >>>>> not possible because the thread doing epoll_wait() on efd[0] is > >>>>> going to > >>>>> have the ep->mtx, and thus when the thread wakes up on efd[1], its > >>>>> going > >>>>> to have to be ordered because its also grabbing the ep->mtx > >>>>> associated > >>>>> with efd[0]. > >>>>> > >>>>> So I think its safe to do the following if we want to go further > >>>>> than > >>>>> the proposed patch, which is what you suggested earlier in the > >>>>> thread > >>>>> (minus keeping the wakeup on ep->wq). > >>>> > >>>> Then I do not understand why we need to keep ep->wq wakeup? > >>>> @wq and @poll_wait are almost the same with only one difference: > >>>> @wq is used when you sleep on it inside epoll_wait() and the other > >>>> is used when you nest epoll fd inside epoll fd. Either you wake > >>>> both up either you don't this at all. > >>>> > >>>> ep_poll_callback() does wakeup explicitly, ep_insert() and > >>>> ep_modify() > >>>> do wakeup explicitly, so what are the cases when we need to do > >>>> wakeups > >>>> from ep_scan_ready_list()? > >>> > >>> Hi Roman, > >>> > >>> So the reason I was saying not to drop the ep->wq wakeup was that I > >>> was > >>> thinking about a usecase where you have multi-threads say thread A > >>> and > >>> thread B which are doing epoll_wait() on the same epfd. Now, the > >>> threads > >>> both call epoll_wait() and are added as exclusive to ep->wq. Now a > >>> bunch > >>> of events happen and thread A is woken up. However, thread A may only > >>> process a subset of the events due to its 'maxevents' parameter. In > >>> that > >>> case, I was thinking that the wakeup on ep->wq might be helpful, > >>> because > >>> in the absence of subsequent events, thread B can now start > >>> processing > >>> the rest, instead of waiting for the next event to be queued. > >>> > >>> However, I was thinking about the state of things before: > >>> 86c0517 fs/epoll: deal with wait_queue only once > >>> > >>> Before that patch, thread A would have been removed from eq->wq > >>> before > >>> the wakeup call, thus waking up thread B. However, now that thread A > >>> stays on the queue during the call to ep_send_events(), I believe the > >>> wakeup would only target thread A, which doesn't help since its > >>> already > >>> checking for events. So given the state of things I think you are > >>> right > >>> in that its not needed. However, I wonder if not removing from the > >>> ep->wq affects the multi-threaded case I described. Its been around > >>> since 5.0, so probably not, but it would be a more subtle performance > >>> difference. > >> > >> Now I understand what you mean. You want to prevent "idling" of > >> events, > >> while thread A is busy with the small portion of events (portion is > >> equal > >> to maxevents). On next iteration thread A will pick up the rest, no > >> doubts, > >> but would be nice to give a chance to thread B immediately to deal > >> with the > >> rest. Ok, makes sense. > > > > Exactly, I don't believe its racy as is - but it seems like it would be > > good to wakeup other threads that may be waiting. That said, this logic > > was removed as I pointed out. So I'm not sure we need to tie this > > change > > to the current one - but it may be a nice addition. > > > >> > >> But what if to make this wakeup explicit if we have more events to > >> process? > >> (nothing is tested, just a guess) > >> > >> @@ -255,6 +255,7 @@ struct ep_pqueue { > >> struct ep_send_events_data { > >> int maxevents; > >> struct epoll_event __user *events; > >> + bool have_more; > >> int res; > >> }; > >> @@ -1783,14 +1768,17 @@ static __poll_t ep_send_events_proc(struct > >> eventpoll *ep, struct list_head *head > >> } > >> > >> static int ep_send_events(struct eventpoll *ep, > >> - struct epoll_event __user *events, int > >> maxevents) > >> + struct epoll_event __user *events, int > >> maxevents, > >> + bool *have_more) > >> { > >> - struct ep_send_events_data esed; > >> - > >> - esed.maxevents = maxevents; > >> - esed.events = events; > >> + struct ep_send_events_data esed = { > >> + .maxevents = maxevents, > >> + .events = events, > >> + }; > >> > >> ep_scan_ready_list(ep, ep_send_events_proc, &esed, 0, false); > >> + *have_more = esed.have_more; > >> + > >> return esed.res; > >> } > >> > >> @@ -1827,7 +1815,7 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> { > >> int res = 0, eavail, timed_out = 0; > >> u64 slack = 0; > >> - bool waiter = false; > >> + bool waiter = false, have_more; > >> wait_queue_entry_t wait; > >> ktime_t expires, *to = NULL; > >> > >> @@ -1927,7 +1915,8 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> * more luck. > >> */ > >> if (!res && eavail && > >> - !(res = ep_send_events(ep, events, maxevents)) && > >> !timed_out) > >> + !(res = ep_send_events(ep, events, maxevents, &have_more)) > >> && > >> + !timed_out) > >> goto fetch_events; > >> > >> if (waiter) { > >> @@ -1935,6 +1924,12 @@ static int ep_poll(struct eventpoll *ep, struct > >> epoll_event __user *events, > >> __remove_wait_queue(&ep->wq, &wait); > >> spin_unlock_irq(&ep->wq.lock); > >> } > >> + /* > >> + * We were not able to process all the events, so immediately > >> + * wakeup other waiter. > >> + */ > >> + if (res > 0 && have_more && waitqueue_active(&ep->wq)) > >> + wake_up(&ep->wq); > >> > >> return res; > >> } > >> > >> > > > > Ok, yeah I like making it explicit. Looks like you are missing the > > changes to ep_scan_ready_list(), but I think the general approach makes > > sense. > > Yeah, missed the hunk: > > @@ -1719,8 +1704,10 @@ static __poll_t ep_send_events_proc(struct > eventpoll *ep, struct list_head *head > lockdep_assert_held(&ep->mtx); > > list_for_each_entry_safe(epi, tmp, head, rdllink) { > - if (esed->res >= esed->maxevents) > + if (esed->res >= esed->maxevents) { > + esed->have_more = true; > break; > + } > > > Although I really didn't have a test case that motivated this - > > its just was sort of noting this change in behavior while reviewing the > > current change. > > > >> PS. So what we decide with the original patch? Remove the whole > >> branch? > >> > > > > For fwiw, I'm ok removing the whole branch as you proposed. > > Then probably Heiher could resend once more. Heiher, can you, please? Sorry for delay. Thank you for your help! That's ok, I will re-send the patch and add unit-tests to kselftests in the later patches. > > > And I think the above change can go in separately (if we decide we want > > it). I don't > > think they need to be tied together. I also want to make sure this > > change gets a full linux-next cycle, so I think it should target 5.5 at > > this point. > > Sure, this explicit ->wq wakeup is a separate patch, which should be > covered with some benchmarks. I can try to cook something out in order > to get numbers. > > -- > Roman > -- Best regards! Hev https://hev.cc