Received: by 2002:a5b:505:0:0:0:0:0 with SMTP id o5csp4137595ybp; Mon, 7 Oct 2019 03:55:48 -0700 (PDT) X-Google-Smtp-Source: APXvYqwf300zo9IQV7n4qylfq1cuVyLcnrRZMTy72vlYFJ/OdL2Pk2GT2nn/sYsl3RZYumFC5/iO X-Received: by 2002:a17:906:82c1:: with SMTP id a1mr23478127ejy.187.1570445748153; Mon, 07 Oct 2019 03:55:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570445748; cv=none; d=google.com; s=arc-20160816; b=RUJqWkCUAKCsazXrcdq+OpkUFXcMh7knEPTZOO5ecc/lLY+9Wr4JgBLajUhzQIsiCe q5Zg11ndEeAntaNiu4bLPf0qMRKzhseLuImqTRk/K7J/3GDr4VYGjtZrobEugf0M5zCy D8RLYMMgvGlkGyG5CTjC6yYRDLeEUhn+7w3TyZM7wZUdRZjQvPa+GrG0tKnk3Ji16Xrf M6Yw8sMCjDhS4PQuLPzFuju7mUHXK+AQkEePbAfk2Lx+3BjcStkstr9m9jyr5IUB+Kzk 5yousaK2SwmSMFh5flzVeNuEqOF1EuE+vRFAvYpoYEdgIjXgVoPSQxC1N3lf3Lz03FPm LpeQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version; bh=ooszPXb6tS7IcMhPhdZfqSEimzkpqykj8VCsp9DuDrk=; b=WkzgrSnpPrRv4XR8iLCeGKcw1Bm5LeQKnlS+HNZYesmUUChwIhqkaBVu4/gMGAI7o/ O0G187Cjr4MLerb9u6vHFvvqFyX+cMFNH+3PuybDOJIEiZE5kLDS/Bqj5YBAqNUXLpE6 GFx42yXINJRwNqjGUKiHEztwLaHso6scMOMdBrWivX9awdpMq6uMov9XKXRGOoi0rAq9 S22brViPUoxXTehRBOsgR97PBkswtvNwtZhzbWTEShY9/KgY07uej5ZKF0HfgSz3oKxs TlYUmfGHHQPnu3V0trL+AWLcSgJLQM7mu6vaLAiVw/aLrM6Vx7W5tLGs3Dj6P7KhZUDJ XXZA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z6si7856154edx.139.2019.10.07.03.55.24; Mon, 07 Oct 2019 03:55:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727639AbfJGKyh (ORCPT + 99 others); Mon, 7 Oct 2019 06:54:37 -0400 Received: from mx2.suse.de ([195.135.220.15]:52556 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727317AbfJGKyg (ORCPT ); Mon, 7 Oct 2019 06:54:36 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id A8ACEAD79; Mon, 7 Oct 2019 10:54:33 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Date: Mon, 07 Oct 2019 12:54:32 +0200 From: Roman Penyaev To: Jason Baron Cc: Andrew Morton , hev , linux-fsdevel@vger.kernel.org, Al Viro , Davide Libenzi , Davidlohr Bueso , Dominik Brodowski , Eric Wong , Linus Torvalds , Sridhar Samudrala , linux-kernel@vger.kernel.org Subject: Re: [PATCH RESEND v4] fs/epoll: Remove unnecessary wakeups of nested epoll that in ET mode In-Reply-To: <9ca02c9b-85b7-dced-9c82-1fc453c49b8a@akamai.com> References: <20190925015603.10939-1-r@hev.cc> <20190927192915.6ec24ad706258de99470a96e@linux-foundation.org> <9ca02c9b-85b7-dced-9c82-1fc453c49b8a@akamai.com> Message-ID: <9a82925ff7dfc314d36b3d36e54316a8@suse.de> X-Sender: rpenyaev@suse.de User-Agent: Roundcube Webmail Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019-10-03 18:13, Jason Baron wrote: > On 9/30/19 7:55 AM, Roman Penyaev wrote: >> On 2019-09-28 04:29, Andrew Morton wrote: >>> On Wed, 25 Sep 2019 09:56:03 +0800 hev wrote: >>> >>>> From: Heiher >>>> >>>> Take the case where we have: >>>> >>>>         t0 >>>>          | (ew) >>>>         e0 >>>>          | (et) >>>>         e1 >>>>          | (lt) >>>>         s0 >>>> >>>> t0: thread 0 >>>> e0: epoll fd 0 >>>> e1: epoll fd 1 >>>> s0: socket fd 0 >>>> ew: epoll_wait >>>> et: edge-trigger >>>> lt: level-trigger >>>> >>>> We only need to wakeup nested epoll fds if something has been queued >>>> to the >>>> overflow list, since the ep_poll() traverses the rdllist during >>>> recursive poll >>>> and thus events on the overflow list may not be visible yet. >>>> >>>> Test code: >>> >>> Look sane to me.  Do you have any performance testing results which >>> show a benefit? >>> >>> epoll maintainership isn't exactly a hive of activity nowadays :( >>> Roman, would you please have time to review this? >> >> So here is my observation: current patch does not fix the described >> problem (double wakeup) for the case, when new event comes exactly >> to the ->ovflist.  According to the patch this is the desired >> intention: >> >>    /* >>     * We only need to wakeup nested epoll fds if something has been >> queued >>     * to the overflow list, since the ep_poll() traverses the rdllist >>     * during recursive poll and thus events on the overflow list may >> not be >>     * visible yet. >>     */ >>     if (nepi != NULL) >>        pwake++; >> >>     .... >> >>     if (pwake == 2) >>        ep_poll_safewake(&ep->poll_wait); >> >> >> but this actually means that we repeat the same behavior (double >> wakeup) >> but only for the case, when event comes to the ->ovflist. >> >> How to reproduce? Can be easily done (ok, not so easy but it is >> possible >> to try): to the given userspace test we need to add one more socket >> and >> immediately fire the event: >> >>     e.events = EPOLLIN; >>     if (epoll_ctl(efd[1], EPOLL_CTL_ADD, s2fd[0], &e) < 0) >>        goto out; >> >>     /* >>      * Signal any fd to let epoll_wait() to call ep_scan_ready_list() >>      * in order to "catch" it there and add new event to ->ovflist. >>      */ >>      if (write(s2fd[1], "w", 1) != 1) >>         goto out; >> >> That is done in order to let the following epoll_wait() call to invoke >> ep_scan_ready_list(), where we can "catch" and insert new event >> exactly >> to the ->ovflist. In order to insert event exactly in the correct list >> I introduce artificial delay. >> >> Modified test and kernel patch is below.  Here is the output of the >> testing tool with some debug lines from kernel: >> >>   # ~/devel/test/edge-bug >>   [   59.263178] ### sleep 2 >>   >> write to sock >>   [   61.318243] ### done sleep >>   [   61.318991] !!!!!!!!!!! ep_poll_safewake(&ep->poll_wait); >> events_in_rdllist=1, events_in_ovflist=1 >>   [   61.321204] ### sleep 2 >>   [   63.398325] ### done sleep >>   error: What?! Again?! >> >> First epoll_wait() call (ep_scan_ready_list()) observes 2 events >> (see "!!!!!!!!!!! ep_poll_safewake" output line), exactly what we >> wanted to achieve, so eventually ep_poll_safewake() is called again >> which leads to double wakeup. >> >> In my opinion current patch as it is should be dropped, it does not >> fix the described problem but just hides it. >> >> -- Hi Jason, > > Yes, there are 2 wakeups in the test case you describe, but if the > second event (write to s1fd) gets queued after the first call to > epoll_wait(), we are going to get 2 wakeups anyways. Yes, exactly, for this reason I print out the number of events observed on first wait, there should be 1 (rdllist) and 1 (ovflist), otherwise this is another case, when second event comes exactly after first wait, which is legitimate wakeup. > So yes, there may > be a slightly bigger window with this patch for 2 wakeups, but its > small > and I tried to be conservative with the patch - I'd rather get an > occasional 2nd wakeup then miss one. Trying to debug missing wakeups > isn't always fun... > > That said, the reason for propagating events that end up on the > overflow > list was to prevent the race of the wakee not seeing events because > they > were still on the overflow list. In the testcase, imagine if there was > a > thread doing epoll_wait() on efd[0], and then a write happends on s1fd. > I thought it was possible then that a 2nd thread doing epoll_wait() on > efd[1], wakes up, checks efd[0] and sees no events because they are > still potentially on the overflow list. However, I think that case is > not possible because the thread doing epoll_wait() on efd[0] is going > to > have the ep->mtx, and thus when the thread wakes up on efd[1], its > going > to have to be ordered because its also grabbing the ep->mtx associated > with efd[0]. > > So I think its safe to do the following if we want to go further than > the proposed patch, which is what you suggested earlier in the thread > (minus keeping the wakeup on ep->wq). Then I do not understand why we need to keep ep->wq wakeup? @wq and @poll_wait are almost the same with only one difference: @wq is used when you sleep on it inside epoll_wait() and the other is used when you nest epoll fd inside epoll fd. Either you wake both up either you don't this at all. ep_poll_callback() does wakeup explicitly, ep_insert() and ep_modify() do wakeup explicitly, so what are the cases when we need to do wakeups from ep_scan_ready_list()? I would still remove the whole branch: --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -671,7 +671,6 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, void *priv, int depth, bool ep_locked) { __poll_t res; - int pwake = 0; struct epitem *epi, *nepi; LIST_HEAD(txlist); @@ -738,26 +737,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, */ list_splice(&txlist, &ep->rdllist); __pm_relax(ep->ws); - - if (!list_empty(&ep->rdllist)) { - /* - * Wake up (if active) both the eventpoll wait list and - * the ->poll() wait list (delayed after we release the lock). - */ - if (waitqueue_active(&ep->wq)) - wake_up(&ep->wq); - if (waitqueue_active(&ep->poll_wait)) - pwake++; - } write_unlock_irq(&ep->lock); if (!ep_locked) mutex_unlock(&ep->mtx); - /* We have to call this outside the lock */ - if (pwake) - ep_poll_safewake(&ep->poll_wait); - return res; } -- Roman