Received: by 2002:a25:c593:0:0:0:0:0 with SMTP id v141csp484185ybe; Wed, 4 Sep 2019 02:58:51 -0700 (PDT) X-Google-Smtp-Source: APXvYqxHfvEEyrqcO9zJTeePqHr93wjjIg1v5wayb3CluUF6Vyl2S3trhFDD5lGedmhaSjAEYTYL X-Received: by 2002:a17:90a:bf0e:: with SMTP id c14mr4044798pjs.140.1567591131277; Wed, 04 Sep 2019 02:58:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1567591131; cv=none; d=google.com; s=arc-20160816; b=RZezeRTLKXDTXW/gbrq4Lwvft+lPJoV3BWs3qOQ5m+m5kTpxpiCL+FmHBEpMhrEe42 E9NnvME8enKxgsUypYGbNKgFZ9UI+NDahS37fyTT/QjXXalIPwdHiiVTNEwGB+MCKSo5 8QFQOvWKavbyxheK/rgXrdVw6syYIAk5lw++kFlj/97QikrHQ7pTQl7t70PcyaBtUp8o 2OVTJv43lIjgrBlsW7rnONCKqAygwwKmlw0EweLPKsiUwi76VDyn5DbvrfTc6zxxcdei V/XfwtiPa4o//dkrtTzrZIYhVWSGICE1XQrc1YtrJZ7Yz2nYJ2AJovECNyECQR8rmUMp RSgA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version; bh=dMjD+ZkB0vInNIgBmN+0BZS1GQIpXcVip0GGu9TgwQM=; b=rToh1mynOuRNKxPprtgYFwsBr2SJ+ab81yfHEHZ9ziRYAVhYJb3zjTlZpHmgq5TpDA JiHerGNnLbSnmDtk2csTYNiyTO9MzGfIgE7ho0QvMnZnWNTjMJr7F16CyPit4v/uqvkj l7mbPqG2W1DxieXozriBgfzo4YQeJzhgl6AG1EIzfN7i5E84fcKP95P0am9X4mzb/irU q+i6HRWq2HQCbyv7BGJZK/Wp0NfNpw/QFEsWtnc3ocz6DQQpEWwJc2IMMrzKKbuaDYur bWPfOggVUHw9xvoGyKu11OQT5AZRVvAyaFEC/y3/2b+3V4roS1TfRPE0piovbqKCd3+L bbjA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d17si19782448pfr.235.2019.09.04.02.58.35; Wed, 04 Sep 2019 02:58:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729217AbfIDJ5s (ORCPT + 99 others); Wed, 4 Sep 2019 05:57:48 -0400 Received: from mx2.suse.de ([195.135.220.15]:37556 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726010AbfIDJ5s (ORCPT ); Wed, 4 Sep 2019 05:57:48 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 4AB52B647; Wed, 4 Sep 2019 09:57:46 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Date: Wed, 04 Sep 2019 11:57:45 +0200 From: Roman Penyaev To: Jason Baron Cc: hev , linux-fsdevel@vger.kernel.org, e@80x24.org, Al Viro , Andrew Morton , Davide Libenzi , Davidlohr Bueso , Dominik Brodowski , Linus Torvalds , Sridhar Samudrala , linux-kernel@vger.kernel.org Subject: Re: [PATCH RESEND] fs/epoll: fix the edge-triggered mode for nested epoll In-Reply-To: <7075dd44-feea-a52f-ddaa-087d7bb2c4f6@akamai.com> References: <20190902052034.16423-1-r@hev.cc> <0cdc9905efb9b77b159e09bee17d3ad4@suse.de> <7075dd44-feea-a52f-ddaa-087d7bb2c4f6@akamai.com> Message-ID: <23659bc3e5f80efe9746aefd4d6791e8@suse.de> X-Sender: rpenyaev@suse.de User-Agent: Roundcube Webmail Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2019-09-03 23:08, Jason Baron wrote: > On 9/2/19 11:36 AM, Roman Penyaev wrote: >> Hi, >> >> This is indeed a bug. (quick side note: could you please remove efd[1] >> from your test, because it is not related to the reproduction of a >> current bug). >> >> Your patch lacks a good description, what exactly you've fixed.  Let >> me speak out loud and please correct me if I'm wrong, my understanding >> of epoll internals has become a bit rusty: when epoll fds are nested >> an attempt to harvest events (ep_scan_ready_list() call) produces a >> second (repeated) event from an internal fd up to an external fd: >> >>      epoll_wait(efd[0], ...): >>        ep_send_events(): >>           ep_scan_ready_list(depth=0): >>             ep_send_events_proc(): >>                 ep_item_poll(): >>                   ep_scan_ready_list(depth=1): >>                     ep_poll_safewake(): >>                       ep_poll_callback() >>                         list_add_tail(&epi, &epi->rdllist); >>                         ^^^^^^ >>                         repeated event >> >> >> In your patch you forbid wakeup for the cases, where depth != 0, i.e. >> for all nested cases. That seems clear.  But what if we can go further >> and remove the whole chunk, which seems excessive: >> >> @@ -885,26 +886,11 @@ static __poll_t ep_scan_ready_list(struct >> eventpoll *ep, >> >> - >> -       if (!list_empty(&ep->rdllist)) { >> -               /* >> -                * Wake up (if active) both the eventpoll wait list >> and >> -                * the ->poll() wait list (delayed after we release >> the >> lock). >> -                */ >> -               if (waitqueue_active(&ep->wq)) >> -                       wake_up(&ep->wq); >> -               if (waitqueue_active(&ep->poll_wait)) >> -                       pwake++; >> -       } >>         write_unlock_irq(&ep->lock); >> >>         if (!ep_locked) >>                 mutex_unlock(&ep->mtx); >> >> -       /* We have to call this outside the lock */ >> -       if (pwake) >> -               ep_poll_safewake(&ep->poll_wait); >> >> >> I reason like that: by the time we've reached the point of scanning >> events >> for readiness all wakeups from ep_poll_callback have been already >> fired and >> new events have been already accounted in ready list >> (ep_poll_callback() >> calls >> the same ep_poll_safewake()). Here, frankly, I'm not 100% sure and >> probably >> missing some corner cases. >> >> Thoughts? > > So the: 'wake_up(&ep->wq);' part, I think is about waking up other > threads that may be in waiting in epoll_wait(). For example, there may > be multiple threads doing epoll_wait() on the same epoll fd, and the > logic above seems to say thread 1 may have processed say N events and > now its going to to go off to work those, so let's wake up thread 2 now > to handle the next chunk. Not quite. Thread which calls ep_scan_ready_list() processes all the events, and while processing those, removes them one by one from the ready list. But if event mask is !0 and event belongs to Level Triggered Mode descriptor (let's say default mode) it tails event again back to the list (because we are in level mode, so event should be there). So at the end of this traversing loop ready list is likely not empty, and if so, wake up again is called for nested epoll fds. But, those nested epoll fds should get already all the notifications from the main event callback ep_poll_callback(), regardless any thread which traverses events. I suppose this logic exists for decades, when Davide (the author) was reshuffling the code here and there. But I do not feel confidence to state that this extra wakeup is bogus, I just have a gut feeling that it looks excessive. > So I think removing all that even for the > depth 0 case is going to change some behavior here. So perhaps, it > should be removed for all depths except for 0? And if so, it may be > better to make 2 patches here to separate these changes. > > For the nested wakeups, I agree that the extra wakeups seem unnecessary > and it may make sense to remove them for all depths. I don't think the > nested epoll semantics are particularly well spelled out, and afaict, > nested epoll() has behaved this way for quite some time. And the > current > behavior is not bad in the way that a missing wakeup or false negative > would be. That's 100% true! For edge mode extra wake up is not a bug, not optimal for userspace - yes, but that can't lead to any lost wakeups. -- Roman