Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp2539691ybz; Sun, 3 May 2020 03:27:11 -0700 (PDT) X-Google-Smtp-Source: APiQypLZlxq+4EYGirtZO6ooLzdhQ6LNDWlwvHJDxzI24oxTQS2xXrSN81ROA0GyM1cgcY1ZgOVJ X-Received: by 2002:a17:907:447f:: with SMTP id oo23mr10207870ejb.274.1588501631161; Sun, 03 May 2020 03:27:11 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588501631; cv=none; d=google.com; s=arc-20160816; b=qyYvAoCKE4Pn6yAo3PirhVzqSCdNAqb2hg5/pGwVsI8HB7F3Kh5NAG2jHYu+A4sWV1 hKYCU20X3Q4w66bRQTYK05QqOWHnDiOEEh3yMAEx7rK3I/q5Qf48QAgFmLLyFZpbVsgT qPJb2vvXReou9gkMtIM7bL3e4xbD0GROG7Ear3IKtLLXL0Hv7SPJWItjLAC+0Xk2GmB5 8k6Ud6ciL4Q0JOXDbSQDHgt7jY6gDLmRNzJEDNP2UQFhmk27n7cPT86U8Nmgzwzt10PZ QjaOS78JhLcK4zT6TzdxXOHclKQhE3wKHpY410M4HnCOShETVDylGPHVwC8YYL2D7nMx FwOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version; bh=pgk6dyRSJ3d7RJZCIOLjTVY/zhgOvgUo05DLO9JoclQ=; b=yxB9iMfFnz7JcN+L0uDdIyxalD504iSvFQe5nPJFe66jB6O94sTCUFD1W0xAKioGYt cSikEvVH/xVjKLA9/eu5GOe+7M3v8MexvKyuTV9s92rAGQ/9oT81gzahyTKmXGQwjZOR LXxt4l/fIeiJ3LcfV2wzd3xJkhVTx519JN3CZHPbS2iDlCHddm61xgdRjizn2lca1fvc 3l/TINOVaoW4/2B5liVtVIo26Iqxi/AIbpyS5JuzAh73vX27PRSX5uRklMZ85+bFsBl+ 7sMEfOT1LBc/kc3w3gMV17HKCjnEOQGiamDZw6u7lvEiiUuHwwReFFyVRMnn/Vy0Hoxy VyDA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id oz15si5042000ejb.24.2020.05.03.03.26.32; Sun, 03 May 2020 03:27:11 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728021AbgECKYe (ORCPT + 99 others); Sun, 3 May 2020 06:24:34 -0400 Received: from mx2.suse.de ([195.135.220.15]:46950 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727051AbgECKYe (ORCPT ); Sun, 3 May 2020 06:24:34 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id DBEB1ABBE; Sun, 3 May 2020 10:24:32 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Sun, 03 May 2020 12:24:30 +0200 From: Roman Penyaev To: Jason Baron Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Alexander Viro , Heiher , Khazhismel Kumykov , Davidlohr Bueso , stable@vger.kernel.org Subject: Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events In-Reply-To: <81612721-9448-83fa-4efe-603996d56b9a@akamai.com> References: <1588360533-11828-1-git-send-email-jbaron@akamai.com> <930c565705249d2b6264a31f1be6529e@suse.de> <81612721-9448-83fa-4efe-603996d56b9a@akamai.com> Message-ID: X-Sender: rpenyaev@suse.de User-Agent: Roundcube Webmail Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2020-05-02 00:09, Jason Baron wrote: > On 5/1/20 5:02 PM, Roman Penyaev wrote: >> Hi Jason, >> >> That is indeed a nice catch. >> Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist) >> and >> READ_ONCE(ep->ovflist) for ep_events_available(), do we? >> > > Hi Roman, > > Good point, even if we order those reads its still racy, since the > read of the ready list could come after its been cleared and the > read of the overflow could again come after its been cleared. You mean the second chunk? True. Sigh. > So I'm afraid we might need instead something like this to make > sure they are read together: No, impossible, I can't believe in that :) We can't give up. All we need is to keep a mark, that ep->rdllist is not empty, even we've just spliced it. ep_poll_callback() always takes the ->ovflist path, if ->ovflist is not EP_UNACTIVE_PTR, but ep_events_available() does not need to observe ->ovflist at all, just a ->rdllist. Take a look at that, do I miss something? : diff --git a/fs/eventpoll.c b/fs/eventpoll.c index aba03ee749f8..a8770f9a917e 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -376,8 +376,7 @@ static void ep_nested_calls_init(struct nested_calls *ncalls) */ static inline int ep_events_available(struct eventpoll *ep) { - return !list_empty_careful(&ep->rdllist) || - READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; + return !list_empty_careful(&ep->rdllist); } #ifdef CONFIG_NET_RX_BUSY_POLL @@ -683,7 +682,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, { __poll_t res; struct epitem *epi, *nepi; - LIST_HEAD(txlist); + LIST_HEAD(rdllist); + LIST_HEAD(ovflist); lockdep_assert_irqs_enabled(); @@ -704,14 +704,22 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, * in a lockless way. */ write_lock_irq(&ep->lock); - list_splice_init(&ep->rdllist, &txlist); + /* + * We do not call list_splice_init() because for lockless + * ep_events_available() ->rdllist is still "not empty". + * Otherwise the feature that there is something left in + * the list can be lost which causes missed wakeup. + */ + list_splice(&ep->rdllist, &rdllist); + /* + * If ->rdllist was empty we should pretend it was not, + * because after the unlock ->ovflist comes into play, + * which is invisible for lockless ep_events_available(). + */ + ep->rdllist.next = LIST_POISON1; WRITE_ONCE(ep->ovflist, NULL); write_unlock_irq(&ep->lock); /* * Now call the callback function. */ - res = (*sproc)(ep, &txlist, priv); + res = (*sproc)(ep, &rdllist, priv); write_lock_irq(&ep->lock); /* @@ -724,7 +732,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, /* * We need to check if the item is already in the list. * During the "sproc" callback execution time, items are - * queued into ->ovflist but the "txlist" might already + * queued into ->ovflist but the "rdllist" might already * contain them, and the list_splice() below takes care of them. */ if (!ep_is_linked(epi)) { @@ -732,7 +740,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, * ->ovflist is LIFO, so we have to reverse it in order * to keep in FIFO. */ - list_add(&epi->rdllink, &ep->rdllist); + list_add(&epi->rdllink, &ovflist); ep_pm_stay_awake(epi); } } @@ -743,10 +751,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, */ WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR); - /* - * Quickly re-inject items left on "txlist". - */ - list_splice(&txlist, &ep->rdllist); + /* Events from ->ovflist happened later, thus splice to the tail */ + list_splice_tail(&ovflist, &rdllist); + /* Just replace list */ + list_replace(&rdllist, &ep->rdllist); + __pm_relax(ep->ws); write_unlock_irq(&ep->lock); @@ -1763,13 +1772,13 @@ static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head * Trigger mode, we need to insert back inside * the ready list, so that the next call to * epoll_wait() will check again the events - * availability. At this point, no one can insert - * into ep->rdllist besides us. The epoll_ctl() - * callers are locked out by - * ep_scan_ready_list() holding "mtx" and the - * poll callback will queue them in ep->ovflist. + * availability. What we do here is simply + * return the epi to the same position where + * it was, the ep_scan_ready_list() will + * re-inject the leftovers to the ->rdllist + * under the proper lock. */ - list_add_tail(&epi->rdllink, &ep->rdllist); + list_add_tail(&epi->rdllink, &tmp->rdllink); ep_pm_stay_awake(epi); } } -- Roman