Received: by 2002:a25:1985:0:0:0:0:0 with SMTP id 127csp3253277ybz; Sun, 3 May 2020 22:06:56 -0700 (PDT) X-Google-Smtp-Source: APiQypINODxg492m6iKhBSNvJPqudqq2DZiGJfTylmI79jCbih4qinkQf4OZrndHewNb/oRr5XOR X-Received: by 2002:a05:6402:14ce:: with SMTP id f14mr13029401edx.244.1588568816427; Sun, 03 May 2020 22:06:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1588568816; cv=none; d=google.com; s=arc-20160816; b=eDE6vlnNh2RQVtzec7MYK3E4dFpyfD2bFnpX9Ur7a1dPeNiJPn+wcobPHjK78UN+pH IeV216xb/1yEEYraSL9Z9hCuhF7Xj2kxIij8jH32NirE5MGxMx1nLokdSaVY17z7qM/O j9GfOdJ9qGBd7ylrsITt6m4QbF2nG/esUbGJ5nZTcWhp/mANx2/YHCq17v73IlxHRFEy 2gjEAS3VQk4mu76mQ2zi36Mt+MKsOKnXvIhG6rlC5UWGIefxzgcXXyiKe4qq0ErQkc4C le/WvJu/nSMakSvG4fSgdjDgE1u1DPdFPcnJkukp6XGcNgNeoosWlAwDD5y1u1DCAMN2 vCjg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:from:subject:dkim-signature; bh=fmpig0yn0d2DMHi1ULj0wEL9TmtVthnvE+yUNwWMGJk=; b=G3zfkS8SXNww7RRztYd54ii7HiYVylcf80w8TzCLyRfqEwOXj6WnZIcmd/GVeXzfFd rg0h6Vtrp464zvOp7Tuhff5RXgMPEkcg958lnTytZg41nIB0JHICN+wxmiaI8ayPEP7Z k8E7lGiPXpBvZ5cHxIs2Hp75mJkNYBJ8NuGGGjou2mVeN+qE7iCy/1edXdE/b7EbZFwl LXMPXW6B8S1Mb3NKn4UvE0CcffdaKRH9zOHl8XkAAK5PimWLxhe26dFs2/iFBDP/6CvN ztqMQydSNIofvxllCf5n6EAukdqH+eJ/gSpSFhPSt+mnCo/v9lko6PlgHp7kWMi2QIu4 HEDQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@akamai.com header.s=jan2016.eng header.b=UMUyvbIM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=akamai.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id pv20si6423250ejb.295.2020.05.03.22.06.33; Sun, 03 May 2020 22:06:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@akamai.com header.s=jan2016.eng header.b=UMUyvbIM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=akamai.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726770AbgEDFAc (ORCPT + 99 others); Mon, 4 May 2020 01:00:32 -0400 Received: from mx0a-00190b01.pphosted.com ([67.231.149.131]:13652 "EHLO mx0a-00190b01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725859AbgEDFAb (ORCPT ); Mon, 4 May 2020 01:00:31 -0400 Received: from pps.filterd (m0050095.ppops.net [127.0.0.1]) by m0050095.ppops.net-00190b01. (8.16.0.42/8.16.0.42) with SMTP id 0444wakD003458; Mon, 4 May 2020 05:59:27 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; h=subject : from : to : cc : references : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=jan2016.eng; bh=fmpig0yn0d2DMHi1ULj0wEL9TmtVthnvE+yUNwWMGJk=; b=UMUyvbIM0p5qqHL4U0JnKbcSSqKk+bVZ6jNe/Gkp0gTfI48jgnDCBGS6WvI9Jvghx8q3 wuEWukcK3kYzvHNC3WxBQSiGu5ijr/Jmu7OXVdc4igAUy4frdI0pT9ZfQ6O6kbEM2uVt pdlprazMjtdOO3YoV2D4InfVOJJH0iss7y/TbjUEjdMuAyyclQLGZvJ/Ly00ihS3ymV7 TNHOiqCPL+76CS6ieiNQTxkXhz71Zonsu8sio9H2RDtFJpNzrjgyLzqIu4s+pqrEWvXT x5cYq6zwxsMe2XDEEfJ8/2u0EYrWBj+IQ6m248mUrx87vfW+sDTjXiZPlo8i1bWjW3WQ 0g== Received: from prod-mail-ppoint7 (a72-247-45-33.deploy.static.akamaitechnologies.com [72.247.45.33] (may be forged)) by m0050095.ppops.net-00190b01. with ESMTP id 30s0d6j3x6-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 04 May 2020 05:59:27 +0100 Received: from pps.filterd (prod-mail-ppoint7.akamai.com [127.0.0.1]) by prod-mail-ppoint7.akamai.com (8.16.0.27/8.16.0.27) with SMTP id 0444lcji031818; Mon, 4 May 2020 00:59:26 -0400 Received: from prod-mail-relay15.akamai.com ([172.27.17.40]) by prod-mail-ppoint7.akamai.com with ESMTP id 30t9u5gy17-1; Mon, 04 May 2020 00:59:25 -0400 Received: from [0.0.0.0] (prod-ssh-gw01.bos01.corp.akamai.com [172.27.119.138]) by prod-mail-relay15.akamai.com (Postfix) with ESMTP id 329B9230B2; Mon, 4 May 2020 04:59:25 +0000 (GMT) Subject: Re: [PATCH] epoll: ensure ep_poll() doesn't miss wakeup events From: Jason Baron To: Roman Penyaev Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, Alexander Viro , Heiher , Khazhismel Kumykov , Davidlohr Bueso , stable@vger.kernel.org References: <1588360533-11828-1-git-send-email-jbaron@akamai.com> <930c565705249d2b6264a31f1be6529e@suse.de> <81612721-9448-83fa-4efe-603996d56b9a@akamai.com> <842aa331-b650-bf99-8ea9-b5d3e0866054@akamai.com> Message-ID: <8632df62-7475-3328-4a38-95462fbc410d@akamai.com> Date: Mon, 4 May 2020 00:59:24 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.7.0 MIME-Version: 1.0 In-Reply-To: <842aa331-b650-bf99-8ea9-b5d3e0866054@akamai.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.676 definitions=2020-05-04_02:2020-05-01,2020-05-04 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-2002250000 definitions=main-2005040040 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.676 definitions=2020-05-04_02:2020-05-01,2020-05-04 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 clxscore=1015 spamscore=0 bulkscore=0 mlxlogscore=999 priorityscore=1501 mlxscore=0 malwarescore=0 lowpriorityscore=0 adultscore=0 phishscore=0 suspectscore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2003020000 definitions=main-2005040041 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5/4/20 12:29 AM, Jason Baron wrote: > > > On 5/3/20 6:24 AM, Roman Penyaev wrote: >> On 2020-05-02 00:09, Jason Baron wrote: >>> On 5/1/20 5:02 PM, Roman Penyaev wrote: >>>> Hi Jason, >>>> >>>> That is indeed a nice catch. >>>> Seems we need smp_rmb() pair between list_empty_careful(&rp->rdllist) and >>>> READ_ONCE(ep->ovflist) for ep_events_available(), do we? >>>> >>> >>> Hi Roman, >>> >>> Good point, even if we order those reads its still racy, since the >>> read of the ready list could come after its been cleared and the >>> read of the overflow could again come after its been cleared. >> >> You mean the second chunk? True. Sigh. >> >>> So I'm afraid we might need instead something like this to make >>> sure they are read together: >> >> No, impossible, I can't believe in that :) We can't give up. >> >> All we need is to keep a mark, that ep->rdllist is not empty, >> even we've just spliced it.  ep_poll_callback() always takes >> the ->ovflist path, if ->ovflist is not EP_UNACTIVE_PTR, but >> ep_events_available() does not need to observe ->ovflist at >> all, just a ->rdllist. >> >> Take a look at that, do I miss something? : >> >> diff --git a/fs/eventpoll.c b/fs/eventpoll.c >> index aba03ee749f8..a8770f9a917e 100644 >> --- a/fs/eventpoll.c >> +++ b/fs/eventpoll.c >> @@ -376,8 +376,7 @@ static void ep_nested_calls_init(struct nested_calls *ncalls) >>   */ >>  static inline int ep_events_available(struct eventpoll *ep) >>  { >> -       return !list_empty_careful(&ep->rdllist) || >> -               READ_ONCE(ep->ovflist) != EP_UNACTIVE_PTR; >> +       return !list_empty_careful(&ep->rdllist); >>  } >> >>  #ifdef CONFIG_NET_RX_BUSY_POLL >> @@ -683,7 +682,8 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, >>  { >>         __poll_t res; >>         struct epitem *epi, *nepi; >> -       LIST_HEAD(txlist); >> +       LIST_HEAD(rdllist); >> +       LIST_HEAD(ovflist); >> >>         lockdep_assert_irqs_enabled(); >> >> @@ -704,14 +704,22 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, >>          * in a lockless way. >>          */ >>         write_lock_irq(&ep->lock); >> -       list_splice_init(&ep->rdllist, &txlist); >> +       /* >> +        * We do not call list_splice_init() because for lockless >> +        * ep_events_available() ->rdllist is still "not empty". >> +        * Otherwise the feature that there is something left in >> +        * the list can be lost which causes missed wakeup. >> +        */ >> +       list_splice(&ep->rdllist, &rdllist); >> +       /* >> +        * If ->rdllist was empty we should pretend it was not, >> +        * because after the unlock ->ovflist comes into play, >> +        * which is invisible for lockless ep_events_available(). >> +        */ >> +       ep->rdllist.next = LIST_POISON1; >>         WRITE_ONCE(ep->ovflist, NULL); >>         write_unlock_irq(&ep->lock); >> >>         /* >>          * Now call the callback function. >>          */ >> -       res = (*sproc)(ep, &txlist, priv); >> +       res = (*sproc)(ep, &rdllist, priv); >> >>         write_lock_irq(&ep->lock); >>         /* >> @@ -724,7 +732,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, >>                 /* >>                  * We need to check if the item is already in the list. >>                  * During the "sproc" callback execution time, items are >> -                * queued into ->ovflist but the "txlist" might already >> +                * queued into ->ovflist but the "rdllist" might already >>                  * contain them, and the list_splice() below takes care of them. >>                  */ >>                 if (!ep_is_linked(epi)) { >> @@ -732,7 +740,7 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, >>                          * ->ovflist is LIFO, so we have to reverse it in order >>                          * to keep in FIFO. >>                          */ >> -                       list_add(&epi->rdllink, &ep->rdllist); >> +                       list_add(&epi->rdllink, &ovflist); >>                         ep_pm_stay_awake(epi); >>                 } >>         } >> @@ -743,10 +751,11 @@ static __poll_t ep_scan_ready_list(struct eventpoll *ep, >>          */ >>         WRITE_ONCE(ep->ovflist, EP_UNACTIVE_PTR); >> >> -       /* >> -        * Quickly re-inject items left on "txlist". >> -        */ >> -       list_splice(&txlist, &ep->rdllist); >> +       /* Events from ->ovflist happened later, thus splice to the tail */ >> +       list_splice_tail(&ovflist, &rdllist); >> +       /* Just replace list */ >> +       list_replace(&rdllist, &ep->rdllist); >> + >>         __pm_relax(ep->ws); >>         write_unlock_irq(&ep->lock); >> >> @@ -1763,13 +1772,13 @@ static __poll_t ep_send_events_proc(struct eventpoll *ep, struct list_head *head >>                          * Trigger mode, we need to insert back inside >>                          * the ready list, so that the next call to >>                          * epoll_wait() will check again the events >> -                        * availability. At this point, no one can insert >> -                        * into ep->rdllist besides us. The epoll_ctl() >> -                        * callers are locked out by >> -                        * ep_scan_ready_list() holding "mtx" and the >> -                        * poll callback will queue them in ep->ovflist. >> +                        * availability. What we do here is simply >> +                        * return the epi to the same position where >> +                        * it was, the ep_scan_ready_list() will >> +                        * re-inject the leftovers to the ->rdllist >> +                        * under the proper lock. >>                          */ >> -                       list_add_tail(&epi->rdllink, &ep->rdllist); >> +                       list_add_tail(&epi->rdllink, &tmp->rdllink); >>                         ep_pm_stay_awake(epi); >>                 } >>         } >> >> >> -- >> Roman >> > > > Hi Roman, > > I think this misses an important case - the initial ep_poll_callback() > may queue to the overflow list. In this case ep_poll has no visibility > into the event since its only checking ep->rdllist. Ok, my mistake - I see it sets: ep->rdllist.next = LIST_POISON1; for that case. Ok I think this approach makes sense then. Thanks, -Jason