Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp8147732imu; Tue, 4 Dec 2018 03:51:54 -0800 (PST) X-Google-Smtp-Source: AFSGD/XLJDdd1mK1N2SsLLx16rZ/qmGIl10joUpyt3uEKPhDU6QO8LEhuZEMDtV3hPVszrqWupMg X-Received: by 2002:a63:5722:: with SMTP id l34mr16452283pgb.118.1543924314000; Tue, 04 Dec 2018 03:51:54 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543924313; cv=none; d=google.com; s=arc-20160816; b=rd0NGoBu+d2JPtyx9Ak/ET4BRxM9IsItwLbJP/2+KliXNM0TY4rLxiDmaPSKZ1Wg+A eTVPEIOP32zGkS6w/PatP+yftQc7RgO4A+SxU5K39T0m4XCwmMmQizbSGJ33RhJ1MLWx VKFg4w58aHgmIgPc6fsuObg+RiTaCPJBBgaApT9PEbjCP+EP4vN61s4WKJI30elMcH2W 9XQnEbvO+rIPRMsU/fS1zpOi4GtpyH6G9iXAGv+k53j89089nukpHeq6e2G+DxFtOrey zfldYloqa5QiccWQEvNnEja/QwOzZYK332+K/5LTcTa0ACsX7gFI/eKqTNPry4hFxgls 7BPw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version; bh=7/Tgcv1ePD0KMCYlfkm4nY6hEUpr/hMj7r+btqft7NM=; b=durpornEC0p++rFhzC+GKLZahhp3rdL6Chtgrw/Kah4CKZ0FNJZ+MH/1jLr+xc8VzL HK995yghjiEY8QBHY6iB1BEyzGho6emScUiNWjL6OlRDupXlm5mq8L3OsL3Zk6PlpAy/ G+8p1Foue8ke6CyKoxNRtBwKbBj/msb8BoLx1LGjVbelTw1rj2M/nsauoN2kcFu1NMj6 xaZk+8DEFMH3tAHVQzs16DdRk6x49pr1p5rhmmUW7shOmLTc0B/fbLO2J+IeiSajtUUl u7gQE/WnTQTIi5XuLqRGdiQkWc8FZR4rQhEQGyx0xHugDdqdiCmcVe99bZAX6c714tJ1 U85Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k7si15656473pgm.462.2018.12.04.03.51.38; Tue, 04 Dec 2018 03:51:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726187AbeLDLvD (ORCPT + 99 others); Tue, 4 Dec 2018 06:51:03 -0500 Received: from mx2.suse.de ([195.135.220.15]:54304 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725767AbeLDLvC (ORCPT ); Tue, 4 Dec 2018 06:51:02 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 41B2CB068; Tue, 4 Dec 2018 11:50:59 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 04 Dec 2018 12:50:58 +0100 From: Roman Penyaev To: Linus Torvalds Cc: Al Viro , Paul McKenney , linux-fsdevel , Linux List Kernel Mailing Subject: Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention In-Reply-To: References: <20181203110237.14787-1-rpenyaev@suse.de> Message-ID: <83edf06ce9db540495b53527eca3248c@suse.de> X-Sender: rpenyaev@suse.de User-Agent: Roundcube Webmail Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018-12-03 18:34, Linus Torvalds wrote: > On Mon, Dec 3, 2018 at 3:03 AM Roman Penyaev wrote: >> >> Also I'm not quite sure where to put very special lockless variant >> of adding element to the list (list_add_tail_lockless() in this >> patch). Seems keeping it locally is safer. > > That function is scary, and can be mis-used so easily that I > definitely don't want to see it anywhere else. > > Afaik, it's *really* important that only "add_tail" operations can be > done in parallel. True, adding element either to head or to tail can work in parallel, any mix will corrupt the list. I tried to reflect this in the comment of list_add_tail_lockless(). Although not sure has it become clearer to a reader or not. > This also ends up making the memory ordering of "xchg()" very very > important. Yes, we've documented it as being an ordering op, but I'm > not sure we've relied on it this directly before. Seems exit_mm() does exactly the same, the following chunk: up_read(&mm->mmap_sem); self.task = current; self.next = xchg(&core_state->dumper.next, &self); At least code pattern looks similar. > I also note that now we do more/different locking in the waitqueue > handling, because the code now takes both that rwlock _and_ the > waitqueue spinlock for wakeup. That also makes me worried that the > "waitqueue_active()" games are no no longer reliable. I think they're > fine (looks like they are only done under the write-lock, so it's > effectively the same serialization anyway), The only difference in waking up is that same epollitem waitqueue can be observed as active from different CPUs, real wake up happens only once (wake_up() takes wq.lock, so should be fine to call it multiple times), but 1 is returned for all callers of ep_poll_callback() who has seen the wq as active. If epollitem is created with EPOLLEXCLUSIVE flag, then 1, which is returned from ep_poll_callback(), indicates "break the loop, exclusive wake up has happened" (the loop is in __wake_up_common), but even we consider this exclusive wake up case this seems is totally fine, because wake up events are not lost and epollitem will scan all ready fds and eventually will observe all of the callers (who has returned 1 from ep_poll_callback()) as ready. I hope I did not miss anything. > but the upshoot of all of > this is that I *really* want others to look at this patch too. A lot > of small subtle things here. Would be great if someone can look through, eventpoll.c looks a bit abandoned. -- Roman