Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp9342920imu; Wed, 5 Dec 2018 03:17:40 -0800 (PST) X-Google-Smtp-Source: AFSGD/X5qqZGZvn9Rtel5BK1hqZyAIp7m36TlNBl52UY2VF8NR6IzGCEEdZMAqeWFpXXKVKSQjgD X-Received: by 2002:a63:31d0:: with SMTP id x199mr19866660pgx.10.1544008660271; Wed, 05 Dec 2018 03:17:40 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544008660; cv=none; d=google.com; s=arc-20160816; b=V1vsOD8HRSZFheoyQO0LRUEmtOakqa6mx+tdfkHZhGiSXKB7oP+hCkc030JiF1DPla q8ea49RC1KU7bdUup7WupYYAFF+0TI6ZnKiMucx+GQFrzssXNMMiAiQP9BMfNJIWcK9o b35VTfsGAqWEuIgUL/EMMW2EWP/j3EE0Z6TGoE0ccb3xtbPEbEOjo76kSXPtvSvJH5Zh TErTlgfYZzZ+/zmQzIX3HKRZoRynLj9zBePta6RVjcDd60HQgZ8t9lkmOmuAqY/LW2/G brYa+TpKHe0zEpyxhpUbDzXf6u6VGGEiNbWEPmY8l0PU9sfOytL2WPdHFluLkeGGBbAi XGbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:message-id:references :in-reply-to:subject:cc:to:from:date:content-transfer-encoding :mime-version; bh=7isuqwEfY1gXqa0TWuhhSyFo9MI8aJnnUXX610CFMTw=; b=UYKQeqFEO7fRFH9w6KGlA0oiHqexr80FvJA5zzyMYiizgC7eZ06XQftnWl2jbP5k+o SE0i6qcrAqJ614qcFSC9XohRo4Vb8VMcsetsLzahuJe+HEC1fH3QYQQlMBwqR6q8TLWS cl+Jnuqs7CZtX5Rkgz2rrDiXeLBfdYFWV3+DX1fZ3dV0VJLb3WydUSN+9wVyQiD4BoRy 6gBJV3EEN8cI0MI3niWuWORPdJiOihk31WO1WnxXWe6qWhdpZrE04lLBi5bePG7ldlnm 71ANKloM+FHeadRzljvZcX4rfufWAl8PXb0E0xzFjOO858A52YTehOER9k/baWh9KeAg 5dFA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 206si18126544pga.240.2018.12.05.03.17.25; Wed, 05 Dec 2018 03:17:40 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727389AbeLELQt (ORCPT + 99 others); Wed, 5 Dec 2018 06:16:49 -0500 Received: from mx2.suse.de ([195.135.220.15]:58090 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726918AbeLELQt (ORCPT ); Wed, 5 Dec 2018 06:16:49 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 73962AE3D; Wed, 5 Dec 2018 11:16:46 +0000 (UTC) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Wed, 05 Dec 2018 12:16:46 +0100 From: Roman Penyaev To: Jason Baron Cc: Alexander Viro , "Paul E. McKenney" , Linus Torvalds , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention In-Reply-To: <45bce871-edfd-c402-acde-2e57e80cc522@akamai.com> References: <20181203110237.14787-1-rpenyaev@suse.de> <45bce871-edfd-c402-acde-2e57e80cc522@akamai.com> Message-ID: <38cc83144a2ec332dead4e21214ea068@suse.de> X-Sender: rpenyaev@suse.de User-Agent: Roundcube Webmail Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2018-12-04 18:23, Jason Baron wrote: > On 12/3/18 6:02 AM, Roman Penyaev wrote: [...] >> >> ep_set_busy_poll_napi_id(epi); >> >> @@ -1156,8 +1187,8 @@ static int ep_poll_callback(wait_queue_entry_t >> *wait, unsigned mode, int sync, v >> */ >> if (unlikely(ep->ovflist != EP_UNACTIVE_PTR)) { >> if (epi->next == EP_UNACTIVE_PTR) { >> - epi->next = ep->ovflist; >> - ep->ovflist = epi; >> + /* Atomically exchange tail */ >> + epi->next = xchg(&ep->ovflist, epi); > > This also relies on the fact that the same epi can't be added to the > list in parallel, because the wait queue doing the wakeup will have the > wait_queue_head lock. That was a little confusing for me b/c we only > had > the read_lock() above. Yes, that is indeed not obvious path, but wq is locked by wake_up_*_poll() call or caller of wake_up_locked_poll() has to be sure wq.lock is taken. I'll add an explicit comment here, thanks for pointing out. > >> if (epi->ws) { >> /* >> * Activate ep->ws since epi->ws may get >> @@ -1172,7 +1203,7 @@ static int ep_poll_callback(wait_queue_entry_t >> *wait, unsigned mode, int sync, v >> >> /* If this file is already in the ready list we exit soon */ >> if (!ep_is_linked(epi)) { >> - list_add_tail(&epi->rdllink, &ep->rdllist); >> + list_add_tail_lockless(&epi->rdllink, &ep->rdllist); >> ep_pm_stay_awake_rcu(epi); >> } > > same for this. ... and an explicit comment here. > >> >> @@ -1197,13 +1228,13 @@ static int ep_poll_callback(wait_queue_entry_t >> *wait, unsigned mode, int sync, v >> break; >> } >> } >> - wake_up_locked(&ep->wq); >> + wake_up(&ep->wq); > > why the switch here to the locked() variant? Shouldn't the 'reader' > side, in this case which takes the rwlock for write see all updates in > a > coherent state at this point? lockdep inside __wake_up_common expects wq_head->lock is taken, and seems this is not a good idea to leave wq naked on wake up path, when several CPUs can enter wake function. Although default_wake_function is protected by spinlock inside try_to_wake_up(), but for example autoremove_wake_function() can't be called concurrently for the same wq (it removes wq entry from the list). Also in case of bookmarks __wake_up_common adds an entry to the list, thus can't be called without any locks. I understand you concern and you are right saying that read side sees wq entries as stable, but that will work only if __wake_up_common does not modify anything, that is seems true now, but of course it is too scary to rely on that in the future. > >> } >> if (waitqueue_active(&ep->poll_wait)) >> pwake++; >> >> out_unlock: >> - spin_unlock_irqrestore(&ep->wq.lock, flags); >> + read_unlock_irqrestore(&ep->lock, flags); >> >> /* We have to call this outside the lock */ >> if (pwake) >> @@ -1489,7 +1520,7 @@ static int ep_insert(struct eventpoll *ep, const >> struct epoll_event *event, >> goto error_remove_epi; >> >> /* We have to drop the new item inside our item list to keep track >> of it */ >> - spin_lock_irq(&ep->wq.lock); >> + write_lock_irq(&ep->lock); >> >> /* record NAPI ID of new item if present */ >> ep_set_busy_poll_napi_id(epi); >> @@ -1501,12 +1532,12 @@ static int ep_insert(struct eventpoll *ep, >> const struct epoll_event *event, >> >> /* Notify waiting tasks that events are available */ >> if (waitqueue_active(&ep->wq)) >> - wake_up_locked(&ep->wq); >> + wake_up(&ep->wq); > > is this necessary to switch as well? Is this to make lockdep happy? > Looks like there are few more conversions too below... Yes, necessary, wq.lock should be taken. -- Roman