Received: by 2002:a05:7412:e794:b0:fa:551:50a7 with SMTP id o20csp600270rdd; Tue, 9 Jan 2024 13:43:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IEznA/7tdJspFaypjMBfc39ix+yZTIhS2YI6Z+f2HNpwddBsrLokio9OLOlDC/HBNgMQuRq X-Received: by 2002:a17:902:7844:b0:1d4:3e87:d446 with SMTP id e4-20020a170902784400b001d43e87d446mr15486pln.127.1704836632523; Tue, 09 Jan 2024 13:43:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704836632; cv=none; d=google.com; s=arc-20160816; b=Nt5Tdpn6G00scRNYH1krX680RPJ8tSjbzNHEP533Pzx5fTclF+mGbUrJeoh850Kc1s OtQiBaR2N8P9suS+SYYlKhdhylFPzsJw25/VC55nA/4KfmksqYrKn74U3jCNuV8KA0on kLzBq4O+c7vBU8mudAOmzdKMvujSLxOTSMsk+C4+43xswEDV/CLELFMzT5T8npCgsjcn PHTL8wqMUDJXi97N4nw3w9SCfS+oLcAyVc7R2vURt4oO5rk1U8gO7s04oEohYj1ZjIjG vMCAe15c/dNKJYKxJ2JPcFPySDUzJMmQKljpPCB+0oY1bngXyCZ9jCUfBf/i4VB6fJeA fXQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=ZzkJt17lxJqZBgSEhiHS6vPyJu6wWmn4ZJ+SIdSIktU=; fh=bKWFTNSJ8z1PVqjMj20EC21PsensvVZYbmZjtAszKMc=; b=h9180FVZQASY4nlWCq737LY0hB0YT+U2PXYTyAfaRpJ426XErUMYmh9/KCUbFqeGoL v6rSDnxVzQN5mCxC9d9RImq2vLuoFl3Y8LEcypeGIr7fQ1Gnz1olR3uWQI3PU0MtD5E2 rBWwdifF90sddTjBv9cnEEXqD7IJjXlpl8VHJFhiZqOahF9KnWGYoIhc+z8mnC59oSg7 XxDJzpQoNJoG7nQCY6MeYhdPYPqd2dxEhvWs4DoHcvsbqHjY+DO6yTVpibVGhvH2Bf6v jQW37yh3vzS/TpJTjKiuXfOgMK87adyq0Qo4i8bUsnHZGqmuJv/kEqHMRMIBKaVTed+b oVtQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (body hash did not verify) header.i=@infradead.org header.s=casper.20170209 header.b=RNWn93Rj; spf=pass (google.com: domain of linux-kernel+bounces-21437-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-21437-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id x1-20020a170902820100b001d3e0015960si2194744pln.143.2024.01.09.13.43.52 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Jan 2024 13:43:52 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-21437-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@infradead.org header.s=casper.20170209 header.b=RNWn93Rj; spf=pass (google.com: domain of linux-kernel+bounces-21437-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-21437-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 333FF28892C for ; Tue, 9 Jan 2024 21:43:52 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C279E3F8F9; Tue, 9 Jan 2024 21:42:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="RNWn93Rj" Received: from casper.infradead.org (casper.infradead.org [90.155.50.34]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 43D5A3F8F1 for ; Tue, 9 Jan 2024 21:42:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=D47kuEUuvY/SrXunoNEbi3Kkq7SL6IKph7l39rRXOkA=; b=RNWn93RjQaoOeXeMsl8H+OA4py NXoKNEhNNHWWuomGV6O8Wr2hOdVYNb7AJxYDZCw6AXMhIj71E1JaZYENde6tEe8+OXgQjpyYiFiok avPWrWxjLKnrJdyKsoUWbpEdQZgQA3R4/pTyTIyo8KBQENc1bDN30WVklnA1woOQ00FBYVp6LRVV9 bHBbDJN050IGQmSdZ1jeu+hwWDQDlG1Mq6Ti4Q7YToN7o+6WmquSGmIAqtTNJUbqMF0XAsIzSCreX +LJnKXMBRISm8WJjS+T8K6MHVdp2K+460zdXI9aOLFtdgjKhEXVv7A7AJMj1EAUjdqnwHuSaLX2cV Kzk1+xsQ==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1rNJrH-00AMnq-PL; Tue, 09 Jan 2024 21:42:03 +0000 Date: Tue, 9 Jan 2024 21:42:03 +0000 From: Matthew Wilcox To: Waiman Long Cc: Peter Zijlstra , Ingo Molnar , Will Deacon , linux-kernel@vger.kernel.org, Suren Baghdasaryan , "Liam R. Howlett" , "Paul E. McKenney" Subject: Re: [RFC] Sleep waiting for an rwsem to be unlocked Message-ID: References: <04d23913-fb0e-4f47-9e6c-dae50ee28a3a@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <04d23913-fb0e-4f47-9e6c-dae50ee28a3a@redhat.com> On Tue, Jan 09, 2024 at 04:04:08PM -0500, Waiman Long wrote: > On 1/9/24 12:12, Matthew Wilcox wrote: > > The problem we're trying to solve is a lock-free walk of > > /proc/$pid/maps. If the process is modifying the VMAs at the same time > > the reader is walking them, it can see garbage. For page faults, we > > handle this by taking the mmap_lock for read and retrying the page fault > > (excluding any further modifications). > > > > We don't want to take that approach for the maps file. The monitoring > > task may have a significantly lower process priority, and so taking > > the mmap_lock for read can block it for a significant period of time. > > The obvious answer is to do some kind of backoff+sleep. But we already > > have a wait queue, so why not use it? > > > > I haven't done the rwbase version; this is just a demonstration of what > > we could do. It's also untested other than by compilation. It might > > well be missing something. > > It is not clear what exactly is the purpose of this new API. Are you just .. really? I wrote it out in the part you quoted, and I wrote it out differently in the kernel-doc for the function: + * rwsem_wait_killable - Wait for current write lock holder to release lock + * @sem: The semaphore to wait on. + * + * This is equivalent to calling down_read(); up_read() but avoids the + * possibility that the thread will be preempted while holding the lock + * causing threads that want to take the lock for writes to block. The + * intended use case is for lockless readers who notice an inconsistent + * state and want to wait for the current writer to finish. Something I forgot to add was that we only guarantee that _a_ writer finished; another writer may have the lock when the function returns. > waiting in the rwsem wait queue until it gets waken up without taking a read > or write lock? I see two issues at the moment. > > 1) The handoff processing should exclude the new RWSEM_WAITING_FOR_RELEASE > waiter types. Hmm. I thought I'd done that by only incrementing 'woken' for RWSEM_WAITING_FOR_READ types. > 2) If the rwsem is free, it should call rwsem_wake() again to wake up the > next waiter, like what is being done in up_write(). because the wait queue might have a waiter followed by a writer? I think calling rwsem_wake() again is probably a bad idea as it will defeat the MAX_READERS_WAKEUP limit. Probably rwsem_mark_wake() needs to handle that case itself; maybe something like this? +++ b/kernel/locking/rwsem.c @@ -419,6 +419,7 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, lockdep_assert_held(&sem->wait_lock); +again: /* * Take a peek at the queue head waiter such that we can determine * the wakeup(s) to perform. @@ -542,6 +543,12 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, */ if (oldcount & RWSEM_FLAG_HANDOFF) adjustment -= RWSEM_FLAG_HANDOFF; + } else { + /* + * Everybody we woke was a waiter, not a reader. Wake the + * first writer instead. + */ + goto again; } if (adjustment)