Received: by 2002:a05:7412:e794:b0:fa:551:50a7 with SMTP id o20csp1572485rdd; Thu, 11 Jan 2024 03:22:14 -0800 (PST) X-Google-Smtp-Source: AGHT+IE85nnMhFrOYTrTNtNl3342mfZdPPeZq3MrByTaue7YEs3dCRPd++QY8KgjeQwRwa3iLe4E X-Received: by 2002:a05:6a00:3a23:b0:6d9:a811:98bf with SMTP id fj35-20020a056a003a2300b006d9a81198bfmr1209165pfb.31.1704972134685; Thu, 11 Jan 2024 03:22:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704972134; cv=none; d=google.com; s=arc-20160816; b=mA7ADdCmTbsvO+XorB2L3os8PDmku50+bjSh0lF3ysc03HOzCt89ToTvZWs/eRO5Dj hfnyTChePb+1uygm4nlHoc4oOPGADqGIk5rkfb7TM5xI39WmbxUrlD2E8eYDVBbbA5+R fz+JjK+cQfnuV7mC0gYgPOBqAizCdbawjoNDgX7xqweQfuD4IASJRSkCoyH8wcBJJfmk dijhXsd3HpPIZWZN2l0qVaHKpN29K37UVv3SYvcV5MA8umN+lTw73fwNb4a23PxJcsPZ sRzj2jd3Lg/zVCjqzE4ZWae0s+VBnvFlxKGM/NVNXqnPmr8m9DKaAP1I9dP1uBBK5nut 0xgg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:reply-to:message-id :subject:cc:to:from:date:dkim-signature; bh=OCq6/kPLeprtHLFPYTOjAov0SEi24RPSUDctcGIfxw8=; fh=KfL2Iuha1YZJt8GQ+sPsgddWJKOLBJo5uujOS9OjoSM=; b=ntJ2ec4jBj3/VQpJq/v20m3xfIohh70fQ47ZJOuLFD4bdQ4f+7Zr7H+DduupRhieJc STXDANvX0TLPZbFDL4++jBEuFLn84wUcfgbAF/w7J6Jhsakj/BChRaKh263UKDlYFsYO kRg3ixk/QSsSfeF/TdnE9DlFTRXHezFqLeHwSfGGbPwafBOJFeNXLCP4eV01JZrvfvbY HosrLs5OxgVQ3I2t261I5orqAwXfPpkXs078WBS+9vxkeej8gyQfvOpVHW7+MASPrEaB hs7FmpWxFRuNrEbZol61UF9tvDB66OgQMX/a9B81wFzEeIsi6sbtFJG72oJFUq+k6XhN g64g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=hLOX0mqy; spf=pass (google.com: domain of linux-kernel+bounces-23472-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-23472-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id lp9-20020a056a003d4900b006d09859390bsi892764pfb.16.2024.01.11.03.22.14 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 11 Jan 2024 03:22:14 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-23472-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=hLOX0mqy; spf=pass (google.com: domain of linux-kernel+bounces-23472-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-23472-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 5587E2881D5 for ; Thu, 11 Jan 2024 11:22:14 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8BF5814F97; Thu, 11 Jan 2024 11:22:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="hLOX0mqy" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9FAAF14F8B for ; Thu, 11 Jan 2024 11:22:06 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2B0C7C433F1; Thu, 11 Jan 2024 11:22:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1704972126; bh=VxNPM/MxJ0rF6RJjeSHH15KLSJ/SFdXF4lDCYWBKx0g=; h=Date:From:To:Cc:Subject:Reply-To:References:In-Reply-To:From; b=hLOX0mqy2MXEbSZZzFrQURXhDxqXebDl4a53WAaNw2YSVQcbJv+vZzVm53uDcPXkn 7DtOnCqMcTQCB5+j9K8aEL+22lYydozrcsIpV4oyJKrB8ZQ0Ez7bcRwlLd+mQIsvBL pNIIPtQXa0qfS/LkbbYoD+GLLi9CSuyFtS7GI1F8Lo78gQJYiA+V7mXoYh95gAo7DL lpWq2RaM0TbRa05DX9B1RZXZkOH9J+fiQvOXjPZJEuwpLeeLVvnukNKzKHrlOmgb3h /GIa73LaB5EQKVhOREDObxYAgVscljqJ04AIlEh3uT9DuDPOUaT4700cRWw+fXpIke Vnut/nz2XHEmg== Received: by paulmck-ThinkPad-P17-Gen-1.home (Postfix, from userid 1000) id A9D89CE045D; Thu, 11 Jan 2024 03:22:05 -0800 (PST) Date: Thu, 11 Jan 2024 03:22:05 -0800 From: "Paul E. McKenney" To: Matthew Wilcox Cc: Peter Zijlstra , Ingo Molnar , Will Deacon , Waiman Long , linux-kernel@vger.kernel.org, Suren Baghdasaryan , "Liam R. Howlett" Subject: Re: [RFC] Sleep waiting for an rwsem to be unlocked Message-ID: Reply-To: paulmck@kernel.org References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Tue, Jan 09, 2024 at 05:12:06PM +0000, Matthew Wilcox wrote: > The problem we're trying to solve is a lock-free walk of > /proc/$pid/maps. If the process is modifying the VMAs at the same time > the reader is walking them, it can see garbage. For page faults, we > handle this by taking the mmap_lock for read and retrying the page fault > (excluding any further modifications). > > We don't want to take that approach for the maps file. The monitoring > task may have a significantly lower process priority, and so taking > the mmap_lock for read can block it for a significant period of time. > The obvious answer is to do some kind of backoff+sleep. But we already > have a wait queue, so why not use it? > > I haven't done the rwbase version; this is just a demonstration of what > we could do. It's also untested other than by compilation. It might > well be missing something. > > Signed-off-by: Matthew Wilcox (Oracle) At first glance, this is good and sufficient for this use case. I do have one question that would be important if anyone were to want to rely on the "This is equivalent to calling down_read(); up_read()" statement in the header comment, please see below. Thanx, Paul > --- > include/linux/rwsem.h | 6 +++ > kernel/locking/rwsem.c | 104 ++++++++++++++++++++++++++++++++++++++++- > 2 files changed, 108 insertions(+), 2 deletions(-) > > diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h > index 4f1c18992f76..e7bf9dfc471a 100644 > --- a/include/linux/rwsem.h > +++ b/include/linux/rwsem.h > @@ -250,6 +250,12 @@ DEFINE_GUARD_COND(rwsem_write, _try, down_write_trylock(_T)) > */ > extern void downgrade_write(struct rw_semaphore *sem); > > +/* > + * wait for current writer to be finished > + */ > +void rwsem_wait(struct rw_semaphore *sem); > +int __must_check rwsem_wait_killable(struct rw_semaphore *sem); > + > #ifdef CONFIG_DEBUG_LOCK_ALLOC > /* > * nested locking. NOTE: rwsems are not allowed to recurse > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c > index 2340b6d90ec6..7c8096c5586f 100644 > --- a/kernel/locking/rwsem.c > +++ b/kernel/locking/rwsem.c > @@ -332,7 +332,8 @@ EXPORT_SYMBOL(__init_rwsem); > > enum rwsem_waiter_type { > RWSEM_WAITING_FOR_WRITE, > - RWSEM_WAITING_FOR_READ > + RWSEM_WAITING_FOR_READ, > + RWSEM_WAITING_FOR_RELEASE, > }; > > struct rwsem_waiter { > @@ -511,7 +512,8 @@ static void rwsem_mark_wake(struct rw_semaphore *sem, > if (waiter->type == RWSEM_WAITING_FOR_WRITE) > continue; > > - woken++; > + if (waiter->type == RWSEM_WAITING_FOR_READ) > + woken++; > list_move_tail(&waiter->list, &wlist); > > /* > @@ -1401,6 +1403,67 @@ static inline void __downgrade_write(struct rw_semaphore *sem) > preempt_enable(); > } > > +static inline int __wait_read_common(struct rw_semaphore *sem, int state) > +{ > + int ret = 0; > + long adjustment = 0; > + struct rwsem_waiter waiter; > + DEFINE_WAKE_Q(wake_q); > + > + waiter.task = current; > + waiter.type = RWSEM_WAITING_FOR_RELEASE; > + waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT; > + waiter.handoff_set = false; > + > + preempt_disable(); > + raw_spin_lock_irq(&sem->wait_lock); > + if (list_empty(&sem->wait_list)) { > + if (!(atomic_long_read(&sem->count) & RWSEM_WRITER_MASK)) { > + /* Provide lock ACQUIRE */ > + smp_acquire__after_ctrl_dep(); > + raw_spin_unlock_irq(&sem->wait_lock); > + goto done; If we take this path, we are ordered against the prior writer's release courtesy of the acquire ordering on ->count. But we are not ordered against the next writer's acquisition if that writer takes the fastpath because rwsem_write_trylock() only does acquire semantics. Again, this does not matter for your use case, and it all just works on strongly ordered systems such as x86. Assuming I am not just confused here, as far as I am concerned, this could be fixed by adjusting the guarantees in the rwsem_wait_killable() function's header comment. But it might be good to avoid the sharp edges that would be provided by weakening that guarantee. To that end, I -think- that a fix that would save that header comment's current wording would insert an smp_mb() before the above atomic_long_read(), but I could easily be wrong. Plus there might well need to be similar adjustments later in the code. (I don't immediately see any, but it has been a good long while since I have stared at this code.) Thoughts from people more familiar with this code? > + } > + adjustment = RWSEM_FLAG_WAITERS; > + } > + rwsem_add_waiter(sem, &waiter); > + if (adjustment) { > + long count = atomic_long_add_return(adjustment, &sem->count); > + rwsem_cond_wake_waiter(sem, count, &wake_q); > + } > + raw_spin_unlock_irq(&sem->wait_lock); > + > + if (!wake_q_empty(&wake_q)) > + wake_up_q(&wake_q); > + > + for (;;) { > + set_current_state(state); > + if (!smp_load_acquire(&waiter.task)) { > + /* Matches rwsem_mark_wake()'s smp_store_release(). */ > + break; > + } > + if (signal_pending_state(state, current)) { > + raw_spin_lock_irq(&sem->wait_lock); > + if (waiter.task) > + goto out_nolock; > + raw_spin_unlock_irq(&sem->wait_lock); > + /* Ordered by sem->wait_lock against rwsem_mark_wake(). */ > + break; > + } > + schedule_preempt_disabled(); > + } > + > + __set_current_state(TASK_RUNNING); > +done: > + preempt_enable(); > + return ret; > +out_nolock: > + rwsem_del_wake_waiter(sem, &waiter, &wake_q); > + __set_current_state(TASK_RUNNING); > + ret = -EINTR; > + goto done; > +} > + > #else /* !CONFIG_PREEMPT_RT */ > > #define RT_MUTEX_BUILD_MUTEX > @@ -1500,6 +1563,11 @@ static inline void __downgrade_write(struct rw_semaphore *sem) > rwbase_write_downgrade(&sem->rwbase); > } > > +static inline int __wait_read_killable(struct rw_semaphore *sem) > +{ > + return rwbase_wait_lock(&sem->rwbase, TASK_KILLABLE); > +} > + > /* Debug stubs for the common API */ > #define DEBUG_RWSEMS_WARN_ON(c, sem) > > @@ -1643,6 +1711,38 @@ void downgrade_write(struct rw_semaphore *sem) > } > EXPORT_SYMBOL(downgrade_write); > > +/** > + * rwsem_wait_killable - Wait for current write lock holder to release lock > + * @sem: The semaphore to wait on. > + * > + * This is equivalent to calling down_read(); up_read() but avoids the > + * possibility that the thread will be preempted while holding the lock > + * causing threads that want to take the lock for writes to block. The > + * intended use case is for lockless readers who notice an inconsistent > + * state and want to wait for the current writer to finish. > + */ > +int rwsem_wait_killable(struct rw_semaphore *sem) > +{ > + might_sleep(); > + > + rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_); > + rwsem_release(&sem->dep_map, _RET_IP_); > + > + return __wait_read_common(sem, TASK_KILLABLE); > +} > +EXPORT_SYMBOL(rwsem_wait_killable); > + > +void rwsem_wait(struct rw_semaphore *sem) > +{ > + might_sleep(); > + > + rwsem_acquire_read(&sem->dep_map, 0, 0, _RET_IP_); > + rwsem_release(&sem->dep_map, _RET_IP_); > + > + __wait_read_common(sem, TASK_UNINTERRUPTIBLE); > +} > +EXPORT_SYMBOL(rwsem_wait); > + > #ifdef CONFIG_DEBUG_LOCK_ALLOC > > void down_read_nested(struct rw_semaphore *sem, int subclass) > -- > 2.43.0 >