Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EF524C433FE for ; Wed, 17 Nov 2021 20:11:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D2AA561A02 for ; Wed, 17 Nov 2021 20:11:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S239094AbhKQUOc (ORCPT ); Wed, 17 Nov 2021 15:14:32 -0500 Received: from dcvr.yhbt.net ([64.71.152.64]:50250 "EHLO dcvr.yhbt.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231429AbhKQUOb (ORCPT ); Wed, 17 Nov 2021 15:14:31 -0500 Received: from localhost (dcvr.yhbt.net [127.0.0.1]) by dcvr.yhbt.net (Postfix) with ESMTP id 5FAAC1F953; Wed, 17 Nov 2021 20:11:32 +0000 (UTC) Date: Wed, 17 Nov 2021 20:11:32 +0000 From: Eric Wong To: Frederic Weisbecker Cc: linux-rt-users@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, Ingo Molnar , Peter Zijlstra , Thomas Gleixner , Steven Rostedt , Mike Galbraith , Sebastian Andrzej Siewior , John Ogness , Roman Penyaev , Davidlohr Bueso , Jason Baron , Al Viro , "Paul E. McKenney" , Mathieu Desnoyers Subject: Re: [RFC] How to fix eventpoll rwlock based priority inversion on PREEMPT_RT? Message-ID: <20211117201132.M259904@dcvr> References: <20211116140252.GA348770@lothringen> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20211116140252.GA348770@lothringen> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Frederic Weisbecker wrote: > Hi, > > I'm iterating again on this topic, this time with the author of > the patch Cc'ed. > > The following commit: > > a218cc491420 (epoll: use rwlock in order to reduce ep_poll > callback() contention) > > has changed the ep->lock into an rwlock. This can cause priority inversion > on PREEMPT_RT. Here is an example: > > > 1) High priority task A waits for events on epoll_wait(), nothing shows up so > it goes to sleep for new events in the ep_poll() loop. > > 2) Lower prio task B brings new events in ep_poll_callback(), waking up A > while still holding read_lock(ep->lock) > > 3) Task A wakes up immediately, tries to grab write_lock(ep->lock) but it has > to wait for task B to release read_lock(ep->lock). Unfortunately there is > no priority inheritance when write_lock() is called on an rwlock that is > already read_lock'ed. So back to task B that may even be preempted by > yet another task before releasing read_lock(ep->lock). > > > Now how to solve this? Several possibilities: > > == Delay the wake up after releasing the read_lock()? == > > That solves part of the problem only. If another event comes up > concurrently we are back to the original issue. > > == Make rwlock more fair ? == > > Currently read_lock() only acquires the rtmutex if the lock is already > write-held (or write_lock() is waiting to acquire). So if read_lock() happens > after write_lock(), fairness is observed but if write_lock() happens after > read_lock(), priority inheritance doesn't happen. > > I think there has been attempts to solve this by the past but some issues > arised (don't know the exact details, comments on rwbase_rt.c bring some clues). > > == Convert the rwlock to RCU ? == > > Traditionally, we try to convert rwlocks bringing issues to RCU. I'm not sure the > situation fits here because the rwlock is used the other way around: > the epoll consumer does the write_lock() and the producers do read_lock(). Then > concurrent producers use ad-hoc concurrent list add (see list_add_tail_lockless) > to handle racy modifications. > > There are also list modifications on both side. There are added from the > producers and read and deleted (even re-added sometimes) on the consumer side. > > Perhaps RCU could be used with keeping locking on the consumer side... +CC linux-fsdevel and Mathieu Desnoyers I proposed using wfcqueue many years ago, but ran out of time/hardware/funding to work on it: https://yhbt.net/lore/lkml/20130401183118.GA9968@dcvr.yhbt.net/ wfcqueue is used internally by Userspace-RCU, but wfcqueue itself doesn't rely on RCU. I'm not sure if wfcqueue helps PREEMPT_RT, but Mathieu + Paul might. > == Convert to llist ? == > > It's a possibility but some operations like single element deletion may be > costly because only llist_add() and llist_del_all() are atomic on llist. > !CONFIG_PREEMPT_RT might not be happy about it. > > == Consider epoll not PREEMPT_RT friendly? == > > A last resort is to simply consider epoll is not RT-friendly and suggest > using more simple alternatives like poll().... > > Any thoughts?