Received: by 2002:a05:7412:3290:b0:fa:6e18:a558 with SMTP id ev16csp579834rdb; Fri, 26 Jan 2024 04:53:22 -0800 (PST) X-Google-Smtp-Source: AGHT+IFFF3M/C+NYx6Y+Y5bIkeEyh9mUPQcau3z/lWM7XK5+mRs1TxdtalAkDgwiWiyo4+XKO5zu X-Received: by 2002:a05:620a:e0a:b0:783:2146:86f6 with SMTP id y10-20020a05620a0e0a00b00783214686f6mr1529254qkm.33.1706273602015; Fri, 26 Jan 2024 04:53:22 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706273602; cv=pass; d=google.com; s=arc-20160816; b=DsMWI0dJ8CVRI9WspDukt4+scNgourh7OR/nkabhjUr5DICrLjmrVhTMEV17swIgpi Vho2DcDx3pAXeqr70yFM88Ms1i4h9HXmBQ7dPCcIT1xV9a7FmzIxhac+Iu3W8pELKljx Fb07iIJJNIkMJUoe7FVU2b2FPEF/SB/S5rVP8XCcnRU0o61pnFguaiYJLcVqktPv+cZM ni70qYK1TqtYOF6QNlEQc0PCHHLUCWVM2qLMEzFyl+WIqWddXWM36ONAQBuxk5gUTKr7 lNckj9k5ODAD5BwAqUV16mTMToiwS4KMBbk6wfsB9eb1Kz00i/p4ZLuGBQEl22euGNqT QDwQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date:dkim-signature; bh=rdHPPPWOWFY4vmchbnkpKr1qGo8V2ARHDkTW/i4VW7Q=; fh=rlwn4pxNjzZxzAe0OFpEm5XldgGOhioyoE/W5GSrSmE=; b=TNbqzej6Jh7m14byEyln2At/RNgwmZyDdBMVw75ISLl3rZ1tAgUgbCuTzIsPiRiBuu bQoI5doGZ3nsI1DnfBPOSwjk90ycd4qrgGf6XWSNDEIKF8stZ0iiuYIp/Yem6de19vOr 4IjV7/pSY9bVfd7zgI8DJnCTnsmx02ptsNZXBWbe2VOYRuj5WioPlFkndyK4ZCXdJl2b JlMOFLEeWX9YUhj80HWCq7hCHGHasQpscxxMN7pOJF7/EcYWv73lvXSpU7GL0EtKsQ4v 2l2kIZb5zw5Uf4P8sFBttwQc6E8TtG9gk+5C2xqEKqkzUrvoR1WPDE1VfTR1GU7xVAEc xOJg== ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=GYHi+SUJ; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-40101-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-40101-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id s16-20020a05620a031000b007811c62b6f4si1203317qkm.727.2024.01.26.04.53.21 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 26 Jan 2024 04:53:22 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-40101-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=GYHi+SUJ; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-40101-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-40101-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id B48AB1C241BB for ; Fri, 26 Jan 2024 12:53:21 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 334B01B5B1; Fri, 26 Jan 2024 12:53:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="GYHi+SUJ" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CAB0175AE for ; Fri, 26 Jan 2024 12:53:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706273597; cv=none; b=ehawT7u7+NamQ8qEiPiBRYCHFykh96avAGlQA8IcWDEpFjJud/ZraoJokgklh0fMvolYmbTpwUWgkf9kwOtZ1I1LdDWG9jJMlQB83gMOZGDyb0eEh1JDt4ykO2WSKCcCCmR+oGdIbftvOWz2DOxchrhv1SHOJ/6PD7VzgbmCkz8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706273597; c=relaxed/simple; bh=HHQ9uNXBo4a0b5iPNJL1gSJx/JBEMjonGTMJo4OoIdY=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=CUZFbmHwphxNzqcO8CgXudzTzXvfEPFaDryRJfrhSOezrPNYjcjYdr3zImRyWlFONlU5iLrxrjxlkJnEFa3qB0t7o7JZnLG7hRnS+lXseFX/lkO9gCXGeY69W65YCbbW6rcPCRcM3s5QWFI3qk9aoCv/2wdDGYMAA6kZvNdA5LU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=GYHi+SUJ; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1B2E5C433C7; Fri, 26 Jan 2024 12:53:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1706273596; bh=HHQ9uNXBo4a0b5iPNJL1gSJx/JBEMjonGTMJo4OoIdY=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=GYHi+SUJFCreUFBbFFHU+jtCipPNSnLLRMTme+fNJy0alPj99zoQrYIzRYK+zYd5V oMGggiLO4MnQkLz+mOY6WFAH9F2CQi9IVKRN6qcpDU92Z3tKE/7a0tl8ovM82sptk7 Ttce1sA6ept2UVitpZdEsQcBnzML9WI4tT/uUPyuliZkcvDylljF0GlEe7YBhSYcj+ xDp/WfeKSMyWVxopDKQpXZGQmb5PFO6eqYBlUerszn5Jh3fvD+ao84NPp24wwQdFpQ HaEZ4GuEbSmPN5RETJjEotqg73gYjAFVYxvu8CW/FxkMZUjrtxSWxuHQOLHieAltoT 2lCvNKtDvWtCQ== Date: Fri, 26 Jan 2024 13:53:13 +0100 From: Frederic Weisbecker To: Anna-Maria Behnsen Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , John Stultz , Thomas Gleixner , Eric Dumazet , "Rafael J . Wysocki" , Arjan van de Ven , "Paul E . McKenney" , Rik van Riel , Steven Rostedt , Sebastian Siewior , Giovanni Gherdovich , Lukasz Luba , "Gautham R . Shenoy" , Srinivas Pandruvada , K Prateek Nayak Subject: Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model Message-ID: References: <20240115143743.27827-1-anna-maria@linutronix.de> <20240115143743.27827-19-anna-maria@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240115143743.27827-19-anna-maria@linutronix.de> On Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen wrote: > + * Protection of the tmigr group state information: > + * ------------------------------------------------ > + * > + * The state information with the list of active children and migrator needs to > + * be protected by a sequence counter. It prevents a race when updates in child > + * groups are propagated in changed order. The state update is performed > + * lockless and group wise. The following scenario describes what happens > + * without updating the sequence counter: > + * > + * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well > + * as GRP0:1 will not change during the scenario): > + * > + * LVL 1 [GRP1:0] > + * migrator = GRP0:1 > + * active = GRP0:0, GRP0:1 > + * / \ > + * LVL 0 [GRP0:0] [GRP0:1] > + * migrator = CPU0 migrator = CPU2 > + * active = CPU0 active = CPU2 > + * / \ / \ > + * CPUs 0 1 2 3 > + * active idle active idle > + * > + * > + * 1. CPU0 goes idle. As the update is performed group wise, in the first step > + * only GRP0:0 is updated. The update of GRP1:0 is pending as CPU0 has to > + * walk the hierarchy. > + * > + * LVL 1 [GRP1:0] > + * migrator = GRP0:1 > + * active = GRP0:0, GRP0:1 > + * / \ > + * LVL 0 [GRP0:0] [GRP0:1] > + * --> migrator = TMIGR_NONE migrator = CPU2 > + * --> active = active = CPU2 > + * / \ / \ > + * CPUs 0 1 2 3 > + * --> idle idle active idle > + * > + * 2. While CPU0 goes idle and continues to update the state, CPU1 comes out of > + * idle. CPU1 updates GRP0:0. The update for GRP1:0 is pending as CPU1 also > + * has to the hierarchy. Both CPUs (CPU0 and CPU1) now walk the hierarchy to > + * perform the needed update from their point of view. The currently visible > + * state looks the following: > + * > + * LVL 1 [GRP1:0] > + * migrator = GRP0:1 > + * active = GRP0:0, GRP0:1 > + * / \ > + * LVL 0 [GRP0:0] [GRP0:1] > + * --> migrator = CPU1 migrator = CPU2 > + * --> active = CPU1 active = CPU2 > + * / \ / \ > + * CPUs 0 1 2 3 > + * idle --> active active idle > + * > + * 3. Here is the race condition: CPU1 managed to propagate its changes (from > + * step 2) through the hierarchy to GRP1:0 before CPU0 (step 1) did. The > + * active members of GRP1:0 remain unchanged after the update since it is > + * still valid from CPU1 current point of view: > + * > + * LVL 1 [GRP1:0] > + * --> migrator = GRP0:1 > + * --> active = GRP0:0, GRP0:1 > + * / \ > + * LVL 0 [GRP0:0] [GRP0:1] > + * migrator = CPU1 migrator = CPU2 > + * active = CPU1 active = CPU2 > + * / \ / \ > + * CPUs 0 1 2 3 > + * idle active active idle So let's take this scenario and suppose we are at this stage. CPU 1 has propagated the state up to [GRP1:0] and CPU 0 is going to do it but hasn't yet. > +static bool tmigr_inactive_up(struct tmigr_group *group, > + struct tmigr_group *child, > + void *ptr) > +{ > + union tmigr_state curstate, newstate, childstate; > + struct tmigr_walk *data = ptr; > + bool walk_done; > + u8 childmask; > + > + childmask = data->childmask; > + curstate.state = atomic_read(&group->migr_state); And now suppose CPU 0 arrives here and sees the group->migr_state change performed by CPU 1. So it's all good, right? The below atomic_cmpxchg() will success on the first take. > + childstate.state = 0; > + > + do { > + if (child) > + childstate.state = atomic_read(&child->migr_state); But then how do you guarantee that CPU 0 will load here the version of child->migr_state modified by CPU 1? What prevents from loading the stale value? The one that was modified by CPU 0 instead? Nothing because the two above reads are unordered. As a result, CPU 0 may ignore the fact that CPU 1 is up and wrongly report GRP0:0 as active up to GRP1:0. One way to solve this is to change the above atomic_read(&group->migr_state) into atomic_read_acquire(&group->migr_state). It's cheap and pairs with the order enforced by the upwards successful cmpxchg calls. > + > + newstate = curstate; > + walk_done = true; > + > + /* Reset active bit when the child is no longer active */ > + if (!childstate.active) > + newstate.active &= ~childmask; > + > + if (newstate.migrator == childmask) { > + /* > + * Find a new migrator for the group, because the child > + * group is idle! > + */ > + if (!childstate.active) { > + unsigned long new_migr_bit, active = newstate.active; > + > + new_migr_bit = find_first_bit(&active, BIT_CNT); > + > + if (new_migr_bit != BIT_CNT) { > + newstate.migrator = BIT(new_migr_bit); > + } else { > + newstate.migrator = TMIGR_NONE; > + > + /* Changes need to be propagated */ > + walk_done = false; > + } > + } > + } > + > + newstate.seq++; > + > + WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active)); > + > + } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)); Similarly, I seem to remember that a failing cmpxchg() doesn't imply a full memory barrier. If it's the case, you may need to reload &group->migr_state using an acquire barrier. But lemme check that... Thanks.