Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp4734824rdb; Tue, 12 Dec 2023 08:00:01 -0800 (PST) X-Google-Smtp-Source: AGHT+IFucz0t8IE+7sukPIdyu41l8VyMc2nz3ehysL9UqfzUD0N7WfHsfl7bZUC4q9nI0zcU8fC6 X-Received: by 2002:a17:902:c405:b0:1d3:3f66:81d2 with SMTP id k5-20020a170902c40500b001d33f6681d2mr624824plk.120.1702396801005; Tue, 12 Dec 2023 08:00:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702396800; cv=none; d=google.com; s=arc-20160816; b=TWj1asIPMGkjf2OrQv1YALlAqSikoGO5hI4FF51KC49bUbsA5fODkzpzGIHmZQMeM7 bb/tSG0PlZl6W0GhQ4RyBcVXk5Jch658G2hrOPYriSMTCAkamo72dfc9U18j0TIPb+hA Ml839lKZ/gS6o4szb4QGwT0P1Ublbr0f07IXWurMTdh2BtvGDp3/ghABPFeDlOf8vxc6 fZTaTM8t0mHoTSuP7hkFwQP8ZWkyGdkt6u8PsmTeJkLjSpUnsryxXVQrAMRMLgnkrG6I EtpZQGaD9k725hZd7kDpdgMUqIuaqzgdUoPr5ZMV1SL3/fitYAgw0pbW/SlvptPA4Bpg i/xQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:dkim-signature:dkim-signature:date; bh=CZYUGGhGIcq4pE6W68er3NNyNlAUozlYglpukm412Vs=; fh=5cMh8CKYPGN1F0fzuaj1xrlN5+dBfa6bPK4XIprQ5s0=; b=ga/mo8vlQnE6f1HcVb6fE6V9RFLpqx4j9IheTFjq/3RdTVyo2z+BQBZdSOviWt64lS zzyOQQhAuplm4dDi1+CzmspRTHzvKRxNumqYBFk/by8kLG96+Zsr0j8cCu5XPFAIr7lO aY0utDr/yWPkDMvnkRNEDpcw4itwXoLNYiwUgMHVQ/f1xlEnA8qI6AIfuBVk6snGhb7o a1nz+1FvcqZTVeIxyAykHvRgz7xy66e6+oobBIxhqcBTNa0ALvppDTHXXXrP1B7M9Gi2 4LRKPbqueL20xexYD3P5yFHKmpZ7LnfolRDdgLUtBfjUvOcXiEYDFQHU5oUZGDeLqDaI TJxg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=KnPTuFe2; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from howler.vger.email (howler.vger.email. [2620:137:e000::3:4]) by mx.google.com with ESMTPS id f10-20020a170902ce8a00b001d0ac4424cesi8294579plg.138.2023.12.12.08.00.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Dec 2023 08:00:00 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) client-ip=2620:137:e000::3:4; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=KnPTuFe2; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:4 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by howler.vger.email (Postfix) with ESMTP id D28FF808BC80; Tue, 12 Dec 2023 07:59:57 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at howler.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376757AbjLLP7j (ORCPT + 99 others); Tue, 12 Dec 2023 10:59:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53592 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1376763AbjLLP7i (ORCPT ); Tue, 12 Dec 2023 10:59:38 -0500 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CAA3CF2 for ; Tue, 12 Dec 2023 07:59:43 -0800 (PST) Date: Tue, 12 Dec 2023 16:59:39 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1702396781; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CZYUGGhGIcq4pE6W68er3NNyNlAUozlYglpukm412Vs=; b=KnPTuFe2hjj3bt5IFJKNFhFoExaBxz+/BsbAHWxGFsoomXhhaDtRrC5+FfvASby/ocJpsK c8dDvKfu1DWBDSuJuISf2XBJ3KwTszByLPMQQvCpp+D+KMFLASG58WlMW/eLuq5fwubYCy h8kXONN+9AiYMSyDLuMYA2meHpvzde9uqeZVlWHJyLaUOMIFFkl8FNJN6ukBCc24JqH5Vt bKlXXvef195wKXZuPIQcf/yI826WZnZU3FYNoC64ioZunWiJpIKvDFFAjjVEXks/HGfZQd FWzK+jip4rnPPP4f7QYPwHPU1U1Orqykqfk5pt0hklmFIs46PHWcF/ZEK9Jo5w== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1702396781; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=CZYUGGhGIcq4pE6W68er3NNyNlAUozlYglpukm412Vs=; b=pU6km5Og1ShqVbKq4usrIy3bV33YC05BrFhVOLFDGXe7hWN27pwxTcAZfp4mpT87/Q2otK NouffezMEhhViMAw== From: Sebastian Siewior To: Anna-Maria Behnsen Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , John Stultz , Thomas Gleixner , Eric Dumazet , "Rafael J . Wysocki" , Arjan van de Ven , "Paul E . McKenney" , Frederic Weisbecker , Rik van Riel , Steven Rostedt , Giovanni Gherdovich , Lukasz Luba , "Gautham R . Shenoy" , Srinivas Pandruvada , K Prateek Nayak Subject: Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model Message-ID: <20231212155939.Z_YQl91J@linutronix.de> References: <20231201092654.34614-1-anna-maria@linutronix.de> <20231201092654.34614-31-anna-maria@linutronix.de> <20231211180417.XG9G3ryS@linutronix.de> <87h6knmzco.fsf@somnus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <87h6knmzco.fsf@somnus> X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on howler.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (howler.vger.email [0.0.0.0]); Tue, 12 Dec 2023 07:59:58 -0800 (PST) On 2023-12-12 12:31:19 [+0100], Anna-Maria Behnsen wrote: > Sebastian Siewior writes: >=20 > > On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote: > >> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migrati= on.c > >> new file mode 100644 > >> index 000000000000..05cd8f1bc45d > >> --- /dev/null > >> +++ b/kernel/time/timer_migration.c > >> @@ -0,0 +1,1636 @@ > > =E2=80=A6 > >> +static bool tmigr_active_up(struct tmigr_group *group, > >> + struct tmigr_group *child, > >> + void *ptr) > >> +{ > >> + union tmigr_state curstate, newstate; > >> + struct tmigr_walk *data =3D ptr; > >> + bool walk_done; > >> + u8 childmask; > >> + > >> + childmask =3D data->childmask; > >> + newstate =3D curstate =3D data->groupstate; > >> + > >> +retry: > >> + walk_done =3D true; > >> + > >> + if (newstate.migrator =3D=3D TMIGR_NONE) { > >> + newstate.migrator =3D childmask; > >> + > >> + /* Changes need to be propagated */ > >> + walk_done =3D false; > >> + } > >> + > >> + newstate.active |=3D childmask; > >> + > >> + newstate.seq++; > >> + > >> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstat= e.state)) { > >> + newstate.state =3D curstate.state; > > > > This does not look right. If > > group->migr_state !=3D curstate.state > > then > > curstate.state =3D newstate.state > > > > does not make a difference since curstate is on stack. So this should > > become > > > > | if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate= =2Estate)) { > > | curstate.state =3D newstate.state =3D atomic_read(&group->parent->mi= gr_state); > > > > and now I question the existence of tmigr_walk::groupstate. It does not > > match the comment for the struct saying it will be re-read if the > > cmpxchg() fails because it does not happen (at least here). Also why do > > we have it? Is it avoid atomic_read() and have it "cached"? >=20 > atomic_try_cmpxchg() updates curstate.state with the actual > group->migr_state when those values do not match. So it is reread by > atomic_try_cmpxchg() and still matches the description. (This at least > tells the function description of atomic_try_cmpxchg()). Ach. Indeed. That part slipped my mind. Could still replace it with: newstate =3D curstate to match the assignment at the top of the function? Or do something like: | childmask =3D data->childmask; | curstate =3D data->groupstate; | retry: | newstate =3D curstate; | | walk_done =3D true; =E2=80=A6 | if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, news= tate.state)) | goto retry; So gcc can save a branch and recycle the upper the cooking the code. gcc-13 does not recognise this, clang-16 does. > But beside of this, why do I want to update curstate.state with > group->parent->migr_state when cmpxchg of this group wasn't successful > yet or was it a copy paste error? It was an error. > >> + data->groupstate.state =3D atomic_read(&group->parent->migr_state); > >> + data->childmask =3D group->childmask; > > > > We don't re-read in case the cmpxchg failed assuming someone else is > > updating the state. Wouldn't it make sense to read the childmask at top > > of the function from the child pointer. There is no need to keep around > > in the data pointer, right? >=20 > When we are in lvl0, then @child is NULL as the child is a tmigr_cpu and > not a tmigr_group. This is the reason why I decided to store it inside > the tmigr_walk struct. But it supposed to be group->migr_state for the cmpxchg. So considering the previous bit, why not: | childmask =3D data->childmask; | curstate =3D atomic_read(&group->migr_state); |=20 | do { | newstate =3D curstate; | walk_done =3D true; |=20 | if (newstate.migrator =3D=3D TMIGR_NONE) { | newstate.migrator =3D childmask; |=20 | /* Changes need to be propagated */ | walk_done =3D false; | } |=20 | newstate.active |=3D childmask; |=20 | newstate.seq++; |=20 | } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state,= newstate.state)); this seems nice. > >> + } > >> + > >> + /* > >> + * The group is active and the event will be ignored - the ignore fl= ag is > >> + * updated without holding the lock. In case the bit is set while > >> + * another CPU already handles remote events, nothing happens, becau= se > >> + * it is clear that the CPU became active just in this moment, or in > >> + * worst case the event is handled remote. Nothing to worry about. > >> + */ > > > > The CPU is becoming active, so is the group. The ignore flag for the > > group is updated lock less to reflect this. Another CPU might also > > set it true while becoming active. In worst case the migrator > > observes it too late and expires remotely timer belonging to this > > group. The lock is held while going idle (and setting ignore to > > false) so the state is not lost. > > >=20 > This is what I wanted to explain: >=20 > /* > * The group is active (again). The group event might be still queued > * into the parent group's timerqueue but can now be handled by the the s/the$@@ > * migrator of this group. Therefore the ignore flag for the group event > * is updated to reflect this. > * > * The update of the ignore flag in the active path is done lock lockless > * less. In worst case the migrator of the parent group observes the > * change too late and expires remotely timer belonging to this a timer? > * group. The lock is held while updating the ignore flag in idle > * path. So this state change will not be lost. > */ >=20 > >> + group->groupevt.ignore =3D true; > >> + > >> + return walk_done; > >> +} =E2=80=A6 > >> + if (!tmc->online || tmc->remote || tmc->cpuevt.ignore || > >> + now < tmc->cpuevt.nextevt.expires) { > >> + raw_spin_unlock_irq(&tmc->lock); > >> + return next; > > > > Looking at the last condition where the timerqueue has been forwarded by > > a jiffy+, shouldn't we return _that_ values next instead of KTIME_MAX? >=20 > No. Because the event is already queued into the hierarchy and the > migrator takes care. If hiererachy is completely idle, the CPU which > updated the event takes care. I'll add this to the comment above. So another CPU took care of it and we set tmc->wakeup to KTIME_MAX=E2=80=A6 One confusing part is that this return value (if not aborted early but after completing this function) is used to set tmc->wakeup based on the next pending timer for the CPU that was expired remotely. We could have expired four CPUs and the next timer of the last CPU may not be the earliest timer for the fist CPU in the group. And this fine because it can be stale and only valid if the CPU goes idle? > >> + /* > >> + * When the CPU is idle, check whether the recalculation of @tmc->wa= keup > >> + * is required. @tmc->wakeup_recalc is set by a remote CPU which is > >> + * about to go offline, was the last active CPU in the whole timer > >> + * migration hierarchy and now delegates handling of the hierarchy to > >> + * this CPU. > > > > I'm failing here=E2=80=A6 >=20 > If the CPU is idle, check whether the recalculation of @tmc->wakeup > is required. @tmc->wakeup_recalc is set, when the last active CPU > went offline. The last active CPU delegated the handling of the timer > migration hierarchy to another (this) CPU by updating this flag and > sending a reschedule. >=20 > Better? So the last CPU going offline had to be the migrator because otherwise it wouldn't matter? =E2=80=A6 > >> + > >> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstat= e.state)) { > > > > This one appears wrong, too. The curstate is not getting updated during > > retry. >=20 > See the answer above. Yes, and I think the do { } while should work here, too. > >> + * data->nextexp was set by tmigr_update_events() and contains the > >> + * expiry of the first global event which needs to be handled > >> + */ > >> + if (data->nextexp !=3D KTIME_MAX) { > >> + WARN_ON_ONCE(group->parent); > >> + /* > >> + * Top level path - If this CPU is about going offline, wake > >> + * up some random other CPU so it will take over the > >> + * migrator duty and program its timer properly. Ideally > >> + * wake the CPU with the closest expiry time, but that's > >> + * overkill to figure out. > >> + * > >> + * Set wakeup_recalc of remote CPU, to make sure the complete > >> + * idle hierarchy with enqueued timers is reevaluated. > >> + */ > >> + if (!(this_cpu_ptr(&tmigr_cpu)->online)) { > >> + struct tmigr_cpu *tmc =3D this_cpu_ptr(&tmigr_cpu); > >> + unsigned int cpu =3D smp_processor_id(); > >> + struct tmigr_cpu *tmc_resched; > >> + > >> + cpu =3D cpumask_any_but(cpu_online_mask, cpu); > >> + tmc_resched =3D per_cpu_ptr(&tmigr_cpu, cpu); > >> + > >> + raw_spin_unlock(&tmc->lock); > >> + > >> + raw_spin_lock(&tmc_resched->lock); > >> + tmc_resched->wakeup_recalc =3D true; > >> + raw_spin_unlock(&tmc_resched->lock); > >> + > >> + raw_spin_lock(&tmc->lock); > >> + smp_send_reschedule(cpu); > > > > This whole thing confuses me. > > If the CPU goes offline, it needs to get removed from the migration > > hierarchy and this is it. Everything else is handled by the migrator. If > > the migrator is going offline then it needs wake a random CPU and make > > sure it takes the migrator role. I am confused by the whole ::wakeup and > > ::wakeup_recalc thingy. > > >=20 > wakeup_recalc is required to indicate, that the CPU was chosen as the > new migrator CPU when the last active CPU in timer migration hierarchy > went offline. Aha! I suspected this. So this is more like need_new_migrator. =E2=80=A6 > Hopefully this is now clear? yes. > Thanks, >=20 > Anna-Maria Sebastian