Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp4570379rdb; Tue, 12 Dec 2023 03:31:45 -0800 (PST) X-Google-Smtp-Source: AGHT+IGgmcnjMcBzpHjOe1mJTnP1N79sEqXXu7ZHXRqMx4RmPEQNLa0al8t8laT9QD8FdxNqdSjp X-Received: by 2002:a05:6a20:914f:b0:190:37a0:f343 with SMTP id x15-20020a056a20914f00b0019037a0f343mr8485705pzc.34.1702380704947; Tue, 12 Dec 2023 03:31:44 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702380704; cv=none; d=google.com; s=arc-20160816; b=A5NS+rQdecLAIP0yZW7UscBUjx1vbfse6ZI2Unw1KBpjLLeBcoKuHrhZNaJrnMQJtO DCHbNZjH8aYQqsPKzqPDAGjiEF2dsCZBChVwN9j0D4TduzrcqQmtEoMM6RPdl/AlxAPI 2UAOUCSyVkLHiuJ2J5XjH55sAENI109seaftgq09Gtz/GrBvzOld6Xj7ilriYRazzgTG 5LGbl8jJvrfGw6PVsKRTNTdyX/1WpHZ8tss+rttFEZU4vqRjK0Zo9OguFaa4d+rycaxc +Gy5EaZuTzLXyzFH1pfp3eXEDrYUrxZGbWRp8B9FNTMW3nIZ+YAUkm5qfce+I7GzJlR6 fVQQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:references:in-reply-to:subject:cc:to:dkim-signature :dkim-signature:from; bh=WL3p6urZpqGoMrrojMLOA1py97tU69H4w9bwTYIvOYo=; fh=zngDPpHc9IOT4XyfACYkaIQBkMO1o2QUewR15pJChdQ=; b=h8kWZbOx1NTxV/p0BLpRAAzepnzR5vs1A8SK8vGC7WFjI9xPvDpChgzoUOMspRvLk9 nDPBJFPfI4bPk3ISXdPmmB7sfbV8VK3xhv2XJD+uHogaG/+pvAondt1OCiKKR/YM3uNp Nk38Jt5X8UrXSesl/S7DBfd2tyIPbMZqb39xv0uGWJG75t59gFymLfrKDW1yl73T0wzJ /EjYFOkmaH2sNhJcnZFGEb4gt5jIcD3aATVB94ilKE3XSJC6Z0cqkjTCPyTMwC2sCCd/ bLPhONobjSKnXdGQfywiqYIyKietl/ley8UxMvWwSmO3HtQDTyxmaWkQC/Mv9cTK1Q6+ ud8w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=W1n7N1kJ; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id bj17-20020a056a02019100b005c65ed0f688si7755165pgb.296.2023.12.12.03.31.44 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Dec 2023 03:31:44 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=W1n7N1kJ; dkim=neutral (no key) header.i=@linutronix.de; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 135008092EDF; Tue, 12 Dec 2023 03:31:31 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1346275AbjLLLbR (ORCPT + 99 others); Tue, 12 Dec 2023 06:31:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52980 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232278AbjLLLbQ (ORCPT ); Tue, 12 Dec 2023 06:31:16 -0500 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 85F88D2 for ; Tue, 12 Dec 2023 03:31:21 -0800 (PST) From: Anna-Maria Behnsen DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1702380680; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WL3p6urZpqGoMrrojMLOA1py97tU69H4w9bwTYIvOYo=; b=W1n7N1kJ69wlZU7ioIk49+4Nz2sMWZiUFjrsqR/ralKS6gAsBNnabjRanPS6nIyF/3qQW2 YyXlxRy1fna/5ILDtxrUpF/90KSBaJ3/kDo2oXtWyvovBJNykkf966Vh5dGPzzsF4kjGs/ gG1OGQ8iOBhEKKI7meYu1AKqrJ1j77GNuiabO1R7zR/72iQhffW4tIVNicKwZal3/2Rzvq 8Qfbeqdm+rPgTnhk2eAmvYIum8ia3aLiu5Csov8s1fFyvumVajgenQUb8TAR+qSGbQFSk9 8WyMWWr24TG/Yp7BXHuam49dhOAFdig8lW1UbXXKBE2ik8O3fPSjiQwpl/sA/w== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1702380680; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WL3p6urZpqGoMrrojMLOA1py97tU69H4w9bwTYIvOYo=; b=z94MdADbaQtvr3e8uu5LMeqwlcbna/v/WRQbmSlzbjEiaWm1Yy3xq3NnMK3JgUsZCN0yAs Xgk2KDy1mkiBAEDQ== To: Sebastian Siewior Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , John Stultz , Thomas Gleixner , Eric Dumazet , "Rafael J . Wysocki" , Arjan van de Ven , "Paul E . McKenney" , Frederic Weisbecker , Rik van Riel , Steven Rostedt , Giovanni Gherdovich , Lukasz Luba , "Gautham R . Shenoy" , Srinivas Pandruvada , K Prateek Nayak Subject: Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model In-Reply-To: <20231211180417.XG9G3ryS@linutronix.de> References: <20231201092654.34614-1-anna-maria@linutronix.de> <20231201092654.34614-31-anna-maria@linutronix.de> <20231211180417.XG9G3ryS@linutronix.de> Date: Tue, 12 Dec 2023 12:31:19 +0100 Message-ID: <87h6knmzco.fsf@somnus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Tue, 12 Dec 2023 03:31:31 -0800 (PST) Sebastian Siewior writes: > On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote: >> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration= .c >> new file mode 100644 >> index 000000000000..05cd8f1bc45d >> --- /dev/null >> +++ b/kernel/time/timer_migration.c >> @@ -0,0 +1,1636 @@ > =E2=80=A6 >> +static bool tmigr_active_up(struct tmigr_group *group, >> + struct tmigr_group *child, >> + void *ptr) >> +{ >> + union tmigr_state curstate, newstate; >> + struct tmigr_walk *data =3D ptr; >> + bool walk_done; >> + u8 childmask; >> + >> + childmask =3D data->childmask; >> + newstate =3D curstate =3D data->groupstate; >> + >> +retry: >> + walk_done =3D true; >> + >> + if (newstate.migrator =3D=3D TMIGR_NONE) { >> + newstate.migrator =3D childmask; >> + >> + /* Changes need to be propagated */ >> + walk_done =3D false; >> + } >> + >> + newstate.active |=3D childmask; >> + >> + newstate.seq++; >> + >> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.= state)) { >> + newstate.state =3D curstate.state; > > This does not look right. If > group->migr_state !=3D curstate.state > then > curstate.state =3D newstate.state > > does not make a difference since curstate is on stack. So this should > become > > | if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.s= tate)) { > | curstate.state =3D newstate.state =3D atomic_read(&group->parent->migr= _state); > > and now I question the existence of tmigr_walk::groupstate. It does not > match the comment for the struct saying it will be re-read if the > cmpxchg() fails because it does not happen (at least here). Also why do > we have it? Is it avoid atomic_read() and have it "cached"? atomic_try_cmpxchg() updates curstate.state with the actual group->migr_state when those values do not match. So it is reread by atomic_try_cmpxchg() and still matches the description. (This at least tells the function description of atomic_try_cmpxchg()). But beside of this, why do I want to update curstate.state with group->parent->migr_state when cmpxchg of this group wasn't successful yet or was it a copy paste error? >> + goto retry; >> + } >> + >> + if (group->parent && (walk_done =3D=3D false)) { > > The group's parent doesn't change so it can be accessed lock-less. It is > just that the topmost group has no parent so we need this check. I would > move the walk_done check to the left so it can be evaluated first. Will change it. >> + data->groupstate.state =3D atomic_read(&group->parent->migr_state); >> + data->childmask =3D group->childmask; > > We don't re-read in case the cmpxchg failed assuming someone else is > updating the state. Wouldn't it make sense to read the childmask at top > of the function from the child pointer. There is no need to keep around > in the data pointer, right? When we are in lvl0, then @child is NULL as the child is a tmigr_cpu and not a tmigr_group. This is the reason why I decided to store it inside the tmigr_walk struct. >> + } >> + >> + /* >> + * The group is active and the event will be ignored - the ignore flag= is >> + * updated without holding the lock. In case the bit is set while >> + * another CPU already handles remote events, nothing happens, because >> + * it is clear that the CPU became active just in this moment, or in >> + * worst case the event is handled remote. Nothing to worry about. >> + */ > > The CPU is becoming active, so is the group. The ignore flag for the > group is updated lock less to reflect this. Another CPU might also > set it true while becoming active. In worst case the migrator > observes it too late and expires remotely timer belonging to this > group. The lock is held while going idle (and setting ignore to > false) so the state is not lost. > This is what I wanted to explain: /* * The group is active (again). The group event might be still queued * into the parent group's timerqueue but can now be handled by the the * migrator of this group. Therefore the ignore flag for the group event * is updated to reflect this. * * The update of the ignore flag in the active path is done lock * less. In worst case the migrator of the parent group observes the * change too late and expires remotely timer belonging to this * group. The lock is held while updating the ignore flag in idle * path. So this state change will not be lost. */ >> + group->groupevt.ignore =3D true; >> + >> + return walk_done; >> +} [...] >> + >> +static u64 tmigr_handle_remote_cpu(unsigned int cpu, u64 now, >> + unsigned long jif) >> +{ >> + struct timer_events tevt; >> + struct tmigr_walk data; >> + struct tmigr_cpu *tmc; >> + u64 next =3D KTIME_MAX; >> + >> + tmc =3D per_cpu_ptr(&tmigr_cpu, cpu); >> + >> + raw_spin_lock_irq(&tmc->lock); >> + >> + /* >> + * The remote CPU is offline or the CPU event does not has to be handl= ed >> + * (the CPU is active or there is no longer an event to expire) or >> + * another CPU handles the CPU timers already or the next event was >> + * already expired - return! >> + */ > > The comment is the english version of the C code below. The *why* is > usually the thing we care about. This basically sums up to: > > If remote CPU is offline then the timer events have been > migrated away. > If tmigr_cpu::remote is set then another CPU takes care of this. > If tmigr_event::ignore is set then the CPU is returning from > idle and take care of its timers. > If the next event expires in the future then the event node has > been updated and there are no timer to expire right now. I'll change it. Thanks. >> + if (!tmc->online || tmc->remote || tmc->cpuevt.ignore || >> + now < tmc->cpuevt.nextevt.expires) { >> + raw_spin_unlock_irq(&tmc->lock); >> + return next; > > Looking at the last condition where the timerqueue has been forwarded by > a jiffy+, shouldn't we return _that_ values next instead of KTIME_MAX? No. Because the event is already queued into the hierarchy and the migrator takes care. If hiererachy is completely idle, the CPU which updated the event takes care. I'll add this to the comment above. >> + } >> + >> + tmc->remote =3D true; >> + WRITE_ONCE(tmc->wakeup, KTIME_MAX); >> + >> + /* Drop the lock to allow the remote CPU to exit idle */ >> + raw_spin_unlock_irq(&tmc->lock); >> + >> + if (cpu !=3D smp_processor_id()) >> + timer_expire_remote(cpu); >> + >> + /* >> + * Lock ordering needs to be preserved - timer_base locks before tmigr >> + * related locks (see section "Locking rules" in the documentation at >> + * the top). During fetching the next timer interrupt, also tmc->lock >> + * needs to be held. Otherwise there is a possible race window against >> + * the CPU itself when it comes out of idle, updates the first timer in >> + * the hierarchy and goes back to idle. >> + * >> + * timer base locks are dropped as fast as possible: After checking >> + * whether the remote CPU went offline in the meantime and after >> + * fetching the next remote timer interrupt. Dropping the locks as fast >> + * as possible keeps the locking region small and prevents holding >> + * several (unnecessary) locks during walking the hierarchy for updati= ng >> + * the timerqueue and group events. >> + */ >> + local_irq_disable(); >> + timer_lock_remote_bases(cpu); >> + raw_spin_lock(&tmc->lock); >> + >> + /* >> + * When the CPU went offline in the meantime, no hierarchy walk has to >> + * be done for updating the queued events, because the walk was >> + * already done during marking the CPU offline in the hierarchy. >> + * >> + * When the CPU is no longer idle, the CPU takes care of the timers and > If the CPU is no longer idle then it takes care of its timers and > also of the timers in its hierarchy. > >> + * also of the timers in the path to the top. >> + * >> + * (See also section "Required event and timerqueue update after >> + * remote expiry" in the documentation at the top) > > Perfect. The reasoning why it is safe to skip walk + expire timers, has > been explained. :) >> + */ >> + if (!tmc->online || !tmc->idle) { >> + timer_unlock_remote_bases(cpu); >> + goto unlock; >> + } else { >> + /* next event of CPU */ >> + fetch_next_timer_interrupt_remote(jif, now, &tevt, cpu); >> + } >> + >> + timer_unlock_remote_bases(cpu); >> + >> + data.evt =3D &tmc->cpuevt; >> + data.nextexp =3D tevt.global; >> + data.remote =3D true; >> + >> + /* >> + * The update is done even when there is no 'new' global timer pending >> + * on the remote CPU (see section "Required event and timerqueue update >> + * after remote expiry" in the documentation at the top) >> + */ >> + walk_groups(&tmigr_new_timer_up, &data, tmc); >> + >> + next =3D data.nextexp; >> + >> +unlock: >> + tmc->remote =3D false; >> + raw_spin_unlock_irq(&tmc->lock); >> + >> + return next; >> +} [...] >> +/** >> + * tmigr_requires_handle_remote() - Check the need of remote timer hand= ling >> + * >> + * Must be called with interrupts disabled. >> + */ >> +int tmigr_requires_handle_remote(void) >> +{ >> + struct tmigr_cpu *tmc =3D this_cpu_ptr(&tmigr_cpu); >> + struct tmigr_remote_data data; >> + unsigned int ret =3D 0; >> + unsigned long jif; >> + >> + if (tmigr_is_not_available(tmc)) >> + return ret; >> + >> + data.now =3D get_jiffies_update(&jif); >> + data.childmask =3D tmc->childmask; >> + data.nextexp =3D KTIME_MAX; >> + data.tmc_active =3D !tmc->idle; >> + data.check =3D false; >> + >> + /* >> + * When the CPU is active, walking the hierarchy to check whether a >> + * remote expiry is required. > > s/When/If > s/walking/walk > >> + * >> + * Check is done lockless as interrupts are disabled and @tmc->idle is >> + * set only by the local CPU. >> + */ >> + if (!tmc->idle) { >> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc); >> + >> + if (data.nextexp !=3D KTIME_MAX) >> + ret =3D 1; >> + >> + return ret; >> + } >> + >> + /* >> + * When the CPU is idle, check whether the recalculation of @tmc->wake= up >> + * is required. @tmc->wakeup_recalc is set by a remote CPU which is >> + * about to go offline, was the last active CPU in the whole timer >> + * migration hierarchy and now delegates handling of the hierarchy to >> + * this CPU. > > I'm failing here=E2=80=A6 If the CPU is idle, check whether the recalculation of @tmc->wakeup is required. @tmc->wakeup_recalc is set, when the last active CPU went offline. The last active CPU delegated the handling of the timer migration hierarchy to another (this) CPU by updating this flag and sending a reschedule. Better? >> + * Racy lockless check is valid: >> + * - @tmc->wakeup_recalc is set by the remote CPU before it issues >> + * reschedule IPI. >> + * - As interrupts are disabled here this CPU will either observe >> + * @tmc->wakeup_recalc set before the reschedule IPI can be handled = or >> + * it will observe it when this function is called again on return >> + * from handling the reschedule IPI. >> + */ >> + if (tmc->wakeup_recalc) { >> + raw_spin_lock(&tmc->lock); >> + >> + __walk_groups(&tmigr_requires_handle_remote_up, &data, tmc); >> + >> + if (data.nextexp !=3D KTIME_MAX) >> + ret =3D 1; >> + >> + WRITE_ONCE(tmc->wakeup, data.nextexp); >> + tmc->wakeup_recalc =3D false; >> + raw_spin_unlock(&tmc->lock); >> + >> + return ret; >> + } >> + >> + /* >> + * When the CPU is idle and @tmc->wakeup is reliable, compare it with >> + * @data.now. On 64 bit it is valid to do this lockless. On 32 bit >> + * systems, holding the lock is required to get valid data on concurre= nt >> + * writers. > > The wakeup value is reliable when the CPU is idle? Yes and when wakeup_recalc is not set. > See the comments regarding 64bit. Updated it accordingly. > >> + */ >> + if (IS_ENABLED(CONFIG_64BIT)) { >> + if (data.now >=3D READ_ONCE(tmc->wakeup)) >> + ret =3D 1; >> + } else { >> + raw_spin_lock(&tmc->lock); >> + if (data.now >=3D tmc->wakeup) >> + ret =3D 1; >> + raw_spin_unlock(&tmc->lock); >> + } >> + >> + return ret; >> +} > > =E2=80=A6 > >> +static bool tmigr_inactive_up(struct tmigr_group *group, >> + struct tmigr_group *child, >> + void *ptr) >> +{ >> + union tmigr_state curstate, newstate; >> + struct tmigr_walk *data =3D ptr; >> + bool walk_done; >> + u8 childmask; >> + >> + childmask =3D data->childmask; >> + newstate =3D curstate =3D data->groupstate; >> + >> +retry: >> + walk_done =3D true; >> + >> + /* Reset active bit when the child is no longer active */ >> + if (!data->childstate.active) >> + newstate.active &=3D ~childmask; >> + >> + if (newstate.migrator =3D=3D childmask) { >> + /* >> + * Find a new migrator for the group, because the child group is >> + * idle! >> + */ >> + if (!data->childstate.active) { >> + unsigned long new_migr_bit, active =3D newstate.active; >> + >> + new_migr_bit =3D find_first_bit(&active, BIT_CNT); >> + >> + if (new_migr_bit !=3D BIT_CNT) { >> + newstate.migrator =3D BIT(new_migr_bit); >> + } else { >> + newstate.migrator =3D TMIGR_NONE; >> + >> + /* Changes need to be propagated */ >> + walk_done =3D false; >> + } >> + } >> + } >> + >> + newstate.seq++; >> + >> + WARN_ON_ONCE((newstate.migrator !=3D TMIGR_NONE) && !(newstate.active)= ); >> + >> + if (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.= state)) { > > This one appears wrong, too. The curstate is not getting updated during > retry. See the answer above. >> + newstate.state =3D curstate.state; >> + >> + /* >> + * Something changed in the child/parent group in the meantime, >> + * reread the state of the child and parent; Update of >> + * data->childstate is required for event handling; >> + */ >> + if (child) >> + data->childstate.state =3D atomic_read(&child->migr_state); >> + >> + goto retry; >> + } >> + >> + data->groupstate =3D newstate; >> + data->remote =3D false; >> + >> + /* Event Handling */ >> + tmigr_update_events(group, child, data); >> + >> + if (group->parent && (walk_done =3D=3D false)) { >> + data->childmask =3D group->childmask; >> + data->childstate =3D newstate; >> + data->groupstate.state =3D atomic_read(&group->parent->migr_state); >> + } >> + >> + /* >> + * data->nextexp was set by tmigr_update_events() and contains the >> + * expiry of the first global event which needs to be handled >> + */ >> + if (data->nextexp !=3D KTIME_MAX) { >> + WARN_ON_ONCE(group->parent); >> + /* >> + * Top level path - If this CPU is about going offline, wake >> + * up some random other CPU so it will take over the >> + * migrator duty and program its timer properly. Ideally >> + * wake the CPU with the closest expiry time, but that's >> + * overkill to figure out. >> + * >> + * Set wakeup_recalc of remote CPU, to make sure the complete >> + * idle hierarchy with enqueued timers is reevaluated. >> + */ >> + if (!(this_cpu_ptr(&tmigr_cpu)->online)) { >> + struct tmigr_cpu *tmc =3D this_cpu_ptr(&tmigr_cpu); >> + unsigned int cpu =3D smp_processor_id(); >> + struct tmigr_cpu *tmc_resched; >> + >> + cpu =3D cpumask_any_but(cpu_online_mask, cpu); >> + tmc_resched =3D per_cpu_ptr(&tmigr_cpu, cpu); >> + >> + raw_spin_unlock(&tmc->lock); >> + >> + raw_spin_lock(&tmc_resched->lock); >> + tmc_resched->wakeup_recalc =3D true; >> + raw_spin_unlock(&tmc_resched->lock); >> + >> + raw_spin_lock(&tmc->lock); >> + smp_send_reschedule(cpu); > > This whole thing confuses me. > If the CPU goes offline, it needs to get removed from the migration > hierarchy and this is it. Everything else is handled by the migrator. If > the migrator is going offline then it needs wake a random CPU and make > sure it takes the migrator role. I am confused by the whole ::wakeup and > ::wakeup_recalc thingy. > wakeup_recalc is required to indicate, that the CPU was chosen as the new migrator CPU when the last active CPU in timer migration hierarchy went offline. When a CPU goes idle and it is the last active CPU in the timer migration hierarchy, it has to make sure that it wakes up in time to handle the first event of the hierarchy. On the normal idle path this is not a problem as the value of the first event of the hierarchy is returned. But when an IRQ occurs on this idle CPU, the timers are revisited again. But then it is also required that the first event of the timer migration hierarchy is still considered, as the CPU cannot make sure another CPU will handle it. So the value is stored on idle path to tmc->wakeup. Otherwise every idle CPU has to walk the hierarchy again after an IRQ to make sure nothing is lost as the CPU doesn't know what happend in the meantime. I'm aware, that it is possible that several CPUs have the same wakeup value when there is no new first event and if the hierarchy is idle and some CPUs become active and go idle directly again. But if we want to cover this, we need a single point with the first event and with the last migrator information and I'm quite sure that it will not scale. Hopefully this is now clear? >> + } >> + } >> + >> + return walk_done; >> +} > > Sebastian Thanks, Anna-Maria