Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp4592454rdb; Tue, 12 Dec 2023 04:14:33 -0800 (PST) X-Google-Smtp-Source: AGHT+IFRHisMT20DMoNALRR7wm1YHBq4jyCNHhFdhkgSmvyCxTjFm4jbcG7wXJ5qevjwDFvK7zs/ X-Received: by 2002:a05:6a21:3284:b0:190:23e5:bf14 with SMTP id yt4-20020a056a21328400b0019023e5bf14mr3117211pzb.41.1702383272615; Tue, 12 Dec 2023 04:14:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702383272; cv=none; d=google.com; s=arc-20160816; b=HjPU4SH8ORmjjzKycCGmakhbhnBmiVMShZADxip2qSKfBFP590DALxX4lNh9gJKTDZ LGd2mJlZLpQmkfjWzq/XDgSQiFsNGbPziHNhuWY05nDgl4hJCSz8v0ThBoqrBHG4L0T/ SYPgjQM/sVL9wFZ2cjh1/KF7kmcaK+Jo31PPRWv6CZY6BNPukL9A9fqmy+4Yv6Sh30N/ PGbGsUdUehyKL+F/RE/abEb1V0B19gKABIoiRZhRMIjuwbcxgzosgbPdp4nhKUPa9P2q N7gyGcfp0yVp8APGIVSUAHUFazc3RO02LijTzG44Z+uAboaHBTuHr3oHVqzclfOEwhyC bUZw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:dkim-signature:dkim-signature:date; bh=+nscTcQLqzuv9Vo3oyPSvp/PJhtQsJ0QUOs81Z6V7WM=; fh=5cMh8CKYPGN1F0fzuaj1xrlN5+dBfa6bPK4XIprQ5s0=; b=eS6B01dKYTRUQKVBr8FABG0eomXg2AwkZlXsnEm+4Z4LDt+NJ5MWXpB2+QOr0Cr2yP RhSU0AOyl5PtW+HMjWqU5bndDaCobUiJmuJ/1DKdVdCCft5t4H1Qabg6wD/33Rw9UJ/C fzXv6TtZL1JWk+IoMdFYdQ9fx2nRZMf2WClddNLNLWG8jMaMkjom6ZaiwTqik5c2jeMn 5C/JFjG12trlM8P59TaqGAVkGXM9RgtTGAUkNPT6/+mt//7Fclh4iYams5EpEEtNg5ab 7jsO12VlJ7S47p3SM+ZL4BQbZ47gPAiipPbV8LwbIQG36RT7q4/FJ/GX1BR6XotoYJwe JRZA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=MyUY9h4y; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=sbJF089l; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id j5-20020a17090ae60500b00286ef2fc253si9122012pjy.97.2023.12.12.04.14.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Dec 2023 04:14:32 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=MyUY9h4y; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e header.b=sbJF089l; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 566DD805B2DA; Tue, 12 Dec 2023 04:14:29 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232296AbjLLMOD (ORCPT + 99 others); Tue, 12 Dec 2023 07:14:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:50258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230055AbjLLMOC (ORCPT ); Tue, 12 Dec 2023 07:14:02 -0500 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5F26AF for ; Tue, 12 Dec 2023 04:14:07 -0800 (PST) Date: Tue, 12 Dec 2023 13:14:04 +0100 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1702383246; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+nscTcQLqzuv9Vo3oyPSvp/PJhtQsJ0QUOs81Z6V7WM=; b=MyUY9h4ykCwod4XgY4mZSNhE5dq4vTGCTjsi1By9OsQ0206HfYj3GbW27oaFzPpTcI4yRM 3ImK5Qi8N53NyKELFQFb49f07B5NJf9crmHxgsMrzgvY9p+rB57698/j0cwCwOpWmIVVwE 62qoG+ZAUnl8hdlZwbn36WOXAwtxGCrzBQKgsTi1Tl8BIeUpWcO/M+xYPllTABONpjLu0W CQY6tIUnX2GseMHXUpd7rRotRz34DiP/jMmPxcfI/Z8bjVyVGb4A/IlYCr38R75kQglvzE Pjxuadm+hVpYW8n716YHyPTeYBQlopjyLVnQVRGEYA45JXdT3GxCW0u+MXNolA== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1702383246; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+nscTcQLqzuv9Vo3oyPSvp/PJhtQsJ0QUOs81Z6V7WM=; b=sbJF089ljWf954Xr+TKPPGJ04JtCRKujugIaiemAjimYXPHsZSN5PezkAO18yUtyPCwFSJ U7jcR7W+0ZYfBFAA== From: Sebastian Siewior To: Anna-Maria Behnsen Cc: linux-kernel@vger.kernel.org, Peter Zijlstra , John Stultz , Thomas Gleixner , Eric Dumazet , "Rafael J . Wysocki" , Arjan van de Ven , "Paul E . McKenney" , Frederic Weisbecker , Rik van Riel , Steven Rostedt , Giovanni Gherdovich , Lukasz Luba , "Gautham R . Shenoy" , Srinivas Pandruvada , K Prateek Nayak Subject: Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model Message-ID: <20231212121404.C2R9VWj1@linutronix.de> References: <20231201092654.34614-1-anna-maria@linutronix.de> <20231201092654.34614-31-anna-maria@linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: <20231201092654.34614-31-anna-maria@linutronix.de> X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Tue, 12 Dec 2023 04:14:29 -0800 (PST) On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote: > diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c > new file mode 100644 > index 000000000000..05cd8f1bc45d > --- /dev/null > +++ b/kernel/time/timer_migration.c > @@ -0,0 +1,1636 @@ =E2=80=A6 > +static int __init tmigr_init(void) > +{ > + unsigned int cpulvl, nodelvl, cpus_per_node, i; > + unsigned int nnodes =3D num_possible_nodes(); > + unsigned int ncpus =3D num_possible_cpus(); > + int ret =3D -ENOMEM; > + > + /* Nothing to do if running on UP */ > + if (ncpus =3D=3D 1) > + return 0; > + > + /* > + * Calculate the required hierarchy levels. Unfortunately there is no > + * reliable information available, unless all possible CPUs have been > + * brought up and all numa nodes are populated. NUMA > + * > + * Estimate the number of levels with the number of possible nodes and > + * the number of possible CPUs. Assume CPUs are spread evenly across > + * nodes. We cannot rely on cpumask_of_node() because there only already > + * online CPUs are considered. > + */ We cannot rely on cpumask_of_node() because it only works for online CPUs. > + cpus_per_node =3D DIV_ROUND_UP(ncpus, nnodes); > + > + /* Calc the hierarchy levels required to hold the CPUs of a node */ > + cpulvl =3D DIV_ROUND_UP(order_base_2(cpus_per_node), > + ilog2(TMIGR_CHILDREN_PER_GROUP)); > + > + /* Calculate the extra levels to connect all nodes */ > + nodelvl =3D DIV_ROUND_UP(order_base_2(nnodes), > + ilog2(TMIGR_CHILDREN_PER_GROUP)); > + > + tmigr_hierarchy_levels =3D cpulvl + nodelvl; > + > + /* > + * If a numa node spawns more than one CPU level group then the next NUMA > + * level(s) of the hierarchy contains groups which handle all CPU groups > + * of the same numa node. The level above goes across numa nodes. Store NUMA > + * this information for the setup code to decide when node matching is > + * not longer required. s/not longer/no longer ? > + */ > + tmigr_crossnode_level =3D cpulvl; > + > + tmigr_level_list =3D kcalloc(tmigr_hierarchy_levels, sizeof(struct list= _head), GFP_KERNEL); > + if (!tmigr_level_list) > + goto err; > + > + for (i =3D 0; i < tmigr_hierarchy_levels; i++) > + INIT_LIST_HEAD(&tmigr_level_list[i]); > + > + pr_info("Timer migration: %d hierarchy levels; %d children per group;" > + " %d crossnode level\n", > + tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP, > + tmigr_crossnode_level); > + > + ret =3D cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online", > + tmigr_cpu_online, tmigr_cpu_offline); > + if (ret) > + goto err; > + > + return 0; > + > +err: > + pr_err("Timer migration setup failed\n"); > + return ret; > +} > +late_initcall(tmigr_init); > diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h > new file mode 100644 > index 000000000000..260b87e5708d > --- /dev/null > +++ b/kernel/time/timer_migration.h > @@ -0,0 +1,144 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +#ifndef _KERNEL_TIME_MIGRATION_H > +#define _KERNEL_TIME_MIGRATION_H > + > +/* Per group capacity. Must be a power of 2! */ > +#define TMIGR_CHILDREN_PER_GROUP 8 BUILD_BUG_ON_NOT_POWER_OF_2(TMIGR_CHILDREN_PER_GROUP) Maybe in the .c file. > +/** > + * struct tmigr_event - a timer event associated to a CPU > + * @nextevt: The node to enqueue an event in the parent group queue > + * @cpu: The CPU to which this event belongs > + * @ignore: Hint whether the event could be ignored; it is set when > + * CPU or group is active; > + */ > +struct tmigr_event { > + struct timerqueue_node nextevt; > + unsigned int cpu; > + bool ignore; > +}; > + > +/** > + * struct tmigr_group - timer migration hierarchy group > + * @lock: Lock protecting the event information and group hierarchy > + * information during setup > + * @migr_state: State of the group (see union tmigr_state) So the lock does not protect migr_state? Mind moving it a little down the road? Not only would it be more obvious what is protected by the lock but it would also move migr_state in another/ later cache line. > + * @parent: Pointer to the parent group > + * @groupevt: Next event of the group which is only used when the > + * group is !active. The group event is then queued into > + * the parent timer queue. > + * Ignore bit of @groupevt is set when the group is active. > + * @next_expiry: Base monotonic expiry time of the next event of the > + * group; It is used for the racy lockless check whether a > + * remote expiry is required; it is always reliable > + * @events: Timer queue for child events queued in the group > + * @childmask: childmask of the group in the parent group; is set > + * during setup and will never change; could be read _can_ be read lockless. > + * lockless > + * @level: Hierarchy level of the group; Required during setup > + * @list: List head that is added to the per level > + * tmigr_level_list; is required during setup when a > + * new group needs to be connected to the existing > + * hierarchy groups > + * @numa_node: Is set to numa node when level < tmigr_crossnode_level; NUMA (as long as the group level is per NUMA node). > + * otherwise it is set to NUMA_NO_NODE; Required for > + * setup only to make sure CPUs and groups are per > + * numa node as long as level < tmigr_crossnode_level =E2=80=A6 to make sure CPU and group information is NUMA local. This is true until the top most hierarchy level (level < tmigr_crossnode_level). > + * @num_children: Counter of group children to make sure the group is on= ly > + * filled with TMIGR_CHILDREN_PER_GROUP; Required for setup > + * only > + */ > +struct tmigr_group { > + raw_spinlock_t lock; > + atomic_t migr_state; > + struct tmigr_group *parent; > + struct tmigr_event groupevt; > + u64 next_expiry; > + struct timerqueue_head events; > + u8 childmask; > + unsigned int level; > + struct list_head list; > + int numa_node; > + unsigned int num_children; > +}; > + > +/** > + * struct tmigr_cpu - timer migration per CPU group > + * @lock: Lock protecting the tmigr_cpu group information > + * @online: Indicates whether the CPU is online; In deactivate path > + * it is required to know whether the migrator in the top > + * level group is on the way to go offline when a timer is level group, which is to be set offline, while a timer is pending. > + * pending. Then another online CPU needs to be rescheduled > + * to make sure the timers are handled properly; Then another online CPU needs to be notified to take over the migrator role. The "rescheduled" part sounds like the current implementation. > + * Furthermore the information is required in CPU hotplug > + * path as the CPU is able to go idle before the timer > + * migration hierarchy hotplug AP is reached. During this > + * phase, the CPU has to handle the global timers by its s/by its own/on its own/ > + * own and does not act as a migrator. s/does not/must not > + * @idle: Indicates whether the CPU is idle in the timer migration > + * hierarchy > + * @remote: Is set when timers of the CPU are expired remote s/remote/remotely > + * @wakeup_recalc: Indicates, whether a recalculation of the @wakeup val= ue > + * is required. It is only used when the CPU is marked idle > + * in the timer migration hierarchy. What does `It' refer to? Is it `wakeup_recalc' or `wakeup' ? > + * @tmgroup: Pointer to the parent group > + * @childmask: childmask of tmigr_cpu in the parent group > + * @wakeup: Stores the first timer when the timer migration > + * hierarchy is completely idle and remote expiry was done; > + * is returned to timer code in the idle path; it is only is used in the idle path only (what is the idle path (probably obvious)) > + * valid, when @wakeup_recalc is not set > + * @cpuevt: CPU event which could be queued into the parent group I don't know why but it feels like s/queued/enqueued/g But it might be a British vs American thing. Sebastian