Received: by 2002:ab2:6857:0:b0:1ef:ffd0:ce49 with SMTP id l23csp2245230lqp; Sun, 24 Mar 2024 09:28:48 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXqrqDq8/VOcDJ1Ds+YB2YmTlwzpgj2loW1ZOIyvnH+dMEckpGTosBQF+t/ZD7wKbu7qGmeCzCv6Wq57axuveoZPwM/hLaVtEKpL8zdQg== X-Google-Smtp-Source: AGHT+IFjUQtwBFiEH3MKFbNxZ7K9m/Pu6vmJoqOQaEiwf7QeMG1t8OVA2ivHBv1zUxFCbIR4jgtv X-Received: by 2002:a05:6808:13c8:b0:3c3:c98c:8d25 with SMTP id d8-20020a05680813c800b003c3c98c8d25mr3693639oiw.15.1711297728744; Sun, 24 Mar 2024 09:28:48 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1711297728; cv=pass; d=google.com; s=arc-20160816; b=gZoWwl2DabybJESld0MzWn/27TLkaZE2LDYKwoxotrSsfnlw3Aw6wLSQ2uDacxtCx6 KnGAZ8QI2s9Y/Nzxdm5yZzaMyGWzIVHhR26HFKZoct1YAh1SwBf6dGMyqEj+jq7WGF6z Y8aPyYvz9tq8qVLGf8QfBNHfT6GmXRSahDDP7QBqyPKug1fDm56hQR28RgG2fm6+/lN/ JFlCQX8Kb4tVsNn2IUrs4T8HdMoUp1fdfagUF72l6RpKm2Bd2S2QVsBChULLFNt7d5By En/iQ9V5WfKow+xwQlYc1zJfP/6+rgAsoeC2TG6YHOs/QczqJac04lcP8a9oMfAFqsBV 8iGw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=user-agent:in-reply-to:content-disposition:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:references :message-id:subject:cc:to:from:date:dkim-signature:dkim-filter; bh=L7aJR+/O/U52wmp8nXcyTTBjlAy8uzaOy6DxR8fMbvg=; fh=ddfnFSFXeoFGQDLQbYk9PZ1HDzJabwsTONDjtiJFIIs=; b=dN3IrJMPrGUteK7irNRpVajamchEEEVfb7wNULWcQBUbEdrYsYQQ3PNCIHrrLXzrl7 3JmjKPD/+wi04qvYHDoWFpCQz3U/jyo4s8rCLCJJ6zXMdLbnhsLo+vX1vbCsFaeHZ9du A42OraXuWinnlh4FDcdtumDMiZzLtVQGpf39W3rFF7lvB0cqEMuG2iQUXPKI38gEvNB8 WtzEaE57rc9MtnwzuKciptFeVHqeyWzgvP/rcg5GqDLGEUge7s2a5v0rkfMHOkYJLnFA W0vn5APG9PdqpPERSB7hhFGmoTlX37ju130S8svtqeN1kd4qSYiVoVyjBMnzHLR3T14z 5nFg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b="NHi5D1H/"; arc=pass (i=1 spf=pass spfdomain=linux.microsoft.com dkim=pass dkdomain=linux.microsoft.com dmarc=pass fromdomain=linux.microsoft.com); spf=pass (google.com: domain of linux-kernel+bounces-112734-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-112734-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id fy17-20020a05622a5a1100b004312fb8b511si3865905qtb.257.2024.03.24.09.28.48 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 24 Mar 2024 09:28:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-112734-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b="NHi5D1H/"; arc=pass (i=1 spf=pass spfdomain=linux.microsoft.com dkim=pass dkdomain=linux.microsoft.com dmarc=pass fromdomain=linux.microsoft.com); spf=pass (google.com: domain of linux-kernel+bounces-112734-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-112734-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 6562F1C209E5 for ; Sun, 24 Mar 2024 16:28:48 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 31F0F18EBF; Sun, 24 Mar 2024 16:28:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.microsoft.com header.i=@linux.microsoft.com header.b="NHi5D1H/" Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8EBB610A19 for ; Sun, 24 Mar 2024 16:28:39 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=13.77.154.182 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711297721; cv=none; b=PIcQ2qcdmKOO1NOlKiTYslMZy12RDC90eUrp+l+wlFIpJ3QGx6gBaVq/5HmJFe9SBSF8t0I86ECSvZv+r8Q6upZIUMm8jy5hYdqcVNr0QrtRALsJnix3tqX5Eky9DcwVTSQyhAFpve+H5r/+3verKCMr779myhqTZNN801kS0/A= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711297721; c=relaxed/simple; bh=VVxT1+30kfFtynlw5txLe5LXptKYOBoBmsWHltYGTU0=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=bnjFKCXZt+fpXCanJ0DhpwhO46nJkXExvXpABoQl3KYejjFg+bV26msF9erGyAymQhe9bvmwnO5eFLddnW6PgpEE00EvLplfMPqF9LcPjOxUugIILrvSyYrCTR6PjZBEVoo1+Mz+Q3XLgPr9P3cbkneIBmU21DPk0bMAoKy5ohE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.microsoft.com; spf=pass smtp.mailfrom=linux.microsoft.com; dkim=pass (1024-bit key) header.d=linux.microsoft.com header.i=@linux.microsoft.com header.b=NHi5D1H/; arc=none smtp.client-ip=13.77.154.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.microsoft.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.microsoft.com Received: by linux.microsoft.com (Postfix, from userid 1127) id 87C1D20B74C0; Sun, 24 Mar 2024 09:28:33 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 87C1D20B74C0 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1711297713; bh=L7aJR+/O/U52wmp8nXcyTTBjlAy8uzaOy6DxR8fMbvg=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=NHi5D1H/nl3xhC4xFnWkToLc6k2tX03KxmPP45asdxiiWuALs7QL2DbMuCWfkgtNy SbameL55W851rQeHRK0xK1OgR0rD0AOcw/ov4LXV46sDmtxb6saDF3a7A3GWtpk4Jh vJ7XNvC906SDyGSFVZQbOUgBaJlbwh4uxdhOlktA= Date: Sun, 24 Mar 2024 09:28:33 -0700 From: Saurabh Singh Sengar To: Thomas Gleixner Cc: dave.hansen@linux.intel.com, luto@kernel.org, peterz@infradead.org, mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, linux-kernel@vger.kernel.org, ssengar@microsoft.com, sgeorgejohn@microsoft.com Subject: Re: [PATCH] x86/numa: Map NUMA node to CPUs as per DeviceTree Message-ID: <20240324162833.GA18417@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net> References: <1710265404-23146-1-git-send-email-ssengar@linux.microsoft.com> <87v85bfzg0.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87v85bfzg0.ffs@tglx> User-Agent: Mutt/1.5.21 (2010-09-15) On Sun, Mar 24, 2024 at 03:59:27PM +0100, Thomas Gleixner wrote: > On Tue, Mar 12 2024 at 10:43, Saurabh Sengar wrote: > > Currently for DeviceTree bootup, x86 code does the default mapping of > > CPUs to NUMA, which is wrong. This can cause incorrect mapping and WARN > > on a SMT enabled system like below: > > > > [0.417551] ------------[ cut here ]------------ > > [0.417551] Saurabh sched: CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. > > [0.417551] WARNING: CPU: 1 PID: 0 at topology_sane.isra.0+0x5c/0x6d > > [0.417551] Modules linked in: > > [0.417551] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.1.71-microsoft-hcl+ #4 > > [0.417551] RIP: 0010:topology_sane.isra.0+0x5c/0x6d > > [0.417551] Code: 41 39 dc 74 27 80 3d 32 ae 2d 00 00 75 1e 41 89 d9 45 89 e0 44 89 d6 48 c7 c7 00 a6 4a 88 c6 05 19 ae 2d 00 01 e8 6e 1f cb ff <0f> 0b 41 39 dc 5b 41 5c 0f 94 c0 5d c3 cc cc cc cc 55 48 8b 05 05 > > [0.417551] RSP: 0000:ffffc9000013feb0 EFLAGS: 00010086 > > [0.417551] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 > > [0.417551] RDX: 0000000000000003 RSI: 0000000000000086 RDI: 00000000ffffffff > > [0.417551] RBP: ffffc9000013fec0 R08: ffffffff88778160 R09: ffffffff88778160 > > [0.417551] R10: ffff888227fe26da R11: ffff888227fe26c1 R12: 0000000000000001 > > [0.417551] R13: 0000000000000000 R14: ffff888216415040 R15: 0000000000000000 > > [0.417551] FS: 0000000000000000(0000) GS:ffff888216400000(0000) knlGS:0000000000000000 > > [0.417551] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [0.417551] CR2: 0000000000000000 CR3: 0000000208809001 CR4: 0000000000330ea0 > > [0.417551] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > [0.417551] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 > > [0.417551] Call Trace: > > [0.417551] > > [0.417551] ? show_regs.cold+0x1a/0x1f > > [0.417551] ? __warn+0x6e/0xc0 > > [0.417551] ? report_bug+0x101/0x1a0 > > [0.417551] ? handle_bug+0x40/0x70 > > [0.417551] ? exc_invalid_op+0x19/0x70 > > [0.417551] ? asm_exc_invalid_op+0x1b/0x20 > > [0.417551] ? topology_sane.isra.0+0x5c/0x6d > > [0.417551] match_smt+0xf6/0xfc > > [0.417551] set_cpu_sibling_map.cold+0x24f/0x512 > > [0.417551] start_secondary+0x5c/0x110 > > [0.417551] secondary_startup_64_no_verify+0xcd/0xdb > > [0.417551] > > [0.417551] ---[ end trace 0000000000000000 ]--- > > Can you please trim the backtrace like documented. 95% of the pasted > information above is completely irrelevant to understand the issue. > > CPU #1's smt-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency. > WARNING: CPU: 1 PID: 0 at topology_sane.isra.0+0x5c/0x6d > match_smt+0xf6/0xfc > set_cpu_sibling_map.cold+0x24f/0x512 > start_secondary+0x5c/0x110 > > is more than sufficient, no? > > > +static void __init of_parse_and_init_cpus(void) > > +{ > > + struct device_node *dn; > > + int cpuid = 0; > > + int nid; > > + > > + for_each_of_cpu_node(dn) { > > + if (cpuid >= NR_CPUS) { > > This condition is wrong. nr_cpu_ids != NR_CPUS. > > > + pr_warn("NR_CPUS too small for %d cpuid\n", cpuid); > > What's this for? The APIC enumeration code warns about this already. > > > + return; > > + } > > + nid = of_node_to_nid(dn); > > + numa_set_node(cpuid, nid); > > + cpuid++; > > + } > > +} > > + > > static int __init numa_init(int (*init_func)(void)) > > { > > int i; > > @@ -645,6 +662,9 @@ static int __init numa_init(int (*init_func)(void)) > > if (ret < 0) > > return ret; > > > > + if (acpi_disabled) > > + of_parse_and_init_cpus(); > > numa_init() is invoked from x86_numa_init() with the various NUMA init > functions as argument and x86_numa_init() already has OF NUMA logic: > > #ifdef CONFIG_ACPI_NUMA > if (!numa_init(x86_acpi_numa_init)) > return; > #endif > #ifdef CONFIG_AMD_NUMA > if (!numa_init(amd_numa_init)) > return; > #endif > if (acpi_disabled && !numa_init(of_numa_init)) > return; > > of_numa_init() does not do the numa_set_node() part, but that's not a > justification to glue this into numa_init() which is firmware > independent except for the firmware specific callback argument. > > Also x86_numa_init() is invoked _before_ the APIC ID to Linux CPU number > association happens, so doing the CPU number to node mapping at this > point is just wrong for any CPU number != 0. > > It "works" for OF just by chance as the actual APIC enumeration which > associates Linux CPU numbers works in the same order, but that does not > make it correct in any way. > > x86_acpi_numa_init() and amd_numa_init() set up the nodes like > of_numa_init() does and aside of that save the APIC ID to node mapping. > > Aside of that the numa_set_node() variant happens to "work" for Intel > CPUs as srat_detect_node() falls back to cpu_to_node() when > numa_cpu_node() returns NO_NUMA_NODE. > > Though the AMD variant falls back to cpu_info.topo.llc_id which is not > necessarily the same result as what the device tree enumerated. > > I can see why you glued it into numa_init(): > > of_node_to_nid() requires node_possible_map to be initialized, which > happens in numa_register_memblks() invoked from numa_init() if the > firmware specific callback which enumerates the nodes was successful. > > Of course the change log is silent about this. > > But there is no reason to scan this right in numa_init() as nothing > needs the information to be set at this point, unless I'm missing > something obscure, which might be the case when staring at this NUMA > enumeration maze. > > The CPU to node mapping based on the APIC ID to node mapping happens in > init_cpu_to_node() which is after the APIC enumeration and the > finalizing of cpu_possible_map completed. At this point the CPU number > to APIC ID mapping is stable. > > So the uncompiled below should just work, no? > > Thanks, > > tglx > --- > --- a/arch/x86/kernel/devicetree.c > +++ b/arch/x86/kernel/devicetree.c > @@ -24,6 +24,7 @@ > #include > #include > #include > +#include > #include > > __initdata u64 initial_dtb; > @@ -137,6 +138,7 @@ static void __init dtb_cpu_setup(void) > continue; > } > topology_register_apic(apic_id, CPU_ACPIID_INVALID, true); > + set_apicid_to_node(apic_id, of_node_to_nid(dn)); > } > } > Thanks for your suggestion. I use this approach because ARM64 and riscv platforms were having a function of_parse_and_init_cpus to do the same. But I agree in x86 DeviceTree is handled differently, and we can restrict the DT related code from numa_init. I will look in to making this approach work for our platform and send the new patch. Few thoughts related to recent change wrt removal of x86_dtb_init: I recognize that due to recent changes, each dtb platform will now need to set a pointer for x86_init.mpparse.early_parse_smp_cfg to get the dtb_cpu_setup executed. This was not the requirement before because earlier x86_dtb_init was anyway getting called. Do you think we should improve this as well by setting x86_init.mpparse.early_parse_smp_cfg to x86_dtb_parse_smp_config for all the dtb platforms by default. I see the ce4100 platform is setting the parse_smp_cfg, shouldn't the early_parse_smp_cfg be more accurate there ? Let me know if my understanding is correct, I can accommodate these improvements as well in my patch series. - Saurabh