Received: by 2002:a05:6358:489b:b0:bb:da1:e618 with SMTP id x27csp2416069rwn; Fri, 16 Sep 2022 09:57:53 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4u5bCRq9nP4ZibtPGKF4+Ygi5crbxOeUBkllp9myaJ5gX7/P8BYv4zPBPcAbkfhT78L1pG X-Received: by 2002:a17:90a:1617:b0:200:9da5:d0ed with SMTP id n23-20020a17090a161700b002009da5d0edmr6443418pja.90.1663347473514; Fri, 16 Sep 2022 09:57:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1663347473; cv=none; d=google.com; s=arc-20160816; b=dSr6nWnG53wt5xsaxDZuIY6EAKFRLlOYJVVf/qms7uzgz58sDFrOwaeLKClXP689B5 NBcpSFpQZcYM613Loz8Ld5/ikX+7OViG/VaBdSJD4C0teK2wktjXgY0V8MSo6UE6qzuR o+7u5pvVdZPI0jw6ZWNqnGfzLHiYs/07SF5WlnXNUFE/0M0+JSSbgd+hm3m1j3EOg9ux MDvhYwEUpFIjW4zT8prpom5R8kqL2UUrqtNVn6qWoGsXW8LKQZ8H+WqHrXjVsckliDdG TFSw0B1efQDIsoyzFBsiJ1zJJ4I2RIuHCI6Nai+40muSRkuIj3Pjn7iUMWRwDGIZ6C/n q4Dg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=roHv/kyq8Ox90lFXYSz3G8DKRZ2XQmJAj+CpeBUH9g4=; b=077Kd4HFE7+Lpokqqee4nUC+MmfbroGVbl4V2GDwQ3QlLNLsP+KekSt8UhafcHB1yG IxJAO9+gxP8EEuyjzYcld5YXulwhnOp2B54eN3mFVeebR2livC+UhMos3U5jTC2rqnq0 iDsr4iqPh38lFcx2elkQefMJ8j/mpvxlkaaZRZyye67+vefuhZo8eJ8j6xgYHc1Fx/xG VUg2mN7Jts/If1PQ9tuUGVll7aZmhpDqKFp4ZFQlFG9k3/lnynDBW0VEJmhsx74R1DTk XGN4JEoCHDR38PQdgwcqGUeG/8OwNEs+vXaTHp6h0l33lhRG1dom2yPFDIzsU9F5IcG+ yXpA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w20-20020a056a0014d400b0054598afeb07si13365925pfu.56.2022.09.16.09.57.40; Fri, 16 Sep 2022 09:57:53 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229669AbiIPQOu (ORCPT + 99 others); Fri, 16 Sep 2022 12:14:50 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54924 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229471AbiIPQOs (ORCPT ); Fri, 16 Sep 2022 12:14:48 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2396C2717F; Fri, 16 Sep 2022 09:14:44 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EF12B1595; Fri, 16 Sep 2022 09:14:49 -0700 (PDT) Received: from localhost (ionvoi01-desktop.cambridge.arm.com [10.1.196.65]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 23F593F73D; Fri, 16 Sep 2022 09:14:43 -0700 (PDT) Date: Fri, 16 Sep 2022 17:14:41 +0100 From: Ionela Voinescu To: Yicong Yang Cc: Darren Hart , yangyicong@hisilicon.com, Sudeep Holla , Dietmar Eggemann , "Rafael J. Wysocki" , Catalin Marinas , Will Deacon , Peter Zijlstra , Vincent Guittot , Greg Kroah-Hartman , "D . Scott Phillips" , Ilkka Koskinen , stable@vger.kernel.org, LKML , Linux Arm , Barry Song <21cnbao@gmail.com>, Jonathan Cameron Subject: Re: [PATCH v5] topology: make core_mask include at least cluster_siblings Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-6.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Friday 16 Sep 2022 at 15:59:34 (+0800), Yicong Yang wrote: > On 2022/9/16 1:56, Darren Hart wrote: > > On Thu, Sep 15, 2022 at 08:01:18PM +0800, Yicong Yang wrote: > >> Hi Darren, > >> > > > > Hi Yicong, > > > > ... > > > >>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > >>> index 1d6636ebaac5..5497c5ab7318 100644 > >>> --- a/drivers/base/arch_topology.c > >>> +++ b/drivers/base/arch_topology.c > >>> @@ -667,6 +667,15 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > >>> core_mask = &cpu_topology[cpu].llc_sibling; > >>> } > >>> > >>> + /* > >>> + * For systems with no shared cpu-side LLC but with clusters defined, > >>> + * extend core_mask to cluster_siblings. The sched domain builder will > >>> + * then remove MC as redundant with CLS if SCHED_CLUSTER is enabled. > >>> + */ > >>> + if (IS_ENABLED(CONFIG_SCHED_CLUSTER) && > >>> + cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) > >>> + core_mask = &cpu_topology[cpu].cluster_sibling; > >>> + > >>> return core_mask; > >>> } > >>> > >> > >> Is this patch still necessary for Ampere after Ionela's patch [1], which > >> will limit the cluster's span within coregroup's span. > > > > Yes, see: > > https://lore.kernel.org/lkml/YshYAyEWhE4z%2FKpB@fedora/ > > > > Both patches work together to accomplish the desired sched domains for the > > Ampere Altra family. > > > > Thanks for the link. From my understanding, on the Altra machine we'll get > the following results: > > with your patch alone: > Scheduler will get a weight of 2 for both CLS and MC level and finally the > MC domain will be squashed. The lowest domain will be CLS. > > with both your patch and Ionela's: > CLS will have a weight of 1 and MC will have a weight of 2. CLS won't be > built and the lowest domain will be MC. > > with Ionela's patch alone: > Both CLS and MC will have a weight of 1, which is incorrect. > This would happen with or without my patch. My patch only breaks the tie between CLS and MC. And the above outcome is "incorrect" for Ampere Altra where there's no cache spanning multiple cores, but ACPI presents clusters. With Darren's patch this information on clusters is used instead to build the MC domain. > So your patch is still necessary for Amphere Altra. Then we need to limit > MC span to DIE/NODE span, according to the scheduler's definition for > topology level, for the issue below. Maybe something like this: > > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > index 46cbe4471e78..8ebaba576836 100644 > --- a/drivers/base/arch_topology.c > +++ b/drivers/base/arch_topology.c > @@ -713,6 +713,9 @@ const struct cpumask *cpu_coregroup_mask(int cpu) > cpumask_subset(core_mask, &cpu_topology[cpu].cluster_sibling)) > core_mask = &cpu_topology[cpu].cluster_sibling; > > + if (cpumask_subset(cpu_cpu_mask(cpu), core_mask)) > + core_mask = cpu_cpu_mask(cpu); > + > return core_mask; > } > I agree cluster_sibling should not span more CPUs than package/node. I thought that restriction was imposed by find_acpi_cpu_topology_cluster(). I'll take a further look over that as I think it's a better location to restrict the span of the cluster. > >> > >> I found an issue that the NUMA domains are not built on qemu with: > >> > >> qemu-system-aarch64 \ > >> -kernel ${Image} \ > >> -smp 8 \ > >> -cpu cortex-a72 \ > >> -m 32G \ > >> -object memory-backend-ram,id=node0,size=8G \ > >> -object memory-backend-ram,id=node1,size=8G \ > >> -object memory-backend-ram,id=node2,size=8G \ > >> -object memory-backend-ram,id=node3,size=8G \ > >> -numa node,memdev=node0,cpus=0-1,nodeid=0 \ > >> -numa node,memdev=node1,cpus=2-3,nodeid=1 \ > >> -numa node,memdev=node2,cpus=4-5,nodeid=2 \ > >> -numa node,memdev=node3,cpus=6-7,nodeid=3 \ > >> -numa dist,src=0,dst=1,val=12 \ > >> -numa dist,src=0,dst=2,val=20 \ > >> -numa dist,src=0,dst=3,val=22 \ > >> -numa dist,src=1,dst=2,val=22 \ > >> -numa dist,src=1,dst=3,val=24 \ > >> -numa dist,src=2,dst=3,val=12 \ > >> -machine virt,iommu=smmuv3 \ > >> -net none \ > >> -initrd ${Rootfs} \ > >> -nographic \ > >> -bios QEMU_EFI.fd \ > >> -append "rdinit=/init console=ttyAMA0 earlycon=pl011,0x9000000 sched_verbose loglevel=8" > >> > >> I can see the schedule domain build stops at MC level since we reach all the > >> cpus in the system: > >> > >> [ 2.141316] CPU0 attaching sched-domain(s): > >> [ 2.142558] domain-0: span=0-7 level=MC > >> [ 2.145364] groups: 0:{ span=0 cap=964 }, 1:{ span=1 cap=914 }, 2:{ span=2 cap=921 }, 3:{ span=3 cap=964 }, 4:{ span=4 cap=925 }, 5:{ span=5 cap=964 }, 6:{ span=6 cap=967 }, 7:{ span=7 cap=967 } > >> [ 2.158357] CPU1 attaching sched-domain(s): > >> [ 2.158964] domain-0: span=0-7 level=MC > >> [...] > >> It took me a bit to reproduce this as it requires "QEMU emulator version 7.1.0" otherwise there won't be a PPTT table. With this, the cache hierarchy is not really "healthy", so it's not a topology I'd expect to see in practice. But I suppose we should try to fix it. root@debian-arm64-buster:/sys/devices/system/cpu/cpu0/cache# grep . */* index0/level:1 index0/shared_cpu_list:0-7 index0/shared_cpu_map:ff index0/type:Data index1/level:1 index1/shared_cpu_list:0-7 index1/shared_cpu_map:ff index1/type:Instruction index2/level:2 index2/shared_cpu_list:0-7 index2/shared_cpu_map:ff index2/type:Unified Thanks, Ionela. > >> Without this the NUMA domains are built correctly: > >> > > > Without which? My patch, Ionela's patch, or both? > > > > Revert your patch only will have below result, sorry for the ambiguous. Before reverting, > for CPU 0, MC should span 0-1 but with your patch it's extended to 0-7 and the scheduler > domain build will stop at MC level because it has reached all the CPUs. > > >> [ 2.008885] CPU0 attaching sched-domain(s): > >> [ 2.009764] domain-0: span=0-1 level=MC > >> [ 2.012654] groups: 0:{ span=0 cap=962 }, 1:{ span=1 cap=925 } > >> [ 2.016532] domain-1: span=0-3 level=NUMA > >> [ 2.017444] groups: 0:{ span=0-1 cap=1887 }, 2:{ span=2-3 cap=1871 } > >> [ 2.019354] domain-2: span=0-5 level=NUMA > > > > I'm not following this topology - what in the description above should result in > > a domain with span=0-5? > > > > It emulates a 3-hop NUMA machine and the NUMA domains will be built according to the > NUMA distances: > > node 0 1 2 3 > 0: 10 12 20 22 > 1: 12 10 22 24 > 2: 20 22 10 12 > 3: 22 24 12 10 > > So for CPU 0 the NUMA domains will look like: > NUMA domain 0 for local nodes (squashed to MC domain), CPU 0-1 > NUMA domain 1 for nodes within distance 12, CPU 0-3 > NUMA domain 2 for nodes within distance 20, CPU 0-5 > NUMA domain 3 for all the nodes, CPU 0-7 > > Thanks. > > > > >> [ 2.019983] groups: 0:{ span=0-3 cap=3758 }, 4:{ span=4-5 cap=1935 } > >> [ 2.021527] domain-3: span=0-7 level=NUMA > >> [ 2.022516] groups: 0:{ span=0-5 mask=0-1 cap=5693 }, 6:{ span=4-7 mask=6-7 cap=3978 } > >> [...] > >> > >> Hope to see your comments since I have no Ampere machine and I don't know > >> how to emulate its topology on qemu. > >> > >> [1] bfcc4397435d ("arch_topology: Limit span of cpu_clustergroup_mask()") > >> > >> Thanks, > >> Yicong > > > > Thanks, > >