Received: by 10.213.65.68 with SMTP id h4csp977075imn; Wed, 14 Mar 2018 06:07:26 -0700 (PDT) X-Google-Smtp-Source: AG47ELtIuyNDhU0fYPduFwMZsd1MLnU0HDQw50v41JrvfvemGIosEnigRIinHVWY0a5gojZng84u X-Received: by 2002:a17:902:24c:: with SMTP id 70-v6mr4051270plc.384.1521032846273; Wed, 14 Mar 2018 06:07:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521032846; cv=none; d=google.com; s=arc-20160816; b=SeWZ065JTgd2SHV0susp8hFJX3Y/zP5cpbDPzCl4DKETNugtbR0lWpPcDLPTAILTUm 2RlPRVPnS2wXY3jiaBCQqmPLKoUJQkakNCcjFxRbraHwuVmk13S+4S7B5OpSebJ4vY/v BRQVlgNizeMftmTIXawho5T9RTfKX2J1EPWZExeLFlv8GJtdimaztknkicXxMzdjqqMQ LuNEWolrpQ0o3UJew7wDUAya048IkZGhu2+hOOAPkU7uTjjreB7b7Y5bbLKc6YUn5Kto XGeSsMUsUiQUUABQ/JK0AI/DylDHVsbHRsgNZpMfZ4xAbcYQ+F7YaA+ZHIW52/pohD6F uxTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=lshwiXWTxJsThuf4WaLmhZ2RIzSv54iZ4TTZarkS304=; b=wbyCRj/jKSFMsbGqN7/OtXpPSF3FjFx3NS3HOqh5CwTLPkhXMCT2OYVn7JkBvyAn7K Z/LXij6rEaO2An4xlcCU+DjaBPtdPx9NZhZs8guxo7qYsXinv8HP8sA30TjEvy0xZew0 RuHrAhxSPF5LYaCR92gxIwljUtyBacX/0lMYX+lNko5VvQus7a4X3uiwbbElukEK5vHs ++jIuudmeERneVX8hAMiYM9bAj9x3qoB+xXIl3F2iIcKidc6Tc+G5auCtfHP09405ZLV 8UDSn/E7SCljvKL3VyMhgaatuZJ01FOZFkqhr2afsaSAFWKzn2y3h72EXU4/2Cn1A4q3 KYaQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x61-v6si1917742plb.741.2018.03.14.06.07.07; Wed, 14 Mar 2018 06:07:26 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752081AbeCNNFR (ORCPT + 99 others); Wed, 14 Mar 2018 09:05:17 -0400 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:52164 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751394AbeCNNFP (ORCPT ); Wed, 14 Mar 2018 09:05:15 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9CE1A15AB; Wed, 14 Mar 2018 06:05:15 -0700 (PDT) Received: from e105550-lin.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.72.51.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 1B2253F675; Wed, 14 Mar 2018 06:05:11 -0700 (PDT) Date: Wed, 14 Mar 2018 13:05:09 +0000 From: Morten Rasmussen To: Jeremy Linton Cc: mark.rutland@arm.com, vkilari@codeaurora.org, lorenzo.pieralisi@arm.com, catalin.marinas@arm.com, tnowicki@caviumnetworks.com, gregkh@linuxfoundation.org, will.deacon@arm.com, dietmar.eggemann@arm.com, rjw@rjwysocki.net, linux-kernel@vger.kernel.org, ahs3@redhat.com, linux-acpi@vger.kernel.org, palmer@sifive.com, hanjun.guo@linaro.org, sudeep.holla@arm.com, austinwc@codeaurora.org, linux-riscv@lists.infradead.org, john.garry@huawei.com, wangxiongfeng2@huawei.com, linux-arm-kernel@lists.infradead.org, lenb@kernel.org Subject: Re: [PATCH v7 13/13] arm64: topology: divorce MC scheduling domain from core_siblings Message-ID: <20180314130509.GM4589@e105550-lin.cambridge.arm.com> References: <20180228220619.6992-1-jeremy.linton@arm.com> <20180228220619.6992-14-jeremy.linton@arm.com> <20180301155216.GI4589@e105550-lin.cambridge.arm.com> <5d6bf4cf-2f6d-d123-f17f-d47d8e74c16c@arm.com> <20180306160721.GJ4589@e105550-lin.cambridge.arm.com> <8ac3567c-9fd8-4b0c-121c-287a027b5156@arm.com> <20180307130623.GK4589@e105550-lin.cambridge.arm.com> <4e7fd404-fb7b-ee38-4a78-8ee091ae07cf@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4e7fd404-fb7b-ee38-4a78-8ee091ae07cf@arm.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 07, 2018 at 10:19:50AM -0600, Jeremy Linton wrote: > Hi, > > On 03/07/2018 07:06 AM, Morten Rasmussen wrote: > >On Tue, Mar 06, 2018 at 04:22:18PM -0600, Jeremy Linton wrote: > >>>>>>To do this correctly, we should really base that on the cache > >>>>>>topology immediately below the NUMA node (for NUMA in socket) >> or below the physical package for normal NUMA configurations. > >>>>> > >>>>>That means we wouldn't support multi-die NUMA nodes? > >>>> > >>>>You mean a bottom level NUMA domain that crosses multiple sockets/dies? That > >>>>should work. This patch is picking the widest cache layer below the smallest > >>>>of the package or numa grouping. What actually happens depends on the > >>>>topology. Given a case where there are multiple dies in a socket, and the > >>>>numa domain is at the socket level the MC is going to reflect the caching > >>>>topology immediately below the socket. In the case of multiple dies, with a > >>>>cache that crosses them in socket, then the MC is basically going to be the > >>>>socket, otherwise if the widest cache is per die, or some narrower grouping > >>>>(cluster?) then that is what ends up in the MC. (this is easier with some > >>>>pictures) > >>> > >>>That is more or less what I meant. I think I got confused with the role > >>>of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level > >>>cpumask spans exactly the NUMA node, so IIUC we have three scenarios: > >>> > >>>1. Multi-die/socket/physical package NUMA node > >>> Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level > >>> spans multiple multi-package nodes. The MC mask reflects the last-level > >>> cache within the NUMA node which is most likely per-die or per-cluster > >>> (inside each die). > >>> > >>>2. physical package == NUMA node > >>> The top non-NUMA (DIE) mask is the same as the core sibling mask. > >>> If there is cache spanning the entire node, the scheduler topology > >>> will eliminate a layer (DIE?), so bottom NUMA level would be right on > >>> top of MC spanning multiple physical packages. If there is no > >>> node-wide last level cache, DIE is preserved and MC matches the span > >>> of the last level cache. > >>> > >>>3. numa-in-package > >>> Top non-NUMA (DIE) mask is not reflecting the actual die, but is > >>> reflecting the NUMA node. MC has a span equal to the largest share > >>> cache span smaller than or equal to the the NUMA node. If it is > >>> equal, DIE level is eliminated, otherwise DIE is preserved, but > >>> doesn't really represent die. Bottom non-NUMA level spans multiple > >>> in-package NUMA nodes. > >>> > >>>As you said, multi-die nodes should work. However, I'm not sure if > >>>shrinking MC to match a cache could cause us trouble, or if it should > >>>just be shrunk to be the smaller of the node mask and core siblings. > >> > >>Shrinking to the smaller of the numa or package is fairly trivial change, > >>I'm good with that change too.. I discounted it because there might be an > >>advantage in case 2 if the internal hardware is actually a case 3 (or just > >>multiple rings/whatever each with a L3). In those cases the firmware vendor > >>could play around with whatever representation serves them the best. > > > >Agreed. Distributed last level caches and interconnect speeds makes it > >virtually impossible to define MC in a way that works well for everyone > >based on the topology information we have at hand. > > > >> > >>>Unless you have a node-wide last level cache DIE level won't be > >>>eliminated in scenario 2 and 3, and could cause trouble. For > >>>numa-in-package, you can end up with a DIE level inside the node where > >>>the default flags don't favour aggressive spreading of tasks. The same > >>>could be the case for per-package nodes (scenario 2). > >>> > >>>Don't we end up redefining physical package to be last level cache > >>>instead of using the PPTT flag for scenario 2 and 3? > >> > >>I'm not sure I understand, core_siblings isn't changing (its still per > >>package). Only the MC mapping which normally is just core_siblings. For all > >>intents right now this patch is the same as v6, except for the > >>numa-in-package where the MC domain is being shrunk to the node siblings. > >>I'm just trying to setup the code for potential future cases where the LLC > >>isn't equal to the node or package. > > > >Right, core_siblings remains the same. The scheduler topology just looks > >a bit odd as we can have core_siblings spanning the full true physical > >package and have DIE level as a subset of that with an MC level where > >the MC siblings is a much smaller subset of cpus than core_siblings. > > > >IOW, it would lead to having one topology used by the scheduler and > >another used by the users of topology_core_cpumask() (which is not > >many I think). > > > >Is there a good reason for diverging instead of adjusting the > >core_sibling mask? On x86 the core_siblings mask is defined by the last > >level cache span so they don't have this issue. > > I'm overwhelmingly convinced we are doing the right thing WRT the core > siblings at the moment. Its exported to user space, and the general > understanding is that its a socket. Even with numa in package/on die if you > run lscpu, lstopo, etc... They all understand the system topology correctly > doing it this way (AFAIK). Right. As said in my reply to Brice, I thought MC level and sysfs were aligned, but they clearly aren't. I agree that treating them different is the right thing to do. > >I would prefer this simpler solution as it should eliminate DIE level > >for all numa-in-package configurations. Although, I think we should consider > >just shrinking the core_sibling mask instead of having a difference MC > >mask (cpu_coregroup_mask). Do you see any problems in doing that?I'm > My strongest opinion is leaning toward core_siblings being correct as it > stands. How the scheduler deals with that is less clear. I will toss the > above as a separate patch, and we can forget this one. I see dropping DIE as > a larger patch set defining an arch specific scheduler topology and tweaking > the individual scheduler level/flags/tuning. OTOH, unless there is something > particularly creative there, I don't see how to avoid NUMA domains pushing > deeper into the cache/system topology. Which means filling the MC layer (and > possible others) similarly to the above snippit. Agreed that core_siblings is correct. With the simple solution DIE shouldn't show up for any numa_in_package configurations allowing NUMA to sit directly on top of MC, which should mean that flags should be roughly okay.