Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp3678728pxu; Mon, 19 Oct 2020 19:34:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz0K7S7Ojix9u2jetYAvauf+NGY5RmQYRWaWcimaALZk/4U/785YT3pnGbch/kRTlOc0Zc6 X-Received: by 2002:a17:907:42fd:: with SMTP id oa21mr844186ejb.247.1603161243228; Mon, 19 Oct 2020 19:34:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1603161243; cv=none; d=google.com; s=arc-20160816; b=dA6PIx8C8AYE1ANt+iCAs5AwwExWwjrA4bWYz4izTjkVLnorXrkRXn+h5bfnw8JTMe pJV/oom0Dk+h9vX26fMcUZQeKRUbYBTxYEu32zAbnfYkSkGOev5wWNjnn6Er1tLXGn7T zukfekfXQ9eKL6ERqwFa4a2TkUJUNG89lVFvfVpzGYmzjFgu1T1cGWo7ST3zK6nlgVw6 hH5Q/O1xOFJtEO8u/ThQ11NKIDWudzAkk7PMynZ8jFBYa7oQlSvX77nBdxUh185Ip89W sgke9mLtPalggDGE9Xb7PCkvO5ioAxaXlojLZW+Uhk76fPwTM0k4EYenKzwoEkNGKQYe K5RA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date; bh=ayu6Oq5SVE1JzEZq8NuF2Bh4f7eM6a7ehkzjGY9OOnc=; b=dMqN0ltEISak57JTF56jt2I9Kia1yvnjU/3h7eyokazoVCUqofgy5heWmYX9ovdG4g gom6KazgD0iIguCZx8K3neEyW1YR3nKibc/mT/9vk/3RF9rFAaayLUUbdj+wbkNwmY// LQJxtOUTNVXGYinRgrbdtz3T9+m8H7vfHN27papqgJgPd2XNv1IDYLnVI+/ArRiNbAkr adai1/ko1OW0Ntggm3PD/4wIrcmXawo509tYdcS/32S3dCvcH70JRVB2XWRhUgmkpQop evMne6ea2QM0imoK4Rf4x+fufYnjasr2KKmcRlRhfFvHXgYi1H/iIy11eyHj9ho0nJJo 4QBQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id b14si325410ejq.296.2020.10.19.19.33.41; Mon, 19 Oct 2020 19:34:03 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730497AbgJSQCU (ORCPT + 99 others); Mon, 19 Oct 2020 12:02:20 -0400 Received: from lhrrgout.huawei.com ([185.176.76.210]:2993 "EHLO huawei.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1730335AbgJSQCT (ORCPT ); Mon, 19 Oct 2020 12:02:19 -0400 Received: from lhreml710-chm.china.huawei.com (unknown [172.18.7.108]) by Forcepoint Email with ESMTP id 6DF842B7FF7FBCE7BDF8; Mon, 19 Oct 2020 17:02:17 +0100 (IST) Received: from localhost (10.52.126.130) by lhreml710-chm.china.huawei.com (10.201.108.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1913.5; Mon, 19 Oct 2020 17:02:16 +0100 Date: Mon, 19 Oct 2020 17:00:20 +0100 From: Jonathan Cameron To: Valentin Schneider CC: Morten Rasmussen , Peter Zijlstra , , , , , Len Brown , Greg Kroah-Hartman , Sudeep Holla , , Will Deacon , , Brice Goglin , Jeremy Linton , Jerome Glisse Subject: Re: [RFC PATCH] topology: Represent clusters of CPUs within a die. Message-ID: <20201019160020.000013d6@Huawei.com> In-Reply-To: References: <20201016152702.1513592-1-Jonathan.Cameron@huawei.com> <20201019103522.GK2628@hirez.programming.kicks-ass.net> <20201019123226.00006705@Huawei.com> <20201019131052.GC8004@e123083-lin> <20201019142715.00005fb1@huawei.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 3.17.4 (GTK+ 2.24.32; i686-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.52.126.130] X-ClientProxiedBy: lhreml711-chm.china.huawei.com (10.201.108.62) To lhreml710-chm.china.huawei.com (10.201.108.61) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 19 Oct 2020 16:51:06 +0100 Valentin Schneider wrote: > On 19/10/20 15:27, Jonathan Cameron wrote: > > On Mon, 19 Oct 2020 14:48:02 +0100 > > Valentin Schneider wrote: > >> > >> That's my queue to paste some of that stuff I've been rambling on and off > >> about! > >> > >> With regards to cache / interconnect layout, I do believe that if we > >> want to support in the scheduler itself then we should leverage some > >> distance table rather than to create X extra scheduler topology levels. > >> > >> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given > >> that SLIT gives us a distance value between any two PXM, we could directly > >> express core-to-core distance in that table. With that (and if that still > >> lets us properly discover NUMA node spans), we could let the scheduler > >> build dynamic NUMA-like topology levels representing the inner quirks of > >> the cache / interconnect layout. > > > > You would rapidly run into the problem SLIT had for numa node description. > > There is no consistent description of distance and except in the vaguest > > sense or 'nearer' it wasn't any use for anything. That is why HMAT > > came along. It's far from perfect but it is a step up. > > > > I wasn't aware of HMAT; my feeble ACPI knowledge is limited to SRAT / SLIT > / PPTT, so thanks for pointing this out. > > > I can't see how you'd generalize those particular tables to do anything > > for intercore comms without breaking their use for NUMA, but something > > a bit similar might work. > > > > Right, there's the issue of still being able to determine NUMA node > boundaries. Backwards compatibility will break you there. I'd definitely look at a separate table. Problem with SLIT etc is that, as static tables, we can't play games with OSC bits to negotiate what the OS and the firmware both understand. > > > A lot of thought has gone in (and meeting time) to try an improve the > > situation for complex topology around NUMA. Whilst there are differences > > in representing the internal interconnects and caches it seems like a somewhat > > similar problem. The issue there is it is really really hard to describe > > this stuff with enough detail to be useful, but simple enough to be usable. > > > > https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@redhat.com/ > > > > Thanks for the link! > > >> > >> It's mostly pipe dreams for now, but there seems to be more and more > >> hardware where that would make sense; somewhat recently the PowerPC guys > >> added something to their arch-specific code in that regards. > > > > Pipe dream == something to work on ;) > > > > ACPI has a nice code first model of updating the spec now, so we can discuss > > this one in public, and propose spec changes only once we have an implementation > > proven. > > > > FWIW I blabbered about a "generalization" of NUMA domains & distances > within the scheduler at LPC19 (and have been pasting that occasionally, > apologies for the broken record): > > https://linuxplumbersconf.org/event/4/contributions/484/ > > I've only pondered about the implementation, but if (big if; also I really > despise advertising "the one solution that will solve all your issues" > which this is starting to sound like) it would help I could cobble together > an RFC leveraging a separate distance table. It would certainly be interesting. > > It doesn't solve the "funneling cache properties into a single number" > issue, which as you just pointed out in a parallel email is a separate > discussion altogether. > > > Note I'm not proposing we put the cluster stuff in the scheduler, just > > provide it as a hint to userspace. > > > > The goal being to tweak tasks' affinities, right? Other than CPU pinning > and rare cases, IMO if the userspace has to mess around with affinities it > is due to the failings of the underlying scheduler. Restricted CPU > affinities is also something the load-balancer struggles with; I have and > have been fighting over such issues where just a single per-CPU kworker > waking up at the wrong time can mess up load-balance for quite some time. I > tend to phrase it as: "if you're rude to the scheduler, it can and will > respond in kind". > > Now yes, it's not the same timescale nor amount of work, but this is > something the scheduler itself should leverage, not userspace. Ideally I absolutely agree, but then we get into the games of trying to classify the types of workload which would benefit. Much like with NUMA spreading, it is going to be hard to come up with a one true solution (nice though that would be!) Not getting regressions with anything in this area is going to be really tricky. J > > > Jonathan