Received: by 2002:a05:6a10:2785:0:0:0:0 with SMTP id ia5csp3046432pxb; Tue, 12 Jan 2021 05:10:39 -0800 (PST) X-Google-Smtp-Source: ABdhPJylyvSrMWwsb4zOdweZZuzemMjNR42Qr26IaQ3oT6lQy647WLnCxc5rNGqku9UqRe/v6ACY X-Received: by 2002:a17:906:1916:: with SMTP id a22mr3183172eje.536.1610457039181; Tue, 12 Jan 2021 05:10:39 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1610457039; cv=none; d=google.com; s=arc-20160816; b=WPatMIWBy/DTjje6sHfnYBQj6j83srqyc8WGDWw+r6PeLDYfDsGdSBXSG88SdzBFMz PygmcRzvHKv1YDJQg7wcnh3sDynC62NzagVPO02eiXwvosRaAyiL+pO+qXg2knosJ3cr MKv3xkCIg/+EVMrhA6If98LpnMn2uX4mRFQHJ0UV9NiLvZ0KWqgwD1DKVX+xGqG7UHrm 1XicP+jbPHwUgPzrgqUry1QZlWMoYJmAedzii0nbht55sy2Cc+MGIdh80bNBzuKBiY90 VhcsZuLey6nm9wyB0aVYku9i0lVPi8V6d9dI4tUHr1gv3+1XpVn9vTmGWi38ZXa8hD6a UvdA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=qJ/OI8yuZIrTzgyX9ZxUudlEcREva/QxlbMeYmpJLDw=; b=aRcWyzVZZNEqMTF42iIcf3h3OOG0EsfCfRGzFw99lxSHWIfd5PdseVhezXCfkOkcVe 4sf/uRty0jjSE1cejy6EmunXaHJnTp6GGcgMqRLkOEGAh6Bv227k8l2aPd64jaixLtxD JiZInrtaxoeEaI4OKTntnp6af1xyRwCC+SIafzBesveoAih73iouTIviKhsY97j/Ttpx svyFC49pD9IfEtze0avRwswU4UuJ2VgXQ/7BWxqrPiU6I3dfMvXF5TLiR7YfGZEty3h7 Ef/jZfXAhWTUROGzn3X1jjHkszCnil5PLMol5gpT7IHAvSKkCf3DAa9aqewaZ0JtEXgG YUyQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t23si986853ejs.321.2021.01.12.05.10.14; Tue, 12 Jan 2021 05:10:39 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388798AbhALMyT (ORCPT + 99 others); Tue, 12 Jan 2021 07:54:19 -0500 Received: from foss.arm.com ([217.140.110.172]:45474 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727907AbhALMyT (ORCPT ); Tue, 12 Jan 2021 07:54:19 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E38E71042; Tue, 12 Jan 2021 04:53:32 -0800 (PST) Received: from [192.168.178.6] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C85B43F66E; Tue, 12 Jan 2021 04:53:27 -0800 (PST) Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler To: "Song Bao Hua (Barry Song)" , Morten Rasmussen , Tim Chen Cc: "valentin.schneider@arm.com" , "catalin.marinas@arm.com" , "will@kernel.org" , "rjw@rjwysocki.net" , "vincent.guittot@linaro.org" , "lenb@kernel.org" , "gregkh@linuxfoundation.org" , Jonathan Cameron , "mingo@redhat.com" , "peterz@infradead.org" , "juri.lelli@redhat.com" , "rostedt@goodmis.org" , "bsegall@google.com" , "mgorman@suse.de" , "mark.rutland@arm.com" , "sudeep.holla@arm.com" , "aubrey.li@linux.intel.com" , "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linuxarm@openeuler.org" , "xuwei (O)" , "Zengtao (B)" , "tiantao (H)" References: <20210106083026.40444-1-song.bao.hua@hisilicon.com> <737932c9-846a-0a6b-08b8-e2d2d95b67ce@linux.intel.com> <20210108151241.GA47324@e123083-lin> From: Dietmar Eggemann Message-ID: Date: Tue, 12 Jan 2021 13:53:26 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/01/2021 22:30, Song Bao Hua (Barry Song) wrote: > >> -----Original Message----- >> From: Morten Rasmussen [mailto:morten.rasmussen@arm.com] >> Sent: Saturday, January 9, 2021 4:13 AM >> To: Tim Chen >> Cc: Song Bao Hua (Barry Song) ; >> valentin.schneider@arm.com; catalin.marinas@arm.com; will@kernel.org; >> rjw@rjwysocki.net; vincent.guittot@linaro.org; lenb@kernel.org; >> gregkh@linuxfoundation.org; Jonathan Cameron ; >> mingo@redhat.com; peterz@infradead.org; juri.lelli@redhat.com; >> dietmar.eggemann@arm.com; rostedt@goodmis.org; bsegall@google.com; >> mgorman@suse.de; mark.rutland@arm.com; sudeep.holla@arm.com; >> aubrey.li@linux.intel.com; linux-arm-kernel@lists.infradead.org; >> linux-kernel@vger.kernel.org; linux-acpi@vger.kernel.org; >> linuxarm@openeuler.org; xuwei (O) ; Zengtao (B) >> ; tiantao (H) >> Subject: Re: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and >> add cluster scheduler >> >> On Thu, Jan 07, 2021 at 03:16:47PM -0800, Tim Chen wrote: >>> On 1/6/21 12:30 AM, Barry Song wrote: >>>> ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each >>>> cluster has 4 cpus. All clusters share L3 cache data while each cluster >>>> has local L3 tag. On the other hand, each cluster will share some >>>> internal system bus. This means cache is much more affine inside one cluster >>>> than across clusters. >>> >>> There is a similar need for clustering in x86. Some x86 cores could share >> L2 caches that >>> is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters >>> of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing >> L3). >>> Having a sched domain at the L2 cluster helps spread load among >>> L2 domains. This will reduce L2 cache contention and help with >>> performance for low to moderate load scenarios. >> >> IIUC, you are arguing for the exact opposite behaviour, i.e. balancing >> between L2 caches while Barry is after consolidating tasks within the >> boundaries of a L3 tag cache. One helps cache utilization, the other >> communication latency between tasks. Am I missing something? > > Morten, this is not true. > > we are both actually looking for the same behavior. My patch also > has done the exact same behavior of spreading with Tim's patch. That's the case for the load-balance path because of the extra Sched Domain (SD) (CLS/MC_L2) below MC. But in wakeup you add code which leads to a different packing strategy. It looks like that Tim's workload (SPECrate mcf) shows a performance boost solely because of the changes the additional MC_L2 SD introduces in load balance. The wakeup path is unchanged, i.e. llc-packing. IMHO we have to carefully distinguish between packing vs. spreading in wakeup and load-balance here. > Considering the below two cases: > Case 1. we have two tasks without any relationship running in a system with 2 clusters and 8 cpus. > > Without the sched_domain of cluster, these two tasks might be put as below: > +-------------------+ +-----------------+ > | +----+ +----+ | | | > | |task| |task| | | | > | |1 | |2 | | | | > | +----+ +----+ | | | > | | | | > | cluster1 | | cluster2 | > +-------------------+ +-----------------+ > With the sched_domain of cluster, load balance will spread them as below: > +-------------------+ +-----------------+ > | +----+ | | +----+ | > | |task| | | |task| | > | |1 | | | |2 | | > | +----+ | | +----+ | > | | | | > | cluster1 | | cluster2 | > +-------------------+ +-----------------+ > > Then task1 and tasks2 get more cache and decrease cache contention. > They will get better performance. > > That is what my original patch also can make. And tim's patch > is also doing. Once we add a sched_domain, load balance will > get involved. > > > Case 2. we have 8 tasks, running in a system with 2 clusters and 8 cpus. > But they are working in 4 groups: > Task1 wakes up task4 > Task2 wakes up task5 > Task3 wakes up task6 > Task4 wakes up task7 > > With my changing in select_idle_sibling, the WAKE_AFFINE mechanism will > try to put task1 and 4, task2 and 5, task3 and 6, task4 and 7 in same clusters rather > than putting all of them in the random one of the 8 cpus. However, the 8 tasks > are still spreading among the 8 cpus with my change in select_idle_sibling > as load balance is still working. > > +---------------------------+ +----------------------+ > | +----+ +-----+ | | +----+ +-----+ | > | |task| |task | | | |task| |task | | > | |1 | | 4 | | | |2 | |5 | | > | +----+ +-----+ | | +----+ +-----+ | > | | | | > | cluster1 | | cluster2 | > | | | | > | | | | > | +-----+ +------+ | | +-----+ +------+ | > | |task | | task | | | |task | |task | | > | |3 | | 6 | | | |4 | |8 | | > | +-----+ +------+ | | +-----+ +------+ | > +---------------------------+ +----------------------+ Your use-case (#tasks, runtime/period) seems to be perfectly crafted to show the benefit of your patch on your specific system (cluster-size = 4). IMHO, this extra infrastructure especially in the wakeup path should show benefits over a range of different benchmarks. > Let's consider the 3rd case, that one would be more tricky: > > task1 and task2 have close relationship and they are waker-wakee pair. > With my current patch, select_idle_sidling() wants to put them in one > cluster, load balance wants to put them in two clusters. Load balance will win. > Then maybe we need some same mechanism like adjusting numa imbalance: > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/sched/fair.c?id=b396f52326de20 > if we permit a light imbalance between clusters, select_idle_sidling() > will win. And task1 and task2 get better cache affinity. This would look weird to allow this kind of imbalance on CLS (MC_L2) and NUMA domains but not on the MC domain for example.