Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp2897434pxb; Mon, 19 Apr 2021 17:26:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyPaMASnZxxtmo9zdaz3ppL2NKBA0zbTncPXqQhRDpGz4LTDIg6Ic7o/XBRW9wN7oI58EQr X-Received: by 2002:a17:90b:813:: with SMTP id bk19mr1845359pjb.108.1618878415374; Mon, 19 Apr 2021 17:26:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618878415; cv=none; d=google.com; s=arc-20160816; b=Bg0PvjIeyz1jsHJgCbzhm7rorKHFfcXgirOB7Clq5AH1rTpxgonXACGVXuiPlv/qJq cQ1Ndo2B/W2Y0efrW4CMrVKrKP6E72gxO1a4cQT1QoEh5bWK/yC5SAlFBdy531TYHqZN ZiNq1163JmSdT/68xQew+WD0fiMfJFIuyixKY6RrG2vU9reafAUs3Qtg8u0wuP6vM+JU isshcnaD+/OFGwxUafjenPGgKuvmgaRomz8NLEsjxt1WEEF9ixVUlesAA0m/ktGR4WxC gVbu19K4kdH1DaBm0Vdjwu67Jggl9MJvwfz66NJLN8ltMIjMbVtDIvUXyGod+X7YdbmM 6oSA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=TrfgBZ48fujO97kxSwn+/WrT6MZGI2MhME1+NwxxW7s=; b=vxIb3xBpk1icnbqO2ozqZLMCJe4k/sFwAUC0Z8a10dnBnHqueVmZJ2ayM0jyl5kUZY 5UcZR8m3QlTDmYgmXMDYuAgpaPCc0TuSJn7JA/qBcnZ9dDWthjHXwj9wXynfgCYcPogm wpHBojtKVpdMfrvGZ3wyPwwAIBHnKwcOSa/mrRL8mCUYL7v1B05YwWxltUgDUt8iHcVr AqtoC9Ygvw1IDMQV79hkx7OZxxVEt8WEgGGUPLaaQwudLrU5Pkdw4ZVVTZZ/+7Ho5/dk F2+lbz7uJmgF8+G9+JtBY8T4M+iK6BxwRJIg48XWIPGTwKbuEf97JNmgxZswNFfZrrbw r1Gw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=hisilicon.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id gi23si1239305pjb.22.2021.04.19.17.26.43; Mon, 19 Apr 2021 17:26:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=hisilicon.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230116AbhDTA0d (ORCPT + 99 others); Mon, 19 Apr 2021 20:26:33 -0400 Received: from szxga04-in.huawei.com ([45.249.212.190]:16136 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229758AbhDTA0c (ORCPT ); Mon, 19 Apr 2021 20:26:32 -0400 Received: from DGGEMS407-HUB.china.huawei.com (unknown [172.30.72.59]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4FPPXj076pzmdWf; Tue, 20 Apr 2021 08:23:01 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.200.79) by DGGEMS407-HUB.china.huawei.com (10.3.19.207) with Microsoft SMTP Server id 14.3.498.0; Tue, 20 Apr 2021 08:25:52 +0800 From: Barry Song To: , , , , , , , , , , , , , CC: , , , , , , , , , , , , , , , , , , , Barry Song Subject: [RFC PATCH v6 0/4] scheduler: expose the topology of clusters and add cluster scheduler Date: Tue, 20 Apr 2021 12:18:40 +1200 Message-ID: <20210420001844.9116-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.126.200.79] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ARM64 server chip Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | There is a similar need for clustering in x86. Some x86 cores could share L2 caches that is similar to the cluster in Kupeng 920 (e.g. on Jacobsville there are 6 clusters of 4 Atom cores, each cluster sharing a separate L2, and 24 cores sharing L3). Having a sched_domain for clusters will bring two aspects of improvement: 1. spreading unrelated tasks among clusters, which decreases the contention of resources and improve the throughput. unrelated tasks might be put randomly without cluster sched_domain: +-------------------+ +-----------------+ | +----+ +----+ | | | | |task| |task| | | | | |1 | |2 | | | | | +----+ +----+ | | | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ but with cluster sched_domain, they are likely to spread due to LB: +-------------------+ +-----------------+ | +----+ | | +----+ | | |task| | | |task| | | |1 | | | |2 | | | +----+ | | +----+ | | | | | | cluster1 | | cluster2 | +-------------------+ +-----------------+ 2. gathering related tasks within a cluster, which improves the cache affinity of tasks talking with each other. Without cluster sched_domain, related tasks might be put randomly. In case task1-8 have relationship as below: Task1 talks with task5 Task2 talks with task6 Task3 talks with task7 Task4 talks with task8 With the tuning of select_idle_cpu() to scan local cluster first, those tasks might get a chance to be gathered like: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 5 | | | |3 | |7 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |2 | | 6 | | | |4 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ Otherwise, the result might be: +---------------------------+ +----------------------+ | +----+ +-----+ | | +----+ +-----+ | | |task| |task | | | |task| |task | | | |1 | | 2 | | | |5 | |6 | | | +----+ +-----+ | | +----+ +-----+ | | | | | | cluster1 | | cluster2 | | | | | | | | | | +-----+ +------+ | | +-----+ +------+ | | |task | | task | | | |task | |task | | | |3 | | 4 | | | |7 | |8 | | | +-----+ +------+ | | +-----+ +------+ | +---------------------------+ +----------------------+ -v6: * added topology_cluster_cpumask() for x86, code provided by Tim. * emulated a two-level spreading/packing heuristic by only scanning cluster in wake_affine path for tasks running in same LLC(also NUMA). This partially addressed Dietmar's comment in RFC v3: "In case we would like to further distinguish between llc-packing and even narrower (cluster or MC-L2)-packing, we would introduce a 2. level packing vs. spreading heuristic further down in sis(). IMHO, Barry's current implementation doesn't do this right now. Instead he's trying to pack on cluster first and if not successful look further among the remaining llc CPUs for an idle CPU." * adjusted the hackbench parameter to make relatively low and high load. previous patchsets with "-f 10" ran under an extremely high load with hundreds of threads, which seems not real use cases. This also addressed Vincent's question in RFC v4: "In particular, I'm still not convinced that the modification of the wakeup path is the root of the hackbench improvement; especially with g=14 where there should not be much idle CPUs with 14*40 tasks on at most 32 CPUs." -v5: * split "add scheduler level for clusters" into two patches to evaluate the impact of spreading and gathering separately; * add a tracepoint of select_idle_cpu for debug purpose; add bcc script in commit log; * add cluster_id = -1 in reset_cpu_topology() * rebased to tip/sched/core -v4: * rebased to tip/sched/core with the latest unified code of select_idle_cpu * added Tim's patch for x86 Jacobsville * also added benchmark data of spreading unrelated tasks * avoided the iteration of sched_domain by moving to static_key(addressing Vincent's comment * used acpi_cpu_id for acpi_find_processor_node(addressing Masa's comment) Barry Song (2): scheduler: add scheduler level for clusters scheduler: scan idle cpu in cluster for tasks within one LLC Jonathan Cameron (1): topology: Represent clusters of CPUs within a die Tim Chen (1): scheduler: Add cluster scheduler level for x86 Documentation/admin-guide/cputopology.rst | 26 +++++++++++-- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 + arch/x86/Kconfig | 8 ++++ arch/x86/include/asm/smp.h | 7 ++++ arch/x86/include/asm/topology.h | 2 + arch/x86/kernel/cpu/cacheinfo.c | 1 + arch/x86/kernel/cpu/common.c | 3 ++ arch/x86/kernel/smpboot.c | 43 ++++++++++++++++++++- block/blk-mq.c | 2 +- drivers/acpi/pptt.c | 63 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 15 ++++++++ drivers/base/topology.c | 10 +++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/cluster.h | 19 ++++++++++ include/linux/sched/sd_flags.h | 9 +++++ include/linux/sched/topology.h | 12 +++++- include/linux/topology.h | 13 +++++++ kernel/sched/core.c | 29 ++++++++++++-- kernel/sched/fair.c | 51 +++++++++++++++---------- kernel/sched/sched.h | 4 ++ kernel/sched/topology.c | 18 +++++++++ 23 files changed, 324 insertions(+), 30 deletions(-) create mode 100644 include/linux/sched/cluster.h -- 1.8.3.1