Received: by 2002:a05:6a10:f347:0:0:0:0 with SMTP id d7csp665263pxu; Wed, 6 Jan 2021 00:37:51 -0800 (PST) X-Google-Smtp-Source: ABdhPJzSKbQ+Z/Ac/5lHNjSCNRGIhrdriugtJ03t4I/NtmXebpyIaa7/5sP8YKmI4hWNaOkWHFUA X-Received: by 2002:a17:906:851:: with SMTP id f17mr2174123ejd.392.1609922271555; Wed, 06 Jan 2021 00:37:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1609922271; cv=none; d=google.com; s=arc-20160816; b=Cu6TZmYNiUX68yveFLGimpl4a4VGl0XF/wYasI0ndP81ukRQHvbi60PruGetifJGIL JPueYzOPkJU6BlBB/p828uKJyFPW3AHeBSWA4/i7sR6NUsoasHbN/xz7y91gs7K0dYQG M2LFyZrre/pKT7iNi6MxZ6F4s1A3XTYc/jhppnPbJyr+FwxOXBO0UdVJCJnzxg8sJfRZ 8bMpNWr1MQVyn8fh7q0tc8XpPkDR+5YYFhIt3ziRDBh8QBggsqyPyxCWt2VkifqLMRgk 4ujilhTLb3762sWPL3zuVPD6jg+7AAg4nSBIrtWhnuzl3wPzWQKl4IDRXP/v2xDyzGQM tHKQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=gqxWqJQah7v2DLXimW97yKdDiPM8ax8r3cCL3Bs4/+E=; b=CJBc3aQdaaTgjsjLDSGdA25uDUxP8j429VogbAQ2BLmqW/eXjdmnrRtVea5oRRHwH5 Ly5AoRH3Zv0p7Lr3F5uGZKClotbpDsv/t/8vFfeYUh2qE7GqzoeeyCLuDPVcPqbNIkk7 LttB0nChmSZdcQbkSyZIeX/91uSw6rLv4Z7EkPZMjF93Zl1MAUyFVZmGw5FwvZeKVhwy 6TOND0kb2pThfRA60PPFALkjhWYleqhX0LxS4AV4SuUDBEPRh3h/QGH7dX3F+f1rxtwV KnvIv39QXf/smd95ikbj79mb6C0WQ7+deMGGhYdETTynFvN3b8D2jhy5cLuuJzoHWwhF XXbQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n11si141856ejg.99.2021.01.06.00.37.28; Wed, 06 Jan 2021 00:37:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726315AbhAFIg3 (ORCPT + 99 others); Wed, 6 Jan 2021 03:36:29 -0500 Received: from szxga04-in.huawei.com ([45.249.212.190]:9716 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725788AbhAFIg2 (ORCPT ); Wed, 6 Jan 2021 03:36:28 -0500 Received: from DGGEMS411-HUB.china.huawei.com (unknown [172.30.72.60]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4D9jMv1QVgzl0mh; Wed, 6 Jan 2021 16:34:35 +0800 (CST) Received: from SWX921481.china.huawei.com (10.126.203.68) by DGGEMS411-HUB.china.huawei.com (10.3.19.211) with Microsoft SMTP Server id 14.3.498.0; Wed, 6 Jan 2021 16:35:34 +0800 From: Barry Song To: , , , , , , , , , , , , , , , , , CC: , , , , , , , Barry Song Subject: [RFC PATCH v3 0/2] scheduler: expose the topology of clusters and add cluster scheduler Date: Wed, 6 Jan 2021 21:30:24 +1300 Message-ID: <20210106083026.40444-1-song.bao.hua@hisilicon.com> X-Mailer: git-send-email 2.21.0.windows.1 MIME-Version: 1.0 Content-Transfer-Encoding: 7BIT Content-Type: text/plain; charset=US-ASCII X-Originating-IP: [10.126.203.68] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org ARM64 server chip Kunpeng 920 has 6 clusters in each NUMA node, and each cluster has 4 cpus. All clusters share L3 cache data while each cluster has local L3 tag. On the other hand, each cluster will share some internal system bus. This means cache is much more affine inside one cluster than across clusters. +-----------------------------------+ +---------+ | +------+ +------+ +---------------------------+ | | | CPU0 | | cpu1 | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ cluster | | tag | | | | | CPU2 | | CPU3 | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | | | L3 | | | | +------+ +------+ +----+ tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | L3 | | data | +-----------------------------------+ | | | +------+ +------+ | +-----------+ | | | | | | | | | | | | | +------+ +------+ +----+ L3 | | | | | | tag | | | | +------+ +------+ | | | | | | | | | | ++ +-----------+ | | | +------+ +------+ |---------------------------+ | +-----------------------------------| | | +-----------------------------------| | | | +------+ +------+ +---------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | | +----+ L3 | | | | +------+ +------+ | | tag | | | | | | | | | | | | | | +------+ +------+ | +-----------+ | | | | | | +-----------------------------------+ | | +-----------------------------------+ | | | +------+ +------+ +--------------------------+ | | | | | | | +-----------+ | | | +------+ +------+ | | | | | Through the following small program, you can see the performance impact of running it in one cluster and across two clusters: struct foo { int x; int y; } f; void *thread1_fun(void *param) { int s = 0; for (int i = 0; i < 0xfffffff; i++) s += f.x; } void *thread2_fun(void *param) { int s = 0; for (int i = 0; i < 0xfffffff; i++) f.y++; } int main(int argc, char **argv) { pthread_t tid1, tid2; pthread_create(&tid1, NULL, thread1_fun, NULL); pthread_create(&tid2, NULL, thread2_fun, NULL); pthread_join(tid1, NULL); pthread_join(tid2, NULL); } While running this program in one cluster, it takes: $ time taskset -c 0,1 ./a.out real 0m0.832s user 0m1.649s sys 0m0.004s As a contrast, it takes much more time if we run the same program in two clusters: $ time taskset -c 0,4 ./a.out real 0m1.133s user 0m1.960s sys 0m0.000s 0.832/1.133 = 73%, it is a huge difference. Also, hackbench running on 4 cpus in single one cluster and 4 cpus in different clusters also shows a large contrast: * inside a cluster: root@ubuntu:~# taskset -c 0,1,2,3 hackbench -p -T -l 20000 -g 1 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 20000 messages of 100 bytes Time: 4.285 * across clusters: root@ubuntu:~# taskset -c 0,4,8,12 hackbench -p -T -l 20000 -g 1 Running in threaded mode with 1 groups using 40 file descriptors each (== 40 tasks) Each sender will pass 20000 messages of 100 bytes Time: 5.524 The score is 4.285 vs 5.524, shorter time means better performance. All these testing implies that we should let the Linux scheduler use this topology to make better load balancing and WAKE_AFFINE decisions. However, the current scheduler totally has no idea of clusters. This patchset exposed the cluster topology first, then added the sched domain for cluster. While it is named as "cluster", architectures and machines can define the exact meaning of cluster as long as they have some resources sharing under llc and they can leverage the affinity of this resource to achive better scheduling performance. -v3: - rebased againest 5.11-rc2 - with respect to the comments of Valentin Schneider, Peter Zijlstra, Vincent Guittot and Mel Gorman etc. * moved the scheduler changes from arm64 to the common place for all architectures. * added SD_SHARE_CLS_RESOURCES sd_flags specifying the sched_domain where select_idle_cpu() should begin to scan from * removed redundant select_idle_cluster() function since all code is in select_idle_cpu() now. it also avoided scanning cluster cpus twice in v2 code; * redo the hackbench in one numa after the above changes Valentin suggested that select_idle_cpu() could begin to scan from domain with SD_SHARE_PKG_RESOURCES. Changing like this might be too aggressive and limit the spreading of tasks. Thus, this patch lets the architectures and machines to decide where to start by adding a new SD_SHARE_CLS_RESOURCES. Barry Song (1): scheduler: add scheduler level for clusters Jonathan Cameron (1): topology: Represent clusters of CPUs within a die. Documentation/admin-guide/cputopology.rst | 26 +++++++++++--- arch/arm64/Kconfig | 7 ++++ arch/arm64/kernel/topology.c | 2 ++ drivers/acpi/pptt.c | 60 +++++++++++++++++++++++++++++++ drivers/base/arch_topology.c | 14 ++++++++ drivers/base/topology.c | 10 ++++++ include/linux/acpi.h | 5 +++ include/linux/arch_topology.h | 5 +++ include/linux/sched/sd_flags.h | 9 +++++ include/linux/sched/topology.h | 7 ++++ include/linux/topology.h | 13 +++++++ kernel/sched/fair.c | 27 ++++++++++---- kernel/sched/topology.c | 6 ++++ 13 files changed, 181 insertions(+), 10 deletions(-) -- 2.7.4