Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp2274118imm; Thu, 20 Sep 2018 10:23:45 -0700 (PDT) X-Google-Smtp-Source: ANB0VdabRL1EF/knNAvDo5fA/G8Wuflk/rz/zmfOUKy4XAdty/3wJPS37BEvJSm/joV/x2tckX// X-Received: by 2002:a62:8ad1:: with SMTP id o78-v6mr41987949pfk.17.1537464225833; Thu, 20 Sep 2018 10:23:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537464225; cv=none; d=google.com; s=arc-20160816; b=NNTLPaYcz9xFU0IGEJVm5nzXu73oFe0Hrrz/Bi3VrHhYPw329x8dNw5zvrVjPRcZlf 0Fd1ujVV+30BcFu7Nmo0mJgWpuXK3ablnqKdDmsqhUs1C5UqLPKl+kHVe/sYAWvTDwdf o+Z2uIOhXc1lgB+GpZHYBPoOfAIaAwBlH13+lZx3xVpSV+9qIhtgzPEMIM7qmq+7L0y0 bJoZcOj9wiGrJ5p3jhFhU2e3uYOa56I9lPqVy4Hq81n5gzxSC0cbEWdBFVE8IeHE4yIc 0sSwIIBEt5i11vrP1p1d+cfAMeMNt9iQVbw2oOJgjN9ZJueV4H32tHgtAei+/0dGyljD Fm9Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from; bh=F28EfD/+SmpdpwjhOEp9DLobnQ3qPv4yLHuxsG1OLRc=; b=QvHGPJb+4MijeB3gm53Uw5w9wamR+WIW68aCsEQZLkp2hl7z5t5LdHpDd6RCJPn7QJ mmtuotelSQ9xS5XYjezmPxx7AbK9unlxbtAP7z/FsrquEc2VRrY5N9UWK8WjlxynHsWi tzSm11dxlW6PDlw3nqCIE3qw4bEWWIeyxsIFPAkDN6hU1NT3n+8j0RQ7zomdJYSbOBVd wKj972SuS2Vyd6QRB28PjUHpe1S0NPt5I3i/xnSrHddC91pBy+KxOhP6ECeGcvfiWvpR gpBcJ7J10R2FjptzNTwX3wpus19NgoJ7dXuIYm7uV/lrrX93eiMkeemha2r7xfQMl6n2 U2Gg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n7-v6si24187112plk.204.2018.09.20.10.23.30; Thu, 20 Sep 2018 10:23:45 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387454AbeITXH1 (ORCPT + 99 others); Thu, 20 Sep 2018 19:07:27 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:59471 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727990AbeITXHZ (ORCPT ); Thu, 20 Sep 2018 19:07:25 -0400 Received: from pps.filterd (m0098394.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w8KHHovd086715 for ; Thu, 20 Sep 2018 13:26:24 -0400 Received: from e34.co.us.ibm.com (e34.co.us.ibm.com [32.97.110.152]) by mx0a-001b2d01.pphosted.com with ESMTP id 2mmf5s24d8-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 20 Sep 2018 13:26:23 -0400 Received: from localhost by e34.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 20 Sep 2018 11:22:53 -0600 Received: from b03cxnp08027.gho.boulder.ibm.com (9.17.130.19) by e34.co.us.ibm.com (192.168.1.134) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Thu, 20 Sep 2018 11:22:50 -0600 Received: from b03ledav004.gho.boulder.ibm.com (b03ledav004.gho.boulder.ibm.com [9.17.130.235]) by b03cxnp08027.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w8KHMn3H26935544 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 20 Sep 2018 10:22:49 -0700 Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 874997805F; Thu, 20 Sep 2018 11:22:49 -0600 (MDT) Received: from b03ledav004.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E78EB7805C; Thu, 20 Sep 2018 11:22:48 -0600 (MDT) Received: from sofia.ibm.com (unknown [9.124.212.176]) by b03ledav004.gho.boulder.ibm.com (Postfix) with ESMTP; Thu, 20 Sep 2018 11:22:48 -0600 (MDT) Received: by sofia.ibm.com (Postfix, from userid 1000) id 3372A2E2D92; Thu, 20 Sep 2018 22:52:45 +0530 (IST) From: "Gautham R. Shenoy" To: "Aneesh Kumar K.V" , Srikar Dronamraju , Michael Ellerman , Benjamin Herrenschmidt , Michael Neuling , Vaidyanathan Srinivasan , Akshay Adiga , Shilpasri G Bhat , "Oliver O'Halloran" , Nicholas Piggin , Murilo Opsfelder Araujo , Anton Blanchard Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, "Gautham R. Shenoy" Subject: [PATCH v8 0/3] powerpc: Detection and scheduler optimization for POWER9 bigcore Date: Thu, 20 Sep 2018 22:52:36 +0530 X-Mailer: git-send-email 1.8.3.1 X-TM-AS-GCONF: 00 x-cbid: 18092017-0016-0000-0000-000009314598 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00009741; HX=3.00000242; KW=3.00000007; PH=3.00000004; SC=3.00000266; SDB=6.01091142; UDB=6.00563758; IPR=6.00871165; MB=3.00023416; MTD=3.00000008; XFM=3.00000015; UTC=2018-09-20 17:22:53 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18092017-0017-0000-0000-0000406C11FD Message-Id: <1537464159-25919-1-git-send-email-ego@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-09-20_10:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1809200167 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Gautham R. Shenoy" Hi, This is the eight iteration of the patchset to add support for big-core on POWER9. This patch also optimizes the task placement on such big-core systems. The previous versions can be found here: v7: https://lkml.org/lkml/2018/8/20/52 v6: https://lkml.org/lkml/2018/8/9/119 v5: https://lkml.org/lkml/2018/8/6/587 v4: https://lkml.org/lkml/2018/7/24/79 v3: https://lkml.org/lkml/2018/7/6/255 v2: https://lkml.org/lkml/2018/7/3/401 v1: https://lkml.org/lkml/2018/5/11/245 Changes : v7 --> v8: - Reorganized the patch series into three patches : - First one discovers the big-cores and initializes a per-cpu cpumask with its small-core siblings. - The second patch uses the small-core siblings at the SMT level sched-domains on the big-core systems and also activates the CACHE domain that corresponds to the big-core where all the threads share L2 cache. - The third patch creates a pair of sysfs attributes named /sys/devices/system/cpu/cpuN/topology/smallcore_thread_siblings and /sys/devices/system/cpu/cpuN/topology/smallcore_thread_siblings_list - The third patch addresses Michael Neuling's review comment for the previous iteration. Description: ~~~~~~~~~~~~~~~~~~~~ A pair of IBM POWER9 SMT4 cores can be fused together to form a big-core with 8 SMT threads. This can be discovered via the "ibm,thread-groups" CPU property in the device tree which will indicate which group of threads that share the L1 cache, translation cache and instruction data flow. If there are multiple such group of threads, then the core is a big-core. Furthermore, on POWER9 the thread-ids of such a big-core is obtained by interleaving the thread-ids of the component SMT4 cores. Eg: Threads in the pair of component SMT4 cores of an interleaved big-core are numbered {0,2,4,6} and {1,3,5,7} respectively. ------------------------- | L1 Cache | ---------------------------------- |L2| | | | | | | 0 | 2 | 4 | 6 |Small Core0 |C | | | | | Big |a -------------------------- Core |c | | | | | |h | 1 | 3 | 5 | 7 | Small Core1 |e | | | | | ----------------------------- | L1 Cache | -------------------------- On such a big-core system, when multiple tasks are scheduled to run on the big-core, we get the best performance when the tasks are spread across the pair of SMT4 cores. Eg: Suppose there 4 tasks {p1, p2, p3, p4} are run on a big core, then An Example of Optimal Task placement: -------------------------- | | | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| (p2)| | | Big Core -------------------------- | | | | | | 1 | 3 | 5 | 7 | Small Core1 | | (p3)| | (p4) | -------------------------- An example of Suboptimal Task placement: -------------------------- | | | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| (p2)| | (p4)| Big Core -------------------------- | | | | | | 1 | 3 | 5 | 7 | Small Core1 | | (p3)| | | -------------------------- In order to achieve optimal task placement, on big-core systems, we define the SMT level sched-domain to consist of the threads belonging to the small cores. The CACHE level sched domain will consist of all the threads belonging to the big-core. With this, the Linux Kernel load-balancer will ensure that the tasks are spread across all the component small cores in the system, thereby yielding optimum performance. Furthermore, this solution works correctly across all SMT modes (8,4,2), as the interleaved thread-ids ensures that when we go to lower SMT modes (4,2) the threads are offlined in a descending order, thereby leaving equal number of threads from the component small cores online as illustrated below. With Patches: (ppc64_cpu --smt=on) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT groups: 0:{ span=0 cap=294 }, 2:{ span=2 cap=294 }, 4:{ span=4 cap=294 }, 6:{ span=6 cap=294 } CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT groups: 1:{ span=1 cap=294 }, 3:{ span=3 cap=294 }, 5:{ span=5 cap=294 }, 7:{ span=7 cap=294 } Optimal Task placement (SMT 8) -------------------------- | | | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| (p2)| | | Big Core -------------------------- | | | | | | 1 | 3 | 5 | 7 | Small Core1 | | (p3)| | (p4) | -------------------------- With Patches : (ppc64_cpu --smt=4) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0,2 level=SMT groups: 0:{ span=0 cap=589 }, 2:{ span=2 cap=589 } CPU1 attaching sched-domain(s): domain-0: span=1,3 level=SMT groups: 1:{ span=1 cap=589 }, 3:{ span=3 cap=589 } Optimal Task placement (SMT 4) -------------------------- | | | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| (p2)| Off | Off | Big Core -------------------------- | | | | | | 1 | 3 | 5 | 7 | Small Core1 | (p4)| (p3)| Off | Off | -------------------------- With Patches : (ppc64_cpu --smt=2) : SMT domain ceases to exist. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Optimal Task placement (SMT 2) -------------------------- | (p2)| | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| Off | Off | Off | Big Core -------------------------- | (p3)| | | | | 1 | 3 | 5 | 7 | Small Core1 | (p4)| Off | Off | Off | -------------------------- Thus, as an added advantage in SMT=2 mode, we will only have 3 levels in the sched-domain topology (CACHE, DIE and NUMA). The SMT levels, without the patches are as follows. Without Patches: (ppc64_cpu --smt=on) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0-7 level=SMT groups: 0:{ span=0 cap=147 }, 1:{ span=1 cap=147 }, 2:{ span=2 cap=147 }, 3:{ span=3 cap=147 }, 4:{ span=4 cap=147 }, 5:{ span=5 cap=147 }, 6:{ span=6 cap=147 }, 7:{ span=7 cap=147 } CPU1 attaching sched-domain(s): domain-0: span=0-7 level=SMT groups: 1:{ span=1 cap=147 }, 2:{ span=2 cap=147 }, 3:{ span=3 cap=147 }, 4:{ span=4 cap=147 }, 5:{ span=5 cap=147 }, 6:{ span=6 cap=147 }, 7:{ span=7 cap=147 }, 0:{ span=0 cap=147 } Without Patches: (ppc64_cpu --smt=4) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0-3 level=SMT groups: 0:{ span=0 cap=294 }, 1:{ span=1 cap=294 }, 2:{ span=2 cap=294 }, 3:{ span=3 cap=294 }, CPU1 attaching sched-domain(s): domain-0: span=0-3 level=SMT groups: 1:{ span=1 cap=294 }, 2:{ span=2 cap=294 }, 3:{ span=3 cap=294 }, 0:{ span=0 cap=294 } Without Patches: (ppc64_cpu --smt=2) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0-1 level=SMT groups: 0:{ span=0 cap=589 }, 1:{ span=1 cap=589 }, CPU1 attaching sched-domain(s): domain-0: span=0-1 level=SMT groups: 1:{ span=1 cap=589 }, 0:{ span=0 cap=589 }, This patchset contains two patches which on detecting the presence of big-cores, defines the SMT level sched domain to correspond to the threads of the small cores. Patch 1: adds support to detect the presence of big-cores and reports the small-core siblings of each CPU X via the sysfs file "/sys/devices/system/cpu/cpuX/small_core_siblings". Patch 2: Defines the SMT level sched domain to correspond to the threads of the small cores. Results: ~~~~~~~~~~~~~~~~~ 1) 2 thread ebizzy ~~~~~~~~~~~~~~~~~~~~~~ Experimental results for ebizzy with 2 threads, bound to a single big-core show a marked improvement with this patchset over the 4.19-rc4 vanilla kernel. The result of 100 such runs for 4.19-rc4 kernel and the 4.19-rc4 + big-core-smt-patches are as follows 4.19.0-rc4 vanilla ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ records/s : # samples : Histogram ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [0000000 - 1000000] : 0 : # [1000000 - 2000000] : 1 : # [2000000 - 3000000] : 2 : # [3000000 - 4000000] : 17 : #### [4000000 - 5000000] : 9 : ## [5000000 - 6000000] : 5 : ## [6000000 - 7000000] : 66 : ############## ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4.19-rc4 + big-core-patches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ records/s : # samples : Histogram ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [0000000 - 1000000] : 0 : # [1000000 - 2000000] : 0 : # [2000000 - 3000000] : 5 : ## [3000000 - 4000000] : 9 : ## [4000000 - 5000000] : 0 : # [5000000 - 6000000] : 2 : # [6000000 - 7000000] : 84 : ################# ================================================= 2) Hackbench (perf bench sched pipe) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 500 iterations of the hackbench run both on 4.19-rc4 vanilla kernel and v4.19-rc4 + big-core-smt-patches. There isn't a significant difference between the two. The values for Min, Max, Median, Avg below are in seconds. Lower the better. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4.19-rc4 vanilla ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N Min Max Median Avg Stddev 500 4.603 9.438 6.165 5.921446 0.47448034 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4.19-rc4 + big-core-patches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N Min Max Median Avg Stddev 500 4.532 6.476 6.224 5.982098 0.43021891 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Gautham R. Shenoy (3): powerpc: Detect the presence of big-cores via "ibm,thread-groups" powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores powerpc/sysfs: Add topology/smallcore_thread_siblings[_list] Documentation/ABI/testing/sysfs-devices-system-cpu | 14 ++ arch/powerpc/include/asm/cputhreads.h | 2 + arch/powerpc/include/asm/smp.h | 6 + arch/powerpc/kernel/smp.c | 240 ++++++++++++++++++++- arch/powerpc/kernel/sysfs.c | 88 ++++++++ 5 files changed, 349 insertions(+), 1 deletion(-) -- 1.9.4