Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp1632454imm; Wed, 8 Aug 2018 22:33:36 -0700 (PDT) X-Google-Smtp-Source: AA+uWPxKw8RYYqI4RhsiquAW8RdJM2yhDT3tnkY+N1rKDWi8tNYy6VDd5pmb6/pXGKp3+80pDszN X-Received: by 2002:a17:902:7c89:: with SMTP id y9-v6mr665089pll.187.1533792816517; Wed, 08 Aug 2018 22:33:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533792816; cv=none; d=google.com; s=arc-20160816; b=bFoOG4QRughSRCBYcsDDt6CVeclrTFoeGBTbNkmc3kkgaO56WmjMxzYlrLPDUYITzN FWnlbFpMoMtSLoM+oMPiVWjcrc8qEklAHePqaZhVh6Zqn0ZiY2fXj0koagDYKoCkaTK4 fyGbHE5XLYxgkhAib3YYDBjUcLtYYXIPo2uBhyuAJwXzky57i34thnRc2e2DQ9Bb3vTM 1SfvglNRBS4oc/si90nT26gxpz/axKBa68HJyu5aTMbP7Ejh5+Dw0xyZkvqXuvaiMqNA 1qTW+btFpeYQzfgaFiF/TV1/rVCdNxAI+Kblw4eFSImoL0dGtsmS6AcB+UifbcTFEDJV Z8FQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=6H/sYY9EUDCmJBdrBe3wG0fdjla40XLxMBuxZnkYHXE=; b=Ip7BOazVcTkOrDA4vkfWvipb9QLEDlGUvGPAYHfjtjUgJlk5qoEbrszONzAwL8et7K yJVKlkDVRxsRw+FjAEOlo4VRLp0lWlqumYO4DtICaxz+chqRlsD+L9fGEYoP0X6jl42E u1B0RwKnyhdfxgvQsIHlWYM89M4pTCSqqAP00IpZiAcjf4oawiwenCb4fr7af8rhj1xm 3d3WKwcf591376SdhW57c7mvlzoF2c9GS7JVlCo1+Ubem7tI0d67bnanIUFLgFhge2B7 vWanj5kHOQp+NNjJxwWd3ck4n2jGX2DBJYe9SIr4B4208Qha1aF9EmhY4i2+xrtRUTIx tkSQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p21-v6si4919825plo.182.2018.08.08.22.33.21; Wed, 08 Aug 2018 22:33:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728363AbeHIHze (ORCPT + 99 others); Thu, 9 Aug 2018 03:55:34 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:38272 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728034AbeHIHze (ORCPT ); Thu, 9 Aug 2018 03:55:34 -0400 Received: from pps.filterd (m0098413.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w795Spok000676 for ; Thu, 9 Aug 2018 01:32:29 -0400 Received: from e31.co.us.ibm.com (e31.co.us.ibm.com [32.97.110.149]) by mx0b-001b2d01.pphosted.com with ESMTP id 2kref3j45h-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 09 Aug 2018 01:32:29 -0400 Received: from localhost by e31.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 8 Aug 2018 23:32:28 -0600 Received: from b03cxnp07028.gho.boulder.ibm.com (9.17.130.15) by e31.co.us.ibm.com (192.168.1.131) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Wed, 8 Aug 2018 23:32:26 -0600 Received: from b03ledav003.gho.boulder.ibm.com (b03ledav003.gho.boulder.ibm.com [9.17.130.234]) by b03cxnp07028.gho.boulder.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w795WPVV5112236 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 8 Aug 2018 22:32:25 -0700 Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0C7F16A051; Wed, 8 Aug 2018 23:32:25 -0600 (MDT) Received: from b03ledav003.gho.boulder.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id A9CCF6A047; Wed, 8 Aug 2018 23:32:24 -0600 (MDT) Received: from sofia.ibm.com (unknown [9.124.35.39]) by b03ledav003.gho.boulder.ibm.com (Postfix) with ESMTP; Wed, 8 Aug 2018 23:32:24 -0600 (MDT) Received: by sofia.ibm.com (Postfix, from userid 1000) id A8ACC2E2DEB; Thu, 9 Aug 2018 11:02:21 +0530 (IST) From: "Gautham R. Shenoy" To: Srikar Dronamraju , Michael Ellerman , Benjamin Herrenschmidt , Michael Neuling , Vaidyanathan Srinivasan , Akshay Adiga , Shilpasri G Bhat , "Oliver O'Halloran" , Nicholas Piggin , Murilo Opsfelder Araujo , Anton Blanchard Cc: linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, "Gautham R. Shenoy" Subject: [PATCH v6 0/2] powerpc: Detection and scheduler optimization for POWER9 bigcore Date: Thu, 9 Aug 2018 11:02:06 +0530 X-Mailer: git-send-email 1.8.3.1 X-TM-AS-GCONF: 00 x-cbid: 18080905-8235-0000-0000-00000DE4DDF4 X-IBM-SpamModules-Scores: X-IBM-SpamModules-Versions: BY=3.00009511; HX=3.00000242; KW=3.00000007; PH=3.00000004; SC=3.00000266; SDB=6.01071191; UDB=6.00551543; IPR=6.00850807; MB=3.00022602; MTD=3.00000008; XFM=3.00000015; UTC=2018-08-09 05:32:28 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18080905-8236-0000-0000-00004234DD73 Message-Id: <1533792728-6304-1-git-send-email-ego@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-08-09_02:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1807170000 definitions=main-1808090057 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: "Gautham R. Shenoy" Hi, This is the fifth iteration of the patchset to add support for big-core on POWER9. This patch also optimizes the task placement on such big-core systems. The previous versions can be found here: v5: https://lkml.org/lkml/2018/8/6/587 v4: https://lkml.org/lkml/2018/7/24/79 v3: https://lkml.org/lkml/2018/7/6/255 v2: https://lkml.org/lkml/2018/7/3/401 v1: https://lkml.org/lkml/2018/5/11/245 Changes : v5 --> v6: - Fixed the code to build without warnings for !CONFIG_SCHED_SMT. - While checking for shared caches on big-core system, use the smallcore_sibling_mask to compare with compare with l2_cache_mask, which will ensure that the CACHE level sched-domain is created. - Added benchmark results with hackbench to demonstrate the benefits of having the CACHE level sched-domain. v4 --> v5: - Patch 2 is entirely different: Instead of using CPU_FTR_ASYM_SMT feature, use the small core siblings at the SMT level sched-domain. This was suggested by Nicholas Piggin and Michael Ellerman. - A more detailed description follows below. v3 --> v4: - Build fix for powerpc-g5 : Enable CPU_FTR_ASYM_SMT only on CONFIG_PPC_POWERNV and CONFIG_PPC_PSERIES. - Fixed a minor error in the ABI description. v2 --> v3 - Set sane values in the tg->property, tg->nr_groups inside parse_thread_groups before returning due to an error. - Define a helper function to determine whether a CPU device node is a big-core or not. - Updated the comments around the functions to describe the arguments passed to them. v1 --> v2 - Added comments explaining the "ibm,thread-groups" device tree property. - Uses cleaner device-tree parsing functions to parse the u32 arrays. - Adds a sysfs file listing the small-core siblings for every CPU. - Enables the scheduler optimization by setting the CPU_FTR_ASYM_SMT bit in the cur_cpu_spec->cpu_features on detecting the presence of interleaved big-core. - Handles the corner case where there is only a single thread-group or when there is a single thread in a thread-group. Description: ~~~~~~~~~~~~~~~~~~~~ A pair of IBM POWER9 SMT4 cores can be fused together to form a big-core with 8 SMT threads. This can be discovered via the "ibm,thread-groups" CPU property in the device tree which will indicate which group of threads that share the L1 cache, translation cache and instruction data flow. If there are multiple such group of threads, then the core is a big-core. Furthermore, on POWER9 the thread-ids of such a big-core is obtained by interleaving the thread-ids of the component SMT4 cores. Eg: Threads in the pair of component SMT4 cores of an interleaved big-core are numbered {0,2,4,6} and {1,3,5,7} respectively. ------------------------- | L1 Cache | ---------------------------------- |L2| | | | | | | 0 | 2 | 4 | 6 |Small Core0 |C | | | | | Big |a -------------------------- Core |c | | | | | |h | 1 | 3 | 5 | 7 | Small Core1 |e | | | | | ----------------------------- | L1 Cache | -------------------------- On such a big-core system, when multiple tasks are scheduled to run on the big-core, we get the best performance when the tasks are spread across the pair of SMT4 cores. Eg: Suppose there 4 tasks {p1, p2, p3, p4} are run on a big core, then An Example of Optimal Task placement: -------------------------- | | | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| (p2)| | | Big Core -------------------------- | | | | | | 1 | 3 | 5 | 7 | Small Core1 | | (p3)| | (p4) | -------------------------- An example of Suboptimal Task placement: -------------------------- | | | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| (p2)| | (p4)| Big Core -------------------------- | | | | | | 1 | 3 | 5 | 7 | Small Core1 | | (p3)| | | -------------------------- In order to achieve optimal task placement, on big-core systems, we define the SMT level sched-domain to consist of the threads belonging to the small cores. The CACHE level sched domain will consist of all the threads belonging to the big-core. With this, the Linux Kernel load-balancer will ensure that the tasks are spread across all the component small cores in the system, thereby yielding optimum performance. Furthermore, this solution works correctly across all SMT modes (8,4,2), as the interleaved thread-ids ensures that when we go to lower SMT modes (4,2) the threads are offlined in a descending order, thereby leaving equal number of threads from the component small cores online as illustrated below. With Patches: (ppc64_cpu --smt=on) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0,2,4,6 level=SMT groups: 0:{ span=0 cap=294 }, 2:{ span=2 cap=294 }, 4:{ span=4 cap=294 }, 6:{ span=6 cap=294 } CPU1 attaching sched-domain(s): domain-0: span=1,3,5,7 level=SMT groups: 1:{ span=1 cap=294 }, 3:{ span=3 cap=294 }, 5:{ span=5 cap=294 }, 7:{ span=7 cap=294 } Optimal Task placement (SMT 8) -------------------------- | | | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| (p2)| | | Big Core -------------------------- | | | | | | 1 | 3 | 5 | 7 | Small Core1 | | (p3)| | (p4) | -------------------------- With Patches : (ppc64_cpu --smt=4) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0,2 level=SMT groups: 0:{ span=0 cap=589 }, 2:{ span=2 cap=589 } CPU1 attaching sched-domain(s): domain-0: span=1,3 level=SMT groups: 1:{ span=1 cap=589 }, 3:{ span=3 cap=589 } Optimal Task placement (SMT 4) -------------------------- | | | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| (p2)| Off | Off | Big Core -------------------------- | | | | | | 1 | 3 | 5 | 7 | Small Core1 | (p4)| (p3)| Off | Off | -------------------------- With Patches : (ppc64_cpu --smt=2) : SMT domain ceases to exist. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Optimal Task placement (SMT 2) -------------------------- | (p2)| | | | | 0 | 2 | 4 | 6 | Small Core0 | (p1)| Off | Off | Off | Big Core -------------------------- | (p3)| | | | | 1 | 3 | 5 | 7 | Small Core1 | (p4)| Off | Off | Off | -------------------------- Thus, as an added advantage in SMT=2 mode, we will only have 3 levels in the sched-domain topology (CACHE, DIE and NUMA). The SMT levels, without the patches are as follows. Without Patches: (ppc64_cpu --smt=on) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0-7 level=SMT groups: 0:{ span=0 cap=147 }, 1:{ span=1 cap=147 }, 2:{ span=2 cap=147 }, 3:{ span=3 cap=147 }, 4:{ span=4 cap=147 }, 5:{ span=5 cap=147 }, 6:{ span=6 cap=147 }, 7:{ span=7 cap=147 } CPU1 attaching sched-domain(s): domain-0: span=0-7 level=SMT groups: 1:{ span=1 cap=147 }, 2:{ span=2 cap=147 }, 3:{ span=3 cap=147 }, 4:{ span=4 cap=147 }, 5:{ span=5 cap=147 }, 6:{ span=6 cap=147 }, 7:{ span=7 cap=147 }, 0:{ span=0 cap=147 } Without Patches: (ppc64_cpu --smt=4) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0-3 level=SMT groups: 0:{ span=0 cap=294 }, 1:{ span=1 cap=294 }, 2:{ span=2 cap=294 }, 3:{ span=3 cap=294 }, CPU1 attaching sched-domain(s): domain-0: span=0-3 level=SMT groups: 1:{ span=1 cap=294 }, 2:{ span=2 cap=294 }, 3:{ span=3 cap=294 }, 0:{ span=0 cap=294 } Without Patches: (ppc64_cpu --smt=2) : SMT domain ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ CPU0 attaching sched-domain(s): domain-0: span=0-1 level=SMT groups: 0:{ span=0 cap=589 }, 1:{ span=1 cap=589 }, CPU1 attaching sched-domain(s): domain-0: span=0-1 level=SMT groups: 1:{ span=1 cap=589 }, 0:{ span=0 cap=589 }, This patchset contains two patches which on detecting the presence of big-cores, defines the SMT level sched domain to correspond to the threads of the small cores. Patch 1: adds support to detect the presence of big-cores and reports the small-core siblings of each CPU X via the sysfs file "/sys/devices/system/cpu/cpuX/small_core_siblings". Patch 2: Defines the SMT level sched domain to correspond to the threads of the small cores. Results: ~~~~~~~~~~~~~~~~~ 1) 2 thread ebizzy ~~~~~~~~~~~~~~~~~~~~~~ Experimental results for ebizzy with 2 threads, bound to a single big-core show a marked improvement with this patchset over the 4.18-rc5 vanilla kernel. The result of 100 such runs for 4.18-rc7 kernel and the 4.18-rc7 + big-core-smt-patches are as follows 4.18.0-rc7 vanilla ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ records/s : # samples : Histogram ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [ 0 - 1000000] : 0 : # [1000000 - 2000000] : 3 : # [2000000 - 3000000] : 7 : ## [3000000 - 4000000] : 26 : ###### [4000000 - 5000000] : 4 : # [5000000 - 6000000] : 60 : ############# 4.18.0-rc7 + big-core-smt-patches ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ records/s : # samples : Histogram ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [ 0 - 1000000] : 0 : # [1000000 - 2000000] : 0 : # [2000000 - 3000000] : 11 : ### [3000000 - 4000000] : 0 : # [4000000 - 5000000] : 0 : # [5000000 - 6000000] : 89 : ################## 2) Hackbench (perf bench sched pipe) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 100 iterations of the hackbench run both on 4.18-rc7 vanilla kernel and v.18-rc7 + big-core-smt-patches. All the values are time in seconds (Lower the better) 4.18.0-rc7 vanilla ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N Min Max Median Avg Stddev x 100 4.225 9.754 6.174 6.00402 0.88311027 4.18.0-rc7 + big-core-smt-patches (v6 : the present version) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N Min Max Median Avg Stddev x 100 4.069 6.745 6.08 5.72414 0.73853727 The presence of the CACHE level sched-domain in v6, which was absent in v5 of the patches seems to be making a difference, as the median and the average times taken by hackbench both drop. 4.18.0-rc7 + big-core-smt-patches (v5 : the previous version) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ N Min Max Median Avg Stddev x 100 4.972 10.123 6.177 6.323 0.68728617 Gautham R. Shenoy (2): powerpc: Detect the presence of big-cores via "ibm,thread-groups" powerpc: Use cpu_smallcore_sibling_mask at SMT level on bigcores Documentation/ABI/testing/sysfs-devices-system-cpu | 8 ++ arch/powerpc/include/asm/cputhreads.h | 22 +++ arch/powerpc/include/asm/smp.h | 6 + arch/powerpc/kernel/setup-common.c | 154 +++++++++++++++++++++ arch/powerpc/kernel/smp.c | 62 ++++++++- arch/powerpc/kernel/sysfs.c | 35 +++++ 6 files changed, 282 insertions(+), 5 deletions(-) -- 1.9.4