Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp417299ybt; Wed, 24 Jun 2020 02:30:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJy1E8Ex7+2fOVV3ZwUqAMI9qnDIje3C68Y0M+jPRusNpOtM0Xn1xgLN3qMABgAZgd4lFqFi X-Received: by 2002:a05:6402:2213:: with SMTP id cq19mr830711edb.299.1592991017430; Wed, 24 Jun 2020 02:30:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1592991017; cv=none; d=google.com; s=arc-20160816; b=OwouI9AP2jm24q3xS1BVlVX78bhN4KMR+DnL5rVZVD6YMcBl62TwqeHUnAyGwVroYJ lXp1MnuN4KpJKx8AtdVQee2JiS0lV7SdAXmjfRm5H8ZyxprjbB+GxXyMmEKUmovW8RSW 7rDokHFyXc2RKbVn9cc5YJ2bhEo/U8bR+m9PzaeEXZ9tbrEtNFL4UkHVZ2+z4+44xpfD Q54A4ZeRk4lL4qLkWejqTYNGzK1sGQgqez6kzmjACRPN/kusTUFNCfM2fQjyMrIrAYTW Qx3smACrYflt53Pkvvuptyis3ipo6j6ltQrqRQT1CmSYGSeWyQ4cNxaQatCBSkWoWtfw cnsQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from; bh=o0p805oYjNy9olt2NNz7TDOX9jdYEg2NH1H3CoM/ebk=; b=CiP8kDUNcn5IXibgJmq8IC5WJ0C1hlEtT4brJ2ulUOa+ltpDt5gxRjcf30ylm0Llhh X+cHbVFFOFKqAGc8uQBKJPt7EzXBzbDOToFu2tqA/fFhFGkY0gPfEMg1hQ7uJB4R149M RXpwc5daJxKm7J/UmBF8x6/MmZQ0ChaA1w1qJ1vqUkVcYd3GwMcx9CjU5RdWyGcrBEsF KEp7G3czRqiH6/yq5E4B2oRzToUl+iu2sioEFDc1QyZpZ3KSGZ4lUcYDSYmzaWxOI6Pp QXbytAjQQwKxA5Zj/vSsFGVHBsL64WuhFkgHhXOIypBBof/t+IWyEH+i6F3y4L1FtOL7 eNfA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i16si12854878ejh.449.2020.06.24.02.29.54; Wed, 24 Jun 2020 02:30:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389548AbgFXJ3b (ORCPT + 99 others); Wed, 24 Jun 2020 05:29:31 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:39270 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388336AbgFXJ3a (ORCPT ); Wed, 24 Jun 2020 05:29:30 -0400 Received: from pps.filterd (m0098399.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 05O934mM190109; Wed, 24 Jun 2020 05:29:13 -0400 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 31uwysj68t-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 24 Jun 2020 05:29:13 -0400 Received: from m0098399.ppops.net (m0098399.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 05O94VCs195280; Wed, 24 Jun 2020 05:29:13 -0400 Received: from ppma06fra.de.ibm.com (48.49.7a9f.ip4.static.sl-reverse.com [159.122.73.72]) by mx0a-001b2d01.pphosted.com with ESMTP id 31uwysj668-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 24 Jun 2020 05:29:12 -0400 Received: from pps.filterd (ppma06fra.de.ibm.com [127.0.0.1]) by ppma06fra.de.ibm.com (8.16.0.42/8.16.0.42) with SMTP id 05O9Q0lP028374; Wed, 24 Jun 2020 09:29:09 GMT Received: from b06cxnps4076.portsmouth.uk.ibm.com (d06relay13.portsmouth.uk.ibm.com [9.149.109.198]) by ppma06fra.de.ibm.com with ESMTP id 31uuspr742-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 24 Jun 2020 09:29:08 +0000 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 05O9T6Xj57933914 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 24 Jun 2020 09:29:06 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 62F6652059; Wed, 24 Jun 2020 09:29:06 +0000 (GMT) Received: from srikart450.in.ibm.com (unknown [9.102.29.235]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id A66C852057; Wed, 24 Jun 2020 09:29:01 +0000 (GMT) From: Srikar Dronamraju To: Andrew Morton Cc: Srikar Dronamraju , linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Michal Hocko , Mel Gorman , Vlastimil Babka , "Kirill A. Shutemov" , Christopher Lameter , Michael Ellerman , Linus Torvalds , Gautham R Shenoy , Satheesh Rajendran , David Hildenbrand Subject: [PATCH v5 0/3] Offline memoryless cpuless node 0 Date: Wed, 24 Jun 2020 14:58:43 +0530 Message-Id: <20200624092846.9194-1-srikar@linux.vnet.ibm.com> X-Mailer: git-send-email 2.17.1 X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.687 definitions=2020-06-24_04:2020-06-24,2020-06-24 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 mlxscore=0 lowpriorityscore=0 malwarescore=0 bulkscore=0 priorityscore=1501 impostorscore=0 phishscore=0 cotscore=-2147483648 mlxlogscore=999 spamscore=0 adultscore=0 clxscore=1011 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2004280000 definitions=main-2006240067 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Changelog v4:->v5: - rebased to v5.8-rc2 link v4: http://lore.kernel.org/lkml/20200512132937.19295-1-srikar@linux.vnet.ibm.com/t/#u Changelog v3:->v4: - Resolved comments from Christopher. Link v3: http://lore.kernel.org/lkml/20200501031128.19584-1-srikar@linux.vnet.ibm.com/t/#u Changelog v2:->v3: - Resolved comments from Gautham. Link v2: https://lore.kernel.org/linuxppc-dev/20200428093836.27190-1-srikar@linux.vnet.ibm.com/t/#u Changelog v1:->v2: - Rebased to v5.7-rc3 - Updated the changelog. Link v1: https://lore.kernel.org/linuxppc-dev/20200311110237.5731-1-srikar@linux.vnet.ibm.com/t/#u Linux kernel configured with CONFIG_NUMA on a system with multiple possible nodes, marks node 0 as online at boot. However in practice, there are systems which have node 0 as memoryless and cpuless. This can cause 1. numa_balancing to be enabled on systems with only one online node. 2. Existence of dummy (cpuless and memoryless) node which can confuse users/scripts looking at output of lscpu / numactl. This patchset wants to correct this anomaly. This should only affect systems that have CONFIG_MEMORYLESS_NODES. Currently there are only 2 architectures ia64 and powerpc that have this config. Note: Patch 3 in this patch series depends on patches 1 and 2. Without patches 1 and 2, patch 3 might crash powerpc. v5.8-rc2 available: 2 nodes (0,2) node 0 cpus: node 0 size: 0 MB node 0 free: 0 MB node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31490 MB node distances: node 0 2 0: 10 20 2: 20 10 proc and sys files ------------------ /sys/devices/system/node/online: 0,2 /proc/sys/kernel/numa_balancing: 1 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 v5.8-rc2 + patches ------------------ available: 1 nodes (2) node 2 cpus: 0 1 2 3 4 5 6 7 node 2 size: 32625 MB node 2 free: 31487 MB node distances: node 2 2: 10 proc and sys files ------------------ /sys/devices/system/node/online: 2 /proc/sys/kernel/numa_balancing: 0 /sys/devices/system/node/has_cpu: 2 /sys/devices/system/node/has_memory: 2 /sys/devices/system/node/has_normal_memory: 2 /sys/devices/system/node/possible: 0-31 1. User space applications like Numactl, lscpu, that parse the sysfs tend to believe there is an extra online node. This tends to confuse users and applications. Other user space applications start believing that system was not able to use all the resources (i.e missing resources) or the system was not setup correctly. 2. Also existence of dummy node also leads to inconsistent information. The number of online nodes is inconsistent with the information in the device-tree and resource-dump 3. When the dummy node is present, single node non-Numa systems end up showing up as NUMA systems and numa_balancing gets enabled. This will mean we take the hit from the unnecessary numa hinting faults. On a machine with just one node with node number not being 0, the current setup will end up showing 2 online nodes. And when there are more than one online nodes, numa_balancing gets enabled. Without patch $ grep numa /proc/vmstat numa_hit 95179 numa_miss 0 numa_foreign 0 numa_interleave 3764 numa_local 95179 numa_other 0 numa_pte_updates 1206973 <---------- numa_huge_pte_updates 4654 <---------- numa_hint_faults 19560 <---------- numa_hint_faults_local 19560 <---------- numa_pages_migrated 0 With patch $ grep numa /proc/vmstat numa_hit 322338756 numa_miss 0 numa_foreign 0 numa_interleave 3790 numa_local 322338756 numa_other 0 numa_pte_updates 0 <---------- numa_huge_pte_updates 0 <---------- numa_hint_faults 0 <---------- numa_hint_faults_local 0 <---------- numa_pages_migrated 0 Here are 2 sample numa programs. numa01.sh is a set of 2 process each running threads as many as number of cpus; each thread doing 50 loops on 3GB process shared memory operations. numa02.sh is a single process with threads as many as number of cpus; each thread doing 800 loops on 32MB thread local memory operations. Testcase Time: Min Max Avg StdDev ./numa01.sh Real: 149.62 149.66 149.64 0.02 ./numa01.sh Sys: 3.21 3.71 3.46 0.25 ./numa01.sh User: 4755.13 4758.15 4756.64 1.51 ./numa02.sh Real: 24.98 25.02 25.00 0.02 ./numa02.sh Sys: 0.51 0.59 0.55 0.04 ./numa02.sh User: 790.28 790.88 790.58 0.30 Testcase Time: Min Max Avg StdDev %Change ./numa01.sh Real: 149.44 149.46 149.45 0.01 0.127133% ./numa01.sh Sys: 0.71 0.89 0.80 0.09 332.5% ./numa01.sh User: 4754.19 4754.48 4754.33 0.15 0.0485873% ./numa02.sh Real: 24.97 24.98 24.98 0.00 0.0800641% ./numa02.sh Sys: 0.26 0.41 0.33 0.08 66.6667% ./numa02.sh User: 789.75 790.28 790.01 0.27 0.072151% numa01.sh param no_patch with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 1131164 0 -100% numa_hint_faults_local 1131164 0 -100% numa_hit 213696 214244 0.256439% numa_local 213696 214244 0.256439% numa_pte_updates 1131294 0 -100% pgfault 1380845 241424 -82.5162% pgmajfault 75 60 -20% Here are 2 sample numa programs. numa01.sh is a set of 2 process each running threads as many as number of cpus; each thread doing 50 loops on 3GB process shared memory operations. numa02.sh is a single process with threads as many as number of cpus; each thread doing 800 loops on 32MB thread local memory operations. Without patch ------------- Testcase Time: Min Max Avg StdDev ./numa01.sh Real: 149.62 149.66 149.64 0.02 ./numa01.sh Sys: 3.21 3.71 3.46 0.25 ./numa01.sh User: 4755.13 4758.15 4756.64 1.51 ./numa02.sh Real: 24.98 25.02 25.00 0.02 ./numa02.sh Sys: 0.51 0.59 0.55 0.04 ./numa02.sh User: 790.28 790.88 790.58 0.30 With patch ----------- Testcase Time: Min Max Avg StdDev %Change ./numa01.sh Real: 149.44 149.46 149.45 0.01 0.127133% ./numa01.sh Sys: 0.71 0.89 0.80 0.09 332.5% ./numa01.sh User: 4754.19 4754.48 4754.33 0.15 0.0485873% ./numa02.sh Real: 24.97 24.98 24.98 0.00 0.0800641% ./numa02.sh Sys: 0.26 0.41 0.33 0.08 66.6667% ./numa02.sh User: 789.75 790.28 790.01 0.27 0.072151% numa01.sh param no_patch with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 1131164 0 -100% numa_hint_faults_local 1131164 0 -100% numa_hit 213696 214244 0.256439% numa_local 213696 214244 0.256439% numa_pte_updates 1131294 0 -100% pgfault 1380845 241424 -82.5162% pgmajfault 75 60 -20% numa02.sh param no_patch with_patch %Change ----- ---------- ---------- ------- numa_hint_faults 111878 0 -100% numa_hint_faults_local 111878 0 -100% numa_hit 41854 43220 3.26373% numa_local 41854 43220 3.26373% numa_pte_updates 113926 0 -100% pgfault 163662 51210 -68.7099% pgmajfault 56 52 -7.14286% Observations: The real time and user time actually doesn't change much. However the system time changes to some extent. The reason being the number of numa hinting faults. With the patch we are not seeing the numa hinting faults. Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Cc: Michal Hocko Cc: Mel Gorman Cc: Vlastimil Babka Cc: "Kirill A. Shutemov" Cc: Christopher Lameter Cc: Michael Ellerman Cc: Andrew Morton Cc: Linus Torvalds Cc: Gautham R Shenoy Cc: Satheesh Rajendran Cc: David Hildenbrand Srikar Dronamraju (3): powerpc/numa: Set numa_node for all possible cpus powerpc/numa: Prefer node id queried from vphn mm/page_alloc: Keep memoryless cpuless node 0 offline arch/powerpc/mm/numa.c | 35 +++++++++++++++++++++++++---------- mm/page_alloc.c | 4 +++- 2 files changed, 28 insertions(+), 11 deletions(-) -- 2.18.1