Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp4210123ybl; Tue, 21 Jan 2020 15:12:13 -0800 (PST) X-Google-Smtp-Source: APXvYqzeOaft7U+n2s5f/ZPtqvNcFYNzLcqFIfAJWpNjhJgQYEMRoJiu/TzFVQAt4L7u8YOC/pcE X-Received: by 2002:a05:6830:1515:: with SMTP id k21mr5084814otp.177.1579648333285; Tue, 21 Jan 2020 15:12:13 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1579648333; cv=none; d=google.com; s=arc-20160816; b=DoXtfoQO7x2qnai/1WPkniXMnv2vxUPhoHoSb2/iy9Guz6egBuGCaZ6RxCQmd/Hi9b CyusLEi1k7m1yVRq+mr7Ri6nBcqWbp3sSqCwuibfHjHRMRybCkYXfNogiBjLYV0rUJv+ IffZ1MV9rUQUPjkQG0WQ39XkY8QrRzomYlenDGY14KLEEDhab+KFFPFP+wI0+u8vGiI7 6clQFxsp2Pljg6rJyvb1TUY0qjkCk+w2ixJRXJTEGz7CcXTU1JYw0C8iAtDnzsJBNW7e ehDsMnbbiPsXNhwK9oWVULkCzyyLXvNBTUscPLnFumIH71bsSqHeoK7N0vzT1d1ud2VG 6MCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from; bh=wxtiZJ77ryFEXUMJZMjX7j2eYk+hzOBZNeS1A5wstOA=; b=c/c8fHnqTLF2Yr+39wwM2sQjBwLWRw0pVRf/a/0reJPozGBNWcpcvnY82PULSi2LbT /EHLQNaUIalHZgZRTn4iaUExFDl6MqOec/ELxFoR0WV+UzZAsCCxiHc7INfHU0jrq0h1 XVXHu7d+Zgv+tst4ZMJgzljGOz7qGwsDMVBbWmqC6XmIU5n8H+oR238/ox7TNAuB6opf dY7na8YHvncqhTNqZ2lO+giZbkURTRnFS9fhzqh9lno7JKM6ZF/xAbICupj2FZkB3s7v EjQFen7FamlkakJuskbHM8VC4ICML1t0xt6Ofh132NIj3eOHDwK4rYAYBNkMS4cziSxX jLvA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a20si21607344otf.271.2020.01.21.15.12.01; Tue, 21 Jan 2020 15:12:13 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728750AbgAUXKi (ORCPT + 99 others); Tue, 21 Jan 2020 18:10:38 -0500 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:7348 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725876AbgAUXKh (ORCPT ); Tue, 21 Jan 2020 18:10:37 -0500 Received: from pps.filterd (m0098393.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 00LMlMXQ093251; Tue, 21 Jan 2020 18:10:31 -0500 Received: from pps.reinject (localhost [127.0.0.1]) by mx0a-001b2d01.pphosted.com with ESMTP id 2xp46emsb3-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 21 Jan 2020 18:10:31 -0500 Received: from m0098393.ppops.net (m0098393.ppops.net [127.0.0.1]) by pps.reinject (8.16.0.36/8.16.0.36) with SMTP id 00LMr364138911; Tue, 21 Jan 2020 18:10:30 -0500 Received: from ppma03wdc.us.ibm.com (ba.79.3fa9.ip4.static.sl-reverse.com [169.63.121.186]) by mx0a-001b2d01.pphosted.com with ESMTP id 2xp46emsaw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 21 Jan 2020 18:10:30 -0500 Received: from pps.filterd (ppma03wdc.us.ibm.com [127.0.0.1]) by ppma03wdc.us.ibm.com (8.16.0.27/8.16.0.27) with SMTP id 00LN6YOB014609; Tue, 21 Jan 2020 23:10:29 GMT Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com [9.57.198.29]) by ppma03wdc.us.ibm.com with ESMTP id 2xksn67qe1-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Tue, 21 Jan 2020 23:10:29 +0000 Received: from b01ledav002.gho.pok.ibm.com (b01ledav002.gho.pok.ibm.com [9.57.199.107]) by b01cxnp23034.gho.pok.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id 00LNAT8w54460894 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 21 Jan 2020 23:10:29 GMT Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 53324124052; Tue, 21 Jan 2020 23:10:29 +0000 (GMT) Received: from b01ledav002.gho.pok.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 94C53124055; Tue, 21 Jan 2020 23:10:28 +0000 (GMT) Received: from rascal.austin.ibm.com (unknown [9.41.179.32]) by b01ledav002.gho.pok.ibm.com (Postfix) with ESMTP; Tue, 21 Jan 2020 23:10:28 +0000 (GMT) From: Scott Cheloha To: linux-kernel@vger.kernel.org, "Rafael J. Wysocki" , Greg Kroah-Hartman , David Hildenbrand , Andrew Morton Cc: nathanl@linux.ibm.com, ricklind@linux.vnet.ibm.com, mhocko@suse.com, Scott Cheloha Subject: [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup Date: Tue, 21 Jan 2020 17:10:28 -0600 Message-Id: <20200121231028.13699-1-cheloha@linux.ibm.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200109212516.17849-1-cheloha@linux.vnet.ibm.com> References: <20200109212516.17849-1-cheloha@linux.vnet.ibm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.572 definitions=2020-01-17_05:2020-01-16,2020-01-17 signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 spamscore=0 lowpriorityscore=0 phishscore=0 bulkscore=0 priorityscore=1501 clxscore=1011 mlxscore=0 adultscore=0 mlxlogscore=999 suspectscore=0 malwarescore=0 impostorscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-1910280000 definitions=main-2001210169 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Scott Cheloha Searching for a particular memory block by id is an O(n) operation because each memory block's underlying device is kept in an unsorted linked list on the subsystem bus. We can cut the lookup cost to O(log n) if we cache each memory block in an xarray. This time complexity improvement is significant on systems with many memory blocks. For example: 1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks. With this change memory_dev_init() completes ~12ms faster and walk_memory_blocks() completes ~12ms faster. Before: [ 0.005042] memory_dev_init: adding memory blocks [ 0.021591] memory_dev_init: added memory blocks [ 0.022699] walk_memory_blocks: walking memory blocks [ 0.038730] walk_memory_blocks: walked memory blocks 0-511 After: [ 0.005057] memory_dev_init: adding memory blocks [ 0.009415] memory_dev_init: added memory blocks [ 0.010519] walk_memory_blocks: walking memory blocks [ 0.014135] walk_memory_blocks: walked memory blocks 0-511 2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks. With this change memory_dev_init() completes ~88ms faster and walk_memory_blocks() completes ~87ms faster. Before: [ 0.252246] memory_dev_init: adding memory blocks [ 0.395469] memory_dev_init: added memory blocks [ 0.409413] walk_memory_blocks: walking memory blocks [ 0.433028] walk_memory_blocks: walked memory blocks 0-511 [ 0.433094] walk_memory_blocks: walking memory blocks [ 0.500244] walk_memory_blocks: walked memory blocks 131072-131583 After: [ 0.245063] memory_dev_init: adding memory blocks [ 0.299539] memory_dev_init: added memory blocks [ 0.313609] walk_memory_blocks: walking memory blocks [ 0.315287] walk_memory_blocks: walked memory blocks 0-511 [ 0.315349] walk_memory_blocks: walking memory blocks [ 0.316988] walk_memory_blocks: walked memory blocks 131072-131583 3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks. With this change we complete memory_dev_init() ~37 minutes faster and walk_memory_blocks() at least ~30 minutes faster. The exact timing for walk_memory_blocks() is missing, though I observed that the soft lockups in walk_memory_blocks() disappeared with the change, suggesting that lower bound. Before: [ 13.703907] memory_dev_init: adding blocks [ 2287.406099] memory_dev_init: added all blocks [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 After: [ 13.696898] memory_dev_init: adding blocks [ 15.660035] memory_dev_init: added all blocks (the walk_memory_blocks traces disappear) There should be no significant negative impact for machines with few memory blocks. A sparse xarray has a small footprint and an O(log n) lookup is negligibly slower than an O(n) lookup for only the smallest number of memory blocks. 1. A 16GB x86 machine with 128MB memblocks has 132 blocks. With this change memory_dev_init() completes ~300us faster and walk_memory_blocks() completes no faster or slower. The improvement is pretty close to noise. Before: [ 0.224752] memory_dev_init: adding memory blocks [ 0.227116] memory_dev_init: added memory blocks [ 0.227183] walk_memory_blocks: walking memory blocks [ 0.227183] walk_memory_blocks: walked memory blocks 0-131 After: [ 0.224911] memory_dev_init: adding memory blocks [ 0.226935] memory_dev_init: added memory blocks [ 0.227089] walk_memory_blocks: walking memory blocks [ 0.227089] walk_memory_blocks: walked memory blocks 0-131 Signed-off-by: Scott Cheloha Acked-by: David Hildenbrand Acked-by: Nathan Lynch Acked-by: Michal Hocko --- v2 incorporates suggestions from David Hildenbrand. v3 changes: - Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex" - Be conservative: don't use radix_tree_for_each_slot() in walk_memory_blocks() yet. It introduces RCU which could change behavior. Walking the tree "by hand" with find_memory_block_by_id() is slower but keeps the patch simple. v4 changes: - Rewrite commit message to explicitly note the time complexity improvements. - Provide anecdotal accounts of time-savings in changelog v5 changes: - Switch from the radix_tree API to the xarray API to conform to current kernel preferences. - Move the time savings accounts into the commit message itself. Remeasure performance changes on the machines I had available. It should be noted that I was not able to get time on the 32TB machine to remeasure the improvements for v5. The quoted traces are from v4 of the patch. However, the xarray API is used to implement the radix_tree API, so I expect the performance changes will be identical. I did test v5 of the patch on the other machines mentioned in the commit message to ensure there were no regressions. drivers/base/memory.c | 37 ++++++++++++++++++++++++------------- 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/drivers/base/memory.c b/drivers/base/memory.c index 799b43191dea..2178d3e6d063 100644 --- a/drivers/base/memory.c +++ b/drivers/base/memory.c @@ -21,6 +21,7 @@ #include #include #include +#include #include #include @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = { .offline = memory_subsys_offline, }; +/* + * Memory blocks are cached in a local radix tree to avoid + * a costly linear search for the corresponding device on + * the subsystem bus. + */ +static DEFINE_XARRAY(memory_blocks); + static BLOCKING_NOTIFIER_HEAD(memory_chain); int register_memory_notifier(struct notifier_block *nb) @@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn) /* A reference for the returned memory block device is acquired. */ static struct memory_block *find_memory_block_by_id(unsigned long block_id) { - struct device *dev; + struct memory_block *mem; - dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL); - return dev ? to_memory_block(dev) : NULL; + mem = xa_load(&memory_blocks, block_id); + if (mem) + get_device(&mem->dev); + return mem; } -/* - * For now, we have a linear search to go find the appropriate - * memory_block corresponding to a particular phys_index. If - * this gets to be a real problem, we can always use a radix - * tree or something here. - * - * This could be made generic for all device subsystems. - */ struct memory_block *find_memory_block(struct mem_section *section) { unsigned long block_id = base_memory_block_id(__section_nr(section)); @@ -628,9 +630,16 @@ int register_memory(struct memory_block *memory) memory->dev.offline = memory->state == MEM_OFFLINE; ret = device_register(&memory->dev); - if (ret) + if (ret) { put_device(&memory->dev); - + return ret; + } + ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory, + GFP_KERNEL)); + if (ret) { + put_device(&memory->dev); + device_unregister(&memory->dev); + } return ret; } @@ -688,6 +697,8 @@ static void unregister_memory(struct memory_block *memory) if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys)) return; + WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL); + /* drop the ref. we got via find_memory_block() */ put_device(&memory->dev); device_unregister(&memory->dev); -- 2.24.1