Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S966192AbbFJQiU (ORCPT ); Wed, 10 Jun 2015 12:38:20 -0400 Received: from mail-wi0-f182.google.com ([209.85.212.182]:33139 "EHLO mail-wi0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966159AbbFJQiB (ORCPT ); Wed, 10 Jun 2015 12:38:01 -0400 MIME-Version: 1.0 In-Reply-To: <94D0CD8314A33A4D9D801C0FE68B40295A97B1CB@G9W0745.americas.hpqcorp.net> References: <1433891440-3515-1-git-send-email-toshi.kani@hp.com> <94D0CD8314A33A4D9D801C0FE68B40295A97B1CB@G9W0745.americas.hpqcorp.net> Date: Wed, 10 Jun 2015 09:37:59 -0700 Message-ID: Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices From: Dan Williams To: "Elliott, Robert (Server Storage)" Cc: Jeff Moyer , linux-nvdimm , "Rafael J. Wysocki" , "linux-kernel@vger.kernel.org" , Linux ACPI Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6070 Lines: 142 On Wed, Jun 10, 2015 at 9:20 AM, Elliott, Robert (Server Storage) wrote: >> -----Original Message----- >> From: Linux-nvdimm [mailto:linux-nvdimm-bounces@lists.01.org] On Behalf Of >> Dan Williams >> Sent: Wednesday, June 10, 2015 9:58 AM >> To: Jeff Moyer >> Cc: linux-nvdimm; Rafael J. Wysocki; linux-kernel@vger.kernel.org; Linux >> ACPI >> Subject: Re: [PATCH v2 0/3] Add NUMA support for NVDIMM devices >> >> On Wed, Jun 10, 2015 at 8:54 AM, Jeff Moyer wrote: >> > Toshi Kani writes: >> > >> >> Since NVDIMMs are installed on memory slots, they expose the NUMA >> >> topology of a platform. This patchset adds support of sysfs >> >> 'numa_node' to I/O-related NVDIMM devices under /sys/bus/nd/devices. >> >> This enables numactl(8) to accept 'block:' and 'file:' paths of >> >> pmem and btt devices as shown in the examples below. >> >> numactl --preferred block:pmem0 --show >> >> numactl --preferred file:/dev/pmem0s --show >> >> >> >> numactl can be used to bind an application to the locality of >> >> a target NVDIMM for better performance. Here is a result of fio >> >> benchmark to ext4/dax on an HP DL380 with 2 sockets for local and >> >> remote settings. >> >> >> >> Local [1] : 4098.3MB/s >> >> Remote [2]: 3718.4MB/s >> >> >> >> [1] numactl --preferred block:pmem0 --cpunodebind block:pmem0 fio > on-pmem0> >> >> [2] numactl --preferred block:pmem1 --cpunodebind block:pmem1 fio > on-pmem0> >> > >> > Did you post the patches to numactl somewhere? >> > >> >> numactl already supports this today. > > numactl does have a bug handling partitions under these devices, > because it assumes all storage devices have "/devices/pci" > in their path as it tries to find the parent device for the > partition. I think we'll propose a numactl patch for that; > I don't think the drivers can fool it. > > Details (from an earlier version of the patch series > in which btt devices were named /dev/nd1, etc.): > > strace shows that numactl is trying to find numa_node in very > different locations for /dev/nd1p1 vs. /dev/sda1. > > strace for /dev/nd1p1 > ===================== > open("/sys/class/block/nd1p1/dev", O_RDONLY) = 4 > read(4, "259:1\n", 4095) = 6 > close(4) = 0 > close(3) = 0 > readlink("/sys/class/block/nd1p1", "../../devices/LNXSYSTM:00/LNXSYB"..., 1024) = 77 > open("/sys/class/block/nd1p1/device/numa_node", O_RDONLY) = -1 ENOENT (No such file or directory) > > strace for /dev/sda1 > ==================== > open("/sys/class/block/sda1/dev", O_RDONLY) = 4 > read(4, "8:1\n", 4095) = 4 > close(4) = 0 > close(3) = 0 > readlink("/sys/class/block/sda1", "../../devices/pci0000:00/0000:00"..., 1024) = 91 > open("/sys//devices/pci0000:00/0000:00:01.0//numa_node", O_RDONLY) = 3 > read(3, "0\n", 4095) = 2 > close(3) = 0 > > The "sys/class/block/xxx" paths link to: > lrwxrwxrwx. 1 root root 0 May 20 20:42 /sys/class/block/nd1p1 -> ../../devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/btt1/block/nd1/nd1p1 > lrwxrwxrwx. 1 root root 0 May 20 20:41 /sys/class/block/sda1 -> ../../devices/pci0000:00/0000:00:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sda/sda1 > > > For /dev/sda1, numactl recognizes "/devices/pci" as > a special path, and strips off everything after the > numbers. Faced with: > ../../devices/pci0000:00/0000:00:01.0/0000:03:00.0/host6/target6:0:0/6:0:0:0/block/sda/sda1 > > it ends up with this (leaving a sloppy "//" in the path): > /sys/devices/pci0000:00/0000:00:01.0//numa_node > > It would also succeed if it ended up with this: > /sys/devices/pci0000:00/0000:00:01.0/0000:03:00.0/numa_node > > For /dev/nd1p1 it does not see that string, so just > tries to open "/sys/class/block/nd1p1/device/numa_node" > > There are no "device/" subdirectories in the tree for > partition devices (for either sda1 or nd1p1), so this > fails. > > > From http://oss.sgi.com/projects/libnuma/ > numactl affinity.c: > /* Somewhat hackish: extract device from symlink path. > Better would be a direct backlink. This knows slightly too > much about the actual sysfs layout. */ > char path[1024]; > char *fn = NULL; > if (asprintf(&fn, "/sys/class/%s/%s", cls, dev) > 0 && > readlink(fn, path, sizeof path) > 0) { > regex_t re; > regmatch_t match[2]; > char *p; > > regcomp(&re, "(/devices/pci[0-9a-fA-F:/]+\\.[0-9]+)/", > REG_EXTENDED); > ret = regexec(&re, path, 2, match, 0); > regfree(&re); > if (ret == 0) { > free(fn); > assert(match[0].rm_so > 0); > assert(match[0].rm_eo > 0); > path[match[1].rm_eo + 1] = 0; > p = path + match[0].rm_so; > ret = sysfs_node_read(mask, "/sys/%s/numa_node", p); > if (ret < 0) > return node_parse_failure(ret, NULL, p); > return ret; > } > } > free(fn); > > ret = sysfs_node_read(mask, "/sys/class/%s/%s/device/numa_node", > cls, dev); I think it is broken to try go from /sys/class down it should go from the device node up. I.e. from the resolved path of /sys/dev/block/:, and then walk up the directory tree to the parent of block. $ readlink -f /sys/dev/block/8\:1/ /sys/devices/pci0000:00/0000:00:1f.2/ata1/host0/target0:0:0/0:0:0:0/block/sda/sda1 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/