Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3236164imu; Sat, 24 Nov 2018 00:46:05 -0800 (PST) X-Google-Smtp-Source: AFSGD/VcR3eXvU/r/VDVfywB7RuUvmW75qWSmaDoykjVpUufbWC/EGJmAEqsQVuW5Q6sqMQe7QmI X-Received: by 2002:a17:902:5ac7:: with SMTP id g7mr19391507plm.212.1543049165026; Sat, 24 Nov 2018 00:46:05 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543049165; cv=none; d=google.com; s=arc-20160816; b=T3iKJ+P+FdXWVN6gUFjBWo2aDyl1BRTv1VSVH1IwiKvRk+IjKk7i/gFrBXTqupfK+X PRgd4pdjjQkPURVz7QW4jPyTYxFjIwLw5SrOgJsKPc3LKQQqk0vCyddDFntDEb4HN+EM IMEoknq/jZxBofSrUxUW1lwtSoWslc+VAEhGPH+9ZUx+qwc+Ru+if+5zRUXoMIkUGFD/ 4260bPlPflDYkKqh428zXrX9NWS6P/uy+dLHIt6FCyi9sp7P7tQvRB188Hn2a2VXnEAD ALClij7L/Uz6/yaXkx1xxWaXebK5CLly5zVG/S0qNYZkTAA8rEp6WeuuyJHz9Mc5DaNW cilw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=/pyLSnVA1ilwj6HbR8Zqq0lbebPC6Jq/SItElT23/84=; b=R7vXqMlMOMcgntlwwCZLA7azf+nkbKPLbhIPyZiaBDGlnnabTazGe00jWItBWepPAz ZGNAkXjvd6auWtfiwj/7qbAYS411T+wibRPlwWNyl6XhWh4/OzB1tRAtJcVAgIx3vevo CBTB8zSlOL/bR0eGw3X7+J2YD/Qro14X8RlcV9bmOoCEQ+nJfmCUFM8W8/s7YEtZ1dce 988qHpgZf0ZCgBF04xqx0bWERQuaElol26yo5Qr89u/rLBNNcgB5oBOi3f8R9lCDf4/J jTFwfZeCVFxDMx7lBagKNT4WTCPT2W9IxmSQEHCVOEBoaC+vUjWhYzvzH4krjELYshIf U88w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=CdhJ9Z2i; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f34si46888975pgm.318.2018.11.24.00.45.50; Sat, 24 Nov 2018 00:46:04 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=CdhJ9Z2i; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2440782AbeKXEBF (ORCPT + 99 others); Fri, 23 Nov 2018 23:01:05 -0500 Received: from mail-ot1-f65.google.com ([209.85.210.65]:42657 "EHLO mail-ot1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732237AbeKXEBE (ORCPT ); Fri, 23 Nov 2018 23:01:04 -0500 Received: by mail-ot1-f65.google.com with SMTP id v23so5765829otk.9 for ; Fri, 23 Nov 2018 09:15:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=/pyLSnVA1ilwj6HbR8Zqq0lbebPC6Jq/SItElT23/84=; b=CdhJ9Z2i1B8dCYO8IdsfOG/+ogDgLxZVXpQsQFKjkYx9sRRaDQx3z4kU073D1Wt3M8 t9VLprcLFezqTAyG23awixeWWFCe5Q+omwggIf1eH/husLq4GzVSF908caLKtDcz6Okq Up+AMxJajHvJsjzfjA8cwQTCNjAGy0vubGTgJHLhRcKrzcQmEe/YJBHLrTNkxzau7W4u BwSzguMXNodMmRMtjzzYtUWagHELQj2jNGLD0fHhyAlBG3kq37JATXg51rgasvUMF3EB R0pgFDztkD0lsaDZ14zY0VoDNCzP4WHurPkH343Sd4GjY+puvS0KEDMKTYL7pFuLj6o2 XAtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=/pyLSnVA1ilwj6HbR8Zqq0lbebPC6Jq/SItElT23/84=; b=gsQ2kNuQGzskUn5GF2V1isCqoNTM58VpUwhiQLSUCEwh9EZvl72yAFbSqNiqVOX6e2 V+Dp30bCCPF/PYe6Dv7YYsFWbBqS+SuUThsItGYiCBNrV0QpzaUhHzJsvccSLB/9Ky7f c3m02XSF1dudxX5ARQ02uIrd/0/alidSDAjdsEksNx3l7Wk9ITZVmfOieVgRXX1aijOx KQmnxonAxxVIDzGf8FA/KibEJMcw69/pODjYmzoFMGqgu7HBVvcrMWOQhqZ4cRGmyN4r kues1bn2fD+kKDFGDV0arwZFYTolPrAVemWhEQk0i49eBf/w41no3lkfpAtcyoKPLQR1 cJbg== X-Gm-Message-State: AA+aEWbEvN+ZbiICHInKQaxfELh5SOq4dzMFHMc5L1qp7MX2f3VHiWkR uUXgyfOxtKwdawJK81KINT+XJ9T97iSVz+YH1tuChg== X-Received: by 2002:a9d:a78:: with SMTP id 111mr4969218otg.229.1542993357790; Fri, 23 Nov 2018 09:15:57 -0800 (PST) MIME-Version: 1.0 References: <20181114224902.12082-1-keith.busch@intel.com> <1ed406b2-b85f-8e02-1df0-7c39aa21eca9@arm.com> <4ea6e80f-80ba-6992-8aa0-5c2d88996af7@intel.com> <0194f47c-d1d8-108e-a57f-0316adb9112b@arm.com> In-Reply-To: <0194f47c-d1d8-108e-a57f-0316adb9112b@arm.com> From: Dan Williams Date: Fri, 23 Nov 2018 09:15:45 -0800 Message-ID: Subject: Re: [PATCH 0/7] ACPI HMAT memory sysfs representation To: anshuman.khandual@arm.com Cc: Dave Hansen , Keith Busch , Linux Kernel Mailing List , Linux ACPI , Linux MM , Greg KH , "Rafael J. Wysocki" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Nov 22, 2018 at 11:11 PM Anshuman Khandual wrote: > > > > On 11/22/2018 11:38 PM, Dan Williams wrote: > > On Thu, Nov 22, 2018 at 3:52 AM Anshuman Khandual > > wrote: > >> > >> > >> > >> On 11/19/2018 11:07 PM, Dave Hansen wrote: > >>> On 11/18/18 9:44 PM, Anshuman Khandual wrote: > >>>> IIUC NUMA re-work in principle involves these functional changes > >>>> > >>>> 1. Enumerating compute and memory nodes in heterogeneous environment (short/medium term) > >>> > >>> This patch set _does_ that, though. > >>> > >>>> 2. Enumerating memory node attributes as seen from the compute nodes (short/medium term) > >>> > >>> It does that as well (a subset at least). > >>> > >>> It sounds like the subset that's being exposed is insufficient for yo > >>> We did that because we think doing anything but a subset in sysfs will > >>> just blow up sysfs: MAX_NUMNODES is as high as 1024, so if we have 4 > >>> attributes, that's at _least_ 1024*1024*4 files if we expose *all* > >>> combinations. > >> Each permutation need not be a separate file inside all possible NODE X > >> (/sys/devices/system/node/nodeX) directories. It can be a top level file > >> enumerating various attribute values for a given (X, Y) node pair based > >> on an offset something like /proc/pid/pagemap. > >> > >>> > >>> Do we agree that sysfs is unsuitable for exposing attributes in this manner? > >>> > >> > >> Yes, for individual files. But this can be worked around with an offset > >> based access from a top level global attributes file as mentioned above. > >> Is there any particular advantage of using individual files for each > >> given attribute ? I was wondering that a single unsigned long (u64) will > >> be able to pack 8 different attributes where each individual attribute > >> values can be abstracted out in 8 bits. > > > > sysfs has a 4K limit, and in general I don't think there is much > > incremental value to go describe the entirety of the system from sysfs > > or anywhere else in the kernel for that matter. It's simply too much> information to reasonably consume. Instead the kernel can describe the > > I agree that it may be some amount of information to parse but is crucial > for any task on a heterogeneous system to evaluate (probably re-evaluate > if the task moves around) its memory and CPU binding at runtime to make > sure it has got the right one. Can you provide some more evidence for this statement? It seems that not many applications even care about basic numa let alone specific memory targeting, at least according to libnumactl users. dnf repoquery --whatrequires numactl-libs The kernel is the arbiter of memory, something is broken if applications *need* to take on this responsibility. Yes, there will be applications that want to tune and override the default kernel behavior, but this is the exception, not the rule. The applications that tend to care about specific memories also tend to be purpose built for a given platform, and that lessens their reliance on the kernel to enumerate all properties. > > coarse boundaries and some semblance of "best" access initiator for a > > given target. That should cover the "80%" case of what applications > > The current proposal just assumes that the best one is the nearest one. > This may be true for bandwidth and latency but may not be true for some > other properties. This assumptions should not be there while defining > new ABI. In fact, I tend to agree with you, but in my opinion that's an argument to expose even less, not more. If we start with something minimal that can be extended over time we lessen the risk of over exposing details that don't matter in practice. We're in the middle of a bit of a disaster with the VmFlags export in /proc/$pid/smaps precisely because the implementation was too comprehensive and applications started depending on details that the kernel does not want to guarantee going forward. So there is a real risk of being too descriptive in an interface design. > > want to discover, for the other "20%" we likely need some userspace > > library that can go parse these platform specific information sources > > and supplement the kernel view. I also think a simpler kernel starting > > point gives us room to go pull in more commonly used attributes if it > > turns out they are useful, and avoid going down the path of exporting > > attributes that have questionable value in practice. > > > > Applications can just query platform information right now and just use > them for mbind() without requiring this new interface. No, they can't today, at least not for the topology details that HMAT is describing. The platform-firmware to numa-node translation is currently not complete. At a minimum we need a listing of initiator ids and target ids. For an ACPI platform that is the proximity-domain to numa-node-id translation information. Once that translation is in place then a userspace library can consult the platform-specific information sources to translate the platform-firmware view to the Linux handles for those memories. Am I missing the library that does this today? > We are not even > changing any core MM yet. So if it's just about identifying the node's > memory properties it can be scanned from platform itself. But I agree > we would like the kernel to start adding interfaces for multi attribute > memory but all I am saying is that it has to be comprehensive. Some of > the attributes have more usefulness now and some have less but the new > ABI interface has to accommodate exporting all of these. I get the sense we are talking past each other, can you give the next level of detail on that "has to be comprehensive" statement?