Received: by 2002:a05:7412:8d10:b0:f3:1519:9f41 with SMTP id bj16csp4464109rdb; Mon, 11 Dec 2023 23:10:50 -0800 (PST) X-Google-Smtp-Source: AGHT+IEt/YIX4XL2ffNYCFbTcP1+ryC/Z2hOX6X2t/LPdFQbSTBTnaf6y201mkSovQu0h7yJCdQw X-Received: by 2002:a05:6e02:1d02:b0:35d:59a2:68fc with SMTP id i2-20020a056e021d0200b0035d59a268fcmr7265163ila.41.1702365049709; Mon, 11 Dec 2023 23:10:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1702365049; cv=none; d=google.com; s=arc-20160816; b=bqMMVfCzJbQoF4z0WqNRyFImRWnqtSiiXv+5pgTcy9gOPbjd/xOMDXboVsEUUAcS2Q Wo+JW3uxoMwc6MTKJpqS6aXCuVdfft4ZP1BXZ/fC9tR00xlefF21RY9E3L3+hXO+cdgc ZAyR+1iiLtwpL/0kRiWPR9NemBlsNyuo2jOf4ExX6LQwlSYoZKm3Hyi0d2kSEGaqk3YV O4ciIqe1uOOWRmofL3dTl3y1Ffb6q+I76bQDwIuCngGhm1sqhG5kOiHKw4eB+lyQw922 FDOELN8ovHAYlOTvrW/tTbxe2iqMSqwdFOhgUU+fXoOykrBJLsroaILQ6v+2m+HylZnY aukg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:message-id:date:references:in-reply-to:subject:cc:to :from:dkim-signature; bh=28KV19zP3VEjye3AH7bxIajb9foX7kCsFm2lNCEOMG4=; fh=E37LqaUMrmSwt3a9mA472eCcwhht6kBtjPaCAKxUGgs=; b=j4Dk9wrCRLUmj/korpHD65sKt61ufANKxj8mRPd26KTaLRHmz7Rd3qKaK8RnBtHz2g jXucMgpZiDH/ASHern+IaVI9v4MecPgAVyX4dhmxwox/ReXevlSfeO3I7bz2EQYcOgC9 S91D+cT/WFGb/pSBDEm104YgxuCcJew5ltA4JFtGS/nUarwIOKBsMrZswfYUDoGN2VNF 5gsaJYH9iAZvQTH3sFcmWMd6Cp/6DtYCdcDaKQs0sGiLk3qGd1YMUdJHtIOBzZQ9TzZ9 rVfwebtUfUWzSTxppMoUnWhoq5I4WBXIL2JnjPWNO7xhGZCbbokXGJ8i7/q5aRn7PxB1 6vFg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ZP+jHdzX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id e12-20020a65478c000000b005c16f26b1b4si7201913pgs.440.2023.12.11.23.10.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Dec 2023 23:10:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=ZP+jHdzX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id EBC6480560DC; Mon, 11 Dec 2023 23:10:46 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230315AbjLLHKa (ORCPT + 99 others); Tue, 12 Dec 2023 02:10:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53400 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229455AbjLLHK2 (ORCPT ); Tue, 12 Dec 2023 02:10:28 -0500 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 78107CB; Mon, 11 Dec 2023 23:10:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1702365035; x=1733901035; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=c+J2fNloI48VUHDJJRFMGroqhUwoShcQh1SbFVIWMEA=; b=ZP+jHdzXUWe8PfSzoKDCxEm410tqcQSdc0aVSDNeOmr8nQsnlowxIP1I HK4FOmqbtW68tVw0EZ7wq+boDQyBQ7LZsXZIy3nZcl0BJdyzJ2lwjuJ7N PStHYZ7EzWgrCxooy41tQb3quL+VTMUkPsrNLz0QivpwEdbg7Wcoofv8V UyCFefPkwi2AmbdnGsZDGJ4UkYlOvU55FhwvaAf7OHO4/Iv2CnFmO/+Uu Ku6BHgBANzNIyy2LFVKFndWoXodPGN7orkwhafGhniEqn+fzYwF9yivaR ACf3kaenchWnuyz7ZTniEBMD+hdrY8AlevitFMa5M5fi43oQbAgs4kPd4 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10921"; a="1599746" X-IronPort-AV: E=Sophos;i="6.04,269,1695711600"; d="scan'208";a="1599746" Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Dec 2023 23:10:34 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10921"; a="839332406" X-IronPort-AV: E=Sophos;i="6.04,269,1695711600"; d="scan'208";a="839332406" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 11 Dec 2023 23:10:24 -0800 From: "Huang, Ying" To: Gregory Price Cc: Gregory Price , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Johannes Weiner , "Hasan Al Maruf" , Hao Wang , Dan Williams , Michal Hocko , Zhongkun He , Frank van der Linden , "John Groves" , Jonathan Cameron Subject: Re: [PATCH v2 00/11] mempolicy2, mbind2, and weighted interleave In-Reply-To: (Gregory Price's message of "Mon, 11 Dec 2023 11:42:11 -0500") References: <20231209065931.3458-1-gregory.price@memverge.com> <87r0jtxp23.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Tue, 12 Dec 2023 15:08:24 +0800 Message-ID: <87plzbx5hz.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.9 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 11 Dec 2023 23:10:47 -0800 (PST) Gregory Price writes: > On Mon, Dec 11, 2023 at 01:53:40PM +0800, Huang, Ying wrote: >> Hi, Gregory, >>=20 >> Thanks for updated version! >>=20 >> Gregory Price writes: >>=20 >> > v2: >> > changes / adds: >> > - flattened weight matrix to an array at requested of Ying Huang >> > - Updated ABI docs per Davidlohr Bueso request >> > - change uapi structure to use aligned/fixed-length members as >> > Suggested-by: Arnd Bergmann >> > - Implemented weight fetch logic in get_mempolicy2 >> > - mbind2 was changed to take (iovec,len) as function arguments >> > rather than add them to the uapi structure, since they describe >> > where to apply the mempolicy - as opposed to being part of it. >> > >> > The sysfs structure is designed as follows. >> > >> > $ tree /sys/kernel/mm/mempolicy/ >> > /sys/kernel/mm/mempolicy/ >> > =E2=94=9C=E2=94=80=E2=94=80 possible_nodes >> > =E2=94=94=E2=94=80=E2=94=80 weighted_interleave >> > =E2=94=9C=E2=94=80=E2=94=80 nodeN >> > =E2=94=82=C2=A0 =E2=94=94=E2=94=80=E2=94=80 weight >> > =E2=94=94=E2=94=80=E2=94=80 nodeN+X >> > =C2=A0 =E2=94=94=E2=94=80=E2=94=80 weight >> > >> > 'mempolicy' is added to '/sys/kernel/mm/' as a control group for >> > the mempolicy subsystem. >>=20 >> Is it good to add 'mempolicy' in '/sys/kernel/mm/numa'? The advantage >> is that 'mempolicy' here is in fact "NUMA mempolicy". The disadvantage >> is one more directory nesting. I have no strong opinion here. >>=20 > > i don't have a strong opinion here. > >> > 'possible_nodes' is added to 'mm/mempolicy' to help describe the >> > expected structures under mempolicy directorys. For example, >> > possible_nodes describes what nodeN directories wille exist under >> > the weighted_interleave directory. >>=20 >> We have '/sys/devices/system/node/possible' already. Is this just a >> duplication? If so, why? And, the possible nodes can be gotten via >> contents of 'weighted_interleave' too. >>=20 > > I'll remove it > >> And it appears not necessary to make 'weighted_interleave/nodeN' >> directory. Why not just make it a file. >>=20 > > Originally I wasn't sure whether there would be more attributes, but > this is probably fine. I'll change it. > >> And, can we add a way to reset weight to the default value? For example >> `echo > nodeN/weight` or `echo > nodeN`. >>=20 > > Seems reasonable. > >> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2 >> > >> > These interfaces are the 'extended' counterpart to their relatives. >> > They use the userland 'struct mpol_args' structure to communicate a >> > complete mempolicy configuration to the kernel. This structure >> > looks very much like the kernel-internal 'struct mempolicy_args': >> > >> > struct mpol_args { >> > /* Basic mempolicy settings */ >> > __u16 mode; >> > __u16 mode_flags; >> > __s32 home_node; >> > __aligned_u64 pol_nodes; >> > __u64 pol_maxnodes; >> > __u64 addr; >> > __s32 policy_node; >> > __s32 addr_node; >> > __aligned_u64 *il_weights; /* of size pol_maxnodes */ >> > }; >>=20 >> This looks unnecessarily complex. I don't think that it's a good idea >> to use exact same parameter for all 3 syscalls. >> > > It is exactly as complex as mempolicy is. Everything here is already > described in the existing interfaces (except il_weights). > >> For example, can we use something as below? >>=20 >> long set_mempolicy2(int mode, const unsigned long *nodemask, unsigned = int *il_weights, >> unsigned long maxnode, unsigned long home_node, >> unsigned long flags); >>=20 >> long mbind2(unsigned long start, unsigned long len, >> int mode, const unsigned long *nodemask, unsig= ned int *il_weights, >> unsigned long maxnode, unsigned long home_node, >> unsigned long flags); >>=20 > > Your definition of mbind2 is impossible. > > Neither of these interfaces solve the extensibility issue. If a new > policy which requires a new format of data arrives, we can look forward > to set_mempolicy3 and mbind3. IIUC, we will not over-engineering too much. It's hard to predict the requirements in the future. >> A struct may be defined to hold mempolicy iteself. >>=20 >> struct mpol { >> int mode; >> unsigned int home_node; >> const unsigned long *nodemask; >> unsigned int *il_weights; >> unsigned int maxnode; >> }; >>=20 > > addr could be pulled out for get_mempolicy2, so i will do that > > 'addr_node' and 'policy_node' are warts that came from the original > get_mempolicy. Removing them increases the complexity of handling > arguments in the common get_mempolicy code. > > I could probably just drop support for retrieving the addr_node from > get_mempolicy2, since it's already possible with get_mempolicy. So I > will do that. If it's necessary, we can add another struct for get_mempolicy2(). But I don't think that it's necessary to add get_mempolicy2() specific parameters for set_mempolicy2() or mbind2(). -- Best Regards, Huang, Ying