Received-SPF: pass (google.com: domain of linux-kernel+bounces-14017-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:40f1:3f00::1 as permitted sender) client-ip=2604:1380:40f1:3f00::1;
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,  <linux-mm@kvack.org>,
  <linux-doc@vger.kernel.org>,  <linux-fsdevel@vger.kernel.org>,
  <linux-kernel@vger.kernel.org>,  <linux-api@vger.kernel.org>,
  <x86@kernel.org>,  <akpm@linux-foundation.org>,  <arnd@arndb.de>,
  <tglx@linutronix.de>,  <luto@kernel.org>,  <mingo@redhat.com>,
  <bp@alien8.de>,  <dave.hansen@linux.intel.com>,  <hpa@zytor.com>,
  <mhocko@kernel.org>,  <tj@kernel.org>,  <corbet@lwn.net>,
  <rakie.kim@sk.com>,  <hyeongtak.ji@sk.com>,  <honggyu.kim@sk.com>,
  <vtavarespetr@micron.com>,  <peterz@infradead.org>,
  <jgroves@micron.com>,  <ravis.opensrc@micron.com>,
  <sthanneeru@micron.com>,  <emirakhur@micron.com>,  <Hasan.Maruf@amd.com>,
  <seungjun.ha@samsung.com>,  Johannes Weiner <hannes@cmpxchg.org>,  Hasan
 Al Maruf <hasanalmaruf@fb.com>,  Hao Wang <haowang3@fb.com>,  Dan Williams
 <dan.j.williams@intel.com>,  "Michal Hocko" <mhocko@suse.com>,  Zhongkun
 He <hezhongkun.hzk@bytedance.com>,  "Frank van der Linden"
 <fvdl@google.com>,  John Groves <john@jagalactic.com>,  Jonathan Cameron
 <Jonathan.Cameron@huawei.com>
Subject: Re: [PATCH v5 00/11] mempolicy2, mbind2, and weighted interleave
In-Reply-To: <ZYqEjsaqseI68EyJ@memverge.com> (Gregory Price's message of "Tue,
	26 Dec 2023 02:45:18 -0500")
References: <20231223181101.1954-1-gregory.price@memverge.com>
	<87frzqg1jp.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZYqEjsaqseI68EyJ@memverge.com>
Date: Tue, 02 Jan 2024 12:27:42 +0800
Message-ID: <87le98e4w1.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii

Gregory Price <gregory.price@memverge.com> writes:

> On Mon, Dec 25, 2023 at 03:54:18PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > For example, the stream benchmark demonstrates that default interleave
>> > is actively harmful, where weighted interleave is beneficial.
>> >
>> > Hardware: 1-socket 8 channel DDR5 + 1 CXL expander in PCIe x16
>> > Default interleave : -78% (slower than DRAM)
>> > Global weighting   : -6% to +4% (workload dependant)
>> > Targeted weights   : +2.5% to +4% (consistently better than DRAM)
>> >
>> > If nothing else, this shows how awful round-robin interleave is.
>> 
>> I guess the performance of the default policy, local (fast memory)
>> first, may be even better in some situation?  For example, before the
>> bandwidth of DRAM is saturated?
>> 
>
> Yes - but it's more complicated than that.
>
> Global weighting here means we did `numactl -w --interleave ...`, which
> means *all* memory regions will be interleaved.  Code, stack, heap, etc.
>
> Targeted weights means we used mbind2() with local weights, which only
> targted specific heap regions.
>
> The default policy was better than global weighting likely as a result
> of things like stack/code being distributed to higher latency memory
> produced a measurable overhead.
>
> To provide this, we only applied weights to bandwidth driving regions,
> and as a result we demonstrated a measurable performance increase.
>
> So yes, the defautl policy may be better in some situations - but that
> will be true of any policy.

Yes.  Some memory area may be more sensitive to memory latency than
other area.

Per my understanding, memory latency will increase with the actual
memory throughput.  And it increases quickly when the memory throughput
nears the maximum memory bandwidth.  As in the figures in the following
URL.

https://mahmoudhatem.wordpress.com/2017/11/07/memory-bandwidth-vs-latency-response-curve/

If the memory latency of the DRAM will not increase much, it's better to
place the hot data in DRAM always.  But if the memory throughput nears
the max memory bandwidth, so that the memory latency of DRAM increases
greatly, may be even higher than that of CXL memory, it's better to put
some hot data in CXL memory to reduce the overall memory latency.

If it's right, I suggest to add something like above in the patch
description.

>> I understand that you may want to limit the memory usage of the fast
>> memory too.  But IMHO, that is another requirements.  That should be
>> enforced by something like per-node memory limit.
>> 
>
> This interface does not limit memory usage of a particular node, it 
> distributes data according to the requested policy.
>
> Nuanced distinction, but important.  If nodes become exhausted, tasks
> are still free to allocate memory from any node in the nodemask, even if
> it violates the requested mempolicy.
>
> This is consistent with the existing behavior of mempolicy.

Good.

>> > =====================================================================
>> > (Patches 3-6) Refactoring mempolicy for code-reuse
>> >
>> > To avoid multiple paths of mempolicy creation, we should refactor the
>> > existing code to enable the designed extensibility, and refactor
>> > existing users to utilize the new interface (while retaining the
>> > existing userland interface).
>> >
>> > This set of patches introduces a new mempolicy_args structure, which
>> > is used to more fully describe a requested mempolicy - to include
>> > existing and future extensions.
>> >
>> > /*
>> >  * Describes settings of a mempolicy during set/get syscalls and
>> >  * kernel internal calls to do_set_mempolicy()
>> >  */
>> > struct mempolicy_args {
>> >     unsigned short mode;            /* policy mode */
>> >     unsigned short mode_flags;      /* policy mode flags */
>> >     int home_node;                  /* mbind: use MPOL_MF_HOME_NODE */
>> >     nodemask_t *policy_nodes;       /* get/set/mbind */
>> >     unsigned char *il_weights;      /* for mode MPOL_WEIGHTED_INTERLEAVE */
>> > };
>> 
>> According to
>> 
>> https://www.geeksforgeeks.org/difference-between-argument-and-parameter-in-c-c-with-examples/
>> 
>> it appears that "parameter" are better than "argument" for struct name
>> here.  It appears that current kernel source supports this too.
>> 
>> $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_param' -r include/linux | wc -l
>> 411
>> $ grep 'struct[\t ]\+[a-zA-Z0-9]\+_arg' -r include/linux | wc -l
>> 25
>> 
>
> Will change.
>
>> > This arg structure will eventually be utilized by the following
>> > interfaces:
>> >     mpol_new() - new mempolicy creation
>> >     do_get_mempolicy() - acquiring information about mempolicy
>> >     do_set_mempolicy() - setting the task mempolicy
>> >     do_mbind()         - setting a vma mempolicy
>> >
>> > do_get_mempolicy() is completely refactored to break it out into
>> > separate functionality based on the flags provided by get_mempolicy(2)
>> >     MPOL_F_MEMS_ALLOWED: acquires task->mems_allowed
>> >     MPOL_F_ADDR: acquires information on vma policies
>> >     MPOL_F_NODE: changes the output for the policy arg to node info
>> >
>> > We refactor the get_mempolicy syscall flatten the logic based on these
>> > flags, and aloow for set_mempolicy2() to re-use the underlying logic.
>> >
>> > The result of this refactor, and the new mempolicy_args structure, is
>> > that extensions like 'sys_set_mempolicy_home_node' can now be directly
>> > integrated into the initial call to 'set_mempolicy2', and that more
>> > complete information about a mempolicy can be returned with a single
>> > call to 'get_mempolicy2', rather than multiple calls to 'get_mempolicy'
>> >
>> >
>> > =====================================================================
>> > (Patches 7-10) set_mempolicy2, get_mempolicy2, mbind2
>> >
>> > These interfaces are the 'extended' counterpart to their relatives.
>> > They use the userland 'struct mpol_args' structure to communicate a
>> > complete mempolicy configuration to the kernel.  This structure
>> > looks very much like the kernel-internal 'struct mempolicy_args':
>> >
>> > struct mpol_args {
>> >         /* Basic mempolicy settings */
>> >         __u16 mode;
>> >         __u16 mode_flags;
>> >         __s32 home_node;
>> >         __u64 pol_maxnodes;
>> 
>> I understand that we want to avoid hole in struct.  But I still feel
>> uncomfortable to use __u64 for a small.  But I don't have solution too.
>> Anyone else has some idea?
>>
>
> maxnode has been an `unsigned long` in every other interface for quite
> some time.  Seems better to keep this consistent rather than it suddenly
> become `unsigned long` over here and `unsigned short` over there.

I don't think that it matters.  The actual maximum node number will be
less than maximum `unsigned short`.

>> >         __aligned_u64 pol_nodes;
>> >         __aligned_u64 *il_weights;      /* of size pol_maxnodes */
>> 
>> Typo?  Should be,
>> 
>
> derp derp
>
>> >
>> > The 'flags' argument for mbind2 is the same as 'mbind', except with
>> > the addition of MPOL_MF_HOME_NODE to denote whether the 'home_node'
>> > field should be utilized.
>> >
>> > The 'flags' argument for get_mempolicy2 allows for MPOL_F_ADDR to
>> > allow operating on VMA policies, but MPOL_F_NODE and MPOL_F_MEMS_ALLOWED
>> > behavior has been omitted, since get_mempolicy() provides this already.
>> 
>> I still think that it's a good idea to make it possible to deprecate
>> get_mempolicy().  How about use a union as follows?
>> 
>> struct mpol_mems_allowed {
>>          __u64 maxnodes;
>>          __aligned_u64 nodemask;
>> };
>> 
>> union mpol_info {
>>         struct mpol_args args;
>>         struct mpol_mems_allowed mems_allowed;
>>         __s32 node;
>> };
>> 
>
> See my other email.  I've come around to see mems_allowed as a wart that
> needs to be removed.  The same information is already available via
> sysfs cpusets.mems and cpusets.mems_effective.
>
> Additionally, mems_allowed isn't even technically part of the mempolicy,
> so if we did want an interface to acquire the infomation, you'd prefer
> to just implement a stand-alone syscall.
>
> The sysfs interface seems sufficient though.
>
> `policy_node` is a similar "why does this even exist" type feature,
> except that it can still be used from get_mempolicy() and if there is an
> actual reason to extend it to get_mempolicy2() it can be added to
> mpol_params.

OK.

--
Best Regards,
Huang, Ying