Received-SPF: pass (google.com: domain of linux-kernel+bounces-15189-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gregory.price@memverge.com>
Cc: Gregory Price <gourry.memverge@gmail.com>,  <linux-mm@kvack.org>,
  <linux-doc@vger.kernel.org>,  <linux-fsdevel@vger.kernel.org>,
  <linux-kernel@vger.kernel.org>,  <linux-api@vger.kernel.org>,
  <x86@kernel.org>,  <akpm@linux-foundation.org>,  <arnd@arndb.de>,
  <tglx@linutronix.de>,  <luto@kernel.org>,  <mingo@redhat.com>,
  <bp@alien8.de>,  <dave.hansen@linux.intel.com>,  <hpa@zytor.com>,
  <mhocko@kernel.org>,  <tj@kernel.org>,  <corbet@lwn.net>,
  <rakie.kim@sk.com>,  <hyeongtak.ji@sk.com>,  <honggyu.kim@sk.com>,
  <vtavarespetr@micron.com>,  <peterz@infradead.org>,
  <jgroves@micron.com>,  <ravis.opensrc@micron.com>,
  <sthanneeru@micron.com>,  <emirakhur@micron.com>,  <Hasan.Maruf@amd.com>,
  <seungjun.ha@samsung.com>,  Srinivasulu Thanneeru
 <sthanneeru.opensrc@micron.com>
Subject: Re: [PATCH v5 02/11] mm/mempolicy: introduce
 MPOL_WEIGHTED_INTERLEAVE for weighted interleaving
In-Reply-To: <ZZRybDPSoLme8Ldh@memverge.com> (Gregory Price's message of "Tue,
	2 Jan 2024 15:30:36 -0500")
References: <20231223181101.1954-1-gregory.price@memverge.com>
	<20231223181101.1954-3-gregory.price@memverge.com>
	<8734vof3kq.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZYp6ZRLZQVtTHest@memverge.com>
	<878r58dt31.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZZRybDPSoLme8Ldh@memverge.com>
Date: Wed, 03 Jan 2024 13:46:56 +0800
Message-ID: <87mstnc6jz.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii

Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Jan 02, 2024 at 04:42:42PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Wed, Dec 27, 2023 at 04:32:37PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gourry.memverge@gmail.com> writes:
>> >> 
>> >> > +static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>> >> > +{
>> >> > +	nodemask_t nodemask = pol->nodes;
>> >> > +	unsigned int target, weight_total = 0;
>> >> > +	int nid;
>> >> > +	unsigned char weights[MAX_NUMNODES];
>> >> 
>> >> MAX_NUMNODSE could be as large as 1024.  1KB stack space may be too
>> >> large?
>> >> 
>> >
>> > I've been struggling with a good solution to this.  We need a local copy
>> > of weights to prevent weights from changing out from under us during
>> > allocation (which may take quite some time), but it seemed unwise to
>> > to allocate 1KB heap in this particular path.
>> >
>> > Is my concern unfounded?  If so, I can go ahead and add the allocation
>> > code.
>> 
>> Please take a look at NODEMASK_ALLOC().
>>
>
> This is not my question. NODEMASK_ALLOC calls kmalloc/kfree. 
>
> Some of the allocations on the stack can be replaced with a scratch
> allocation, that's no big deal.
>
> I'm specifically concerned about:
> 	weighted_interleave_nid
> 	alloc_pages_bulk_array_weighted_interleave
>
> I'm unsure whether kmalloc/kfree is safe (and non-offensive) in those
> contexts. If kmalloc/kfree is safe fine, this problem is trivial.
>
> If not, there is no good solution to this without pre-allocating a
> scratch area per-task.

You need to audit whether it's safe for all callers.  I guess that you
need to allocate pages after calling, so you can use the same GFP flags
here.

>> >> I don't think barrier() is needed to wait for memory operations for
>> >> stack.  It's usually used for cross-processor memory order.
>> >>
>> >
>> > This is present in the old interleave code.  To the best of my
>> > understanding, the concern is for mempolicy->nodemask rebinding that can
>> > occur when cgroups.cpusets.mems_allowed changes.
>> >
>> > so we can't iterate over (mempolicy->nodemask), we have to take a local
>> > copy.
>> >
>> > My *best* understanding of the barrier here is to prevent the compiler
>> > from reordering operations such that it attempts to optimize out the
>> > local copy (or do lazy-fetch).
>> >
>> > It is present in the original interleave code, so I pulled it forward to
>> > this, but I have not tested whether this is a bit paranoid or not.
>> >
>> > from `interleave_nid`:
>> >
>> >  /*
>> >   * The barrier will stabilize the nodemask in a register or on
>> >   * the stack so that it will stop changing under the code.
>> >   *
>> >   * Between first_node() and next_node(), pol->nodes could be changed
>> >   * by other threads. So we put pol->nodes in a local stack.
>> >   */
>> >  barrier();
>> 
>> Got it.  This is kind of READ_ONCE() for nodemask.  To avoid to add
>> comments all over the place.  Can we implement a wrapper for it?  For
>> example, memcpy_once().  __read_once_size() in
>> tools/include/linux/compiler.h can be used as reference.
>> 
>> Because node_weights[] may be changed simultaneously too.  We may need
>> to consider similar issue for it too.  But RCU seems more appropriate
>> for node_weights[].
>> 
>
> Weights are collected individually onto the stack because we have to sum
> them up before we actually apply the weights.
>
> A stale weight is not offensive.  RCU is not needed and doesn't help.

When you copy weights from iw_table[] to stack, it's possible for
compiler to cache its contents in register, or merge, split the memory
operations.  At the same time, iw_table[] may be changed simultaneously
via sysfs interface.  So, we need a mechanism to guarantee that we read
the latest contents consistently.

> The reason the barrier is needed is not weights, it's the nodemask.

Yes.  So I said that we need similar stuff for weights.

> So you basically just want to replace barrier() with this and drop the
> copy/pasted comments:
>
> static void read_once_policy_nodemask(struct mempolicy *pol, nodemask_t *mask)
> {
>         /*
>          * The barrier will stabilize the nodemask in a register or on
>          * the stack so that it will stop changing under the code.
>          *
>          * Between first_node() and next_node(), pol->nodes could be changed
>          * by other threads. So we put pol->nodes in a local stack.
>          */
>         barrier();
>         __builtin_memcpy(mask, &pol->nodes, sizeof(nodemask_t));
>         barrier();
> }
>
> - nodemask_t nodemask = pol->nodemask
> - barrier()
> + nodemask_t nodemask;
> + read_once_policy_nodemask(pol, &nodemask)
>
> Is that right?

Yes.  Something like that.  Or even more general (need to be optimized?),

static inline static void memcpy_once(void *dst, void *src, size_t n)
{
        barrier();
        memcpy(dst, src, n);
        barrier();
}

        memcpy_once(&nodemask, &pol->nodemask, sizeof(nodemask));

The comments can be based on that of READ_ONCE().

--
Best Regards,
Huang, Ying