Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.34 as permitted sender) client-ip=23.128.96.34;
From:   "Huang, Ying" <ying.huang@intel.com>
To:     Michal Hocko <mhocko@suse.com>
Cc:     Johannes Weiner <hannes@cmpxchg.org>,
        Gregory Price <gourry.memverge@gmail.com>,
        linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
        linux-mm@kvack.org, akpm@linux-foundation.org,
        aneesh.kumar@linux.ibm.com, weixugc@google.com, apopple@nvidia.com,
        tim.c.chen@intel.com, dave.hansen@intel.com, shy828301@gmail.com,
        gregkh@linuxfoundation.org, rafael@kernel.org,
        Gregory Price <gregory.price@memverge.com>
Subject: Re: [RFC PATCH v3 0/4] Node Weights and Weighted Interleave
In-Reply-To: <jgh5b5bm73qe7m3qmnsjo3drazgfaix3ycqmom5u6tfp6hcerj@ij4vftrutvrt>
        (Michal Hocko's message of "Tue, 31 Oct 2023 16:56:27 +0100")
References: <20231031003810.4532-1-gregory.price@memverge.com>
        <rm43wgtlvwowjolzcf6gj4un4qac4myngxqnd2jwt5yqxree62@t66scnrruttc>
        <20231031152142.GA3029315@cmpxchg.org>
        <jgh5b5bm73qe7m3qmnsjo3drazgfaix3ycqmom5u6tfp6hcerj@ij4vftrutvrt>
Date:   Wed, 01 Nov 2023 10:21:47 +0800
Message-ID: <87msvy6wn8.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
Precedence: bulk

Michal Hocko <mhocko@suse.com> writes:

> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>> > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>> > > This patchset implements weighted interleave and adds a new sysfs
>> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
>> > > 
>> > > The il_weight of a node is used by mempolicy to implement weighted
>> > > interleave when `numactl --interleave=...` is invoked.  By default
>> > > il_weight for a node is always 1, which preserves the default round
>> > > robin interleave behavior.
>> > > 
>> > > Interleave weights may be set from 0-100, and denote the number of
>> > > pages that should be allocated from the node when interleaving
>> > > occurs.
>> > > 
>> > > For example, if a node's interleave weight is set to 5, 5 pages
>> > > will be allocated from that node before the next node is scheduled
>> > > for allocations.
>> > 
>> > I find this semantic rather weird TBH. First of all why do you think it
>> > makes sense to have those weights global for all users? What if
>> > different applications have different view on how to spred their
>> > interleaved memory?
>> > 
>> > I do get that you might have a different tiers with largerly different
>> > runtime characteristics but why would you want to interleave them into a
>> > single mapping and have hard to predict runtime behavior?
>> > 
>> > [...]
>> > > In this way it becomes possible to set an interleaving strategy
>> > > that fits the available bandwidth for the devices available on
>> > > the system. An example system:
>> > > 
>> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
>> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
>> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
>> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
>> > > 
>> > > In this setup, the effective weights for nodes 0-3 for a task
>> > > running on Node 0 may be [60, 20, 10, 10].
>> > > 
>> > > This spreads memory out across devices which all have different
>> > > latency and bandwidth attributes at a way that can maximize the
>> > > available resources.
>> > 
>> > OK, so why is this any better than not using any memory policy rely
>> > on demotion to push out cold memory down the tier hierarchy?
>> > 
>> > What is the actual real life usecase and what kind of benefits you can
>> > present?
>> 
>> There are two things CXL gives you: additional capacity and additional
>> bus bandwidth.
>> 
>> The promotion/demotion mechanism is good for the capacity usecase,
>> where you have a nice hot/cold gradient in the workingset and want
>> placement accordingly across faster and slower memory.
>> 
>> The interleaving is useful when you have a flatter workingset
>> distribution and poorer access locality. In that case, the CPU caches
>> are less effective and the workload can be bus-bound. The workload
>> might fit entirely into DRAM, but concentrating it there is
>> suboptimal. Fanning it out in proportion to the relative performance
>> of each memory tier gives better resuls.
>> 
>> We experimented with datacenter workloads on such machines last year
>> and found significant performance benefits:
>> 
>> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/
>
> Thanks, this is a useful insight.
>  
>> This hopefully also explains why it's a global setting. The usecase is
>> different from conventional NUMA interleaving, which is used as a
>> locality measure: spread shared data evenly between compute
>> nodes. This one isn't about locality - the CXL tier doesn't have local
>> compute. Instead, the optimal spread is based on hardware parameters,
>> which is a global property rather than a per-workload one.
>
> Well, I am not convinced about that TBH. Sure it is probably a good fit
> for this specific CXL usecase but it just doesn't fit into many others I
> can think of - e.g. proportional use of those tiers based on the
> workload - you get what you pay for.

For "pay", per my understanding, we need some cgroup based
per-memory-tier (or per-node) usage limit.  The following patchset is
the first step for that.

https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/

--
Best Regards,
Huang, Ying