Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
From:   "Huang, Ying" <ying.huang@intel.com>
To:     Bharata B Rao <bharata@amd.com>
Cc:     Aneesh Kumar K V <aneesh.kumar@linux.ibm.com>,
        <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Alistair Popple <apopple@nvidia.com>,
        Dan Williams <dan.j.williams@intel.com>,
        Dave Hansen <dave.hansen@intel.com>,
        "Davidlohr Bueso" <dave@stgolabs.net>,
        Hesham Almatary <hesham.almatary@huawei.com>,
        Jagdish Gediya <jvgediya.oss@gmail.com>,
        Johannes Weiner <hannes@cmpxchg.org>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        "Michal Hocko" <mhocko@kernel.org>,
        Tim Chen <tim.c.chen@intel.com>, Wei Xu <weixugc@google.com>,
        Yang Shi <shy828301@gmail.com>
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers
References: <20221027065925.476955-1-ying.huang@intel.com>
        <578c9b89-10eb-1e23-8868-cdd6685d8d4e@linux.ibm.com>
        <877d0kk5uf.fsf@yhuang6-desk2.ccr.corp.intel.com>
        <59291b98-6907-0acf-df11-6d87681027cc@linux.ibm.com>
        <8735b8jy9k.fsf@yhuang6-desk2.ccr.corp.intel.com>
        <0d938c9f-c810-b10a-e489-c2b312475c52@amd.com>
        <87tu3oibyr.fsf@yhuang6-desk2.ccr.corp.intel.com>
        <07912a0d-eb91-a6ef-2b9d-74593805f29e@amd.com>
Date:   Mon, 31 Oct 2022 09:33:49 +0800
In-Reply-To: <07912a0d-eb91-a6ef-2b9d-74593805f29e@amd.com> (Bharata B. Rao's
        message of "Fri, 28 Oct 2022 19:23:33 +0530")
Message-ID: <87leowepz6.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
Precedence: bulk

Bharata B Rao <bharata@amd.com> writes:

> On 10/28/2022 2:03 PM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>> 
>>> On 10/28/2022 11:16 AM, Huang, Ying wrote:
>>>> If my understanding were correct, you think the latency / bandwidth of
>>>> these NUMA nodes will near each other, but may be different.
>>>>
>>>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
>>>> we should deal with that in memory types instead of memory tiers.
>>>> There's only one abstract distance for each memory type.
>>>>
>>>> So, I still believe we will not have many memory tiers with my proposal.
>>>>
>>>> I don't care too much about the exact number, but want to discuss some
>>>> general design choice,
>>>>
>>>> a) Avoid to group multiple memory types into one memory tier by default
>>>>    at most times.
>>>
>>> Do you expect the abstract distances of two different types to be
>>> close enough in real life (like you showed in your example with
>>> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
>>> most times?
>>>
>>> Are you foreseeing that abstract distance that get mapped by sources
>>> like HMAT would run into this issue?
>> 
>> Only if we set abstract distance chunk size large.  So, I think that
>> it's better to set chunk size as small as possible to avoid potential
>> issue.  What is the downside to set the chunk size small?
>
> I don't see anything in particular. However
>
> - With just two memory types (default_dram_type and dax_slowmem_type
> with adistance values of 576 and 576*5 respectively) defined currently,
> - With no interface yet to set/change adistance value of a memory type,
> - With no defined way to convert the performance characteristics info
> (bw and latency) from sources like HMAT into a adistance value,
>
> I find it a bit difficult to see how a chunk size of 10 against the
> existing 128 could be more useful.

OK.  Maybe we pay too much attention to specific number.  My target
isn't to push this specific RFC into kernel.  I just want to discuss the
design choices with community.

My basic idea is NOT to group memory types into memory tiers via
customizing abstract distance chunk size.  Because that's hard to be
used and implemented.  So far, it appears that nobody objects this.

Then, it's even better to avoid to adjust abstract chunk size in kernel
as much as possible.  This will make the life of the user space
tools/scripts easier.  One solution is to define more than enough
possible tiers under DRAM (we have unlimited number of tiers above
DRAM).

In the upstream implementation, 4 tiers are possible below DRAM.  That's
enough for now.  But in the long run, it may be better to define more.
100 possible tiers below DRAM may be too extreme.  How about define the
abstract distance of DRAM to be 1050 and chunk size to be 100.  Then we
will have 10 possible tiers below DRAM.  That may be more than enough
even in the long run?

Again, the specific number isn't so important for me.  So please suggest
your number if necessary.

Best Regards,
Huang, Ying