Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
From:   "Huang, Ying" <ying.huang@intel.com>
To:     "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc:     linux-mm@kvack.org, akpm@linux-foundation.org,
        Wei Xu <weixugc@google.com>, Yang Shi <shy828301@gmail.com>,
        Davidlohr Bueso <dave@stgolabs.net>,
        Tim C Chen <tim.c.chen@intel.com>,
        Michal Hocko <mhocko@kernel.org>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Hesham Almatary <hesham.almatary@huawei.com>,
        Dave Hansen <dave.hansen@intel.com>,
        Jonathan Cameron <Jonathan.Cameron@huawei.com>,
        Alistair Popple <apopple@nvidia.com>,
        Dan Williams <dan.j.williams@intel.com>,
        Johannes Weiner <hannes@cmpxchg.org>, jvgediya.oss@gmail.com,
        Jagdish Gediya <jvgediya@linux.ibm.com>
Subject: Re: [PATCH v9 1/8] mm/demotion: Add support for explicit memory tiers
References: <20220714045351.434957-1-aneesh.kumar@linux.ibm.com>
        <20220714045351.434957-2-aneesh.kumar@linux.ibm.com>
        <87bktq4xs7.fsf@yhuang6-desk2.ccr.corp.intel.com>
        <3659f1bb-a82e-1aad-f297-808a2c17687d@linux.ibm.com>
        <87sfn2u0vy.fsf@linux.ibm.com>
Date:   Mon, 18 Jul 2022 14:08:11 +0800
In-Reply-To: <87sfn2u0vy.fsf@linux.ibm.com> (Aneesh Kumar K. V.'s message of
        "Fri, 15 Jul 2022 15:57:13 +0530")
Message-ID: <87y1wr2bsk.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
Precedence: bulk

"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
> ....
>
>> 
>>> You dropped the original sysfs interface patches from the series, but
>>> the kernel internal implementation is still for the original sysfs
>>> interface.  For example, memory tier ID is for the original sysfs
>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>> to implement with the new interface in mind.  What do you think about
>>> the following design?
>>> 
>>
>> Sorry I am not able to follow you here. This patchset completely drops
>> exposing memory tiers to userspace via sysfs. Instead it allow
>> creation of memory tiers with specific tierID from within the kernel/device driver.
>> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>>
>>
>>> - Each NUMA node belongs to a memory type, and each memory type
>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>   a "distance".  For simplicity, we can start with static distances, for
>>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>>   node can be recorded in a global array,
>>> 
>>>     int node_distances[MAX_NUMNODES];
>>> 
>>>   or, just
>>> 
>>>     pgdat->distance
>>> 
>>
>> I don't follow this. I guess you are trying to have a different design.
>> Would it be much easier if you can write this in the form of a patch? 
>>
>>
>>> - Each memory tier corresponds to a range of distance, for example,
>>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>> 
>>> - The core API of memory tier could be
>>> 
>>>     struct memory_tier *find_create_memory_tier(int distance);
>>> 
>>>   it will find the memory tier which covers "distance" in the memory
>>>   tier list, or create a new memory tier if not found.
>>> 
>>
>> I was expecting this to be internal to dax kmem. How dax kmem maps
>> "abstract distance" to a memory tier. At this point this patchset is
>> keeping all that for a future patchset. 
>>
>
> This shows how i was expecting "abstract distance" to be integrated.
>

Thanks!

To make the first version as simple as possible, I think we can just use
some static "abstract distance" for dax_kmem, e.g., 250.  Because we
use it for PMEM only now.  We can enhance dax_kmem later.

IMHO, we should make the core framework correct firstly.

- A device driver should report the capability (or performance level) of
  the hardware to the memory tier core via abstract distance.  This can
  be done via some global data structure (e.g. node_distances[]) at
  least in the first version.

- Memory tier core determines the mapping from the abstract distance to
  the memory tier via abstract distance ranges, and allocate the struct
  memory_tier when necessary.  That is, memory tier core determines
  whether to allocate or reuse which memory tier for NUMA nodes, not
  device drivers.

- It's better to place the NUMA node to the correct memory tier in the
  fist place.  We should avoid to place the PMEM node in the default
  tier, then change it to the correct memory tier.  That is, device
  drivers should report the abstract distance before onlining NUMA
  nodes.

Please check my reply to Wei too about my other suggestions for the
first version.

Best Regards,
Huang, Ying