2022-10-27 07:47:14

by Huang, Ying

[permalink] [raw]
Subject: [RFC] memory tiering: use small chunk size and more tiers

We need some way to override the system default memory tiers. For
the example system as follows,

type abstract distance
---- -----------------
HBM 300
DRAM 1000
CXL_MEM 5000
PMEM 5100

Given the memory tier chunk size is 100, the default memory tiers
could be,

tier abstract distance types
range
---- ----------------- -----
3 300-400 HBM
10 1000-1100 DRAM
50 5000-5100 CXL_MEM
51 5100-5200 PMEM

If we want to group CXL MEM and PMEM into one tier, we have 2 choices.

1) Override the abstract distance of CXL_MEM or PMEM. For example, if
we change the abstract distance of PMEM to 5050, the memory tiers
become,

tier abstract distance types
range
---- ----------------- -----
3 300-400 HBM
10 1000-1100 DRAM
50 5000-5100 CXL_MEM, PMEM

2) Override the memory tier chunk size. For example, if we change the
memory tier chunk size to 200, the memory tiers become,

tier abstract distance types
range
---- ----------------- -----
1 200-400 HBM
5 1000-1200 DRAM
25 5000-5200 CXL_MEM, PMEM

But after some thoughts, I think choice 2) may be not good. The
problem is that even if 2 abstract distances are almost same, they may
be put in 2 tier if they sit in the different sides of the tier
boundary. For example, if the abstract distance of CXL_MEM is 4990,
while the abstract distance of PMEM is 5010. Although the difference
of the abstract distances is only 20, CXL_MEM and PMEM will put in
different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
This makes choice 2) hard to be used, it may become tricky to find out
the appropriate tier chunk size that satisfying all requirements.

So I suggest to abandon choice 2) and use choice 1) only. This makes
the overall design and user space interface to be simpler and easier
to be used. The overall design of the abstract distance could be,

1. Use decimal for abstract distance and its chunk size. This makes
them more user friendly.

2. Make the tier chunk size as small as possible. For example, 10.
This will put different memory types in one memory tier only if their
performance is almost same by default. And we will not provide the
interface to override the chunk size.

3. Make the abstract distance of normal DRAM large enough. For
example, 1000, then 100 tiers can be defined below DRAM, this is
more than enough in practice.

4. If we want to override the default memory tiers, just override the
abstract distances of some memory types with a per memory type
interface.

This patch is to apply the design choices above in the existing code.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Aneesh Kumar K.V <[email protected]>
Cc: Alistair Popple <[email protected]>
Cc: Bharata B Rao <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: Hesham Almatary <[email protected]>
Cc: Jagdish Gediya <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jonathan Cameron <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Yang Shi <[email protected]>
---
include/linux/memory-tiers.h | 7 +++----
mm/memory-tiers.c | 7 +++----
2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index 965009aa01d7..2e39d9a6c8ce 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -7,17 +7,16 @@
#include <linux/kref.h>
#include <linux/mmzone.h>
/*
- * Each tier cover a abstrace distance chunk size of 128
+ * Each tier cover a abstrace distance chunk size of 10
*/
-#define MEMTIER_CHUNK_BITS 7
-#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS)
+#define MEMTIER_CHUNK_SIZE 10
/*
* Smaller abstract distance values imply faster (higher) memory tiers. Offset
* the DRAM adistance so that we can accommodate devices with a slightly lower
* adistance value (slightly faster) than default DRAM adistance to be part of
* the same memory tier.
*/
-#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
+#define MEMTIER_ADISTANCE_DRAM ((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2))
#define MEMTIER_HOTPLUG_PRIO 100

struct memory_tier;
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index fa8c9d07f9ce..e03011428fa5 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -165,11 +165,10 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
bool found_slot = false;
struct memory_tier *memtier, *new_memtier;
int adistance = memtype->adistance;
- unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;

lockdep_assert_held_once(&memory_tier_lock);

- adistance = round_down(adistance, memtier_adistance_chunk_size);
+ adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE);
/*
* If the memtype is already part of a memory tier,
* just return that.
@@ -204,7 +203,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
else
list_add_tail(&new_memtier->list, &memory_tiers);

- new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS;
+ new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE;
new_memtier->dev.bus = &memory_tier_subsys;
new_memtier->dev.release = memory_tier_device_release;
new_memtier->dev.groups = memtier_dev_groups;
@@ -641,7 +640,7 @@ static int __init memory_tier_init(void)
#endif
mutex_lock(&memory_tier_lock);
/*
- * For now we can have 4 faster memory tiers with smaller adistance
+ * For now we can have 100 faster memory tiers with smaller adistance
* than default DRAM tier.
*/
default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);
--
2.35.1



2022-10-27 11:01:47

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On 10/27/22 12:29 PM, Huang Ying wrote:
> We need some way to override the system default memory tiers. For
> the example system as follows,
>
> type abstract distance
> ---- -----------------
> HBM 300
> DRAM 1000
> CXL_MEM 5000
> PMEM 5100
>
> Given the memory tier chunk size is 100, the default memory tiers
> could be,
>
> tier abstract distance types
> range
> ---- ----------------- -----
> 3 300-400 HBM
> 10 1000-1100 DRAM
> 50 5000-5100 CXL_MEM
> 51 5100-5200 PMEM
>
> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>
> 1) Override the abstract distance of CXL_MEM or PMEM. For example, if
> we change the abstract distance of PMEM to 5050, the memory tiers
> become,
>
> tier abstract distance types
> range
> ---- ----------------- -----
> 3 300-400 HBM
> 10 1000-1100 DRAM
> 50 5000-5100 CXL_MEM, PMEM
>
> 2) Override the memory tier chunk size. For example, if we change the
> memory tier chunk size to 200, the memory tiers become,
>
> tier abstract distance types
> range
> ---- ----------------- -----
> 1 200-400 HBM
> 5 1000-1200 DRAM
> 25 5000-5200 CXL_MEM, PMEM
>
> But after some thoughts, I think choice 2) may be not good. The
> problem is that even if 2 abstract distances are almost same, they may
> be put in 2 tier if they sit in the different sides of the tier
> boundary. For example, if the abstract distance of CXL_MEM is 4990,
> while the abstract distance of PMEM is 5010. Although the difference
> of the abstract distances is only 20, CXL_MEM and PMEM will put in
> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
> This makes choice 2) hard to be used, it may become tricky to find out
> the appropriate tier chunk size that satisfying all requirements.
>

Shouldn't we wait for gaining experience w.r.t how we would end up
mapping devices with different latencies and bandwidth before tuning these values?

> So I suggest to abandon choice 2) and use choice 1) only. This makes
> the overall design and user space interface to be simpler and easier
> to be used. The overall design of the abstract distance could be,
>
> 1. Use decimal for abstract distance and its chunk size. This makes
> them more user friendly.
>
> 2. Make the tier chunk size as small as possible. For example, 10.
> This will put different memory types in one memory tier only if their
> performance is almost same by default. And we will not provide the
> interface to override the chunk size.
>

this could also mean we can end up with lots of memory tiers with relative
smaller performance difference between them. Again it depends how HMAT
attributes will be used to map to abstract distance.



> 3. Make the abstract distance of normal DRAM large enough. For
> example, 1000, then 100 tiers can be defined below DRAM, this is
> more than enough in practice.

Why 100? Will we really have that many tiers below/faster than DRAM? As of now
I see only HBM below it.

>
> 4. If we want to override the default memory tiers, just override the
> abstract distances of some memory types with a per memory type
> interface.
>
> This patch is to apply the design choices above in the existing code.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Cc: Aneesh Kumar K.V <[email protected]>
> Cc: Alistair Popple <[email protected]>
> Cc: Bharata B Rao <[email protected]>
> Cc: Dan Williams <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Hesham Almatary <[email protected]>
> Cc: Jagdish Gediya <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Jonathan Cameron <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Wei Xu <[email protected]>
> Cc: Yang Shi <[email protected]>
> ---
> include/linux/memory-tiers.h | 7 +++----
> mm/memory-tiers.c | 7 +++----
> 2 files changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index 965009aa01d7..2e39d9a6c8ce 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -7,17 +7,16 @@
> #include <linux/kref.h>
> #include <linux/mmzone.h>
> /*
> - * Each tier cover a abstrace distance chunk size of 128
> + * Each tier cover a abstrace distance chunk size of 10
> */
> -#define MEMTIER_CHUNK_BITS 7
> -#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS)
> +#define MEMTIER_CHUNK_SIZE 10
> /*
> * Smaller abstract distance values imply faster (higher) memory tiers. Offset
> * the DRAM adistance so that we can accommodate devices with a slightly lower
> * adistance value (slightly faster) than default DRAM adistance to be part of
> * the same memory tier.
> */
> -#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
> +#define MEMTIER_ADISTANCE_DRAM ((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2))
> #define MEMTIER_HOTPLUG_PRIO 100
>
> struct memory_tier;
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index fa8c9d07f9ce..e03011428fa5 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -165,11 +165,10 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
> bool found_slot = false;
> struct memory_tier *memtier, *new_memtier;
> int adistance = memtype->adistance;
> - unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
>
> lockdep_assert_held_once(&memory_tier_lock);
>
> - adistance = round_down(adistance, memtier_adistance_chunk_size);
> + adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE);
> /*
> * If the memtype is already part of a memory tier,
> * just return that.
> @@ -204,7 +203,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
> else
> list_add_tail(&new_memtier->list, &memory_tiers);
>
> - new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS;
> + new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE;
> new_memtier->dev.bus = &memory_tier_subsys;
> new_memtier->dev.release = memory_tier_device_release;
> new_memtier->dev.groups = memtier_dev_groups;
> @@ -641,7 +640,7 @@ static int __init memory_tier_init(void)
> #endif
> mutex_lock(&memory_tier_lock);
> /*
> - * For now we can have 4 faster memory tiers with smaller adistance
> + * For now we can have 100 faster memory tiers with smaller adistance
> * than default DRAM tier.
> */
> default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);


2022-10-28 03:13:48

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

Hi, Aneesh,

Aneesh Kumar K V <[email protected]> writes:

> On 10/27/22 12:29 PM, Huang Ying wrote:
>> We need some way to override the system default memory tiers. For
>> the example system as follows,
>>
>> type abstract distance
>> ---- -----------------
>> HBM 300
>> DRAM 1000
>> CXL_MEM 5000
>> PMEM 5100
>>
>> Given the memory tier chunk size is 100, the default memory tiers
>> could be,
>>
>> tier abstract distance types
>> range
>> ---- ----------------- -----
>> 3 300-400 HBM
>> 10 1000-1100 DRAM
>> 50 5000-5100 CXL_MEM
>> 51 5100-5200 PMEM
>>
>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>>
>> 1) Override the abstract distance of CXL_MEM or PMEM. For example, if
>> we change the abstract distance of PMEM to 5050, the memory tiers
>> become,
>>
>> tier abstract distance types
>> range
>> ---- ----------------- -----
>> 3 300-400 HBM
>> 10 1000-1100 DRAM
>> 50 5000-5100 CXL_MEM, PMEM
>>
>> 2) Override the memory tier chunk size. For example, if we change the
>> memory tier chunk size to 200, the memory tiers become,
>>
>> tier abstract distance types
>> range
>> ---- ----------------- -----
>> 1 200-400 HBM
>> 5 1000-1200 DRAM
>> 25 5000-5200 CXL_MEM, PMEM
>>
>> But after some thoughts, I think choice 2) may be not good. The
>> problem is that even if 2 abstract distances are almost same, they may
>> be put in 2 tier if they sit in the different sides of the tier
>> boundary. For example, if the abstract distance of CXL_MEM is 4990,
>> while the abstract distance of PMEM is 5010. Although the difference
>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>> This makes choice 2) hard to be used, it may become tricky to find out
>> the appropriate tier chunk size that satisfying all requirements.
>>
>
> Shouldn't we wait for gaining experience w.r.t how we would end up
> mapping devices with different latencies and bandwidth before tuning these values?

Just want to discuss the overall design.

>> So I suggest to abandon choice 2) and use choice 1) only. This makes
>> the overall design and user space interface to be simpler and easier
>> to be used. The overall design of the abstract distance could be,
>>
>> 1. Use decimal for abstract distance and its chunk size. This makes
>> them more user friendly.
>>
>> 2. Make the tier chunk size as small as possible. For example, 10.
>> This will put different memory types in one memory tier only if their
>> performance is almost same by default. And we will not provide the
>> interface to override the chunk size.
>>
>
> this could also mean we can end up with lots of memory tiers with relative
> smaller performance difference between them. Again it depends how HMAT
> attributes will be used to map to abstract distance.

Per my understanding, there will not be many memory types in a system.
So, there will not be many memory tiers too. In most systems, there are
only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
etc. Do you know systems with many memory types? The basic idea is to
put different memory types in different memory tiers by default. If
users want to group them, they can do that via overriding the abstract
distance of some memory type.

>
>> 3. Make the abstract distance of normal DRAM large enough. For
>> example, 1000, then 100 tiers can be defined below DRAM, this is
>> more than enough in practice.
>
> Why 100? Will we really have that many tiers below/faster than DRAM? As of now
> I see only HBM below it.

Yes. 100 is more than enough. We just want to avoid to group different
memory types by default.

Best Regards,
Huang, Ying

>>
>> 4. If we want to override the default memory tiers, just override the
>> abstract distances of some memory types with a per memory type
>> interface.
>>
>> This patch is to apply the design choices above in the existing code.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>> Cc: Aneesh Kumar K.V <[email protected]>
>> Cc: Alistair Popple <[email protected]>
>> Cc: Bharata B Rao <[email protected]>
>> Cc: Dan Williams <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Davidlohr Bueso <[email protected]>
>> Cc: Hesham Almatary <[email protected]>
>> Cc: Jagdish Gediya <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Jonathan Cameron <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Tim Chen <[email protected]>
>> Cc: Wei Xu <[email protected]>
>> Cc: Yang Shi <[email protected]>
>> ---
>> include/linux/memory-tiers.h | 7 +++----
>> mm/memory-tiers.c | 7 +++----
>> 2 files changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> index 965009aa01d7..2e39d9a6c8ce 100644
>> --- a/include/linux/memory-tiers.h
>> +++ b/include/linux/memory-tiers.h
>> @@ -7,17 +7,16 @@
>> #include <linux/kref.h>
>> #include <linux/mmzone.h>
>> /*
>> - * Each tier cover a abstrace distance chunk size of 128
>> + * Each tier cover a abstrace distance chunk size of 10
>> */
>> -#define MEMTIER_CHUNK_BITS 7
>> -#define MEMTIER_CHUNK_SIZE (1 << MEMTIER_CHUNK_BITS)
>> +#define MEMTIER_CHUNK_SIZE 10
>> /*
>> * Smaller abstract distance values imply faster (higher) memory tiers. Offset
>> * the DRAM adistance so that we can accommodate devices with a slightly lower
>> * adistance value (slightly faster) than default DRAM adistance to be part of
>> * the same memory tier.
>> */
>> -#define MEMTIER_ADISTANCE_DRAM ((4 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE >> 1))
>> +#define MEMTIER_ADISTANCE_DRAM ((100 * MEMTIER_CHUNK_SIZE) + (MEMTIER_CHUNK_SIZE / 2))
>> #define MEMTIER_HOTPLUG_PRIO 100
>>
>> struct memory_tier;
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> index fa8c9d07f9ce..e03011428fa5 100644
>> --- a/mm/memory-tiers.c
>> +++ b/mm/memory-tiers.c
>> @@ -165,11 +165,10 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
>> bool found_slot = false;
>> struct memory_tier *memtier, *new_memtier;
>> int adistance = memtype->adistance;
>> - unsigned int memtier_adistance_chunk_size = MEMTIER_CHUNK_SIZE;
>>
>> lockdep_assert_held_once(&memory_tier_lock);
>>
>> - adistance = round_down(adistance, memtier_adistance_chunk_size);
>> + adistance = rounddown(adistance, MEMTIER_CHUNK_SIZE);
>> /*
>> * If the memtype is already part of a memory tier,
>> * just return that.
>> @@ -204,7 +203,7 @@ static struct memory_tier *find_create_memory_tier(struct memory_dev_type *memty
>> else
>> list_add_tail(&new_memtier->list, &memory_tiers);
>>
>> - new_memtier->dev.id = adistance >> MEMTIER_CHUNK_BITS;
>> + new_memtier->dev.id = adistance / MEMTIER_CHUNK_SIZE;
>> new_memtier->dev.bus = &memory_tier_subsys;
>> new_memtier->dev.release = memory_tier_device_release;
>> new_memtier->dev.groups = memtier_dev_groups;
>> @@ -641,7 +640,7 @@ static int __init memory_tier_init(void)
>> #endif
>> mutex_lock(&memory_tier_lock);
>> /*
>> - * For now we can have 4 faster memory tiers with smaller adistance
>> + * For now we can have 100 faster memory tiers with smaller adistance
>> * than default DRAM tier.
>> */
>> default_dram_type = alloc_memory_type(MEMTIER_ADISTANCE_DRAM);

2022-10-28 05:21:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On 10/28/22 8:33 AM, Huang, Ying wrote:
> Hi, Aneesh,
>
> Aneesh Kumar K V <[email protected]> writes:
>
>> On 10/27/22 12:29 PM, Huang Ying wrote:
>>> We need some way to override the system default memory tiers. For
>>> the example system as follows,
>>>
>>> type abstract distance
>>> ---- -----------------
>>> HBM 300
>>> DRAM 1000
>>> CXL_MEM 5000
>>> PMEM 5100
>>>
>>> Given the memory tier chunk size is 100, the default memory tiers
>>> could be,
>>>
>>> tier abstract distance types
>>> range
>>> ---- ----------------- -----
>>> 3 300-400 HBM
>>> 10 1000-1100 DRAM
>>> 50 5000-5100 CXL_MEM
>>> 51 5100-5200 PMEM
>>>
>>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>>>
>>> 1) Override the abstract distance of CXL_MEM or PMEM. For example, if
>>> we change the abstract distance of PMEM to 5050, the memory tiers
>>> become,
>>>
>>> tier abstract distance types
>>> range
>>> ---- ----------------- -----
>>> 3 300-400 HBM
>>> 10 1000-1100 DRAM
>>> 50 5000-5100 CXL_MEM, PMEM
>>>
>>> 2) Override the memory tier chunk size. For example, if we change the
>>> memory tier chunk size to 200, the memory tiers become,
>>>
>>> tier abstract distance types
>>> range
>>> ---- ----------------- -----
>>> 1 200-400 HBM
>>> 5 1000-1200 DRAM
>>> 25 5000-5200 CXL_MEM, PMEM
>>>
>>> But after some thoughts, I think choice 2) may be not good. The
>>> problem is that even if 2 abstract distances are almost same, they may
>>> be put in 2 tier if they sit in the different sides of the tier
>>> boundary. For example, if the abstract distance of CXL_MEM is 4990,
>>> while the abstract distance of PMEM is 5010. Although the difference
>>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>>> This makes choice 2) hard to be used, it may become tricky to find out
>>> the appropriate tier chunk size that satisfying all requirements.
>>>
>>
>> Shouldn't we wait for gaining experience w.r.t how we would end up
>> mapping devices with different latencies and bandwidth before tuning these values?
>
> Just want to discuss the overall design.
>
>>> So I suggest to abandon choice 2) and use choice 1) only. This makes
>>> the overall design and user space interface to be simpler and easier
>>> to be used. The overall design of the abstract distance could be,
>>>
>>> 1. Use decimal for abstract distance and its chunk size. This makes
>>> them more user friendly.
>>>
>>> 2. Make the tier chunk size as small as possible. For example, 10.
>>> This will put different memory types in one memory tier only if their
>>> performance is almost same by default. And we will not provide the
>>> interface to override the chunk size.
>>>
>>
>> this could also mean we can end up with lots of memory tiers with relative
>> smaller performance difference between them. Again it depends how HMAT
>> attributes will be used to map to abstract distance.
>
> Per my understanding, there will not be many memory types in a system.
> So, there will not be many memory tiers too. In most systems, there are
> only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
> etc.

So we don't need the chunk size to be 10 because we don't forsee us needing
to group devices into that many tiers.

> Do you know systems with many memory types? The basic idea is to
> put different memory types in different memory tiers by default. If
> users want to group them, they can do that via overriding the abstract
> distance of some memory type.
>

with small chunk size and depending on how we are going to derive abstract distance,
I am wondering whether we would end up with lots of memory tiers with no
real value. Hence my suggestion to wait making a change like this till we have
code that map HMAT/CDAT attributes to abstract distance.




>>
>>> 3. Make the abstract distance of normal DRAM large enough. For
>>> example, 1000, then 100 tiers can be defined below DRAM, this is
>>> more than enough in practice.
>>
>> Why 100? Will we really have that many tiers below/faster than DRAM? As of now
>> I see only HBM below it.
>
> Yes. 100 is more than enough. We just want to avoid to group different
> memory types by default.
>
> Best Regards,
> Huang, Ying
>


2022-10-28 06:00:12

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

Aneesh Kumar K V <[email protected]> writes:

> On 10/28/22 8:33 AM, Huang, Ying wrote:
>> Hi, Aneesh,
>>
>> Aneesh Kumar K V <[email protected]> writes:
>>
>>> On 10/27/22 12:29 PM, Huang Ying wrote:
>>>> We need some way to override the system default memory tiers. For
>>>> the example system as follows,
>>>>
>>>> type abstract distance
>>>> ---- -----------------
>>>> HBM 300
>>>> DRAM 1000
>>>> CXL_MEM 5000
>>>> PMEM 5100
>>>>
>>>> Given the memory tier chunk size is 100, the default memory tiers
>>>> could be,
>>>>
>>>> tier abstract distance types
>>>> range
>>>> ---- ----------------- -----
>>>> 3 300-400 HBM
>>>> 10 1000-1100 DRAM
>>>> 50 5000-5100 CXL_MEM
>>>> 51 5100-5200 PMEM
>>>>
>>>> If we want to group CXL MEM and PMEM into one tier, we have 2 choices.
>>>>
>>>> 1) Override the abstract distance of CXL_MEM or PMEM. For example, if
>>>> we change the abstract distance of PMEM to 5050, the memory tiers
>>>> become,
>>>>
>>>> tier abstract distance types
>>>> range
>>>> ---- ----------------- -----
>>>> 3 300-400 HBM
>>>> 10 1000-1100 DRAM
>>>> 50 5000-5100 CXL_MEM, PMEM
>>>>
>>>> 2) Override the memory tier chunk size. For example, if we change the
>>>> memory tier chunk size to 200, the memory tiers become,
>>>>
>>>> tier abstract distance types
>>>> range
>>>> ---- ----------------- -----
>>>> 1 200-400 HBM
>>>> 5 1000-1200 DRAM
>>>> 25 5000-5200 CXL_MEM, PMEM
>>>>
>>>> But after some thoughts, I think choice 2) may be not good. The
>>>> problem is that even if 2 abstract distances are almost same, they may
>>>> be put in 2 tier if they sit in the different sides of the tier
>>>> boundary. For example, if the abstract distance of CXL_MEM is 4990,
>>>> while the abstract distance of PMEM is 5010. Although the difference
>>>> of the abstract distances is only 20, CXL_MEM and PMEM will put in
>>>> different tiers if the tier chunk size is 50, 100, 200, 250, 500, ....
>>>> This makes choice 2) hard to be used, it may become tricky to find out
>>>> the appropriate tier chunk size that satisfying all requirements.
>>>>
>>>
>>> Shouldn't we wait for gaining experience w.r.t how we would end up
>>> mapping devices with different latencies and bandwidth before tuning these values?
>>
>> Just want to discuss the overall design.
>>
>>>> So I suggest to abandon choice 2) and use choice 1) only. This makes
>>>> the overall design and user space interface to be simpler and easier
>>>> to be used. The overall design of the abstract distance could be,
>>>>
>>>> 1. Use decimal for abstract distance and its chunk size. This makes
>>>> them more user friendly.
>>>>
>>>> 2. Make the tier chunk size as small as possible. For example, 10.
>>>> This will put different memory types in one memory tier only if their
>>>> performance is almost same by default. And we will not provide the
>>>> interface to override the chunk size.
>>>>
>>>
>>> this could also mean we can end up with lots of memory tiers with relative
>>> smaller performance difference between them. Again it depends how HMAT
>>> attributes will be used to map to abstract distance.
>>
>> Per my understanding, there will not be many memory types in a system.
>> So, there will not be many memory tiers too. In most systems, there are
>> only 2 or 3 memory tiers in the system, for example, HBM, DRAM, CXL,
>> etc.
>
> So we don't need the chunk size to be 10 because we don't forsee us needing
> to group devices into that many tiers.

I suggest to use small chunk size to avoid to group 2 memory
types into one memory tier accidently.

>> Do you know systems with many memory types? The basic idea is to
>> put different memory types in different memory tiers by default. If
>> users want to group them, they can do that via overriding the abstract
>> distance of some memory type.
>>
>
> with small chunk size and depending on how we are going to derive abstract distance,
> I am wondering whether we would end up with lots of memory tiers with no
> real value. Hence my suggestion to wait making a change like this till we have
> code that map HMAT/CDAT attributes to abstract distance.

Per my understanding, the NUMA nodes of the same memory type/tier will
have the exact same latency and bandwidth in HMAT/CDAT for the CPU in
the same socket.

If my understanding were correct, you think the latency / bandwidth of
these NUMA nodes will near each other, but may be different.

Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
we should deal with that in memory types instead of memory tiers.
There's only one abstract distance for each memory type.

So, I still believe we will not have many memory tiers with my proposal.

I don't care too much about the exact number, but want to discuss some
general design choice,

a) Avoid to group multiple memory types into one memory tier by default
at most times.

b) Abandon customizing abstract distance chunk size.

Best Regards,
Huang, Ying

>
>>>
>>>> 3. Make the abstract distance of normal DRAM large enough. For
>>>> example, 1000, then 100 tiers can be defined below DRAM, this is
>>>> more than enough in practice.
>>>
>>> Why 100? Will we really have that many tiers below/faster than DRAM? As of now
>>> I see only HBM below it.
>>
>> Yes. 100 is more than enough. We just want to avoid to group different
>> memory types by default.
>>
>> Best Regards,
>> Huang, Ying
>>

2022-10-28 08:09:31

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On 10/28/2022 11:16 AM, Huang, Ying wrote:
> If my understanding were correct, you think the latency / bandwidth of
> these NUMA nodes will near each other, but may be different.
>
> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
> we should deal with that in memory types instead of memory tiers.
> There's only one abstract distance for each memory type.
>
> So, I still believe we will not have many memory tiers with my proposal.
>
> I don't care too much about the exact number, but want to discuss some
> general design choice,
>
> a) Avoid to group multiple memory types into one memory tier by default
> at most times.

Do you expect the abstract distances of two different types to be
close enough in real life (like you showed in your example with
CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
most times?

Are you foreseeing that abstract distance that get mapped by sources
like HMAT would run into this issue?

Regards,
Bharata.

2022-10-28 09:35:43

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

Bharata B Rao <[email protected]> writes:

> On 10/28/2022 11:16 AM, Huang, Ying wrote:
>> If my understanding were correct, you think the latency / bandwidth of
>> these NUMA nodes will near each other, but may be different.
>>
>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
>> we should deal with that in memory types instead of memory tiers.
>> There's only one abstract distance for each memory type.
>>
>> So, I still believe we will not have many memory tiers with my proposal.
>>
>> I don't care too much about the exact number, but want to discuss some
>> general design choice,
>>
>> a) Avoid to group multiple memory types into one memory tier by default
>> at most times.
>
> Do you expect the abstract distances of two different types to be
> close enough in real life (like you showed in your example with
> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
> most times?
>
> Are you foreseeing that abstract distance that get mapped by sources
> like HMAT would run into this issue?

Only if we set abstract distance chunk size large. So, I think that
it's better to set chunk size as small as possible to avoid potential
issue. What is the downside to set the chunk size small?

Best Regards,
Huang, Ying

2022-10-28 13:58:04

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On 10/28/2022 2:03 PM, Huang, Ying wrote:
> Bharata B Rao <[email protected]> writes:
>
>> On 10/28/2022 11:16 AM, Huang, Ying wrote:
>>> If my understanding were correct, you think the latency / bandwidth of
>>> these NUMA nodes will near each other, but may be different.
>>>
>>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
>>> we should deal with that in memory types instead of memory tiers.
>>> There's only one abstract distance for each memory type.
>>>
>>> So, I still believe we will not have many memory tiers with my proposal.
>>>
>>> I don't care too much about the exact number, but want to discuss some
>>> general design choice,
>>>
>>> a) Avoid to group multiple memory types into one memory tier by default
>>> at most times.
>>
>> Do you expect the abstract distances of two different types to be
>> close enough in real life (like you showed in your example with
>> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
>> most times?
>>
>> Are you foreseeing that abstract distance that get mapped by sources
>> like HMAT would run into this issue?
>
> Only if we set abstract distance chunk size large. So, I think that
> it's better to set chunk size as small as possible to avoid potential
> issue. What is the downside to set the chunk size small?

I don't see anything in particular. However

- With just two memory types (default_dram_type and dax_slowmem_type
with adistance values of 576 and 576*5 respectively) defined currently,
- With no interface yet to set/change adistance value of a memory type,
- With no defined way to convert the performance characteristics info
(bw and latency) from sources like HMAT into a adistance value,

I find it a bit difficult to see how a chunk size of 10 against the
existing 128 could be more useful.

Regards,
Bharata.

2022-10-31 02:19:45

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

Bharata B Rao <[email protected]> writes:

> On 10/28/2022 2:03 PM, Huang, Ying wrote:
>> Bharata B Rao <[email protected]> writes:
>>
>>> On 10/28/2022 11:16 AM, Huang, Ying wrote:
>>>> If my understanding were correct, you think the latency / bandwidth of
>>>> these NUMA nodes will near each other, but may be different.
>>>>
>>>> Even if the latency / bandwidth of these NUMA nodes isn't exactly same,
>>>> we should deal with that in memory types instead of memory tiers.
>>>> There's only one abstract distance for each memory type.
>>>>
>>>> So, I still believe we will not have many memory tiers with my proposal.
>>>>
>>>> I don't care too much about the exact number, but want to discuss some
>>>> general design choice,
>>>>
>>>> a) Avoid to group multiple memory types into one memory tier by default
>>>> at most times.
>>>
>>> Do you expect the abstract distances of two different types to be
>>> close enough in real life (like you showed in your example with
>>> CXL - 5000 and PMEM - 5100) that they will get assigned into same tier
>>> most times?
>>>
>>> Are you foreseeing that abstract distance that get mapped by sources
>>> like HMAT would run into this issue?
>>
>> Only if we set abstract distance chunk size large. So, I think that
>> it's better to set chunk size as small as possible to avoid potential
>> issue. What is the downside to set the chunk size small?
>
> I don't see anything in particular. However
>
> - With just two memory types (default_dram_type and dax_slowmem_type
> with adistance values of 576 and 576*5 respectively) defined currently,
> - With no interface yet to set/change adistance value of a memory type,
> - With no defined way to convert the performance characteristics info
> (bw and latency) from sources like HMAT into a adistance value,
>
> I find it a bit difficult to see how a chunk size of 10 against the
> existing 128 could be more useful.

OK. Maybe we pay too much attention to specific number. My target
isn't to push this specific RFC into kernel. I just want to discuss the
design choices with community.

My basic idea is NOT to group memory types into memory tiers via
customizing abstract distance chunk size. Because that's hard to be
used and implemented. So far, it appears that nobody objects this.

Then, it's even better to avoid to adjust abstract chunk size in kernel
as much as possible. This will make the life of the user space
tools/scripts easier. One solution is to define more than enough
possible tiers under DRAM (we have unlimited number of tiers above
DRAM).

In the upstream implementation, 4 tiers are possible below DRAM. That's
enough for now. But in the long run, it may be better to define more.
100 possible tiers below DRAM may be too extreme. How about define the
abstract distance of DRAM to be 1050 and chunk size to be 100. Then we
will have 10 possible tiers below DRAM. That may be more than enough
even in the long run?

Again, the specific number isn't so important for me. So please suggest
your number if necessary.

Best Regards,
Huang, Ying

2022-11-01 14:47:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On Mon 31-10-22 09:33:49, Huang, Ying wrote:
[...]
> In the upstream implementation, 4 tiers are possible below DRAM. That's
> enough for now. But in the long run, it may be better to define more.
> 100 possible tiers below DRAM may be too extreme.

I am just curious. Is any configurations with more than couple of tiers
even manageable? I mean applications have been struggling even with
regular NUMA systems for years and vast majority of them is largerly
NUMA unaware. How are they going to configure for a more complex system
when a) there is no resource access control so whatever you aim for
might not be available and b) in which situations there is going to be a
demand only for subset of tears (GPU memory?) ?

Thanks!

--
Michal Hocko
SUSE Labs

2022-11-02 01:10:52

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

Michal Hocko <[email protected]> writes:

> On Mon 31-10-22 09:33:49, Huang, Ying wrote:
> [...]
>> In the upstream implementation, 4 tiers are possible below DRAM. That's
>> enough for now. But in the long run, it may be better to define more.
>> 100 possible tiers below DRAM may be too extreme.
>
> I am just curious. Is any configurations with more than couple of tiers
> even manageable? I mean applications have been struggling even with
> regular NUMA systems for years and vast majority of them is largerly
> NUMA unaware. How are they going to configure for a more complex system
> when a) there is no resource access control so whatever you aim for
> might not be available and b) in which situations there is going to be a
> demand only for subset of tears (GPU memory?) ?

Sorry for confusing. I think that there are only several (less than 10)
tiers in a system in practice. Yes, here, I suggested to define 100 (10
in the later text) POSSIBLE tiers below DRAM. My intention isn't to
manage a system with tens memory tiers. Instead, my intention is to
avoid to put 2 memory types into one memory tier by accident via make
the abstract distance range of each memory tier as small as possible.
More possible memory tiers, smaller abstract distance range of each
memory tier.

Best Regards,
Huang, Ying

2022-11-02 07:56:03

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On Wed 02-11-22 08:39:49, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
> > [...]
> >> In the upstream implementation, 4 tiers are possible below DRAM. That's
> >> enough for now. But in the long run, it may be better to define more.
> >> 100 possible tiers below DRAM may be too extreme.
> >
> > I am just curious. Is any configurations with more than couple of tiers
> > even manageable? I mean applications have been struggling even with
> > regular NUMA systems for years and vast majority of them is largerly
> > NUMA unaware. How are they going to configure for a more complex system
> > when a) there is no resource access control so whatever you aim for
> > might not be available and b) in which situations there is going to be a
> > demand only for subset of tears (GPU memory?) ?
>
> Sorry for confusing. I think that there are only several (less than 10)
> tiers in a system in practice. Yes, here, I suggested to define 100 (10
> in the later text) POSSIBLE tiers below DRAM. My intention isn't to
> manage a system with tens memory tiers. Instead, my intention is to
> avoid to put 2 memory types into one memory tier by accident via make
> the abstract distance range of each memory tier as small as possible.
> More possible memory tiers, smaller abstract distance range of each
> memory tier.

TBH I do not really understand how tweaking ranges helps anything.
IIUC drivers are free to assign any abstract distance so they will clash
without any higher level coordination.
--
Michal Hocko
SUSE Labs

2022-11-02 08:35:40

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

Michal Hocko <[email protected]> writes:

> On Wed 02-11-22 08:39:49, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
>> > [...]
>> >> In the upstream implementation, 4 tiers are possible below DRAM. That's
>> >> enough for now. But in the long run, it may be better to define more.
>> >> 100 possible tiers below DRAM may be too extreme.
>> >
>> > I am just curious. Is any configurations with more than couple of tiers
>> > even manageable? I mean applications have been struggling even with
>> > regular NUMA systems for years and vast majority of them is largerly
>> > NUMA unaware. How are they going to configure for a more complex system
>> > when a) there is no resource access control so whatever you aim for
>> > might not be available and b) in which situations there is going to be a
>> > demand only for subset of tears (GPU memory?) ?
>>
>> Sorry for confusing. I think that there are only several (less than 10)
>> tiers in a system in practice. Yes, here, I suggested to define 100 (10
>> in the later text) POSSIBLE tiers below DRAM. My intention isn't to
>> manage a system with tens memory tiers. Instead, my intention is to
>> avoid to put 2 memory types into one memory tier by accident via make
>> the abstract distance range of each memory tier as small as possible.
>> More possible memory tiers, smaller abstract distance range of each
>> memory tier.
>
> TBH I do not really understand how tweaking ranges helps anything.
> IIUC drivers are free to assign any abstract distance so they will clash
> without any higher level coordination.

Yes. That's possible. Each memory tier corresponds to one abstract
distance range. The larger the range is, the higher the possibility of
clashing is. So I suggest to make the abstract distance range smaller
to reduce the possibility of clashing.

Best Regards,
Huang, Ying

2022-11-02 08:46:11

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

Michal Hocko <[email protected]> writes:

> On Wed 02-11-22 16:02:54, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Wed 02-11-22 08:39:49, Huang, Ying wrote:
>> >> Michal Hocko <[email protected]> writes:
>> >>
>> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
>> >> > [...]
>> >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's
>> >> >> enough for now. But in the long run, it may be better to define more.
>> >> >> 100 possible tiers below DRAM may be too extreme.
>> >> >
>> >> > I am just curious. Is any configurations with more than couple of tiers
>> >> > even manageable? I mean applications have been struggling even with
>> >> > regular NUMA systems for years and vast majority of them is largerly
>> >> > NUMA unaware. How are they going to configure for a more complex system
>> >> > when a) there is no resource access control so whatever you aim for
>> >> > might not be available and b) in which situations there is going to be a
>> >> > demand only for subset of tears (GPU memory?) ?
>> >>
>> >> Sorry for confusing. I think that there are only several (less than 10)
>> >> tiers in a system in practice. Yes, here, I suggested to define 100 (10
>> >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to
>> >> manage a system with tens memory tiers. Instead, my intention is to
>> >> avoid to put 2 memory types into one memory tier by accident via make
>> >> the abstract distance range of each memory tier as small as possible.
>> >> More possible memory tiers, smaller abstract distance range of each
>> >> memory tier.
>> >
>> > TBH I do not really understand how tweaking ranges helps anything.
>> > IIUC drivers are free to assign any abstract distance so they will clash
>> > without any higher level coordination.
>>
>> Yes. That's possible. Each memory tier corresponds to one abstract
>> distance range. The larger the range is, the higher the possibility of
>> clashing is. So I suggest to make the abstract distance range smaller
>> to reduce the possibility of clashing.
>
> I am sorry but I really do not understand how the size of the range
> actually addresses a fundamental issue that each driver simply picks
> what it wants. Is there any enumeration defining basic characteristic of
> each tier? How does a driver developer knows which tear to assign its
> driver to?

The smaller range size will not guarantee anything. It just tries to
help the default behavior.

The drivers are expected to assign the abstract distance based on the
memory latency/bandwidth, etc. And the abstract distance range of a
memory tier corresponds to a memory latency/bandwidth range too. So, if
the size of the abstract distance range is smaller, the possibility for
two types of memory with different latency/bandwidth to clash on
the abstract distance range is lower.

Clashing isn't a totally disaster. We plan to provide a per-memory-type
knob to offset the abstract distance provided by driver. Then, we can
move clashing memory types away if necessary.

Best Regards,
Huang, Ying

2022-11-02 09:03:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On Wed 02-11-22 16:02:54, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Wed 02-11-22 08:39:49, Huang, Ying wrote:
> >> Michal Hocko <[email protected]> writes:
> >>
> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
> >> > [...]
> >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's
> >> >> enough for now. But in the long run, it may be better to define more.
> >> >> 100 possible tiers below DRAM may be too extreme.
> >> >
> >> > I am just curious. Is any configurations with more than couple of tiers
> >> > even manageable? I mean applications have been struggling even with
> >> > regular NUMA systems for years and vast majority of them is largerly
> >> > NUMA unaware. How are they going to configure for a more complex system
> >> > when a) there is no resource access control so whatever you aim for
> >> > might not be available and b) in which situations there is going to be a
> >> > demand only for subset of tears (GPU memory?) ?
> >>
> >> Sorry for confusing. I think that there are only several (less than 10)
> >> tiers in a system in practice. Yes, here, I suggested to define 100 (10
> >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to
> >> manage a system with tens memory tiers. Instead, my intention is to
> >> avoid to put 2 memory types into one memory tier by accident via make
> >> the abstract distance range of each memory tier as small as possible.
> >> More possible memory tiers, smaller abstract distance range of each
> >> memory tier.
> >
> > TBH I do not really understand how tweaking ranges helps anything.
> > IIUC drivers are free to assign any abstract distance so they will clash
> > without any higher level coordination.
>
> Yes. That's possible. Each memory tier corresponds to one abstract
> distance range. The larger the range is, the higher the possibility of
> clashing is. So I suggest to make the abstract distance range smaller
> to reduce the possibility of clashing.

I am sorry but I really do not understand how the size of the range
actually addresses a fundamental issue that each driver simply picks
what it wants. Is there any enumeration defining basic characteristic of
each tier? How does a driver developer knows which tear to assign its
driver to?

--
Michal Hocko
SUSE Labs

2022-11-02 09:03:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

On Wed 02-11-22 16:28:08, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Wed 02-11-22 16:02:54, Huang, Ying wrote:
> >> Michal Hocko <[email protected]> writes:
> >>
> >> > On Wed 02-11-22 08:39:49, Huang, Ying wrote:
> >> >> Michal Hocko <[email protected]> writes:
> >> >>
> >> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
> >> >> > [...]
> >> >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's
> >> >> >> enough for now. But in the long run, it may be better to define more.
> >> >> >> 100 possible tiers below DRAM may be too extreme.
> >> >> >
> >> >> > I am just curious. Is any configurations with more than couple of tiers
> >> >> > even manageable? I mean applications have been struggling even with
> >> >> > regular NUMA systems for years and vast majority of them is largerly
> >> >> > NUMA unaware. How are they going to configure for a more complex system
> >> >> > when a) there is no resource access control so whatever you aim for
> >> >> > might not be available and b) in which situations there is going to be a
> >> >> > demand only for subset of tears (GPU memory?) ?
> >> >>
> >> >> Sorry for confusing. I think that there are only several (less than 10)
> >> >> tiers in a system in practice. Yes, here, I suggested to define 100 (10
> >> >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to
> >> >> manage a system with tens memory tiers. Instead, my intention is to
> >> >> avoid to put 2 memory types into one memory tier by accident via make
> >> >> the abstract distance range of each memory tier as small as possible.
> >> >> More possible memory tiers, smaller abstract distance range of each
> >> >> memory tier.
> >> >
> >> > TBH I do not really understand how tweaking ranges helps anything.
> >> > IIUC drivers are free to assign any abstract distance so they will clash
> >> > without any higher level coordination.
> >>
> >> Yes. That's possible. Each memory tier corresponds to one abstract
> >> distance range. The larger the range is, the higher the possibility of
> >> clashing is. So I suggest to make the abstract distance range smaller
> >> to reduce the possibility of clashing.
> >
> > I am sorry but I really do not understand how the size of the range
> > actually addresses a fundamental issue that each driver simply picks
> > what it wants. Is there any enumeration defining basic characteristic of
> > each tier? How does a driver developer knows which tear to assign its
> > driver to?
>
> The smaller range size will not guarantee anything. It just tries to
> help the default behavior.
>
> The drivers are expected to assign the abstract distance based on the
> memory latency/bandwidth, etc.

Would it be possible/feasible to have a canonical way to calculate the
abstract distance from these characteristics by the core kernel so that
drivers do not even have fall into that trap?

--
Michal Hocko
SUSE Labs

2022-11-02 09:04:47

by Huang, Ying

[permalink] [raw]
Subject: Re: [RFC] memory tiering: use small chunk size and more tiers

Michal Hocko <[email protected]> writes:

> On Wed 02-11-22 16:28:08, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Wed 02-11-22 16:02:54, Huang, Ying wrote:
>> >> Michal Hocko <[email protected]> writes:
>> >>
>> >> > On Wed 02-11-22 08:39:49, Huang, Ying wrote:
>> >> >> Michal Hocko <[email protected]> writes:
>> >> >>
>> >> >> > On Mon 31-10-22 09:33:49, Huang, Ying wrote:
>> >> >> > [...]
>> >> >> >> In the upstream implementation, 4 tiers are possible below DRAM. That's
>> >> >> >> enough for now. But in the long run, it may be better to define more.
>> >> >> >> 100 possible tiers below DRAM may be too extreme.
>> >> >> >
>> >> >> > I am just curious. Is any configurations with more than couple of tiers
>> >> >> > even manageable? I mean applications have been struggling even with
>> >> >> > regular NUMA systems for years and vast majority of them is largerly
>> >> >> > NUMA unaware. How are they going to configure for a more complex system
>> >> >> > when a) there is no resource access control so whatever you aim for
>> >> >> > might not be available and b) in which situations there is going to be a
>> >> >> > demand only for subset of tears (GPU memory?) ?
>> >> >>
>> >> >> Sorry for confusing. I think that there are only several (less than 10)
>> >> >> tiers in a system in practice. Yes, here, I suggested to define 100 (10
>> >> >> in the later text) POSSIBLE tiers below DRAM. My intention isn't to
>> >> >> manage a system with tens memory tiers. Instead, my intention is to
>> >> >> avoid to put 2 memory types into one memory tier by accident via make
>> >> >> the abstract distance range of each memory tier as small as possible.
>> >> >> More possible memory tiers, smaller abstract distance range of each
>> >> >> memory tier.
>> >> >
>> >> > TBH I do not really understand how tweaking ranges helps anything.
>> >> > IIUC drivers are free to assign any abstract distance so they will clash
>> >> > without any higher level coordination.
>> >>
>> >> Yes. That's possible. Each memory tier corresponds to one abstract
>> >> distance range. The larger the range is, the higher the possibility of
>> >> clashing is. So I suggest to make the abstract distance range smaller
>> >> to reduce the possibility of clashing.
>> >
>> > I am sorry but I really do not understand how the size of the range
>> > actually addresses a fundamental issue that each driver simply picks
>> > what it wants. Is there any enumeration defining basic characteristic of
>> > each tier? How does a driver developer knows which tear to assign its
>> > driver to?
>>
>> The smaller range size will not guarantee anything. It just tries to
>> help the default behavior.
>>
>> The drivers are expected to assign the abstract distance based on the
>> memory latency/bandwidth, etc.
>
> Would it be possible/feasible to have a canonical way to calculate the
> abstract distance from these characteristics by the core kernel so that
> drivers do not even have fall into that trap?

Yes. That sounds a good idea. We can provide a function to map from
the memory latency/bandwidth to the abstract distance for the drivers.

Best Regards,
Huang, Ying