LinuxLists.cc - [RESEND PATCH 0/2] iommu/iova: optimize the iova rcache

2023-08-11 13:29:31

Subject: [RESEND PATCH 0/2] iommu/iova: optimize the iova rcache

The number of iova_cpu_rcache can grow with the number of cpus in
iova_rcache, but the size of rcache->depot will not. The deeper of
rcache->depot can help iova_rcache cache more iovas, and can help
iova_rcache better dealing with senarios in which drivers allocating
and free iovas on different cpu cores. We only let the size of rcache->depot
to grow with the number of cpus which is larger than 32 to avoid potential
performance decrease on machines which don't have much cpus.

Also, it is unsafe to directly free cpu rcache magazines in free_iova_rcaches,
add check before freeing it.

Zhang Zekun (2):
iommu/iova: Add check for cpu_rcache in free_iova_rcaches
iommu/iova: allocate iova_rcache->depot dynamicly

drivers/iommu/iova.c | 36 +++++++++++++++++++++++++++++++-----
1 file changed, 31 insertions(+), 5 deletions(-)

--
2.17.1

2023-08-11 14:01:21

by zhangzekun (A)

[permalink] [raw]

Subject: [RESEND PATCH 1/2] iommu/iova: Add check for cpu_rcache in free_iova_rcaches

free_iova_rcaches() needs to check if cpu_rcache->loaded and
cpu_rcache->prev is NULL before freeing them. Because
iova_domain_init_rcaches() may fail to alloc magazine for
cpu_rcache->loaded and cpu_rcache->prev, but they will be freed
for all cpus.

Fixes: 32e92d9f6f87 ("iommu/iova: Separate out rcache init")
Signed-off-by: Zhang Zekun <[email protected]>
---
drivers/iommu/iova.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 10b964600948..3c784a28e9ed 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -746,8 +746,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)

spin_lock_init(&cpu_rcache->lock);
cpu_rcache->loaded = iova_magazine_alloc(GFP_KERNEL);
+ if (!cpu_rcache->loaded) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
cpu_rcache->prev = iova_magazine_alloc(GFP_KERNEL);
- if (!cpu_rcache->loaded || !cpu_rcache->prev) {
+ if (!cpu_rcache->prev) {
ret = -ENOMEM;
goto out_err;
}
@@ -903,7 +907,11 @@ static void free_iova_rcaches(struct iova_domain *iovad)
break;
for_each_possible_cpu(cpu) {
cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
+ if (!cpu_rcache->loaded)
+ break;
iova_magazine_free(cpu_rcache->loaded);
+ if (!cpu_rcache->prev)
+ break;
iova_magazine_free(cpu_rcache->prev);
}
free_percpu(rcache->cpu_rcaches);
--
2.17.1

2023-08-11 14:01:45

by Robin Murphy

[permalink] [raw]

Subject: Re: [RESEND PATCH 1/2] iommu/iova: Add check for cpu_rcache in free_iova_rcaches

On 2023-08-11 14:02, Zhang Zekun wrote:
> free_iova_rcaches() needs to check if cpu_rcache->loaded and
> cpu_rcache->prev is NULL before freeing them.

Why? iova_magazine_free() is just kfree(), and kfree(NULL) is perfectly
valid, specifically to avoid having to make cleanup paths all fiddly and
overcomplicated like this.

Thanks,
Robin.

> Because
> iova_domain_init_rcaches() may fail to alloc magazine for
> cpu_rcache->loaded and cpu_rcache->prev, but they will be freed
> for all cpus.
>
> Fixes: 32e92d9f6f87 ("iommu/iova: Separate out rcache init")
> Signed-off-by: Zhang Zekun <[email protected]>
> ---
> drivers/iommu/iova.c | 10 +++++++++-
> 1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 10b964600948..3c784a28e9ed 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -746,8 +746,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
>
> spin_lock_init(&cpu_rcache->lock);
> cpu_rcache->loaded = iova_magazine_alloc(GFP_KERNEL);
> + if (!cpu_rcache->loaded) {
> + ret = -ENOMEM;
> + goto out_err;
> + }
> cpu_rcache->prev = iova_magazine_alloc(GFP_KERNEL);
> - if (!cpu_rcache->loaded || !cpu_rcache->prev) {
> + if (!cpu_rcache->prev) {
> ret = -ENOMEM;
> goto out_err;
> }
> @@ -903,7 +907,11 @@ static void free_iova_rcaches(struct iova_domain *iovad)
> break;
> for_each_possible_cpu(cpu) {
> cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
> + if (!cpu_rcache->loaded)
> + break;
> iova_magazine_free(cpu_rcache->loaded);
> + if (!cpu_rcache->prev)
> + break;
> iova_magazine_free(cpu_rcache->prev);
> }
> free_percpu(rcache->cpu_rcaches);

2023-08-11 14:05:32

by zhangzekun (A)

[permalink] [raw]

Subject: [RESEND PATCH 2/2] iommu/iova: allocate iova_rcache->depot dynamicly

In fio test with 4k,read,and allowed cpus to 0-255, we observe a performance
decrease of IOPS. The normal IOPS can reach up to 1980k, but we can only
get about 1600k.

abormal IOPS:
Jobs: 12 (f=12): [R(12)][99.3%][r=6220MiB/s][r=1592k IOPS][eta 00m:12s]
Jobs: 12 (f=12): [R(12)][99.4%][r=6215MiB/s][r=1591k IOPS][eta 00m:11s]
Jobs: 12 (f=12): [R(12)][99.4%][r=6335MiB/s][r=1622k IOPS][eta 00m:10s]
Jobs: 12 (f=12): [R(12)][99.5%][r=6194MiB/s][r=1586k IOPS][eta 00m:09s]
Jobs: 12 (f=12): [R(12)][99.6%][r=6173MiB/s][r=1580k IOPS][eta 00m:08s]
Jobs: 12 (f=12): [R(12)][99.6%][r=5984MiB/s][r=1532k IOPS][eta 00m:07s]
Jobs: 12 (f=12): [R(12)][99.7%][r=6374MiB/s][r=1632k IOPS][eta 00m:06s]
Jobs: 12 (f=12): [R(12)][99.7%][r=6343MiB/s][r=1624k IOPS][eta 00m:05s]

normal IOPS:
Jobs: 12 (f=12): [R(12)][99.3%][r=7736MiB/s][r=1980k IOPS][eta 00m:12s]
Jobs: 12 (f=12): [R(12)][99.4%][r=7744MiB/s][r=1982k IOPS][eta 00m:11s]
Jobs: 12 (f=12): [R(12)][99.4%][r=7737MiB/s][r=1981k IOPS][eta 00m:10s]
Jobs: 12 (f=12): [R(12)][99.5%][r=7735MiB/s][r=1980k IOPS][eta 00m:09s]
Jobs: 12 (f=12): [R(12)][99.6%][r=7741MiB/s][r=1982k IOPS][eta 00m:08s]
Jobs: 12 (f=12): [R(12)][99.6%][r=7740MiB/s][r=1982k IOPS][eta 00m:07s]
Jobs: 12 (f=12): [R(12)][99.7%][r=7736MiB/s][r=1981k IOPS][eta 00m:06s]
Jobs: 12 (f=12): [R(12)][99.7%][r=7736MiB/s][r=1980k IOPS][eta 00m:05s]

The current struct of iova_rcache will have iova_cpu_rcache for every
cpu, and these iova_cpu_rcaches use a common buffer iova_rcache->depot to
exchange iovas among iova_cpu_rcaches. A machine with 256 cpus will have
256 iova_cpu_rcaches and 1 iova_rcache->depot per iova_domain. However,
the max size of iova_rcache->depot is fixed to MAX_GLOBAL_MAGS which equals
to 32, and can't grow with the number of cpus, and this can cause problem
in some condition.

Some drivers will only free iovas in their irq call back function. For
the driver in this case it has 16 thread irqs to free iova, but these
irq call back function will only free iovas on 16 certain cpus(cpu{0,16,
32...,240}). Thread irq which smp affinity is 0-15, will only free iova on
cpu 0. However, the driver will alloc iova on all cpus(cpu{0-255}), cpus
without free iova in local cpu_rcache need to get free iovas from
iova_rcache->depot. The current size of iova_rcache->depot max size is 32,
and it seems to be too small for 256 users (16 cpus will put iovas to
iova_rcache->depot and 240 cpus will try to get iova from it). Set
iova_rcache->depot grow with the num of possible cpus, and the decrease
of IOPS disappear.

Signed-off-by: Zhang Zekun <[email protected]>
---
drivers/iommu/iova.c | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 3c784a28e9ed..df37a4501e98 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -238,6 +238,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,

static struct kmem_cache *iova_cache;
static unsigned int iova_cache_users;
+static unsigned int max_global_mags;
static DEFINE_MUTEX(iova_cache_mutex);

static struct iova *alloc_iova_mem(void)
@@ -625,7 +626,6 @@ EXPORT_SYMBOL_GPL(reserve_iova);
* will be wasted.
*/
#define IOVA_MAG_SIZE 127
-#define MAX_GLOBAL_MAGS 32 /* magazines per bin */

struct iova_magazine {
unsigned long size;
@@ -641,7 +641,7 @@ struct iova_cpu_rcache {
struct iova_rcache {
spinlock_t lock;
unsigned long depot_size;
- struct iova_magazine *depot[MAX_GLOBAL_MAGS];
+ struct iova_magazine **depot;
struct iova_cpu_rcache __percpu *cpu_rcaches;
};

@@ -722,6 +722,13 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
unsigned int cpu;
int i, ret;

+ /*
+ * the size of max global mags should growth with the num of
+ * cpus
+ */
+ if (!max_global_mags)
+ max_global_mags = max_t(unsigned int, 32, num_possible_cpus());
+
iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE,
sizeof(struct iova_rcache),
GFP_KERNEL);
@@ -733,6 +740,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
struct iova_rcache *rcache;

rcache = &iovad->rcaches[i];
+ rcache->depot = kcalloc(max_global_mags, sizeof(struct iova_magazine *),
+ GFP_KERNEL);
+ if (!rcache->depot) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
spin_lock_init(&rcache->lock);
rcache->depot_size = 0;
rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache),
@@ -798,7 +811,7 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,

if (new_mag) {
spin_lock(&rcache->lock);
- if (rcache->depot_size < MAX_GLOBAL_MAGS) {
+ if (rcache->depot_size < max_global_mags) {
rcache->depot[rcache->depot_size++] =
cpu_rcache->loaded;
} else {
@@ -903,8 +916,12 @@ static void free_iova_rcaches(struct iova_domain *iovad)

for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
rcache = &iovad->rcaches[i];
- if (!rcache->cpu_rcaches)
+ if (!rcache->depot)
+ break;
+ if (!rcache->cpu_rcaches) {
+ kfree(rcache->depot);
break;
+ }
for_each_possible_cpu(cpu) {
cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
if (!cpu_rcache->loaded)
@@ -917,6 +934,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
free_percpu(rcache->cpu_rcaches);
for (j = 0; j < rcache->depot_size; ++j)
iova_magazine_free(rcache->depot[j]);
+ kfree(rcache->depot);
}

kfree(iovad->rcaches);
--
2.17.1

2023-08-11 14:17:15

by Robin Murphy

[permalink] [raw]

Subject: Re: [RESEND PATCH 0/2] iommu/iova: optimize the iova rcache

On 2023-08-11 14:02, Zhang Zekun wrote:
> The number of iova_cpu_rcache can grow with the number of cpus in
> iova_rcache, but the size of rcache->depot will not. The deeper of
> rcache->depot can help iova_rcache cache more iovas, and can help
> iova_rcache better dealing with senarios in which drivers allocating
> and free iovas on different cpu cores. We only let the size of rcache->depot
> to grow with the number of cpus which is larger than 32 to avoid potential
> performance decrease on machines which don't have much cpus.

Oh, that reminds me I had started working on this again just before I
was off, only my plan was to get rid of the depot allocation
altogether[1]. However I wanted to finish the second patch on that
branch (to reclaim wasted memory from the depot if the workload changes)
before posting anything - I was feeling really pleased that I'd made it
work until I realised it was leaking all the actual IOVAs... :(

Thanks,
Robin.

[1]
https://gitlab.arm.com/linux-arm/linux-rm/-/commit/2f37ebe93eb282b534bf9e0fd4adc66cfe4b6693

>
> Also, it is unsafe to directly free cpu rcache magazines in free_iova_rcaches,
> add check before freeing it.
>
> Zhang Zekun (2):
> iommu/iova: Add check for cpu_rcache in free_iova_rcaches
> iommu/iova: allocate iova_rcache->depot dynamicly
>
> drivers/iommu/iova.c | 36 +++++++++++++++++++++++++++++++-----
> 1 file changed, 31 insertions(+), 5 deletions(-)
>

2023-08-11 15:42:22

by John Garry

[permalink] [raw]

Subject: Re: [RESEND PATCH 2/2] iommu/iova: allocate iova_rcache->depot dynamicly

On 11/08/2023 14:02, Zhang Zekun wrote:
> In fio test with 4k,read,and allowed cpus to 0-255, we observe a performance
> decrease of IOPS. The normal IOPS

What do you mean by normal IOPS? Describe this "normal" scenario.

? can reach up to 1980k, but we can only
> get about 1600k.
>
> abormal IOPS:
> Jobs: 12 (f=12): [R(12)][99.3%][r=6220MiB/s][r=1592k IOPS][eta 00m:12s]
> Jobs: 12 (f=12): [R(12)][99.4%][r=6215MiB/s][r=1591k IOPS][eta 00m:11s]
> Jobs: 12 (f=12): [R(12)][99.4%][r=6335MiB/s][r=1622k IOPS][eta 00m:10s]
> Jobs: 12 (f=12): [R(12)][99.5%][r=6194MiB/s][r=1586k IOPS][eta 00m:09s]
> Jobs: 12 (f=12): [R(12)][99.6%][r=6173MiB/s][r=1580k IOPS][eta 00m:08s]
> Jobs: 12 (f=12): [R(12)][99.6%][r=5984MiB/s][r=1532k IOPS][eta 00m:07s]
> Jobs: 12 (f=12): [R(12)][99.7%][r=6374MiB/s][r=1632k IOPS][eta 00m:06s]
> Jobs: 12 (f=12): [R(12)][99.7%][r=6343MiB/s][r=1624k IOPS][eta 00m:05s]
>
> normal IOPS:
> Jobs: 12 (f=12): [R(12)][99.3%][r=7736MiB/s][r=1980k IOPS][eta 00m:12s]
> Jobs: 12 (f=12): [R(12)][99.4%][r=7744MiB/s][r=1982k IOPS][eta 00m:11s]
> Jobs: 12 (f=12): [R(12)][99.4%][r=7737MiB/s][r=1981k IOPS][eta 00m:10s]
> Jobs: 12 (f=12): [R(12)][99.5%][r=7735MiB/s][r=1980k IOPS][eta 00m:09s]
> Jobs: 12 (f=12): [R(12)][99.6%][r=7741MiB/s][r=1982k IOPS][eta 00m:08s]
> Jobs: 12 (f=12): [R(12)][99.6%][r=7740MiB/s][r=1982k IOPS][eta 00m:07s]
> Jobs: 12 (f=12): [R(12)][99.7%][r=7736MiB/s][r=1981k IOPS][eta 00m:06s]
> Jobs: 12 (f=12): [R(12)][99.7%][r=7736MiB/s][r=1980k IOPS][eta 00m:05s]
>
> The current struct of iova_rcache will have iova_cpu_rcache for every
> cpu, and these iova_cpu_rcaches use a common buffer iova_rcache->depot to
> exchange iovas among iova_cpu_rcaches. A machine with 256 cpus will have
> 256 iova_cpu_rcaches and 1 iova_rcache->depot per iova_domain. However,
> the max size of iova_rcache->depot is fixed to MAX_GLOBAL_MAGS which equals
> to 32, and can't grow with the number of cpus, and this can cause problem
> in some condition.
>
> Some drivers will only free iovas in their irq call back function. For
> the driver in this case it has 16 thread irqs to free iova, but these
> irq call back function will only free iovas on 16 certain cpus(cpu{0,16,
> 32...,240}). Thread irq which smp affinity is 0-15, will only free iova on
> cpu 0. However, the driver will alloc iova on all cpus(cpu{0-255}), cpus
> without free iova in local cpu_rcache need to get free iovas from
> iova_rcache->depot. The current size of iova_rcache->depot max size is 32,
> and it seems to be too small for 256 users (16 cpus will put iovas to
> iova_rcache->depot and 240 cpus will try to get iova from it). Set
> iova_rcache->depot grow with the num of possible cpus, and the decrease
> of IOPS disappear.

Doesn't it take a long time for all the depots to fill for you? From the
description, this sounds like the hisilicon SAS controller which you are
experimenting with and I found there that it took a long time for the
depots to fill for high throughput testing.

Thanks,
John

>
> Signed-off-by: Zhang Zekun <[email protected]>
> ---
> drivers/iommu/iova.c | 26 ++++++++++++++++++++++----
> 1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
> index 3c784a28e9ed..df37a4501e98 100644
> --- a/drivers/iommu/iova.c
> +++ b/drivers/iommu/iova.c
> @@ -238,6 +238,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
>
> static struct kmem_cache *iova_cache;
> static unsigned int iova_cache_users;
> +static unsigned int max_global_mags;
> static DEFINE_MUTEX(iova_cache_mutex);
>
> static struct iova *alloc_iova_mem(void)
> @@ -625,7 +626,6 @@ EXPORT_SYMBOL_GPL(reserve_iova);
> * will be wasted.
> */
> #define IOVA_MAG_SIZE 127
> -#define MAX_GLOBAL_MAGS 32 /* magazines per bin */
>
> struct iova_magazine {
> unsigned long size;
> @@ -641,7 +641,7 @@ struct iova_cpu_rcache {
> struct iova_rcache {
> spinlock_t lock;
> unsigned long depot_size;
> - struct iova_magazine *depot[MAX_GLOBAL_MAGS];
> + struct iova_magazine **depot;
> struct iova_cpu_rcache __percpu *cpu_rcaches;
> };
>
> @@ -722,6 +722,13 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
> unsigned int cpu;
> int i, ret;
>
> + /*
> + * the size of max global mags should growth with the num of
> + * cpus
> + */
> + if (!max_global_mags)
> + max_global_mags = max_t(unsigned int, 32, num_possible_cpus());
> +
> iovad->rcaches = kcalloc(IOVA_RANGE_CACHE_MAX_SIZE,
> sizeof(struct iova_rcache),
> GFP_KERNEL);
> @@ -733,6 +740,12 @@ int iova_domain_init_rcaches(struct iova_domain *iovad)
> struct iova_rcache *rcache;
>
> rcache = &iovad->rcaches[i];
> + rcache->depot = kcalloc(max_global_mags, sizeof(struct iova_magazine *),
> + GFP_KERNEL);
> + if (!rcache->depot) {
> + ret = -ENOMEM;
> + goto out_err;
> + }
> spin_lock_init(&rcache->lock);
> rcache->depot_size = 0;
> rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache),
> @@ -798,7 +811,7 @@ static bool __iova_rcache_insert(struct iova_domain *iovad,
>
> if (new_mag) {
> spin_lock(&rcache->lock);
> - if (rcache->depot_size < MAX_GLOBAL_MAGS) {
> + if (rcache->depot_size < max_global_mags) {
> rcache->depot[rcache->depot_size++] =
> cpu_rcache->loaded;
> } else {
> @@ -903,8 +916,12 @@ static void free_iova_rcaches(struct iova_domain *iovad)
>
> for (i = 0; i < IOVA_RANGE_CACHE_MAX_SIZE; ++i) {
> rcache = &iovad->rcaches[i];
> - if (!rcache->cpu_rcaches)
> + if (!rcache->depot)
> + break;
> + if (!rcache->cpu_rcaches) {
> + kfree(rcache->depot);
> break;
> + }
> for_each_possible_cpu(cpu) {
> cpu_rcache = per_cpu_ptr(rcache->cpu_rcaches, cpu);
> if (!cpu_rcache->loaded)
> @@ -917,6 +934,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
> free_percpu(rcache->cpu_rcaches);
> for (j = 0; j < rcache->depot_size; ++j)
> iova_magazine_free(rcache->depot[j]);
> + kfree(rcache->depot);
> }
>
> kfree(iovad->rcaches);

2023-08-12 02:57:53

by zhangzekun (A)

[permalink] [raw]

Subject: Re: [RESEND PATCH 1/2] iommu/iova: Add check for cpu_rcache in free_iova_rcaches

在 2023/8/11 21:32, Robin Murphy 写道:
> On 2023-08-11 14:02, Zhang Zekun wrote:
>> free_iova_rcaches() needs to check if cpu_rcache->loaded and
>> cpu_rcache->prev is NULL before freeing them.
>
> Why? iova_magazine_free() is just kfree(), and kfree(NULL) is
> perfectly valid, specifically to avoid having to make cleanup paths
> all fiddly and overcomplicated like this.
>
> Thanks,
> Robin.
>
Hi, Robin
Thanks for your review, I have missed that kfree() can handle NULL and
it is safe
to iterate through all cpus, because __alloc_percpu() will alloc a
zero-filled area,
and pointers passed to kfree() will be either NULL or a vaild one. There
is no need
to add check before these pointers.

Thanks,
Zekun

2023-08-12 07:46:54

by zhangzekun (A)

[permalink] [raw]

Subject: Re: [RESEND PATCH 2/2] iommu/iova: allocate iova_rcache->depot dynamicly

在 2023/8/11 22:14, John Garry 写道:
> On 11/08/2023 14:02, Zhang Zekun wrote:
>> In fio test with 4k,read,and allowed cpus to 0-255, we observe a
>> performance
>> decrease of IOPS. The normal IOPS
>
> What do you mean by normal IOPS? Describe this "normal" scenario.
>
Hi, John

The reason why I think 1980K is normal is that I have test the iova_rache
hit rate with all iova size, the average iova cache hit rate can reach up to
around 99% (varys from diffent work loads and iova size), and I think
iova_rcache behave well in our test work loads. Besides, the IOPS is
behaving as expect which is acked by our test group.

> ? can reach up to 1980k, but we can only
>> get about 1600k.
>>
>> abormal IOPS:
>> Jobs: 12 (f=12): [R(12)][99.3%][r=6220MiB/s][r=1592k IOPS][eta 00m:12s]
>> Jobs: 12 (f=12): [R(12)][99.4%][r=6215MiB/s][r=1591k IOPS][eta 00m:11s]
>> Jobs: 12 (f=12): [R(12)][99.4%][r=6335MiB/s][r=1622k IOPS][eta 00m:10s]
>> Jobs: 12 (f=12): [R(12)][99.5%][r=6194MiB/s][r=1586k IOPS][eta 00m:09s]
>> Jobs: 12 (f=12): [R(12)][99.6%][r=6173MiB/s][r=1580k IOPS][eta 00m:08s]
>> Jobs: 12 (f=12): [R(12)][99.6%][r=5984MiB/s][r=1532k IOPS][eta 00m:07s]
>> Jobs: 12 (f=12): [R(12)][99.7%][r=6374MiB/s][r=1632k IOPS][eta 00m:06s]
>> Jobs: 12 (f=12): [R(12)][99.7%][r=6343MiB/s][r=1624k IOPS][eta 00m:05s]
>>
>> normal IOPS:
>> Jobs: 12 (f=12): [R(12)][99.3%][r=7736MiB/s][r=1980k IOPS][eta 00m:12s]
>> Jobs: 12 (f=12): [R(12)][99.4%][r=7744MiB/s][r=1982k IOPS][eta 00m:11s]
>> Jobs: 12 (f=12): [R(12)][99.4%][r=7737MiB/s][r=1981k IOPS][eta 00m:10s]
>> Jobs: 12 (f=12): [R(12)][99.5%][r=7735MiB/s][r=1980k IOPS][eta 00m:09s]
>> Jobs: 12 (f=12): [R(12)][99.6%][r=7741MiB/s][r=1982k IOPS][eta 00m:08s]
>> Jobs: 12 (f=12): [R(12)][99.6%][r=7740MiB/s][r=1982k IOPS][eta 00m:07s]
>> Jobs: 12 (f=12): [R(12)][99.7%][r=7736MiB/s][r=1981k IOPS][eta 00m:06s]
>> Jobs: 12 (f=12): [R(12)][99.7%][r=7736MiB/s][r=1980k IOPS][eta 00m:05s]
>>
>> The current struct of iova_rcache will have iova_cpu_rcache for every
>> cpu, and these iova_cpu_rcaches use a common buffer
>> iova_rcache->depot to
>> exchange iovas among iova_cpu_rcaches. A machine with 256 cpus will have
>> 256 iova_cpu_rcaches and 1 iova_rcache->depot per iova_domain. However,
>> the max size of iova_rcache->depot is fixed to MAX_GLOBAL_MAGS which
>> equals
>> to 32, and can't grow with the number of cpus, and this can cause
>> problem
>> in some condition.
>>
>> Some drivers will only free iovas in their irq call back function. For
>> the driver in this case it has 16 thread irqs to free iova, but these
>> irq call back function will only free iovas on 16 certain cpus(cpu{0,16,
>> 32...,240}). Thread irq which smp affinity is 0-15, will only free
>> iova on
>> cpu 0. However, the driver will alloc iova on all cpus(cpu{0-255}), cpus
>> without free iova in local cpu_rcache need to get free iovas from
>> iova_rcache->depot. The current size of iova_rcache->depot max size
>> is 32,
>> and it seems to be too small for 256 users (16 cpus will put iovas to
>> iova_rcache->depot and 240 cpus will try to get iova from it). Set
>> iova_rcache->depot grow with the num of possible cpus, and the decrease
>> of IOPS disappear.
>
> Doesn't it take a long time for all the depots to fill for you? From
> the description, this sounds like the hisilicon SAS controller which
> you are experimenting with and I found there that it took a long time
> for the depots to fill for high throughput testing.
>
> Thanks,
> John

Yes, it will take more time to fill rcahe->depot, but we don't need to fill
all depots before using iova in them. We can use iova in rcache->depot
when local cpu_rcache is empty and there are iova magazines in
rcache->depot, which means "rcache->depot_size > 0". The larger depot
size will help caching more iova magazines freed in __iova_rcache_insert()
which means more potential memory cost for iova_rcache, and should not
introduce performance issues.

2023-08-19 21:16:55

by John Garry

[permalink] [raw]

Subject: Re: [RESEND PATCH 2/2] iommu/iova: allocate iova_rcache->depot dynamicly

On 12/08/2023 08:18, zhangzekun (A) wrote:
>
>
> 在 2023/8/11 22:14, John Garry 写道:
>> On 11/08/2023 14:02, Zhang Zekun wrote:
>>> In fio test with 4k,read,and allowed cpus to 0-255, we observe a
>>> performance
>>> decrease of IOPS.
> The normal IOPS
>>
>> What do you mean by normal IOPS? Describe this "normal" scenario.
>>
> Hi, John
>
> The reason why I think 1980K is normal is that I have test the iova_rache
> hit rate with all iova size, the average iova cache hit rate can reach
> up to

Sorry to say, but I still don't know what you mean by the terms "normal"
and "abnormal" here. Is "normal" prior to the drop in IOPS which you
mention, above? If so, at what time do this occur?

> around 99% (varys from diffent work loads and iova size), and I think
> iova_rcache behave well in our test work loads. Besides, the IOPS is
> behaving as expect which is acked by our test group.
>
>> ? can reach up to 1980k, but we can only
>>> get about 1600k.
>>>
>>> abormal IOPS:
>>> Jobs: 12 (f=12): [R(12)][99.3%][r=6220MiB/s][r=1592k IOPS][eta 00m:12s]
>>> Jobs: 12 (f=12): [R(12)][99.4%][r=6215MiB/s][r=1591k IOPS][eta 00m:11s]
>>> Jobs: 12 (f=12): [R(12)][99.4%][r=6335MiB/s][r=1622k IOPS][eta 00m:10s]
>>> Jobs: 12 (f=12): [R(12)][99.5%][r=6194MiB/s][r=1586k IOPS][eta 00m:09s]
>>> Jobs: 12 (f=12): [R(12)][99.6%][r=6173MiB/s][r=1580k IOPS][eta 00m:08s]
>>> Jobs: 12 (f=12): [R(12)][99.6%][r=5984MiB/s][r=1532k IOPS][eta 00m:07s]
>>> Jobs: 12 (f=12): [R(12)][99.7%][r=6374MiB/s][r=1632k IOPS][eta 00m:06s]
>>> Jobs: 12 (f=12): [R(12)][99.7%][r=6343MiB/s][r=1624k IOPS][eta 00m:05s]
>>>
>>> normal IOPS:
>>> Jobs: 12 (f=12): [R(12)][99.3%][r=7736MiB/s][r=1980k IOPS][eta 00m:12s]
>>> Jobs: 12 (f=12): [R(12)][99.4%][r=7744MiB/s][r=1982k IOPS][eta 00m:11s]
>>> Jobs: 12 (f=12): [R(12)][99.4%][r=7737MiB/s][r=1981k IOPS][eta 00m:10s]
>>> Jobs: 12 (f=12): [R(12)][99.5%][r=7735MiB/s][r=1980k IOPS][eta 00m:09s]
>>> Jobs: 12 (f=12): [R(12)][99.6%][r=7741MiB/s][r=1982k IOPS][eta 00m:08s]
>>> Jobs: 12 (f=12): [R(12)][99.6%][r=7740MiB/s][r=1982k IOPS][eta 00m:07s]
>>> Jobs: 12 (f=12): [R(12)][99.7%][r=7736MiB/s][r=1981k IOPS][eta 00m:06s]
>>> Jobs: 12 (f=12): [R(12)][99.7%][r=7736MiB/s][r=1980k IOPS][eta 00m:05s]
>>>
>>> The current struct of iova_rcache will have iova_cpu_rcache for every
>>> cpu, and these iova_cpu_rcaches use a common buffer
>>> iova_rcache->depot to
>>> exchange iovas among iova_cpu_rcaches. A machine with 256 cpus will have
>>> 256 iova_cpu_rcaches and 1 iova_rcache->depot per iova_domain. However,
>>> the max size of iova_rcache->depot is fixed to MAX_GLOBAL_MAGS which
>>> equals
>>> to 32, and can't grow with the number of cpus, and this can cause
>>> problem
>>> in some condition.
>>>
>>> Some drivers will only free iovas in their irq call back function. For
>>> the driver in this case it has 16 thread irqs to free iova, but these
>>> irq call back function will only free iovas on 16 certain cpus(cpu{0,16,
>>> 32...,240}). Thread irq which smp affinity is 0-15, will only free
>>> iova on
>>> cpu 0. However, the driver will alloc iova on all cpus(cpu{0-255}), cpus
>>> without free iova in local cpu_rcache need to get free iovas from
>>> iova_rcache->depot. The current size of iova_rcache->depot max size
>>> is 32,
>>> and it seems to be too small for 256 users (16 cpus will put iovas to
>>> iova_rcache->depot and 240 cpus will try to get iova from it). Set
>>> iova_rcache->depot grow with the num of possible cpus, and the decrease
>>> of IOPS disappear.
>>
>> Doesn't it take a long time for all the depots to fill for you? From
>> the description, this sounds like the hisilicon SAS controller which
>> you are experimenting with and I found there that it took a long time
>> for the depots to fill for high throughput testing.
>>
>> Thanks,
>> John
>
> Yes, it will take more time to fill rcahe->depot, but we don't need to fill
> all depots before using iova in them. We can use iova in rcache->depot
> when local cpu_rcache is empty and there are iova magazines in
> rcache->depot, which means "rcache->depot_size > 0". The larger depot
> size will help caching more iova magazines freed in __iova_rcache_insert()
> which means more potential memory cost for iova_rcache, and should not
> introduce performance issues.

Thanks,
John

2023-08-20 03:35:13

by zhangzekun (A)

[permalink] [raw]

Subject: Re: [RESEND PATCH 2/2] iommu/iova: allocate iova_rcache->depot dynamicly

在 2023/8/15 23:15, John Garry 写道:
> On 12/08/2023 08:18, zhangzekun (A) wrote:
>>
>>
>> 在 2023/8/11 22:14, John Garry 写道:
>>> On 11/08/2023 14:02, Zhang Zekun wrote:
>>>> In fio test with 4k,read,and allowed cpus to 0-255, we observe a
>>>> performance
>>>> decrease of IOPS.
>> The normal IOPS
>>>
>>> What do you mean by normal IOPS? Describe this "normal" scenario.
>>>
>> Hi, John
>>
>> The reason why I think 1980K is normal is that I have test the
>> iova_rache
>> hit rate with all iova size, the average iova cache hit rate can
>> reach up to
>
> Sorry to say, but I still don't know what you mean by the terms
> "normal" and "abnormal" here. Is "normal" prior to the drop in IOPS
> which you mention, above? If so, at what time do this occur?
Hi, John

The decrease is first observe in high concurrency fio test based on
openeulr kernel-5.10, and the IOPS is about 300~400K , which is quite
abnormal for some low fio concurrency test can have more than 1000K
IOPS. The frame graph show that iovas alloc from slow path alloc_iova()
takes a high percentage, and I find out that current struct of
iova_rcache could behave poor with machine have 256 cores and with
device driver does not free and alloc iova evenly in a heavy work load.
After optimizing the iova_rcache, the IOPS come to more than 1900K IOPS
on openeuler kernel-5.10. The mainline of linux kernel have the same
problem on openeuler kernel-5.10, but the IOPS does not improve such a
large percentage. I use the term "normal" and "abnormal" here, maybe
these pair of words should be replaced by "before optimization" and
"after optimization".

Thanks,
Zekun