2018-05-10 23:32:49

by Qing Huang

[permalink] [raw]
Subject: [PATCH] mlx4_core: allocate 4KB ICM chunks

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size, the above issue is fixed.

However in order to support 4KB ICM chunk size, we need to fix another
issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <[email protected]>
Acked-by: Daniel Jurgens <[email protected]>
---
drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..2b17a4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -43,12 +43,12 @@
#include "fw.h"

/*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in 4KB page size chunks to avoid high order memory
+ * allocations in fragmented/high usage memory situation.
*/
enum {
- MLX4_ICM_ALLOC_SIZE = 1 << 18,
- MLX4_TABLE_CHUNK_SIZE = 1 << 18
+ MLX4_ICM_ALLOC_SIZE = 1 << 12,
+ MLX4_TABLE_CHUNK_SIZE = 1 << 12
};

static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
@@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;

- table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+ table->icm = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}

- kfree(table->icm);
+ vfree(table->icm);

return -ENOMEM;
}
@@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}

- kfree(table->icm);
+ vfree(table->icm);
}
--
2.9.3



2018-05-11 00:14:47

by Zhu Yanjun

[permalink] [raw]
Subject: Re: [PATCH] mlx4_core: allocate 4KB ICM chunks



On 2018/5/11 7:31, Qing Huang wrote:
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
>
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
>
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
>
> With 4KB ICM chunk size, the above issue is fixed.
>
> However in order to support 4KB ICM chunk size, we need to fix another
> issue in large size kcalloc allocations.
>
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
>
> The solution is to use vzalloc to replace kcalloc. There is no need
> for contiguous memory pages for a driver meta data structure (no need
Hi,

Replace continuous memory pages with virtual memory, is there any
performance loss?

Zhu Yanjun
> of DMA ops).
>
> Signed-off-by: Qing Huang <[email protected]>
> Acked-by: Daniel Jurgens <[email protected]>
> ---
> drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index a822f7a..2b17a4b 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,12 @@
> #include "fw.h"
>
> /*
> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
> - * per chunk.
> + * We allocate in 4KB page size chunks to avoid high order memory
> + * allocations in fragmented/high usage memory situation.
> */
> enum {
> - MLX4_ICM_ALLOC_SIZE = 1 << 18,
> - MLX4_TABLE_CHUNK_SIZE = 1 << 18
> + MLX4_ICM_ALLOC_SIZE = 1 << 12,
> + MLX4_TABLE_CHUNK_SIZE = 1 << 12
> };
>
> static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
> obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
> num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>
> - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
> + table->icm = vzalloc(num_icm * sizeof(*table->icm));
> if (!table->icm)
> return -ENOMEM;
> table->virt = virt;
> @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
> mlx4_free_icm(dev, table->icm[i], use_coherent);
> }
>
> - kfree(table->icm);
> + vfree(table->icm);
>
> return -ENOMEM;
> }
> @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
> mlx4_free_icm(dev, table->icm[i], table->coherent);
> }
>
> - kfree(table->icm);
> + vfree(table->icm);
> }


2018-05-11 01:38:14

by Qing Huang

[permalink] [raw]
Subject: Re: [PATCH] mlx4_core: allocate 4KB ICM chunks

Thank you for reviewing it!


On 5/10/2018 6:23 PM, Yanjun Zhu wrote:
>
>
>
> On 2018/5/11 9:15, Qing Huang wrote:
>>
>>
>>
>> On 5/10/2018 5:13 PM, Yanjun Zhu wrote:
>>>
>>>
>>> On 2018/5/11 7:31, Qing Huang wrote:
>>>> When a system is under memory presure (high usage with fragments),
>>>> the original 256KB ICM chunk allocations will likely trigger kernel
>>>> memory management to enter slow path doing memory compact/migration
>>>> ops in order to complete high order memory allocations.
>>>>
>>>> When that happens, user processes calling uverb APIs may get stuck
>>>> for more than 120s easily even though there are a lot of free pages
>>>> in smaller chunks available in the system.
>>>>
>>>> Syslog:
>>>> ...
>>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>>>> oracle_205573_e:205573 blocked for more than 120 seconds.
>>>> ...
>>>>
>>>> With 4KB ICM chunk size, the above issue is fixed.
>>>>
>>>> However in order to support 4KB ICM chunk size, we need to fix another
>>>> issue in large size kcalloc allocations.
>>>>
>>>> E.g.
>>>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
>>>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for
>>>> each mtt
>>>> entry). So we need a 16MB allocation for a table->icm pointer array to
>>>> hold 2M pointers which can easily cause kcalloc to fail.
>>>>
>>>> The solution is to use vzalloc to replace kcalloc. There is no need
>>>> for contiguous memory pages for a driver meta data structure (no need
>>> Hi,
>>>
>>> Replace continuous memory pages with virtual memory, is there any
>>> performance loss?
>>
>> Not really. "table->icm" will be accessed as individual pointer
>> variables randomly. Kcalloc
>
> Sure. Thanks. If "table->icm" will be accessed as individual pointer
> variables randomly, the performance loss
> caused by discontinuous memory will be very trivial.
>
> Reviewed-by: Zhu Yanjun <[email protected]>
>
>> also returns a virtual address except its mapped pages are guaranteed
>> to be contiguous
>> which will provide little advantage over vzalloc for individual
>> pointer variable access.
>>
>> Qing
>>
>>>
>>> Zhu Yanjun
>>>> of DMA ops).
>>>>
>>>> Signed-off-by: Qing Huang <[email protected]>
>>>> Acked-by: Daniel Jurgens <[email protected]>
>>>> ---
>>>>   drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
>>>>   1 file changed, 7 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>> b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>> index a822f7a..2b17a4b 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
>>>> @@ -43,12 +43,12 @@
>>>>   #include "fw.h"
>>>>     /*
>>>> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
>>>> - * per chunk.
>>>> + * We allocate in 4KB page size chunks to avoid high order memory
>>>> + * allocations in fragmented/high usage memory situation.
>>>>    */
>>>>   enum {
>>>> -    MLX4_ICM_ALLOC_SIZE    = 1 << 18,
>>>> -    MLX4_TABLE_CHUNK_SIZE    = 1 << 18
>>>> +    MLX4_ICM_ALLOC_SIZE    = 1 << 12,
>>>> +    MLX4_TABLE_CHUNK_SIZE    = 1 << 12
>>>>   };
>>>>     static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct
>>>> mlx4_icm_chunk *chunk)
>>>> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev,
>>>> struct mlx4_icm_table *table,
>>>>       obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
>>>>       num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>>>>   -    table->icm      = kcalloc(num_icm, sizeof(*table->icm),
>>>> GFP_KERNEL);
>>>> +    table->icm      = vzalloc(num_icm * sizeof(*table->icm));
>>>>       if (!table->icm)
>>>>           return -ENOMEM;
>>>>       table->virt     = virt;
>>>> @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev,
>>>> struct mlx4_icm_table *table,
>>>>               mlx4_free_icm(dev, table->icm[i], use_coherent);
>>>>           }
>>>>   -    kfree(table->icm);
>>>> +    vfree(table->icm);
>>>>         return -ENOMEM;
>>>>   }
>>>> @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev
>>>> *dev, struct mlx4_icm_table *table)
>>>>               mlx4_free_icm(dev, table->icm[i], table->coherent);
>>>>           }
>>>>   -    kfree(table->icm);
>>>> +    vfree(table->icm);
>>>>   }
>>>
>>
>


2018-05-11 10:28:54

by Håkon Bugge

[permalink] [raw]
Subject: Re: [PATCH] mlx4_core: allocate 4KB ICM chunks



> On 11 May 2018, at 01:31, Qing Huang <[email protected]> wrote:
>
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
>
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
>
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
>
> With 4KB ICM chunk size, the above issue is fixed.
>
> However in order to support 4KB ICM chunk size, we need to fix another
> issue in large size kcalloc allocations.
>
> E.g.
> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
> entry). So we need a 16MB allocation for a table->icm pointer array to
> hold 2M pointers which can easily cause kcalloc to fail.
>
> The solution is to use vzalloc to replace kcalloc. There is no need
> for contiguous memory pages for a driver meta data structure (no need
> of DMA ops).
>
> Signed-off-by: Qing Huang <[email protected]>
> Acked-by: Daniel Jurgens <[email protected]>
> ---
> drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
> 1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index a822f7a..2b17a4b 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -43,12 +43,12 @@
> #include "fw.h"
>
> /*
> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
> - * per chunk.
> + * We allocate in 4KB page size chunks to avoid high order memory
> + * allocations in fragmented/high usage memory situation.
> */
> enum {
> - MLX4_ICM_ALLOC_SIZE = 1 << 18,
> - MLX4_TABLE_CHUNK_SIZE = 1 << 18
> + MLX4_ICM_ALLOC_SIZE = 1 << 12,
> + MLX4_TABLE_CHUNK_SIZE = 1 << 12

Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the hw page size is 8KiB.


Thxs, Håkon

> };
>
> static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
> obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
> num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>
> - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
> + table->icm = vzalloc(num_icm * sizeof(*table->icm));
> if (!table->icm)
> return -ENOMEM;
> table->virt = virt;
> @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
> mlx4_free_icm(dev, table->icm[i], use_coherent);
> }
>
> - kfree(table->icm);
> + vfree(table->icm);
>
> return -ENOMEM;
> }
> @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
> mlx4_free_icm(dev, table->icm[i], table->coherent);
> }
>
> - kfree(table->icm);
> + vfree(table->icm);
> }
> --
> 2.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2018-05-11 19:18:01

by Qing Huang

[permalink] [raw]
Subject: Re: [PATCH] mlx4_core: allocate 4KB ICM chunks


On 5/11/2018 3:27 AM, Håkon Bugge wrote:
>> On 11 May 2018, at 01:31, Qing Huang<[email protected]> wrote:
>>
>> When a system is under memory presure (high usage with fragments),
>> the original 256KB ICM chunk allocations will likely trigger kernel
>> memory management to enter slow path doing memory compact/migration
>> ops in order to complete high order memory allocations.
>>
>> When that happens, user processes calling uverb APIs may get stuck
>> for more than 120s easily even though there are a lot of free pages
>> in smaller chunks available in the system.
>>
>> Syslog:
>> ...
>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>> oracle_205573_e:205573 blocked for more than 120 seconds.
>> ...
>>
>> With 4KB ICM chunk size, the above issue is fixed.
>>
>> However in order to support 4KB ICM chunk size, we need to fix another
>> issue in large size kcalloc allocations.
>>
>> E.g.
>> Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
>> size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
>> entry). So we need a 16MB allocation for a table->icm pointer array to
>> hold 2M pointers which can easily cause kcalloc to fail.
>>
>> The solution is to use vzalloc to replace kcalloc. There is no need
>> for contiguous memory pages for a driver meta data structure (no need
>> of DMA ops).
>>
>> Signed-off-by: Qing Huang<[email protected]>
>> Acked-by: Daniel Jurgens<[email protected]>
>> ---
>> drivers/net/ethernet/mellanox/mlx4/icm.c | 14 +++++++-------
>> 1 file changed, 7 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
>> index a822f7a..2b17a4b 100644
>> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
>> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
>> @@ -43,12 +43,12 @@
>> #include "fw.h"
>>
>> /*
>> - * We allocate in as big chunks as we can, up to a maximum of 256 KB
>> - * per chunk.
>> + * We allocate in 4KB page size chunks to avoid high order memory
>> + * allocations in fragmented/high usage memory situation.
>> */
>> enum {
>> - MLX4_ICM_ALLOC_SIZE = 1 << 18,
>> - MLX4_TABLE_CHUNK_SIZE = 1 << 18
>> + MLX4_ICM_ALLOC_SIZE = 1 << 12,
>> + MLX4_TABLE_CHUNK_SIZE = 1 << 12
> Shouldn’t these be the arch’s page size order? E.g., if running on SPARC, the hw page size is 8KiB.

Good point on supporting wider range of architectures. I got tunnel
vision when fixing this on our x64 lab machines.
Will send an v2 patch.

Thanks,
Qing

> Thxs, Håkon
>
>> };
>>
>> static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
>> @@ -400,7 +400,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
>> obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
>> num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;
>>
>> - table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
>> + table->icm = vzalloc(num_icm * sizeof(*table->icm));
>> if (!table->icm)
>> return -ENOMEM;
>> table->virt = virt;
>> @@ -446,7 +446,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
>> mlx4_free_icm(dev, table->icm[i], use_coherent);
>> }
>>
>> - kfree(table->icm);
>> + vfree(table->icm);
>>
>> return -ENOMEM;
>> }
>> @@ -462,5 +462,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
>> mlx4_free_icm(dev, table->icm[i], table->coherent);
>> }
>>
>> - kfree(table->icm);
>> + vfree(table->icm);
>> }
>> --
>> 2.9.3
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message [email protected]
>> More majordomo info athttp://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message [email protected]
> More majordomo info athttp://vger.kernel.org/majordomo-info.html