2018-05-17 20:56:08

by Qing Huang

[permalink] [raw]
Subject: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks

When a system is under memory presure (high usage with fragments),
the original 256KB ICM chunk allocations will likely trigger kernel
memory management to enter slow path doing memory compact/migration
ops in order to complete high order memory allocations.

When that happens, user processes calling uverb APIs may get stuck
for more than 120s easily even though there are a lot of free pages
in smaller chunks available in the system.

Syslog:
...
Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
oracle_205573_e:205573 blocked for more than 120 seconds.
...

With 4KB ICM chunk size on x86_64 arch, the above issue is fixed.

However in order to support smaller ICM chunk size, we need to fix
another issue in large size kcalloc allocations.

E.g.
Setting log_num_mtt=30 requires 1G mtt entries. With the 4KB ICM chunk
size, each ICM chunk can only hold 512 mtt entries (8 bytes for each mtt
entry). So we need a 16MB allocation for a table->icm pointer array to
hold 2M pointers which can easily cause kcalloc to fail.

The solution is to use vzalloc to replace kcalloc. There is no need
for contiguous memory pages for a driver meta data structure (no need
of DMA ops).

Signed-off-by: Qing Huang <[email protected]>
Acked-by: Daniel Jurgens <[email protected]>
Reviewed-by: Zhu Yanjun <[email protected]>
---
v3: use PAGE_SIZE instead of PAGE_SHIFT
add comma to the end of enum variables
include vmalloc.h header file to avoid build issues on Sparc

v2: adjuste chunk size to reflect different architectures

drivers/net/ethernet/mellanox/mlx4/icm.c | 15 ++++++++-------
1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c b/drivers/net/ethernet/mellanox/mlx4/icm.c
index a822f7a..3705f4e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -33,6 +33,7 @@

#include <linux/errno.h>
#include <linux/mm.h>
+#include <linux/vmalloc.h>
#include <linux/scatterlist.h>
#include <linux/slab.h>

@@ -43,12 +44,12 @@
#include "fw.h"

/*
- * We allocate in as big chunks as we can, up to a maximum of 256 KB
- * per chunk.
+ * We allocate in page size (default 4KB on many archs) chunks to avoid high
+ * order memory allocations in fragmented/high usage memory situation.
*/
enum {
- MLX4_ICM_ALLOC_SIZE = 1 << 18,
- MLX4_TABLE_CHUNK_SIZE = 1 << 18
+ MLX4_ICM_ALLOC_SIZE = PAGE_SIZE,
+ MLX4_TABLE_CHUNK_SIZE = PAGE_SIZE,
};

static void mlx4_free_icm_pages(struct mlx4_dev *dev, struct mlx4_icm_chunk *chunk)
@@ -400,7 +401,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
obj_per_chunk = MLX4_TABLE_CHUNK_SIZE / obj_size;
num_icm = (nobj + obj_per_chunk - 1) / obj_per_chunk;

- table->icm = kcalloc(num_icm, sizeof(*table->icm), GFP_KERNEL);
+ table->icm = vzalloc(num_icm * sizeof(*table->icm));
if (!table->icm)
return -ENOMEM;
table->virt = virt;
@@ -446,7 +447,7 @@ int mlx4_init_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table,
mlx4_free_icm(dev, table->icm[i], use_coherent);
}

- kfree(table->icm);
+ vfree(table->icm);

return -ENOMEM;
}
@@ -462,5 +463,5 @@ void mlx4_cleanup_icm_table(struct mlx4_dev *dev, struct mlx4_icm_table *table)
mlx4_free_icm(dev, table->icm[i], table->coherent);
}

- kfree(table->icm);
+ vfree(table->icm);
}
--
2.9.3



2018-05-17 21:16:12

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks



On 05/17/2018 01:53 PM, Qing Huang wrote:
> When a system is under memory presure (high usage with fragments),
> the original 256KB ICM chunk allocations will likely trigger kernel
> memory management to enter slow path doing memory compact/migration
> ops in order to complete high order memory allocations.
>
> When that happens, user processes calling uverb APIs may get stuck
> for more than 120s easily even though there are a lot of free pages
> in smaller chunks available in the system.
>
> Syslog:
> ...
> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
> oracle_205573_e:205573 blocked for more than 120 seconds.
> ...
>


NACK on this patch.

You have been asked repeatedly to use kvmalloc()

This is not a minor suggestion.

Take a look at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d

And you'll understand some people care about this.

Strongly.

Thanks.


2018-05-17 21:47:34

by Qing Huang

[permalink] [raw]
Subject: Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks



On 5/17/2018 2:14 PM, Eric Dumazet wrote:
> On 05/17/2018 01:53 PM, Qing Huang wrote:
>> When a system is under memory presure (high usage with fragments),
>> the original 256KB ICM chunk allocations will likely trigger kernel
>> memory management to enter slow path doing memory compact/migration
>> ops in order to complete high order memory allocations.
>>
>> When that happens, user processes calling uverb APIs may get stuck
>> for more than 120s easily even though there are a lot of free pages
>> in smaller chunks available in the system.
>>
>> Syslog:
>> ...
>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>> oracle_205573_e:205573 blocked for more than 120 seconds.
>> ...
>>
> NACK on this patch.
>
> You have been asked repeatedly to use kvmalloc()
>
> This is not a minor suggestion.
>
> Take a look athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d

Would you please take a look at how table->icm is being used in the mlx4
driver? It's a meta data used for individual pointer variable referencing,
not as data frag or in/out buffer. It has no need for contiguous phy.
memory.

Thanks.

> And you'll understand some people care about this.
>
> Strongly.
>
> Thanks.
>


2018-05-22 15:35:37

by Tariq Toukan

[permalink] [raw]
Subject: Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks



On 18/05/2018 12:45 AM, Qing Huang wrote:
>
>
> On 5/17/2018 2:14 PM, Eric Dumazet wrote:
>> On 05/17/2018 01:53 PM, Qing Huang wrote:
>>> When a system is under memory presure (high usage with fragments),
>>> the original 256KB ICM chunk allocations will likely trigger kernel
>>> memory management to enter slow path doing memory compact/migration
>>> ops in order to complete high order memory allocations.
>>>
>>> When that happens, user processes calling uverb APIs may get stuck
>>> for more than 120s easily even though there are a lot of free pages
>>> in smaller chunks available in the system.
>>>
>>> Syslog:
>>> ...
>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>>> oracle_205573_e:205573 blocked for more than 120 seconds.
>>> ...
>>>
>> NACK on this patch.
>>
>> You have been asked repeatedly to use kvmalloc()
>>
>> This is not a minor suggestion.
>>
>> Take a look
>> athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d
>>
>
> Would you please take a look at how table->icm is being used in the mlx4
> driver? It's a meta data used for individual pointer variable referencing,
> not as data frag or in/out buffer. It has no need for contiguous phy.
> memory.
>
> Thanks.
>

NACK.

This would cause a degradation when iterating the entries of table->icm.
For example, in mlx4_table_get_range.

Thanks,
Tariq

>> And you'll understand some people care about this.
>>
>> Strongly.
>>
>> Thanks.
>>
>

2018-05-23 01:43:39

by Qing Huang

[permalink] [raw]
Subject: Re: [PATCH v3] mlx4_core: allocate ICM memory in page size chunks



On 5/22/2018 8:33 AM, Tariq Toukan wrote:
>
>
> On 18/05/2018 12:45 AM, Qing Huang wrote:
>>
>>
>> On 5/17/2018 2:14 PM, Eric Dumazet wrote:
>>> On 05/17/2018 01:53 PM, Qing Huang wrote:
>>>> When a system is under memory presure (high usage with fragments),
>>>> the original 256KB ICM chunk allocations will likely trigger kernel
>>>> memory management to enter slow path doing memory compact/migration
>>>> ops in order to complete high order memory allocations.
>>>>
>>>> When that happens, user processes calling uverb APIs may get stuck
>>>> for more than 120s easily even though there are a lot of free pages
>>>> in smaller chunks available in the system.
>>>>
>>>> Syslog:
>>>> ...
>>>> Dec 10 09:04:51 slcc03db02 kernel: [397078.572732] INFO: task
>>>> oracle_205573_e:205573 blocked for more than 120 seconds.
>>>> ...
>>>>
>>> NACK on this patch.
>>>
>>> You have been asked repeatedly to use kvmalloc()
>>>
>>> This is not a minor suggestion.
>>>
>>> Take a look
>>> athttps://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d8c13f2271ec5178c52fbde072ec7b562651ed9d
>>>
>>
>> Would you please take a look at how table->icm is being used in the
>> mlx4 driver? It's a meta data used for individual pointer variable
>> referencing,
>> not as data frag or in/out buffer. It has no need for contiguous phy.
>> memory.
>>
>> Thanks.
>>
>
> NACK.
>
> This would cause a degradation when iterating the entries of table->icm.
> For example, in mlx4_table_get_range.
E.g.
int mlx4_table_get_range(struct mlx4_dev *dev, struct mlx4_icm_table *table,
                         u32 start, u32 end)
{
        int inc = MLX4_TABLE_CHUNK_SIZE / table->obj_size;
        int err;
        u32 i;

        for (i = start; i <= end; i += inc) {
                err = mlx4_table_get(dev, table, i);
                if (err)
                        goto fail;
        }

        return 0;
...
}

E.g. mtt obj is 8 bytes, so a 4KB ICM block would have 512 mtt objects.
So you will have to allocate
more 512 mtt objects in order to have table->icm pointer to increment by
1 to fetch next pointer
value.  So 256K mtt objects are needed in order to traverse table->icm
pointer across a page boundary
in the call stacks.

Considering mlx4_table_get_range() is only used in control path, there
is no significant gain by using kvzalloc
vs. vzalloc for table->icm.

Anyway, if a user makes sure mlx4 driver to be loaded very early and
doesn't remove and reload it afterwards,
we should have enough (and not wasting) contiguous phy mem for
table->icm allocation. I will use kvzalloc to
replace vzalloc and send a V4 patch.

Thanks,
Qing


>
> Thanks,
> Tariq
>
>>> And you'll understand some people care about this.
>>>
>>> Strongly.
>>>
>>> Thanks.
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html