To update xlt (during mlx5_ib_reg_user_mr()), the driver can request up to
1 MB (order-8) memory, depending on the size of the MR. This costly
allocation can sometimes take very long to return (a few seconds),
especially if the system is fragmented and does not have any free chunks
for orders >= 3. This causes the calling application to hang for a long
time. To avoid these long latency spikes, limit max order of allocation to
order 3, and reuse that buffer to populate_xlt() for that MR. This will
increase the latency slightly (in the order of microseconds) for each
mlx5_ib_update_xlt() call, especially for larger MRs (since were making
multiple calls to populate_xlt()), but its a small price to pay to avoid
the large latency spikes with higher order allocations. The flag
__GFP_NORETRY is used while fetching the free pages to ensure that there
are no long compaction stalls when the system's memory is in fragmented
condition.
Signed-off-by: Praveen Kumar Kannoju <[email protected]>
---
drivers/infiniband/hw/mlx5/mr.c | 22 +++-------------------
1 file changed, 3 insertions(+), 19 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index db05b0e..dac19f0 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1004,9 +1004,7 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd,
return mr;
}
-#define MLX5_MAX_UMR_CHUNK ((1 << (MLX5_MAX_UMR_SHIFT + 4)) - \
- MLX5_UMR_MTT_ALIGNMENT)
-#define MLX5_SPARE_UMR_CHUNK 0x10000
+#define MLX5_SPARE_UMR_CHUNK 0x8000
/*
* Allocate a temporary buffer to hold the per-page information to transfer to
@@ -1028,30 +1026,16 @@ static void *mlx5_ib_alloc_xlt(size_t *nents, size_t ent_size, gfp_t gfp_mask)
*/
might_sleep();
- gfp_mask |= __GFP_ZERO;
+ gfp_mask |= __GFP_ZERO | __GFP_NORETRY;
- /*
- * If the system already has a suitable high order page then just use
- * that, but don't try hard to create one. This max is about 1M, so a
- * free x86 huge page will satisfy it.
- */
size = min_t(size_t, ent_size * ALIGN(*nents, xlt_chunk_align),
- MLX5_MAX_UMR_CHUNK);
+ MLX5_SPARE_UMR_CHUNK);
*nents = size / ent_size;
res = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN,
get_order(size));
if (res)
return res;
- if (size > MLX5_SPARE_UMR_CHUNK) {
- size = MLX5_SPARE_UMR_CHUNK;
- *nents = get_order(size) / ent_size;
- res = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN,
- get_order(size));
- if (res)
- return res;
- }
-
*nents = PAGE_SIZE / ent_size;
res = (void *)__get_free_page(gfp_mask);
if (res)
--
1.8.3.1
Ping.
Request the reviewers to go through the patch and let us know if you have any queries with respect to it.
-
Praveen Kumar Kannoju.
-----Original Message-----
From: Praveen Kannoju [mailto:[email protected]]
Sent: 16 March 2021 06:39 PM
To: [email protected]; [email protected]; [email protected]; [email protected]; [email protected]
Cc: Rajesh Sivaramasubramaniom <[email protected]>; Rama Nichanamatlu <[email protected]>; Aruna Ramakrishna <[email protected]>; Jeffery Yoder <[email protected]>; Praveen Kannoju <[email protected]>
Subject: [PATCH v2] IB/mlx5: Reduce max order of memory allocated for xlt update
To update xlt (during mlx5_ib_reg_user_mr()), the driver can request up to
1 MB (order-8) memory, depending on the size of the MR. This costly allocation can sometimes take very long to return (a few seconds), especially if the system is fragmented and does not have any free chunks for orders >= 3. This causes the calling application to hang for a long time. To avoid these long latency spikes, limit max order of allocation to order 3, and reuse that buffer to populate_xlt() for that MR. This will increase the latency slightly (in the order of microseconds) for each
mlx5_ib_update_xlt() call, especially for larger MRs (since were making multiple calls to populate_xlt()), but its a small price to pay to avoid the large latency spikes with higher order allocations. The flag __GFP_NORETRY is used while fetching the free pages to ensure that there are no long compaction stalls when the system's memory is in fragmented condition.
Signed-off-by: Praveen Kumar Kannoju <[email protected]>
---
drivers/infiniband/hw/mlx5/mr.c | 22 +++-------------------
1 file changed, 3 insertions(+), 19 deletions(-)
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c index db05b0e..dac19f0 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1004,9 +1004,7 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd,
return mr;
}
-#define MLX5_MAX_UMR_CHUNK ((1 << (MLX5_MAX_UMR_SHIFT + 4)) - \
- MLX5_UMR_MTT_ALIGNMENT)
-#define MLX5_SPARE_UMR_CHUNK 0x10000
+#define MLX5_SPARE_UMR_CHUNK 0x8000
/*
* Allocate a temporary buffer to hold the per-page information to transfer to @@ -1028,30 +1026,16 @@ static void *mlx5_ib_alloc_xlt(size_t *nents, size_t ent_size, gfp_t gfp_mask)
*/
might_sleep();
- gfp_mask |= __GFP_ZERO;
+ gfp_mask |= __GFP_ZERO | __GFP_NORETRY;
- /*
- * If the system already has a suitable high order page then just use
- * that, but don't try hard to create one. This max is about 1M, so a
- * free x86 huge page will satisfy it.
- */
size = min_t(size_t, ent_size * ALIGN(*nents, xlt_chunk_align),
- MLX5_MAX_UMR_CHUNK);
+ MLX5_SPARE_UMR_CHUNK);
*nents = size / ent_size;
res = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN,
get_order(size));
if (res)
return res;
- if (size > MLX5_SPARE_UMR_CHUNK) {
- size = MLX5_SPARE_UMR_CHUNK;
- *nents = get_order(size) / ent_size;
- res = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN,
- get_order(size));
- if (res)
- return res;
- }
-
*nents = PAGE_SIZE / ent_size;
res = (void *)__get_free_page(gfp_mask);
if (res)
--
1.8.3.1
On Tue, Mar 16, 2021 at 01:09:01PM +0000, Praveen Kumar Kannoju wrote:
> To update xlt (during mlx5_ib_reg_user_mr()), the driver can request up to
> 1 MB (order-8) memory, depending on the size of the MR. This costly
> allocation can sometimes take very long to return (a few seconds),
> especially if the system is fragmented and does not have any free chunks
> for orders >= 3. This causes the calling application to hang for a long
> time. To avoid these long latency spikes, limit max order of allocation to
> order 3, and reuse that buffer to populate_xlt() for that MR. This will
> increase the latency slightly (in the order of microseconds) for each
> mlx5_ib_update_xlt() call, especially for larger MRs (since were making
> multiple calls to populate_xlt()), but its a small price to pay to avoid
> the large latency spikes with higher order allocations. The flag
> __GFP_NORETRY is used while fetching the free pages to ensure that there
> are no long compaction stalls when the system's memory is in fragmented
> condition.
>
> Signed-off-by: Praveen Kumar Kannoju <[email protected]>
> drivers/infiniband/hw/mlx5/mr.c | 22 +++-------------------
> 1 file changed, 3 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index db05b0e..dac19f0 100644
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -1004,9 +1004,7 @@ static struct mlx5_ib_mr *alloc_cacheable_mr(struct ib_pd *pd,
> return mr;
> }
>
> -#define MLX5_MAX_UMR_CHUNK ((1 << (MLX5_MAX_UMR_SHIFT + 4)) - \
> - MLX5_UMR_MTT_ALIGNMENT)
> -#define MLX5_SPARE_UMR_CHUNK 0x10000
> +#define MLX5_SPARE_UMR_CHUNK 0x8000
>
> /*
> * Allocate a temporary buffer to hold the per-page information to transfer to
> @@ -1028,30 +1026,16 @@ static void *mlx5_ib_alloc_xlt(size_t *nents, size_t ent_size, gfp_t gfp_mask)
> */
> might_sleep();
>
> - gfp_mask |= __GFP_ZERO;
> + gfp_mask |= __GFP_ZERO | __GFP_NORETRY;
>
> - /*
> - * If the system already has a suitable high order page then just use
> - * that, but don't try hard to create one. This max is about 1M, so a
> - * free x86 huge page will satisfy it.
> - */
> size = min_t(size_t, ent_size * ALIGN(*nents, xlt_chunk_align),
> - MLX5_MAX_UMR_CHUNK);
> + MLX5_SPARE_UMR_CHUNK);
> *nents = size / ent_size;
> res = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN,
> get_order(size));
> if (res)
> return res;
>
> - if (size > MLX5_SPARE_UMR_CHUNK) {
> - size = MLX5_SPARE_UMR_CHUNK;
> - *nents = get_order(size) / ent_size;
> - res = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN,
> - get_order(size));
> - if (res)
> - return res;
> - }
Why did you delete this and make the size smaller? Isn't GFP_NORETRY
enough?
Jason
On Tue, Mar 23, 2021 at 12:41:51PM -0700, Aruna Ramakrishna wrote:
> There is a far greater possibility of an order-8 allocation failing,
> esp. with the addition of __GFP_NORETRY , and the code would have to
> fall back to a lower order allocation more often than not (esp. on a
> long running system). Unless the performance gains from using order-8
> pages is significant (and it does not seem that way to me), we can just
> skip this step and directly go to the lower order allocation.
Do not send HTML mails.
Do you have benchmarks that show the performance of the high order
pages is not relavent? I'm a bit surprised to hear that
This code really needs some attention to use a proper
scatter/gather. I understand the chip can do it, just some of the
software layers need to be stripped away so it can form the right SGL
in the HW.
Jason
> On Mar 23, 2021, at 4:13 PM, Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Mar 23, 2021 at 12:41:51PM -0700, Aruna Ramakrishna wrote:
>> There is a far greater possibility of an order-8 allocation failing,
>> esp. with the addition of __GFP_NORETRY , and the code would have to
>> fall back to a lower order allocation more often than not (esp. on a
>> long running system). Unless the performance gains from using order-8
>> pages is significant (and it does not seem that way to me), we can just
>> skip this step and directly go to the lower order allocation.
>
> Do not send HTML mails.
I apologize; I’ve fixed the setting now.
>
> Do you have benchmarks that show the performance of the high order
> pages is not relavent? I'm a bit surprised to hear that
>
I guess my point was more to the effect that an order-8 alloc will fail more often than not, in this flow. For instance, when we were debugging the latency spikes here, this was the typical buddyinfo output on that system:
Node 0, zone DMA 0 1 1 2 3 0 1 0 1 1 3
Node 0, zone DMA32 7 7 7 6 10 2 6 7 6 2 306
Node 0, zone Normal 3390 51354 17574 6556 1586 26 2 1 0 0 0
Node 1, zone Normal 11519 23315 23306 9738 73 2 0 1 0 0 0
I think this level of fragmentation is pretty normal on long running systems. Here, in the reg_mr flow, the first try (order-8) alloc will probably fail 9 times out of 10 (esp. after the addition of GFP_NORETRY flag), and then as fallback, the code tries to allocate a lower order, and if that too fails, it allocates a page. I think it makes sense to just avoid trying an order-8 alloc here.
Thanks,
Aruna
> This code really needs some attention to use a proper
> scatter/gather. I understand the chip can do it, just some of the
> software layers need to be stripped away so it can form the right SGL
> in the HW.
>
> Jason
On Tue, Mar 23, 2021 at 09:27:38PM -0700, Aruna Ramakrishna wrote:
> > Do you have benchmarks that show the performance of the high order
> > pages is not relavent? I'm a bit surprised to hear that
> >
>
> I guess my point was more to the effect that an order-8 alloc will
> fail more often than not, in this flow. For instance, when we were
> debugging the latency spikes here, this was the typical buddyinfo
> output on that system:
>
> Node 0, zone DMA 0 1 1 2 3 0 1 0 1 1 3
> Node 0, zone DMA32 7 7 7 6 10 2 6 7 6 2 306
> Node 0, zone Normal 3390 51354 17574 6556 1586 26 2 1 0 0 0
> Node 1, zone Normal 11519 23315 23306 9738 73 2 0 1 0 0 0
>
> I think this level of fragmentation is pretty normal on long running
> systems. Here, in the reg_mr flow, the first try (order-8) alloc
> will probably fail 9 times out of 10 (esp. after the addition of
> GFP_NORETRY flag), and then as fallback, the code tries to allocate
> a lower order, and if that too fails, it allocates a page. I think
> it makes sense to just avoid trying an order-8 alloc here.
But a system like this won't get THPs either, so I'm not sure it is
relevant. The function was designed as it is to consume a "THP" if it
is available.
Jason
On Thu, Mar 25, 2021 at 11:39:28AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 23, 2021 at 09:27:38PM -0700, Aruna Ramakrishna wrote:
>
> > > Do you have benchmarks that show the performance of the high order
> > > pages is not relavent? I'm a bit surprised to hear that
> > >
> >
> > I guess my point was more to the effect that an order-8 alloc will
> > fail more often than not, in this flow. For instance, when we were
> > debugging the latency spikes here, this was the typical buddyinfo
> > output on that system:
> >
> > Node 0, zone DMA 0 1 1 2 3 0 1 0 1 1 3
> > Node 0, zone DMA32 7 7 7 6 10 2 6 7 6 2 306
> > Node 0, zone Normal 3390 51354 17574 6556 1586 26 2 1 0 0 0
> > Node 1, zone Normal 11519 23315 23306 9738 73 2 0 1 0 0 0
> >
> > I think this level of fragmentation is pretty normal on long running
> > systems. Here, in the reg_mr flow, the first try (order-8) alloc
> > will probably fail 9 times out of 10 (esp. after the addition of
> > GFP_NORETRY flag), and then as fallback, the code tries to allocate
> > a lower order, and if that too fails, it allocates a page. I think
> > it makes sense to just avoid trying an order-8 alloc here.
>
> But a system like this won't get THPs either, so I'm not sure it is
> relevant. The function was designed as it is to consume a "THP" if it
> is available.
So can we do this with just the addition of __GFP_NORETRY ?
Jason
Hi Jason,
We will go ahead with just adding the "__GFP_NORETRY " flag to reduce the time it takes to fail the higher order memory allocations in case higher order pages are not available.
Will send out the corresponding patch.
Thank you very much for your inputs.
-
Praveen Kumar Kannoju.
-----Original Message-----
From: Jason Gunthorpe [mailto:[email protected]]
Sent: 31 March 2021 11:23 PM
To: Aruna Ramakrishna <[email protected]>
Cc: Praveen Kannoju <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; Rajesh Sivaramasubramaniom <[email protected]>; Rama Nichanamatlu <[email protected]>; Jeffery Yoder <[email protected]>
Subject: Re: [PATCH v2] IB/mlx5: Reduce max order of memory allocated for xlt update
On Thu, Mar 25, 2021 at 11:39:28AM -0300, Jason Gunthorpe wrote:
> On Tue, Mar 23, 2021 at 09:27:38PM -0700, Aruna Ramakrishna wrote:
>
> > > Do you have benchmarks that show the performance of the high order
> > > pages is not relavent? I'm a bit surprised to hear that
> > >
> >
> > I guess my point was more to the effect that an order-8 alloc will
> > fail more often than not, in this flow. For instance, when we were
> > debugging the latency spikes here, this was the typical buddyinfo
> > output on that system:
> >
> > Node 0, zone DMA 0 1 1 2 3 0 1 0 1 1 3
> > Node 0, zone DMA32 7 7 7 6 10 2 6 7 6 2 306
> > Node 0, zone Normal 3390 51354 17574 6556 1586 26 2 1 0 0 0
> > Node 1, zone Normal 11519 23315 23306 9738 73 2 0 1 0 0 0
> >
> > I think this level of fragmentation is pretty normal on long running
> > systems. Here, in the reg_mr flow, the first try (order-8) alloc
> > will probably fail 9 times out of 10 (esp. after the addition of
> > GFP_NORETRY flag), and then as fallback, the code tries to allocate
> > a lower order, and if that too fails, it allocates a page. I think
> > it makes sense to just avoid trying an order-8 alloc here.
>
> But a system like this won't get THPs either, so I'm not sure it is
> relevant. The function was designed as it is to consume a "THP" if it
> is available.
So can we do this with just the addition of __GFP_NORETRY ?
Jason