2021-02-03 22:07:43

by Joao Martins

[permalink] [raw]
Subject: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()

Use the newly added unpin_user_page_range_dirty_lock()
for more quickly unpinning a consecutive range of pages
represented as compound pages. This will also calculate
number of pages to unpin (for the tail pages which matching
head page) and thus batch the refcount update.

Running a test program which calls mr reg/unreg on a 1G in size
and measures cost of both operations together (in a guest using rxe)
with THP and hugetlbfs:

Before:
590 rounds in 5.003 sec: 8480.335 usec / round
6898 rounds in 60.001 sec: 8698.367 usec / round

After:
2631 rounds in 5.001 sec: 1900.618 usec / round
31625 rounds in 60.001 sec: 1897.267 usec / round

Signed-off-by: Joao Martins <[email protected]>
---
drivers/infiniband/core/umem.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index 2dde99a9ba07..ea4ebb3261d9 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -47,17 +47,17 @@

static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
{
- struct sg_page_iter sg_iter;
- struct page *page;
+ bool make_dirty = umem->writable && dirty;
+ struct scatterlist *sg;
+ int i;

if (umem->nmap > 0)
ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
DMA_BIDIRECTIONAL);

- for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
- page = sg_page_iter_page(&sg_iter);
- unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
- }
+ for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
+ unpin_user_page_range_dirty_lock(sg_page(sg),
+ DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);

sg_free_table(&umem->sg_head);
}
--
2.17.1


2021-02-04 00:19:30

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()

On 2/3/21 2:00 PM, Joao Martins wrote:
> Use the newly added unpin_user_page_range_dirty_lock()
> for more quickly unpinning a consecutive range of pages
> represented as compound pages. This will also calculate
> number of pages to unpin (for the tail pages which matching
> head page) and thus batch the refcount update.
>
> Running a test program which calls mr reg/unreg on a 1G in size
> and measures cost of both operations together (in a guest using rxe)
> with THP and hugetlbfs:

In the patch subject line:

s/__ib_mem_release/__ib_umem_release/

>
> Before:
> 590 rounds in 5.003 sec: 8480.335 usec / round
> 6898 rounds in 60.001 sec: 8698.367 usec / round
>
> After:
> 2631 rounds in 5.001 sec: 1900.618 usec / round
> 31625 rounds in 60.001 sec: 1897.267 usec / round
>
> Signed-off-by: Joao Martins <[email protected]>
> ---
> drivers/infiniband/core/umem.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 2dde99a9ba07..ea4ebb3261d9 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -47,17 +47,17 @@
>
> static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
> {
> - struct sg_page_iter sg_iter;
> - struct page *page;
> + bool make_dirty = umem->writable && dirty;
> + struct scatterlist *sg;
> + int i;

Maybe unsigned int is better, so as to perfectly match the scatterlist.length.

>
> if (umem->nmap > 0)
> ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
> DMA_BIDIRECTIONAL);
>
> - for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
> - page = sg_page_iter_page(&sg_iter);
> - unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
> - }
> + for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)

The change from umem->sg_nents to umem->nmap looks OK, although we should get
IB people to verify that there is not some odd bug or reason to leave it as is.

> + unpin_user_page_range_dirty_lock(sg_page(sg),
> + DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);

Is it really OK to refer directly to sg->length? The scatterlist library goes
to some effort to avoid having callers directly access the struct member variables.

Actually, the for_each_sg() code and its behavior with sg->length and sg_page(sg)
confuses me because I'm new to it, and I don't quite understand how this works.
Especially with SG_CHAIN. I'm assuming that you've monitored /proc/vmstat for
nr_foll_pin* ?

>
> sg_free_table(&umem->sg_head);
> }
>

thanks,
--
John Hubbard
NVIDIA

2021-02-05 00:06:37

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()



On 2/4/21 12:15 AM, John Hubbard wrote:
> On 2/3/21 2:00 PM, Joao Martins wrote:
>> Use the newly added unpin_user_page_range_dirty_lock()
>> for more quickly unpinning a consecutive range of pages
>> represented as compound pages. This will also calculate
>> number of pages to unpin (for the tail pages which matching
>> head page) and thus batch the refcount update.
>>
>> Running a test program which calls mr reg/unreg on a 1G in size
>> and measures cost of both operations together (in a guest using rxe)
>> with THP and hugetlbfs:
>
> In the patch subject line:
>
> s/__ib_mem_release/__ib_umem_release/
>
Ah, yes.

>>
>> Before:
>> 590 rounds in 5.003 sec: 8480.335 usec / round
>> 6898 rounds in 60.001 sec: 8698.367 usec / round
>>
>> After:
>> 2631 rounds in 5.001 sec: 1900.618 usec / round
>> 31625 rounds in 60.001 sec: 1897.267 usec / round
>>
>> Signed-off-by: Joao Martins <[email protected]>
>> ---
>> drivers/infiniband/core/umem.c | 12 ++++++------
>> 1 file changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>> index 2dde99a9ba07..ea4ebb3261d9 100644
>> --- a/drivers/infiniband/core/umem.c
>> +++ b/drivers/infiniband/core/umem.c
>> @@ -47,17 +47,17 @@
>>
>> static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>> {
>> - struct sg_page_iter sg_iter;
>> - struct page *page;
>> + bool make_dirty = umem->writable && dirty;
>> + struct scatterlist *sg;
>> + int i;
>
> Maybe unsigned int is better, so as to perfectly match the scatterlist.length.
>
Will fix.

>>
>> if (umem->nmap > 0)
>> ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
>> DMA_BIDIRECTIONAL);
>>
>> - for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
>> - page = sg_page_iter_page(&sg_iter);
>> - unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
>> - }
>> + for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
>
> The change from umem->sg_nents to umem->nmap looks OK, although we should get
> IB people to verify that there is not some odd bug or reason to leave it as is.
>
/me nods

fwiw this was suggested by Jason :) as the way I had done was unnecessarily allocating a
page to unpin pages.

>> + unpin_user_page_range_dirty_lock(sg_page(sg),
>> + DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
>
> Is it really OK to refer directly to sg->length? The scatterlist library goes
> to some effort to avoid having callers directly access the struct member variables.
>
> Actually, the for_each_sg() code and its behavior with sg->length and sg_page(sg)
> confuses me because I'm new to it, and I don't quite understand how this works.

So IIUC this can be done given how ib_umem_get allocates scatterlists (i.e. see the call
to __sg_alloc_table_from_pages()). It builds a scatterlist with a segment size in device
DMA max segment size (e.g. 64K, 2G or etc depending on what the device sets it to) and
after created each scatterlist I am iterating represents a contiguous range of PFNs with a
starting page. And if you keep pinning contiguous amounts of memory, it keeps coalescing
this to the previously allocated sgl.

> Especially with SG_CHAIN. I'm assuming that you've monitored /proc/vmstat for
> nr_foll_pin* ?
>
Yeap I did. I see no pages left unpinned.

>>
>> sg_free_table(&umem->sg_head);
>> }
>>
>
> thanks,
>

2021-02-05 01:47:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()

On Wed, Feb 03, 2021 at 04:15:53PM -0800, John Hubbard wrote:
> > diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> > index 2dde99a9ba07..ea4ebb3261d9 100644
> > +++ b/drivers/infiniband/core/umem.c
> > @@ -47,17 +47,17 @@
> > static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
> > {
> > - struct sg_page_iter sg_iter;
> > - struct page *page;
> > + bool make_dirty = umem->writable && dirty;
> > + struct scatterlist *sg;
> > + int i;
>
> Maybe unsigned int is better, so as to perfectly match the scatterlist.length.

Yes please

> > if (umem->nmap > 0)
> > ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
> > DMA_BIDIRECTIONAL);
> > - for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
> > - page = sg_page_iter_page(&sg_iter);
> > - unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
> > - }
> > + for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
>
> The change from umem->sg_nents to umem->nmap looks OK, although we should get
> IB people to verify that there is not some odd bug or reason to leave it as is.

No, nmap wouldn't be right here. nmap is the number of dma mapped SGLs
in the list and should only be used by things doing sg_dma* stuff.

umem->sg_nents is the number of CPU SGL entries and is the correct
thing here.

> > + unpin_user_page_range_dirty_lock(sg_page(sg),
> > + DIV_ROUND_UP(sg->length, PAGE_SIZE), make_dirty);
>
> Is it really OK to refer directly to sg->length? The scatterlist library goes
> to some effort to avoid having callers directly access the struct member variables.

Yes, only the dma length has acessors

Jason

2021-02-05 19:45:50

by Joao Martins

[permalink] [raw]
Subject: Re: [PATCH 4/4] RDMA/umem: batch page unpin in __ib_mem_release()



On 2/4/21 8:00 PM, Jason Gunthorpe wrote:
> On Wed, Feb 03, 2021 at 04:15:53PM -0800, John Hubbard wrote:
>>> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
>>> index 2dde99a9ba07..ea4ebb3261d9 100644
>>> +++ b/drivers/infiniband/core/umem.c
>>> @@ -47,17 +47,17 @@
>>> static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int dirty)
>>> {
>>> - struct sg_page_iter sg_iter;
>>> - struct page *page;
>>> + bool make_dirty = umem->writable && dirty;
>>> + struct scatterlist *sg;
>>> + int i;
>>
>> Maybe unsigned int is better, so as to perfectly match the scatterlist.length.
>
> Yes please
>
Fixed in v2.

>>> if (umem->nmap > 0)
>>> ib_dma_unmap_sg(dev, umem->sg_head.sgl, umem->sg_nents,
>>> DMA_BIDIRECTIONAL);
>>> - for_each_sg_page(umem->sg_head.sgl, &sg_iter, umem->sg_nents, 0) {
>>> - page = sg_page_iter_page(&sg_iter);
>>> - unpin_user_pages_dirty_lock(&page, 1, umem->writable && dirty);
>>> - }
>>> + for_each_sg(umem->sg_head.sgl, sg, umem->nmap, i)
>>
>> The change from umem->sg_nents to umem->nmap looks OK, although we should get
>> IB people to verify that there is not some odd bug or reason to leave it as is.
>
> No, nmap wouldn't be right here. nmap is the number of dma mapped SGLs
> in the list and should only be used by things doing sg_dma* stuff.
>
> umem->sg_nents is the number of CPU SGL entries and is the correct
> thing here.
>

And this was fixed in v2 as well.