2020-09-03 15:14:14

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function

From: Leon Romanovsky <[email protected]>

From Maor:

This series adds a new constructor for a scatter gather table. Like
sg_alloc_table_from_pages function, this function merges all contiguous
chunks of the pages a into single scatter gather entry.

In contrast to sg_alloc_table_from_pages, the new API allows chaining of
new pages to already initialized SG table.

This allows drivers to utilize the optimization of merging contiguous
pages without a need to pre allocate all the pages and hold them in
a very large temporary buffer prior to the call to SG table initialization.

The first two patches refactor the code of sg_alloc_table_from_pages
in order to have code sharing and add sg_alloc_next function to allow
dynamic allocation of more entries in the SG table.

The third patch introduces the new API.

The last patch changes the Infiniband driver to use the new API. It
removes duplicate functionality from the code and benefits the
optimization of allocating dynamic SG table from pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

Number of entries Size
Before 26214400 600.0MB
After 51200 1.2MB

Thanks

Maor Gottlieb (4):
lib/scatterlist: Refactor sg_alloc_table_from_pages
lib/scatterlist: Add support in dynamically allocation of SG entries
lib/scatterlist: Add support in dynamic allocation of SG table from
pages
RDMA/umem: Move to allocate SG table from pages

drivers/infiniband/core/umem.c | 93 ++--------
include/linux/scatterlist.h | 39 +++--
lib/scatterlist.c | 302 +++++++++++++++++++++++++--------
3 files changed, 271 insertions(+), 163 deletions(-)

--
2.26.2


2020-09-03 15:14:32

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 4/4] RDMA/umem: Move to allocate SG table from pages

From: Maor Gottlieb <[email protected]>

Remove the implementation of ib_umem_add_sg_table and instead
call to sg_alloc_table_append which already has the logic to
merge contiguous pages.

Besides that it removes duplicated functionality, it reduces the
memory consumption of the SG table significantly. Prior to this
patch, the SG table was allocated in advance regardless consideration
of contiguous pages.

In huge pages system of 2MB page size, without this change, the SG table
would contain x512 SG entries.
E.g. for 100GB memory registration:

Number of entries Size
Before 26214400 600.0MB
After 51200 1.2MB

Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
drivers/infiniband/core/umem.c | 93 +++++-----------------------------
1 file changed, 14 insertions(+), 79 deletions(-)

diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
index be889e99cfac..9eb946f665ec 100644
--- a/drivers/infiniband/core/umem.c
+++ b/drivers/infiniband/core/umem.c
@@ -62,73 +62,6 @@ static void __ib_umem_release(struct ib_device *dev, struct ib_umem *umem, int d
sg_free_table(&umem->sg_head);
}

-/* ib_umem_add_sg_table - Add N contiguous pages to scatter table
- *
- * sg: current scatterlist entry
- * page_list: array of npage struct page pointers
- * npages: number of pages in page_list
- * max_seg_sz: maximum segment size in bytes
- * nents: [out] number of entries in the scatterlist
- *
- * Return new end of scatterlist
- */
-static struct scatterlist *ib_umem_add_sg_table(struct scatterlist *sg,
- struct page **page_list,
- unsigned long npages,
- unsigned int max_seg_sz,
- int *nents)
-{
- unsigned long first_pfn;
- unsigned long i = 0;
- bool update_cur_sg = false;
- bool first = !sg_page(sg);
-
- /* Check if new page_list is contiguous with end of previous page_list.
- * sg->length here is a multiple of PAGE_SIZE and sg->offset is 0.
- */
- if (!first && (page_to_pfn(sg_page(sg)) + (sg->length >> PAGE_SHIFT) ==
- page_to_pfn(page_list[0])))
- update_cur_sg = true;
-
- while (i != npages) {
- unsigned long len;
- struct page *first_page = page_list[i];
-
- first_pfn = page_to_pfn(first_page);
-
- /* Compute the number of contiguous pages we have starting
- * at i
- */
- for (len = 0; i != npages &&
- first_pfn + len == page_to_pfn(page_list[i]) &&
- len < (max_seg_sz >> PAGE_SHIFT);
- len++)
- i++;
-
- /* Squash N contiguous pages from page_list into current sge */
- if (update_cur_sg) {
- if ((max_seg_sz - sg->length) >= (len << PAGE_SHIFT)) {
- sg_set_page(sg, sg_page(sg),
- sg->length + (len << PAGE_SHIFT),
- 0);
- update_cur_sg = false;
- continue;
- }
- update_cur_sg = false;
- }
-
- /* Squash N contiguous pages into next sge or first sge */
- if (!first)
- sg = sg_next(sg);
-
- (*nents)++;
- sg_set_page(sg, first_page, len << PAGE_SHIFT, 0);
- first = false;
- }
-
- return sg;
-}
-
/**
* ib_umem_find_best_pgsz - Find best HW page size to use for this MR
*
@@ -205,7 +138,8 @@ static struct ib_umem *__ib_umem_get(struct ib_device *device,
struct mm_struct *mm;
unsigned long npages;
int ret;
- struct scatterlist *sg;
+ struct scatterlist *sg = NULL;
+ struct sg_append append = {};
unsigned int gup_flags = FOLL_WRITE;

/*
@@ -255,15 +189,9 @@ static struct ib_umem *__ib_umem_get(struct ib_device *device,

cur_base = addr & PAGE_MASK;

- ret = sg_alloc_table(&umem->sg_head, npages, GFP_KERNEL);
- if (ret)
- goto vma;
-
if (!umem->writable)
gup_flags |= FOLL_FORCE;

- sg = umem->sg_head.sgl;
-
while (npages) {
cond_resched();
ret = pin_user_pages_fast(cur_base,
@@ -276,10 +204,18 @@ static struct ib_umem *__ib_umem_get(struct ib_device *device,

cur_base += ret * PAGE_SIZE;
npages -= ret;
-
- sg = ib_umem_add_sg_table(sg, page_list, ret,
- dma_get_max_seg_size(device->dma_device),
- &umem->sg_nents);
+ append.left_pages = npages;
+ append.prv = sg;
+ sg = sg_alloc_table_append(&umem->sg_head, page_list, ret, 0,
+ ret << PAGE_SHIFT,
+ dma_get_max_seg_size(device->dma_device),
+ GFP_KERNEL, &append);
+ umem->sg_nents = umem->sg_head.nents;
+ if (IS_ERR(sg)) {
+ unpin_user_pages_dirty_lock(page_list, ret, 0);
+ ret = PTR_ERR(sg);
+ goto umem_release;
+ }
}

sg_mark_end(sg);
@@ -301,7 +237,6 @@ static struct ib_umem *__ib_umem_get(struct ib_device *device,

umem_release:
__ib_umem_release(device, umem, 0);
-vma:
atomic64_sub(ib_umem_num_pages(umem), &mm->pinned_vm);
out:
free_page((unsigned long) page_list);
--
2.26.2

2020-09-03 15:56:31

by Leon Romanovsky

[permalink] [raw]
Subject: [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages

From: Maor Gottlieb <[email protected]>

Currently, sg_alloc_table_from_pages doesn't support dynamic chaining of
SG entries. Therefore it requires from user to allocate all the pages in
advance and hold them in a large buffer. Such a buffer consumes a lot of
temporary memory in HPC systems which do a very large memory registration.

The next patches introduce API for dynamically allocation from pages and
it requires us to do the following:
* Extract the code to alloc_from_pages_common.
* Change the build of the table to iterate on the chunks and not on the
SGEs. It will allow dynamic allocation of more SGEs.

Since sg_alloc_table_from_pages allocate exactly the number of chunks,
therefore chunks are equal to the number of SG entries.

Signed-off-by: Maor Gottlieb <[email protected]>
Signed-off-by: Leon Romanovsky <[email protected]>
---
lib/scatterlist.c | 75 ++++++++++++++++++++++++++++-------------------
1 file changed, 45 insertions(+), 30 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 5d63a8857f36..292e785d21ee 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -365,38 +365,18 @@ int sg_alloc_table(struct sg_table *table, unsigned int nents, gfp_t gfp_mask)
}
EXPORT_SYMBOL(sg_alloc_table);

-/**
- * __sg_alloc_table_from_pages - Allocate and initialize an sg table from
- * an array of pages
- * @sgt: The sg table header to use
- * @pages: Pointer to an array of page pointers
- * @n_pages: Number of pages in the pages array
- * @offset: Offset from start of the first page to the start of a buffer
- * @size: Number of valid bytes in the buffer (after offset)
- * @max_segment: Maximum size of a scatterlist node in bytes (page aligned)
- * @gfp_mask: GFP allocation mask
- *
- * Description:
- * Allocate and initialize an sg table from a list of pages. Contiguous
- * ranges of the pages are squashed into a single scatterlist node up to the
- * maximum size specified in @max_segment. An user may provide an offset at a
- * start and a size of valid data in a buffer specified by the page array.
- * The returned sg table is released by sg_free_table.
- *
- * Returns:
- * 0 on success, negative error on failure
- */
-int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
- unsigned int n_pages, unsigned int offset,
- unsigned long size, unsigned int max_segment,
- gfp_t gfp_mask)
+static struct scatterlist *
+alloc_from_pages_common(struct sg_table *sgt, struct page **pages,
+ unsigned int n_pages, unsigned int offset,
+ unsigned long size, unsigned int max_segment,
+ gfp_t gfp_mask)
{
unsigned int chunks, cur_page, seg_len, i;
+ struct scatterlist *prv, *s = NULL;
int ret;
- struct scatterlist *s;

if (WARN_ON(!max_segment || offset_in_page(max_segment)))
- return -EINVAL;
+ return ERR_PTR(-EINVAL);

/* compute number of contiguous chunks */
chunks = 1;
@@ -412,11 +392,12 @@ int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,

ret = sg_alloc_table(sgt, chunks, gfp_mask);
if (unlikely(ret))
- return ret;
+ return ERR_PTR(ret);

/* merging chunks and putting them into the scatterlist */
cur_page = 0;
- for_each_sg(sgt->sgl, s, sgt->orig_nents, i) {
+ s = sgt->sgl;
+ for (i = 0; i < chunks; i++) {
unsigned int j, chunk_size;

/* look for the end of the current chunk */
@@ -435,9 +416,43 @@ int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
size -= chunk_size;
offset = 0;
cur_page = j;
+ prv = s;
+ s = sg_next(s);
}
+ return prv;
+}

- return 0;
+/**
+ * __sg_alloc_table_from_pages - Allocate and initialize an sg table from
+ * an array of pages
+ * @sgt: The sg table header to use
+ * @pages: Pointer to an array of page pointers
+ * @n_pages: Number of pages in the pages array
+ * @offset: Offset from start of the first page to the start of a buffer
+ * @size: Number of valid bytes in the buffer (after offset)
+ * @max_segment: Maximum size of a scatterlist node in bytes (page aligned)
+ * @gfp_mask: GFP allocation mask
+ *
+ * Description:
+ * Allocate and initialize an sg table from a list of pages. Contiguous
+ * ranges of the pages are squashed into a single scatterlist node up to the
+ * maximum size specified in @max_segment. A user may provide an offset at a
+ * start and a size of valid data in a buffer specified by the page array.
+ * The returned sg table is released by sg_free_table.
+ *
+ * Returns:
+ * 0 on success, negative error on failure
+ */
+int __sg_alloc_table_from_pages(struct sg_table *sgt, struct page **pages,
+ unsigned int n_pages, unsigned int offset,
+ unsigned long size, unsigned int max_segment,
+ gfp_t gfp_mask)
+{
+ struct scatterlist *sg;
+
+ sg = alloc_from_pages_common(sgt, pages, n_pages, offset, size,
+ max_segment, gfp_mask);
+ return PTR_ERR_OR_ZERO(sg);
}
EXPORT_SYMBOL(__sg_alloc_table_from_pages);

--
2.26.2

2020-09-03 16:00:38

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH rdma-next 0/4] scatterlist: add sg_alloc_table_append function

On Thu, Sep 03, 2020 at 05:32:17PM +0200, Christoph Hellwig wrote:
> Patch 1 never made it through.

Thanks, I sent it now and the patch is seen in ML.

https://lore.kernel.org/linux-rdma/[email protected]/T/#t

2020-09-07 07:30:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages

On Thu, Sep 03, 2020 at 06:54:34PM +0300, Leon Romanovsky wrote:
> From: Maor Gottlieb <[email protected]>
>
> Currently, sg_alloc_table_from_pages doesn't support dynamic chaining of
> SG entries. Therefore it requires from user to allocate all the pages in
> advance and hold them in a large buffer. Such a buffer consumes a lot of
> temporary memory in HPC systems which do a very large memory registration.
>
> The next patches introduce API for dynamically allocation from pages and
> it requires us to do the following:
> * Extract the code to alloc_from_pages_common.
> * Change the build of the table to iterate on the chunks and not on the
> SGEs. It will allow dynamic allocation of more SGEs.
>
> Since sg_alloc_table_from_pages allocate exactly the number of chunks,
> therefore chunks are equal to the number of SG entries.

Given how few users __sg_alloc_table_from_pages has, what about just
switching it to your desired calling conventions without another helper?

2020-09-07 07:31:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH rdma-next 4/4] RDMA/umem: Move to allocate SG table from pages

On Thu, Sep 03, 2020 at 03:18:53PM +0300, Leon Romanovsky wrote:
> From: Maor Gottlieb <[email protected]>
>
> Remove the implementation of ib_umem_add_sg_table and instead
> call to sg_alloc_table_append which already has the logic to
> merge contiguous pages.
>
> Besides that it removes duplicated functionality, it reduces the
> memory consumption of the SG table significantly. Prior to this
> patch, the SG table was allocated in advance regardless consideration
> of contiguous pages.
>
> In huge pages system of 2MB page size, without this change, the SG table
> would contain x512 SG entries.
> E.g. for 100GB memory registration:
>
> Number of entries Size
> Before 26214400 600.0MB
> After 51200 1.2MB
>
> Signed-off-by: Maor Gottlieb <[email protected]>
> Signed-off-by: Leon Romanovsky <[email protected]>

Looks sensible for now, but the real fix is of course to avoid
the scatterlist here entirely, and provide a bvec based
pin_user_pages_fast. I'll need to finally get that done..

2020-09-07 12:40:33

by Maor Gottlieb

[permalink] [raw]
Subject: Re: [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages


On 9/7/2020 10:29 AM, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 06:54:34PM +0300, Leon Romanovsky wrote:
>> From: Maor Gottlieb <[email protected]>
>>
>> Currently, sg_alloc_table_from_pages doesn't support dynamic chaining of
>> SG entries. Therefore it requires from user to allocate all the pages in
>> advance and hold them in a large buffer. Such a buffer consumes a lot of
>> temporary memory in HPC systems which do a very large memory registration.
>>
>> The next patches introduce API for dynamically allocation from pages and
>> it requires us to do the following:
>> * Extract the code to alloc_from_pages_common.
>> * Change the build of the table to iterate on the chunks and not on the
>> SGEs. It will allow dynamic allocation of more SGEs.
>>
>> Since sg_alloc_table_from_pages allocate exactly the number of chunks,
>> therefore chunks are equal to the number of SG entries.
> Given how few users __sg_alloc_table_from_pages has, what about just
> switching it to your desired calling conventions without another helper?

I tried it now. It didn't save a lot.  Please give me your decision and
if needed I will update accordingly.

2020-09-08 19:23:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH rdma-next 4/4] RDMA/umem: Move to allocate SG table from pages

On Mon, Sep 07, 2020 at 09:29:26AM +0200, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 03:18:53PM +0300, Leon Romanovsky wrote:
> > From: Maor Gottlieb <[email protected]>
> >
> > Remove the implementation of ib_umem_add_sg_table and instead
> > call to sg_alloc_table_append which already has the logic to
> > merge contiguous pages.
> >
> > Besides that it removes duplicated functionality, it reduces the
> > memory consumption of the SG table significantly. Prior to this
> > patch, the SG table was allocated in advance regardless consideration
> > of contiguous pages.
> >
> > In huge pages system of 2MB page size, without this change, the SG table
> > would contain x512 SG entries.
> > E.g. for 100GB memory registration:
> >
> > Number of entries Size
> > Before 26214400 600.0MB
> > After 51200 1.2MB
> >
> > Signed-off-by: Maor Gottlieb <[email protected]>
> > Signed-off-by: Leon Romanovsky <[email protected]>
>
> Looks sensible for now, but the real fix is of course to avoid
> the scatterlist here entirely, and provide a bvec based
> pin_user_pages_fast. I'll need to finally get that done..

I'm working on cleaning all the DMA RDMA drivers using ib_umem to the
point where doing something like this would become fairly simple.

pin_user_pages_fast_bvec/whatever would be a huge improvement here,
calling in a loop like this just to get a partial page list to copy to
a SGL is horrificly slow due to all the extra overheads. Going
directly to the bvec/sgl/etc inside all the locks will be a lot faster

Jason

2020-09-08 19:44:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH rdma-next 1/4] lib/scatterlist: Refactor sg_alloc_table_from_pages

On Mon, Sep 07, 2020 at 03:32:31PM +0300, Maor Gottlieb wrote:
>
> On 9/7/2020 10:29 AM, Christoph Hellwig wrote:
>> On Thu, Sep 03, 2020 at 06:54:34PM +0300, Leon Romanovsky wrote:
>>> From: Maor Gottlieb <[email protected]>
>>>
>>> Currently, sg_alloc_table_from_pages doesn't support dynamic chaining of
>>> SG entries. Therefore it requires from user to allocate all the pages in
>>> advance and hold them in a large buffer. Such a buffer consumes a lot of
>>> temporary memory in HPC systems which do a very large memory registration.
>>>
>>> The next patches introduce API for dynamically allocation from pages and
>>> it requires us to do the following:
>>> * Extract the code to alloc_from_pages_common.
>>> * Change the build of the table to iterate on the chunks and not on the
>>> SGEs. It will allow dynamic allocation of more SGEs.
>>>
>>> Since sg_alloc_table_from_pages allocate exactly the number of chunks,
>>> therefore chunks are equal to the number of SG entries.
>> Given how few users __sg_alloc_table_from_pages has, what about just
>> switching it to your desired calling conventions without another helper?
>
> I tried it now. It didn't save a lot.? Please give me your decision and if
> needed I will update accordingly.

Feel free to keep it for now, we can sort this out later.