2022-05-18 00:57:27

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size

On Tue, May 17, 2022 at 08:04:58PM +0000, Long Li wrote:
> > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page size
> >
> > On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote:
> > > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page
> > > > size
> > > >
> > > > On Tue, May 17, 2022 at 02:04:29AM -0700, [email protected]
> > wrote:
> > > > > From: Long Li <[email protected]>
> > > > >
> > > > > The system chooses default 64K page size if the device does not
> > > > > specify the max page size the device can handle for DMA. This do
> > > > > not work well when device is registering large chunk of memory in
> > > > > that a large page size is more efficient.
> > > > >
> > > > > Set it to the maximum hardware supported page size.
> > > >
> > > > For RDMA devices this should be set to the largest segment size an
> > > > ib_sge can take in when posting work. It should not be the page size
> > > > of MR. 2M is a weird number for that, are you sure it is right?
> > >
> > > Yes, this is the maximum page size used in hardware page tables.
> >
> > As I said, it should be the size of the sge in the WQE, not the "hardware page
> > tables"
>
> This driver uses the following code to figure out the largest page
> size for memory registration with hardware:
>
> page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova);
>
> In this function, mr->umem is created with ib_dma_max_seg_size() as
> its max segment size when creating its sgtable.
>
> The purpose of setting DMA page size to 2M is to make sure this
> function returns the largest possible MR size that the hardware can
> take. Otherwise, this function will return 64k: the default DMA
> size.

As I've already said, you are supposed to set the value that limits to
ib_sge and *NOT* the value that is related to
ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's
typically work on a 32 bit length.

Jason


2022-05-18 06:14:41

by Ajay Sharma

[permalink] [raw]
Subject: RE: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

Thanks Long.
Hello Jason,
I am the author of the patch.
To your comment below :
" As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length."

The ib_sge is limited by the __sg_alloc_table_from_pages() which uses ib_dma_max_seg_size() which is what is set by the eth driver using dma_set_max_seg_size() . Currently our hw does not support PTEs larger than 2M.

So ib_umem_find_best_pgsz() takes as an input PG_SZ_BITMAP . The bitmap has all the bits set for the page sizes supported by the HW.

#define PAGE_SZ_BM (SZ_4K | SZ_8K | SZ_16K | SZ_32K | SZ_64K | SZ_128K \
| SZ_256K | SZ_512K | SZ_1M | SZ_2M)

Are you suggesting we are too restrictive in the bitmap we are passing ? or that we should not set this bitmap let the function choose default ?

Regards,
Ajay

-----Original Message-----
From: Jason Gunthorpe <[email protected]>
Sent: Tuesday, May 17, 2022 5:04 PM
To: Long Li <[email protected]>
Cc: Ajay Sharma <[email protected]>; KY Srinivasan <[email protected]>; Haiyang Zhang <[email protected]>; Stephen Hemminger <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui <[email protected]>; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Paolo Abeni <[email protected]>; Leon Romanovsky <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]
Subject: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

[You don't often get email from [email protected]. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification.]

On Tue, May 17, 2022 at 08:04:58PM +0000, Long Li wrote:
> > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max page
> > size
> >
> > On Tue, May 17, 2022 at 07:32:51PM +0000, Long Li wrote:
> > > > Subject: Re: [PATCH 05/12] net: mana: Set the DMA device max
> > > > page size
> > > >
> > > > On Tue, May 17, 2022 at 02:04:29AM -0700,
> > > > [email protected]
> > wrote:
> > > > > From: Long Li <[email protected]>
> > > > >
> > > > > The system chooses default 64K page size if the device does
> > > > > not specify the max page size the device can handle for DMA.
> > > > > This do not work well when device is registering large chunk
> > > > > of memory in that a large page size is more efficient.
> > > > >
> > > > > Set it to the maximum hardware supported page size.
> > > >
> > > > For RDMA devices this should be set to the largest segment size
> > > > an ib_sge can take in when posting work. It should not be the
> > > > page size of MR. 2M is a weird number for that, are you sure it is right?
> > >
> > > Yes, this is the maximum page size used in hardware page tables.
> >
> > As I said, it should be the size of the sge in the WQE, not the
> > "hardware page tables"
>
> This driver uses the following code to figure out the largest page
> size for memory registration with hardware:
>
> page_sz = ib_umem_find_best_pgsz(mr->umem, PAGE_SZ_BM, iova);
>
> In this function, mr->umem is created with ib_dma_max_seg_size() as
> its max segment size when creating its sgtable.
>
> The purpose of setting DMA page size to 2M is to make sure this
> function returns the largest possible MR size that the hardware can
> take. Otherwise, this function will return 64k: the default DMA size.

As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length.

Jason

2022-05-18 16:08:10

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

On Wed, May 18, 2022 at 05:59:00AM +0000, Ajay Sharma wrote:
> Thanks Long.
> Hello Jason,
> I am the author of the patch.
> To your comment below :
> " As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length."
>
> The ib_sge is limited by the __sg_alloc_table_from_pages() which
> uses ib_dma_max_seg_size() which is what is set by the eth driver
> using dma_set_max_seg_size() . Currently our hw does not support
> PTEs larger than 2M.

*sigh* again it has nothing to do with *PTEs* in the HW.

Jason

2022-05-18 21:12:03

by Ajay Sharma

[permalink] [raw]
Subject: RE: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

Sorry , I am not able to follow. Below is the reference efa driver implementation :

static int efa_device_init(struct efa_com_dev *edev, struct pci_dev *pdev)
{
int dma_width;
int err;

err = efa_com_dev_reset(edev, EFA_REGS_RESET_NORMAL);
if (err)
return err;

err = efa_com_validate_version(edev);
if (err)
return err;

dma_width = efa_com_get_dma_width(edev);
if (dma_width < 0) {
err = dma_width;
return err;
}

err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(dma_width));
if (err) {
dev_err(&pdev->dev, "dma_set_mask_and_coherent failed %d\n", err);
return err;
}

dma_set_max_seg_size(&pdev->dev, UINT_MAX);
return 0;
}

static int efa_register_mr(struct ib_pd *ibpd, struct efa_mr *mr, u64 start,
u64 length, u64 virt_addr, int access_flags)
{
struct efa_dev *dev = to_edev(ibpd->device);
struct efa_com_reg_mr_params params = {};
struct efa_com_reg_mr_result result = {};
struct pbl_context pbl;
unsigned int pg_sz;
int inline_size;
int err;

params.pd = to_epd(ibpd)->pdn;
params.iova = virt_addr;
params.mr_length_in_bytes = length;
params.permissions = access_flags;

pg_sz = ib_umem_find_best_pgsz(mr->umem,
dev->dev_attr.page_size_cap,
virt_addr);
....
}

Ideally we would like to read it from HW, but currently we are hardcoding the bitmap. I can change the commit message if you feel that is misleading .
Something along the lines :
RDMA/mana: Use API to get contiguous memory blocks aligned to device supported page size

Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when
registering an MR instead of coding it in the driver.

ib_umem_find_best_pgsz() is used to find the best suitable page size
which replaces the existing efa_cont_pages() implementation.
rdma_for_each_block() is used to iterate the umem in aligned contiguous memory blocks.


Ajay


-----Original Message-----
From: Jason Gunthorpe <[email protected]>
Sent: Wednesday, May 18, 2022 9:05 AM
To: Ajay Sharma <[email protected]>
Cc: Long Li <[email protected]>; KY Srinivasan <[email protected]>; Haiyang Zhang <[email protected]>; Stephen Hemminger <[email protected]>; Wei Liu <[email protected]>; Dexuan Cui <[email protected]>; David S. Miller <[email protected]>; Jakub Kicinski <[email protected]>; Paolo Abeni <[email protected]>; Leon Romanovsky <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

[You don't often get email from [email protected]. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification.]

On Wed, May 18, 2022 at 05:59:00AM +0000, Ajay Sharma wrote:
> Thanks Long.
> Hello Jason,
> I am the author of the patch.
> To your comment below :
> " As I've already said, you are supposed to set the value that limits to ib_sge and *NOT* the value that is related to ib_umem_find_best_pgsz. It is usually 2G because the ib_sge's typically work on a 32 bit length."
>
> The ib_sge is limited by the __sg_alloc_table_from_pages() which uses
> ib_dma_max_seg_size() which is what is set by the eth driver using
> dma_set_max_seg_size() . Currently our hw does not support PTEs larger
> than 2M.

*sigh* again it has nothing to do with *PTEs* in the HW.

Jason

2022-05-19 01:51:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

On Wed, May 18, 2022 at 09:05:22PM +0000, Ajay Sharma wrote:

> Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when
> registering an MR instead of coding it in the driver.

The dma_set_max_seg_size() has *nothing* to do with
ib_umem_find_best_pgsz() other than its value should be larger than
the largest set bit.

Again, it is supposed to be the maximum value the HW can support in a
ib_sge length field, which is usually 2G.

Jason

2022-05-19 03:00:42

by Long Li

[permalink] [raw]
Subject: RE: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max page size

> Subject: Re: [EXTERNAL] Re: [PATCH 05/12] net: mana: Set the DMA device max
> page size
>
> On Wed, May 18, 2022 at 09:05:22PM +0000, Ajay Sharma wrote:
>
> > Use the ib_umem_find_best_pgsz() and rdma_for_each_block() API when
> > registering an MR instead of coding it in the driver.
>
> The dma_set_max_seg_size() has *nothing* to do with
> ib_umem_find_best_pgsz() other than its value should be larger than the largest
> set bit.
>
> Again, it is supposed to be the maximum value the HW can support in a ib_sge
> length field, which is usually 2G.

Will fix this in v2.

Long