Subject: [PATCH for-next v4 0/8] On-Demand Paging on SoftRoCE

This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
driver, which has been available only in mlx5 driver[1] so far.

The first patch of this series is provided for testing purpose, and it
should be dropped in the end. It converts triple tasklets to use workqueue
in order to let them sleep during page-fault. Bob Pearson says he will post
the patch to do this, and I think we can adopt that. The other patches in
this series are, I believe, completed works.

I omitted some contents like the motive behind this series for simplicity.
Please see the cover letter of v3 for more details[2].

[Overview]
When applications register a memory region(MR), RDMA drivers normally pin
pages in the MR so that physical addresses are never changed during RDMA
communication. This requires the MR to fit in physical memory and
inevitably leads to memory pressure. On the other hand, On-Demand Paging
(ODP) allows applications to register MRs without pinning pages. They are
paged-in when the driver requires and paged-out when the OS reclaims. As a
result, it is possible to register a large MR that does not fit in physical
memory without taking up so much physical memory.

[How does ODP work?]
"struct ib_umem_odp" is used to manage pages. It is created for each
ODP-enabled MR on its registration. This struct holds a pair of arrays
(dma_list/pfn_list) that serve as a driver page table. DMA addresses and
PFNs are stored in the driver page table. They are updated on page-in and
page-out, both of which use the common interfaces in the ib_uverbs layer.

Page-in can occur when requester, responder or completer access an MR in
order to process RDMA operations. If they find that the pages being
accessed are not present on physical memory or requisite permissions are
not set on the pages, they provoke page fault to make the pages present
with proper permissions and at the same time update the driver page table.
After confirming the presence of the pages, they execute memory access such
as read, write or atomic operations.

Page-out is triggered by page reclaim or filesystem events (e.g. metadata
update of a file that is being used as an MR). When creating an ODP-enabled
MR, the driver registers an MMU notifier callback. When the kernel issues a
page invalidation notification, the callback is provoked to unmap DMA
addresses and update the driver page table. After that, the kernel releases
the pages.

[Supported operations]
All traditional operations are supported on RC connection. The new Atomic
write[3] and RDMA Flush[4] operations are not included in this patchset. I
will post them later after this patchset is merged. On UD connection, Send,
Recv, and SRQ-Recv are supported.

[How to test ODP?]
There are only a few resources available for testing. pyverbs testcases in
rdma-core and perftest[5] are recommendable ones. Other than them, the
ibv_rc_pingpong command can also used for testing. Note that you may have
to build perftest from upstream because older versions do not handle ODP
capabilities correctly.

The tree is available from github:
https://github.com/daimatsuda/linux/tree/odp_v4
While this series is based on commit f605f26ea196, the tree includes an
additional bugfix, which is yet to be merged as of today (Apr 19th, 2023).
https://lore.kernel.org/linux-rdma/[email protected]/

[Future work]
My next work is to enable the new Atomic write[3] and RDMA Flush[4]
operations with ODP. After that, I am going to implement the prefetch
feature. It allows applications to trigger page fault using
ibv_advise_mr(3) to optimize performance. Some existing software like
librpma[6] use this feature. Additionally, I think we can also add the
implicit ODP feature in the future.

[1] [RFC 00/20] On demand paging
https://www.spinics.net/lists/linux-rdma/msg18906.html

[2] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
https://lore.kernel.org/lkml/[email protected]/

[3] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
https://lore.kernel.org/linux-rdma/[email protected]/

[4] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
https://lore.kernel.org/lkml/[email protected]/

[5] linux-rdma/perftest: Infiniband Verbs Performance Tests
https://github.com/linux-rdma/perftest

[6] librpma: Remote Persistent Memory Access Library
https://github.com/pmem/rpma

v3->v4:
1) Re-designed functions that access MRs to use the MR xarray.
2) Rebased onto the latest jgg-for-next tree.

v2->v3:
1) Removed a patch that changes the common ib_uverbs layer.
2) Re-implemented patches for conversion to workqueue.
3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
4) Fixed some functions that returned incorrect errors.
5) Temporarily disabled ODP for RDMA Flush and Atomic Write.

v1->v2:
1) Fixed a crash issue reported by Haris Iqbal.
2) Tried to make lock patters clearer as pointed out by Romanovsky.
3) Minor clean ups and fixes.

Daisuke Matsuda (8):
RDMA/rxe: Tentative workqueue implementation
RDMA/rxe: Always schedule works before accessing user MRs
RDMA/rxe: Make MR functions accessible from other rxe source code
RDMA/rxe: Move resp_states definition to rxe_verbs.h
RDMA/rxe: Add page invalidation support
RDMA/rxe: Allow registering MRs for On-Demand Paging
RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
RDMA/rxe: Add support for the traditional Atomic operations with ODP

drivers/infiniband/sw/rxe/Makefile | 2 +
drivers/infiniband/sw/rxe/rxe.c | 27 ++-
drivers/infiniband/sw/rxe/rxe.h | 37 ---
drivers/infiniband/sw/rxe/rxe_comp.c | 12 +-
drivers/infiniband/sw/rxe/rxe_loc.h | 49 +++-
drivers/infiniband/sw/rxe/rxe_mr.c | 27 +--
drivers/infiniband/sw/rxe/rxe_odp.c | 311 ++++++++++++++++++++++++++
drivers/infiniband/sw/rxe/rxe_recv.c | 4 +-
drivers/infiniband/sw/rxe/rxe_resp.c | 32 ++-
drivers/infiniband/sw/rxe/rxe_task.c | 84 ++++---
drivers/infiniband/sw/rxe/rxe_task.h | 6 +-
drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
drivers/infiniband/sw/rxe/rxe_verbs.h | 39 ++++
13 files changed, 535 insertions(+), 100 deletions(-)
create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c

base-commit: f605f26ea196a3b49bea249330cbd18dba61a33e

--
2.39.1


2023-04-19 16:11:46

by Pearson, Robert B

[permalink] [raw]
Subject: RE: [PATCH for-next v4 0/8] On-Demand Paging on SoftRoCE

The work queue patch has been submitted and is waiting for some action. -- Bob

-----Original Message-----
From: Daisuke Matsuda <[email protected]>
Sent: Wednesday, April 19, 2023 12:52 AM
To: [email protected]; [email protected]; [email protected]; [email protected]
Cc: [email protected]; [email protected]; [email protected]; [email protected]; Daisuke Matsuda <[email protected]>
Subject: [PATCH for-next v4 0/8] On-Demand Paging on SoftRoCE

This patch series implements the On-Demand Paging feature on SoftRoCE(rxe) driver, which has been available only in mlx5 driver[1] so far.

The first patch of this series is provided for testing purpose, and it should be dropped in the end. It converts triple tasklets to use workqueue in order to let them sleep during page-fault. Bob Pearson says he will post the patch to do this, and I think we can adopt that. The other patches in this series are, I believe, completed works.

I omitted some contents like the motive behind this series for simplicity.
Please see the cover letter of v3 for more details[2].

[Overview]
When applications register a memory region(MR), RDMA drivers normally pin pages in the MR so that physical addresses are never changed during RDMA communication. This requires the MR to fit in physical memory and inevitably leads to memory pressure. On the other hand, On-Demand Paging
(ODP) allows applications to register MRs without pinning pages. They are paged-in when the driver requires and paged-out when the OS reclaims. As a result, it is possible to register a large MR that does not fit in physical memory without taking up so much physical memory.

[How does ODP work?]
"struct ib_umem_odp" is used to manage pages. It is created for each ODP-enabled MR on its registration. This struct holds a pair of arrays
(dma_list/pfn_list) that serve as a driver page table. DMA addresses and PFNs are stored in the driver page table. They are updated on page-in and page-out, both of which use the common interfaces in the ib_uverbs layer.

Page-in can occur when requester, responder or completer access an MR in order to process RDMA operations. If they find that the pages being accessed are not present on physical memory or requisite permissions are not set on the pages, they provoke page fault to make the pages present with proper permissions and at the same time update the driver page table.
After confirming the presence of the pages, they execute memory access such as read, write or atomic operations.

Page-out is triggered by page reclaim or filesystem events (e.g. metadata update of a file that is being used as an MR). When creating an ODP-enabled MR, the driver registers an MMU notifier callback. When the kernel issues a page invalidation notification, the callback is provoked to unmap DMA addresses and update the driver page table. After that, the kernel releases the pages.

[Supported operations]
All traditional operations are supported on RC connection. The new Atomic write[3] and RDMA Flush[4] operations are not included in this patchset. I will post them later after this patchset is merged. On UD connection, Send, Recv, and SRQ-Recv are supported.

[How to test ODP?]
There are only a few resources available for testing. pyverbs testcases in rdma-core and perftest[5] are recommendable ones. Other than them, the ibv_rc_pingpong command can also used for testing. Note that you may have to build perftest from upstream because older versions do not handle ODP capabilities correctly.

The tree is available from github:
https://github.com/daimatsuda/linux/tree/odp_v4
While this series is based on commit f605f26ea196, the tree includes an additional bugfix, which is yet to be merged as of today (Apr 19th, 2023).
https://lore.kernel.org/linux-rdma/[email protected]/

[Future work]
My next work is to enable the new Atomic write[3] and RDMA Flush[4] operations with ODP. After that, I am going to implement the prefetch feature. It allows applications to trigger page fault using
ibv_advise_mr(3) to optimize performance. Some existing software like librpma[6] use this feature. Additionally, I think we can also add the implicit ODP feature in the future.

[1] [RFC 00/20] On demand paging
https://www.spinics.net/lists/linux-rdma/msg18906.html

[2] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE https://lore.kernel.org/lkml/[email protected]/

[3] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation https://lore.kernel.org/linux-rdma/[email protected]/

[4] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation https://lore.kernel.org/lkml/[email protected]/

[5] linux-rdma/perftest: Infiniband Verbs Performance Tests https://github.com/linux-rdma/perftest

[6] librpma: Remote Persistent Memory Access Library https://github.com/pmem/rpma

v3->v4:
1) Re-designed functions that access MRs to use the MR xarray.
2) Rebased onto the latest jgg-for-next tree.

v2->v3:
1) Removed a patch that changes the common ib_uverbs layer.
2) Re-implemented patches for conversion to workqueue.
3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
4) Fixed some functions that returned incorrect errors.
5) Temporarily disabled ODP for RDMA Flush and Atomic Write.

v1->v2:
1) Fixed a crash issue reported by Haris Iqbal.
2) Tried to make lock patters clearer as pointed out by Romanovsky.
3) Minor clean ups and fixes.

Daisuke Matsuda (8):
RDMA/rxe: Tentative workqueue implementation
RDMA/rxe: Always schedule works before accessing user MRs
RDMA/rxe: Make MR functions accessible from other rxe source code
RDMA/rxe: Move resp_states definition to rxe_verbs.h
RDMA/rxe: Add page invalidation support
RDMA/rxe: Allow registering MRs for On-Demand Paging
RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
RDMA/rxe: Add support for the traditional Atomic operations with ODP

drivers/infiniband/sw/rxe/Makefile | 2 +
drivers/infiniband/sw/rxe/rxe.c | 27 ++-
drivers/infiniband/sw/rxe/rxe.h | 37 ---
drivers/infiniband/sw/rxe/rxe_comp.c | 12 +-
drivers/infiniband/sw/rxe/rxe_loc.h | 49 +++-
drivers/infiniband/sw/rxe/rxe_mr.c | 27 +--
drivers/infiniband/sw/rxe/rxe_odp.c | 311 ++++++++++++++++++++++++++
drivers/infiniband/sw/rxe/rxe_recv.c | 4 +-
drivers/infiniband/sw/rxe/rxe_resp.c | 32 ++- drivers/infiniband/sw/rxe/rxe_task.c | 84 ++++---
drivers/infiniband/sw/rxe/rxe_task.h | 6 +-
drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
drivers/infiniband/sw/rxe/rxe_verbs.h | 39 ++++
13 files changed, 535 insertions(+), 100 deletions(-) create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c

base-commit: f605f26ea196a3b49bea249330cbd18dba61a33e

--
2.39.1

Subject: RE: [PATCH for-next v4 0/8] On-Demand Paging on SoftRoCE

On Thu, April 20, 2023 1:07 AM Pearson, Robert B wrote:
>
> The work queue patch has been submitted and is waiting for some action. -- Bob

Hi,
Could you tell me which is it? I am willing to review it.

This seems to be your latest work queue patch:
https://lore.kernel.org/all/TYCPR01MB8455A2D0B3303FD90B3BB6F1E58B9@TYCPR01MB8455.jpnprd01.prod.outlook.com/
I cannot find any one newer on the mailing list nor on the Patchwork.

Daisuke

>
> -----Original Message-----
> From: Daisuke Matsuda <[email protected]>
> Sent: Wednesday, April 19, 2023 12:52 AM
> To: [email protected]; [email protected]; [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected]; [email protected]; Daisuke
> Matsuda <[email protected]>
> Subject: [PATCH for-next v4 0/8] On-Demand Paging on SoftRoCE
>
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe) driver, which has been available only in
> mlx5 driver[1] so far.
>
> The first patch of this series is provided for testing purpose, and it should be dropped in the end. It converts triple tasklets
> to use workqueue in order to let them sleep during page-fault. Bob Pearson says he will post the patch to do this, and I
> think we can adopt that. The other patches in this series are, I believe, completed works.
>
> I omitted some contents like the motive behind this series for simplicity.
> Please see the cover letter of v3 for more details[2].
>
> [Overview]
> When applications register a memory region(MR), RDMA drivers normally pin pages in the MR so that physical addresses
> are never changed during RDMA communication. This requires the MR to fit in physical memory and inevitably leads to
> memory pressure. On the other hand, On-Demand Paging
> (ODP) allows applications to register MRs without pinning pages. They are paged-in when the driver requires and
> paged-out when the OS reclaims. As a result, it is possible to register a large MR that does not fit in physical memory
> without taking up so much physical memory.
>
> [How does ODP work?]
> "struct ib_umem_odp" is used to manage pages. It is created for each ODP-enabled MR on its registration. This struct
> holds a pair of arrays
> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and PFNs are stored in the driver page table. They
> are updated on page-in and page-out, both of which use the common interfaces in the ib_uverbs layer.
>
> Page-in can occur when requester, responder or completer access an MR in order to process RDMA operations. If they
> find that the pages being accessed are not present on physical memory or requisite permissions are not set on the pages,
> they provoke page fault to make the pages present with proper permissions and at the same time update the driver page
> table.
> After confirming the presence of the pages, they execute memory access such as read, write or atomic operations.
>
> Page-out is triggered by page reclaim or filesystem events (e.g. metadata update of a file that is being used as an MR).
> When creating an ODP-enabled MR, the driver registers an MMU notifier callback. When the kernel issues a page
> invalidation notification, the callback is provoked to unmap DMA addresses and update the driver page table. After that,
> the kernel releases the pages.
>
> [Supported operations]
> All traditional operations are supported on RC connection. The new Atomic write[3] and RDMA Flush[4] operations are
> not included in this patchset. I will post them later after this patchset is merged. On UD connection, Send, Recv, and
> SRQ-Recv are supported.
>
> [How to test ODP?]
> There are only a few resources available for testing. pyverbs testcases in rdma-core and perftest[5] are recommendable
> ones. Other than them, the ibv_rc_pingpong command can also used for testing. Note that you may have to build perftest
> from upstream because older versions do not handle ODP capabilities correctly.
>
> The tree is available from github:
> https://github.com/daimatsuda/linux/tree/odp_v4
> While this series is based on commit f605f26ea196, the tree includes an additional bugfix, which is yet to be merged as of
> today (Apr 19th, 2023).
> https://lore.kernel.org/linux-rdma/[email protected]/
>
> [Future work]
> My next work is to enable the new Atomic write[3] and RDMA Flush[4] operations with ODP. After that, I am going to
> implement the prefetch feature. It allows applications to trigger page fault using
> ibv_advise_mr(3) to optimize performance. Some existing software like librpma[6] use this feature. Additionally, I think we
> can also add the implicit ODP feature in the future.
>
> [1] [RFC 00/20] On demand paging
> https://www.spinics.net/lists/linux-rdma/msg18906.html
>
> [2] [PATCH for-next v3 0/7] On-Demand Paging on SoftRoCE
> https://lore.kernel.org/lkml/[email protected]/
>
> [3] [PATCH v7 0/8] RDMA/rxe: Add atomic write operation
> https://lore.kernel.org/linux-rdma/[email protected]/
>
> [4] [for-next PATCH 00/10] RDMA/rxe: Add RDMA FLUSH operation
> https://lore.kernel.org/lkml/[email protected]/
>
> [5] linux-rdma/perftest: Infiniband Verbs Performance Tests https://github.com/linux-rdma/perftest
>
> [6] librpma: Remote Persistent Memory Access Library https://github.com/pmem/rpma
>
> v3->v4:
> 1) Re-designed functions that access MRs to use the MR xarray.
> 2) Rebased onto the latest jgg-for-next tree.
>
> v2->v3:
> 1) Removed a patch that changes the common ib_uverbs layer.
> 2) Re-implemented patches for conversion to workqueue.
> 3) Fixed compile errors (happened when CONFIG_INFINIBAND_ON_DEMAND_PAGING=n).
> 4) Fixed some functions that returned incorrect errors.
> 5) Temporarily disabled ODP for RDMA Flush and Atomic Write.
>
> v1->v2:
> 1) Fixed a crash issue reported by Haris Iqbal.
> 2) Tried to make lock patters clearer as pointed out by Romanovsky.
> 3) Minor clean ups and fixes.
>
> Daisuke Matsuda (8):
> RDMA/rxe: Tentative workqueue implementation
> RDMA/rxe: Always schedule works before accessing user MRs
> RDMA/rxe: Make MR functions accessible from other rxe source code
> RDMA/rxe: Move resp_states definition to rxe_verbs.h
> RDMA/rxe: Add page invalidation support
> RDMA/rxe: Allow registering MRs for On-Demand Paging
> RDMA/rxe: Add support for Send/Recv/Write/Read with ODP
> RDMA/rxe: Add support for the traditional Atomic operations with ODP
>
> drivers/infiniband/sw/rxe/Makefile | 2 +
> drivers/infiniband/sw/rxe/rxe.c | 27 ++-
> drivers/infiniband/sw/rxe/rxe.h | 37 ---
> drivers/infiniband/sw/rxe/rxe_comp.c | 12 +-
> drivers/infiniband/sw/rxe/rxe_loc.h | 49 +++-
> drivers/infiniband/sw/rxe/rxe_mr.c | 27 +--
> drivers/infiniband/sw/rxe/rxe_odp.c | 311 ++++++++++++++++++++++++++
> drivers/infiniband/sw/rxe/rxe_recv.c | 4 +-
> drivers/infiniband/sw/rxe/rxe_resp.c | 32 ++- drivers/infiniband/sw/rxe/rxe_task.c | 84 ++++---
> drivers/infiniband/sw/rxe/rxe_task.h | 6 +-
> drivers/infiniband/sw/rxe/rxe_verbs.c | 5 +-
> drivers/infiniband/sw/rxe/rxe_verbs.h | 39 ++++
> 13 files changed, 535 insertions(+), 100 deletions(-) create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>
> base-commit: f605f26ea196a3b49bea249330cbd18dba61a33e
>
> --
> 2.39.1

2023-04-27 15:55:12

by Bob Pearson

[permalink] [raw]
Subject: Re: [PATCH for-next v4 0/8] On-Demand Paging on SoftRoCE

On 4/19/23 19:28, Daisuke Matsuda (Fujitsu) wrote:
> On Thu, April 20, 2023 1:07 AM Pearson, Robert B wrote:
>>
>> The work queue patch has been submitted and is waiting for some action. -- Bob
>
> Hi,
> Could you tell me which is it? I am willing to review it.
>
> This seems to be your latest work queue patch:
> https://lore.kernel.org/all/TYCPR01MB8455A2D0B3303FD90B3BB6F1E58B9@TYCPR01MB8455.jpnprd01.prod.outlook.com/
> I cannot find any one newer on the mailing list nor on the Patchwork.
>
> Daisuke

Daisuke,

Sorry for the delay. I've been on another project for a few days.
I can't either. After the fix qp counting in task.c the work queue patch is almost trivial.
I'll send it again.

Bob