2008-10-03 21:33:50

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 00/12] svcrdma: Fast memory registration support


Bruce:

This updated patchset implements support for Fast Memory Registration
in the NFS server. The second to last patch has been updated to
provide better guidance on whether the connection is safe or
not. Basically the code gives you a qualification string instead of
expecting the administrator to know how to interpret the memory
registration strategy and transport idiosyncrasies. The documentation
has also been updated given your suggestions and the change to the
informative log message. The string is certainly informative, but may
be a little long. For this reason, there is a proc file entry to
disable it.

Fast Memory Regstration is the ability to quickly map a kernel memory
page list as a logically contiguous memory region from the perspective
of the adapter. This mapping is created and invalidated using work
requests posted on the SQ.

There are two benefits of this new memory registration strategy:

- It allows for large amounts of data to be transferred between the
client and server with a single work request

- It improves security since the effective lifetime and scope of an
RKEY is a single RPC and WR.

The documentation file Documentation/filesystems/nfs-rdma.txt file has
been updated with information about NFSRDMA security in general and
how it is affected by the new Fast Memory Registration capability.

This new capability is only enabled if the underlying device advertises
that it is supported. It is not necessary that both the client and
server use the same strategy, however, for Fast Memory Registration,
it improves performance.

These patches are also available here:
git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git

Signed-off-by: Tom Tucker <[email protected]>


[PATCH 01/12] svcrdma: Add Fast Reg MR Data Types

include/linux/sunrpc/svc_rdma.h | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)

[PATCH 02/12] svcrdma: Add FRMR get/put services

include/linux/sunrpc/svc_rdma.h | 3 +
net/sunrpc/xprtrdma/svc_rdma_transport.c | 122 ++++++++++++++++++++++++++++-
2 files changed, 120 insertions(+), 5 deletions(-)

[PATCH 03/12] svcrdma: Query device for Fast Reg support during connection setup

net/sunrpc/xprtrdma/svc_rdma_transport.c | 76 +++++++++++++++++++++++++++--
1 files changed, 70 insertions(+), 6 deletions(-)

[PATCH 04/12] svcrdma: Add a service to register a Fast Reg MR with the device

include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_transport.c | 118 +++++++++++++++++++++---------
2 files changed, 84 insertions(+), 35 deletions(-)

[PATCH 05/12] svcrdma: Modify post recv path to use local dma key

net/sunrpc/xprtrdma/svc_rdma_transport.c | 12 +++++++++---
1 files changed, 9 insertions(+), 3 deletions(-)

[PATCH 06/12] svcrdma: Add support to svc_rdma_send to handle chained WR

net/sunrpc/xprtrdma/svc_rdma_transport.c | 29 +++++++++++++++++++++--------
1 files changed, 21 insertions(+), 8 deletions(-)

[PATCH 07/12] svcrdma: Modify the RPC recv path to use FRMR when available

include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 187 ++++++++++++++++++++++++++----
net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +-
3 files changed, 171 insertions(+), 22 deletions(-)

[PATCH 08/12] svcrdma: Modify the RPC reply path to use FRMR when available

net/sunrpc/xprtrdma/svc_rdma_sendto.c | 263 +++++++++++++++++++++++++-----
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +
2 files changed, 225 insertions(+), 40 deletions(-)

[PATCH 09/12] svcrdma: Update svc_rdma_send_error to use DMA LKEY

net/sunrpc/xprtrdma/svc_rdma_transport.c | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)

[PATCH 10/12] svcrdma: Fix IRD/ORD polarity

net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

[PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma.c | 8 ++++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 63 ++++++++++++++++++-----------
3 files changed, 48 insertions(+), 24 deletions(-)

[PATCH 12/12] svcrdma: Documentation update for the FastReg memory model

Documentation/filesystems/nfs-rdma.txt | 126 ++++++++++++++++++++++++++++++++
1 files changed, 126 insertions(+), 0 deletions(-)



2008-10-22 20:23:48

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

On Sun, Oct 12, 2008 at 09:20:04PM -0500, Tom Tucker wrote:
> Tom Tucker wrote:
>> J. Bruce Fields wrote:
>>> On Thu, Oct 09, 2008 at 01:46:43PM -0500, Tom Tucker wrote:
>>>
>>>> Coming up with a name for the command is probably
>>>> harder than writing it.
>>>>
>>>> Here it is...
>>>>
>>>
>>> Neat-o, thanks.
>>>
>>> Just for fun, I installed libibverbs from Fedora 9, modprobe'd
>>> ib_uverbs, and tried running this, and got a "libibverbs: Warning:
>>> couldn't open config directory '/etc/libibverbs.d'." Is there a HOWTO
>>> somewhere that I should know about?
>>>
>> Hmm. Sounds like the Fedora RPM didn't do all the necessary bits. I
>> have a Fedora 9 system,
>> I'll see what I get. I typically use the latest OFED distro.
>>
>
> I should qualify this. I use the OFED distro for the user-mode bits. For
> kernel bits, I use top of
> tree.

I poked around the redhat bugzilla for any reports and didn't find
anything.... Is there a package from the OFED distro that I should look
at to get the idea?

--b.

2008-10-22 21:37:04

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

J. Bruce Fields wrote:
> On Sun, Oct 12, 2008 at 09:20:04PM -0500, Tom Tucker wrote:
>> Tom Tucker wrote:
>>> J. Bruce Fields wrote:
>>>> On Thu, Oct 09, 2008 at 01:46:43PM -0500, Tom Tucker wrote:
>>>>
>>>>> Coming up with a name for the command is probably
>>>>> harder than writing it.
>>>>>
>>>>> Here it is...
>>>>>
>>>> Neat-o, thanks.
>>>>
>>>> Just for fun, I installed libibverbs from Fedora 9, modprobe'd
>>>> ib_uverbs, and tried running this, and got a "libibverbs: Warning:
>>>> couldn't open config directory '/etc/libibverbs.d'." Is there a HOWTO
>>>> somewhere that I should know about?
>>>>
>>> Hmm. Sounds like the Fedora RPM didn't do all the necessary bits. I
>>> have a Fedora 9 system,
>>> I'll see what I get. I typically use the latest OFED distro.
>>>
>> I should qualify this. I use the OFED distro for the user-mode bits. For
>> kernel bits, I use top of
>> tree.
>
> I poked around the redhat bugzilla for any reports and didn't find
> anything.... Is there a package from the OFED distro that I should look
> at to get the idea?
>

http://www.openfabrics.org/downloads/OFED/ofed-1.4/OFED-1.4-rc3.tgz

Untar and run ./install.pl

Notes:

1. It has a crippled NFSRDMA port someone did that you definitely don't
want to install. It should not be installed by default so, if you select
open 2 (HPC), you should be ok.

2. The install is dependent on the running kernel, so you'll want to
install it with the test kernel running.

3. I _think_ this thing will overwrite your test kernel's drivers, so
you may need to redo 'make modules_install' after installing OFED. I do
anyway.

Feel free to ping me if you run into trouble.

Tom

> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2008-10-10 21:02:48

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

On Thu, Oct 09, 2008 at 01:46:43PM -0500, Tom Tucker wrote:
> Coming up with a name for the command is probably
> harder than writing it.
>
> Here it is...

Neat-o, thanks.

Just for fun, I installed libibverbs from Fedora 9, modprobe'd
ib_uverbs, and tried running this, and got a "libibverbs: Warning:
couldn't open config directory '/etc/libibverbs.d'." Is there a HOWTO
somewhere that I should know about?

--b.

>
> #include <stdlib.h>
> #include <stdio.h>
> #include <infiniband/verbs.h>
>
> #define FAST_REG (1<<21) /* This will be in infiniband/verbs.h in the future */
>
> static char *safety_string(struct ibv_device_attr *a, struct ibv_device *dev)
> {
> if (a->device_cap_flags & FAST_REG
> || dev->transport_type == IBV_TRANSPORT_IB)
> return "Safe. NFSRDMA exposes only RPC memory.\n";
> else
> return "Unsafe. NFSRDMA exposes Server memory.\n";
> }
>
> int main(int argc, char *argv[])
> {
> struct ibv_device **dev_list;
> struct ibv_context *context;
> struct ibv_device_attr attr;
> int dev_count;
> int i;
>
> dev_list = ibv_get_device_list(&dev_count);
> for (i = 0; dev_list && i < dev_count; i++) {
> printf("%-20s: ", ibv_get_device_name(dev_list[i]));
> context = ibv_open_device(dev_list[i]);
> if (!context) {
> printf("could not open device\n");
> continue;
> }
> if (!ibv_query_device(context, &attr))
> printf("%s\n", safety_string(&attr, dev_list[i]));
> else
> printf("could not query device\n");
>
> ibv_close_device(context);
> }
> if (dev_list)
> ibv_free_device_list(dev_list);
>
> exit(0);
> }
>
>> The one drawback is that it wouldn't be able to tell whether the
>> currently running kernel actually supported fast registration. Do you
>> think a guess based on kernel version would be good enough for that?
>>
>
> I do, yes.
>
>>> This code makes devices more secure than they used to be. So there is
>>> no negative security regression here. This patchset simply improves
>>> the security for newer devices that support the new features.
>>
>> Yes, agreed. Just to be clear, I *have* queued up all but these last
>> two patches (the printk and documentation patches) for 2.6.28.
>>
>
> Ok, thanks.
>
>> --b.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2008-10-13 02:18:08

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

J. Bruce Fields wrote:
> On Thu, Oct 09, 2008 at 01:46:43PM -0500, Tom Tucker wrote:
>
>> Coming up with a name for the command is probably
>> harder than writing it.
>>
>> Here it is...
>>
>
> Neat-o, thanks.
>
> Just for fun, I installed libibverbs from Fedora 9, modprobe'd
> ib_uverbs, and tried running this, and got a "libibverbs: Warning:
> couldn't open config directory '/etc/libibverbs.d'." Is there a HOWTO
> somewhere that I should know about?
>
Hmm. Sounds like the Fedora RPM didn't do all the necessary bits. I have
a Fedora 9 system,
I'll see what I get. I typically use the latest OFED distro.

Tom
> --b.
>
>
>> #include <stdlib.h>
>> #include <stdio.h>
>> #include <infiniband/verbs.h>
>>
>> #define FAST_REG (1<<21) /* This will be in infiniband/verbs.h in the future */
>>
>> static char *safety_string(struct ibv_device_attr *a, struct ibv_device *dev)
>> {
>> if (a->device_cap_flags & FAST_REG
>> || dev->transport_type == IBV_TRANSPORT_IB)
>> return "Safe. NFSRDMA exposes only RPC memory.\n";
>> else
>> return "Unsafe. NFSRDMA exposes Server memory.\n";
>> }
>>
>> int main(int argc, char *argv[])
>> {
>> struct ibv_device **dev_list;
>> struct ibv_context *context;
>> struct ibv_device_attr attr;
>> int dev_count;
>> int i;
>>
>> dev_list = ibv_get_device_list(&dev_count);
>> for (i = 0; dev_list && i < dev_count; i++) {
>> printf("%-20s: ", ibv_get_device_name(dev_list[i]));
>> context = ibv_open_device(dev_list[i]);
>> if (!context) {
>> printf("could not open device\n");
>> continue;
>> }
>> if (!ibv_query_device(context, &attr))
>> printf("%s\n", safety_string(&attr, dev_list[i]));
>> else
>> printf("could not query device\n");
>>
>> ibv_close_device(context);
>> }
>> if (dev_list)
>> ibv_free_device_list(dev_list);
>>
>> exit(0);
>> }
>>
>>
>>> The one drawback is that it wouldn't be able to tell whether the
>>> currently running kernel actually supported fast registration. Do you
>>> think a guess based on kernel version would be good enough for that?
>>>
>>>
>> I do, yes.
>>
>>
>>>> This code makes devices more secure than they used to be. So there is
>>>> no negative security regression here. This patchset simply improves
>>>> the security for newer devices that support the new features.
>>>>
>>> Yes, agreed. Just to be clear, I *have* queued up all but these last
>>> two patches (the printk and documentation patches) for 2.6.28.
>>>
>>>
>> Ok, thanks.
>>
>>
>>> --b.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2008-10-13 02:20:05

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

Tom Tucker wrote:
> J. Bruce Fields wrote:
>> On Thu, Oct 09, 2008 at 01:46:43PM -0500, Tom Tucker wrote:
>>
>>> Coming up with a name for the command is probably
>>> harder than writing it.
>>>
>>> Here it is...
>>>
>>
>> Neat-o, thanks.
>>
>> Just for fun, I installed libibverbs from Fedora 9, modprobe'd
>> ib_uverbs, and tried running this, and got a "libibverbs: Warning:
>> couldn't open config directory '/etc/libibverbs.d'." Is there a HOWTO
>> somewhere that I should know about?
>>
> Hmm. Sounds like the Fedora RPM didn't do all the necessary bits. I
> have a Fedora 9 system,
> I'll see what I get. I typically use the latest OFED distro.
>

I should qualify this. I use the OFED distro for the user-mode bits. For
kernel bits, I use top of
tree.

Thanks,
Tom

> Tom
>> --b.
>>
>>
>>> #include <stdlib.h>
>>> #include <stdio.h>
>>> #include <infiniband/verbs.h>
>>>
>>> #define FAST_REG (1<<21) /* This will be in infiniband/verbs.h in
>>> the future */
>>>
>>> static char *safety_string(struct ibv_device_attr *a, struct
>>> ibv_device *dev)
>>> {
>>> if (a->device_cap_flags & FAST_REG
>>> || dev->transport_type == IBV_TRANSPORT_IB)
>>> return "Safe. NFSRDMA exposes only RPC memory.\n";
>>> else
>>> return "Unsafe. NFSRDMA exposes Server memory.\n";
>>> }
>>>
>>> int main(int argc, char *argv[])
>>> {
>>> struct ibv_device **dev_list;
>>> struct ibv_context *context;
>>> struct ibv_device_attr attr;
>>> int dev_count;
>>> int i;
>>>
>>> dev_list = ibv_get_device_list(&dev_count);
>>> for (i = 0; dev_list && i < dev_count; i++) {
>>> printf("%-20s: ", ibv_get_device_name(dev_list[i]));
>>> context = ibv_open_device(dev_list[i]);
>>> if (!context) {
>>> printf("could not open device\n");
>>> continue;
>>> }
>>> if (!ibv_query_device(context, &attr))
>>> printf("%s\n", safety_string(&attr,
>>> dev_list[i]));
>>> else
>>> printf("could not query device\n");
>>>
>>> ibv_close_device(context);
>>> }
>>> if (dev_list)
>>> ibv_free_device_list(dev_list);
>>>
>>> exit(0);
>>> }
>>>
>>>
>>>> The one drawback is that it wouldn't be able to tell whether the
>>>> currently running kernel actually supported fast registration. Do you
>>>> think a guess based on kernel version would be good enough for that?
>>>>
>>>>
>>> I do, yes.
>>>
>>>
>>>>> This code makes devices more secure than they used to be. So there
>>>>> is no negative security regression here. This patchset simply
>>>>> improves the security for newer devices that support the new
>>>>> features.
>>>>>
>>>> Yes, agreed. Just to be clear, I *have* queued up all but these last
>>>> two patches (the printk and documentation patches) for 2.6.28.
>>>>
>>> Ok, thanks.
>>>
>>>
>>>> --b.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> linux-nfs" in
>>>> the body of a message to [email protected]
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2008-10-08 22:26:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 00/12] svcrdma: Fast memory registration support

On Fri, Oct 03, 2008 at 04:33:37PM -0500, Tom Tucker wrote:
>
> Bruce:
>
> This updated patchset implements support for Fast Memory Registration
> in the NFS server. The second to last patch has been updated to
> provide better guidance on whether the connection is safe or
> not. Basically the code gives you a qualification string instead of
> expecting the administrator to know how to interpret the memory
> registration strategy and transport idiosyncrasies. The documentation
> has also been updated given your suggestions and the change to the
> informative log message. The string is certainly informative, but may
> be a little long. For this reason, there is a proc file entry to
> disable it.

OK, thanks, I've applied the first 10 patches (actually, merged them
from the for-bfields branch of your git tree).

I'm still a little iffy on patch #11.

--b.

>
> Fast Memory Regstration is the ability to quickly map a kernel memory
> page list as a logically contiguous memory region from the perspective
> of the adapter. This mapping is created and invalidated using work
> requests posted on the SQ.
>
> There are two benefits of this new memory registration strategy:
>
> - It allows for large amounts of data to be transferred between the
> client and server with a single work request
>
> - It improves security since the effective lifetime and scope of an
> RKEY is a single RPC and WR.
>
> The documentation file Documentation/filesystems/nfs-rdma.txt file has
> been updated with information about NFSRDMA security in general and
> how it is affected by the new Fast Memory Registration capability.
>
> This new capability is only enabled if the underlying device advertises
> that it is supported. It is not necessary that both the client and
> server use the same strategy, however, for Fast Memory Registration,
> it improves performance.
>
> These patches are also available here:
> git://git.linux-nfs.org/projects/tomtucker/xprt-switch-2.6.git
>
> Signed-off-by: Tom Tucker <[email protected]>
>
>
> [PATCH 01/12] svcrdma: Add Fast Reg MR Data Types
>
> include/linux/sunrpc/svc_rdma.h | 22 +++++++++++++++++++++-
> 1 files changed, 21 insertions(+), 1 deletions(-)
>
> [PATCH 02/12] svcrdma: Add FRMR get/put services
>
> include/linux/sunrpc/svc_rdma.h | 3 +
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 122 ++++++++++++++++++++++++++++-
> 2 files changed, 120 insertions(+), 5 deletions(-)
>
> [PATCH 03/12] svcrdma: Query device for Fast Reg support during connection setup
>
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 76 +++++++++++++++++++++++++++--
> 1 files changed, 70 insertions(+), 6 deletions(-)
>
> [PATCH 04/12] svcrdma: Add a service to register a Fast Reg MR with the device
>
> include/linux/sunrpc/svc_rdma.h | 1 +
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 118 +++++++++++++++++++++---------
> 2 files changed, 84 insertions(+), 35 deletions(-)
>
> [PATCH 05/12] svcrdma: Modify post recv path to use local dma key
>
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 12 +++++++++---
> 1 files changed, 9 insertions(+), 3 deletions(-)
>
> [PATCH 06/12] svcrdma: Add support to svc_rdma_send to handle chained WR
>
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 29 +++++++++++++++++++++--------
> 1 files changed, 21 insertions(+), 8 deletions(-)
>
> [PATCH 07/12] svcrdma: Modify the RPC recv path to use FRMR when available
>
> include/linux/sunrpc/svc_rdma.h | 1 +
> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 187 ++++++++++++++++++++++++++----
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +-
> 3 files changed, 171 insertions(+), 22 deletions(-)
>
> [PATCH 08/12] svcrdma: Modify the RPC reply path to use FRMR when available
>
> net/sunrpc/xprtrdma/svc_rdma_sendto.c | 263 +++++++++++++++++++++++++-----
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +
> 2 files changed, 225 insertions(+), 40 deletions(-)
>
> [PATCH 09/12] svcrdma: Update svc_rdma_send_error to use DMA LKEY
>
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 11 +++++++++--
> 1 files changed, 9 insertions(+), 2 deletions(-)
>
> [PATCH 10/12] svcrdma: Fix IRD/ORD polarity
>
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used
>
> include/linux/sunrpc/svc_rdma.h | 1 +
> net/sunrpc/xprtrdma/svc_rdma.c | 8 ++++
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 63 ++++++++++++++++++-----------
> 3 files changed, 48 insertions(+), 24 deletions(-)
>
> [PATCH 12/12] svcrdma: Documentation update for the FastReg memory model
>
> Documentation/filesystems/nfs-rdma.txt | 126 ++++++++++++++++++++++++++++++++
> 1 files changed, 126 insertions(+), 0 deletions(-)
>

2008-10-08 22:48:43

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

On Fri, Oct 03, 2008 at 04:33:48PM -0500, Tom Tucker wrote:
> Add a log message that allows the administrator to determine if server memory
> is exposed on a particular client connection. This message can be disabled
> by writing 0 to the /proc/sys/sunrpc/svc_rdma/show_conn_info file.

I could grudgingly live with the idea of a log message here as a
temporary fix while we figure out something better, but I'm not happy
about making it bigger and adding a sysctl.

If we just want a hack for now, I'd be inclined to leave this printk a
dprintk, add the extra information to the dprintk, and tell people to
turn on transport debugging and watch a client connect. Ugly, but at
least it'll be obvious it's not an api that's going to stick around.

Beyond the short-term: is there some other way of getting this
information from userspace? Since this is a property of the interface
device, it'd seem natural to communicate the information in something
like a sysfs file for the device, or whatever's used to query properties
of network interfaces.

I'm a bit surprised the information isn't already there. Aren't
userspace applications eventually also supposed to be able to use rdma?
Don't they need to query network interfaces for their capabilities?

--b.

>
> Signed-off-by: Tom Tucker <[email protected]>
>
> ---
> include/linux/sunrpc/svc_rdma.h | 1 +
> net/sunrpc/xprtrdma/svc_rdma.c | 8 ++++
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 63 ++++++++++++++++++-----------
> 3 files changed, 48 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index c14fe86..7ee0a76 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -52,6 +52,7 @@
> extern unsigned int svcrdma_ord;
> extern unsigned int svcrdma_max_requests;
> extern unsigned int svcrdma_max_req_size;
> +extern unsigned int svcrdma_show_conn_info;
>
> extern atomic_t rdma_stat_recv;
> extern atomic_t rdma_stat_read;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
> index 8710117..493e243 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
> @@ -58,6 +58,7 @@ static unsigned int max_max_requests = 16384;
> unsigned int svcrdma_max_req_size = RPCRDMA_MAX_REQ_SIZE;
> static unsigned int min_max_inline = 4096;
> static unsigned int max_max_inline = 65536;
> +unsigned int svcrdma_show_conn_info = 1;
>
> atomic_t rdma_stat_recv;
> atomic_t rdma_stat_read;
> @@ -145,6 +146,13 @@ static ctl_table svcrdma_parm_table[] = {
> .extra1 = &min_ord,
> .extra2 = &max_ord,
> },
> + {
> + .procname = "show_conn_info",
> + .data = &svcrdma_show_conn_info,
> + .maxlen = sizeof(svcrdma_show_conn_info),
> + .mode = 0644,
> + .proc_handler = &proc_dointvec,
> + },
>
> {
> .procname = "rdma_stat_read",
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index bd17023..3d97032 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -833,6 +833,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> struct rdma_conn_param conn_param;
> struct ib_qp_init_attr qp_attr;
> struct ib_device_attr devattr;
> + char *transport_string;
> + char *security_string;
> int dma_mr_acc;
> int need_dma_mr;
> int ret;
> @@ -981,10 +983,13 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> /*
> * Determine if a DMA MR is required and if so, what privs are required
> */
> + security_string = "Safe. Only RPC memory exposed.";
> switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
> case RDMA_TRANSPORT_IWARP:
> + transport_string = "iWARP";
> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
> if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
> + security_string = "Unsafe: Server memory exposed.";
> need_dma_mr = 1;
> dma_mr_acc =
> (IB_ACCESS_LOCAL_WRITE |
> @@ -996,6 +1001,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> need_dma_mr = 0;
> break;
> case RDMA_TRANSPORT_IB:
> + transport_string = "IB";
> if (!(devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
> need_dma_mr = 1;
> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
> @@ -1052,30 +1058,39 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
> goto errout;
> }
>
> - dprintk("svcrdma: new connection %p accepted with the following "
> - "attributes:\n"
> - " local_ip : %d.%d.%d.%d\n"
> - " local_port : %d\n"
> - " remote_ip : %d.%d.%d.%d\n"
> - " remote_port : %d\n"
> - " max_sge : %d\n"
> - " sq_depth : %d\n"
> - " max_requests : %d\n"
> - " ord : %d\n",
> - newxprt,
> - NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
> - route.addr.src_addr)->sin_addr.s_addr),
> - ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
> - route.addr.src_addr)->sin_port),
> - NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
> - route.addr.dst_addr)->sin_addr.s_addr),
> - ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
> - route.addr.dst_addr)->sin_port),
> - newxprt->sc_max_sge,
> - newxprt->sc_sq_depth,
> - newxprt->sc_max_requests,
> - newxprt->sc_ord);
> -
> + if (!svcrdma_show_conn_info)
> + goto out;
> +
> + printk(KERN_INFO "svcrdma: New connection accepted with the following "
> + "attributes:\n"
> + " transport : %s\n"
> + " local_ip : %d.%d.%d.%d\n"
> + " local_port : %d\n"
> + " remote_ip : %d.%d.%d.%d\n"
> + " remote_port : %d\n"
> + " max_sge : %d\n"
> + " sq_depth : %d\n"
> + " max_requests : %d\n"
> + " ord : %d\n"
> + " using fastreg? : %c\n"
> + " memory exposure : %s\n",
> + transport_string,
> + NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
> + route.addr.src_addr)->sin_addr.s_addr),
> + ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
> + route.addr.src_addr)->sin_port),
> + NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
> + route.addr.dst_addr)->sin_addr.s_addr),
> + ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
> + route.addr.dst_addr)->sin_port),
> + newxprt->sc_max_sge,
> + newxprt->sc_sq_depth,
> + newxprt->sc_max_requests,
> + newxprt->sc_ord,
> + newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG ? 'Y' : 'N',
> + security_string);
> +
> + out:
> return &newxprt->sc_xprt;
>
> errout:

2008-10-09 06:37:07

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

J. Bruce Fields wrote:
> On Fri, Oct 03, 2008 at 04:33:48PM -0500, Tom Tucker wrote:
>
>> Add a log message that allows the administrator to determine if server memory
>> is exposed on a particular client connection. This message can be disabled
>> by writing 0 to the /proc/sys/sunrpc/svc_rdma/show_conn_info file.
>>
>
> I could grudgingly live with the idea of a log message here as a
> temporary fix while we figure out something better, but I'm not happy
> about making it bigger and adding a sysctl.
>
>
I would happily remove all of this. I believed you thought it important
that we actively informed the administrator.
Maybe I over-reacted.
> If we just want a hack for now, I'd be inclined to leave this printk a
> dprintk, add the extra information to the dprintk, and tell people to
> turn on transport debugging and watch a client connect. Ugly, but at
> least it'll be obvious it's not an api that's going to stick around.
>
>
Id' love to get rid of it...
> Beyond the short-term: is there some other way of getting this
> information from userspace? Since this is a property of the interface
> device, it'd seem natural to communicate the information in something
> like a sysfs file for the device, or whatever's used to query properties
> of network interfaces.
>
>
> I'm a bit surprised the information isn't already there. Aren't
> userspace applications eventually also supposed to be able to use rdma?
> Don't they need to query network interfaces for their capabilities?
>
>
All of this information is available from a full-function user-mode API.

This code makes devices more secure than they used to be. So there is no
negative security regression here.
This patchset simply improves the security for newer devices that
support the new features.

If you tell me to yank this out, it will make my day.

Tom
> --b.
>
>
>> Signed-off-by: Tom Tucker <[email protected]>
>>
>> ---
>> include/linux/sunrpc/svc_rdma.h | 1 +
>> net/sunrpc/xprtrdma/svc_rdma.c | 8 ++++
>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 63 ++++++++++++++++++-----------
>> 3 files changed, 48 insertions(+), 24 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
>> index c14fe86..7ee0a76 100644
>> --- a/include/linux/sunrpc/svc_rdma.h
>> +++ b/include/linux/sunrpc/svc_rdma.h
>> @@ -52,6 +52,7 @@
>> extern unsigned int svcrdma_ord;
>> extern unsigned int svcrdma_max_requests;
>> extern unsigned int svcrdma_max_req_size;
>> +extern unsigned int svcrdma_show_conn_info;
>>
>> extern atomic_t rdma_stat_recv;
>> extern atomic_t rdma_stat_read;
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
>> index 8710117..493e243 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma.c
>> @@ -58,6 +58,7 @@ static unsigned int max_max_requests = 16384;
>> unsigned int svcrdma_max_req_size = RPCRDMA_MAX_REQ_SIZE;
>> static unsigned int min_max_inline = 4096;
>> static unsigned int max_max_inline = 65536;
>> +unsigned int svcrdma_show_conn_info = 1;
>>
>> atomic_t rdma_stat_recv;
>> atomic_t rdma_stat_read;
>> @@ -145,6 +146,13 @@ static ctl_table svcrdma_parm_table[] = {
>> .extra1 = &min_ord,
>> .extra2 = &max_ord,
>> },
>> + {
>> + .procname = "show_conn_info",
>> + .data = &svcrdma_show_conn_info,
>> + .maxlen = sizeof(svcrdma_show_conn_info),
>> + .mode = 0644,
>> + .proc_handler = &proc_dointvec,
>> + },
>>
>> {
>> .procname = "rdma_stat_read",
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index bd17023..3d97032 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -833,6 +833,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> struct rdma_conn_param conn_param;
>> struct ib_qp_init_attr qp_attr;
>> struct ib_device_attr devattr;
>> + char *transport_string;
>> + char *security_string;
>> int dma_mr_acc;
>> int need_dma_mr;
>> int ret;
>> @@ -981,10 +983,13 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> /*
>> * Determine if a DMA MR is required and if so, what privs are required
>> */
>> + security_string = "Safe. Only RPC memory exposed.";
>> switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
>> case RDMA_TRANSPORT_IWARP:
>> + transport_string = "iWARP";
>> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
>> if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
>> + security_string = "Unsafe: Server memory exposed.";
>> need_dma_mr = 1;
>> dma_mr_acc =
>> (IB_ACCESS_LOCAL_WRITE |
>> @@ -996,6 +1001,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> need_dma_mr = 0;
>> break;
>> case RDMA_TRANSPORT_IB:
>> + transport_string = "IB";
>> if (!(devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
>> need_dma_mr = 1;
>> dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
>> @@ -1052,30 +1058,39 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>> goto errout;
>> }
>>
>> - dprintk("svcrdma: new connection %p accepted with the following "
>> - "attributes:\n"
>> - " local_ip : %d.%d.%d.%d\n"
>> - " local_port : %d\n"
>> - " remote_ip : %d.%d.%d.%d\n"
>> - " remote_port : %d\n"
>> - " max_sge : %d\n"
>> - " sq_depth : %d\n"
>> - " max_requests : %d\n"
>> - " ord : %d\n",
>> - newxprt,
>> - NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
>> - route.addr.src_addr)->sin_addr.s_addr),
>> - ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
>> - route.addr.src_addr)->sin_port),
>> - NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
>> - route.addr.dst_addr)->sin_addr.s_addr),
>> - ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
>> - route.addr.dst_addr)->sin_port),
>> - newxprt->sc_max_sge,
>> - newxprt->sc_sq_depth,
>> - newxprt->sc_max_requests,
>> - newxprt->sc_ord);
>> -
>> + if (!svcrdma_show_conn_info)
>> + goto out;
>> +
>> + printk(KERN_INFO "svcrdma: New connection accepted with the following "
>> + "attributes:\n"
>> + " transport : %s\n"
>> + " local_ip : %d.%d.%d.%d\n"
>> + " local_port : %d\n"
>> + " remote_ip : %d.%d.%d.%d\n"
>> + " remote_port : %d\n"
>> + " max_sge : %d\n"
>> + " sq_depth : %d\n"
>> + " max_requests : %d\n"
>> + " ord : %d\n"
>> + " using fastreg? : %c\n"
>> + " memory exposure : %s\n",
>> + transport_string,
>> + NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
>> + route.addr.src_addr)->sin_addr.s_addr),
>> + ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
>> + route.addr.src_addr)->sin_port),
>> + NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
>> + route.addr.dst_addr)->sin_addr.s_addr),
>> + ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
>> + route.addr.dst_addr)->sin_port),
>> + newxprt->sc_max_sge,
>> + newxprt->sc_sq_depth,
>> + newxprt->sc_max_requests,
>> + newxprt->sc_ord,
>> + newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG ? 'Y' : 'N',
>> + security_string);
>> +
>> + out:
>> return &newxprt->sc_xprt;
>>
>> errout:
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2008-10-09 16:26:26

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

On Thu, Oct 09, 2008 at 01:37:05AM -0500, Tom Tucker wrote:
> J. Bruce Fields wrote:
>> On Fri, Oct 03, 2008 at 04:33:48PM -0500, Tom Tucker wrote:
>>
>>> Add a log message that allows the administrator to determine if server memory
>>> is exposed on a particular client connection. This message can be disabled
>>> by writing 0 to the /proc/sys/sunrpc/svc_rdma/show_conn_info file.
>>>
>>
>> I could grudgingly live with the idea of a log message here as a
>> temporary fix while we figure out something better, but I'm not happy
>> about making it bigger and adding a sysctl.
>>
>>
> I would happily remove all of this. I believed you thought it important
> that we actively informed the administrator.
> Maybe I over-reacted.

Well, I was probably unclear and/or confused, apologies. What I think
would make sense for now would be just be some easy way we an
administrator could answer the question "what kind of security model are
my rdma interfaces using"?

A cautious administrator will want to answer these questions before
turning on nfs over rdma, so a log message on client connect isn't as
useful. Also, messages to the log are fine for debugging, for notices
about exceptional events, etc., but they aren't reliable for this kind
of use (they get stored in distro-specific locations for varying amounts
of time; wording may change across kernel versions, making them harder
to grep for; etc).

So these log messages might serve as a stopgap, but I'd prefer something
that could be queried reliably at any time.

If in addition we wanted to warn about the riskier case, maybe we could
print a message *just* in that case (and print it only once), but I
don't feel strongly about that.

>> If we just want a hack for now, I'd be inclined to leave this printk a
>> dprintk, add the extra information to the dprintk, and tell people to
>> turn on transport debugging and watch a client connect. Ugly, but at
>> least it'll be obvious it's not an api that's going to stick around.
>>
>>
> Id' love to get rid of it...
>> Beyond the short-term: is there some other way of getting this
>> information from userspace? Since this is a property of the interface
>> device, it'd seem natural to communicate the information in something
>> like a sysfs file for the device, or whatever's used to query properties
>> of network interfaces.
>>
>> I'm a bit surprised the information isn't already there. Aren't
>> userspace applications eventually also supposed to be able to use rdma?
>> Don't they need to query network interfaces for their capabilities?
>>
>>
> All of this information is available from a full-function user-mode API.

Oh, cool. So would it be difficult to write a C program that just
printed out some basic information about the rdma-capable interfaces on
the system? If it didn't have a lot of dependencies, we could even
consider adding it to nfs-utils.

The one drawback is that it wouldn't be able to tell whether the
currently running kernel actually supported fast registration. Do you
think a guess based on kernel version would be good enough for that?

> This code makes devices more secure than they used to be. So there is no
> negative security regression here. This patchset simply improves the
> security for newer devices that support the new features.

Yes, agreed. Just to be clear, I *have* queued up all but these last
two patches (the printk and documentation patches) for 2.6.28.

--b.

2008-10-09 18:46:43

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

J. Bruce Fields wrote:
> On Thu, Oct 09, 2008 at 01:37:05AM -0500, Tom Tucker wrote:
>> J. Bruce Fields wrote:
>>> On Fri, Oct 03, 2008 at 04:33:48PM -0500, Tom Tucker wrote:
>>>
>>>> Add a log message that allows the administrator to determine if server memory
>>>> is exposed on a particular client connection. This message can be disabled
>>>> by writing 0 to the /proc/sys/sunrpc/svc_rdma/show_conn_info file.
>>>>
>>> I could grudgingly live with the idea of a log message here as a
>>> temporary fix while we figure out something better, but I'm not happy
>>> about making it bigger and adding a sysctl.
>>>
>>>
>> I would happily remove all of this. I believed you thought it important
>> that we actively informed the administrator.
>> Maybe I over-reacted.
>
> Well, I was probably unclear and/or confused, apologies. What I think
> would make sense for now would be just be some easy way we an
> administrator could answer the question "what kind of security model are
> my rdma interfaces using"?
>
> A cautious administrator will want to answer these questions before
> turning on nfs over rdma, so a log message on client connect isn't as
> useful. Also, messages to the log are fine for debugging, for notices
> about exceptional events, etc., but they aren't reliable for this kind
> of use (they get stored in distro-specific locations for varying amounts
> of time; wording may change across kernel versions, making them harder
> to grep for; etc).
>
> So these log messages might serve as a stopgap, but I'd prefer something
> that could be queried reliably at any time.
>
> If in addition we wanted to warn about the riskier case, maybe we could
> print a message *just* in that case (and print it only once), but I
> don't feel strongly about that.
>
>>> If we just want a hack for now, I'd be inclined to leave this printk a
>>> dprintk, add the extra information to the dprintk, and tell people to
>>> turn on transport debugging and watch a client connect. Ugly, but at
>>> least it'll be obvious it's not an api that's going to stick around.
>>>
>>>
>> Id' love to get rid of it...
>>> Beyond the short-term: is there some other way of getting this
>>> information from userspace? Since this is a property of the interface
>>> device, it'd seem natural to communicate the information in something
>>> like a sysfs file for the device, or whatever's used to query properties
>>> of network interfaces.
>>>
>>> I'm a bit surprised the information isn't already there. Aren't
>>> userspace applications eventually also supposed to be able to use rdma?
>>> Don't they need to query network interfaces for their capabilities?
>>>
>>>
>> All of this information is available from a full-function user-mode API.
>
> Oh, cool. So would it be difficult to write a C program that just
> printed out some basic information about the rdma-capable interfaces on
> the system? If it didn't have a lot of dependencies, we could even
> consider adding it to nfs-utils.
>

Coming up with a name for the command is probably
harder than writing it.

Here it is...

#include <stdlib.h>
#include <stdio.h>
#include <infiniband/verbs.h>

#define FAST_REG (1<<21) /* This will be in infiniband/verbs.h in the future */

static char *safety_string(struct ibv_device_attr *a, struct ibv_device *dev)
{
if (a->device_cap_flags & FAST_REG
|| dev->transport_type == IBV_TRANSPORT_IB)
return "Safe. NFSRDMA exposes only RPC memory.\n";
else
return "Unsafe. NFSRDMA exposes Server memory.\n";
}

int main(int argc, char *argv[])
{
struct ibv_device **dev_list;
struct ibv_context *context;
struct ibv_device_attr attr;
int dev_count;
int i;

dev_list = ibv_get_device_list(&dev_count);
for (i = 0; dev_list && i < dev_count; i++) {
printf("%-20s: ", ibv_get_device_name(dev_list[i]));
context = ibv_open_device(dev_list[i]);
if (!context) {
printf("could not open device\n");
continue;
}
if (!ibv_query_device(context, &attr))
printf("%s\n", safety_string(&attr, dev_list[i]));
else
printf("could not query device\n");

ibv_close_device(context);
}
if (dev_list)
ibv_free_device_list(dev_list);

exit(0);
}

> The one drawback is that it wouldn't be able to tell whether the
> currently running kernel actually supported fast registration. Do you
> think a guess based on kernel version would be good enough for that?
>

I do, yes.

>> This code makes devices more secure than they used to be. So there is no
>> negative security regression here. This patchset simply improves the
>> security for newer devices that support the new features.
>
> Yes, agreed. Just to be clear, I *have* queued up all but these last
> two patches (the printk and documentation patches) for 2.6.28.
>

Ok, thanks.

> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2008-10-03 21:33:50

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 01/12] svcrdma: Add Fast Reg MR Data Types

Add data types to track Fast Reg Memory Regions. The core data type is
svc_rdma_fastreg_mr that associates a device MR with a host kva and page
list. A field is added to the WR context to keep track of the FRMR
used to map the local memory for an RPC.

An FRMR list and spin lock are added to the transport instance to keep
track of all FRMR allocated for the transport. Also added are device
capability flags to indicate what the memory registration
capabilities are for the underlying device and whether or not fast
memory registration is supported.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index dc05b54..49e458d 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -72,6 +72,7 @@ extern atomic_t rdma_stat_sq_prod;
*/
struct svc_rdma_op_ctxt {
struct svc_rdma_op_ctxt *read_hdr;
+ struct svc_rdma_fastreg_mr *frmr;
int hdr_count;
struct xdr_buf arg;
struct list_head dto_q;
@@ -103,16 +104,30 @@ struct svc_rdma_chunk_sge {
int start; /* sge no for this chunk */
int count; /* sge count for this chunk */
};
+struct svc_rdma_fastreg_mr {
+ struct ib_mr *mr;
+ void *kva;
+ struct ib_fast_reg_page_list *page_list;
+ int page_list_len;
+ unsigned long access_flags;
+ unsigned long map_len;
+ enum dma_data_direction direction;
+ struct list_head frmr_list;
+};
struct svc_rdma_req_map {
+ struct svc_rdma_fastreg_mr *frmr;
unsigned long count;
union {
struct kvec sge[RPCSVC_MAXPAGES];
struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
};
};
-
+#define RDMACTXT_F_FAST_UNREG 1
#define RDMACTXT_F_LAST_CTXT 2

+#define SVCRDMA_DEVCAP_FAST_REG 1 /* fast mr registration */
+#define SVCRDMA_DEVCAP_READ_W_INV 2 /* read w/ invalidate */
+
struct svcxprt_rdma {
struct svc_xprt sc_xprt; /* SVC transport structure */
struct rdma_cm_id *sc_cm_id; /* RDMA connection id */
@@ -136,6 +151,11 @@ struct svcxprt_rdma {
struct ib_cq *sc_rq_cq;
struct ib_cq *sc_sq_cq;
struct ib_mr *sc_phys_mr; /* MR for server memory */
+ u32 sc_dev_caps; /* distilled device caps */
+ u32 sc_dma_lkey; /* local dma key */
+ unsigned int sc_frmr_pg_list_len;
+ struct list_head sc_frmr_q;
+ spinlock_t sc_frmr_q_lock;

spinlock_t sc_lock; /* transport lock */


2008-10-03 21:33:50

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 02/12] svcrdma: Add FRMR get/put services

Add services for the allocating, freeing, and unmapping Fast Reg MR. These
services will be used by the transport connection setup, send and receive
routines.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 3 +
net/sunrpc/xprtrdma/svc_rdma_transport.c | 122 ++++++++++++++++++++++++++++-
2 files changed, 120 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 49e458d..3425268 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -214,6 +214,9 @@ extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
+extern struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *);
+extern void svc_rdma_put_frmr(struct svcxprt_rdma *,
+ struct svc_rdma_fastreg_mr *);
extern void svc_sq_reap(struct svcxprt_rdma *);
extern void svc_rq_reap(struct svcxprt_rdma *);
extern struct svc_xprt_class svc_rdma_class;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 900cb69..dd170af 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -100,6 +100,7 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
ctxt->xprt = xprt;
INIT_LIST_HEAD(&ctxt->dto_q);
ctxt->count = 0;
+ ctxt->frmr = NULL;
atomic_inc(&xprt->sc_ctxt_used);
return ctxt;
}
@@ -109,11 +110,19 @@ static void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
struct svcxprt_rdma *xprt = ctxt->xprt;
int i;
for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) {
- atomic_dec(&xprt->sc_dma_used);
- ib_dma_unmap_single(xprt->sc_cm_id->device,
- ctxt->sge[i].addr,
- ctxt->sge[i].length,
- ctxt->direction);
+ /*
+ * Unmap the DMA addr in the SGE if the lkey matches
+ * the sc_dma_lkey, otherwise, ignore it since it is
+ * an FRMR lkey and will be unmapped later when the
+ * last WR that uses it completes.
+ */
+ if (ctxt->sge[i].lkey == xprt->sc_dma_lkey) {
+ atomic_dec(&xprt->sc_dma_used);
+ ib_dma_unmap_single(xprt->sc_cm_id->device,
+ ctxt->sge[i].addr,
+ ctxt->sge[i].length,
+ ctxt->direction);
+ }
}
}

@@ -150,6 +159,7 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
schedule_timeout_uninterruptible(msecs_to_jiffies(500));
}
map->count = 0;
+ map->frmr = NULL;
return map;
}

@@ -425,10 +435,12 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
INIT_LIST_HEAD(&cma_xprt->sc_dto_q);
INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
+ INIT_LIST_HEAD(&cma_xprt->sc_frmr_q);
init_waitqueue_head(&cma_xprt->sc_send_wait);

spin_lock_init(&cma_xprt->sc_lock);
spin_lock_init(&cma_xprt->sc_rq_dto_lock);
+ spin_lock_init(&cma_xprt->sc_frmr_q_lock);

cma_xprt->sc_ord = svcrdma_ord;

@@ -686,6 +698,103 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
return ERR_PTR(ret);
}

+static int rdma_alloc_frmr(struct svcxprt_rdma *xprt)
+{
+ struct ib_mr *mr;
+ struct ib_fast_reg_page_list *pl;
+ struct svc_rdma_fastreg_mr *frmr;
+
+ mr = ib_alloc_fast_reg_mr(xprt->sc_pd, RPCSVC_MAXPAGES);
+ if (!mr)
+ goto errout;
+ pl = ib_alloc_fast_reg_page_list(xprt->sc_cm_id->device,
+ RPCSVC_MAXPAGES);
+ if (!pl) {
+ ib_dereg_mr(mr);
+ goto errout;
+ }
+ frmr = kmalloc(sizeof(*frmr), GFP_KERNEL);
+ if (!frmr) {
+ ib_dereg_mr(mr);
+ ib_free_fast_reg_page_list(pl);
+ goto errout;
+ }
+ frmr->mr = mr;
+ frmr->page_list = pl;
+ INIT_LIST_HEAD(&frmr->frmr_list);
+ spin_lock_bh(&xprt->sc_frmr_q_lock);
+ list_add(&frmr->frmr_list, &xprt->sc_frmr_q);
+ spin_unlock_bh(&xprt->sc_frmr_q_lock);
+
+ return 0;
+
+ errout:
+ return -ENOMEM;
+}
+
+static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
+{
+ struct svc_rdma_fastreg_mr *frmr;
+
+ while (!list_empty(&xprt->sc_frmr_q)) {
+ frmr = list_entry(xprt->sc_frmr_q.next,
+ struct svc_rdma_fastreg_mr, frmr_list);
+ list_del_init(&frmr->frmr_list);
+ ib_dereg_mr(frmr->mr);
+ ib_free_fast_reg_page_list(frmr->page_list);
+ kfree(frmr);
+ }
+}
+
+struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
+{
+ struct svc_rdma_fastreg_mr *frmr = NULL;
+
+ while (1) {
+ spin_lock_bh(&rdma->sc_frmr_q_lock);
+ if (!list_empty(&rdma->sc_frmr_q)) {
+ frmr = list_entry(rdma->sc_frmr_q.next,
+ struct svc_rdma_fastreg_mr, frmr_list);
+ list_del_init(&frmr->frmr_list);
+ }
+ spin_unlock_bh(&rdma->sc_frmr_q_lock);
+ if (frmr)
+ break;
+ if (rdma_alloc_frmr(rdma))
+ return ERR_PTR(-ENOMEM);
+ }
+ frmr->map_len = 0;
+ frmr->page_list_len = 0;
+
+ return frmr;
+}
+
+static void frmr_unmap_dma(struct svcxprt_rdma *xprt,
+ struct svc_rdma_fastreg_mr *frmr)
+{
+ int page_no;
+ for (page_no = 0; page_no < frmr->page_list_len; page_no++) {
+ dma_addr_t addr = frmr->page_list->page_list[page_no];
+ if (ib_dma_mapping_error(frmr->mr->device, addr))
+ continue;
+ atomic_dec(&xprt->sc_dma_used);
+ ib_dma_unmap_single(frmr->mr->device, addr, PAGE_SIZE,
+ frmr->direction);
+ }
+}
+
+void svc_rdma_put_frmr(struct svcxprt_rdma *rdma,
+ struct svc_rdma_fastreg_mr *frmr)
+{
+ if (frmr) {
+ frmr_unmap_dma(rdma, frmr);
+ spin_lock_bh(&rdma->sc_frmr_q_lock);
+ BUG_ON(!list_empty(&frmr->frmr_list));
+ list_add(&frmr->frmr_list, &rdma->sc_frmr_q);
+ spin_unlock_bh(&rdma->sc_frmr_q_lock);
+ }
+}
+
/*
* This is the xpo_recvfrom function for listening endpoints. Its
* purpose is to accept incoming connections. The CMA callback handler
@@ -961,6 +1070,9 @@ static void __svc_rdma_free(struct work_struct *work)
WARN_ON(atomic_read(&rdma->sc_ctxt_used) != 0);
WARN_ON(atomic_read(&rdma->sc_dma_used) != 0);

+ /* De-allocate fastreg mr */
+ rdma_dealloc_frmr_q(rdma);
+
/* Destroy the QP if present (not a listener) */
if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
ib_destroy_qp(rdma->sc_qp);

2008-10-03 21:33:50

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 03/12] svcrdma: Query device for Fast Reg support during connection setup

Query the device capabilities in the svc_rdma_accept function to determine
what advanced memory management capabilities are supported by the device.
Based on the query, select the most secure model available given the
requirements of the transport and capabilities of the adapter.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 76 +++++++++++++++++++++++++++--
1 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index dd170af..5918285 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -813,6 +813,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
struct rdma_conn_param conn_param;
struct ib_qp_init_attr qp_attr;
struct ib_device_attr devattr;
+ int dma_mr_acc;
+ int need_dma_mr;
int ret;
int i;

@@ -928,15 +930,77 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
}
newxprt->sc_qp = newxprt->sc_cm_id->qp;

- /* Register all of physical memory */
- newxprt->sc_phys_mr = ib_get_dma_mr(newxprt->sc_pd,
- IB_ACCESS_LOCAL_WRITE |
- IB_ACCESS_REMOTE_WRITE);
- if (IS_ERR(newxprt->sc_phys_mr)) {
- dprintk("svcrdma: Failed to create DMA MR ret=%d\n", ret);
+ /*
+ * Use the most secure set of MR resources based on the
+ * transport type and available memory management features in
+ * the device. Here's the table implemented below:
+ *
+ * Fast Global DMA Remote WR
+ * Reg LKEY MR Access
+ * Sup'd Sup'd Needed Needed
+ *
+ * IWARP N N Y Y
+ * N Y Y Y
+ * Y N Y N
+ * Y Y N -
+ *
+ * IB N N Y N
+ * N Y N -
+ * Y N Y N
+ * Y Y N -
+ *
+ * NB: iWARP requires remote write access for the data sink
+ * of an RDMA_READ. IB does not.
+ */
+ if (devattr.device_cap_flags & IB_DEVICE_MEM_MGT_EXTENSIONS) {
+ newxprt->sc_frmr_pg_list_len =
+ devattr.max_fast_reg_page_list_len;
+ newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_FAST_REG;
+ }
+
+ /*
+ * Determine if a DMA MR is required and if so, what privs are required
+ */
+ switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
+ case RDMA_TRANSPORT_IWARP:
+ newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
+ if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
+ need_dma_mr = 1;
+ dma_mr_acc =
+ (IB_ACCESS_LOCAL_WRITE |
+ IB_ACCESS_REMOTE_WRITE);
+ } else if (!(devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
+ need_dma_mr = 1;
+ dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
+ } else
+ need_dma_mr = 0;
+ break;
+ case RDMA_TRANSPORT_IB:
+ if (!(devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
+ need_dma_mr = 1;
+ dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
+ } else
+ need_dma_mr = 0;
+ break;
+ default:
goto errout;
}

+ /* Create the DMA MR if needed, otherwise, use the DMA LKEY */
+ if (need_dma_mr) {
+ /* Register all of physical memory */
+ newxprt->sc_phys_mr =
+ ib_get_dma_mr(newxprt->sc_pd, dma_mr_acc);
+ if (IS_ERR(newxprt->sc_phys_mr)) {
+ dprintk("svcrdma: Failed to create DMA MR ret=%d\n",
+ ret);
+ goto errout;
+ }
+ newxprt->sc_dma_lkey = newxprt->sc_phys_mr->lkey;
+ } else
+ newxprt->sc_dma_lkey =
+ newxprt->sc_cm_id->device->local_dma_lkey;
+
/* Post receive buffers */
for (i = 0; i < newxprt->sc_max_requests; i++) {
ret = svc_rdma_post_recv(newxprt);

2008-10-03 21:33:50

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 04/12] svcrdma: Add a service to register a Fast Reg MR with the device

Fast Reg MR introduces a new WR type. Add a service to register the
region with the adapter and update the completion handling to support
completions with a NULL WR context.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_transport.c | 111 ++++++++++++++++++++---------
2 files changed, 77 insertions(+), 35 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 3425268..1402d19 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -214,6 +214,7 @@ extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
+extern int svc_rdma_fastreg(struct svcxprt_rdma *, struct svc_rdma_fastreg_mr *);
extern struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *);
extern void svc_rdma_put_frmr(struct svcxprt_rdma *,
struct svc_rdma_fastreg_mr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 5918285..4b70282 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -326,6 +326,45 @@ static void rq_cq_reap(struct svcxprt_rdma *xprt)
}

/*
+ * Processs a completion context
+ */
+static void process_context(struct svcxprt_rdma *xprt,
+ struct svc_rdma_op_ctxt *ctxt)
+{
+ svc_rdma_unmap_dma(ctxt);
+
+ switch (ctxt->wr_op) {
+ case IB_WR_SEND:
+ svc_rdma_put_context(ctxt, 1);
+ break;
+
+ case IB_WR_RDMA_WRITE:
+ svc_rdma_put_context(ctxt, 0);
+ break;
+
+ case IB_WR_RDMA_READ:
+ if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
+ struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
+ BUG_ON(!read_hdr);
+ spin_lock_bh(&xprt->sc_rq_dto_lock);
+ set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
+ list_add_tail(&read_hdr->dto_q,
+ &xprt->sc_read_complete_q);
+ spin_unlock_bh(&xprt->sc_rq_dto_lock);
+ svc_xprt_enqueue(&xprt->sc_xprt);
+ }
+ svc_rdma_put_context(ctxt, 0);
+ break;
+
+ default:
+ printk(KERN_ERR "svcrdma: unexpected completion type, "
+ "opcode=%d\n",
+ ctxt->wr_op);
+ break;
+ }
+}
+
+/*
* Send Queue Completion Handler - potentially called on interrupt context.
*
* Note that caller must hold a transport reference.
@@ -337,17 +376,12 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt)
struct ib_cq *cq = xprt->sc_sq_cq;
int ret;

-
if (!test_and_clear_bit(RDMAXPRT_SQ_PENDING, &xprt->sc_flags))
return;

ib_req_notify_cq(xprt->sc_sq_cq, IB_CQ_NEXT_COMP);
atomic_inc(&rdma_stat_sq_poll);
while ((ret = ib_poll_cq(cq, 1, &wc)) > 0) {
- ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id;
- xprt = ctxt->xprt;
-
- svc_rdma_unmap_dma(ctxt);
if (wc.status != IB_WC_SUCCESS)
/* Close the transport */
set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
@@ -356,35 +390,10 @@ static void sq_cq_reap(struct svcxprt_rdma *xprt)
atomic_dec(&xprt->sc_sq_count);
wake_up(&xprt->sc_send_wait);

- switch (ctxt->wr_op) {
- case IB_WR_SEND:
- svc_rdma_put_context(ctxt, 1);
- break;
-
- case IB_WR_RDMA_WRITE:
- svc_rdma_put_context(ctxt, 0);
- break;
-
- case IB_WR_RDMA_READ:
- if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
- struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
- BUG_ON(!read_hdr);
- spin_lock_bh(&xprt->sc_rq_dto_lock);
- set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
- list_add_tail(&read_hdr->dto_q,
- &xprt->sc_read_complete_q);
- spin_unlock_bh(&xprt->sc_rq_dto_lock);
- svc_xprt_enqueue(&xprt->sc_xprt);
- }
- svc_rdma_put_context(ctxt, 0);
- break;
+ ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id;
+ if (ctxt)
+ process_context(xprt, ctxt);

- default:
- printk(KERN_ERR "svcrdma: unexpected completion type, "
- "opcode=%d, status=%d\n",
- wc.opcode, wc.status);
- break;
- }
svc_xprt_put(&xprt->sc_xprt);
}

@@ -1190,6 +1199,40 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt)
return 1;
}

+/*
+ * Attempt to register the kvec representing the RPC memory with the
+ * device.
+ *
+ * Returns:
+ * NULL : The device does not support fastreg or there were no more
+ * fastreg mr.
+ * frmr : The kvec register request was successfully posted.
+ * <0 : An error was encountered attempting to register the kvec.
+ */
+int svc_rdma_fastreg(struct svcxprt_rdma *xprt,
+ struct svc_rdma_fastreg_mr *frmr)
+{
+ struct ib_send_wr fastreg_wr;
+ u8 key;
+
+ /* Bump the key */
+ key = (u8)(frmr->mr->lkey & 0x000000FF);
+ ib_update_fast_reg_key(frmr->mr, ++key);
+
+ /* Prepare FASTREG WR */
+ memset(&fastreg_wr, 0, sizeof fastreg_wr);
+ fastreg_wr.opcode = IB_WR_FAST_REG_MR;
+ fastreg_wr.send_flags = IB_SEND_SIGNALED;
+ fastreg_wr.wr.fast_reg.iova_start = (unsigned long)frmr->kva;
+ fastreg_wr.wr.fast_reg.page_list = frmr->page_list;
+ fastreg_wr.wr.fast_reg.page_list_len = frmr->page_list_len;
+ fastreg_wr.wr.fast_reg.page_shift = PAGE_SHIFT;
+ fastreg_wr.wr.fast_reg.length = frmr->map_len;
+ fastreg_wr.wr.fast_reg.access_flags = frmr->access_flags;
+ fastreg_wr.wr.fast_reg.rkey = frmr->mr->lkey;
+ return svc_rdma_send(xprt, &fastreg_wr);
+}
+
int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
{
struct ib_send_wr *bad_wr;
@@ -1199,8 +1242,6 @@ int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
return -ENOTCONN;

BUG_ON(wr->send_flags != IB_SEND_SIGNALED);
- BUG_ON(((struct svc_rdma_op_ctxt *)(unsigned long)wr->wr_id)->wr_op !=
- wr->opcode);
/* If the SQ is full, wait until an SQ entry is available */
while (1) {
spin_lock_bh(&xprt->sc_lock);

2008-10-03 21:33:51

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 10/12] svcrdma: Fix IRD/ORD polarity

The inititator/responder resources in the event have been swapped. They
no represent what the local peer would set their values to in order to
match the peer. Note that iWARP does not exchange these on the wire and
the provider is simply putting in the local device max.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 36e2cc4..bd17023 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -598,7 +598,7 @@ static int rdma_listen_handler(struct rdma_cm_id *cma_id,
dprintk("svcrdma: Connect request on cma_id=%p, xprt = %p, "
"event=%d\n", cma_id, cma_id->context, event->event);
handle_connect_req(cma_id,
- event->param.conn.responder_resources);
+ event->param.conn.initiator_depth);
break;

case RDMA_CM_EVENT_ESTABLISHED:

2008-10-03 21:33:51

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 07/12] svcrdma: Modify the RPC recv path to use FRMR when available

RPCRDMA requests that specify a read-list are fetched with RDMA_READ. Using
an FRMR to map the data sink improves NFSRDMA security on transports that
place the RDMA_READ data sink LKEY on the wire because the valid lifetime
of the MR is only the duration of the RDMA_READ. The LKEY is invalidated
when the last RDMA_READ WR completes.

Mapping the data sink also allows for very large amounts to data to be
fetched with a single WR, so if the client is also using FRMR, the entire
RPC read-list can be fetched with a single WR.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 187 ++++++++++++++++++++++++++----
net/sunrpc/xprtrdma/svc_rdma_transport.c | 5 +-
3 files changed, 171 insertions(+), 22 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 1402d19..c14fe86 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -212,6 +212,7 @@ extern int svc_rdma_post_recv(struct svcxprt_rdma *);
extern int svc_rdma_create_listen(struct svc_serv *, int, struct sockaddr *);
extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
+extern void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt);
extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
extern int svc_rdma_fastreg(struct svcxprt_rdma *, struct svc_rdma_fastreg_mr *);
diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 74de31a..a475657 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -116,7 +116,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
*
* Assumptions:
* - chunk[0]->position points to pages[0] at an offset of 0
- * - pages[] is not physically or virtually contigous and consists of
+ * - pages[] is not physically or virtually contiguous and consists of
* PAGE_SIZE elements.
*
* Output:
@@ -125,7 +125,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
* chunk in the read list
*
*/
-static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt,
+static int map_read_chunks(struct svcxprt_rdma *xprt,
struct svc_rqst *rqstp,
struct svc_rdma_op_ctxt *head,
struct rpcrdma_msg *rmsgp,
@@ -211,26 +211,128 @@ static int rdma_rcl_to_sge(struct svcxprt_rdma *xprt,
return sge_no;
}

-static void rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
- struct svc_rdma_op_ctxt *ctxt,
- struct kvec *vec,
- u64 *sgl_offset,
- int count)
+/* Map a read-chunk-list to an XDR and fast register the page-list.
+ *
+ * Assumptions:
+ * - chunk[0] position points to pages[0] at an offset of 0
+ * - pages[] will be made physically contiguous by creating a one-off memory
+ * region using the fastreg verb.
+ * - byte_count is # of bytes in read-chunk-list
+ * - ch_count is # of chunks in read-chunk-list
+ *
+ * Output:
+ * - sge array pointing into pages[] array.
+ * - chunk_sge array specifying sge index and count for each
+ * chunk in the read list
+ */
+static int fast_reg_read_chunks(struct svcxprt_rdma *xprt,
+ struct svc_rqst *rqstp,
+ struct svc_rdma_op_ctxt *head,
+ struct rpcrdma_msg *rmsgp,
+ struct svc_rdma_req_map *rpl_map,
+ struct svc_rdma_req_map *chl_map,
+ int ch_count,
+ int byte_count)
+{
+ int page_no;
+ int ch_no;
+ u32 offset;
+ struct rpcrdma_read_chunk *ch;
+ struct svc_rdma_fastreg_mr *frmr;
+ int ret = 0;
+
+ frmr = svc_rdma_get_frmr(xprt);
+ if (IS_ERR(frmr))
+ return -ENOMEM;
+
+ head->frmr = frmr;
+ head->arg.head[0] = rqstp->rq_arg.head[0];
+ head->arg.tail[0] = rqstp->rq_arg.tail[0];
+ head->arg.pages = &head->pages[head->count];
+ head->hdr_count = head->count; /* save count of hdr pages */
+ head->arg.page_base = 0;
+ head->arg.page_len = byte_count;
+ head->arg.len = rqstp->rq_arg.len + byte_count;
+ head->arg.buflen = rqstp->rq_arg.buflen + byte_count;
+
+ /* Fast register the page list */
+ frmr->kva = page_address(rqstp->rq_arg.pages[0]);
+ frmr->direction = DMA_FROM_DEVICE;
+ frmr->access_flags = (IB_ACCESS_LOCAL_WRITE|IB_ACCESS_REMOTE_WRITE);
+ frmr->map_len = byte_count;
+ frmr->page_list_len = PAGE_ALIGN(byte_count) >> PAGE_SHIFT;
+ for (page_no = 0; page_no < frmr->page_list_len; page_no++) {
+ frmr->page_list->page_list[page_no] =
+ ib_dma_map_single(xprt->sc_cm_id->device,
+ page_address(rqstp->rq_arg.pages[page_no]),
+ PAGE_SIZE, DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device,
+ frmr->page_list->page_list[page_no]))
+ goto fatal_err;
+ atomic_inc(&xprt->sc_dma_used);
+ head->arg.pages[page_no] = rqstp->rq_arg.pages[page_no];
+ }
+ head->count += page_no;
+
+ /* rq_respages points one past arg pages */
+ rqstp->rq_respages = &rqstp->rq_arg.pages[page_no];
+
+ /* Create the reply and chunk maps */
+ offset = 0;
+ ch = (struct rpcrdma_read_chunk *)&rmsgp->rm_body.rm_chunks[0];
+ for (ch_no = 0; ch_no < ch_count; ch_no++) {
+ rpl_map->sge[ch_no].iov_base = frmr->kva + offset;
+ rpl_map->sge[ch_no].iov_len = ch->rc_target.rs_length;
+ chl_map->ch[ch_no].count = 1;
+ chl_map->ch[ch_no].start = ch_no;
+ offset += ch->rc_target.rs_length;
+ ch++;
+ }
+
+ ret = svc_rdma_fastreg(xprt, frmr);
+ if (ret)
+ goto fatal_err;
+
+ return ch_no;
+
+ fatal_err:
+ printk("svcrdma: error fast registering xdr for xprt %p", xprt);
+ svc_rdma_put_frmr(xprt, frmr);
+ return -EIO;
+}
+
+static int rdma_set_ctxt_sge(struct svcxprt_rdma *xprt,
+ struct svc_rdma_op_ctxt *ctxt,
+ struct svc_rdma_fastreg_mr *frmr,
+ struct kvec *vec,
+ u64 *sgl_offset,
+ int count)
{
int i;

ctxt->count = count;
ctxt->direction = DMA_FROM_DEVICE;
for (i = 0; i < count; i++) {
- atomic_inc(&xprt->sc_dma_used);
- ctxt->sge[i].addr =
- ib_dma_map_single(xprt->sc_cm_id->device,
- vec[i].iov_base, vec[i].iov_len,
- DMA_FROM_DEVICE);
+ ctxt->sge[i].length = 0; /* in case map fails */
+ if (!frmr) {
+ ctxt->sge[i].addr =
+ ib_dma_map_single(xprt->sc_cm_id->device,
+ vec[i].iov_base,
+ vec[i].iov_len,
+ DMA_FROM_DEVICE);
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device,
+ ctxt->sge[i].addr))
+ return -EINVAL;
+ ctxt->sge[i].lkey = xprt->sc_dma_lkey;
+ atomic_inc(&xprt->sc_dma_used);
+ } else {
+ ctxt->sge[i].addr = (unsigned long)vec[i].iov_base;
+ ctxt->sge[i].lkey = frmr->mr->lkey;
+ }
ctxt->sge[i].length = vec[i].iov_len;
- ctxt->sge[i].lkey = xprt->sc_phys_mr->lkey;
*sgl_offset = *sgl_offset + vec[i].iov_len;
}
+ return 0;
}

static int rdma_read_max_sge(struct svcxprt_rdma *xprt, int sge_count)
@@ -278,6 +380,7 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
struct svc_rdma_op_ctxt *hdr_ctxt)
{
struct ib_send_wr read_wr;
+ struct ib_send_wr inv_wr;
int err = 0;
int ch_no;
int ch_count;
@@ -301,9 +404,20 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
svc_rdma_rcl_chunk_counts(ch, &ch_count, &byte_count);
if (ch_count > RPCSVC_MAXPAGES)
return -EINVAL;
- sge_count = rdma_rcl_to_sge(xprt, rqstp, hdr_ctxt, rmsgp,
- rpl_map, chl_map,
- ch_count, byte_count);
+
+ if (!xprt->sc_frmr_pg_list_len)
+ sge_count = map_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
+ rpl_map, chl_map, ch_count,
+ byte_count);
+ else
+ sge_count = fast_reg_read_chunks(xprt, rqstp, hdr_ctxt, rmsgp,
+ rpl_map, chl_map, ch_count,
+ byte_count);
+ if (sge_count < 0) {
+ err = -EIO;
+ goto out;
+ }
+
sgl_offset = 0;
ch_no = 0;

@@ -312,13 +426,16 @@ static int rdma_read_xdr(struct svcxprt_rdma *xprt,
next_sge:
ctxt = svc_rdma_get_context(xprt);
ctxt->direction = DMA_FROM_DEVICE;
+ ctxt->frmr = hdr_ctxt->frmr;
+ ctxt->read_hdr = NULL;
clear_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
+ clear_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);

/* Prepare READ WR */
memset(&read_wr, 0, sizeof read_wr);
- ctxt->wr_op = IB_WR_RDMA_READ;
read_wr.wr_id = (unsigned long)ctxt;
read_wr.opcode = IB_WR_RDMA_READ;
+ ctxt->wr_op = read_wr.opcode;
read_wr.send_flags = IB_SEND_SIGNALED;
read_wr.wr.rdma.rkey = ch->rc_target.rs_handle;
read_wr.wr.rdma.remote_addr =
@@ -327,10 +444,15 @@ next_sge:
read_wr.sg_list = ctxt->sge;
read_wr.num_sge =
rdma_read_max_sge(xprt, chl_map->ch[ch_no].count);
- rdma_set_ctxt_sge(xprt, ctxt,
- &rpl_map->sge[chl_map->ch[ch_no].start],
- &sgl_offset,
- read_wr.num_sge);
+ err = rdma_set_ctxt_sge(xprt, ctxt, hdr_ctxt->frmr,
+ &rpl_map->sge[chl_map->ch[ch_no].start],
+ &sgl_offset,
+ read_wr.num_sge);
+ if (err) {
+ svc_rdma_unmap_dma(ctxt);
+ svc_rdma_put_context(ctxt, 0);
+ goto out;
+ }
if (((ch+1)->rc_discrim == 0) &&
(read_wr.num_sge == chl_map->ch[ch_no].count)) {
/*
@@ -339,6 +461,29 @@ next_sge:
* the client and the RPC needs to be enqueued.
*/
set_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags);
+ if (hdr_ctxt->frmr) {
+ set_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
+ /*
+ * Invalidate the local MR used to map the data
+ * sink.
+ */
+ if (xprt->sc_dev_caps &
+ SVCRDMA_DEVCAP_READ_W_INV) {
+ read_wr.opcode =
+ IB_WR_RDMA_READ_WITH_INV;
+ ctxt->wr_op = read_wr.opcode;
+ read_wr.ex.invalidate_rkey =
+ ctxt->frmr->mr->lkey;
+ } else {
+ /* Prepare INVALIDATE WR */
+ memset(&inv_wr, 0, sizeof inv_wr);
+ inv_wr.opcode = IB_WR_LOCAL_INV;
+ inv_wr.send_flags = IB_SEND_SIGNALED;
+ inv_wr.ex.invalidate_rkey =
+ hdr_ctxt->frmr->mr->lkey;
+ read_wr.next = &inv_wr;
+ }
+ }
ctxt->read_hdr = hdr_ctxt;
}
/* Post the read */
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 5ea7901..30778a5 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -105,7 +105,7 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
return ctxt;
}

-static void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
+void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
{
struct svcxprt_rdma *xprt = ctxt->xprt;
int i;
@@ -343,9 +343,12 @@ static void process_context(struct svcxprt_rdma *xprt,
break;

case IB_WR_RDMA_READ:
+ case IB_WR_RDMA_READ_WITH_INV:
if (test_bit(RDMACTXT_F_LAST_CTXT, &ctxt->flags)) {
struct svc_rdma_op_ctxt *read_hdr = ctxt->read_hdr;
BUG_ON(!read_hdr);
+ if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
+ svc_rdma_put_frmr(xprt, ctxt->frmr);
spin_lock_bh(&xprt->sc_rq_dto_lock);
set_bit(XPT_DATA, &xprt->sc_xprt.xpt_flags);
list_add_tail(&read_hdr->dto_q,

2008-10-03 21:33:50

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 05/12] svcrdma: Modify post recv path to use local dma key

Update the svc_rdma_post_recv routine to use the adapter's global LKEY
instead of sc_phys_mr which is only valid when using a DMA MR.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 12 +++++++++---
1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 4b70282..a144fab 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -483,7 +483,7 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
struct ib_recv_wr recv_wr, *bad_recv_wr;
struct svc_rdma_op_ctxt *ctxt;
struct page *page;
- unsigned long pa;
+ dma_addr_t pa;
int sge_no;
int buflen;
int ret;
@@ -495,13 +495,15 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
BUG_ON(sge_no >= xprt->sc_max_sge);
page = svc_rdma_get_page();
ctxt->pages[sge_no] = page;
- atomic_inc(&xprt->sc_dma_used);
pa = ib_dma_map_page(xprt->sc_cm_id->device,
page, 0, PAGE_SIZE,
DMA_FROM_DEVICE);
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device, pa))
+ goto err_put_ctxt;
+ atomic_inc(&xprt->sc_dma_used);
ctxt->sge[sge_no].addr = pa;
ctxt->sge[sge_no].length = PAGE_SIZE;
- ctxt->sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
+ ctxt->sge[sge_no].lkey = xprt->sc_dma_lkey;
buflen += PAGE_SIZE;
}
ctxt->count = sge_no;
@@ -517,6 +519,10 @@ int svc_rdma_post_recv(struct svcxprt_rdma *xprt)
svc_rdma_put_context(ctxt, 1);
}
return ret;
+
+ err_put_ctxt:
+ svc_rdma_put_context(ctxt, 1);
+ return -ENOMEM;
}

/*

2008-10-03 21:33:51

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 06/12] svcrdma: Add support to svc_rdma_send to handle chained WR

WR can be submitted as linked lists of WR. Update the svc_rdma_send
routine to handle WR chains. This will be used to submit a WR that
uses an FRMR with another WR that invalidates the FRMR.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 29 +++++++++++++++++++++--------
1 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index a144fab..5ea7901 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1241,17 +1241,23 @@ int svc_rdma_fastreg(struct svcxprt_rdma *xprt,

int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
{
- struct ib_send_wr *bad_wr;
+ struct ib_send_wr *bad_wr, *n_wr;
+ int wr_count;
+ int i;
int ret;

if (test_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags))
return -ENOTCONN;

BUG_ON(wr->send_flags != IB_SEND_SIGNALED);
+ wr_count = 1;
+ for (n_wr = wr->next; n_wr; n_wr = n_wr->next)
+ wr_count++;
+
/* If the SQ is full, wait until an SQ entry is available */
while (1) {
spin_lock_bh(&xprt->sc_lock);
- if (xprt->sc_sq_depth == atomic_read(&xprt->sc_sq_count)) {
+ if (xprt->sc_sq_depth < atomic_read(&xprt->sc_sq_count) + wr_count) {
spin_unlock_bh(&xprt->sc_lock);
atomic_inc(&rdma_stat_sq_starve);

@@ -1266,19 +1272,26 @@ int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
return 0;
continue;
}
- /* Bumped used SQ WR count and post */
- svc_xprt_get(&xprt->sc_xprt);
+ /* Take a transport ref for each WR posted */
+ for (i = 0; i < wr_count; i++)
+ svc_xprt_get(&xprt->sc_xprt);
+
+ /* Bump used SQ WR count and post */
+ atomic_add(wr_count, &xprt->sc_sq_count);
ret = ib_post_send(xprt->sc_qp, wr, &bad_wr);
- if (!ret)
- atomic_inc(&xprt->sc_sq_count);
- else {
- svc_xprt_put(&xprt->sc_xprt);
+ if (ret) {
+ set_bit(XPT_CLOSE, &xprt->sc_xprt.xpt_flags);
+ atomic_sub(wr_count, &xprt->sc_sq_count);
+ for (i = 0; i < wr_count; i ++)
+ svc_xprt_put(&xprt->sc_xprt);
dprintk("svcrdma: failed to post SQ WR rc=%d, "
"sc_sq_count=%d, sc_sq_depth=%d\n",
ret, atomic_read(&xprt->sc_sq_count),
xprt->sc_sq_depth);
}
spin_unlock_bh(&xprt->sc_lock);
+ if (ret)
+ wake_up(&xprt->sc_send_wait);
break;
}
return ret;

2008-10-03 21:33:51

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 09/12] svcrdma: Update svc_rdma_send_error to use DMA LKEY

Update the svc_rdma_send_error code to use the DMA LKEY which is valid
regardless of the memory registration strategy in use.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_transport.c | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 360ff97..36e2cc4 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -1320,10 +1320,14 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
length = svc_rdma_xdr_encode_error(xprt, rmsgp, err, va);

/* Prepare SGE for local address */
- atomic_inc(&xprt->sc_dma_used);
sge.addr = ib_dma_map_page(xprt->sc_cm_id->device,
p, 0, PAGE_SIZE, DMA_FROM_DEVICE);
- sge.lkey = xprt->sc_phys_mr->lkey;
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device, sge.addr)) {
+ put_page(p);
+ return;
+ }
+ atomic_inc(&xprt->sc_dma_used);
+ sge.lkey = xprt->sc_dma_lkey;
sge.length = length;

ctxt = svc_rdma_get_context(xprt);
@@ -1344,6 +1348,9 @@ void svc_rdma_send_error(struct svcxprt_rdma *xprt, struct rpcrdma_msg *rmsgp,
if (ret) {
dprintk("svcrdma: Error %d posting send for protocol error\n",
ret);
+ ib_dma_unmap_page(xprt->sc_cm_id->device,
+ sge.addr, PAGE_SIZE,
+ DMA_FROM_DEVICE);
svc_rdma_put_context(ctxt, 1);
}
}

2008-10-03 21:33:51

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 08/12] svcrdma: Modify the RPC reply path to use FRMR when available

Use FRMR to map local RPC reply data. This allows RDMA_WRITE to send reply
data using a single WR. The FRMR is invalidated by linking the LOCAL_INV WR
to the RDMA_SEND message used to complete the reply.

Signed-off-by: Tom Tucker <[email protected]>

---
net/sunrpc/xprtrdma/svc_rdma_sendto.c | 255 +++++++++++++++++++++++++-----
net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +
2 files changed, 217 insertions(+), 40 deletions(-)

diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
index 84d3283..9a7a8e7 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
@@ -69,9 +69,127 @@
* array is only concerned with the reply we are assured that we have
* on extra page for the RPCRMDA header.
*/
-static void xdr_to_sge(struct svcxprt_rdma *xprt,
- struct xdr_buf *xdr,
- struct svc_rdma_req_map *vec)
+int fast_reg_xdr(struct svcxprt_rdma *xprt,
+ struct xdr_buf *xdr,
+ struct svc_rdma_req_map *vec)
+{
+ int sge_no;
+ u32 sge_bytes;
+ u32 page_bytes;
+ u32 page_off;
+ int page_no = 0;
+ u8 *frva;
+ struct svc_rdma_fastreg_mr *frmr;
+
+ frmr = svc_rdma_get_frmr(xprt);
+ if (IS_ERR(frmr))
+ return -ENOMEM;
+ vec->frmr = frmr;
+
+ /* Skip the RPCRDMA header */
+ sge_no = 1;
+
+ /* Map the head. */
+ frva = (void *)((unsigned long)(xdr->head[0].iov_base) & PAGE_MASK);
+ vec->sge[sge_no].iov_base = xdr->head[0].iov_base;
+ vec->sge[sge_no].iov_len = xdr->head[0].iov_len;
+ vec->count = 2;
+ sge_no++;
+
+ /* Build the FRMR */
+ frmr->kva = frva;
+ frmr->direction = DMA_TO_DEVICE;
+ frmr->access_flags = 0;
+ frmr->map_len = PAGE_SIZE;
+ frmr->page_list_len = 1;
+ frmr->page_list->page_list[page_no] =
+ ib_dma_map_single(xprt->sc_cm_id->device,
+ (void *)xdr->head[0].iov_base,
+ PAGE_SIZE, DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device,
+ frmr->page_list->page_list[page_no]))
+ goto fatal_err;
+ atomic_inc(&xprt->sc_dma_used);
+
+ page_off = xdr->page_base;
+ page_bytes = xdr->page_len + page_off;
+ if (!page_bytes)
+ goto encode_tail;
+
+ /* Map the pages */
+ vec->sge[sge_no].iov_base = frva + frmr->map_len + page_off;
+ vec->sge[sge_no].iov_len = page_bytes;
+ sge_no++;
+ while (page_bytes) {
+ struct page *page;
+
+ page = xdr->pages[page_no++];
+ sge_bytes = min_t(u32, page_bytes, (PAGE_SIZE - page_off));
+ page_bytes -= sge_bytes;
+
+ frmr->page_list->page_list[page_no] =
+ ib_dma_map_page(xprt->sc_cm_id->device, page, 0,
+ PAGE_SIZE, DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device,
+ frmr->page_list->page_list[page_no]))
+ goto fatal_err;
+
+ atomic_inc(&xprt->sc_dma_used);
+ page_off = 0; /* reset for next time through loop */
+ frmr->map_len += PAGE_SIZE;
+ frmr->page_list_len++;
+ }
+ vec->count++;
+
+ encode_tail:
+ /* Map tail */
+ if (0 == xdr->tail[0].iov_len)
+ goto done;
+
+ vec->count++;
+ vec->sge[sge_no].iov_len = xdr->tail[0].iov_len;
+
+ if (((unsigned long)xdr->tail[0].iov_base & PAGE_MASK) ==
+ ((unsigned long)xdr->head[0].iov_base & PAGE_MASK)) {
+ /*
+ * If head and tail use the same page, we don't need
+ * to map it again.
+ */
+ vec->sge[sge_no].iov_base = xdr->tail[0].iov_base;
+ } else {
+ void *va;
+
+ /* Map another page for the tail */
+ page_off = (unsigned long)xdr->tail[0].iov_base & ~PAGE_MASK;
+ va = (void *)((unsigned long)xdr->tail[0].iov_base & PAGE_MASK);
+ vec->sge[sge_no].iov_base = frva + frmr->map_len + page_off;
+
+ frmr->page_list->page_list[page_no] =
+ ib_dma_map_single(xprt->sc_cm_id->device, va, PAGE_SIZE,
+ DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device,
+ frmr->page_list->page_list[page_no]))
+ goto fatal_err;
+ atomic_inc(&xprt->sc_dma_used);
+ frmr->map_len += PAGE_SIZE;
+ frmr->page_list_len++;
+ }
+
+ done:
+ if (svc_rdma_fastreg(xprt, frmr))
+ goto fatal_err;
+
+ return 0;
+
+ fatal_err:
+ printk("svcrdma: Error fast registering memory for xprt %p\n", xprt);
+ svc_rdma_put_frmr(xprt, frmr);
+ return -EIO;
+}
+
+static int map_xdr(struct svcxprt_rdma *xprt,
+ struct xdr_buf *xdr,
+ struct svc_rdma_req_map *vec)
{
int sge_max = (xdr->len+PAGE_SIZE-1) / PAGE_SIZE + 3;
int sge_no;
@@ -83,6 +201,9 @@ static void xdr_to_sge(struct svcxprt_rdma *xprt,
BUG_ON(xdr->len !=
(xdr->head[0].iov_len + xdr->page_len + xdr->tail[0].iov_len));

+ if (xprt->sc_frmr_pg_list_len)
+ return fast_reg_xdr(xprt, xdr, vec);
+
/* Skip the first sge, this is for the RPCRDMA header */
sge_no = 1;

@@ -116,9 +237,12 @@ static void xdr_to_sge(struct svcxprt_rdma *xprt,

BUG_ON(sge_no > sge_max);
vec->count = sge_no;
+ return 0;
}

/* Assumptions:
+ * - We are using FRMR
+ * - or -
* - The specified write_len can be represented in sc_max_sge * PAGE_SIZE
*/
static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
@@ -158,30 +282,35 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp,
sge_no = 0;

/* Copy the remaining SGE */
- while (bc != 0 && xdr_sge_no < vec->count) {
- sge[sge_no].lkey = xprt->sc_phys_mr->lkey;
- sge_bytes = min((size_t)bc,
- (size_t)(vec->sge[xdr_sge_no].iov_len-sge_off));
+ while (bc != 0) {
+ sge_bytes = min_t(size_t,
+ bc, vec->sge[xdr_sge_no].iov_len-sge_off);
sge[sge_no].length = sge_bytes;
- atomic_inc(&xprt->sc_dma_used);
- sge[sge_no].addr =
- ib_dma_map_single(xprt->sc_cm_id->device,
- (void *)
- vec->sge[xdr_sge_no].iov_base + sge_off,
- sge_bytes, DMA_TO_DEVICE);
- if (dma_mapping_error(xprt->sc_cm_id->device->dma_device,
- sge[sge_no].addr))
- goto err;
+ if (!vec->frmr) {
+ sge[sge_no].addr =
+ ib_dma_map_single(xprt->sc_cm_id->device,
+ (void *)
+ vec->sge[xdr_sge_no].iov_base + sge_off,
+ sge_bytes, DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(xprt->sc_cm_id->device,
+ sge[sge_no].addr))
+ goto err;
+ atomic_inc(&xprt->sc_dma_used);
+ sge[sge_no].lkey = xprt->sc_dma_lkey;
+ } else {
+ sge[sge_no].addr = (unsigned long)
+ vec->sge[xdr_sge_no].iov_base + sge_off;
+ sge[sge_no].lkey = vec->frmr->mr->lkey;
+ }
+ ctxt->count++;
+ ctxt->frmr = vec->frmr;
sge_off = 0;
sge_no++;
- ctxt->count++;
xdr_sge_no++;
+ BUG_ON(xdr_sge_no > vec->count);
bc -= sge_bytes;
}

- BUG_ON(bc != 0);
- BUG_ON(xdr_sge_no > vec->count);
-
/* Prepare WRITE WR */
memset(&write_wr, 0, sizeof write_wr);
ctxt->wr_op = IB_WR_RDMA_WRITE;
@@ -226,7 +355,10 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
res_ary = (struct rpcrdma_write_array *)
&rdma_resp->rm_body.rm_chunks[1];

- max_write = xprt->sc_max_sge * PAGE_SIZE;
+ if (vec->frmr)
+ max_write = vec->frmr->map_len;
+ else
+ max_write = xprt->sc_max_sge * PAGE_SIZE;

/* Write chunks start at the pagelist */
for (xdr_off = rqstp->rq_res.head[0].iov_len, chunk_no = 0;
@@ -297,7 +429,10 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
res_ary = (struct rpcrdma_write_array *)
&rdma_resp->rm_body.rm_chunks[2];

- max_write = xprt->sc_max_sge * PAGE_SIZE;
+ if (vec->frmr)
+ max_write = vec->frmr->map_len;
+ else
+ max_write = xprt->sc_max_sge * PAGE_SIZE;

/* xdr offset starts at RPC message */
for (xdr_off = 0, chunk_no = 0;
@@ -307,7 +442,6 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
ch = &arg_ary->wc_array[chunk_no].wc_target;
write_len = min(xfer_len, ch->rs_length);

-
/* Prepare the reply chunk given the length actually
* written */
rs_offset = get_unaligned(&(ch->rs_offset));
@@ -366,6 +500,7 @@ static int send_reply(struct svcxprt_rdma *rdma,
int byte_count)
{
struct ib_send_wr send_wr;
+ struct ib_send_wr inv_wr;
int sge_no;
int sge_bytes;
int page_no;
@@ -385,27 +520,45 @@ static int send_reply(struct svcxprt_rdma *rdma,
/* Prepare the context */
ctxt->pages[0] = page;
ctxt->count = 1;
+ ctxt->frmr = vec->frmr;
+ if (vec->frmr)
+ set_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);
+ else
+ clear_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags);

/* Prepare the SGE for the RPCRDMA Header */
- atomic_inc(&rdma->sc_dma_used);
ctxt->sge[0].addr =
ib_dma_map_page(rdma->sc_cm_id->device,
page, 0, PAGE_SIZE, DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(rdma->sc_cm_id->device, ctxt->sge[0].addr))
+ goto err;
+ atomic_inc(&rdma->sc_dma_used);
+
ctxt->direction = DMA_TO_DEVICE;
+
ctxt->sge[0].length = svc_rdma_xdr_get_reply_hdr_len(rdma_resp);
- ctxt->sge[0].lkey = rdma->sc_phys_mr->lkey;
+ ctxt->sge[0].lkey = rdma->sc_dma_lkey;

/* Determine how many of our SGE are to be transmitted */
for (sge_no = 1; byte_count && sge_no < vec->count; sge_no++) {
sge_bytes = min_t(size_t, vec->sge[sge_no].iov_len, byte_count);
byte_count -= sge_bytes;
- atomic_inc(&rdma->sc_dma_used);
- ctxt->sge[sge_no].addr =
- ib_dma_map_single(rdma->sc_cm_id->device,
- vec->sge[sge_no].iov_base,
- sge_bytes, DMA_TO_DEVICE);
+ if (!vec->frmr) {
+ ctxt->sge[sge_no].addr =
+ ib_dma_map_single(rdma->sc_cm_id->device,
+ vec->sge[sge_no].iov_base,
+ sge_bytes, DMA_TO_DEVICE);
+ if (ib_dma_mapping_error(rdma->sc_cm_id->device,
+ ctxt->sge[sge_no].addr))
+ goto err;
+ atomic_inc(&rdma->sc_dma_used);
+ ctxt->sge[sge_no].lkey = rdma->sc_dma_lkey;
+ } else {
+ ctxt->sge[sge_no].addr = (unsigned long)
+ vec->sge[sge_no].iov_base;
+ ctxt->sge[sge_no].lkey = vec->frmr->mr->lkey;
+ }
ctxt->sge[sge_no].length = sge_bytes;
- ctxt->sge[sge_no].lkey = rdma->sc_phys_mr->lkey;
}
BUG_ON(byte_count != 0);

@@ -417,11 +570,16 @@ static int send_reply(struct svcxprt_rdma *rdma,
ctxt->pages[page_no+1] = rqstp->rq_respages[page_no];
ctxt->count++;
rqstp->rq_respages[page_no] = NULL;
- /* If there are more pages than SGE, terminate SGE list */
+ /*
+ * If there are more pages than SGE, terminate SGE
+ * list so that svc_rdma_unmap_dma doesn't attempt to
+ * unmap garbage.
+ */
if (page_no+1 >= sge_no)
ctxt->sge[page_no+1].length = 0;
}
BUG_ON(sge_no > rdma->sc_max_sge);
+ BUG_ON(sge_no > ctxt->count);
memset(&send_wr, 0, sizeof send_wr);
ctxt->wr_op = IB_WR_SEND;
send_wr.wr_id = (unsigned long)ctxt;
@@ -429,12 +587,26 @@ static int send_reply(struct svcxprt_rdma *rdma,
send_wr.num_sge = sge_no;
send_wr.opcode = IB_WR_SEND;
send_wr.send_flags = IB_SEND_SIGNALED;
+ if (vec->frmr) {
+ /* Prepare INVALIDATE WR */
+ memset(&inv_wr, 0, sizeof inv_wr);
+ inv_wr.opcode = IB_WR_LOCAL_INV;
+ inv_wr.send_flags = IB_SEND_SIGNALED;
+ inv_wr.ex.invalidate_rkey =
+ vec->frmr->mr->lkey;
+ send_wr.next = &inv_wr;
+ }

ret = svc_rdma_send(rdma, &send_wr);
if (ret)
- svc_rdma_put_context(ctxt, 1);
+ goto err;

- return ret;
+ return 0;
+
+ err:
+ svc_rdma_put_frmr(rdma, vec->frmr);
+ svc_rdma_put_context(ctxt, 1);
+ return -EIO;
}

void svc_rdma_prep_reply_hdr(struct svc_rqst *rqstp)
@@ -477,8 +649,9 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
ctxt = svc_rdma_get_context(rdma);
ctxt->direction = DMA_TO_DEVICE;
vec = svc_rdma_get_req_map();
- xdr_to_sge(rdma, &rqstp->rq_res, vec);
-
+ ret = map_xdr(rdma, &rqstp->rq_res, vec);
+ if (ret)
+ goto err0;
inline_bytes = rqstp->rq_res.len;

/* Create the RDMA response header */
@@ -498,7 +671,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
if (ret < 0) {
printk(KERN_ERR "svcrdma: failed to send write chunks, rc=%d\n",
ret);
- goto error;
+ goto err1;
}
inline_bytes -= ret;

@@ -508,7 +681,7 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
if (ret < 0) {
printk(KERN_ERR "svcrdma: failed to send reply chunks, rc=%d\n",
ret);
- goto error;
+ goto err1;
}
inline_bytes -= ret;

@@ -517,9 +690,11 @@ int svc_rdma_sendto(struct svc_rqst *rqstp)
svc_rdma_put_req_map(vec);
dprintk("svcrdma: send_reply returns %d\n", ret);
return ret;
- error:
+
+ err1:
+ put_page(res_page);
+ err0:
svc_rdma_put_req_map(vec);
svc_rdma_put_context(ctxt, 0);
- put_page(res_page);
return ret;
}
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 30778a5..360ff97 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -335,6 +335,8 @@ static void process_context(struct svcxprt_rdma *xprt,

switch (ctxt->wr_op) {
case IB_WR_SEND:
+ if (test_bit(RDMACTXT_F_FAST_UNREG, &ctxt->flags))
+ svc_rdma_put_frmr(xprt, ctxt->frmr);
svc_rdma_put_context(ctxt, 1);
break;


2008-10-03 21:33:51

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 12/12] svcrdma: Documentation update for the FastReg memory model

This patch adds security related documentation to the nfs-rdma.txt file
that describes the memory registration model, the potential security
exploits, and compares these exploits to a similar threat when using TCP
as the transport.

Signed-off-by: Tom Tucker <[email protected]>

---
Documentation/filesystems/nfs-rdma.txt | 126 ++++++++++++++++++++++++++++++++
1 files changed, 126 insertions(+), 0 deletions(-)

diff --git a/Documentation/filesystems/nfs-rdma.txt b/Documentation/filesystems/nfs-rdma.txt
index 44bd766..6abdc6d 100644
--- a/Documentation/filesystems/nfs-rdma.txt
+++ b/Documentation/filesystems/nfs-rdma.txt
@@ -269,3 +269,129 @@ NFS/RDMA Setup
the "proto" field for the given mount.

Congratulations! You're using NFS/RDMA!
+
+Security
+--------
+
+ NFSRDMA exploits the RDMA capabilities of the IB and iWARP
+ transports to more efficiently exchange RPC data between the client
+ and the server. This section discusses the security implications of
+ the exchange of memory information on the wire when the wire may be
+ monitorable by an untrusted application. The identifier that
+ encapsulates this memory information is called an RKEY.
+
+ A principal exploit is that a node on the local network could snoop
+ RDMA packets containing RKEY and then forge a packet with this RKEY
+ to write and/or read the memory of the peer to which the RKEY
+ referred.
+
+ If the underlying RDMA device is capable of Fast Memory Registration
+ or the transport is IB, then NFSRDMA is no less secure than TCP.
+ However, if the transport is iWARP and the device does not support
+ Fast Memory Registration, then the node above could write anywhere
+ in the server's memory using the method described.
+
+ When the client connects, NFSRDMA sends a string to the message log
+ providing information about the new connection. Examples of this
+ string are shown below:
+
+ svcrdma: New connection accepted with the following attributes:
+ transport : iWARP
+ local_ip : 192.168.70.121
+ local_port : 20049
+ remote_ip : 192.168.70.122
+ remote_port : 44286
+ max_sge : 4
+ sq_depth : 128
+ max_requests : 16
+ ord : 8
+ using fastreg? : Y
+ memory exposure : Safe. Only RPC memory exposed.
+
+ This connection is safe. iWARP is the transport, however, the device
+ supports Fast Memory Registration.
+
+ svcrdma: New connection accepted with the following attributes:
+ transport : iWARP
+ local_ip : 192.168.70.121
+ local_port : 20049
+ remote_ip : 192.168.70.122
+ remote_port : 44286
+ max_sge : 4
+ sq_depth : 128
+ max_requests : 16
+ ord : 8
+ using fastreg? : N
+ memory exposure : Unsafe. Server memory exposed.
+
+ This conneciton is unsafe, the transport is iWARP and Fast Memory
+ Registration is not supported.
+
+ svcrdma: New connection accepted with the following attributes:
+ transport : IB
+ local_ip : 192.168.79.121
+ local_port : 20049
+ remote_ip : 192.168.79.122
+ remote_port : 44287
+ max_sge : 27
+ sq_depth : 128
+ max_requests : 16
+ ord : 4
+ using fastreg? : N
+ memory exposure : Safe. Only RPC memory exposed.
+
+ This connection is safe because although the adapter does not
+ support Fast Memory Registration, the transport is IB.
+
+ These messages can be disabled as follows:
+
+ # echo 0 > /proc/sys/sunrpc/svc_rdma/show_conn_info
+
+ The text below provides additional information on this issue.
+
+ The NFSRDMA protocol is defined such that a) only the server
+ initiates RDMA, and b) only the client's memory is exposed via
+ RKEY. This is why the server reads to fetch RPC data from the client
+ even though it would be more efficient for the client to write the
+ data to the server's memory. This design goal is not entirely
+ realized with iWARP, however, because the RKEY (called an STag on
+ iWARP) for the data sink of an RDMA_READ is actually placed on the
+ wire, and this RKEY has Remote Write permission. This means that the
+ server's memory is exposed by virtue of having placed the RKEY for
+ its local memory on the wire in order to receive the result of the
+ RDMA_READ.
+
+ By contrast, IB uses an opaque transaction ID# to associate the
+ READ_RPL with the READ_REQ and the data sink of an READ_REQ does not
+ require remote access. That said, the byzantine node in question
+ could still forge a packet with this transaction ID and corrupt the
+ target memory, however, the scope of the exploit is bounded to the
+ lifetime of this single RDMA_READ request and to the memory mapped
+ by the data sink of the READ_REQ.
+
+ The newer RDMA adapters (both iWARP and IB) support "Fast Memory
+ Registration". This capability allows memory to be quickly
+ registered (i.e. made available for remote access) and de-registered
+ by submitting WR on the SQ. These capabilities provide a mechanism
+ to reduce the exposure discused above by limiting the scope of the
+ exploit. The idea is to create an RKEY that only maps the single RPC
+ and whose effective lifetime is only the exchange of this single
+ RPC. This is the default memory model that is employed by the server
+ when supported by the adapter and by the client when the
+ rdma_memreg_strategy is set to 6. Note that the client and server
+ may use different memory registration strategies, however,
+ performance is better when both the client and server use the
+ FastReg memory registration strategy.
+
+ This approach has two benefits, a) it restricts the domain of the
+ exploit to the memory of a single RPC, and b) it limits the duration
+ of the exploit to the time it takes to satisfy the RDMA_READ.
+
+ It is arguable that a one-shot STag/RKEY is no less secure than RPC
+ on the TCP transport. Consider that the exact same byzantine
+ application could more easily corrupt TCP RPC payload by simply
+ forging a packet with the correct TCP sequence number -- in fact
+ it's easier than the RDMA exploit because the RDMA exploit requires
+ that you correctly forge both the TCP packet and the RDMA
+ payload. In addition the duration of the TCP exploit is the lifetime
+ of the connection, not the lifetime of a single WR/RPC data transfer.

2008-10-03 21:33:51

by Tom Tucker

[permalink] [raw]
Subject: [PATCH 11/12] svcrdma: Add a message log string to indicate if FastReg is being used

Add a log message that allows the administrator to determine if server memory
is exposed on a particular client connection. This message can be disabled
by writing 0 to the /proc/sys/sunrpc/svc_rdma/show_conn_info file.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 1 +
net/sunrpc/xprtrdma/svc_rdma.c | 8 ++++
net/sunrpc/xprtrdma/svc_rdma_transport.c | 63 ++++++++++++++++++-----------
3 files changed, 48 insertions(+), 24 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index c14fe86..7ee0a76 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -52,6 +52,7 @@
extern unsigned int svcrdma_ord;
extern unsigned int svcrdma_max_requests;
extern unsigned int svcrdma_max_req_size;
+extern unsigned int svcrdma_show_conn_info;

extern atomic_t rdma_stat_recv;
extern atomic_t rdma_stat_read;
diff --git a/net/sunrpc/xprtrdma/svc_rdma.c b/net/sunrpc/xprtrdma/svc_rdma.c
index 8710117..493e243 100644
--- a/net/sunrpc/xprtrdma/svc_rdma.c
+++ b/net/sunrpc/xprtrdma/svc_rdma.c
@@ -58,6 +58,7 @@ static unsigned int max_max_requests = 16384;
unsigned int svcrdma_max_req_size = RPCRDMA_MAX_REQ_SIZE;
static unsigned int min_max_inline = 4096;
static unsigned int max_max_inline = 65536;
+unsigned int svcrdma_show_conn_info = 1;

atomic_t rdma_stat_recv;
atomic_t rdma_stat_read;
@@ -145,6 +146,13 @@ static ctl_table svcrdma_parm_table[] = {
.extra1 = &min_ord,
.extra2 = &max_ord,
},
+ {
+ .procname = "show_conn_info",
+ .data = &svcrdma_show_conn_info,
+ .maxlen = sizeof(svcrdma_show_conn_info),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },

{
.procname = "rdma_stat_read",
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index bd17023..3d97032 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -833,6 +833,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
struct rdma_conn_param conn_param;
struct ib_qp_init_attr qp_attr;
struct ib_device_attr devattr;
+ char *transport_string;
+ char *security_string;
int dma_mr_acc;
int need_dma_mr;
int ret;
@@ -981,10 +983,13 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
/*
* Determine if a DMA MR is required and if so, what privs are required
*/
+ security_string = "Safe. Only RPC memory exposed.";
switch (rdma_node_get_transport(newxprt->sc_cm_id->device->node_type)) {
case RDMA_TRANSPORT_IWARP:
+ transport_string = "iWARP";
newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_READ_W_INV;
if (!(newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG)) {
+ security_string = "Unsafe: Server memory exposed.";
need_dma_mr = 1;
dma_mr_acc =
(IB_ACCESS_LOCAL_WRITE |
@@ -996,6 +1001,7 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
need_dma_mr = 0;
break;
case RDMA_TRANSPORT_IB:
+ transport_string = "IB";
if (!(devattr.device_cap_flags & IB_DEVICE_LOCAL_DMA_LKEY)) {
need_dma_mr = 1;
dma_mr_acc = IB_ACCESS_LOCAL_WRITE;
@@ -1052,30 +1058,39 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
goto errout;
}

- dprintk("svcrdma: new connection %p accepted with the following "
- "attributes:\n"
- " local_ip : %d.%d.%d.%d\n"
- " local_port : %d\n"
- " remote_ip : %d.%d.%d.%d\n"
- " remote_port : %d\n"
- " max_sge : %d\n"
- " sq_depth : %d\n"
- " max_requests : %d\n"
- " ord : %d\n",
- newxprt,
- NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
- route.addr.src_addr)->sin_addr.s_addr),
- ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
- route.addr.src_addr)->sin_port),
- NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
- route.addr.dst_addr)->sin_addr.s_addr),
- ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
- route.addr.dst_addr)->sin_port),
- newxprt->sc_max_sge,
- newxprt->sc_sq_depth,
- newxprt->sc_max_requests,
- newxprt->sc_ord);
-
+ if (!svcrdma_show_conn_info)
+ goto out;
+
+ printk(KERN_INFO "svcrdma: New connection accepted with the following "
+ "attributes:\n"
+ " transport : %s\n"
+ " local_ip : %d.%d.%d.%d\n"
+ " local_port : %d\n"
+ " remote_ip : %d.%d.%d.%d\n"
+ " remote_port : %d\n"
+ " max_sge : %d\n"
+ " sq_depth : %d\n"
+ " max_requests : %d\n"
+ " ord : %d\n"
+ " using fastreg? : %c\n"
+ " memory exposure : %s\n",
+ transport_string,
+ NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
+ route.addr.src_addr)->sin_addr.s_addr),
+ ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
+ route.addr.src_addr)->sin_port),
+ NIPQUAD(((struct sockaddr_in *)&newxprt->sc_cm_id->
+ route.addr.dst_addr)->sin_addr.s_addr),
+ ntohs(((struct sockaddr_in *)&newxprt->sc_cm_id->
+ route.addr.dst_addr)->sin_port),
+ newxprt->sc_max_sge,
+ newxprt->sc_sq_depth,
+ newxprt->sc_max_requests,
+ newxprt->sc_ord,
+ newxprt->sc_dev_caps & SVCRDMA_DEVCAP_FAST_REG ? 'Y' : 'N',
+ security_string);
+
+ out:
return &newxprt->sc_xprt;

errout:

2008-10-04 01:05:32

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 02/12] svcrdma: Add FRMR get/put services

Bruce:

This patch is missing your simplification. I'll add it and repost.

Tom

Tom Tucker wrote:
> Add services for the allocating, freeing, and unmapping Fast Reg MR. These
> services will be used by the transport connection setup, send and receive
> routines.
>
> Signed-off-by: Tom Tucker <[email protected]>
>
> ---
> include/linux/sunrpc/svc_rdma.h | 3 +
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 122 ++++++++++++++++++++++++++++-
> 2 files changed, 120 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
> index 49e458d..3425268 100644
> --- a/include/linux/sunrpc/svc_rdma.h
> +++ b/include/linux/sunrpc/svc_rdma.h
> @@ -214,6 +214,9 @@ extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
> extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
> extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
> extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
> +extern struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *);
> +extern void svc_rdma_put_frmr(struct svcxprt_rdma *,
> + struct svc_rdma_fastreg_mr *);
> extern void svc_sq_reap(struct svcxprt_rdma *);
> extern void svc_rq_reap(struct svcxprt_rdma *);
> extern struct svc_xprt_class svc_rdma_class;
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index 900cb69..dd170af 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -100,6 +100,7 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
> ctxt->xprt = xprt;
> INIT_LIST_HEAD(&ctxt->dto_q);
> ctxt->count = 0;
> + ctxt->frmr = NULL;
> atomic_inc(&xprt->sc_ctxt_used);
> return ctxt;
> }
> @@ -109,11 +110,19 @@ static void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
> struct svcxprt_rdma *xprt = ctxt->xprt;
> int i;
> for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) {
> - atomic_dec(&xprt->sc_dma_used);
> - ib_dma_unmap_single(xprt->sc_cm_id->device,
> - ctxt->sge[i].addr,
> - ctxt->sge[i].length,
> - ctxt->direction);
> + /*
> + * Unmap the DMA addr in the SGE if the lkey matches
> + * the sc_dma_lkey, otherwise, ignore it since it is
> + * an FRMR lkey and will be unmapped later when the
> + * last WR that uses it completes.
> + */
> + if (ctxt->sge[i].lkey == xprt->sc_dma_lkey) {
> + atomic_dec(&xprt->sc_dma_used);
> + ib_dma_unmap_single(xprt->sc_cm_id->device,
> + ctxt->sge[i].addr,
> + ctxt->sge[i].length,
> + ctxt->direction);
> + }
> }
> }
>
> @@ -150,6 +159,7 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
> schedule_timeout_uninterruptible(msecs_to_jiffies(500));
> }
> map->count = 0;
> + map->frmr = NULL;
> return map;
> }
>
> @@ -425,10 +435,12 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
> INIT_LIST_HEAD(&cma_xprt->sc_dto_q);
> INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
> INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
> + INIT_LIST_HEAD(&cma_xprt->sc_frmr_q);
> init_waitqueue_head(&cma_xprt->sc_send_wait);
>
> spin_lock_init(&cma_xprt->sc_lock);
> spin_lock_init(&cma_xprt->sc_rq_dto_lock);
> + spin_lock_init(&cma_xprt->sc_frmr_q_lock);
>
> cma_xprt->sc_ord = svcrdma_ord;
>
> @@ -686,6 +698,103 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
> return ERR_PTR(ret);
> }
>
> +static int rdma_alloc_frmr(struct svcxprt_rdma *xprt)
> +{
> + struct ib_mr *mr;
> + struct ib_fast_reg_page_list *pl;
> + struct svc_rdma_fastreg_mr *frmr;
> +
> + mr = ib_alloc_fast_reg_mr(xprt->sc_pd, RPCSVC_MAXPAGES);
> + if (!mr)
> + goto errout;
> + pl = ib_alloc_fast_reg_page_list(xprt->sc_cm_id->device,
> + RPCSVC_MAXPAGES);
> + if (!pl) {
> + ib_dereg_mr(mr);
> + goto errout;
> + }
> + frmr = kmalloc(sizeof(*frmr), GFP_KERNEL);
> + if (!frmr) {
> + ib_dereg_mr(mr);
> + ib_free_fast_reg_page_list(pl);
> + goto errout;
> + }
> + frmr->mr = mr;
> + frmr->page_list = pl;
> + INIT_LIST_HEAD(&frmr->frmr_list);
> + spin_lock_bh(&xprt->sc_frmr_q_lock);
> + list_add(&frmr->frmr_list, &xprt->sc_frmr_q);
> + spin_unlock_bh(&xprt->sc_frmr_q_lock);
> +
> + return 0;
> +
> + errout:
> + return -ENOMEM;
> +}
> +
> +static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
> +{
> + struct svc_rdma_fastreg_mr *frmr;
> +
> + while (!list_empty(&xprt->sc_frmr_q)) {
> + frmr = list_entry(xprt->sc_frmr_q.next,
> + struct svc_rdma_fastreg_mr, frmr_list);
> + list_del_init(&frmr->frmr_list);
> + ib_dereg_mr(frmr->mr);
> + ib_free_fast_reg_page_list(frmr->page_list);
> + kfree(frmr);
> + }
> +}
> +
> +struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
> +{
> + struct svc_rdma_fastreg_mr *frmr = NULL;
> +
> + while (1) {
> + spin_lock_bh(&rdma->sc_frmr_q_lock);
> + if (!list_empty(&rdma->sc_frmr_q)) {
> + frmr = list_entry(rdma->sc_frmr_q.next,
> + struct svc_rdma_fastreg_mr, frmr_list);
> + list_del_init(&frmr->frmr_list);
> + }
> + spin_unlock_bh(&rdma->sc_frmr_q_lock);
> + if (frmr)
> + break;
> + if (rdma_alloc_frmr(rdma))
> + return ERR_PTR(-ENOMEM);
> + }
> + frmr->map_len = 0;
> + frmr->page_list_len = 0;
> +
> + return frmr;
> +}
> +
> +static void frmr_unmap_dma(struct svcxprt_rdma *xprt,
> + struct svc_rdma_fastreg_mr *frmr)
> +{
> + int page_no;
> + for (page_no = 0; page_no < frmr->page_list_len; page_no++) {
> + dma_addr_t addr = frmr->page_list->page_list[page_no];
> + if (ib_dma_mapping_error(frmr->mr->device, addr))
> + continue;
> + atomic_dec(&xprt->sc_dma_used);
> + ib_dma_unmap_single(frmr->mr->device, addr, PAGE_SIZE,
> + frmr->direction);
> + }
> +}
> +
> +void svc_rdma_put_frmr(struct svcxprt_rdma *rdma,
> + struct svc_rdma_fastreg_mr *frmr)
> +{
> + if (frmr) {
> + frmr_unmap_dma(rdma, frmr);
> + spin_lock_bh(&rdma->sc_frmr_q_lock);
> + BUG_ON(!list_empty(&frmr->frmr_list));
> + list_add(&frmr->frmr_list, &rdma->sc_frmr_q);
> + spin_unlock_bh(&rdma->sc_frmr_q_lock);
> + }
> +}
> +
> /*
> * This is the xpo_recvfrom function for listening endpoints. Its
> * purpose is to accept incoming connections. The CMA callback handler
> @@ -961,6 +1070,9 @@ static void __svc_rdma_free(struct work_struct *work)
> WARN_ON(atomic_read(&rdma->sc_ctxt_used) != 0);
> WARN_ON(atomic_read(&rdma->sc_dma_used) != 0);
>
> + /* De-allocate fastreg mr */
> + rdma_dealloc_frmr_q(rdma);
> +
> /* Destroy the QP if present (not a listener) */
> if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
> ib_destroy_qp(rdma->sc_qp);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2008-10-06 20:02:57

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 02/12] svcrdma: Add FRMR get/put services

Bruce:

I believe that this version of the patch is what you were expecting.
Thanks for catching this. This is the version that is in my git tree
as well.

BTW, I've tested these patches with dbench, iozone and Connectathon.

Tom

From: Tom Tucker <[email protected]>
Date: Mon, 6 Oct 2008 14:45:18 -0500
Subject: [PATCH 02/12] svcrdma: Add FRMR get/put services

Add services for the allocating, freeing, and unmapping Fast Reg MR. These
services will be used by the transport connection setup, send and receive
routines.

Signed-off-by: Tom Tucker <[email protected]>

---
include/linux/sunrpc/svc_rdma.h | 3 +
net/sunrpc/xprtrdma/svc_rdma_transport.c | 116 ++++++++++++++++++++++++++++--
2 files changed, 114 insertions(+), 5 deletions(-)

diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
index 49e458d..3425268 100644
--- a/include/linux/sunrpc/svc_rdma.h
+++ b/include/linux/sunrpc/svc_rdma.h
@@ -214,6 +214,9 @@ extern struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *);
extern void svc_rdma_put_context(struct svc_rdma_op_ctxt *, int);
extern struct svc_rdma_req_map *svc_rdma_get_req_map(void);
extern void svc_rdma_put_req_map(struct svc_rdma_req_map *);
+extern struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *);
+extern void svc_rdma_put_frmr(struct svcxprt_rdma *,
+ struct svc_rdma_fastreg_mr *);
extern void svc_sq_reap(struct svcxprt_rdma *);
extern void svc_rq_reap(struct svcxprt_rdma *);
extern struct svc_xprt_class svc_rdma_class;
diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
index 900cb69..f0b5c5f 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
@@ -100,6 +100,7 @@ struct svc_rdma_op_ctxt *svc_rdma_get_context(struct svcxprt_rdma *xprt)
ctxt->xprt = xprt;
INIT_LIST_HEAD(&ctxt->dto_q);
ctxt->count = 0;
+ ctxt->frmr = NULL;
atomic_inc(&xprt->sc_ctxt_used);
return ctxt;
}
@@ -109,11 +110,19 @@ static void svc_rdma_unmap_dma(struct svc_rdma_op_ctxt *ctxt)
struct svcxprt_rdma *xprt = ctxt->xprt;
int i;
for (i = 0; i < ctxt->count && ctxt->sge[i].length; i++) {
- atomic_dec(&xprt->sc_dma_used);
- ib_dma_unmap_single(xprt->sc_cm_id->device,
- ctxt->sge[i].addr,
- ctxt->sge[i].length,
- ctxt->direction);
+ /*
+ * Unmap the DMA addr in the SGE if the lkey matches
+ * the sc_dma_lkey, otherwise, ignore it since it is
+ * an FRMR lkey and will be unmapped later when the
+ * last WR that uses it completes.
+ */
+ if (ctxt->sge[i].lkey == xprt->sc_dma_lkey) {
+ atomic_dec(&xprt->sc_dma_used);
+ ib_dma_unmap_single(xprt->sc_cm_id->device,
+ ctxt->sge[i].addr,
+ ctxt->sge[i].length,
+ ctxt->direction);
+ }
}
}

@@ -150,6 +159,7 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
schedule_timeout_uninterruptible(msecs_to_jiffies(500));
}
map->count = 0;
+ map->frmr = NULL;
return map;
}

@@ -425,10 +435,12 @@ static struct svcxprt_rdma *rdma_create_xprt(struct svc_serv *serv,
INIT_LIST_HEAD(&cma_xprt->sc_dto_q);
INIT_LIST_HEAD(&cma_xprt->sc_rq_dto_q);
INIT_LIST_HEAD(&cma_xprt->sc_read_complete_q);
+ INIT_LIST_HEAD(&cma_xprt->sc_frmr_q);
init_waitqueue_head(&cma_xprt->sc_send_wait);

spin_lock_init(&cma_xprt->sc_lock);
spin_lock_init(&cma_xprt->sc_rq_dto_lock);
+ spin_lock_init(&cma_xprt->sc_frmr_q_lock);

cma_xprt->sc_ord = svcrdma_ord;

@@ -686,6 +698,97 @@ static struct svc_xprt *svc_rdma_create(struct svc_serv *serv,
return ERR_PTR(ret);
}

+static struct svc_rdma_fastreg_mr *rdma_alloc_frmr(struct svcxprt_rdma *xprt)
+{
+ struct ib_mr *mr;
+ struct ib_fast_reg_page_list *pl;
+ struct svc_rdma_fastreg_mr *frmr;
+
+ frmr = kmalloc(sizeof(*frmr), GFP_KERNEL);
+ if (!frmr)
+ goto err;
+
+ mr = ib_alloc_fast_reg_mr(xprt->sc_pd, RPCSVC_MAXPAGES);
+ if (!mr)
+ goto err_free_frmr;
+
+ pl = ib_alloc_fast_reg_page_list(xprt->sc_cm_id->device,
+ RPCSVC_MAXPAGES);
+ if (!pl)
+ goto err_free_mr;
+
+ frmr->mr = mr;
+ frmr->page_list = pl;
+ INIT_LIST_HEAD(&frmr->frmr_list);
+ return frmr;
+
+ err_free_mr:
+ ib_dereg_mr(mr);
+ err_free_frmr:
+ kfree(frmr);
+ err:
+ return ERR_PTR(-ENOMEM);
+}
+
+static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
+{
+ struct svc_rdma_fastreg_mr *frmr;
+
+ while (!list_empty(&xprt->sc_frmr_q)) {
+ frmr = list_entry(xprt->sc_frmr_q.next,
+ struct svc_rdma_fastreg_mr, frmr_list);
+ list_del_init(&frmr->frmr_list);
+ ib_dereg_mr(frmr->mr);
+ ib_free_fast_reg_page_list(frmr->page_list);
+ kfree(frmr);
+ }
+}
+
+struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
+{
+ struct svc_rdma_fastreg_mr *frmr = NULL;
+
+ spin_lock_bh(&rdma->sc_frmr_q_lock);
+ if (!list_empty(&rdma->sc_frmr_q)) {
+ frmr = list_entry(rdma->sc_frmr_q.next,
+ struct svc_rdma_fastreg_mr, frmr_list);
+ list_del_init(&frmr->frmr_list);
+ frmr->map_len = 0;
+ frmr->page_list_len = 0;
+ }
+ spin_unlock_bh(&rdma->sc_frmr_q_lock);
+ if (frmr)
+ return frmr;
+
+ return rdma_alloc_frmr(rdma);
+}
+
+static void frmr_unmap_dma(struct svcxprt_rdma *xprt,
+ struct svc_rdma_fastreg_mr *frmr)
+{
+ int page_no;
+ for (page_no = 0; page_no < frmr->page_list_len; page_no++) {
+ dma_addr_t addr = frmr->page_list->page_list[page_no];
+ if (ib_dma_mapping_error(frmr->mr->device, addr))
+ continue;
+ atomic_dec(&xprt->sc_dma_used);
+ ib_dma_unmap_single(frmr->mr->device, addr, PAGE_SIZE,
+ frmr->direction);
+ }
+}
+
+void svc_rdma_put_frmr(struct svcxprt_rdma *rdma,
+ struct svc_rdma_fastreg_mr *frmr)
+{
+ if (frmr) {
+ frmr_unmap_dma(rdma, frmr);
+ spin_lock_bh(&rdma->sc_frmr_q_lock);
+ BUG_ON(!list_empty(&frmr->frmr_list));
+ list_add(&frmr->frmr_list, &rdma->sc_frmr_q);
+ spin_unlock_bh(&rdma->sc_frmr_q_lock);
+ }
+}
+
/*
* This is the xpo_recvfrom function for listening endpoints. Its
* purpose is to accept incoming connections. The CMA callback handler
@@ -961,6 +1064,9 @@ static void __svc_rdma_free(struct work_struct *work)
WARN_ON(atomic_read(&rdma->sc_ctxt_used) != 0);
WARN_ON(atomic_read(&rdma->sc_dma_used) != 0);

+ /* De-allocate fastreg mr */
+ rdma_dealloc_frmr_q(rdma);
+
/* Destroy the QP if present (not a listener) */
if (rdma->sc_qp && !IS_ERR(rdma->sc_qp))
ib_destroy_qp(rdma->sc_qp);