LinuxLists.cc - Re: [RFC,PATCH 7/15] knfsd: create RDMA transport in nfssvc

2007-05-22 06:22:02

Subject: Re: [RFC,PATCH 7/15] knfsd: create RDMA transport in nfssvc

I must sat that I am hitting acronym-overload hear.
RDMA IB iWARP OFA-API SQ-CQ SQ-WR sge max_qp_rd_atom ....

But to the topic of registering the RDMA listening point....

I now understand the point of port 2050 I think. RDMA adds to the
protocol. As well as all the bytes of the RPC request, there is
information about different ..uhm... regions (?) of the message.
This is like a scatter-gather list?
It lets you put the "write" data correctly aligned into a page,
so that we could eventually use the 'splice' technology to achieve
zero-copy write.

But we still have this concept of a different transport to handle
properly.

A bit of an aside: You mention that with "IB", IP is not used, so there
is no number. I assume you mean no IP address of the client? In that
situation, how do we identify the client for authorisation purposes?

More on-topic, we need to consider how this interacts with
/proc/fs/nfsd/portlist

This file can be written to and read from.
When writing, you write a decimal number of a file descriptor.
That fd should be a socket on which to expect incoming requests -
either a UDP socket or a TCP socket that is listening.
How can we extend that to RDMA? What sort of handle does user-space
use for talking over one of these DDP interfaces?
We could arrange that writing e.g.
RDMA TCP 2050
did what you want, but I would much rather avoid that sort of stuff.

When reading from a file you get one line per active transport:
ipv4 tcp 0.0.0.0 2049
ipv4 udp 0.0.0.0 2049

What would we read for RDMA? You say that it uses TCP. Can it use
UDP instead? Might it make sense to listen on only one interface?
Is there an IPv6 version of RDMA??

It seems like a real pity that it couldn't get shoe-horned into a
socket interface.
It would seem that the msg_control part of sendmsg/recvmsg would be
ideal for managing the details of data placement.

NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-05-22 15:13:17

by Tom Tucker

[permalink] [raw]

Subject: Re: [RFC,PATCH 7/15] knfsd: create RDMA transport in nfssvc

On Tue, 2007-05-22 at 16:21 +1000, Neil Brown wrote:
> I must sat that I am hitting acronym-overload hear.
> RDMA IB iWARP OFA-API SQ-CQ SQ-WR sge max_qp_rd_atom ....
>

Yes, it's a bit daunting.

> But to the topic of registering the RDMA listening point....
>
> I now understand the point of port 2050 I think. RDMA adds to the
> protocol. As well as all the bytes of the RPC request, there is
> information about different ..uhm... regions (?) of the message.

Yes, they call them "chunks" and the RPCRDMA header contains a
"chunk-list". The chunk-list (built by the client and used by the
server) tells the transport where in the XDR to place data.

> This is like a scatter-gather list?

Yes, three in fact: one for reading data from the client's memory and
placing it in the request (read-list), one for writing data into the
client's memory as part of the REPLY (write-list), one for the REPLY
header itself (reply-list).

> It lets you put the "write" data correctly aligned into a page,
> so that we could eventually use the 'splice' technology to achieve
> zero-copy write.
>
Yes.

> But we still have this concept of a different transport to handle
> properly.
>
> A bit of an aside: You mention that with "IB", IP is not used, so there
> is no number. I assume you mean no IP address of the client? In that
> situation, how do we identify the client for authorisation purposes?

The RDMA CMA (connection management agent) has a "sockets-like" API for
transport independent connection management. This API uses IP addresses
and port numbers for both IB and iWARP. To implement this on IB, IPoIB
and IP addresses are used for connection management, but not for data.
So for the purposes of authorization, it all works the same for all
transports.

For the purpose of transport selection, we just need a unique id. I was
simply pointing out that IP protocol numbers don't uniquely identify
transports for RDMA.

> More on-topic, we need to consider how this interacts with
> /proc/fs/nfsd/portlist

> This file can be written to and read from.
> When writing, you write a decimal number of a file descriptor.
> That fd should be a socket on which to expect incoming requests -
> either a UDP socket or a TCP socket that is listening.
> How can we extend that to RDMA? What sort of handle does user-space
> use for talking over one of these DDP interfaces?
> We could arrange that writing e.g.
> RDMA TCP 2050
> did what you want, but I would much rather avoid that sort of stuff.
>
> When reading from a file you get one line per active transport:
> ipv4 tcp 0.0.0.0 2049
> ipv4 udp 0.0.0.0 2049
>
> What would we read for RDMA?

I think this makes sense:

ipv4 rdma 0.0.0.0 2050
ipv6 rdma 0.0.0.0 2050

> You say that it uses TCP.

It currently uses TCP or IB, but could ultimately use SCTP, etc...

> Can it use
> UDP instead?

No.

> Might it make sense to listen on only one interface?

This has been discussed, but it runs against this "converged NIC" or
"universal NIC" idea that there is one NIC, one interface and one IP
address that solves ALL your problems. They even eat acronyms :-)

> Is there an IPv6 version of RDMA??

Eventually.

>
> It seems like a real pity that it couldn't get shoe-horned into a
> socket interface.

I think it is a matter of the application of sufficient force ;-) In
fairness, the I/O model is very different and the buffer management is
very-very different. This coupled with the extreme performance pressures
put on these transports led to the creation of a new API. Just getting
enough abstraction to combine iWARP and IB into a single API took a lot
of convincing.

> It would seem that the msg_control part of sendmsg/recvmsg would be
> ideal for managing the details of data placement.
>

Er, probably not enough (or too much) there, but connection management
-- absolutely.

> NeilBrown

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-05-23 15:01:26

by Talpey, Thomas

[permalink] [raw]

Subject: Re: [RFC,PATCH 7/15] knfsd: create RDMA transport in nfssvc

Sorry, I've been travelling until this morning. Replies to port 2050 questions
and other transport issues below.

At 11:59 AM 5/22/2007, Tom Tucker wrote:
>On Tue, 2007-05-22 at 16:21 +1000, Neil Brown wrote:
>> I now understand the point of port 2050 I think. RDMA adds to the
>> protocol. As well as all the bytes of the RPC request, there is

I don't want to confuse things again, but I do want to add something about
the "protocol" question. As I mentioned before, the notion of an RPC transport
is actually two things, an API and a set of protocol semantics, e.g. sockets/TCP,
which we somewhat conflate into a single designator.

In the case of RDMA, the underlying protocol is actually not important. The
RDMA semantics are well-defined for any transport (you can read about
the RPC/RDMA protocol assumptions in the internet draft). Basically, the
RDMA abstraction hides the protocol almost completely.

[RDMA requirements are on page 3 ff]
<http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-rpcrdma-05.txt>

The exception, which turns out to be not much of an exception at all,
is the addressing. The reason it's not an issue is the Infiniband connection
manager, which magically allows IP addresses (IPv4 currently) to be passed
to connect. As a result, there's no need for an RDMA protocol selector,
the IP connect routing handles hardware selection and hides any notion
that IB or iWARP is handling the connection. This is a Good Thing, it makes
RPC totally RDMA-implementation agnostic.

As for why port 2050 exists, it's because the server needs to know whether
to expect RPC/TCP framing or RPC/RDMA on an incoming connection over
iWARP. Over Infiniband there's no ambiguity since connections are always
in RDMA mode, but over iWARP there is an optional "step-up" negotiation.
We need a new port to know whether the negotiation has been bypassed.
There is discussion of this in the nfsdirect Internet Draft.

[Port discussion on page 6 ff]
<http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-nfsdirect-05.txt>

>For the purpose of transport selection, we just need a unique id. I was
>simply pointing out that IP protocol numbers don't uniquely identify
>transports for RDMA.

I'll pile on here. This is important, it goes to the heart of how we "name" RPC
transports. Currently, the client switch uses a protocol id (UDP=17, TCP=6),
and this number is the de facto name of the transport. The address family
of the server's address also comes into play.

For now, we've simply stolen either 255 or 256 to register the RDMA transport
(we changed the numbers at one point IIRC). This may be okay, or maybe not.
Personally, I'd prefer a string-based transport naming, but numbers are fine
too, as long as everyone agrees.

>> When reading from a file you get one line per active transport:
>> ipv4 tcp 0.0.0.0 2049
>> ipv4 udp 0.0.0.0 2049
>>
>> What would we read for RDMA?
>
>I think this makes sense:
>
>ipv4 rdma 0.0.0.0 2050
>ipv6 rdma 0.0.0.0 2050

Like that - strings. The normal TCP transport would be "ipv4-tcp" for
instance.

For RDMA however, the "ipv4" is only significant to the RDMA connection
manager. It serves only to describe the format of the following address.
So while it makes sense, it sort of doesn't matter.

By the way, "ipv6 rdma 0.0.0.0 2050" doesn't make sense, because "0.0.0.0"
isn't an ipv6 address.

>> You say that it uses TCP.
>
>It currently uses TCP or IB, but could ultimately use SCTP, etc...

It makes absolutely no difference whether TCP is used as the RDMA transport,
from the perspoective of the upper layer. The semantics would be identical
if SCTP were used. So, I would not include "tcp" in the rdma transport selector
at all.

Tom.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs