2007-03-26 23:10:50

by Chuck Lever

[permalink] [raw]
Subject: cel's patches under development

_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


Attachments:
chuck.lever.vcf (317.00 B)
(No filename) (345.00 B)
(No filename) (140.00 B)
Download all attachments

2007-04-10 15:24:09

by Olaf Kirch

[permalink] [raw]
Subject: Re: cel's patches under development

Hi,

I've been working a bit on some changes to clean up the xdr_buf
handling. Patches can be found here:
http://oss.oracle.com/~okir/xdr-buf-rewrite/

The patch set aims to

- factor out common send/recv code in the client/server side, and
put it into socklib.c
- get rid of some of the messy and duplicated code in xdr.c
- move the data copies out of the data_ready callbacks in xprtsock.c.
(In the long run, one may want to think about per-CPU rpciod threads,
so that data remains inside the cache of the CPU that processes it,
rather than bouncing around)

The current code tries to guess the correct alignment of the returned data in
READ, READLINK and READDIR replies, and inlines a vector of pages "just so"
into the XDR so that the data ends up in the right place. If it doesn't, we jump
through 15 hoops to realign it.

The approach I'm trying is to leave the data in skb's for as long as it's
practical, and copy data to the page cache in the XDR decode routines.
In order to hide the skb mechanics, there's an rpc_data object that
encapsulates this. It looks a little awkward, but Chuck complained
about layering violations so I tried this instead.

The patch set is based on Chuck's tree (drop-20070409). It's
far from complete; in particular, I want to get rid of any code that
uses xdr_inline_pages, and I want to fully convert svcsock.c to use
the new generic recv functions as well.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-04-10 19:32:31

by Trond Myklebust

[permalink] [raw]
Subject: Re: cel's patches under development

On Tue, 2007-04-10 at 17:22 +0200, Olaf Kirch wrote:
> Hi,
> The approach I'm trying is to leave the data in skb's for as long as it's
> practical, and copy data to the page cache in the XDR decode routines.
> In order to hide the skb mechanics, there's an rpc_data object that
> encapsulates this. It looks a little awkward, but Chuck complained
> about layering violations so I tried this instead.

I'm a bit wary of this. skbs are usually allocated from the ATOMIC pool
which is a very limited resource: their lifetime really wants to be as
short as possible. Won't this basically end up aggravating an already
nasty situation w.r.t. our behaviour when under memory pressure?

Also, what about stuff like RDMA, which doesn't need this sort of
mechanism in order to get things right?

Finally, will we need to keep writing these very complex handlers for
every new protocol that we want to add (e.g. IPv6, IPoIB, ...)?

Cheers
Trond


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-04-11 06:43:25

by Olaf Kirch

[permalink] [raw]
Subject: Re: cel's patches under development

On Tuesday 10 April 2007 21:32, Trond Myklebust wrote:
> I'm a bit wary of this. skbs are usually allocated from the ATOMIC pool
> which is a very limited resource: their lifetime really wants to be as
> short as possible. Won't this basically end up aggravating an already
> nasty situation w.r.t. our behaviour when under memory pressure?

I don't think so. We're talking about an additional delay caused by
moving the memcpy from the BH to some process (user or rpciod).
I agree that it would be a bad idea to keep these skbs around longer
than needed.

On the server side, the impact of this delay will probably be even
lower - skbs usually spend some time on the TCP socket's receive
queue anyway, and nfsd is pulling over the received packet
using recvmsg.

> Also, what about stuff like RDMA, which doesn't need this sort of
> mechanism in order to get things right?

But RDMA may benefit from the proposed interface for transport
specific receive buffers (rpc_data objects). How that buffer works is
entirely up to the transport. For TCP and UDP it's skb_lists, but for
RDMA it would probably be something very different.

Here's the mode of operation - XDR functions that expect to receive
data to a pagevec, such as READ, READLINK etc, call
xprt_rcvbuf_alloc(xprt, pages, pgbase, pglen) to allocate a
transport specific buffer object. Transports such as TCP or UDP
ignore the page vector, but the RDMA transport could use this
to do set up its buffers. From the implementation point of view,
it's probably not much different from extracting the pagevec
information from the xdr_buf receive buffer, but it looks cleaner
to me.

Likewise, it is possible to create transport-specific rpc_data
objects for sending data (and eg have RDMA do a DMA Send
for these). This would allow to get rid of the pagevec inside the
xdr_buf altogether.

> Finally, will we need to keep writing these very complex handlers for
> every new protocol that we want to add (e.g. IPv6, IPoIB, ...)?

No. TCP and UDP share the same skb_list handling code,
regardless of address family and link layer protocol. Adding
a transport such as DCCP will not need a new handler
either.

I believe the net result of this proposed restructuring will be
less complexity. Right now we have half a dozen or so
functions that walk through an xdr_buf (head, pagevec, tail)...
whereas with the proposed changes, you have the complexity
confined to one place.

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-04-11 11:39:41

by Talpey, Thomas

[permalink] [raw]
Subject: Re: cel's patches under development

At 02:41 AM 4/11/2007, Olaf Kirch wrote:
>On Tuesday 10 April 2007 21:32, Trond Myklebust wrote:
>> Also, what about stuff like RDMA, which doesn't need this sort of
>> mechanism in order to get things right?
>
>But RDMA may benefit from the proposed interface for transport
>specific receive buffers (rpc_data objects). How that buffer works is
>entirely up to the transport. For TCP and UDP it's skb_lists, but for
>RDMA it would probably be something very different.

FWIW, I'm still looking at this from the RDMA perspective. I like the
abstraction of the data representation, but I don't like the protocol
specific switch statements that are in the data movement. These
are cumbersome and don't lend themselves to dynamic loading. Also,
RDMA isn't a new protocol family, so it doesn't lend itself to an
IPv4/IPv6 and UDP/TCP kind of switch.

>Here's the mode of operation - XDR functions that expect to receive
>data to a pagevec, such as READ, READLINK etc, call
>xprt_rcvbuf_alloc(xprt, pages, pgbase, pglen) to allocate a
>transport specific buffer object. Transports such as TCP or UDP
>ignore the page vector, but the RDMA transport could use this
>to do set up its buffers. From the implementation point of view,

The RDMA code turns out to not need this, because the xdr iov's
have a clear encoding for pagelists (head->pages[]->tail). The only
ambiguity comes from the upper layer - NFS READDIRs for example
send down a pagelist, but they don't expect to cache the result
directly like for READ. We use a different RDMA operation for them,
and there's a guess made at the lower layer. However, moving this
function to XDR doesn't fix it. So I'm semi-neutral on the proposed
change from the RDMA perspective.

>I believe the net result of this proposed restructuring will be
>less complexity. Right now we have half a dozen or so
>functions that walk through an xdr_buf (head, pagevec, tail)...
>whereas with the proposed changes, you have the complexity
>confined to one place.

This is good, but I'd like to see a stronger benefit, personally. More
comments after I get a better look.

Tom.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-04-11 12:03:45

by Olaf Kirch

[permalink] [raw]
Subject: Re: cel's patches under development

On Wednesday 11 April 2007 13:39, Talpey, Thomas wrote:
> FWIW, I'm still looking at this from the RDMA perspective. I like the
> abstraction of the data representation, but I don't like the protocol
> specific switch statements that are in the data movement. These

Right now, the proposed rpc_data thing has two function pointers,
so no switch needed.

> are cumbersome and don't lend themselves to dynamic loading. Also,
> RDMA isn't a new protocol family, so it doesn't lend itself to an
> IPv4/IPv6 and UDP/TCP kind of switch.

How is it integrated on the client side? Wouldn't this be another
transport in terms of Chuck's transport switch? If so, it would just
provide a function for creating an RDMA specific rpc_data object
that does the right thing.

> The RDMA code turns out to not need this, because the xdr iov's
> have a clear encoding for pagelists (head->pages[]->tail).

But that means yet another non-obvious user of xdr_buf that makes
it even harder to rewrite that whole code :-) Wouldn't it be better
to make it explicit "here is a bunch of pages that I want you to
put the bulk data into"?

Olaf
--
Olaf Kirch | --- o --- Nous sommes du soleil we love when we play
[email protected] | / | \ sol.dhoop.naytheet.ah kin.ir.samse.qurax

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2007-04-11 12:32:23

by Talpey, Thomas

[permalink] [raw]
Subject: Re: cel's patches under development

At 08:02 AM 4/11/2007, Olaf Kirch wrote:
>> are cumbersome and don't lend themselves to dynamic loading. Also,
>> RDMA isn't a new protocol family, so it doesn't lend itself to an
>> IPv4/IPv6 and UDP/TCP kind of switch.
>
>How is it integrated on the client side? Wouldn't this be another
>transport in terms of Chuck's transport switch? If so, it would just
>provide a function for creating an RDMA specific rpc_data object
>that does the right thing.

Sure, it's switched, though frankly it's imperfect since the switch
selects based on address family / protocol naming just like others.
But once we hack^H^H^H^Hwork around that it just jumps to the
rdma transport setup.

>> The RDMA code turns out to not need this, because the xdr iov's
>> have a clear encoding for pagelists (head->pages[]->tail).
>
>But that means yet another non-obvious user of xdr_buf that makes
>it even harder to rewrite that whole code :-) Wouldn't it be better
>to make it explicit "here is a bunch of pages that I want you to
>put the bulk data into"?

Yep, and I agree. Not sure about the "non-obvious" thing though, it
seems like a direct application of the structure to me.

What I really want is some kind of upper layer hint that follows the xdr
buffer, perhaps as explicit as "this is a read or write", or at least "this is
bulk data". Basically, something from xdr_inline_pages/xdr_encode_pages
that indicates the purpose. How it's accounted for and represented in
iovecs isn't as much of a concern.

I think a second part of your proposal is that there'd be a new callout
to set up these buffers for i/o prior to send marshalling. For RDMA we
basically do that in the transport switch send routine, where we insert
the per-transport headers (RDMA chunk headers, TCP just inserts its
record marker). I'm stil thinking about whether there's some advantage
to a separate callout.

Tom.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs