Hi,
I've been giving the fast memory registration NFS RDMA
patches a spin, and I've got a couple questions.
AFAICS the default xprtrdma memory registration model
is still RPCRDMA_ALLPHYSICAL; I had to
"echo 6 > /proc/sys/sunrpc/rdma_memreg_strategy"
prior to a mount to get fast registration. Given that fast
registration has better security properties for iWARP, and
the fallback is RPCRDMA_ALLPHYSICAL if fast registration is
not supported, is it more appropriate to have RPCRDMA_FASTREG
be the default?
Second, it seems that the number of pages in a client fast
memory registration is still limited to RPCRDMA_MAX_DATA_SEGS.
So on a client write, without fast registration I get
RPCRDMA_MAX_DATA_SEGS RDMA reads of 1 page each, whereas with
fast registration I get 1 RDMA read of RPCRDMA_MAX_DATA_SEGS
pages.
In either case my maximum rsize, wsize for an RDMA mount
is still 32 KiB.
My understanding is that, e.g., a Chelsio T3 with the
2.6.27-rc driver can support 24 pages in a fast registration
request. So, what I was hoping to see with a T3 were RPCs with
RPCRDMA_MAX_DATA_SEGS chunks, each for a fast registration of
24 pages each, making possible an RDMA mount with 768 KiB for
rsize, wsize.
Is something like that possible? If so, do you have any
work in progress along those lines?
-- Jim
Jim Schutt wrote:
> Hi,
>
> I've been giving the fast memory registration NFS RDMA
> patches a spin, and I've got a couple questions.
>
> AFAICS the default xprtrdma memory registration model
> is still RPCRDMA_ALLPHYSICAL; I had to
> "echo 6 > /proc/sys/sunrpc/rdma_memreg_strategy"
> prior to a mount to get fast registration. Given that fast
> registration has better security properties for iWARP, and
> the fallback is RPCRDMA_ALLPHYSICAL if fast registration is
> not supported, is it more appropriate to have RPCRDMA_FASTREG
???
> be the default?
I'm not sure I parsed this right, but I think you're asking if FASTREG
should be the default if it _is_ supported by the HW. IMO yes.
>
> Second, it seems that the number of pages in a client fast
> memory registration is still limited to RPCRDMA_MAX_DATA_SEGS.
> So on a client write, without fast registration I get
> RPCRDMA_MAX_DATA_SEGS RDMA reads of 1 page each, whereas with
> fast registration I get 1 RDMA read of RPCRDMA_MAX_DATA_SEGS
> pages.
>
> In either case my maximum rsize, wsize for an RDMA mount
> is still 32 KiB.
Sure, Big data was not the purpose of the patch.
>
> My understanding is that, e.g., a Chelsio T3 with the
> 2.6.27-rc driver can support 24 pages in a fast registration
> request. So, what I was hoping to see with a T3 were RPCs with
> RPCRDMA_MAX_DATA_SEGS chunks, each for a fast registration of
> 24 pages each, making possible an RDMA mount with 768 KiB for
> rsize, wsize.
>
> Is something like that possible? If so, do you have any
> work in progress along those lines?
>
I have nothing in the works along those lines -- sorry.
> -- Jim
>
>
On Thu, 2008-09-25 at 14:29 -0600, Tom Tucker wrote:
> Jim Schutt wrote:
> > Hi,
> >
> > I've been giving the fast memory registration NFS RDMA
> > patches a spin, and I've got a couple questions.
> >
> > AFAICS the default xprtrdma memory registration model
> > is still RPCRDMA_ALLPHYSICAL; I had to
> > "echo 6 > /proc/sys/sunrpc/rdma_memreg_strategy"
> > prior to a mount to get fast registration. Given that fast
> > registration has better security properties for iWARP, and
> > the fallback is RPCRDMA_ALLPHYSICAL if fast registration is
> > not supported, is it more appropriate to have RPCRDMA_FASTREG
> ???
> > be the default?
>
> I'm not sure I parsed this right, but I think you're asking if FASTREG
> should be the default if it _is_ supported by the HW. IMO yes.
Yes, that't it.
>
> >
> > Second, it seems that the number of pages in a client fast
> > memory registration is still limited to RPCRDMA_MAX_DATA_SEGS.
> > So on a client write, without fast registration I get
> > RPCRDMA_MAX_DATA_SEGS RDMA reads of 1 page each, whereas with
> > fast registration I get 1 RDMA read of RPCRDMA_MAX_DATA_SEGS
> > pages.
> >
> > In either case my maximum rsize, wsize for an RDMA mount
> > is still 32 KiB.
>
> Sure, Big data was not the purpose of the patch.
OK.
>
> >
> > My understanding is that, e.g., a Chelsio T3 with the
> > 2.6.27-rc driver can support 24 pages in a fast registration
> > request. So, what I was hoping to see with a T3 were RPCs with
> > RPCRDMA_MAX_DATA_SEGS chunks, each for a fast registration of
> > 24 pages each, making possible an RDMA mount with 768 KiB for
> > rsize, wsize.
> >
> > Is something like that possible? If so, do you have any
> > work in progress along those lines?
> >
>
> I have nothing in the works along those lines -- sorry.
OK - thanks for letting me know.
>
> > -- Jim
> >
> >
>
>
At 11:39 AM 9/25/2008, Jim Schutt wrote:
>Hi,
>
>I've been giving the fast memory registration NFS RDMA
>patches a spin, and I've got a couple questions.
Your questions are mainly about the client, so I'll jump in here too...
>
>AFAICS the default xprtrdma memory registration model
>is still RPCRDMA_ALLPHYSICAL; I had to
> "echo 6 > /proc/sys/sunrpc/rdma_memreg_strategy"
>prior to a mount to get fast registration. Given that fast
>registration has better security properties for iWARP, and
>the fallback is RPCRDMA_ALLPHYSICAL if fast registration is
>not supported, is it more appropriate to have RPCRDMA_FASTREG
>be the default?
Possibly. At this point we don't have enough experience with FASTREG
to know whether it's better. For large-footprint memory on the server
with a Chelsio interconnect, it's required, but on Infiniband adapters,
there are more degrees of freedom and historically ALLPHYS works best.
Also, at this point we don't know that FASTREG is really FASTer. :-)
Frankly, I hate calling things "fast" or "new", there's always something
"faster" or "newer". But the OFA code uses this name. In any case,
the codepath still needs testing and performance evaluation before
we make it a default.
>Second, it seems that the number of pages in a client fast
>memory registration is still limited to RPCRDMA_MAX_DATA_SEGS.
>So on a client write, without fast registration I get
>RPCRDMA_MAX_DATA_SEGS RDMA reads of 1 page each, whereas with
>fast registration I get 1 RDMA read of RPCRDMA_MAX_DATA_SEGS
>pages.
Yes, the client is currently limited to this many segments. You can raise
the number by recompiling, but I don't recommend it, the client gets rather
greedy with per-mount memory. I do plan to remedy this.
In the meantime, let me offer the observation that multiple RDMA Reads
are not a penalty, since they are able to stream up to the IRD max offered
by the client, which is in turn more than sufficient to maintain bandwidth
usage. Are you seeing a bottleneck? If so, I'd like to see the output from
the client with RPCDBG_TRANS turned on, it prints the IRD at connect time.
>In either case my maximum rsize, wsize for an RDMA mount
>is still 32 KiB.
Yes. But here's the deal - write throughput is almost never a network
problem. Instead, it's either a server ordering problem, or a congestion/
latency issue. The rub is, large I/O's help the former (by cramming lots
of writes together in a single request), but they hurt the latter (by
cramming large chunks into the pipe).
In other words, small I/Os on low-latency networks can be good.
However, the Linux NFS server has a rather clumsy interface to the
backing filesystem, and if you're using ext, its ability to handle many
32KB sized writes in arbitrary order is somewhat poor. What type
of storage are you exporting? Are you using async on the server?
>
>My understanding is that, e.g., a Chelsio T3 with the
>2.6.27-rc driver can support 24 pages in a fast registration
>request. So, what I was hoping to see with a T3 were RPCs with
>RPCRDMA_MAX_DATA_SEGS chunks, each for a fast registration of
>24 pages each, making possible an RDMA mount with 768 KiB for
>rsize, wsize.
You can certainly try raising MAX_DATA_SEGS to this value and building
a new sunrpc module. I do not recommend such a large write size however;
you won't be able to do many mounts, due to resource issues on both client
and server.
If you're seeing throughput problems, I would suggest trying a 64KB write
size first (MAX_DATA_SEGS==16), and if that improves then maybe 128KB (32).
128KB is generally more than enough to make ext happy (well, happi*er*).
>
>Is something like that possible? If so, do you have any
>work in progress along those lines?
I do. But I'd be very interested to see more data before committing to
the large-io approach. Can you help?
Tom.