LinuxLists.cc - Congestion window or other reason?

2008-09-26 19:16:15

Subject: Congestion window or other reason?

Hi all,

I have bulk writing and reading working on RPCMX, sort of.

Reading seems to work fine. Writing seems to work fine if
RPCMX_MAX_DATA_SEGS is set to <=32 (versus the default of 8 in
RPCRDMA). Above that value, I get an oops on the server.

Reading seems to work fine with values up to 128. I have not tried
256, because the code then adds 2 to create RPCMX_MAX_SEGS and MX is
limited to 256 total segments. I tried 254, but the client seems to
prefer power of 2 values, so it uses 128 pages (512 KB).

As best I can tell when reading a 1 MB file, the first transfer is 32
pages, the second transfer is 128 pages, and the final transfer is the
remaining 96 pages.

When writing, the client tries to send 128 pages twice. The SVC MX
layer receives them correctly (it parses the request correctly and
posts 128 pages), but I get an oops in svc_process().

Is this related to the congestion window or something else? It doesn't
come up in RPCRDMA since it is limited to 8 pages.

Any advice would be great.

Thanks,

Scott

--
Scott Atchley
Myricom Inc.
http://www.myri.com

2008-09-26 20:08:53

by Talpey, Thomas

[permalink] [raw]

Subject: Re: Congestion window or other reason?

At 03:16 PM 9/26/2008, Scott Atchley wrote:
>Hi all,
>
>I have bulk writing and reading working on RPCMX, sort of.

I'd love to hear more about RPCMX! What is it?

>
>Reading seems to work fine. Writing seems to work fine if
>RPCMX_MAX_DATA_SEGS is set to <=32 (versus the default of 8 in
>RPCRDMA). Above that value, I get an oops on the server.
>
>Reading seems to work fine with values up to 128. I have not tried
>256, because the code then adds 2 to create RPCMX_MAX_SEGS and MX is
>limited to 256 total segments. I tried 254, but the client seems to
>prefer power of 2 values, so it uses 128 pages (512 KB).
>
>As best I can tell when reading a 1 MB file, the first transfer is 32
>pages, the second transfer is 128 pages, and the final transfer is the
>remaining 96 pages.
>
>When writing, the client tries to send 128 pages twice. The SVC MX
>layer receives them correctly (it parses the request correctly and
>posts 128 pages), but I get an oops in svc_process().
>
>Is this related to the congestion window or something else? It doesn't
>come up in RPCRDMA since it is limited to 8 pages.

The congestion window is all about the number of concurrent RPC requests,
and isn't dependent on the number of segments or even size of each message.
Congestion is a client-side thing, the server never delays its replies.

The RPC/RDMA code uses the congestion window to manage its flow control
window with the server. There is a second, somewhat hidden congestion
window that the RDMA adapters use between one another for RDMA Read
requests, the IRD/ORD. But those aren't visible outside the lowest layer.

I would be surprised if you can manage hundreds of pages times dozens
of active requests without some significant resource issues at the
server. Perhaps your problems are related to those?

Tom.

2008-09-26 20:33:25

by Scott Atchley

[permalink] [raw]

Subject: Re: Congestion window or other reason?

On Sep 26, 2008, at 4:08 PM, Talpey, Thomas wrote:

> I'd love to hear more about RPCMX! What is it?

It is based on the RPCRDMA code using MX. MX is Myricom's second-
generation zero-copy, kernel-bypass API (GM was the first). Unlike IB/
iWarp, MX provides only a two-sided interface (send/recv) and is
closely modeled after MPI-1 semantics.

I wrote the MX ports for Lustre and PVFS2. I am finding this to be
more challenging than either of those.

> The congestion window is all about the number of concurrent RPC
> requests,
> and isn't dependent on the number of segments or even size of each
> message.
> Congestion is a client-side thing, the server never delays its
> replies.

Interesting. The client does not have a global view, unfortunately,
and has no idea how busy the server is (i.e. how many other clients it
is servicing).

> The RPC/RDMA code uses the congestion window to manage its flow
> control
> window with the server. There is a second, somewhat hidden congestion
> window that the RDMA adapters use between one another for RDMA Read
> requests, the IRD/ORD. But those aren't visible outside the lowest
> layer.

Is this due to the fact that IB uses queue pairs (QP) and one peer
cannot send a message to another unless a slot is available in the QP?
If so, we do not have this limitation in MX (no QPs).

> I would be surprised if you can manage hundreds of pages times dozens
> of active requests without some significant resource issues at the
> server. Perhaps your problems are related to those?
>
> Tom.

In Lustre and PVFS2, the network MTU is 1 MB (and optionally 4 MB in
PVFS2). We do not have issues in MX scaling to hundreds or thousands
of peers (again no QPs). As for handling a few hundred MBs from a few
hundred clients, it should be no problem. Whether the filesystem back-
end can handle it is another question.

When using TCP with rsize=wsize=1MB, is there anything in RPC besides
TCP that restricts how much data is sent over (or received at the
server) initially? That is, does a client start by sending a smaller
amount, then increase up to the 1 MB limit? Or, does it simply try to
write() 1 MB? Or does the server read a smaller amount and then
subsequently larger amounts?

Thanks,

Scott

2008-09-26 21:25:15

by Talpey, Thomas

[permalink] [raw]

Subject: Re: Congestion window or other reason?

At 04:33 PM 9/26/2008, Scott Atchley wrote:
>On Sep 26, 2008, at 4:08 PM, Talpey, Thomas wrote:
>
>> I'd love to hear more about RPCMX! What is it?
>
>It is based on the RPCRDMA code using MX. MX is Myricom's second-
>generation zero-copy, kernel-bypass API (GM was the first). Unlike IB/
>iWarp, MX provides only a two-sided interface (send/recv) and is
>closely modeled after MPI-1 semantics.

Ok, you've got my attention! Is the code visible somewhere btw?

>
>I wrote the MX ports for Lustre and PVFS2. I am finding this to be
>more challenging than either of those.
>
>> The congestion window is all about the number of concurrent RPC
>> requests,
>> and isn't dependent on the number of segments or even size of each
>> message.
>> Congestion is a client-side thing, the server never delays its
>> replies.
>
>Interesting. The client does not have a global view, unfortunately,
>and has no idea how busy the server is (i.e. how many other clients it
>is servicing).

Correct, because the NFS protocol is not designed this way. However,
the server can manage clients via the RPCRDMA credit mechanism, by
allowing them to send more or less messages in response to its own
load.

>
>> The RPC/RDMA code uses the congestion window to manage its flow
>> control
>> window with the server. There is a second, somewhat hidden congestion
>> window that the RDMA adapters use between one another for RDMA Read
>> requests, the IRD/ORD. But those aren't visible outside the lowest
>> layer.
>
>Is this due to the fact that IB uses queue pairs (QP) and one peer
>cannot send a message to another unless a slot is available in the QP?
>If so, we do not have this limitation in MX (no QPs).

RPCRDMA credits are primarily used for this, it's not so much the fact that
there's a queuepair, it's actually the number of posted receives. If the
client sends more than the server has available, then the connection will
fail. However, the server can implement something called "shared receive
queue" which permits a sort of oversubscription.

>
>> I would be surprised if you can manage hundreds of pages times dozens
>> of active requests without some significant resource issues at the
>> server. Perhaps your problems are related to those?
>>
>> Tom.
>
>In Lustre and PVFS2, the network MTU is 1 MB (and optionally 4 MB in
>PVFS2). We do not have issues in MX scaling to hundreds or thousands
>of peers (again no QPs). As for handling a few hundred MBs from a few
>hundred clients, it should be no problem. Whether the filesystem back-
>end can handle it is another question.

Yes, and dedicating that much memory to clients is another. With the
IB and iWARP protocols and the current Linux server, these buffers are
not shared. This enhances integrity and protection, but it limits the
maximum scaling. I take it this is not a concern for you?

>
>When using TCP with rsize=wsize=1MB, is there anything in RPC besides
>TCP that restricts how much data is sent over (or received at the
>server) initially? That is, does a client start by sending a smaller
>amount, then increase up to the 1 MB limit? Or, does it simply try to
>write() 1 MB? Or does the server read a smaller amount and then
>subsequently larger amounts?

RPC is purely a request/response mechanism, with rules for discovering
endpoints and formatting requests and replies. RPCRDMA adds framing
for RDMA networks, and mechanisms for managing RDMA networks such
as credits and rules on when to use RDMA. Finally, the NFS/RDMA transport
binding makes requirements for sending messages. Since there are several
NFS protocol versions, the answer to your question depends on that.
There is no congestion control (slow start, message sizes) in the RPC
protocol, however there are many implementations of it in RPC.

I'm not certain if your question is purely about TCP, or if it's about RDMA
with TCP as an example. However in both cases the answer is the same:
it's not about the size of a message, it's about the message itself. If
the client and server have agreed that a 1MB write is ok, then yes the
client may immediately send 1MB.

Tom.

2008-09-27 00:35:28

by Scott Atchley

[permalink] [raw]

Subject: Re: Congestion window or other reason?

On Sep 26, 2008, at 5:24 PM, Talpey, Thomas wrote:

> Ok, you've got my attention! Is the code visible somewhere btw?

No, it is in our internal CVS. I can send you a tarball if you want to
take a look.

>> Interesting. The client does not have a global view, unfortunately,
>> and has no idea how busy the server is (i.e. how many other clients
>> it
>> is servicing).
>
> Correct, because the NFS protocol is not designed this way. However,
> the server can manage clients via the RPCRDMA credit mechanism, by
> allowing them to send more or less messages in response to its own
> load.

I believe that I am duplicating the RPCRDMA usage of credits. I need
to check.

> RPCRDMA credits are primarily used for this, it's not so much the
> fact that
> there's a queuepair, it's actually the number of posted receives. If
> the
> client sends more than the server has available, then the connection
> will
> fail. However, the server can implement something called "shared
> receive
> queue" which permits a sort of oversubscription.

MX's behavior is more like the shared receive queue. Unexpected
messages <=32KB are stored in a temp buffer until the matching receive
has been posted. Once it is posted, the data is copied to the receive
buffers and the app can complete the request by testing (polling) or
waiting (blocking).

MX also gives an app the ability to supply a function to handle
unexpected messages. Instead of per-posting receives like RPCRDMA, I
allocate the ctxt and hang them on an idle queue (doubly-linked list).
In the unexpected handler, I dequeue a ctxt and post the matching
receive. MX then can place the data in the proper buffer without an
additional copy.

I chose not to pre-post the receives for the client's request messages
since they could overwhelm the MX posted receive list. By using the
unexpected handler, only bulk IO are pre-posted (i.e. after the
request has come in).

> Yes, and dedicating that much memory to clients is another. With the
> IB and iWARP protocols and the current Linux server, these buffers are
> not shared. This enhances integrity and protection, but it limits the
> maximum scaling. I take it this is not a concern for you?

I am not sure about what you mean by integrity and protection. A
buffer is only used by one request at a time.

> RPC is purely a request/response mechanism, with rules for discovering
> endpoints and formatting requests and replies. RPCRDMA adds framing
> for RDMA networks, and mechanisms for managing RDMA networks such
> as credits and rules on when to use RDMA. Finally, the NFS/RDMA
> transport
> binding makes requirements for sending messages. Since there are
> several
> NFS protocol versions, the answer to your question depends on that.
> There is no congestion control (slow start, message sizes) in the RPC
> protocol, however there are many implementations of it in RPC.

I am trying to duplicate all of the above from RPCRDMA. I am curious
why a client read of 256 pages with a rsize of 128 pages arrives in
three transfers of 32, 128, and then 96 pages. I assume that the same
reason is allowing client writes to succeed only if the max pages is 32.

> I'm not certain if your question is purely about TCP, or if it's
> about RDMA
> with TCP as an example. However in both cases the answer is the same:
> it's not about the size of a message, it's about the message itself.
> If
> the client and server have agreed that a 1MB write is ok, then yes the
> client may immediately send 1MB.
>
> Tom.

Hmmm, I will try to debug the svc_process code to find the oops.

I am on vacation next week. I will take a look once I get back.

Thanks!

Scott

2009-01-14 20:29:05

by Scott Atchley

[permalink] [raw]

Subject: Re: Congestion window or other reason?

Reviving an old thread....

Hi Tom Talpey and Tom Tucker, it was good to meet you at SC08. :-)

On Sep 30, 2008, at 8:34 AM, Talpey, Thomas wrote:

>> I believe that I am duplicating the RPCRDMA usage of credits. I need
>> to check.
>
> If you are passing credits, then be sure you're managing them
> correctly and
> that you have at least as many client RPC slots configured as the
> server can
> optimally handle. It's very important to throughput. The NFSv4.1
> protocol
> will manage these explicitly, btw - the session "slot table" is
> basically the
> same interaction. The RPC/RDMA credits will basically be managed by
> them,
> so the whole stack will bebefit.

A quick glance seems to indicate that I am using credits. The client
never sends more than 32 requests in my tests.

>>> RPCRDMA credits are primarily used for this, it's not so much the
>>> fact that there's a queuepair, it's actually the number of posted
>>> receives. If the client sends more than the server has available,
>>> then the connection
>>> will fail. However, the server can implement something called
>>> "shared receive queue" which permits a sort of oversubscription.
>>
>> MX's behavior is more like the shared receive queue. Unexpected
>> messages <=32KB are stored in a temp buffer until the matching
>> receive
>> has been posted. Once it is posted, the data is copied to the receive
>> buffers and the app can complete the request by testing (polling) or
>> waiting (blocking).
>
> Ouch. I guess that's convenient for the upper layer, but it costs
> quite a
> bit of NIC memory, or if host memory is used, makes latency and bus
> traffic quite indeterminate. I would strongly suggest fully
> provisioning
> each server endpoint, and using the protocol's credits to manage
> resources.

Host memory. In the kernel, we limit the unexpected queue to 2 MB in
the kernel. Ideally, the only unexpected messages are RPC requests,
and I have already allocated 32 per client.

>> I chose not to pre-post the receives for the client's request
>> messages
>> since they could overwhelm the MX posted receive list. By using the
>> unexpected handler, only bulk IO are pre-posted (i.e. after the
>> request has come in).
>
> The client never posts more than the max_inline_write size, which is
> fully configurable. By default, it's only 1KB, and there are
> normally just
> 32 credits. Bulk data is handled by RDMA, which can be scheduled at
> the server's convenience - this is a key design point of the RPC/RDMA
> protocol. Only 32KB per client is "overwhelm" territory?

I upped my inline size to 3072 bytes (each context gets a full page,
but I can't use all of it since the header needs to go in there).

32 KB is not overwhelm territory. Posting 32 identical, small recvs
for RPC request messages per client (e.g. 1000 clients) would mean
that to match a single, large IO, MX would have to walk a linked-list
with potentially 32,000 small messages before finding the correct
large message. Using the unexpected handler to manage RPC requests in
an active message manner keeps the posted recv linked-list populated
only with large IO messages.

I could instead have RPC requests and IO messages on separate
completion queues which would do the same thing. I use the former out
of habit.

>>> Yes, and dedicating that much memory to clients is another. With the
>>> IB and iWARP protocols and the current Linux server, these buffers
>>> are
>>> not shared. This enhances integrity and protection, but it limits
>>> the
>>> maximum scaling. I take it this is not a concern for you?
>>
>> I am not sure about what you mean by integrity and protection. A
>> buffer is only used by one request at a time.
>
> Correct - and that's precisely the goal. The issue is whether there
> are
> data paths which can expose the buffer(s) outside of the scope of a
> single request, for example to allow a buggy server to corrupt
> messages
> which are being processed at the client, or to allow attacks on
> clients or
> servers from foreign hosts. Formerly, with IB and iWARP we had to
> choose
> between performance and protection. With the new iWARP "FRMR"
> facility,
> we (finally) have a scheme that protects well, without costing a large
> per-io penalty.

Hmmm. When using MX over Myrinet, such an attack is not feasible. When
using MX over Ethernet, it is still probably not feasible since MX
traffic is not viewable within the kernel (via wireshark, etc.). Could
someone use a non-Myricom NIC to craft a bogus Myrinet over Ethernet
frame, it is theoretically possible.

>>> RPC is purely a request/response mechanism, with rules for
>>> discovering
>>> endpoints and formatting requests and replies. RPCRDMA adds framing
>>> for RDMA networks, and mechanisms for managing RDMA networks such
>>> as credits and rules on when to use RDMA. Finally, the NFS/RDMA
>>> transport binding makes requirements for sending messages. Since
>>> there are
>>> several NFS protocol versions, the answer to your question depends
>>> on that.
>>> There is no congestion control (slow start, message sizes) in the
>>> RPC
>>> protocol, however there are many implementations of it in RPC.
>>
>> I am trying to duplicate all of the above from RPCRDMA. I am curious
>> why a client read of 256 pages with a rsize of 128 pages arrives in
>> three transfers of 32, 128, and then 96 pages. I assume that the same
>> reason is allowing client writes to succeed only if the max pages
>> is 32.
>
> Usually, this is because the server's filesystem delivered the
> results in
> these chunks. For example, yours may have had a 128-page extent size,
> which the client was reading on a 96-page offset. Therefore the
> first read
> yielded the last 32 pages of the first extent, followed by a full
> 128 and
> a 96 to finish up. Or perhaps, it was simply convenient for it to
> return
> them in such a way.
>
> You can maybe recode the server to perform full-sized IO, but I don't
> recommend it. You'll be performing synchronous filesystem ops in order
> to avoid a few network transfers. That is, in all likelihood, a very
> bad
> trade. But I don't know your server.

I was just curious. Thanks for the clear explanation. We will behave
like the others and service what NFS hands us.

>>> I'm not certain if your question is purely about TCP, or if it's
>>> about RDMA with TCP as an example. However in both cases the
>>> answer is the same:
>>> it's not about the size of a message, it's about the message itself.
>>> If the client and server have agreed that a 1MB write is ok, then
>>> yes the
>>> client may immediately send 1MB.
>>>
>>> Tom.
>>
>> Hmmm, I will try to debug the svc_process code to find the oops.

I found several bugs and I think I have fixed them. I seem to have it
working correctly with 32 KB messages (reading, writing, [un]mounting,
etc.). On a few reads or writes out of a 1,000, I will get a NFS stale
handle error. I need to track this down.

Also, when using more than 8 pages (32 KB), reads and writes complete
but the data is corrupted. This is clearly a bug in my code and I am
looking into it.

Scott