2013-08-15 00:14:32

by Wendy Cheng

[permalink] [raw]
Subject: Helps to Decode rpc_debug Output

The IO on top of a NFS mounted directory was hanging so I forced a
(client side) "rpc_debug" output from the proc entry.

<6>[ 4311.590317] 1676 0001 -11 ffff8801e0bde400 ffff8801d18b1248
0 ffffffff81420d40 nfsv3 WRITE a:call_status q:xprt_resend
... (similar lines) ...
<6>[ 4311.590435] 1682 0001 -11 ffff8801e0bde400 ffff8801d18b0e10
0 ffffffff81420d40 nfsv3 WRITE a:call_connect_status
q:xprt_sending
... (similar lines) ...

Could someone give me an educational statement on what the above two
lines mean ? . More specifically, what "call_connect_status" does and
what "xprt_sending" means ? Is there any way to force (i.e. hacking
the code to get) a re-connect (i.e. invoke "connect" from
rpc_xprt_ops) ?

Longer version of the question:
I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
Memory registration is "RPCRDMA_ALLPHYSICAL". There are many issues so
far but I do manage to get nfs mount working. Simple file operations
(such as "ls", file read/write, "scp", etc) seem to work as well.
While trying to run iozone to see whether the performance gain can be
justified for the development efforts, the program runs until it
reaches 2MB file size - at that point, RDMA CM sends out
"TIMEWAIT_EXIT" event, the xprt is disconnected, and all IOs on that
share hang. IPOIB still works though. Not sure what would be the best
way to debug this.

Thanks,
Wendy


2013-08-26 13:22:52

by Tom Talpey

[permalink] [raw]
Subject: Re: Helps to Decode rpc_debug Output

On 8/21/2013 11:55 AM, Wendy Cheng wrote:
> On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng <[email protected]> wrote:
>> On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey <[email protected]> wrote:
>>> On 8/14/2013 8:14 PM, Wendy Cheng wrote:
>>>>
>>>> Longer version of the question:
>>>> I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
>>>> kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
>>>> server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
>>>> Memory registration is "RPCRDMA_ALLPHYSICAL". There are many issues so
>>>> far but I do manage to get nfs mount working. Simple file operations
>>>> (such as "ls", file read/write, "scp", etc) seem to work as well.
>>>
>
> Yay ... got this up .. amazingly on a uOS that does not have much of
> the conventional kernel debug facilities.

Congrats!

> One thing I'm still scratching my head is that ... by looking at the
> raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS
> over IPOIB (TCP).

Sounds like your bottleneck lies in some other component. What's the
storage, for example? RDMA won't do a thing to improve a slow disk.
Or, what kind of IOPS rate are you seeing? If these systems aren't
generating enough load to push a CPU limit, then shifting the protocol
on the same link might not yield much.

> However, the total run time differs greatly. NFS
> over RDMA seems to take a much longer time to finish (vs. NFS over
> IPOIB). Not sure why is that .... Maybe by the constant
> connect/disconnect triggered by reestablish_timeout ? The connection
> re-establish is known to be expensive on this uOS.

Um, yes, of course. Fix that before drawing any conclusions.

Tom.

2013-08-21 15:55:50

by Wendy Cheng

[permalink] [raw]
Subject: Re: Helps to Decode rpc_debug Output

On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng <[email protected]> wrote:
> On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey <[email protected]> wrote:
>> On 8/14/2013 8:14 PM, Wendy Cheng wrote:
>>>
>>> Longer version of the question:
>>> I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
>>> kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
>>> server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
>>> Memory registration is "RPCRDMA_ALLPHYSICAL". There are many issues so
>>> far but I do manage to get nfs mount working. Simple file operations
>>> (such as "ls", file read/write, "scp", etc) seem to work as well.
>>

Yay ... got this up .. amazingly on a uOS that does not have much of
the conventional kernel debug facilities.

The hang was caused by auto disconnect, triggered by xprt->timer. The
task was carried out by xprt_init_autodisconnect(). It silently
disconnects the xprt w/out sensible warning. The uOS is on a
small-core (slower) hardware. Instead of a hard number, this timeout
value needs to be at least a "proc" tunable. Will check newer kernels
to see whether it's been improved and/or draft a patch later.

One thing I'm still scratching my head is that ... by looking at the
raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS
over IPOIB (TCP). However, the total run time differs greatly. NFS
over RDMA seems to take a much longer time to finish (vs. NFS over
IPOIB). Not sure why is that .... Maybe by the constant
connect/disconnect triggered by reestablish_timeout ? The connection
re-establish is known to be expensive on this uOS. Why do we need two
sets of timeout where
1. xprt->timer disconnects (w/out reconnect) ?
2. reestablish_timeout constantly disconnect/re-connect ?

-- Wendy

2013-08-26 17:08:20

by Wendy Cheng

[permalink] [raw]
Subject: Re: Helps to Decode rpc_debug Output

On Mon, Aug 26, 2013 at 6:22 AM, Tom Talpey <[email protected]> wrote:
> On 8/21/2013 11:55 AM, Wendy Cheng wrote:
>>
>> On Thu, Aug 15, 2013 at 11:08 AM, Wendy Cheng <[email protected]>
>> wrote:
>>>
>>> On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey <[email protected]> wrote:
>>>>
>>>> On 8/14/2013 8:14 PM, Wendy Cheng wrote:
>>>>>
>>>>>
>>>>> Longer version of the question:
>>>>> I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
>>>>> kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
>>>>> server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
>>>>> Memory registration is "RPCRDMA_ALLPHYSICAL". There are many issues so
>>>>> far but I do manage to get nfs mount working. Simple file operations
>>>>> (such as "ls", file read/write, "scp", etc) seem to work as well.
>>>>

>> One thing I'm still scratching my head is that ... by looking at the
>> raw IOPS, I don't see dramatic difference between NFS-RDMA vs. NFS
>> over IPOIB (TCP).
>
>
> Sounds like your bottleneck lies in some other component. What's the
> storage, for example? RDMA won't do a thing to improve a slow disk.
> Or, what kind of IOPS rate are you seeing? If these systems aren't
> generating enough load to push a CPU limit, then shifting the protocol
> on the same link might not yield much.

There is no kernel profiling tool with this uOS (yet) so it is hard to
identify the bottleneck. Looking from the surface, the slow down seems
to be from SUNRPC's Van Jacobson congestion control
(xprt_reserve_xprt_cong()) where it either creates a race condition
for the transmissions (write/commit) to miss their wake-up(s); or the
algorithm itself is not a right choice for this client system that
consists of many (244 on my system) slower cores (CPU).

Solid state drives are used on the RHEL server.

-- Wendy

2013-08-15 18:08:46

by Wendy Cheng

[permalink] [raw]
Subject: Re: Helps to Decode rpc_debug Output

On Thu, Aug 15, 2013 at 5:46 AM, Tom Talpey <[email protected]> wrote:
> On 8/14/2013 8:14 PM, Wendy Cheng wrote:
>>
>> Longer version of the question:
>> I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
>> kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
>> server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
>> Memory registration is "RPCRDMA_ALLPHYSICAL". There are many issues so
>> far but I do manage to get nfs mount working. Simple file operations
>> (such as "ls", file read/write, "scp", etc) seem to work as well.
>
[snip]

> Why did you replace the Linux IB stack with OFED? Did you also take the
> NFS/RDMA from that package, and if so are you sure that it all is
> is working properly? Doesn't 2.6.38 already have all this?
>

Other part of the cluster runs OFED 1.5.4 on top of RHEL 6.3 - it was
a product decision. Ditto for the 2.6.38 based uOS.

OFED 1.5.4 based NFS/RDMA (i.e. xprtrdma) does not run on both
platforms. It took a while to understand the setup. I believe issues
with RHEL boxes have been fixed - at least iozone runs thru (as client
and server) w/out trouble. Now the issue is with this 2.6.38 uOS (as
client) that talks to RHEL 6.3 (as server). I don't know much about
NFS V4 so the focus is on V3.

-- Wendy

2013-08-15 12:53:18

by Tom Talpey

[permalink] [raw]
Subject: Re: Helps to Decode rpc_debug Output

On 8/14/2013 8:14 PM, Wendy Cheng wrote:
> Longer version of the question:
> I'm trying to enable NFS-RDMA on an embedded system (based on 2.6.38
> kernel) as a client. The IB stacks are taken from OFED 1.5.4. NFS
> server is a RHEL 6.3 Xeon box. The connection uses mellox-4 driver.
> Memory registration is "RPCRDMA_ALLPHYSICAL". There are many issues so
> far but I do manage to get nfs mount working. Simple file operations
> (such as "ls", file read/write, "scp", etc) seem to work as well.

You're probably seeing connection loss from bad RDMA handles, which
come into play when you send large r/w traffic. The fact that small
i/o such as simple ls, and non-NFS traffic such as scp, means the
network itself is ok.

> While trying to run iozone to see whether the performance gain can be
> justified for the development efforts, the program runs until it
> reaches 2MB file size - at that point, RDMA CM sends out
> "TIMEWAIT_EXIT" event, the xprt is disconnected, and all IOs on that
> share hang. IPOIB still works though. Not sure what would be the best
> way to debug this.

I would suggest enabling RPC "transport" debugging, and any tracing
in the IB stack itself, looking to see if you can find any patterns.
You may want to look at the server side, too.

Unfortunately, because you are using Infiniband, packet capture is
going to be next to impossible. You might try using an iWARP adapter,
or one which can be sniffed, if you suspect a traffic issue.

Why did you replace the Linux IB stack with OFED? Did you also take the
NFS/RDMA from that package, and if so are you sure that it all is
is working properly? Doesn't 2.6.38 already have all this?

Tom.