Message-ID: <555DAEEF.9040204@mpstor.com>
Date: Thu, 21 May 2015 11:09:51 +0100
From: Benjamin ESTRABAUD <be@mpstor.com>
MIME-Version: 1.0
To: "J. Bruce Fields" <bfields@fieldses.org>
CC: "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>,
        "bc@mpstor.com" <bc@mpstor.com>, Christoph Hellwig <hch@infradead.org>
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN)
 filesystem.
References: <41EB9782-8445-4FBB-A825-A484EFF7169C@mpstor.com> <20150515192037.GB29627@fieldses.org> <555CB5EE.2@mpstor.com> <20150520194048.GA20221@fieldses.org>
In-Reply-To: <20150520194048.GA20221@fieldses.org>
Content-Type: text/plain; charset=windows-1252; format=flowed
Sender: linux-nfs-owner@vger.kernel.org

On 20/05/15 20:40, J. Bruce Fields wrote:
> On Wed, May 20, 2015 at 05:27:26PM +0100, Benjamin ESTRABAUD wrote:
>> On 15/05/15 20:20, J. Bruce Fields wrote:
>>> On Fri, May 15, 2015 at 10:44:13AM -0700, Benjamin ESTRABAUD wrote:
>>>> I've been using pNFS for a while since recently, and I am very pleased
>>>> with its overall stability and performance.
>>>>
>>>> A pNFS MDS server was setup with SAN storage in the backend (a RAID0
>>>> built ontop of multiple LUNs). Clients were given access to the same
>>>> RAID0 using the same LUNs on the same SAN.
>>>>
>>>> However, I've been noticing a small issue with it that prevents me
>>> >from using pNFS to its full potential: If I run non-direct IOs (for
>>>> instance "dd" without the "oflag=direct" option), IOs run excessively
>>>> slowly (3-4MB/sec) and the dd process hangs until forcefully
>>>> terminated.
>>>
>> Sorry for the late reply, I was unavailable for the past few days. I
>> had time to look at the problem further.
>>
>>> And that's reproduceable every time?
>>>
>
Hi Bruce,

> Thanks for the detailed report.  Quick questions:
>
>> It is, and here is what is happening more in details:
>>
>> on the client, "/mnt/pnfs1" is the "pNFS" mount point. We use NFS v 4.1.
>>
>> * Running dd with bs=512 and no "direct" set on the client:
>>
>> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000
>>
>> => Here we get variable performance, dd's average is 100MB/sec, and
>> we can see all the IOs going to the SAN block device. nfsstat
>> confirms that no IOs are going through the NFS server (no "writes"
>> are recorded, only "layoutcommit". Performance is maybe low but at
>> this block size we don't really care.
>>
>> * Running dd with bs=512 and "direct" setL
>>
>> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000 oflag=direct
>>
>> => Here, funnily enough, all the IOs are sent over NFS. The
>> "nfsstat" command shows writes increasing, the SAN block device
>> activity on the client is idle. The performance is about 13MB/sec,
>> but again expected with such a small IO size. The only unexpected is
>> that small 512bytes IOs are not going through the iSCSI SAN.
>>
>> * Running dd with bs=1M and no "direct" set on the client:
>>
>> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000
>>
>> => Here the IOs "work" and go through the SAN (no "write" counter
>> increasing in "nfsstat" and I can see disk statistics on the block
>> device on the client increasing). However the speed at which the IOs
>> go through is really slow (the actual speed recorded on the SAN
>> device fluctuates a lot, from 3MB/sec to a lot more). Overall dd is
>> not really happy and "Ctrl-C"ing it takes a long time, and in the
>> last try actually caused a kernel panic (see
>> http://imgur.com/YpXjvQ3 sorry about the picture format, did not
>> have the dmesg output capturing and had access to the VGA only).
>> When "dd" finally comes around and terminates, the average speed is
>> 200MB/sec.
>> Again the SAN block device shows IOs being submitted and "nfsstat"
>> shows no "writes" but a few "layoutcommits", showing that the writes
>> are not going through the "regular" NFS server.
>>
>>
>> * Running dd with bs=1M and no "direct" set on the client:
>
> I think you meant to leave out the "no" there?
>
Exactly, that's what I meant, sorry was confused.

>> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000 oflag=direct
>>
>> => Here the IOs work much faster (almost twice as fast as with
>> "direct" set, or 350+MB/sec) and dd is much more responsive (can
>> "Ctrl-C" it almost instantly). Again the SAN block device shows IOs
>> being submitted and "nfsstat" shows no "writes" but a few
>> "layoutcommits", showing that the writes are not going through the
>> "regular" NFS server.
>>
>> This shows that somehow running with "oflag=direct" causes
>> unstability and lower performance, at least on this version.
>
> And I think you mean "running without", not "running with"?
>
> Assuming those are just typos, unless I'm missing something.
>
Also right, I meant that without oflag=direct I get lower performance. 
Well, actually, as my later mail shows, it does only for a specific file 
size. I'm going to be running more tests to narrow it down.

In the meantime I tried looking into network traces but couldn't capture 
nice traces as Wireshark was losing input. I'm running wireshark 
remotely, with the tcpdump input coming from a slow SSH session, so 
maybe I'll try and capture a few seconds worth of output, scp the file 
back to me and use that instead.

Ben.

> --b.
>
>>
>> Both clients are running Linux 4.1.0-rc2 on CentOS 7.0 and the
>> server is running Linux 4.1.0-rc2 on CentOS 7.1.
>>
>>> Can you get network captures and figure out (for example), whether the
>>> slow writes are going over iSCSI or NFS, and if they're returning errors
>>> in either case?
>>>
>> I'm going to do that now (try and locate errors). However "nfsstat"
>> does indicate that slower writes are going through iSCSI.
>>
>>>> The same behaviour can be observed laying out an IO file
>>>> with FIO for instance, or using some applications which do not use the
>>>> ODIRECT flag. When using direct IO I can observe lots of iSCSI
>>>> traffic, at extremely good performance (same performance as the SAN
>>>> gets on "raw" block devices).
>>>>
>>>> All the systems are running CentOS 7.0 with a custom kernel 4.1-rc2
>>>> (pNFS enabled) apart from the storage nodes which are running a custom
>>>> minimal Linux distro with Kernel 3.18.
>>>>
>>>> The SAN is all 40G Mellanox Ethernet, and we are not using the OFED
>>>> driver anywhere (Everything is only "standard" upstream Linux).
>>>
>>> What's the non-SAN network (that the NFS traffic goes over)?
>>>
>> The NFS traffic also goes through the same SAN actually, both the
>> iSCSI LUNs and the NFS server are accessible over the same 40G/sec
>> Ethernet fabric.
>>
>> Regards,
>> Ben.
>>
>>> --b.
>>>
>>>>
>>>> Would anybody have any ideas where this issue could be coming from?
>>>>
>>>> Regards, Ben - MPSTOR.-- To unsubscribe from this list: send the line
>>>> "unsubscribe linux-nfs" in the body of a message to
>>>> majordomo@vger.kernel.org More majordomo info at
>>>> http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>