2015-05-15 17:49:43

by Benjamin ESTRABAUD

[permalink] [raw]
Subject: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

Hi!

I've been using pNFS for a while since recently, and I am very pleased with its overall stability and performance.

A pNFS MDS server was setup with SAN storage in the backend (a RAID0 built ontop of multiple LUNs). Clients were given access to the same RAID0 using the same LUNs on the same SAN.

However, I've been noticing a small issue with it that prevents me from using pNFS to its full potential: If I run non-direct IOs (for instance "dd" without the "oflag=direct" option), IOs run excessively slowly (3-4MB/sec) and the dd process hangs until forcefully terminated. The same behaviour can be observed laying out an IO file with FIO for instance, or using some applications which do not use the ODIRECT flag. When using direct IO I can observe lots of iSCSI traffic, at extremely good performance (same performance as the SAN gets on "raw" block devices).

All the systems are running CentOS 7.0 with a custom kernel 4.1-rc2 (pNFS enabled) apart from the storage nodes which are running a custom minimal Linux distro with Kernel 3.18.

The SAN is all 40G Mellanox Ethernet, and we are not using the OFED driver anywhere (Everything is only "standard" upstream Linux).

Would anybody have any ideas where this issue could be coming from?

Regards,
Ben - MPSTOR.


2015-05-15 19:20:39

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On Fri, May 15, 2015 at 10:44:13AM -0700, Benjamin ESTRABAUD wrote:
> I've been using pNFS for a while since recently, and I am very pleased
> with its overall stability and performance.
>
> A pNFS MDS server was setup with SAN storage in the backend (a RAID0
> built ontop of multiple LUNs). Clients were given access to the same
> RAID0 using the same LUNs on the same SAN.
>
> However, I've been noticing a small issue with it that prevents me
> from using pNFS to its full potential: If I run non-direct IOs (for
> instance "dd" without the "oflag=direct" option), IOs run excessively
> slowly (3-4MB/sec) and the dd process hangs until forcefully
> terminated.

And that's reproduceable every time?

Can you get network captures and figure out (for example), whether the
slow writes are going over iSCSI or NFS, and if they're returning errors
in either case?

> The same behaviour can be observed laying out an IO file
> with FIO for instance, or using some applications which do not use the
> ODIRECT flag. When using direct IO I can observe lots of iSCSI
> traffic, at extremely good performance (same performance as the SAN
> gets on "raw" block devices).
>
> All the systems are running CentOS 7.0 with a custom kernel 4.1-rc2
> (pNFS enabled) apart from the storage nodes which are running a custom
> minimal Linux distro with Kernel 3.18.
>
> The SAN is all 40G Mellanox Ethernet, and we are not using the OFED
> driver anywhere (Everything is only "standard" upstream Linux).

What's the non-SAN network (that the NFS traffic goes over)?

--b.

>
> Would anybody have any ideas where this issue could be coming from?
>
> Regards, Ben - MPSTOR.-- To unsubscribe from this list: send the line
> "unsubscribe linux-nfs" in the body of a message to
> [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html

2015-05-17 16:39:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

Hi Benjamin,

do you also see the issue with a Linux 4.0 client, or with latest
Linux tree (including commit 869a249123ac117b9995dc9e534644084b8c6321)?


2015-05-20 16:27:29

by Benjamin ESTRABAUD

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On 15/05/15 20:20, J. Bruce Fields wrote:
> On Fri, May 15, 2015 at 10:44:13AM -0700, Benjamin ESTRABAUD wrote:
>> I've been using pNFS for a while since recently, and I am very pleased
>> with its overall stability and performance.
>>
>> A pNFS MDS server was setup with SAN storage in the backend (a RAID0
>> built ontop of multiple LUNs). Clients were given access to the same
>> RAID0 using the same LUNs on the same SAN.
>>
>> However, I've been noticing a small issue with it that prevents me
>> from using pNFS to its full potential: If I run non-direct IOs (for
>> instance "dd" without the "oflag=direct" option), IOs run excessively
>> slowly (3-4MB/sec) and the dd process hangs until forcefully
>> terminated.
>
Sorry for the late reply, I was unavailable for the past few days. I had
time to look at the problem further.

> And that's reproduceable every time?
>
It is, and here is what is happening more in details:

on the client, "/mnt/pnfs1" is the "pNFS" mount point. We use NFS v 4.1.

* Running dd with bs=512 and no "direct" set on the client:

dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000

=> Here we get variable performance, dd's average is 100MB/sec, and we
can see all the IOs going to the SAN block device. nfsstat confirms that
no IOs are going through the NFS server (no "writes" are recorded, only
"layoutcommit". Performance is maybe low but at this block size we don't
really care.

* Running dd with bs=512 and "direct" setL

dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000 oflag=direct

=> Here, funnily enough, all the IOs are sent over NFS. The "nfsstat"
command shows writes increasing, the SAN block device activity on the
client is idle. The performance is about 13MB/sec, but again expected
with such a small IO size. The only unexpected is that small 512bytes
IOs are not going through the iSCSI SAN.

* Running dd with bs=1M and no "direct" set on the client:

dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000

=> Here the IOs "work" and go through the SAN (no "write" counter
increasing in "nfsstat" and I can see disk statistics on the block
device on the client increasing). However the speed at which the IOs go
through is really slow (the actual speed recorded on the SAN device
fluctuates a lot, from 3MB/sec to a lot more). Overall dd is not really
happy and "Ctrl-C"ing it takes a long time, and in the last try actually
caused a kernel panic (see http://imgur.com/YpXjvQ3 sorry about the
picture format, did not have the dmesg output capturing and had access
to the VGA only).
When "dd" finally comes around and terminates, the average speed is
200MB/sec.
Again the SAN block device shows IOs being submitted and "nfsstat" shows
no "writes" but a few "layoutcommits", showing that the writes are not
going through the "regular" NFS server.


* Running dd with bs=1M and no "direct" set on the client:

dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000 oflag=direct

=> Here the IOs work much faster (almost twice as fast as with "direct"
set, or 350+MB/sec) and dd is much more responsive (can "Ctrl-C" it
almost instantly). Again the SAN block device shows IOs being submitted
and "nfsstat" shows no "writes" but a few "layoutcommits", showing that
the writes are not going through the "regular" NFS server.

This shows that somehow running with "oflag=direct" causes unstability
and lower performance, at least on this version.

Both clients are running Linux 4.1.0-rc2 on CentOS 7.0 and the server is
running Linux 4.1.0-rc2 on CentOS 7.1.

> Can you get network captures and figure out (for example), whether the
> slow writes are going over iSCSI or NFS, and if they're returning errors
> in either case?
>
I'm going to do that now (try and locate errors). However "nfsstat" does
indicate that slower writes are going through iSCSI.

>> The same behaviour can be observed laying out an IO file
>> with FIO for instance, or using some applications which do not use the
>> ODIRECT flag. When using direct IO I can observe lots of iSCSI
>> traffic, at extremely good performance (same performance as the SAN
>> gets on "raw" block devices).
>>
>> All the systems are running CentOS 7.0 with a custom kernel 4.1-rc2
>> (pNFS enabled) apart from the storage nodes which are running a custom
>> minimal Linux distro with Kernel 3.18.
>>
>> The SAN is all 40G Mellanox Ethernet, and we are not using the OFED
>> driver anywhere (Everything is only "standard" upstream Linux).
>
> What's the non-SAN network (that the NFS traffic goes over)?
>
The NFS traffic also goes through the same SAN actually, both the iSCSI
LUNs and the NFS server are accessible over the same 40G/sec Ethernet
fabric.

Regards,
Ben.

> --b.
>
>>
>> Would anybody have any ideas where this issue could be coming from?
>>
>> Regards, Ben - MPSTOR.-- To unsubscribe from this list: send the line
>> "unsubscribe linux-nfs" in the body of a message to
>> [email protected] More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2015-05-20 16:30:20

by Benjamin ESTRABAUD

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On 17/05/15 17:38, Christoph Hellwig wrote:
> Hi Benjamin,
>
> do you also see the issue with a Linux 4.0 client, or with latest
> Linux tree (including commit 869a249123ac117b9995dc9e534644084b8c6321)?
>
>
Hi Christopher,

I'm going to try this now (move client and maybe even server to Linux
4.0, now that 4.0 has a more "current" stable release).

By the way, what is the minimum Linux Kernel version required to connect
to a NFS v4.1 server using pNFS? We only managed to get this working
with kernel 4.0 (on the client) and it appears that this is the lowest
kernel release that supports pNFS block clients (working with Linux pNFS
recent block server). Could you please confirm this?

Regards,
Ben.

2015-05-20 18:31:20

by Benjamin ESTRABAUD

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On 20/05/15 17:27, Benjamin ESTRABAUD wrote:
> On 15/05/15 20:20, J. Bruce Fields wrote:
>> On Fri, May 15, 2015 at 10:44:13AM -0700, Benjamin ESTRABAUD wrote:
>>> I've been using pNFS for a while since recently, and I am very pleased
>>> with its overall stability and performance.
>>>
>>> A pNFS MDS server was setup with SAN storage in the backend (a RAID0
>>> built ontop of multiple LUNs). Clients were given access to the same
>>> RAID0 using the same LUNs on the same SAN.
>>>
>>> However, I've been noticing a small issue with it that prevents me
>>> from using pNFS to its full potential: If I run non-direct IOs (for
>>> instance "dd" without the "oflag=direct" option), IOs run excessively
>>> slowly (3-4MB/sec) and the dd process hangs until forcefully
>>> terminated.
>>
Here is some additional information:

It turns out that everything works as expected until I write a specific
"sweet spot" file size or IOs. I wrote a small bash script that writes
files one by one starting by a 1GiB file up to a 1TiB one, incrementing
the file size by 1GiB after each iteration:

for i in {1..1000}; do echo $i; dd if=/dev/zero
of=/mnt/pnfs1/testfile."$i"G bs=1M count="$(($i * 1024))"; done

Note that in the above test we are not running "direct IOs", but use
"dd"'s default mode, buffered.

The test runs without a hitch for a good while (yielding between
900MiB/sec and 1.3GiB/sec), I can see the buffering happening since
after a test starts, no IOs are detected on the iSCSI SAN LUN for a
short period on time and then a burst of IOs is detected (about
2-3GiB/sec, which the backend storage can actually handle).

"nfsstat" also confirms that no NFS writes are happening, "layoutcommit"
operations are recorded when a new file is written instead.

After 25 iterations (after creating a 25GiB file, for a cumulative total
of 325GiB if including the testfile.1G -> testfile.24G) the issue
occured again. The IO rate to the SAN LUN dropped severely to a real
3MiB/sec (measured at the SAN LUN block device level).

Also I've noticed that a kernel process is taking up 100% of one core at
least:

516 root 20 0 0 0 0 R 100.0 0.0 11:09.72
kworker/u49:4

I then canceled the test and removed the partial 26G file that seemed to
have caused the issue, and re-generated the same 26G file using dd.
After a few seconds, a kernel workqueue (this time kworker/u50:3) comes
up at 100% CPU (from little before, couldn't really see it in top).
I then tried to delete the 25G file and write that 25G file, and the
same workqueue issue occured (100% CPU).
I then deleted a much smaller file (5GiB) and re-wrote it without any
issues.
I then tried a 20G file also without problem.
I overwrite the 24G file also without problem.
Went back to a 25G file and the issue happens again.

Somehow the issue happens only when reaching a sweet spot triggered by
writing a file around 25G or larger in size.

Both SAN iSCSI targets (LIO based) are pretty idle (apart from the odd
iscsi_tx that happens from time to time) and don't report anything
suspicious on dmesg.

Would the 25GiB figure ring any bells to you? Would there be a way for
me to identify this workqueue (figure out if it is pNFS related)?

Thanks a lot in advance for your help!

Regards,
Ben.

> Sorry for the late reply, I was unavailable for the past few days. I had
> time to look at the problem further.
>
>> And that's reproduceable every time?
>>
> It is, and here is what is happening more in details:
>
> on the client, "/mnt/pnfs1" is the "pNFS" mount point. We use NFS v 4.1.
>
> * Running dd with bs=512 and no "direct" set on the client:
>
> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000
>
> => Here we get variable performance, dd's average is 100MB/sec, and we
> can see all the IOs going to the SAN block device. nfsstat confirms that
> no IOs are going through the NFS server (no "writes" are recorded, only
> "layoutcommit". Performance is maybe low but at this block size we don't
> really care.
>
> * Running dd with bs=512 and "direct" setL
>
> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000 oflag=direct
>
> => Here, funnily enough, all the IOs are sent over NFS. The "nfsstat"
> command shows writes increasing, the SAN block device activity on the
> client is idle. The performance is about 13MB/sec, but again expected
> with such a small IO size. The only unexpected is that small 512bytes
> IOs are not going through the iSCSI SAN.
>
> * Running dd with bs=1M and no "direct" set on the client:
>
> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000
>
> => Here the IOs "work" and go through the SAN (no "write" counter
> increasing in "nfsstat" and I can see disk statistics on the block
> device on the client increasing). However the speed at which the IOs go
> through is really slow (the actual speed recorded on the SAN device
> fluctuates a lot, from 3MB/sec to a lot more). Overall dd is not really
> happy and "Ctrl-C"ing it takes a long time, and in the last try actually
> caused a kernel panic (see http://imgur.com/YpXjvQ3 sorry about the
> picture format, did not have the dmesg output capturing and had access
> to the VGA only).
> When "dd" finally comes around and terminates, the average speed is
> 200MB/sec.
> Again the SAN block device shows IOs being submitted and "nfsstat" shows
> no "writes" but a few "layoutcommits", showing that the writes are not
> going through the "regular" NFS server.
>
>
> * Running dd with bs=1M and no "direct" set on the client:
>
> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000 oflag=direct
>
> => Here the IOs work much faster (almost twice as fast as with "direct"
> set, or 350+MB/sec) and dd is much more responsive (can "Ctrl-C" it
> almost instantly). Again the SAN block device shows IOs being submitted
> and "nfsstat" shows no "writes" but a few "layoutcommits", showing that
> the writes are not going through the "regular" NFS server.
>
> This shows that somehow running with "oflag=direct" causes unstability
> and lower performance, at least on this version.
>
> Both clients are running Linux 4.1.0-rc2 on CentOS 7.0 and the server is
> running Linux 4.1.0-rc2 on CentOS 7.1.
>
>> Can you get network captures and figure out (for example), whether the
>> slow writes are going over iSCSI or NFS, and if they're returning errors
>> in either case?
>>
> I'm going to do that now (try and locate errors). However "nfsstat" does
> indicate that slower writes are going through iSCSI.
>
>>> The same behaviour can be observed laying out an IO file
>>> with FIO for instance, or using some applications which do not use the
>>> ODIRECT flag. When using direct IO I can observe lots of iSCSI
>>> traffic, at extremely good performance (same performance as the SAN
>>> gets on "raw" block devices).
>>>
>>> All the systems are running CentOS 7.0 with a custom kernel 4.1-rc2
>>> (pNFS enabled) apart from the storage nodes which are running a custom
>>> minimal Linux distro with Kernel 3.18.
>>>
>>> The SAN is all 40G Mellanox Ethernet, and we are not using the OFED
>>> driver anywhere (Everything is only "standard" upstream Linux).
>>
>> What's the non-SAN network (that the NFS traffic goes over)?
>>
> The NFS traffic also goes through the same SAN actually, both the iSCSI
> LUNs and the NFS server are accessible over the same 40G/sec Ethernet
> fabric.
>
> Regards,
> Ben.
>
>> --b.
>>
>>>
>>> Would anybody have any ideas where this issue could be coming from?
>>>
>>> Regards, Ben - MPSTOR.-- To unsubscribe from this list: send the line
>>> "unsubscribe linux-nfs" in the body of a message to
>>> [email protected] More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2015-05-20 19:40:49

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On Wed, May 20, 2015 at 05:27:26PM +0100, Benjamin ESTRABAUD wrote:
> On 15/05/15 20:20, J. Bruce Fields wrote:
> >On Fri, May 15, 2015 at 10:44:13AM -0700, Benjamin ESTRABAUD wrote:
> >>I've been using pNFS for a while since recently, and I am very pleased
> >>with its overall stability and performance.
> >>
> >>A pNFS MDS server was setup with SAN storage in the backend (a RAID0
> >>built ontop of multiple LUNs). Clients were given access to the same
> >>RAID0 using the same LUNs on the same SAN.
> >>
> >>However, I've been noticing a small issue with it that prevents me
> >>from using pNFS to its full potential: If I run non-direct IOs (for
> >>instance "dd" without the "oflag=direct" option), IOs run excessively
> >>slowly (3-4MB/sec) and the dd process hangs until forcefully
> >>terminated.
> >
> Sorry for the late reply, I was unavailable for the past few days. I
> had time to look at the problem further.
>
> >And that's reproduceable every time?
> >

Thanks for the detailed report. Quick questions:

> It is, and here is what is happening more in details:
>
> on the client, "/mnt/pnfs1" is the "pNFS" mount point. We use NFS v 4.1.
>
> * Running dd with bs=512 and no "direct" set on the client:
>
> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000
>
> => Here we get variable performance, dd's average is 100MB/sec, and
> we can see all the IOs going to the SAN block device. nfsstat
> confirms that no IOs are going through the NFS server (no "writes"
> are recorded, only "layoutcommit". Performance is maybe low but at
> this block size we don't really care.
>
> * Running dd with bs=512 and "direct" setL
>
> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000 oflag=direct
>
> => Here, funnily enough, all the IOs are sent over NFS. The
> "nfsstat" command shows writes increasing, the SAN block device
> activity on the client is idle. The performance is about 13MB/sec,
> but again expected with such a small IO size. The only unexpected is
> that small 512bytes IOs are not going through the iSCSI SAN.
>
> * Running dd with bs=1M and no "direct" set on the client:
>
> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000
>
> => Here the IOs "work" and go through the SAN (no "write" counter
> increasing in "nfsstat" and I can see disk statistics on the block
> device on the client increasing). However the speed at which the IOs
> go through is really slow (the actual speed recorded on the SAN
> device fluctuates a lot, from 3MB/sec to a lot more). Overall dd is
> not really happy and "Ctrl-C"ing it takes a long time, and in the
> last try actually caused a kernel panic (see
> http://imgur.com/YpXjvQ3 sorry about the picture format, did not
> have the dmesg output capturing and had access to the VGA only).
> When "dd" finally comes around and terminates, the average speed is
> 200MB/sec.
> Again the SAN block device shows IOs being submitted and "nfsstat"
> shows no "writes" but a few "layoutcommits", showing that the writes
> are not going through the "regular" NFS server.
>
>
> * Running dd with bs=1M and no "direct" set on the client:

I think you meant to leave out the "no" there?

> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000 oflag=direct
>
> => Here the IOs work much faster (almost twice as fast as with
> "direct" set, or 350+MB/sec) and dd is much more responsive (can
> "Ctrl-C" it almost instantly). Again the SAN block device shows IOs
> being submitted and "nfsstat" shows no "writes" but a few
> "layoutcommits", showing that the writes are not going through the
> "regular" NFS server.
>
> This shows that somehow running with "oflag=direct" causes
> unstability and lower performance, at least on this version.

And I think you mean "running without", not "running with"?

Assuming those are just typos, unless I'm missing something.

--b.

>
> Both clients are running Linux 4.1.0-rc2 on CentOS 7.0 and the
> server is running Linux 4.1.0-rc2 on CentOS 7.1.
>
> >Can you get network captures and figure out (for example), whether the
> >slow writes are going over iSCSI or NFS, and if they're returning errors
> >in either case?
> >
> I'm going to do that now (try and locate errors). However "nfsstat"
> does indicate that slower writes are going through iSCSI.
>
> >>The same behaviour can be observed laying out an IO file
> >>with FIO for instance, or using some applications which do not use the
> >>ODIRECT flag. When using direct IO I can observe lots of iSCSI
> >>traffic, at extremely good performance (same performance as the SAN
> >>gets on "raw" block devices).
> >>
> >>All the systems are running CentOS 7.0 with a custom kernel 4.1-rc2
> >>(pNFS enabled) apart from the storage nodes which are running a custom
> >>minimal Linux distro with Kernel 3.18.
> >>
> >>The SAN is all 40G Mellanox Ethernet, and we are not using the OFED
> >>driver anywhere (Everything is only "standard" upstream Linux).
> >
> >What's the non-SAN network (that the NFS traffic goes over)?
> >
> The NFS traffic also goes through the same SAN actually, both the
> iSCSI LUNs and the NFS server are accessible over the same 40G/sec
> Ethernet fabric.
>
> Regards,
> Ben.
>
> >--b.
> >
> >>
> >>Would anybody have any ideas where this issue could be coming from?
> >>
> >>Regards, Ben - MPSTOR.-- To unsubscribe from this list: send the line
> >>"unsubscribe linux-nfs" in the body of a message to
> >>[email protected] More majordomo info at
> >>http://vger.kernel.org/majordomo-info.html
> >--
> >To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> >the body of a message to [email protected]
> >More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-05-21 10:09:53

by Benjamin ESTRABAUD

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On 20/05/15 20:40, J. Bruce Fields wrote:
> On Wed, May 20, 2015 at 05:27:26PM +0100, Benjamin ESTRABAUD wrote:
>> On 15/05/15 20:20, J. Bruce Fields wrote:
>>> On Fri, May 15, 2015 at 10:44:13AM -0700, Benjamin ESTRABAUD wrote:
>>>> I've been using pNFS for a while since recently, and I am very pleased
>>>> with its overall stability and performance.
>>>>
>>>> A pNFS MDS server was setup with SAN storage in the backend (a RAID0
>>>> built ontop of multiple LUNs). Clients were given access to the same
>>>> RAID0 using the same LUNs on the same SAN.
>>>>
>>>> However, I've been noticing a small issue with it that prevents me
>>> >from using pNFS to its full potential: If I run non-direct IOs (for
>>>> instance "dd" without the "oflag=direct" option), IOs run excessively
>>>> slowly (3-4MB/sec) and the dd process hangs until forcefully
>>>> terminated.
>>>
>> Sorry for the late reply, I was unavailable for the past few days. I
>> had time to look at the problem further.
>>
>>> And that's reproduceable every time?
>>>
>
Hi Bruce,

> Thanks for the detailed report. Quick questions:
>
>> It is, and here is what is happening more in details:
>>
>> on the client, "/mnt/pnfs1" is the "pNFS" mount point. We use NFS v 4.1.
>>
>> * Running dd with bs=512 and no "direct" set on the client:
>>
>> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000
>>
>> => Here we get variable performance, dd's average is 100MB/sec, and
>> we can see all the IOs going to the SAN block device. nfsstat
>> confirms that no IOs are going through the NFS server (no "writes"
>> are recorded, only "layoutcommit". Performance is maybe low but at
>> this block size we don't really care.
>>
>> * Running dd with bs=512 and "direct" setL
>>
>> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=512 count=100000000 oflag=direct
>>
>> => Here, funnily enough, all the IOs are sent over NFS. The
>> "nfsstat" command shows writes increasing, the SAN block device
>> activity on the client is idle. The performance is about 13MB/sec,
>> but again expected with such a small IO size. The only unexpected is
>> that small 512bytes IOs are not going through the iSCSI SAN.
>>
>> * Running dd with bs=1M and no "direct" set on the client:
>>
>> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000
>>
>> => Here the IOs "work" and go through the SAN (no "write" counter
>> increasing in "nfsstat" and I can see disk statistics on the block
>> device on the client increasing). However the speed at which the IOs
>> go through is really slow (the actual speed recorded on the SAN
>> device fluctuates a lot, from 3MB/sec to a lot more). Overall dd is
>> not really happy and "Ctrl-C"ing it takes a long time, and in the
>> last try actually caused a kernel panic (see
>> http://imgur.com/YpXjvQ3 sorry about the picture format, did not
>> have the dmesg output capturing and had access to the VGA only).
>> When "dd" finally comes around and terminates, the average speed is
>> 200MB/sec.
>> Again the SAN block device shows IOs being submitted and "nfsstat"
>> shows no "writes" but a few "layoutcommits", showing that the writes
>> are not going through the "regular" NFS server.
>>
>>
>> * Running dd with bs=1M and no "direct" set on the client:
>
> I think you meant to leave out the "no" there?
>
Exactly, that's what I meant, sorry was confused.

>> dd if=/dev/zero of=/mnt/pnfs1/testfile bs=1M count=100000000 oflag=direct
>>
>> => Here the IOs work much faster (almost twice as fast as with
>> "direct" set, or 350+MB/sec) and dd is much more responsive (can
>> "Ctrl-C" it almost instantly). Again the SAN block device shows IOs
>> being submitted and "nfsstat" shows no "writes" but a few
>> "layoutcommits", showing that the writes are not going through the
>> "regular" NFS server.
>>
>> This shows that somehow running with "oflag=direct" causes
>> unstability and lower performance, at least on this version.
>
> And I think you mean "running without", not "running with"?
>
> Assuming those are just typos, unless I'm missing something.
>
Also right, I meant that without oflag=direct I get lower performance.
Well, actually, as my later mail shows, it does only for a specific file
size. I'm going to be running more tests to narrow it down.

In the meantime I tried looking into network traces but couldn't capture
nice traces as Wireshark was losing input. I'm running wireshark
remotely, with the tcpdump input coming from a slow SSH session, so
maybe I'll try and capture a few seconds worth of output, scp the file
back to me and use that instead.

Ben.

> --b.
>
>>
>> Both clients are running Linux 4.1.0-rc2 on CentOS 7.0 and the
>> server is running Linux 4.1.0-rc2 on CentOS 7.1.
>>
>>> Can you get network captures and figure out (for example), whether the
>>> slow writes are going over iSCSI or NFS, and if they're returning errors
>>> in either case?
>>>
>> I'm going to do that now (try and locate errors). However "nfsstat"
>> does indicate that slower writes are going through iSCSI.
>>
>>>> The same behaviour can be observed laying out an IO file
>>>> with FIO for instance, or using some applications which do not use the
>>>> ODIRECT flag. When using direct IO I can observe lots of iSCSI
>>>> traffic, at extremely good performance (same performance as the SAN
>>>> gets on "raw" block devices).
>>>>
>>>> All the systems are running CentOS 7.0 with a custom kernel 4.1-rc2
>>>> (pNFS enabled) apart from the storage nodes which are running a custom
>>>> minimal Linux distro with Kernel 3.18.
>>>>
>>>> The SAN is all 40G Mellanox Ethernet, and we are not using the OFED
>>>> driver anywhere (Everything is only "standard" upstream Linux).
>>>
>>> What's the non-SAN network (that the NFS traffic goes over)?
>>>
>> The NFS traffic also goes through the same SAN actually, both the
>> iSCSI LUNs and the NFS server are accessible over the same 40G/sec
>> Ethernet fabric.
>>
>> Regards,
>> Ben.
>>
>>> --b.
>>>
>>>>
>>>> Would anybody have any ideas where this issue could be coming from?
>>>>
>>>> Regards, Ben - MPSTOR.-- To unsubscribe from this list: send the line
>>>> "unsubscribe linux-nfs" in the body of a message to
>>>> [email protected] More majordomo info at
>>>> http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


2015-05-25 15:13:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On Wed, May 20, 2015 at 07:31:12PM +0100, Benjamin ESTRABAUD wrote:
> After 25 iterations (after creating a 25GiB file, for a cumulative total of
> 325GiB if including the testfile.1G -> testfile.24G) the issue occured
> again. The IO rate to the SAN LUN dropped severely to a real 3MiB/sec
> (measured at the SAN LUN block device level).
>
> Also I've noticed that a kernel process is taking up 100% of one core at
> least:
>
> 516 root 20 0 0 0 0 R 100.0 0.0 11:09.72
> kworker/u49:4

Can you send me the output of "perf record -ag" for that run?

Also can you send the output from trace-cmd for tracing all nfsd.layout*
tracepoints for such a run?

> Would the 25GiB figure ring any bells to you? Would there be a way for me to
> identify this workqueue (figure out if it is pNFS related)?

Perf record should help by looking at the cycles spent.

2015-05-25 15:14:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On Wed, May 20, 2015 at 05:30:19PM +0100, Benjamin ESTRABAUD wrote:
> I'm going to try this now (move client and maybe even server to Linux 4.0,
> now that 4.0 has a more "current" stable release).

Only the client is interesting in this case.

> By the way, what is the minimum Linux Kernel version required to connect to
> a NFS v4.1 server using pNFS? We only managed to get this working with
> kernel 4.0 (on the client) and it appears that this is the lowest kernel
> release that supports pNFS block clients (working with Linux pNFS recent
> block server). Could you please confirm this?

The big changes to make the block client useful went into 3.18. Almost
nothing has changed in the client since.

2015-05-26 16:43:29

by Benjamin ESTRABAUD

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On 25/05/15 16:13, Christoph Hellwig wrote:
> On Wed, May 20, 2015 at 07:31:12PM +0100, Benjamin ESTRABAUD wrote:
>> After 25 iterations (after creating a 25GiB file, for a cumulative total of
>> 325GiB if including the testfile.1G -> testfile.24G) the issue occured
>> again. The IO rate to the SAN LUN dropped severely to a real 3MiB/sec
>> (measured at the SAN LUN block device level).
>>
>> Also I've noticed that a kernel process is taking up 100% of one core at
>> least:
>>
>> 516 root 20 0 0 0 0 R 100.0 0.0 11:09.72
>> kworker/u49:4
>
Hi Christoph,

> Can you send me the output of "perf record -ag" for that run?
>
I ran "perf record -ag" on the pNFS client and "trace-cmd record -e
nfsd" (it seems to capture all layout* tracepoints) on the pNFS server
(I figured there was no need to run it on the client, and anyways the
trace wound up empty when I tried).

I then ran "dd if=/dev/zero of=/mnt/pnfs1/testfile.26G bs=1M
count=26624" on the client (writing a 26GB file), waited about 20
seconds for the kworker issue to happen (it never happens immediately)
and as soon as it started, waited another 10 seconds so that the trace
has enough data to debug with.

All those three commands (perf record, trace-cmd and dd) where run
within a 3-4 seconds window, so there should be not much "junk" perf
trace at the beginning which has nothing to do with NFS.

Here's the link to the compressed perf record -ag+trace-cmd outputs (let
me know if you need to use a different host provider than dropbox):

https://www.dropbox.com/s/wou3hqb2go21gbw/traces.tar.gz?dl=0

> Also can you send the output from trace-cmd for tracing all nfsd.layout*
> tracepoints for such a run?
>
>> Would the 25GiB figure ring any bells to you? Would there be a way for me to
>> identify this workqueue (figure out if it is pNFS related)?
>
> Perf record should help by looking at the cycles spent.
>

Thanks a lot for your help!

Regards,
Ben.

2015-05-26 16:44:46

by Benjamin ESTRABAUD

[permalink] [raw]
Subject: Re: Issue running buffered writes to a pNFS (NFS 4.1 backed by SAN) filesystem.

On 25/05/15 16:14, Christoph Hellwig wrote:
> On Wed, May 20, 2015 at 05:30:19PM +0100, Benjamin ESTRABAUD wrote:
>> I'm going to try this now (move client and maybe even server to Linux 4.0,
>> now that 4.0 has a more "current" stable release).
>
> Only the client is interesting in this case.
>
>> By the way, what is the minimum Linux Kernel version required to connect to
>> a NFS v4.1 server using pNFS? We only managed to get this working with
>> kernel 4.0 (on the client) and it appears that this is the lowest kernel
>> release that supports pNFS block clients (working with Linux pNFS recent
>> block server). Could you please confirm this?
>
> The big changes to make the block client useful went into 3.18. Almost
> nothing has changed in the client since.
>
OK, we'll give 3.18 another shot (we had tried before without luck), but
if not much as changed we're likely to see the problem also there. I'm
going to also try running this test over a different SAN fabric (GbE
instead of 40GbE and see if the issue still occurs).

Thanks,

Ben.