2002-12-05 02:44:35

by Jake Hammer

[permalink] [raw]
Subject: NFS read ok, write ok, simultaneous R+W=TERRIBLE

Hi All,

I'm working with a 2.4.19 + all Neil Brown 2.4.19 patches on a P4 Xeon
system, 2048MB ram, uniprocessor kernel. Distro is Debian Woody Stable. Disk
subsystem is ATA RAID 5, 14 spindles. On writes from 15 100base-T clients
via a Foundry switch, I am able to see 45MB/sec writes and 60MB/sec reads.
Clients are DD'ing 10GB files to and from the box simultaneously. Mount =
mount -o proto=udp,vers=2,wsize=32768,rsize=32768 bigbox:/space

The problem is on read + write. As soon as the clients switch from read only
or write only and do BOTH read and write, the CPU pegs and performance drops
to 3MB/sec (three MB/s)! TOP shows that 4 of the NFSd's are consuming all of
the CPU. It's like they are contending for some kind of resource like a lock
or something.

Any help would be sincerely appreciated. This is very strange behavior. It
also happens with 2.4.18 + all Neil Brown patches for 2.4.18.

Thanks,

Jake Hammer




-------------------------------------------------------
This SF.net email is sponsored by: Microsoft Visual Studio.NET
comprehensive development tool, built to increase your
productivity. Try a free online hosted session at:
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2002-12-05 05:01:10

by Eric Whiting

[permalink] [raw]
Subject: Re: NFS read ok, write ok, simultaneous R+W=TERRIBLE

Jake,

Your results are similar to what I see -- I don't really have a
solution, but here are some things to consider:

1. Make sure your network is clean and doing duplex right. (test using
two ftp sessions moving those 10 gig files -- one reading while one
writes) Ftp will let you know if the TCP part of your network is healthy
under full duplex loads.

2. Try TCP mounts

3. Try NFS V3

4. Make sure your IDE RAID system on the server is healthy. Can it
handle concurrent read/write processes on the local box -- with NFS out
of the picture?

What is the filesystem you have on the RAID5 setup?
reiserfs/ext3/xfs/jfs?
What sort of IDE HW raid supports 14 disks? Or are you using SW raid?

I would like to see a solution to this sort of problem as well.

eric




Jake Hammer wrote:
>
> Hi All,
>
> I'm working with a 2.4.19 + all Neil Brown 2.4.19 patches on a P4 Xeon
> system, 2048MB ram, uniprocessor kernel. Distro is Debian Woody Stable. Disk
> subsystem is ATA RAID 5, 14 spindles. On writes from 15 100base-T clients
> via a Foundry switch, I am able to see 45MB/sec writes and 60MB/sec reads.
> Clients are DD'ing 10GB files to and from the box simultaneously. Mount =
> mount -o proto=udp,vers=2,wsize=32768,rsize=32768 bigbox:/space
>
> The problem is on read + write. As soon as the clients switch from read only
> or write only and do BOTH read and write, the CPU pegs and performance drops
> to 3MB/sec (three MB/s)! TOP shows that 4 of the NFSd's are consuming all of
> the CPU. It's like they are contending for some kind of resource like a lock
> or something.
>
> Any help would be sincerely appreciated. This is very strange behavior. It
> also happens with 2.4.18 + all Neil Brown patches for 2.4.18.
>
> Thanks,
>
> Jake Hammer
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Microsoft Visual Studio.NET
> comprehensive development tool, built to increase your
> productivity. Try a free online hosted session at:
> http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs


-------------------------------------------------------
This SF.net email is sponsored by: Microsoft Visual Studio.NET
comprehensive development tool, built to increase your
productivity. Try a free online hosted session at:
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 05:55:31

by Jake Hammer

[permalink] [raw]
Subject: RE: NFS read ok, write ok, simultaneous R+W=TERRIBLE

Hi All, Eric,

> 1. Make sure your network is clean and doing duplex right. (test using
> two ftp sessions moving those 10 gig files -- one reading while one
> writes) Ftp will let you know if the TCP part of your network is healthy
> under full duplex loads.

Network is pristine and duplex is 100% correct. Our Sun and NetApp gear
works great.

> 2. Try TCP mounts

They are even slower with read+write! 1-2MB/sec!

> 3. Try NFS V3

Not [well] supported for TCP. Since the Neil Brown patches enable it, I
tried both UDP and TCP. Both are the same on read+write. UDP is faster by
10-15% on read/write but the same on read+write.

> 4. Make sure your IDE RAID system on the server is healthy. Can it
> handle concurrent read/write processes on the local box -- with NFS out
> of the picture?

Local read/write and read+write is *great* - no issues at all. Local write
is ~100MB/sec, local read is ~120MB/sec, and local read+write is ~105MB/sec.
It's not the disk subsystem.

> What is the filesystem you have on the RAID5 setup?
> reiserfs/ext3/xfs/jfs?
> What sort of IDE HW raid supports 14 disks? Or are you using SW raid?

2 x 8 port 3ware IDE RAID cards, RAID 5 done in software by 2.4.19 IBM EVMS.
Filesystem is EXT3, mounted with defaults. 14 disks.

To summarize:

Local disk is fast read, write, and read+write. NFS is sort of fast for read
and for write, but it *utterly* dies on read + write UDP and TCP. Network is
clean.

> I would like to see a solution to this sort of problem as well.

Glad you're seeing it as well. I hope it can be sorted out.

Thanks,

Jake




-------------------------------------------------------
This SF.net email is sponsored by: Microsoft Visual Studio.NET
comprehensive development tool, built to increase your
productivity. Try a free online hosted session at:
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 06:27:21

by Jake Hammer

[permalink] [raw]
Subject: RE: NFS read ok, write ok, simultaneous R+W=TERRIBLE

> Sounds like you are not using the 3ware card for anything other than
> just an IDE controller right?

Exactly - I'm using it to give me SCSI JBOD.

> There is a linux-ide-raid mailing list -- with good info. I've seen
> people run raid 5 on the 3ware card and then use SW raid to stripe 2
> 3ware cards. RAID 51 of sorts.

I follow this list - no help as my disks are FAST. I'm using RAID 0+5 truth
be known. Every pair of drives is striped in software. That gives me 7
really fast drives (14 spindles striped in pairs). Then, I S/W Raid 5 those
together. All done with EVMS. So, 1 parity and 6 data basically.

> Also there was a pretty good IDE raid review today:
>
> http://tech-report.com/reviews/2002q4/ideraid/index.x?pg=1

I saw this cool article. My disks are *FAST* on R/W and R+W. My NFS is great
on read, great on write, but just dies on read + write. Hence the conundrum.

Neil or Trond - any clues?

Thanks,

Jake




-------------------------------------------------------
This SF.net email is sponsored by: Microsoft Visual Studio.NET
comprehensive development tool, built to increase your
productivity. Try a free online hosted session at:
http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 10:32:49

by Tom McNeal

[permalink] [raw]
Subject: RE: NFS read ok, write ok, simultaneous R+W=TERRIBLE

I haven't had a chance to look at it carefully, but this sounds
like unnecessary reads to update cached data prior to writes.
We had an issue with this on HP-UX a few years back.

Regards -

Tom

--
-----------------------------------
Tom McNeal (650)906-0761
-----------------------------------
> > Sounds like you are not using the 3ware card for anything other than
> > just an IDE controller right?
>
> Exactly - I'm using it to give me SCSI JBOD.
>
> > There is a linux-ide-raid mailing list -- with good info. I've seen
> > people run raid 5 on the 3ware card and then use SW raid to stripe 2
> > 3ware cards. RAID 51 of sorts.
>
> I follow this list - no help as my disks are FAST. I'm using RAID 0+5 truth
> be known. Every pair of drives is striped in software. That gives me 7
> really fast drives (14 spindles striped in pairs). Then, I S/W Raid 5 those
> together. All done with EVMS. So, 1 parity and 6 data basically.
>
> > Also there was a pretty good IDE raid review today:
> >
> > http://tech-report.com/reviews/2002q4/ideraid/index.x?pg=1
>
> I saw this cool article. My disks are *FAST* on R/W and R+W. My NFS is great
> on read, great on write, but just dies on read + write. Hence the conundrum.
>
> Neil or Trond - any clues?
>
> Thanks,
>
> Jake
>
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 17:12:26

by canon

[permalink] [raw]
Subject: Re: NFS read ok, write ok, simultaneous R+W=TERRIBLE


Jake,

I can add a "me,too".

We have about 40 3ware boxes (~30 TB). These have a single 8
port card running raid 5 (7+1). We are currently
running XFS on them because we see better scaling performance
versus ext3. However, the numbers you are reporting look
similar to ours for 6 clients.

I'm not convinced that the local performance is ruled
out. How are you testing the local performance? Are
you running multiple streams? Are you certain you
are busting cache? Don't get me wrong, I would love to
see a performance bug found in NFS that suddenly
makes everything better. It would make my life
easier. :-)

Also, are the NFSd's in disk wait? If they are not, then
this would be a clearer indications that the nfs subsystem
may be to blame. We usually see all the daemons in
disk wait when we are really pounding one of the these
boxes.

One other note. We were running software raid (on a 2.2.19
kernel). After 3ware made some improvements to their
raid 5 implementation, we moved away from it. It appeared
that we were actually CPU bound with SW raid. We
are seeing better scaling performance with the HW
raid. Plus, we can use the hot swap ability of the
card (which for 300+ drives is nice). :-)

If anyone finds tweaks it would be great to get
them folded into the FAQ.

--Shane

> Hi All,
>
> I'm working with a 2.4.19 + all Neil Brown 2.4.19 patches on a P4 Xeon
> system, 2048MB ram, uniprocessor kernel. Distro is Debian Woody Stable. Disk
> subsystem is ATA RAID 5, 14 spindles. On writes from 15 100base-T clients
> via a Foundry switch, I am able to see 45MB/sec writes and 60MB/sec reads.
> Clients are DD'ing 10GB files to and from the box simultaneously. Mount =
> mount -o proto=udp,vers=2,wsize=32768,rsize=32768 bigbox:/space
>
> The problem is on read + write. As soon as the clients switch from read only
> or write only and do BOTH read and write, the CPU pegs and performance drops
> to 3MB/sec (three MB/s)! TOP shows that 4 of the NFSd's are consuming all of
> the CPU. It's like they are contending for some kind of resource like a lock
> or something.
>
> Any help would be sincerely appreciated. This is very strange behavior. It
> also happens with 2.4.18 + all Neil Brown patches for 2.4.18.
>
> Thanks,
>
> Jake Hammer
>
>
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by: Microsoft Visual Studio.NET
> comprehensive development tool, built to increase your
> productivity. Try a free online hosted session at:
> http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 18:25:45

by Eff Norwood

[permalink] [raw]
Subject: RE: NFS read ok, write ok, simultaneous R+W=TERRIBLE

Jake, Shane,

I can also add a "me too". Our similar setup produces the same good read,
good write, but terrible read/write. Local disk tests using multiple streams
with files much larger than cache show good local disk performance. NFS v2,
v3, UDP, TCP all work great for read,write, but are up to 95% slower on
read/write.

Eff Norwood

> Jake,
>
> I can add a "me,too".
>
> We have about 40 3ware boxes (~30 TB). These have a single 8
> port card running raid 5 (7+1). We are currently
> running XFS on them because we see better scaling performance
> versus ext3. However, the numbers you are reporting look
> similar to ours for 6 clients.
>
> I'm not convinced that the local performance is ruled
> out. How are you testing the local performance? Are
> you running multiple streams? Are you certain you
> are busting cache? Don't get me wrong, I would love to
> see a performance bug found in NFS that suddenly
> makes everything better. It would make my life
> easier. :-)
>
> Also, are the NFSd's in disk wait? If they are not, then
> this would be a clearer indications that the nfs subsystem
> may be to blame. We usually see all the daemons in
> disk wait when we are really pounding one of the these
> boxes.
>
> One other note. We were running software raid (on a 2.2.19
> kernel). After 3ware made some improvements to their
> raid 5 implementation, we moved away from it. It appeared
> that we were actually CPU bound with SW raid. We
> are seeing better scaling performance with the HW
> raid. Plus, we can use the hot swap ability of the
> card (which for 300+ drives is nice). :-)
>
> If anyone finds tweaks it would be great to get
> them folded into the FAQ.
>
> --Shane
>
> > Hi All,
> >
> > I'm working with a 2.4.19 + all Neil Brown 2.4.19 patches on a P4 Xeon
> > system, 2048MB ram, uniprocessor kernel. Distro is Debian Woody
> Stable. Disk
> > subsystem is ATA RAID 5, 14 spindles. On writes from 15
> 100base-T clients
> > via a Foundry switch, I am able to see 45MB/sec writes and
> 60MB/sec reads.
> > Clients are DD'ing 10GB files to and from the box
> simultaneously. Mount =
> > mount -o proto=udp,vers=2,wsize=32768,rsize=32768 bigbox:/space
> >
> > The problem is on read + write. As soon as the clients switch
> from read only
> > or write only and do BOTH read and write, the CPU pegs and
> performance drops
> > to 3MB/sec (three MB/s)! TOP shows that 4 of the NFSd's are
> consuming all of
> > the CPU. It's like they are contending for some kind of
> resource like a lock
> > or something.
> >
> > Any help would be sincerely appreciated. This is very strange
> behavior. It
> > also happens with 2.4.18 + all Neil Brown patches for 2.4.18.
> >
> > Thanks,
> >
> > Jake Hammer
> >
> >
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft Visual Studio.NET
> > comprehensive development tool, built to increase your
> > productivity. Try a free online hosted session at:
> > http://ads.sourceforge.net/cgi-bin/redirect.pl?micr0003en
> > _______________________________________________
> > NFS maillist - [email protected]
> > https://lists.sourceforge.net/lists/listinfo/nfs
>
>




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 18:35:27

by Jake Hammer

[permalink] [raw]
Subject: RE: NFS read ok, write ok, simultaneous R+W=TERRIBLE

Hi Shane,

> I'm not convinced that the local performance is ruled
> out. How are you testing the local performance? Are
> you running multiple streams? Are you certain you
> are busting cache?

We're launching 20 occurences of DD each producing a 10GB zero filled file
from /dev/zero. Since the cache is only 2GB, we have to be breaking it with
just one file, nevermind all of them. Then we launch 20 occurences of DD
again and read all the files back into /dev/null. Then the the local test
that makes me believe that NFS is broken: 20 occurences of DD reading and 20
occurences of DD writing simultaneously is just as fast as the other tests.
So, local disk performance sustains excellent throughput numbers on all of
these *local* operations.

Now try the same testing with 20 NFS mounted clients each DD'ing a 10GB
file. Great read, great write, horrible read/write.

> Also, are the NFSd's in disk wait?

Could you tell me how to check this? I'll be pleased to report once I know
how.

> If they are not, then
> this would be a clearer indications that the nfs subsystem
> may be to blame. We usually see all the daemons in
> disk wait when we are really pounding one of the these
> boxes.

Yes, I'd love to be able to check. Please let me know how. This sounds
interesting.

> One other note. We were running software raid (on a 2.2.19
> kernel).

2.2.19 or 2.4.19?

> After 3ware made some improvements to their
> raid 5 implementation, we moved away from it. It appeared
> that we were actually CPU bound with SW raid.

Also interesting.

Have you changed your bdflush settings at all? If so, can we compare
settings? Playing with bdflush makes huge impacts to our local disk
performance.

Thanks,

Jake




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 19:47:31

by Jake Hammer

[permalink] [raw]
Subject: RE: NFS read ok, write ok, simultaneous R+W=TERRIBLE

> Are you reading and writing the files at the same time? Or are
> you reading
> older files, while writing newer files?

20 clients all write a unique 10GB file (20 x 10GB files), then
20 clients all read the unique 10GB file written previously (20 x 10GB
files), then
20 clients all read and write new unique 10GB files *simultaneously* (20 x
10GB files read + 20 x 10GB files written - 40 total) Note that the files
being read are the same files from the first write in step 1.

> So the second test is
> still 20 clients,
> but each client is doing a read and write (to different files)?

Correct - as above.

Thanks,

Jake





-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 19:55:06

by David Rees

[permalink] [raw]
Subject: Re: NFS read ok, write ok, simultaneous R+W=TERRIBLE

On Thu, Dec 05, 2002 at 10:35:14AM -0800, Jake Hammer wrote:
>
> > Also, are the NFSd's in disk wait?
>
> Could you tell me how to check this? I'll be pleased to report once I know
> how.
>
> > If they are not, then
> > this would be a clearer indications that the nfs subsystem
> > may be to blame. We usually see all the daemons in
> > disk wait when we are really pounding one of the these
> > boxes.
>
> Yes, I'd love to be able to check. Please let me know how. This sounds
> interesting.

I doubt that the NFSd's are in disk wait given that you mentioned that 4
of them are sucking up CPU as indicated in your original post:
http://marc.theaimsgroup.com/?l=linux-nfs&m=103905720718370&w=2

However, to check if a process is indeed in disk wait, running top will
show you. Under the stat column, instead of a R, you will see D
indicating a process waiting on a disk command to return.

Sounds like a legitimate problem with the NFS server code, I'll also see
if I can reproduce.

-Dave


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-05 23:48:28

by canon

[permalink] [raw]
Subject: Re: NFS read ok, write ok, simultaneous R+W=TERRIBLE


> We're launching 20 occurences of DD each producing a 10GB zero filled file
> from /dev/zero. Since the cache is only 2GB, we have to be breaking it with
> just one file, nevermind all of them. Then we launch 20 occurences of DD
> again and read all the files back into /dev/null. Then the the local test
> that makes me believe that NFS is broken: 20 occurences of DD reading and 20
> occurences of DD writing simultaneously is just as fast as the other tests.
> So, local disk performance sustains excellent throughput numbers on all of
> these *local* operations.
>
> Now try the same testing with 20 NFS mounted clients each DD'ing a 10GB
> file. Great read, great write, horrible read/write.
>

Okay, that sounds good. This sounds similar to the tests we do,
but we use iozone. So it seems more likely you are seeing
an NFS issue.

>
>
> 2.2.19 or 2.4.19?
>

It actually was a 2.2.19. This was a while back. We did
tests software raid under 2.4 though and saw similar
behaviour. We haven't revisited this though. There
are probably improvements in the md code, so it may
not be an issue still.

>
> Have you changed your bdflush settings at all? If so, can we compare
> settings? Playing with bdflush makes huge impacts to our local disk
> performance.
>

We do adjust this, but I'm going to check that we are using optimal
settings. What values are you using?


> Thanks,
>
> Jake
>
>

Likewise,

--Shane


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2002-12-06 17:59:26

by Paul Haas

[permalink] [raw]
Subject: RE: NFS read ok, write ok, simultaneous R+W=TERRIBLE

My apologies in advance if I get any of the attributions wrong.

On Wed, 4 Dec 2002, Jake Hammer wrote:

> 2 x 8 port 3ware IDE RAID cards, RAID 5 done in software by 2.4.19 IBM EVMS.
> Filesystem is EXT3, mounted with defaults. 14 disks.

So what is the effective block size across the RAID array? Correct me if
I'm wrong. I'm guessing 48k bytes. Each stripe has 6 data disks and 1
parity disk. The block size on each data disk is 8k bytes. So the
smallest write is 48kbytes of data and 8kbytes of parity.

In another message you wrote:

> Mount =
> mount -o proto=udp,vers=2,wsize=32768,rsize=32768 bigbox:/space

Doesn't NFS default to synchronous writes?

To write the first 32kbytes of data in a file, NFS actually writes 48k,
then it waits for that data to get to the disk. The next 32kbytes will
straddle two 48kbyte "blocks", so the next write will actually write
96kbytes to the disks. The third 32k won't straddle a block boundary, so
it will be like the first 32k, 48k of data disk traffic. So for every 96k
of writes, you'll see 192k of disk traffic.

Local writes aren't syncronous, so 96kbytes of writes are 96kbytes.

In other messages you wrote about the NFS write only case:

> I am able to see 45MB/sec writes

And for the local write only case you wrote:

> Local write is ~100MB/sec

That's about the 1 to 2 ratio I would expect.

Now add a bunch of other disk I/O activity to trash the caches that are
internal to each disk.

Each partial block write that doesn't straddle blocks will involve reading
the whole 48kbyte block, modifying the changed bits, and writing 48bytes
back out. Double that for the writes that straddle block boundaries.
Performance will really, really suck, because now there are bunchs of
seek delays.

You also wrote:

> performance drops to 3MB/sec (three MB/s)!

Yup, that really, really sucks.

> To summarize:
>
> Local disk is fast read, write, and read+write. NFS is sort of fast for read
> and for write, but it *utterly* dies on read + write UDP and TCP. Network is
> clean.

> > I would like to see a solution to this sort of problem as well.

> Glad you're seeing it as well. I hope it can be sorted out.

Can you try a combination of mirroring and striping instead of RAID 5 and
see what you get?

If it is possible, make the block sizes match. I don't think it is easy
to make the NFS block size 48k. If you could try 4 data disks per parity
disk, then it might be reasonable.

> Thanks,

> Jake


--
Paul Haas EDS
2600 Green Road, Suite 100, Ann Arbor, MI 48105
[email protected] http://www.iware.com (734) 623-5808









-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs