LinuxLists.cc - NFS reliability in Enterprise Environment

2006-02-01 22:02:10

Subject: NFS reliability in Enterprise Environment

Does anyone have any data they can share on the reliability of Linux NFS
servers (not clients) in an enterprise environment (e.g. server uptimes,
number of volumes shared, number of clients, IO rates)? I have a large
enterprise application set (about 10K programs) that uses NFS for
file-sharing. A movement is afoot at my company to port the
applications from the more expensive "big iron" Unix systems it is
currently deployed on to less expensive Linux x86 servers. I'm trying
to do a baseline risk assessment of this move, and I'd like some
empirical data (or some speculative) on the reliability of NFS. Even
better would be a contrast in reliability to shared-storage filesystems
like GFS or OCFS2.

TIA!

Notice: This transmission is for the sole use of the intended recipient(s) and may contain information that is confidential and/or privileged. If you are not the intended recipient, please delete this transmission and any attachments and notify the sender by return email immediately. Any unauthorized review, use, disclosure or distribution is prohibited.

2006-02-02 00:08:32

by Brian Kerr

[permalink] [raw]

Subject: Re: NFS reliability in Enterprise Environment

On 2/1/06, David Sullivan <[email protected]> wrote:
>
> Does anyone have any data they can share on the reliability of Linux NFS
> servers (not clients) in an enterprise environment (e.g. server uptimes,
> number of volumes shared, number of clients, IO rates)? I have a large
> enterprise application set (about 10K programs) that uses NFS for
> file-sharing. A movement is afoot at my company to port the applications
> from the more expensive "big iron" Unix systems it is currently deployed =
on
> to less expensive Linux x86 servers. I'm trying to do a baseline risk
> assessment of this move, and I'd like some empirical data (or some
> speculative) on the reliability of NFS. Even better would be a contrast =
in
> reliability to shared-storage filesystems like GFS or OCFS2.
er tcp(only).

While I can't share exact IO and stats I can tell you that it keeps up
with netapp filers that consistently push 40-80MBit/second 24/7/365.=20
Be sure to do some research on kernel and hardware obviously and you
should be good.

I would be interested to hear of anyone else using GFS in an NFS
implmentation. Shortly we will be moving some NFS servers to large
storage arrays and we need all NFS servers to see the same filesystems
for redundancy and scalability. From what I've gathered GFS is ready
for primetime - I would love to hear how it works from a NFS
perspective.

-Brian

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-02 00:34:06

by Dan Stromberg

[permalink] [raw]

Subject: Re: NFS reliability in Enterprise Environment

On Wed, 2006-02-01 at 19:08 -0500, Brian Kerr wrote:
> On 2/1/06, David Sullivan <[email protected]> wrote:
> >
> > Does anyone have any data they can share on the reliability of Linux NFS
> > servers (not clients) in an enterprise environment (e.g. server uptimes,
> > number of volumes shared, number of clients, IO rates)? I have a large
> > enterprise application set (about 10K programs) that uses NFS for
> > file-sharing. A movement is afoot at my company to port the applications
> > from the more expensive "big iron" Unix systems it is currently deployed on
> > to less expensive Linux x86 servers. I'm trying to do a baseline risk
> > assessment of this move, and I'd like some empirical data (or some
> > speculative) on the reliability of NFS. Even better would be a contrast in
> > reliability to shared-storage filesystems like GFS or OCFS2.
> er tcp(only).
>
> While I can't share exact IO and stats I can tell you that it keeps up
> with netapp filers that consistently push 40-80MBit/second 24/7/365.
> Be sure to do some research on kernel and hardware obviously and you
> should be good.
>
> I would be interested to hear of anyone else using GFS in an NFS
> implmentation. Shortly we will be moving some NFS servers to large
> storage arrays and we need all NFS servers to see the same filesystems
> for redundancy and scalability. From what I've gathered GFS is ready
> for primetime - I would love to hear how it works from a NFS
> perspective.
>
> -Brian

We tried GFS served out over NFS, and it didn't work out well at all.

In fact, since we had a bunch of 2 terabyte slices that were made
available via GFS, we could've just eliminated GFS altogether, because
32 bit linux can do that -without- GFS. Sorry my GFS notes are a little
sketchy, but my manager seemed to want to do the GFS stuff himself :) I
mostly just documented how to start it up and shut it down:

http://dcs.nac.uci.edu/~strombrg/gfs_procedures.html

We later tried Lustre, but that didn't work out well in combination with
NFS either, despite our having contracted with the vendor to -make- it
work. It was just iteration after iteration and patch after patch, and
"it's the hardware" vs "it's the software" because we had so many
vendors involved. My lustre notes are far less sketchy, because I got
to take a lead role in that one for the most part:

http://dcs.nac.uci.edu/~strombrg/Lustre-notes.html

I also still have reams of log data that I shared with the vendor:

http://dcs.nac.uci.edu/~strombrg/lustre-info-for-CFS.html

...and as I skim through it, I'm recalling that I eventually discovered
that we were getting errors even on the lustre NAS head that was serving
data out over NFS - IE, with no NFS involved at all. Then again, we
were using a fork off the lustre mainline that wasn't to be merged back
into the mainline (which means if we ever needed to upgrade, we'd
probably end up contracting with them again to merge the needed changes
into either the mainline, or -another- fork).

We eventually got IBM to buy back the PC hardware they sold us for the
storage farm, got a StorEdge 3511 and a sparc box from Sun, and we had
about 16 terabytes up in short order. We're using a sparc with Solaris
on it as a NAS head for all that data, and then an AIX system accesses
the data over NFS.

QFS, the filesystem we're using on it, is actually kind of cool. It
allows you to easily aggregate data into one huge filesystem, and it
merges the volume management and filesystem layers for performance - but
I'm glad it's them maintaining the combination and not me :)

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-02 05:00:28

by Brian Kerr

[permalink] [raw]

Subject: Re: NFS reliability in Enterprise Environment

On 2/1/06, Dan Stromberg <[email protected]> wrote:

> We tried GFS served out over NFS, and it didn't work out well at all.
>
> In fact, since we had a bunch of 2 terabyte slices that were made
> available via GFS, we could've just eliminated GFS altogether, because
> 32 bit linux can do that -without- GFS. Sorry my GFS notes are a little
> sketchy, but my manager seemed to want to do the GFS stuff himself :) I
> mostly just documented how to start it up and shut it down:

That's not what I wanted to hear!

We are looking at fairly dense storage and will need to make about
9-10 2TB partitions. Having the limitation of only one host seeing a
partition is obviously a huge drawback. I was hoping to avoid this
with clustered file systems of some sort. I plan to do some of my own
testing on a smaller scale before moving anything into the next phase.

Thanks for the summary!

> http://dcs.nac.uci.edu/~strombrg/gfs_procedures.html
>
>
> We later tried Lustre, but that didn't work out well in combination with
> NFS either, despite our having contracted with the vendor to -make- it
> work. It was just iteration after iteration and patch after patch, and
> "it's the hardware" vs "it's the software" because we had so many
> vendors involved. My lustre notes are far less sketchy, because I got
> to take a lead role in that one for the most part:
>
> http://dcs.nac.uci.edu/~strombrg/Lustre-notes.html
>
> I also still have reams of log data that I shared with the vendor:
>
> http://dcs.nac.uci.edu/~strombrg/lustre-info-for-CFS.html
>
> ...and as I skim through it, I'm recalling that I eventually discovered
> that we were getting errors even on the lustre NAS head that was serving
> data out over NFS - IE, with no NFS involved at all. Then again, we
> were using a fork off the lustre mainline that wasn't to be merged back
> into the mainline (which means if we ever needed to upgrade, we'd
> probably end up contracting with them again to merge the needed changes
> into either the mainline, or -another- fork).
>
>
> We eventually got IBM to buy back the PC hardware they sold us for the
> storage farm, got a StorEdge 3511 and a sparc box from Sun, and we had
> about 16 terabytes up in short order. We're using a sparc with Solaris
> on it as a NAS head for all that data, and then an AIX system accesses
> the data over NFS.
>
>
> QFS, the filesystem we're using on it, is actually kind of cool. It
> allows you to easily aggregate data into one huge filesystem, and it
> merges the volume management and filesystem layers for performance - but
> I'm glad it's them maintaining the combination and not me :)
>
>
>

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-02 14:55:41

by Roger Heflin

[permalink] [raw]

Subject: RE: NFS reliability in Enterprise Environment

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of David Sullivan
> Sent: Wednesday, February 01, 2006 4:02 PM
> To: [email protected]
> Subject: [NFS] NFS reliability in Enterprise Environment
>
> Does anyone have any data they can share on the reliability
> of Linux NFS servers (not clients) in an enterprise
> environment (e.g. server uptimes, number of volumes shared,
> number of clients, IO rates)? I have a large enterprise
> application set (about 10K programs) that uses NFS for
> file-sharing. A movement is afoot at my company to port the
> applications from the more expensive "big iron" Unix systems
> it is currently deployed on to less expensive Linux x86
> servers. I'm trying to do a baseline risk assessment of this
> move, and I'd like some empirical data (or some speculative)
> on the reliability of NFS. Even better would be a contrast
> in reliability to shared-storage filesystems like GFS or OCFS2.
>
> TIA!

Just to note, since usually the underlying disk is the
bottleneck, having 2 servers hit the same disk hardware does
not really help.

If you want things to scale, I would find a way to logically
separate things.

On the speed note, with cheap arrays and cheap NFS servers (basic
dual cpu machines) I can serve single stream writes at
95MByte/second, and reads at 115MByte/second, both of these are
sustained rates (10x+ times the amount of possible cache anywhere
in the setup), and this is over a single ethernet, to a single
14 disk array.

I would personally stay away from the shared filesystems, the
trade off required to make them share between machines probably
cut the speed enough to make it not worth it.

With single arrays (such as the above mentioned test), the total
rate gets lower when two machines (or two streams from the same
machine) access the same array (physical disks) as the disks
start to seek around.

When I last did it, I broke things up into small physical units and
spread them over many cheap machines (1-2TB, this was to allow
us to get very large IO rates, ie 500MBytes/second+, and this was
3+ years ago), this also allow us to have a fully tested cold
spare that could replace any broken server machine
within 1 hour. This was a better thing over the large Suns we
had previously as we only had 2 of them, and really could not afford
to keep a 100k machine as a cold spare.

Roger

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2006-02-03 16:36:52

by Brian Kerr

[permalink] [raw]

Subject: Re: NFS reliability in Enterprise Environment

On 2/2/06, Roger Heflin <[email protected]> wrote:

...
> Just to note, since usually the underlying disk is the
> bottleneck, having 2 servers hit the same disk hardware does
> not really help.

Sure, but I'm not looking for SAN speeds on a solution using this
technology. Rather I would like to see a relatively
fast,scalable,highly available,dense storage solution using NFS
technology.

Obviously you can do this with a Netapp or countless other solutions -
but we are looking at a Nexsan Satabeast(22TB in 4U) for it's
unmatched density and price. Adding a couple linux servers running
GFS into the mix to make the data available to hundreds of machines.

> If you want things to scale, I would find a way to logically
> separate things.

This is the problem we have now. We have over 15 different "storage
servers" with locally attached ata raid arrays. Getting an
application to scale properly when you have 15+ different NFS paths is
not going to work in the long run, it isn't working now.

> On the speed note, with cheap arrays and cheap NFS servers (basic
> dual cpu machines) I can serve single stream writes at
> 95MByte/second, and reads at 115MByte/second, both of these are
> sustained rates (10x+ times the amount of possible cache anywhere
> in the setup), and this is over a single ethernet, to a single
> 14 disk array.

Yes these work great, but when you have 15 or more it gets
unmanageable, hence the switch to GFS. A lot of the data on these
servers is simply archived away anyway.

> I would personally stay away from the shared filesystems, the
> trade off required to make them share between machines probably
> cut the speed enough to make it not worth it.

I will let you know once we implement it. A 22TB array connected to
only two or three servers shouldn't hit a bottleneck. You have to use
3-4 ports off the controllers to maximize bandwidth from the array
anyway.

> With single arrays (such as the above mentioned test), the total
> rate gets lower when two machines (or two streams from the same
> machine) access the same array (physical disks) as the disks
> start to seek around.
>
> When I last did it, I broke things up into small physical units and
> spread them over many cheap machines (1-2TB, this was to allow
> us to get very large IO rates, ie 500MBytes/second+, and this was
> 3+ years ago), this also allow us to have a fully tested cold
> spare that could replace any broken server machine
> within 1 hour. This was a better thing over the large Suns we
> had previously as we only had 2 of them, and really could not afford
> to keep a 100k machine as a cold spare.

-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=103432&bid=230486&dat=121642
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs