2008-09-17 13:06:40

by Martin Knoblauch

[permalink] [raw]
Subject: [RFC][Resend] Make NFS-Client readahead tunable

Hi,

the following/attached patch works around a [obscure] problem when an 2.6 (not sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS. Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance for a short-/mid-term solution from Sun are very slim. So, being lazy, I would love to get this patch into Linux. If not, I just will have to maintain it for eternity out of tree.

The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data and a relatively small amount of data "online" on disk and pushes old or infrequently used data to "offline" media like e.g. tape. This is completely transparent to the users. If the date for an "offline" file is needed, the so called "stager daemon" copies it back from the offline medium. All of this works great most of the time. Now, if an Linux NFS client tries to read such an offline file, performance drops to "extremely slow". After lengthly investigation of tcp-dumps, mount options and procedures involving black cats at midnight, we found out that the readahead behaviour of the Linux NFS client causes the problem. Basically it seems to issue read requests up to 15*rsize to the server. In the case of the "offl
ine" files, this behaviour causes heavy competition for the inode lock between the NFSD process and the stager daemon on the Solaris server.

- The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the problem, but a solution will need time. Lots of it.
- The working solution: disable the client side readahead, or make it tunable. The patch does that by introducing a NFS module parameter "ra_factor" which can take values between 1 and 15 (default 15) and a tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.

Signed-off-by: Martin Knoblauch <[email protected]>

diff -urp linux-2.6.27-rc6-git4/fs/nfs/client.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c
--- linux-2.6.27-rc6-git4/fs/nfs/client.c 2008-09-17 11:35:21.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c 2008-09-17 11:55:18.000000000 +0200
@@ -722,6 +722,11 @@ error:
}

/*
+ * NFS Client Read-Ahead factor
+*/
+unsigned int nfs_ra_factor;
+
+/*
* Load up the server record from information gained in an fsinfo record
*/
static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *fsinfo)
@@ -746,7 +751,11 @@ static void nfs_server_set_fsinfo(struct
server->rsize = NFS_MAX_FILE_IO_SIZE;
server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;

- server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
+ dprintk("nfs_server_set_fsinfo: rsize, wsize, rpages, \
+ nfs_ra_factor, ra_pages: %d %d %d %d %d\n",
+ server->rsize,server->wsize,server->rpages,
+ nfs_ra_factor,server->rpages * nfs_ra_factor);
+ server->backing_dev_info.ra_pages = server->rpages * nfs_ra_factor;

if (server->wsize > max_rpc_payload)
server->wsize = max_rpc_payload;
diff -urp linux-2.6.27-rc6-git4/fs/nfs/inode.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c
--- linux-2.6.27-rc6-git4/fs/nfs/inode.c 2008-09-17 11:35:21.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c 2008-09-17 11:45:09.000000000 +0200
@@ -53,6 +53,8 @@

/* Default is to see 64-bit inode numbers */
static int enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED;
+static unsigned int ra_factor __read_mostly = NFS_MAX_READAHEAD;
+

static void nfs_invalidate_inode(struct inode *);
static int nfs_update_inode(struct inode *, struct nfs_fattr *);
@@ -1347,6 +1349,12 @@ static int __init init_nfs_fs(void)
#endif
if ((err = register_nfs_fs()) != 0)
goto out;
+
+ if (ra_factor < 1 || ra_factor > NFS_MAX_READAHEAD)
+ nfs_ra_factor = NFS_MAX_READAHEAD;
+ else
+ nfs_ra_factor = ra_factor;
+
return 0;
out:
#ifdef CONFIG_PROC_FS
@@ -1388,6 +1396,10 @@ static void __exit exit_nfs_fs(void)
MODULE_AUTHOR("Olaf Kirch <[email protected]>");
MODULE_LICENSE("GPL");
module_param(enable_ino64, bool, 0644);
+MODULE_PARM_DESC(enable_ino64, "Enable 64-bit inode numbers (Default: 1)");
+module_param(ra_factor, uint, 0644);
+MODULE_PARM_DESC(ra_factor,
+ "Number of rsize read-ahead requests (Default/Max: 15, Min: 1)");

module_init(init_nfs_fs)
module_exit(exit_nfs_fs)
diff -urp linux-2.6.27-rc6-git4/fs/nfs/sysctl.c linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/sysctl.c
--- linux-2.6.27-rc6-git4/fs/nfs/sysctl.c 2008-07-13 23:51:29.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/sysctl.c 2008-09-17 11:45:09.000000000 +0200
@@ -14,9 +14,12 @@
#include <linux/nfs_fs.h>

#include "callback.h"
+#include "internal.h"

static const int nfs_set_port_min = 0;
static const int nfs_set_port_max = 65535;
+static const unsigned int min_nfs_ra_factor = 1;
+static const unsigned int max_nfs_ra_factor = NFS_MAX_READAHEAD;
static struct ctl_table_header *nfs_callback_sysctl_table;

static ctl_table nfs_cb_sysctls[] = {
@@ -58,6 +61,16 @@ static ctl_table nfs_cb_sysctls[] = {
.mode = 0644,
.proc_handler = &proc_dointvec,
},
+ {
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "nfs_ra_factor",
+ .data = &nfs_ra_factor,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec_minmax,
+ .extra1 = (unsigned int *)&min_nfs_ra_factor,
+ .extra2 = (unsigned int *)&max_nfs_ra_factor,
+ },
{ .ctl_name = 0 }
};

diff -urp linux-2.6.27-rc6-git4/include/linux/nfs_fs.h linux-2.6.27-rc6-git4-nfs_ra/include/linux/nfs_fs.h
--- linux-2.6.27-rc6-git4/include/linux/nfs_fs.h 2008-09-17 11:35:25.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/include/linux/nfs_fs.h 2008-09-17 11:45:09.000000000 +0200
@@ -464,6 +464,11 @@ extern int nfs_writeback_done(struct rpc
extern void nfs_writedata_release(void *);

/*
+ * linux/fs/nfs/client.c
+*/
+extern unsigned int nfs_ra_factor;
+
+/*
* Try to write back everything synchronously (but check the
* return value!)
*/
diff -urp linux-2.6.27-rc6-git4/Makefile linux-2.6.27-rc6-git4-nfs_ra/Makefile
--- linux-2.6.27-rc6-git4/Makefile 2008-09-17 11:35:56.000000000 +0200
+++ linux-2.6.27-rc6-git4-nfs_ra/Makefile 2008-09-17 11:45:09.000000000 +0200
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 27
-EXTRAVERSION = -rc6-git4
+EXTRAVERSION = -rc6-git4-nfs_ra
NAME = Rotary Wombat

# *DOCUMENTATION*



Cheers
Martin

------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www: http://www.knobisoft.de


Attachments:
nfs_ra-2.6.27-rc6-git4.diff (4.36 kB)

2008-09-18 13:20:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

On Thu, 2008-09-18 at 01:38 -0700, Martin Knoblauch wrote:

> I believe Peter wanted to add per bdi stuff for nfs some time ago. Not sure what came out of it.

$ mount localhost:/ /mnt/tmp
$ grep nfs /proc/$$/mountinfo
21 17 0:17 / /var/lib/nfs/rpc_pipefs rw - rpc_pipefs rpc_pipefs rw
31 13 0:20 / /proc/fs/nfsd rw - nfsd nfsd rw
37 17 0:22 / /mnt/tmp rw - nfs localhost:/ rw,vers=3,rsize=65536,wsize=65536,namlen=255,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountproto=tcp,addr=127.0.0.1
$ ls -la /sys/class/bdi/0\:22/
total 0
drwxr-xr-x 3 root root 0 2008-09-18 15:16 .
drwxr-xr-x 21 root root 0 2008-09-18 15:16 ..
-rw-r--r-- 1 root root 4096 2008-09-18 15:19 max_ratio
-rw-r--r-- 1 root root 4096 2008-09-18 15:19 min_ratio
drwxr-xr-x 2 root root 0 2008-09-18 15:19 power
-rw-r--r-- 1 root root 4096 2008-09-18 15:19 read_ahead_kb
lrwxrwxrwx 1 root root 0 2008-09-18 15:19 subsystem -> ../../bdi
-rw-r--r-- 1 root root 4096 2008-09-18 15:19 uevent
$ cat /sys/class/bdi/0\:22/read_ahead_kb
960

2008-09-17 13:19:25

by Michael Trimarchi

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

Hi



----- Messaggio originale -----
> Da: Martin Knoblauch <[email protected]>
> A: linux-nfs list <[email protected]>
> Cc: [email protected]
> Inviato: Mercoled=EC 17 settembre 2008, 15:06:40
> Oggetto: [RFC][Resend] Make NFS-Client readahead tunable
>=20
=2E...

> diff -urp linux-2.6.27-rc6-git4/Makefile linux-2.6.27-rc6-git4-nfs_ra=
/Makefile
> --- linux-2.6.27-rc6-git4/Makefile 2008-09-17 11:35:56.000000000=
+0200
> +++ linux-2.6.27-rc6-git4-nfs_ra/Makefile 2008-09-17 11:45:09.0=
00000000=20
> +0200
> @@ -1,7 +1,7 @@
> VERSION =3D 2
> PATCHLEVEL =3D 6
> SUBLEVEL =3D 27
> -EXTRAVERSION =3D -rc6-git4
> +EXTRAVERSION =3D -rc6-git4-nfs_ra
> NAME =3D Rotary Wombat
>=20
> # *DOCUMENTATION*

I'm not an expert but maybe this is not necessary :)=20

>=20
>=20
> Cheers
> Martin
>=20
> ------------------------------------------------------

Michael

__________________________________________________
Do You Yahoo!?
Poco spazio e tanto spam? Yahoo! Mail ti protegge dallo spam e ti da ta=
nto spazio gratuito per i tuoi file e i messaggi=20
http://mail.yahoo.it=20

2008-09-17 13:25:23

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Jim Rees <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: linux-nfs list <[email protected]>
> Sent: Wednesday, September 17, 2008 3:21:49 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> There are times when you want to increase nfs readahead instead of
> decreasing it. I suggest you make the default 15 but allow tuning it either
> up or down.

We never needed that in our case. But yes, would be trivial. The question is, whether there should be a maximum, just as a safeguard.

Martin


2008-09-17 13:27:47

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Michael Trimarchi <[email protected]>
> To: Martin Knoblauch <[email protected]>; linux-nfs list <linux-nfs@=
vger.kernel.org>
> Cc: [email protected]
> Sent: Wednesday, September 17, 2008 3:19:25 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>=20
> Hi
>=20
>=20
>=20
> ----- Messaggio originale -----
> > Da: Martin Knoblauch=20
> > A: linux-nfs list=20
> > Cc: [email protected]
> > Inviato: Mercoled=EC 17 settembre 2008, 15:06:40
> > Oggetto: [RFC][Resend] Make NFS-Client readahead tunable
> >=20
> ....
>=20
> > diff -urp linux-2.6.27-rc6-git4/Makefile linux-2.6.27-rc6-git4-nfs_=
ra/Makefile
> > --- linux-2.6.27-rc6-git4/Makefile 2008-09-17 11:35:56.0000000=
00 +0200
> > +++ linux-2.6.27-rc6-git4-nfs_ra/Makefile 2008-09-17 11:45:09=
=2E000000000=20
> > +0200
> > @@ -1,7 +1,7 @@
> > VERSION =3D 2
> > PATCHLEVEL =3D 6
> > SUBLEVEL =3D 27
> > -EXTRAVERSION =3D -rc6-git4
> > +EXTRAVERSION =3D -rc6-git4-nfs_ra
> > NAME =3D Rotary Wombat
> >=20
> > # *DOCUMENTATION*
>=20
> I'm not an expert but maybe this is not necessary :)=20
>=20

Doh, yes. A final patch should not have that. Instead it should have s=
ome documentation for the module parameter and the tunable :-)

Cheers
Martin


2008-09-17 13:30:24

by Jim Rees

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

There are times when you want to increase nfs readahead instead of
decreasing it. I suggest you make the default 15 but allow tuning it either
up or down.

2008-09-17 13:42:32

by Michael Trimarchi

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

Hi,

...

> Signed-off-by: Martin Knoblauch
>
> diff -urp linux-2.6.27-rc6-git4/fs/nfs/client.c
> linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c
> --- linux-2.6.27-rc6-git4/fs/nfs/client.c 2008-09-17 11:35:21.000000000
> +0200
> +++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c 2008-09-17
> 11:55:18.000000000 +0200
> @@ -722,6 +722,11 @@ error:
> }
>
> /*
> + * NFS Client Read-Ahead factor
> +*/
> +unsigned int nfs_ra_factor;
> +
> +/*
> * Load up the server record from information gained in an fsinfo record
> */
> static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo
> *fsinfo)
> @@ -746,7 +751,11 @@ static void nfs_server_set_fsinfo(struct
> server->rsize = NFS_MAX_FILE_IO_SIZE;
> server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >>
> PAGE_CACHE_SHIFT;
>
> - server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
> + dprintk("nfs_server_set_fsinfo: rsize, wsize, rpages, \
> + nfs_ra_factor, ra_pages: %d %d %d %d %d\n",
> + server->rsize,server->wsize,server->rpages,
> + nfs_ra_factor,server->rpages * nfs_ra_factor);
> + server->backing_dev_info.ra_pages = server->rpages * nfs_ra_factor;
>
> if (server->wsize > max_rpc_payload)
> server->wsize = max_rpc_payload;
> diff -urp linux-2.6.27-rc6-git4/fs/nfs/inode.c
> linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c
> --- linux-2.6.27-rc6-git4/fs/nfs/inode.c 2008-09-17 11:35:21.000000000
> +0200
> +++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c 2008-09-17 11:45:09.000000000
> +0200
> @@ -53,6 +53,8 @@
>
> /* Default is to see 64-bit inode numbers */
> static int enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED;
> +static unsigned int ra_factor __read_mostly = NFS_MAX_READAHEAD;
> +
>
> static void nfs_invalidate_inode(struct inode *);
> static int nfs_update_inode(struct inode *, struct nfs_fattr *);
> @@ -1347,6 +1349,12 @@ static int __init init_nfs_fs(void)
> #endif
> if ((err = register_nfs_fs()) != 0)
> goto out;
> +
> + if (ra_factor < 1 || ra_factor > NFS_MAX_READAHEAD)
> + nfs_ra_factor = NFS_MAX_READAHEAD;
> + else
> + nfs_ra_factor = ra_factor;
> +

So, I think that this is not necessary because it is done ( ... I hope) by the
proc_dointvec_minmax handler. It is correct?

Regards Michael

__________________________________________________
Do You Yahoo!?
Poco spazio e tanto spam? Yahoo! Mail ti protegge dallo spam e ti da tanto spazio gratuito per i tuoi file e i messaggi
http://mail.yahoo.it

2008-09-17 14:07:59

by Peter Staubach

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

Martin Knoblauch wrote:
> Hi,
>
> the following/attached patch works around a [obscure] problem when an 2.6 (not sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS. Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance for a short-/mid-term solution from Sun are very slim. So, being lazy, I would love to get this patch into Linux. If not, I just will have to maintain it for eternity out of tree.
>
> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data and a relatively small amount of data "online" on disk and pushes old or infrequently used data to "offline" media like e.g. tape. This is completely transparent to the users. If the date for an "offline" file is needed, the so called "stager daemon" copies it back from the offline medium. All of this works great most of the time. Now, if an Linux NFS client tries to read such an offline file, performance drops to "extremely slow". After lengthly investigation of tcp-dumps, mount options and procedures involving black cats at midnight, we found out that the readahead behaviour of the Linux NFS client causes the problem. Basically it seems to issue read requests up to 15*rsize to the server. In the case of the "of
fline" files, this behaviour causes heavy competition for the inode lock between the NFSD process and the stager daemon on the Solaris server.
>
> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the problem, but a solution will need time. Lots of it.
> - The working solution: disable the client side readahead, or make it tunable. The patch does that by introducing a NFS module parameter "ra_factor" which can take values between 1 and 15 (default 15) and a tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.

Hi.

I was curious if a design to limit or eliminate read-ahead
activity when the server returns EJUKEBOX was considered?
Unless one can know that the server and client can get into
this situation ahead of time, how would the tunable be used?

Thanx...

ps


2008-09-17 15:40:15

by Jim Rees

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

Martin Knoblauch wrote:

We never needed that in our case. But yes, would be trivial. The question
is, whether there should be a maximum, just as a safeguard.

Yes. The default should be (RPC_DEF_SLOT_TABLE - 1), and the maximum should
be max(xprt_udp_slot_table_entries, xprt_tcp_slot_table_entries) (maybe
minus one).

I wonder if it would make sense to adjust NFS_MAX_READAHEAD when
xprt_*_slot_table_entries is changed via sysctl.

2008-09-17 15:41:18

by Chuck Lever

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

On Wed, Sep 17, 2008 at 9:06 AM, Peter Staubach <[email protected]> wrote:
> Martin Knoblauch wrote:
>>
>> Hi,
>>
>> the following/attached patch works around a [obscure] problem when an 2.6
>> (not sure/caring about 2.4) NFS client accesses an "offline" file on a
>> Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS.
>> Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux
>> problem, but the chance for a short-/mid-term solution from Sun are very
>> slim. So, being lazy, I would love to get this patch into Linux. If not, I
>> just will have to maintain it for eternity out of tree.
>>
>> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores
>> meta-data and a relatively small amount of data "online" on disk and pushes
>> old or infrequently used data to "offline" media like e.g. tape. This is
>> completely transparent to the users. If the date for an "offline" file is
>> needed, the so called "stager daemon" copies it back from the offline
>> medium. All of this works great most of the time. Now, if an Linux NFS
>> client tries to read such an offline file, performance drops to "extremely
>> slow". After lengthly investigation of tcp-dumps, mount options and
>> procedures involving black cats at midnight, we found out that the readahead
>> behaviour of the Linux NFS client causes the problem. Basically it seems to
>> issue read requests up to 15*rsize to the server. In the case of the
>> "offline" files, this behaviour causes heavy competition for the inode lock
>> between the NFSD process and the stager daemon on the Solaris server.
>>
>> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks
>> the problem, but a solution will need time. Lots of it.
>> - The working solution: disable the client side readahead, or make it
>> tunable. The patch does that by introducing a NFS module parameter
>> "ra_factor" which can take values between 1 and 15 (default 15) and a
>> tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
>
> Hi.
>
> I was curious if a design to limit or eliminate read-ahead
> activity when the server returns EJUKEBOX was considered?
> Unless one can know that the server and client can get into
> this situation ahead of time, how would the tunable be used?

I tend to agree. A tunable is probably not a good solution in this case.

I would bet that this lock contention issue is a problem in other more
common cases, and would merit some careful analysis.

--
Chuck Lever

2008-09-17 16:03:18

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Peter Staubach <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: linux-nfs list <[email protected]>; [email protected]
> Sent: Wednesday, September 17, 2008 4:06:44 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> Martin Knoblauch wrote:
> > Hi,
> >
> > the following/attached patch works around a [obscure] problem when an 2.6 (not
> sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10
> NFS server when the underlying filesystem is of type SAM-FS. Happens with
> RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance
> for a short-/mid-term solution from Sun are very slim. So, being lazy, I would
> love to get this patch into Linux. If not, I just will have to maintain it for
> eternity out of tree.
> >
> > The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data
> and a relatively small amount of data "online" on disk and pushes old or
> infrequently used data to "offline" media like e.g. tape. This is completely
> transparent to the users. If the date for an "offline" file is needed, the so
> called "stager daemon" copies it back from the offline medium. All of this works
> great most of the time. Now, if an Linux NFS client tries to read such an
> offline file, performance drops to "extremely slow". After lengthly
> investigation of tcp-dumps, mount options and procedures involving black cats at
> midnight, we found out that the readahead behaviour of the Linux NFS client
> causes the problem. Basically it seems to issue read requests up to 15*rsize to
> the server. In the case of the "offline" files, this behaviour causes heavy
> competition for the inode lock between the NFSD process and the stager daemon on
> the Solaris server.
> >
> > - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the
> problem, but a solution will need time. Lots of it.
> > - The working solution: disable the client side readahead, or make it tunable.
> The patch does that by introducing a NFS module parameter "ra_factor" which can
> take values between 1 and 15 (default 15) and a tunable
> "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
>
> Hi.
>
> I was curious if a design to limit or eliminate read-ahead
> activity when the server returns EJUKEBOX was considered?

not seriously, because that would need a lot more knowledge about the internal workings of the NFS-Client than I have. The Solaris client seems to be working along that lines, but the code to modify the readahead window looks complicated. The Solaris client also seems to be a lot less agressive when doing readahead. Maximum seems to be 4x8k. As far as I see, the Linux client doesn't really care about the readahead handling at all. It just fills "server->backing_dev_info.ra_pages" and leaves the handling to the MM system.

Then, there is no guarantee that EJUKEBOX is ever sent by the server. If the offline archive resides on disk (e.g. a cheap SATA array), delivery will start almost immediatelly and the server will not send that error. Tracked that :-( Same for already positioned tapes.

> Unless one can know that the server and client can get into
> this situation ahead of time, how would the tunable be used?
>

Basically one has to know that the problem exists (that is easily detected) and that the readahead factor is involved.

My patch has of course some pitfalls. at least:

a) as implemented, the nfs_ra_factor will be used for all NFS mounts. It should/could be per filesystem, but that needs a new mount option and I did not want to touch that code due to lack of understanding (and no time to aquire said understanding). But frankly, so far we have not observed any serious performance drawbacks with ra_factor=1.
b) changing the factor needs a remount, as the NFS client only cares about it at that time.

Not a problem in my situation of course.

Cheers
Martin


2008-09-17 16:10:50

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

Adding back LKML.


----- Original Message ----
> From: Jim Rees <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: linux-nfs list <[email protected]>
> Sent: Wednesday, September 17, 2008 5:31:12 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> Martin Knoblauch wrote:
>
> We never needed that in our case. But yes, would be trivial. The question
> is, whether there should be a maximum, just as a safeguard.
>
> Yes. The default should be (RPC_DEF_SLOT_TABLE - 1), and the maximum should
> be max(xprt_udp_slot_table_entries, xprt_tcp_slot_table_entries) (maybe
> minus one).
>

The default is NFS_MAX_READAHEAD, which is (RPC_DEF_SLOT_TABLE - 1). Incidentially, your suggested maximum seems to be the same on a default setup (minus one applied).

> I wonder if it would make sense to adjust NFS_MAX_READAHEAD when
> xprt_*_slot_table_entries is changed via sysctl.

I am not sure how useful/practical this is, as currently the ra_factor is applied at mount time.

Cheers
Martin


2008-09-17 16:15:46

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Michael Trimarchi <[email protected]>
> To: Martin Knoblauch <[email protected]>; linux-nfs list <[email protected]>
> Cc: [email protected]
> Sent: Wednesday, September 17, 2008 3:42:30 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> Hi,
>
> ...
>
> > Signed-off-by: Martin Knoblauch
> >
> > diff -urp linux-2.6.27-rc6-git4/fs/nfs/client.c
> > linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c
> > --- linux-2.6.27-rc6-git4/fs/nfs/client.c 2008-09-17 11:35:21.000000000
> > +0200
> > +++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/client.c 2008-09-17
> > 11:55:18.000000000 +0200
> > @@ -722,6 +722,11 @@ error:
> > }
> >
> > /*
> > + * NFS Client Read-Ahead factor
> > +*/
> > +unsigned int nfs_ra_factor;
> > +
> > +/*
> > * Load up the server record from information gained in an fsinfo record
> > */
> > static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo
>
> > *fsinfo)
> > @@ -746,7 +751,11 @@ static void nfs_server_set_fsinfo(struct
> > server->rsize = NFS_MAX_FILE_IO_SIZE;
> > server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >>
> > PAGE_CACHE_SHIFT;
> >
> > - server->backing_dev_info.ra_pages = server->rpages *
> NFS_MAX_READAHEAD;
> > + dprintk("nfs_server_set_fsinfo: rsize, wsize, rpages, \
> > + nfs_ra_factor, ra_pages: %d %d %d %d %d\n",
> > + server->rsize,server->wsize,server->rpages,
> > + nfs_ra_factor,server->rpages * nfs_ra_factor);
> > + server->backing_dev_info.ra_pages = server->rpages * nfs_ra_factor;
> >
> > if (server->wsize > max_rpc_payload)
> > server->wsize = max_rpc_payload;
> > diff -urp linux-2.6.27-rc6-git4/fs/nfs/inode.c
> > linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c
> > --- linux-2.6.27-rc6-git4/fs/nfs/inode.c 2008-09-17 11:35:21.000000000
> > +0200
> > +++ linux-2.6.27-rc6-git4-nfs_ra/fs/nfs/inode.c 2008-09-17 11:45:09.000000000
> > +0200
> > @@ -53,6 +53,8 @@
> >
> > /* Default is to see 64-bit inode numbers */
> > static int enable_ino64 = NFS_64_BIT_INODE_NUMBERS_ENABLED;
> > +static unsigned int ra_factor __read_mostly = NFS_MAX_READAHEAD;
> > +
> >
> > static void nfs_invalidate_inode(struct inode *);
> > static int nfs_update_inode(struct inode *, struct nfs_fattr *);
> > @@ -1347,6 +1349,12 @@ static int __init init_nfs_fs(void)
> > #endif
> > if ((err = register_nfs_fs()) != 0)
> > goto out;
> > +
> > + if (ra_factor < 1 || ra_factor > NFS_MAX_READAHEAD)
> > + nfs_ra_factor = NFS_MAX_READAHEAD;
> > + else
> > + nfs_ra_factor = ra_factor;
> > +
>
> So, I think that this is not necessary because it is done ( ... I hope) by the
> proc_dointvec_minmax handler. It is correct?
>

That is of course true if the tunable is changed via the /proc interface. The code above handles the module parameter, which is not governed by the minmax handler.

Cheers
Martin


2008-09-17 16:23:06

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Chuck Lever <[email protected]>
> To: Peter Staubach <[email protected]>
> Cc: Martin Knoblauch <[email protected]>; linux-nfs list <[email protected]>; [email protected]
> Sent: Wednesday, September 17, 2008 5:41:15 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> On Wed, Sep 17, 2008 at 9:06 AM, Peter Staubach wrote:
> > Martin Knoblauch wrote:
> >>
> >> Hi,
> >>
> >> the following/attached patch works around a [obscure] problem when an 2.6
> >> (not sure/caring about 2.4) NFS client accesses an "offline" file on a
> >> Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS.
> >> Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux
> >> problem, but the chance for a short-/mid-term solution from Sun are very
> >> slim. So, being lazy, I would love to get this patch into Linux. If not, I
> >> just will have to maintain it for eternity out of tree.
> >>
> >> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores
> >> meta-data and a relatively small amount of data "online" on disk and pushes
> >> old or infrequently used data to "offline" media like e.g. tape. This is
> >> completely transparent to the users. If the date for an "offline" file is
> >> needed, the so called "stager daemon" copies it back from the offline
> >> medium. All of this works great most of the time. Now, if an Linux NFS
> >> client tries to read such an offline file, performance drops to "extremely
> >> slow". After lengthly investigation of tcp-dumps, mount options and
> >> procedures involving black cats at midnight, we found out that the readahead
> >> behaviour of the Linux NFS client causes the problem. Basically it seems to
> >> issue read requests up to 15*rsize to the server. In the case of the
> >> "offline" files, this behaviour causes heavy competition for the inode lock
> >> between the NFSD process and the stager daemon on the Solaris server.
> >>
> >> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks
> >> the problem, but a solution will need time. Lots of it.
> >> - The working solution: disable the client side readahead, or make it
> >> tunable. The patch does that by introducing a NFS module parameter
> >> "ra_factor" which can take values between 1 and 15 (default 15) and a
> >> tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
> >
> > Hi.
> >
> > I was curious if a design to limit or eliminate read-ahead
> > activity when the server returns EJUKEBOX was considered?
> > Unless one can know that the server and client can get into
> > this situation ahead of time, how would the tunable be used?
>
> I tend to agree. A tunable is probably not a good solution in this case.
>
> I would bet that this lock contention issue is a problem in other more
> common cases, and would merit some careful analysis.
>

Are you talking wrt. a Solaris NFS-Server with SAM-FS/QFS as backend filesystem? We have in a lot of tests not observed any problems when accessing online files with the default readahead setting. The "offline" situation seems unique. As for other NFS Servers, we have never observed any readahead related problems either.

As I already replied elsewhere, teaching the Linux NFS client to do "proper" readahead handling is beyond my knowledge. But I can test. I guess Sun engineering would be delighted if they don't have to fix their stuff :-)

Cheers
Martin


2008-09-17 16:43:55

by Chuck Lever

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

On Wed, Sep 17, 2008 at 11:23 AM, Martin Knoblauch <[email protected]> wrote:
> ----- Original Message ----
>
>> From: Chuck Lever <[email protected]>
>> To: Peter Staubach <[email protected]>
>> Cc: Martin Knoblauch <[email protected]>; linux-nfs list <[email protected]>; [email protected]
>> Sent: Wednesday, September 17, 2008 5:41:15 PM
>> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>>
>> On Wed, Sep 17, 2008 at 9:06 AM, Peter Staubach wrote:
>> > Martin Knoblauch wrote:
>> >>
>> >> Hi,
>> >>
>> >> the following/attached patch works around a [obscure] problem when an 2.6
>> >> (not sure/caring about 2.4) NFS client accesses an "offline" file on a
>> >> Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS.
>> >> Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux
>> >> problem, but the chance for a short-/mid-term solution from Sun are very
>> >> slim. So, being lazy, I would love to get this patch into Linux. If not, I
>> >> just will have to maintain it for eternity out of tree.
>> >>
>> >> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores
>> >> meta-data and a relatively small amount of data "online" on disk and pushes
>> >> old or infrequently used data to "offline" media like e.g. tape. This is
>> >> completely transparent to the users. If the date for an "offline" file is
>> >> needed, the so called "stager daemon" copies it back from the offline
>> >> medium. All of this works great most of the time. Now, if an Linux NFS
>> >> client tries to read such an offline file, performance drops to "extremely
>> >> slow". After lengthly investigation of tcp-dumps, mount options and
>> >> procedures involving black cats at midnight, we found out that the readahead
>> >> behaviour of the Linux NFS client causes the problem. Basically it seems to
>> >> issue read requests up to 15*rsize to the server. In the case of the
>> >> "offline" files, this behaviour causes heavy competition for the inode lock
>> >> between the NFSD process and the stager daemon on the Solaris server.
>> >>
>> >> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks
>> >> the problem, but a solution will need time. Lots of it.
>> >> - The working solution: disable the client side readahead, or make it
>> >> tunable. The patch does that by introducing a NFS module parameter
>> >> "ra_factor" which can take values between 1 and 15 (default 15) and a
>> >> tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
>> >
>> > Hi.
>> >
>> > I was curious if a design to limit or eliminate read-ahead
>> > activity when the server returns EJUKEBOX was considered?
>> > Unless one can know that the server and client can get into
>> > this situation ahead of time, how would the tunable be used?
>>
>> I tend to agree. A tunable is probably not a good solution in this case.
>>
>> I would bet that this lock contention issue is a problem in other more
>> common cases, and would merit some careful analysis.
>>
>
> Are you talking wrt. a Solaris NFS-Server with SAM-FS/QFS as backend filesystem?

I misread your mail, and thought the inode lock contention issue was
on the client.

--
Chuck Lever

2008-09-17 17:01:33

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Chuck Lever <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Peter Staubach <[email protected]>; linux-nfs list <[email protected]>; [email protected]
> Sent: Wednesday, September 17, 2008 6:43:48 PM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> On Wed, Sep 17, 2008 at 11:23 AM, Martin Knoblauch wrote:
> > ----- Original Message ----
> >
> >> From: Chuck Lever
> >> To: Peter Staubach
> >> Cc: Martin Knoblauch ; linux-nfs list
> ; [email protected]
> >> Sent: Wednesday, September 17, 2008 5:41:15 PM
> >> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> >>
> >> On Wed, Sep 17, 2008 at 9:06 AM, Peter Staubach wrote:
> >> > Martin Knoblauch wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> the following/attached patch works around a [obscure] problem when an 2.6
> >> >> (not sure/caring about 2.4) NFS client accesses an "offline" file on a
> >> >> Sun/Solaris-10 NFS server when the underlying filesystem is of type
> SAM-FS.
> >> >> Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux
> >> >> problem, but the chance for a short-/mid-term solution from Sun are very
> >> >> slim. So, being lazy, I would love to get this patch into Linux. If not, I
> >> >> just will have to maintain it for eternity out of tree.
> >> >>
> >> >> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores
> >> >> meta-data and a relatively small amount of data "online" on disk and
> pushes
> >> >> old or infrequently used data to "offline" media like e.g. tape. This is
> >> >> completely transparent to the users. If the date for an "offline" file is
> >> >> needed, the so called "stager daemon" copies it back from the offline
> >> >> medium. All of this works great most of the time. Now, if an Linux NFS
> >> >> client tries to read such an offline file, performance drops to "extremely
> >> >> slow". After lengthly investigation of tcp-dumps, mount options and
> >> >> procedures involving black cats at midnight, we found out that the
> readahead
> >> >> behaviour of the Linux NFS client causes the problem. Basically it seems
> to
> >> >> issue read requests up to 15*rsize to the server. In the case of the
> >> >> "offline" files, this behaviour causes heavy competition for the inode
> lock
> >> >> between the NFSD process and the stager daemon on the Solaris server.
> >> >>
> >> >> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks
> >> >> the problem, but a solution will need time. Lots of it.
> >> >> - The working solution: disable the client side readahead, or make it
> >> >> tunable. The patch does that by introducing a NFS module parameter
> >> >> "ra_factor" which can take values between 1 and 15 (default 15) and a
> >> >> tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
> >> >
> >> > Hi.
> >> >
> >> > I was curious if a design to limit or eliminate read-ahead
> >> > activity when the server returns EJUKEBOX was considered?
> >> > Unless one can know that the server and client can get into
> >> > this situation ahead of time, how would the tunable be used?
> >>
> >> I tend to agree. A tunable is probably not a good solution in this case.
> >>
> >> I would bet that this lock contention issue is a problem in other more
> >> common cases, and would merit some careful analysis.
> >>
> >
> > Are you talking wrt. a Solaris NFS-Server with SAM-FS/QFS as backend
> filesystem?
>
> I misread your mail, and thought the inode lock contention issue was
> on the client.
>

No problem, maybe I was not articulating myself clearly. Just to restate - the lock contention happens on the server.

Cheers
Martin


2008-09-18 01:44:29

by Greg Banks

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

Martin Knoblauch wrote:
> Hi,
>
> the following/attached patch works around a [obscure] problem when an 2.6 (not sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10 NFS server when the underlying filesystem is of type SAM-FS. Happens with RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance for a short-/mid-term solution from Sun are very slim. So, being lazy, I would love to get this patch into Linux. If not, I just will have to maintain it for eternity out of tree.
>
> The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data and a relatively small amount of data "online" on disk and pushes old or infrequently used data to "offline" media like e.g. tape. This is completely transparent to the users. If the date for an "offline" file is needed, the so called "stager daemon" copies it back from the offline medium. All of this works great most of the time. Now, if an Linux NFS client tries to read such an offline file, performance drops to "extremely slow".
By "extremely slow" do you mean "tape read speed"?
> After lengthly investigation of tcp-dumps, mount options and procedures involving black cats at midnight, we found out that the readahead behaviour of the Linux NFS client causes the problem. Basically it seems to issue read requests up to 15*rsize to the server. In the case of the "offline" files, this behaviour causes heavy competition for the inode lock between the NFSD process and the stager daemon on the Solaris server.
>
So, you need to

a) make your stager daemon do IO more sensibly, and

b) apply something like this patch which adds O_NONBLOCK when knfsd does
reads writes and truncates and translates -EAGAIN into NFS3ERR_JUKEBOX

http://kerneltrap.org/mailarchive/linux-fsdevel/2006/5/5/312567

and

c) make your filesystem IO interposing layer report -EAGAIN when a
process tries to do IO to an offline region in a file and O_NONBLOCK is
present.
> - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the problem, but a solution will need time. Lots of it.
> - The working solution: disable the client side readahead, or make it tunable. The patch does that by introducing a NFS module parameter "ra_factor" which can take values between 1 and 15 (default 15) and a tunable "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
>
I think having a tunable for client readahead is an excellent idea,
although not to solve your particular problem. The SLES10 kernel has a
patch which does precisely that, perhaps Neil could post it.

I don't think there's a lot of point having both a module parameter and
a sysctl.

A maximum of 15 is unwise. I've found that (at least with the older
readahead mechanisms in SLES10) a multiple of 4 is required to preserve
rsize-alignment of READ rpcs to the server, which helps a lot with wide
RAID backends. So in SGI we tune client readahead to 16.

Your patch seems to have a bunch of other unrelated stuff mixed in.

--
Greg Banks, P.Engineer, SGI Australian Software Group.
Be like the squirrel.
I don't speak for SGI.


2008-09-18 03:13:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

On Thu, 18 Sep 2008 11:42:54 +1000 Greg Banks <gnb-cP1dWloDopni96+mSzHFpQC/[email protected]> wrote:

> I think having a tunable for client readahead is an excellent idea,
> although not to solve your particular problem. The SLES10 kernel has a
> patch which does precisely that, perhaps Neil could post it.
>
> I don't think there's a lot of point having both a module parameter and
> a sysctl.

mount -o remount,readahead=42

2008-09-18 07:42:59

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Andrew Morton <[email protected]>
> To: Greg Banks <gnb-cP1dWloDopni96+mSzHFpQC/[email protected]>
> Cc: Martin Knoblauch <[email protected]>; linux-nfs list <[email protected]>; [email protected]
> Sent: Thursday, September 18, 2008 5:13:34 AM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> On Thu, 18 Sep 2008 11:42:54 +1000 Greg Banks wrote:
>
> > I think having a tunable for client readahead is an excellent idea,
> > although not to solve your particular problem. The SLES10 kernel has a
> > patch which does precisely that, perhaps Neil could post it.
> >
> > I don't think there's a lot of point having both a module parameter and
> > a sysctl.
>
> mount -o remount,readahead=42

[root@lpsdm52 ~]# mount -o remount,readahead=42 /net/spsdms/fs13
Bad nfs mount parameter: readahead
[root@lpsdm52 ~]# mount -o readahead=42 /net/spsdms/fs13
Bad nfs mount parameter: readahead


I assume the reply was meant to say that the correct way of introducing a modifyable readahead size is to implement it as a mount option ? :-) I considered it, but it seems to be more intrusive than the workaround patch. It also needs changes to userspace tools - correct?

Cheers
Martin


2008-09-18 08:18:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

On Thu, 18 Sep 2008 00:42:58 -0700 (PDT) Martin Knoblauch <[email protected]> wrote:

> ----- Original Message ----
>
> > From: Andrew Morton <[email protected]>
> > To: Greg Banks <gnb-cP1dWloDopni96+mSzHFpQC/[email protected]>
> > Cc: Martin Knoblauch <[email protected]>; linux-nfs list <[email protected]>; [email protected]
> > Sent: Thursday, September 18, 2008 5:13:34 AM
> > Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> >
> > On Thu, 18 Sep 2008 11:42:54 +1000 Greg Banks wrote:
> >
> > > I think having a tunable for client readahead is an excellent idea,
> > > although not to solve your particular problem. The SLES10 kernel has a
> > > patch which does precisely that, perhaps Neil could post it.
> > >
> > > I don't think there's a lot of point having both a module parameter and
> > > a sysctl.
> >
> > mount -o remount,readahead=42
>
> [root@lpsdm52 ~]# mount -o remount,readahead=42 /net/spsdms/fs13
> Bad nfs mount parameter: readahead
> [root@lpsdm52 ~]# mount -o readahead=42 /net/spsdms/fs13
> Bad nfs mount parameter: readahead
>
>
> I assume the reply was meant to say that the correct way of introducing a modifyable readahead size is to implement it as a mount option ? :-)

Yes.

> I considered it, but it seems to be more intrusive than the workaround patch. It also needs changes to userspace tools - correct?

No. mount(8) will pass unrecognised options straight down into the
filesystem driver.

It's better this way - it allows the tunable to be set on a per-mount
basis rather than machine-wide.

Note that for block devices, readahead is a per-backing_dev_info thing
(and a backing_dev_info has a 1:1 relationship to a disk drive for sane
setups).

And the NFS client maintains a backing_dev_info, which appears to map
onto a server, so making the NFS readahead a per-backing_dev_info (ie:
per server) thing might make sense. Maybe nfs makes per-server information
manipulatable down in sysfs somewhere..


2008-09-18 08:19:30

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Greg Banks <gnb-cP1dWloDopni96+mSzHFpQC/[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: linux-nfs list <[email protected]>; [email protected]
> Sent: Thursday, September 18, 2008 3:42:54 AM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> Martin Knoblauch wrote:
> > Hi,
> >
> > the following/attached patch works around a [obscure] problem when an 2.6 (not
> sure/caring about 2.4) NFS client accesses an "offline" file on a Sun/Solaris-10
> NFS server when the underlying filesystem is of type SAM-FS. Happens with
> RHEL4/5 and mainline kernels. Frankly, it is not a Linux problem, but the chance
> for a short-/mid-term solution from Sun are very slim. So, being lazy, I would
> love to get this patch into Linux. If not, I just will have to maintain it for
> eternity out of tree.
> >
> > The problem: SAM-FS is Suns proprietary HSM filesystem. It stores meta-data
> and a relatively small amount of data "online" on disk and pushes old or
> infrequently used data to "offline" media like e.g. tape. This is completely
> transparent to the users. If the date for an "offline" file is needed, the so
> called "stager daemon" copies it back from the offline medium. All of this works
> great most of the time. Now, if an Linux NFS client tries to read such an
> offline file, performance drops to "extremely slow".
> By "extremely slow" do you mean "tape read speed"?
> > After lengthly investigation of tcp-dumps, mount options and procedures
> involving black cats at midnight, we found out that the readahead behaviour of
> the Linux NFS client causes the problem. Basically it seems to issue read
> requests up to 15*rsize to the server. In the case of the "offline" files, this
> behaviour causes heavy competition for the inode lock between the NFSD process
> and the stager daemon on the Solaris server.
> >
Hi Greg,

my impression is, there is some confusion here. Likely caused by me not writing a good description :-(

> So, you need to
>
> a) make your stager daemon do IO more sensibly, and
>

As I am not affiliated with Sun in any way, it is "their" stager daemon. And I told "them", but a solution will not come before the next major release :-(

> b) apply something like this patch which adds O_NONBLOCK when knfsd does
> reads writes and truncates and translates -EAGAIN into NFS3ERR_JUKEBOX
>
> http://kerneltrap.org/mailarchive/linux-fsdevel/2006/5/5/312567
>

OK, what has knfsd to do with it? The NFS server is Solaris-10 on Sparc.

> and
>
> c) make your filesystem IO interposing layer report -EAGAIN when a
> process tries to do IO to an offline region in a file and O_NONBLOCK is
> present.

I leave that to "them" :-)

> > - The real solution: fixing SAM-FS/NFSD interaction. Sun engineering acks the
> problem, but a solution will need time. Lots of it.
> > - The working solution: disable the client side readahead, or make it tunable.
> The patch does that by introducing a NFS module parameter "ra_factor" which can
> take values between 1 and 15 (default 15) and a tunable
> "/proc/sys/fs/nfs/nfs_ra_factor" with the same range and default.
> >
> I think having a tunable for client readahead is an excellent idea,
> although not to solve your particular problem. The SLES10 kernel has a
> patch which does precisely that, perhaps Neil could post it.
>
> I don't think there's a lot of point having both a module parameter and
> a sysctl.
>

Actually there is a good reason. The module parameter can be used to set the new value at load time and never bother again. The sysctl is very convenient when doing experiments.

As Andrew already pointed out, the best solution would be a mount option. But that seems much more involved as my workaround patch.

> A maximum of 15 is unwise. I've found that (at least with the older
> readahead mechanisms in SLES10) a multiple of 4 is required to preserve
> rsize-alignment of READ rpcs to the server, which helps a lot with wide
> RAID backends. So in SGI we tune client readahead to 16.
>

15 is the value that the Linux NFS client uses., at least since 2.6.3. As it is not tunable up to today, the comment seems moot :-) But it opens the questions:

a) should 1 be the minimum, or 0?
b) can the backing_dev_info.ra_pages field safely be set to something higher than 15?

> Your patch seems to have a bunch of other unrelated stuff mixed in.
>

Yeah, someone already pointed out, that the Makefile hunk does not belong there. But you say "a bunch" - anything else?

Cheers
Martin
PS: Did we ever meet/mail when I was at SGI (1991-1997)?


2008-09-18 08:38:59

by Martin Knoblauch

[permalink] [raw]
Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable

----- Original Message ----

> From: Andrew Morton <[email protected]>
> To: Martin Knoblauch <[email protected]>
> Cc: Greg Banks <gnb-cP1dWloDopni96+mSzHFpQC/[email protected]>; linux-nfs list <[email protected]>; [email protected]
> Sent: Thursday, September 18, 2008 10:18:18 AM
> Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
>
> On Thu, 18 Sep 2008 00:42:58 -0700 (PDT) Martin Knoblauch
> wrote:
>
> > ----- Original Message ----
> >
> > > From: Andrew Morton
> > > To: Greg Banks
> > > Cc: Martin Knoblauch ; linux-nfs list
> ; [email protected]
> > > Sent: Thursday, September 18, 2008 5:13:34 AM
> > > Subject: Re: [RFC][Resend] Make NFS-Client readahead tunable
> > >
> > > On Thu, 18 Sep 2008 11:42:54 +1000 Greg Banks wrote:
> > >
> > > > I think having a tunable for client readahead is an excellent idea,
> > > > although not to solve your particular problem. The SLES10 kernel has a
> > > > patch which does precisely that, perhaps Neil could post it.
> > > >
> > > > I don't think there's a lot of point having both a module parameter and
> > > > a sysctl.
> > >
> > > mount -o remount,readahead=42
> >
> > [root@lpsdm52 ~]# mount -o remount,readahead=42 /net/spsdms/fs13
> > Bad nfs mount parameter: readahead
> > [root@lpsdm52 ~]# mount -o readahead=42 /net/spsdms/fs13
> > Bad nfs mount parameter: readahead
> >
> >
> > I assume the reply was meant to say that the correct way of introducing a
> modifyable readahead size is to implement it as a mount option ? :-)
>
> Yes.
>

:-)

> > I considered it, but it seems to be more intrusive than the workaround patch.
> It also needs changes to userspace tools - correct?
>
> No. mount(8) will pass unrecognised options straight down into the
> filesystem driver.
>

Has that always been the case, or is it a recent change? I have to support RHEL4 userland, which is not really new.

> It's better this way - it allows the tunable to be set on a per-mount
> basis rather than machine-wide.
>

No question about that. I just thought it to be to complicated. Maybe I erred.

> Note that for block devices, readahead is a per-backing_dev_info thing
> (and a backing_dev_info has a 1:1 relationship to a disk drive for sane
> setups).
>
> And the NFS client maintains a backing_dev_info, which appears to map
> onto a server, so making the NFS readahead a per-backing_dev_info (ie:
> per server) thing might make sense. Maybe nfs makes per-server information
> manipulatable down in sysfs somewhere..

I believe Peter wanted to add per bdi stuff for nfs some time ago. Not sure what came out of it.

Cheers
Martin