2001-12-20 09:45:14

by Steffen Persvold

[permalink] [raw]
Subject: Re: 2.4.8 NFS Problems

Hi guys,

I was searching on google for some reports on the problem I'm seeing with our NFS server/clients and
found this thread. It looked somewhat the same (atleast the result with the EIO is the same).

Parts of old message :

>From: Mike Black ([email protected])
>Date: Sep 05 2001

>I've been getting random NFS EIO errors for a few months but
>now it's repeatable.
>Trying to copy a large file from one 2.4.8 SMP box to another
>is consistently failing (at different offsets >each time).



Our setup is like this :

Server:
RedHat 7.2 - kernel 2.4.9-13smp
nfs-utils-0.3.1-13.7.2.1
ext3 filesystem (73GB)


Clients:
ia32 client - RedHat 6.2 - kernel 2.2.19-6.2.7enterprise
mount-2.10r-0.6.x


alpha client - RedHat 6.2 - kernel 2.2.19 (vanilla)
mount-2.10r-5


ia64 client - RedHat 7.1 - kernel 2.4.3-12smp
mount-2.10r-5



I've seen the "Input/Output error" problem only on the Alpha and the IA64 clients and the problem is
occuring when making a static library (with 'ar'). The message is like this :

ar: xxxxxx/libmpi.a: Input/output error


The mountpoints is mounted like this :

ia32 client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,addr=huey 0 0

alpha client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,addr=huey 0 0

ia64 client:
huey:/export/home/mpitest /home/mpitest nfs rw,v3,rsize=8192,wsize=8192,hard,udp,lock,addr=huey 0 0


I don't know why the "hard" and "lock" options doesn't appear on ia32 and alpha, but this might be
related to the /proc/mounts interface on the running kernel (these clients are running 2.2.19 while
the ia64 client is running 2.4). The automount entry looks like this :

/home auto_home rsize=8192,wsize=8192

So according to the nfs man pages the "hard" option should be default :

hard If an NFS file operation has a major timeout then report "server not
responding" on the console and continue retrying indefinitely. This
is the default.


So what could be the problem here ? Is it a NFS server bug, a NFS client bug or a NFS/ext3 bug ? We
used to run RedHat 7.0 on this server with the 2.2.19-enterprise kernel, nfs-utils-0.3.1-7 and with
a ext2 filesystem. This problem did not occur back then.

Thanks,
--
Steffen Persvold | Scalable Linux Systems | Try out the world's best
mailto:[email protected] | http://www.scali.com | performing MPI implementation:
Tel: (+47) 2262 8950 | Olaf Helsets vei 6 | - ScaMPI 1.12.2 -
Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY | >300MBytes/s and <4uS latency


2001-12-20 11:11:26

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.4.8 NFS Problems

>>>>> " " == Steffen Persvold <[email protected]> writes:

>> I've been getting random NFS EIO errors for a few months but
>> now it's repeatable. Trying to copy a large file from one 2.4.8
>> SMP box to another is consistently failing (at different
>> offsets >each time).

Please try the patch on

http://www.fys.uio.no/~trondmy/src/2.4.17/linux-2.4.17-fattr.dif

that fixes at least 1 such EIO error which was discovered using fsx.

Cheers,
Trond

2001-12-20 14:41:06

by Steffen Persvold

[permalink] [raw]
Subject: Re: 2.4.8 NFS Problems

Trond Myklebust wrote:
>
> >>>>> " " == Steffen Persvold <[email protected]> writes:
>
> >> I've been getting random NFS EIO errors for a few months but
> >> now it's repeatable. Trying to copy a large file from one 2.4.8
> >> SMP box to another is consistently failing (at different
> >> offsets >each time).
>
> Please try the patch on
>
> http://www.fys.uio.no/~trondmy/src/2.4.17/linux-2.4.17-fattr.dif
>
> that fixes at least 1 such EIO error which was discovered using fsx.
>

I can do that, but since one of the clients reporting this problem is an Alpha machine running
2.2.19 the patch won't do much good (not that the patch is architecture dependent, but it's only for
2.4.17). Has this patch been there since 2.2 or is it a new "feature" in the "stable" #:) 2.4
kernels.

Regards,
--
Steffen Persvold | Scalable Linux Systems | Try out the world's best
mailto:[email protected] | http://www.scali.com | performing MPI implementation:
Tel: (+47) 2262 8950 | Olaf Helsets vei 6 | - ScaMPI 1.12.2 -
Fax: (+47) 2262 8951 | N0621 Oslo, NORWAY | >300MBytes/s and <4uS latency

2001-12-20 20:28:03

by Trond Myklebust

[permalink] [raw]
Subject: Re: 2.4.8 NFS Problems

>>>>> " " == Steffen Persvold <[email protected]> writes:

> I can do that, but since one of the clients reporting this
> problem is an Alpha machine running
> 2.2.19 the patch won't do much good (not that the patch is
> architecture dependent, but it's only for
> 2.4.17). Has this patch been there since 2.2 or is it a new
> "feature" in the "stable" #:) 2.4 kernels.

All the problems fixed by the patch should be present in 2.2.19 too. I
don't really have time to backport the whole thing, but I've appended
a backport of the bit that is directly relevant to the EIO error.

Cheers,
Trond

--- linux-2.2.19-up/fs/nfs/read.c.orig Sun Mar 25 18:37:38 2001
+++ linux-2.2.19-up/fs/nfs/read.c Thu Dec 20 21:25:13 2001
@@ -420,7 +420,7 @@
{
struct nfs_read_data *data = (struct nfs_read_data *) task->tk_calldata;
struct inode *inode = data->inode;
- int count = data->res.count;
+ unsigned int count = data->res.count;

dprintk("NFS: %4d nfs_readpage_result, (status %d)\n",
task->tk_pid, task->tk_status);
@@ -431,10 +431,15 @@
struct page *page = req->wb_page;
nfs_list_remove_request(req);

- if (task->tk_status >= 0 && count >= 0) {
+ if (task->tk_status >= 0) {
+ char *p = page_address(page);
+ if (count < PAGE_CACHE_SIZE) {
+ memset(p + count, 0, PAGE_CACHE_SIZE - count);
+ count = 0;
+ } else
+ count -= PAGE_CACHE_SIZE;
flush_dcache_page(page_address(page)); /* Is this correct? */
set_bit(PG_uptodate, &page->flags);
- count -= PAGE_CACHE_SIZE;
} else
set_bit(PG_error, &page->flags);
nfs_unlock_page(page);