2000-11-15 23:55:36

by Ivan Kanis

[permalink] [raw]
Subject: [BUG] knfsd causes file system corruption when files are locked.

[1.] knfsd causes file system corruption when files are locked.

[2.] Lock down a file using the NLM_SHARE sharing mechanism. Remove
the file. Unlock the file using NLM_UNSHARE. The filesystem does not
recover the file space. I am running this on ext2fs. Fsck-ing the
filesystem does not help. The only way to recover the space is to
reformat the partition.

[3.] knfsd, lock, NLM_SHARE, NLM_UNSHARE

[4.] Linux version 2.2.16 (root@jedi) (gcc version 2.7.2.3)

[5.] N/A

[6.] This test.c program will reproduce the problem. You need to compile it
on a Solaris machine because Linux fcntl does not support NLM_SHARE.

-----start here
#include <fcntl.h>
#include <errno.h>
#include <stdio.h>

int main (int argc, char *argv[])
{
struct fshare lck;
int fd, ret;
if (argc != 2) {
printf ("Usage: %s file to lock\n", argv[0]);
return 1;
}
fd = open (argv[1], O_WRONLY);
memset (&lck, 0, sizeof (struct flock));
lck.f_access = F_WRACC;
lck.f_deny = F_NODNY;
ret = fcntl (fd, F_SHARE, &lck);
unlink (argv[1]);
ret = fcntl (fd, F_UNSHARE, &lck);

return 0;
}
-----end here

Step to reproduce the problem

- Compile the program:
gcc test.c -o test

- Mount a Linux nfs partition on Solaris: (Remember the partition will
get corrupted, use a partition that you don't care about.)
mount -o vers=2 jedi:/sandbox /mnt

- Create a chunk of data on /mnt
dd if=/dev/zero of=/mnt/chunk count=10000

- Do a df before running the program

- Run the test program
./test /mnt/chunk

- Run df again. The free space reamains the same. The space is gone
till you reformat the partition.


[7.] This bug was seen on a Debian 2.2 machine. We have seen the same
thing happens on systems running Red Hat 6.2 and TurboLinux 6.0 distributions.

[7.1] Environment:
Kernel modules found
Gnu C 2.95.2
Binutils 2.9.5.0.37
Linux C Library ..
Dynamic Linker (ld.so) 1.9.11
ls: /usr/lib/libg++.so: No such file or directory
Procps 2.0.6
Mount 2.10f
Net-tools (1999-04-20)
Kbd 0.99
Sh-utils 2.0
Sh-utils Parker.
Sh-utils
Sh-utils Inc.
Sh-utils NO
Sh-utils PURPOSE.

[7.2] Processor information

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 5
model name : Pentium II (Deschutes)
stepping : 2
cpu MHz : 447.700
cache size : 512 KB
fdiv_bug : no
hlt_bug : no
sep_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
bogomips : 891.29

[7.3] Module information

aic7xxx 124328 1
nfsd 140436 8 (autoclean)
snd-pcm-oss 16840 1 (autoclean)
snd-pcm-plugin 13000 0 (autoclean) [snd-pcm-oss]
snd-mixer-oss 4308 1 (autoclean) [snd-pcm-oss]
snd-card-cs4236 5224 2
snd-mpu401-uart 2356 0 [snd-card-cs4236]
snd-rawmidi 9752 0 [snd-mpu401-uart]
snd-seq-device 3476 0 [snd-rawmidi]
isapnp 27572 0 [snd-card-cs4236]
snd-cs4236 20580 0 [snd-card-cs4236]
snd-cs4231 19008 0 [snd-card-cs4236 snd-cs4236]
snd-mixer 23536 0 [snd-mixer-oss snd-cs4236 snd-cs4231]
snd-pcm 29784 0 [snd-pcm-oss snd-pcm-plugin snd-cs4231]
snd-opl3 4328 0 [snd-card-cs4236]
snd-timer 8224 0 [snd-cs4231 snd-pcm snd-opl3]
snd-hwdep 3052 0 [snd-opl3]
snd 36300 1 [snd-pcm-oss snd-pcm-plugin snd-mixer-oss snd-card-cs4236 snd-mpu401-uart snd-rawmidi snd-seq-device snd-cs4236 snd-cs4231 snd-mixer snd-pcm snd-opl3 snd-timer snd-hwdep]
soundcore 2448 3 [snd]
3c59x 18212 1

[7.4] SCSI Information

Attached devices:
Host: scsi0 Channel: 00 Id: 05 Lun: 00
Vendor: NEC Model: CD-ROM DRIVE:465 Rev: 1.03
Type: CD-ROM ANSI SCSI revision: 02

[7.5] N/A

[8.] Here is a trace from the Solaris snoop program while the test
program mentioned above is running:

sun -> jedi.wrq.com NFS C LOOKUP2 FH=2344 chunk
jedi.wrq.com -> sun NFS R LOOKUP2 OK FH=D308
sun -> jedi.wrq.com NLM C SHARE3 OH=0009 FH=D308 Mode=0 Access=2
jedi.wrq.com -> sun NLM R SHARE3 OH=0009 granted 0
sun -> jedi.wrq.com NLM C UNSHARE3 OH=000A FH=D308 Mode=0 Access=2
jedi.wrq.com -> sun NLM R UNSHARE3 OH=000A granted 0
sun -> jedi.wrq.com NFS C GETATTR2 FH=2344
jedi.wrq.com -> sun NFS R GETATTR2 OK
sun -> jedi.wrq.com NFS C LOOKUP2 FH=2344 chunk
jedi.wrq.com -> sun NFS R LOOKUP2 OK FH=D308
sun -> jedi.wrq.com NFS C LOOKUP2 FH=2344 .nfs5FC7
jedi.wrq.com -> sun NFS R LOOKUP2 No such file or directory
sun -> jedi.wrq.com NFS C RENAME2 FH=2344 chunk to FH=2344 .nfs5FC7
jedi.wrq.com -> sun NFS R RENAME2 OK
sun -> jedi.wrq.com NLM C UNSHARE3 OH=000B FH=D308 Mode=0 Access=1
jedi.wrq.com -> sun NLM R UNSHARE3 OH=000B granted 0
sun -> jedi.wrq.com NFS C REMOVE2 FH=2344 .nfs5FC7
jedi.wrq.com -> sun NFS R REMOVE2 OK

/


2000-11-16 00:30:48

by NeilBrown

[permalink] [raw]
Subject: Re: [BUG] knfsd causes file system corruption when files are locked.

On Wednesday November 15, [email protected] wrote:
> [1.] knfsd causes file system corruption when files are locked.
>
> [2.] Lock down a file using the NLM_SHARE sharing mechanism. Remove
> the file. Unlock the file using NLM_UNSHARE. The filesystem does not
> recover the file space. I am running this on ext2fs. Fsck-ing the
> filesystem does not help. The only way to recover the space is to
> reformat the partition.
>
> [3.] knfsd, lock, NLM_SHARE, NLM_UNSHARE
>
> [4.] Linux version 2.2.16 (root@jedi) (gcc version 2.7.2.3)

Lots of changes have gone into knfsd since 2.2.16. Could you please
try again with either a later 2.2.18pre kernel, or 2.2.16 with patches
from
http://nfs.sourceforge.net/
applied? Thanks.

Quick guide is:
2.2.16
plus
http://www.fys.uio.no/~trondmy/src/nfsv3-old/linux-2.2.16-nfsv3-0.22.0.dif.bz2
plus
http://download.sourceforge.net/nfs/kernel-nfs-dhiggen_merge-2.0.gz

NeilBrown

>
> [5.] N/A
>
> [6.] This test.c program will reproduce the problem. You need to compile it
> on a Solaris machine because Linux fcntl does not support NLM_SHARE.
>
> -----start here
> #include <fcntl.h>
> #include <errno.h>
> #include <stdio.h>
>
> int main (int argc, char *argv[])
> {
> struct fshare lck;
> int fd, ret;
> if (argc != 2) {
> printf ("Usage: %s file to lock\n", argv[0]);
> return 1;
> }
> fd = open (argv[1], O_WRONLY);
> memset (&lck, 0, sizeof (struct flock));
> lck.f_access = F_WRACC;
> lck.f_deny = F_NODNY;
> ret = fcntl (fd, F_SHARE, &lck);
> unlink (argv[1]);
> ret = fcntl (fd, F_UNSHARE, &lck);
>
> return 0;
> }
> -----end here
>
> Step to reproduce the problem
>
> - Compile the program:
> gcc test.c -o test
>
> - Mount a Linux nfs partition on Solaris: (Remember the partition will
> get corrupted, use a partition that you don't care about.)
> mount -o vers=2 jedi:/sandbox /mnt
>
> - Create a chunk of data on /mnt
> dd if=/dev/zero of=/mnt/chunk count=10000
>
> - Do a df before running the program
>
> - Run the test program
> ./test /mnt/chunk
>
> - Run df again. The free space reamains the same. The space is gone
> till you reformat the partition.
>
>
> [7.] This bug was seen on a Debian 2.2 machine. We have seen the same
> thing happens on systems running Red Hat 6.2 and TurboLinux 6.0 distributions.
>
> [7.1] Environment:
> Kernel modules found
> Gnu C 2.95.2
> Binutils 2.9.5.0.37
> Linux C Library ..
> Dynamic Linker (ld.so) 1.9.11
> ls: /usr/lib/libg++.so: No such file or directory
> Procps 2.0.6
> Mount 2.10f
> Net-tools (1999-04-20)
> Kbd 0.99
> Sh-utils 2.0
> Sh-utils Parker.
> Sh-utils
> Sh-utils Inc.
> Sh-utils NO
> Sh-utils PURPOSE.
>
> [7.2] Processor information
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 5
> model name : Pentium II (Deschutes)
> stepping : 2
> cpu MHz : 447.700
> cache size : 512 KB
> fdiv_bug : no
> hlt_bug : no
> sep_bug : no
> f00f_bug : no
> coma_bug : no
> fpu : yes
> fpu_exception : yes
> cpuid level : 2
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr
> bogomips : 891.29
>
> [7.3] Module information
>
> aic7xxx 124328 1
> nfsd 140436 8 (autoclean)
> snd-pcm-oss 16840 1 (autoclean)
> snd-pcm-plugin 13000 0 (autoclean) [snd-pcm-oss]
> snd-mixer-oss 4308 1 (autoclean) [snd-pcm-oss]
> snd-card-cs4236 5224 2
> snd-mpu401-uart 2356 0 [snd-card-cs4236]
> snd-rawmidi 9752 0 [snd-mpu401-uart]
> snd-seq-device 3476 0 [snd-rawmidi]
> isapnp 27572 0 [snd-card-cs4236]
> snd-cs4236 20580 0 [snd-card-cs4236]
> snd-cs4231 19008 0 [snd-card-cs4236 snd-cs4236]
> snd-mixer 23536 0 [snd-mixer-oss snd-cs4236 snd-cs4231]
> snd-pcm 29784 0 [snd-pcm-oss snd-pcm-plugin snd-cs4231]
> snd-opl3 4328 0 [snd-card-cs4236]
> snd-timer 8224 0 [snd-cs4231 snd-pcm snd-opl3]
> snd-hwdep 3052 0 [snd-opl3]
> snd 36300 1 [snd-pcm-oss snd-pcm-plugin snd-mixer-oss snd-card-cs4236 snd-mpu401-uart snd-rawmidi snd-seq-device snd-cs4236 snd-cs4231 snd-mixer snd-pcm snd-opl3 snd-timer snd-hwdep]
> soundcore 2448 3 [snd]
> 3c59x 18212 1
>
> [7.4] SCSI Information
>
> Attached devices:
> Host: scsi0 Channel: 00 Id: 05 Lun: 00
> Vendor: NEC Model: CD-ROM DRIVE:465 Rev: 1.03
> Type: CD-ROM ANSI SCSI revision: 02
>
> [7.5] N/A
>
> [8.] Here is a trace from the Solaris snoop program while the test
> program mentioned above is running:
>
> sun -> jedi.wrq.com NFS C LOOKUP2 FH=2344 chunk
> jedi.wrq.com -> sun NFS R LOOKUP2 OK FH=D308
> sun -> jedi.wrq.com NLM C SHARE3 OH=0009 FH=D308 Mode=0 Access=2
> jedi.wrq.com -> sun NLM R SHARE3 OH=0009 granted 0
> sun -> jedi.wrq.com NLM C UNSHARE3 OH=000A FH=D308 Mode=0 Access=2
> jedi.wrq.com -> sun NLM R UNSHARE3 OH=000A granted 0
> sun -> jedi.wrq.com NFS C GETATTR2 FH=2344
> jedi.wrq.com -> sun NFS R GETATTR2 OK
> sun -> jedi.wrq.com NFS C LOOKUP2 FH=2344 chunk
> jedi.wrq.com -> sun NFS R LOOKUP2 OK FH=D308
> sun -> jedi.wrq.com NFS C LOOKUP2 FH=2344 .nfs5FC7
> jedi.wrq.com -> sun NFS R LOOKUP2 No such file or directory
> sun -> jedi.wrq.com NFS C RENAME2 FH=2344 chunk to FH=2344 .nfs5FC7
> jedi.wrq.com -> sun NFS R RENAME2 OK
> sun -> jedi.wrq.com NLM C UNSHARE3 OH=000B FH=D308 Mode=0 Access=1
> jedi.wrq.com -> sun NLM R UNSHARE3 OH=000B granted 0
> sun -> jedi.wrq.com NFS C REMOVE2 FH=2344 .nfs5FC7
> jedi.wrq.com -> sun NFS R REMOVE2 OK
>
> /
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-11-16 03:14:02

by Ivan Kanis

[permalink] [raw]
Subject: Re: [BUG] knfsd causes file system corruption when files are locked.


> On Wednesday November 15, [email protected] wrote:
Ivan> [1.] knfsd causes file system corruption when files are locked.
Ivan>
Ivan> [2.] Lock down a file using the NLM_SHARE sharing
Ivan> mechanism. Remove the file. Unlock the file using
Ivan> NLM_UNSHARE. The filesystem does not recover the file space. I
Ivan> am running this on ext2fs. Fsck-ing the filesystem does not
Ivan> help. The only way to recover the space is to reformat the
Ivan> partition.
Ivan>
Ivan> [3.] knfsd, lock, NLM_SHARE, NLM_UNSHARE
Ivan>
Ivan> [4.] Linux version 2.2.16 (root@jedi) (gcc version 2.7.2.3)


>>>>> "Neil" == Neil Brown <[email protected]> writes:

Neil> Lots of changes have gone into knfsd since 2.2.16. Could
Neil> you please try again with either a later 2.2.18pre kernel,
Neil> or 2.2.16 with patches from
Neil> http://nfs.sourceforge.net/
Neil> applied? Thanks.

Neil> Quick guide is:
Neil> 2.2.16
Neil> plus
Neil> http://www.fys.uio.no/~trondmy/src/nfsv3-old/linux-2.2.16-nfsv3-0.22.0.dif.bz2
Neil> plus
Neil> http://download.sourceforge.net/nfs/kernel-nfs-dhiggen_merge-2.0.gz

Neil> NeilBrown

I can reproduce the bug using:

Linux version 2.2.18pre21 (root@jedi) (gcc version 2.7.2.3)

I don't have to type vers=2 to mount a linux nfs share on Solaris
(yeah!)

Ivan

2000-11-16 15:07:00

by Trond Myklebust

[permalink] [raw]
Subject: Re: [BUG] knfsd causes file system corruption when files are locked.

>>>>> " " == Ivan Kanis <[email protected]> writes:

Ivan> space. I am running this on ext2fs. Fsck-ing the filesystem
Ivan> does not help. The only way to recover the space is to
Ivan> reformat the partition.
Ivan>
Ivan> [3.] knfsd, lock, NLM_SHARE, NLM_UNSHARE
Ivan>
Ivan> [4.] Linux version 2.2.16 (root@jedi) (gcc version 2.7.2.3)


Please dig around in dejanews for the locking patch I posted on l-k
last week (to be applied on top of 2.2.18pre21). It fixes 3 leaks in
the locking code (amongst them the share leak).

If you can't find it, I'll be happy to post it via private mail...

Cheers,
Trond

2000-11-20 19:14:06

by Ivan Kanis

[permalink] [raw]
Subject: Re: [BUG] knfsd causes file system corruption when files are locked.

>>>>> "Trond" == Trond Myklebust <[email protected]> writes:
>>>>> " " == Ivan Kanis <[email protected]> writes:

Ivan> space. I am running this on ext2fs. Fsck-ing the filesystem
Ivan> does not help. The only way to recover the space is to
Ivan> reformat the partition.
Ivan>
Ivan> [3.] knfsd, lock, NLM_SHARE, NLM_UNSHARE
Ivan>
Ivan> [4.] Linux version 2.2.16 (root@jedi) (gcc version 2.7.2.3)


Trond> Please dig around in dejanews for the locking patch I
Trond> posted on l-k last week (to be applied on top of
Trond> 2.2.18pre21). It fixes 3 leaks in the locking code (amongst
Trond> them the share leak).

Hi Trond,

Fantastic news: your patches fixed the bug! I hope it will make it in
the kernel pretty soon, I will be testing it some more today.

Thank you,

Ivan Kanis