2003-06-13 13:25:12

by Christian Jaeger

[permalink] [raw]
Subject: Lockups with loop'ed sparse files on reiserfs?

Hello

I've experienced 3 lockups in the last few days, all while using
sparse files. Could also be problems with UML, SKAS, raid5 over loop
device, or loop devices with vfat files, but it looks like the only
common thing is sparse files on reiserfs.

1.) kernel 2.4.20 from debian unstable (= kernel.org kernel with
quite a few security and other patches), additionally patched with
kernel-patch-skas 3-1 from debian. Started user-mode-linux using a
sparse file with an ext2 filesystem on it, using tap0 networking, did
apt-get upgrade inside this uml (which started to download (and
already unpack?) quite a bit of stuff), halfway through the whole
(host) system froze. Still responded to pings, but telnet $host 80
would not show any activity from running apache. Went to the server
room, I could change virtual terminals with Alt-<number>, but could
not log in. Reset.

2.) same kernel:
- created 6 sparse files of 650MB each, on reiserfs filesystems (some
of them on the same filesystem), and 2 files of 650MB on a vfat
filesystem.
- Tied them to /dev/loop*,
- mdadm /dev/md0 -C -l 5 -n 7 -x 1 /dev/loop*
- then (while the array was building) mkreiser /dev/md0,
- mount /dev/md0 /mnt/md0
- cd /mnt/md0; netcat -l -p "$port" | multifeed '|' sh -c 'exec
md5sum >&2' '&' cat | gpg | lzop -d | tar xf -
(where multifeed is a C program by myself feeding the data to
multiple processes)
basically fetch data from tcp and untar it onto the filesystem.
After about 500MB of data has been written onto /mnt/md0, the box
froze. Still responded to ping, but not to telnet $host 80. Could
switch vt's, type root and enter password, but didn't get a login.

3.) kernel 2.4.18 from kernel.org (the machine ran without any
problem (except for sporadically switching off dma on /dev/hda) with
this kernel for about a year):
Did same thing as mentioned under 2.) (rm -rf /mnt/md0/* before
starting the write again). This time it happened already after
filling the md partition with about 200MB. And this time, while still
responding to pings and being able to switch vt's, it wouldn't react
to hitting the keys 'root'.

I'd mainly like to know if all of what I did is supported or not.

The machine is a AMD Duron 1Ghz with 256MB RAM, 3 IDE harddisks (but
only hda and hdd involved in the above), 2 ethernet cards using
8139too.

Christian.


2003-06-13 15:45:48

by Oleg Drokin

[permalink] [raw]
Subject: Re: Lockups with loop'ed sparse files on reiserfs?

Hello!

On Fri, Jun 13, 2003 at 07:56:34PM +0400, Oleg Drokin wrote:
> > already unpack?) quite a bit of stuff), halfway through the whole
> > (host) system froze. Still responded to pings, but telnet $host 80
> > would not show any activity from running apache. Went to the server
> > room, I could change virtual terminals with Alt-<number>, but could
> > not log in. Reset.
> Were there anything interesting on the console where your kernel outputs
> its messages (the host kernel?)?

BTW, while we are at it, were there enough space on the partition with sparse
files to hold all the data you was writing there?

Bye,
Oleg

2003-06-13 15:42:49

by Oleg Drokin

[permalink] [raw]
Subject: Re: Lockups with loop'ed sparse files on reiserfs?

Hello!

On Fri, Jun 13, 2003 at 03:38:44PM +0200, Christian Jaeger wrote:

> I've experienced 3 lockups in the last few days, all while using
> sparse files. Could also be problems with UML, SKAS, raid5 over loop
> device, or loop devices with vfat files, but it looks like the only
> common thing is sparse files on reiserfs.
>
> 1.) kernel 2.4.20 from debian unstable (= kernel.org kernel with
> quite a few security and other patches), additionally patched with
> kernel-patch-skas 3-1 from debian. Started user-mode-linux using a
> sparse file with an ext2 filesystem on it, using tap0 networking, did
> apt-get upgrade inside this uml (which started to download (and
> already unpack?) quite a bit of stuff), halfway through the whole
> (host) system froze. Still responded to pings, but telnet $host 80
> would not show any activity from running apache. Went to the server
> room, I could change virtual terminals with Alt-<number>, but could
> not log in. Reset.

Were there anything interesting on the console where your kernel outputs
its messages (the host kernel?)?
Any chance to hit say sysrq-T/sysrq-P to find out where CPU spins?

> I'd mainly like to know if all of what I did is supported or not.

Yes it is supported. I am doing this kind of stuff (with uml and skas3)
on reiserfs everyday and everything works just fine with 2.4.19, 1.4.20 and 2.4.21.

Bye,
Oleg

2003-06-13 18:16:11

by Christian Jaeger

[permalink] [raw]
Subject: Re: Lockups with loop'ed sparse files on reiserfs?

At 19:59 Uhr +0400 13.06.2003, Oleg Drokin wrote:
> > Were there anything interesting on the console where your kernel outputs
> > its messages (the host kernel?)?

IIRC nothing was output, at least I don't remember anything that I
thought was significant. But see below re kern.log entries.

>Any chance to hit say sysrq-T/sysrq-P to find out where CPU spins?

I've never used those, I'll have to learn about those debugging
options first. Where should I go to?

>BTW, while we are at it, were there enough space on the partition with sparse
>files to hold all the data you was writing there?

I did calculate all space bevor I started a few days ago. I did now
recalculate on current free space on the partitions and in fact on
one partition there's not enough space (anymore?):

losetup /dev/loop0 /root/raid5_1
losetup /dev/loop1 /root/raid5_2
du /root -> 1675228 k free. 650*1024*2=1331200 k, => ok
losetup /dev/loop2 /mnt/hdd8/raid5_3
losetup /dev/loop3 /mnt/hdd8/raid5_4
losetup /dev/loop4 /mnt/hdd8/raid5_5
du /mnt/hdd8/ -> 1973856 k free. 650*1024*3=1996800k => *not* ok.
(pity that I already deleted those 3 files)
losetup /dev/loop5 /mnt/hda11/raid5_6
du /mnt/hda11 -> 849044 free. => ok.
losetup /dev/loop6 /mnt/hdd6/.c/raid5_8
losetup /dev/loop7 /mnt/hdd6/.c/raid5_9
this is a vfat partition so no sparse files (and 2.9GB free too)
(The files looks like:
-rw------- 1 root root 681574400 8. Jun 23:46 raid5_6
)

Now the question is wbat happens if a partition is full.
In fact I've seen this in kern.log (full log at
http://pflanze.mine.nu/~chris/scratch/kern.log ):

Jun 13 11:34:57 pflanze kernel: raid5: md0, not all disks are
operational -- trying to recover array
...
Jun 13 11:34:57 pflanze kernel: md0: resyncing spare disk [dev 07:07]
to replace failed disk

Though I think that was before I started writing stuff onto the array.

What does happen if a raid array fails (i.e. 2 disks fail and there's
no spare, or 1 spare and 3 disks fail etc.)? If it's not an important
array (i.e. no swap or root filesystem on it), is there a reason for
the system to go down? Isn't it possible to just mark the mounted
filesystem as erroneous and return EIO to applications accessing it?

There's also the case 1, using uml. In this case I'm sure there was
no problem with space. The sparse filesystem image file I used is
exactly 500'000'000 bytes, and there's 1675228 k free space on the
partition where it is put on.

Christian.

2003-06-13 20:08:20

by Oleg Drokin

[permalink] [raw]
Subject: Re: Lockups with loop'ed sparse files on reiserfs?

Hello!

On Fri, Jun 13, 2003 at 08:07:55PM +0200, Christian Jaeger wrote:

> >Any chance to hit say sysrq-T/sysrq-P to find out where CPU spins?
> I've never used those, I'll have to learn about those debugging
> options first. Where should I go to?

Read /usr/src/linux/Documentation/sysrq.txt

> Now the question is wbat happens if a partition is full.

There were a known problem with reiserfs that it might sometimes
deadlock in out-of-space situation.
This is fixed in 2.4.21

> In fact I've seen this in kern.log (full log at
> http://pflanze.mine.nu/~chris/scratch/kern.log ):
> Jun 13 11:34:57 pflanze kernel: raid5: md0, not all disks are
> operational -- trying to recover array
> ...
> Jun 13 11:34:57 pflanze kernel: md0: resyncing spare disk [dev 07:07]
> to replace failed disk

This is raid5 stuff resyncing. Probably it is normal if you just
setup the raid5 array.

> What does happen if a raid array fails (i.e. 2 disks fail and there's
> no spare, or 1 spare and 3 disks fail etc.)? If it's not an important

Everything that will access this array will break, I presume ;)

> array (i.e. no swap or root filesystem on it), is there a reason for
> the system to go down? Isn't it possible to just mark the mounted
> filesystem as erroneous and return EIO to applications accessing it?

Something like that will happen.

> There's also the case 1, using uml. In this case I'm sure there was
> no problem with space. The sparse filesystem image file I used is
> exactly 500'000'000 bytes, and there's 1675228 k free space on the
> partition where it is put on.

Ok, that's where sysrq-T/sysrq-P traceswould be most useful.
And if you'd try with 2.4.21 that would be even better.

Thank you.

Bye,
Oleg

2003-06-14 22:57:07

by Christian Jaeger

[permalink] [raw]
Subject: Re: Lockups with loop'ed sparse files on reiserfs?

At 0:22 Uhr +0400 14.06.2003, Oleg Drokin wrote:
>Read /usr/src/linux/Documentation/sysrq.txt

Done, new kernels now compiled with CONFIG_MAGIC_SYSRQ.

>There were a known problem with reiserfs that it might sometimes
>deadlock in out-of-space situation.
>This is fixed in 2.4.21

Good to know.

> > There's also the case 1, using uml. In this case I'm sure there was
>> no problem with space. The sparse filesystem image file I used is
> > exactly 500'000'000 bytes, and there's 1675228 k free space on the
> > partition where it is put on.
>
>Ok, that's where sysrq-T/sysrq-P traceswould be most useful.
>And if you'd try with 2.4.21 that would be even better.

I've now compiled 2.4.21 from kernel.org with skas3 from debian, as
well as 2.4.21 with grsecurity (from grsecurity.net, with medium
setting) and skas3, and tried uml again with the same sparse image
multiple times under both. I haven't managed to lock the machine up
to now even while installing quite some stuff, so maybe the problem
is already solved. If not, I'll tell you when it happens again. (I
think I'll run 2.4.21-grsec-skas3 for the near future now.)

Thanks for your help
Christian.