2001-02-08 14:02:01

by Andrius Adomaitis

[permalink] [raw]
Subject: Problems with 2.4.2-pre1 & reiser & vfs


Hello,

I have dual PIII 800 machine running as mail server on DAC 960 RAID &
reiserfs comming with 2.4.1kernel.

Under very high loads I get following messages in my kernel log:

kernel: vs-13060: reiserfs_update_sd: stat data of object [7906789
7906806 0x0 SD](nlink == 1) not found (pos 23)
kernel: vs-13060: reiserfs_update_sd: stat data of object [7906789
7906806 0x0 SD] (nlink == 1) not found (pos 23)
kernel: PAP-5660: reiserfs_do_truncate: wrong result -1 of search for
[7906789 7906806 0xfffffffffffffff DIRECT]
kernel: vs-13060: reiserfs_update_sd: stat data of object [7906789
7906806 0x0 SD] (nlink == 1) not found (pos 23)
kernel: PAP-5660: reiserfs_do_truncate: wrong result -1 of search for
[7906789 7906806 0xfffffffffffffff DIRECT]
.....

and afterwards come these:

kernel: vs-3050: wait_buffer_until_released: nobody releases buffer
(dev 30:09, size 4096, blocknr 1661732, count 16,
kernel: vs-3050: wait_buffer_until_released: nobody releases buffer
(dev 30:09, size 4096, blocknr 1661732, count 16,
...
and so on.

The interesting thing is that system is still operational, but load
jumps up to 260 or so, and any attempts to reboot system fail. ps aux
shows that there exists imortal (kill -9 $PID doesn't kill it) qmail
process that consumes 97% of one CPU's resources. Also `vmstat` shows
tons of processes in uninterruptable sleep, but `free` reports that it
is still enough memory (no swap used) and huge buffers... Machine gets
slugish but works for a while (0.5-2h dependent on mail request rate).

System is Debian potato,
gcc version 2.95.2 20000220 (Debian GNU/Linux),
reiserfs utils 3.6.25.

Any patches or suggestions to fix that would be appreciated...

P.S. Also I thought wouldn't it be good to have some sysctl entry in
proc that rebooted machine dependent on the value in control file when
proper software reboot is impossible (like in situation described
above)? Or probably there already exist(s) such thing(s)?

Thanks.
--
Andrius
[email protected]


2001-02-08 22:37:54

by Chris Mason

[permalink] [raw]
Subject: Re: Problems with 2.4.2-pre1 & reiser & vfs



On Thursday, February 08, 2001 04:00:26 PM +0100 Andrius Adomaitis <[email protected]> wrote:

>
> Hello,
>
> I have dual PIII 800 machine running as mail server on DAC 960 RAID &
> reiserfs comming with 2.4.1kernel.
>
> Under very high loads I get following messages in my kernel log:
>
> kernel: vs-13060: reiserfs_update_sd: stat data of object [7906789
> 7906806 0x0 SD](nlink == 1) not found (pos 23)
> kernel: vs-13060: reiserfs_update_sd: stat data of object [7906789
> 7906806 0x0 SD] (nlink == 1) not found (pos 23)
> kernel: PAP-5660: reiserfs_do_truncate: wrong result -1 of search for
> [7906789 7906806 0xfffffffffffffff DIRECT]
> kernel: vs-13060: reiserfs_update_sd: stat data of object [7906789
> 7906806 0x0 SD] (nlink == 1) not found (pos 23)
> kernel: PAP-5660: reiserfs_do_truncate: wrong result -1 of search for
> [7906789 7906806 0xfffffffffffffff DIRECT]
> .....

These aren't good at all, and show a general corruption problem. I know the ac kernels have at least one small DAC960 bug fixes, are there other fixes pending?

>
> and afterwards come these:
>
> kernel: vs-3050: wait_buffer_until_released: nobody releases buffer
> (dev 30:09, size 4096, blocknr 1661732, count 16,
> kernel: vs-3050: wait_buffer_until_released: nobody releases buffer
> (dev 30:09, size 4096, blocknr 1661732, count 16,
> ...
> and so on.
>

There is more info in this message, it would help if you could send the entire line.

> The interesting thing is that system is still operational, but load
> jumps up to 260 or so, and any attempts to reboot system fail. ps aux
> shows that there exists imortal (kill -9 $PID doesn't kill it) qmail
> process that consumes 97% of one CPU's resources. Also `vmstat` shows
> tons of processes in uninterruptable sleep, but `free` reports that it
> is still enough memory (no swap used) and huge buffers... Machine gets
> slugish but works for a while (0.5-2h dependent on mail request rate).
>

Once you get a vs-3050, any process that tries to change the FS ends up waiting on the journal, which is waiting on the process stuck in vs-3050. There is no escape.

-chris



2001-02-08 22:49:44

by Alan

[permalink] [raw]
Subject: Re: Problems with 2.4.2-pre1 & reiser & vfs

> > kernel: PAP-5660: reiserfs_do_truncate: wrong result -1 of search for
> > [7906789 7906806 0xfffffffffffffff DIRECT]
> > .....
>
> These aren't good at all, and show a general corruption problem. I know the ac kernels have at least one small DAC960 bug fixes, are there other fixes pending?
>

The dac960 changes relate to gcc 2.96 stuff and wouldnt account for real bugs.
DAC960 built with cvs gcc or 2.96 < 2.96-74 or so could do because of the ABI
thing but wouldnt boot that far. If its straight 2.4.1 suspect the elevator
corruption thing too