LinuxLists.cc - PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

2002-03-13 10:02:45

Subject: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

At Namesys, every change to reiserfs is required to be first tested by
the author, and then tested by at least one additional person. For
these tests we have a gigantic suite of tests that run for hours.
Beta test users (all 2.5 users) should expect things to break and
be unreliable. They should also expect that nothing a
regression test can find should ever hit kernel.org. They should
never have to spend their valuable time finding bugs that a computer
can find for them.

I would encourage requiring that every patch sent to Linus have
two signatures on it, one indicating that the author has tested it
(this requires rebooting the computer with the new kernel in the
case of the patch that prompted this email), and another indicating what
second person has tested it.

For reiserfs I personally try to make sure that one of the
signatures is from someone who knows that sometimes the usual
regression tests won't hit the new code, and has been observed
to understand that he is expected to write whatever
code is required to test it properly in that case.

I realize that I am engaging in a bit of cultural imperialism here,
and that many members of the Linux community find testing before
release an inconvenience. In fact, the person responsible
for this breakage has been personally responsible
for almost all of the "so bad no one should use it" breakage incidents
that reiserfs has experienced since it was added to the kernel, and
nothing said privately or publicly seems to be enough to get him
to even try copying a file into ReiserFS between making and releasing
his improvements. In my not at all objective view he is pissing all
over our painstaking efforts at quality control.

Despite this, I would like to encourage that person to continue making
improvements to ReiserFS.

So, let me therefor confine myself to asking that if persons
modify ReiserFS, that the changes be first sent to the maintainer
before being either sent or accepted into the mainstream kernel.
We will indeed ensure that they are tested before release if we
are allowed a chance to do so.

Todd, a patch is attached that will get you working again.

This patch will be sent to Linus a little bit later with
a bunch of other patches.

Hans

In response to:
****************************************************
From: "Todd E. Johnson" <[email protected]>
Subject: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot

Hello,

Information as follows:

[1.] One line summary of the problem:

Kernel Panic on 2.5.6 and 2.5.7-pre1 boot

[2.] Full description of the problem/report:

After a simple build of 2.5.6, and/or 2.5.7-pre1, when I reboot my
machine, I am greeted by
a kernel panic moments after the IDE controller is recognized.

[most of email deleted...]

***************************************************************************

--- linux-2.5.6.original/fs/reiserfs/journal.c Tue Mar 12 15:25:27 2002
+++ linux-2.5.6/fs/reiserfs/journal.c Wed Mar 13 11:01:37 2002
@@ -98,21 +98,6 @@
static int release_journal_dev( struct super_block *super,
struct reiserfs_journal *journal );

-static inline struct buffer_head *journ_get_hash_table(struct super_block *s, int block)
-{
- return __get_hash_table(SB_JOURNAL(s)->j_dev_bd, block, s->s_blocksize);
-}
-
-static inline struct buffer_head *journ_getblk(struct super_block *s, int block)
-{
- return __getblk(SB_JOURNAL(s)->j_dev_bd, block, s->s_blocksize);
-}
-
-static inline struct buffer_head *journ_bread(struct super_block *s, int block)
-{
- return __bread(SB_JOURNAL(s)->j_dev_bd, block, s->s_blocksize);
-}
-
static void init_journal_hash(struct super_block *p_s_sb) {
memset(SB_JOURNAL(p_s_sb)->j_hash_table, 0, JOURNAL_HASH_SIZE * sizeof(struct reiserfs_journal_cnode *)) ;
}
@@ -705,7 +690,7 @@
count = 0 ;
for (i = 0 ; atomic_read(&(jl->j_commit_left)) > 1 && i < (jl->j_len + 1) ; i++) { /* everything but commit_bh */
bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl->j_start+i) % SB_ONDISK_JOURNAL_SIZE(s);
- tbh = journ_get_hash_table(s, bn) ;
+ tbh = get_hash_table(SB_JOURNAL_DEV(s), bn, s->s_blocksize) ;

/* kill this sanity check */
if (count > (orig_commit_left + 2)) {
@@ -734,7 +719,7 @@
for (i = 0 ; atomic_read(&(jl->j_commit_left)) > 1 &&
i < (jl->j_len + 1) ; i++) { /* everything but commit_bh */
bn = SB_ONDISK_JOURNAL_1st_BLOCK(s) + (jl->j_start + i) % SB_ONDISK_JOURNAL_SIZE(s) ;
- tbh = journ_get_hash_table(s, bn) ;
+ tbh = get_hash_table(SB_JOURNAL_DEV(s), bn, s->s_blocksize) ;

wait_on_buffer(tbh) ;
if (!buffer_uptodate(tbh)) {
@@ -1425,8 +1410,9 @@
offset = d_bh->b_blocknr - SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) ;

/* ok, we have a journal description block, lets see if the transaction was valid */
- c_bh = journ_bread(p_s_sb, SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
- ((offset + le32_to_cpu(desc->j_len) + 1) % SB_ONDISK_JOURNAL_SIZE(p_s_sb))) ;
+ c_bh = bread(SB_JOURNAL_DEV(p_s_sb), SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
+ ((offset + le32_to_cpu(desc->j_len) + 1) % SB_ONDISK_JOURNAL_SIZE(p_s_sb)),
+ p_s_sb->s_blocksize) ;
if (!c_bh)
return 0 ;
commit = (struct reiserfs_journal_commit *)c_bh->b_data ;
@@ -1480,7 +1466,7 @@
unsigned long trans_offset ;
int i;

- d_bh = journ_bread(p_s_sb, cur_dblock) ;
+ d_bh = bread(SB_JOURNAL_DEV(p_s_sb), cur_dblock, p_s_sb->s_blocksize) ;
if (!d_bh)
return 1 ;
desc = (struct reiserfs_journal_desc *)d_bh->b_data ;
@@ -1504,9 +1490,9 @@
brelse(d_bh) ;
return 1 ;
}
- c_bh = journ_bread(p_s_sb, SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
+ c_bh = bread(SB_JOURNAL_DEV(p_s_sb), SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
((trans_offset + le32_to_cpu(desc->j_len) + 1) %
- SB_ONDISK_JOURNAL_SIZE(p_s_sb))) ;
+ SB_ONDISK_JOURNAL_SIZE(p_s_sb)), p_s_sb->s_blocksize) ;
if (!c_bh) {
brelse(d_bh) ;
return 1 ;
@@ -1535,7 +1521,7 @@
}
/* get all the buffer heads */
for(i = 0 ; i < le32_to_cpu(desc->j_len) ; i++) {
- log_blocks[i] = journ_getblk(p_s_sb, SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) + (trans_offset + 1 + i) % SB_ONDISK_JOURNAL_SIZE(p_s_sb));
+ log_blocks[i] = getblk(SB_JOURNAL_DEV(p_s_sb), SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) + (trans_offset + 1 + i) % SB_ONDISK_JOURNAL_SIZE(p_s_sb), p_s_sb->s_blocksize);
if (i < JOURNAL_TRANS_HALF) {
real_blocks[i] = sb_getblk(p_s_sb, le32_to_cpu(desc->j_realblock[i])) ;
} else {
@@ -1675,9 +1661,10 @@
** is the first unflushed, and if that transaction is not valid,
** replay is done
*/
- SB_JOURNAL(p_s_sb)->j_header_bh = journ_bread(p_s_sb,
+ SB_JOURNAL(p_s_sb)->j_header_bh = bread (SB_JOURNAL_DEV(p_s_sb),
SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
- SB_ONDISK_JOURNAL_SIZE(p_s_sb));
+ SB_ONDISK_JOURNAL_SIZE(p_s_sb),
+ p_s_sb->s_blocksize) ;
if (!SB_JOURNAL(p_s_sb)->j_header_bh) {
return 1 ;
}
@@ -1698,7 +1685,7 @@
** there is nothing more we can do, and it makes no sense to read
** through the whole log.
*/
- d_bh = journ_bread(p_s_sb, SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) + le32_to_cpu(jh->j_first_unflushed_offset)) ;
+ d_bh = bread(SB_JOURNAL_DEV(p_s_sb), SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) + le32_to_cpu(jh->j_first_unflushed_offset), p_s_sb->s_blocksize) ;
ret = journal_transaction_is_valid(p_s_sb, d_bh, NULL, NULL) ;
if (!ret) {
continue_replay = 0 ;
@@ -2049,8 +2036,9 @@
rs = SB_DISK_SUPER_BLOCK(p_s_sb);

/* read journal header */
- bhjh = journ_bread(p_s_sb,
- SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) + SB_ONDISK_JOURNAL_SIZE(p_s_sb));
+ bhjh = bread (SB_JOURNAL_DEV(p_s_sb),
+ SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) + SB_ONDISK_JOURNAL_SIZE(p_s_sb),
+ SB_BLOCKSIZE(p_s_sb));
if (!bhjh) {
printk("sh-459: unable to read journal header\n") ;
return 1 ;
@@ -2988,7 +2976,7 @@

rs = SB_DISK_SUPER_BLOCK(p_s_sb) ;
/* setup description block */
- d_bh = journ_getblk(p_s_sb, SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) + SB_JOURNAL(p_s_sb)->j_start) ;
+ d_bh = getblk(SB_JOURNAL_DEV(p_s_sb), SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) + SB_JOURNAL(p_s_sb)->j_start, p_s_sb->s_blocksize) ;
mark_buffer_uptodate(d_bh, 1) ;
desc = (struct reiserfs_journal_desc *)(d_bh)->b_data ;
memset(desc, 0, sizeof(struct reiserfs_journal_desc)) ;
@@ -2996,8 +2984,9 @@
desc->j_trans_id = cpu_to_le32(SB_JOURNAL(p_s_sb)->j_trans_id) ;

/* setup commit block. Don't write (keep it clean too) this one until after everyone else is written */
- c_bh = journ_getblk(p_s_sb, SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
- ((SB_JOURNAL(p_s_sb)->j_start + SB_JOURNAL(p_s_sb)->j_len + 1) % SB_ONDISK_JOURNAL_SIZE(p_s_sb))) ;
+ c_bh = getblk(SB_JOURNAL_DEV(p_s_sb), SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
+ ((SB_JOURNAL(p_s_sb)->j_start + SB_JOURNAL(p_s_sb)->j_len + 1) % SB_ONDISK_JOURNAL_SIZE(p_s_sb)),
+ p_s_sb->s_blocksize) ;
commit = (struct reiserfs_journal_commit *)c_bh->b_data ;
memset(commit, 0, sizeof(struct reiserfs_journal_commit)) ;
commit->j_trans_id = cpu_to_le32(SB_JOURNAL(p_s_sb)->j_trans_id) ;
@@ -3087,8 +3076,9 @@
/* copy all the real blocks into log area. dirty log blocks */
if (test_bit(BH_JDirty, &cn->bh->b_state)) {
struct buffer_head *tmp_bh ;
- tmp_bh = journ_getblk(p_s_sb, SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
- ((cur_write_start + jindex) % SB_ONDISK_JOURNAL_SIZE(p_s_sb))) ;
+ tmp_bh = getblk(SB_JOURNAL_DEV(p_s_sb), SB_ONDISK_JOURNAL_1st_BLOCK(p_s_sb) +
+ ((cur_write_start + jindex) % SB_ONDISK_JOURNAL_SIZE(p_s_sb)),
+ p_s_sb->s_blocksize) ;
mark_buffer_uptodate(tmp_bh, 1) ;
memcpy(tmp_bh->b_data, cn->bh->b_data, cn->bh->b_size) ;
jindex++ ;

2002-03-13 13:50:43

by Rik van Riel

[permalink] [raw]

Subject: Re: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

On Wed, 13 Mar 2002, Hans Reiser wrote:

> I would encourage requiring that every patch sent to Linus have two
> signatures on it, one indicating that the author has tested it and
> another indicating what second person has tested it.

Agreed. Linus and Marcelo are very busy and shouldn't be
bothered with untested patches. If we only send them
known-good stuff their lives should get a little bit
easier.

regards,

Rik
--
<insert bitkeeper endorsement here>

http://www.surriel.com/ http://distro.conectiva.com/

2002-03-13 13:59:14

by Martin Dalecki

[permalink] [raw]

Subject: Re: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

Rik van Riel wrote:
> On Wed, 13 Mar 2002, Hans Reiser wrote:
>
>
>>I would encourage requiring that every patch sent to Linus have two
>>signatures on it, one indicating that the author has tested it and
>>another indicating what second person has tested it.
>>
>
> Agreed. Linus and Marcelo are very busy and shouldn't be
> bothered with untested patches. If we only send them
> known-good stuff their lives should get a little bit
> easier.

You don't understand by accident that sometimes blind untested
changes serve the purpose of hinting at API changes in
areas where once doesn't have the slightest opportunity
of testing? Just simple count how many FS there are out there
and you should see that there is no chance for "quality"
testing before submission of advancements there.

Breakage is the price you have to pay for advancements.

(I'm not arguing over the particular case in quesiton here.
I'm just arguing over the proposal.)

2002-03-13 14:10:16

by Alexander Viro

[permalink] [raw]

Subject: Re: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

On Wed, 13 Mar 2002, Martin Dalecki wrote:

> You don't understand by accident that sometimes blind untested
> changes serve the purpose of hinting at API changes in
> areas where once doesn't have the slightest opportunity
> of testing? Just simple count how many FS there are out there
> and you should see that there is no chance for "quality"
> testing before submission of advancements there.
>
> Breakage is the price you have to pay for advancements.
>
> (I'm not arguing over the particular case in quesiton here.
> I'm just arguing over the proposal.)

Particular case is combination of my screwup, seriously convoluted code
in original and Oleg not sending the fix we'd settled down upon to Linus
(as far as I can tell).

Broken patches _are_ bad and "how many FS are there" is a BS argument.
It _is_ possible to do that right and when it's done wrong - it's a
fuckup of patch author, period.

Proposal is a bit naive, though - in most of the cases fuckups merrily
pass original testing.

2002-03-13 14:20:18

by Martin Dalecki

[permalink] [raw]

Subject: Re: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

Alexander Viro wrote:

> Proposal is a bit naive, though - in most of the cases fuckups merrily
> pass original testing.

No argument here - this was my point.

2002-03-13 17:05:14

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [reiserfs-list] Re: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

On Wed, 13 Mar 2002 09:09:52 EST, Alexander Viro said:

> Proposal is a bit naive, though - in most of the cases fuckups merrily
> pass original testing.

We're not going to create a testing procedure that will find all bugs,
since that's essentially impossible. On the other hand, I think we
all need to collectively take 24 or 48 hour off, spend the time downing
several <insert beverage here>, and see if anybody has a good proposal
that would catch 90% of the show-stopper bugs that have slipped through
so far in the 2.4 and 2.5 series, without complicating matters TOO much.

Here's a simple one: a -rcN release has to sit there at least 96
hours before it gets tagged as "final". That's something we've been
quite poor at so far:

rel final last -pre rel delay
2.4.1 29 Jan 23:56 -pre12 29 Jan 10:16 13h40m
2.4.2 21 Feb 17:00 -pre4 16 Feb 11:17 5d 3h45m (2 weekend days)
2.4.3 29 Mar 21:03 -pre8 25 Mar 20:08 4d 1h
2.4.4 27 Apr 18:43 -pre8 26 Apr 10:58 1d 7h45m
2.4.5 25 May 18:26 -pre6 24 May 17:47 1d 0h45m
2.4.6 03 Jul 17:07 -pre9 02 Jul 17:41 23h45m (all on weekend)
2.4.7 20 Jul 14:25 -pre9 19 Jul 22:03 16h22m
2.4.8 10 Aug 21:13 -pre8 09 Aug 17:34 1d 3h45m
2.4.9 16 Aug 11:32 -pre4 14 Aug 19:06 1d 16h24m
2.4.10 23 Sep 11:30 -pre15 22 Sep 20:58 14h32m (all on weekend)
2.4.11 09 Oct 16.55 -pre6 08 Oct 14:18 1d 2h37m
2.4.12 11 Oct 00:59 <none> -- -----
2.4.13 23 Oct 22:28 -pre6 21 Oct 10:54 1d 11h30m (1 weekend day)
2.4.14 05 Nov 15:30 -pre8 03 Nov 17:42 1d 21h45m (2 weekend days)
2.4.15 22 Nov 22:18 -pre9 21 Nov 22:46 23h28m
2.4.16 26 Nov 05:32 -pre1 24 Nov 12:05 1d 17h27m (2 weekend days)
2.4.17 21 Dec 09:52 -rc2 18 Dec 13:38 2d 18h14m
2.4.18 25 Feb 11:40 -rc4 22 Feb 04:28 3d 7h12m (2 weekend days)

So even in many cases where there *were* 2 or 3 day waits, they were
across a weekend - which means quite likely a lot of people didn't
test because they were *home* having *lives*.

Think about it. "Just wait 96 hours".

--
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech

Attachments:

(No filename) (226.00 B)

2002-03-13 17:46:10

by Alexander Viro

[permalink] [raw]

Subject: Re: [reiserfs-list] Re: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

On Wed, 13 Mar 2002 [email protected] wrote:

> On Wed, 13 Mar 2002 09:09:52 EST, Alexander Viro said:
>
> > Proposal is a bit naive, though - in most of the cases fuckups merrily
> > pass original testing.
>
>
> We're not going to create a testing procedure that will find all bugs,
> since that's essentially impossible. On the other hand, I think we
> all need to collectively take 24 or 48 hour off, spend the time downing
> several <insert beverage here>, and see if anybody has a good proposal
> that would catch 90% of the show-stopper bugs that have slipped through
> so far in the 2.4 and 2.5 series, without complicating matters TOO much.
>
> Here's a simple one: a -rcN release has to sit there at least 96
> hours before it gets tagged as "final". That's something we've been
> quite poor at so far:

Well, let's see. I can't speak for other folks, but the worst of mine
during 2.4 was in 2.4.15-pre9. The rest was minor and IIRC didn't make
it into -final. That one did, however. It wasn't caught until 2.4.15 -
not immediately, at that.

2002-03-13 18:46:34

by Hans Reiser

[permalink] [raw]

Subject: Re: PROBLEM: Kernel Panic on 2.5.6 and 2.5.7-pre1 boot [PATCH] and discussion of Linux testing procedures

Alexander Viro wrote:

>
>
>Proposal is a bit naive, though - in most of the cases fuckups merrily
>pass original testing.
>
>
>
No, I think we catch 70-85%. Namesys internal bad patches frequently
pass original testing of patch author and are caught by professional
tester whose specialty is testing reiserfs patches. You just haven't
seen these bad patches, so you don't realize how effective the catching
is.;-) There have been a lot of catches.;-) It is not simply a matter
of skill in testing though: you need a second person to test it in the
way that the author did not think to test it (e.g. by booting to the
right kernel, everyone on this list has made that benchmarking/testing
error a few times, yes....;-) ).

Hans