Date: Sun, 3 Aug 2008 07:38:17 -0600
From: Matthew Wilcox <matthew@wil.cx>
To: Hong Tran Duc <hongtd2k@gmail.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
       linux-ide@vger.kernel.org
Subject: Re: Oops when read/write or mount/unmount continuously ~ 600.000 times
Message-ID: <20080803133817.GG26461@parisc-linux.org>
References: <4895A96E.2040303@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4895A96E.2040303@gmail.com>
User-Agent: Mutt/1.5.13 (2006-08-11)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2234
Lines: 48

On Sun, Aug 03, 2008 at 07:49:50PM +0700, Hong Tran Duc wrote:
> I?m using kernel 2.4.20 with fully preemptive enable (patch & set the 
> CONFIG option). My CPU is PowerPC 750FX, HDD 80GB, RAM 512,

2.4.20 was released in November 2002; almost 6 years ago.  I don't think
you're going to find too many people interested in helping you debug
this.  If you can reproduce the problem with something more recent (say
2.6.26 or even 2.4.36.6 if you can't use 2.6 for whatever reason), then
I think people will be more interested.

> The reasons is almost linked list on those function was broken. Ex: 
> linkedlist->next linkedlist->prev = NULL or set to invalid address.
> In the situation SIGILL, the instruction pointer (NIP) is same as the 
> return address register (LR).

In later kernels, we have a list debugging option which lets you find
list corruptions earlier.

> The newest Oops, I got on function __wait_on_buffer(). The main 
> sequences of __wait_on_buffer() are:
> 1. put_bh -> atomic_inc(bh->b_count);
> 2. add wait queue
> 3. loop: do some thing task manipulation, call *schedule()*
> 4. remove wait queue
> 5. get_bh -> atomic_dec(bh->b_count); *<- Got Oops here, SEGV because 
> bh->b_count = R25 = 0x02 *
> 
> After analysis assembly code (I upload on pastebin bellow) at this 
> point, I found that:
> * At the point (1) -> address of bh->b_count stored in register r25.
> * The point from (2) ->(4) all of affect to register 25 will be restored 
> from stack (r25 act as non violent register in gcc ABI).
> * An step 5, *r25 = 0x02 ??? I don?t know why r25 is changed ? May be 
> stack on somewhere was corrupted ?*

The implementation of __wait_on_buffer has completely changed since
then.  It's probably not worth trying to debug this.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/