2005-11-29 20:31:16

by Ryan Richter

[permalink] [raw]
Subject: Re: Fw: crash on x86_64 - mm related?

On Tue, Nov 29, 2005 at 10:04:39PM +0200, Kai Makisara wrote:
> I looked at the driver and it seems that there is a bug: st_write calls
> release_buffering at the end even when it has started an asynchronous
> write. This means that it releases the mapping while it is being used!
> (I wonder why this has not been noticed earlier.)
>
> The patch below (against 2.6.15-rc2) should fix this bug and some others
> related to buffering. It is based on the patch "[PATCH] SCSI tape direct
> i/o fixes" I sent to linux-scsi on Nov 21. The patch restores setting
> pages dirty after reading and clears number of s/g segments when the
> pointers are not valid any more.
>
> The patch has been lightly tested with AMD64.

This applies cleanly to 2.6.14.2, do you forsee any problems using it
with that kernel? I'd like to not change too many things at once.

If it should be OK, I'll boot this tonight or tomorrow - the backups run
every other night, so it won't get any testing until tomorrow night.

Thanks a lot,
-ryan


2005-11-29 20:48:04

by Kai Mäkisara (Kolumbus)

[permalink] [raw]
Subject: Re: Fw: crash on x86_64 - mm related?

On Tue, 29 Nov 2005, Ryan Richter wrote:

> On Tue, Nov 29, 2005 at 10:04:39PM +0200, Kai Makisara wrote:
> > I looked at the driver and it seems that there is a bug: st_write calls
> > release_buffering at the end even when it has started an asynchronous
> > write. This means that it releases the mapping while it is being used!
> > (I wonder why this has not been noticed earlier.)
> >
> > The patch below (against 2.6.15-rc2) should fix this bug and some others
> > related to buffering. It is based on the patch "[PATCH] SCSI tape direct
> > i/o fixes" I sent to linux-scsi on Nov 21. The patch restores setting
> > pages dirty after reading and clears number of s/g segments when the
> > pointers are not valid any more.
> >
> > The patch has been lightly tested with AMD64.
>
> This applies cleanly to 2.6.14.2, do you forsee any problems using it
> with that kernel? I'd like to not change too many things at once.
>
No, I don't see any potential problems applying this patch to 2.6.14.2.
There is nothing specific to 2.6.15-rc2.

If someone sees that there is something wrong, please yell. The
main purpose of the patch is not to call release_buffering() at the end of
st_write() when starting asynchronous write and call it in
write_behind_check() instead.

> If it should be OK, I'll boot this tonight or tomorrow - the backups run
> every other night, so it won't get any testing until tomorrow night.
>
> Thanks a lot,
> -ryan
>
Thanks for reporting the problem and thanks in advance for testing.

--
Kai

2005-11-29 20:59:03

by Ryan Richter

[permalink] [raw]
Subject: Re: Fw: crash on x86_64 - mm related?

On Tue, Nov 29, 2005 at 10:48:22PM +0200, Kai Makisara wrote:
> On Tue, 29 Nov 2005, Ryan Richter wrote:
> > This applies cleanly to 2.6.14.2, do you forsee any problems using it
> > with that kernel? I'd like to not change too many things at once.
> >
> No, I don't see any potential problems applying this patch to 2.6.14.2.
> There is nothing specific to 2.6.15-rc2.
>
> If someone sees that there is something wrong, please yell. The
> main purpose of the patch is not to call release_buffering() at the end of
> st_write() when starting asynchronous write and call it in
> write_behind_check() instead.

OK, thanks. I think I'll go ahead and advance to 2.6.14.3 since that
should theoretically not cause any problems.

One question: do you think the oopses that happened later that actually
crashed the box were from damage caused by this bug or is that a
different problem?

> > If it should be OK, I'll boot this tonight or tomorrow - the backups run
> > every other night, so it won't get any testing until tomorrow night.
> >
> > Thanks a lot,
> > -ryan
> >
> Thanks for reporting the problem and thanks in advance for testing.

Sure thing,
-ryan

2005-11-29 21:36:06

by Kai Mäkisara (Kolumbus)

[permalink] [raw]
Subject: Re: Fw: crash on x86_64 - mm related?

On Tue, 29 Nov 2005, Ryan Richter wrote:

> On Tue, Nov 29, 2005 at 10:48:22PM +0200, Kai Makisara wrote:
> > On Tue, 29 Nov 2005, Ryan Richter wrote:
> > > This applies cleanly to 2.6.14.2, do you forsee any problems using it
> > > with that kernel? I'd like to not change too many things at once.
> > >
> > No, I don't see any potential problems applying this patch to 2.6.14.2.
> > There is nothing specific to 2.6.15-rc2.
> >
> > If someone sees that there is something wrong, please yell. The
> > main purpose of the patch is not to call release_buffering() at the end of
> > st_write() when starting asynchronous write and call it in
> > write_behind_check() instead.
>
> OK, thanks. I think I'll go ahead and advance to 2.6.14.3 since that
> should theoretically not cause any problems.
>
> One question: do you think the oopses that happened later that actually
> crashed the box were from damage caused by this bug or is that a
> different problem?
>
I looked at the oopses but, not knowing enough about what is happening
inside the kernel, I can only hope that they are caused by the st bug(s).
We will see after testing with the patch.

--
Kai

2005-11-30 05:12:04

by Kai Mäkisara (Kolumbus)

[permalink] [raw]
Subject: Re: Fw: crash on x86_64 - mm related?

On Tue, 29 Nov 2005, Kai Makisara wrote:

> On Tue, 29 Nov 2005, Ryan Richter wrote:
>
> > On Tue, Nov 29, 2005 at 10:04:39PM +0200, Kai Makisara wrote:
> > > I looked at the driver and it seems that there is a bug: st_write calls
> > > release_buffering at the end even when it has started an asynchronous
> > > write. This means that it releases the mapping while it is being used!
> > > (I wonder why this has not been noticed earlier.)
> > >
> > > The patch below (against 2.6.15-rc2) should fix this bug and some others
> > > related to buffering. It is based on the patch "[PATCH] SCSI tape direct
> > > i/o fixes" I sent to linux-scsi on Nov 21. The patch restores setting
> > > pages dirty after reading and clears number of s/g segments when the
> > > pointers are not valid any more.
> > >
> > > The patch has been lightly tested with AMD64.
> >
> > This applies cleanly to 2.6.14.2, do you forsee any problems using it
> > with that kernel? I'd like to not change too many things at once.
> >
> No, I don't see any potential problems applying this patch to 2.6.14.2.
> There is nothing specific to 2.6.15-rc2.
>
> If someone sees that there is something wrong, please yell. The
> main purpose of the patch is not to call release_buffering() at the end of
> st_write() when starting asynchronous write and call it in
> write_behind_check() instead.
>
Yelling!

The patch does not cause any problems but my theory about calling
release_buffering() too early is wrong: asynchronous writes are not done
with direct i/o. (The reason for this choice is that the API allows the
user to do anything with the buffer between writes and so the SCSI write
has to be finished before returning from write().)

The other fixes in the patch are valid but I don't see how they should fix
this problem.

--
Kai