2011-03-07 19:00:51

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Reiserfs deadlock in 2.6.36

Hi Bastien,

Sorry for the time I've been away.

On Sun, Jan 30, 2011 at 01:08:29AM +0100, Bastien ROUCARIES wrote:
> Le jeudi 23 d?cembre 2010 04:42:33, Frederic Weisbecker a ?crit :
> Hi,
>
> I take me more than two days of testing to reporduce this bugs with trace enabled. My filesystem was quite slow and this bugs seems
> to be timing related.
>
> One patern that trigger this bug is git. Doing a lot of git work of my desktop crash my machine.
>
> Moreover, trying to reproduce this bug lead to data loss. I have rebuilded twice my / partition using --rebuild-tree, and restored
> my home partition three times using backups.
>
> My log is here.
>
> Do you need more information?

Yeah do you have CONFIG_REISERFS_CHECK? I just would
like to ensure we are not missing this important source of
information.

I'm puzzled because, given the traces, your opening and closing of the journal are
well balanced.

You have a writer queued and stuck but I see no trace of it in the traces stream.
I only see well balanced journal operations, including journal closing that would have
woken your queued writer.

A theory could be that your queued writer was waiting for someone to close the journal,
which finally happen but actually several minutes later, after there was many
journal opening/closing that overwrote the old trace containing the queueing of
the stuck writer.

I don't know what to do yet. I need to think more about it.


2011-03-08 08:41:18

by Bastien Roucariès

[permalink] [raw]
Subject: Re: Reiserfs deadlock in 2.6.36

On Mon, Mar 7, 2011 at 8:00 PM, Frederic Weisbecker <[email protected]> wrote:
> Hi Bastien,

Cc: Ingo Molnar because he work a lot on soft lockup, and could have
an idea to debug
cc: andrew morton that trakc also "File/memory corruption in 2.6.37"

>> I take me more than two days of testing to reporduce this bugs with trace enabled. My filesystem was quite slow and this bugs seems
>> to be timing related.
>>
>> One patern that trigger this bug is git. Doing a lot of git work of my desktop crash my machine.
>>
>> Moreover, trying to reproduce this bug lead to data loss. I have rebuilded twice my / partition using --rebuild-tree, and restored
>> my home partition three times using backups.
>>
>> My log is here.
>>
>> Do you need more information?
>
> Yeah do you have CONFIG_REISERFS_CHECK? I just would
> like to ensure we are not missing this important source of
> information.

Yes I have it
> I'm puzzled because, given the traces, your opening and closing of the journal are
> well balanced.
>
> You have a writer queued and stuck but I see no trace of it in the traces stream.
> I only see well balanced journal operations, including journal closing that would have
> woken your queued writer.
>
> A theory could be that your queued writer was waiting for someone to close the journal,
> which finally happen but actually several minutes later, after there was many
> journal opening/closing that overwrote the old trace containing the queueing of
> the stuck writer.

Doing a while true;do sync && sleep1; done; help a lot

>
> I don't know what to do yet. I need to think more about it.
>

Could we do the stuff I have sugested at first ? use lockdep to track
journal open,/close using fake lock ?

BTW it seems that someone experiment this confition on ext3. I could
do more testing if you want, and I will run xfstests in order to see
if I could reproduce more quickly

Bastien

2011-03-08 14:05:59

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: Reiserfs deadlock in 2.6.36

On Tue, Mar 08, 2011 at 09:41:15AM +0100, Bastien ROUCARIES wrote:
> On Mon, Mar 7, 2011 at 8:00 PM, Frederic Weisbecker <[email protected]> wrote:
> > Hi Bastien,
>
> Cc: Ingo Molnar because he work a lot on soft lockup, and could have
> an idea to debug
> cc: andrew morton that trakc also "File/memory corruption in 2.6.37"

About the corruption, I'm not sure it's the same problem. It's hard to
tell yet.

> >> I take me more than two days of testing to reporduce this bugs with trace enabled. My filesystem was quite slow and this bugs seems
> >> to be timing related.
> >>
> >> One patern that trigger this bug is git. Doing a lot of git work of my desktop crash my machine.
> >>
> >> Moreover, trying to reproduce this bug lead to data loss. I have rebuilded twice my / partition using --rebuild-tree, and restored
> >> my home partition three times using backups.
> >>
> >> My log is here.
> >>
> >> Do you need more information?
> >
> > Yeah do you have CONFIG_REISERFS_CHECK? I just would
> > like to ensure we are not missing this important source of
> > information.
>
> Yes I have it

Ok.

> > I'm puzzled because, given the traces, your opening and closing of the journal are
> > well balanced.
> >
> > You have a writer queued and stuck but I see no trace of it in the traces stream.
> > I only see well balanced journal operations, including journal closing that would have
> > woken your queued writer.
> >
> > A theory could be that your queued writer was waiting for someone to close the journal,
> > which finally happen but actually several minutes later, after there was many
> > journal opening/closing that overwrote the old trace containing the queueing of
> > the stuck writer.
>
> Doing a while true;do sync && sleep1; done; help a lot

Which kernel are you running by the way?

> >
> > I don't know what to do yet. I need to think more about it.
> >
>
> Could we do the stuff I have sugested at first ? use lockdep to track
> journal open,/close using fake lock ?

I don't think it's not an adapted test. Lockdep is useful to detect lock inversion
scenarios but that's not very useful to detect a lock that takes too much time
to be released. For that we have the hung task detector, whose report we already
have.

> BTW it seems that someone experiment this confition on ext3. I could
> do more testing if you want, and I will run xfstests in order to see
> if I could reproduce more quickly

I'm not sure the file corruption and the deadlock are linked. But
may be xfstest can provoke the deadlock (or the file corruption)
more quickly. It's pretty good at stressing file systems.

2011-03-08 15:21:49

by Bastien Roucariès

[permalink] [raw]
Subject: Re: Reiserfs deadlock in 2.6.36

>>
>> Doing a while true;do ?sync && sleep1; done; help a lot
>
> Which kernel are you running by the way?

2.6.37 now

>
>> >
>> > I don't know what to do yet. I need to think more about it.
>> >
>>
>> Could we do the stuff I have sugested at first ? use lockdep to track
>> journal open,/close using fake lock ?
>
> I don't think it's not an adapted test. Lockdep is useful to detect lock inversion
> scenarios but that's not very useful to detect a lock that takes too much time
> to be released. For that we have the hung task detector, whose report we already
> have.
>
>> BTW it seems that someone experiment this confition on ext3. I could
>> do more testing if you want, and I will run xfstests in order to see
>> if I could reproduce more quickly
>
> I'm not sure the file corruption and the deadlock are linked. But
> may be xfstest can provoke the deadlock (or the file corruption)
> more quickly. It's pretty good at stressing file systems.
>
Do you know a test number to try ?

Bastien