2003-08-20 15:14:40

by Martin Zwickel

[permalink] [raw]
Subject: 2.6.0-t3: vfs/ext3 do_lookup bug?!

Hi there!

Today I wanted to check out some src-files from cvs.
But my fault was, that I ran cvs twice at the same time.

so two "cvs upd -d -A" are now in 'D' state.

I think they got stuck because both tried to access the same file.


#ps lax
4 0 11833 11832 15 0 4136 2664 down D pts/18 0:07 cvs upd -d -A
4 0 11933 11932 22 0 2672 1180 down D pts/19 0:00 cvs upd -d -A

#sysrq-t + dmesg:
cvs D 00000196 294197680 11833 11832 (NOTLB)
dc84de5c 00000082 d2314a0c 00000196 d23148fc dc84de58 cafba080 cccee770
00000282 cafba080 cccee778 c0107f8e 00000001 cafba080 c0119687 d49d3e94
cccee778 dc84de7c 00000000 d919a9f0 00000000 cccee770 cccee708 d919a980
Call Trace:
[<c0107f8e>] __down+0x7c/0xc7
[<c0119687>] default_wake_function+0x0/0x2e
[<c0108118>] __down_failed+0x8/0xc
[<c0159df3>] .text.lock.namei+0x5/0x16a
[<c0156c98>] do_lookup+0x96/0xa1
[<c0157077>] link_path_walk+0x3d4/0x762
[<c0157c57>] open_namei+0x8e/0x3f3
[<c014a265>] filp_open+0x43/0x69
[<c014a64c>] sys_open+0x5b/0x8b
[<c0109063>] syscall_call+0x7/0xb

cvs D C8E86424 297794096 11933 11932 (NOTLB)
d49d3e80 00000086 c03a561b c8e86424 00000001 d092d408 c2dd2080 cccee770
00000286 c2dd2080 cccee778 c0107f8e 00000001 c2dd2080 c0119687 cccee778
dc84de70 d49d3f38 dffe46c0 d49d3ee4 00000000 cccee770 cccee708 d919a980
Call Trace:
[<c0107f8e>] __down+0x7c/0xc7
[<c0119687>] default_wake_function+0x0/0x2e
[<c0108118>] __down_failed+0x8/0xc
[<c0159df3>] .text.lock.namei+0x5/0x16a
[<c0156c98>] do_lookup+0x96/0xa1
[<c0157077>] link_path_walk+0x3d4/0x762
[<c015783c>] __user_walk+0x49/0x5e
[<c0149be1>] sys_access+0x93/0x150
[<c015393d>] sys_stat64+0x37/0x39
[<c0109063>] syscall_call+0x7/0xb


So is it a kernel bug? the down in do_lookup shouldn't lock the process
forever...

Regards,
Martin

--
MyExcuse:
The salesman drove over the CPU board.

Martin Zwickel <[email protected]>
Research & Development

TechnoTrend AG <http://www.technotrend.de>


Attachments:
(No filename) (189.00 B)

2003-08-20 18:34:36

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.0-t3: vfs/ext3 do_lookup bug?!

Martin Zwickel <[email protected]> wrote:
>
> Today I wanted to check out some src-files from cvs.
> But my fault was, that I ran cvs twice at the same time.
>
> so two "cvs upd -d -A" are now in 'D' state.
>
> I think they got stuck because both tried to access the same file.

How odd. Were they the only processes which were in D state?

2003-08-20 19:07:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: 2.6.0-t3: vfs/ext3 do_lookup bug?!

Andrew Morton wrote:

> Martin Zwickel <[email protected]> wrote:
>>
>> Today I wanted to check out some src-files from cvs.
>> But my fault was, that I ran cvs twice at the same time.
>>
>> so two "cvs upd -d -A" are now in 'D' state.
>>
>> I think they got stuck because both tried to access the same file.
>
> How odd. Were they the only processes which were in D state?

They are probably hung on the same semaphore, ie "dir->i_sem". They almost
certainly are not deadlocked on each other: something else has either left
the directory semaphore locked, or is in turn waiting for something else.

Martin - do you have the _full_ list of processes available?

Linus

2003-08-21 06:59:37

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.0-t3: vfs/ext3 do_lookup bug?!

Martin Zwickel <[email protected]> wrote:
>
> cutted-dmesg.txt text/plain (15496 bytes)

Try `dmesg -s 1000000'. The silly thing has too small a buffer.

2003-08-21 06:59:38

by Martin Zwickel

[permalink] [raw]
Subject: Re: 2.6.0-t3: vfs/ext3 do_lookup bug?!

On Wed, 20 Aug 2003 12:06:34 -0700
Linus Torvalds <[email protected]> bubbled:

> Andrew Morton wrote:
>
> > Martin Zwickel <[email protected]> wrote:
> >>
> >> Today I wanted to check out some src-files from cvs.
> >> But my fault was, that I ran cvs twice at the same time.
> >>
> >> so two "cvs upd -d -A" are now in 'D' state.
> >>
> >> I think they got stuck because both tried to access the same file.
> >
> > How odd. Were they the only processes which were in D state?
>
> They are probably hung on the same semaphore, ie "dir->i_sem". They almost
> certainly are not deadlocked on each other: something else has either left
> the directory semaphore locked, or is in turn waiting for something else.
>
> Martin - do you have the _full_ list of processes available?
>
> Linus

hmm, still not rebooted ;)
and i recognized that updatedb now also got stuck in the same down.
(somewhere @
/work/BuildEnv/sandbox/components/base/kernel/source/linux-2.4.21-rc1/include/c
onfig/x86/f00f/CVS/)
but where could the semaphore/mutex got locked and not unlocked?
the ps lax and lsof output is attached.
i'll reboot and fsck the fs...

and thx for the response.

Regards,
Martin

--
MyExcuse:
CPU needs bearings repacked

Martin Zwickel <[email protected]>
Research & Development

TechnoTrend AG <http://www.technotrend.de>


Attachments:
pslax.list (16.29 kB)
cutted-dmesg.txt (14.99 kB)
lsof.updatedb (1.20 kB)
lsof.cvs (3.35 kB)
Download all attachments

2003-08-21 07:25:41

by Martin Zwickel

[permalink] [raw]
Subject: Re: 2.6.0-t3: vfs/ext3 do_lookup bug?!

On Wed, 20 Aug 2003 23:41:19 -0700
Andrew Morton <[email protected]> bubbled:

> Martin Zwickel <[email protected]> wrote:
> >
> > cutted-dmesg.txt text/plain (15496 bytes)
>
> Try `dmesg -s 1000000'. The silly thing has too small a buffer.

too late.. :(
rebooted and fscked.
on reboot, my console did hang up while unmounting fs's and i got tons of
strange errors about something on my fs(where the processes got stuck). can't
remeber the outputs, was too much and too fast.
only a sysrq-b helped.

on another fs i got some "Deleted inode ###### has zero dtime. Fix<y>?".
(on other boxes i get them sometimes too if i manually check a ext3
fs)
shouldnt ext3 prevent those errors, since it has a journal and should
recover them???

on the fs where the processes got stuck i got some unattached inodes:
Unattached inode 1035466
Connect to /lost+found<y>? yes

Inode 1035466 ref count is 2, should be 1. Fix<y>? yes

Unattached inode 1053163
Connect to /lost+found<y>? yes

Inode 1053163 ref count is 2, should be 1. Fix<y>? yes

Inode 1053382 ref count is 1, should be 2. Fix<y>? yes

Pass 5: Checking group summary information

is this the normal behaviour, to e2fsck the ext3 fs's after some time?
i thought that ext3 handles those errors itself.

well, after the reboot and fsck i can access my files again.

ps.: 2.6.0-t3 scheduler performance is not that good...

Regards,
Martin

--
MyExcuse:
Zombie processes detected, machine is haunted.

Martin Zwickel <[email protected]>
Research & Development

TechnoTrend AG <http://www.technotrend.de>


Attachments:
(No filename) (189.00 B)

2003-08-21 07:38:27

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.0-t3: vfs/ext3 do_lookup bug?!

Martin Zwickel <[email protected]> wrote:
>
> On Wed, 20 Aug 2003 23:41:19 -0700
> Andrew Morton <[email protected]> bubbled:
>
> > Martin Zwickel <[email protected]> wrote:
> > >
> > > cutted-dmesg.txt text/plain (15496 bytes)
> >
> > Try `dmesg -s 1000000'. The silly thing has too small a buffer.
>
> too late.. :(
> rebooted and fscked.
> on reboot, my console did hang up while unmounting fs's and i got tons of
> strange errors about something on my fs(where the processes got stuck). can't
> remeber the outputs, was too much and too fast.
> only a sysrq-b helped.
>
> on another fs i got some "Deleted inode ###### has zero dtime. Fix<y>?".

Sigh. Well the filesystem obviously shat itself, so the fsck errors aren't
that surprising.

My guess would be that something oopsed while holding a directory semaphore
and you missed the oops. Maybe you were in X at the time?

If it happens again, please remember that dmesg needs the `-s 1000000'
option to prevent it from truncating output.

> ps.: 2.6.0-t3 scheduler performance is not that good...

It's pretty bad. I'm running 2.6.0-test3-mm1 here which has Ingo/Con
goodies and it is significantly improved.

Once that code is working sufficiently well for everyone and there is
consensus that the general direction is correct and the possible
regressions with mixed database workloads are sorted out, we'll fix it up.
So don't panic yet.

2003-08-21 08:32:33

by Martin Zwickel

[permalink] [raw]
Subject: Re: 2.6.0-t3: vfs/ext3 do_lookup bug?!

On Thu, 21 Aug 2003 00:40:18 -0700
Andrew Morton <[email protected]> bubbled:

> Sigh. Well the filesystem obviously shat itself, so the fsck errors aren't
> that surprising.
>
> My guess would be that something oopsed while holding a directory semaphore
> and you missed the oops. Maybe you were in X at the time?

yes, but I cant find an oops in my /var/log/messages and .?.gz. the last oops
is a month ago.
well, something really strange happened and if it happens again, i'll try to
figure it out what went wrong.

>
> If it happens again, please remember that dmesg needs the `-s 1000000'
> option to prevent it from truncating output.

yep, will do that next time.

>
> > ps.: 2.6.0-t3 scheduler performance is not that good...
>
> It's pretty bad. I'm running 2.6.0-test3-mm1 here which has Ingo/Con
> goodies and it is significantly improved.
>
> Once that code is working sufficiently well for everyone and there is
> consensus that the general direction is correct and the possible
> regressions with mixed database workloads are sorted out, we'll fix it up.
> So don't panic yet.

I tried many patches for test2. and some worked not that bad.
ok, i'll give -mm1 a try.

and thx for the help.

Regards,
Martin

--
MyExcuse:
temporary routing anomaly

Martin Zwickel <[email protected]>
Research & Development

TechnoTrend AG <http://www.technotrend.de>


Attachments:
(No filename) (189.00 B)