2009-09-29 21:48:30

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] New: kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256

Summary: kernel BUG at fs/ext3/super.c:435
Product: File System
Version: 2.5
Kernel Version: 2.6.31
Platform: All
OS/Version: Linux
Tree: Mainline
Status: NEW
Severity: normal
Priority: P1
Component: ext3
AssignedTo: [email protected]
ReportedBy: [email protected]
Blocks: 13615
Regression: Yes


Subject : [2.6.31] kernel BUG at fs/ext3/super.c:435
Submitter : Mikael Pettersson <[email protected]>
Date : 2009-09-21 7:29
References : http://marc.info/?l=linux-kernel&m=125351816109264&w=4

This entry is being used for tracking a regression from 2.6.30. Please don't
close it until the problem is fixed in the mainline.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.


2009-10-04 20:48:18

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #1 from Rafael J. Wysocki <[email protected]> 2009-10-04 20:47:41 ---
On Sunday 04 October 2009, Mikael Pettersson wrote:
> Rafael J. Wysocki wrote:
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D14256
> > Subject : kernel BUG at fs/ext3/super.c:435
> > Submitter : Mikael Pettersson <[email protected]>
> > Date : 2009-09-21 7:29 (11 days old)
> > References : http://marc.info/?l=3Dlinux-kernel&m=3D125351816109264&w=3D4
>
> The exact same bug (same cause, same symptom) just hit me again in 2.6.32-rc1.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-10-09 22:02:20

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #2 from Rafael J. Wysocki <[email protected]> 2009-10-09 22:01:43 ---
On Friday 09 October 2009, Mikael Pettersson wrote:
> Mikael Pettersson writes:
> > Rafael J. Wysocki writes:
> > > On Sunday 04 October 2009, Mikael Pettersson wrote:
> > > > Rafael J. Wysocki wrote:
> > > > > The following bug entry is on the current list of known regressions
> > > > > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > > > > be listed and let me know (either way).
> > > > >
> > > > >
> > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D14256
> > > > > Subject : kernel BUG at fs/ext3/super.c:435
> > > > > Submitter : Mikael Pettersson <[email protected]>
> > > > > Date : 2009-09-21 7:29 (11 days old)
> > > > > References : http://marc.info/?l=3Dlinux-kernel&m=3D125351816109264&w=3D4
> > > >
> > > > The exact same bug (same cause, same symptom) just hit me again in 2.6.32-rc1.
> > >
> > > Thanks for the update.
> > >
> > > Could you check the current Linus' tree, please? There are some known
> > > regression fixes in there.
> >
> > I tried simplified versions of the bug trigger on two machines
> > running 2.6.32-rc1-git6, and neither triggered the kernel bug.
> >
> > The original recipe involved doing a glibc rebuild, run its test
> > suite, install it, and reboot. Today however machine 1 was already
> > doing a rebuild so after the rebuild it did a reboot into the new
> > kernel before the install. The second machine booted the new kernel
> > directly to install the binary packages from the first machine.
> >
> > I'll re-run the full bug trigger recipe on a third machine later next
> > week (it must rebuild glibc itself anyway due to arch differences).
>
> Not fixed in 2.6.32-rc3. A glibc rebuild + install triggered the
> exact same bug on the third machine.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-10-10 01:32:19

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #3 from Theodore Tso <[email protected]> 2009-10-10 01:31:41 ---
On Fri, Oct 09, 2009 at 10:01:44PM +0000, [email protected]
wrote:
> > Not fixed in 2.6.32-rc3. A glibc rebuild + install triggered the
> > exact same bug on the third machine.

Can you isolate this down to something simpler? Will "make install"
followed by a reboot be sufficient to replicate the assertion failure?
If it does, will copying in a new version of libc.so.6 and then
rebooting be enough to trigger the assert failure?

Thanks,

- Ted

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-10-10 01:32:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [Bug 14256] kernel BUG at fs/ext3/super.c:435

On Fri, Oct 09, 2009 at 10:01:44PM +0000, [email protected] wrote:
> > Not fixed in 2.6.32-rc3. A glibc rebuild + install triggered the
> > exact same bug on the third machine.

Can you isolate this down to something simpler? Will "make install"
followed by a reboot be sufficient to replicate the assertion failure?
If it does, will copying in a new version of libc.so.6 and then
rebooting be enough to trigger the assert failure?

Thanks,

- Ted

2009-10-12 13:49:16

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #4 from Mikael Pettersson <[email protected]> 2009-10-12 13:48:39 ---
Theodore Tso writes:
> On Fri, Oct 09, 2009 at 10:01:44PM +0000,
[email protected] wrote:
> > > Not fixed in 2.6.32-rc3. A glibc rebuild + install triggered the
> > > exact same bug on the third machine.
>
> Can you isolate this down to something simpler? Will "make install"
> followed by a reboot be sufficient to replicate the assertion failure?

I've done repeated glibc downgrades and upgrades, using pre-built binary
packages, followed by reboots, but they have so far not triggered the bug.

My guess is that something in the build process or the test suite run
triggers the bug.

FWIW, the systems are rpm-based derived from Fedora, so the rebuilds
are just "rpmbuild --rebuild glibc-<version>.src.rpm", and the installs
are "rpm -Uvh" on the binary rpms, with --force if I'm doing a downgrade.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-10-12 13:49:14

by Mikael Pettersson

[permalink] [raw]
Subject: Re: [Bug 14256] kernel BUG at fs/ext3/super.c:435

Theodore Tso writes:
> On Fri, Oct 09, 2009 at 10:01:44PM +0000, [email protected] wrote:
> > > Not fixed in 2.6.32-rc3. A glibc rebuild + install triggered the
> > > exact same bug on the third machine.
>
> Can you isolate this down to something simpler? Will "make install"
> followed by a reboot be sufficient to replicate the assertion failure?

I've done repeated glibc downgrades and upgrades, using pre-built binary
packages, followed by reboots, but they have so far not triggered the bug.

My guess is that something in the build process or the test suite run
triggers the bug.

FWIW, the systems are rpm-based derived from Fedora, so the rebuilds
are just "rpmbuild --rebuild glibc-<version>.src.rpm", and the installs
are "rpm -Uvh" on the binary rpms, with --force if I'm doing a downgrade.

2009-10-12 16:20:32

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


Eric Sandeen <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #5 from Eric Sandeen <[email protected]> 2009-10-12 16:19:54 ---
I can give this a whirl on fedora. Just to try to reproduce more exactly,
which version of the glibc rpm package are you rebuilding?

I'm also curious, does simply rebuilding the package and rebooting lead to the
same issue (w/o installing the result?)

Do you have rpm configured in any unique way or is the build happening in all
the normal/standard places?

So you have /tmp as ext3 as a separate fs, any other interesting fs config?

Thanks,
-Eric

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-11-05 16:07:29

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


Jan Kara <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #6 from Jan Kara <[email protected]> 2009-11-05 16:07:31 ---
Eric, have you succeeded in reproducing this?

What seems to happen is that while we are in ext3_put_super(), there are still
some inode references held and therefore the inodes (regular files) don't get
deleted and we complain about them in ext3_put_super. The question is who holds
the reference... I did a quick audit of the code and couldn't find iget()
unmatched with iput() or a similar problem.

I guess I can create a debugging patch which might be able to show us who holds
the reference. Will try to code it next week. Mikael, will you be able to run
such debugging patch?

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-11-05 16:13:11

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #7 from Eric Sandeen <[email protected]> 2009-11-05 16:13:15 ---
Jan, nope haven't, to be honest fell off my radar as I was looking at that ext4
corruption. I've not tried to reproduce it yet.

Sorry,
-Eric

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-11-11 15:27:50

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #8 from Jan Kara <[email protected]> 2009-11-11 15:27:53 ---
Created an attachment (id=23749)
--> (http://bugzilla.kernel.org/attachment.cgi?id=23749)
Patch reporting lost inode references

This patch reports more information about inodes left in orphan list.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-11-11 15:29:13

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #9 from Jan Kara <[email protected]> 2009-11-11 15:29:18 ---
Mikael, could you please run a kernel with the above patch and post here
resulting messages during umount and also attach here your System.map file from
the tested kernel? Thanks.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-11-20 15:11:16

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


Theodore Tso <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #10 from Theodore Tso <[email protected]> 2009-11-20 15:11:21 ---
Hi Mikael,

Have you had a chance to try running a kernel with Jan's debugging patch?

Is this a problem you can reproduce reliably, or was it a one-off?

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-11-21 14:27:36

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #11 from Rafael J. Wysocki <[email protected]> 2009-11-21 14:27:39 ---
On Saturday 21 November 2009, Mikael Pettersson wrote:
> Rafael J. Wysocki wrote:
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D14256
> > Subject : kernel BUG at fs/ext3/super.c:435
> > Submitter : Mikael Pettersson <[email protected]>
> > Date : 2009-09-21 7:29 (57 days old)
> > References : http://marc.info/?l=3Dlinux-kernel&m=3D125351816109264&w=3D4
>
> Still occurs with kernel 2.6.32-rc8.
>
> This time I only rebuilt the glibc src rpm, which also runs its
> test suite, but I did not install the new glibc binaries before
> the shutdown -r.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-12-08 11:12:14

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


Jan Kara <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #12 from Jan Kara <[email protected]> 2009-12-08 11:12:19 ---
Ah, I've noticed that Mikael was not on CC of this bug so he didn't see my
messages. Mikael, could you please try running the debugging patch from comment
#8/#9? Thanks.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-12-08 11:47:35

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #13 from Mikael Pettersson <[email protected]> 2009-12-08 11:47:41 ---
(In reply to comment #12)
> Ah, I've noticed that Mikael was not on CC of this bug so he didn't see my
> messages. Mikael, could you please try running the debugging patch from comment
> #8/#9? Thanks.

I'll test the debug patch this weekend, right now I'm away from the machines
that trigger the problem.

(Since I issued all my bug reports on LKML it never occurred to me to add a cc:
to myself in this bugzilla entry. Sorry about that.)

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-12-14 12:40:27

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #14 from Mikael Pettersson <[email protected]> 2009-12-14 12:40:25 ---
With Jan Kara'a debugging patch on top of 2.6.32 final I get the following
after shutdown -r:

Unmounting file systems: sb orphan head is 22
sb_info orphan list:
inode sda2:22 at cc7c1618: mode 100600, nlink 0, next 13, i_count=2501
dentry=
NULL
iget IPs:
c009e590,
c009dd08,
inode sda2:13 at d1d932b8: mode 100600, nlink 0, next 12, i_count=2 dentry=
NULL
iget IPs:
c009e590,
c009dd08,
inode sda2:12 at d1de0618: mode 100600, nlink 0, next 0, i_count=1 dentry=
NULL
iget IPs:
c009e590,
c009dd08,
kernel BUG at fs/ext3/super.c:450!

c009e590 is fs/fs-writeback.c:writeback_inodes_wb(), line 681.

c009dd08 is fs/fs-writeback.c:wait_sb_inodes(), line 1170.

I get this with both gcc-4.3.4 and gcc-4.4.2 compiled kernels.

On a second machine I got a similar result, except the three inodes had only
one iget IP recorded, and it was the one in writeback_inodes_wb().

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-12-14 14:46:11

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


Jan Kara <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #15 from Jan Kara <[email protected]> 2009-12-14 14:46:07 ---
Thanks for the debugging. So the problematic files were regular files that lead
pretty simple life - no references from unexpected places or so. What catches
my eye is that inode numbers are surprisingly small (22, 12, 13; in your
original report 17, 15). Since normal inode numbers start by 11 and that is
usually taken by lost+found, numbers 12 and 13 are as small as they can get. I
suppose this is a filesystem dedicated for the build, isn't it? And the files
making problems are created as a first files on that filesystem in its root
directory. Any idea what those files are?

Another thing: i_count=2501 for inode 22 is absurdly high. We'd have to leak
references really heavily for that file.

I've checked the code in writeback_inodes_wb() and wait_sb_inodes() but
reference counting there looks fine. So I have no clue what's going on right
now. Anyway, adding Jens to CC just in case it rings a bell to him.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-12-15 09:08:49

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #16 from Mikael Pettersson <[email protected]> 2009-12-15 09:08:45 ---
The affected file system is /tmp. There's nothing special about it, except it
being separate from / and /home. The builds are plain "rpmbuild --rebuild" so
they do most accesses under /usr/src/ and /var/tmp/, both on /, but the glibc
test suite will create temporary files on /tmp. I have no idea what those files
are.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2009-12-29 13:18:07

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


[email protected] changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]
| |.com




--- Comment #17 from [email protected] 2009-12-29 13:18:03 ---
I have reproduced the exact same problem in a new LFS system using 2.6.32.2
kernel.

The bug is definitely triggered but issuing a "make check" of eglibc before a
reboot. You do not need to install it or just build it, whatever triggers this
bug is related to something that eglibc makes during tests.

---

brw-rw---- 1 root disk 253, 3 Dec 29 2009 /dev/dm-3

brw------- 1 root root 253, 3 Dec 29 2009 LV0-tmp


Unmounting all other currently mounted file systems...
/etc/rc.d/rc6.d/S70mountfs: line 18: 22979 Segmentation fault umount -a -d
-r > /dev/null
[ FAIL ]


[ 903.074244] sb orphan head is 13
[ 903.088507] sb_info orphan list:
[ 903.102590] inode dm-3:13 at ffff880114948a10: mode 100600, nlink 0, next
12
[ 903.131992] inode dm-3:12 at ffff880114907340: mode 100600, nlink 0, next
0
[ 903.149056] ------------[ cut here ]------------
[ 903.149057] kernel BUG at fs/ext3/super.c:435!
[ 903.149058] invalid opcode: 0000 [#1] SMP
[ 903.149060] last sysfs file: /sys/class/net/bonding_masters
[ 903.149061] CPU 2
[ 903.149062] Modules linked in: bonding ipv6 bnx2 usbhid button hpilo
[ 903.149067] Pid: 22979, comm: umount Not tainted 2.6.32.2 #1 ProLiant DL360
G6
[ 903.149068] RIP: 0010:[<ffffffff8117d883>] [<ffffffff8117d883>]
ext3_put_super+0x233/0x280
[ 903.149075] RSP: 0018:ffff88011cd6be28 EFLAGS: 00010202
[ 903.149076] RAX: ffff8801149489c8 RBX: ffff88011d857e00 RCX:
ffffffff815e0800
[ 903.149077] RDX: 00000000007f3f13 RSI: 0000000000000046 RDI:
ffffffff816dc130
[ 903.149078] RBP: ffff88011cd6be58 R08: 0000000000000000 R09:
0000000000000001
[ 903.149080] R10: 0000000000000000 R11: 000000000e200000 R12:
ffff88011d3db400
[ 903.149081] R13: ffff88011d857f98 R14: ffff88011d857f98 R15:
ffff88011d0cc060
[ 903.149083] FS: 00007f9f88456730(0000) GS:ffff88002c040000(0000)
knlGS:0000000000000000
[ 903.149084] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 903.149086] CR2: 00007f9f8845c000 CR3: 000000011dc84000 CR4:
00000000000006e0
[ 903.149087] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 903.149088] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 903.149090] Process umount (pid: 22979, threadinfo ffff88011cd6a000, task
ffff88011d0cc060)
[ 903.149091] Stack:
[ 903.149092] ffff880100000000 ffff88011d3db400 ffff88011d3db400
ffffffff814a4ac0
[ 903.149094] <0> ffffffff8163d1b8 ffff88011d3db400 ffff88011cd6be78
ffffffff810f6cc1
[ 903.149096] <0> ffff88011f29de40 0000000000000003 ffff88011cd6be98
ffffffff810f6d8c
[ 903.149098] Call Trace:
[ 903.149104] [<ffffffff810f6cc1>] generic_shutdown_super+0x51/0xf0
[ 903.149106] [<ffffffff810f6d8c>] kill_block_super+0x2c/0x50
[ 903.149109] [<ffffffff810f81ea>] deactivate_super+0x6a/0xf0
[ 903.149112] [<ffffffff81113d6e>] mntput_no_expire+0x9e/0xf0
[ 903.149114] [<ffffffff8111518e>] sys_umount+0x5e/0x460
[ 903.149117] [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
[ 903.149118] Code: c7 b1 de 54 81 31 c0 e8 3c 4d 30 00 4c 8b b3 98 01 00 00
49 8b 06 4d 39 f5 0f 18 08 75 11 4c 3b ab 98 01 00 00 0f 84 c4 fe ff ff <0f> 0b
eb fe 49 8d 46 80 49 8d 4e 48 48 8b b0 c8 01 00 00 44 8b
[ 903.149131] RIP [<ffffffff8117d883>] ext3_put_super+0x233/0x280
[ 903.149134] RSP <ffff88011cd6be28>
[ 903.159165] ---[ end trace 3ed92bd591190341 ]---
[ 903.159166] ------------[ cut here ]------------
[ 903.159169] WARNING: at kernel/exit.c:895 do_exit+0x94a/0x9e0()
[ 903.159170] Hardware name: ProLiant DL360 G6
[ 903.159171] Modules linked in: bonding ipv6 bnx2 usbhid button hpilo
[ 903.159175] Pid: 22979, comm: umount Tainted: G D 2.6.32.2 #1
[ 903.159176] Call Trace:
[ 903.159179] [<ffffffff8104c218>] warn_slowpath_common+0x78/0xf0
[ 903.159181] [<ffffffff8104c29f>] warn_slowpath_null+0xf/0x20
[ 903.159183] [<ffffffff81050fca>] do_exit+0x94a/0x9e0
[ 903.159187] [<ffffffff8106de01>] ? down_trylock+0x31/0x50
[ 903.159190] [<ffffffff81010305>] oops_end+0x75/0xa0
[ 903.159192] [<ffffffff810108a1>] die+0xd1/0x110
[ 903.159194] [<ffffffff8100d0d4>] do_trap+0xc4/0x170
[ 903.159196] [<ffffffff8100d580>] do_invalid_op+0x90/0xb0
[ 903.159198] [<ffffffff8117d883>] ? ext3_put_super+0x233/0x280
[ 903.159200] [<ffffffff8100c815>] invalid_op+0x15/0x20
[ 903.159203] [<ffffffff8117d883>] ? ext3_put_super+0x233/0x280
[ 903.159205] [<ffffffff8117d8c7>] ? ext3_put_super+0x277/0x280
[ 903.159207] [<ffffffff810f6cc1>] generic_shutdown_super+0x51/0xf0
[ 903.159209] [<ffffffff810f6d8c>] kill_block_super+0x2c/0x50
[ 903.159211] [<ffffffff810f81ea>] deactivate_super+0x6a/0xf0
[ 903.159213] [<ffffffff81113d6e>] mntput_no_expire+0x9e/0xf0
[ 903.159215] [<ffffffff8111518e>] sys_umount+0x5e/0x460
[ 903.159217] [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
[ 903.159218] ---[ end trace 3ed92bd591190342 ]---

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-11 21:59:34

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #18 from Jan Kara <[email protected]> 2010-01-11 21:59:30 ---
Guys, I'm having problems reproducing this. I've checked out 2.10 version of
eglibc but when I run "make check" I get error on:
GCONV_PATH=/local/test1/build/iconvdata LC_ALL=C
LOCPATH=/local/test1/build/localedata
/local/test1/build/elf/ld-linux-x86-64.so.2 --library-path
/local/test1/build:/local/test1/build/math:/local/test1/build/elf:/local/test1/build/dlfcn:/local/test1/build/nss:/local/test1/build/nis:/local/test1/build/rt:/local/test1/build/resolv:/local/test1/build/crypt:/local/test1/build/nptl
/local/test1/build/localedata/tst_mblen

probably because of:
mblen:de_DE.ISO-8859-1:1:1:2:F:the return value is different from an expected
value

When I configure the package without TLS, I cannot get even that far... So
could you either provide me with a simpler reproduction case or tell me how to
make "make check" do what I need? Thanks...

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-11 22:55:42

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #19 from [email protected] 2010-01-11 22:55:40 ---
Jan,

You usually run glibc checks with -k (some errors are usual and recurring):
make -k check

You can follow CLFS/LFS guideline here using the pre-downloaded version of
eglibc and patches:
http://cross-lfs.org/view/svn/x86_64/final-system/eglibc-64bit.html

Regards,
Carlos Velasco

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-12 16:03:34

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #20 from Jan Kara <[email protected]> 2010-01-12 16:03:29 ---
OK, now "make -k check" have been doing something for a reasonable time but I
don't see problems with orphan files you see. Only at the end there are a few
ld-linux-... processes hanging so that I cannot umount the filesystem (I have
to kill them to umount the filesystem) - do you also observe such processes?

Also if you can still reproduce the problem - could you please find which test
does exactly cause the problem so that I can concentrate on it? Thanks.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-12 17:30:55

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #21 from [email protected] 2010-01-12 17:30:52 ---
I don't see any ld-linux process after make -k check

bash-4.0# ps -ef | grep -i ld-linux
root 13640 18376 0 18:27 ? 00:00:00 grep -i ld-linux

bash-4.0# umount /tmp
[435397.580983] sb orphan head is 14
[435397.595803] sb_info orphan list:
[435397.611165] inode dm-3:14 at ffff8800c7880d28: mode 100600, nlink 0, next
13
[435397.639671] inode dm-3:13 at ffff8800df2566e0: mode 100600, nlink 0, next
12
[435397.672072] inode dm-3:12 at ffff8800df2560b0: mode 100600, nlink 0, next
0
bash-4.0# [435397.672080] ------------[ cut here ]------------
[435397.672082] kernel BUG at fs/ext3/super.c:435!
[435397.672083] invalid opcode: 0000 [#1] SMP
[435397.672085] last sysfs file: /sys/class/net/bonding_masters
[435397.672086] CPU 2
[435397.672087] Modules linked in: bonding ipv6 usbhid bnx2 hpilo button
[435397.672092] Pid: 13641, comm: umount Not tainted 2.6.32.3 #1 ProLiant DL360
G6
[435397.672093] RIP: 0010:[<ffffffff8117da53>] [<ffffffff8117da53>]
ext3_put_super+0x233/0x280
[435397.672099] RSP: 0018:ffff88011cda5e28 EFLAGS: 00010206
[435397.672101] RAX: ffff8800c7880ce0 RBX: ffff88011cd54e00 RCX:
000000000001055f
[435397.672102] RDX: 0000000000004646 RSI: 0000000000000082 RDI:
ffffffff815e6330
[435397.672104] RBP: ffff88011cda5e58 R08: 000000000001055e R09:
ffffffff816dd240
[435397.672105] R10: 0000000000000013 R11: ffffffff816dd240 R12:
ffff88011d13b400
[435397.672106] R13: ffff88011cd54f98 R14: ffff88011cd54f98 R15:
ffff88011d3e1350
[435397.672108] FS: 00007f910e82b730(0000) GS:ffff88002c040000(0000)
knlGS:0000000000000000
[435397.672110] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[435397.672111] CR2: 00007f910dedb32e CR3: 000000011cf65000 CR4:
00000000000006e0
[435397.672112] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[435397.672114] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[435397.672116] Process umount (pid: 13641, threadinfo ffff88011cda4000, task
ffff88011d3e1350)
[435397.672117] Stack:
[435397.672118] ffff880100000000 ffff88011d13b400 ffff88011d13b400
ffffffff814a4ac0
[435397.672120] <0> ffffffff8163d1b8 ffff88011d13b400 ffff88011cda5e78
ffffffff810f6cd1
[435397.672121] <0> ffff88011f016180 0000000000000003 ffff88011cda5e98
ffffffff810f6d9c
[435397.672124] Call Trace:
[435397.672130] [<ffffffff810f6cd1>] generic_shutdown_super+0x51/0xf0
[435397.672132] [<ffffffff810f6d9c>] kill_block_super+0x2c/0x50
[435397.672134] [<ffffffff810f81fa>] deactivate_super+0x6a/0xf0
[435397.672137] [<ffffffff81113dce>] mntput_no_expire+0x9e/0xf0
[435397.672140] [<ffffffff811151ee>] sys_umount+0x5e/0x460
[435397.672143] [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
[435397.672144] Code: c7 41 df 54 81 31 c0 e8 b7 52 30 00 4c 8b b3 98 01 00 00
49 8b 06 4d 39 f5 0f 18 08 75 11 4c 3b ab 98 01 00 00 0f 84 c4 fe ff ff <0f> 0b
eb fe 49 8d 46 80 49 8d 4e 48 48 8b b0 c8 01 00 00 44 8b
[435397.672158] RIP [<ffffffff8117da53>] ext3_put_super+0x233/0x280
[435397.672160] RSP <ffff88011cda5e28>
[435397.682079] ---[ end trace 7a92197c0643be5c ]---
[435397.682080] ------------[ cut here ]------------
[435397.682083] WARNING: at kernel/exit.c:895 do_exit+0x94a/0x9e0()
[435397.682084] Hardware name: ProLiant DL360 G6
[435397.682085] Modules linked in: bonding ipv6 usbhid bnx2 hpilo button
[435397.682088] Pid: 13641, comm: umount Tainted: G D 2.6.32.3 #1
[435397.682090] Call Trace:
[435397.682093] [<ffffffff8104c208>] warn_slowpath_common+0x78/0xf0
[435397.682095] [<ffffffff8104c28f>] warn_slowpath_null+0xf/0x20
[435397.682097] [<ffffffff81050fba>] do_exit+0x94a/0x9e0
[435397.682100] [<ffffffff8106ddf1>] ? down_trylock+0x31/0x50
[435397.682104] [<ffffffff81010305>] oops_end+0x75/0xa0
[435397.682106] [<ffffffff810108a1>] die+0xd1/0x110
[435397.682108] [<ffffffff8100d0d4>] do_trap+0xc4/0x170
[435397.682110] [<ffffffff8100d580>] do_invalid_op+0x90/0xb0
[435397.682112] [<ffffffff8117da53>] ? ext3_put_super+0x233/0x280
[435397.682114] [<ffffffff8106ddf1>] ? down_trylock+0x31/0x50
[435397.682117] [<ffffffff8100c815>] invalid_op+0x15/0x20
[435397.682119] [<ffffffff8117da53>] ? ext3_put_super+0x233/0x280
[435397.682121] [<ffffffff8117da97>] ? ext3_put_super+0x277/0x280
[435397.682123] [<ffffffff810f6cd1>] generic_shutdown_super+0x51/0xf0
[435397.682126] [<ffffffff810f6d9c>] kill_block_super+0x2c/0x50
[435397.682128] [<ffffffff810f81fa>] deactivate_super+0x6a/0xf0
[435397.682130] [<ffffffff81113dce>] mntput_no_expire+0x9e/0xf0
[435397.682132] [<ffffffff811151ee>] sys_umount+0x5e/0x460
[435397.682134] [<ffffffff8100baeb>] system_call_fastpath+0x16/0x1b
[435397.682135] ---[ end trace 7a92197c0643be5d ]---


There are many many tests, difficult finding which one.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-12 20:12:45

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #22 from Jan Kara <[email protected]> 2010-01-12 20:12:39 ---
Hmm, strange. I've now downloaded eglibc 2.10.1 from the site you pointed to
and followed instructions there. Now I don't have any ld-linux hung processes
and everything seems to finish fine but still no luck in reproducing the
problem.

About finding which test is causing this: I suppose you could easily disable
tests from different directories and the when you find in which directory the
problematic test is, you could do a binary search on tests. But I understand
it's quite a bit of work...

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-12 23:25:51

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #23 from [email protected] 2010-01-12 23:25:48 ---
It is something in the nptl (threads) subdir tests.

eglibc-build/nptl# make subdir=nptl -C ../../eglibc-2.10.1/nptl
objdir=/compile/eglibc-build tests
...
eglibc-build/nptl# umount /tmp
Segmentation fault

Not sure about how to isolate the exact test from here now... will try to do my
best.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-13 00:34:02

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #24 from [email protected] 2010-01-13 00:34:00 ---
Found. It is the test "tst-mutexpi9".

eglibc-2.10.1/nptl# GCONV_PATH=/compile/eglibc-build/iconvdata LC_ALL=C
/compile/eglibc-build/elf/ld-linux-x86-64.so.2 --library-path
/compile/eglibc-build:/compile/eglibc-build/math:/compile/eglibc-build/elf:/compile/eglibc-build/dlfcn:/compile/eglibc-build/nss:/compile/eglibc-build/nis:/compile/eglibc-build/rt:/compile/eglibc-build/resolv:/compile/eglibc-build/crypt:/compile/eglibc-build/nptl
/compile/eglibc-build/nptl/tst-mutexpi9 >
/compile/eglibc-build/nptl/tst-mutexpi9.out

eglibc-2.10.1/nptl# umount /tmp
Segmentation fault

eglibc-2.10.1/nptl# cat /compile/eglibc-build/nptl/tst-mutexpi9.out
going to fork now

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-18 22:41:51

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #25 from Jan Kara <[email protected]> 2010-01-18 22:41:46 ---
Whohoo, reproduced! The trick was that this test is creating files in /tmp and
I was doing all the compilation (and unmounting) in a different partition than
/tmp is. Thanks for identifying the test case.

For others interested: What the test does is: It creates a temporary file,
mmaps it, deletes it, creates pthread mutex in the mmap, forks, does some mutex
locking, exits. During this process we apparently leak an inode reference, yay.
What is interesting is that when the mutex is created without priority
inheritance logic, everything works fine so there's something racy about our
pthread mutex handling in the kernel.

I'll now go and debug it more, I just wanted to share the news.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-19 19:56:44

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


Jan Kara <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
CC| |[email protected]




--- Comment #26 from Jan Kara <[email protected]> 2010-01-19 19:56:41 ---
OK, I've spent some time looking into this. It's definitely a futex problem -
futexes take reference to inode and apparently this test case takes some code
path where we don't drop all the inode references we have acquired. I've tried
to find that place by looking into the code but wasn't successful.

I think the problem is caused by futex rewrite that happened between 2.6.30 and
2.6.31 so I'm adding Darren to CC (who was author of most of the patches).
Darren, I'm able to reproduce the problem with 2.6.33-rc3 kernel so apparently
the problem isn't fixed yet. I'll attach here a minimal test program that
triggers the problem.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-19 19:57:27

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #27 from Jan Kara <[email protected]> 2010-01-19 19:57:25 ---
Created an attachment (id=24639)
--> (http://bugzilla.kernel.org/attachment.cgi?id=24639)
Test program causing lost inode references

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-19 23:24:43

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #28 from Darren Hart <[email protected]> 2010-01-19 23:24:41 ---
>From the following in the test:

mem = mmap (NULL, ps, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

this may be related to be the problem reported by Kosaki and fixed with his
patch recently pulled into tip by Ingo and sent Linus and stable. Can you try
with this patch from tip/core-fixes-for-linus?

commit 7485d0d3758e8e6491a5c9468114e74dc050785d
Author: KOSAKI Motohiro <[email protected]>
Date: Tue Jan 5 16:32:43 2010 +0900

futexes: Remove rw parameter from get_futex_key()

Currently, futexes have two problem:

A) The current futex code doesn't handle private file mappings properly.

get_futex_key() uses PageAnon() to distinguish file and
anon, which can cause the following bad scenario:

1) thread-A call futex(private-mapping, FUTEX_WAIT), it
sleeps on file mapping object.
2) thread-B writes a variable and it makes it cow.
3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
wakes up blocked thread on the anonymous page. (but it's nothing)

B) Current futex code doesn't handle zero page properly.

Read mode get_user_pages() can return zero page, but current
futex code doesn't handle it at all. Then, zero page makes
infinite loop internally.

The solution is to use write mode get_user_page() always for
page lookup. It prevents the lookup of both file page of private
mappings and zero page.

Performance concerns:

Probaly very little, because glibc always initialize variables
for futex before to call futex(). It means glibc users never see
the overhead of this patch.

Compatibility concerns:

This patch has few compatibility issues. After this patch,
FUTEX_WAIT require writable access to futex variables (read-only
mappings makes EFAULT). But practically it's not a problem,
glibc always initalizes variables for futexes explicitly - nobody
uses read-only mappings.

Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: KOSAKI Motohiro <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Acked-by: Darren Hart <[email protected]>
Cc: <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Ulrich Drepper <[email protected]>
LKML-Reference: <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-20 10:40:34

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #29 from Jan Kara <[email protected]> 2010-01-20 10:40:30 ---
I've tested a kernel I've pulled today from Linus (which has the above commit)
and it still has the issue. Note that the mapping created is shared so it's
unlikely that the above patch would really fix it...

BTW: The problem is 100% reproducible for me (and for the guys reporting it as
well) so I believe that you should see it as well with the test program.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-20 13:52:48

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #30 from Mikael Pettersson <[email protected]> 2010-01-20 13:52:46 ---
Using the standalone test case I can reproduce the bug with all kernels back to
2.6.29, but not with 2.6.28. I'm trying a bisect now.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-20 18:11:19

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #31 from Darren Hart <[email protected]> 2010-01-20 18:11:16 ---
Jan, I apologize, I missed the shared mapping. I wouldn't expect Kosaki's patch
to fix this. Have you experimented with the SHARED, RECURSIVE, and PRIO_INHERIT
flags of the mutexattr to see if this is the only configuration in which you
see the problem? Understanding which of these flags are required to reproduce
would help point at where to look in the code. Right now I suspect a futex key
reference imbalance in the futex_lock_pi() code path. 2.6.29 includes changes
from me on fault handling as well as key reference imbalance fixes in this
area.

Mikael, please let me know what your git bisection reveals.

I'm trying to make time to do some testing/analysis myself, but I'm being kept
very busy on other pressing issues right now, so I may not be able to respond
as quickly as I'd like.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-20 19:10:28

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #32 from Darren Hart <[email protected]> 2010-01-20 19:10:25 ---
I see above that PRIO_INHERIT is required, and the test depends on SHARED. I
suppose we could try with pthreads instead of fork to see if it's related to
SHARED. For now I'll start examining the futex_lock_pi() path - particular
around the signal induced wakeup.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-20 19:29:54

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #33 from Darren Hart <[email protected]> 2010-01-20 19:29:53 ---
Took a look at the 2.6.29 code, I believe it is possible to have an inode
reference imbalance when a fault is taken. Unfortunately, both queue_lock() and
get_futex_key() acquire references to the inode (I'd like to do away with
queue_lock() as it masks reference usage and generally complicates the
corner-case-heavy futex code). In the fault path queue_unlock() will release
the first inode reference, but if on the first attempt (attempt == 0) the
get_user() fails, we'll simply return without dropping the second reference.
Some instrumentation could confirm this. I'll take a look at later sources
which should have a significantly different fault path.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-20 19:54:39

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #34 from Mikael Pettersson <[email protected]> 2010-01-20 19:54:37 ---
My bisection has identified the following as the culprit:

commit 38d47c1b7075bd7ec3881141bb3629da58f88dab
Author: Peter Zijlstra <[email protected]>
Date: Fri Sep 26 19:32:20 2008 +0200

futex: rely on get_user_pages() for shared futexes

On the way of getting rid of the mmap_sem requirement for shared futexes,
start by relying on get_user_pages().

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-20 20:09:09

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #35 from Darren Hart <[email protected]> 2010-01-20 20:09:08 ---
The fault path inode count problem appears to be fixed in 2.6.33-rc3 where the
fault logic is rewritten and and we properly account for both references taken
by queue_lock as well as get_futex_key(). Since Jan in comment #26 said he can
reproduce on 2.6.33-rc3, this isn't likely to be the path that caused the
problem. I'll continue to look.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-23 22:18:43

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #36 from Mikael Pettersson <[email protected]> 2010-01-23 22:18:38 ---
I've posted a futex bug fix to lkml:
<http://marc.info/?l=linux-kernel&m=126428261230399&w=2>

It fixed the standalone test case. I'm now doing a full glibc rebuild and test
suite run on a machine with the fixed kernel.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-24 11:28:10

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #37 from Mikael Pettersson <[email protected]> 2010-01-24 11:28:05 ---
(In reply to comment #36)
> I've posted a futex bug fix to lkml:
> <http://marc.info/?l=linux-kernel&m=126428261230399&w=2>
>
> It fixed the standalone test case. I'm now doing a full glibc rebuild and test
> suite run on a machine with the fixed kernel.

Completed. The patch eliminated all orphan inodes from the glibc test suite.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-24 17:13:47

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #38 from [email protected] 2010-01-24 17:13:44 ---
It works for me.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-24 18:18:00

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #39 from [email protected] 2010-01-24 18:17:58 ---
Not sure about the patch.
With patch applied, after testing and installing glibc I see plenty of tmp
files left in /tmp:

-rw------- 1 root root 0 Jan 24 18:56 tst-atime.HiK5Qd

-rw------- 1 root root 0 Jan 24 19:07 cc2mcjSe.o
-rw------- 1 root root 0 Jan 24 19:07 cc3s3qwc.c
-rw------- 1 root root 9766 Jan 24 19:07 ccNVp9dh.ld
-rw------- 1 root root 0 Jan 24 19:07 cckbLeAj.le

I don't remember to see them without patch applied.

Also, when rebooting I see this error:

Unmounting all other currently mounted file systems...
umount: /: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))

Very odd...

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-24 18:20:16

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #40 from [email protected] 2010-01-24 18:20:15 ---
After rebooting the temp files are gone...

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-24 19:27:36

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #41 from [email protected] 2010-01-24 19:27:33 ---
Tested with unpatched kernel and the temp files are there too. Can't test the
error unmounting / because when /tmp is unmounted it gives oops. This may be
another problem.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-25 01:37:35

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #42 from [email protected] 2010-01-25 01:37:33 ---
Same problem using 2.6.33-rc5 with the patch.

Unmounting all other currently mounted file systems...
umount: /: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))

I don't know why but this only happens with the patch applied.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-25 10:29:06

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #43 from Mikael Pettersson <[email protected]> 2010-01-25 10:28:58 ---
Please clarify exactly what you do to trigger the busy /.

Did you run the glibc test suite or the specific tst-mutex9 program?
If so, then without the futex_lock_pi() patch you wouldn't even get to a normal
shutdown due to the oops, so the busy / may be normal and caused by something
else.

Also, does the busy / prevent the shutdown from succeeding?

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-25 12:07:14

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #44 from Jan Kara <[email protected]> 2010-01-25 12:07:11 ---
Carlos, EBUSY from umount usually means that some process has some file open on
the filesystem (/ in your case). Are there any usual processes running before
you run shutdown? Or maybe you could try just switching to single user mode and
then use lsof to see processes which have files open on /...

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-28 17:19:21

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #45 from [email protected] 2010-01-28 17:19:13 ---
Sorry about delay, I made further tests.
I think the patch is good and this is another problem.

The oops happened during TEST of the eglibc. This has been solved by the patch.
This second issue is related to the INSTALL of the eglibc.

After "make install" and reboot I always see an umount busy message for /.
This busy message was not seen before because we stepped into the oops
umounting /tmp, before umounting /.

I tried downgrading to linux kernel 2.6.28 but I couldn't, I receive an NMI
interrupt and system hang before testing with this kernel.

So I have issued a "ps -ef" and a lsof just before the umount all.

Looking output of ps, there are no user processes apart of these:
root 30920 1 0 17:50 ttyS1 00:00:00 init [6]
root 30921 30920 0 17:50 ttyS1 00:00:00 /bin/bash /etc/rc.d/init.d/rc 6
root 31085 30921 0 17:50 ttyS1 00:00:00 /bin/bash
/etc/rc.d/rc6.d/S70mou

Umount command is:
"umount -a -d -r >/dev/null"

I ONLY see this error when installing eglibc. Rebooting the system without
installing eglibc is successful without errors.

Looking into LSOF, I have been looking into differences between a "good reboot"
and a "bad reboot - umount busy". I can only see that in the "bad reboot" the
SIZE/OFF columns are empty for the new installed eglibc files.

Bad:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
init 1 root DEL REG 253,1 887056
/lib64/libc-2.10.1.so
init 1 root DEL REG 253,1 887055 /lib64/ld-2.10.1.so

Good:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
init 1 root mem REG 253,1 1752309 887013 /lib64/libc-2.10.1.so
init 1 root mem REG 253,1 149463 887007 /lib64/ld-2.10.1.so

Not sure how to proceed from here. Not sure if this is an ext3 issue also...

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-28 17:23:57

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #46 from [email protected] 2010-01-28 17:23:55 ---
Created an attachment (id=24757)
--> (http://bugzilla.kernel.org/attachment.cgi?id=24757)
lsof before umount in a reboor without busy error

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-28 17:25:08

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


[email protected] changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #24757|lsof before umount in a |lsof before umount in a
description|reboor without busy error |reboot without busy error




--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-28 17:24:44

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #47 from [email protected] 2010-01-28 17:24:43 ---
Created an attachment (id=24758)
--> (http://bugzilla.kernel.org/attachment.cgi?id=24758)
lsof before umount in a "bad" reboot with busy error

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-01-28 17:55:57

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256





--- Comment #48 from Jan Kara <[email protected]> 2010-01-28 17:55:53 ---
Thanks for debugging Carlos! Note, that in the 'bad' case, the files are
deleted... When there are still open but deleted files, we cannot remount the
filesystem read-only because that would make deleting the file at closing of
the last file descriptor impossible. So what you see is an expected behavior...

Mikael, will you take care of pushing your fix to vanilla kernel and possibly
also -stable releases? Thanks.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.

2010-02-10 19:16:07

by bugzilla-daemon

[permalink] [raw]
Subject: [Bug 14256] kernel BUG at fs/ext3/super.c:435

http://bugzilla.kernel.org/show_bug.cgi?id=14256


Jan Kara <[email protected]> changed:

What |Removed |Added
----------------------------------------------------------------------------
AssignedTo|[email protected] |[email protected]
|g |




--- Comment #49 from Jan Kara <[email protected]> 2010-02-10 19:15:27 ---
Rafael, the patch has been merged as 5ecb01cfdf96c5f465192bdb2a4fd4a61a24c6cc
so I'm closing the bug.

--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.