2011-03-02 03:44:22

by George Spelvin

[permalink] [raw]
Subject: Null pointer OOPS in sync_inodes_sb+0xa9/0x104

Just after resuming from a suspend to RAM, 2.6.38-rc6 plus Eric Dumazet's
[PATCH] net: Add default_advmss() methods to blackhole dst_ops

Core 2 duo, 32-bit kernel, 2 GiB RAM.
I suspend to RAM every few days; this is not the first
time since booting.

The display hadn'd flipped to X yet, so I managed to capture the screen
(manually transcribed):

BUG: unable to handle kernel NULL pointer dereference at 00000030
IP: [<c10a3a21>] sync_inodes_sb+0xa9/0x104
*pde = 00000000
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1d.1/usb3/3-1/3-1.3/3-1.3:1.0/input11/name
Modules linked in: btusb sco rfcomm l2cap crc16 bluetooth ctr twofish_generic twofish_i586 twofish_common serpent xcbc sha256_generic b43 mac80211 cfg80211

Pid: 23367, comm: mount Not tainted 2.6.38-rc6 #228 Dell Inc. MXC061 /0MG532
EIP: 0060:[<c10a3a21>] EFLAGS: 0010246 CPU: 1
EIP is at sync_inodes_sb+0xa9/0x104
EAX: c003fd5c EBX: f58f606c ECX: 00000000 EDX: 00000909
ESI: d4cc1424 EDI: c003fd0c EBP: 00000000 ESP: f56ffee8
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process mount (pid: 23367, ti=f56fe000 task=c28472a0 task.ti = f56fe000)
Stack:
7fffffff f58f6000 00000001 00000000 f56ffef8 f56ffef8 f56fff04 00000000
00000303 f56fff0c f56fff0c f58f6000 00000001 00000000 f4a57000 c10a6c72
f58f6000 00000020 00000000 c108e442 00000000 00000000 00000020 00000000
Call Trace:
[<c10a6c72>] ? _sync_filesystem+0x3a/0x69
[<c108e442>] ? do_remount_sb+0x59/0xce
[<c10a067b>] ? do_mount+0x214/0x6a8
[<c10a0b83>] ? sys_mount+0x74/0xa9
[<c1002750>] ? sysenter_do_call+0x12/0x26
Code: b9 3a c1 e8 f4 6d f8 ff b8 80 f7 51 c1 31 f6 e8 f1 09 26 00 8b 7b 6c 83 c3 6c 83 ef 50 eb 44 f6 47 30 38 75 38 8b af c0 00 00 00 <83> 7d 30 00 74 2c 89 f8 e8 1a 85 ff ff fe 05 80 f7 51 c1 89 f0
EIP: [<c10a3a21>] sync_inodes_sb+0xa9/0x104 SS:ESP 0068:f56ffee8
CR2: 0000000000000030

I can't make much sense out of it, but maybe someone can. It's a simple
laptop with one rotating hard drive. Distro is Debian/unstable, kernel
is hand-compiled from Linus' git.


2011-03-02 11:13:07

by Anton Altaparmakov

[permalink] [raw]
Subject: a major regression in recent kernels? - was: Re: Null pointer OOPS in sync_inodes_sb+0xa9/0x104

Hi Linus,

I am not sure who the "responsible" person for this is but a problem is emerging on recent kernels where device hot plug and/or suspend/resume (which I guess effectively performas hot plug/unplug as part of its operation) causes kernel crashes with NULL pointer dereference. It has now been reported at least by three different people on LKML/fs-devel and also one of Tuxera's customers is seeing it too on their embedded hardware.

The problem appears to be that on device unplug, the s_bdi field in the struct super block is set to NULL. But at that point a file system is still mounted and fully operational! So any file system / VFS code that goes through s_bdi causes a NULL pointer dereference.

Not sure what the correct solution is but this is definitely a functional regression given I now am aware of four independent reports in recent kernels but never before.

Here are links to the reports in the archives:

https://lkml.org/lkml/2010/12/9/436

http://www.gossamer-threads.com/lists/linux/kernel/1345992

And below is the third report which appears to be the same problem as the above two, i.e. the super block's s_bdi is NULL, so dereferencing it causes a NULL pointer dereference and sync_inodes_sb() contains a call to bdi_queue_work() which is likely inlined automatically by gcc and the crash is thus when bdi is first dereferenced in bdi_queue_work()...

Not sure what the solution is. I don't think it is a matter of checking against s_bdi being NULL because it is used in a lot of places so it will be hard to track them all down. More likely s_bdi should not be set to NULL until all super_block structures referencing it go away and the disappeared device driver code should be adapted to match but I may be wrong...

Just thought you should be made aware as this problem appears to be ignored / stay under the radar of people who actually have a hope of fixing it properly so far...

Best regards,

Anton

On 2 Mar 2011, at 03:44, George Spelvin wrote:
> Just after resuming from a suspend to RAM, 2.6.38-rc6 plus Eric Dumazet's
> [PATCH] net: Add default_advmss() methods to blackhole dst_ops
>
> Core 2 duo, 32-bit kernel, 2 GiB RAM.
> I suspend to RAM every few days; this is not the first
> time since booting.
>
> The display hadn'd flipped to X yet, so I managed to capture the screen
> (manually transcribed):
>
> BUG: unable to handle kernel NULL pointer dereference at 00000030
> IP: [<c10a3a21>] sync_inodes_sb+0xa9/0x104
> *pde = 00000000
> Oops: 0000 [#1] SMP
> last sysfs file: /sys/devices/pci0000:00/0000:00:1d.1/usb3/3-1/3-1.3/3-1.3:1.0/input11/name
> Modules linked in: btusb sco rfcomm l2cap crc16 bluetooth ctr twofish_generic twofish_i586 twofish_common serpent xcbc sha256_generic b43 mac80211 cfg80211
>
> Pid: 23367, comm: mount Not tainted 2.6.38-rc6 #228 Dell Inc. MXC061 /0MG532
> EIP: 0060:[<c10a3a21>] EFLAGS: 0010246 CPU: 1
> EIP is at sync_inodes_sb+0xa9/0x104
> EAX: c003fd5c EBX: f58f606c ECX: 00000000 EDX: 00000909
> ESI: d4cc1424 EDI: c003fd0c EBP: 00000000 ESP: f56ffee8
> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process mount (pid: 23367, ti=f56fe000 task=c28472a0 task.ti = f56fe000)
> Stack:
> 7fffffff f58f6000 00000001 00000000 f56ffef8 f56ffef8 f56fff04 00000000
> 00000303 f56fff0c f56fff0c f58f6000 00000001 00000000 f4a57000 c10a6c72
> f58f6000 00000020 00000000 c108e442 00000000 00000000 00000020 00000000
> Call Trace:
> [<c10a6c72>] ? _sync_filesystem+0x3a/0x69
> [<c108e442>] ? do_remount_sb+0x59/0xce
> [<c10a067b>] ? do_mount+0x214/0x6a8
> [<c10a0b83>] ? sys_mount+0x74/0xa9
> [<c1002750>] ? sysenter_do_call+0x12/0x26
> Code: b9 3a c1 e8 f4 6d f8 ff b8 80 f7 51 c1 31 f6 e8 f1 09 26 00 8b 7b 6c 83 c3 6c 83 ef 50 eb 44 f6 47 30 38 75 38 8b af c0 00 00 00 <83> 7d 30 00 74 2c 89 f8 e8 1a 85 ff ff fe 05 80 f7 51 c1 89 f0
> EIP: [<c10a3a21>] sync_inodes_sb+0xa9/0x104 SS:ESP 0068:f56ffee8
> CR2: 0000000000000030
>
> I can't make much sense out of it, but maybe someone can. It's a simple
> laptop with one rotating hard drive. Distro is Debian/unstable, kernel
> is hand-compiled from Linus' git.

--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

2011-03-02 18:32:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: a major regression in recent kernels? - was: Re: Null pointer OOPS in sync_inodes_sb+0xa9/0x104

Hmm. Adding some more people to the cc - in particular Jens. I'm not
sure that he'd be on the fsdevel list (although maybe he is).

The whole "backing_dev_info" has been a total disaster. The thing is
crap. It violates all the normal kernel memory management rules ("Thou
shalt use reference counts and free only when it goes to zero") and
the whole thing has been a constant source of "oh, that driver didn't
set it, but we changed all the code to require it to be correct".

And the reason we set it to NULL when the device goes away is exactly
that it's not ref-counted correctly, so we really _have_ to set it to
NULL, because it's not going to be around.

(And the reverse of that is why all kernel data structures should use
refcounts, and not some external lifetime notion)

Gaah. The caller has already done the check for a NULL s_bdi:

/*
* This should be safe, as we require bdi backing to actually
* write out data in the first place
*/
if (!sb->s_bdi || sb->s_bdi == &noop_backing_dev_info)
return 0;

before it calls into "sync_inodes_sb(sb);", but since that stupid
disconnect event can happen at any time, that doesn't protect
anything. The s_bdi field clearly just becomes NULL after the check.

Jens? This was introduced in 592b09a42fc3 ("backing-dev: ensure that a
removed bdi no longer has super_block referencing it") over a year
ago, but the _real_ problem goes back all the way to commit
32a88aa1b6df which introduced that broken "s_bdi" without refcounts to
begin with.

Guys, any idea on how to fix this? The hacky way might be to introduce
a dummy backing_dev_info and instead of setting s_bdi to NULL, we set
it to the dummy one that doesn't do anything. It's still hacky as
hell, though. The real problem really is that having pointers to
structures without refcounts is just _always_ wrong.

See Documentation/CodingStyle: "Chapter 11: Data structures":

"Remember: if another thread can find your data structure, and you don't
have a reference count on it, you almost certainly have a bug."

wiser words have seldom been spoken.

Linus

On Wed, Mar 2, 2011 at 2:52 AM, Anton Altaparmakov <[email protected]> wrote:
> Hi Linus,
>
> I am not sure who the "responsible" person for this is but a problem is emerging on recent kernels where device hot plug and/or suspend/resume (which I guess effectively performas hot plug/unplug as part of its operation) causes kernel crashes with NULL pointer dereference. ?It has now been reported at least by three different people on LKML/fs-devel and also one of Tuxera's customers is seeing it too on their embedded hardware.
>
> The problem appears to be that on device unplug, the s_bdi field in the struct super block is set to NULL. ?But at that point a file system is still mounted and fully operational! ?So any file system / VFS code that goes through s_bdi causes a NULL pointer dereference.
>
> Not sure what the correct solution is but this is definitely a functional regression given I now am aware of four independent reports in recent kernels but never before.
>
> Here are links to the reports in the archives:
>
> ? ? ? ?https://lkml.org/lkml/2010/12/9/436
>
> ? ? ? ?http://www.gossamer-threads.com/lists/linux/kernel/1345992
>
> And below is the third report which appears to be the same problem as the above two, i.e. the super block's s_bdi is NULL, so dereferencing it causes a NULL pointer dereference and sync_inodes_sb() contains a call to bdi_queue_work() which is likely inlined automatically by gcc and the crash is thus when bdi is first dereferenced in bdi_queue_work()...
>
> Not sure what the solution is. ?I don't think it is a matter of checking against s_bdi being NULL because it is used in a lot of places so it will be hard to track them all down. ?More likely s_bdi should not be set to NULL until all super_block structures referencing it go away and the disappeared device driver code should be adapted to match but I may be wrong...
>
> Just thought you should be made aware as this problem appears to be ignored / stay under the radar of people who actually have a hope of fixing it properly so far...
>
> Best regards,
>
> ? ? ? ?Anton
>
> On 2 Mar 2011, at 03:44, George Spelvin wrote:
>> Just after resuming from a suspend to RAM, 2.6.38-rc6 plus Eric Dumazet's
>> [PATCH] net: Add default_advmss() methods to blackhole dst_ops
>>
>> Core 2 duo, 32-bit kernel, 2 GiB RAM.
>> I suspend to RAM every few days; this is not the first
>> time since booting.
>>
>> The display hadn'd flipped to X yet, so I managed to capture the screen
>> (manually transcribed):
>>
>> BUG: unable to handle kernel NULL pointer dereference at 00000030
>> IP: [<c10a3a21>] sync_inodes_sb+0xa9/0x104
>> *pde = 00000000
>> Oops: 0000 [#1] SMP
>> last sysfs file: /sys/devices/pci0000:00/0000:00:1d.1/usb3/3-1/3-1.3/3-1.3:1.0/input11/name
>> Modules linked in: btusb sco rfcomm l2cap crc16 bluetooth ctr twofish_generic twofish_i586 twofish_common serpent xcbc sha256_generic b43 mac80211 cfg80211
>>
>> Pid: 23367, comm: mount Not tainted 2.6.38-rc6 #228 Dell Inc. MXC061 ? ? ? ? ? ? ? ? ? ? ? ? ?/0MG532
>> EIP: 0060:[<c10a3a21>] EFLAGS: 0010246 CPU: 1
>> EIP is at sync_inodes_sb+0xa9/0x104
>> EAX: c003fd5c EBX: f58f606c ECX: 00000000 EDX: 00000909
>> ESI: d4cc1424 EDI: c003fd0c EBP: 00000000 ESP: f56ffee8
>> DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
>> Process mount (pid: 23367, ti=f56fe000 task=c28472a0 task.ti = f56fe000)
>> Stack:
>> 7fffffff f58f6000 00000001 00000000 f56ffef8 f56ffef8 f56fff04 00000000
>> 00000303 f56fff0c f56fff0c f58f6000 00000001 00000000 f4a57000 c10a6c72
>> f58f6000 00000020 00000000 c108e442 00000000 00000000 00000020 00000000
>> Call Trace:
>> [<c10a6c72>] ? _sync_filesystem+0x3a/0x69
>> [<c108e442>] ? do_remount_sb+0x59/0xce
>> [<c10a067b>] ? do_mount+0x214/0x6a8
>> [<c10a0b83>] ? sys_mount+0x74/0xa9
>> [<c1002750>] ? sysenter_do_call+0x12/0x26
>> Code: b9 3a c1 e8 f4 6d f8 ff b8 80 f7 51 c1 31 f6 e8 f1 09 26 00 8b 7b 6c 83 c3 6c 83 ef 50 eb 44 f6 47 30 38 75 38 8b af c0 00 00 00 <83> 7d 30 00 74 2c 89 f8 e8 1a 85 ff ff fe 05 80 f7 51 c1 89 f0
>> EIP: [<c10a3a21>] sync_inodes_sb+0xa9/0x104 SS:ESP 0068:f56ffee8
>> CR2: 0000000000000030
>>
>> I can't make much sense out of it, but maybe someone can. ?It's a simple
>> laptop with one rotating hard drive. ?Distro is Debian/unstable, kernel
>> is hand-compiled from Linus' git.
>
> --
> Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
> Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
> Linux NTFS maintainer, http://www.linux-ntfs.org/
>
>

2011-03-03 00:15:12

by Jens Axboe

[permalink] [raw]
Subject: Re: a major regression in recent kernels? - was: Re: Null pointer OOPS in sync_inodes_sb+0xa9/0x104

On 2011-03-02 13:31, Linus Torvalds wrote:
> Hmm. Adding some more people to the cc - in particular Jens. I'm not
> sure that he'd be on the fsdevel list (although maybe he is).
>
> The whole "backing_dev_info" has been a total disaster. The thing is
> crap. It violates all the normal kernel memory management rules ("Thou
> shalt use reference counts and free only when it goes to zero") and
> the whole thing has been a constant source of "oh, that driver didn't
> set it, but we changed all the code to require it to be correct".
>
> And the reason we set it to NULL when the device goes away is exactly
> that it's not ref-counted correctly, so we really _have_ to set it to
> NULL, because it's not going to be around.
>
> (And the reverse of that is why all kernel data structures should use
> refcounts, and not some external lifetime notion)
>
> Gaah. The caller has already done the check for a NULL s_bdi:
>
> /*
> * This should be safe, as we require bdi backing to actually
> * write out data in the first place
> */
> if (!sb->s_bdi || sb->s_bdi == &noop_backing_dev_info)
> return 0;
>
> before it calls into "sync_inodes_sb(sb);", but since that stupid
> disconnect event can happen at any time, that doesn't protect
> anything. The s_bdi field clearly just becomes NULL after the check.
>
> Jens? This was introduced in 592b09a42fc3 ("backing-dev: ensure that a
> removed bdi no longer has super_block referencing it") over a year
> ago, but the _real_ problem goes back all the way to commit
> 32a88aa1b6df which introduced that broken "s_bdi" without refcounts to
> begin with.
>
> Guys, any idea on how to fix this? The hacky way might be to introduce
> a dummy backing_dev_info and instead of setting s_bdi to NULL, we set
> it to the dummy one that doesn't do anything. It's still hacky as
> hell, though. The real problem really is that having pointers to
> structures without refcounts is just _always_ wrong.
>
> See Documentation/CodingStyle: "Chapter 11: Data structures":
>
> "Remember: if another thread can find your data structure, and you don't
> have a reference count on it, you almost certainly have a bug."
>
> wiser words have seldom been spoken.

Agree, from the ->s_bdi point of view it has been and is a mess. I guess
the hope was to avoid adding real reference counting for the bdi. I just
took a quick look at it, and I don't think it'll be too problematic to
add. The bdi is mostly just a settings container.

We already pretty much have a dummy backing_dev_info,
default_backing_dev_info. 2.6.38 and stable backport would be to use
that and going forward ensure we probably reference the bdi when it's
attached to the super block.

I've got a flight coming up tomorrow, I'll cook up both patches for
this.


--
Jens Axboe

2011-03-04 12:52:58

by Christoph Hellwig

[permalink] [raw]
Subject: Re: a major regression in recent kernels? - was: Re: Null pointer OOPS in sync_inodes_sb+0xa9/0x104

On Wed, Mar 02, 2011 at 10:31:15AM -0800, Linus Torvalds wrote:
> The whole "backing_dev_info" has been a total disaster. The thing is
> crap. It violates all the normal kernel memory management rules ("Thou
> shalt use reference counts and free only when it goes to zero") and
> the whole thing has been a constant source of "oh, that driver didn't
> set it, but we changed all the code to require it to be correct".
>
> And the reason we set it to NULL when the device goes away is exactly
> that it's not ref-counted correctly, so we really _have_ to set it to
> NULL, because it's not going to be around.
>
> (And the reverse of that is why all kernel data structures should use
> refcounts, and not some external lifetime notion)

Yes. But the bdi is even worse than that, as it conflates things with
different lifetime into a single object. We have the "old school" bdi
which mostly contained various bits of tuning for the VM and read-ahead
algorithms. This one is required to stay around even with no fs mounted
on block devices because people expect it to stay around with no fs
mounted. And then we have the writeback context entangled into it,
which only makes sense with an active filesystem (or block device node)
on it to make it special fun. Even more fun is that we have a pointer
from the superblock, and one from the inode, and the latter might point
to lala land if this is say a /dev/mem node which has a different bdi
for the "old-school" MM usage.

I had various stages of prototypes for separating the two into:

1) the old bdi. Life time rules are: allocated and reference counted
with the containing device. That is gendisk for block devices,
server context for remote devices, static at module init time for
/dev/zero and similar.
2) writeback context. Only exists if a user is there, and thus
refcounted by itself. For non-blockdevice filesystem instances it's
trivially always allocated with the superblock, and goes away with it.
For block-device instances we need to keep a pointer to it from
struct block_device and properly look it up on mount, or opening of
the block device nodes.

I guess I need to get back to it, but kept it off for now as the code
had reached relative stability and really fear touching it again.

It's for sure not .38 material, though.

2011-03-14 07:52:45

by Jens Axboe

[permalink] [raw]
Subject: Re: a major regression in recent kernels? - was: Re: Null pointer OOPS in sync_inodes_sb+0xa9/0x104

George,

If you can reproduce this easily, can you try with this patch? When we
move the potential dirty list entries to the default_backing_dev_info,
reassign the sb->s_bdi as well. default_backing_dev_info will always be
around. I hope this can fix it up for 2.6.38 and we can add the proper
ref counting for .39.

diff --git a/fs/super.c b/fs/super.c
index 7e9dd4c..0d89e93 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -71,6 +71,7 @@ static struct super_block *alloc_super(struct file_system_type *type)
#else
INIT_LIST_HEAD(&s->s_files);
#endif
+ s->s_bdi = &default_backing_dev_info;
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
@@ -1003,6 +1004,7 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void
}
BUG_ON(!mnt->mnt_sb);
WARN_ON(!mnt->mnt_sb->s_bdi);
+ WARN_ON(mnt->mnt_sb->s_bdi == &default_backing_dev_info);
mnt->mnt_sb->s_flags |= MS_BORN;

error = security_sb_kern_mount(mnt->mnt_sb, flags, secdata);
diff --git a/fs/sync.c b/fs/sync.c
index ba76b96..412dc89 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -33,7 +33,7 @@ static int __sync_filesystem(struct super_block *sb, int wait)
* This should be safe, as we require bdi backing to actually
* write out data in the first place
*/
- if (!sb->s_bdi || sb->s_bdi == &noop_backing_dev_info)
+ if (sb->s_bdi == &noop_backing_dev_info)
return 0;

if (sb->s_qcop && sb->s_qcop->quota_sync)
@@ -79,7 +79,7 @@ EXPORT_SYMBOL_GPL(sync_filesystem);

static void sync_one_sb(struct super_block *sb, void *arg)
{
- if (!(sb->s_flags & MS_RDONLY) && sb->s_bdi)
+ if (!(sb->s_flags & MS_RDONLY))
__sync_filesystem(sb, *(int *)arg);
}
/*
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 027100d..8e4ed88 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -604,7 +604,7 @@ static void bdi_prune_sb(struct backing_dev_info *bdi)
spin_lock(&sb_lock);
list_for_each_entry(sb, &super_blocks, s_list) {
if (sb->s_bdi == bdi)
- sb->s_bdi = NULL;
+ sb->s_bdi = &default_backing_dev_info;
}
spin_unlock(&sb_lock);
}

--
Jens Axboe