2015-04-08 13:46:50

by Jan Engelhardt

[permalink] [raw]
Subject: NULL deref around xfs in v4.0-rc1–rc7


Starting somewhere around v4.0-rc1 and persisting through commit
v4.0-rc7, there is a new NULL deference apparently happening in
conjunction with xfs. This inhibits this machine's booting,
as xfs is used for the root filesystem.

First bisection points at first-bad commit v4.0-rc1~8, and since that is
a merge commit, I'll be investigating some more hand-chosen commits (and
then people to Cc) as we speak.


Boot log of v4.0-rc1~8:

Fusion MPT base driver 3.04.20
Copyright (c) 1999-2008 LSI Corporation
Fusion MPT SAS Host driver 3.04.20
mptbase: ioc0: Initiating bringup
ioc0: LSISAS1068 A0: Capabilities={Initiator}
scsi host0: ioc0: LSISAS1068 A0, FwRev=00000000h, Ports=8, MaxQ=256, IRQ=22
mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 1, sas_addr 0x1060504030201a0
scsi 0:0:0:0: Direct-Access VBOX HARDDISK 1.0 PQ: 0 ANSI: 5
scsi 0:0:0:0: Attached scsi generic sg0 type 0
mptbase: ioc1: Initiating bringup
ioc1: LSISAS1068 A0: Capabilities={Initiator}
scsi host1: ioc1: LSISAS1068 A0, FwRev=00000000h, Ports=8, MaxQ=256, IRQ=17
mptsas: ioc1: attaching ssp device: fw_channel 0, fw_id 0, phy 0, sas_addr 0x60504030201a0
scsi 1:0:0:0: Direct-Access VBOX HARDDISK 1.0 PQ: 0 ANSI: 5
scsi 1:0:0:0: Attached scsi generic sg1 type 0
sd 0:0:0:0: [sda] 12582912 512-byte logical blocks: (6.44 GB/6.00 GiB)
sd 1:0:0:0: [sdb] 16777216 512-byte logical blocks: (8.58 GB/8.00 GiB)
sd 0:0:0:0: [sda] Write Protect is off
sd 0:0:0:0: [sda] Incomplete mode parameter data
sd 0:0:0:0: [sda] Assuming drive cache: write through
sd 1:0:0:0: [sdb] Write Protect is off
sd 1:0:0:0: [sdb] Incomplete mode parameter data
sd 1:0:0:0: [sdb] Assuming drive cache: write through
sda: sda1 sda2
sd 0:0:0:0: [sda] Attached SCSI disk
sdb: sdb1 sdb2
sd 1:0:0:0: [sdb] Attached SCSI disk
audit: type=1130 audit(1428456646.877:11): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-vconsole-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Please enter passphrase for disk HARDDISK (sfroot)! 
NET: Registered protocol family 38
audit_printk_skb: 3 callbacks suppressed
audit: type=1130 audit(1428456653.677:13): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-cryptsetup@sfroot comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1130 audit(1428456653.941:14): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dracut-initqueue comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1130 audit(1428456654.369:15): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-fsck@dev-mapper-sfroot comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
SGI XFS with ACLs, security attributes, realtime, no debug enabled
XFS (dm-0): Mounting V5 Filesystem
XFS (dm-0): Ending clean mount
audit: type=1130 audit(1428456654.705:16): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=initrd-parse-etc comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1131 audit(1428456654.761:17): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=initrd-parse-etc comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1130 audit(1428456655.077:18): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=dracut-pre-pivot comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1130 audit(1428456655.157:19): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-ask-password-console comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1131 audit(1428456655.417:20): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=systemd-ask-password-console comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1130 audit(1428456655.437:21): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=initrd-cleanup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
audit: type=1131 audit(1428456655.453:22): pid=1 uid=0 auid=4294967295 ses=4294967295 msg='unit=initrd-cleanup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
systemd-journald[155]: Received SIGTERM from PID 1 (systemd).
BUG: unable to handle kernel paging request at 0000000000001000
IP: [<ffffffff812718d0>] scsi_init_cmd_errh+0x26/0x5d
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: xfs crc32c_generic libcrc32c dm_crypt xts gf128mul algif_skcipher af_alg sd_mod mptsas scsi_transport_sas mptscsih mptbase dm_mod sg ipv6
CPU: 0 PID: 447 Comm: systemd-cgroups Not tainted 4.0.0-rc1 #21
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: ffff88007acceeb0 ti: ffff88007bcc0000 task.ti: ffff88007bcc0000
RIP: 0010:[<ffffffff812718d0>] [<ffffffff812718d0>] scsi_init_cmd_errh+0x26/0x5d
RSP: 0018:ffff88007bcc3730 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88007b078800 RCX: 0000000000000018
RDX: ffff88007b02d370 RSI: 0000000000000000 RDI: 0000000000001000
RBP: ffff88007b02d370 R08: ffff88007bf1a440 R09: 00000000000000fa
R10: 0000000000000000 R11: ffffea0001eef880 R12: 0000000000000000
R13: ffff88007b078800 R14: ffff88007bc35000 R15: ffff88007b02d200
FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000001000 CR3: 000000007cb79000 CR4: 00000000000007f0
Stack:
ffffffff812731ac ffff88007acceeb0 ffff88007bc35000 ffff88007b075400
ffff88007b078968 ffff88007b02d378 ffff88007fc18e88 ffff88007b070f20
ffff88007bcc37c0 ffff88007b075408 0000000000000000 0000000000000000
Call Trace:
[<ffffffff812731ac>] ? scsi_queue_rq+0x2e5/0x3d3
[<ffffffff8118d833>] ? __blk_mq_run_hw_queue+0x19a/0x29f
[<ffffffff8118da01>] ? blk_mq_alloc_request+0xc9/0x102
[<ffffffffa00f774b>] ? __xfs_get_blocks+0x321/0x321 [xfs]
[<ffffffff8118df7c>] ? blk_mq_run_hw_queue+0x4a/0x93
[<ffffffff8118ebfa>] ? blk_sq_make_request+0x166/0x171
[<ffffffff8118638e>] ? generic_make_request+0x8f/0xcc
[<ffffffff811864ce>] ? submit_bio+0x103/0x121
[<ffffffff810cc0ae>] ? get_page+0x9/0x25
[<ffffffff810cc49f>] ? __lru_cache_add+0x1a/0x3a
[<ffffffff8113629a>] ? mpage_bio_submit+0x1f/0x25
[<ffffffff81136f07>] ? mpage_readpages+0xe2/0xf6
[<ffffffffa00f774b>] ? __xfs_get_blocks+0x321/0x321 [xfs]
[<ffffffff810c8d19>] ? get_page_from_freelist+0x3b3/0x785
[<ffffffff810f860a>] ? alloc_pages_current+0xad/0xca
[<ffffffff810cb5f9>] ? __do_page_cache_readahead+0x116/0x1af
[<ffffffff810c4731>] ? filemap_fault+0x18e/0x393
[<ffffffff810df31c>] ? __do_fault+0x3b/0x8f
[<ffffffff810e2033>] ? handle_mm_fault+0x50e/0xfc9
[<ffffffff810daf1c>] ? vma_interval_tree_insert+0x1f/0x82
[<ffffffff810e5909>] ? vma_set_page_prot+0x33/0x52
[<ffffffff81034345>] ? __do_page_fault+0x212/0x3ba
[<ffffffff81337de8>] ? page_fault+0x28/0x30
[<ffffffff811adf85>] ? __clear_user+0x1c/0x3d
[<ffffffff811455f9>] ? padzero+0x1b/0x28
[<ffffffff81147144>] ? load_elf_binary+0x7d5/0x1009
[<ffffffff811135dc>] ? search_binary_handler+0x6b/0x170
[<ffffffff81114780>] ? do_execveat_common.isra.31+0x45a/0x5dd
[<ffffffff81114926>] ? do_execve+0x23/0x28
[<ffffffff8104d4f7>] ? ____call_usermodehelper+0x100/0x128
[<ffffffff8104d3f7>] ? call_usermodehelper+0x47/0x47
[<ffffffff8133637c>] ? ret_from_fork+0x7c/0xb0
[<ffffffff8104d3f7>] ? call_usermodehelper+0x47/0x47
Code: c2 89 d0 5b c3 48 c7 87 b0 00 00 00 00 00 00 00 c7 87 f4 00 00 00 00 00 00 00 48 89 fa 48 8b bf 10 01 00 00 31 c0 b9 18 00 00 00 <f3> ab 66 83 ba cc 00 00 00 00 75 2a 48 8b 8a d8 00 00 00 8a 01
RIP [<ffffffff812718d0>] scsi_init_cmd_errh+0x26/0x5d
RSP <ffff88007bcc3730>
CR2: 0000000000001000
---[ end trace ef687cac676b43b0 ]---


2015-04-08 15:20:26

by Jan Engelhardt

[permalink] [raw]
Subject: Re: NULL deref around xfs in v4.0-rc1–rc7

On Wednesday 2015-04-08 15:41, Jan Engelhardt wrote:

>Starting somewhere around v4.0-rc1 and persisting through commit
>v4.0-rc7, there is a new NULL deference apparently happening in
>conjunction with xfs. This inhibits this machine's booting,
>as xfs is used for the root filesystem.
>
>First bisection points at first-bad commit v4.0-rc1~8, and since that is
>a merge commit, I'll be investigating some more hand-chosen commits (and
>then people to Cc) as we speak.

I reran bisect just to be sure.
It now shows v4.0-rc1~9 is bad, v4.0-rc1~9^1 is ok, and v4.0-rc~9^2 is
ok too. So this means that the combination of the both ~9 childs work
badly together.


# good: [2bfedd1d9f470506d98cb5662ced381c38225968] Merge branch 'for-linus' of git://git.kernel.dk/linux-block
git bisect good 2bfedd1d9f470506d98cb5662ced381c38225968
# bad: [cd50b70ccd5c87794ec28bfb87b7fba9961eb0ae] Merge tag 'pm+acpi-3.20-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad cd50b70ccd5c87794ec28bfb87b7fba9961eb0ae
# good: [9d0de5a63a4a22abfd2bd70694a610d18350cf87] Merge branches 'acpi-ec', 'acpi-soc', 'acpi-video' and 'acpi-resources'
git bisect good 9d0de5a63a4a22abfd2bd70694a610d18350cf87
# good: [67fadaa2768716209ee19a8b8bf05bc3ac399445] cpufreq: s3c: remove last use of resume_clocks callback
git bisect good 67fadaa2768716209ee19a8b8bf05bc3ac399445
# good: [70734a786acfd1998e47d40df19cba5c29469bdf] cpuidle: powernv: Avoid endianness conversions while parsing DT
git bisect good 70734a786acfd1998e47d40df19cba5c29469bdf
# good: [3466b547e37b988723dc93465b7cb06b4b1f731f] Merge branches 'pnp', 'pm-cpuidle' and 'pm-cpufreq'
git bisect good 3466b547e37b988723dc93465b7cb06b4b1f731f
# first bad commit: [cd50b70ccd5c87794ec28bfb87b7fba9961eb0ae] Merge tag 'pm+acpi-3.20-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
# good: [2bfedd1d9f470506d98cb5662ced381c38225968] Merge branch 'for-linus' of git://git.kernel.dk/linux-block
git bisect good 2bfedd1d9f470506d98cb5662ced381c38225968
# bad: [cd50b70ccd5c87794ec28bfb87b7fba9961eb0ae] Merge tag 'pm+acpi-3.20-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad cd50b70ccd5c87794ec28bfb87b7fba9961eb0ae
# good: [9d0de5a63a4a22abfd2bd70694a610d18350cf87] Merge branches 'acpi-ec', 'acpi-soc', 'acpi-video' and 'acpi-resources'
git bisect good 9d0de5a63a4a22abfd2bd70694a610d18350cf87
# good: [67fadaa2768716209ee19a8b8bf05bc3ac399445] cpufreq: s3c: remove last use of resume_clocks callback
git bisect good 67fadaa2768716209ee19a8b8bf05bc3ac399445
# good: [70734a786acfd1998e47d40df19cba5c29469bdf] cpuidle: powernv: Avoid endianness conversions while parsing DT
git bisect good 70734a786acfd1998e47d40df19cba5c29469bdf
# good: [3466b547e37b988723dc93465b7cb06b4b1f731f] Merge branches 'pnp', 'pm-cpuidle' and 'pm-cpufreq'
git bisect good 3466b547e37b988723dc93465b7cb06b4b1f731f
# first bad commit: [cd50b70ccd5c87794ec28bfb87b7fba9961eb0ae] Merge tag 'pm+acpi-3.20-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm



BUG: unable to handle kernel paging request at 0000000000001000
IP: [<ffffffff81269d9e>] scsi_init_cmd_errh+0x26/0x5d
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: xfs crc32c_generic libcrc32c dm_crypt xts gf128mul algif_skcipher af_alg sd_mod mptsas scsi_transport_sas mptscsih mptbase dm_mod sg ipv6
CPU: 0 PID: 406 Comm: kworker/u2:0 Not tainted 3.19.0+ #53
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: ffff88007bf73c60 ti: ffff88007cb20000 task.ti: ffff88007cb20000
RIP: 0010:[<ffffffff81269d9e>] [<ffffffff81269d9e>] scsi_init_cmd_errh+0x26/0x5d
RSP: 0018:ffff88007cb23870 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88007bfa6800 RCX: 0000000000000018
RDX: ffff88007bfec970 RSI: 0000000000000000 RDI: 0000000000001000
RBP: ffff88007bfec970 R08: ffff88007be345c0 R09: 00000000000000fa
R10: 0000000000000001 R11: ffffea0001ec8c40 R12: 0000000000000000
R13: ffff88007bfa6800 R14: ffff88007bc04000 R15: ffff88007bfec800
FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000001000 CR3: 000000007bc9b000 CR4: 00000000000007f0
Stack:
ffffffff8126b67a ffff88007bf73c60 ffff88007bc04000 ffff88007bf13400
ffff88007bfa6968 ffff88007bfec978 ffff88007fc18e48 ffff88007bfb4f20
ffff88007cb23900 ffff88007bf13408 0000000000000000 0000000000000000
Call Trace:
[<ffffffff8126b67a>] ? scsi_queue_rq+0x2e5/0x3d3
[<ffffffff8118d840>] ? __blk_mq_run_hw_queue+0x19a/0x29f
[<ffffffff8118da01>] ? blk_mq_alloc_request+0xbc/0x102
[<ffffffffa00f974b>] ? __xfs_get_blocks+0x321/0x321 [xfs]
[<ffffffff8118df89>] ? blk_mq_run_hw_queue+0x4a/0x93
[<ffffffff8118ec07>] ? blk_sq_make_request+0x166/0x171
[<ffffffff8118639b>] ? generic_make_request+0x8f/0xcc
[<ffffffff811864db>] ? submit_bio+0x103/0x121
[<ffffffff810cc0ae>] ? get_page+0x9/0x25
[<ffffffff810cc49f>] ? __lru_cache_add+0x1a/0x3a
[<ffffffff81136312>] ? mpage_bio_submit+0x1f/0x25
[<ffffffff81136f7f>] ? mpage_readpages+0xe2/0xf6
[<ffffffffa00f974b>] ? __xfs_get_blocks+0x321/0x321 [xfs]
[<ffffffff810f85f9>] ? alloc_pages_current+0xad/0xca
[<ffffffff810cb5f9>] ? __do_page_cache_readahead+0x116/0x1af
[<ffffffff811a8a21>] ? radix_tree_lookup_slot+0x10/0x23
[<ffffffff810cb88b>] ? ondemand_readahead+0x1f9/0x20a
[<ffffffff810c3231>] ? pagecache_get_page+0x22/0x138
[<ffffffff810c3dab>] ? generic_file_read_iter+0x17a/0x4d4
[<ffffffffa00d430e>] ? xfs_attr_get+0x52/0x113 [xfs]
[<ffffffffa01013d8>] ? xfs_file_read_iter+0x1bb/0x20d [xfs]
[<ffffffff8110e8c9>] ? new_sync_read+0x67/0x8b
[<ffffffff8110f539>] ? vfs_read+0x6d/0xb7
[<ffffffff81112ff7>] ? kernel_read+0x39/0x47
[<ffffffff811146f3>] ? do_execveat_common.isra.31+0x3b7/0x5dd
[<ffffffff8111493c>] ? do_execve+0x23/0x28
[<ffffffff8104d4f7>] ? ____call_usermodehelper+0x100/0x128
[<ffffffff8104d3f7>] ? call_usermodehelper+0x47/0x47
[<ffffffff813188fc>] ? ret_from_fork+0x7c/0xb0
[<ffffffff8104d3f7>] ? call_usermodehelper+0x47/0x47
Code: c2 89 d0 5b c3 48 c7 87 b0 00 00 00 00 00 00 00 c7 87 f4 00 00 00 00 00 00 00 48 89 fa 48 8b bf 10 01 00 00 31 c0 b9 18 00 00 00 <f3> ab 66 83 ba cc 00 00 00 00 75 2a 48 8b 8a d8 00 00 00 8a 01
RIP [<ffffffff81269d9e>] scsi_init_cmd_errh+0x26/0x5d
RSP <ffff88007cb23870>
CR2: 0000000000001000
---[ end trace 54414923d584f14b ]---

2015-04-09 17:38:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: NULL deref around xfs in v4.0-rc1–rc7

On Wed, Apr 8, 2015 at 8:20 AM, Jan Engelhardt <[email protected]> wrote:
> On Wednesday 2015-04-08 15:41, Jan Engelhardt wrote:
>
>>Starting somewhere around v4.0-rc1 and persisting through commit
>>v4.0-rc7, there is a new NULL deference apparently happening in
>>conjunction with xfs. This inhibits this machine's booting,
>>as xfs is used for the root filesystem.
>>
>>First bisection points at first-bad commit v4.0-rc1~8, and since that is
>>a merge commit, I'll be investigating some more hand-chosen commits (and
>>then people to Cc) as we speak.
>
> I reran bisect just to be sure.
> It now shows v4.0-rc1~9 is bad, v4.0-rc1~9^1 is ok, and v4.0-rc~9^2 is
> ok too. So this means that the combination of the both ~9 childs work
> badly together.

Ok, that's just _odd_.

That v4.0-rc1~9 is just the pm+acpi merge, and has absolutely nothing
to do with XFS or the block code. In fact, looking at the diff from
it's direct parent, it doesn't even really change any relevant code.

So I get the feeling that the oops you are seeing is likely not
consistent, and may depend on allocation patterns or similar. Because
the bisect doesn't make any sense at all.

It looks much more like a pure block-mq bug, but one that needs some
very special condition to trigger.

Jens, does this look familiar or trigger any ideas:

BUG: unable to handle kernel paging request at 0000000000001000
IP: [<ffffffff812718d0>] scsi_init_cmd_errh+0x26/0x5d

(The whole oops is on lkml).

Jan, can you reproduce the oops with frame pointers so that we get a
better call trace? Although it looks fairly normal: the trapping code
is

rep stos %eax,%es:*(%rdi)

and %rdi is 0x1000. It seems to be simply

memset(cmd->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);

where 'cmd->sense_buffer' has some insane value ("PAGE_SIZE" or just a
flipped bit, or whatever)

Jens?

Linus

2015-04-09 18:24:34

by Jan Engelhardt

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7


On Thursday 2015-04-09 19:38, Linus Torvalds wrote:
>>
>> I reran bisect just to be sure.
>> It now shows v4.0-rc1~9 is bad, v4.0-rc1~9^1 is ok, and v4.0-rc~9^2 is
>> ok too. So this means that the combination of the both ~9 childs work
>> badly together.
>
>Ok, that's just _odd_.
>[...]
>So I get the feeling that the oops you are seeing is likely not
>consistent, and may depend on allocation patterns or similar.

It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept track
really) attempts does it not die.

With frame pointers:
BUG: unable to handle kernel paging request at 0000000000001000
IP: [<ffffffff812853c9>] scsi_init_cmd_errh+0x2a/0x62
PGD 0
Oops: 0002 [#1] SMP
Modules linked in: xfs crc32c_generic libcrc32c dm_crypt xts gf128mul algif_skcipher af_alg sd_mod mptsas scsi_transport_sas mptscsih mptbase dm_mod sg ipv6
CPU: 0 PID: 403 Comm: kworker/u2:1 Not tainted 4.0.0-rc7+ #55
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: ffff88007b686f60 ti: ffff88007bcb4000 task.ti: ffff88007bcb4000
RIP: 0010:[<ffffffff812853c9>] [<ffffffff812853c9>] scsi_init_cmd_errh+0x2a/0x62
RSP: 0018:ffff88007bcb77a8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff88007bf8d800 RCX: 0000000000000018
RDX: ffff88007bf7ab70 RSI: 0000000000000000 RDI: 0000000000001000
RBP: ffff88007bcb77a8 R08: ffff88007beb9c40 R09: 0000000000000000
R10: 0000000000000000 R11: ffffea0001fe17c0 R12: ffff88007bf7ab70
R13: 0000000000000000 R14: ffff88007bf8d800 R15: ffff88007bf7aa00
FS: 0000000000000000(0000) GS:ffff88007fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000001000 CR3: 000000007cb0d000 CR4: 00000000000007f0
Stack:
ffff88007bcb7818 ffffffff81286d59 ffff88007b686f60 ffff88007bc24000
ffff88007bf7ab78 ffff88007bf8d968 ffff88007be56c00 ffff88007bc24000
ffff88007cbfb400 ffff88007bcb7850 ffff88007be56c08 ffff88007bf7aa00
Call Trace:
[<ffffffff81286d59>] scsi_queue_rq+0x2e8/0x3d2
[<ffffffff8119e64d>] __blk_mq_run_hw_queue+0x19b/0x2a2
[<ffffffff8119e901>] ? blk_mq_merge_queue_io+0x75/0x147
[<ffffffffa00fa34a>] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
[<ffffffff8119edeb>] blk_mq_run_hw_queue+0x4f/0x99
[<ffffffff8119fab9>] blk_sq_make_request+0x163/0x170
[<ffffffff81196a8b>] generic_make_request+0x97/0xd6
[<ffffffff81196bd7>] submit_bio+0x10d/0x12c
[<ffffffff810d5e15>] ? __lru_cache_add+0x1e/0x3f
[<ffffffff81142af5>] mpage_bio_submit+0x25/0x2c
[<ffffffff8114387d>] mpage_readpages+0xf8/0x10c
[<ffffffffa00fa34a>] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
[<ffffffffa00f9c45>] xfs_vm_readpages+0x18/0x1a [xfs]
[<ffffffff810d4e5c>] __do_page_cache_readahead+0x137/0x1d3
[<ffffffff810d5102>] ondemand_readahead+0x20a/0x21b
[<ffffffff810d5262>] page_cache_sync_readahead+0x38/0x3a
[<ffffffff810cd1c5>] generic_file_read_iter+0x191/0x4fb
[<ffffffffa010b2cf>] ? xfs_ilock+0x32/0x5d [xfs]
[<ffffffffa01023c2>] xfs_file_read_iter+0x1c2/0x213 [xfs]
[<ffffffff81118e63>] new_sync_read+0x74/0x98
[<ffffffff81119aef>] __vfs_read+0x14/0x3b
[<ffffffff81119b8a>] vfs_read+0x74/0xc1
[<ffffffff8111d977>] kernel_read+0x3c/0x4a
[<ffffffff8111dbd2>] prepare_binprm+0x117/0x11f
[<ffffffff8111f10d>] do_execveat_common.isra.31+0x3b2/0x5d8
[<ffffffff8111f35a>] do_execve+0x27/0x29
[<ffffffff81050e07>] ____call_usermodehelper+0x10a/0x138
[<ffffffff81050cfd>] ? call_usermodehelper+0x49/0x49
[<ffffffff8133b3d8>] ret_from_fork+0x58/0x90
[<ffffffff81050cfd>] ? call_usermodehelper+0x49/0x49
Code: c3 55 48 89 fa 48 c7 87 b0 00 00 00 00 00 00 00 c7 87 f4 00 00 00 00 00 00 00 48 8b bf 10 01 00 00 31 c0 b9 18 00 00 00 48 89 e5 <f3> ab 66 83 ba cc 00 00 00 00 75 2a 48 8b 8a d8 00 00 00 8a 01
RIP [<ffffffff812853c9>] scsi_init_cmd_errh+0x2a/0x62
RSP <ffff88007bcb77a8>
CR2: 0000000000001000
---[ end trace fbec0fe487830b1d ]---



>and %rdi is 0x1000. It seems to be simply
>
> memset(cmd->sense_buffer, 0, SCSI_SENSE_BUFFERSIZE);
>
>where 'cmd->sense_buffer' has some insane value ("PAGE_SIZE" or just a
>flipped bit, or whatever)

Having been observed on two isolated different systems, I don't
think so much that it would be a broken HW-induced bitflip.

Oh yeah, if anybody likes, I can hand out the virtualbox image.

2015-04-09 21:12:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7

On Thu, Apr 9, 2015 at 11:24 AM, Jan Engelhardt <[email protected]> wrote:
>
> It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept track
> really) attempts does it not die.
>
> With frame pointers:
> [<ffffffff81286d59>] scsi_queue_rq+0x2e8/0x3d2
> [<ffffffff8119e64d>] __blk_mq_run_hw_queue+0x19b/0x2a2
> [<ffffffff8119e901>] ? blk_mq_merge_queue_io+0x75/0x147
> [<ffffffffa00fa34a>] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
> [<ffffffff8119edeb>] blk_mq_run_hw_queue+0x4f/0x99
> [<ffffffff8119fab9>] blk_sq_make_request+0x163/0x170

Ok, good.

So the cmd comes from

struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);

which in turn is just

return (void *) rq + sizeof(*rq);

which in turn is written by some crazy monkey on crack. That's some
shit code. Why the hell you'd write it that way, when the natural
thing to do would be just

return rq + 1;

without the sizeof, and without the cast.

The particular crazy monkey on crack is Jens Axboe, in commit 320ae51feed5c.

Jens, really. This code is shit.

That ->sense_buffer thing is supposed to be initialized by the
blk_mq_ops.init_request() function, which is called - if it exists =
when the array of requests ('->rqs[]') is initialized.

And that code too looks like crap. It seems to be very clever, trying
to allocaet big contiguous chunks of RAM for the requests, but then
the initialization sequence is questionable as hell. It takes that
nonzeroed allocation, and zeroes a few fields randomly. The rest will
contain whatever garbage data they used to.

Does this entirely untested patch make any difference?

And Jens, this all really looks very fishy. When I look at these kinds
of core functions, and find just *stupid* code like this, it makes me
unhappy.

Anyway, I assume that the actual "disk" in question is mpt fusion?

Linus


Attachments:
patch.diff (870.00 B)

2015-04-09 21:25:16

by Jens Axboe

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7

On 04/09/2015 03:12 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2015 at 11:24 AM, Jan Engelhardt <[email protected]> wrote:
>>
>> It's fairly consistent (reproducible?). Only 1 in 15 or so (have not kept track
>> really) attempts does it not die.
>>
>> With frame pointers:
>> [<ffffffff81286d59>] scsi_queue_rq+0x2e8/0x3d2
>> [<ffffffff8119e64d>] __blk_mq_run_hw_queue+0x19b/0x2a2
>> [<ffffffff8119e901>] ? blk_mq_merge_queue_io+0x75/0x147
>> [<ffffffffa00fa34a>] ? __xfs_get_blocks+0x2f9/0x2f9 [xfs]
>> [<ffffffff8119edeb>] blk_mq_run_hw_queue+0x4f/0x99
>> [<ffffffff8119fab9>] blk_sq_make_request+0x163/0x170
>
> Ok, good.
>
> So the cmd comes from
>
> struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req);
>
> which in turn is just
>
> return (void *) rq + sizeof(*rq);
>
> which in turn is written by some crazy monkey on crack. That's some
> shit code. Why the hell you'd write it that way, when the natural
> thing to do would be just
>
> return rq + 1;
>
> without the sizeof, and without the cast.
>
> The particular crazy monkey on crack is Jens Axboe, in commit 320ae51feed5c.
>
> Jens, really. This code is shit.

That's a bit rough on a single line of code like that, don't you think?
But yes, rq + 1 is identical and cleaner...

> That ->sense_buffer thing is supposed to be initialized by the
> blk_mq_ops.init_request() function, which is called - if it exists =
> when the array of requests ('->rqs[]') is initialized.
>
> And that code too looks like crap. It seems to be very clever, trying
> to allocaet big contiguous chunks of RAM for the requests, but then
> the initialization sequence is questionable as hell. It takes that
> nonzeroed allocation, and zeroes a few fields randomly. The rest will
> contain whatever garbage data they used to.
>
> Does this entirely untested patch make any difference?
>
> And Jens, this all really looks very fishy. When I look at these kinds
> of core functions, and find just *stupid* code like this, it makes me
> unhappy.

Not sure why it isn't all zeroed, definitely the saner thing to do at
init time. So patch looks fine, should be applied regardless of whether
or not it fixes this issue.

But it's _always_ been like that, so it's not a change in behavior. It
is fragile, so perhaps some SCSI change modified alloc behavior and it's
not causing and issue.

And if this is mpt, we recently ran into some list corruption issues due
to a bug in the driver. It hit on reboot, but it was scan related, so
could be a boot issue as well.

--
Jens Axboe

2015-04-09 21:37:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7

On Thu, Apr 9, 2015 at 2:25 PM, Jens Axboe <[email protected]> wrote:
>
> Not sure why it isn't all zeroed, definitely the saner thing to do at init
> time.

So practically speaking, it might well often be zeroed just because
the BIOS may have initialized memory that way (and big multi-page
allocations have probably not gotten re-used).

> And if this is mpt, we recently ran into some list corruption issues due to
> a bug in the driver. It hit on reboot, but it was scan related, so could be
> a boot issue as well.

So one of the earlier emails had this:

Copyright (c) 1999-2008 LSI Corporation
Fusion MPT SAS Host driver 3.04.20
mptbase: ioc0: Initiating bringup
ioc0: LSISAS1068 A0: Capabilities={Initiator}
scsi host0: ioc0: LSISAS1068 A0, FwRev=00000000h, Ports=8, MaxQ=256, IRQ=22
mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 1,
sas_addr 0x1060504030201a0
scsi 0:0:0:0: Direct-Access VBOX HARDDISK 1.0 PQ: 0 ANSI: 5
scsi 0:0:0:0: Attached scsi generic sg0 type 0
mptbase: ioc1: Initiating bringup
ioc1: LSISAS1068 A0: Capabilities={Initiator}
scsi host1: ioc1: LSISAS1068 A0, FwRev=00000000h, Ports=8, MaxQ=256, IRQ=17
mptsas: ioc1: attaching ssp device: fw_channel 0, fw_id 0, phy 0,
sas_addr 0x60504030201a0
scsi 1:0:0:0: Direct-Access VBOX HARDDISK 1.0 PQ: 0 ANSI: 5
scsi 1:0:0:0: Attached scsi generic sg1 type 0

and I'm assuming that that is the backing storage.

And yes, memory corruption sounds like a more likely cause than
anything else. I don't like how the request data wasn't fully
initialized, but the cmd->sense_buffer pointer itself *should* have
been initialized by the ->init_request() call.

So I don't actually expect my patch to really make any difference,
although I do think that code should be looked at.

Linus

2015-04-09 21:42:56

by Jens Axboe

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7

On 04/09/2015 03:37 PM, Linus Torvalds wrote:
> On Thu, Apr 9, 2015 at 2:25 PM, Jens Axboe <[email protected]> wrote:
>>
>> Not sure why it isn't all zeroed, definitely the saner thing to do at init
>> time.
>
> So practically speaking, it might well often be zeroed just because
> the BIOS may have initialized memory that way (and big multi-page
> allocations have probably not gotten re-used).
>
>> And if this is mpt, we recently ran into some list corruption issues due to
>> a bug in the driver. It hit on reboot, but it was scan related, so could be
>> a boot issue as well.
>
> So one of the earlier emails had this:
>
> Copyright (c) 1999-2008 LSI Corporation
> Fusion MPT SAS Host driver 3.04.20
> mptbase: ioc0: Initiating bringup
> ioc0: LSISAS1068 A0: Capabilities={Initiator}
> scsi host0: ioc0: LSISAS1068 A0, FwRev=00000000h, Ports=8, MaxQ=256, IRQ=22
> mptsas: ioc0: attaching ssp device: fw_channel 0, fw_id 1, phy 1,
> sas_addr 0x1060504030201a0
> scsi 0:0:0:0: Direct-Access VBOX HARDDISK 1.0 PQ: 0 ANSI: 5
> scsi 0:0:0:0: Attached scsi generic sg0 type 0
> mptbase: ioc1: Initiating bringup
> ioc1: LSISAS1068 A0: Capabilities={Initiator}
> scsi host1: ioc1: LSISAS1068 A0, FwRev=00000000h, Ports=8, MaxQ=256, IRQ=17
> mptsas: ioc1: attaching ssp device: fw_channel 0, fw_id 0, phy 0,
> sas_addr 0x60504030201a0
> scsi 1:0:0:0: Direct-Access VBOX HARDDISK 1.0 PQ: 0 ANSI: 5
> scsi 1:0:0:0: Attached scsi generic sg1 type 0
>
> and I'm assuming that that is the backing storage.

mpt is a maze of roughly duplicate, crazy drivers. The bug in question
impacted mpt2sas and mpt3sas, and this looks like the mpt fusion driver.
So it's probably not that.

> And yes, memory corruption sounds like a more likely cause than
> anything else. I don't like how the request data wasn't fully
> initialized, but the cmd->sense_buffer pointer itself *should* have
> been initialized by the ->init_request() call.

The block request state should be sane, we clear what we need, and
at-alloc init will ensure that state is safe across request reuse. But
it really should just be cleared unconditionally at allocation time.

And ->init_request() should take care of the SCSI command init. It does
look like it's relying on zeroes already, so either adding the memset()
or just adding the __GFP_ZERO would be prudent.

> So I don't actually expect my patch to really make any difference,
> although I do think that code should be looked at.

Jan, is it always clearing in a page size? That seems odd, especially if
we're considering random gunk in memory.

--
Jens Axboe

2015-04-09 22:29:09

by Jan Engelhardt

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7


On Thursday 2015-04-09 23:12, Linus Torvalds wrote:
>That ->sense_buffer thing is supposed to be initialized by the
>blk_mq_ops.init_request() function, which is called - if it exists =
>when the array of requests ('->rqs[]') is initialized.
>[...]
>Does this entirely untested patch make any difference?

Yes, this seems to solve it for me.


>Anyway, I assume that the actual "disk" in question is mpt fusion?

I had two setups, one non-virtual AHCI, and one emulated mpt,
both of which had failed and are now fixed for what I can tell.

2015-04-09 22:32:53

by Jan Engelhardt

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7


On Thursday 2015-04-09 23:42, Jens Axboe wrote:
>
>> So I don't actually expect my patch to really make any difference,
>> although I do think that code should be looked at.
>
> Jan, is it always clearing in a page size? That seems odd, especially if we're
> considering random gunk in memory.

The particular memset in scsi_init_cmd_errh would only clear
SCSI_SENSE_BUFFERSIZE. The problem really was that cmd->sense_buffer
was nonsensical (I did a hexdump) and read 0x1000. In one instance,
it even read 0x10000 because I used that value in an if clause
preceding a printk statement I added. Together with __GFP_ZERO
apparently fixing the issue for today, the noninitialized memory
theory looks like the most applicable one to me currently.

2015-04-10 00:07:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7

On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt <[email protected]> wrote:
>
> On Thursday 2015-04-09 23:12, Linus Torvalds wrote:
>>That ->sense_buffer thing is supposed to be initialized by the
>>blk_mq_ops.init_request() function, which is called - if it exists =
>>when the array of requests ('->rqs[]') is initialized.
>>[...]
>>Does this entirely untested patch make any difference?
>
> Yes, this seems to solve it for me.

Can you humor me, and try that patch on top of a few different kernel
versions that you had trouble with.

The fact that your bisection ended on a completely unrelated commit
makes me think that some particular memory aligment (code or initial
data or whatever) ends up showing or hiding the bug, so I'm a bit
worried that my ->rqs[] initialization patch also happened to just
hide the problem rather than really fix it.

That said, at least it's a lot more likely to be relevant than the
ACPI merge commit you bisected to, so while I'm somewhat cautious
about that patch (not that it would be wrong, but that it would matter
that much) I at least can see it make a real difference.

Linus

2015-04-11 17:46:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7

On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt <[email protected]> wrote:
>>
>> Yes, this seems to solve it for me.
>
> Can you humor me, and try that patch on top of a few different kernel
> versions that you had trouble with.

Jan? I'm inclined to commit the patch for 4.0 even though it is late
(it's not like it can hurt), but since you seem to be the only one
seeing this issue, I'd really like you to double-check it..

Linus

2015-04-11 17:47:53

by Jens Axboe

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7

On 04/11/2015 11:46 AM, Linus Torvalds wrote:
> On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt <[email protected]> wrote:
>>>
>>> Yes, this seems to solve it for me.
>>
>> Can you humor me, and try that patch on top of a few different kernel
>> versions that you had trouble with.
>
> Jan? I'm inclined to commit the patch for 4.0 even though it is late
> (it's not like it can hurt), but since you seem to be the only one
> seeing this issue, I'd really like you to double-check it..

I already committed for 4.1 (marked stable), but I'd be fine with adding
it to this series as well obviously.

--
Jens Axboe

2015-04-11 19:48:08

by Jan Engelhardt

[permalink] [raw]
Subject: Re: NULL deref around blkmq in v4.0-rc1–rc7

On Saturday 2015-04-11 19:46, Linus Torvalds wrote:

>On Thu, Apr 9, 2015 at 5:07 PM, Linus Torvalds
><[email protected]> wrote:
>> On Thu, Apr 9, 2015 at 3:29 PM, Jan Engelhardt <[email protected]> wrote:
>>>
>>> Yes, this seems to solve it for me.
>>
>> Can you humor me, and try that patch on top of a few different kernel
>> versions that you had trouble with.
>
>Jan? I'm inclined to commit the patch for 4.0 even though it is late
>(it's not like it can hurt), but since you seem to be the only one
>seeing this issue, I'd really like you to double-check it..

Yeah, just commit. I'm currently "stuck" in the weekend.