2018-11-29 09:21:50

by Chen, Rong A

[permalink] [raw]
Subject: [LKP] [block] 9d037ad707: BUG:unable_to_handle_kernel

FYI, we noticed the following commit (built with gcc-7):

commit: 9d037ad707ed6069fbea4e38e6ee37e027b13f1d ("block: remove req->timeout_list")
https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git mq-perf

in testcase: fsmark
with following parameters:

iterations: 1x
nr_threads: 64t
disk: 1BRD_48G
fs: btrfs
fs2: nfsv4
filesize: 4M
test_size: 40G
sync_method: fsyncBeforeClose
ucode: 0x42d
cpufreq_governor: performance

test-description: The fsmark is a file system benchmark to test synchronous write workloads, for example, mail servers workload.
test-url: https://sourceforge.net/projects/fsmark/


on test machine: 48 threads Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 64G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+---------------------------------------------------------------------+------------+------------+
| | 27d420bc47 | 9d037ad707 |
+---------------------------------------------------------------------+------------+------------+
| boot_successes | 5 | 24 |
| boot_failures | 1 | |
| WARNING:suspicious_RCU_usage | 1 | |
| include/linux/xarray.h:#suspicious_rcu_dereference_check()usage | 1 | |
| include/linux/xarray.h:#suspicious_rcu_dereference_protected()usage | 1 | |
| BUG:kernel_hang_in_test_stage | 1 | |
+---------------------------------------------------------------------+------------+------------+



[ 26.259902] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 26.269247] PGD 0 P4D 0
[ 26.272559] Oops: 0000 [#1] SMP PTI
[ 26.276947] CPU: 42 PID: 0 Comm: swapper/42 Not tainted 4.20.0-rc1-00174-g9d037ad7 #1
[ 26.286229] Hardware name: Intel Corporation S2600WP/S2600WP, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[ 26.298280] RIP: 0010:t10_pi_complete+0x5c/0x1d0
[ 26.304001] Code: 00 00 0f b6 88 2a 01 00 00 88 5c 24 03 48 8b 5f 30 83 e9 09 48 d3 eb 45 31 ed f6 45 12 01 74 04 4c 8b 6d 78 44 0f b6 64 24 03 <45> 8b 4d 08 41 8b 7d 10 45 8b 55 14 45 8b 45 18 45 0f b6 dc 85 ff
[ 26.326185] RSP: 0018:ffff881012283e98 EFLAGS: 00010246
[ 26.332666] RAX: ffff8807eba48848 RBX: 0000000000000000 RCX: 00000000fffffff7
[ 26.341306] RDX: 0000000000000018 RSI: 0000000000000000 RDI: ffff8807ed1e0000
[ 26.349970] RBP: ffff880fedea5700 R08: ffff88101229ff40 R09: 0000000000000200
[ 26.358648] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 26.367338] R13: 0000000000000000 R14: ffff8807ed1e0000 R15: ffff8810039c3400
[ 26.376039] FS: 0000000000000000(0000) GS:ffff881012280000(0000) knlGS:0000000000000000
[ 26.385840] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 26.393015] CR2: 0000000000000008 CR3: 0000000feeb0a005 CR4: 00000000001606e0
[ 26.401744] Call Trace:
[ 26.405154] <IRQ>
[ 26.408077] sd_done+0x2b2/0x330 [sd_mod]
[ 26.413274] scsi_finish_command+0xcb/0x120
[ 26.418638] blk_done_softirq+0xa1/0xd0
[ 26.423605] __do_softirq+0xe3/0x311
[ 26.428269] irq_exit+0xf0/0x100
[ 26.432548] call_function_single_interrupt+0xf/0x20
[ 26.438813] </IRQ>
[ 26.441836] RIP: 0010:cpuidle_enter_state+0xb4/0x330
[ 26.448115] Code: 31 ff e8 8f 85 89 ff 45 84 f6 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 58 02 00 00 31 ff e8 d3 5a 8f ff fb 66 0f 1f 44 00 00 <85> ed 0f 88 35 02 00 00 4c 29 fb 48 ba cf f7 53 e3 a5 9b c4 20 48
[ 26.470614] RSP: 0018:ffffc900064f7ea0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff04
[ 26.479879] RAX: ffff8810122a2c40 RBX: 000000061d3621be RCX: 000000000000001f
[ 26.488663] RDX: 000000061d3621be RSI: 000000002f8590d5 RDI: 0000000000000000
[ 26.497444] RBP: 0000000000000001 R08: 0000000000000002 R09: 00000000000224c0
[ 26.506226] R10: ffffc900064f7e80 R11: 000000000000002a R12: ffffffff8274efb8
[ 26.515003] R13: ffffe8fffec81d00 R14: 0000000000000000 R15: 000000061d34cc3f
[ 26.523789] do_idle+0x1f4/0x260
[ 26.528114] cpu_startup_entry+0x19/0x20
[ 26.533236] start_secondary+0x1ae/0x200
[ 26.538335] secondary_startup_64+0xa4/0xb0
[ 26.543710] Modules linked in: sd_mod sg intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel mgag200 snd_pcm ipmi_si crypto_simd cryptd snd_timer ttm ipmi_devintf snd glue_helper drm_kms_helper soundcore isci pcspkr ipmi_msghandler syscopyarea libsas sysfillrect ahci sysimgblt fb_sys_fops libahci scsi_transport_sas drm libata wmi pcc_cpufreq ip_tables
[ 26.592415] CR2: 0000000000000008
[ 26.596941] ---[ end trace 47ea337d473f2b82 ]---


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml



Thanks,
Rong Chen


Attachments:
(No filename) (5.14 kB)
config-4.20.0-rc1-00174-g9d037ad7 (171.01 kB)
job-script (7.78 kB)
dmesg.xz (21.13 kB)
job.yaml (5.25 kB)
Download all attachments

2018-11-29 17:07:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [LKP] [block] 9d037ad707: BUG:unable_to_handle_kernel

On Thu, Nov 29, 2018 at 05:20:31PM +0800, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
>
> commit: 9d037ad707ed6069fbea4e38e6ee37e027b13f1d ("block: remove req->timeout_list")
> https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git mq-perf

This looks very odd. How could we introduce a new BUG with the removal
of an unused structure member?

2018-11-29 18:38:50

by Jens Axboe

[permalink] [raw]
Subject: Re: [LKP] [block] 9d037ad707: BUG:unable_to_handle_kernel

On 11/29/18 10:05 AM, Christoph Hellwig wrote:
> On Thu, Nov 29, 2018 at 05:20:31PM +0800, kernel test robot wrote:
>> FYI, we noticed the following commit (built with gcc-7):
>>
>> commit: 9d037ad707ed6069fbea4e38e6ee37e027b13f1d ("block: remove req->timeout_list")
>> https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git mq-perf
>
> This looks very odd. How could we introduce a new BUG with the removal
> of an unused structure member?

Someone else reported a t10 pi issue, I'm guessing it's a latent bug and
the struct size changing is causing it to trigger weirdly on this one.
That's the only explanation, as it can't possibly be this specific commit.


--
Jens Axboe