2016-12-23 18:26:04

by MasterPrenium

[permalink] [raw]
Subject: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode

Hello Guys,

I've having some trouble on a new system I'm setting up. I'm getting a kernel BUG message, seems to be related with the use of Xen (when I boot the system _without_ Xen, I don't get any crash).
Here is configuration :
- 3x Hard Drives running on RAID 5 Software raid created by mdadm
- On top of it, DRBD for replication over another node (Active/passive cluster)
- On top of it, a BTRFS FileSystem with a few subvolumes
- On top of it, XEN VMs running.

The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync for example) on the RAID5 stack.
I've to reset system to make it work again.

Reproducible : ALWAYS (making the i/o, it crash in 2-5mins). Also reproducible on another system with the same hardware.

Kernel versions impacted (at least): kernel-4.4.26, kernel-4.8.15, kernel-4.9.0

Here dmesg errors :
[ 937.123220] ------------[ cut here ]------------
[ 937.127549] kernel BUG at drivers/md/raid5.c:527!
[ 937.131891] invalid opcode: 0000 [#1] SMP
[ 937.136216] Modules linked in: x86_pkg_temp_thermal coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
[ 937.145665] CPU: 2 PID: 9704 Comm: kworker/u16:8 Not tainted 4.9.0-gentoo #2
[ 937.150293] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, BIOS 1.0b 11/21/2016
[ 937.155531] Workqueue: drbd0_submit do_submit
[ 937.160506] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
[ 937.164115] RIP: e030:[<ffffffff819e1fc1>] [<ffffffff819e1fc1>] raid5_get_active_stripe+0x5e1/0x670
[ 937.169584] RSP: e02b:ffffc9000a66fa58 EFLAGS: 00010086
[ 937.175070] RAX: 0000000000000000 RBX: ffff880249d50000 RCX: ffff8802648bb5d0
[ 937.180640] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff880249d50000
[ 937.185505] RBP: ffffc9000a66faf0 R08: ffff8801f4813288 R09: 0000000000000000
[ 937.190631] R10: 0000000000000288 R11: 0000000000000000 R12: 0000000000000000
[ 937.196030] R13: 000000001e773e88 R14: ffff880249d50000 R15: ffff8802648bb400
[ 937.202011] FS: 0000000000000000(0000) GS:ffff880270c80000(0000) knlGS:ffff880270c80000
[ 937.206628] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 937.212372] CR2: 00007f68a101b520 CR3: 0000000257875000 CR4: 0000000000042660
[ 937.217538] Stack:
[ 937.223361] ffff8802648bb400 ffff880269550b40 0000000000000000 0000000166cf3800
[ 937.229103] 000000001e773e88 ffff8802648bb5d0 0000000000000001 0000000000000000
[ 937.233707] ffff8802648bb40c 0000000000000001 ffffc9000a66faf0 ffff880047cba958
[ 937.239736] Call Trace:
[ 937.244406] [<ffffffff819e21cd>] raid5_make_request+0x17d/0xdf0
[ 937.250345] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[ 937.256173] [<ffffffff81a09c03>] md_make_request+0xe3/0x220
[ 937.261031] [<ffffffff81483e9b>] generic_make_request+0xcb/0x1a0
[ 937.265615] [<ffffffff81732537>] drbd_send_and_submit+0x497/0x1310
[ 937.271605] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[ 937.276726] [<ffffffff817339ba>] send_and_submit_pending+0x6a/0x90
[ 937.282292] [<ffffffff81733e43>] do_submit+0x463/0x550
[ 937.288333] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[ 937.293205] [<ffffffff81095400>] process_one_work+0x170/0x420
[ 937.298982] [<ffffffff810957d3>] worker_thread+0x123/0x500
[ 937.304154] [<ffffffff810956b0>] ? process_one_work+0x420/0x420
[ 937.310314] [<ffffffff810956b0>] ? process_one_work+0x420/0x420
[ 937.316013] [<ffffffff8109b135>] kthread+0xc5/0xe0
[ 937.320918] [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
[ 937.327029] [<ffffffff8109b070>] ? kthread_park+0x60/0x60
[ 937.331994] [<ffffffff81ccbbc5>] ret_from_fork+0x25/0x30
[ 937.338068] Code: 85 d0 fb ff ff f0 41 80 8f 98 02 00 00 04 e9 c2 fb ff ff f3 90 41 8b 47 70 a8 01 75 f6 89 45 a4 e9 e2 fd ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 49 89 d6 e9 e1 fa ff ff 49 8b 82 e8 01 00 00 4d 8b 8a e0
[ 937.349579] RIP [<ffffffff819e1fc1>] raid5_get_active_stripe+0x5e1/0x670
[ 937.355290] RSP <ffffc9000a66fa58>
[ 937.386587] ---[ end trace b870be01f61065a5 ]---
[ 941.931453] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 941.937139] IP: [<ffffffff810bcaa6>] __wake_up_common+0x26/0x80
[ 941.943106] PGD 252dde067
[ 941.943219] PUD 252ee7067
[ 941.950107] PMD 0

[ 941.956080] Oops: 0000 [#2] SMP
[ 941.961919] Modules linked in: x86_pkg_temp_thermal coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
[ 941.974933] CPU: 2 PID: 9704 Comm: kworker/u16:8 Tainted: G D 4.9.0-gentoo #2
[ 941.982080] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, BIOS 1.0b 11/21/2016
[ 941.989296] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
[ 941.996831] RIP: e030:[<ffffffff810bcaa6>] [<ffffffff810bcaa6>] __wake_up_common+0x26/0x80
[ 942.004391] RSP: e02b:ffffc9000a66fe50 EFLAGS: 00010086
[ 942.011818] RAX: 0000000000000200 RBX: ffffc9000a66ff18 RCX: 0000000000000000
[ 942.019290] RDX: 0000000000000000 RSI: 0000000000000003 RDI: ffffc9000a66ff18
[ 942.026779] RBP: ffffc9000a66fe88 R08: 0000000000000000 R09: 0000000000000000
[ 942.034246] R10: 0000000000000008 R11: 0000000000000001 R12: ffffc9000a66ff20
[ 942.041693] R13: 0000000000000200 R14: 0000000000000000 R15: 0000000000000003
[ 942.049166] FS: 0000000000000000(0000) GS:ffff880270c80000(0000) knlGS:ffff880270c80000
[ 942.056599] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 942.063953] CR2: 0000000000000028 CR3: 0000000257875000 CR4: 0000000000042660
[ 942.070841] kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
[ 942.074862] BUG: unable to handle kernel paging request at ffffc9000234f8f8
[ 942.078910] IP: [<ffffc9000234f8f8>] 0xffffc9000234f8f8
[ 942.082961] PGD 1e9840067
[ 942.083010] PUD 1e983f067
[ 942.086963] PMD 26b42c067
[ 942.086978] PTE 801000026b66c067

[ 942.094822] Oops: 0011 [#3] SMP
[ 942.098734] Modules linked in: x86_pkg_temp_thermal coretemp crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
[ 942.107222] CPU: 2 PID: 9704 Comm: kworker/u16:8 Tainted: G D 4.9.0-gentoo #2
[ 942.111581] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F, BIOS 1.0b 11/21/2016
[ 942.116050] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
[ 942.120530] RIP: e030:[<ffffc9000234f8f8>] [<ffffc9000234f8f8>] 0xffffc9000234f8f8
[ 942.125019] RSP: e02b:ffffc9000a66fb80 EFLAGS: 00010082
[ 942.129534] RAX: 0000000000000041 RBX: 0000000000042660 RCX: 0000000000000006
[ 942.134355] RDX: 0000000000000041 RSI: ffffffff824e00a0 RDI: ffff880270c8dd80
[ 942.138921] RBP: ffffc9000a66fbe0 R08: 0000000000000000 R09: 0000000000000000
[ 942.143564] R10: 0000000000000008 R11: 0000000000000001 R12: 0000000080050033
[ 942.148172] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 942.152837] FS: 0000000000000000(0000) GS:ffff880270c80000(0000) knlGS:ffff880270c80000
[ 942.157525] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 942.162213] CR2: 0000000000000028 CR3: 0000000257875000 CR4: 0000000000042660
[ 942.166954] Stack:
[ 942.171576] 0000000257875000 0000000000000028 ffff880270c80000 ffff880270c80000
[ 942.176267] 0000000000000000 0000e0330a66c000 0000000000000000 ffffc9000a66fda8
[ 942.180918] 0000000000000000 ffffc9000a66fda8 0000000000000000 0000000000000000
[ 942.185521] Call Trace:
[ 942.190043] [<ffffffff810302ad>] show_regs+0x2d/0x180
[ 942.194605] [<ffffffff81030725>] __die+0xa5/0xf0
[ 942.199050] [<ffffffff8106041e>] no_context+0x14e/0x3d0
[ 942.203562] [<ffffffff81060798>] __bad_area_nosemaphore+0xf8/0x1c0
[ 942.208002] [<ffffffff8106086f>] bad_area_nosemaphore+0xf/0x20
[ 942.212478] [<ffffffff81061034>] __do_page_fault+0x84/0x4b0
[ 942.216797] [<ffffffff8106148c>] do_page_fault+0x2c/0x40
[ 942.221021] [<ffffffff81ccd808>] page_fault+0x28/0x30
[ 942.225184] [<ffffffff810bcaa6>] ? __wake_up_common+0x26/0x80
[ 942.229287] [<ffffffff810bcb0e>] __wake_up_locked+0xe/0x10
[ 942.233366] [<ffffffff810bd4d2>] complete+0x32/0x50
[ 942.237330] [<ffffffff8107a500>] mm_release+0xc0/0x160
[ 942.241216] [<ffffffff81080206>] do_exit+0x136/0xb50
[ 942.245056] [<ffffffff81ccdc07>] rewind_stack_do_exit+0x17/0x20
[ 942.248933] Code: c9 ff ff c0 cf 74 b7 01 88 ff ff 00 30 cf 66 02 88 ff ff 00 00 00 00 00 00 00 00 40 29 57 6b 02 88 ff ff b0 cf 0b 81 ff ff ff ff <70> fb 66 0a 00 c9 ff ff 88 b6 8b 64 02 88 ff ff 00 00 00 00 01
[ 942.257683] RIP [<ffffc9000234f8f8>] 0xffffc9000234f8f8
[ 942.261814] RSP <ffffc9000a66fb80>
[ 942.265860] CR2: ffffc9000234f8f8
[ 942.269830] ---[ end trace b870be01f61065a6 ]---
[ 942.293603] Fixing recursive fault but reboot is needed!
[ 962.926746] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 962.930582] 4-...: (1 GPs behind) idle=deb/140000000000000/0 softirq=51234/51234 fqs=5195
[ 962.934400] (detected by 1, t=21002 jiffies, g=26732, c=26731, q=173)
[ 962.938231] Task dump for CPU 4:
[ 962.942054] md10_raid5 R running task 13232 2654 2 0x00000008
[ 962.945939] ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 0000000000000000
[ 962.949899] 0000000000000220 ffff8802648bb40c 0000000000000002 ffff8802648bb708
[ 962.953781] 0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 ffff8802648bb400
[ 962.957570] Call Trace:
[ 962.961272] [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
[ 962.964943] [<ffffffff819d87f4>] ? release_inactive_stripe_list+0x44/0x180
[ 962.968604] [<ffffffff819e5469>] ? handle_active_stripes.isra.56+0x169/0x440
[ 962.972253] [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
[ 962.975825] [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
[ 962.979360] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[ 962.982900] [<ffffffff81a093e0>] ? find_pers+0x70/0x70
[ 962.986392] [<ffffffff8109b135>] ? kthread+0xc5/0xe0
[ 962.989881] [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
[ 962.993382] [<ffffffff8109b070>] ? kthread_park+0x60/0x60
[ 962.996858] [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
[ 1025.932534] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1025.936027] 4-...: (1 GPs behind) idle=deb/140000000000000/0 softirq=51234/51234 fqs=20780
[ 1025.939486] (detected by 0, t=84014 jiffies, g=26732, c=26731, q=742)
[ 1025.942969] Task dump for CPU 4:
[ 1025.946373] md10_raid5 R running task 13232 2654 2 0x00000008
[ 1025.949909] ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 0000000000000000
[ 1025.953451] 0000000000000220 ffff8802648bb40c 0000000000000002 ffff8802648bb708
[ 1025.957015] 0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 ffff8802648bb400
[ 1025.960601] Call Trace:
[ 1025.964139] [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
[ 1025.967724] [<ffffffff819d87f4>] ? release_inactive_stripe_list+0x44/0x180
[ 1025.971351] [<ffffffff819e5469>] ? handle_active_stripes.isra.56+0x169/0x440
[ 1025.975001] [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
[ 1025.978598] [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
[ 1025.982255] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[ 1025.985875] [<ffffffff81a093e0>] ? find_pers+0x70/0x70
[ 1025.989496] [<ffffffff8109b135>] ? kthread+0xc5/0xe0
[ 1025.993117] [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
[ 1025.996707] [<ffffffff8109b070>] ? kthread_park+0x60/0x60
[ 1026.000354] [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
[ 1088.937436] INFO: rcu_sched detected stalls on CPUs/tasks:
[ 1088.941109] 4-...: (1 GPs behind) idle=deb/140000000000000/0 softirq=51234/51234 fqs=36280
[ 1088.944649] (detected by 0, t=147019 jiffies, g=26732, c=26731, q=1328)
[ 1088.948180] Task dump for CPU 4:
[ 1088.951671] md10_raid5 R running task 13232 2654 2 0x00000008
[ 1088.955296] ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88 0000000000000000
[ 1088.958963] 0000000000000220 ffff8802648bb40c 0000000000000002 ffff8802648bb708
[ 1088.962665] 0000000000000001 ffffc9000306bcc8 ffffffff81ccb884 ffff8802648bb400
[ 1088.966301] Call Trace:
[ 1088.969868] [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
[ 1088.973451] [<ffffffff819d87f4>] ? release_inactive_stripe_list+0x44/0x180
[ 1088.977020] [<ffffffff819e5469>] ? handle_active_stripes.isra.56+0x169/0x440
[ 1088.980572] [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
[ 1088.984066] [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
[ 1088.987549] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
[ 1088.991011] [<ffffffff81a093e0>] ? find_pers+0x70/0x70
[ 1088.994444] [<ffffffff8109b135>] ? kthread+0xc5/0xe0
[ 1088.997815] [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
[ 1089.001181] [<ffffffff8109b070>] ? kthread_park+0x60/0x60
[ 1089.004534] [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30

(Another log here : http://pastebin.com/maxGFc1z)

Xen versions affected (at least): xen-4.6, xen-4.7, xen-4.8
xen-tools same version

Userland is a gentoo linux box.

Kernel .config : http://pastebin.com/p0EcHjbu

All buit with : gcc (Gentoo 4.9.3 p1.5, pie-0.6.4) 4.9.3

-> scripts/ver_linux
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.

Linux Node_1 4.9.0-gentoo #2 SMP Fri Dec 23 16:37:48 CET 2016 x86_64 Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz GenuineIntel GNU/Linux

GNU C 4.9.3
GNU Make 4.1
Binutils 2.25.1
Util-linux 2.26.2
Mount 2.26.2
Module-init-tools 22
E2fsprogs 1.43.3
Linux C Library 2.22
Dynamic linker (ldd) 2.22
Linux C++ Library 6.0.20
Procps 3.3.12
Net-tools 1.60
Kbd 2.0.3
Console-tools 2.0.3
Sh-utils 8.25
Udev 220
Modules Loaded ablk_helper aesni_intel aes_x86_64 coretemp crc32c_intel mei mei_me mpt3sas x86_pkg_temp_thermal

-> System is a SuperMicro Motherboard X10SDV-4C-7TP4F with 8GB of DDR 4 ECC Registered memory


Any help would be greatly appreciated !

Thanks,


2016-12-26 13:14:14

by MasterPrenium

[permalink] [raw]
Subject: Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode

Hi guys,

I've tested the same set-up except with a RAID 1 Soft Array, in this
case I get no issue at all.
It's definitely a raid 5 problem.
As requested, I've tested re-creating the RAID 5 array (just to be
sure), issue remains the same, even with metadata 0.90 or metadata 1.2.

Thanks,

Le 23/12/2016 19:25, MasterPrenium a écrit :
> Hello Guys,
>
> I've having some trouble on a new system I'm setting up. I'm getting a
> kernel BUG message, seems to be related with the use of Xen (when I
> boot the system _without_ Xen, I don't get any crash).
> Here is configuration :
> - 3x Hard Drives running on RAID 5 Software raid created by mdadm
> - On top of it, DRBD for replication over another node (Active/passive
> cluster)
> - On top of it, a BTRFS FileSystem with a few subvolumes
> - On top of it, XEN VMs running.
>
> The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync
> for example) on the RAID5 stack.
> I've to reset system to make it work again.
>
> Reproducible : ALWAYS (making the i/o, it crash in 2-5mins). Also
> reproducible on another system with the same hardware.
>
> Kernel versions impacted (at least): kernel-4.4.26, kernel-4.8.15,
> kernel-4.9.0
>
> Here dmesg errors :
> [ 937.123220] ------------[ cut here ]------------
> [ 937.127549] kernel BUG at drivers/md/raid5.c:527!
> [ 937.131891] invalid opcode: 0000 [#1] SMP
> [ 937.136216] Modules linked in: x86_pkg_temp_thermal coretemp
> crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
> [ 937.145665] CPU: 2 PID: 9704 Comm: kworker/u16:8 Not tainted
> 4.9.0-gentoo #2
> [ 937.150293] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F,
> BIOS 1.0b 11/21/2016
> [ 937.155531] Workqueue: drbd0_submit do_submit
> [ 937.160506] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
> [ 937.164115] RIP: e030:[<ffffffff819e1fc1>] [<ffffffff819e1fc1>]
> raid5_get_active_stripe+0x5e1/0x670
> [ 937.169584] RSP: e02b:ffffc9000a66fa58 EFLAGS: 00010086
> [ 937.175070] RAX: 0000000000000000 RBX: ffff880249d50000 RCX:
> ffff8802648bb5d0
> [ 937.180640] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
> ffff880249d50000
> [ 937.185505] RBP: ffffc9000a66faf0 R08: ffff8801f4813288 R09:
> 0000000000000000
> [ 937.190631] R10: 0000000000000288 R11: 0000000000000000 R12:
> 0000000000000000
> [ 937.196030] R13: 000000001e773e88 R14: ffff880249d50000 R15:
> ffff8802648bb400
> [ 937.202011] FS: 0000000000000000(0000) GS:ffff880270c80000(0000)
> knlGS:ffff880270c80000
> [ 937.206628] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 937.212372] CR2: 00007f68a101b520 CR3: 0000000257875000 CR4:
> 0000000000042660
> [ 937.217538] Stack:
> [ 937.223361] ffff8802648bb400 ffff880269550b40 0000000000000000
> 0000000166cf3800
> [ 937.229103] 000000001e773e88 ffff8802648bb5d0 0000000000000001
> 0000000000000000
> [ 937.233707] ffff8802648bb40c 0000000000000001 ffffc9000a66faf0
> ffff880047cba958
> [ 937.239736] Call Trace:
> [ 937.244406] [<ffffffff819e21cd>] raid5_make_request+0x17d/0xdf0
> [ 937.250345] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [ 937.256173] [<ffffffff81a09c03>] md_make_request+0xe3/0x220
> [ 937.261031] [<ffffffff81483e9b>] generic_make_request+0xcb/0x1a0
> [ 937.265615] [<ffffffff81732537>] drbd_send_and_submit+0x497/0x1310
> [ 937.271605] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [ 937.276726] [<ffffffff817339ba>] send_and_submit_pending+0x6a/0x90
> [ 937.282292] [<ffffffff81733e43>] do_submit+0x463/0x550
> [ 937.288333] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [ 937.293205] [<ffffffff81095400>] process_one_work+0x170/0x420
> [ 937.298982] [<ffffffff810957d3>] worker_thread+0x123/0x500
> [ 937.304154] [<ffffffff810956b0>] ? process_one_work+0x420/0x420
> [ 937.310314] [<ffffffff810956b0>] ? process_one_work+0x420/0x420
> [ 937.316013] [<ffffffff8109b135>] kthread+0xc5/0xe0
> [ 937.320918] [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
> [ 937.327029] [<ffffffff8109b070>] ? kthread_park+0x60/0x60
> [ 937.331994] [<ffffffff81ccbbc5>] ret_from_fork+0x25/0x30
> [ 937.338068] Code: 85 d0 fb ff ff f0 41 80 8f 98 02 00 00 04 e9 c2
> fb ff ff f3 90 41 8b 47 70 a8 01 75 f6 89 45 a4 e9 e2 fd ff ff 0f 0b
> 0f 0b 0f 0b <0f> 0b 49 89 d6 e9 e1 fa ff ff 49 8b 82 e8 01 00 00 4d 8b
> 8a e0
> [ 937.349579] RIP [<ffffffff819e1fc1>]
> raid5_get_active_stripe+0x5e1/0x670
> [ 937.355290] RSP <ffffc9000a66fa58>
> [ 937.386587] ---[ end trace b870be01f61065a5 ]---
> [ 941.931453] BUG: unable to handle kernel NULL pointer dereference
> at (null)
> [ 941.937139] IP: [<ffffffff810bcaa6>] __wake_up_common+0x26/0x80
> [ 941.943106] PGD 252dde067
> [ 941.943219] PUD 252ee7067
> [ 941.950107] PMD 0
>
> [ 941.956080] Oops: 0000 [#2] SMP
> [ 941.961919] Modules linked in: x86_pkg_temp_thermal coretemp
> crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
> [ 941.974933] CPU: 2 PID: 9704 Comm: kworker/u16:8 Tainted: G
> D 4.9.0-gentoo #2
> [ 941.982080] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F,
> BIOS 1.0b 11/21/2016
> [ 941.989296] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
> [ 941.996831] RIP: e030:[<ffffffff810bcaa6>] [<ffffffff810bcaa6>]
> __wake_up_common+0x26/0x80
> [ 942.004391] RSP: e02b:ffffc9000a66fe50 EFLAGS: 00010086
> [ 942.011818] RAX: 0000000000000200 RBX: ffffc9000a66ff18 RCX:
> 0000000000000000
> [ 942.019290] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
> ffffc9000a66ff18
> [ 942.026779] RBP: ffffc9000a66fe88 R08: 0000000000000000 R09:
> 0000000000000000
> [ 942.034246] R10: 0000000000000008 R11: 0000000000000001 R12:
> ffffc9000a66ff20
> [ 942.041693] R13: 0000000000000200 R14: 0000000000000000 R15:
> 0000000000000003
> [ 942.049166] FS: 0000000000000000(0000) GS:ffff880270c80000(0000)
> knlGS:ffff880270c80000
> [ 942.056599] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 942.063953] CR2: 0000000000000028 CR3: 0000000257875000 CR4:
> 0000000000042660
> [ 942.070841] kernel tried to execute NX-protected page - exploit
> attempt? (uid: 0)
> [ 942.074862] BUG: unable to handle kernel paging request at
> ffffc9000234f8f8
> [ 942.078910] IP: [<ffffc9000234f8f8>] 0xffffc9000234f8f8
> [ 942.082961] PGD 1e9840067
> [ 942.083010] PUD 1e983f067
> [ 942.086963] PMD 26b42c067
> [ 942.086978] PTE 801000026b66c067
>
> [ 942.094822] Oops: 0011 [#3] SMP
> [ 942.098734] Modules linked in: x86_pkg_temp_thermal coretemp
> crc32c_intel aesni_intel aes_x86_64 ablk_helper mei_me mei mpt3sas
> [ 942.107222] CPU: 2 PID: 9704 Comm: kworker/u16:8 Tainted: G
> D 4.9.0-gentoo #2
> [ 942.111581] Hardware name: Supermicro Super Server/X10SDV-4C-7TP4F,
> BIOS 1.0b 11/21/2016
> [ 942.116050] task: ffff88026b0b2940 task.stack: ffffc9000a66c000
> [ 942.120530] RIP: e030:[<ffffc9000234f8f8>] [<ffffc9000234f8f8>]
> 0xffffc9000234f8f8
> [ 942.125019] RSP: e02b:ffffc9000a66fb80 EFLAGS: 00010082
> [ 942.129534] RAX: 0000000000000041 RBX: 0000000000042660 RCX:
> 0000000000000006
> [ 942.134355] RDX: 0000000000000041 RSI: ffffffff824e00a0 RDI:
> ffff880270c8dd80
> [ 942.138921] RBP: ffffc9000a66fbe0 R08: 0000000000000000 R09:
> 0000000000000000
> [ 942.143564] R10: 0000000000000008 R11: 0000000000000001 R12:
> 0000000080050033
> [ 942.148172] R13: 0000000000000000 R14: 0000000000000000 R15:
> 0000000000000000
> [ 942.152837] FS: 0000000000000000(0000) GS:ffff880270c80000(0000)
> knlGS:ffff880270c80000
> [ 942.157525] CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 942.162213] CR2: 0000000000000028 CR3: 0000000257875000 CR4:
> 0000000000042660
> [ 942.166954] Stack:
> [ 942.171576] 0000000257875000 0000000000000028 ffff880270c80000
> ffff880270c80000
> [ 942.176267] 0000000000000000 0000e0330a66c000 0000000000000000
> ffffc9000a66fda8
> [ 942.180918] 0000000000000000 ffffc9000a66fda8 0000000000000000
> 0000000000000000
> [ 942.185521] Call Trace:
> [ 942.190043] [<ffffffff810302ad>] show_regs+0x2d/0x180
> [ 942.194605] [<ffffffff81030725>] __die+0xa5/0xf0
> [ 942.199050] [<ffffffff8106041e>] no_context+0x14e/0x3d0
> [ 942.203562] [<ffffffff81060798>] __bad_area_nosemaphore+0xf8/0x1c0
> [ 942.208002] [<ffffffff8106086f>] bad_area_nosemaphore+0xf/0x20
> [ 942.212478] [<ffffffff81061034>] __do_page_fault+0x84/0x4b0
> [ 942.216797] [<ffffffff8106148c>] do_page_fault+0x2c/0x40
> [ 942.221021] [<ffffffff81ccd808>] page_fault+0x28/0x30
> [ 942.225184] [<ffffffff810bcaa6>] ? __wake_up_common+0x26/0x80
> [ 942.229287] [<ffffffff810bcb0e>] __wake_up_locked+0xe/0x10
> [ 942.233366] [<ffffffff810bd4d2>] complete+0x32/0x50
> [ 942.237330] [<ffffffff8107a500>] mm_release+0xc0/0x160
> [ 942.241216] [<ffffffff81080206>] do_exit+0x136/0xb50
> [ 942.245056] [<ffffffff81ccdc07>] rewind_stack_do_exit+0x17/0x20
> [ 942.248933] Code: c9 ff ff c0 cf 74 b7 01 88 ff ff 00 30 cf 66 02
> 88 ff ff 00 00 00 00 00 00 00 00 40 29 57 6b 02 88 ff ff b0 cf 0b 81
> ff ff ff ff <70> fb 66 0a 00 c9 ff ff 88 b6 8b 64 02 88 ff ff 00 00 00
> 00 01
> [ 942.257683] RIP [<ffffc9000234f8f8>] 0xffffc9000234f8f8
> [ 942.261814] RSP <ffffc9000a66fb80>
> [ 942.265860] CR2: ffffc9000234f8f8
> [ 942.269830] ---[ end trace b870be01f61065a6 ]---
> [ 942.293603] Fixing recursive fault but reboot is needed!
> [ 962.926746] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 962.930582] 4-...: (1 GPs behind) idle=deb/140000000000000/0
> softirq=51234/51234 fqs=5195
> [ 962.934400] (detected by 1, t=21002 jiffies, g=26732, c=26731, q=173)
> [ 962.938231] Task dump for CPU 4:
> [ 962.942054] md10_raid5 R running task 13232 2654 2
> 0x00000008
> [ 962.945939] ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88
> 0000000000000000
> [ 962.949899] 0000000000000220 ffff8802648bb40c 0000000000000002
> ffff8802648bb708
> [ 962.953781] 0000000000000001 ffffc9000306bcc8 ffffffff81ccb884
> ffff8802648bb400
> [ 962.957570] Call Trace:
> [ 962.961272] [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
> [ 962.964943] [<ffffffff819d87f4>] ?
> release_inactive_stripe_list+0x44/0x180
> [ 962.968604] [<ffffffff819e5469>] ?
> handle_active_stripes.isra.56+0x169/0x440
> [ 962.972253] [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
> [ 962.975825] [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
> [ 962.979360] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [ 962.982900] [<ffffffff81a093e0>] ? find_pers+0x70/0x70
> [ 962.986392] [<ffffffff8109b135>] ? kthread+0xc5/0xe0
> [ 962.989881] [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
> [ 962.993382] [<ffffffff8109b070>] ? kthread_park+0x60/0x60
> [ 962.996858] [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
> [ 1025.932534] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1025.936027] 4-...: (1 GPs behind) idle=deb/140000000000000/0
> softirq=51234/51234 fqs=20780
> [ 1025.939486] (detected by 0, t=84014 jiffies, g=26732, c=26731, q=742)
> [ 1025.942969] Task dump for CPU 4:
> [ 1025.946373] md10_raid5 R running task 13232 2654 2
> 0x00000008
> [ 1025.949909] ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88
> 0000000000000000
> [ 1025.953451] 0000000000000220 ffff8802648bb40c 0000000000000002
> ffff8802648bb708
> [ 1025.957015] 0000000000000001 ffffc9000306bcc8 ffffffff81ccb884
> ffff8802648bb400
> [ 1025.960601] Call Trace:
> [ 1025.964139] [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
> [ 1025.967724] [<ffffffff819d87f4>] ?
> release_inactive_stripe_list+0x44/0x180
> [ 1025.971351] [<ffffffff819e5469>] ?
> handle_active_stripes.isra.56+0x169/0x440
> [ 1025.975001] [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
> [ 1025.978598] [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
> [ 1025.982255] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [ 1025.985875] [<ffffffff81a093e0>] ? find_pers+0x70/0x70
> [ 1025.989496] [<ffffffff8109b135>] ? kthread+0xc5/0xe0
> [ 1025.993117] [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
> [ 1025.996707] [<ffffffff8109b070>] ? kthread_park+0x60/0x60
> [ 1026.000354] [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
> [ 1088.937436] INFO: rcu_sched detected stalls on CPUs/tasks:
> [ 1088.941109] 4-...: (1 GPs behind) idle=deb/140000000000000/0
> softirq=51234/51234 fqs=36280
> [ 1088.944649] (detected by 0, t=147019 jiffies, g=26732, c=26731,
> q=1328)
> [ 1088.948180] Task dump for CPU 4:
> [ 1088.951671] md10_raid5 R running task 13232 2654 2
> 0x00000008
> [ 1088.955296] ffff880270d0dcc0 ffff880270ed8ec0 000000000306bc88
> 0000000000000000
> [ 1088.958963] 0000000000000220 ffff8802648bb40c 0000000000000002
> ffff8802648bb708
> [ 1088.962665] 0000000000000001 ffffc9000306bcc8 ffffffff81ccb884
> ffff8802648bb400
> [ 1088.966301] Call Trace:
> [ 1088.969868] [<ffffffff81ccb884>] ? _raw_spin_lock_irqsave+0x54/0x60
> [ 1088.973451] [<ffffffff819d87f4>] ?
> release_inactive_stripe_list+0x44/0x180
> [ 1088.977020] [<ffffffff819e5469>] ?
> handle_active_stripes.isra.56+0x169/0x440
> [ 1088.980572] [<ffffffff819e5ae1>] ? raid5d+0x3a1/0x730
> [ 1088.984066] [<ffffffff81a094d3>] ? md_thread+0xf3/0x100
> [ 1088.987549] [<ffffffff810bcfb0>] ? wake_up_atomic_t+0x30/0x30
> [ 1088.991011] [<ffffffff81a093e0>] ? find_pers+0x70/0x70
> [ 1088.994444] [<ffffffff8109b135>] ? kthread+0xc5/0xe0
> [ 1088.997815] [<ffffffff8102c815>] ? __switch_to+0x355/0x7a0
> [ 1089.001181] [<ffffffff8109b070>] ? kthread_park+0x60/0x60
> [ 1089.004534] [<ffffffff81ccbbc5>] ? ret_from_fork+0x25/0x30
>
> (Another log here : http://pastebin.com/maxGFc1z)
>
> Xen versions affected (at least): xen-4.6, xen-4.7, xen-4.8
> xen-tools same version
>
> Userland is a gentoo linux box.
>
> Kernel .config : http://pastebin.com/p0EcHjbu
>
> All buit with : gcc (Gentoo 4.9.3 p1.5, pie-0.6.4) 4.9.3
>
> -> scripts/ver_linux
> If some fields are empty or look unusual you may have an old version.
> Compare to the current minimal requirements in Documentation/Changes.
>
> Linux Node_1 4.9.0-gentoo #2 SMP Fri Dec 23 16:37:48 CET 2016 x86_64
> Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz GenuineIntel GNU/Linux
>
> GNU C 4.9.3
> GNU Make 4.1
> Binutils 2.25.1
> Util-linux 2.26.2
> Mount 2.26.2
> Module-init-tools 22
> E2fsprogs 1.43.3
> Linux C Library 2.22
> Dynamic linker (ldd) 2.22
> Linux C++ Library 6.0.20
> Procps 3.3.12
> Net-tools 1.60
> Kbd 2.0.3
> Console-tools 2.0.3
> Sh-utils 8.25
> Udev 220
> Modules Loaded ablk_helper aesni_intel aes_x86_64 coretemp
> crc32c_intel mei mei_me mpt3sas x86_pkg_temp_thermal
>
> -> System is a SuperMicro Motherboard X10SDV-4C-7TP4F with 8GB of DDR
> 4 ECC Registered memory
>
>
> Any help would be greatly appreciated !
>
> Thanks,
>

2016-12-30 20:54:04

by Jes Sorensen

[permalink] [raw]
Subject: Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode

MasterPrenium <[email protected]> writes:
> Hello Guys,
>
> I've having some trouble on a new system I'm setting up. I'm getting a
> kernel BUG message, seems to be related with the use of Xen (when I
> boot the system _without_ Xen, I don't get any crash).
> Here is configuration :
> - 3x Hard Drives running on RAID 5 Software raid created by mdadm
> - On top of it, DRBD for replication over another node (Active/passive cluster)
> - On top of it, a BTRFS FileSystem with a few subvolumes
> - On top of it, XEN VMs running.
>
> The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync
> for example) on the RAID5 stack.
> I've to reset system to make it work again.
>
> Reproducible : ALWAYS (making the i/o, it crash in 2-5mins). Also
> reproducible on another system with the same hardware.
>
> Kernel versions impacted (at least): kernel-4.4.26, kernel-4.8.15, kernel-4.9.0

Well you have one foreign object in there that is not part of the
kernel and which shows up in the OOPS: DRDB

What happens when you remove that from the equation?

Jes

2016-12-31 00:00:37

by MasterPrenium

[permalink] [raw]
Subject: Re: PROBLEM: Kernel BUG with raid5 soft + Xen + DRBD - invalid opcode

Hello,

Thanks for your reply. DRBD isn't part of the kernel ? I was thinking it
has been included since 2.6.3x ?

I've just tested without DRBD, the issue seems to remain. Can't see the
"BUG", but the kernel crashed also. (A little bit later)
I don't have full dump since I lost my network connection and my serial
connection.
Here is a picture of what I got :
http://img15.hostingpics.net/pics/113882KernelError6.png
Another one : http://img11.hostingpics.net/pics/164702KernelError7.png

It also seems to me that having the "glances" monitoring software
running in dom0, makes the kernel crashes quicker, don't think this can
help but... just in case...

Any idea / test I can make ? This is really a blocking issue with
potential data loss...

Best regards,
MasterPrenium


Le 30/12/2016 21:54, Jes Sorensen a ?crit :
> MasterPrenium<[email protected]> writes:
>> Hello Guys,
>>
>> I've having some trouble on a new system I'm setting up. I'm getting a
>> kernel BUG message, seems to be related with the use of Xen (when I
>> boot the system _without_ Xen, I don't get any crash).
>> Here is configuration :
>> - 3x Hard Drives running on RAID 5 Software raid created by mdadm
>> - On top of it, DRBD for replication over another node (Active/passive cluster)
>> - On top of it, a BTRFS FileSystem with a few subvolumes
>> - On top of it, XEN VMs running.
>>
>> The BUG is happening when I'm making "huge" I/O (20MB/s with a rsync
>> for example) on the RAID5 stack.
>> I've to reset system to make it work again.
>>
>> Reproducible : ALWAYS (making the i/o, it crash in 2-5mins). Also
>> reproducible on another system with the same hardware.
>>
>> Kernel versions impacted (at least): kernel-4.4.26, kernel-4.8.15, kernel-4.9.0
> Well you have one foreign object in there that is not part of the
> kernel and which shows up in the OOPS: DRDB
>
> What happens when you remove that from the equation?
>
> Jes