2020-02-23 18:38:38

by Ondrej Zary

[permalink] [raw]
Subject: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)

Hello,
a couple of days after upgrading a server from Debian 9 (kernel 4.9.210-1)
to 10 (kernel 4.19.98), qla2xxx crashed, along with mysql.

There is an EMC CX3 array connected through the fibre-channel adapter.
No errors are present in EMC event log.

This server was running without any problems since Debian 4.
Is this a known bug?

[979178.888922] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[979178.889160] PGD 0 P4D 0
[979178.889243] Oops: 0002 [#1] SMP PTI
[979178.889362] CPU: 6 PID: 11060 Comm: kworker/u16:2 Not tainted 4.19.0-8-amd64 #1 Debian 4.19.98-1
[979178.889617] Hardware name: Dell Inc. PowerEdge 2950/0JR815, BIOS 2.7.0 10/30/2010
[979178.889855] Workqueue: scsi_tmf_4 scmd_eh_abort_handler [scsi_mod]
[979178.890069] RIP: 0010:qla24xx_async_abort_cmd+0x1b/0x250 [qla2xxx]
[979178.890258] Code: e9 19 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 41 55 41 54 55 53 4c 8b 6f 28 4c 8b 7f 20 4c 8b 77 48 <f0> 41 ff 46 04 0f ae
f0 41 f6 46 24 04 74 17 f0 41 ff 4e 04 bd 02
[979178.890801] RSP: 0018:ffffb1250ba83da8 EFLAGS: 00010293
[979178.890966] RAX: 0000000000000800 RBX: ffff93b89db837a8 RCX: 00000000000005f4
[979178.891178] RDX: ffff93b89e28afa8 RSI: 0000000000000001 RDI: ffff93b8a5018fc0
[979178.891389] RBP: ffff93b89ccb89c0 R08: ffffffffc0595860 R09: 0000000000000000
[979178.891600] R10: 8080808080808080 R11: 0000000000000010 R12: ffff93b89db82000
[979178.891811] R13: ffff93b89db837a8 R14: 0000000000000000 R15: ffff93b89d88a800
[979178.892023] FS: 0000000000000000(0000) GS:ffff93b8a7b80000(0000) knlGS:0000000000000000
[979178.892258] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[979178.892430] CR2: 0000000000000004 CR3: 000000021a62a000 CR4: 00000000000006e0
[979178.892642] Call Trace:
[979178.892748] qla24xx_abort_command+0x218/0x2d0 [qla2xxx]
[979178.892911] ? __switch_to_asm+0x41/0x70
[979178.893031] ? __switch_to_asm+0x35/0x70
[979178.893160] qla2xxx_eh_abort+0x117/0x310 [qla2xxx]
[979178.893323] scmd_eh_abort_handler+0x85/0x220 [scsi_mod]
[979178.893484] process_one_work+0x1a7/0x3a0
[979178.893611] worker_thread+0x30/0x390
[979178.893727] ? create_worker+0x1a0/0x1a0
[979178.893847] kthread+0x112/0x130
[979178.893948] ? kthread_bind+0x30/0x30
[979178.894064] ret_from_fork+0x35/0x40
[979178.894174] Modules linked in: loop ipmi_ssif radeon ttm drm_kms_helper drm coretemp i2c_algo_bit iTCO_wdt iTCO_vendor_support ipmi_si joydev kvm sg evdev i5000_edac
ipmi_devintf pcc_cpufreq ipmi_msghandler rng_core i5k_amb irqbypass dcdbas serio_raw acpi_cpufreq button pcspkr ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb crypto
_simd cryptd glue_helper aes_x86_64 dm_service_time dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua uas usb_storage hid_generic usbhid hid sr_mod ses cdrom encl
osure sd_mod scsi_transport_sas ata_generic qla2xxx uhci_hcd ehci_pci ehci_hcd psmouse ata_piix nvme_fc libata nvme_fabrics usbcore nvme_core megaraid_sas scsi_transport_
fc scsi_mod lpc_ich mfd_core usb_common bnx2
[979178.895968] CR2: 0000000000000004
[979178.896075] ---[ end trace 4d42692cc0dc3c87 ]---
[979178.896225] RIP: 0010:qla24xx_async_abort_cmd+0x1b/0x250 [qla2xxx]
[979178.896414] Code: e9 19 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 41 55 41 54 55 53 4c 8b 6f 28 4c 8b 7f 20 4c 8b 77 48 <f0> 41 ff 46 04 0f ae
f0 41 f6 46 24 04 74 17 f0 41 ff 4e 04 bd 02
[979178.896956] RSP: 0018:ffffb1250ba83da8 EFLAGS: 00010293
[979178.897121] RAX: 0000000000000800 RBX: ffff93b89db837a8 RCX: 00000000000005f4
[979178.897332] RDX: ffff93b89e28afa8 RSI: 0000000000000001 RDI: ffff93b8a5018fc0
[979178.897544] RBP: ffff93b89ccb89c0 R08: ffffffffc0595860 R09: 0000000000000000
[979178.908415] R10: 8080808080808080 R11: 0000000000000010 R12: ffff93b89db82000
[979178.919419] R13: ffff93b89db837a8 R14: 0000000000000000 R15: ffff93b89d88a800
[979178.930444] FS: 0000000000000000(0000) GS:ffff93b8a7b80000(0000) knlGS:0000000000000000
[979178.941366] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[979178.952142] CR2: 0000000000000004 CR3: 000000021a62a000 CR4: 00000000000006e0
[980103.072740] mysqld[2175]: segfault at 0 ip 000055bbc5cd2d93 sp 00007f2362ffb450 error 6 in mysqld[55bbc551a000+805000]
[980103.083956] Code: c7 45 00 00 00 00 00 8b 7d cc 4c 89 e2 4c 89 f6 e8 62 81 84 ff 49 89 c7 49 39 c4 0f 84 f6 00 00 00 e8 e1 1c 00 00 41 8b 4d 00 <89> 08 85 c9 74 37 49
83 ff ff 0f 84 9d 00 00 00 f6 c3 06 75 28 4d




--
Ondrej Zary


2020-02-23 19:27:01

by Bart Van Assche

[permalink] [raw]
Subject: Re: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)

On 2020-02-23 10:29, Ondrej Zary wrote:
> a couple of days after upgrading a server from Debian 9 (kernel 4.9.210-1)
> to 10 (kernel 4.19.98), qla2xxx crashed, along with mysql.
>
> There is an EMC CX3 array connected through the fibre-channel adapter.
> No errors are present in EMC event log.
>
> This server was running without any problems since Debian 4.
> Is this a known bug?

Please report issues encountered with Debian kernels in the Debian bug
tracker. If you want the upstream community to assist please retest with
an upstream kernel.

Thanks,

Bart.

2020-02-23 19:57:48

by Ondrej Zary

[permalink] [raw]
Subject: Re: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)

On Sunday 23 February 2020 20:26:39 Bart Van Assche wrote:
> On 2020-02-23 10:29, Ondrej Zary wrote:
> > a couple of days after upgrading a server from Debian 9 (kernel 4.9.210-1)
> > to 10 (kernel 4.19.98), qla2xxx crashed, along with mysql.
> >
> > There is an EMC CX3 array connected through the fibre-channel adapter.
> > No errors are present in EMC event log.
> >
> > This server was running without any problems since Debian 4.
> > Is this a known bug?
>
> Please report issues encountered with Debian kernels in the Debian bug
> tracker. If you want the upstream community to assist please retest with
> an upstream kernel.

Debian kernel does not have any patches related to qla2xxx driver:
https://salsa.debian.org/kernel-team/linux/raw/debian/4.19.98-1/debian/patches/series

It crashed after running for 11 days. Not a quick&easy test.

--
Ondrej Zary

2020-02-24 02:17:32

by Bart Van Assche

[permalink] [raw]
Subject: Re: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)

On 2020-02-23 11:57, Ondrej Zary wrote:
> On Sunday 23 February 2020 20:26:39 Bart Van Assche wrote:
>> On 2020-02-23 10:29, Ondrej Zary wrote:
>>> a couple of days after upgrading a server from Debian 9 (kernel 4.9.210-1)
>>> to 10 (kernel 4.19.98), qla2xxx crashed, along with mysql.
>>>
>>> There is an EMC CX3 array connected through the fibre-channel adapter.
>>> No errors are present in EMC event log.
>>>
>>> This server was running without any problems since Debian 4.
>>> Is this a known bug?
>>
>> Please report issues encountered with Debian kernels in the Debian bug
>> tracker. If you want the upstream community to assist please retest with
>> an upstream kernel.
>
> Debian kernel does not have any patches related to qla2xxx driver:
> https://salsa.debian.org/kernel-team/linux/raw/debian/4.19.98-1/debian/patches/series
>
> It crashed after running for 11 days. Not a quick&easy test.

It would help a lot if the crash address would be translated into a
source code line number. Something like the following commands should do
the trick:
$ gdb drivers/scsi/qla2xxx/qla2xxx.ko
(gdb) list *(qla24xx_async_abort_cmd+0x1b)

Thanks,

Bart.

2020-02-24 08:22:00

by Ondrej Zary

[permalink] [raw]
Subject: Re: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)

On Monday 24 February 2020, Bart Van Assche wrote:
> On 2020-02-23 11:57, Ondrej Zary wrote:
> > On Sunday 23 February 2020 20:26:39 Bart Van Assche wrote:
> >> On 2020-02-23 10:29, Ondrej Zary wrote:
> >>> a couple of days after upgrading a server from Debian 9 (kernel
> >>> 4.9.210-1) to 10 (kernel 4.19.98), qla2xxx crashed, along with mysql.
> >>>
> >>> There is an EMC CX3 array connected through the fibre-channel adapter.
> >>> No errors are present in EMC event log.
> >>>
> >>> This server was running without any problems since Debian 4.
> >>> Is this a known bug?
> >>
> >> Please report issues encountered with Debian kernels in the Debian bug
> >> tracker. If you want the upstream community to assist please retest with
> >> an upstream kernel.
> >
> > Debian kernel does not have any patches related to qla2xxx driver:
> > https://salsa.debian.org/kernel-team/linux/raw/debian/4.19.98-1/debian/pa
> >tches/series
> >
> > It crashed after running for 11 days. Not a quick&easy test.
>
> It would help a lot if the crash address would be translated into a
> source code line number. Something like the following commands should do
> the trick:
> $ gdb drivers/scsi/qla2xxx/qla2xxx.ko
> (gdb) list *(qla24xx_async_abort_cmd+0x1b)

Looks like it's in some inlined function.

/usr/src/linux-source-4.19# gdb /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
...
Reading symbols from /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...Reading symbols
from /usr/lib/debug//lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...done.
done.

(gdb) list *(qla24xx_async_abort_cmd+0x1b)
0xf88b is in qla24xx_async_abort_cmd (./arch/x86/include/asm/atomic.h:97).
92 *
93 * Atomically increments @v by 1.
94 */
95 static __always_inline void arch_atomic_inc(atomic_t *v)
96 {
97 asm volatile(LOCK_PREFIX "incl %0"
98 : "+m" (v->counter) :: "memory");
99 }
100 #define arch_atomic_inc arch_atomic_inc
101

(gdb) list *(qla24xx_abort_command+0x218)
0x22238 is in qla24xx_abort_command (./drivers/scsi/qla2xxx/qla_mbx.c:3084).
3079
3080 if (vha->flags.qpairs_available && sp->qpair)
3081 req = sp->qpair->req;
3082
3083 if (ql2xasynctmfenable)
3084 return qla24xx_async_abort_command(sp);
3085
3086 spin_lock_irqsave(&ha->hardware_lock, flags);
3087 for (handle = 1; handle < req->num_outstanding_cmds; handle++) {
3088 if (req->outstanding_cmds[handle] == sp)

(gdb) list *(qla2xxx_eh_abort+0x117)
0x15e7 is in qla2xxx_eh_abort (./drivers/scsi/qla2xxx/qla_os.c:1314).
1309 /* Get a reference to the sp and drop the lock.*/
1310 sp_get(sp);
1311
1312 spin_unlock_irqrestore(&ha->hardware_lock, flags);
1313 rval = ha->isp_ops->abort_command(sp);
1314 if (rval) {
1315 if (rval == QLA_FUNCTION_PARAMETER_ERROR)
1316 ret = SUCCESS;
1317 else
1318 ret = FAILED;

(gdb) disassemble qla24xx_async_abort_cmd
Dump of assembler code for function qla24xx_async_abort_cmd:
0x000000000000f870 <+0>: callq 0xf875 <qla24xx_async_abort_cmd+5>
0x000000000000f875 <+5>: push %r15
0x000000000000f877 <+7>: push %r14
0x000000000000f879 <+9>: push %r13
0x000000000000f87b <+11>: push %r12
0x000000000000f87d <+13>: push %rbp
0x000000000000f87e <+14>: push %rbx
0x000000000000f87f <+15>: mov 0x28(%rdi),%r13
0x000000000000f883 <+19>: mov 0x20(%rdi),%r15
0x000000000000f887 <+23>: mov 0x48(%rdi),%r14
0x000000000000f88b <+27>: lock incl 0x4(%r14)
0x000000000000f890 <+32>: mfence
0x000000000000f893 <+35>: testb $0x4,0x24(%r14)
0x000000000000f898 <+40>: je 0xf8b1 <qla24xx_async_abort_cmd+65>
0x000000000000f89a <+42>: lock decl 0x4(%r14)
0x000000000000f89f <+47>: mov $0x102,%ebp
0x000000000000f8a4 <+52>: pop %rbx
0x000000000000f8a5 <+53>: mov %ebp,%eax
0x000000000000f8a7 <+55>: pop %rbp
0x000000000000f8a8 <+56>: pop %r12
0x000000000000f8aa <+58>: pop %r13
0x000000000000f8ac <+60>: pop %r14
0x000000000000f8ae <+62>: pop %r15
0x000000000000f8b0 <+64>: retq
0x000000000000f8b1 <+65>: mov %rdi,%rbp
0x000000000000f8b4 <+68>: mov 0x30(%r14),%rdi
0x000000000000f8b8 <+72>: mov %esi,%r12d
0x000000000000f8bb <+75>: mov $0x6000c0,%esi
0x000000000000f8c0 <+80>: callq 0xf8c5 <qla24xx_async_abort_cmd+85>
0x000000000000f8c5 <+85>: mov %rax,%rbx
0x000000000000f8c8 <+88>: test %rax,%rax
0x000000000000f8cb <+91>: je 0xf89a <qla24xx_async_abort_cmd+42>
0x000000000000f8cd <+93>: lea 0x8(%rax),%rdi
0x000000000000f8d1 <+97>: mov %rax,%rcx
0x000000000000f8d4 <+100>: movq $0x0,(%rax)
0x000000000000f8db <+107>: mov $0xc,%edx
0x000000000000f8e0 <+112>: movq $0x0,0x180(%rax)
0x000000000000f8eb <+123>: and $0xfffffffffffffff8,%rdi
0x000000000000f8ef <+127>: xor %eax,%eax
0x000000000000f8f1 <+129>: sub %rdi,%rcx
0x000000000000f8f4 <+132>: add $0x188,%ecx
0x000000000000f8fa <+138>: shr $0x3,%ecx
0x000000000000f8fd <+141>: rep stos %rax,%es:(%rdi)
0x000000000000f900 <+144>: mov %r15,0x20(%rbx)
0x000000000000f904 <+148>: movl $0x1,0x40(%rbx)
0x000000000000f90b <+155>: mov 0x18(%r14),%rax
0x000000000000f90f <+159>: mov %dx,0x36(%rbx)
0x000000000000f913 <+163>: movq $0x0,0x38(%rbx)
0x000000000000f91b <+171>: mov %rax,0x28(%rbx)
0x000000000000f91f <+175>: lea 0x50(%rbx),%rax
0x000000000000f923 <+179>: mov %rax,0x50(%rbx)
0x000000000000f927 <+183>: mov %rax,0x58(%rbx)
0x000000000000f92b <+187>: mov 0x48(%rbp),%rax
0x000000000000f92f <+191>: mov %rax,0x48(%rbx)
0x000000000000f933 <+195>: test %r12b,%r12b
0x000000000000f936 <+198>: je 0xf941 <qla24xx_async_abort_cmd+209>
0x000000000000f938 <+200>: mov $0x40,%eax
0x000000000000f93d <+205>: mov %ax,0x34(%rbx)
0x000000000000f941 <+209>: lea 0xa0(%rbx),%rdi
0x000000000000f948 <+216>: mov $0x0,%rdx
0x000000000000f94f <+223>: mov $0x0,%rsi
0x000000000000f956 <+230>: movq $0x0,0x170(%rbx)
0x000000000000f961 <+241>: lea 0x148(%rbx),%r14
0x000000000000f968 <+248>: movl $0x0,0x98(%rbx)
0x000000000000f972 <+258>: callq 0xf977 <qla24xx_async_abort_cmd+263>
0x000000000000f977 <+263>: xor %r8d,%r8d
0x000000000000f97a <+266>: xor %ecx,%ecx
0x000000000000f97c <+268>: xor %edx,%edx
0x000000000000f97e <+270>: mov $0x0,%rsi
0x000000000000f985 <+277>: mov %r14,%rdi
0x000000000000f988 <+280>: callq 0xf98d <qla24xx_async_abort_cmd+285>
0x000000000000f98d <+285>: mov 0x0(%rip),%rax # 0xf994 <qla24xx_async_abort_cmd+292>
0x000000000000f994 <+292>: lea 0x78(%rbx),%rdi
0x000000000000f998 <+296>: mov $0x0,%rdx
0x000000000000f99f <+303>: mov $0x0,%rsi
0x000000000000f9a6 <+310>: movl $0x0,0x70(%rbx)
0x000000000000f9ad <+317>: add $0x2904,%rax
0x000000000000f9b3 <+323>: movq $0x0,0x180(%rbx)
0x000000000000f9be <+334>: mov %rax,0x158(%rbx)
0x000000000000f9c5 <+341>: callq 0xf9ca <qla24xx_async_abort_cmd+346>
0x000000000000f9ca <+346>: mov 0x28(%rbx),%rax
0x000000000000f9ce <+350>: mov 0x448(%rax),%rax
0x000000000000f9d5 <+357>: testb $0x2,0x15a(%rax)
0x000000000000f9dc <+364>: jne 0xfa80 <qla24xx_async_abort_cmd+528>
0x000000000000f9e2 <+370>: mov %r14,%rdi
0x000000000000f9e5 <+373>: callq 0xf9ea <qla24xx_async_abort_cmd+378>
0x000000000000f9ea <+378>: mov 0x30(%rbp),%r8d
0x000000000000f9ee <+382>: mov 0x48(%rbp),%rax
0x000000000000f9f2 <+386>: mov %r13,%rsi
0x000000000000f9f5 <+389>: movzwl 0x36(%rbp),%r9d
0x000000000000f9fa <+394>: mov $0x507c,%edx
0x000000000000f9ff <+399>: mov $0x2000000,%edi
0x000000000000fa04 <+404>: mov $0x0,%rcx
0x000000000000fa0b <+411>: mov %r8d,0x90(%rbx)
0x000000000000fa12 <+418>: mov 0x48(%rax),%rax
0x000000000000fa16 <+422>: movzwl 0x40(%rax),%eax
0x000000000000fa1a <+426>: movq $0x0,0x178(%rbx)
0x000000000000fa25 <+437>: mov %ax,0x96(%rbx)
0x000000000000fa2c <+444>: callq 0xfa31 <qla24xx_async_abort_cmd+449>
0x000000000000fa31 <+449>: mov %rbx,%rdi
0x000000000000fa34 <+452>: callq 0xfa39 <qla24xx_async_abort_cmd+457>
0x000000000000fa39 <+457>: mov %eax,%ebp
0x000000000000fa3b <+459>: test %eax,%eax
0x000000000000fa3d <+461>: jne 0xfa64 <qla24xx_async_abort_cmd+500>
0x000000000000fa3f <+463>: test %r12b,%r12b
0x000000000000fa42 <+466>: je 0xf8a4 <qla24xx_async_abort_cmd+52>
0x000000000000fa48 <+472>: lea 0x98(%rbx),%rdi
0x000000000000fa4f <+479>: callq 0xfa54 <qla24xx_async_abort_cmd+484>
0x000000000000fa54 <+484>: cmpw $0x0,0x94(%rbx)
0x000000000000fa5c <+492>: mov $0x102,%eax
0x000000000000fa61 <+497>: cmovne %eax,%ebp
0x000000000000fa64 <+500>: mov 0x180(%rbx),%rax
0x000000000000fa6b <+507>: mov %rbx,%rdi
0x000000000000fa6e <+510>: callq 0xfa73 <qla24xx_async_abort_cmd+515>
0x000000000000fa73 <+515>: mov %ebp,%eax
0x000000000000fa75 <+517>: pop %rbx
0x000000000000fa76 <+518>: pop %rbp
0x000000000000fa77 <+519>: pop %r12
0x000000000000fa79 <+521>: pop %r13
0x000000000000fa7b <+523>: pop %r14
0x000000000000fa7d <+525>: pop %r15
0x000000000000fa7f <+527>: retq
0x000000000000fa80 <+528>: cmpw $0xa,0x36(%rbx)
0x000000000000fa85 <+533>: jne 0xf9e2 <qla24xx_async_abort_cmd+370>
0x000000000000fa8b <+539>: lea 0xe8(%rbx),%rdi
0x000000000000fa92 <+546>: mov $0x0,%rdx
0x000000000000fa99 <+553>: mov $0x0,%rsi
0x000000000000faa0 <+560>: movl $0x0,0xe0(%rbx)
0x000000000000faaa <+570>: callq 0xfaaf <qla24xx_async_abort_cmd+575>
0x000000000000faaf <+575>: jmpq 0xf9e2 <qla24xx_async_abort_cmd+370>
End of assembler dump.


--
Ondrej Zary

2020-02-25 03:42:12

by Bart Van Assche

[permalink] [raw]
Subject: Re: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)

On 2020-02-24 00:20, Ondrej Zary wrote:
> Looks like it's in some inlined function.
>
> /usr/src/linux-source-4.19# gdb /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> GNU gdb (Debian 8.2.1-2+b3) 8.2.1
> ...
> Reading symbols from /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...Reading symbols
> from /usr/lib/debug//lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...done.
> done.
>
> (gdb) list *(qla24xx_async_abort_cmd+0x1b)
> 0xf88b is in qla24xx_async_abort_cmd (./arch/x86/include/asm/atomic.h:97).
> 92 *
> 93 * Atomically increments @v by 1.
> 94 */
> 95 static __always_inline void arch_atomic_inc(atomic_t *v)
> 96 {
> 97 asm volatile(LOCK_PREFIX "incl %0"
> 98 : "+m" (v->counter) :: "memory");
> 99 }
> 100 #define arch_atomic_inc arch_atomic_inc
>
> [ ... ]
>
> (gdb) disassemble qla24xx_async_abort_cmd
> Dump of assembler code for function qla24xx_async_abort_cmd:
> 0x000000000000f870 <+0>: callq 0xf875 <qla24xx_async_abort_cmd+5>
> 0x000000000000f875 <+5>: push %r15
> 0x000000000000f877 <+7>: push %r14
> 0x000000000000f879 <+9>: push %r13
> 0x000000000000f87b <+11>: push %r12
> 0x000000000000f87d <+13>: push %rbp
> 0x000000000000f87e <+14>: push %rbx
> 0x000000000000f87f <+15>: mov 0x28(%rdi),%r13
> 0x000000000000f883 <+19>: mov 0x20(%rdi),%r15
> 0x000000000000f887 <+23>: mov 0x48(%rdi),%r14
> 0x000000000000f88b <+27>: lock incl 0x4(%r14)
> 0x000000000000f890 <+32>: mfence

Thanks, this is very helpful. I think the above means that the crash is
triggered by the following code:

sp = qla2xxx_get_qpair_sp(cmd_sp->qpair, cmd_sp->fcport,
GFP_KERNEL);

From the start of qla2xxx_get_qpair_sp():

QLA_QPAIR_MARK_BUSY(qpair, bail);

From qla_def.h:

#define QLA_QPAIR_MARK_BUSY(__qpair, __bail) do { \
atomic_inc(&__qpair->ref_count); \
mb(); \
if (__qpair->delete_in_progress) { \
atomic_dec(&__qpair->ref_count); \
__bail = 1; \
} else { \
__bail = 0; \
} \
} while (0)

One of the changes between kernel version v4.9.210 and v4.19.98 is the
following: "qla2xxx: Add multiple queue pair functionality". I think the
above information means that the cmd_sp->qpair pointer is NULL. I will
let QLogic recommend a solution.

Bart.

2020-02-27 17:10:58

by Ondrej Zary

[permalink] [raw]
Subject: Re: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)



On Tuesday 25 February 2020 04:41:48 Bart Van Assche wrote:
> On 2020-02-24 00:20, Ondrej Zary wrote:
> > Looks like it's in some inlined function.
> >
> > /usr/src/linux-source-4.19# gdb /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> > GNU gdb (Debian 8.2.1-2+b3) 8.2.1
> > ...
> > Reading symbols from /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...Reading symbols
> > from /usr/lib/debug//lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...done.
> > done.
> >
> > (gdb) list *(qla24xx_async_abort_cmd+0x1b)
> > 0xf88b is in qla24xx_async_abort_cmd (./arch/x86/include/asm/atomic.h:97).
> > 92 *
> > 93 * Atomically increments @v by 1.
> > 94 */
> > 95 static __always_inline void arch_atomic_inc(atomic_t *v)
> > 96 {
> > 97 asm volatile(LOCK_PREFIX "incl %0"
> > 98 : "+m" (v->counter) :: "memory");
> > 99 }
> > 100 #define arch_atomic_inc arch_atomic_inc
> >
> > [ ... ]
> >
> > (gdb) disassemble qla24xx_async_abort_cmd
> > Dump of assembler code for function qla24xx_async_abort_cmd:
> > 0x000000000000f870 <+0>: callq 0xf875 <qla24xx_async_abort_cmd+5>
> > 0x000000000000f875 <+5>: push %r15
> > 0x000000000000f877 <+7>: push %r14
> > 0x000000000000f879 <+9>: push %r13
> > 0x000000000000f87b <+11>: push %r12
> > 0x000000000000f87d <+13>: push %rbp
> > 0x000000000000f87e <+14>: push %rbx
> > 0x000000000000f87f <+15>: mov 0x28(%rdi),%r13
> > 0x000000000000f883 <+19>: mov 0x20(%rdi),%r15
> > 0x000000000000f887 <+23>: mov 0x48(%rdi),%r14
> > 0x000000000000f88b <+27>: lock incl 0x4(%r14)
> > 0x000000000000f890 <+32>: mfence
>
> Thanks, this is very helpful. I think the above means that the crash is
> triggered by the following code:
>
> sp = qla2xxx_get_qpair_sp(cmd_sp->qpair, cmd_sp->fcport,
> GFP_KERNEL);
>
> From the start of qla2xxx_get_qpair_sp():
>
> QLA_QPAIR_MARK_BUSY(qpair, bail);
>
> From qla_def.h:
>
> #define QLA_QPAIR_MARK_BUSY(__qpair, __bail) do { \
> atomic_inc(&__qpair->ref_count); \
> mb(); \
> if (__qpair->delete_in_progress) { \
> atomic_dec(&__qpair->ref_count); \
> __bail = 1; \
> } else { \
> __bail = 0; \
> } \
> } while (0)
>
> One of the changes between kernel version v4.9.210 and v4.19.98 is the
> following: "qla2xxx: Add multiple queue pair functionality". I think the
> above information means that the cmd_sp->qpair pointer is NULL. I will
> let QLogic recommend a solution.

Thank you very much for the analysis.
Unfortunately, QLogic does not seem to care...

--
Ondrej Zary

2020-03-02 22:27:15

by Ondrej Zary

[permalink] [raw]
Subject: Re: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)

On Thursday 27 February 2020 18:09:07 Ondrej Zary wrote:
>
> On Tuesday 25 February 2020 04:41:48 Bart Van Assche wrote:
> > On 2020-02-24 00:20, Ondrej Zary wrote:
> > > Looks like it's in some inlined function.
> > >
> > > /usr/src/linux-source-4.19# gdb /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> > > GNU gdb (Debian 8.2.1-2+b3) 8.2.1
> > > ...
> > > Reading symbols from /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...Reading symbols
> > > from /usr/lib/debug//lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...done.
> > > done.
> > >
> > > (gdb) list *(qla24xx_async_abort_cmd+0x1b)
> > > 0xf88b is in qla24xx_async_abort_cmd (./arch/x86/include/asm/atomic.h:97).
> > > 92 *
> > > 93 * Atomically increments @v by 1.
> > > 94 */
> > > 95 static __always_inline void arch_atomic_inc(atomic_t *v)
> > > 96 {
> > > 97 asm volatile(LOCK_PREFIX "incl %0"
> > > 98 : "+m" (v->counter) :: "memory");
> > > 99 }
> > > 100 #define arch_atomic_inc arch_atomic_inc
> > >
> > > [ ... ]
> > >
> > > (gdb) disassemble qla24xx_async_abort_cmd
> > > Dump of assembler code for function qla24xx_async_abort_cmd:
> > > 0x000000000000f870 <+0>: callq 0xf875 <qla24xx_async_abort_cmd+5>
> > > 0x000000000000f875 <+5>: push %r15
> > > 0x000000000000f877 <+7>: push %r14
> > > 0x000000000000f879 <+9>: push %r13
> > > 0x000000000000f87b <+11>: push %r12
> > > 0x000000000000f87d <+13>: push %rbp
> > > 0x000000000000f87e <+14>: push %rbx
> > > 0x000000000000f87f <+15>: mov 0x28(%rdi),%r13
> > > 0x000000000000f883 <+19>: mov 0x20(%rdi),%r15
> > > 0x000000000000f887 <+23>: mov 0x48(%rdi),%r14
> > > 0x000000000000f88b <+27>: lock incl 0x4(%r14)
> > > 0x000000000000f890 <+32>: mfence
> >
> > Thanks, this is very helpful. I think the above means that the crash is
> > triggered by the following code:
> >
> > sp = qla2xxx_get_qpair_sp(cmd_sp->qpair, cmd_sp->fcport,
> > GFP_KERNEL);
> >
> > From the start of qla2xxx_get_qpair_sp():
> >
> > QLA_QPAIR_MARK_BUSY(qpair, bail);
> >
> > From qla_def.h:
> >
> > #define QLA_QPAIR_MARK_BUSY(__qpair, __bail) do { \
> > atomic_inc(&__qpair->ref_count); \
> > mb(); \
> > if (__qpair->delete_in_progress) { \
> > atomic_dec(&__qpair->ref_count); \
> > __bail = 1; \
> > } else { \
> > __bail = 0; \
> > } \
> > } while (0)
> >
> > One of the changes between kernel version v4.9.210 and v4.19.98 is the
> > following: "qla2xxx: Add multiple queue pair functionality". I think the
> > above information means that the cmd_sp->qpair pointer is NULL. I will
> > let QLogic recommend a solution.
>
> Thank you very much for the analysis.
> Unfortunately, QLogic does not seem to care...

Let's try to CC the people at Cavium that signed-off the commit.

--
Ondrej Zary

2020-03-19 18:04:02

by Ondrej Zary

[permalink] [raw]
Subject: Re: NULL pointer dereference in qla24xx_abort_command, kernel 4.19.98 (Debian)

On Monday 02 March 2020 23:26:08 Ondrej Zary wrote:
> On Thursday 27 February 2020 18:09:07 Ondrej Zary wrote:
> >
> > On Tuesday 25 February 2020 04:41:48 Bart Van Assche wrote:
> > > On 2020-02-24 00:20, Ondrej Zary wrote:
> > > > Looks like it's in some inlined function.
> > > >
> > > > /usr/src/linux-source-4.19# gdb /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko
> > > > GNU gdb (Debian 8.2.1-2+b3) 8.2.1
> > > > ...
> > > > Reading symbols from /lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...Reading symbols
> > > > from /usr/lib/debug//lib/modules/4.19.0-8-amd64/kernel/drivers/scsi/qla2xxx/qla2xxx.ko...done.
> > > > done.
> > > >
> > > > (gdb) list *(qla24xx_async_abort_cmd+0x1b)
> > > > 0xf88b is in qla24xx_async_abort_cmd (./arch/x86/include/asm/atomic.h:97).
> > > > 92 *
> > > > 93 * Atomically increments @v by 1.
> > > > 94 */
> > > > 95 static __always_inline void arch_atomic_inc(atomic_t *v)
> > > > 96 {
> > > > 97 asm volatile(LOCK_PREFIX "incl %0"
> > > > 98 : "+m" (v->counter) :: "memory");
> > > > 99 }
> > > > 100 #define arch_atomic_inc arch_atomic_inc
> > > >
> > > > [ ... ]
> > > >
> > > > (gdb) disassemble qla24xx_async_abort_cmd
> > > > Dump of assembler code for function qla24xx_async_abort_cmd:
> > > > 0x000000000000f870 <+0>: callq 0xf875 <qla24xx_async_abort_cmd+5>
> > > > 0x000000000000f875 <+5>: push %r15
> > > > 0x000000000000f877 <+7>: push %r14
> > > > 0x000000000000f879 <+9>: push %r13
> > > > 0x000000000000f87b <+11>: push %r12
> > > > 0x000000000000f87d <+13>: push %rbp
> > > > 0x000000000000f87e <+14>: push %rbx
> > > > 0x000000000000f87f <+15>: mov 0x28(%rdi),%r13
> > > > 0x000000000000f883 <+19>: mov 0x20(%rdi),%r15
> > > > 0x000000000000f887 <+23>: mov 0x48(%rdi),%r14
> > > > 0x000000000000f88b <+27>: lock incl 0x4(%r14)
> > > > 0x000000000000f890 <+32>: mfence
> > >
> > > Thanks, this is very helpful. I think the above means that the crash is
> > > triggered by the following code:
> > >
> > > sp = qla2xxx_get_qpair_sp(cmd_sp->qpair, cmd_sp->fcport,
> > > GFP_KERNEL);
> > >
> > > From the start of qla2xxx_get_qpair_sp():
> > >
> > > QLA_QPAIR_MARK_BUSY(qpair, bail);
> > >
> > > From qla_def.h:
> > >
> > > #define QLA_QPAIR_MARK_BUSY(__qpair, __bail) do { \
> > > atomic_inc(&__qpair->ref_count); \
> > > mb(); \
> > > if (__qpair->delete_in_progress) { \
> > > atomic_dec(&__qpair->ref_count); \
> > > __bail = 1; \
> > > } else { \
> > > __bail = 0; \
> > > } \
> > > } while (0)
> > >
> > > One of the changes between kernel version v4.9.210 and v4.19.98 is the
> > > following: "qla2xxx: Add multiple queue pair functionality". I think the
> > > above information means that the cmd_sp->qpair pointer is NULL. I will
> > > let QLogic recommend a solution.
> >
> > Thank you very much for the analysis.
> > Unfortunately, QLogic does not seem to care...
>
> Let's try to CC the people at Cavium that signed-off the commit.

No reply.

[email protected] address is dead:
Generating server: DC5-EXCH01.marvell.com
[email protected]
Remote Server returned '550 5.1.1 RESOLVER.ADR.RecipNotFound; not found'

Added some more CC addresses.

Yesterday it crashed again at the same place:

[2076301.849762] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[2076301.850021] PGD 0 P4D 0
[2076301.850109] Oops: 0002 [#1] SMP PTI
[2076301.850219] CPU: 4 PID: 18992 Comm: kworker/u16:1 Not tainted 4.19.0-8-amd64 #1 Debian 4.19.98-1
[2076301.850478] Hardware name: Dell Inc. PowerEdge 2950/0JR815, BIOS 2.7.0 10/30/2010
[2076301.850720] Workqueue: scsi_tmf_4 scmd_eh_abort_handler [scsi_mod]
[2076301.850936] RIP: 0010:qla24xx_async_abort_cmd+0x1b/0x250 [qla2xxx]
[2076301.851130] Code: e9 19 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 41 55 41 54 55 53 4c 8b 6f 28 4c 8b 7f 20 4c 8b 77 48 <f0> 41 ff 46 04 0f a
e f0 41 f6 46 24 04 74 17 f0 41 ff 4e 04 bd 02
[2076301.851663] RSP: 0018:ffffa10f8bbe7da8 EFLAGS: 00010293
[2076301.851820] RAX: 0000000000000800 RBX: ffff8ab8ddd197a8 RCX: 0000000000000070
[2076301.852036] RDX: ffff8ab8de4a8388 RSI: 0000000000000001 RDI: ffff8ab8799b8c40
[2076301.852253] RBP: ffff8ab8dc96c480 R08: ffffffffc03b7860 R09: 0000000000000000
[2076301.852469] R10: 8080808080808080 R11: 0000000000000010 R12: ffff8ab8dea00000
[2076301.852686] R13: ffff8ab8ddd197a8 R14: 0000000000000000 R15: ffff8ab8dd632000
[2076301.852902] FS: 0000000000000000(0000) GS:ffff8ab8e7b00000(0000) knlGS:0000000000000000
[2076301.853142] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2076301.853320] CR2: 0000000000000004 CR3: 00000002203dc000 CR4: 00000000000006e0
[2076301.853536] Call Trace:
[2076301.853632] qla24xx_abort_command+0x218/0x2d0 [qla2xxx]
[2076301.853799] ? __switch_to_asm+0x41/0x70
[2076301.853924] ? __switch_to_asm+0x35/0x70
[2076301.854056] qla2xxx_eh_abort+0x117/0x310 [qla2xxx]
[2076301.854209] scmd_eh_abort_handler+0x85/0x220 [scsi_mod]
[2076301.854375] process_one_work+0x1a7/0x3a0
[2076301.854506] worker_thread+0x30/0x390
[2076301.854628] ? create_worker+0x1a0/0x1a0
[2076301.854753] kthread+0x112/0x130
[2076301.854859] ? kthread_bind+0x30/0x30
[2076301.854980] ret_from_fork+0x35/0x40
[2076301.855095] Modules linked in: loop ipmi_ssif radeon coretemp ttm drm_kms_helper drm kvm i2c_algo_bit i5000_edac iTCO_wdt sg iTCO_vendor_support irqbypass evdev i5k_
amb serio_raw joydev ipmi_si rng_core pcc_cpufreq dcdbas pcspkr ipmi_devintf acpi_cpufreq ipmi_msghandler button ext4 crc16 mbcache jbd2 crc32c_generic fscrypto ecb crypt
o_simd cryptd glue_helper aes_x86_64 dm_service_time dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua uas usb_storage hid_generic usbhid hid sr_mod cdrom ses enc
losure sd_mod scsi_transport_sas ata_generic qla2xxx ata_piix nvme_fc ehci_pci nvme_fabrics libata uhci_hcd psmouse ehci_hcd nvme_core megaraid_sas usbcore scsi_transport
_fc lpc_ich mfd_core scsi_mod usb_common bnx2
[2076301.856887] CR2: 0000000000000004
[2076301.856999] ---[ end trace e9083db8fb76e126 ]---
[2076301.857151] RIP: 0010:qla24xx_async_abort_cmd+0x1b/0x250 [qla2xxx]
[2076301.857345] Code: e9 19 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 41 57 41 56 41 55 41 54 55 53 4c 8b 6f 28 4c 8b 7f 20 4c 8b 77 48 <f0> 41 ff 46 04 0f a
e f0 41 f6 46 24 04 74 17 f0 41 ff 4e 04 bd 02
[2076301.857878] RSP: 0018:ffffa10f8bbe7da8 EFLAGS: 00010293
[2076301.858035] RAX: 0000000000000800 RBX: ffff8ab8ddd197a8 RCX: 0000000000000070
[2076301.858251] RDX: ffff8ab8de4a8388 RSI: 0000000000000001 RDI: ffff8ab8799b8c40
[2076301.858467] RBP: ffff8ab8dc96c480 R08: ffffffffc03b7860 R09: 0000000000000000
[2076301.869384] R10: 8080808080808080 R11: 0000000000000010 R12: ffff8ab8dea00000
[2076301.880412] R13: ffff8ab8ddd197a8 R14: 0000000000000000 R15: ffff8ab8dd632000
[2076301.891483] FS: 0000000000000000(0000) GS:ffff8ab8e7b00000(0000) knlGS:0000000000000000
[2076301.902490] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2076301.913344] CR2: 0000000000000004 CR3: 00000002203dc000 CR4: 00000000000006e0
[2077225.259348] mysqld[2155]: segfault at 0 ip 000056409366ad93 sp 00007fa049514450 error 6 in mysqld[564092eb2000+805000]
[2077225.270564] Code: c7 45 00 00 00 00 00 8b 7d cc 4c 89 e2 4c 89 f6 e8 62 81 84 ff 49 89 c7 49 39 c4 0f 84 f6 00 00 00 e8 e1 1c 00 00 41 8b 4d 00 <89> 08 85 c9 74 37 4
9 83 ff ff 0f 84 9d 00 00 00 f6 c3 06 75 28 4d


--
Ondrej Zary