2022-11-07 17:52:53

by Gerd Bayer

[permalink] [raw]
Subject: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

Hi,

our internal s390 CI pointed us to a potential racy "use after free" or similar
issue in drivers/nvme/host/pci.c by ending one of the tests in the following
kernel panic:

[ 1836.550881] nvme nvme0: pci function 0004:00:00.0
[ 1836.563814] nvme nvme0: Shutdown timeout set to 15 seconds
[ 1836.569587] nvme nvme0: 63/0/0 default/read/poll queues
[ 1836.577114] nvme0n1: p1 p2
[ 1861.856726] nvme nvme0: pci function 0004:00:00.0
[ 1861.869539] nvme nvme0: failed to mark controller CONNECTING
[ 1861.869542] nvme nvme0: Removing after probe failure status: -16
[ 1861.869552] Unable to handle kernel pointer dereference in virtual kernel address space
[ 1861.869554] Failing address: 0000000000000000 TEID: 0000000000000483
[ 1861.869555] Fault in home space mode while using kernel ASCE.
[ 1861.869558] AS:0000000135c4c007 R3:00000003fffe0007 S:00000003fffe6000 P:000000000000013d
[ 1861.869587] Oops: 0004 ilc:3 [#1] SMP
[ 1861.869591] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables
nfnetlink mlx5_ib ib_uverbs uvdevice s390_trng ib_core vfio_ccw mdev vfio_iommu_type1 eadm_sch
vfio sch_fq_codel configfs dm_service_time mlx5_core ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes
sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 nvme sha_common nvme_core zfcp scsi_transport_fc
dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_mirror dm_region_hash dm_log pkey zcry
pt rng_core autofs4
[ 1861.869627] CPU: 4 PID: 2929 Comm: kworker/u800:0 Not tainted 6.1.0-rc3-next-20221104 #4
[ 1861.869630] Hardware name: IBM 3931 A01 701 (LPAR)
[ 1861.869631] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 1861.869637] Krnl PSW : 0704c00180000000 0000000134f026d0 (mutex_lock+0x10/0x28)
[ 1861.869643] R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 1861.869646] Krnl GPRS: 0000000001000000 0000000000000000 0000000000000078 00000000a5f8c200
[ 1861.869648] 000003800309601c 0000000000000004 0000000000000000 0000000088e64220
[ 1861.869650] 0000000000000078 0000000000000000 0000000000000098 0000000088e64000
[ 1861.869651] 00000000a5f8c200 0000000088e641e0 00000001349bdac2 0000038003ea7c20
[ 1861.869658] Krnl Code: 0000000134f026c0: c0040008cfb8 brcl 0,000000013501c630
[ 1861.869658] 0000000134f026c6: a7190000 lghi %r1,0
[ 1861.869658] #0000000134f026ca: e33003400004 lg %r3,832
[ 1861.869658] >0000000134f026d0: eb1320000030 csg %r1,%r3,0(%r2)
[ 1861.869658] 0000000134f026d6: ec160006007c cgij %r1,0,6,0000000134f026e2
[ 1861.869658] 0000000134f026dc: 07fe bcr 15,%r14
[ 1861.869658] 0000000134f026de: 47000700 bc 0,1792
[ 1861.869658] 0000000134f026e2: c0f4ffffffe7 brcl 15,0000000134f026b0
[ 1861.869715] Call Trace:
[ 1861.869716] [<0000000134f026d0>] mutex_lock+0x10/0x28
[ 1861.869719] [<000003ff7fc381d6>] nvme_dev_disable+0x1b6/0x2b0 [nvme]
[ 1861.869722] [<000003ff7fc3929e>] nvme_reset_work+0x49e/0x6a0 [nvme]
[ 1861.869724] [<0000000134309158>] process_one_work+0x200/0x458
[ 1861.869730] [<00000001343098e6>] worker_thread+0x66/0x480
[ 1861.869732] [<0000000134312888>] kthread+0x108/0x110
[ 1861.869735] [<0000000134297354>] __ret_from_fork+0x3c/0x58
[ 1861.869738] [<0000000134f074ea>] ret_from_fork+0xa/0x40
[ 1861.869740] Last Breaking-Event-Address:
[ 1861.869741] [<00000001349bdabc>] blk_mq_quiesce_tagset+0x2c/0xc0
[ 1861.869747] Kernel panic - not syncing: Fatal exception: panic_on_oops

On a stock kernel from
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=next-20221104
we have been able to reproduce this at will with
this small script

#!/usr/bin/env bash

echo $1 > /sys/bus/pci/drivers/nvme/unbind
echo $1 > /sys/bus/pci/drivers/nvme/bind
echo 1 > /sys/bus/pci/devices/$1/remove

when filling in the NVMe drives' PCI identifier.

We believe this to be a race-condition somewhere, since this sequence does not produce the panic
when executed interactively.

Could this be linked to the recent (refactoring) work by Christoph Hellwig?
E.g. https://lore.kernel.org/all/[email protected]/

Thank you,
Gerd Bayer



2022-11-08 04:16:29

by Chaitanya Kulkarni

[permalink] [raw]
Subject: Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

On 11/7/22 09:28, Gerd Bayer wrote:
> Hi,
>
> our internal s390 CI pointed us to a potential racy "use after free" or similar
> issue in drivers/nvme/host/pci.c by ending one of the tests in the following
> kernel panic:
>

Thanks a lot for reporting this ...

[...]
>
> On a stock kernel from
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=next-20221104
> we have been able to reproduce this at will with
> this small script
>
> #!/usr/bin/env bash
>
> echo $1 > /sys/bus/pci/drivers/nvme/unbind
> echo $1 > /sys/bus/pci/drivers/nvme/bind
> echo 1 > /sys/bus/pci/devices/$1/remove
>
> when filling in the NVMe drives' PCI identifier.
>

Can you please submit a blktests for this ?
so this will get tested by everyone at each release ?

> We believe this to be a race-condition somewhere, since this sequence does not produce the panic
> when executed interactively.
>

You can try and bisect the code to point out at exact commit.

> Could this be linked to the recent (refactoring) work by Christoph Hellwig?
> E.g. https://lore.kernel.org/all/[email protected]/
>
> Thank you,
> Gerd Bayer
>
>

-ck

2022-11-08 08:09:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

Below is the minimal fix. I'll see if I sort out the mess that is
probe/reset failure vs ->remove a bit better, though.

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f94b05c585cbc..577bacdcfee08 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);

void nvme_stop_queues(struct nvme_ctrl *ctrl)
{
+ if (!ctrl->tagset)
+ return;
if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
blk_mq_quiesce_tagset(ctrl->tagset);
else
@@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);

void nvme_start_queues(struct nvme_ctrl *ctrl)
{
+ if (!ctrl->tagset)
+ return;
if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
blk_mq_unquiesce_tagset(ctrl->tagset);
}

2022-11-08 16:37:47

by Keith Busch

[permalink] [raw]
Subject: Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

On Tue, Nov 08, 2022 at 03:50:33AM +0000, Chaitanya Kulkarni wrote:
> On 11/7/22 09:28, Gerd Bayer wrote:
>
> > We believe this to be a race-condition somewhere, since this sequence does not produce the panic
> > when executed interactively.
> >
>
> You can try and bisect the code to point out at exact commit.

Bisect doesn't work without a known good commit point.

2022-11-08 17:20:11

by Gerd Bayer

[permalink] [raw]
Subject: Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

Hi Christoph,

with your minimal fix

On Tue, 2022-11-08 at 08:48 +0100, Christoph Hellwig wrote:
> Below is the minimal fix. I'll see if I sort out the mess that is
> probe/reset failure vs ->remove a bit better, though.
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f94b05c585cbc..577bacdcfee08 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
>
> void nvme_stop_queues(struct nvme_ctrl *ctrl)
> {
> + if (!ctrl->tagset)
> + return;
> if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> blk_mq_quiesce_tagset(ctrl->tagset);
> else
> @@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
>
> void nvme_start_queues(struct nvme_ctrl *ctrl)
> {
> + if (!ctrl->tagset)
> + return;
> if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> blk_mq_unquiesce_tagset(ctrl->tagset);
> }

on next-20221108 the kernel does not crash any
more when I run the short test-script.

dmesg shows:
Nov 08 17:38:51 a46lp24.lnxne.boe kernel: nvme nvme0: pci function 0004:00:00.0
Nov 08 17:38:51 a46lp24.lnxne.boe kernel:
nvme nvme0: failed to mark controller CONNECTING
Nov 08 17:38:51 a46lp24.lnxne.boe kernel: nvme nvme0: Removing after
probe failure status: -16
Nov 08 17:38:52 a46lp24.lnxne.boe kernel: pci 0004:00:00.0: Removing from iommu group 0

while kernel remains up.
I can even do
- rescan on the pci bus (to bring back the nvme drive), and
- run the test script
multiple times.

So from my point of view this band-aid is valuable to be incorporated while the larger
overhaul in

https://lore.kernel.org/linux-nvme/[email protected]/
is out for review and test.

Feel free to add my
Tested-by: Gerd Bayer <[email protected]>

Thank you,
Gerd Bayer


2022-11-08 18:02:21

by Keith Busch

[permalink] [raw]
Subject: Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

On Tue, Nov 08, 2022 at 08:48:47AM +0100, Christoph Hellwig wrote:
> Below is the minimal fix. I'll see if I sort out the mess that is
> probe/reset failure vs ->remove a bit better, though.

This looks good to go. I vote apply for 6.1, and we should consider
merging your larger refactor for 6.2 after more test/review.

Reviewed-by: Keith Busch <[email protected]>

> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f94b05c585cbc..577bacdcfee08 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
>
> void nvme_stop_queues(struct nvme_ctrl *ctrl)
> {
> + if (!ctrl->tagset)
> + return;
> if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> blk_mq_quiesce_tagset(ctrl->tagset);
> else
> @@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
>
> void nvme_start_queues(struct nvme_ctrl *ctrl)
> {
> + if (!ctrl->tagset)
> + return;
> if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> blk_mq_unquiesce_tagset(ctrl->tagset);
> }
>

2022-11-09 04:34:36

by Sagi Grimberg

[permalink] [raw]
Subject: Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next


> Below is the minimal fix. I'll see if I sort out the mess that is
> probe/reset failure vs ->remove a bit better, though.
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f94b05c585cbc..577bacdcfee08 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
>
> void nvme_stop_queues(struct nvme_ctrl *ctrl)
> {
> + if (!ctrl->tagset)
> + return;
> if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> blk_mq_quiesce_tagset(ctrl->tagset);
> else
> @@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
>
> void nvme_start_queues(struct nvme_ctrl *ctrl)
> {
> + if (!ctrl->tagset)
> + return;
> if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
> blk_mq_unquiesce_tagset(ctrl->tagset);
> }

Can we do that in the pci driver and not here?

2022-11-09 06:42:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

On Tue, Nov 08, 2022 at 10:23:31AM -0700, Keith Busch wrote:
> This looks good to go. I vote apply for 6.1, and we should consider
> merging your larger refactor for 6.2 after more test/review.

The problem doesn't exist yet in 6.1, it is caused by the recent
quiesce rework.

2022-11-09 08:05:27

by Chao Leng

[permalink] [raw]
Subject: Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next

The check "if (!ctrl->tagset)" is just reduce the probability.

The real reason is the race of probe and remove.
It is similar with TCP and RDMA transport.
Israel has tried to fix it.
The detail:
https://github.com/torvalds/linux/commit/ce1518139e6976cf19c133b555083354fdb629b8
Unfortunately, this patch was reverted.

If it is in the process of "probe", remove should not be called.
Maybe we can move pci_set_drvdata to the end of nvme_probe.
Of course, the removal may not take effect if it is in the process of "probe".
This is why the patch of Israel is reverted.

Perhaps the better option would be that "remove" wait for the "probe" to complete,
and then do the real remove.
This requires additional mechanism to implement this.

On 2022/11/9 10:54, Sagi Grimberg wrote:
>
>> Below is the minimal fix.  I'll see if I sort out the mess that is
>> probe/reset failure vs ->remove a bit better, though.
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index f94b05c585cbc..577bacdcfee08 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
>>   void nvme_stop_queues(struct nvme_ctrl *ctrl)
>>   {
>> +    if (!ctrl->tagset)
>> +        return;
>>       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>>           blk_mq_quiesce_tagset(ctrl->tagset);
>>       else
>> @@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
>>   void nvme_start_queues(struct nvme_ctrl *ctrl)
>>   {
>> +    if (!ctrl->tagset)
>> +        return;
>>       if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>>           blk_mq_unquiesce_tagset(ctrl->tagset);
>>   }
>
> Can we do that in the pci driver and not here?
>
> .