2021-10-22 10:12:00

by Li Chen

[permalink] [raw]
Subject: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

Hi, all

I found my nvme may get timeout with simply dd, the pcie controller is cadence, and the pcie-controller platform I use is pcie-cadence-host.c, kenrel version is 5.10.32:

# dd if=/dev/urandom of=/dev/nvme0n1 bs=4k count=1024000
[ 41.913484][ T274] urandom_read: 2 callbacks suppressed
[ 41.913490][ T274] random: dd: uninitialized urandom read (4096 bytes read)
[ 41.926130][ T274] random: dd: uninitialized urandom read (4096 bytes read)
[ 41.933348][ T274] random: dd: uninitialized urandom read (4096 bytes read)
[ 47.651842][ C0] random: crng init done
[ 47.655963][ C0] random: 2 urandom warning(s) missed due to ratelimiting
[ 81.448128][ T64] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
[ 81.481139][ T7] nvme 0000:05:00.0: enabling bus mastering
[ 81.486946][ T7] nvme 0000:05:00.0: saving config space at offset 0x0 (reading 0xa809144d)
[ 81.495517][ T7] nvme 0000:05:00.0: saving config space at offset 0x4 (reading 0x20100006)
[ 81.504091][ T7] nvme 0000:05:00.0: saving config space at offset 0x8 (reading 0x1080200)
[ 81.512571][ T7] nvme 0000:05:00.0: saving config space at offset 0xc (reading 0x0)
[ 81.520527][ T7] nvme 0000:05:00.0: saving config space at offset 0x10 (reading 0x8000004)
[ 81.529094][ T7] nvme 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
[ 81.537138][ T7] nvme 0000:05:00.0: saving config space at offset 0x18 (reading 0x0)
[ 81.545186][ T7] nvme 0000:05:00.0: saving config space at offset 0x1c (reading 0x0)
[ 81.553252][ T7] nvme 0000:05:00.0: saving config space at offset 0x20 (reading 0x0)
[ 81.561296][ T7] nvme 0000:05:00.0: saving config space at offset 0x24 (reading 0x0)
[ 81.569340][ T7] nvme 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
[ 81.577384][ T7] nvme 0000:05:00.0: saving config space at offset 0x2c (reading 0xa801144d)
[ 81.586038][ T7] nvme 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
[ 81.594081][ T7] nvme 0000:05:00.0: saving config space at offset 0x34 (reading 0x40)
[ 81.602217][ T7] nvme 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
[ 81.610266][ T7] nvme 0000:05:00.0: saving config space at offset 0x3c (reading 0x12c)
[ 81.634065][ T7] nvme nvme0: Shutdown timeout set to 8 seconds
[ 81.674332][ T7] nvme nvme0: 1/0/0 default/read/poll queues
[ 112.168136][ T256] nvme nvme0: I/O 12 QID 1 timeout, disable controller
[ 112.193145][ T256] blk_update_request: I/O error, dev nvme0n1, sector 600656 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.205220][ T256] Buffer I/O error on dev nvme0n1, logical block 75082, lost async page write
[ 112.213978][ T256] Buffer I/O error on dev nvme0n1, logical block 75083, lost async page write
[ 112.222727][ T256] Buffer I/O error on dev nvme0n1, logical block 75084, lost async page write
[ 112.231474][ T256] Buffer I/O error on dev nvme0n1, logical block 75085, lost async page write
[ 112.240220][ T256] Buffer I/O error on dev nvme0n1, logical block 75086, lost async page write
[ 112.248966][ T256] Buffer I/O error on dev nvme0n1, logical block 75087, lost async page write
[ 112.257719][ T256] Buffer I/O error on dev nvme0n1, logical block 75088, lost async page write
[ 112.266467][ T256] Buffer I/O error on dev nvme0n1, logical block 75089, lost async page write
[ 112.275213][ T256] Buffer I/O error on dev nvme0n1, logical block 75090, lost async page write
[ 112.283959][ T256] Buffer I/O error on dev nvme0n1, logical block 75091, lost async page write
[ 112.293554][ T256] blk_update_request: I/O error, dev nvme0n1, sector 601672 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.306559][ T256] blk_update_request: I/O error, dev nvme0n1, sector 602688 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.319525][ T256] blk_update_request: I/O error, dev nvme0n1, sector 603704 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.332501][ T256] blk_update_request: I/O error, dev nvme0n1, sector 604720 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.345466][ T256] blk_update_request: I/O error, dev nvme0n1, sector 605736 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.358427][ T256] blk_update_request: I/O error, dev nvme0n1, sector 606752 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.371386][ T256] blk_update_request: I/O error, dev nvme0n1, sector 607768 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.384346][ T256] blk_update_request: I/O error, dev nvme0n1, sector 608784 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.397315][ T256] blk_update_request: I/O error, dev nvme0n1, sector 609800 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
[ 112.459313][ T7] nvme nvme0: failed to mark controller live state
[ 112.465758][ T7] nvme nvme0: Removing after probe failure status: -19
dd: error writing '/dev/nvme0n1': No space left on device
112200+0 records in
112199+0 records out
459567104 bytes (438.3MB) copied, 70.573266 seconds, 6.2MB/s
# [ 112.935768][ T7] nvme nvme0: failed to set APST feature (-19)




Here is the dts I used:
pciec: pcie-controller@2040000000 {
compatible = "cdns,cdns-pcie-host";
device_type = "pci";
#address-cells = <3>;
#size-cells = <2>;
bus-range = <0 5>;
linux,pci-domain = <0>;
cdns,no-bar-match-nbits = <38>;
vendor-id = <0x17cd>;
device-id = <0x0100>;
reg-names = "reg", "cfg";
reg = <0x20 0x40000000 0x0 0x10000000>,
<0x20 0x00000000 0x0 0x00001000>; /* RC only */

/*
* type: 0x00000000 cfg space
* type: 0x01000000 IO
* type: 0x02000000 32bit mem space No prefetch
* type: 0x03000000 64bit mem space No prefetch
* type: 0x43000000 64bit mem space prefetch
* The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
* <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00 0x01000000>, CFG_SPACE
*/
ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
<0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000>;

#interrupt-cells = <0x1>;
interrupt-map-mask = <0x00 0x0 0x0 0x7>;
interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
<0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
<0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
<0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
phys = <&pcie_phy>;
phy-names="pcie-phy";
status = "ok";
};


After some digging, I find if I change the controller's range property from <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>, then dd will success without timeout. IIUC, range here is only for non-prefetch 32bit mmio, but dd will use dma(maybe cpu will send cmd to nvme controller via mmio?).

Question:
1. Why dd can cause nvme timeout? Is there more debug ways?
2. How can this mmio range affect nvme timeout?

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.


2021-10-25 09:08:22

by Li Chen

[permalink] [raw]
Subject: RE: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

> [ 81.634065][ T7] nvme nvme0: Shutdown timeout set to 8 seconds
> [ 81.674332][ T7] nvme nvme0: 1/0/0 default/read/poll queues
> [ 112.168136][ T256] nvme nvme0: I/O 12 QID 1 timeout, disable controller

I/O 12 QID 1 is nvme_cmd_write, reproduce after rebooting and using ftrace:

# dmesg | grep QID
[ 99.456260] nvme nvme0: I/O 128 QID 1 timeout, disable controller
# grep "qid=1" /sys/kernel/tracing/trace | egrep "cmdid=128\b"
kworker/0:1H-65 [000] .... 68.987492: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=128, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=216856, len=1015, ctrl=0x0, dsmgmt=0, reftag=0)
<idle>-0 [001] ..s1 99.475164: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=128, res=0x0, retries=1, flags=0x1, status=0x371

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

2021-10-25 16:24:34

by Keith Busch

[permalink] [raw]
Subject: Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

On Mon, Oct 25, 2021 at 10:47:39AM -0500, Bjorn Helgaas wrote:
> [+cc Tom (Cadence maintainer), NVMe folks]
>
> On Fri, Oct 22, 2021 at 10:08:20AM +0000, Li Chen wrote:
> > pciec: pcie-controller@2040000000 {
> > compatible = "cdns,cdns-pcie-host";
> > device_type = "pci";
> > #address-cells = <3>;
> > #size-cells = <2>;
> > bus-range = <0 5>;
> > linux,pci-domain = <0>;
> > cdns,no-bar-match-nbits = <38>;
> > vendor-id = <0x17cd>;
> > device-id = <0x0100>;
> > reg-names = "reg", "cfg";
> > reg = <0x20 0x40000000 0x0 0x10000000>,
> > <0x20 0x00000000 0x0 0x00001000>; /* RC only */
> >
> > /*
> > * type: 0x00000000 cfg space
> > * type: 0x01000000 IO
> > * type: 0x02000000 32bit mem space No prefetch
> > * type: 0x03000000 64bit mem space No prefetch
> > * type: 0x43000000 64bit mem space prefetch
> > * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
> > * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00 0x01000000>, CFG_SPACE
> > */
> > ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
> > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000>;
> >
> > #interrupt-cells = <0x1>;
> > interrupt-map-mask = <0x00 0x0 0x0 0x7>;
> > interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
> > <0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
> > <0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
> > <0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
> > phys = <&pcie_phy>;
> > phy-names="pcie-phy";
> > status = "ok";
> > };
> >
> >
> > After some digging, I find if I change the controller's range
> > property from
> >
> > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into
> > <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>,
> >
> > then dd will success without timeout. IIUC, range here
> > is only for non-prefetch 32bit mmio, but dd will use dma (maybe cpu
> > will send cmd to nvme controller via mmio?).

Generally speaking, an nvme driver notifies the controller of new
commands via a MMIO write to a specific nvme register. The nvme
controller fetches those commands from host memory with a DMA.

One exception to that description is if the nvme controller supports CMB
with SQEs, but they're not very common. If you had such a controller,
the driver will use MMIO to write commands directly into controller
memory instead of letting the controller DMA them from host memory. Do
you know if you have such a controller?

The data transfers associated with your 'dd' command will always use DMA.

> I don't know how to interpret "ranges". Can you supply the dmesg and
> "lspci -vvs 0000:05:00.0" output both ways, e.g.,
>
> pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
>
> > Question:
> > 1. Why dd can cause nvme timeout? Is there more debug ways?

That means the nvme controller didn't provide a response to a posted
command within the driver's latency tolerance.

> > 2. How can this mmio range affect nvme timeout?

Let's see how those ranges affect what the kernel sees in the pci
topology, as Bjorn suggested.

2021-10-25 19:08:43

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

[+cc Tom (Cadence maintainer), NVMe folks]

On Fri, Oct 22, 2021 at 10:08:20AM +0000, Li Chen wrote:
> Hi, all
>
> I found my nvme may get timeout with simply dd, the pcie controller
> is cadence, and the pcie-controller platform I use is
> pcie-cadence-host.c, kenrel version is 5.10.32:

Given that nvme is pretty widely used, I tend to suspect the cadence
driver or DTS here, but I added both cadence and nvme folks to the CC:
list.

> # dd if=/dev/urandom of=/dev/nvme0n1 bs=4k count=1024000
> [ 41.913484][ T274] urandom_read: 2 callbacks suppressed
> [ 41.913490][ T274] random: dd: uninitialized urandom read (4096 bytes read)
> [ 41.926130][ T274] random: dd: uninitialized urandom read (4096 bytes read)
> [ 41.933348][ T274] random: dd: uninitialized urandom read (4096 bytes read)
> [ 47.651842][ C0] random: crng init done
> [ 47.655963][ C0] random: 2 urandom warning(s) missed due to ratelimiting
> [ 81.448128][ T64] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x2010
> [ 81.481139][ T7] nvme 0000:05:00.0: enabling bus mastering
> [ 81.486946][ T7] nvme 0000:05:00.0: saving config space at offset 0x0 (reading 0xa809144d)
> [ 81.495517][ T7] nvme 0000:05:00.0: saving config space at offset 0x4 (reading 0x20100006)
> [ 81.504091][ T7] nvme 0000:05:00.0: saving config space at offset 0x8 (reading 0x1080200)
> [ 81.512571][ T7] nvme 0000:05:00.0: saving config space at offset 0xc (reading 0x0)
> [ 81.520527][ T7] nvme 0000:05:00.0: saving config space at offset 0x10 (reading 0x8000004)
> [ 81.529094][ T7] nvme 0000:05:00.0: saving config space at offset 0x14 (reading 0x0)
> [ 81.537138][ T7] nvme 0000:05:00.0: saving config space at offset 0x18 (reading 0x0)
> [ 81.545186][ T7] nvme 0000:05:00.0: saving config space at offset 0x1c (reading 0x0)
> [ 81.553252][ T7] nvme 0000:05:00.0: saving config space at offset 0x20 (reading 0x0)
> [ 81.561296][ T7] nvme 0000:05:00.0: saving config space at offset 0x24 (reading 0x0)
> [ 81.569340][ T7] nvme 0000:05:00.0: saving config space at offset 0x28 (reading 0x0)
> [ 81.577384][ T7] nvme 0000:05:00.0: saving config space at offset 0x2c (reading 0xa801144d)
> [ 81.586038][ T7] nvme 0000:05:00.0: saving config space at offset 0x30 (reading 0x0)
> [ 81.594081][ T7] nvme 0000:05:00.0: saving config space at offset 0x34 (reading 0x40)
> [ 81.602217][ T7] nvme 0000:05:00.0: saving config space at offset 0x38 (reading 0x0)
> [ 81.610266][ T7] nvme 0000:05:00.0: saving config space at offset 0x3c (reading 0x12c)
> [ 81.634065][ T7] nvme nvme0: Shutdown timeout set to 8 seconds
> [ 81.674332][ T7] nvme nvme0: 1/0/0 default/read/poll queues
> [ 112.168136][ T256] nvme nvme0: I/O 12 QID 1 timeout, disable controller
> [ 112.193145][ T256] blk_update_request: I/O error, dev nvme0n1, sector 600656 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.205220][ T256] Buffer I/O error on dev nvme0n1, logical block 75082, lost async page write
> [ 112.213978][ T256] Buffer I/O error on dev nvme0n1, logical block 75083, lost async page write
> [ 112.222727][ T256] Buffer I/O error on dev nvme0n1, logical block 75084, lost async page write
> [ 112.231474][ T256] Buffer I/O error on dev nvme0n1, logical block 75085, lost async page write
> [ 112.240220][ T256] Buffer I/O error on dev nvme0n1, logical block 75086, lost async page write
> [ 112.248966][ T256] Buffer I/O error on dev nvme0n1, logical block 75087, lost async page write
> [ 112.257719][ T256] Buffer I/O error on dev nvme0n1, logical block 75088, lost async page write
> [ 112.266467][ T256] Buffer I/O error on dev nvme0n1, logical block 75089, lost async page write
> [ 112.275213][ T256] Buffer I/O error on dev nvme0n1, logical block 75090, lost async page write
> [ 112.283959][ T256] Buffer I/O error on dev nvme0n1, logical block 75091, lost async page write
> [ 112.293554][ T256] blk_update_request: I/O error, dev nvme0n1, sector 601672 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.306559][ T256] blk_update_request: I/O error, dev nvme0n1, sector 602688 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.319525][ T256] blk_update_request: I/O error, dev nvme0n1, sector 603704 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.332501][ T256] blk_update_request: I/O error, dev nvme0n1, sector 604720 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.345466][ T256] blk_update_request: I/O error, dev nvme0n1, sector 605736 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.358427][ T256] blk_update_request: I/O error, dev nvme0n1, sector 606752 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.371386][ T256] blk_update_request: I/O error, dev nvme0n1, sector 607768 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.384346][ T256] blk_update_request: I/O error, dev nvme0n1, sector 608784 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.397315][ T256] blk_update_request: I/O error, dev nvme0n1, sector 609800 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> [ 112.459313][ T7] nvme nvme0: failed to mark controller live state
> [ 112.465758][ T7] nvme nvme0: Removing after probe failure status: -19
> dd: error writing '/dev/nvme0n1': No space left on device
> 112200+0 records in
> 112199+0 records out
> 459567104 bytes (438.3MB) copied, 70.573266 seconds, 6.2MB/s
> # [ 112.935768][ T7] nvme nvme0: failed to set APST feature (-19)
>
>
>
>
> Here is the dts I used:
> pciec: pcie-controller@2040000000 {
> compatible = "cdns,cdns-pcie-host";
> device_type = "pci";
> #address-cells = <3>;
> #size-cells = <2>;
> bus-range = <0 5>;
> linux,pci-domain = <0>;
> cdns,no-bar-match-nbits = <38>;
> vendor-id = <0x17cd>;
> device-id = <0x0100>;
> reg-names = "reg", "cfg";
> reg = <0x20 0x40000000 0x0 0x10000000>,
> <0x20 0x00000000 0x0 0x00001000>; /* RC only */
>
> /*
> * type: 0x00000000 cfg space
> * type: 0x01000000 IO
> * type: 0x02000000 32bit mem space No prefetch
> * type: 0x03000000 64bit mem space No prefetch
> * type: 0x43000000 64bit mem space prefetch
> * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
> * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00 0x01000000>, CFG_SPACE
> */
> ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
> <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000>;
>
> #interrupt-cells = <0x1>;
> interrupt-map-mask = <0x00 0x0 0x0 0x7>;
> interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
> <0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
> <0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
> <0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
> phys = <&pcie_phy>;
> phy-names="pcie-phy";
> status = "ok";
> };
>
>
> After some digging, I find if I change the controller's range
> property from
>
> <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into
> <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>,
>
> then dd will success without timeout. IIUC, range here
> is only for non-prefetch 32bit mmio, but dd will use dma (maybe cpu
> will send cmd to nvme controller via mmio?).

I don't know how to interpret "ranges". Can you supply the dmesg and
"lspci -vvs 0000:05:00.0" output both ways, e.g.,

pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]

> Question:
> 1. Why dd can cause nvme timeout? Is there more debug ways?
> 2. How can this mmio range affect nvme timeout?
>
> Regards,
> Li
>
> **********************************************************************
> This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

2021-10-26 07:08:09

by Li Chen

[permalink] [raw]
Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

Hi, Bjorn

> -----Original Message-----
> From: Bjorn Helgaas [mailto:[email protected]]
> Sent: Monday, October 25, 2021 11:48 PM
> To: Li Chen
> Cc: [email protected]; Lorenzo Pieralisi; Rob Herring; [email protected];
> Bjorn Helgaas; [email protected]; Tom Joseph; Keith Busch; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> Subject: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
>
> [+cc Tom (Cadence maintainer), NVMe folks]
>
> On Fri, Oct 22, 2021 at 10:08:20AM +0000, Li Chen wrote:
> > Hi, all
> >
> > I found my nvme may get timeout with simply dd, the pcie controller
> > is cadence, and the pcie-controller platform I use is
> > pcie-cadence-host.c, kenrel version is 5.10.32:
>
> Given that nvme is pretty widely used, I tend to suspect the cadence
> driver or DTS here, but I added both cadence and nvme folks to the CC:
> list.

Thanks!

>
> > # dd if=/dev/urandom of=/dev/nvme0n1 bs=4k count=1024000
> > [ 41.913484][ T274] urandom_read: 2 callbacks suppressed
> > [ 41.913490][ T274] random: dd: uninitialized urandom read (4096 bytes read)
> > [ 41.926130][ T274] random: dd: uninitialized urandom read (4096 bytes read)
> > [ 41.933348][ T274] random: dd: uninitialized urandom read (4096 bytes read)
> > [ 47.651842][ C0] random: crng init done
> > [ 47.655963][ C0] random: 2 urandom warning(s) missed due to ratelimiting
> > [ 81.448128][ T64] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x2010
> > [ 81.481139][ T7] nvme 0000:05:00.0: enabling bus mastering
> > [ 81.486946][ T7] nvme 0000:05:00.0: saving config space at offset 0x0
> (reading 0xa809144d)
> > [ 81.495517][ T7] nvme 0000:05:00.0: saving config space at offset 0x4
> (reading 0x20100006)
> > [ 81.504091][ T7] nvme 0000:05:00.0: saving config space at offset 0x8
> (reading 0x1080200)
> > [ 81.512571][ T7] nvme 0000:05:00.0: saving config space at offset 0xc
> (reading 0x0)
> > [ 81.520527][ T7] nvme 0000:05:00.0: saving config space at offset 0x10
> (reading 0x8000004)
> > [ 81.529094][ T7] nvme 0000:05:00.0: saving config space at offset 0x14
> (reading 0x0)
> > [ 81.537138][ T7] nvme 0000:05:00.0: saving config space at offset 0x18
> (reading 0x0)
> > [ 81.545186][ T7] nvme 0000:05:00.0: saving config space at offset 0x1c
> (reading 0x0)
> > [ 81.553252][ T7] nvme 0000:05:00.0: saving config space at offset 0x20
> (reading 0x0)
> > [ 81.561296][ T7] nvme 0000:05:00.0: saving config space at offset 0x24
> (reading 0x0)
> > [ 81.569340][ T7] nvme 0000:05:00.0: saving config space at offset 0x28
> (reading 0x0)
> > [ 81.577384][ T7] nvme 0000:05:00.0: saving config space at offset 0x2c
> (reading 0xa801144d)
> > [ 81.586038][ T7] nvme 0000:05:00.0: saving config space at offset 0x30
> (reading 0x0)
> > [ 81.594081][ T7] nvme 0000:05:00.0: saving config space at offset 0x34
> (reading 0x40)
> > [ 81.602217][ T7] nvme 0000:05:00.0: saving config space at offset 0x38
> (reading 0x0)
> > [ 81.610266][ T7] nvme 0000:05:00.0: saving config space at offset 0x3c
> (reading 0x12c)
> > [ 81.634065][ T7] nvme nvme0: Shutdown timeout set to 8 seconds
> > [ 81.674332][ T7] nvme nvme0: 1/0/0 default/read/poll queues
> > [ 112.168136][ T256] nvme nvme0: I/O 12 QID 1 timeout, disable controller
> > [ 112.193145][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 600656 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.205220][ T256] Buffer I/O error on dev nvme0n1, logical block 75082, lost
> async page write
> > [ 112.213978][ T256] Buffer I/O error on dev nvme0n1, logical block 75083, lost
> async page write
> > [ 112.222727][ T256] Buffer I/O error on dev nvme0n1, logical block 75084, lost
> async page write
> > [ 112.231474][ T256] Buffer I/O error on dev nvme0n1, logical block 75085, lost
> async page write
> > [ 112.240220][ T256] Buffer I/O error on dev nvme0n1, logical block 75086, lost
> async page write
> > [ 112.248966][ T256] Buffer I/O error on dev nvme0n1, logical block 75087, lost
> async page write
> > [ 112.257719][ T256] Buffer I/O error on dev nvme0n1, logical block 75088, lost
> async page write
> > [ 112.266467][ T256] Buffer I/O error on dev nvme0n1, logical block 75089, lost
> async page write
> > [ 112.275213][ T256] Buffer I/O error on dev nvme0n1, logical block 75090, lost
> async page write
> > [ 112.283959][ T256] Buffer I/O error on dev nvme0n1, logical block 75091, lost
> async page write
> > [ 112.293554][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 601672 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.306559][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 602688 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.319525][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 603704 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.332501][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 604720 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.345466][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 605736 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.358427][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 606752 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.371386][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 607768 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.384346][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 608784 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.397315][ T256] blk_update_request: I/O error, dev nvme0n1, sector
> 609800 op 0x1:(WRITE) flags 0x104000 phys_seg 127 prio class 0
> > [ 112.459313][ T7] nvme nvme0: failed to mark controller live state
> > [ 112.465758][ T7] nvme nvme0: Removing after probe failure status: -19
> > dd: error writing '/dev/nvme0n1': No space left on device
> > 112200+0 records in
> > 112199+0 records out
> > 459567104 bytes (438.3MB) copied, 70.573266 seconds, 6.2MB/s
> > # [ 112.935768][ T7] nvme nvme0: failed to set APST feature (-19)
> >
> >
> >
> >
> > Here is the dts I used:
> > pciec: pcie-controller@2040000000 {
> > compatible = "cdns,cdns-pcie-host";
> > device_type = "pci";
> > #address-cells = <3>;
> > #size-cells = <2>;
> > bus-range = <0 5>;
> > linux,pci-domain = <0>;
> > cdns,no-bar-match-nbits = <38>;
> > vendor-id = <0x17cd>;
> > device-id = <0x0100>;
> > reg-names = "reg", "cfg";
> > reg = <0x20 0x40000000 0x0 0x10000000>,
> > <0x20 0x00000000 0x0 0x00001000>; /* RC only */
> >
> > /*
> > * type: 0x00000000 cfg space
> > * type: 0x01000000 IO
> > * type: 0x02000000 32bit mem space No prefetch
> > * type: 0x03000000 64bit mem space No prefetch
> > * type: 0x43000000 64bit mem space prefetch
> > * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
> > * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00
> 0x01000000>, CFG_SPACE
> > */
> > ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00
> 0x00100000>,
> > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00
> 0x08000000>;
> >
> > #interrupt-cells = <0x1>;
> > interrupt-map-mask = <0x00 0x0 0x0 0x7>;
> > interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
> > <0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
> > <0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
> > <0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
> > phys = <&pcie_phy>;
> > phy-names="pcie-phy";
> > status = "ok";
> > };
> >
> >
> > After some digging, I find if I change the controller's range
> > property from
> >
> > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into
> > <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>,
> >
> > then dd will success without timeout. IIUC, range here
> > is only for non-prefetch 32bit mmio, but dd will use dma (maybe cpu
> > will send cmd to nvme controller via mmio?).
>
> I don't know how to interpret "ranges". Can you supply the dmesg and
> "lspci -vvs 0000:05:00.0" output both ways, e.g.,
>
> pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
>

From the source code:

static int devm_of_pci_get_host_bridge_resources
{
...
err = of_pci_range_parser_init(&parser, dev_node);
...
for_each_of_pci_range(&parser, &range) {
/* Read next ranges element */
if ((range.flags & IORESOURCE_TYPE_BITS) == IORESOURCE_IO)
range_type = "IO";
else if ((range.flags & IORESOURCE_TYPE_BITS) == IORESOURCE_MEM)
range_type = "MEM";
else
range_type = "err";
dev_info(dev, " %6s %#012llx..%#012llx -> %#012llx\n",
range_type, range.cpu_addr,
range.cpu_addr + range.size - 1, range.pci_addr);

/*
* If we failed translation or got a zero-sized region
* then skip this range
*/
if (range.cpu_addr == OF_BAD_ADDR || range.size == 0)
continue;

err = of_pci_range_to_resource(&range, dev_node, &tmp_res);
...
}

So, it simply translates mmio mapping.



Take
ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
<0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
for example (this ranges will make nvme timeout):

lspci output:
# lspci -i /usr/share/misc/pci.ids -vvs 0000:05:00.0
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a801
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 44
Region 0: Memory at 2008000000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b0] MSI-X: Enable- Count=13 Masked-
Vector table: BAR=0 offset=00003000
PBA: BAR=0 offset=00002000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158 v1] Power Budgeting <?>
Capabilities: [168 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [188 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [190 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Kernel driver in use: nvme
Kernel modules: nvme




/proc/iomem:

01200000-ffffffff : System RAM
01280000-022affff : Kernel code
022b0000-0295ffff : reserved
02960000-040cffff : Kernel data
05280000-0528ffff : reserved
41cc0000-422c0fff : reserved
422c1000-4232afff : reserved
4232d000-667bbfff : reserved
667bc000-667bcfff : reserved
667bd000-667c0fff : reserved
667c1000-ffffffff : reserved
2000000000-2000000fff : cfg
2008000000-200bffffff : pcie-controller@2040000000
2008000000-20081fffff : PCI Bus 0000:01
2008000000-20080fffff : PCI Bus 0000:02
2008000000-20080fffff : PCI Bus 0000:05
2008000000-2008003fff : 0000:05:00.0
2008000000-2008003fff : nvme
2008100000-200817ffff : 0000:01:00.0
...

From dmesg output:

# dmesg | grep pci
[ 0.000000] Kernel command line: console=ttyS0 ubi.mtd=lnx root=ubi0:rootfs rw rootfstype=ubifs init=/linuxrc pci=nomsi pcie_pme=nomsi nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pci=nommconf nvme_core.default_ps_max_latency_us=0
[ 2.431732] cdns-pcie 2040000000.pcie-controller: host bridge /pcie-controller@2040000000 ranges:
[ 2.442055] cdns-pcie 2040000000.pcie-controller: Parsing ranges property...
[ 2.450010] cdns-pcie 2040000000.pcie-controller: IO 0x2000100000..0x20001fffff -> 0x0000000000
[ 2.460215] cdns-pcie 2040000000.pcie-controller: MEM 0x2008000000..0x200bffffff -> 0x0008000000
[ 2.523271] cdns-pcie 2040000000.pcie-controller: Link up
[ 2.530647] cdns-pcie 2040000000.pcie-controller: PCI host bridge to bus 0000:00
[ 2.538885] pci_bus 0000:00: root bus resource [bus 00-05]
[ 2.545194] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
[ 2.552280] pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus address [0x08000000-0x0bffffff])
[ 2.563651] pci_bus 0000:00: scanning bus
[ 2.568753] pci 0000:00:00.0: [17cd:0100] type 01 class 0x060400
[ 2.576196] pci 0000:00:00.0: supports D1
[ 2.581006] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[ 2.587667] pci 0000:00:00.0: PME# disabled
[ 2.597799] pci_bus 0000:00: fixups for bus
[ 2.602800] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.610308] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.619160] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.627525] pci_bus 0000:01: scanning bus
[ 2.632550] pci 0000:01:00.0: [12d8:c016] type 01 class 0x060400
[ 2.639425] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0007ffff]
[ 2.647250] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[ 2.654165] pci 0000:01:00.0: PME# disabled
[ 2.664011] pci_bus 0000:01: fixups for bus
[ 2.669012] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.676521] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.685370] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.693768] pci_bus 0000:02: scanning bus
[ 2.699101] pci 0000:02:02.0: [12d8:c016] type 01 class 0x060400
[ 2.706740] pci 0000:02:02.0: PME# supported from D0 D3hot D3cold
[ 2.713653] pci 0000:02:02.0: PME# disabled
[ 2.720417] pci 0000:02:04.0: [12d8:c016] type 01 class 0x060400
[ 2.728032] pci 0000:02:04.0: PME# supported from D0 D3hot D3cold
[ 2.734945] pci 0000:02:04.0: PME# disabled
[ 2.741684] pci 0000:02:06.0: [12d8:c016] type 01 class 0x060400
[ 2.749288] pci 0000:02:06.0: PME# supported from D0 D3hot D3cold
[ 2.756201] pci 0000:02:06.0: PME# disabled
[ 2.765483] pci_bus 0000:02: fixups for bus
[ 2.770482] pci 0000:02:02.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.777991] pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.786839] pci 0000:02:04.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.794337] pci 0000:02:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.803184] pci 0000:02:06.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.810691] pci 0000:02:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.819539] pci 0000:02:02.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.827948] pci_bus 0000:03: scanning bus
[ 2.836100] pci_bus 0000:03: fixups for bus
[ 2.841085] pci_bus 0000:03: bus scan returning with max=03
[ 2.847467] pci_bus 0000:03: busn_res: [bus 03-05] end is updated to 03
[ 2.854911] pci 0000:02:04.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.863151] pci_bus 0000:04: scanning bus
[ 2.871319] pci_bus 0000:04: fixups for bus
[ 2.876303] pci_bus 0000:04: bus scan returning with max=04
[ 2.882683] pci_bus 0000:04: busn_res: [bus 04-05] end is updated to 04
[ 2.890151] pci 0000:02:06.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.898383] pci_bus 0000:05: scanning bus
[ 2.903362] pci 0000:05:00.0: [144d:a809] type 00 class 0x010802
[ 2.910266] pci 0000:05:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[ 2.923188] pci_bus 0000:05: fixups for bus
[ 2.928174] pci_bus 0000:05: bus scan returning with max=05
[ 2.934553] pci_bus 0000:05: busn_res: [bus 05] end is updated to 05
[ 2.941724] pci_bus 0000:02: bus scan returning with max=05
[ 2.948101] pci_bus 0000:02: busn_res: [bus 02-05] end is updated to 05
[ 2.955557] pci_bus 0000:01: bus scan returning with max=05
[ 2.961983] pci_bus 0000:01: busn_res: [bus 01-05] end is updated to 05
[ 2.969413] pci_bus 0000:00: bus scan returning with max=05
[ 2.975843] pci 0000:00:00.0: BAR 8: assigned [mem 0x2008000000-0x20081fffff]
[ 2.983843] pci 0000:01:00.0: BAR 8: assigned [mem 0x2008000000-0x20080fffff]
[ 2.991795] pci 0000:01:00.0: BAR 0: assigned [mem 0x2008100000-0x200817ffff]
[ 2.999801] pci 0000:02:06.0: BAR 8: assigned [mem 0x2008000000-0x20080fffff]
[ 3.007763] pci 0000:02:02.0: PCI bridge to [bus 03]
[ 3.013622] pci 0000:02:04.0: PCI bridge to [bus 04]
[ 3.019487] pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
[ 3.028012] pci 0000:02:06.0: PCI bridge to [bus 05]
[ 3.033792] pci 0000:02:06.0: bridge window [mem 0x2008000000-0x20080fffff]
[ 3.041767] pci 0000:01:00.0: PCI bridge to [bus 02-05]
[ 3.047808] pci 0000:01:00.0: bridge window [mem 0x2008000000-0x20080fffff]
[ 3.055782] pci 0000:00:00.0: PCI bridge to [bus 01-05]
[ 3.061822] pci 0000:00:00.0: bridge window [mem 0x2008000000-0x20081fffff]
[ 4.184289] pcie_pme=nomsi


Also take
ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00 0x00100000>,
<0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
for example (this ranges will not make nvme timeout):

lspci??
# lspci -i /usr/share/misc/pci.ids -vvs 0000:05:00.0
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980 (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd Device a801
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 44
Region 0: Memory at 2000400000 (64-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b0] MSI-X: Enable- Count=13 Masked-
Vector table: BAR=0 offset=00003000
PBA: BAR=0 offset=00002000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap+ MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [148 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [158 v1] Power Budgeting <?>
Capabilities: [168 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [188 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [190 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=10us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Kernel driver in use: nvme
Kernel modules: nvme

/proc/iomem:
01200000-ffffffff : System RAM
01280000-022affff : Kernel code
022b0000-0295ffff : reserved
02960000-040cffff : Kernel data
05280000-0528ffff : reserved
41cc0000-422c0fff : reserved
422c1000-4232afff : reserved
4232d000-667bbfff : reserved
667bc000-667bcfff : reserved
667bd000-667c0fff : reserved
667c1000-ffffffff : reserved
2000000000-2000000fff : cfg
2000400000-20083fffff : pcie-controller@2040000000
2000400000-20005fffff : PCI Bus 0000:01
2000400000-20004fffff : PCI Bus 0000:02
2000400000-20004fffff : PCI Bus 0000:05
2000400000-2000403fff : 0000:05:00.0
2000500000-200057ffff : 0000:01:00.0
...

Dmesg:
# dmesg | grep pci
[ 0.000000] Kernel command line: console=ttyS0 ubi.mtd=lnx root=ubi0:rootfs rw rootfstype=ubifs init=/linuxrc pci=nomsi pcie_pme=nomsi nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pci=nommconf nvme_core.default_ps_max_latency_us=0
[ 2.436357] cdns-pcie 2040000000.pcie-controller: host bridge /pcie-controller@2040000000 ranges:
[ 2.446129] cdns-pcie 2040000000.pcie-controller: Parsing ranges property...
[ 2.454141] cdns-pcie 2040000000.pcie-controller: IO 0x2000100000..0x20001fffff -> 0x0000000000
[ 2.464350] cdns-pcie 2040000000.pcie-controller: MEM 0x2000400000..0x20083fffff -> 0x0000400000
[ 2.527395] cdns-pcie 2040000000.pcie-controller: Link up
[ 2.534745] cdns-pcie 2040000000.pcie-controller: PCI host bridge to bus 0000:00
[ 2.543036] pci_bus 0000:00: root bus resource [bus 00-05]
[ 2.549345] pci_bus 0000:00: root bus resource [io 0x0000-0xfffff]
[ 2.556431] pci_bus 0000:00: root bus resource [mem 0x2000400000-0x20083fffff] (bus address [0x00400000-0x083fffff])
[ 2.567802] pci_bus 0000:00: scanning bus
[ 2.572823] pci 0000:00:00.0: [17cd:0100] type 01 class 0x060400
[ 2.580290] pci 0000:00:00.0: supports D1
[ 2.585099] pci 0000:00:00.0: PME# supported from D0 D1 D3hot
[ 2.591709] pci 0000:00:00.0: PME# disabled
[ 2.601802] pci_bus 0000:00: fixups for bus
[ 2.606801] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.614310] pci 0000:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.623161] pci 0000:00:00.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.631551] pci_bus 0000:01: scanning bus
[ 2.636563] pci 0000:01:00.0: [12d8:c016] type 01 class 0x060400
[ 2.643460] pci 0000:01:00.0: reg 0x10: [mem 0x00000000-0x0007ffff]
[ 2.651289] pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
[ 2.658202] pci 0000:01:00.0: PME# disabled
[ 2.667966] pci_bus 0000:01: fixups for bus
[ 2.673047] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.680556] pci 0000:01:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.689405] pci 0000:01:00.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.697804] pci_bus 0000:02: scanning bus
[ 2.703079] pci 0000:02:02.0: [12d8:c016] type 01 class 0x060400
[ 2.710742] pci 0000:02:02.0: PME# supported from D0 D3hot D3cold
[ 2.717657] pci 0000:02:02.0: PME# disabled
[ 2.724466] pci 0000:02:04.0: [12d8:c016] type 01 class 0x060400
[ 2.732078] pci 0000:02:04.0: PME# supported from D0 D3hot D3cold
[ 2.739037] pci 0000:02:04.0: PME# disabled
[ 2.745734] pci 0000:02:06.0: [12d8:c016] type 01 class 0x060400
[ 2.753344] pci 0000:02:06.0: PME# supported from D0 D3hot D3cold
[ 2.760257] pci 0000:02:06.0: PME# disabled
[ 2.769510] pci_bus 0000:02: fixups for bus
[ 2.774532] pci 0000:02:02.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.782043] pci 0000:02:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.790892] pci 0000:02:04.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.798494] pci 0000:02:04.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.807347] pci 0000:02:06.0: scanning [bus 00-00] behind bridge, pass 0
[ 2.814855] pci 0000:02:06.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[ 2.823701] pci 0000:02:02.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.832098] pci_bus 0000:03: scanning bus
[ 2.840343] pci_bus 0000:03: fixups for bus
[ 2.845328] pci_bus 0000:03: bus scan returning with max=03
[ 2.851708] pci_bus 0000:03: busn_res: [bus 03-05] end is updated to 03
[ 2.859152] pci 0000:02:04.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.867404] pci_bus 0000:04: scanning bus
[ 2.875551] pci_bus 0000:04: fixups for bus
[ 2.880535] pci_bus 0000:04: bus scan returning with max=04
[ 2.886914] pci_bus 0000:04: busn_res: [bus 04-05] end is updated to 04
[ 2.894359] pci 0000:02:06.0: scanning [bus 00-00] behind bridge, pass 1
[ 2.902637] pci_bus 0000:05: scanning bus
[ 2.907632] pci 0000:05:00.0: [144d:a809] type 00 class 0x010802
[ 2.914536] pci 0000:05:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[ 2.927440] pci_bus 0000:05: fixups for bus
[ 2.932426] pci_bus 0000:05: bus scan returning with max=05
[ 2.938807] pci_bus 0000:05: busn_res: [bus 05] end is updated to 05
[ 2.946025] pci_bus 0000:02: bus scan returning with max=05
[ 2.952403] pci_bus 0000:02: busn_res: [bus 02-05] end is updated to 05
[ 2.959834] pci_bus 0000:01: bus scan returning with max=05
[ 2.966233] pci_bus 0000:01: busn_res: [bus 01-05] end is updated to 05
[ 2.973666] pci_bus 0000:00: bus scan returning with max=05
[ 2.980097] pci 0000:00:00.0: BAR 8: assigned [mem 0x2000400000-0x20005fffff]
[ 2.988097] pci 0000:01:00.0: BAR 8: assigned [mem 0x2000400000-0x20004fffff]
[ 2.996046] pci 0000:01:00.0: BAR 0: assigned [mem 0x2000500000-0x200057ffff]
[ 3.004092] pci 0000:02:06.0: BAR 8: assigned [mem 0x2000400000-0x20004fffff]
[ 3.012055] pci 0000:02:02.0: PCI bridge to [bus 03]
[ 3.017867] pci 0000:02:04.0: PCI bridge to [bus 04]
[ 3.023709] pci 0000:05:00.0: BAR 0: assigned [mem 0x2000400000-0x2000403fff 64bit]
[ 3.032253] pci 0000:02:06.0: PCI bridge to [bus 05]
[ 3.038071] pci 0000:02:06.0: bridge window [mem 0x2000400000-0x20004fffff]
[ 3.046086] pci 0000:01:00.0: PCI bridge to [bus 02-05]
[ 3.052127] pci 0000:01:00.0: bridge window [mem 0x2000400000-0x20004fffff]
[ 3.060100] pci 0000:00:00.0: PCI bridge to [bus 01-05]
[ 3.066140] pci 0000:00:00.0: bridge window [mem 0x2000400000-0x20005fffff]
[ 4.181033] pcie_pme=nomsi
[ 44.419264] nvme nvme0: pci function 0000:05:00.0
[ 44.425541] pci 0000:00:00.0: enabling device (0000 -> 0002)
[ 44.432217] pci 0000:00:00.0: enabling bus mastering
[ 44.438183] pci 0000:01:00.0: enabling device (0000 -> 0002)
[ 44.444720] pci 0000:01:00.0: enabling bus mastering
[ 44.453373] pci 0000:02:06.0: enabling device (0000 -> 0002)
[ 44.459951] pci 0000:02:06.0: enabling bus mastering


[firstlove@sh-lchen tmp]$ diff no-timeout.lspci timeout.lspci
8c8
< Region 0: Memory at 2000400000 (64-bit, non-prefetchable) [size=16K]
---
> Region 0: Memory at 2008000000 (64-bit, non-prefetchable) [size=16K]

> > Question:
> > 1. Why dd can cause nvme timeout? Is there more debug ways?
> > 2. How can this mmio range affect nvme timeout?
> >
> > Regards,
> > Li
> >
> >
> **************************************************************
> ********
> > This email and attachments contain Ambarella Proprietary and/or Confidential
> Information and is intended solely for the use of the individual(s) to whom it is
> addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is
> prohibited. If you are not an intended recipient, please contact the sender by
> reply email and destroy all copies of the original message. Thank you.
>
> ##############################################################
> ########
> This EXTERNAL email has been scanned by Proofpoint Email Protect service.

Regards,
Li

2021-10-26 08:44:56

by Li Chen

[permalink] [raw]
Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

Hi, Keith and Bjorn

> -----Original Message-----
> From: Keith Busch [mailto:[email protected]]
> Sent: Tuesday, October 26, 2021 12:22 AM
> To: Bjorn Helgaas
> Cc: Li Chen; [email protected]; Lorenzo Pieralisi; Rob Herring;
> [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> Subject: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
>
> On Mon, Oct 25, 2021 at 10:47:39AM -0500, Bjorn Helgaas wrote:
> > [+cc Tom (Cadence maintainer), NVMe folks]
> >
> > On Fri, Oct 22, 2021 at 10:08:20AM +0000, Li Chen wrote:
> > > pciec: pcie-controller@2040000000 {
> > > compatible = "cdns,cdns-pcie-host";
> > > device_type = "pci";
> > > #address-cells = <3>;
> > > #size-cells = <2>;
> > > bus-range = <0 5>;
> > > linux,pci-domain = <0>;
> > > cdns,no-bar-match-nbits = <38>;
> > > vendor-id = <0x17cd>;
> > > device-id = <0x0100>;
> > > reg-names = "reg", "cfg";
> > > reg = <0x20 0x40000000 0x0 0x10000000>,
> > > <0x20 0x00000000 0x0 0x00001000>; /* RC only */
> > >
> > > /*
> > > * type: 0x00000000 cfg space
> > > * type: 0x01000000 IO
> > > * type: 0x02000000 32bit mem space No prefetch
> > > * type: 0x03000000 64bit mem space No prefetch
> > > * type: 0x43000000 64bit mem space prefetch
> > > * The First 16MB from BUS_DEV_FUNC=0:0:0 for cfg space
> > > * <0x00000000 0x00 0x00000000 0x20 0x00000000 0x00
> 0x01000000>, CFG_SPACE
> > > */
> > > ranges = <0x01000000 0x00 0x00000000 0x20 0x00100000 0x00
> 0x00100000>,
> > > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00
> 0x08000000>;
> > >
> > > #interrupt-cells = <0x1>;
> > > interrupt-map-mask = <0x00 0x0 0x0 0x7>;
> > > interrupt-map = <0x0 0x0 0x0 0x1 &gic 0 229 0x4>,
> > > <0x0 0x0 0x0 0x2 &gic 0 230 0x4>,
> > > <0x0 0x0 0x0 0x3 &gic 0 231 0x4>,
> > > <0x0 0x0 0x0 0x4 &gic 0 232 0x4>;
> > > phys = <&pcie_phy>;
> > > phy-names="pcie-phy";
> > > status = "ok";
> > > };
> > >
> > >
> > > After some digging, I find if I change the controller's range
> > > property from
> > >
> > > <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x08000000> into
> > > <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>,
> > >
> > > then dd will success without timeout. IIUC, range here
> > > is only for non-prefetch 32bit mmio, but dd will use dma (maybe cpu
> > > will send cmd to nvme controller via mmio?).
>
> Generally speaking, an nvme driver notifies the controller of new
> commands via a MMIO write to a specific nvme register. The nvme
> controller fetches those commands from host memory with a DMA.
>
> One exception to that description is if the nvme controller supports CMB
> with SQEs, but they're not very common. If you had such a controller,
> the driver will use MMIO to write commands directly into controller
> memory instead of letting the controller DMA them from host memory. Do
> you know if you have such a controller?
>
> The data transfers associated with your 'dd' command will always use DMA.
>

My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980". From its datasheet, https://s3.ap-northeast-2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_Rev.1.1.pdf, it says nothing about CMB/SQEs, so I'm not sure. Is there other ways/tools(like nvme-cli) to query?

> > I don't know how to interpret "ranges". Can you supply the dmesg and
> > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> >
> > pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> >
> > > Question:
> > > 1. Why dd can cause nvme timeout? Is there more debug ways?
>
> That means the nvme controller didn't provide a response to a posted
> command within the driver's latency tolerance.

FYI, with the help of pci bridger's vendor, they find something interesting: "From catc log, I saw some memory read pkts sent from SSD card, but its memory range is within the memory range of switch down port. So, switch down port will replay UR pkt. It seems not normal." and "Why SSD card send out some memory pkts which memory address is within switch down port's memory range. If so, switch will response UR pkts". I also don't understand how can this happen?


>
> > > 2. How can this mmio range affect nvme timeout?
>
> Let's see how those ranges affect what the kernel sees in the pci
> topology, as Bjorn suggested.

Ok, will add details in another mail replaying Bjorn.

>
> ##############################################################
> ########
> This EXTERNAL email has been scanned by Proofpoint Email Protect service.

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.