2021-10-26 08:33:32

by Keith Busch

[permalink] [raw]
Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 980". From its datasheet, https://s3.ap-northeast-2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_Rev.1.1.pdf, it says nothing about CMB/SQEs, so I'm not sure. Is there other ways/tools(like nvme-cli) to query?

The driver will export a sysfs property for it if it is supported:

# cat /sys/class/nvme/nvme0/cmb

If the file doesn't exist, then /dev/nvme0 doesn't have the capability.

> > > I don't know how to interpret "ranges". Can you supply the dmesg and
> > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > >
> > > pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > > pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > > pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > >
> > > > Question:
> > > > 1. Why dd can cause nvme timeout? Is there more debug ways?
> >
> > That means the nvme controller didn't provide a response to a posted
> > command within the driver's latency tolerance.
>
> FYI, with the help of pci bridger's vendor, they find something interesting: "From catc log, I saw some memory read pkts sent from SSD card, but its memory range is within the memory range of switch down port. So, switch down port will replay UR pkt. It seems not normal." and "Why SSD card send out some memory pkts which memory address is within switch down port's memory range. If so, switch will response UR pkts". I also don't understand how can this happen?

I think we can safely assume you're not attempting peer-to-peer, so that
behavior as described shouldn't be happening. It sounds like the memory
windows may be incorrect. The dmesg may help to show if something appears
wrong.


2021-10-29 10:57:58

by Li Chen

[permalink] [raw]
Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

> -----Original Message-----
> From: Keith Busch [mailto:[email protected]]
> Sent: Tuesday, October 26, 2021 12:16 PM
> To: Li Chen
> Cc: Bjorn Helgaas; [email protected]; Lorenzo Pieralisi; Rob Herring;
> [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
>
> On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co
> Ltd NVMe SSD Controller 980". From its datasheet,
> https://urldefense.com/v3/__https://s3.ap-northeast-
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm not sure.
> Is there other ways/tools(like nvme-cli) to query?
>
> The driver will export a sysfs property for it if it is supported:
>
> # cat /sys/class/nvme/nvme0/cmb
>
> If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
>
> > > > I don't know how to interpret "ranges". Can you supply the dmesg and
> > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > >
> > > > pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > > > pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > > > pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > >
> > > > > Question:
> > > > > 1. Why dd can cause nvme timeout? Is there more debug ways?
> > >
> > > That means the nvme controller didn't provide a response to a posted
> > > command within the driver's latency tolerance.
> >
> > FYI, with the help of pci bridger's vendor, they find something interesting:
> "From catc log, I saw some memory read pkts sent from SSD card, but its memory
> range is within the memory range of switch down port. So, switch down port will
> replay UR pkt. It seems not normal." and "Why SSD card send out some memory
> pkts which memory address is within switch down port's memory range. If so,
> switch will response UR pkts". I also don't understand how can this happen?
>
> I think we can safely assume you're not attempting peer-to-peer, so that
> behavior as described shouldn't be happening. It sounds like the memory
> windows may be incorrect. The dmesg may help to show if something appears
> wrong.

Hi, Keith

Agree that here doesn't involve peer-to-peer DMA. After conforming from switch vendor today, the two ur(unsupported request) is because nvme is trying to dma read dram with bus address 80d5000 and 80d5100. But the two bus addresses are located in switch's down port range, so the switch down port report ur.

In our soc, dma/bus/pci address and physical/AXI address are 1:1, and DRAM space in physical memory address space is 000000.0000 - 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address are also 80d5000 and 80d5100, which both located inside dram space.

Both our bootloader and romcode don't enum and configure pcie devices and switches, so the switch cfg stage should be left to kernel.

Come back to the subject of this thread: " nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges". I found:

1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
(which will timeout nvme)

Switch(bridge of nvme)'s resource window:
Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]

80d5000 and 80d5100 are both inside this range.

2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
(which make nvme not timeout)

Switch(bridge of nvme)'s resource window:
Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]

80d5000 and 80d5100 are not inside this range, so if nvme tries to read 80d5000 and 80d5100 , ur won't happe.


From /proc/iomen:
# cat /proc/iomem
01200000-ffffffff : System RAM
01280000-022affff : Kernel code
022b0000-0295ffff : reserved
02960000-040cffff : Kernel data
05280000-0528ffff : reserved
41cc0000-422c0fff : reserved
422c1000-4232afff : reserved
4232d000-667bbfff : reserved
667bc000-667bcfff : reserved
667bd000-667c0fff : reserved
667c1000-ffffffff : reserved
2000000000-2000000fff : cfg

No one uses 0000000-1200000, so " Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]" will never have any problem(because 0x1200000 > 0x004fffff).


Above answers the question in Subject, one question left: what's the right way to resolve this problem? Use ranges property to configure switch memory window indirectly(just what I did)? Or something else?

I don't think changing range property is the right way: If my PCIe topology becomes more complex and have more endpoints or switches, maybe I have to reserve more MMIO through range property(please correct me if I'm wrong), the end of switch's memory window may be larger than 0x01200000. In case getting ur again, I must reserve more physical memory address for them(like change kernel start address 0x01200000 to 0x02000000), which will make my visible dram smaller(I have verified it with "free -m"), it is not acceptable.


So, is there any better solution?

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

2021-10-29 19:44:55

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > -----Original Message-----
> > From: Keith Busch [mailto:[email protected]]
> > Sent: Tuesday, October 26, 2021 12:16 PM
> > To: Li Chen
> > Cc: Bjorn Helgaas; [email protected]; Lorenzo Pieralisi; Rob Herring;
> > [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph; Jens
> > Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-
> > prefetch mmio outbound/ranges
> >
> > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics Co
> > Ltd NVMe SSD Controller 980". From its datasheet,
> > https://urldefense.com/v3/__https://s3.ap-northeast-
> > 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm not sure.
> > Is there other ways/tools(like nvme-cli) to query?
> >
> > The driver will export a sysfs property for it if it is supported:
> >
> > # cat /sys/class/nvme/nvme0/cmb
> >
> > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> >
> > > > > I don't know how to interpret "ranges". Can you supply the dmesg and
> > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > >
> > > > > pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff window]
> > > > > pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff window]
> > > > > pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > > pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > >
> > > > > > Question:
> > > > > > 1. Why dd can cause nvme timeout? Is there more debug ways?
> > > >
> > > > That means the nvme controller didn't provide a response to a posted
> > > > command within the driver's latency tolerance.
> > >
> > > FYI, with the help of pci bridger's vendor, they find something
> > > interesting:
> > "From catc log, I saw some memory read pkts sent from SSD card,
> > but its memory range is within the memory range of switch down
> > port. So, switch down port will replay UR pkt. It seems not
> > normal." and "Why SSD card send out some memory pkts which memory
> > address is within switch down port's memory range. If so, switch
> > will response UR pkts". I also don't understand how can this
> > happen?
> >
> > I think we can safely assume you're not attempting peer-to-peer,
> > so that behavior as described shouldn't be happening. It sounds
> > like the memory windows may be incorrect. The dmesg may help to
> > show if something appears wrong.
>
> Agree that here doesn't involve peer-to-peer DMA. After conforming
> from switch vendor today, the two ur(unsupported request) is because
> nvme is trying to dma read dram with bus address 80d5000 and
> 80d5100. But the two bus addresses are located in switch's down port
> range, so the switch down port report ur.
>
> In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> and DRAM space in physical memory address space is 000000.0000 -
> 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> are also 80d5000 and 80d5100, which both located inside dram space.
>
> Both our bootloader and romcode don't enum and configure pcie
> devices and switches, so the switch cfg stage should be left to
> kernel.
>
> Come back to the subject of this thread: " nvme may get timeout from
> dd when using different non-prefetch mmio outbound/ranges". I found:
>
> 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> (which will timeout nvme)
>
> Switch(bridge of nvme)'s resource window:
> Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]
>
> 80d5000 and 80d5100 are both inside this range.

The PCI host bridge MMIO window is here:

pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus address [0x08000000-0x0bffffff])
pci 0000:01:00.0: PCI bridge to [bus 02-05]
pci 0000:01:00.0: bridge window [mem 0x2008000000-0x20080fffff]
pci 0000:02:06.0: PCI bridge to [bus 05]
pci 0000:02:06.0: bridge window [mem 0x2008000000-0x20080fffff]
pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]

So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
If the NVMe device is generating DMA transactions to 0x080d5000, which
is inside that range, those will be interpreted as peer-to-peer
transactions. But obviously that's not intended and there's no device
at 0x080d5000 anyway.

My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something wrong
in how that's set up. Is there an IOMMU? There should be arch code
that knows what RAM is available for DMA buffers, maybe based on the
DT. I'm not really familiar with how all that would be arranged, but
the complete dmesg log and complete DT might have a clue. Can you
post those somewhere?

> 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
> (which make nvme not timeout)
>
> Switch(bridge of nvme)'s resource window:
> Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]
>
> 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> read 80d5000 and 80d5100 , ur won't happen.
>
> From /proc/iomen:
> # cat /proc/iomem
> 01200000-ffffffff : System RAM
> 01280000-022affff : Kernel code
> 022b0000-0295ffff : reserved
> 02960000-040cffff : Kernel data
> 05280000-0528ffff : reserved
> 41cc0000-422c0fff : reserved
> 422c1000-4232afff : reserved
> 4232d000-667bbfff : reserved
> 667bc000-667bcfff : reserved
> 667bd000-667c0fff : reserved
> 667c1000-ffffffff : reserved
> 2000000000-2000000fff : cfg
>
> No one uses 0000000-1200000, so " Memory behind bridge: Memory
> behind bridge: 00400000-004fffff [size=1M]" will never have any
> problem(because 0x1200000 > 0x004fffff).
>
> Above answers the question in Subject, one question left: what's the
> right way to resolve this problem? Use ranges property to configure
> switch memory window indirectly(just what I did)? Or something else?
>
> I don't think changing range property is the right way: If my PCIe
> topology becomes more complex and have more endpoints or switches,
> maybe I have to reserve more MMIO through range property(please
> correct me if I'm wrong), the end of switch's memory window may be
> larger than 0x01200000. In case getting ur again, I must reserve
> more physical memory address for them(like change kernel start
> address 0x01200000 to 0x02000000), which will make my visible dram
> smaller(I have verified it with "free -m"), it is not acceptable.

Right, I don't think changing the PCI ranges property is the right
answer. I think it's just a coincidence that moving the host bridge
MMIO aperture happens to move it out of the way of the DMA to
0x080d5000.

As far as I can tell, the PCI core and the nvme driver are doing the
right things here, and the problem is something behind the DMA API.

I think there should be something that removes the MMIO aperture bus
addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
pool of memory available for DMA buffers.

The MMIO aperture bus addresses in the non-timeout case,
0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
System RAM area, which would explain why a DMA buffer would never
overlap with it.

Bjorn

2021-11-02 05:24:51

by Li Chen

[permalink] [raw]
Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges

Hi, Bjorn

> -----Original Message-----
> From: Bjorn Helgaas [mailto:[email protected]]
> Sent: Saturday, October 30, 2021 3:43 AM
> To: Li Chen
> Cc: Keith Busch; [email protected]; Lorenzo Pieralisi; Rob Herring;
> [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> Subject: Re: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
>
> On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > > -----Original Message-----
> > > From: Keith Busch [mailto:[email protected]]
> > > Sent: Tuesday, October 26, 2021 12:16 PM
> > > To: Li Chen
> > > Cc: Bjorn Helgaas; [email protected]; Lorenzo Pieralisi; Rob Herring;
> > > [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph;
> Jens
> > > Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> non-
> > > prefetch mmio outbound/ranges
> > >
> > > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics
> Co
> > > Ltd NVMe SSD Controller 980". From its datasheet,
> > > https://urldefense.com/v3/__https://s3.ap-northeast-
> > >
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > >
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm not
> sure.
> > > Is there other ways/tools(like nvme-cli) to query?
> > >
> > > The driver will export a sysfs property for it if it is supported:
> > >
> > > # cat /sys/class/nvme/nvme0/cmb
> > >
> > > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> > >
> > > > > > I don't know how to interpret "ranges". Can you supply the dmesg and
> > > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > > >
> > > > > > pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff
> window]
> > > > > > pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff
> window]
> > > > > > pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > > > pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > > >
> > > > > > > Question:
> > > > > > > 1. Why dd can cause nvme timeout? Is there more debug ways?
> > > > >
> > > > > That means the nvme controller didn't provide a response to a posted
> > > > > command within the driver's latency tolerance.
> > > >
> > > > FYI, with the help of pci bridger's vendor, they find something
> > > > interesting:
> > > "From catc log, I saw some memory read pkts sent from SSD card,
> > > but its memory range is within the memory range of switch down
> > > port. So, switch down port will replay UR pkt. It seems not
> > > normal." and "Why SSD card send out some memory pkts which memory
> > > address is within switch down port's memory range. If so, switch
> > > will response UR pkts". I also don't understand how can this
> > > happen?
> > >
> > > I think we can safely assume you're not attempting peer-to-peer,
> > > so that behavior as described shouldn't be happening. It sounds
> > > like the memory windows may be incorrect. The dmesg may help to
> > > show if something appears wrong.
> >
> > Agree that here doesn't involve peer-to-peer DMA. After conforming
> > from switch vendor today, the two ur(unsupported request) is because
> > nvme is trying to dma read dram with bus address 80d5000 and
> > 80d5100. But the two bus addresses are located in switch's down port
> > range, so the switch down port report ur.
> >
> > In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> > and DRAM space in physical memory address space is 000000.0000 -
> > 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> > are also 80d5000 and 80d5100, which both located inside dram space.
> >
> > Both our bootloader and romcode don't enum and configure pcie
> > devices and switches, so the switch cfg stage should be left to
> > kernel.
> >
> > Come back to the subject of this thread: " nvme may get timeout from
> > dd when using different non-prefetch mmio outbound/ranges". I found:
> >
> > 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> > (which will timeout nvme)
> >
> > Switch(bridge of nvme)'s resource window:
> > Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]
> >
> > 80d5000 and 80d5100 are both inside this range.
>
> The PCI host bridge MMIO window is here:
>
> pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus
> address [0x08000000-0x0bffffff])
> pci 0000:01:00.0: PCI bridge to [bus 02-05]
> pci 0000:01:00.0: bridge window [mem 0x2008000000-0x20080fffff]
> pci 0000:02:06.0: PCI bridge to [bus 05]
> pci 0000:02:06.0: bridge window [mem 0x2008000000-0x20080fffff]
> pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
>
> So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
> If the NVMe device is generating DMA transactions to 0x080d5000, which
> is inside that range, those will be interpreted as peer-to-peer
> transactions. But obviously that's not intended and there's no device
> at 0x080d5000 anyway.
>
> My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
> dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something wrong
> in how that's set up. Is there an IOMMU? There should be arch code
> that knows what RAM is available for DMA buffers, maybe based on the
> DT. I'm not really familiar with how all that would be arranged, but
> the complete dmesg log and complete DT might have a clue. Can you
> post those somewhere?

After some printk, I found nvme_pci_setup_prps get some dma addresses inside switch's memory window from sg, but I don't where the sg is from(see comments in following source codes):

static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
struct request *req, struct nvme_rw_command *cmnd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
struct dma_pool *pool;
int length = blk_rq_payload_bytes(req);
struct scatterlist *sg = iod->sg;
int dma_len = sg_dma_len(sg);
u64 dma_addr = sg_dma_address(sg);
......
for (;;) {
if (i == NVME_CTRL_PAGE_SIZE >> 3) {
__le64 *old_prp_list = prp_list;
prp_list = dma_pool_alloc(pool, GFP_ATOMIC, &prp_dma);
printk("lchen %s %d dma pool %llx", __func__, __LINE__, prp_dma);
if (!prp_list)
goto free_prps;
list[iod->npages++] = prp_list;
prp_list[0] = old_prp_list[i - 1];
old_prp_list[i - 1] = cpu_to_le64(prp_dma);
i = 1;
}
prp_list[i++] = cpu_to_le64(dma_addr);
dma_len -= NVME_CTRL_PAGE_SIZE;
dma_addr += NVME_CTRL_PAGE_SIZE;
length -= NVME_CTRL_PAGE_SIZE;
if (length <= 0)
break;
if (dma_len > 0)
continue;
if (unlikely(dma_len < 0))
goto bad_sgl;
sg = sg_next(sg);
dma_addr = sg_dma_address(sg);
dma_len = sg_dma_len(sg);


// XXX: Here get the following output, the region is inside bridge's window 08000000-080fffff [size=1M]
/*

# dmesg | grep " 80" | grep -v " 80"
[ 0.000476] Console: colour dummy device 80x25
[ 79.331766] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr 80bc000
[ 79.815469] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr 8090000
[ 111.562129] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr 8090000
[ 111.873690] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr 80bc000
* * */
printk("lchen dma %s %d addr %llx, end addr %llx", __func__, __LINE__, dma_addr, dma_addr + dma_len);
}
done:
cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
return BLK_STS_OK;
free_prps:
nvme_free_prps(dev, req);
return BLK_STS_RESOURCE;
bad_sgl:
WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
"Invalid SGL for payload:%d nents:%d\n",
blk_rq_payload_bytes(req), iod->nents);
return BLK_STS_IOERR;
}

Backtrace of this function:
# entries-in-buffer/entries-written: 1574/1574 #P:2
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
kworker/u4:0-7 [000] ...1 40.095494: nvme_queue_rq <-blk_mq_dispatch_rq_list
kworker/u4:0-7 [000] ...1 40.095503: <stack trace>
=> nvme_queue_rq
=> blk_mq_dispatch_rq_list
=> __blk_mq_do_dispatch_sched
=> __blk_mq_sched_dispatch_requests
=> blk_mq_sched_dispatch_requests
=> __blk_mq_run_hw_queue
=> __blk_mq_delay_run_hw_queue
=> blk_mq_run_hw_queue
=> blk_mq_sched_insert_requests
=> blk_mq_flush_plug_list
=> blk_flush_plug_list
=> blk_mq_submit_bio
=> __submit_bio_noacct_mq
=> submit_bio_noacct
=> submit_bio
=> submit_bh_wbc.constprop.0
=> __block_write_full_page
=> block_write_full_page
=> blkdev_writepage
=> __writepage
=> write_cache_pages
=> generic_writepages
=> blkdev_writepages
=> do_writepages
=> __writeback_single_inode
=> writeback_sb_inodes
=> __writeback_inodes_wb
=> wb_writeback
=> wb_do_writeback
=> wb_workfn
=> process_one_work
=> worker_thread
=> kthread
=> ret_from_fork


We don't have IOMMU and just have 1:1 mapping dma outbound.


Here is the whole dmesg output(without my debug log): https://paste.debian.net/1217721/
Here is our dtsi: https://paste.debian.net/1217723/
>
> > 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
> > (which make nvme not timeout)
> >
> > Switch(bridge of nvme)'s resource window:
> > Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]
> >
> > 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> > read 80d5000 and 80d5100 , ur won't happen.
> >
> > From /proc/iomen:
> > # cat /proc/iomem
> > 01200000-ffffffff : System RAM
> > 01280000-022affff : Kernel code
> > 022b0000-0295ffff : reserved
> > 02960000-040cffff : Kernel data
> > 05280000-0528ffff : reserved
> > 41cc0000-422c0fff : reserved
> > 422c1000-4232afff : reserved
> > 4232d000-667bbfff : reserved
> > 667bc000-667bcfff : reserved
> > 667bd000-667c0fff : reserved
> > 667c1000-ffffffff : reserved
> > 2000000000-2000000fff : cfg
> >
> > No one uses 0000000-1200000, so " Memory behind bridge: Memory
> > behind bridge: 00400000-004fffff [size=1M]" will never have any
> > problem(because 0x1200000 > 0x004fffff).
> >
> > Above answers the question in Subject, one question left: what's the
> > right way to resolve this problem? Use ranges property to configure
> > switch memory window indirectly(just what I did)? Or something else?
> >
> > I don't think changing range property is the right way: If my PCIe
> > topology becomes more complex and have more endpoints or switches,
> > maybe I have to reserve more MMIO through range property(please
> > correct me if I'm wrong), the end of switch's memory window may be
> > larger than 0x01200000. In case getting ur again, I must reserve
> > more physical memory address for them(like change kernel start
> > address 0x01200000 to 0x02000000), which will make my visible dram
> > smaller(I have verified it with "free -m"), it is not acceptable.
>
> Right, I don't think changing the PCI ranges property is the right
> answer. I think it's just a coincidence that moving the host bridge
> MMIO aperture happens to move it out of the way of the DMA to
> 0x080d5000.
>
> As far as I can tell, the PCI core and the nvme driver are doing the
> right things here, and the problem is something behind the DMA API.
>
> I think there should be something that removes the MMIO aperture bus
> addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
> pool of memory available for DMA buffers.
>
> The MMIO aperture bus addresses in the non-timeout case,
> 0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
> System RAM area, which would explain why a DMA buffer would never
> overlap with it.
>
> Bjorn

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

2021-11-02 07:14:35

by Li Chen

[permalink] [raw]
Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges



> -----Original Message-----
> From: Li Chen
> Sent: Tuesday, November 2, 2021 1:18 PM
> To: Bjorn Helgaas
> Cc: Keith Busch; [email protected]; Lorenzo Pieralisi; Rob Herring;
> [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
>
> Hi, Bjorn
>
> > -----Original Message-----
> > From: Bjorn Helgaas [mailto:[email protected]]
> > Sent: Saturday, October 30, 2021 3:43 AM
> > To: Li Chen
> > Cc: Keith Busch; [email protected]; Lorenzo Pieralisi; Rob Herring;
> > [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph; Jens
> > Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> non-
> > prefetch mmio outbound/ranges
> >
> > On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > > > -----Original Message-----
> > > > From: Keith Busch [mailto:[email protected]]
> > > > Sent: Tuesday, October 26, 2021 12:16 PM
> > > > To: Li Chen
> > > > Cc: Bjorn Helgaas; [email protected]; Lorenzo Pieralisi; Rob Herring;
> > > > [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph;
> > Jens
> > > > Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> > > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> > non-
> > > > prefetch mmio outbound/ranges
> > > >
> > > > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung Electronics
> > Co
> > > > Ltd NVMe SSD Controller 980". From its datasheet,
> > > > https://urldefense.com/v3/__https://s3.ap-northeast-
> > > >
> >
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > > >
> >
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > > > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm
> not
> > sure.
> > > > Is there other ways/tools(like nvme-cli) to query?
> > > >
> > > > The driver will export a sysfs property for it if it is supported:
> > > >
> > > > # cat /sys/class/nvme/nvme0/cmb
> > > >
> > > > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> > > >
> > > > > > > I don't know how to interpret "ranges". Can you supply the dmesg
> and
> > > > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > > > >
> > > > > > > pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff
> > window]
> > > > > > > pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff
> > window]
> > > > > > > pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > > > > pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > > > >
> > > > > > > > Question:
> > > > > > > > 1. Why dd can cause nvme timeout? Is there more debug ways?
> > > > > >
> > > > > > That means the nvme controller didn't provide a response to a posted
> > > > > > command within the driver's latency tolerance.
> > > > >
> > > > > FYI, with the help of pci bridger's vendor, they find something
> > > > > interesting:
> > > > "From catc log, I saw some memory read pkts sent from SSD card,
> > > > but its memory range is within the memory range of switch down
> > > > port. So, switch down port will replay UR pkt. It seems not
> > > > normal." and "Why SSD card send out some memory pkts which memory
> > > > address is within switch down port's memory range. If so, switch
> > > > will response UR pkts". I also don't understand how can this
> > > > happen?
> > > >
> > > > I think we can safely assume you're not attempting peer-to-peer,
> > > > so that behavior as described shouldn't be happening. It sounds
> > > > like the memory windows may be incorrect. The dmesg may help to
> > > > show if something appears wrong.
> > >
> > > Agree that here doesn't involve peer-to-peer DMA. After conforming
> > > from switch vendor today, the two ur(unsupported request) is because
> > > nvme is trying to dma read dram with bus address 80d5000 and
> > > 80d5100. But the two bus addresses are located in switch's down port
> > > range, so the switch down port report ur.
> > >
> > > In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> > > and DRAM space in physical memory address space is 000000.0000 -
> > > 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> > > are also 80d5000 and 80d5100, which both located inside dram space.
> > >
> > > Both our bootloader and romcode don't enum and configure pcie
> > > devices and switches, so the switch cfg stage should be left to
> > > kernel.
> > >
> > > Come back to the subject of this thread: " nvme may get timeout from
> > > dd when using different non-prefetch mmio outbound/ranges". I found:
> > >
> > > 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> > > (which will timeout nvme)
> > >
> > > Switch(bridge of nvme)'s resource window:
> > > Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]
> > >
> > > 80d5000 and 80d5100 are both inside this range.
> >
> > The PCI host bridge MMIO window is here:
> >
> > pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus
> > address [0x08000000-0x0bffffff])
> > pci 0000:01:00.0: PCI bridge to [bus 02-05]
> > pci 0000:01:00.0: bridge window [mem 0x2008000000-0x20080fffff]
> > pci 0000:02:06.0: PCI bridge to [bus 05]
> > pci 0000:02:06.0: bridge window [mem 0x2008000000-0x20080fffff]
> > pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
> >
> > So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
> > If the NVMe device is generating DMA transactions to 0x080d5000, which
> > is inside that range, those will be interpreted as peer-to-peer
> > transactions. But obviously that's not intended and there's no device
> > at 0x080d5000 anyway.
> >
> > My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
> > dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something wrong
> > in how that's set up. Is there an IOMMU? There should be arch code
> > that knows what RAM is available for DMA buffers, maybe based on the
> > DT. I'm not really familiar with how all that would be arranged, but
> > the complete dmesg log and complete DT might have a clue. Can you
> > post those somewhere?
>
> After some printk, I found nvme_pci_setup_prps get some dma addresses inside
> switch's memory window from sg, but I don't where the sg is from(see
> comments in following source codes):

I just noticed it should come from mempool_alloc in nvme_map_data:
static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
struct nvme_command *cmnd)
{
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
blk_status_t ret = BLK_STS_RESOURCE;
int nr_mapped;

if (blk_rq_nr_phys_segments(req) == 1) {
struct bio_vec bv = req_bvec(req);

if (!is_pci_p2pdma_page(bv.bv_page)) {
if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE * 2)
return nvme_setup_prp_simple(dev, req,
&cmnd->rw, &bv);

if (iod->nvmeq->qid &&
dev->ctrl.sgls & ((1 << 0) | (1 << 1)))
return nvme_setup_sgl_simple(dev, req,
&cmnd->rw, &bv);
}
}

iod->dma_len = 0;
iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);


unsigned int l = sg_dma_address(iod->sg), r = sg_dma_address(iod->sg) + sg_dma_len(iod->sg), tl = 0x08000000, tr = tl + 0x04000000, ntl = 0x00400000, ntr = ntl + 0x08000000;

/*
# dmesg | grep "region-timeout ? 1" | grep nvme_map_data
[ 16.002446] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1c21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 16.341240] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1a21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 16.405938] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1821c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 36.126917] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2421c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 36.703839] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2221c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 38.510086] lchen nvme_map_data, 895: first_dma 0, end_dma 89c621c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 40.542394] lchen nvme_map_data, 895: first_dma 0, end_dma 89c421c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 40.573405] lchen nvme_map_data, 895: first_dma 0, end_dma 87ee21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 40.604419] lchen nvme_map_data, 895: first_dma 0, end_dma 87ec21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 40.874395] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa221c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 40.905323] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa021c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 40.968675] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0e21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 40.999659] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0821c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.030601] lchen nvme_map_data, 895: first_dma 0, end_dma 8bde21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.061629] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1e21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.092598] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb621c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.123677] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb221c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.160960] lchen nvme_map_data, 895: first_dma 0, end_dma 8cee21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.193609] lchen nvme_map_data, 895: first_dma 0, end_dma 8cec21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.224607] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7e21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.255592] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7c21c, inside region-timeout ? 1, is inside region-non-timeout ? 0
[ 41.286594] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7821c, inside region-timeout ? 1, is inside region-non-timeout ? 0
*/
printk("lchen %s, %d: first_dma %llx, end_dma %llx, inside region-timeout ? %d, is inside region-non-timeout ? %d", __func__, __LINE__, l, r, l >= tl && l <= tr || r >= tl && r <= tr, l >= ntl && l <= ntr || r >= ntl && r <= ntr);



//printk("lchen %s %d, addr starts %llx, ends %llx", __func__, __LINE__, sg_dma_address(iod->sg), sg_dma_address(iod->sg) + sg_dma_len(iod->sg));

................
}

>
> static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
> struct request *req, struct nvme_rw_command *cmnd)
> {
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> struct dma_pool *pool;
> int length = blk_rq_payload_bytes(req);
> struct scatterlist *sg = iod->sg;
> int dma_len = sg_dma_len(sg);
> u64 dma_addr = sg_dma_address(sg);
> ......
> for (;;) {
> if (i == NVME_CTRL_PAGE_SIZE >> 3) {
> __le64 *old_prp_list = prp_list;
> prp_list = dma_pool_alloc(pool, GFP_ATOMIC,
> &prp_dma);
> printk("lchen %s %d dma pool %llx", __func__, __LINE__,
> prp_dma);
> if (!prp_list)
> goto free_prps;
> list[iod->npages++] = prp_list;
> prp_list[0] = old_prp_list[i - 1];
> old_prp_list[i - 1] = cpu_to_le64(prp_dma);
> i = 1;
> }
> prp_list[i++] = cpu_to_le64(dma_addr);
> dma_len -= NVME_CTRL_PAGE_SIZE;
> dma_addr += NVME_CTRL_PAGE_SIZE;
> length -= NVME_CTRL_PAGE_SIZE;
> if (length <= 0)
> break;
> if (dma_len > 0)
> continue;
> if (unlikely(dma_len < 0))
> goto bad_sgl;
> sg = sg_next(sg);
> dma_addr = sg_dma_address(sg);
> dma_len = sg_dma_len(sg);
>
>
> // XXX: Here get the following output, the region is inside bridge's
> window 08000000-080fffff [size=1M]
> /*
>
> # dmesg | grep " 80" | grep -v " 80"
> [ 0.000476] Console: colour dummy device 80x25
> [ 79.331766] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> 80bc000
> [ 79.815469] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> 8090000
> [ 111.562129] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> 8090000
> [ 111.873690] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> 80bc000
> * * */
> printk("lchen dma %s %d addr %llx, end addr %llx", __func__,
> __LINE__, dma_addr, dma_addr + dma_len);
> }
> done:
> cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
> cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
> return BLK_STS_OK;
> free_prps:
> nvme_free_prps(dev, req);
> return BLK_STS_RESOURCE;
> bad_sgl:
> WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
> "Invalid SGL for payload:%d nents:%d\n",
> blk_rq_payload_bytes(req), iod->nents);
> return BLK_STS_IOERR;
> }
>
> Backtrace of this function:
> # entries-in-buffer/entries-written: 1574/1574 #P:2
> #
> # _-----=> irqs-off
> # / _----=> need-resched
> # | / _---=> hardirq/softirq
> # || / _--=> preempt-depth
> # ||| / delay
> # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> # | | | |||| | |
> kworker/u4:0-7 [000] ...1 40.095494: nvme_queue_rq <-
> blk_mq_dispatch_rq_list
> kworker/u4:0-7 [000] ...1 40.095503: <stack trace>
> => nvme_queue_rq
> => blk_mq_dispatch_rq_list
> => __blk_mq_do_dispatch_sched
> => __blk_mq_sched_dispatch_requests
> => blk_mq_sched_dispatch_requests
> => __blk_mq_run_hw_queue
> => __blk_mq_delay_run_hw_queue
> => blk_mq_run_hw_queue
> => blk_mq_sched_insert_requests
> => blk_mq_flush_plug_list
> => blk_flush_plug_list
> => blk_mq_submit_bio
> => __submit_bio_noacct_mq
> => submit_bio_noacct
> => submit_bio
> => submit_bh_wbc.constprop.0
> => __block_write_full_page
> => block_write_full_page
> => blkdev_writepage
> => __writepage
> => write_cache_pages
> => generic_writepages
> => blkdev_writepages
> => do_writepages
> => __writeback_single_inode
> => writeback_sb_inodes
> => __writeback_inodes_wb
> => wb_writeback
> => wb_do_writeback
> => wb_workfn
> => process_one_work
> => worker_thread
> => kthread
> => ret_from_fork
>
>
> We don't have IOMMU and just have 1:1 mapping dma outbound.
>
>
> Here is the whole dmesg output(without my debug log):
> https://paste.debian.net/1217721/
> Here is our dtsi: https://paste.debian.net/1217723/
> >
> > > 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
> > > (which make nvme not timeout)
> > >
> > > Switch(bridge of nvme)'s resource window:
> > > Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]
> > >
> > > 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> > > read 80d5000 and 80d5100 , ur won't happen.
> > >
> > > From /proc/iomen:
> > > # cat /proc/iomem
> > > 01200000-ffffffff : System RAM
> > > 01280000-022affff : Kernel code
> > > 022b0000-0295ffff : reserved
> > > 02960000-040cffff : Kernel data
> > > 05280000-0528ffff : reserved
> > > 41cc0000-422c0fff : reserved
> > > 422c1000-4232afff : reserved
> > > 4232d000-667bbfff : reserved
> > > 667bc000-667bcfff : reserved
> > > 667bd000-667c0fff : reserved
> > > 667c1000-ffffffff : reserved
> > > 2000000000-2000000fff : cfg
> > >
> > > No one uses 0000000-1200000, so " Memory behind bridge: Memory
> > > behind bridge: 00400000-004fffff [size=1M]" will never have any
> > > problem(because 0x1200000 > 0x004fffff).
> > >
> > > Above answers the question in Subject, one question left: what's the
> > > right way to resolve this problem? Use ranges property to configure
> > > switch memory window indirectly(just what I did)? Or something else?
> > >
> > > I don't think changing range property is the right way: If my PCIe
> > > topology becomes more complex and have more endpoints or switches,
> > > maybe I have to reserve more MMIO through range property(please
> > > correct me if I'm wrong), the end of switch's memory window may be
> > > larger than 0x01200000. In case getting ur again, I must reserve
> > > more physical memory address for them(like change kernel start
> > > address 0x01200000 to 0x02000000), which will make my visible dram
> > > smaller(I have verified it with "free -m"), it is not acceptable.
> >
> > Right, I don't think changing the PCI ranges property is the right
> > answer. I think it's just a coincidence that moving the host bridge
> > MMIO aperture happens to move it out of the way of the DMA to
> > 0x080d5000.
> >
> > As far as I can tell, the PCI core and the nvme driver are doing the
> > right things here, and the problem is something behind the DMA API.
> >
> > I think there should be something that removes the MMIO aperture bus
> > addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
> > pool of memory available for DMA buffers.
> >
> > The MMIO aperture bus addresses in the non-timeout case,
> > 0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
> > System RAM area, which would explain why a DMA buffer would never
> > overlap with it.
> >
> > Bjorn
>
> Regards,
> Li

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.

2021-11-03 10:06:31

by Li Chen

[permalink] [raw]
Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-prefetch mmio outbound/ranges



> -----Original Message-----
> From: Li Chen
> Sent: Tuesday, November 2, 2021 3:12 PM
> To: Bjorn Helgaas
> Cc: Keith Busch; [email protected]; Lorenzo Pieralisi; Rob Herring;
> [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph; Jens
> Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> Subject: RE: [EXT] Re: nvme may get timeout from dd when using different non-
> prefetch mmio outbound/ranges
>
>
>
> > -----Original Message-----
> > From: Li Chen
> > Sent: Tuesday, November 2, 2021 1:18 PM
> > To: Bjorn Helgaas
> > Cc: Keith Busch; [email protected]; Lorenzo Pieralisi; Rob Herring;
> > [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph; Jens
> > Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> > Subject: RE: [EXT] Re: nvme may get timeout from dd when using different
> non-
> > prefetch mmio outbound/ranges
> >
> > Hi, Bjorn
> >
> > > -----Original Message-----
> > > From: Bjorn Helgaas [mailto:[email protected]]
> > > Sent: Saturday, October 30, 2021 3:43 AM
> > > To: Li Chen
> > > Cc: Keith Busch; [email protected]; Lorenzo Pieralisi; Rob Herring;
> > > [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph;
> Jens
> > > Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using different
> > non-
> > > prefetch mmio outbound/ranges
> > >
> > > On Fri, Oct 29, 2021 at 10:52:37AM +0000, Li Chen wrote:
> > > > > -----Original Message-----
> > > > > From: Keith Busch [mailto:[email protected]]
> > > > > Sent: Tuesday, October 26, 2021 12:16 PM
> > > > > To: Li Chen
> > > > > Cc: Bjorn Helgaas; [email protected]; Lorenzo Pieralisi; Rob
> Herring;
> > > > > [email protected]; Bjorn Helgaas; [email protected]; Tom Joseph;
> > > Jens
> > > > > Axboe; Christoph Hellwig; Sagi Grimberg; [email protected]
> > > > > Subject: Re: [EXT] Re: nvme may get timeout from dd when using
> different
> > > non-
> > > > > prefetch mmio outbound/ranges
> > > > >
> > > > > On Tue, Oct 26, 2021 at 03:40:54AM +0000, Li Chen wrote:
> > > > > > My nvme is " 05:00.0 Non-Volatile memory controller: Samsung
> Electronics
> > > Co
> > > > > Ltd NVMe SSD Controller 980". From its datasheet,
> > > > > https://urldefense.com/v3/__https://s3.ap-northeast-
> > > > >
> > >
> >
> 2.amazonaws.com/global.semi.static/Samsung_NVMe_SSD_980_Data_Sheet_R
> > > > >
> > >
> >
> ev.1.1.pdf__;!!PeEy7nZLVv0!3MU3LdTWuzON9JMUkq29zwJM4d7g7wKtkiZszTu-
> > > > > PVepWchI_uLHpQGgdR_LEZM$ , it says nothing about CMB/SQEs, so I'm
> > not
> > > sure.
> > > > > Is there other ways/tools(like nvme-cli) to query?
> > > > >
> > > > > The driver will export a sysfs property for it if it is supported:
> > > > >
> > > > > # cat /sys/class/nvme/nvme0/cmb
> > > > >
> > > > > If the file doesn't exist, then /dev/nvme0 doesn't have the capability.
> > > > >
> > > > > > > > I don't know how to interpret "ranges". Can you supply the dmesg
> > and
> > > > > > > > "lspci -vvs 0000:05:00.0" output both ways, e.g.,
> > > > > > > >
> > > > > > > > pci_bus 0000:00: root bus resource [mem 0x7f800000-0xefffffff
> > > window]
> > > > > > > > pci_bus 0000:00: root bus resource [mem 0xfd000000-0xfe7fffff
> > > window]
> > > > > > > > pci 0000:05:00.0: [vvvv:dddd] type 00 class 0x...
> > > > > > > > pci 0000:05:00.0: reg 0x10: [mem 0x.....000-0x.....fff ...]
> > > > > > > >
> > > > > > > > > Question:
> > > > > > > > > 1. Why dd can cause nvme timeout? Is there more debug ways?
> > > > > > >
> > > > > > > That means the nvme controller didn't provide a response to a posted
> > > > > > > command within the driver's latency tolerance.
> > > > > >
> > > > > > FYI, with the help of pci bridger's vendor, they find something
> > > > > > interesting:
> > > > > "From catc log, I saw some memory read pkts sent from SSD card,
> > > > > but its memory range is within the memory range of switch down
> > > > > port. So, switch down port will replay UR pkt. It seems not
> > > > > normal." and "Why SSD card send out some memory pkts which memory
> > > > > address is within switch down port's memory range. If so, switch
> > > > > will response UR pkts". I also don't understand how can this
> > > > > happen?
> > > > >
> > > > > I think we can safely assume you're not attempting peer-to-peer,
> > > > > so that behavior as described shouldn't be happening. It sounds
> > > > > like the memory windows may be incorrect. The dmesg may help to
> > > > > show if something appears wrong.
> > > >
> > > > Agree that here doesn't involve peer-to-peer DMA. After conforming
> > > > from switch vendor today, the two ur(unsupported request) is because
> > > > nvme is trying to dma read dram with bus address 80d5000 and
> > > > 80d5100. But the two bus addresses are located in switch's down port
> > > > range, so the switch down port report ur.
> > > >
> > > > In our soc, dma/bus/pci address and physical/AXI address are 1:1,
> > > > and DRAM space in physical memory address space is 000000.0000 -
> > > > 0fffff.ffff 64G, so bus address 80d5000 and 80d5100 to cpu address
> > > > are also 80d5000 and 80d5100, which both located inside dram space.
> > > >
> > > > Both our bootloader and romcode don't enum and configure pcie
> > > > devices and switches, so the switch cfg stage should be left to
> > > > kernel.
> > > >
> > > > Come back to the subject of this thread: " nvme may get timeout from
> > > > dd when using different non-prefetch mmio outbound/ranges". I found:
> > > >
> > > > 1. For <0x02000000 0x00 0x08000000 0x20 0x08000000 0x00 0x04000000>;
> > > > (which will timeout nvme)
> > > >
> > > > Switch(bridge of nvme)'s resource window:
> > > > Memory behind bridge: Memory behind bridge: 08000000-080fffff [size=1M]
> > > >
> > > > 80d5000 and 80d5100 are both inside this range.
> > >
> > > The PCI host bridge MMIO window is here:
> > >
> > > pci_bus 0000:00: root bus resource [mem 0x2008000000-0x200bffffff] (bus
> > > address [0x08000000-0x0bffffff])
> > > pci 0000:01:00.0: PCI bridge to [bus 02-05]
> > > pci 0000:01:00.0: bridge window [mem 0x2008000000-0x20080fffff]
> > > pci 0000:02:06.0: PCI bridge to [bus 05]
> > > pci 0000:02:06.0: bridge window [mem 0x2008000000-0x20080fffff]
> > > pci 0000:05:00.0: BAR 0: assigned [mem 0x2008000000-0x2008003fff 64bit]
> > >
> > > So bus address [0x08000000-0x0bffffff] is the area used for PCI BARs.
> > > If the NVMe device is generating DMA transactions to 0x080d5000, which
> > > is inside that range, those will be interpreted as peer-to-peer
> > > transactions. But obviously that's not intended and there's no device
> > > at 0x080d5000 anyway.
> > >
> > > My guess is the nvme driver got 0x080d5000 from the DMA API, e.g.,
> > > dma_map_bvec() or dma_map_sg_attrs(), so maybe there's something
> wrong
> > > in how that's set up. Is there an IOMMU? There should be arch code
> > > that knows what RAM is available for DMA buffers, maybe based on the
> > > DT. I'm not really familiar with how all that would be arranged, but
> > > the complete dmesg log and complete DT might have a clue. Can you
> > > post those somewhere?
> >
> > After some printk, I found nvme_pci_setup_prps get some dma addresses
> inside
> > switch's memory window from sg, but I don't where the sg is from(see
> > comments in following source codes):
>
> I just noticed it should come from mempool_alloc in nvme_map_data:
> static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> struct nvme_command *cmnd)
> {
> struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> blk_status_t ret = BLK_STS_RESOURCE;
> int nr_mapped;
>
> if (blk_rq_nr_phys_segments(req) == 1) {
> struct bio_vec bv = req_bvec(req);
>
> if (!is_pci_p2pdma_page(bv.bv_page)) {
> if (bv.bv_offset + bv.bv_len <= NVME_CTRL_PAGE_SIZE
> * 2)
> return nvme_setup_prp_simple(dev, req,
> &cmnd->rw, &bv);
>
> if (iod->nvmeq->qid &&
> dev->ctrl.sgls & ((1 << 0) | (1 << 1)))
> return nvme_setup_sgl_simple(dev, req,
> &cmnd->rw, &bv);
> }
> }
>
> iod->dma_len = 0;
> iod->sg = mempool_alloc(dev->iod_mempool, GFP_ATOMIC);
>
>
> unsigned int l = sg_dma_address(iod->sg), r = sg_dma_address(iod->sg)
> + sg_dma_len(iod->sg), tl = 0x08000000, tr = tl + 0x04000000, ntl = 0x00400000, ntr
> = ntl + 0x08000000;
>
> /*
> # dmesg | grep "region-timeout ? 1" | grep nvme_map_data
> [ 16.002446] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1c21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 16.341240] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1a21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 16.405938] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1821c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 36.126917] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2421c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 36.703839] lchen nvme_map_data, 895: first_dma 0, end_dma 8f2221c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 38.510086] lchen nvme_map_data, 895: first_dma 0, end_dma 89c621c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 40.542394] lchen nvme_map_data, 895: first_dma 0, end_dma 89c421c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 40.573405] lchen nvme_map_data, 895: first_dma 0, end_dma 87ee21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 40.604419] lchen nvme_map_data, 895: first_dma 0, end_dma 87ec21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 40.874395] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa221c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 40.905323] lchen nvme_map_data, 895: first_dma 0, end_dma 8aa021c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 40.968675] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0e21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 40.999659] lchen nvme_map_data, 895: first_dma 0, end_dma 8a0821c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.030601] lchen nvme_map_data, 895: first_dma 0, end_dma 8bde21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.061629] lchen nvme_map_data, 895: first_dma 0, end_dma 8b1e21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.092598] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb621c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.123677] lchen nvme_map_data, 895: first_dma 0, end_dma 8eb221c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.160960] lchen nvme_map_data, 895: first_dma 0, end_dma 8cee21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.193609] lchen nvme_map_data, 895: first_dma 0, end_dma 8cec21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.224607] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7e21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.255592] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7c21c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> [ 41.286594] lchen nvme_map_data, 895: first_dma 0, end_dma 8c7821c, inside
> region-timeout ? 1, is inside region-non-timeout ? 0
> */
> printk("lchen %s, %d: first_dma %llx, end_dma %llx, inside region-
> timeout ? %d, is inside region-non-timeout ? %d", __func__, __LINE__, l, r, l >= tl
> && l <= tr || r >= tl && r <= tr, l >= ntl && l <= ntr || r >= ntl && r <= ntr);
>
>
>
> //printk("lchen %s %d, addr starts %llx, ends %llx", __func__, __LINE__,
> sg_dma_address(iod->sg), sg_dma_address(iod->sg) + sg_dma_len(iod->sg));
>
> ................
> }
>
> >
> > static blk_status_t nvme_pci_setup_prps(struct nvme_dev *dev,
> > struct request *req, struct nvme_rw_command *cmnd)
> > {
> > struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > struct dma_pool *pool;
> > int length = blk_rq_payload_bytes(req);
> > struct scatterlist *sg = iod->sg;
> > int dma_len = sg_dma_len(sg);
> > u64 dma_addr = sg_dma_address(sg);
> > ......
> > for (;;) {
> > if (i == NVME_CTRL_PAGE_SIZE >> 3) {
> > __le64 *old_prp_list = prp_list;
> > prp_list = dma_pool_alloc(pool, GFP_ATOMIC,
> > &prp_dma);
> > printk("lchen %s %d dma pool %llx", __func__, __LINE__,
> > prp_dma);
> > if (!prp_list)
> > goto free_prps;
> > list[iod->npages++] = prp_list;
> > prp_list[0] = old_prp_list[i - 1];
> > old_prp_list[i - 1] = cpu_to_le64(prp_dma);
> > i = 1;
> > }
> > prp_list[i++] = cpu_to_le64(dma_addr);
> > dma_len -= NVME_CTRL_PAGE_SIZE;
> > dma_addr += NVME_CTRL_PAGE_SIZE;
> > length -= NVME_CTRL_PAGE_SIZE;
> > if (length <= 0)
> > break;
> > if (dma_len > 0)
> > continue;
> > if (unlikely(dma_len < 0))
> > goto bad_sgl;
> > sg = sg_next(sg);
> > dma_addr = sg_dma_address(sg);
> > dma_len = sg_dma_len(sg);
> >
> >
> > // XXX: Here get the following output, the region is inside bridge's
> > window 08000000-080fffff [size=1M]
> > /*
> >
> > # dmesg | grep " 80" | grep -v " 80"
> > [ 0.000476] Console: colour dummy device 80x25
> > [ 79.331766] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> > 80bc000
> > [ 79.815469] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> > 8090000
> > [ 111.562129] lchen dma nvme_pci_setup_prps 708 addr 8088000, end addr
> > 8090000
> > [ 111.873690] lchen dma nvme_pci_setup_prps 708 addr 80ba000, end addr
> > 80bc000
> > * * */
> > printk("lchen dma %s %d addr %llx, end addr %llx", __func__,
> > __LINE__, dma_addr, dma_addr + dma_len);
> > }
> > done:
> > cmnd->dptr.prp1 = cpu_to_le64(sg_dma_address(iod->sg));
> > cmnd->dptr.prp2 = cpu_to_le64(iod->first_dma);
> > return BLK_STS_OK;
> > free_prps:
> > nvme_free_prps(dev, req);
> > return BLK_STS_RESOURCE;
> > bad_sgl:
> > WARN(DO_ONCE(nvme_print_sgl, iod->sg, iod->nents),
> > "Invalid SGL for payload:%d nents:%d\n",
> > blk_rq_payload_bytes(req), iod->nents);
> > return BLK_STS_IOERR;
> > }
> >
> > Backtrace of this function:
> > # entries-in-buffer/entries-written: 1574/1574 #P:2
> > #
> > # _-----=> irqs-off
> > # / _----=> need-resched
> > # | / _---=> hardirq/softirq
> > # || / _--=> preempt-depth
> > # ||| / delay
> > # TASK-PID CPU# |||| TIMESTAMP FUNCTION
> > # | | | |||| | |
> > kworker/u4:0-7 [000] ...1 40.095494: nvme_queue_rq <-
> > blk_mq_dispatch_rq_list
> > kworker/u4:0-7 [000] ...1 40.095503: <stack trace>
> > => nvme_queue_rq
> > => blk_mq_dispatch_rq_list
> > => __blk_mq_do_dispatch_sched
> > => __blk_mq_sched_dispatch_requests
> > => blk_mq_sched_dispatch_requests
> > => __blk_mq_run_hw_queue
> > => __blk_mq_delay_run_hw_queue
> > => blk_mq_run_hw_queue
> > => blk_mq_sched_insert_requests
> > => blk_mq_flush_plug_list
> > => blk_flush_plug_list
> > => blk_mq_submit_bio
> > => __submit_bio_noacct_mq
> > => submit_bio_noacct
> > => submit_bio
> > => submit_bh_wbc.constprop.0
> > => __block_write_full_page
> > => block_write_full_page
> > => blkdev_writepage
> > => __writepage
> > => write_cache_pages
> > => generic_writepages
> > => blkdev_writepages
> > => do_writepages
> > => __writeback_single_inode
> > => writeback_sb_inodes
> > => __writeback_inodes_wb
> > => wb_writeback
> > => wb_do_writeback
> > => wb_workfn
> > => process_one_work
> > => worker_thread
> > => kthread
> > => ret_from_fork
> >
> >
> > We don't have IOMMU and just have 1:1 mapping dma outbound.
> >
> >
> > Here is the whole dmesg output(without my debug log):
> > https://paste.debian.net/1217721/


From dmesg:


[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x0000000001200000-0x00000000ffffffff]
[ 0.000000] DMA32 empty
[ 0.000000] Normal empty
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x0000000001200000-0x00000000ffffffff]
[ 0.000000] Initmem setup node 0 [mem 0x0000000001200000-0x00000000ffffffff]
[ 0.000000] On node 0 totalpages: 1043968
[ 0.000000] DMA zone: 16312 pages used for memmap
[ 0.000000] DMA zone: 0 pages reserved
[ 0.000000] DMA zone: 1043968 pages, LIFO batch:63

I only has one zone: ZONE_DMA, and it includes all physical memory:

# cat /proc/iomem
01200000-ffffffff : System RAM
01280000-01c3ffff : Kernel code
01c40000-01ebffff : reserved
01ec0000-0201ffff : Kernel data
05280000-0528ffff : reserved
62000000-65ffffff : reserved
66186000-66786fff : reserved
66787000-667c0fff : reserved
667c3000-667c3fff : reserved
667c4000-667c5fff : reserved
667c6000-667c8fff : reserved
667c9000-ffffffff : reserved
2000000000-2000000fff : cfg
2008000000-200bffffff : pcie-controller@2040000000
2008000000-20081fffff : PCI Bus 0000:01
2008000000-20080fffff : PCI Bus 0000:02
2008000000-20080fffff : PCI Bus 0000:05
2008000000-2008003fff : 0000:05:00.0
2008000000-2008003fff : nvme
2008100000-200817ffff : 0000:01:00.0



From nvme codes:

dev->iod_mempool = mempool_create_node(1, mempool_kmalloc,
mempool_kfree,
(void *) alloc_size,
GFP_KERNEL, node);

So mempool here just uses kmalloc and GFP_KERNEL to get physical memory.

Does it mean that I should re-arrange zone_dma/zone_dma/zone_normal(like limit zone_dma to mem region that doesn't include pcie mmio) to fix this problem?



> > Here is our dtsi: https://paste.debian.net/1217723/
> > >
> > > > 2. For <0x02000000 0x00 0x00400000 0x20 0x00400000 0x00 0x08000000>;
> > > > (which make nvme not timeout)
> > > >
> > > > Switch(bridge of nvme)'s resource window:
> > > > Memory behind bridge: Memory behind bridge: 00400000-004fffff [size=1M]
> > > >
> > > > 80d5000 and 80d5100 are not inside this range, so if nvme tries to
> > > > read 80d5000 and 80d5100 , ur won't happen.
> > > >
> > > > From /proc/iomen:
> > > > # cat /proc/iomem
> > > > 01200000-ffffffff : System RAM
> > > > 01280000-022affff : Kernel code
> > > > 022b0000-0295ffff : reserved
> > > > 02960000-040cffff : Kernel data
> > > > 05280000-0528ffff : reserved
> > > > 41cc0000-422c0fff : reserved
> > > > 422c1000-4232afff : reserved
> > > > 4232d000-667bbfff : reserved
> > > > 667bc000-667bcfff : reserved
> > > > 667bd000-667c0fff : reserved
> > > > 667c1000-ffffffff : reserved
> > > > 2000000000-2000000fff : cfg
> > > >
> > > > No one uses 0000000-1200000, so " Memory behind bridge: Memory
> > > > behind bridge: 00400000-004fffff [size=1M]" will never have any
> > > > problem(because 0x1200000 > 0x004fffff).
> > > >
> > > > Above answers the question in Subject, one question left: what's the
> > > > right way to resolve this problem? Use ranges property to configure
> > > > switch memory window indirectly(just what I did)? Or something else?
> > > >
> > > > I don't think changing range property is the right way: If my PCIe
> > > > topology becomes more complex and have more endpoints or switches,
> > > > maybe I have to reserve more MMIO through range property(please
> > > > correct me if I'm wrong), the end of switch's memory window may be
> > > > larger than 0x01200000. In case getting ur again, I must reserve
> > > > more physical memory address for them(like change kernel start
> > > > address 0x01200000 to 0x02000000), which will make my visible dram
> > > > smaller(I have verified it with "free -m"), it is not acceptable.
> > >
> > > Right, I don't think changing the PCI ranges property is the right
> > > answer. I think it's just a coincidence that moving the host bridge
> > > MMIO aperture happens to move it out of the way of the DMA to
> > > 0x080d5000.
> > >
> > > As far as I can tell, the PCI core and the nvme driver are doing the
> > > right things here, and the problem is something behind the DMA API.
> > >
> > > I think there should be something that removes the MMIO aperture bus
> > > addresses, i.e., 0x08000000-0x0bffffff in the timeout case, from the
> > > pool of memory available for DMA buffers.
> > >
> > > The MMIO aperture bus addresses in the non-timeout case,
> > > 0x00400000-0x083fffff, are not included in the 0x01200000-0xffffffff
> > > System RAM area, which would explain why a DMA buffer would never
> > > overlap with it.
> > >
> > > Bjorn
> >
> > Regards,
> > Li
>
> Regards,
> Li

Regards,
Li

**********************************************************************
This email and attachments contain Ambarella Proprietary and/or Confidential Information and is intended solely for the use of the individual(s) to whom it is addressed. Any unauthorized review, use, disclosure, distribute, copy, or print is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy all copies of the original message. Thank you.