From: Alan Mikhak <[email protected]>
Modify sg_miter_stop() to validate the page pointer
before calling PageSlab(). This check prevents a crash
that will occur if PageSlab() gets called with a page
pointer that is not backed by page struct.
A virtual address obtained from ioremap() for a physical
address in PCI address space can be assigned to a
scatterlist segment using the public scatterlist API
as in the following example:
my_sg_set_page(struct scatterlist *sg,
const void __iomem *ioaddr,
size_t iosize)
{
sg_set_page(sg,
virt_to_page(ioaddr),
(unsigned int)iosize,
offset_in_page(ioaddr));
sg_init_marker(sg, 1);
}
If the virtual address obtained from ioremap() is not
backed by a page struct, virt_to_page() returns an
invalid page pointer. However, sg_copy_buffer() can
correctly recover the original virtual address. Such
addresses can successfully be assigned to scatterlist
segments to transfer data across the PCI bus with
sg_copy_buffer() if it were not for the crash in
PageSlab() when called by sg_miter_stop().
Signed-off-by: Alan Mikhak <[email protected]>
---
lib/scatterlist.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index c2cf2c311b7d..f5c61cad40ba 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -807,6 +807,7 @@ void sg_miter_stop(struct sg_mapping_iter *miter)
miter->__remaining -= miter->consumed;
if ((miter->__flags & SG_MITER_TO_SG) &&
+ pfn_valid(page_to_pfn(miter->page)) &&
!PageSlab(miter->page))
flush_kernel_dcache_page(miter->page);
--
2.7.4
On Mon, Sep 30, 2019 at 04:22:35PM -0700, Alan Mikhak wrote:
> From: Alan Mikhak <[email protected]>
>
> Modify sg_miter_stop() to validate the page pointer
> before calling PageSlab(). This check prevents a crash
> that will occur if PageSlab() gets called with a page
> pointer that is not backed by page struct.
>
> A virtual address obtained from ioremap() for a physical
> address in PCI address space can be assigned to a
> scatterlist segment using the public scatterlist API
> as in the following example:
>
> my_sg_set_page(struct scatterlist *sg,
> const void __iomem *ioaddr,
> size_t iosize)
> {
> sg_set_page(sg,
> virt_to_page(ioaddr),
> (unsigned int)iosize,
> offset_in_page(ioaddr));
> sg_init_marker(sg, 1);
> }
>
> If the virtual address obtained from ioremap() is not
> backed by a page struct, virt_to_page() returns an
> invalid page pointer. However, sg_copy_buffer() can
> correctly recover the original virtual address. Such
> addresses can successfully be assigned to scatterlist
> segments to transfer data across the PCI bus with
> sg_copy_buffer() if it were not for the crash in
> PageSlab() when called by sg_miter_stop().
I thought we already agreed in general that putting things that don't
have struct page into the scatter list was not allowed?
Jason
On Tue, Oct 1, 2019 at 5:16 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Sep 30, 2019 at 04:22:35PM -0700, Alan Mikhak wrote:
> > From: Alan Mikhak <[email protected]>
> >
> > Modify sg_miter_stop() to validate the page pointer
> > before calling PageSlab(). This check prevents a crash
> > that will occur if PageSlab() gets called with a page
> > pointer that is not backed by page struct.
> >
> > A virtual address obtained from ioremap() for a physical
> > address in PCI address space can be assigned to a
> > scatterlist segment using the public scatterlist API
> > as in the following example:
> >
> > my_sg_set_page(struct scatterlist *sg,
> > const void __iomem *ioaddr,
> > size_t iosize)
> > {
> > sg_set_page(sg,
> > virt_to_page(ioaddr),
> > (unsigned int)iosize,
> > offset_in_page(ioaddr));
> > sg_init_marker(sg, 1);
> > }
> >
> > If the virtual address obtained from ioremap() is not
> > backed by a page struct, virt_to_page() returns an
> > invalid page pointer. However, sg_copy_buffer() can
> > correctly recover the original virtual address. Such
> > addresses can successfully be assigned to scatterlist
> > segments to transfer data across the PCI bus with
> > sg_copy_buffer() if it were not for the crash in
> > PageSlab() when called by sg_miter_stop().
>
> I thought we already agreed in general that putting things that don't
> have struct page into the scatter list was not allowed?
>
> Jason
Thanks Jason for your comment.
Cost of adding page structs to a large PCI I/O address range can be
quite substantial. Allowing PCI I/O pages without page structs may be
desirable. Perhaps it is worth considering this cost.
Scatterlist has no problem doing its memcpy() from pages without a
page struct that were obtained from ioremap(). It is only at
sg_miter_stop() that the call to PageSlab() prevents such use by
crashing. This seems accidental, not by design. Scatterlist API
function sg_set_page() allows any page pointer to be assigned even if
it has no page struct. It doesn't check if the page pointer is valid.
Its description doesn't say that a page struct is required.
Regards,
Alan
On Tue, Oct 01, 2019 at 10:09:48AM -0700, Alan Mikhak wrote:
> On Tue, Oct 1, 2019 at 5:16 AM Jason Gunthorpe <[email protected]> wrote:
> >
> > On Mon, Sep 30, 2019 at 04:22:35PM -0700, Alan Mikhak wrote:
> > > From: Alan Mikhak <[email protected]>
> > >
> > > Modify sg_miter_stop() to validate the page pointer
> > > before calling PageSlab(). This check prevents a crash
> > > that will occur if PageSlab() gets called with a page
> > > pointer that is not backed by page struct.
> > >
> > > A virtual address obtained from ioremap() for a physical
> > > address in PCI address space can be assigned to a
> > > scatterlist segment using the public scatterlist API
> > > as in the following example:
> > >
> > > my_sg_set_page(struct scatterlist *sg,
> > > const void __iomem *ioaddr,
> > > size_t iosize)
> > > {
> > > sg_set_page(sg,
> > > virt_to_page(ioaddr),
> > > (unsigned int)iosize,
> > > offset_in_page(ioaddr));
> > > sg_init_marker(sg, 1);
> > > }
> > >
> > > If the virtual address obtained from ioremap() is not
> > > backed by a page struct, virt_to_page() returns an
> > > invalid page pointer. However, sg_copy_buffer() can
> > > correctly recover the original virtual address. Such
> > > addresses can successfully be assigned to scatterlist
> > > segments to transfer data across the PCI bus with
> > > sg_copy_buffer() if it were not for the crash in
> > > PageSlab() when called by sg_miter_stop().
> >
> > I thought we already agreed in general that putting things that don't
> > have struct page into the scatter list was not allowed?
> >
> > Jason
>
> Thanks Jason for your comment.
>
> Cost of adding page structs to a large PCI I/O address range can be
> quite substantial. Allowing PCI I/O pages without page structs may be
> desirable. Perhaps it is worth considering this cost.
This is generally agreed, but nobody has figured out a solution.
> Scatterlist has no problem doing its memcpy() from pages without a
> page struct that were obtained from ioremap(). It is only at
Calling memcpy on pointers from ioremap is very much not allowed. Code
has to use the iomem safe memcpy.
Jason
On Tue, Oct 1, 2019 at 10:12 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Oct 01, 2019 at 10:09:48AM -0700, Alan Mikhak wrote:
> > On Tue, Oct 1, 2019 at 5:16 AM Jason Gunthorpe <[email protected]> wrote:
> > >
> > > On Mon, Sep 30, 2019 at 04:22:35PM -0700, Alan Mikhak wrote:
> > > > From: Alan Mikhak <[email protected]>
> > > >
> > > > Modify sg_miter_stop() to validate the page pointer
> > > > before calling PageSlab(). This check prevents a crash
> > > > that will occur if PageSlab() gets called with a page
> > > > pointer that is not backed by page struct.
> > > >
> > > > A virtual address obtained from ioremap() for a physical
> > > > address in PCI address space can be assigned to a
> > > > scatterlist segment using the public scatterlist API
> > > > as in the following example:
> > > >
> > > > my_sg_set_page(struct scatterlist *sg,
> > > > const void __iomem *ioaddr,
> > > > size_t iosize)
> > > > {
> > > > sg_set_page(sg,
> > > > virt_to_page(ioaddr),
> > > > (unsigned int)iosize,
> > > > offset_in_page(ioaddr));
> > > > sg_init_marker(sg, 1);
> > > > }
> > > >
> > > > If the virtual address obtained from ioremap() is not
> > > > backed by a page struct, virt_to_page() returns an
> > > > invalid page pointer. However, sg_copy_buffer() can
> > > > correctly recover the original virtual address. Such
> > > > addresses can successfully be assigned to scatterlist
> > > > segments to transfer data across the PCI bus with
> > > > sg_copy_buffer() if it were not for the crash in
> > > > PageSlab() when called by sg_miter_stop().
> > >
> > > I thought we already agreed in general that putting things that don't
> > > have struct page into the scatter list was not allowed?
> > >
> > > Jason
> >
> > Thanks Jason for your comment.
> >
> > Cost of adding page structs to a large PCI I/O address range can be
> > quite substantial. Allowing PCI I/O pages without page structs may be
> > desirable. Perhaps it is worth considering this cost.
>
> This is generally agreed, but nobody has figured out a solution.
>
> > Scatterlist has no problem doing its memcpy() from pages without a
> > page struct that were obtained from ioremap(). It is only at
>
> Calling memcpy on pointers from ioremap is very much not allowed. Code
> has to use the iomem safe memcpy.
>
Is it in the realm of possible to add support for such PCI I/O pages
to scatterlist? Perhaps some solution is possible by adding a new
function, say sg_set_iomem_page(), and a new SG_MITER_IOMEM flag that
tells scatterlist to use iomem safe memcpy functions when the page is
not backed by page struct because it was obtained from ioremap(). This
flag can also be used at sg_miter_stop() to not call PageSlab() or
flush_kernel_dcache_page().
If supporting PCI I/O pages is not possible, would it be possible to
check for invalid page pointers in sg_set_page() and communicate the
requirement for a page struct backing in its description?
Alan
> Jason
On Tue, Oct 01, 2019 at 10:26:38AM -0700, Alan Mikhak wrote:
> > > Cost of adding page structs to a large PCI I/O address range can be
> > > quite substantial. Allowing PCI I/O pages without page structs may be
> > > desirable. Perhaps it is worth considering this cost.
> >
> > This is generally agreed, but nobody has figured out a solution.
> >
> > > Scatterlist has no problem doing its memcpy() from pages without a
> > > page struct that were obtained from ioremap(). It is only at
> >
> > Calling memcpy on pointers from ioremap is very much not allowed. Code
> > has to use the iomem safe memcpy.
>
> Is it in the realm of possible to add support for such PCI I/O pages
> to scatterlist? Perhaps some solution is possible by adding a new
> function, say sg_set_iomem_page(), and a new SG_MITER_IOMEM flag that
> tells scatterlist to use iomem safe memcpy functions when the page is
> not backed by page struct because it was obtained from ioremap(). This
> flag can also be used at sg_miter_stop() to not call PageSlab() or
> flush_kernel_dcache_page().
People have tried many different things so far, it is more comple than
just the copy functions as there is also sg_page to worry about.
> If supporting PCI I/O pages is not possible, would it be possible to
> check for invalid page pointers in sg_set_page() and communicate the
> requirement for a page struct backing in its description?
Clarifying the comments seems reasonable to me.
Jason
On Tue, Oct 1, 2019 at 10:44 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Tue, Oct 01, 2019 at 10:26:38AM -0700, Alan Mikhak wrote:
>
> > > > Cost of adding page structs to a large PCI I/O address range can be
> > > > quite substantial. Allowing PCI I/O pages without page structs may be
> > > > desirable. Perhaps it is worth considering this cost.
> > >
> > > This is generally agreed, but nobody has figured out a solution.
> > >
> > > > Scatterlist has no problem doing its memcpy() from pages without a
> > > > page struct that were obtained from ioremap(). It is only at
> > >
> > > Calling memcpy on pointers from ioremap is very much not allowed. Code
> > > has to use the iomem safe memcpy.
> >
> > Is it in the realm of possible to add support for such PCI I/O pages
> > to scatterlist? Perhaps some solution is possible by adding a new
> > function, say sg_set_iomem_page(), and a new SG_MITER_IOMEM flag that
> > tells scatterlist to use iomem safe memcpy functions when the page is
> > not backed by page struct because it was obtained from ioremap(). This
> > flag can also be used at sg_miter_stop() to not call PageSlab() or
> > flush_kernel_dcache_page().
>
> People have tried many different things so far, it is more comple than
> just the copy functions as there is also sg_page to worry about.
>
> > If supporting PCI I/O pages is not possible, would it be possible to
> > check for invalid page pointers in sg_set_page() and communicate the
> > requirement for a page struct backing in its description?
>
> Clarifying the comments seems reasonable to me.
>
> Jason
Thanks Jason. I'll submit a patch to communicate the requirement in
the description of sg_set_page().
Alan
On Tue, Oct 01, 2019 at 09:16:23AM -0300, Jason Gunthorpe wrote:
> > If the virtual address obtained from ioremap() is not
> > backed by a page struct, virt_to_page() returns an
> > invalid page pointer. However, sg_copy_buffer() can
> > correctly recover the original virtual address. Such
> > addresses can successfully be assigned to scatterlist
> > segments to transfer data across the PCI bus with
> > sg_copy_buffer() if it were not for the crash in
> > PageSlab() when called by sg_miter_stop().
>
> I thought we already agreed in general that putting things that don't
> have struct page into the scatter list was not allowed?
Yes, that absolutely is the case.
On Mon, Sep 30, 2019 at 04:22:35PM -0700, Alan Mikhak wrote:
> From: Alan Mikhak <[email protected]>
>
> Modify sg_miter_stop() to validate the page pointer
> before calling PageSlab(). This check prevents a crash
> that will occur if PageSlab() gets called with a page
> pointer that is not backed by page struct.
>
> A virtual address obtained from ioremap() for a physical
> address in PCI address space can be assigned to a
> scatterlist segment using the public scatterlist API
> as in the following example:
As Jason pointed out that is not a valid use of scatterlist. What
are you trying to do here at a higher level?
On Sun, Oct 6, 2019 at 11:13 PM Christoph Hellwig <[email protected]> wrote:
>
> On Mon, Sep 30, 2019 at 04:22:35PM -0700, Alan Mikhak wrote:
> > From: Alan Mikhak <[email protected]>
> >
> > Modify sg_miter_stop() to validate the page pointer
> > before calling PageSlab(). This check prevents a crash
> > that will occur if PageSlab() gets called with a page
> > pointer that is not backed by page struct.
> >
> > A virtual address obtained from ioremap() for a physical
> > address in PCI address space can be assigned to a
> > scatterlist segment using the public scatterlist API
> > as in the following example:
>
> As Jason pointed out that is not a valid use of scatterlist. What
> are you trying to do here at a higher level?
I am developing a PCI endpoint framework function driver to bring-up
an NVMe device over PCIe. The NVMe endpoint function driver connects
to an x86_64 or other root-complex host over PCIe. Internally, the
NVMe endpoint function driver connects to the unmodified Linux NVMe
target driver running on the embedded CPU. The Linux NVMe target
operates an NVMe namespace as determined for the application.
Currently, the Linux NVMe target code operates a file-based namespace
which is backed by the loop device. However, the application can be
expanded to operate on non-volatile storage such as flash or
battery-backed RAM. Currently, I am able to mount such an NVMe
namespace from the x86_64 Debian Linux host across PCIe using the
Disks App and perform Partition Benchmarking. I am also able to save
and load files, such as trace files for debugging the NVMe endpoint
with KernelShark, on the NVMe namespace partition nvme0n1p1.
My goal is to not modify the Linux NVMe target code at all. The NVMe
endpoint function driver currently does the work that is required. It
maps NVMe PRPs and PRP Lists from the host, formats a scatterlist that
NVMe target driver can consume, and executes the NVMe command with the
scatterlist on the NVMe target controller on behalf of the host. The
NVMe target controller can therefore read and write directly to host
buffers using the scatterlist as it does if the scatterlist had
arrived over the NVMe fabric.
In my current platform, there are no page struct backing for the PCIe
memory address space. Nevertheless, I am able to feed the virtual
addresses I obtain from ioremap() to the scatterlist as shown in my
example earlier. The scatterlist code has no problem traversing the
scatterlist that is formed from such addresses that were obtained from
ioremap(). The only place the scatterlist code prevents such usage is
in sg_miter_stop() when it calls PageSlab() to decide if it should
flush the page. I added a check to see if the page is valid and not
call PageSlab() if it is not a page struct backed page. That is all I
had to do to be able to pass scatterlists to the NVMe target.
Given that the PCIe memory address space is large, the cost of adding
page structs for that region is substantial enough for me to ask that
it be considered here to modify scatterlist code to support such
memory pointers that were obtained from ioremap(). If not acceptable,
the solution would be to pay to price and add page structs for the
PCIe memory address space.
Regards,
Alan
On Mon, Oct 7, 2019 at 9:44 AM Alan Mikhak <[email protected]> wrote:
>
> On Sun, Oct 6, 2019 at 11:13 PM Christoph Hellwig <[email protected]> wrote:
> >
> > On Mon, Sep 30, 2019 at 04:22:35PM -0700, Alan Mikhak wrote:
> > > From: Alan Mikhak <[email protected]>
> > >
> > > Modify sg_miter_stop() to validate the page pointer
> > > before calling PageSlab(). This check prevents a crash
> > > that will occur if PageSlab() gets called with a page
> > > pointer that is not backed by page struct.
> > >
> > > A virtual address obtained from ioremap() for a physical
> > > address in PCI address space can be assigned to a
> > > scatterlist segment using the public scatterlist API
> > > as in the following example:
> >
> > As Jason pointed out that is not a valid use of scatterlist. What
> > are you trying to do here at a higher level?
>
> I am developing a PCI endpoint framework function driver to bring-up
> an NVMe device over PCIe. The NVMe endpoint function driver connects
> to an x86_64 or other root-complex host over PCIe. Internally, the
> NVMe endpoint function driver connects to the unmodified Linux NVMe
> target driver running on the embedded CPU. The Linux NVMe target
> operates an NVMe namespace as determined for the application.
> Currently, the Linux NVMe target code operates a file-based namespace
> which is backed by the loop device. However, the application can be
> expanded to operate on non-volatile storage such as flash or
> battery-backed RAM. Currently, I am able to mount such an NVMe
> namespace from the x86_64 Debian Linux host across PCIe using the
> Disks App and perform Partition Benchmarking. I am also able to save
> and load files, such as trace files for debugging the NVMe endpoint
> with KernelShark, on the NVMe namespace partition nvme0n1p1.
>
> My goal is to not modify the Linux NVMe target code at all. The NVMe
> endpoint function driver currently does the work that is required. It
> maps NVMe PRPs and PRP Lists from the host, formats a scatterlist that
> NVMe target driver can consume, and executes the NVMe command with the
> scatterlist on the NVMe target controller on behalf of the host. The
> NVMe target controller can therefore read and write directly to host
> buffers using the scatterlist as it does if the scatterlist had
> arrived over the NVMe fabric.
>
> In my current platform, there are no page struct backing for the PCIe
> memory address space. Nevertheless, I am able to feed the virtual
> addresses I obtain from ioremap() to the scatterlist as shown in my
> example earlier. The scatterlist code has no problem traversing the
> scatterlist that is formed from such addresses that were obtained from
> ioremap(). The only place the scatterlist code prevents such usage is
> in sg_miter_stop() when it calls PageSlab() to decide if it should
> flush the page. I added a check to see if the page is valid and not
> call PageSlab() if it is not a page struct backed page. That is all I
> had to do to be able to pass scatterlists to the NVMe target.
>
> Given that the PCIe memory address space is large, the cost of adding
> page structs for that region is substantial enough for me to ask that
> it be considered here to modify scatterlist code to support such
> memory pointers that were obtained from ioremap(). If not acceptable,
> the solution would be to pay to price and add page structs for the
> PCIe memory address space.
>
Please consider the following information and cost estimate in
bytes for requiring page structs for PCI memory if used with
scatterlists. For example, a 128GB PCI memory address space
could require as much as 256MB of system memory just for
page struct backing. In a 1GB 64-bit system with flat memory
model, that consumes 25% of available memory. However,
not all of the 128GB PCI memory may be mapped for use at
a given time depending on the application. The cost of PCI
page structs is an upfront cost to be paid at system start.
pci memory start: 0x2000000000
pci memory size: 128GB 0x2000000000
pci page_size: 64KB 0x10000
pci page_shift: 16 0x10
pci pages: 2MB 0x200000
pci epc bitmap_size: 256KB 0x40000
pci page_size: 4KB 0x1000
pci page_shift: 12 0xc
pci pages: 32MB 0x2000000
pci epc bitmap_size: 4MB 0x400000
system page size: 4KB 0x1000
linux page struct size: 8B
pci page struct cost: 256MB 0x10000000
Regards,
Alan
On Mon, Oct 07, 2019 at 02:13:51PM -0700, Alan Mikhak wrote:
> > My goal is to not modify the Linux NVMe target code at all. The NVMe
> > endpoint function driver currently does the work that is required.
You will have to do some modifications, as for example in PCIe you can
have a n:1 relationship between SQs and CQs. And you need to handle
the Create/Delete SQ/CQ commands, but not the fabrics commands. And
modifying subsystems in Linux is perfectly acceptable, that is how they
improve.
Do you have a pointer to your code?
> > In my current platform, there are no page struct backing for the PCIe
> > memory address space.
In Linux there aren't struct pages for physical memory remapped using
ioremap(). But if you want to feed them to the I/O subsystem you have
to use devm_memremap_pages to create a page backing. Assuming you are
on a RISC-V platform given your affiliation you'll need to ensure your
kernel allows for ZONE_DEVICE pages, which Logan (added to Cc) has been
working on. I don't remember what the current status is.
> Please consider the following information and cost estimate in
> bytes for requiring page structs for PCI memory if used with
> scatterlists. For example, a 128GB PCI memory address space
> could require as much as 256MB of system memory just for
> page struct backing. In a 1GB 64-bit system with flat memory
> model, that consumes 25% of available memory. However,
> not all of the 128GB PCI memory may be mapped for use at
> a given time depending on the application. The cost of PCI
> page structs is an upfront cost to be paid at system start.
I know the pages are costly. But once you want to feed them through
subsystems that do expect pages you'll have to do that. And anything
using scatterlists currently does. A little hack here and there isn't
going to solve that.
On 2019-10-15 3:55 a.m., Christoph Hellwig wrote:
> On Mon, Oct 07, 2019 at 02:13:51PM -0700, Alan Mikhak wrote:
>>> My goal is to not modify the Linux NVMe target code at all. The NVMe
>>> endpoint function driver currently does the work that is required.
>
> You will have to do some modifications, as for example in PCIe you can
> have a n:1 relationship between SQs and CQs. And you need to handle
> the Create/Delete SQ/CQ commands, but not the fabrics commands. And
> modifying subsystems in Linux is perfectly acceptable, that is how they
> improve.
>
> Do you have a pointer to your code?
>
>>> In my current platform, there are no page struct backing for the PCIe
>>> memory address space.
>
> In Linux there aren't struct pages for physical memory remapped using
> ioremap(). But if you want to feed them to the I/O subsystem you have
> to use devm_memremap_pages to create a page backing. Assuming you are
> on a RISC-V platform given your affiliation you'll need to ensure your
> kernel allows for ZONE_DEVICE pages, which Logan (added to Cc) has been
> working on. I don't remember what the current status is.
The last patchset submission was here [1]. It had some issues but the
main one was that they wanted the page tables to be created and removed
dynamically. I took a crack at it and ran into some issues and haven't
had time to touch it since. I was waiting to see how arm64[2] solves
similar problems and then maybe it can be made common.
It's also a bit of a PITA because the RISC-V hardware we have with hacky
PCI support is fragile and stopped working on recent kernels, last I
tried. So this hasn't been a priority for us.
>> Please consider the following information and cost estimate in
>> bytes for requiring page structs for PCI memory if used with
>> scatterlists. For example, a 128GB PCI memory address space
>> could require as much as 256MB of system memory just for
>> page struct backing. In a 1GB 64-bit system with flat memory
>> model, that consumes 25% of available memory. However,
>> not all of the 128GB PCI memory may be mapped for use at
>> a given time depending on the application. The cost of PCI
>> page structs is an upfront cost to be paid at system start.
>
> I know the pages are costly. But once you want to feed them through
> subsystems that do expect pages you'll have to do that. And anything
> using scatterlists currently does. A little hack here and there isn't
> going to solve that.
>
Agreed. I tried expanding the SG-list to allow for page-less entries and
it was a much bigger mess than what you describe.
Also, I think your analysis is a bit unfair, we don't need to create
pages for the entire 128GB of PCI memory space, we typically only need
the BARs of a subset of devices which is far less. If a system has only
1GB of memory it probably doesn't actually have 128GB of PCI bar spaces
that are sensibly usable.
Yes, it would be nice to get rid of this overhead, but that's a much
bigger long-term project.
Logan
[1]
https://lore.kernel.org/linux-riscv/[email protected]/
[2]
https://lore.kernel.org/linux-arm-kernel/[email protected]/
On 2019-10-15 11:40 a.m., Alan Mikhak wrote:
> On Tue, Oct 15, 2019 at 2:55 AM Christoph Hellwig <[email protected]> wrote:
>>
>> On Mon, Oct 07, 2019 at 02:13:51PM -0700, Alan Mikhak wrote:
>>>> My goal is to not modify the Linux NVMe target code at all. The NVMe
>>>> endpoint function driver currently does the work that is required.
>>
>> You will have to do some modifications, as for example in PCIe you can
>> have a n:1 relationship between SQs and CQs. And you need to handle
>> the Create/Delete SQ/CQ commands, but not the fabrics commands. And
>> modifying subsystems in Linux is perfectly acceptable, that is how they
>> improve.
>
> The NVMe endpoint function driver currently creates the admin ACQ and
> ASQ on startup. When the NVMe host connects over PCIe, NVMe endpoint
> function driver handles the Create/Delete SQ/CQ commands and any other
> commands that cannot go to the NVMe target on behalf of the host. For
> example, it creates a pair of I/O CQ and SQ as requested by the Linux
> host kernel nvme.ko driver. The NVMe endpoint function driver supports
> Controller Memory Buffer (CMB). The I/O SQ is therefore located in CMB
> as requested by host nvme.ko.
>
> As for n:1 relationship between SQs and CQs, I have not implemented that
> yet since I didn't get such a request from the host nvme.ko yet. I agree. It
> needs to be implemented at some point. It is doable. I appreciate your
> comment and took note of it.
>
>>
>> Do you have a pointer to your code?
>
> The code is still work in progress. It is not stable yet for reliable use or
> upstream patch submission. It is stable enough for me to see it work from
> my Debian host desktop to capture screenshots of NVMe partition
> benchmarking, formatting, mounting, file storage and retrieval activity
> such as I mentioned. I could look into possibly submitting an RFC patch
> upstream for early review and feedback to improve it but it is not in a
> polished state yet.
>
>>
>>>> In my current platform, there are no page struct backing for the PCIe
>>>> memory address space.
>>
>> In Linux there aren't struct pages for physical memory remapped using
>> ioremap(). But if you want to feed them to the I/O subsystem you have
>> to use devm_memremap_pages to create a page backing. Assuming you are
>> on a RISC-V platform given your affiliation you'll need to ensure your
>> kernel allows for ZONE_DEVICE pages, which Logan (added to Cc) has been
>> working on. I don't remember what the current status is.
>
> Thanks for this suggestion. I will try using devm_memremap_pages() to
> create a page backing. I will also look for Logan's work regarding
> ZONE_DEVICE pages.
The nvme driver already creates struct pages for the CMB with
devm_memremap_pages(). At least since v4.20. Though, it probably won't
do anything with the CMB on platforms that don't yet support ZONE_DEVICE
(ie riscv).
Logan
On Tue, Oct 15, 2019 at 2:55 AM Christoph Hellwig <[email protected]> wrote:
>
> On Mon, Oct 07, 2019 at 02:13:51PM -0700, Alan Mikhak wrote:
> > > My goal is to not modify the Linux NVMe target code at all. The NVMe
> > > endpoint function driver currently does the work that is required.
>
> You will have to do some modifications, as for example in PCIe you can
> have a n:1 relationship between SQs and CQs. And you need to handle
> the Create/Delete SQ/CQ commands, but not the fabrics commands. And
> modifying subsystems in Linux is perfectly acceptable, that is how they
> improve.
The NVMe endpoint function driver currently creates the admin ACQ and
ASQ on startup. When the NVMe host connects over PCIe, NVMe endpoint
function driver handles the Create/Delete SQ/CQ commands and any other
commands that cannot go to the NVMe target on behalf of the host. For
example, it creates a pair of I/O CQ and SQ as requested by the Linux
host kernel nvme.ko driver. The NVMe endpoint function driver supports
Controller Memory Buffer (CMB). The I/O SQ is therefore located in CMB
as requested by host nvme.ko.
As for n:1 relationship between SQs and CQs, I have not implemented that
yet since I didn't get such a request from the host nvme.ko yet. I agree. It
needs to be implemented at some point. It is doable. I appreciate your
comment and took note of it.
>
> Do you have a pointer to your code?
The code is still work in progress. It is not stable yet for reliable use or
upstream patch submission. It is stable enough for me to see it work from
my Debian host desktop to capture screenshots of NVMe partition
benchmarking, formatting, mounting, file storage and retrieval activity
such as I mentioned. I could look into possibly submitting an RFC patch
upstream for early review and feedback to improve it but it is not in a
polished state yet.
>
> > > In my current platform, there are no page struct backing for the PCIe
> > > memory address space.
>
> In Linux there aren't struct pages for physical memory remapped using
> ioremap(). But if you want to feed them to the I/O subsystem you have
> to use devm_memremap_pages to create a page backing. Assuming you are
> on a RISC-V platform given your affiliation you'll need to ensure your
> kernel allows for ZONE_DEVICE pages, which Logan (added to Cc) has been
> working on. I don't remember what the current status is.
Thanks for this suggestion. I will try using devm_memremap_pages() to
create a page backing. I will also look for Logan's work regarding
ZONE_DEVICE pages.
>
> > Please consider the following information and cost estimate in
> > bytes for requiring page structs for PCI memory if used with
> > scatterlists. For example, a 128GB PCI memory address space
> > could require as much as 256MB of system memory just for
> > page struct backing. In a 1GB 64-bit system with flat memory
> > model, that consumes 25% of available memory. However,
> > not all of the 128GB PCI memory may be mapped for use at
> > a given time depending on the application. The cost of PCI
> > page structs is an upfront cost to be paid at system start.
>
> I know the pages are costly. But once you want to feed them through
> subsystems that do expect pages you'll have to do that. And anything
> using scatterlists currently does. A little hack here and there isn't
> going to solve that.
Feedback on this is clear. Page struct backing is required to use
scatterlists. Thanks.
On Tue, Oct 15, 2019 at 10:45 AM Logan Gunthorpe <[email protected]> wrote:
>
>
>
> On 2019-10-15 11:40 a.m., Alan Mikhak wrote:
> > On Tue, Oct 15, 2019 at 2:55 AM Christoph Hellwig <[email protected]> wrote:
> >>
> >> On Mon, Oct 07, 2019 at 02:13:51PM -0700, Alan Mikhak wrote:
> >>>> My goal is to not modify the Linux NVMe target code at all. The NVMe
> >>>> endpoint function driver currently does the work that is required.
> >>
> >> You will have to do some modifications, as for example in PCIe you can
> >> have a n:1 relationship between SQs and CQs. And you need to handle
> >> the Create/Delete SQ/CQ commands, but not the fabrics commands. And
> >> modifying subsystems in Linux is perfectly acceptable, that is how they
> >> improve.
> >
> > The NVMe endpoint function driver currently creates the admin ACQ and
> > ASQ on startup. When the NVMe host connects over PCIe, NVMe endpoint
> > function driver handles the Create/Delete SQ/CQ commands and any other
> > commands that cannot go to the NVMe target on behalf of the host. For
> > example, it creates a pair of I/O CQ and SQ as requested by the Linux
> > host kernel nvme.ko driver. The NVMe endpoint function driver supports
> > Controller Memory Buffer (CMB). The I/O SQ is therefore located in CMB
> > as requested by host nvme.ko.
> >
> > As for n:1 relationship between SQs and CQs, I have not implemented that
> > yet since I didn't get such a request from the host nvme.ko yet. I agree. It
> > needs to be implemented at some point. It is doable. I appreciate your
> > comment and took note of it.
> >
> >>
> >> Do you have a pointer to your code?
> >
> > The code is still work in progress. It is not stable yet for reliable use or
> > upstream patch submission. It is stable enough for me to see it work from
> > my Debian host desktop to capture screenshots of NVMe partition
> > benchmarking, formatting, mounting, file storage and retrieval activity
> > such as I mentioned. I could look into possibly submitting an RFC patch
> > upstream for early review and feedback to improve it but it is not in a
> > polished state yet.
> >
> >>
> >>>> In my current platform, there are no page struct backing for the PCIe
> >>>> memory address space.
> >>
> >> In Linux there aren't struct pages for physical memory remapped using
> >> ioremap(). But if you want to feed them to the I/O subsystem you have
> >> to use devm_memremap_pages to create a page backing. Assuming you are
> >> on a RISC-V platform given your affiliation you'll need to ensure your
> >> kernel allows for ZONE_DEVICE pages, which Logan (added to Cc) has been
> >> working on. I don't remember what the current status is.
> >
> > Thanks for this suggestion. I will try using devm_memremap_pages() to
> > create a page backing. I will also look for Logan's work regarding
> > ZONE_DEVICE pages.
>
> The nvme driver already creates struct pages for the CMB with
> devm_memremap_pages(). At least since v4.20. Though, it probably won't
> do anything with the CMB on platforms that don't yet support ZONE_DEVICE
> (ie riscv).
Hi Logan, Thanks for your comments. I look forward to looking for the work
you have done on ZONE_DEVICE.
The NVMe endpoint function driver that I am developing runs on a PCIe endpoint
device running Linux sitting across the PCIe bus from nvme.ko running on the
Linux host. It interacts as an NVMe endpoint across the PCIe bus with the NVMe
host. The NVMe host may be Linux, Windows, etc. In the case of Linux host, it
interacts with nvme.ko. The Linux NVMe endpoint is a Linux PCIe endpoint device
which implements a PCI Endpoint Framework function driver which implements
the NVMe endpoint functionality on Linux.
>
> Logan
On Tue, Oct 15, 2019 at 9:12 AM Logan Gunthorpe <[email protected]> wrote:
>
>
>
> On 2019-10-15 3:55 a.m., Christoph Hellwig wrote:
> > On Mon, Oct 07, 2019 at 02:13:51PM -0700, Alan Mikhak wrote:
> >>> My goal is to not modify the Linux NVMe target code at all. The NVMe
> >>> endpoint function driver currently does the work that is required.
> >
> > You will have to do some modifications, as for example in PCIe you can
> > have a n:1 relationship between SQs and CQs. And you need to handle
> > the Create/Delete SQ/CQ commands, but not the fabrics commands. And
> > modifying subsystems in Linux is perfectly acceptable, that is how they
> > improve.
> >
> > Do you have a pointer to your code?
> >
> >>> In my current platform, there are no page struct backing for the PCIe
> >>> memory address space.
> >
> > In Linux there aren't struct pages for physical memory remapped using
> > ioremap(). But if you want to feed them to the I/O subsystem you have
> > to use devm_memremap_pages to create a page backing. Assuming you are
> > on a RISC-V platform given your affiliation you'll need to ensure your
> > kernel allows for ZONE_DEVICE pages, which Logan (added to Cc) has been
> > working on. I don't remember what the current status is.
>
> The last patchset submission was here [1]. It had some issues but the
> main one was that they wanted the page tables to be created and removed
> dynamically. I took a crack at it and ran into some issues and haven't
> had time to touch it since. I was waiting to see how arm64[2] solves
> similar problems and then maybe it can be made common.
Thanks Logan for the links to your patchset.
>
> It's also a bit of a PITA because the RISC-V hardware we have with hacky
> PCI support is fragile and stopped working on recent kernels, last I
> tried. So this hasn't been a priority for us.
I would be interested in hearing more about PCI issues on RISC-V hardware.
Please include me, if you like.
>
> >> Please consider the following information and cost estimate in
> >> bytes for requiring page structs for PCI memory if used with
> >> scatterlists. For example, a 128GB PCI memory address space
> >> could require as much as 256MB of system memory just for
> >> page struct backing. In a 1GB 64-bit system with flat memory
> >> model, that consumes 25% of available memory. However,
> >> not all of the 128GB PCI memory may be mapped for use at
> >> a given time depending on the application. The cost of PCI
> >> page structs is an upfront cost to be paid at system start.
> >
> > I know the pages are costly. But once you want to feed them through
> > subsystems that do expect pages you'll have to do that. And anything
> > using scatterlists currently does. A little hack here and there isn't
> > going to solve that.
> >
>
> Agreed. I tried expanding the SG-list to allow for page-less entries and
> it was a much bigger mess than what you describe.
>
> Also, I think your analysis is a bit unfair, we don't need to create
> pages for the entire 128GB of PCI memory space, we typically only need
> the BARs of a subset of devices which is far less. If a system has only
> 1GB of memory it probably doesn't actually have 128GB of PCI bar spaces
> that are sensibly usable.
I appreciate your comment and your attempt and experience with expanding
the scatterlist to allow for page-less entries. Feedback is clear.
Such expansion
is bigger than it looks at first glance.
An NVMe endpoint has to map Physical Region Pages (PRP) that sit
across the PCIe bus anywhere in host memory. PRPs are typically 4KB
pages which may be scattered throughout host memory. The PCIe address
space of NVMe endpoint is larger than its local physical memory space or
that of the host system. 128GB is not the typical amount of physical memory
populated on a PCIe host or endpoint. However, PCIe address space
of an NVMe endpoint needs to address much larger regions than physical
memory on either side of the PCIe bus.
NVMe endpoint function driver access to host PRPs is not BAR-based. NVMe
endpoint accesses host memory as PCIe bus master. PCIe hosts typically
access endpoint memory using BARs.
Finding an economical solution for page struct backing for large PCIe address
space, which is not itself backed by physical memory, is desirable. Clearly,
page structs are a requirement for using scatterlists. Since scatterlists
are the mechanism that Linux NVMe target driver uses, the NVMe endpoint
function driver needs to convert PRPs to scatterlist using mappings that have
proper page struct backing.
As Christoph Hellwig suggested, devm_memremap_pages() may be a solution
if the kernel supports ZONE_DEVICE.
Regards,
Alan
>
> Yes, it would be nice to get rid of this overhead, but that's a much
> bigger long-term project.
>
> Logan
>
>
> [1]
> https://lore.kernel.org/linux-riscv/[email protected]/
> [2]
> https://lore.kernel.org/linux-arm-kernel/[email protected]/