2005-04-26 05:33:49

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: pci-sysfs resource mmap broken

Hi !

While chasing an interesting bug in ppc/ppc64 implementation of pci
mmap, I discovered that the pci sysfs code is in fact broken and cannot
be made to work on those archs (well, at least for IO space). In fact,
it can, but then, it would break the /proc code :)

The problem is that they are both calling the same arch routine
(pci_mmap_page_range) with the offset argument having a different
semantic.

In the /proc code, it comes from userland directly, and is supposedly,
the raw BAR value.

In sysfs, it's passing the device resource[]->start value.

The problem is that can only work ... on architectures where the
resources contain the same thing as the BAR values. On ppc, where this
is not the case, it will not work. On ppc, resources are "fixed up" in
various ways (for example, PReP adds a fixed offset to all memory
resources to match the HW translation since PCI isn't 1:1 on those, and
all PPCs with more than one domain play tricks with IO resources).

What would be the proper fix here ? Having pci_mmap_resource() actually
read the BAR value for the resource ?

In a similar vein, the "resource" is exposing directly to userland the
content of "struct resource". This doesn't mean anything. The kernel is
internally playing all sort of offset tricks on these values, so they
can't be used for anything useful, either via /dev/mem, or for io port
accesses, or whatever.

Shouldn't we expose the BAR values & size rather here ? That is,
reconsitutes non-offset'd resources, possibly with arch help, or just
reading BAR to get base, and apply resource size & flags ?

Unless you are on x86 of course ...

There is some serious brokenness in there, it needs to be fixed if we
want things like X.org to be ever properly adapted (and we'll have to
deal with existing broken kernels, gack).

Ben.



2005-04-26 06:10:07

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken

On Tue, 2005-04-26 at 15:33 +1000, Benjamin Herrenschmidt wrote:

> In a similar vein, the "resource" is exposing directly to userland the
> content of "struct resource". This doesn't mean anything. The kernel is
> internally playing all sort of offset tricks on these values, so they
> can't be used for anything useful, either via /dev/mem, or for io port
> accesses, or whatever.
>
> Shouldn't we expose the BAR values & size rather here ? That is,
> reconsitutes non-offset'd resources, possibly with arch help, or just
> reading BAR to get base, and apply resource size & flags ?

.../...

Ok, after a bit more thinking, I think I'll go that way for now, please
let me know if you think I'm wrong:

rename "resource" to "resources" and make it contain a start address
that matches the BAR value, that is something that has at least some
sort of meaning in userland and can be passed to pci_mmap_page_range().
To do that "translation", I'll read the BAR value, and use it as start,
then use the resource size & flags.

The name change will also allow userland to "detect" the fixed
implementation.

Ben.


2005-04-26 06:37:01

by Greg KH

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken

On Tue, Apr 26, 2005 at 04:09:41PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2005-04-26 at 15:33 +1000, Benjamin Herrenschmidt wrote:
>
> > In a similar vein, the "resource" is exposing directly to userland the
> > content of "struct resource". This doesn't mean anything. The kernel is
> > internally playing all sort of offset tricks on these values, so they
> > can't be used for anything useful, either via /dev/mem, or for io port
> > accesses, or whatever.
> >
> > Shouldn't we expose the BAR values & size rather here ? That is,
> > reconsitutes non-offset'd resources, possibly with arch help, or just
> > reading BAR to get base, and apply resource size & flags ?
>
> .../...
>
> Ok, after a bit more thinking, I think I'll go that way for now, please
> let me know if you think I'm wrong:
>
> rename "resource" to "resources" and make it contain a start address
> that matches the BAR value, that is something that has at least some
> sort of meaning in userland and can be passed to pci_mmap_page_range().
> To do that "translation", I'll read the BAR value, and use it as start,
> then use the resource size & flags.

That sounds fine with me.

thanks,

greg k-h

2005-04-26 09:24:20

by Russell King

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken

On Tue, Apr 26, 2005 at 04:09:41PM +1000, Benjamin Herrenschmidt wrote:
> Ok, after a bit more thinking, I think I'll go that way for now, please
> let me know if you think I'm wrong:
>
> rename "resource" to "resources" and make it contain a start address
> that matches the BAR value, that is something that has at least some
> sort of meaning in userland and can be passed to pci_mmap_page_range().
> To do that "translation", I'll read the BAR value, and use it as start,
> then use the resource size & flags.
>
> The name change will also allow userland to "detect" the fixed
> implementation.

I'll wait until I see code before commenting.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-04-26 16:31:59

by Grant Grundler

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken

On Tue, Apr 26, 2005 at 03:33:29PM +1000, Benjamin Herrenschmidt wrote:
...
> The problem is that can only work ... on architectures where the
> resources contain the same thing as the BAR values. On ppc, where this
> is not the case, it will not work. On ppc, resources are "fixed up" in
> various ways (for example, PReP adds a fixed offset to all memory
> resources to match the HW translation since PCI isn't 1:1 on those, and
> all PPCs with more than one domain play tricks with IO resources).

IIRC, alpha, sparc, and parisc also are broken then.

> In a similar vein, the "resource" is exposing directly to userland the
> content of "struct resource". This doesn't mean anything. The kernel is
> internally playing all sort of offset tricks on these values, so they
> can't be used for anything useful, either via /dev/mem, or for io port
> accesses, or whatever.

There are two "views" of a PCI resource and the names I've used in
the past are "IO View" and "CPU View". The raw BAR value is the "IO View"
since that's what that devices on that PCI bus need to use for Peer-to-Peer
reads/writes. IIRC, sym2 driver directly reads the BAR for telling the scripts
engine where onboard RAM lives. The "CPU View" is what drivers/user space
needs to use when accessing the device. This is what we should be
exporting to user space and I'm pretty sure that's what X.org/XF86
should be using too. Bjorn, have I got that right?

> Shouldn't we expose the BAR values & size rather here ? That is,
> reconsitutes non-offset'd resources, possibly with arch help, or just
> reading BAR to get base, and apply resource size & flags ?
>
> Unless you are on x86 of course ...
>
> There is some serious brokenness in there, it needs to be fixed if we
> want things like X.org to be ever properly adapted (and we'll have to
> deal with existing broken kernels, gack).

ISTR Bjorn was looking at this and the VGA routing issues.

thanks,
grant

2005-04-26 22:48:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken


> There are two "views" of a PCI resource and the names I've used in
> the past are "IO View" and "CPU View". The raw BAR value is the "IO View"
> since that's what that devices on that PCI bus need to use for Peer-to-Peer
> reads/writes. IIRC, sym2 driver directly reads the BAR for telling the scripts
> engine where onboard RAM lives. The "CPU View" is what drivers/user space
> needs to use when accessing the device. This is what we should be
> exporting to user space and I'm pretty sure that's what X.org/XF86
> should be using too. Bjorn, have I got that right?

No. I don't agree. userspace has no business understanding the kernel
resources content. All userspace need to care about is the resource
number, and _eventually_ match it to a BAR value (either for using the
old /proc interface -> existing code, or to use it with real inx/outx
instructions on x86).

> > Shouldn't we expose the BAR values & size rather here ? That is,
> > reconsitutes non-offset'd resources, possibly with arch help, or just
> > reading BAR to get base, and apply resource size & flags ?
> >
> > Unless you are on x86 of course ...
> >
> > There is some serious brokenness in there, it needs to be fixed if we
> > want things like X.org to be ever properly adapted (and we'll have to
> > deal with existing broken kernels, gack).
>
> ISTR Bjorn was looking at this and the VGA routing issues.

OK, well, I'll come up with a patch in a few hours doing what I
described anyway, and we'll start from that.

The only thing I dislike a bit is that forces me to read the BAR on
every access to "un-offset" the kernel resource. We may be able to have
some arch hook do that properly, but for now, that would fix the problem
and make the whole stuff work again.

Ben.


2005-04-27 03:53:10

by Grant Grundler

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken

On Wed, Apr 27, 2005 at 08:47:34AM +1000, Benjamin Herrenschmidt wrote:
> No. I don't agree. userspace has no business understanding the kernel
> resources content.

Sorry - you are right. Userspace doesn't need to understand kernel
resources. It just needs some sort of handle so it can talk
to the device in whatever way is appropriate. I was thinking
the resource content (which happens to be CPU View of a BAR)
could be that handle.

...
> The only thing I dislike a bit is that forces me to read the BAR on
> every access to "un-offset" the kernel resource. We may be able to have
> some arch hook do that properly, but for now, that would fix the problem
> and make the whole stuff work again.


What is wrong with reading the BAR?
Is the "IO View" needed in the performance path someplace?

It should be trivial since config space is exported via /sys and /proc.
libpci probably already has everything that's needed.

grant

2005-04-27 04:31:08

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken

On Tue, 2005-04-26 at 21:55 -0600, Grant Grundler wrote:
> On Wed, Apr 27, 2005 at 08:47:34AM +1000, Benjamin Herrenschmidt wrote:
> > No. I don't agree. userspace has no business understanding the kernel
> > resources content.
>
> Sorry - you are right. Userspace doesn't need to understand kernel
> resources. It just needs some sort of handle so it can talk
> to the device in whatever way is appropriate. I was thinking
> the resource content (which happens to be CPU View of a BAR)
> could be that handle.

Ok, we are crossing each other here, I was about to reply that you were
indeed right :) Anyway, I'm about to post another message explaining all
that I think I found and my proposed fix for ppc. Should be up as soon
as I have finished testing the patch.

> ...
> > The only thing I dislike a bit is that forces me to read the BAR on
> > every access to "un-offset" the kernel resource. We may be able to have
> > some arch hook do that properly, but for now, that would fix the problem
> > and make the whole stuff work again.
>
>
> What is wrong with reading the BAR?
> Is the "IO View" needed in the performance path someplace?
>
> It should be trivial since config space is exported via /sys and /proc.
> libpci probably already has everything that's needed.

True, but then, I tend to prefer you idea of a CPU view ... so my new
"proposal" is something like pci_resource_to_user() implementation.
Anyway, wait a bit so I can polish this patch and tell me what you
think :)

Ben.


2005-04-27 04:37:52

by David Miller

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken

On Wed, 27 Apr 2005 14:30:21 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> True, but then, I tend to prefer you idea of a CPU view ... so my new
> "proposal" is something like pci_resource_to_user() implementation.
> Anyway, wait a bit so I can polish this patch and tell me what you
> think :)

I know that in particular I don't need to tell you this Ben,
but just in case please make sure it handles the 64-bit
kernel 32-bit userspace properly :-)

I'm very happy someone is working on this issue.
So don't get discouraged :)

2005-04-27 04:41:14

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken

On Tue, 2005-04-26 at 21:28 -0700, David S. Miller wrote:
> On Wed, 27 Apr 2005 14:30:21 +1000
> Benjamin Herrenschmidt <[email protected]> wrote:
>
> > True, but then, I tend to prefer you idea of a CPU view ... so my new
> > "proposal" is something like pci_resource_to_user() implementation.
> > Anyway, wait a bit so I can polish this patch and tell me what you
> > think :)
>
> I know that in particular I don't need to tell you this Ben,
> but just in case please make sure it handles the 64-bit
> kernel 32-bit userspace properly :-)

Well, I'm not changing /proc here, it will stay broken, but sysfs
already always exposes 64 bits and I won't change this. I'm also making
sure that can deal with 32 bits kernels with >32 bits IO space.

> I'm very happy someone is working on this issue.
> So don't get discouraged :)

Ok, cool. Please comment on my email/patch (coming real soon now) as
it's really more an RFC than a final fix at this point.

Ben.


2005-04-27 04:48:48

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Wed, 2005-04-27 at 08:47 +1000, Benjamin Herrenschmidt wrote:

> No. I don't agree. userspace has no business understanding the kernel
> resources content. All userspace need to care about is the resource
> number, and _eventually_ match it to a BAR value (either for using the
> old /proc interface -> existing code, or to use it with real inx/outx
> instructions on x86).

Ok, let me correct myself here... I think you _are_ right, but I also
think that pci_dev->resource[] is _not_ a CPU address that can be
exposed to userland, it's is a kernel internal "cookie" whose meaning is
arch specific.

What we need here is a way to go from pci_dev->resource to some CPU
address (typically in most archs that would result in something that can
be fed through /dev/mem, though that is just an example, doing so is
definitely not recommended :)

Additionally, there are a few others issues:

- mmap'ing resourceX vs. alignment. I have this video card with a bunch
of IOs at 0x400, so mmap'ing it will return something that maps the
first page of IO space (from 0). The userland program must be able to
recalculate the necessary alignement to know that it has to add 0x400 to
the mapping obtained via mmap. There is nothing in the API preventing
it, that is, the user program can get that via array exposed in
"resource" (or rather "resources" as I intend to rename it) ... provided
those are really CPU _addresses_ and not some kind of weird tokens
(which they are on some archs). In the case of IO space, there is the
additional requirement of actually knowing the real IO address for use
with inX/outX instructions on architectures that have those.

- What about architectures that have 32 bits core but > 32 bits
physical address space, like ppc32 with 440 or e500 processors where
IO/MMIO space is above 4gb ? The proper fix here is to have struct
resource to be 64 bits fields but we aren't there ...

- What is supposed to be passed to /proc/bus/pci mmap ? Should it be a
BAR value or the output of /proc/bus/pci/devices ? In the later case,
well, we are exposing the same crap as sysfs, that is a raw kernel
struct resource, which doesn't make sense out of the kernel. So I'm
tempted to "fix" /proc/bus/pci/devices the same way (see below) that I'm
fixing /sys/bus/pci/*/resources to pass a CPU physical address. (this
will only affect ppc[64] or other archs using my new callback anyway).

So what do that mean in practice ? Well, a few things :

1 - As described above, we need a kind of pci_resource_to_user() arch
hook so that struct resource can be converted in something that makes
sense to userland, typically a CPU physical address. It would be easy to
have a default implementation just using the resource "as-is" as it does
today, so we only have to fixup archs for which the value in there is
meaningless outside of the kernel (typically, ppc & ppc64 for IO
resources)

2 - The above pci_resource_to_user() might probably need to be defined
as returning a 64 bits value, and for consistency, we should probably
always display 64 bits in "resources" to handle the case of 32 bits CPUs
with > 32 bits IO space.

3 - pci_mmap_page_range() beeing called by both sysfs mmap and proc
mmap, we must clarify what it's "offset" parameter really is. Is it PCI
BAR value or is it a CPU physical address ? It was defined for the /proc
interface, was it ever documented . I though it wanted a BAR value but
Jesse seems to think it wants something out of /proc/bus/pci/devices.
The former would mean that either pci-sysfs.c must "convert" the offset
to a BAR value before calling pci_mmap_page_range(), or we must define a
different interface since they are non-consistent, but the later means
they are consistant (which is cool). In both case, current ppc
implementation is broken.

4 - additionally, to avoid future compatibility issues, I added the
resource number to the output of /sys/bus/pci/*/resources

The patch below implements this, with 1 - defaulting to just using the
resource as-is, plus a pair of imlpementations for ppc and ppc64, with 2
- returning an u64, and with 3 - taking the approach that /proc/bus/pci
expects offsets from /proc/bus/pci/devices (and not BAR values), that
means the interface stays consistent between proc and sysfs (pfiew !)

Additional misc question: The new interface doesn't allow to pass in any
"write combine" parametre like /proc did. In fact, this paremeter was a
bit limitative, since archs tend to be more varied when it comes to
relaxed ordering. On ppc & ppc64, I only use this as a hint, and the
pci_mmap_page_range() implementation (and /dev/mem too, so X benefits
from it) will in fact use the "prefetchable" attribute of the device
resource to decide wetehr to mark the space "guarded" or not. "guarded"
is a PPC specific attribute that disable things like speculation, but in
practice, also seem to disable write combining on some CPUs. I was
wondering wether it would make sense to haev pci-sysfs.c pass 1 for
write_combine when the resource beeing mapped has the prefetchable bit ?

Ok, here's the patch for comments. It's really on for comments now, I
has some nasty printk's in there, some useless hunks, etc... :)

So don't comment on these please :)

Ben.

Index: linux-work/arch/ppc64/kernel/pci.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/pci.c 2005-04-24 11:37:38.000000000 +1000
+++ linux-work/arch/ppc64/kernel/pci.c 2005-04-27 14:44:46.000000000 +1000
@@ -351,9 +351,12 @@
*offset += hose->pci_mem_offset;
res_bit = IORESOURCE_MEM;
} else {
- io_offset = (unsigned long)hose->io_base_virt;
+ io_offset = (unsigned long)hose->io_base_virt - pci_io_base;
+ printk("offset: %lx, io_base_virt: %p, pci_io_base: %lx, io_offset: %lx\n",
+ *offset, hose->io_base_virt, pci_io_base, io_offset);
*offset += io_offset;
res_bit = IORESOURCE_IO;
+ printk(" -> offset: %lx\n", *offset);
}

/*
@@ -373,12 +376,15 @@
continue;

/* In the range of this resource? */
+ printk(" r%d: %lx -> %lx\n", i, rp->start, rp->end);
if (*offset < (rp->start & PAGE_MASK) || *offset > rp->end)
continue;

/* found it! construct the final physical address */
- if (mmap_state == pci_mmap_io)
- *offset += hose->io_base_phys - io_offset;
+ if (mmap_state == pci_mmap_io) {
+ *offset += hose->io_base_phys - io_offset;
+ printk(" result: %lx\n", *offset);
+ }
return rp;
}

@@ -941,4 +947,22 @@
}
EXPORT_SYMBOL(pci_read_irq_line);

+void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ struct pci_controller *hose = pci_bus_to_host(dev->bus);
+ unsigned long offset = 0;
+
+ if (hose == NULL)
+ return;
+
+ if (rsrc->flags & IORESOURCE_IO)
+ offset = pci_io_base - (unsigned long)hose->io_base_virt +
+ hose->io_base_phys;
+
+ *start = rsrc->start + offset;
+ *end = rsrc->end + offset;
+}
+
#endif /* CONFIG_PPC_MULTIPLATFORM */
Index: linux-work/drivers/pci/pci-sysfs.c
===================================================================
--- linux-work.orig/drivers/pci/pci-sysfs.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/pci-sysfs.c 2005-04-27 14:40:28.000000000 +1000
@@ -54,27 +54,30 @@

/* show resources */
static ssize_t
-resource_show(struct device * dev, char * buf)
+resources_show(struct device * dev, char * buf)
{
struct pci_dev * pci_dev = to_pci_dev(dev);
char * str = buf;
int i;
int max = 7;
+ u64 start, end;

if (pci_dev->subordinate)
max = DEVICE_COUNT_RESOURCE;

for (i = 0; i < max; i++) {
- str += sprintf(str,"0x%016lx 0x%016lx 0x%016lx\n",
- pci_resource_start(pci_dev,i),
- pci_resource_end(pci_dev,i),
- pci_resource_flags(pci_dev,i));
+ struct resource *res = &pci_dev->resource[i];
+ pci_resource_to_user(pci_dev, i, res, &start, &end);
+ str += sprintf(str,"0x%02x 0x%016llx 0x%016llx 0x%016llx\n", i,
+ (unsigned long long)start,
+ (unsigned long long)end,
+ (unsigned long long)res->flags);
}
return (str - buf);
}

struct device_attribute pci_dev_attrs[] = {
- __ATTR_RO(resource),
+ __ATTR_RO(resources),
__ATTR_RO(vendor),
__ATTR_RO(device),
__ATTR_RO(subsystem_vendor),
@@ -267,8 +270,21 @@
struct device, kobj));
struct resource *res = (struct resource *)attr->private;
enum pci_mmap_state mmap_type;
+ u64 start, end;
+ int i;
+
+ for (i = 0; i < PCI_ROM_RESOURCE; i++)
+ if (res == &pdev->resource[i])
+ break;
+ if (i >= PCI_ROM_RESOURCE)
+ return -ENODEV;

- vma->vm_pgoff += res->start >> PAGE_SHIFT;
+ /* pci_mmap_page_range() expects the same kind of entry as coming
+ * from /proc/bus/pci/ which is a "user visible" value. If this is
+ * different from the resource itself, arch will do necessary fixup.
+ */
+ pci_resource_to_user(pdev, i, res, &start, &end);
+ vma->vm_pgoff += start >> PAGE_SHIFT;
mmap_type = res->flags & IORESOURCE_MEM ? pci_mmap_mem : pci_mmap_io;

return pci_mmap_page_range(pdev, vma, mmap_type, 0);
Index: linux-work/include/asm-ppc64/pci.h
===================================================================
--- linux-work.orig/include/asm-ppc64/pci.h 2005-04-24 11:38:56.000000000 +1000
+++ linux-work/include/asm-ppc64/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -135,6 +135,11 @@
unsigned long offset,
unsigned long size,
pgprot_t prot);
+#define HAVE_ARCH_PCI_RESOURCE_TO_USER
+extern void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end);
+


#endif /* __KERNEL__ */
Index: linux-work/drivers/pci/proc.c
===================================================================
--- linux-work.orig/drivers/pci/proc.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/proc.c 2005-04-27 14:34:18.000000000 +1000
@@ -354,14 +354,20 @@
dev->device,
dev->irq);
/* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */
- for(i=0; i<7; i++)
+ for(i=0; i<7; i++) {
+ u64 start, end;
+ pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, LONG_FORMAT,
- dev->resource[i].start |
+ ((unsigned long)start) |
(dev->resource[i].flags & PCI_REGION_FLAG_MASK));
- for(i=0; i<7; i++)
+ }
+ for(i=0; i<7; i++) {
+ u64 start, end;
+ pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, LONG_FORMAT,
dev->resource[i].start < dev->resource[i].end ?
- dev->resource[i].end - dev->resource[i].start + 1 : 0);
+ (unsigned long)(end - start) + 1 : 0);
+ }
seq_putc(m, '\t');
if (drv)
seq_printf(m, "%s", drv->name);
Index: linux-work/drivers/pci/pci.c
===================================================================
--- linux-work.orig/drivers/pci/pci.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/pci.c 2005-04-27 14:34:18.000000000 +1000
@@ -770,7 +770,7 @@
return 0;
}
#endif
-
+
static int __devinit pci_init(void)
{
struct pci_dev *dev = NULL;
Index: linux-work/include/linux/pci.h
===================================================================
--- linux-work.orig/include/linux/pci.h 2005-04-24 11:39:20.000000000 +1000
+++ linux-work/include/linux/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -1017,6 +1017,21 @@
#define pci_pretty_name(dev) ""
#endif

+
+/* Some archs don't want to expose struct resource to userland as-is
+ * in sysfs and /proc
+ */
+#ifndef HAVE_ARCH_PCI_RESOURCE_TO_USER
+static void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ *start = rsrc->start;
+ *end = rsrc->end;
+}
+#endif /* HAVE_ARCH_PCI_RESOURCE_TO_USER */
+
+
/*
* The world is not perfect and supplies us with broken PCI devices.
* For at least a part of these bugs we need a work-around, so both
Index: linux-work/include/asm-ppc/pci.h
===================================================================
--- linux-work.orig/include/asm-ppc/pci.h 2005-04-24 11:38:54.000000000 +1000
+++ linux-work/include/asm-ppc/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -103,6 +103,12 @@
unsigned long size,
pgprot_t prot);

+#define HAVE_ARCH_PCI_RESOURCE_TO_USER
+extern void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end);
+
+
#endif /* __KERNEL__ */

#endif /* __PPC_PCI_H */
Index: linux-work/arch/ppc/kernel/pci.c
===================================================================
--- linux-work.orig/arch/ppc/kernel/pci.c 2005-04-24 11:37:38.000000000 +1000
+++ linux-work/arch/ppc/kernel/pci.c 2005-04-27 14:42:34.000000000 +1000
@@ -1495,7 +1495,7 @@
*offset += hose->pci_mem_offset;
res_bit = IORESOURCE_MEM;
} else {
- io_offset = (unsigned long)hose->io_base_virt;
+ io_offset = hose->io_base_virt - ___IO_BASE;
*offset += io_offset;
res_bit = IORESOURCE_IO;
}
@@ -1522,7 +1522,7 @@

/* found it! construct the final physical address */
if (mmap_state == pci_mmap_io)
- *offset += hose->io_base_phys - _IO_BASE;
+ *offset += hose->io_base_phys - io_offset;
return rp;
}

@@ -1739,6 +1739,23 @@
return result;
}

+void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ struct pci_controller *hose = pci_bus_to_hose(dev->bus->number);
+ unsigned long offset = 0;
+
+ if (hose == NULL)
+ return;
+
+ if (rsrc->flags & IORESOURCE_IO)
+ offset = ___IO_BASE - hose->io_base_virt + hose->io_base_phys;
+
+ *start = rsrc->start + offset;
+ *end = rsrc->end + offset;
+}
+
void __init
pci_init_resource(struct resource *res, unsigned long start, unsigned long end,
int flags, char *name)


2005-04-27 23:14:59

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: pci-sysfs resource mmap broken (and PATCH)

(RESENT: Not sure it actually made it yesterday, sorry if you got it twice)

> No. I don't agree. userspace has no business understanding the kernel
> resources content. All userspace need to care about is the resource
> number, and _eventually_ match it to a BAR value (either for using the
> old /proc interface -> existing code, or to use it with real inx/outx
> instructions on x86).

Ok, let me correct myself here... I think you _are_ right, but I also
think that pci_dev->resource[] is _not_ a CPU address that can be
exposed to userland, it's is a kernel internal "cookie" whose meaning is
arch specific.

What we need here is a way to go from pci_dev->resource to some CPU
address (typically in most archs that would result in something that can
be fed through /dev/mem, though that is just an example, doing so is
definitely not recommended :)

Additionally, there are a few others issues:

- mmap'ing resourceX vs. alignment. I have this video card with a bunch
of IOs at 0x400, so mmap'ing it will return something that maps the
first page of IO space (from 0). The userland program must be able to
recalculate the necessary alignement to know that it has to add 0x400 to
the mapping obtained via mmap. There is nothing in the API preventing
it, that is, the user program can get that via array exposed in
"resource" (or rather "resources" as I intend to rename it) ... provided
those are really CPU _addresses_ and not some kind of weird tokens
(which they are on some archs). In the case of IO space, there is the
additional requirement of actually knowing the real IO address for use
with inX/outX instructions on architectures that have those.

- What about architectures that have 32 bits core but > 32 bits
physical address space, like ppc32 with 440 or e500 processors where
IO/MMIO space is above 4gb ? The proper fix here is to have struct
resource to be 64 bits fields but we aren't there ...

- What is supposed to be passed to /proc/bus/pci mmap ? Should it be a
BAR value or the output of /proc/bus/pci/devices ? In the later case,
well, we are exposing the same crap as sysfs, that is a raw kernel
struct resource, which doesn't make sense out of the kernel. So I'm
tempted to "fix" /proc/bus/pci/devices the same way (see below) that I'm
fixing /sys/bus/pci/*/resources to pass a CPU physical address. (this
will only affect ppc[64] or other archs using my new callback anyway).

So what do that mean in practice ? Well, a few things :

1 - As described above, we need a kind of pci_resource_to_user() arch
hook so that struct resource can be converted in something that makes
sense to userland, typically a CPU physical address. It would be easy to
have a default implementation just using the resource "as-is" as it does
today, so we only have to fixup archs for which the value in there is
meaningless outside of the kernel (typically, ppc & ppc64 for IO
resources)

2 - The above pci_resource_to_user() might probably need to be defined
as returning a 64 bits value, and for consistency, we should probably
always display 64 bits in "resources" to handle the case of 32 bits CPUs
with > 32 bits IO space.

3 - pci_mmap_page_range() beeing called by both sysfs mmap and proc
mmap, we must clarify what it's "offset" parameter really is. Is it PCI
BAR value or is it a CPU physical address ? It was defined for the /proc
interface, was it ever documented . I though it wanted a BAR value but
Jesse seems to think it wants something out of /proc/bus/pci/devices.
The former would mean that either pci-sysfs.c must "convert" the offset
to a BAR value before calling pci_mmap_page_range(), or we must define a
different interface since they are non-consistent, but the later means
they are consistant (which is cool). In both case, current ppc
implementation is broken.

4 - additionally, to avoid future compatibility issues, I added the
resource number to the output of /sys/bus/pci/*/resources

The patch below implements this, with 1 - defaulting to just using the
resource as-is, plus a pair of imlpementations for ppc and ppc64, with 2
- returning an u64, and with 3 - taking the approach that /proc/bus/pci
expects offsets from /proc/bus/pci/devices (and not BAR values), that
means the interface stays consistent between proc and sysfs (pfiew !)

Additional misc question: The new interface doesn't allow to pass in any
"write combine" parametre like /proc did. In fact, this paremeter was a
bit limitative, since archs tend to be more varied when it comes to
relaxed ordering. On ppc & ppc64, I only use this as a hint, and the
pci_mmap_page_range() implementation (and /dev/mem too, so X benefits
from it) will in fact use the "prefetchable" attribute of the device
resource to decide wetehr to mark the space "guarded" or not. "guarded"
is a PPC specific attribute that disable things like speculation, but in
practice, also seem to disable write combining on some CPUs. I was
wondering wether it would make sense to haev pci-sysfs.c pass 1 for
write_combine when the resource beeing mapped has the prefetchable bit ?

Ok, here's the patch for comments. It's really on for comments now, I
has some nasty printk's in there, some useless hunks, etc... :)

So don't comment on these please :)

Ben.

Index: linux-work/arch/ppc64/kernel/pci.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/pci.c 2005-04-24 11:37:38.000000000 +1000
+++ linux-work/arch/ppc64/kernel/pci.c 2005-04-27 14:44:46.000000000 +1000
@@ -351,9 +351,12 @@
*offset += hose->pci_mem_offset;
res_bit = IORESOURCE_MEM;
} else {
- io_offset = (unsigned long)hose->io_base_virt;
+ io_offset = (unsigned long)hose->io_base_virt - pci_io_base;
+ printk("offset: %lx, io_base_virt: %p, pci_io_base: %lx, io_offset: %lx\n",
+ *offset, hose->io_base_virt, pci_io_base, io_offset);
*offset += io_offset;
res_bit = IORESOURCE_IO;
+ printk(" -> offset: %lx\n", *offset);
}

/*
@@ -373,12 +376,15 @@
continue;

/* In the range of this resource? */
+ printk(" r%d: %lx -> %lx\n", i, rp->start, rp->end);
if (*offset < (rp->start & PAGE_MASK) || *offset > rp->end)
continue;

/* found it! construct the final physical address */
- if (mmap_state == pci_mmap_io)
- *offset += hose->io_base_phys - io_offset;
+ if (mmap_state == pci_mmap_io) {
+ *offset += hose->io_base_phys - io_offset;
+ printk(" result: %lx\n", *offset);
+ }
return rp;
}

@@ -941,4 +947,22 @@
}
EXPORT_SYMBOL(pci_read_irq_line);

+void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ struct pci_controller *hose = pci_bus_to_host(dev->bus);
+ unsigned long offset = 0;
+
+ if (hose == NULL)
+ return;
+
+ if (rsrc->flags & IORESOURCE_IO)
+ offset = pci_io_base - (unsigned long)hose->io_base_virt +
+ hose->io_base_phys;
+
+ *start = rsrc->start + offset;
+ *end = rsrc->end + offset;
+}
+
#endif /* CONFIG_PPC_MULTIPLATFORM */
Index: linux-work/drivers/pci/pci-sysfs.c
===================================================================
--- linux-work.orig/drivers/pci/pci-sysfs.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/pci-sysfs.c 2005-04-27 14:40:28.000000000 +1000
@@ -54,27 +54,30 @@

/* show resources */
static ssize_t
-resource_show(struct device * dev, char * buf)
+resources_show(struct device * dev, char * buf)
{
struct pci_dev * pci_dev = to_pci_dev(dev);
char * str = buf;
int i;
int max = 7;
+ u64 start, end;

if (pci_dev->subordinate)
max = DEVICE_COUNT_RESOURCE;

for (i = 0; i < max; i++) {
- str += sprintf(str,"0x%016lx 0x%016lx 0x%016lx\n",
- pci_resource_start(pci_dev,i),
- pci_resource_end(pci_dev,i),
- pci_resource_flags(pci_dev,i));
+ struct resource *res = &pci_dev->resource[i];
+ pci_resource_to_user(pci_dev, i, res, &start, &end);
+ str += sprintf(str,"0x%02x 0x%016llx 0x%016llx 0x%016llx\n", i,
+ (unsigned long long)start,
+ (unsigned long long)end,
+ (unsigned long long)res->flags);
}
return (str - buf);
}

struct device_attribute pci_dev_attrs[] = {
- __ATTR_RO(resource),
+ __ATTR_RO(resources),
__ATTR_RO(vendor),
__ATTR_RO(device),
__ATTR_RO(subsystem_vendor),
@@ -267,8 +270,21 @@
struct device, kobj));
struct resource *res = (struct resource *)attr->private;
enum pci_mmap_state mmap_type;
+ u64 start, end;
+ int i;
+
+ for (i = 0; i < PCI_ROM_RESOURCE; i++)
+ if (res == &pdev->resource[i])
+ break;
+ if (i >= PCI_ROM_RESOURCE)
+ return -ENODEV;

- vma->vm_pgoff += res->start >> PAGE_SHIFT;
+ /* pci_mmap_page_range() expects the same kind of entry as coming
+ * from /proc/bus/pci/ which is a "user visible" value. If this is
+ * different from the resource itself, arch will do necessary fixup.
+ */
+ pci_resource_to_user(pdev, i, res, &start, &end);
+ vma->vm_pgoff += start >> PAGE_SHIFT;
mmap_type = res->flags & IORESOURCE_MEM ? pci_mmap_mem : pci_mmap_io;

return pci_mmap_page_range(pdev, vma, mmap_type, 0);
Index: linux-work/include/asm-ppc64/pci.h
===================================================================
--- linux-work.orig/include/asm-ppc64/pci.h 2005-04-24 11:38:56.000000000 +1000
+++ linux-work/include/asm-ppc64/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -135,6 +135,11 @@
unsigned long offset,
unsigned long size,
pgprot_t prot);
+#define HAVE_ARCH_PCI_RESOURCE_TO_USER
+extern void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end);
+


#endif /* __KERNEL__ */
Index: linux-work/drivers/pci/proc.c
===================================================================
--- linux-work.orig/drivers/pci/proc.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/proc.c 2005-04-27 14:34:18.000000000 +1000
@@ -354,14 +354,20 @@
dev->device,
dev->irq);
/* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */
- for(i=0; i<7; i++)
+ for(i=0; i<7; i++) {
+ u64 start, end;
+ pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, LONG_FORMAT,
- dev->resource[i].start |
+ ((unsigned long)start) |
(dev->resource[i].flags & PCI_REGION_FLAG_MASK));
- for(i=0; i<7; i++)
+ }
+ for(i=0; i<7; i++) {
+ u64 start, end;
+ pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, LONG_FORMAT,
dev->resource[i].start < dev->resource[i].end ?
- dev->resource[i].end - dev->resource[i].start + 1 : 0);
+ (unsigned long)(end - start) + 1 : 0);
+ }
seq_putc(m, '\t');
if (drv)
seq_printf(m, "%s", drv->name);
Index: linux-work/drivers/pci/pci.c
===================================================================
--- linux-work.orig/drivers/pci/pci.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/pci.c 2005-04-27 14:34:18.000000000 +1000
@@ -770,7 +770,7 @@
return 0;
}
#endif
-
+
static int __devinit pci_init(void)
{
struct pci_dev *dev = NULL;
Index: linux-work/include/linux/pci.h
===================================================================
--- linux-work.orig/include/linux/pci.h 2005-04-24 11:39:20.000000000 +1000
+++ linux-work/include/linux/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -1017,6 +1017,21 @@
#define pci_pretty_name(dev) ""
#endif

+
+/* Some archs don't want to expose struct resource to userland as-is
+ * in sysfs and /proc
+ */
+#ifndef HAVE_ARCH_PCI_RESOURCE_TO_USER
+static void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ *start = rsrc->start;
+ *end = rsrc->end;
+}
+#endif /* HAVE_ARCH_PCI_RESOURCE_TO_USER */
+
+
/*
* The world is not perfect and supplies us with broken PCI devices.
* For at least a part of these bugs we need a work-around, so both
Index: linux-work/include/asm-ppc/pci.h
===================================================================
--- linux-work.orig/include/asm-ppc/pci.h 2005-04-24 11:38:54.000000000 +1000
+++ linux-work/include/asm-ppc/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -103,6 +103,12 @@
unsigned long size,
pgprot_t prot);

+#define HAVE_ARCH_PCI_RESOURCE_TO_USER
+extern void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end);
+
+
#endif /* __KERNEL__ */

#endif /* __PPC_PCI_H */
Index: linux-work/arch/ppc/kernel/pci.c
===================================================================
--- linux-work.orig/arch/ppc/kernel/pci.c 2005-04-24 11:37:38.000000000 +1000
+++ linux-work/arch/ppc/kernel/pci.c 2005-04-27 14:42:34.000000000 +1000
@@ -1495,7 +1495,7 @@
*offset += hose->pci_mem_offset;
res_bit = IORESOURCE_MEM;
} else {
- io_offset = (unsigned long)hose->io_base_virt;
+ io_offset = hose->io_base_virt - ___IO_BASE;
*offset += io_offset;
res_bit = IORESOURCE_IO;
}
@@ -1522,7 +1522,7 @@

/* found it! construct the final physical address */
if (mmap_state == pci_mmap_io)
- *offset += hose->io_base_phys - _IO_BASE;
+ *offset += hose->io_base_phys - io_offset;
return rp;
}

@@ -1739,6 +1739,23 @@
return result;
}

+void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ struct pci_controller *hose = pci_bus_to_hose(dev->bus->number);
+ unsigned long offset = 0;
+
+ if (hose == NULL)
+ return;
+
+ if (rsrc->flags & IORESOURCE_IO)
+ offset = ___IO_BASE - hose->io_base_virt + hose->io_base_phys;
+
+ *start = rsrc->start + offset;
+ *end = rsrc->end + offset;
+}
+
void __init
pci_init_resource(struct resource *res, unsigned long start, unsigned long end,
int flags, char *name)


2005-04-28 05:31:11

by Grant Grundler

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Thu, Apr 28, 2005 at 09:13:36AM +1000, Benjamin Herrenschmidt wrote:
> (RESENT: Not sure it actually made it yesterday, sorry if you got it twice)

np - got the previous one too...just needed some time to think about it
and write a not-too-long response.

> > No. I don't agree. userspace has no business understanding the kernel
> > resources content. All userspace need to care about is the resource
> > number, and _eventually_ match it to a BAR value (either for using the
> > old /proc interface -> existing code, or to use it with real inx/outx
> > instructions on x86).
>
> Ok, let me correct myself here... I think you _are_ right, but I also
> think that pci_dev->resource[] is _not_ a CPU address that can be
> exposed to userland, it's is a kernel internal "cookie" whose meaning is
> arch specific.
>
> What we need here is a way to go from pci_dev->resource to some CPU
> address (typically in most archs that would result in something that can
> be fed through /dev/mem, though that is just an example, doing so is
> definitely not recommended :)

Yes, I think that is the key issue. It just needs to be "adjusted" for
whoever the consumer is (e.g X.org) and how they expect to use it.

In practice, I have used mmap'd (uncached) address space to
access chipset registers from user space. Very easy to test
different parameters which is what I needed for my OLS 2003 paper:
http://iou.parisc-linux.org/dma-hints/

Hypothetically, such "platform specific tuning" could be done
from userspace if it doesn't require special syncronization
or a reboot.

My use was a much simpler use than the User Space drivers work
done by Peter Chubb at UNSW:
http://www.gelato.unsw.edu.au/publications.html

I think you could have a very interesting conversation with
Peter about what user driver model needs exported and how
it could be used.

> Additionally, there are a few others issues:
>
> - mmap'ing resourceX vs. alignment. I have this video card with a bunch
> of IOs at 0x400, so mmap'ing it will return something that maps the
> first page of IO space (from 0). The userland program must be able to
> recalculate the necessary alignement to know that it has to add 0x400 to
> the mapping obtained via mmap.

Your description seems to confuse "alignment" with "offset".

I would expect the mmap() return value to point at the base of
whatever thing it is that I handed it. And everything is relative
to that within the range that I ask be mmap'd.

> There is nothing in the API preventing it, that is, the user program
> can get that via array exposed in
> "resource" (or rather "resources" as I intend to rename it) ... provided
> those are really CPU _addresses_ and not some kind of weird tokens
> (which they are on some archs).

If it's a token, the arch specific mmap() will know how to deal with it.
parisc does that with IO Port space(s) in the kernel.

> In the case of IO space, there is the
> additional requirement of actually knowing the real IO address for use
> with inX/outX instructions on architectures that have those.
>
> - What about architectures that have 32 bits core but > 32 bits
> physical address space, like ppc32 with 440 or e500 processors where
> IO/MMIO space is above 4gb ? The proper fix here is to have struct
> resource to be 64 bits fields but we aren't there ...

This is similar to davem's reminder about 32-bit user space on 64-bit
platform. I expect some form of token will need to be used if user
space can't be taught/forced to use 64-bit values coming from
either /sys or /proc. It's probably best to leave /proc untouched
and only mangle /sys so it always prints 64-bit values.


> - What is supposed to be passed to /proc/bus/pci mmap ? Should it be a
> BAR value or the output of /proc/bus/pci/devices ? In the later case,
> well, we are exposing the same crap as sysfs, that is a raw kernel
> struct resource, which doesn't make sense out of the kernel. So I'm
> tempted to "fix" /proc/bus/pci/devices the same way (see below) that I'm
> fixing /sys/bus/pci/*/resources to pass a CPU physical address. (this
> will only affect ppc[64] or other archs using my new callback anyway).
>
> So what do that mean in practice ? Well, a few things :
>
> 1 - As described above, we need a kind of pci_resource_to_user() arch
> hook so that struct resource can be converted in something that makes
> sense to userland, typically a CPU physical address. It would be easy to
> have a default implementation just using the resource "as-is" as it does
> today, so we only have to fixup archs for which the value in there is
> meaningless outside of the kernel (typically, ppc & ppc64 for IO
> resources)


Can you rename the function to pcibios_resource_to_user() ?

I would like the name to indicate this is an arch specific function
that should only be used by PCI subsystem.
Or did you have other in-kernel callers in mind?


> 2 - The above pci_resource_to_user() might probably need to be defined
> as returning a 64 bits value, and for consistency, we should probably
> always display 64 bits in "resources" to handle the case of 32 bits CPUs
> with > 32 bits IO space.

Yes. And an example in Documentation/filesystems/sysfs-pci.txt of
intended use would probably help alot too.

> 3 - pci_mmap_page_range() beeing called by both sysfs mmap and proc
> mmap, we must clarify what it's "offset" parameter really is.

I didn't know anything about pci_mmap_page_range().
parisc and alpha don't implement it. And both translate the IO View
(BAR values) to CPU view (resource) for MMIO space.

> Is it PCI BAR value or is it a CPU physical address?
> It was defined for the /proc interface, was it ever documented.

sys-fs support also uses it:

grundler <533>fgrep HAVE_PCI_MMAP drivers/pci/*
drivers/pci/pci-sysfs.c:#ifdef HAVE_PCI_MMAP
drivers/pci/pci-sysfs.c:#else /* !HAVE_PCI_MMAP */
drivers/pci/pci-sysfs.c:#endif /* HAVE_PCI_MMAP */

Documentation/filesystems/sysfs-pci.txt at least mentions which
parameters can be passed to mmap and what the arch must provide.

> I though it wanted a BAR value but
> Jesse seems to think it wants something out of /proc/bus/pci/devices.

I agree with Jesse - mmap should want the CPU view of the world.

> The former would mean that either pci-sysfs.c must "convert" the offset
> to a BAR value before calling pci_mmap_page_range(), or we must define a
> different interface since they are non-consistent, but the later means
> they are consistant (which is cool). In both case, current ppc
> implementation is broken.
>
> 4 - additionally, to avoid future compatibility issues, I added the
> resource number to the output of /sys/bus/pci/*/resources

What is the expected use for the resource number?
I don't understand what compatibility issue you are trying to avoid.

Somehow, I'm not keen on seeing the resource number there.
Since all of the resources are printed regardless of value,
it's easy enough to derive the resource number in case
someone does need it.

> The patch below implements this, with 1 - defaulting to just using the
> resource as-is, plus a pair of imlpementations for ppc and ppc64, with 2
> - returning an u64, and with 3 - taking the approach that /proc/bus/pci
> expects offsets from /proc/bus/pci/devices (and not BAR values), that
> means the interface stays consistent between proc and sysfs (pfiew !)

Yes - that would be good.

> Additional misc question: The new interface doesn't allow to pass in any
> "write combine" parametre like /proc did. In fact, this paremeter was a
> bit limitative, since archs tend to be more varied when it comes to
> relaxed ordering. On ppc & ppc64, I only use this as a hint, and the
> pci_mmap_page_range() implementation (and /dev/mem too, so X benefits
> from it) will in fact use the "prefetchable" attribute of the device
> resource to decide wetehr to mark the space "guarded" or not. "guarded"
> is a PPC specific attribute that disable things like speculation, but in
> practice, also seem to disable write combining on some CPUs. I was
> wondering wether it would make sense to haev pci-sysfs.c pass 1 for
> write_combine when the resource beeing mapped has the prefetchable bit ?

If it's prefetchable, won't the reads/writes automatically be combined?
Since I equate "prefetchable" == "cacheable", I'd think anything
is fair game.

Legacy frame buffers want write combining for efficienct use of
PCI bus bandwidth transferring data to the card. I don't know
if other devices are as performance sensitive or even want any
sort of write combining.

> Ok, here's the patch for comments. It's really on for comments now, I
> has some nasty printk's in there, some useless hunks, etc... :)
>
> So don't comment on these please :)

Ok. :^)

thanks,
grant

2005-04-28 05:49:02

by David Miller

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Wed, 27 Apr 2005 23:33:11 -0600
Grant Grundler <[email protected]> wrote:

> I would expect the mmap() return value to point at the base of
> whatever thing it is that I handed it. And everything is relative
> to that within the range that I ask be mmap'd.

The 'offset' argument is defined to be page aligned
when passed to mmap().

> If it's a token, the arch specific mmap() will know how to deal with it.
> parisc does that with IO Port space(s) in the kernel.

Yes, if the token goes in as the offset parameter to mmap() then
whatever ->mmap() code we write can call arch specific code to
transform it as necessary.

> This is similar to davem's reminder about 32-bit user space on 64-bit
> platform. I expect some form of token will need to be used if user
> space can't be taught/forced to use 64-bit values coming from
> either /sys or /proc. It's probably best to leave /proc untouched
> and only mangle /sys so it always prints 64-bit values.

Let's make sys use 64-bit, yes.

> I didn't know anything about pci_mmap_page_range().
> parisc and alpha don't implement it. And both translate the IO View
> (BAR values) to CPU view (resource) for MMIO space.

It's been around for ages, and it used in the X server on PPC
and Sparc. It mostly allows handling of multi-domain stuff.
Unfortunately, the $DOMAIN:xxx directory naming change we made
in 2.6.x for /proc/pci stuff broke the X server at least on
sparc64 :-/

> sys-fs support also uses it:
>
> grundler <533>fgrep HAVE_PCI_MMAP drivers/pci/*
> drivers/pci/pci-sysfs.c:#ifdef HAVE_PCI_MMAP
> drivers/pci/pci-sysfs.c:#else /* !HAVE_PCI_MMAP */
> drivers/pci/pci-sysfs.c:#endif /* HAVE_PCI_MMAP */
>
> Documentation/filesystems/sysfs-pci.txt at least mentions which
> parameters can be passed to mmap and what the arch must provide.

I hate to say this, but the largest consumer of this stuff is the
X server, so we really need to force ourselves to work in parallel
on clean X server support. Whether that's via some libpci.a
abstraction or whatever, I personally don't care, but without the
X support in some form all of this is API masterbation :-)

> If it's prefetchable, won't the reads/writes automatically be combined?
> Since I equate "prefetchable" == "cacheable", I'd think anything
> is fair game.

On many platforms some kind of "side effect" bit in the PTE
determine if store buffer compression can happen in the processor.
We'd want to not set such a bit for things like frame-buffers and
the like.

2005-04-28 06:38:09

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)


> Yes, I think that is the key issue. It just needs to be "adjusted" for
> whoever the consumer is (e.g X.org) and how they expect to use it.

I think the key issue is that normally (read: programmatically), user
space should not care. It should either peek the value
in /proc/bus/pci/devices and toss that into mmap, or use the "resourceX"
file representing a BAR. The problem still beeing alignement with BARs,
that require a bit more knowledge, most notably that the values are
actually meaningful as some kind of addresses, at least at the sub-page
boundary. (see below, I do mean _alignment_)

The only additional "issue" here, is that for processors with IO
instructions, that value should be able to be tossed directly into those
IO instructions. That is, those platform have a constraint on what this
user value should be for IO space, but that is fine. It's a platform
constraint.

Having it additionally be "meaninful" by itself (ie. a CPU physical
address) is just a matter of your platform wanting to be nice with ppl
debugging things in userland etc... :) Or using occasionally /dev/mem
because they have some existing hackish tools doing so etc... :)

> .../..
>
> I think you could have a very interesting conversation with
> Peter about what user driver model needs exported and how
> it could be used.

I would like Peter to show up here and say what he thinks of my
proposal. The current stuff is broken anyway, so I'm considering pushing
this patch for 2.6.12. The patch will have 0 effect on non-ppc unless
non-ppc maintainers decide to push a custom implementation of
pci_resource_to_user() anyway.

> > Additionally, there are a few others issues:
> >
> > - mmap'ing resourceX vs. alignment. I have this video card with a bunch
> > of IOs at 0x400, so mmap'ing it will return something that maps the
> > first page of IO space (from 0). The userland program must be able to
> > recalculate the necessary alignement to know that it has to add 0x400 to
> > the mapping obtained via mmap.
>
> Your description seems to confuse "alignment" with "offset".

No, I don't mean offset (as in within the mmap), but alignment of the
resource itself.

> I would expect the mmap() return value to point at the base of
> whatever thing it is that I handed it. And everything is relative
> to that within the range that I ask be mmap'd.

mmap is defined to always request page alignement, both for arguments
and return value. I suppose we could hack around and have it map the
whole page, then return an offset'ed value, but I'm not sure this is
very good practice to hack the defined semantics of a POSIX call.

> > There is nothing in the API preventing it, that is, the user program
> > can get that via array exposed in
> > "resource" (or rather "resources" as I intend to rename it) ... provided
> > those are really CPU _addresses_ and not some kind of weird tokens
> > (which they are on some archs).
>
> If it's a token, the arch specific mmap() will know how to deal with it.
> parisc does that with IO Port space(s) in the kernel.

Yes and no ... Anyway, I want to provide something "nicer" than this
kernel specific "token" on ppc, via pci_resource_to_user(), and thus
expect something of that kind to come back to me via mmap. In the case
of the above, there is still this alignent/offset problem to deal with
which, imho, requires those "tokens" to at least represent addresses for
the low X bits (X = PAGE_SHIFT of the arch).

> This is similar to davem's reminder about 32-bit user space on 64-bit
> platform. I expect some form of token will need to be used if user
> space can't be taught/forced to use 64-bit values coming from
> either /sys or /proc. It's probably best to leave /proc untouched
> and only mangle /sys so it always prints 64-bit values.

Well, /proc is broken, and I don't intend to fix it. sysfs is nice
enough to expose 64 bits values in "resource"/"resources" already.

> Can you rename the function to pcibios_resource_to_user() ?

Yes, I can, but I though we were going away from pcibios_* naming ?

> I would like the name to indicate this is an arch specific function
> that should only be used by PCI subsystem.
> Or did you have other in-kernel callers in mind?

Not specifically in mind, no.

> > 2 - The above pci_resource_to_user() might probably need to be defined
> > as returning a 64 bits value, and for consistency, we should probably
> > always display 64 bits in "resources" to handle the case of 32 bits CPUs
> > with > 32 bits IO space.
>
> Yes. And an example in Documentation/filesystems/sysfs-pci.txt of
> intended use would probably help alot too.

Yup. Will do.

> > 3 - pci_mmap_page_range() beeing called by both sysfs mmap and proc
> > mmap, we must clarify what it's "offset" parameter really is.
>
> I didn't know anything about pci_mmap_page_range().
> parisc and alpha don't implement it. And both translate the IO View
> (BAR values) to CPU view (resource) for MMIO space.

Which means that the only way to mmap PCI space is via /dev/mem right ?
In this case, indeed, your resources must be CPU physical addresses.

> > Is it PCI BAR value or is it a CPU physical address?
> > It was defined for the /proc interface, was it ever documented.
>
> sys-fs support also uses it:

I know it does, that's exactly what I wrote above :) It's used by both,
and I want to make sure it is consistent.

> grundler <533>fgrep HAVE_PCI_MMAP drivers/pci/*
> drivers/pci/pci-sysfs.c:#ifdef HAVE_PCI_MMAP
> drivers/pci/pci-sysfs.c:#else /* !HAVE_PCI_MMAP */
> drivers/pci/pci-sysfs.c:#endif /* HAVE_PCI_MMAP */
>
> Documentation/filesystems/sysfs-pci.txt at least mentions which
> parameters can be passed to mmap and what the arch must provide.

Yah, well, it's not terribly verbose.

> > I though it wanted a BAR value but
> > Jesse seems to think it wants something out of /proc/bus/pci/devices.
>
> I agree with Jesse - mmap should want the CPU view of the world.

I tend to agree, which is why I introduce this pci_resource_to_user() to
turn the kernel "tokens" into that (the thing we use for IO space
resources on ppc are really nothing you want to ever see exposed to
userland).

> What is the expected use for the resource number?
> I don't understand what compatibility issue you are trying to avoid.

Matching an entry in "resource" (or "resources" as my patch renames it)
and the resourceX mmap'able file. Since the first one need to be parsed
to figure out the alignement issue, and for other reasons where the user
may be interested (resource flags ? whatever else ..)

Oh well, now that I double check the code, we always display all
resources in order, so that should be ok without a resource number ...
An intermediate version of my patch had pci_resource_to_user() capable
of failing in which case I would have "skipped" the resource. So I
suppose we don't need to change anything to the actual format of
"resource", which means i probably don't need to rename it neither,
though I still don't like the name beeing singular :)

> Somehow, I'm not keen on seeing the resource number there.
> Since all of the resources are printed regardless of value,
> it's easy enough to derive the resource number in case
> someone does need it.

Yes, they are... they weren't with some intermediate patch of mine that
I was toying with, which is why I introduced the resource number, but in
fact, that is not necessary.

> If it's prefetchable, won't the reads/writes automatically be combined?
> Since I equate "prefetchable" == "cacheable", I'd think anything
> is fair game.

No, prefetchable != cacheable. If you mmap your framebuffer cacheable,
good luck with X video drivers & acceleration for example....

However, prefetchable allows use of more relaxed mapping attributes that
don't enforce ordering etc... On ppc, if I map with _PAGE_NOCACHE, I
have the option of also using _PAGE_GUARDED, which enforces ordering,
disable writecombining (on some CPUs at least) and prevents speculative
access (and I suppose prefetch, though I doubt I get much prefetch on
non-cacheable space), or not. In fact, I'd say the opposite ... I have
the option of not setting _PAGE_GUARDED. I currently do that
transparently for /proc/bus/pci mmap, /sys pci mmap and /dev/mem if it
hits a PCI BAR that is marked prefetchable. Most ppc hardware don't
additionally have separate prefetchable HW ranges (MTRR like thing).

> Legacy frame buffers want write combining for efficienct use of
> PCI bus bandwidth transferring data to the card. I don't know
> if other devices are as performance sensitive or even want any
> sort of write combining.

Yes, but I think that pretty much fit with the presence of a
prefetchable bit, which typically mean no side effects... it's not
exactly the same concept, but in practice, ti's a good match.

> > Ok, here's the patch for comments. It's really on for comments now, I
> > has some nasty printk's in there, some useless hunks, etc... :)
> >
> > So don't comment on these please :)
>
> Ok. :^)
>
> thanks,
> grant
--
Benjamin Herrenschmidt <[email protected]>

2005-04-28 06:40:34

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Wed, 2005-04-27 at 22:37 -0700, David S. Miller wrote:
> On Wed, 27 Apr 2005 23:33:11 -0600

> The 'offset' argument is defined to be page aligned
> when passed to mmap().

mmap API in general is defined to only ever deal with page aligned
parameters & return values no ?

> > If it's a token, the arch specific mmap() will know how to deal with it.
> > parisc does that with IO Port space(s) in the kernel.
>
> Yes, if the token goes in as the offset parameter to mmap() then
> whatever ->mmap() code we write can call arch specific code to
> transform it as necessary.

Except from that page alignment thing ... which is the root of the
problem.

> It's been around for ages, and it used in the X server on PPC
> and Sparc. It mostly allows handling of multi-domain stuff.
> Unfortunately, the $DOMAIN:xxx directory naming change we made
> in 2.6.x for /proc/pci stuff broke the X server at least on
> sparc64 :-/

In fact, I'm fairly sure X still uses /dev/mem on ppc :( Oh well...

> I hate to say this, but the largest consumer of this stuff is the
> X server, so we really need to force ourselves to work in parallel
> on clean X server support.

Cleaning X.org is my goal, this is why I'm trying to clean the kernel
side first :) I'm also working separately on the problem of VGA access
arbitration (We'll probably do a joint session with the desktop summit
an the kernel summit about those issue).

> Whether that's via some libpci.a
> abstraction or whatever, I personally don't care, but without the
> X support in some form all of this is API masterbation :-)

It is :) Userland driver stuff is a consumer too though.

> > If it's prefetchable, won't the reads/writes automatically be combined?
> > Since I equate "prefetchable" == "cacheable", I'd think anything
> > is fair game.
>
> On many platforms some kind of "side effect" bit in the PTE
> determine if store buffer compression can happen in the processor.
> We'd want to not set such a bit for things like frame-buffers and
> the like.

Yes, and I think that pretty much match with PCI devices exposing a
"prefetchable" BAR, don't you agree ?

Ben.


2005-04-28 07:00:25

by David Miller

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Thu, 28 Apr 2005 16:39:13 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> On Wed, 2005-04-27 at 22:37 -0700, David S. Miller wrote:
> > On Wed, 27 Apr 2005 23:33:11 -0600
>
> > The 'offset' argument is defined to be page aligned
> > when passed to mmap().
>
> mmap API in general is defined to only ever deal with page aligned
> parameters & return values no ?

Yes.

> Except from that page alignment thing ... which is the root of the
> problem.

You can hide all of those problems in libpci.a or whatever.
You page align the offset, but pass back to the user a pointer
with his sub-page offset applied to it.

The kernel wants pages, so just give it pages :-)

> > I hate to say this, but the largest consumer of this stuff is the
> > X server, so we really need to force ourselves to work in parallel
> > on clean X server support.
>
> Cleaning X.org is my goal, this is why I'm trying to clean the kernel
> side first :) I'm also working separately on the problem of VGA access
> arbitration (We'll probably do a joint session with the desktop summit
> an the kernel summit about those issue).

Yeah, that one is all about enabling VGA forwarding in the bridges.

Taking out all of the resource garbage in the X server, and replacing
it with properly synchronized calls in the kernel for mapping ROMs
and changing the current VGA forwarding seems to be the way to go.

> > On many platforms some kind of "side effect" bit in the PTE
> > determine if store buffer compression can happen in the processor.
> > We'd want to not set such a bit for things like frame-buffers and
> > the like.
>
> Yes, and I think that pretty much match with PCI devices exposing a
> "prefetchable" BAR, don't you agree ?

Some scsi controllers have prefetchable set in their normal
register BARs. The sym53c8xx does if I remember correctly.

Anyways, what I'm trying to say is that blinding turning prefetchable
BAR into "don't set side effect bit in PTE" might not be so wise.

I really think it's a userlevel decision. That's where all the ioctl()
garbage came from for the /proc/bus/pci mmap() stuff. It was for chossing
IO vs MEM space, and also for setting these kinds of mapping attributes.

2005-04-28 07:23:11

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)


> Yes.
>
> > Except from that page alignment thing ... which is the root of the
> > problem.
>
> You can hide all of those problems in libpci.a or whatever.
> You page align the offset, but pass back to the user a pointer
> with his sub-page offset applied to it.
>
> The kernel wants pages, so just give it pages :-)

Oh, yes, I agree, but I don't want libpci to contain too much arch
specific code, so we must may sure that libpci has the mean of "knowing"
how to deal with that. When you mmap "resource0", you can't know that
there is an alignement/offset issue and that what you'll be getting is
actually not pointing to your actual resource but somewhere ... below.

In order to know that, libpci need to know the actual physical pointer
that was used by the kernel for the mmap before the page alignment so it
can figure out what offset to apply. So that is why I'm saying that
whatever we expose in "resources" and/or "/proc/bus/pci/devices" should
at be a CPU physical address, or at least contain the low PAGE_SHIFT
bits of it... even if the rest is a token...

> > Cleaning X.org is my goal, this is why I'm trying to clean the kernel
> > side first :) I'm also working separately on the problem of VGA access
> > arbitration (We'll probably do a joint session with the desktop summit
> > an the kernel summit about those issue).
>
> Yeah, that one is all about enabling VGA forwarding in the bridges.

Oh, not only that ... also playing ping pong when several cards are
there, tracking who is having the ownership, but also allowing the
driver who can disable legacy decodign on the card to inform the arbiter
so that card stop beeing bothered and can keep interrupts enabled at all
time without bothering etc.. etc.. etc.. that includes fixing the kernel
vga console to deal with an arbiter, some fbdev stuffs as well, and
more.. an finally adapting userland things like X to use it.

The actual arbiter core itself is easy. Making things play nice with it
is the difficult part. I'll come up with various solutions that I wnat
to experience with before KS/DS and will present my results there so we
can decide what to do.

> Taking out all of the resource garbage in the X server, and replacing
> it with properly synchronized calls in the kernel for mapping ROMs
> and changing the current VGA forwarding seems to be the way to go.

It is, and the X.org people do agree, provided there is no loss in
functionality.

> Some scsi controllers have prefetchable set in their normal
> register BARs. The sym53c8xx does if I remember correctly.

Hrm... do we have to worry about somebody in userland mmap'ing it's
register via /sys/*pci and breaking because of that ? :) I have a real
net big performance improvement on X by doing that trick ... the sysfs
mmap API doesn't really provide a mean to do it explicitely from
userland (unlike the ioctl with the old proc api)

> Anyways, what I'm trying to say is that blinding turning prefetchable
> BAR into "don't set side effect bit in PTE" might not be so wise.

Well, I'm still trying to find a case that would break. I'm not doing
that for in-kernel mapping. Oh well, it's done anyway, I'll see what
reports I'm getting from it...

> I really think it's a userlevel decision. That's where all the ioctl()
> garbage came from for the /proc/bus/pci mmap() stuff. It was for chossing
> IO vs MEM space, and also for setting these kinds of mapping attributes.

I know. I just don't see how to replace it with sysfs ... maybe mmap
pgprot flags...

Ben.


2005-04-28 07:31:27

by David Miller

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Thu, 28 Apr 2005 17:21:19 +1000
Benjamin Herrenschmidt <[email protected]> wrote:

> I have a real net big performance improvement on X by doing that
> trick ... the sysfs mmap API doesn't really provide a mean to do
> it explicitely from userland (unlike the ioctl with the old proc api)

You can refine your test to "if PCI class is display or VGA" and the
prefetchability is set in the BAR, then elide the guard PTE
protection bit.

2005-04-28 07:49:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Thu, 2005-04-28 at 00:22 -0700, David S. Miller wrote:
> On Thu, 28 Apr 2005 17:21:19 +1000
> Benjamin Herrenschmidt <[email protected]> wrote:
>
> > I have a real net big performance improvement on X by doing that
> > trick ... the sysfs mmap API doesn't really provide a mean to do
> > it explicitely from userland (unlike the ioctl with the old proc api)
>
> You can refine your test to "if PCI class is display or VGA" and the
> prefetchability is set in the BAR, then elide the guard PTE
> protection bit.

Yes, but I was also thinking about some of those Myrinet kind of things
who also provide large PCI shared memory region ...

I may end up adding an explicit list of classes/vid/did that are
"allowed" to use the trick to avoid problems.

Ben.


2005-04-28 15:09:35

by Grant Grundler

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Wed, Apr 27, 2005 at 11:50:56PM -0700, David S. Miller wrote:
> > Yes, and I think that pretty much match with PCI devices exposing a
> > "prefetchable" BAR, don't you agree ?
>
> Some scsi controllers have prefetchable set in their normal
> register BARs. The sym53c8xx does if I remember correctly.

None of the ones I have here seem to.
Maybe someone else does?

grundler <504>lspci -vd 1000:
0000:00:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53C896/897 (rev 07)
Flags: bus master, medium devsel, latency 128, IRQ 18
I/O ports at 0100 [size=256]
Memory at f8020000 (64-bit, non-prefetchable) [size=1K]
Memory at f8040000 (64-bit, non-prefetchable) [size=8K]
Capabilities: <available only to root>

0000:00:01.1 SCSI storage controller: LSI Logic / Symbios Logic 53C896/897 (rev 07)
Flags: bus master, medium devsel, latency 128, IRQ 19
I/O ports at 0200 [size=256]
Memory at f8007000 (64-bit, non-prefetchable) [size=1K]
Memory at f8002000 (64-bit, non-prefetchable) [size=8K]
Capabilities: <available only to root>

0000:00:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14)
Flags: bus master, medium devsel, latency 128, IRQ 19
I/O ports at 0300 [size=256]
Memory at f8008000 (32-bit, non-prefetchable) [size=256]
Memory at f8001000 (32-bit, non-prefetchable) [size=4K]

0000:00:02.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875 (rev 14)
Flags: bus master, medium devsel, latency 128, IRQ 20
I/O ports at 0400 [size=256]
Memory at f8009000 (32-bit, non-prefetchable) [size=256]
Memory at f8004000 (32-bit, non-prefetchable) [size=4K]

0000:20:00.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz Ultra3 SCSI Adapter (rev 01)
Subsystem: LSI Logic / Symbios Logic: Unknown device 1020
Flags: bus master, 66MHz, medium devsel, latency 128, IRQ 24
I/O ports at 20000 [size=256]
Memory at fa044000 (64-bit, non-prefetchable) [size=1K]
Memory at fa040000 (64-bit, non-prefetchable) [size=8K]
Expansion ROM at fa000000 [disabled] [size=128K]
Capabilities: <available only to root>

0000:20:00.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1010 66MHz Ultra3 SCSI Adapter (rev 01)
Subsystem: LSI Logic / Symbios Logic: Unknown device 1020
Flags: bus master, 66MHz, medium devsel, latency 128, IRQ 25
I/O ports at 20100 [size=256]
Memory at fa045000 (64-bit, non-prefetchable) [size=1K]
Memory at fa042000 (64-bit, non-prefetchable) [size=8K]
Expansion ROM at fa020000 [disabled] [size=128K]
Capabilities: <available only to root>

0000:30:02.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08)
Subsystem: Hewlett-Packard Company: Unknown device 12c5
Flags: 66MHz, medium devsel, IRQ 27
I/O ports at 30000 [disabled] [size=256]
Memory at fb200000 (64-bit, non-prefetchable) [disabled] [size=128K]
Memory at fb220000 (64-bit, non-prefetchable) [disabled] [size=128K]
Expansion ROM at fb000000 [disabled] [size=1M]
Capabilities: <available only to root>

0000:30:02.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08)
Subsystem: Hewlett-Packard Company: Unknown device 12c5
Flags: 66MHz, medium devsel, IRQ 28
I/O ports at 30100 [disabled] [size=256]
Memory at fb240000 (64-bit, non-prefetchable) [disabled] [size=128K]
Memory at fb260000 (64-bit, non-prefetchable) [disabled] [size=128K]
Expansion ROM at fb100000 [disabled] [size=1M]
Capabilities: <available only to root>


> Anyways, what I'm trying to say is that blinding turning prefetchable
> BAR into "don't set side effect bit in PTE" might not be so wise.
>
> I really think it's a userlevel decision. That's where all the ioctl()
> garbage came from for the /proc/bus/pci mmap() stuff. It was for chossing
> IO vs MEM space, and also for setting these kinds of mapping attributes.

Well, if it's a device driver decision, I guess that's ok.
And the primary device driver happens to live in user space in X.org case.

hth,
grant

2005-04-28 22:48:53

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Thu, 2005-04-28 at 09:11 -0600, Grant Grundler wrote:

>
> > Anyways, what I'm trying to say is that blinding turning prefetchable
> > BAR into "don't set side effect bit in PTE" might not be so wise.
> >
> > I really think it's a userlevel decision. That's where all the ioctl()
> > garbage came from for the /proc/bus/pci mmap() stuff. It was for chossing
> > IO vs MEM space, and also for setting these kinds of mapping attributes.
>
> Well, if it's a device driver decision, I guess that's ok.
> And the primary device driver happens to live in user space in X.org case.

Agreed, but 1) Do you have an idea on how to expose this capability with
the sysfs interface ? Adding ioctl's to it would suck big time :) and 2)
It's still nice to have a "workaround" for existing X since the
performance benefit is significant, but then, it's in arch code, so
that's fine (and I could indeed limit it to VGA class devices as David
suggests).

Ben.


2005-04-28 23:36:03

by Grant Grundler

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Fri, Apr 29, 2005 at 08:47:27AM +1000, Benjamin Herrenschmidt wrote:
> > Well, if it's a device driver decision, I guess that's ok.
> > And the primary device driver happens to live in user space in X.org case.
>
> Agreed, but 1) Do you have an idea on how to expose this capability with
> the sysfs interface ? Adding ioctl's to it would suck big time :) and

I don't know enough about VM/TLB stuff to know the right answer.

I suspect the MAP_* attribute/hint needs to be passed in together
with the mmap call if any arch (ia64?) would return a different
virtual address depending the attribute (e.g cached vs uncached).

And write combining might be done in a "layer" below the CPU in
the HW hierarchy. e.g. PCI Host bus controller might combine writes
for some MMIO regions. I don't know if arch specific mmap support
can figure out which HW is the right one to enable write combining
in for a particular MMIO region or PCI device.

I generally don't work with graphics devices and only recently
started poking at infiniband support (128-512MB BAR depending
on card option) to understand really well how BAR is accessed/used.

> 2) It's still nice to have a "workaround" for existing X since the
> performance benefit is significant, but then, it's in arch code, so
> that's fine (and I could indeed limit it to VGA class devices as David
> suggests).

Yup.

thanks,
grant

>
> Ben.
>

2005-04-29 15:52:30

by David Miller

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

On Thu, 28 Apr 2005 17:38:28 -0600
Grant Grundler <[email protected]> wrote:

> I suspect the MAP_* attribute/hint needs to be passed in together
> with the mmap call if any arch (ia64?) would return a different
> virtual address depending the attribute (e.g cached vs uncached).

The only problem could me getting the generic mmap() code to
properly pass the flag down into the driver, I seem to recall
that it either does an -EINVAL or masks out any flags which
are not in the standard set.

But then again this conflicts with what I remember seeing in the
XFree86 PCI support, in that IA64 passed in such a mmap() flag
to indicate a framebuffer like mapping that didn't need a guard-like
bit to be set.

Someone should look at the code to make sure :-)

2005-04-29 22:37:28

by Jesse Barnes

[permalink] [raw]
Subject: Re: pci-sysfs resource mmap broken (and PATCH)

David S. Miller <davem <at> davemloft.net> writes:
> The only problem could me getting the generic mmap() code to
> properly pass the flag down into the driver, I seem to recall
> that it either does an -EINVAL or masks out any flags which
> are not in the standard set.

But it would be a relatively clean solution, if a bit arch specific (i.e. some
arches would allow MAP_WRITECOMBINE or somesuch, while others have different
sorts of batching attributes).

> But then again this conflicts with what I remember seeing in the
> XFree86 PCI support, in that IA64 passed in such a mmap() flag
> to indicate a framebuffer like mapping that didn't need a guard-like
> bit to be set.

It used an ioctl last time I looked.

Jesse


2005-05-03 05:41:00

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: pci-sysfs resource mmap broken PATCH#2

Ok, here's my latest version. It goes back to "resource" instead of
"resources" since I finally didn't add the resource number. Since the
format doesn't change for most archs (only ppc/ppc64 for now will get
different values in there), I decided to keep it to it's old name.

So that patch should only impact ppc/ppc64 for now.

If everybody is ok tomorrow, I'll submit it upstream.

Index: linux-work/arch/ppc64/kernel/pci.c
===================================================================
--- linux-work.orig/arch/ppc64/kernel/pci.c 2005-04-24 11:37:38.000000000 +1000
+++ linux-work/arch/ppc64/kernel/pci.c 2005-04-27 14:44:46.000000000 +1000
@@ -351,9 +351,12 @@
*offset += hose->pci_mem_offset;
res_bit = IORESOURCE_MEM;
} else {
- io_offset = (unsigned long)hose->io_base_virt;
+ io_offset = (unsigned long)hose->io_base_virt - pci_io_base;
+ printk("offset: %lx, io_base_virt: %p, pci_io_base: %lx, io_offset: %lx\n",
+ *offset, hose->io_base_virt, pci_io_base, io_offset);
*offset += io_offset;
res_bit = IORESOURCE_IO;
+ printk(" -> offset: %lx\n", *offset);
}

/*
@@ -373,12 +376,15 @@
continue;

/* In the range of this resource? */
+ printk(" r%d: %lx -> %lx\n", i, rp->start, rp->end);
if (*offset < (rp->start & PAGE_MASK) || *offset > rp->end)
continue;

/* found it! construct the final physical address */
- if (mmap_state == pci_mmap_io)
- *offset += hose->io_base_phys - io_offset;
+ if (mmap_state == pci_mmap_io) {
+ *offset += hose->io_base_phys - io_offset;
+ printk(" result: %lx\n", *offset);
+ }
return rp;
}

@@ -941,4 +947,22 @@
}
EXPORT_SYMBOL(pci_read_irq_line);

+void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ struct pci_controller *hose = pci_bus_to_host(dev->bus);
+ unsigned long offset = 0;
+
+ if (hose == NULL)
+ return;
+
+ if (rsrc->flags & IORESOURCE_IO)
+ offset = pci_io_base - (unsigned long)hose->io_base_virt +
+ hose->io_base_phys;
+
+ *start = rsrc->start + offset;
+ *end = rsrc->end + offset;
+}
+
#endif /* CONFIG_PPC_MULTIPLATFORM */
Index: linux-work/drivers/pci/pci-sysfs.c
===================================================================
--- linux-work.orig/drivers/pci/pci-sysfs.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/pci-sysfs.c 2005-04-27 14:40:28.000000000 +1000
@@ -54,27 +54,30 @@

/* show resources */
static ssize_t
-resource_show(struct device * dev, char * buf)
+resources_show(struct device * dev, char * buf)
{
struct pci_dev * pci_dev = to_pci_dev(dev);
char * str = buf;
int i;
int max = 7;
+ u64 start, end;

if (pci_dev->subordinate)
max = DEVICE_COUNT_RESOURCE;

for (i = 0; i < max; i++) {
- str += sprintf(str,"0x%016lx 0x%016lx 0x%016lx\n",
- pci_resource_start(pci_dev,i),
- pci_resource_end(pci_dev,i),
- pci_resource_flags(pci_dev,i));
+ struct resource *res = &pci_dev->resource[i];
+ pci_resource_to_user(pci_dev, i, res, &start, &end);
+ str += sprintf(str,"0x%02x 0x%016llx 0x%016llx 0x%016llx\n", i,
+ (unsigned long long)start,
+ (unsigned long long)end,
+ (unsigned long long)res->flags);
}
return (str - buf);
}

struct device_attribute pci_dev_attrs[] = {
- __ATTR_RO(resource),
+ __ATTR_RO(resources),
__ATTR_RO(vendor),
__ATTR_RO(device),
__ATTR_RO(subsystem_vendor),
@@ -267,8 +270,21 @@
struct device, kobj));
struct resource *res = (struct resource *)attr->private;
enum pci_mmap_state mmap_type;
+ u64 start, end;
+ int i;
+
+ for (i = 0; i < PCI_ROM_RESOURCE; i++)
+ if (res == &pdev->resource[i])
+ break;
+ if (i >= PCI_ROM_RESOURCE)
+ return -ENODEV;

- vma->vm_pgoff += res->start >> PAGE_SHIFT;
+ /* pci_mmap_page_range() expects the same kind of entry as coming
+ * from /proc/bus/pci/ which is a "user visible" value. If this is
+ * different from the resource itself, arch will do necessary fixup.
+ */
+ pci_resource_to_user(pdev, i, res, &start, &end);
+ vma->vm_pgoff += start >> PAGE_SHIFT;
mmap_type = res->flags & IORESOURCE_MEM ? pci_mmap_mem : pci_mmap_io;

return pci_mmap_page_range(pdev, vma, mmap_type, 0);
Index: linux-work/include/asm-ppc64/pci.h
===================================================================
--- linux-work.orig/include/asm-ppc64/pci.h 2005-04-24 11:38:56.000000000 +1000
+++ linux-work/include/asm-ppc64/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -135,6 +135,11 @@
unsigned long offset,
unsigned long size,
pgprot_t prot);
+#define HAVE_ARCH_PCI_RESOURCE_TO_USER
+extern void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end);
+


#endif /* __KERNEL__ */
Index: linux-work/drivers/pci/proc.c
===================================================================
--- linux-work.orig/drivers/pci/proc.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/proc.c 2005-04-27 14:34:18.000000000 +1000
@@ -354,14 +354,20 @@
dev->device,
dev->irq);
/* Here should be 7 and not PCI_NUM_RESOURCES as we need to preserve compatibility */
- for(i=0; i<7; i++)
+ for(i=0; i<7; i++) {
+ u64 start, end;
+ pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, LONG_FORMAT,
- dev->resource[i].start |
+ ((unsigned long)start) |
(dev->resource[i].flags & PCI_REGION_FLAG_MASK));
- for(i=0; i<7; i++)
+ }
+ for(i=0; i<7; i++) {
+ u64 start, end;
+ pci_resource_to_user(dev, i, &dev->resource[i], &start, &end);
seq_printf(m, LONG_FORMAT,
dev->resource[i].start < dev->resource[i].end ?
- dev->resource[i].end - dev->resource[i].start + 1 : 0);
+ (unsigned long)(end - start) + 1 : 0);
+ }
seq_putc(m, '\t');
if (drv)
seq_printf(m, "%s", drv->name);
Index: linux-work/drivers/pci/pci.c
===================================================================
--- linux-work.orig/drivers/pci/pci.c 2005-04-24 11:37:45.000000000 +1000
+++ linux-work/drivers/pci/pci.c 2005-04-27 14:34:18.000000000 +1000
@@ -770,7 +770,7 @@
return 0;
}
#endif
-
+
static int __devinit pci_init(void)
{
struct pci_dev *dev = NULL;
Index: linux-work/include/linux/pci.h
===================================================================
--- linux-work.orig/include/linux/pci.h 2005-04-24 11:39:20.000000000 +1000
+++ linux-work/include/linux/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -1017,6 +1017,21 @@
#define pci_pretty_name(dev) ""
#endif

+
+/* Some archs don't want to expose struct resource to userland as-is
+ * in sysfs and /proc
+ */
+#ifndef HAVE_ARCH_PCI_RESOURCE_TO_USER
+static void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ *start = rsrc->start;
+ *end = rsrc->end;
+}
+#endif /* HAVE_ARCH_PCI_RESOURCE_TO_USER */
+
+
/*
* The world is not perfect and supplies us with broken PCI devices.
* For at least a part of these bugs we need a work-around, so both
Index: linux-work/include/asm-ppc/pci.h
===================================================================
--- linux-work.orig/include/asm-ppc/pci.h 2005-04-24 11:38:54.000000000 +1000
+++ linux-work/include/asm-ppc/pci.h 2005-04-27 14:34:18.000000000 +1000
@@ -103,6 +103,12 @@
unsigned long size,
pgprot_t prot);

+#define HAVE_ARCH_PCI_RESOURCE_TO_USER
+extern void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end);
+
+
#endif /* __KERNEL__ */

#endif /* __PPC_PCI_H */
Index: linux-work/arch/ppc/kernel/pci.c
===================================================================
--- linux-work.orig/arch/ppc/kernel/pci.c 2005-04-24 11:37:38.000000000 +1000
+++ linux-work/arch/ppc/kernel/pci.c 2005-04-27 14:42:34.000000000 +1000
@@ -1495,7 +1495,7 @@
*offset += hose->pci_mem_offset;
res_bit = IORESOURCE_MEM;
} else {
- io_offset = (unsigned long)hose->io_base_virt;
+ io_offset = hose->io_base_virt - ___IO_BASE;
*offset += io_offset;
res_bit = IORESOURCE_IO;
}
@@ -1522,7 +1522,7 @@

/* found it! construct the final physical address */
if (mmap_state == pci_mmap_io)
- *offset += hose->io_base_phys - _IO_BASE;
+ *offset += hose->io_base_phys - io_offset;
return rp;
}

@@ -1739,6 +1739,23 @@
return result;
}

+void pci_resource_to_user(const struct pci_dev *dev, int bar,
+ const struct resource *rsrc,
+ u64 *start, u64 *end)
+{
+ struct pci_controller *hose = pci_bus_to_hose(dev->bus->number);
+ unsigned long offset = 0;
+
+ if (hose == NULL)
+ return;
+
+ if (rsrc->flags & IORESOURCE_IO)
+ offset = ___IO_BASE - hose->io_base_virt + hose->io_base_phys;
+
+ *start = rsrc->start + offset;
+ *end = rsrc->end + offset;
+}
+
void __init
pci_init_resource(struct resource *res, unsigned long start, unsigned long end,
int flags, char *name)