2006-12-12 22:05:38

by Roland Dreier

[permalink] [raw]
Subject: mapping PCI registers with write combining (and PAT on x86)...

Suppose that I would like to map some PIO registers (in a PCI BAR) to
userspace, and I would like to enable write combining if possible.

I have two problems. First, there's no generic interface for
requesting write combining if possible when doing
io_remap_pfn_range(). Would it make sense to define
pageprot_writecombine for all architectures (and make it fall back to
doing non-cached access if write combining is not possible)? And it
seems that making pgprot_noncached() universal wouldn't hurt either.

Second, given the extreme shortage of MTRRs, it seems that write
combining is not really possible in general on x86 without some
interface to use PATs instead. What is holding up something like Eric
Biederman's patches from going in?

Right now we end up with stuff like the absolutely hair-raising code
in drivers/video/fbmem.c shown below. I really would like to make
progress towards having a better interface for doing this stuff, and
I'm more than willing to work on getting something mergable.

#if defined(__mc68000__)
#if defined(CONFIG_SUN3)
pgprot_val(vma->vm_page_prot) |= SUN3_PAGE_NOCACHE;
#elif defined(CONFIG_MMU)
if (CPU_IS_020_OR_030)
pgprot_val(vma->vm_page_prot) |= _PAGE_NOCACHE030;
if (CPU_IS_040_OR_060) {
pgprot_val(vma->vm_page_prot) &= _CACHEMASK040;
/* Use no-cache mode, serialized */
pgprot_val(vma->vm_page_prot) |= _PAGE_NOCACHE_S;
}
#endif
#elif defined(__powerpc__)
vma->vm_page_prot = phys_mem_access_prot(file, off >> PAGE_SHIFT,
vma->vm_end - vma->vm_start,
vma->vm_page_prot);
#elif defined(__alpha__)
/* Caching is off in the I/O space quadrant by design. */
#elif defined(__i386__) || defined(__x86_64__)
if (boot_cpu_data.x86 > 3)
pgprot_val(vma->vm_page_prot) |= _PAGE_PCD;
#elif defined(__mips__) || defined(__sparc_v9__)
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
#elif defined(__hppa__)
pgprot_val(vma->vm_page_prot) |= _PAGE_NO_CACHE;
#elif defined(__arm__) || defined(__sh__) || defined(__m32r__)
vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
#elif defined(__ia64__)
if (efi_range_is_wc(vma->vm_start, vma->vm_end - vma->vm_start))
vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
else
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
#else
#warning What do we have to do here??
#endif
if (io_remap_pfn_range(vma, vma->vm_start, off >> PAGE_SHIFT,
vma->vm_end - vma->vm_start, vma->vm_page_prot))
return -EAGAIN;
return 0;

Thanks!
Roland


2006-12-12 22:47:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: mapping PCI registers with write combining (and PAT on x86)...

Roland Dreier <[email protected]> writes:

> Suppose that I would like to map some PIO registers (in a PCI BAR) to
> userspace, and I would like to enable write combining if possible.
>
> I have two problems. First, there's no generic interface for
> requesting write combining if possible when doing
> io_remap_pfn_range(). Would it make sense to define
> pageprot_writecombine for all architectures (and make it fall back to
> doing non-cached access if write combining is not possible)? And it
> seems that making pgprot_noncached() universal wouldn't hurt either.

So I think we may simplify this but there is pci_mmap_page_range. That
already handles this for the architectures that currently support it.
So it is probably the case the fbdev should be changed to use that.

I am certainly in favor of simpler infrastructure like making
pgprot_writecombine and pgprot_uncached universal but I would like to
start with what works today.

Then we can go reexamine things like the ia64 slavishly trusting the
BIOS to know which page protections are good.

> Second, given the extreme shortage of MTRRs, it seems that write
> combining is not really possible in general on x86 without some
> interface to use PATs instead. What is holding up something like Eric
> Biederman's patches from going in?

No one had any serious objections to my patches as they were. The actual
problem was that the patches were incomplete. In particular if you
mismatch page protections it is possible to get silent data corruption
or processor crashes. So we need checks to ensure all mappings of
a given page are using the same protections.

To a certain extent I think adding those checks really is a strawman
and should not stop the merge effort, because we have this feature and
those possible bugs on other architectures right now and we don't have
those checks. But I also think in the long term we need them, it just
requires several days of going through the mm so we don't leave any
path uncovered.

> Right now we end up with stuff like the absolutely hair-raising code
> in drivers/video/fbmem.c shown below. I really would like to make
> progress towards having a better interface for doing this stuff, and
> I'm more than willing to work on getting something mergable.

I hope my comments help.

Eric

2006-12-13 00:12:12

by Roland Dreier

[permalink] [raw]
Subject: Re: mapping PCI registers with write combining (and PAT on x86)...

> So I think we may simplify this but there is pci_mmap_page_range. That
> already handles this for the architectures that currently support it.
> So it is probably the case the fbdev should be changed to use that.

Thanks... I was not aware of pci_mmap_page_range(), but that doesn't
seem to be quite the right interface. It uses vma->vm_pgoff to say
what to remap. A typical use for what I have in mind would be for a
userspace process to open a magic file and do mmap() at some
well-known offset (like 0), and have the kernel driver map the right
PCI registers into userspace, without userspace having to know what
offset to ask for.

This is especially important when the kernel has to handle picking a
"context" or "port" to avoid multiple userspace processes stepping on
each other.

And of course arch/i386/pci/i386.c has the following in its
pci_mmap_page_range() anyway:

/* Write-combine setting is ignored, it is changed via the mtrr
* interfaces on this platform.
*/

so the write_combine parameter is ignored...

> No one had any serious objections to my patches as they were. The actual
> problem was that the patches were incomplete. In particular if you
> mismatch page protections it is possible to get silent data corruption
> or processor crashes. So we need checks to ensure all mappings of
> a given page are using the same protections.
>
> To a certain extent I think adding those checks really is a strawman
> and should not stop the merge effort, because we have this feature and
> those possible bugs on other architectures right now and we don't have
> those checks. But I also think in the long term we need them, it just
> requires several days of going through the mm so we don't leave any
> path uncovered.

It does seem somewhat hard to make sure there aren't multiple mappings
of the same thing, and I'm not sure it's worth trying to avoid it. If
a device driver lets me mmap PCI memory with write-combining on, and
then (as root) I mmap raw PCI resources to get the same memory, whose
fault is it if things break?

I'm kind of an mm dummy but I don't even see a good way to avoid
multiple mappings like that.

- R.

2006-12-13 00:28:55

by Eric W. Biederman

[permalink] [raw]
Subject: Re: mapping PCI registers with write combining (and PAT on x86)...

Roland Dreier <[email protected]> writes:

> > So I think we may simplify this but there is pci_mmap_page_range. That
> > already handles this for the architectures that currently support it.
> > So it is probably the case the fbdev should be changed to use that.
>
> Thanks... I was not aware of pci_mmap_page_range(), but that doesn't
> seem to be quite the right interface. It uses vma->vm_pgoff to say
> what to remap. A typical use for what I have in mind would be for a
> userspace process to open a magic file and do mmap() at some
> well-known offset (like 0), and have the kernel driver map the right
> PCI registers into userspace, without userspace having to know what
> offset to ask for.
>
> This is especially important when the kernel has to handle picking a
> "context" or "port" to avoid multiple userspace processes stepping on
> each other.

Agreed. But I think it is a healthier starting point then the fbdev
location. A little refactoring and we should be able to handle
the generic case.

> And of course arch/i386/pci/i386.c has the following in its
> pci_mmap_page_range() anyway:
>
> /* Write-combine setting is ignored, it is changed via the mtrr
> * interfaces on this platform.
> */
>
> so the write_combine parameter is ignored...

Yes. Although my earlier patch fixed that :)

It is also worth nothing that the pci prefetchable attribute
says that write-combining is safe through that bar.

> > No one had any serious objections to my patches as they were. The actual
> > problem was that the patches were incomplete. In particular if you
> > mismatch page protections it is possible to get silent data corruption
> > or processor crashes. So we need checks to ensure all mappings of
> > a given page are using the same protections.
> >
> > To a certain extent I think adding those checks really is a strawman
> > and should not stop the merge effort, because we have this feature and
> > those possible bugs on other architectures right now and we don't have
> > those checks. But I also think in the long term we need them, it just
> > requires several days of going through the mm so we don't leave any
> > path uncovered.
>
> It does seem somewhat hard to make sure there aren't multiple mappings
> of the same thing, and I'm not sure it's worth trying to avoid it. If
> a device driver lets me mmap PCI memory with write-combining on, and
> then (as root) I mmap raw PCI resources to get the same memory, whose
> fault is it if things break?
>
> I'm kind of an mm dummy but I don't even see a good way to avoid
> multiple mappings like that.

It is possible. It is just a matter of looking for other mappings of the
same thing, and seeing if one exists seeing if it has the mapping
attributes you are asking for. We have reverse mappings and
the basic infrastructure.

The justification for pursuing reverse mappings really is that it is
less work than taking weeks tracing down the one bug caused by
allowing this and then silently corrupting data.

The obvious problem comes when one mapping is write combined and
the other mapping is write back.

Eric

2006-12-13 02:52:01

by Kyle McMartin

[permalink] [raw]
Subject: Re: mapping PCI registers with write combining (and PAT on x86)...

On Tue, Dec 12, 2006 at 02:05:32PM -0800, Roland Dreier wrote:
> #if defined(__mc68000__)
<snip>
> #warning What do we have to do here??
> #endif
> if (io_remap_pfn_range(vma, vma->vm_start, off >> PAGE_SHIFT,
> vma->vm_end - vma->vm_start, vma->vm_page_prot))
> return -EAGAIN;
> return 0;
>

Wow, that should probably take the cake for the ugliest snippet
of code in the kernel.

--Kyle