2004-03-20 13:29:40

by Andrea Arcangeli

[permalink] [raw]
Subject: can device drivers return non-ram via vm_ops->nopage?

The only bugreport I've got so far for the latest anon_vma code is from
Jens, and it's a device driver bug in my opinion, but I'd like to have a
definitive confirmation from you about the ->nopage API.

I changed ->nopage like this to catch bugs:

retry:
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);

/*
* non-ram cannot be mapped via ->nopage, it must
* be mapped via remap_page_range instead synchronously
* in the ->mmap device driver callback.
*
* PageReserved pages can be mapped as far as they're under
* a VM_RESERVED vma.
*/
BUG_ON(!pfn_valid(page_to_pfn(new_page)));

/* ->nopage cannot return swapcache */
BUG_ON(PageSwapCache(new_page));
/* ->nopage cannot return anonymous pages */
BUG_ON(PageAnon(new_page));

/*
* This is the entry point for memory under VM_RESERVED vmas.
* That memory will not be tracked by the vm. These aren't
* real anonymous pages, they're "device" reserved pages
* instead.
* These pages under VM_RESERVED vmas are the only pages mapped
* by the VM into userspace with page->as.mapping = NULL.
*/
reserved = vma->vm_flags & VM_RESERVED;
BUG_ON(!reserved && (!new_page->mapping || PageReserved(new_page)));


really it would not be mandatory for me to enforce the last BUG_ON,
since we don't do the pagetable walk anymore to unmap stuff, but I think
it's nicer to enforce the model in the drivers so if we'll ever want to
do the pagetable walk again, we could, if we giveup then we'll be unable
to go back to the pagetable walk. I'm not saying that we'll ever want to
go back, but since most drivers are already setting VM_RESREVED
correctly to work with 2.4, I believe it worth to maintain this
abstraction so if we really want we can go back.

Anyways returning to the non-ram returned by ->nopage see the below
email exchange with Jens. the bug triggering of course is the
BUG_ON(!pfn_valid(page_to_pfn(new_page))).

If we want to return non-ram, we could, but I believe we should change
the API to return a pfn not a page_t * if we want to.

----- Forwarded message from Andrea Arcangeli <[email protected]> -----

Date: Sat, 20 Mar 2004 14:21:56 +0100
From: Andrea Arcangeli <[email protected]>
To: Jens Axboe <[email protected]>

On Fri, Mar 19, 2004 at 01:32:13PM +0100, Jens Axboe wrote:
> kernel BUG at mm/memory.c:1412!
> invalid operand: 0000 [#1]
> SMP
> CPU: 1
> EIP: 0060:[<c01407fe>] Not tainted
> EFLAGS: 00010216 (2.6.4-0-axboe)
> EIP is at do_no_page+0x42c/0x4dc
> eax: 01f80000 ebx: 00000000 ecx: 00001000 edx: f0c5be78
> esi: f0860280 edi: c0453880 ebp: f0c5bec0 esp: f0c5be88
> ds: 007b es: 007b ss: 0068
> Process mplayer (pid: 1500, threadinfo=f0c5a000 task=f149d300)
> Stack: f7f997a0 f7f997a0 00000282 f06fec80 f0c5bea8 00000000 f7d8c984
> 40b0e000
> f0860280 f1267800 00000001 f0c5a000 f0866c38 c0453880 f0c5bf0c
> c01415da
> 00000000 f0866c38 f08b3408 00000000 000081ff 405ae7a3 1be20e55
> f7de0480
> Call Trace:
> [<c01415da>] handle_mm_fault+0xf3/0x694
> [<c011674d>] do_page_fault+0x16c/0x535
> [<c0144179>] __do_mmap_pgoff+0x34c/0x643
> [<c010c44a>] do_mmap2+0x7a/0xa8
> [<c01165e1>] do_page_fault+0x0/0x535
> [<c0106aad>] error_code+0x2d/0x38

a device driver is returning a non-ram page via ->nopage.

I don't think this has ever been safe, it's just that my more robust
anon_vma code is trapping this bug, I think non-ram pages should use
remap_file_pages not ->nopage.

Let's assume I'm wrong and you can return non-ram via ->nopage (even
ignoring the API would be totally incorrect since one should return a
'pfn' not a 'page_t *' if really ->nopage can return non-ram), let's
take plain 2.6.5-rc1 (w/o my anon_vma code)

new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);
[..]
if (pte_none(*page_table)) {
if (!PageReserved(new_page))
++mm->rss;
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
set_pte(page_table, entry);
pte_chain = page_add_rmap(new_page, page_table, pte_chain);
pte_unmap(page_table);


PageReserved(new_page) is already reading random memory, that could even
genrate a machine exception on amd64 or ia64 and lock the box hard.

then it goes ahead and it even does page_add_rmap(new_page), writing
new_page->pte_chain in non-ram with, again potentially crashing the box.
It's ironic that Andrew removed a pfn_valid check in front of
page_add_rmap just in 2.6.5-rc1 (previous kernels wouldn't overwrite
random non-ram there, but still the unrecoverable machine check was
still there simply due the PageReserved).

So I think my anon_vma code is right forbidding non-ram to be returned
by ->nopage, and the device driver should be fixed.

If you disagree, I can change ->nopage to survive a non-ram page, but
besides the API of returning a page_t * would be misleading, this would
be a new feature, and it wouldn't work stable with mainline kernels
(regardless if page_add_rmap starts with a pfn_valid check or not).

----- End forwarded message -----


2004-03-20 14:40:27

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 02:30:25PM +0100, Andrea Arcangeli wrote:
> Anyways returning to the non-ram returned by ->nopage see the below
> email exchange with Jens. the bug triggering of course is the
> BUG_ON(!pfn_valid(page_to_pfn(new_page))).
> If we want to return non-ram, we could, but I believe we should change
> the API to return a pfn not a page_t * if we want to.

This would be very helpful for other reasons also. There's a general
API issue with drivers that want or need to do this. The one I've
heard most about is /dev/mem when it's used to mmap() physical areas
lying in memory holes not covered by ->node_mem_map. Once ->mmap() and
->nopage() supplied by drivers are liberated from reliance on struct
page, numerous hacks, validation overheads, and stability issues may be
eliminated. I'd rather strongly advocate such an API change for mainline,
as it's something that fixes a number of drivers at once, but only if
the implementation carries out a full sweep of all affected callees
and only if it actually resolves the issues with these drivers.

But there's another question that should be asked up-front: in order to
give drivers sufficient expressiveness to correctly implement their
->mmap() methods, is this even sufficient? There is a serious question
of whether the core can actually handle the driver-specific issues,
which suggests devolving a larger swath of the fault handling codepath
to drivers supporting ->mmap() if it is insufficient after all. For
instance, will cache-disabled mappings or bolted/locked TLB entries
that the core doesn't understand be required? I'd like to get someone
with more driver experience or who may have architecture-specific
issues with driver ->nopage() methods to chime in here with respect to
the sufficiency of a pfn-based ->nopage() vs. stronger methods, since
it's pointless to make the pfn-based ->nopage() change if it's
insufficient anyway.

There is also a special case that's hitting a number of architectures
simultaneously that may or may not be a mainline concern. This is that
a number of people actually want to handle faults on hugetlb and do
ZFOD fault handling so that, for instance, various kinds of NUMA and
latency issues can be addressed. The current methods are to trap the
fault before calling handle_mm_fault() in arch code, but a cleaner
solution would very nicely reuse more general or stronger forms of
driver fault handling that would fix driver issues also. It's basically
an upstream call as to whether this will be allowed to have any
influence on the design of a solution to the more critical "drivers are
getting bitten by the requirement of a struct page * return value of
->nopage()" issue, and it looks like upstream is cc:'d on this thread. =)


-- wli

2004-03-20 15:06:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 06:40:22AM -0800, William Lee Irwin III wrote:
> On Sat, Mar 20, 2004 at 02:30:25PM +0100, Andrea Arcangeli wrote:
> > Anyways returning to the non-ram returned by ->nopage see the below
> > email exchange with Jens. the bug triggering of course is the
> > BUG_ON(!pfn_valid(page_to_pfn(new_page))).
> > If we want to return non-ram, we could, but I believe we should change
> > the API to return a pfn not a page_t * if we want to.
>
> This would be very helpful for other reasons also. There's a general
> API issue with drivers that want or need to do this. The one I've

I'm afraid I'll have to teach ->nopage how to deal with non-ram with
this page_t API too (changing it to pfn sounds too intrusive in the
short term), it seems to me that alsa can return non-ram (in the nopage
callback there's a virt_to_page on some iomm region), and changing alsa
to use remap_file_pages sounds too intrusive too.

So in short I believe alsa can corrupt memory randomly starting with
2.6.5-rc1, and it could only generate machine check crashes in previous
kernels.

So for the short term (i.e. next few weeks) we'll have to deal with
page_t still there...

2004-03-20 15:27:35

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 04:06:21PM +0100, Andrea Arcangeli wrote:
> I'm afraid I'll have to teach ->nopage how to deal with non-ram with
> this page_t API too (changing it to pfn sounds too intrusive in the
> short term), it seems to me that alsa can return non-ram (in the nopage
> callback there's a virt_to_page on some iomm region), and changing alsa
> to use remap_file_pages sounds too intrusive too.
> So in short I believe alsa can corrupt memory randomly starting with
> 2.6.5-rc1, and it could only generate machine check crashes in previous
> kernels.
> So for the short term (i.e. next few weeks) we'll have to deal with
> page_t still there...

I've developed an interest in drivers recently, so I may be able to do
some of the footwork here in a timely fashion if we want to go the pfn-
based API route. That actually sounded like the less intrusive of the
two methods I mentioned as well as easily mergeable within a stable
series. OTOH, if there are objections, it may have to wait.

I don't believe devolving larger swaths of the fault path to drivers
would be very difficult to restructure drivers to use. The hard parts
are that it would be time-consuming and would likely merit a support
API exported by architectures to make driver writers' lives easier (i.e.
not introduce more bugs than it resolves) that would need to be agreed
upon, or at least backed by a feature request survey. And, of course,
it would need an implementation for every architecture, which could be
difficult to arrange for the less documented and/or less frequently
updated architectures if features the core doesn't already rely upon
would be required. And that's a certainty, since the core not
understanding the needs of those drivers would be the primary motive
for the more intrusive approach.


-- wli

2004-03-20 15:44:25

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 04:06:21PM +0100, Andrea Arcangeli wrote:
> On Sat, Mar 20, 2004 at 06:40:22AM -0800, William Lee Irwin III wrote:
> > On Sat, Mar 20, 2004 at 02:30:25PM +0100, Andrea Arcangeli wrote:
> > > Anyways returning to the non-ram returned by ->nopage see the below
> > > email exchange with Jens. the bug triggering of course is the
> > > BUG_ON(!pfn_valid(page_to_pfn(new_page))).
> > > If we want to return non-ram, we could, but I believe we should change
> > > the API to return a pfn not a page_t * if we want to.
> >
> > This would be very helpful for other reasons also. There's a general
> > API issue with drivers that want or need to do this. The one I've
>
> I'm afraid I'll have to teach ->nopage how to deal with non-ram with
> this page_t API too (changing it to pfn sounds too intrusive in the
> short term), it seems to me that alsa can return non-ram (in the nopage
> callback there's a virt_to_page on some iomm region), and changing alsa
> to use remap_file_pages sounds too intrusive too.

Actually, ALSA is broken in that respect - it isn't portable as it
stands. It isn't the API which is broken - it's ALSA which is broken.
Performing virt_to_page() on any non-direct mapped RAM page (which
means the value returned from dma_alloc_coherent or pci_alloc_consistent)
is undefined.

One of my current projects is fixing this crap in ALSA.

Besides, returning an invalid struct page will lead to Bad Things(tm)
in set_pte().

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-20 15:56:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 03:44:19PM +0000, Russell King wrote:
> On Sat, Mar 20, 2004 at 04:06:21PM +0100, Andrea Arcangeli wrote:
> > On Sat, Mar 20, 2004 at 06:40:22AM -0800, William Lee Irwin III wrote:
> > > On Sat, Mar 20, 2004 at 02:30:25PM +0100, Andrea Arcangeli wrote:
> > > > Anyways returning to the non-ram returned by ->nopage see the below
> > > > email exchange with Jens. the bug triggering of course is the
> > > > BUG_ON(!pfn_valid(page_to_pfn(new_page))).
> > > > If we want to return non-ram, we could, but I believe we should change
> > > > the API to return a pfn not a page_t * if we want to.
> > >
> > > This would be very helpful for other reasons also. There's a general
> > > API issue with drivers that want or need to do this. The one I've
> >
> > I'm afraid I'll have to teach ->nopage how to deal with non-ram with
> > this page_t API too (changing it to pfn sounds too intrusive in the
> > short term), it seems to me that alsa can return non-ram (in the nopage
> > callback there's a virt_to_page on some iomm region), and changing alsa
> > to use remap_file_pages sounds too intrusive too.
>
> Actually, ALSA is broken in that respect - it isn't portable as it
> stands. It isn't the API which is broken - it's ALSA which is broken.
> Performing virt_to_page() on any non-direct mapped RAM page (which
> means the value returned from dma_alloc_coherent or pci_alloc_consistent)
> is undefined.

this is exactly the problem.

> One of my current projects is fixing this crap in ALSA.

Do you agree it should be fixed by returning a PFN from ->nopage? or are
you doing it differently with remap_file_pages or peraphs you're just
multiplying the right pfn for sizeof(page_t) ingoring the misleading API?

> Besides, returning an invalid struct page will lead to Bad Things(tm)
> in set_pte().

you mean in the non-x86 archs right?

there is no way I can change ->nopage to return a pfn right now (this
stuff must work stable ASAP), so I'm currently teaching do_no_page to
handle non-ram pages (for the first time ever it will be able to do
that), I expect this at least will make it work right on x86 w/o iommu.

for mainline 2.6 if we want to keep using ->nopage I agree with Wli that
an API change is reasonable.

BTW, I was wrong talking about machine checks, the problem here is
reading random _virtual_ address (not phyisical ones), so it could oops
too on x86, and starting from 2.6.5-rc1 it'll corrupt mem too.

2004-03-20 16:02:51

by Jaroslav Kysela

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, 20 Mar 2004, Russell King wrote:

> Actually, ALSA is broken in that respect - it isn't portable as it
> stands. It isn't the API which is broken - it's ALSA which is broken.
> Performing virt_to_page() on any non-direct mapped RAM page (which
> means the value returned from dma_alloc_coherent or pci_alloc_consistent)
> is undefined.
>
> One of my current projects is fixing this crap in ALSA.

Yes, but if there's no API in the kernel code allowing to obtain page
pointers using any value returned from dma_alloc_coherent(), then we
cannot fix this problem.

So, it's not much subsystem (ALSA) problem, but kernel core is not matured
enough.

The same problem is for the cache coherency for mmaped pages.

Jaroslav

-----
Jaroslav Kysela <[email protected]>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs

2004-03-20 16:09:17

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 04:58:21PM +0100, Jaroslav Kysela wrote:
> On Sat, 20 Mar 2004, Russell King wrote:
> > Actually, ALSA is broken in that respect - it isn't portable as it
> > stands. It isn't the API which is broken - it's ALSA which is broken.
> > Performing virt_to_page() on any non-direct mapped RAM page (which
> > means the value returned from dma_alloc_coherent or pci_alloc_consistent)
> > is undefined.
> >
> > One of my current projects is fixing this crap in ALSA.
>
> Yes, but if there's no API in the kernel code allowing to obtain page
> pointers using any value returned from dma_alloc_coherent(), then we
> cannot fix this problem.

It is fixable, if someone sits down and works through it, which is
precisely what I've been doing.

> So, it's not much subsystem (ALSA) problem, but kernel core is not matured
> enough.

It is well known that virt_to_page() is only valid on virtual addresses
which correspond to kernel direct mapped RAM pages, and undefined on
everything else. Unfortunately, ALSA has been using it with
pci_alloc_consistent() for a long time, and this behaviour is what
makes ALSA broken. The fact it works on x86 is merely incidental.

If ALSA wants this functionality, the ALSA people should ideally have
put their requirements forward during the 2.5 development cycle so the
problem could be addressed. However, luckily in this instance, it is
not a big problem to solve. It just requires time to sort through all
the abstraction layers upon abstraction layers which ALSA has.

- and I'm doing exactly this, right now. Be patient. -

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-20 16:15:51

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 04:57:39PM +0100, Andrea Arcangeli wrote:
> > One of my current projects is fixing this crap in ALSA.
>
> Do you agree it should be fixed by returning a PFN from ->nopage?

No. How would you return the PFN from a remapped page? It's far
easier to provide an interface which returns the struct page* for
the underlying pages, thusly:

static struct page *
dma_coherent_to_page(struct device *dev, void *cpu_addr,
dma_addr_t handle, unsigned int offset)

And this is precisely what I would be working on if I weren't writing
this mail. 8)

Take a moment to think about the problem. We've allocated some memory
for coherent DMA via the dma_alloc_coherent() interface. At some point,
we've had to get a struct page* in this allocator. However, the
allocator has had to do some architecture defined operations to provide
coherent memory.

Only the architecture can translate the results from dma_alloc_coherent()
back to a struct page* - which it needs to be able to do if
dma_free_coherent() is going to work.

Therefore, what we need to do to solve the ALSA problem is require all
architectures to provide dma_coherent_to_page() and make ALSA use that.

(A related problem is that some architectures need pgprot_dmacoherent()
to modify the page protections so that the user space mapping is also
DMA-coherent. However, that discussion should be the subject of a
new thread.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-20 16:24:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 04:15:38PM +0000, Russell King wrote:
> On Sat, Mar 20, 2004 at 04:57:39PM +0100, Andrea Arcangeli wrote:
> > > One of my current projects is fixing this crap in ALSA.
> >
> > Do you agree it should be fixed by returning a PFN from ->nopage?
>
> No. How would you return the PFN from a remapped page? It's far
> easier to provide an interface which returns the struct page* for
> the underlying pages, thusly:
>
> static struct page *
> dma_coherent_to_page(struct device *dev, void *cpu_addr,
> dma_addr_t handle, unsigned int offset)
>
> And this is precisely what I would be working on if I weren't writing
> this mail. 8)
>
> Take a moment to think about the problem. We've allocated some memory
> for coherent DMA via the dma_alloc_coherent() interface. At some point,

they're using MMIO pci space or it wouldn't catch my BUG_ON on x86.

The whole point is that it is non ram, if it would be ram, x86 couldn't
notice the virt_to_page, since the page_t would be in the range of the
mem_map_t and pfn_valid would be happy with it.

If it was dma_alloc_coherent it would return ram I think, not non-ram.

> we've had to get a struct page* in this allocator. However, the
> allocator has had to do some architecture defined operations to provide
> coherent memory.
>
> Only the architecture can translate the results from dma_alloc_coherent()
> back to a struct page* - which it needs to be able to do if
> dma_free_coherent() is going to work.
>
> Therefore, what we need to do to solve the ALSA problem is require all
> architectures to provide dma_coherent_to_page() and make ALSA use that.

will this dma_coherent_to_page be allowed to run on a non-ram page?
It's pretty ugly to use page_t for non-ram. one can always convert with
a mul or div with sizeof(page_t) though. My point is that if you want to
allow stuff to deal with non-ram you must never have this stuff work
with page_t but it should work with pfn instead.

2004-03-20 16:57:34

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 05:25:34PM +0100, Andrea Arcangeli wrote:
> they're using MMIO pci space or it wouldn't catch my BUG_ON on x86.
> The whole point is that it is non ram, if it would be ram, x86 couldn't
> notice the virt_to_page, since the page_t would be in the range of the
> mem_map_t and pfn_valid would be happy with it.
> If it was dma_alloc_coherent it would return ram I think, not non-ram.

Any idea what driver? /dev/mem, which is where X typically gets its
mappings of mmiospace, doesn't actually use ->nopage(). Maybe rmk's
notion of doing it all from within the drivers is the right idea in
general, or at least until we hit cases that can't be handled that
way at all.


-- wli

2004-03-20 17:40:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?



On Sat, 20 Mar 2004, Andrea Arcangeli wrote:
>
> The only bugreport I've got so far for the latest anon_vma code is from
> Jens, and it's a device driver bug in my opinion, but I'd like to have a
> definitive confirmation from you about the ->nopage API.

I'd say that this is definitely a driver bug.

If a driver wants to map non-RAM pages, that's perfectly ok, but it MUST
NOT happen through "nopage()". The driver should map them with
"remap_page_range()", and thus never take a page fault for such pages at
all.

There is no reason to ever lazily map non-RAM pages - clearly they aren't
using any "real memory", so there is no reason to not fill the page tables
at mmap() time.

In other words, the driver is horribly broken.

Linus

2004-03-20 17:48:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

I noticed there was effectively one bug in anon_vma that would have
caused a VM_FAULT_OOM/SIGBUS to be mistaken for non-ram (I checked
VM_FAULT after pfn_valid oh well). So it's possible what Jens
experienced was a sigbus and not a non-ram condition, sorry.

So it's not certain anymore that alsa was returning non-ram, but it's
still possible in theory and it would have worked in practice up to
2.6.4 (not anymore in 2.6.5-rc1). The non-ram page_t would fall into the
direct mapping (for a 4G phys address, the page_t would be at address
3G+120M triggering an oops only on machines with less than 128m of ram,
and if the mmio region would been at lower than 4G even less ram would
be needed to hide the bug). This untested patch should make it working
with non-ram too, so it sounds safer for the short term. I will test it
a little bit and I'll upload an update.

--- x/mm/memory.c.~1~ 2004-03-20 15:47:31.000000000 +0100
+++ x/mm/memory.c 2004-03-20 18:35:19.000000000 +0100
@@ -1384,7 +1384,7 @@ do_no_page(struct mm_struct *mm, struct
struct page * new_page;
struct address_space *mapping = NULL;
pte_t entry;
- int sequence = 0, reserved, anon;
+ int sequence = 0, reserved, anon, pageable, ram, as;
int ret = VM_FAULT_MINOR;

if (!vma->vm_ops || !vma->vm_ops->nopage)
@@ -1401,36 +1401,40 @@ do_no_page(struct mm_struct *mm, struct
retry:
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, &ret);

+ /* no page was available -- either SIGBUS or OOM */
+ if (new_page == NOPAGE_SIGBUS)
+ return VM_FAULT_SIGBUS;
+ if (new_page == NOPAGE_OOM)
+ return VM_FAULT_OOM;
+
/*
- * non-ram cannot be mapped via ->nopage, it must
- * be mapped via remap_page_range instead synchronously
- * in the ->mmap device driver callback.
- *
- * PageReserved pages can be mapped as far as they're under
- * a VM_RESERVED vma.
+ * ->nopage should return a PFN not a page_t if here we
+ * wanted to handle non-ram, though we've to make non-ram
+ * work with page_t too for a number of device drivers
+ * that may return non-ram via ->nopage.
*/
- BUG_ON(!pfn_valid(page_to_pfn(new_page)));
-
- /* ->nopage cannot return swapcache */
- BUG_ON(PageSwapCache(new_page));
- /* ->nopage cannot return anonymous pages */
- BUG_ON(PageAnon(new_page));
+ pageable = ram = pfn_valid(page_to_pfn(new_page));
+ if (likely(ram)) {
+ pageable = !PageReserved(new_page);
+ as = !!new_page->mapping;
+
+ BUG_ON(!pageable && as);
+
+ pageable &= as;
+
+ /* ->nopage cannot return swapcache */
+ BUG_ON(PageSwapCache(new_page));
+ /* ->nopage cannot return anonymous pages */
+ BUG_ON(PageAnon(new_page));
+ }

/*
* This is the entry point for memory under VM_RESERVED vmas.
* That memory will not be tracked by the vm. These aren't
* real anonymous pages, they're "device" reserved pages instead.
- * These pages under VM_RESERVED vmas are the only pages mapped
- * by the VM into userspace with page->as.mapping = NULL.
*/
- reserved = vma->vm_flags & VM_RESERVED;
- BUG_ON(!reserved && (!new_page->mapping || PageReserved(new_page)));
-
- /* no page was available -- either SIGBUS or OOM */
- if (new_page == NOPAGE_SIGBUS)
- return VM_FAULT_SIGBUS;
- if (new_page == NOPAGE_OOM)
- return VM_FAULT_OOM;
+ reserved = !!(vma->vm_flags & VM_RESERVED);
+ BUG_ON(reserved != pageable);

/*
* Should we do an early C-O-W break?
@@ -1438,6 +1442,8 @@ retry:
anon = 0;
if (write_access && !(vma->vm_flags & VM_SHARED)) {
struct page * page;
+ if (unlikely(!ram))
+ return VM_FAULT_SIGBUS;
if (unlikely(anon_vma_prepare(vma)))
goto oom;
page = alloc_page(GFP_HIGHUSER);
@@ -1460,7 +1466,8 @@ retry:
(unlikely(sequence != atomic_read(&mapping->truncate_count)))) {
sequence = atomic_read(&mapping->truncate_count);
spin_unlock(&mm->page_table_lock);
- page_cache_release(new_page);
+ if (likely(ram))
+ page_cache_release(new_page);
goto retry;
}
page_table = pte_offset_map(pmd, address);
@@ -1477,20 +1484,21 @@ retry:
*/
/* Only go through if we didn't race with anybody else... */
if (pte_none(*page_table)) {
- if (!PageReserved(new_page))
+ if (likely(ram && !PageReserved(new_page)))
++mm->rss;
flush_icache_page(vma, new_page);
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
set_pte(page_table, entry);
- if (likely(!reserved))
+ if (likely(ram))
page_add_rmap(new_page, vma, address, anon);
pte_unmap(page_table);
} else {
/* One of our sibling threads was faster, back out. */
pte_unmap(page_table);
- page_cache_release(new_page);
+ if (likely(ram))
+ page_cache_release(new_page);
spin_unlock(&mm->page_table_lock);
goto out;
}
@@ -1502,7 +1510,8 @@ retry:
return ret;

oom:
- page_cache_release(new_page);
+ if (likely(ram))
+ page_cache_release(new_page);
ret = VM_FAULT_OOM;
goto out;
}

2004-03-20 17:56:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 09:39:51AM -0800, Linus Torvalds wrote:
>
>
> On Sat, 20 Mar 2004, Andrea Arcangeli wrote:
> >
> > The only bugreport I've got so far for the latest anon_vma code is from
> > Jens, and it's a device driver bug in my opinion, but I'd like to have a
> > definitive confirmation from you about the ->nopage API.
>
> I'd say that this is definitely a driver bug.
>
> If a driver wants to map non-RAM pages, that's perfectly ok, but it MUST
> NOT happen through "nopage()". The driver should map them with
> "remap_page_range()", and thus never take a page fault for such pages at
> all.
>
> There is no reason to ever lazily map non-RAM pages - clearly they aren't
> using any "real memory", so there is no reason to not fill the page tables
> at mmap() time.
>
> In other words, the driver is horribly broken.

thanks for the clarification.

At the moment I'm not sure anymore if this was non-ram or a
VM_FAULT_SIGBUS because I noticed I was doing BUG_ON(!pfn_valid)
_before_ checking new_page == VM_FAULT_SIGBUS. Though my theory about
do_no_page working fine with non-ram page_t with >=128m machines up to
2.6.4 still holds, and it's not obvious that Jens triggered a SIGBUS
either.

2004-03-20 18:22:43

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, 20 Mar 2004, Andrea Arcangeli wrote:
>> The only bugreport I've got so far for the latest anon_vma code is from
>> Jens, and it's a device driver bug in my opinion, but I'd like to have a
>> definitive confirmation from you about the ->nopage API.

On Sat, Mar 20, 2004 at 09:39:51AM -0800, Linus Torvalds wrote:
> I'd say that this is definitely a driver bug.
> If a driver wants to map non-RAM pages, that's perfectly ok, but it MUST
> NOT happen through "nopage()". The driver should map them with
> "remap_page_range()", and thus never take a page fault for such pages at
> all.
> There is no reason to ever lazily map non-RAM pages - clearly they aren't
> using any "real memory", so there is no reason to not fill the page tables
> at mmap() time.
> In other words, the driver is horribly broken.

If our official story is prefaulting, there should be very little to do.
I'll grep around for drivers doing the wrong thing and see if any rmk's
not handling are in need of conversion from fault handling to
remap_page_range().


-- wli

2004-03-20 19:02:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 06:48:57PM +0100, Andrea Arcangeli wrote:
> be needed to hide the bug). This untested patch should make it working
> with non-ram too, so it sounds safer for the short term. I will test it

btw, that works only if NUMA is disabled. there's no way to do
page_to_pfn with a non-ram page with numa enabled since page_zone starts
by reading page->flags, only pte_pfn works (as I found from Martin's
oops).

2004-03-20 19:49:17

by Jaroslav Kysela

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, 20 Mar 2004, Russell King wrote:

> It is well known that virt_to_page() is only valid on virtual addresses
> which correspond to kernel direct mapped RAM pages, and undefined on
> everything else. Unfortunately, ALSA has been using it with
> pci_alloc_consistent() for a long time, and this behaviour is what
> makes ALSA broken. The fact it works on x86 is merely incidental.

It works on PPC as well (at least we have no error reports).

> If ALSA wants this functionality, the ALSA people should ideally have
> put their requirements forward during the 2.5 development cycle so the
> problem could be addressed.

Yes, I'm sorry about that, but the ->nopage usage was requested by Jeff
Garzik and we're not gurus for the VM stuff. Because we're probably first
starting using of this mapping scheme, it resulted to problems.

> However, luckily in this instance, it is not a big problem to solve.
> It just requires time to sort through all the abstraction layers upon
> abstraction layers which ALSA has.
>
> - and I'm doing exactly this, right now. Be patient. -

Thanks a lot.

Jaroslav

-----
Jaroslav Kysela <[email protected]>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs

2004-03-20 20:13:49

by Andrew Morton

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

Andrea Arcangeli <[email protected]> wrote:
>
> On Sat, Mar 20, 2004 at 06:40:22AM -0800, William Lee Irwin III wrote:
> > On Sat, Mar 20, 2004 at 02:30:25PM +0100, Andrea Arcangeli wrote:
> > > Anyways returning to the non-ram returned by ->nopage see the below
> > > email exchange with Jens. the bug triggering of course is the
> > > BUG_ON(!pfn_valid(page_to_pfn(new_page))).
> > > If we want to return non-ram, we could, but I believe we should change
> > > the API to return a pfn not a page_t * if we want to.
> >
> > This would be very helpful for other reasons also. There's a general
> > API issue with drivers that want or need to do this. The one I've
>
> I'm afraid I'll have to teach ->nopage how to deal with non-ram with
> this page_t API too (changing it to pfn sounds too intrusive in the
> short term), it seems to me that alsa can return non-ram (in the nopage
> callback there's a virt_to_page on some iomm region), and changing alsa
> to use remap_file_pages sounds too intrusive too.

I had a check in a valid pfn in page_add_rmap() for several weeks before I
actually removed the test. The debug check never triggered. But looking
at the code I don't see why not. Weird.

fyi, we don't need the check in page_referenced() and try_to_unmap()
because do_no_page() does not place pages on the LRU. It is the ->nopage
implementation which is responsible for that. Presumably the ALSA driver
was not adding the "page" to the LRU.

I agree that ->nopage implementations should not be doing what that driver
is doing. ->nopage is defined to return a page*: it's crazy to be
returning someting from there which isn't covered by mem_map[].

I just don't think it's important enough to be able to cope with
non-mem_map[] "memory" in do_no_page(), so I agree that requiring ->mmap()
to synchronously instantiate the pte's and retaining the debug check in
do_no_page() is a good idea.

2004-03-20 20:28:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 12:13:45PM -0800, Andrew Morton wrote:
> fyi, we don't need the check in page_referenced() and try_to_unmap()
> because do_no_page() does not place pages on the LRU. It is the ->nopage

yes I've noticed.

> I agree that ->nopage implementations should not be doing what that driver
> is doing. ->nopage is defined to return a page*: it's crazy to be
> returning someting from there which isn't covered by mem_map[].

I may have been wrong about that sorry, it's still not certain though,
but I had a bug in the code that would mistake a sigbus for a page_t
outside the mem_map, so it could have been a sigbus not a non-ram page.

Also in the meantime I noticed with NUMA it's impossible to handle
non-ram correctly in ->nopage, at least if using the current
page_to_pfn.

> I just don't think it's important enough to be able to cope with
> non-mem_map[] "memory" in do_no_page(), so I agree that requiring ->mmap()
> to synchronously instantiate the pte's and retaining the debug check in
> do_no_page() is a good idea.

I agree, I reistantiated the debug check because we cannot handle
non-ram from there if it's numa (actually discontigmem). If alsa uses
non-ram pages it must be fixed, but I've an hope it was a sigbus
trouble. We'll know more in a few more hours.

(btw, Martin definitely triggered the sigbus with numa, the 0x3()
dereference was a the page_t address to read page->flags >> 24)

it's good to have reminded the API cannot handle non-ram.

2004-03-20 20:51:02

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 12:13:45PM -0800, Andrew Morton wrote:
> I agree that ->nopage implementations should not be doing what that driver
> is doing. ->nopage is defined to return a page*: it's crazy to be
> returning someting from there which isn't covered by mem_map[].
> I just don't think it's important enough to be able to cope with
> non-mem_map[] "memory" in do_no_page(), so I agree that requiring ->mmap()
> to synchronously instantiate the pte's and retaining the debug check in
> do_no_page() is a good idea.

There are other reasons for doing it, e.g. unusual TLB attributes
and/or unusual pagetable structures backing the virtual region. I don't
see anyone standing up and screaming for more functionality than cache
coherency and/or disablement now, so as far as I'm concerned,
remap_area_pages() (or rmk's stuff) kills the issue.


-- wli

2004-03-20 22:23:46

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 08:44:44PM +0100, Jaroslav Kysela wrote:
> Yes, I'm sorry about that, but the ->nopage usage was requested by Jeff
> Garzik and we're not gurus for the VM stuff. Because we're probably first
> starting using of this mapping scheme, it resulted to problems.

Well, I've been told to effectively screw my idea by David Woodhouse,
so may I make the radical suggestion that rm -rf linux/sound would
also fix the problem. No, didn't think that was acceptable either.

Ok, so, how the fsck do we fix the sound drivers? How do we mmap()
memory provided by dma_alloc_coherent() into user space portably?

It appears from what David Woodhouse has been going on about, even
providing an architecture dma_coherent_to_page() interface isn't
acceptable.

If we can't answer that question, we might as well remove ALSA and
OSS from the kernel because they are abusing existing kernel
interfaces in ways which can not be solved.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-20 22:26:49

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 12:50:53PM -0800, William Lee Irwin III wrote:
> On Sat, Mar 20, 2004 at 12:13:45PM -0800, Andrew Morton wrote:
> > I agree that ->nopage implementations should not be doing what that driver
> > is doing. ->nopage is defined to return a page*: it's crazy to be
> > returning someting from there which isn't covered by mem_map[].
> > I just don't think it's important enough to be able to cope with
> > non-mem_map[] "memory" in do_no_page(), so I agree that requiring ->mmap()
> > to synchronously instantiate the pte's and retaining the debug check in
> > do_no_page() is a good idea.
>
> There are other reasons for doing it, e.g. unusual TLB attributes
> and/or unusual pagetable structures backing the virtual region. I don't
> see anyone standing up and screaming for more functionality than cache
> coherency and/or disablement now, so as far as I'm concerned,
> remap_area_pages() (or rmk's stuff) kills the issue.

I'm no longer planning on this. In fact, I see a future where I tell
people who want to use sound on ARM to go screw themselves because
there doesn't seem to be an acceptable solution to this problem.

Of course, this will lead to dirty hacks by many people who *REQUIRE*
sound to work, but I guess we just don't care about that.

(Yes, I'm pissed off over this issue.)

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-20 22:45:41

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 08:44:44PM +0100, Jaroslav Kysela wrote:
>> Yes, I'm sorry about that, but the ->nopage usage was requested by Jeff
>> Garzik and we're not gurus for the VM stuff. Because we're probably first
>> starting using of this mapping scheme, it resulted to problems.

On Sat, Mar 20, 2004 at 10:23:41PM +0000, Russell King wrote:
> Well, I've been told to effectively screw my idea by David Woodhouse,
> so may I make the radical suggestion that rm -rf linux/sound would
> also fix the problem. No, didn't think that was acceptable either.
> Ok, so, how the fsck do we fix the sound drivers? How do we mmap()
> memory provided by dma_alloc_coherent() into user space portably?
> It appears from what David Woodhouse has been going on about, even
> providing an architecture dma_coherent_to_page() interface isn't
> acceptable.
> If we can't answer that question, we might as well remove ALSA and
> OSS from the kernel because they are abusing existing kernel
> interfaces in ways which can not be solved.

Is there any possibility of an extension to remap_area_pages() that
could resolve this? I can't say I fully understood and/or remember
the issue with it that you pointed out.


-- wli

2004-03-20 22:45:18

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 12:50:53PM -0800, William Lee Irwin III wrote:
>> There are other reasons for doing it, e.g. unusual TLB attributes
>> and/or unusual pagetable structures backing the virtual region. I don't
>> see anyone standing up and screaming for more functionality than cache
>> coherency and/or disablement now, so as far as I'm concerned,
>> remap_area_pages() (or rmk's stuff) kills the issue.

On Sat, Mar 20, 2004 at 10:26:39PM +0000, Russell King wrote:
> I'm no longer planning on this. In fact, I see a future where I tell
> people who want to use sound on ARM to go screw themselves because
> there doesn't seem to be an acceptable solution to this problem.
> Of course, this will lead to dirty hacks by many people who *REQUIRE*
> sound to work, but I guess we just don't care about that.
> (Yes, I'm pissed off over this issue.)

This is the exact opposite of what I'd hoped come of this discussion.
ISTR something about remap_area_pages() missing several pieces, but
I pretty much need some kind of clarification to know what. Well, that,
and I presumed your fixups for ALSA were headed toward mainline
regardless after coping with whatever issue dwmw2 had (e.g. returning
pfn's or something).


-- wli

2004-03-20 23:54:50

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 02:45:18PM -0800, William Lee Irwin III wrote:
> Is there any possibility of an extension to remap_area_pages() that
> could resolve this? I can't say I fully understood and/or remember
> the issue with it that you pointed out.

The issues are:

1. ALSA wants to mmap the buffer used to transfer data to/from the
card into user space. This buffer may be direct-mapped RAM,
memory allocated via dma_alloc_coherent(), an on-device buffer,
or anything else.

The user space mapping must likewise be DMA-coherent.

Currently, ALSA just does virt_to_page() on whatever address it
feels like in its nopage() function, which is obviously not
acceptable for two out of the three specific cases above.

2. ALSA wants to _coherently_ share data between the kernel-side
drivers, and user space ALSA library, mainly the DMA buffer
head/tail pointers so both kernel space and user space knows
when the buffer is full/empty.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-21 00:23:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 11:54:45PM +0000, Russell King wrote:
> The issues are:
> 1. ALSA wants to mmap the buffer used to transfer data to/from the
> card into user space. This buffer may be direct-mapped RAM,
> memory allocated via dma_alloc_coherent(), an on-device buffer,
> or anything else.
> The user space mapping must likewise be DMA-coherent.
> Currently, ALSA just does virt_to_page() on whatever address it
> feels like in its nopage() function, which is obviously not
> acceptable for two out of the three specific cases above.
> 2. ALSA wants to _coherently_ share data between the kernel-side
> drivers, and user space ALSA library, mainly the DMA buffer
> head/tail pointers so both kernel space and user space knows
> when the buffer is full/empty.

Okay, so we've got these pinned down. So I've got two small ideas
(I mentioned them earlier, but maybe vger dropped the message):

(a) I think prefaulting should work for that in general, though the API
doesn't fit the extra things needed for e.g. DMA. Is there some way we
could extend remap_area_pages() (or provide an alternative interface to
similar functionality with the missing pieces included) to do the extra
things needed to make the coherency and/or DMA (or whatever else is
missing) work?

(b) Alternatively, would dma_coherent_to_pfn() instead of
dma_coherent_to_page() and making ->nopage() return pfns help salvage
the method using non-cachable and/or dma-coherent page protections in
vma->vm_page_prot?


-- wli

2004-03-21 00:22:11

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, 20 Mar 2004, Russell King wrote:

> The issues are:
>
> 1. ALSA wants to mmap the buffer used to transfer data to/from the
> card into user space. This buffer may be direct-mapped RAM,
> memory allocated via dma_alloc_coherent(), an on-device buffer,
> or anything else.
>
> The user space mapping must likewise be DMA-coherent.
>
> Currently, ALSA just does virt_to_page() on whatever address it
> feels like in its nopage() function, which is obviously not
> acceptable for two out of the three specific cases above.
>
> 2. ALSA wants to _coherently_ share data between the kernel-side
> drivers, and user space ALSA library, mainly the DMA buffer
> head/tail pointers so both kernel space and user space knows
> when the buffer is full/empty.

Doesn't DRI also suffer from the same issues?

2004-03-21 03:13:57

by Chris Wedgwood

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 09:39:51AM -0800, Linus Torvalds wrote:

> If a driver wants to map non-RAM pages, that's perfectly ok, but it
> MUST NOT happen through "nopage()". The driver should map them with
> "remap_page_range()", and thus never take a page fault for such
> pages at all.

This is what the fetchop driver does.


2004-03-21 06:23:30

by Christoph Hellwig

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 07:13:55PM -0800, Chris Wedgwood wrote:
> > If a driver wants to map non-RAM pages, that's perfectly ok, but it
> > MUST NOT happen through "nopage()". The driver should map them with
> > "remap_page_range()", and thus never take a page fault for such
> > pages at all.
>
> This is what the fetchop driver does.

Not sure how you get to fetchop here, but that driver does map ram pages
so it should take pagefaults and not use remap_page_range().

2004-03-21 07:00:30

by Chris Wedgwood

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 06:23:22AM +0000, Christoph Hellwig wrote:

> Not sure how you get to fetchop here, but that driver does map ram
> pages so it should take pagefaults and not use remap_page_range().

It's been a while since I looked at this.... the fetchop driver maps
AMO space which is excluded from the EFI memory map (and any SHub
aliases) and thus shouldn't be touching anything normally considered
RAM.

<pause>

Checking the source I see:

if (remap_page_range(vm_start, __pa(maddr), PAGE_SIZE, vma->vm_page_prot)) {
fetchop_free_pages(vma->vm_private_data);
vfree(vdata);
fetchop_update_stats(-1, -pages);
return -EAGAIN;
}

as part of the drivers 'mmap fop'. The underlying page is actually
from region-6 so I'm pretty sure it's safe. If you think it is doing
something weird please let me know.



--cw

2004-03-21 09:53:07

by Arjan van de Ven

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, 2004-03-21 at 00:54, Russell King wrote:
> On Sat, Mar 20, 2004 at 02:45:18PM -0800, William Lee Irwin III wrote:
> > Is there any possibility of an extension to remap_area_pages() that
> > could resolve this? I can't say I fully understood and/or remember
> > the issue with it that you pointed out.
>
> The issues are:
>
> 1. ALSA wants to mmap the buffer used to transfer data to/from the
> card into user space. This buffer may be direct-mapped RAM,
> memory allocated via dma_alloc_coherent(), an on-device buffer,
> or anything else.


fwiw an ideal DRI/DRM driver would do the same with the video cards
ringbuffer so the problem isn't unique to alsa, it's a generic issue.


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-03-21 10:44:42

by Jaroslav Kysela

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, 20 Mar 2004, Russell King wrote:

> The issues are:
>
> 1. ALSA wants to mmap the buffer used to transfer data to/from the
> card into user space. This buffer may be direct-mapped RAM,
> memory allocated via dma_alloc_coherent(), an on-device buffer,
> or anything else.

We don't require to remap the mmio ring buffer (actually only RME32 has
a PCI memory window with the ring buffer, but this driver uses
memcpy_(to|from)io already). So, we need to remap RAM and DMA pages
(should be special RAM also) only.

> The user space mapping must likewise be DMA-coherent.
>
> Currently, ALSA just does virt_to_page() on whatever address it
> feels like in its nopage() function, which is obviously not
> acceptable for two out of the three specific cases above.

Yes.

> 2. ALSA wants to _coherently_ share data between the kernel-side
> drivers, and user space ALSA library, mainly the DMA buffer
> head/tail pointers so both kernel space and user space knows
> when the buffer is full/empty.

Yes.

Jaroslav

-----
Jaroslav Kysela <[email protected]>
Linux Kernel Sound Maintainer
ALSA Project, SuSE Labs

2004-03-21 20:45:31

by David Woodhouse

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, 2004-03-20 at 14:45 -0800, William Lee Irwin III wrote:
> This is the exact opposite of what I'd hoped come of this discussion.
> ISTR something about remap_area_pages() missing several pieces, but
> I pretty much need some kind of clarification to know what. Well, that,
> and I presumed your fixups for ALSA were headed toward mainline
> regardless after coping with whatever issue dwmw2 had (e.g. returning
> pfn's or something).

My request was that we shouldn't assume an architecture will have a
'struct page' corresponding to whatever it chooses to return from
dma_alloc_coherent().

There are machines where DMA to/from main memory _cannot_ be coherent
but we have some memory elsewhere, perhaps some SRAM which itself is
hanging off an I/O bus somewhere, which can be used. One of my toys is
currently running with dma_alloc_coherent() giving out memory from a PCI
video card, in fact.

Using a PFN should be OK.

--
dwmw2


2004-03-21 20:49:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 08:45:14PM +0000, David Woodhouse wrote:
> There are machines where DMA to/from main memory _cannot_ be coherent
> but we have some memory elsewhere, perhaps some SRAM which itself is
> hanging off an I/O bus somewhere, which can be used. One of my toys is
> currently running with dma_alloc_coherent() giving out memory from a PCI
> video card, in fact.
>
> Using a PFN should be OK.

And what exactly is a PFN without associated struct page supposed to mean?

2004-03-21 20:58:01

by David Woodhouse

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, 2004-03-21 at 20:49 +0000, Christoph Hellwig wrote:
> And what exactly is a PFN without associated struct page supposed to mean?

It's something you can put into a PTE, and that's about it. Which unless
I'm misunderstanding ALSA/rmk's requirements, should be enough.

--
dwmw2


2004-03-21 21:53:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?



On Sun, 21 Mar 2004, David Woodhouse wrote:
>
> On Sun, 2004-03-21 at 20:49 +0000, Christoph Hellwig wrote:
> > And what exactly is a PFN without associated struct page supposed to mean?
>
> It's something you can put into a PTE, and that's about it. Which unless
> I'm misunderstanding ALSA/rmk's requirements, should be enough.

It would really be wrong to have nopage() return a pte. The thing is, the
VM really works on "struct page", all over the map. It does things like
"page_cache_release()" on the page if the file-backed VMA has been
truncated, and it just knows that the return value from "nopage()" has
_structure_.

Some architectures have per-page flags for things like "this page may need
to have icache flushed from it" etc.

So I really put my veto on "nopage()" returning a PFN. That's just wrong,
wrong, wrong. It returns a "struct page" pointer, and it has lots of
reasons for that.

And none of the reasons for _not_ doing it are valid, since such a user
can just pre-populate the page tables anyway.

So don't go there.

Linus

2004-03-21 22:19:55

by Jeff Garzik

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

I wonder if we could jump back a step...

Years ago, I wanted to avoid remap_page_range() when I was writing
via82cxxx_audio.c, and so Linus suggested the ->nopage approach (which I
liked, and which is still present today in the sound/oss dir).

AFAICS device drivers have three needs that keep getting reinvented over
and over again, WRT mmap(2):
1) letting userspace directly address a region allocated by the kernel
DMA APIs
2) ditto, for MMIO (ioremap)
3) ditto, for PIO (inl/outl)

Alas, #3 must be faked on x86[-64], but this is done anyway for e.g.
mmap'd PCI config access. Many platforms implement in[bwl] essentially
as read[bwl], so for them mmap'd PIO is easy.

#1-3 above are really what device drivers want to do. My
suggestion/request to the VM wizards would be to directly provide mmap
helpers for dma/mmio/pio, that Does The Right Thing. And require their
use in every driver. Don't give driver writers the opportunity to think
about this stuff and/or screw it up.

If there are special DMA requirements of a particular bus or platform,
hide that in there. If some methods of DMA or MMIO or PIO do not lend
themselves to directly mapping to a struct page, the MM guys may dicker
about the interface, but the device driver guys just want #1-3 and don't
really care :) Either it's directly addressible [via some page table
magic] from userland, or it isn't.

So please forgive the tangent, but this thread is IMO talking more about
implementation than the real problem :) pci_dma_mmap() helper or
something like it should be the only thing the driver should care about.
I'm tired of the same platform bugs and issues, in mmap handlers,
reappearing over and over again... Tired of platform-specific ifdefs in
mmap-capable drivers, too.

Jeff



2004-03-21 22:24:40

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 01:53:41PM -0800, Linus Torvalds wrote:
> So I really put my veto on "nopage()" returning a PFN. That's just wrong,
> wrong, wrong. It returns a "struct page" pointer, and it has lots of
> reasons for that.

Ok then. We leave nopage() as is, and define that for returning RAM
backed pages.

We also have a fault() handler which is used for faulting in driver
mappings, which returns a PFN suitable for set_pte(). The fault()
would be separate from do_no_page() in much the same way as
do_anonymous_page() is separate, and it knows that PFNs returned
from this have nothing to do with struct pages. All it does is
set the relevant PTE entry in the page tables to create the mapping.

I don't think remap_area_pages() solves the problem - think about
the DMA ring buffer returned by dma_alloc_coherent(). This returns
an architectually defined virtual address and a DMA address.

Neither of these two addresses can be converted today to a struct
page or a PFN. Sure, we can invent some architecture defined
interface to get hold of this information, but take a moment to
consider all the cases where this type of activity goes on.

What about the case where the buffer is scatter-gather in nature,
just like we're so fond of telling driver writers who want to grab
(eg) 1MB of contiguous kernel memory for video buffers and the like?
Do we really want to tell driver writers to walk over 1MB of pages,
page by page, inserting them into the processes page tables via
remap_area_pages()?

Or does the ->fault() method make sense in all these cases?

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-21 22:27:11

by David Woodhouse

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, 2004-03-21 at 13:53 -0800, Linus Torvalds wrote:
> So I really put my veto on "nopage()" returning a PFN. That's just wrong,
> wrong, wrong. It returns a "struct page" pointer, and it has lots of
> reasons for that.

That's fine -- I wasn't suggesting nopage() should return a PFN.

I was suggesting that if someone wants to map something they're given by
dma_alloc_coherent() into memory, they should be given a PFN to deal
with -- _not_ a "struct page". Therefore, you can't use nopage() for
mapping dma_coherent memory into userspace.

Basically, we should consider the stuff returned by dma_alloc_coherent
to be 'non-RAM' in the context of your previous statement:
'If a driver wants to map non-RAM pages, that's perfectly ok,
but it MUST NOT happen through "nopage()".'

There are machines where you _cannot_ sensibly use host memory for
dma_coherent() allocations, but on which there _is_ a few megabytes of
SRAM hanging off the PCI bus which was put there specifically for that
purpose. So dma_alloc_coherent() returns something for which there is
not a valid 'struct page'.

--
dwmw2


2004-03-21 22:34:16

by Jeff Garzik

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

Russell King wrote:
> What about the case where the buffer is scatter-gather in nature,
> just like we're so fond of telling driver writers who want to grab
> (eg) 1MB of contiguous kernel memory for video buffers and the like?
> Do we really want to tell driver writers to walk over 1MB of pages,
> page by page, inserting them into the processes page tables via
> remap_area_pages()?

Tell driver writers to call a standard platform function with a
{dma|mmio|pio|vmalloc} handle+size+len for {dma|mmio|pio|vmalloc} mmap
setup, and {fault|nopage} handler. ;-) IMO they shouldn't have to care
about the details.

Jeff



2004-03-21 22:42:15

by David Woodhouse

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, 2004-03-21 at 17:34 -0500, Jeff Garzik wrote:
> Tell driver writers to call a standard platform function with a
> {dma|mmio|pio|vmalloc} handle+size+len for {dma|mmio|pio|vmalloc} mmap
> setup, and {fault|nopage} handler. ;-) IMO they shouldn't have to care
> about the details.

Don't let drivers see the {fault|nopage} handler. On most arches it can
probably continue to be nopage(); other arches may use the
newly-proposed fault() or perhaps just put all the PTEs in place up
front. The driver shouldn't be given an opportunity to care.

--
dwmw2


2004-03-21 22:51:42

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 05:34:01PM -0500, Jeff Garzik wrote:
> Russell King wrote:
> > What about the case where the buffer is scatter-gather in nature,
> > just like we're so fond of telling driver writers who want to grab
> > (eg) 1MB of contiguous kernel memory for video buffers and the like?
> > Do we really want to tell driver writers to walk over 1MB of pages,
> > page by page, inserting them into the processes page tables via
> > remap_area_pages()?
>
> Tell driver writers to call a standard platform function with a
> {dma|mmio|pio|vmalloc} handle+size+len for {dma|mmio|pio|vmalloc} mmap
> setup, and {fault|nopage} handler. ;-) IMO they shouldn't have to care
> about the details.

I don't think this addresses the scatter-gather case I mentioned above.
Or if we are, we've rewritten ALSA before hand to use Linux scatterlists
along side several dma_alloc_coherent mappings and have the ability to
mmap these as well.

Remember that we're fond of telling driver writers to use scatter gather
lists rather than grabbing one large contiguous memory chunk... So
they did exactly as we told them. Using pci_alloc_consistent and/or
dma_alloc_coherent and built their own scatter lists.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-21 23:07:44

by Jeff Garzik

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

David Woodhouse wrote:
> On Sun, 2004-03-21 at 17:34 -0500, Jeff Garzik wrote:
>
>>Tell driver writers to call a standard platform function with a
>>{dma|mmio|pio|vmalloc} handle+size+len for {dma|mmio|pio|vmalloc} mmap
>>setup, and {fault|nopage} handler. ;-) IMO they shouldn't have to care
>>about the details.
>
>
> Don't let drivers see the {fault|nopage} handler. On most arches it can
> probably continue to be nopage(); other arches may use the
> newly-proposed fault() or perhaps just put all the PTEs in place up
> front. The driver shouldn't be given an opportunity to care.

If that's possible within the MM APIs... certainly. Have a standard
struct vm_operations_struct for dma, dma s/g, mmio, ... I presume?

Jeff



2004-03-21 23:10:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

Russell King wrote:
> I don't think this addresses the scatter-gather case I mentioned above.
> Or if we are, we've rewritten ALSA before hand to use Linux scatterlists
> along side several dma_alloc_coherent mappings and have the ability to
> mmap these as well.
>
> Remember that we're fond of telling driver writers to use scatter gather
> lists rather than grabbing one large contiguous memory chunk... So
> they did exactly as we told them. Using pci_alloc_consistent and/or
> dma_alloc_coherent and built their own scatter lists.

Agreed... though IMO that can handled by considering DMA S/G as just
one more set of helper functions that the driver writer should not have
to implement ;) dma_sg_setup_mmap() could function as a peer alongside
dma_setup_mmap(), mmio_setup_mmap(), etc. Providing such to driver
writers gives them incentive to use S/G lists as well as incentive not
to invent their own mmap(2) setup and handling code.

Jeff



2004-03-21 23:13:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?



On Sun, 21 Mar 2004, Russell King wrote:
>
> Remember that we're fond of telling driver writers to use scatter gather
> lists rather than grabbing one large contiguous memory chunk... So
> they did exactly as we told them. Using pci_alloc_consistent and/or
> dma_alloc_coherent and built their own scatter lists.

I do think that we should introduce a "map_dma_coherent()" thing, which
basically takes a list of pages that have been allocated by
dma_alloc_coherent(), and remaps them into user space. How hard can that
be?

In fact, on a lot of architectures (well, at least x86, and likely
anything else that doesn't use any IOTLB and just allocates a chunk of
physical memory), I think the "map_dma_coherent()" thing should basically
just become a "remap_page_range()". Ie something like

#define map_dma_coherent(vma, vaddr, len) \
remap_page_range(vma, vma->vm_start, __pa(vaddr), len, vma->vm_page_prot)

for the simple case.

Ehh?

Linus

2004-03-21 23:22:30

by Jeff Garzik

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

Linus Torvalds wrote:
> In fact, on a lot of architectures (well, at least x86, and likely
> anything else that doesn't use any IOTLB and just allocates a chunk of
> physical memory), I think the "map_dma_coherent()" thing should basically
> just become a "remap_page_range()". Ie something like
>
> #define map_dma_coherent(vma, vaddr, len) \
> remap_page_range(vma, vma->vm_start, __pa(vaddr), len, vma->vm_page_prot)
>
> for the simple case.


That would be nice, though the reason I avoided remap_page_range() in
via82cxxx_audio is that it discourages S/G. Because remap_page_range()
is easier and more portable, several drivers allocate one-big-area and
then create an S/G list describing individual portions of that area.

I want to avoid that. Most decent h/w is s/g these days.

Jeff



2004-03-21 23:45:42

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 03:11:58PM -0800, Linus Torvalds wrote:
> On Sun, 21 Mar 2004, Russell King wrote:
> > Remember that we're fond of telling driver writers to use scatter gather
> > lists rather than grabbing one large contiguous memory chunk... So
> > they did exactly as we told them. Using pci_alloc_consistent and/or
> > dma_alloc_coherent and built their own scatter lists.
>
> I do think that we should introduce a "map_dma_coherent()" thing, which
> basically takes a list of pages that have been allocated by
> dma_alloc_coherent(), and remaps them into user space. How hard can that
> be?
>
> In fact, on a lot of architectures (well, at least x86, and likely
> anything else that doesn't use any IOTLB and just allocates a chunk of
> physical memory), I think the "map_dma_coherent()" thing should basically
> just become a "remap_page_range()". Ie something like
>
> #define map_dma_coherent(vma, vaddr, len) \
> remap_page_range(vma, vma->vm_start, __pa(vaddr), len, vma->vm_page_prot)
>
> for the simple case.

Ok, splitting hairs, for the coherent contiguous case, what about:

int dma_coherent_map(struct vm_area_struct *vma, void *cpu_addr,
dma_addr_t dma_addr, size_t size);

and x86 would be:

#define dma_coherent_map(vma,cpu_addr,dma_addr,size) \
remap_page_range(vma, vma->vm_start, __pa(cpu_addr), \
size, vma->vm_page_prot)

This then leaves the PCI BAR case and the DMA coherent SG buffer case,
though neither of those fall within my personal problem space at present.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-21 23:51:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?



On Sun, 21 Mar 2004, Jeff Garzik wrote:
>
> That would be nice, though the reason I avoided remap_page_range() in
> via82cxxx_audio is that it discourages S/G. Because remap_page_range()
> is easier and more portable, several drivers allocate one-big-area and
> then create an S/G list describing individual portions of that area.

Note that there is really two different kinds of IO memory:
- real IO-mapped memory on the other side of a bus
- real RAM which is on the CPU side of the bus, but that has additionally
been "mapped" some way as to be visible from devices.

The second kind is what you seem to be talking about, and it actually
_does_ have a "struct page" associated with it, and as such you can
happily return it from "nopage()". It's just that you had better be sure
that you find the page properly. Just doing a "virt_to_page()" doesn't do
it - you have to make sure to undo the mapping that was done for DMA
reasons.

So the minimal fix for any misuses would be to just have a
"dma_map_to_page()" reverse mapping for "dma_alloc_coherent()". For x86,
that's just the same thing as "virt_to_page()". For others, you have to
look more carefully at undoing whatever mapping the iommu has been set up
for.

That might be the minimal fix, since it would basically involve:
- change whatever offensive "virt_to_page()" calls into
"dma_map_to_page()".
- implement "dma_map_to_page()" for all architectures.

Would that make people happy?

(Architectures that have cache coherency issues will obviously also have
to set cache disable bits in the vma information, that's they broken
architecture problem)

Linus

2004-03-21 23:59:05

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 03:51:31PM -0800, Linus Torvalds wrote:
> That might be the minimal fix, since it would basically involve:
> - change whatever offensive "virt_to_page()" calls into
> "dma_map_to_page()".
> - implement "dma_map_to_page()" for all architectures.
>
> Would that make people happy?

Unfortunately this doesn't make dwmw2 happy - he claims to have machines
which implement dma_alloc_coherent using RAM which doesn't have any
struct page associated with it.

I've already got the interface you suggest above for ARM, and I'd have
taken this further had dwmw2 not chimed in.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-22 00:03:13

by David Woodhouse

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, 2004-03-21 at 15:51 -0800, Linus Torvalds wrote:
> Note that there is really two different kinds of IO memory:
> - real IO-mapped memory on the other side of a bus
> - real RAM which is on the CPU side of the bus, but that has additionally
> been "mapped" some way as to be visible from devices.
>
> The second kind is what you seem to be talking about,

<...>

> So the minimal fix for any misuses would be to just have a
> "dma_map_to_page()" reverse mapping for "dma_alloc_coherent()". For x86,
> that's just the same thing as "virt_to_page()". For others, you have to
> look more carefully at undoing whatever mapping the iommu has been set up
> for.

You are assuming that dma_alloc_coherent() will always return memory of
that second kind -- host-side RAM, not PCI-side. That hasn't previously
been a requirement, and there are machines out there on which it makes a
lot more sense for dma_alloc_coherent() to use some SRAM which happens
to be hanging off the I/O bus than it does to use host RAM.

Doing dma_map_to_pfn() instead of dma_map_to_page() would work. That
means you can't use nopage() for mappings of dma_coherent memory. That's
fine though.

> Would that make people happy?

No. It'd be OK if you make it dma_map_to_pfn() instead of
dma_map_to_page() though. As discussed, that means you can't use
nopage() for mappings of dma_coherent memory. That's fine though.

I think it would be better to provide arch-specific functions for
mapping dma_coherent allocations and SG lists. On most architectures we
can just do it with virt_to_page() and nopage() and it'll be OK. On
others we can do the right thing as appropriate.

--
dwmw2


2004-03-22 00:11:13

by Jeff Garzik

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

Linus Torvalds wrote:
>
> On Sun, 21 Mar 2004, Jeff Garzik wrote:
>
>>That would be nice, though the reason I avoided remap_page_range() in
>>via82cxxx_audio is that it discourages S/G. Because remap_page_range()
>>is easier and more portable, several drivers allocate one-big-area and
>>then create an S/G list describing individual portions of that area.
>
>
> Note that there is really two different kinds of IO memory:
> - real IO-mapped memory on the other side of a bus
> - real RAM which is on the CPU side of the bus, but that has additionally
> been "mapped" some way as to be visible from devices.

Yes. via audio example is DMA (second kind), and an fbdev driver would
need to worry about the first kind (MMIO).

For the second kind, your solution (snipped) seems sane, though I wonder
where dma_unmap_to_page() is called.

For the first kind, please read fb_mmap in drivers/video/fbmem.c. Look
at the _horror_ of ifdefs in exporting the framebuffer. And that horror
is what's often needed when letting userspace mmap(2) PCI memory IO regions.

So, an mmio_map() in addition to dma_map*?

Jeff



2004-03-22 00:20:51

by Russell King

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 07:10:53PM -0500, Jeff Garzik wrote:
> For the first kind, please read fb_mmap in drivers/video/fbmem.c. Look
> at the _horror_ of ifdefs in exporting the framebuffer. And that horror
> is what's often needed when letting userspace mmap(2) PCI memory IO regions.

Most of this:

#if defined(__mc68000__)
...
#elif defined(__mips__)
pgprot_val(vma->vm_page_prot) &= ~_CACHE_MASK;
pgprot_val(vma->vm_page_prot) |= _CACHE_UNCACHED;
#elif defined(__sh__)
pgprot_val(vma->vm_page_prot) &= ~_PAGE_CACHABLE;
#elif defined(__hppa__)
pgprot_val(vma->vm_page_prot) |= _PAGE_NO_CACHE;
#elif defined(__ia64__) || defined(__arm__)
vma->vm_page_prot = pgprot_writecombine(vma->vm_page_prot);
#else
#warning What do we have to do here??
#endif

exists because architectures haven't defined their private
pgprot_writecombine() implementations, preferring instead to add
to the preprocessor junk instead.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 PCMCIA - http://pcmcia.arm.linux.org.uk/
2.6 Serial core

2004-03-22 00:24:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 11:45:15PM +0000, Russell King wrote:
> Ok, splitting hairs, for the coherent contiguous case, what about:
> int dma_coherent_map(struct vm_area_struct *vma, void *cpu_addr,
> dma_addr_t dma_addr, size_t size);
> and x86 would be:
> #define dma_coherent_map(vma,cpu_addr,dma_addr,size) \
> remap_page_range(vma, vma->vm_start, __pa(cpu_addr), \
> size, vma->vm_page_prot)
> This then leaves the PCI BAR case and the DMA coherent SG buffer case,
> though neither of those fall within my personal problem space at present.

Can we get an offset into the area as one of the args? Then scatter/gather
should be trivially constructible (via iteration) from the interface.
Maybe something like:

struct dma_scatterlist {
dma_addr_t dma_addr; /* DMA address */
void *cpu_addr; /* cpu address */
unsigned long length; /* in units of pages */
};

int dma_mmap_coherent_sg(struct dma_scatterlist *sglist,
int nr_sglist_elements, /* length of sglist */
struct vm_area_struct *vma, /* for address space */
unsigned long address, /* user virtual address */
unsigned long offset, /* offset (in pages) */
unsigned long nr_pages); /* length (in pages) */

int dma_munmap_coherent_sg(struct dma_scatterlist *sglist,
int nr_sglist_elements, /* length of sglist */
struct vm_area_struct *vma, /* for address space */
unsigned long address, /* user virtual address */
unsigned long offset, /* offset (in pages) */
unsigned long nr_pages); /* length (in pages) */

int dma_alloc_coherent_sg(struct dma_scatterlist **sglist,
unsigned long length); /* length in pages */

int dma_free_coherent_sg(struct dma_scatterlist **sglist,
unsigned long length); /* length in pages */

Would be useful? And these in turn would drive the dma_alloc_coherent()
and helpers like:

int dma_mmap_coherent(struct vm_area_struct *vma,
unsigned long address,
dma_addr_t dma_addr, /* DMA address */
void *cpu_addr, /* cpu address */
unsigned long nr_pages); /* length (in pages) */

int dma_munmap_coherent(struct vm_area_struct *vma,
unsigned long address,
dma_addr_t dma_addr, /* DMA address */
void *cpu_addr, /* cpu address */
unsigned long nr_pages); /* length (in pages) */

Does any of this sound like it's on the right track API-wise?

My thought on attacking the scatter/gather issue is basically centered
around "they're going to try to do it anyway, and if they don't have
something there to do it for them, they'll get it wrong."


-- wli

2004-03-22 00:30:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

William Lee Irwin III wrote:
> int dma_mmap_coherent_sg(struct dma_scatterlist *sglist,
> int nr_sglist_elements, /* length of sglist */
> struct vm_area_struct *vma, /* for address space */
> unsigned long address, /* user virtual address */
> unsigned long offset, /* offset (in pages) */
> unsigned long nr_pages); /* length (in pages) */
>
> int dma_munmap_coherent_sg(struct dma_scatterlist *sglist,
> int nr_sglist_elements, /* length of sglist */
> struct vm_area_struct *vma, /* for address space */
> unsigned long address, /* user virtual address */
> unsigned long offset, /* offset (in pages) */
> unsigned long nr_pages); /* length (in pages) */
>
> int dma_alloc_coherent_sg(struct dma_scatterlist **sglist,
> unsigned long length); /* length in pages */
>
> int dma_free_coherent_sg(struct dma_scatterlist **sglist,
> unsigned long length); /* length in pages */

No comment on struct dma_scatterlist, but the above is the most natural
API for audio drivers at least.

Audio drivers allocate buffers at ->probe() or open(2), and the only
entity that actually cares about the contents of the buffers are (a) the
hardware and (b) userland. via82cxxx_audio only uses
pci_alloc_consistent because there's not a more appropriate DMA
allocator for the use to which that memory is put.

Audio drivers only need to read/write the buffers inside the kernel when
implementing read(2) and write(2) via copy_{to,from}_user().

Jeff



2004-03-22 00:33:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 11:58:54PM +0000, Russell King wrote:
> On Sun, Mar 21, 2004 at 03:51:31PM -0800, Linus Torvalds wrote:
> > That might be the minimal fix, since it would basically involve:
> > - change whatever offensive "virt_to_page()" calls into
> > "dma_map_to_page()".
> > - implement "dma_map_to_page()" for all architectures.
> >
> > Would that make people happy?
>
> Unfortunately this doesn't make dwmw2 happy - he claims to have machines
> which implement dma_alloc_coherent using RAM which doesn't have any
> struct page associated with it.

I would suggest to add a ->nopage_dma (or whatever other name for an
additional callback in the vm_ops) that will return a non pageable "pfn"
number (not a page_t*). This is all the VM needs to setup the pte
properly, this callback will not know anything about the pageable stuff
(i.e. it will not have to call page_add_rmap or stuff like that).

I definitely agree a driver currently has no way to work safe if it
returns non-ram via ->nopage and it must use remap_file_pages, but OTOH
I don't like remap_file_pages myself, it's a lot nicer to use paging
even for mapping non-ram, even if you don't use scatter gather, even if
you've just an huge block of contigous physical ram, at the very least
for the scheduler latencies in a loop under the page_table_lock.

nopage_dma will be like this:

do_no_page_dma(vma, ...)
{
pfn = vma->vm_ops->nopage_dma()
if (pfn_valid(pfn)) {
/*
* going from valid pfn to page is always ok
* the other way around not
*/
page = pfn_to_page(pfn);
BUG_ON(page->mapping);
if (!PageReserved(page))
mm->rss++;
}
setup the pte using the pfn here, no vm accounting or pte tracking
required since it's either non valid pfn or reserved page that
will be ignored by the zap_pte stuff
}

do_no_page()
{
if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table,
pmd, write_access, address);
if (vma->vm_ops->nopage_dma)
return do_no_page_dma(...)
}

Then the mmu VM troubles are over, how you keep the cache of this pte
view coherent with the iommu view isn't something solvable by the mmu,
but certainly you can add whatever cache flushing callback in teh
do_no_page_dma core, that's a slow path so you can play with it from any
arch adding whatever needed library calls.

btw, on a slightly related note, I don't think this is safe in
get_user_pages in 2.6:

if (!PageReserved(pages[i]))
page_cache_get(pages[i]);

there's nothing preventing munmap to free the page while somebody does
I/O on the page via get_user_pages.

2004-03-22 00:34:02

by Jeff Garzik

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

Russell King wrote:
> On Sun, Mar 21, 2004 at 07:10:53PM -0500, Jeff Garzik wrote:
>
>>For the first kind, please read fb_mmap in drivers/video/fbmem.c. Look
>>at the _horror_ of ifdefs in exporting the framebuffer. And that horror
>>is what's often needed when letting userspace mmap(2) PCI memory IO regions.
>
>
> Most of this:
[...]
> exists because architectures haven't defined their private
> pgprot_writecombine() implementations, preferring instead to add
> to the preprocessor junk instead.


Agreed but the larger point is that that code should not be in fbmem.c
at all.

There are two main types of usage for bus IO memory (MMIO), data and
hardware registers. Both types driver writers currently export to
userspace via mmap(2). Caching and write combining are simply
driver-controlled attributes one must consider.

Jeff



2004-03-22 01:34:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, Mar 21, 2004 at 07:29:59PM -0500, Jeff Garzik wrote:
> No comment on struct dma_scatterlist, but the above is the most natural
> API for audio drivers at least.
> Audio drivers allocate buffers at ->probe() or open(2), and the only
> entity that actually cares about the contents of the buffers are (a) the
> hardware and (b) userland. via82cxxx_audio only uses
> pci_alloc_consistent because there's not a more appropriate DMA
> allocator for the use to which that memory is put.
> Audio drivers only need to read/write the buffers inside the kernel when
> implementing read(2) and write(2) via copy_{to,from}_user().

I based it on rmk's set of arguments to his functions; I'm hoping for
feedback (or another API/implementation) from him and hopefully at
least one other arch maintainer having problems in this area. I'm
hoping to focus mostly on the driver sweep, and to devolve e.g. finer
details of the design like the above arguments and/or structures to
those with more detailed knowledge or direct experience (and the
broader details came from elsewhere too).


-- wli

2004-03-22 03:05:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?



On Mon, 22 Mar 2004, Andrea Arcangeli wrote:
>
> On Sun, Mar 21, 2004 at 11:58:54PM +0000, Russell King wrote:
> > On Sun, Mar 21, 2004 at 03:51:31PM -0800, Linus Torvalds wrote:
> > > That might be the minimal fix, since it would basically involve:
> > > - change whatever offensive "virt_to_page()" calls into
> > > "dma_map_to_page()".
> > > - implement "dma_map_to_page()" for all architectures.
> > >
> > > Would that make people happy?
> >
> > Unfortunately this doesn't make dwmw2 happy - he claims to have machines
> > which implement dma_alloc_coherent using RAM which doesn't have any
> > struct page associated with it.
>
> I would suggest to add a ->nopage_dma (or whatever other name for an
> additional callback in the vm_ops) that will return a non pageable "pfn"

No.

Fix the broken architecture instead.

Linus

2004-03-22 03:29:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?



On Mon, 22 Mar 2004, David Woodhouse wrote:
>
> You are assuming that dma_alloc_coherent() will always return memory of
> that second kind -- host-side RAM, not PCI-side. That hasn't previously
> been a requirement, and there are machines out there on which it makes a
> lot more sense for dma_alloc_coherent() to use some SRAM which happens
> to be hanging off the I/O bus than it does to use host RAM.

So? Those architectures can just allocate "struct page" entries for that
memory too.

There is a point where we should not care about idiotic architectures any
more. We should care about what happens in 99% of all architectures, and
the rest get to work around their _own_ quirks. We do not make the VM
uglier for some insane "it can happen" case.

Linus

2004-03-22 04:57:11

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sun, 2004-03-21 at 06:44, Jaroslav Kysela wrote:
> On Sat, 20 Mar 2004, Russell King wrote:
>
> > It is well known that virt_to_page() is only valid on virtual addresses
> > which correspond to kernel direct mapped RAM pages, and undefined on
> > everything else. Unfortunately, ALSA has been using it with
> > pci_alloc_consistent() for a long time, and this behaviour is what
> > makes ALSA broken. The fact it works on x86 is merely incidental.
>
> It works on PPC as well (at least we have no error reports).

It works only on PPCs that have consistent IOs, not embedded ones which
will use special mappings for getting non-cacheable pages.

> Yes, I'm sorry about that, but the ->nopage usage was requested by Jeff
> Garzik and we're not gurus for the VM stuff. Because we're probably first
> starting using of this mapping scheme, it resulted to problems.


2004-03-22 05:01:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?


> Doesn't DRI also suffer from the same issues?

Well, it depends. Most of the time, DRI uses AGP which is a different
story altogether.

DRI suffers from similar issue when using PCI GART, but then, it also
doesn't use the consistent alloc routines, it gets pages with GFP,
mmap those into userland, and does pci_map_single in the kernel on
each individual page to obtain the bus addresses. This will not be
pretty on non-coherent architectures though.

Ben.


2004-03-22 05:11:55

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?


> exists because architectures haven't defined their private
> pgprot_writecombine() implementations, preferring instead to add
> to the preprocessor junk instead.

And it's not even about writecombine... Like writecombine is an
attribute of the PCI host bridge on pmacs, not a pgprot, while
cacheability issues cannot be always abstracted the same way on
different archs. Actually, GUARDED could be used for !writecombine
on ppc, but !GUARDED would allow prefetch and out of order IOs....

Ben.


2004-03-22 06:41:59

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Sat, Mar 20, 2004 at 10:26:39PM +0000, Russell King wrote:
> I'm no longer planning on this. In fact, I see a future where I tell
> people who want to use sound on ARM to go screw themselves because
> there doesn't seem to be an acceptable solution to this problem.
> Of course, this will lead to dirty hacks by many people who *REQUIRE*
> sound to work, but I guess we just don't care about that.
> (Yes, I'm pissed off over this issue.)

I was really hoping I could help make things work for everyone
because they are not working for everyone now.

Unfortunately, now I also see a future without working sound drivers on
ARM and several others. I'm sorry. I tried, but I've completely run out
of ideas (hell, most of them weren't even mine to begin with, so that
goes back to even before I gave up) and thus far every possible method
of fixing this has been shot down.

... and here I thought fixing drivers was the Right Thing to Do (TM).


-- wli

2004-03-22 18:23:34

by Richard Curnow

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

* Benjamin Herrenschmidt <[email protected]> [2004-03-22]:
> DRI suffers from similar issue when using PCI GART, but then, it also
> doesn't use the consistent alloc routines, it gets pages with GFP,
> mmap those into userland, and does pci_map_single in the kernel on
> each individual page to obtain the bus addresses. This will not be
> pretty on non-coherent architectures though.

... or on platforms where PCI bounce-buffers are being used.

--
Richard \\\ SH-4/SH-5 Core & Debug Architect
Curnow \\\ SuperH (UK) Ltd, Bristol

2004-03-23 17:57:05

by Andy Whitcroft

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

--On 21 March 2004 23:58 +0000 Russell King <[email protected]>
wrote:

> On Sun, Mar 21, 2004 at 03:51:31PM -0800, Linus Torvalds wrote:
>> That might be the minimal fix, since it would basically involve:
>> - change whatever offensive "virt_to_page()" calls into
>> "dma_map_to_page()".
>> - implement "dma_map_to_page()" for all architectures.
>>
>> Would that make people happy?
>
> Unfortunately this doesn't make dwmw2 happy - he claims to have machines
> which implement dma_alloc_coherent using RAM which doesn't have any
> struct page associated with it.

Would it not be possible to allocate struct page's for these special areas
of memory? Worst, worst, worst case could they not represent pages in a
memory only node in the NUMA sense? I am sure there is some way they could
be 'tacked' onto the end of the cmap in reality?

-apw

2004-03-23 17:58:48

by David Woodhouse

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On Tue, 2004-03-23 at 17:59 +0000, Andy Whitcroft wrote:
> Would it not be possible to allocate struct page's for these special areas
> of memory? Worst, worst, worst case could they not represent pages in a
> memory only node in the NUMA sense? I am sure there is some way they could
> be 'tacked' onto the end of the cmap in reality?

It would be possible. But why? What benefit do we gain from this
pretence?

Just hide it all from the driver with dma_coherent_mmap() and forget
about it. Let the arch deal with it -- the _common_ case will be that we
use nopage for the actual mapping, perhaps. But why mandate it?


--
dwmw2

2004-03-23 18:11:59

by William Lee Irwin III

[permalink] [raw]
Subject: Re: can device drivers return non-ram via vm_ops->nopage?

On 21 March 2004 23:58 +0000 Russell King <[email protected]>
>> Unfortunately this doesn't make dwmw2 happy - he claims to have machines
>> which implement dma_alloc_coherent using RAM which doesn't have any
>> struct page associated with it.

On Tue, Mar 23, 2004 at 05:59:20PM +0000, Andy Whitcroft wrote:
> Would it not be possible to allocate struct page's for these special areas
> of memory? Worst, worst, worst case could they not represent pages in a
> memory only node in the NUMA sense? I am sure there is some way they could
> be 'tacked' onto the end of the cmap in reality?

This has already been beaten to death and resolved. dma_mmap_coherent()
is the preferred solution and will have no reliance on the coremap apart
from requiring it when faults are handled (to feed the core API), and
requiring prefaulting when coremap elements are absent for the mapped
areas. More importantly, it allows sane fallback to read()/write() and
understands the results of dma_alloc_coherent(), which virt_to_page(),
whose current use on dma_alloc_coherent()'s results causes driver bugs,
does not.


-- wli