Bryan> I suspect that our use of SetPageReserved in
Bryan> ipath_file_ops.c may be causing us problems, but I am not
Bryan> sure what correct behaviour would look like. Suggestions
Bryan> appreciated.
Why are you doing SetPageReserved? As I understand things, the
reserved bit is deprecated because it doesn't really have any defined
semantics...
- R.
On Thu, 2006-03-09 at 15:28 -0800, Roland Dreier wrote:
> Why are you doing SetPageReserved? As I understand things, the
> reserved bit is deprecated because it doesn't really have any defined
> semantics...
News to me :-)
Any idea what I should be using instead?
<b
On Thu, 2006-03-09 at 16:01 -0800, Roland Dreier wrote:
> Bryan> Any idea what I should be using instead?
>
> It depends what you're trying to do. Hence my original question: why
> are you doing SetPageReserved?
We're mapping some memory that the chip DMAs to into userspace, so that
user processes can spin on memory locations without going through the
kernel. The SetPageReserved hack is an attempt to stop the VM from
reclaiming those pages from us once a user process exits.
I realise that it's surely bogus, and I'd be thrilled to do something
correct instead. We've tried doing both SetPageReserved and get_page,
but it hasn't work out too well so far.
<b
Bryan> Any idea what I should be using instead?
It depends what you're trying to do. Hence my original question: why
are you doing SetPageReserved?
- R.
Bryan> We're mapping some memory that the chip DMAs to into
Bryan> userspace, so that user processes can spin on memory
Bryan> locations without going through the kernel. The
Bryan> SetPageReserved hack is an attempt to stop the VM from
Bryan> reclaiming those pages from us once a user process exits.
Bryan> I realise that it's surely bogus, and I'd be thrilled to do
Bryan> something correct instead. We've tried doing both
Bryan> SetPageReserved and get_page, but it hasn't work out too
Bryan> well so far.
What's wrong with doing get_page()? Surely the VM won't take pages
that you hold a reference to.
- R.
"Bryan O'Sullivan" <[email protected]> wrote:
>
> On Thu, 2006-03-09 at 16:01 -0800, Roland Dreier wrote:
> > Bryan> Any idea what I should be using instead?
> >
> > It depends what you're trying to do. Hence my original question: why
> > are you doing SetPageReserved?
>
> We're mapping some memory that the chip DMAs to into userspace, so that
> user processes can spin on memory locations without going through the
> kernel. The SetPageReserved hack is an attempt to stop the VM from
> reclaiming those pages from us once a user process exits.
If your driver allocated these pages and never added them to the LRU then
the VM won't touch them.
If your driver owns the pages and has a ref on them then they won't get
freed at task exit-time.
If the app owns the pages and you're using get_user_pages() then your
driver owns the last ref on the pages.
> I realise that it's surely bogus, and I'd be thrilled to do something
> correct instead. We've tried doing both SetPageReserved and get_page,
> but it hasn't work out too well so far.
We'd need to see a halfway decent description of the problem first ;)
On Thu, 2006-03-09 at 16:32 -0800, Roland Dreier wrote:
> What's wrong with doing get_page()? Surely the VM won't take pages
> that you hold a reference to.
We've tried it, but it has apparently not been enough so far. I'll see
if I can post an oops report.
<b
On Thu, 2006-03-09 at 16:37 -0800, Andrew Morton wrote:
> We'd need to see a halfway decent description of the problem first ;)
I've been consumed with patch generation for a bit :-)
I'll try to come up with a coherent description of what we're doing and
how it's blowing up in our face tomorrow.
<b
On Thu, 2006-03-09 at 16:37 -0800, Andrew Morton wrote:
> If your driver allocated these pages and never added them to the LRU then
> the VM won't touch them.
We have a rather complex set of user memory mappings. Let me describe
what's going on.
We use get_user_pages to pin user pages, but I'm 85% sure that our
acquisition and release of user pages is just fine (mostly because the
protocol for get_user_pages/free_page is simple). It's all the other
stuff I'm worried about. So here goes ...
There are five (yes, really) different kinds of thing we mmap into
userspace.
The first two are windows onto the chip's MMIO space; we create mappings
for userspace in our mmap code using io_remap_pfn_range.
- General user-oriented chip registers.
- Chip PIO buffers.
We don't do any funnies with get_page/SetPageReserved (or the new
vm_insert_page) on these. However, we turn on the VM_* flag christmas
tree, in a frenzied effort to make the kernel pay no attention:
VM_DONTCOPY | VM_DONTEXPAND | VM_IO | VM_SHM | VM_LOCKED.
I'm pretty sure this is wrong, but I have no idea what right would look
like.
The next three kinds or memory are allocated with dma_alloc_coherent
(first two) or pci_alloc_consistent (last). We have a nopage handler
that knows how to deal with them. We set the low two bits of
vma->vm_private_data to tell the nopage handler what kind of memory it
is dealing with.
- IPATH_VM_RCVEGRBUFS: Memory that the driver allocates for the chip
to DMA small ("eager") messages into. Read-only for userspace. We
mark these pages with SetPageReserved.
- IPATH_VM_RCVHDRQ: A table of queue headers, again DMAed into by the
chip. Read and written by userspace. We mark these pages with
SetPageReserved.
- IPATH_VM_PIOAVAILREGS: A table of buffer availability
registers. DMAed into by the chip. Read-only for userspace. Not
marked with SetPageReserved.
Once again, we sprinkle heaps of VM_* flags all over the freshly baked
mapping: VM_DONTEXPAND | VM_DONTCOPY | VM_RESERVED | VM_IO | VM_SHM
Some of these mappings are exactly one page in size, some are more.
The nopage handler looks very normal, except it does a get_page on
pages marked with IPATH_VM_PIOAVAILREGS, but not on others. Presumably
this is because they've had SetPageReserved set on them.
I won't claim that any of this was ever even slightly correct, but we at
least got away with it from mid-2.5 kernels until 2.6.15. Now we're
apparently being so ill-behaved that the entire box locks up when we run
"hello, world".
These interfaces have changed a lot recently, and surely there are many
people who know better than I what we ought to be doing. I'm personally
stumped. The current code makes no real sense to me, but I have no
outline in my head of what we should be doing instead.
Things about which I am confused:
I don't know what to to protect chip memory that I'm mapping into
userspace.
I think I shouldn't be calling SetPageReserved at all, but I don't know
what I should be doing instead. Naively using get_page instead just
gets me a big crash.
I'll be happy to show off specific hunks of code if you think they'll
help you to understand what's going on, so that you can suggest a better
way.
<b
> The first two are windows onto the chip's MMIO space; we create mappings
> for userspace in our mmap code using io_remap_pfn_range.
>
> - General user-oriented chip registers.
>
> - Chip PIO buffers.
>
> We don't do any funnies with get_page/SetPageReserved (or the new
> vm_insert_page) on these. However, we turn on the VM_* flag christmas
> tree, in a frenzied effort to make the kernel pay no attention:
> VM_DONTCOPY | VM_DONTEXPAND | VM_IO | VM_SHM | VM_LOCKED.
I don't think you need to do anything beyond io_remap_pfn_range().
Look at the comment inside remap_pfn_range() in mm/memory.c.
You may want VM_DONTCOPY for fork() handling I guess.
> The next three kinds or memory are allocated with dma_alloc_coherent
> (first two) or pci_alloc_consistent (last). We have a nopage handler
> that knows how to deal with them. We set the low two bits of
> vma->vm_private_data to tell the nopage handler what kind of memory it
> is dealing with.
As a side note, why do you use both dma_alloc_coherent() and
pci_alloc_consistent()? Presumably all the allocations are for the
same underlying device, so why not just pick one interface (probably
dma_alloc_coherent(), since it lets you specify the GFP mask).
> Once again, we sprinkle heaps of VM_* flags all over the freshly baked
> mapping: VM_DONTEXPAND | VM_DONTCOPY | VM_RESERVED | VM_IO | VM_SHM
You probably want VM_RESERVED. I don't think you want VM_IO (these
pages are in RAM), and there's not much point to VM_SHM, since it's
currently defined as:
#define VM_SHM 0x00000000 /* Means nothing: delete it later */
VM_DONTEXPAND is fine, although you could handle it in your NOPAGE
method too. And VM_DONTCOPY is just as above -- it might make fork()
work better for you.
> The nopage handler looks very normal, except it does a get_page on
> pages marked with IPATH_VM_PIOAVAILREGS, but not on others. Presumably
> this is because they've had SetPageReserved set on them.
I think you should always be doing a get_page(). As far as I know,
there's always a put_page() when the userspace mapping is torn down,
and things are going to get pretty bad if a page count goes below 0.
- R.
On Wed, 2006-03-15 at 17:51 -0800, Roland Dreier wrote:
> However, we turn on the VM_* flag christmas
> > tree, in a frenzied effort to make the kernel pay no attention:
> > VM_DONTCOPY | VM_DONTEXPAND | VM_IO | VM_SHM | VM_LOCKED.
>
> I don't think you need to do anything beyond io_remap_pfn_range().
> Look at the comment inside remap_pfn_range() in mm/memory.c.
> You may want VM_DONTCOPY for fork() handling I guess.
I think we need VM_DONTCOPY for fork, as you say, and probably
VM_DONTEXPAND (for mremap). I don't know why VM_LOCKED is there, since
it seems to be internal to the mm machinery. It looks like it might be
the kernel's equivalent of MCL_FUTURE?
> As a side note, why do you use both dma_alloc_coherent() and
> pci_alloc_consistent()?
I use dma_alloc_coherent when I need to specify the GFP flags,
pci_alloc_consistent when I don't. If you'd rather see only one used,
I'll just drop pci_alloc_consistent.
> You probably want VM_RESERVED.
I'll try it.
> I don't think you want VM_IO (these
> pages are in RAM),
Probably not. The reason some of these flags crept in is that other
drivers use them to try and keep the kernel from paging the pages in
question. I'm pretty sure VM_IO is in that category, and likely
VM_LOCKED as mentioned above, too.
> and there's not much point to VM_SHM, since it's
> currently defined as:
>
> #define VM_SHM 0x00000000 /* Means nothing: delete it later */
I think that's another everyone-else-is-doing-it flag. It only became
zero in 2.6.15-rcX.
> > The nopage handler looks very normal, except it does a get_page on
> > pages marked with IPATH_VM_PIOAVAILREGS, but not on others. Presumably
> > this is because they've had SetPageReserved set on them.
>
> I think you should always be doing a get_page().
Yeah. I think so too, but when I do it, I get an oops.
<b
On Wed, 15 Mar 2006, Bryan O'Sullivan wrote:
>
> I don't know what to to protect chip memory that I'm mapping into
> userspace.
>
> I think I shouldn't be calling SetPageReserved at all, but I don't know
> what I should be doing instead. Naively using get_page instead just
> gets me a big crash.
You should just use "remap_pfn_range()", and new kernels will just
automatically DTRT.
For true chip memory (ie no RAM), even old kernels don't want any
SetPageReserved() games, since there are no actual real real RAM pages for
them - in fact, you shouldn't have any "struct page" to mark reserved -
but if you allocate regular RAM you might want to mark such pages
reserved.
(The current VM no longer needs it or even cares, but that is needed for
older kernels).
Linus
Roland> I think you should always be doing a get_page().
Bryan> Yeah. I think so too, but when I do it, I get an oops.
Hmm, looking at that oops might help debug your problem.
- R.
On Wed, 2006-03-15 at 18:37 -0800, Roland Dreier wrote:
> Roland> I think you should always be doing a get_page().
>
> Bryan> Yeah. I think so too, but when I do it, I get an oops.
>
> Hmm, looking at that oops might help debug your problem.
This is what it looks like when I always call get_page in my nopage
handler (after checking that the pfn is valid and pfn_to_page hasn't
given me junk).
Bad page state at free_hot_cold_page (in process 'mpi_hello', page ffff810002098af8)
flags:0x0100000000000404 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
Backtrace:
Call Trace:<ffffffff80169577>{bad_page+135} <ffffffff8016a743>{free_hot_cold_page+112}
<ffffffff8016a839>{__pagevec_free+41} <ffffffff801710ba>{release_pages+331}
<ffffffff8017fce9>{free_pages_and_swap_cache+125} <ffffffff80176953>{unmap_vmas+1186}
<ffffffff80179a58>{exit_mmap+124} <ffffffff80139fe6>{mmput+37}
<ffffffff8013f2d4>{do_exit+584} <ffffffff8013fdd1>{sys_exit_group+0}
<ffffffff80149fd9>{get_signal_to_deliver+1594} <ffffffff8010f23a>{do_signal+116}
<ffffffff8011017e>{retint_signal+61}
Trying to fix it up, but a reboot is needed
On Wed, 2006-03-15 at 18:53 -0800, Bryan O'Sullivan wrote:
> This is what it looks like when I always call get_page in my nopage
> handler (after checking that the pfn is valid and pfn_to_page hasn't
> given me junk).
Ugh - I clicked on send too soon. I think my next step will be to
figure out which vma is being cleaned up when this problem occurs first.
<b
"Bryan O'Sullivan" <[email protected]> wrote:
>
> On Wed, 2006-03-15 at 18:37 -0800, Roland Dreier wrote:
> > Roland> I think you should always be doing a get_page().
> >
> > Bryan> Yeah. I think so too, but when I do it, I get an oops.
> >
> > Hmm, looking at that oops might help debug your problem.
>
> This is what it looks like when I always call get_page in my nopage
> handler (after checking that the pfn is valid and pfn_to_page hasn't
> given me junk).
>
> Bad page state at free_hot_cold_page (in process 'mpi_hello', page ffff810002098af8)
> flags:0x0100000000000404 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
It still has PG_reserved set. I'd suggest you simply not set PG_reserved
on these pages.
> Call Trace:<ffffffff80169577>{bad_page+135} <ffffffff8016a743>{free_hot_cold_page+112}
> <ffffffff8016a839>{__pagevec_free+41} <ffffffff801710ba>{release_pages+331}
> <ffffffff8017fce9>{free_pages_and_swap_cache+125} <ffffffff80176953>{unmap_vmas+1186}
> <ffffffff80179a58>{exit_mmap+124} <ffffffff80139fe6>{mmput+37}
> <ffffffff8013f2d4>{do_exit+584} <ffffffff8013fdd1>{sys_exit_group+0}
> <ffffffff80149fd9>{get_signal_to_deliver+1594} <ffffffff8010f23a>{do_signal+116}
> <ffffffff8011017e>{retint_signal+61}
hm. Are these pages supposed to be owned by the userspace process? To have their
lifetime controlled by that process?
On Wed, 15 Mar 2006, Bryan O'Sullivan wrote:
>
> This is what it looks like when I always call get_page in my nopage
> handler (after checking that the pfn is valid and pfn_to_page hasn't
> given me junk).
>
> Bad page state at free_hot_cold_page (in process 'mpi_hello', page ffff810002098af8)
> flags:0x0100000000000404 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
> Backtrace:
You shouldn't use PG_reserved on regular pages that you do page counting
on.
Linus
> Bad page state at free_hot_cold_page (in process 'mpi_hello', page ffff810002098af8)
> flags:0x0100000000000404 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
You're setting PG_reserved somewhere. That's deprecated now. Just do
get_page() unconditionally and be happy. I don't think there's any
way that your pages could be swapped out, even though you don't do
SetPageReserved().
- R.
On Wed, 2006-03-15 at 19:28 -0800, Andrew Morton wrote:
> It still has PG_reserved set. I'd suggest you simply not set PG_reserved
> on these pages.
I made that change; thanks to you and Linus for suggesting it.
It caused progress of a sort to occur. This time, we made it through
mmput (the earlier crash site) and tripped over our shoelaces a bit
later during process exit:
Bad page state at __free_pages_ok (in process 'mpi_hello', page ffff8100020e2f88)
flags:0x0100000000000804 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
Backtrace:
Call Trace:<ffffffff80169577>{bad_page+135} <ffffffff8016b1ae>{__free_pages_ok+128}
<ffffffff880f1c88>{:ipath_core:ipath_free_pddata+292}
<ffffffff880f7039>{:ipath_core:ipath_close+1616} <ffffffff8018afeb>{__fput+183}
<ffffffff80188612>{filp_close+91} <ffffffff8013e2a3>{put_files_struct+117}
<ffffffff8013f312>{do_exit+646} <ffffffff8013fdd1>{sys_exit_group+0}
<ffffffff80149fd9>{get_signal_to_deliver+1594} <ffffffff8010f23a>{do_signal+116}
<ffffffff80148529>{force_sig_info+158} <ffffffff8011017e>{retint_signal+61}
Here, I think we're calling dma_free_coherent, on a hunk of memory that
userspace mmapped earlier. Perhaps that got cleaned up during mmput,
and that's why dma_free_coherent is choking here.
> hm. Are these pages supposed to be owned by the userspace process? To have their
> lifetime controlled by that process?
We have two different sets of pages. Some we want to keep around for as
long as a device is plugged in, so the driver should continue to own
them after a user process exits.
Other pages should only live as long as the process that has them
mmapped, but at the moment our driver is (perhaps mistakenly?)
explicitly freeing them as part of fops->close.
I am quite unclear in my head on what mechanism to use to manage the
lifetimes of these pages. Should I use get_page on the pages that span
multiple process lifetimes, and let whatever cleans up the process's
mappings handle the pages that should go away with the process? Is it
even safe to do that, if I allocated them with dma_alloc_coherent
instead of kmalloc?
Thanks for your patience and help,
<b
"Bryan O'Sullivan" <[email protected]> wrote:
>
> On Wed, 2006-03-15 at 19:28 -0800, Andrew Morton wrote:
>
> > It still has PG_reserved set. I'd suggest you simply not set PG_reserved
> > on these pages.
>
> I made that change; thanks to you and Linus for suggesting it.
>
> It caused progress of a sort to occur. This time, we made it through
> mmput (the earlier crash site) and tripped over our shoelaces a bit
> later during process exit:
>
> Bad page state at __free_pages_ok (in process 'mpi_hello', page ffff8100020e2f88)
> flags:0x0100000000000804 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
Someone left PG_private set on this page (!)
>
> > hm. Are these pages supposed to be owned by the userspace process? To have their
> > lifetime controlled by that process?
>
> We have two different sets of pages. Some we want to keep around for as
> long as a device is plugged in, so the driver should continue to own
> them after a user process exits.
>
> Other pages should only live as long as the process that has them
> mmapped, but at the moment our driver is (perhaps mistakenly?)
> explicitly freeing them as part of fops->close.
>
> I am quite unclear in my head on what mechanism to use to manage the
> lifetimes of these pages.
You need to decide who "owns" these pages. Once that's decided, it tells
you who should release them.
> Should I use get_page on the pages that span
> multiple process lifetimes, and let whatever cleans up the process's
> mappings handle the pages that should go away with the process? Is it
> even safe to do that, if I allocated them with dma_alloc_coherent
> instead of kmalloc?
>
Pages which the driver owns should be owned by the, umm, driver. The
driver allocates them, tracks their status, does a put_page() when it's
done with them. Processes might temporarily take a ref on them, in which
case in rare circustances it's the process who does the final put_page(),
but conceptually the driver is still managing these pages.
If the process "owns" these pages then they get allocated in ->nopage and
they get freed in exit(), munmap(), mremap(), etc. ->nopage() should not
take a ref on a page if ->nopage() just allocated it (major fault)
(alloc_pages() did that). If ->nopage finds that the page already exists
(minor fault) then it should take a ref on it. If the driver needs to
temporarily access these pages then it should do a temporary
get_page()/put_page() to protect itself from a concurrent
munmap()/exit()/etc.
If you have pages which are created by ->nopage() but which are supposed to
be shared across forks (the VMA should use VM_DONTCOPY|VM_SHARED) then each
time a new process starts accessing these already-existing pages it'll take
a minor fault. Your ->nopage handler should locate the page by some means,
take a ref on it then return it. do_no_page() will then make the process's
pte map the now-shared page and all is happy. Once all processes which have
faulted in a page have let go of it again (each pte whcih maps that page
has a ref on it which gets undone in munmap/exit/etc), the page will get
freed.
So...
mmap(): set VM_DONTCOPY on VMA, possibly fail if the caller didn't set
VM_SHARED.
nopage(): if the page exists, take a ref on it. If it didn't, allocate it.
Return page.
Approximately. Let's wait for Hugh to come along and clean up my mess.
Andrew> Someone left PG_private set on this page (!)
How does PG_private get set? Could doing a > 1 page kmalloc set it
(because we end up with a compound page)?
Andrew> You need to decide who "owns" these pages. Once that's
Andrew> decided, it tells you who should release them.
[... really good guide to mapping pages into userspace snipped ...]
How about the case where one wants to map pages from
dma_alloc_coherent() into userspace? It seems one should do
get_page() in .nopage, and then the driver can do dma_free_coherent()
when the vma is released.
Or maybe it's just simpler to use vm_insert_page() in the .mmap method
and not try to be fancy with .nopage?
- R.
Roland Dreier <[email protected]> wrote:
>
> Andrew> Someone left PG_private set on this page (!)
>
> How does PG_private get set? Could doing a > 1 page kmalloc set it
> (because we end up with a compound page)?
I can only think that Brian went and set it.
> Andrew> You need to decide who "owns" these pages. Once that's
> Andrew> decided, it tells you who should release them.
>
> [... really good guide to mapping pages into userspace snipped ...]
>
> How about the case where one wants to map pages from
> dma_alloc_coherent() into userspace? It seems one should do
> get_page() in .nopage, and then the driver can do dma_free_coherent()
> when the vma is released.
Well, where were the pages allocated? mmap()? In that case the pages
might be considered driver-owned and yes, I guess the right place to free
them would be in the file's ->release() method. (Sorry to sound vague, but
I haven't really worked on this stuff for a couple of years, and Hugh keeps
on megachanging it).
Lots of sound drivers do this sort of thing for their capture buffers - you
could check them to see what (or, knowing sound drivers, what not) to do.
> Or maybe it's just simpler to use vm_insert_page() in the .mmap method
> and not try to be fancy with .nopage?
>
One would need to work out what to do with these pages when they're shared,
after a fork - a ->nopage() handler would still be needed there, assuming
the VMA is marked VM_DONTCOPY. Because we don't copy all the pte's on a
VM_DONTCOPY vma at fork()-time. (I think we _could_, but we don't)
vm_insert_page() mucks around with rmap-named functions which don't
actually do rmap and sports apparently-incorrect comments wrt
PageReserved(). I don't know how well-cared-for it is...
Andrew> vm_insert_page() mucks around with rmap-named functions
Andrew> which don't actually do rmap and sports
Andrew> apparently-incorrect comments wrt PageReserved(). I don't
Andrew> know how well-cared-for it is...
Linus just added vm_insert_page() a few months ago. Has it already bit-rotted?
- R.
Roland Dreier wrote:
> Andrew> Someone left PG_private set on this page (!)
>
> How does PG_private get set? Could doing a > 1 page kmalloc set it
> (because we end up with a compound page)?
>
I think you should avoid slab allocations for user mappings because
you don't know that it will do compound pages properly, depending on
slab implementation details.
Also, slab lifetimes are not controlled by page refcounting, so when
your driver decides to kfree the memory, a process could still have
refs on the pages via get_user_pages, which might be reused for
unrelated slab allocations.
> Andrew> You need to decide who "owns" these pages. Once that's
> Andrew> decided, it tells you who should release them.
>
> [... really good guide to mapping pages into userspace snipped ...]
>
> How about the case where one wants to map pages from
> dma_alloc_coherent() into userspace? It seems one should do
> get_page() in .nopage, and then the driver can do dma_free_coherent()
> when the vma is released.
>
I think so, provided you set VM_IO on the vma. You need VM_IO to
ensure that get_user_pages callers can't hijack your page's lifetime
rules (Mr. Christmas tree has been carefully engineered for optimal
confusion and best bug count).
> Or maybe it's just simpler to use vm_insert_page() in the .mmap method
> and not try to be fancy with .nopage?
>
That's not being particularly fancy. Either way you need to set VM_IO
on the vma, and make sure the correct memory freeing function (in this
case dma_free_coherent) is called sometime after the last userspace
ref goes away.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Andrew Morton wrote:
> vm_insert_page() mucks around with rmap-named functions which don't
> actually do rmap
What functions are those? I don't see.
> and sports apparently-incorrect comments wrt
> PageReserved().
What's the comment? Are we looking at the same vm_insert_page?
> I don't know how well-cared-for it is...
I think Linus has been caring for it since he added it 3 months ago ;)
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick Piggin <[email protected]> wrote:
>
> Andrew Morton wrote:
>
> > vm_insert_page() mucks around with rmap-named functions which don't
> > actually do rmap
>
> What functions are those? I don't see.
>
> > and sports apparently-incorrect comments wrt
> > PageReserved().
>
> What's the comment? Are we looking at the same vm_insert_page?
vm_insert_page() calls insert_page().
On Wed, 15 Mar 2006, Andrew Morton wrote:
> "Bryan O'Sullivan" <[email protected]> wrote:
> >
> > Bad page state at __free_pages_ok (in process 'mpi_hello', page ffff8100020e2f88)
> > flags:0x0100000000000804 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
>
> Someone left PG_private set on this page (!)
That's pretty sure to be from page_alloc.c's internal use of
PagePrivate, for the buddy system of free pages.
I think Bryan's problem is probably that he has allocated a
high-order page (via pci_alloc_consistent or dma_alloc_coherent),
and is then exposing the constituents of that high-order page to
userspace via nopage: when the final munmap or exit comes,
the constituent pages get freed without realizing they're
supposed to be part of one high-order page.
We had the same problem in sound/core/memalloc.c, and fixed it
by using a compound page (a high-order page in which each constituent
knows it's part of the larger whole): see the two uses of __GFP_COMP
in that file, and in particular the one where snd_malloc_dev_pages
calls dma_alloc_coherent.
Yes, as Roland suggested, it sounds sensible to use dma_alloc_coherent
throughout rather than pci_alloc_consistent, then you can pass __GFP_COMP
down with the other flags and it should work (without any PageReserved).
I haven't checked Bryan's get_pag'ing, but he seemed to have the right
idea of how it should work: was just being a bit diffident about it in
the face of the fact that it clearly wasn't working (through lack of
__GFP_COMP).
Of course, there is an alternative way of going about it, using
remap_pfn_range more (VM_PFNMAP keeping the VM away from refcounting
the pages) and nopage less. But I think Bryan should probably keep on
with the division he already has, no strong reason to change it around.
[ snipping lots of helpful explanation and advice from Andrew ]
> If you have pages which are created by ->nopage() but which are supposed to
> be shared across forks (the VMA should use VM_DONTCOPY|VM_SHARED) then each
> time a new process starts accessing these already-existing pages it'll take
> a minor fault.
No. That isn't how VM_DONTCOPY behaves, and I don't know what you were
thinking to bring it into the discussion there - if you have pages which
are supposed to be shared across forks, that happens without any special
flag, doesn't it? We only teach VM_DONTCOPY in the second year,
I think Bryan would best forget it for now ;)
Oh well, I'd better say a bit on it. VM_DONTCOPY excludes the whole vma
from being forked into the child. So if the child tries to access one
of those addresses, the child'll get a SIGSEGV not a minor fault. RDMA
is finding it a useful flag to get around a peculiarly frustrating issue
with get_user_pages: though that successfully pins pages, a fork followed
by parent writing to pinned private page will put a _copy_ of that page
into the parent's userspace (if child holds a reference to the original),
which likely defeats the purpose of pinning.
So far as Bryan is dealing with shared mappings, he won't be interested
in VM_DONTCOPY at all. If he's dealing with private mappings, he might
want to consider VM_DONTCOPY if get_user_pages and fork can occur
concurrently; but it's an advanced topic, right now he wants to get
the basis working without "Bad page state"s.
He shouldn't set VM_SHM, it means nothing. If he's dealing with real
allocated pages, he shouldn't set VM_IO (but remap_pfn_range will set
it for him if it's used). He shouldn't set VM_RESERVED (won't do any
harm, but we're quite likely to remove it soon). He can set VM_DONT-
EXPAND to avoid worrying about whether mremap to a larger extent,
followed by fault on that larger extent, would behave correctly.
He shouldn't set VM_LOCKED (it would make the locked_vm accounting
go wrong - unless he's careful to participate in that accounting,
as Infiniband's uverbs_mem.c creditably does).
Hugh
On Thu, 16 Mar 2006, Nick Piggin wrote:
> >
> > How about the case where one wants to map pages from
> > dma_alloc_coherent() into userspace? It seems one should do
> > get_page() in .nopage, and then the driver can do dma_free_coherent()
> > when the vma is released.
>
> I think so, provided you set VM_IO on the vma. You need VM_IO to
> ensure that get_user_pages callers can't hijack your page's lifetime
> rules
Once __GFP_COMP is passed to the dma_alloc_coherent, as it needs to be
(unless going VM_PFNMAP), get_user_pages will be safe: no need for VM_IO.
Hugh
On Wed, 15 Mar 2006, Andrew Morton wrote:
> Roland Dreier <[email protected]> wrote:
>
> > Or maybe it's just simpler to use vm_insert_page() in the .mmap method
> > and not try to be fancy with .nopage?
>
> One would need to work out what to do with these pages when they're shared,
> after a fork - a ->nopage() handler would still be needed there, assuming
> the VMA is marked VM_DONTCOPY. Because we don't copy all the pte's on a
> VM_DONTCOPY vma at fork()-time. (I think we _could_, but we don't)
Misapprehension of VM_DONTCOPY addressed in another reply.
> vm_insert_page() mucks around with rmap-named functions which don't
> actually do rmap and sports apparently-incorrect comments wrt
> PageReserved(). I don't know how well-cared-for it is...
It does seem to have four users intree now, so I hope it works.
It was a byproduct of when Linus thought he could get away with
limiting remap_pfn_range, in ways that later proved unsustainable.
It's a bit surplus to requirements now, but does have those users.
I'm not keen on it, because I think most drivers actually
want something slightly different (a remap_vmalloc_pages or a
remap_highorder_page). But that will emerge later on, in that
fabled time when I go remove the SetPageReserveds from drivers.
Yes, there are some out-of-date comments thereabouts: I did correct
them once, but in one of those patches that Linus rejected.
insert_page does a page_add_file_rmap to bump the mapcount, yes;
but whether we ever "reverse map" from page to pte depends on other
things (page->mapping and prio_tree) certainly not set here, and
probably not (indeed, hopefully not!) done by any vm_insert_page
caller. That criticism applies to all the page_add_rmaps and
page_remove_rmaps: we carried over the name from pte_chain days,
but the only thing that gets added or removed there now is "1".
Hugh
Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>Andrew Morton wrote:
>>
>> > vm_insert_page() mucks around with rmap-named functions which don't
>> > actually do rmap
>>
>> What functions are those? I don't see.
>>
>> > and sports apparently-incorrect comments wrt
>> > PageReserved().
>>
>> What's the comment? Are we looking at the same vm_insert_page?
>
>
> vm_insert_page() calls insert_page().
>
Oh OK. I guess I didn't think insert_page was rmap-named. Yes, it looks
like the comment on that one is wrong.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On Wed, 2006-03-15 at 21:38 -0800, Andrew Morton wrote:
> > Bad page state at __free_pages_ok (in process 'mpi_hello', page ffff8100020e2f88)
> > flags:0x0100000000000804 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
>
> Someone left PG_private set on this page (!)
I can't tell you how that happened. I'm certainly not explicitly
setting it anywhere. Apart from PG_reserved (now gone, as seen above),
I don't touch any page flags.
> Pages which the driver owns should be owned by the, umm, driver. The
> driver allocates them, tracks their status, does a put_page() when it's
> done with them.
This is more or less what we're doing now. Except we're not doing a
get_page after dma_alloc_coherent, and vmops->nopage is doing a
get_page. Reading between the lines, I guess the driver should be doing
a get_page right after the allocation, and a put_page before it does the
free? This matches my mental model at least, but it seems that my model
is a bit mental.
> If the process "owns" these pages then they get allocated in ->nopage and
> they get freed in exit(), munmap(), mremap(), etc.
OK, we don't have any cases like this.
> If you have pages which are created by ->nopage() but which are supposed to
> be shared across forks (the VMA should use VM_DONTCOPY|VM_SHARED) then each
> time a new process starts accessing these already-existing pages it'll take
> a minor fault.
And we don't have any cases like this either. All of these mappings are
per-process, not to be shared across fork or clone.
Thanks,
<b
On Wed, 2006-03-15 at 21:54 -0800, Roland Dreier wrote:
> How about the case where one wants to map pages from
> dma_alloc_coherent() into userspace?
This is precisely our case, btw. The pages in question are allocated
during fops->open (some during dev->probe). mmap and nopage never
allocate anything.
> It seems one should do
> get_page() in .nopage, and then the driver can do dma_free_coherent()
> when the vma is released.
If that were the case, I'm unclear on how I would do this. Add a
vmops->close method along with the existing vmops->nopage handler?
<b
On Thu, 2006-03-16 at 14:24 +0000, Hugh Dickins wrote:
> I think Bryan's problem is probably that he has allocated a
> high-order page (via pci_alloc_consistent or dma_alloc_coherent),
> and is then exposing the constituents of that high-order page to
> userspace via nopage: when the final munmap or exit comes,
> the constituent pages get freed without realizing they're
> supposed to be part of one high-order page.
Yes, that sounds about right. Some of the allocations in question are
order-1, and some are higher.
> We had the same problem in sound/core/memalloc.c, and fixed it
> by using a compound page (a high-order page in which each constituent
> knows it's part of the larger whole): see the two uses of __GFP_COMP
> in that file, and in particular the one where snd_malloc_dev_pages
> calls dma_alloc_coherent.
Will do, thanks. Will the allocator give me a compound page if I ask
for a single page but pass __GFP_COMP, or do I need to be sure I always
make a higher-order request if I pass that flag?
> No. That isn't how VM_DONTCOPY behaves, and I don't know what you were
> thinking to bring it into the discussion there - if you have pages which
> are supposed to be shared across forks, that happens without any special
> flag, doesn't it? We only teach VM_DONTCOPY in the second year,
> I think Bryan would best forget it for now ;)
Heh. Actually, in this case, I really don't want the vma to get shared
with a child. If the child gets a SEGV after a fork, that's intended
behaviour, as far as I'm concerned.
> RDMA
> is finding it a useful flag to get around a peculiarly frustrating issue
> with get_user_pages: though that successfully pins pages, a fork followed
> by parent writing to pinned private page will put a _copy_ of that page
> into the parent's userspace (if child holds a reference to the original),
> which likely defeats the purpose of pinning.
That's a useful aside, thanks. It sounds like I want to set VM_DONTCOPY
on the pages I'm doing get_user_pages on too, then.
> He shouldn't set VM_SHM, it means nothing.
Is that the case for kernels prior to 2.6.15, too?
The problem is that I need to keep a version of the driver working on
kernels back as far as 2.6.9, in addition to the current mainline. If
VM_SHM meant nothing back then, too, I can remove it entirely, otherwise
I should drop it only in the for-mainline driver, and leave the backport
version as it stands.
> If he's dealing with real
> allocated pages, he shouldn't set VM_IO (but remap_pfn_range will set
> it for him if it's used). He shouldn't set VM_RESERVED (won't do any
> harm, but we're quite likely to remove it soon).
Yes, I actually dropped both of these last night.
> He shouldn't set VM_LOCKED (it would make the locked_vm accounting
> go wrong - unless he's careful to participate in that accounting,
> as Infiniband's uverbs_mem.c creditably does).
Well, uverbs_mem.c only touches locked_vm for pages it's pinned using
get_free_pages. We do much the same thing in that case.
I'll try turning off VM_LOCKED for the pages we allocate, and see what
breaks :-)
<b
On Wed, 15 Mar 2006, Roland Dreier wrote:
>
> Andrew> vm_insert_page() mucks around with rmap-named functions
> Andrew> which don't actually do rmap and sports
> Andrew> apparently-incorrect comments wrt PageReserved(). I don't
> Andrew> know how well-cared-for it is...
>
> Linus just added vm_insert_page() a few months ago. Has it already bit-rotted?
Heh. We relaxed the "needs PG_reserved" requirement, but there's a comment
that still says that it needs to in the internal function, which used to
be shared between remap_pfn_range and vm_insert_page.
And it uses page_add_file_rmap(), which is misnamed (but that has nothing
to do with vm_insert_page).
page_add_file_rmap() doesn't actually do any rmap stuff, it's required for
_any_ page that is mapped, and in particular pages that are _not_ rmapped.
So vm_insert_page() is doing the right thing, but it looks a bit scary due
to bad naming else-where.
Linus
On Wed, 2006-03-15 at 21:38 -0800, Andrew Morton wrote:
> You need to decide who "owns" these pages. Once that's decided, it tells
> you who should release them.
OK, I've made some interesting progress here.
The driver is now doing all of its allocations of DMAable memory using
dma_alloc_coherent. We do a get_page right after the allocation in
every case, and a put_page right before the free. In the cases where
I'm allocating memory that I know or think might be greater than a page
in size, I'm using __GFP_COMP.
The nopage handler always does a get_page.
We now have a rational-looking set of VM_* flags, instead of a random
heap of whatever seemed to work. And we're not touching PG_reserved.
And now everything works. I have yet to examine /proc/meminfo in
microscopic detail after 100,000 runs to be sure we don't have a leak of
some kind, but I no longer get oopses or crashes after 20 repeated runs,
where before I didn't survive even one.
Whew! What a relief.
Hugh, Andrew, Linus and Roland: thanks very much. This has been a
tremendous help.
<b
On Thu, 16 Mar 2006, Bryan O'Sullivan wrote:
>
> This is precisely our case, btw. The pages in question are allocated
> during fops->open (some during dev->probe). mmap and nopage never
> allocate anything.
If the mapping isn't actually dynamic, then you really should use either:
- remap_pfn_range() (for when you have a physically consecutive page
mapping)
- vm_insert_page() (for when you have individual pages).
at mmap time. Either of those will then do the right thing at unmap.
The rules are:
- remap_pfn_range() doesn't muck around with "struct page" AT ALL, so you
can pass it damn well anything you want these days. It doesn't care,
the VM doesn't care, there's no ref-counting or page flag checking
either on the mmap or the munmap parh.
So with remap_pfn_range(), you can literally do just
remap_pfn_range(vma,
vma->vm_start,
device->phys_dma_address >> PAGE_SHIFT,
device->phys_dma_size,
vma->vm_page_prot);
and the VM will not care what that DMA region is (it might, for
example, be actual device memory, not real RAM at all). But this should
only be used on memory that will be guaranteed to be around for as long
as the mapping exists (which you can guarantee by doing refcounting of
your own "struct file"s, of course).
Normally, you'd use remap_pfn_range() only for special allocations.
Most commonly, it's not RAM at all, but the PCI MMIO memory window to
the hardware itself.
- vm_insert_page() wants _individual_ pages that have been allocated as
such by the page allocator. You can't pass it a kmalloc'ed area or
anything like that, but you _can_ pass it anything that works with the
page allocator. It will increment the page count appropriately for the
mapping, so you should think of it as a no-op: you can do
...
page = get_free_page(..);
if (!page)
return -ENOMEM;
vm_insert_page(mm, addr, page, prot);
...
.. and then in your close/module_exit routine ..
free_page(page);
ie you should do the freeing of _your_ references (you got one when you
allocated the page, so you should free it), and the VM layer will track
_its_ references (ie vm_insert_page() will do whatever is correct so
that when the VM gets unmapped, the page really gets freed)
As a special case (it's not actually a special case in the VM, but as
far as _usage_ is concerned, it's different from the "allocate
individual pages" case), if you allocate a single large _compound_ page
with __GFP_COMP, you can then use vm_insert_page() to insert the
sub-pages individually in the mapping. IOW, you can do
/* Get a compound page of order 4 (16 pages) */
bigpage = __get_free_pages(GFP_USER | __GFP_COMP, 4);
for (i = 0; i < 15; i++)
vm_insert_page(mm, addr + (i << PAGE_SHIFT), bigpage + i, prot);
.. in close/module-exit ..
free_page(bigpage);
Hope this clarifies.
Linus
On Thu, 16 Mar 2006, Bryan O'Sullivan wrote:
> On Thu, 2006-03-16 at 14:24 +0000, Hugh Dickins wrote:
>
> > I think Bryan's problem is probably that he has allocated a
> > high-order page (via pci_alloc_consistent or dma_alloc_coherent),
> > and is then exposing the constituents of that high-order page to
> > userspace via nopage: when the final munmap or exit comes,
> > the constituent pages get freed without realizing they're
> > supposed to be part of one high-order page.
>
> Yes, that sounds about right. Some of the allocations in question are
> order-1, and some are higher.
I tend to say "high-order" when I ought to say ">0-order":
anything over 0-order exhibits the same problem, needing __GFP_COMP.
> > We had the same problem in sound/core/memalloc.c, and fixed it
> > by using a compound page (a high-order page in which each constituent
> > knows it's part of the larger whole): see the two uses of __GFP_COMP
> > in that file, and in particular the one where snd_malloc_dev_pages
> > calls dma_alloc_coherent.
>
> Will do, thanks. Will the allocator give me a compound page if I ask
> for a single page but pass __GFP_COMP, or do I need to be sure I always
> make a higher-order request if I pass that flag?
Just pass __GFP_COMP whatever the order of the request is. In the case
of 0-order requests, it'll simply be ignored (so you don't _need_ to pass
it in that case, but it's just a waste of effort to avoid it): that page
won't be marked PageCompound, nor will it need to be - most of the time
(;)) we have no difficulty holding a single page together with itself.
The whole thing of having to pass __GFP_COMP is a nuisance, and caught
me by surprise. Looking back at the history, it seems that when compound
pages were first introduced (by akpm), it wasn't necessary at all: asking
for a >0-order page gave you a compound page. But then a few places
surfaced which went badly wrong that way, I think somewhere in ARM and
a couple of drivers, places which split the >0-page up again. Perhaps
one day we should go in search of those exceptions, fix them up (maybe
Nick already did so with his split_page in -mm?), and eliminate the
need for a __GFP_COMP flag.
> > No. That isn't how VM_DONTCOPY behaves, and I don't know what you were
> > thinking to bring it into the discussion there - if you have pages which
> > are supposed to be shared across forks, that happens without any special
> > flag, doesn't it? We only teach VM_DONTCOPY in the second year,
> > I think Bryan would best forget it for now ;)
>
> Heh. Actually, in this case, I really don't want the vma to get shared
> with a child. If the child gets a SEGV after a fork, that's intended
> behaviour, as far as I'm concerned.
Okay, fine: I didn't mean to discourage you from using VM_DONTCOPY if
it suits, I just didn't understand why Andrew was promoting its use.
> > RDMA
> > is finding it a useful flag to get around a peculiarly frustrating issue
> > with get_user_pages: though that successfully pins pages, a fork followed
> > by parent writing to pinned private page will put a _copy_ of that page
> > into the parent's userspace (if child holds a reference to the original),
> > which likely defeats the purpose of pinning.
>
> That's a useful aside, thanks. It sounds like I want to set VM_DONTCOPY
> on the pages I'm doing get_user_pages on too, then.
>
> > He shouldn't set VM_SHM, it means nothing.
>
> Is that the case for kernels prior to 2.6.15, too?
Yes, all kernels from 2.4.0 through 2.6.15: a bit was defined for it,
and a motley collection of drivers set that bit, but it wasn't used
for anything at all - not even any files in /proc. A leftover from 2.2.
> The problem is that I need to keep a version of the driver working on
> kernels back as far as 2.6.9, in addition to the current mainline. If
> VM_SHM meant nothing back then, too, I can remove it entirely, otherwise
> I should drop it only in the for-mainline driver, and leave the backport
> version as it stands.
You're safe to drop the VM_SHM everywhere. But your backport driver will
have to be using PageReserved still, not relying on __GFP_COMP: although
__GFP_COMP was defined in 2.6.9 and a few earlier, it used to take effect
only when #ifdef CONFIG_HUGETLB_PAGE - only in 2.6.15 did we make it
available to all configurations. You'll have irritating accounting
differences between the two drivers: it used to be the case that put_page
on a PageReserved page did nothing, so you had to avoid get_page on it to
get the page accounting right; we straightened that out in 2.6.15.
> > If he's dealing with real
> > allocated pages, he shouldn't set VM_IO (but remap_pfn_range will set
> > it for him if it's used). He shouldn't set VM_RESERVED (won't do any
> > harm, but we're quite likely to remove it soon).
>
> Yes, I actually dropped both of these last night.
>
> > He shouldn't set VM_LOCKED (it would make the locked_vm accounting
> > go wrong - unless he's careful to participate in that accounting,
> > as Infiniband's uverbs_mem.c creditably does).
>
> Well, uverbs_mem.c only touches locked_vm for pages it's pinned using
> get_free_pages. We do much the same thing in that case.
>
> I'll try turning off VM_LOCKED for the pages we allocate, and see what
> breaks :-)
I believe uverbs_mem.c sets a good example if you want to follow it
in detail; but I think it's the only driver which does all that.
Most drivers hold a certain number of pages pinned, and even if they
make them available to userspace, we still consider them driver pages
and exempt them from locked_vm accounting. But (as I understand it)
uverbs_mem.c makes it possible for userspace to pin down an otherwise
unlimited number of user pages for an indefinite length of time: so
they've chosen to enforce locked_vm accounting, and I applaud that.
So, if your driver allows userspace to pin an otherwise unlimited
number of user pages for an indefinite length of time, you ought
to follow uverbs_mem.c's locked_vm accounting. But if it's only
a little, briefly, don't bother.
And I'll have confused you by mentioning that under VM_LOCKED:
I was thinking uverbs_mem.c set VM_LOCKED, but it does not, nor
should it. Nor should you, neither at your get_user_pages, nor
when setting up your vmas (if you search the kernel source, I
think you'll find a few drivers which do set VM_LOCKED, but
they're just wrong, and it's one of the things I have to fix
when I do that trawl through them all).
Hugh
On Thu, 16 Mar 2006, Bryan O'Sullivan wrote:
>
> This is more or less what we're doing now. Except we're not doing a
> get_page after dma_alloc_coherent, and vmops->nopage is doing a
> get_page. Reading between the lines, I guess the driver should be doing
> a get_page right after the allocation, and a put_page before it does the
> free? This matches my mental model at least, but it seems that my model
> is a bit mental.
There's no need to do a get_page after the allocation and a put_page
before the free (though you could, it's just extra unnecessary work):
the allocation comes with a reference count of 1, the free frees up
that last remaining reference count of 1 (as Andrew explained more
lucidly elsewhere in his mail).
Hugh
On Thu, 2006-03-16 at 17:23 +0000, Hugh Dickins wrote:
> You're safe to drop the VM_SHM everywhere. But your backport driver will
> have to be using PageReserved still, not relying on __GFP_COMP: although
> __GFP_COMP was defined in 2.6.9 and a few earlier, it used to take effect
> only when #ifdef CONFIG_HUGETLB_PAGE - only in 2.6.15 did we make it
> available to all configurations. You'll have irritating accounting
> differences between the two drivers: it used to be the case that put_page
> on a PageReserved page did nothing, so you had to avoid get_page on it to
> get the page accounting right; we straightened that out in 2.6.15.
OK, that's very helpful to know, thanks.
> So, if your driver allows userspace to pin an otherwise unlimited
> number of user pages for an indefinite length of time, you ought
> to follow uverbs_mem.c's locked_vm accounting.
This is exactly the case, and we are in fact aping uverbs_mem.c quite
carefully :-)
> And I'll have confused you by mentioning that under VM_LOCKED:
> I was thinking uverbs_mem.c set VM_LOCKED, but it does not, nor
> should it.
OK, thanks.
<b
On Thu, 2006-03-16 at 17:27 +0000, Hugh Dickins wrote:
> There's no need to do a get_page after the allocation and a put_page
> before the free (though you could, it's just extra unnecessary work):
> the allocation comes with a reference count of 1, the free frees up
> that last remaining reference count of 1 (as Andrew explained more
> lucidly elsewhere in his mail).
All right. I followed your advice and you are indeed correct; the added
get_page and put_page were not necessary; __GFP_COMP alone did the
trick.
Thanks,
<b
On Thu, 16 Mar 2006, Bryan O'Sullivan wrote:
> On Wed, 2006-03-15 at 21:54 -0800, Roland Dreier wrote:
>
> > How about the case where one wants to map pages from
> > dma_alloc_coherent() into userspace?
>
> This is precisely our case, btw. The pages in question are allocated
> during fops->open (some during dev->probe). mmap and nopage never
> allocate anything.
>
> > It seems one should do
> > get_page() in .nopage, and then the driver can do dma_free_coherent()
> > when the vma is released.
>
> If that were the case, I'm unclear on how I would do this. Add a
> vmops->close method along with the existing vmops->nopage handler?
If neither mmap nor nopage allocated something, then vmops->close
would be the wrong point at which to free it, I think.
If dev->probe allocated something, module unload would be the right
point to free it. If fops->open allocated something, fops->release
would be the right point to free it.
I think. Beware, I've never written a Linux driver, and may well
have this wrong or oversimplified: in which case, I'll have annoyed
someone enough for them to correct me very soon.
Hugh
On Thu, 2006-03-16 at 17:46 +0000, Hugh Dickins wrote:
> > > It seems one should do
> > > get_page() in .nopage, and then the driver can do dma_free_coherent()
> > > when the vma is released.
> >
> > If that were the case, I'm unclear on how I would do this. Add a
> > vmops->close method along with the existing vmops->nopage handler?
>
> If neither mmap nor nopage allocated something, then vmops->close
> would be the wrong point at which to free it, I think.
I think I've satisfied myself that what we're now doing is fairly sane,
so I think we're OK. I'm sure I'll be corrected in the next round of
driver submission patches if I'm wrong :-)
<b
On Thu, 2006-03-16 at 17:23 +0000, Hugh Dickins wrote:
> But your backport driver will
> have to be using PageReserved still, not relying on __GFP_COMP: although
> __GFP_COMP was defined in 2.6.9 and a few earlier, it used to take effect
> only when #ifdef CONFIG_HUGETLB_PAGE - only in 2.6.15 did we make it
> available to all configurations. You'll have irritating accounting
> differences between the two drivers: it used to be the case that put_page
> on a PageReserved page did nothing, so you had to avoid get_page on it to
> get the page accounting right; we straightened that out in 2.6.15.
OK. Would it be correct to say that this is what we should do, then?
* On 2.6.15 and later kernels, use __GFP_COMP at allocation time,
and get_page in ->nopage. This is what we're doing as of this
morning, and it works.
* For backports to 2.6.14 and earlier, avoid __GFP_COMP, mark each
page with SetPageReserved at allocation time, and do nothing
special in ->nopage. Do we need to ClearPageReserved before
freeing?
Thanks,
<b
On Thu, 16 Mar 2006, Bryan O'Sullivan wrote:
>
> OK. Would it be correct to say that this is what we should do, then?
>
> * On 2.6.15 and later kernels, use __GFP_COMP at allocation time,
> and get_page in ->nopage. This is what we're doing as of this
> morning, and it works.
> * For backports to 2.6.14 and earlier, avoid __GFP_COMP, mark each
> page with SetPageReserved at allocation time, and do nothing
> special in ->nopage. Do we need to ClearPageReserved before
> freeing?
Yes, I believe that's exactly right - so long as you do ClearPageReserved
from each of its constituent 0-order-pages before freeing the >0-order
page, in the <= 2.6.14 case.
You wisely remarked earlier that you'd not yet checked for memory leaks:
that is of course the complementary, less obvious, error to the troubles
you've been having so far: and I wish you luck when you come to check,
hoping that I haven't merely misled you from one side to the other!
Hugh
On Thu, 16 Mar 2006, Hugh Dickins wrote:
> On Thu, 16 Mar 2006, Bryan O'Sullivan wrote:
> >
> > OK. Would it be correct to say that this is what we should do, then?
> >
> > * On 2.6.15 and later kernels, use __GFP_COMP at allocation time,
> > and get_page in ->nopage. This is what we're doing as of this
> > morning, and it works.
> > * For backports to 2.6.14 and earlier, avoid __GFP_COMP, mark each
> > page with SetPageReserved at allocation time, and do nothing
> > special in ->nopage. Do we need to ClearPageReserved before
> > freeing?
>
> Yes, I believe that's exactly right - so long as you do ClearPageReserved
> from each of its constituent 0-order-pages before freeing the >0-order
> page, in the <= 2.6.14 case.
The alternative is to always allocate the pages one by one ("order-0"),
and do get_page() when you return them in the ->nopage handler. That will
work with any kernel, so it has the simplicity thing going for it.
Linus
On Thu, 2006-03-16 at 12:35 -0800, Linus Torvalds wrote:
> The alternative is to always allocate the pages one by one ("order-0"),
> and do get_page() when you return them in the ->nopage handler. That will
> work with any kernel, so it has the simplicity thing going for it.
Thanks. That's a very useful suggestion.
<b
[Err, sorry about the empty mail...]
Anyway, I'd like to hijack this thread slightly, since we are close to
a subject that I've been thinking about lately, and I'd like to take
advantage of the expert's attention...
My mthca driver (drivers/infiniband/hw/mthca) supports mapping some
MMIO registers into userspace via io_remap_pfn_range() in a .mmap
method. I think I have that pretty well under control.
However, on a hot unplug event, when the underlying PCI device is
going away, I would like to replace that mapping with a mapping (with
a mapping to the zero page?), so that userspace accesses after the
device is gone don't explode. What's the "right" way to do that?
Presumably it would be something like zeromap_page_range(), but that's
not exported to modules. Exporting that would be one option, but in a
way that's overkill for me: I only have a single page to deal with, so
I could also do something like
vm_insert_page(vma, addr, ZERO_PAGE(addr));
But do I have to do anything to kill the old mapping coming from
remap_pfn_range()? What's the exported API to do that?
I can keep a list of vmas that have registers mapped to userspace and
iterate through it on hot unplug. The only question is what to do
with those vmas.
Thanks,
Roland
On Thu, 2006-03-16 at 15:51 -0800, Roland Dreier wrote:
> However, on a hot unplug event, when the underlying PCI device is
> going away, I would like to replace that mapping with a mapping (with
> a mapping to the zero page?), so that userspace accesses after the
> device is gone don't explode.
Why would you not want accesses to explode? Exploding seems like the
right behaviour to me, since there's no hardware for userspace to talk
to any more.
<b
Hugh Dickins wrote:
> On Thu, 16 Mar 2006, Nick Piggin wrote:
>
>>>How about the case where one wants to map pages from
>>>dma_alloc_coherent() into userspace? It seems one should do
>>>get_page() in .nopage, and then the driver can do dma_free_coherent()
>>>when the vma is released.
>>
>>I think so, provided you set VM_IO on the vma. You need VM_IO to
>>ensure that get_user_pages callers can't hijack your page's lifetime
>>rules
>
>
> Once __GFP_COMP is passed to the dma_alloc_coherent, as it needs to be
> (unless going VM_PFNMAP), get_user_pages will be safe: no need for VM_IO.
>
But it doesn't look like dma_alloc_coherent is guaranteed to return
memory allocated from the regular page allocator, nor even memory
backed by a struct page.
For example, I see one that returns kmalloc()ed memory. If the pages
for the slab are already allocated then __GFP_COMP will not do anything
there. i386 looks like it has a path that uses ioremap...
Now I haven't looked through all these closely like you will have, but
I'd like to know how __GFP_COMP solves all the potential problems I
see.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Nick> But it doesn't look like dma_alloc_coherent is guaranteed to
Nick> return memory allocated from the regular page allocator, nor
Nick> even memory backed by a struct page.
Nick> For example, I see one that returns kmalloc()ed memory. If
Nick> the pages for the slab are already allocated then __GFP_COMP
Nick> will not do anything there. i386 looks like it has a path
Nick> that uses ioremap...
Ugh. So is there a correct way to map DMA-able memory into userspace?
- R.
Bryan> Why would you not want accesses to explode? Exploding
Bryan> seems like the right behaviour to me, since there's no
Bryan> hardware for userspace to talk to any more.
I think it is better for userspace to be notified that the hardware is
gone, and have any accesses before it acts on that notification to be
silently discarded.
- R.
Bryan> Why would you not want accesses to explode? Exploding
Bryan> seems like the right behaviour to me, since there's no
Bryan> hardware for userspace to talk to any more.
Oh yeah... but getting rid of the mapping so userspace gets a segfault
might be a good idea too. However, leaving the old PCI mapping there
seems rather risky to me: I think it's entirely possible that accesses
to that area after the device is gone could trigger machine checks or
worse.
So what's the right way for a driver to get rid of a remap_pfn_range()
mapping into userspace?
- R.
On Iau, 2006-03-16 at 17:12 -0800, Roland Dreier wrote:
> Oh yeah... but getting rid of the mapping so userspace gets a segfault
> might be a good idea too. However, leaving the old PCI mapping there
> seems rather risky to me: I think it's entirely possible that accesses
> to that area after the device is gone could trigger machine checks or
> worse.
Not really. After all the hot remove can race an actual mmio cycle so
you can't close that window to nothing. In other words if it does make
the PCI bridge burp at you - well hotplug has to handle it.
That means on the positive side that all you need to do is refcount
properly and destroy the PCI device when you have finished with it. If a
mapping continues to exist then fine, because the device is still
logically there. If the device is logically there then the resources
have not been unmapped. If the resources have not been unmapped they are
not free for allocation to another device.
Config space looks more problematic but memory maps of PCI space appear
to be ok.
Alan
> > Oh yeah... but getting rid of the mapping so userspace gets a segfault
> > might be a good idea too. However, leaving the old PCI mapping there
> > seems rather risky to me: I think it's entirely possible that accesses
> > to that area after the device is gone could trigger machine checks or
> > worse.
>
> Not really. After all the hot remove can race an actual mmio cycle so
> you can't close that window to nothing. In other words if it does make
> the PCI bridge burp at you - well hotplug has to handle it.
>
> That means on the positive side that all you need to do is refcount
> properly and destroy the PCI device when you have finished with it. If a
> mapping continues to exist then fine, because the device is still
> logically there. If the device is logically there then the resources
> have not been unmapped. If the resources have not been unmapped they are
> not free for allocation to another device.
I'm not sure I'm following you. Is it OK for a driver to return from
its pci_driver .remove method with userspace mappings to MMIO BARs
still hanging around? I wouldn't think so.
Non-privileged userspace processes can get direct access to (a safe
subset of) PCI space for IB devices, and it seems unfriendly to force
all of those processes to be killed before an IB device can be removed.
The way I think about it is that one doesn't have to kill every
process with a connected socket to remove an ethernet device, and I'd
like the same to hold for IB devices. The analogy doesn't really hold
at a technical level I know, but it still seems like the friendliest
interface.
Thanks,
Roland
On Fri, 17 Mar 2006, Nick Piggin wrote:
> Hugh Dickins wrote:
> >
> > Once __GFP_COMP is passed to the dma_alloc_coherent, as it needs to be
> > (unless going VM_PFNMAP), get_user_pages will be safe: no need for VM_IO.
>
> But it doesn't look like dma_alloc_coherent is guaranteed to return
> memory allocated from the regular page allocator, nor even memory
> backed by a struct page.
Hmm, that's bad news.
If it's not backed by a struct page, then I guess Bryan can't be
interested in mapping it into userspace with a .nopage; so perhaps that
case is already ruled out somehow, and we needn't worry about it. Or
should he just be using remap_pfn_range - is there a portable way to
use that with dma_alloc_coherent/pci_alloc_consistent?
I'm not a driver writer, and have no idea of these things: I think,
having got Bryan back on track with his page counts, I'd better step
aside, and let those who understand these things take him forward.
> For example, I see one that returns kmalloc()ed memory. If the pages
> for the slab are already allocated then __GFP_COMP will not do anything
> there. i386 looks like it has a path that uses ioremap...
I don't remember offhand whether passing __GFP_COMP to kmalloc does
something, nothing, errors out, or behaves erratically according to
whether the slab is already allocated.
I'd feel so much more confident if no __GFP_COMP flag were ever needed.
It seems that's how PageCompound started out, every >0-order was compound;
then some nasty code found down in ARM and some drivers that were screwed
by that, so __GFP_COMP introduced (and other trees had __GFP_NOCOMP).
Is there any chance that your split_page() work in -mm, actually addresses
precisely those places that were screwed up by universal compound pages?
So that with your split_page(), we could go back to every >0-order page
being PageCompound, without any need for __GFP_COMP.
There'd probably be a few blips to sort out, but if it seems a plausible
way forward, that's the way I'd like to go. But I wasn't in on the
early days of PageCompound, maybe Andrew remembers the issues. I've
appended a couple of akpm entries from ChangeLog-2.6.6 below, to help
jog memories (I'm amused to see how the compound page logic started
off using page->lru, where we've just now moved it back to).
> Now I haven't looked through all these closely like you will have, but
> I'd like to know how __GFP_COMP solves all the potential problems I
> see.
It seems I can't have looked as closely as I thought. I was advising
Bryan on the basis of the __GFP_COMP (I added) in snd_malloc_dev_pages,
which appears to have been working. But now I fear perhaps that was
just a rare case to support a single driver only needed on a few
architectures (not everyone exports snd_malloc_dev_pages memory into
userspace); or we have a nasty surprise in store for us there too.
Aside from the dark alleys of dma_alloc_coherent that you've mentioned,
there's the architectures which #define it to pci_alloc_consistent,
which takes no gfp_mask, so __GFP_COMP would be ignored (hence,
in part, my desire to make all >0-order pages compound).
None of what I've said above is much help to Bryan (nor is Linus'
suggestion that he allocates one page at a time, if dma_alloc_coherent
won't even give him the right kind of struct-page-backed memory):
but as I said, I'll have to step aside from pretending to advise
on what his driver should be doing.
Hugh
<[email protected]>
[PATCH] stop using page->lru in compound pages
The compound page logic is using page->lru, and these get will scribbled on
in various places so switch the Compound page logic over to using ->mapping
and ->private.
<[email protected]>
[PATCH] use compound pages for hugetlb pages only
The compound page logic is a little fragile - it relies on additional
metadata in the pageframes which some other kernel code likes to stomp on
(xfs was doing this).
Also, because we're treating all higher-order pages as compound pages it is
no longer possible to free individual lower-order pages from the middle of
higher-order pages. At least one ARM driver insists on doing this.
We only really need the compound page logic for higher-order pages which can
be mapped into user pagetables and placed under direct-io. This covers
hugetlb pages and, conceivably, soundcard DMA buffers which were allcoated
with a higher-order allocation but which weren't marked PageReserved.
The patch arranges for the hugetlb implications to allocate their pages with
compound page metadata, and all other higher-order allocations go back to the
old way.
(Andrea supplied the GFP_LEVEL_MASK fix)
On Fri, 2006-03-17 at 11:37 +1100, Nick Piggin wrote:
> But it doesn't look like dma_alloc_coherent is guaranteed to return
> memory allocated from the regular page allocator, nor even memory
> backed by a struct page.
Hmm. Which of the implementations that you've seen will return
something not backed by a struct page? On the half dozen arches I've
looked at (i386/x86_64, powerpc, sparc64, ia64, mips), every one uses
either kmalloc, __get_free_pages, or __alloc_pages at some point, and I
think they all have struct pages behind them.
> For example, I see one that returns kmalloc()ed memory. If the pages
> for the slab are already allocated then __GFP_COMP will not do anything
> there.
Bleh. Perhaps I'm being dense here, but if I'm making a request of
non-zero order and the slab has already been allocated, won't it be
populated with compound pages anyway? Or will it have been allocated as
a single giant compound page, just handing me back individual hunks of
the appropriate size?
I ask this because I seemed to be getting compound pages out of
dma_alloc_coherent even when I *wasn't* passing in __GFP_COMP. This is
apparently why PG_private was set on the individual pages I was getting
back.
<b
On Fri, 17 Mar 2006, Bryan O'Sullivan wrote:
>
> Hmm. Which of the implementations that you've seen will return
> something not backed by a struct page? On the half dozen arches I've
> looked at (i386/x86_64, powerpc, sparc64, ia64, mips), every one uses
> either kmalloc, __get_free_pages, or __alloc_pages at some point, and I
> think they all have struct pages behind them.
kmalloc may be backed by a "struct page", but the point is that it does
not honor the page _count_, and as such it is totally unsuitable for any
VM usage.
The VM relies on the page count very fundamentally, so by doing a
"get_page()" you'll tell every other user that you have a reference to a
page, and it won't be free'd until all references are gone. In contrast,
if you do a "get_page()" on something that has been allocated with
kmalloc(), the slab code will totally ignore it, and happily re-allocate
it to something else once the _original_ allocator free's it.
So kmalloc() really isn't appropriate.
Linus
On Fri, 2006-03-17 at 08:28 -0800, Linus Torvalds wrote:
> kmalloc may be backed by a "struct page", but the point is that it does
> not honor the page _count_, and as such it is totally unsuitable for any
> VM usage.
That's fine. We're not calling dma_free_coherent until after the page
count goes back down to one (the driver is once again the only user).
But this doesn't seem germane to what Nick brought up, anyway.
<b
On Thu, 16 Mar 2006, Roland Dreier wrote:
> Anyway, I'd like to hijack this thread slightly, since we are close to
> a subject that I've been thinking about lately, and I'd like to take
> advantage of the expert's attention...
I look over my shoulder, ah yes, you're speaking to Alan, who's given
you a good reply. But he's looking at it from a bus/driver perspective,
which I can't comment on, so I'll return to your original mail.
> My mthca driver (drivers/infiniband/hw/mthca) supports mapping some
> MMIO registers into userspace via io_remap_pfn_range() in a .mmap
> method. I think I have that pretty well under control.
>
> However, on a hot unplug event, when the underlying PCI device is
> going away, I would like to replace that mapping with a mapping (with
> a mapping to the zero page?), so that userspace accesses after the
> device is gone don't explode. What's the "right" way to do that?
Nobody has asked for that before (so far as I know; or maybe I'm just
not looking at your question from the right angle), so I don't have
any canned right answer.
Sounds a reasonable thing to ask for, especially from the hotplug world.
But it's not how we've proceeded until now: things stay open until the
last user goes away, only then can the driver go away.
Thus a file remains open, and its filesystem busily mounted, until
the last user unmaps or closes it.
You seem to be asking to "revoke" an mmap: not unreasonable, but
we've never provided a revoke method before, and I fear there
might be gotchas. Perhaps there's a different, more familiar
model I can feel more comfortable with...
Shall we look at it, instead, as like truncating a file? You could
use vmtruncate, but you'll have nothing cached, so maybe just use
unmap_mapping_range (see mm/memory.c for args), which is also exported.
Only one caller on the inode(mapping) at once, please (on a regular
file, inode->i_sem is held; but that won't quite work with a device).
So far as I can see, that should work nicely (even, fortuitously,
despite Linus' special adjustment of vm_pgoff on VM_PFNMAP areas).
Though it looks like subsequent userspace accesses will then go to
do_anonymous_page, giving ZERO_PAGE to read faults or a fresh anon
page to write faults: I think I'd prefer it if pte_none accesses in
a VM_PFNMAP area gave SIGBUS, but unsure if we can change that now.
> Presumably it would be something like zeromap_page_range(), but that's
> not exported to modules. Exporting that would be one option, but in a
> way that's overkill for me: I only have a single page to deal with, so
> I could also do something like
>
> vm_insert_page(vma, addr, ZERO_PAGE(addr));
>
> But do I have to do anything to kill the old mapping coming from
> remap_pfn_range()? What's the exported API to do that?
vm_insert_page will fail with -EBUSY if there's something there
already; and though it looks as if it'll do the right thing with
ZERO_PAGE(addr) if the mapping was read-only, it'll very nastily
give write access to it if the mapping was read-write.
But I see no point in inserting pages there anyway: the important
part is removing the old ptes, which unmap_mapping_range should do.
> I can keep a list of vmas that have registers mapped to userspace and
> iterate through it on hot unplug. The only question is what to do
> with those vmas.
unmap_mapping_range will work its own way through the prio_tree of
all vmas mapping your device: so I think it'll spare you some effort.
But it hasn't been used in this way before: I may be missing things.
Hugh
On Fri, 2006-03-17 at 17:13 +0000, Hugh Dickins wrote:
> You seem to be asking to "revoke" an mmap:
Yes. We'd like this ability, too.
> Though it looks like subsequent userspace accesses will then go to
> do_anonymous_page, giving ZERO_PAGE to read faults or a fresh anon
> page to write faults: I think I'd prefer it if pte_none accesses in
> a VM_PFNMAP area gave SIGBUS, but unsure if we can change that now.
It would be unfortunate if userspace were spinning on a chip register,
waiting for the register to transition from zero to non-zero, and we
replaced that mapping with an anonymous page. In that case, userspace
could potentially spin forever, having no way to detect the demise of
the device.
<b
On Fri, 17 Mar 2006, Bryan O'Sullivan wrote:
>
> It would be unfortunate if userspace were spinning on a chip register,
> waiting for the register to transition from zero to non-zero, and we
> replaced that mapping with an anonymous page. In that case, userspace
> could potentially spin forever, having no way to detect the demise of
> the device.
Generally, replacing the mmap with an anonymous zero-mapped mapping would
be a horribly bad idea.
The fact is, you can't avoid the race of seeing the removed state (which
_usually_ means that you will read 0xffffffff from the bus - normal PC's
won't result in bus errors etc). Whatever the kernel does, it can do only
after the device has already been removed - we no longer live in a world
where the administrator can tell the system before-hand that something
will go away.
Replacing the MMIO map with a zero map would be absolutely horrible. It
would be inconsistent, and not even help the fact that the user will haev
seen the removed state.
In fact, I think even "revert" is pretty useless. You're much better off
just sending a perfectly good signal - something that the app will get
regardless of whether it reads the MMIO space at that point in time or
not. After all, the only thing the "revert" would really do is to send a
signal, but then only if the user is trying to access the device.
Anyway, zap_page_range() would do what you want, but it's not exported,
and I'm not convinced it's even something we want to export. You can only
zap a page range from within the context of the zappee, not from an
external module/driver.
[ Maybe it works if somebody else calls it, maybe it doesn't. I wouldn't
bet on it, and more importantly, I can pretty much _guarantee_ that a
driver will get the "struct mm_struct" reference counting wrong. ]
Linus
On Fri, 17 Mar 2006, Linus Torvalds wrote:
>
> Anyway, zap_page_range() would do what you want, but it's not exported,
> and I'm not convinced it's even something we want to export. You can only
> zap a page range from within the context of the zappee, not from an
> external module/driver.
>
> [ Maybe it works if somebody else calls it, maybe it doesn't. I wouldn't
> bet on it, and more importantly, I can pretty much _guarantee_ that a
> driver will get the "struct mm_struct" reference counting wrong. ]
But vmtruncate or unmap_mapping_range is quite used to operating on
whatever mm's have mapped the file, so should be a safe route. Except
here it's a device mapped VM_PFNMAP, so there might prove to be some
gotchas. I'd be happy to make a nopage fault on VM_PFNMAP give SIGBUS,
for sensible behaviour after the "truncate" - but recent history warns
we're liable then to discover some app expecting otherwise ;)
Hugh
Bryan O'Sullivan wrote:
> On Fri, 2006-03-17 at 11:37 +1100, Nick Piggin wrote:
>
>
>>But it doesn't look like dma_alloc_coherent is guaranteed to return
>>memory allocated from the regular page allocator, nor even memory
>>backed by a struct page.
>
>
> Hmm. Which of the implementations that you've seen will return
> something not backed by a struct page? On the half dozen arches I've
> looked at (i386/x86_64, powerpc, sparc64, ia64, mips), every one uses
> either kmalloc, __get_free_pages, or __alloc_pages at some point, and I
> think they all have struct pages behind them.
>
Well I was mainly just looking at the interface definition and that it
only returns an address. i386 specifically looks like it can return
ioremap memory in some cases, right? (which I don't think needs to have
a struct page behind it)
>
>>For example, I see one that returns kmalloc()ed memory. If the pages
>>for the slab are already allocated then __GFP_COMP will not do anything
>>there.
>
>
> Bleh. Perhaps I'm being dense here, but if I'm making a request of
> non-zero order and the slab has already been allocated, won't it be
> populated with compound pages anyway? Or will it have been allocated as
> a single giant compound page, just handing me back individual hunks of
> the appropriate size?
>
By the looks of the slab implementation, it will vary depending on
whether or not the slab was already allocated and whether that was using
__GFP_COMP or not.
> I ask this because I seemed to be getting compound pages out of
> dma_alloc_coherent even when I *wasn't* passing in __GFP_COMP. This is
> apparently why PG_private was set on the individual pages I was getting
> back.
>
I think the reason was a little different: you're getting a higher order
page. This can either be compound or non-compound. If it is not compound,
then you (or the VM) can easily end up freeing one of its constituent
pages (and PagePrivate is set when that page is free in the allocator).
Then when you go to free the whole higher-order page, it finds that one
of its constituent pages has PagePrivate set (which is a bug).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
[I'm not a driver or pci/dma person either, so I can't usefully answer
most of your questions, I'm afraid]
Hugh Dickins wrote:
> Is there any chance that your split_page() work in -mm, actually addresses
> precisely those places that were screwed up by universal compound pages?
> So that with your split_page(), we could go back to every >0-order page
> being PageCompound, without any need for __GFP_COMP.
>
I think it should catch most of the places [I'm sure I've missed some :(]
that split up higher-order pages (which doesn't work on compound pages, I
guess this was the problem).
It makes like difficult for some future patch of mine if the refcounting
mechanism is changed while the page is allocated (eg. like PageReserved
used to do), however I think it wouldn't be too hard to instead invert the
meaning of the flag and just use it in those few places that care?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Bryan O'Sullivan wrote:
> On Fri, 2006-03-17 at 08:28 -0800, Linus Torvalds wrote:
>
>
>>kmalloc may be backed by a "struct page", but the point is that it does
>>not honor the page _count_, and as such it is totally unsuitable for any
>>VM usage.
>
>
> That's fine. We're not calling dma_free_coherent until after the page
> count goes back down to one (the driver is once again the only user).
And that's probably fine too (provided you can somehow be sure that
you're getting compound pages, which you currently can't) - you're
essentially doing your own refcounting then. However in that case you
do need to ensure get_user_pages can't operate on your mapping (use
VM_IO), because that can hijack the page lifetime (ie. hold a ref
even after all mappings have gone).
> But this doesn't seem germane to what Nick brought up, anyway.
>
Well kmalloc is one of the possible problems I saw, but there may be
others too (eg. ioremap).
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Linus> Generally, replacing the mmap with an anonymous zero-mapped
Linus> mapping would be a horribly bad idea.
Linus> The fact is, you can't avoid the race of seeing the removed
Linus> state (which _usually_ means that you will read 0xffffffff
Linus> from the bus - normal PC's won't result in bus errors
Linus> etc). Whatever the kernel does, it can do only after the
Linus> device has already been removed - we no longer live in a
Linus> world where the administrator can tell the system
Linus> before-hand that something will go away.
Linus> Replacing the MMIO map with a zero map would be absolutely
Linus> horrible. It would be inconsistent, and not even help the
Linus> fact that the user will haev seen the removed state.
Linus> In fact, I think even "revert" is pretty useless. You're
Linus> much better off just sending a perfectly good signal -
Linus> something that the app will get regardless of whether it
Linus> reads the MMIO space at that point in time or not. After
Linus> all, the only thing the "revert" would really do is to send
Linus> a signal, but then only if the user is trying to access the
Linus> device.
Hmm, you're probably right for the hot unplug case. However, there
are a couple of other related situations I've thought of in this context:
- Module remove: userspace has PCI memory mmap()ed. The module is
removed. Userspace still has access to PCI memory. And if the
module is reloaded, userspace still has access to the device that
the driver doesn't know about, so the driver may hand off the same
access to another process.
The obvious solution here is to just take a module reference when
userspace mmap()s the device. However unprivileged processes can
get direct access to IB devices, and it may not be so nice for
unprivileged processes to be able to hold off module removal.
- PCI error recovery (or internal device error recovery): if the
driver detects a problem with the PCI bus or device, it might have
to reset the device. The most natural way to model this is as hot
unplug followed by hot plug. However we don't want userspace
processes to keep their BAR mapping across the device reset,
because the device is probably not set up to handle it right after reset.
Really in this case at least, revoking an mmap() seems the cleanest
solution to me.
Any further thoughts?
Thanks,
Roland
On Thu, 2006-03-16 at 20:10 +0000, Hugh Dickins wrote:
> You wisely remarked earlier that you'd not yet checked for memory leaks:
> that is of course the complementary, less obvious, error to the troubles
> you've been having so far: and I wish you luck when you come to check,
> hoping that I haven't merely misled you from one side to the other!
Well, as it turns out, we're not out of the woods.
The memory that we allocate with dma_alloc_coherent is indeed being
handed back to us as compound pages, but something horrible is happening
after that.
Our nopage handler does an unconditional get_page on every page that it
gets handed, but nothing seems to ever call put_page on those pages. At
least, I infer this to be the case, because the page counts look all
wrong when I free them. The kernel thinks I leak 4.5MB of memory every
time I run "hello, world" (~1300 allocations of 4K).
If you recall, we have three different areas of memory that userspace
can mmap. Two of these (rcvhdrq and pioavailregs) are allocated at
driver init time, and freed when the driver exits. The third (egrbufs)
is allocated when the process opens the char device, and closed when the
char device's release method is called.
I've instrumented my calls to dma_alloc_coherent and dma_free_coherent,
and they are getting called when they should be. However, on each
successive run of "hello, world", the page counts on the rcvhdrq and
pioavailregs pages (the ones allocated at driver start) increase,
without ever decreasing. And for the egrbufs pages, the page counts go
up to 8 (I'm allocating 32K at a time, so get_page is called 8 times per
compound page), but I never get handed the same struct page on two
successive runs, so I conclude that the struct pages are leaking.
So my belief is that I am freeing the memory behind the struct pages,
but the struct pages themselves stay stuck with a permanently elevated
count. This is obviously not a winning strategy.
On 2.6.15, this all "works", but I have the poor kernel thinking it
loses 4.5MB of memory on every run. On 2.6.16-rc6, I get a BUG. I'll
have to do a bit of work to reproduce it, so I can't paste it here yet.
So my quandary is thus: what am I doing wrong? It seems that my calls
to get_page are more or less correct, but should I be doing a put_page
somewhere, too, such as when the char dev's release method is called,
prior to calling dma_free_coherent?
<b
On Tue, 21 Mar 2006, Bryan O'Sullivan wrote:
>
> On 2.6.15, this all "works", but I have the poor kernel thinking it
> loses 4.5MB of memory on every run. On 2.6.16-rc6, I get a BUG. I'll
> have to do a bit of work to reproduce it, so I can't paste it here yet.
>
> So my quandary is thus: what am I doing wrong? It seems that my calls
> to get_page are more or less correct, but should I be doing a put_page
> somewhere, too, such as when the char dev's release method is called,
> prior to calling dma_free_coherent?
Please mail me your source (I guess as a patch against 2.6.15),
and tomorrow I'll try to see if I can work out the leak; probably
won't work out the BUG until you can send us the messages - thanks.
Hugh
On Tue, 2006-03-21 at 23:20 +0000, Hugh Dickins wrote:
> Please mail me your source (I guess as a patch against 2.6.15),
> and tomorrow I'll try to see if I can work out the leak; probably
> won't work out the BUG until you can send us the messages - thanks.
Thank you for the kind offer. Before I send you anything, though, I
have some interesting information to report.
First of all, it turns out that the BUG I mentioned was reported against
the SLES10 2.6.16-rc6 kernel. I haven't had a chance to chase it down
yet, but I'm going to have to, because...
...the driver actually works just fine under 2.6.16-final. No memory
leaks, no funnies with page counting being wrong.
Under 2.6.15, what seems to be actually happening is that vmops->nopage
is being called on each page of a 32K compound page, driving the page
count from 1 (prior to any nopage calls) to 9. By the time I get to my
cleanup code, the page count has gone from 9 to 8 (whereas under 2.6.16,
the page count has gone from 9 back to 1, where it belongs). From this,
it seems fairly clear that the kernel isn't decrementing the use counts
correctly on compound pages in 2.6.15.
I think my next step, rather than boring you to tears with an
interminable slog through unfamiliar source code, is to try Linus's
suggestion from last week of just shooting nopage in the head, and
instead use remap_pfn_range in fops->mmap. If the stars are aligned,
perhaps this will give me something that works on a wide variety of
kernels.
Thanks again,
<b
On Wed, 22 Mar 2006, Bryan O'Sullivan wrote:
>
> Under 2.6.15, what seems to be actually happening is that vmops->nopage
> is being called on each page of a 32K compound page, driving the page
> count from 1 (prior to any nopage calls) to 9. By the time I get to my
> cleanup code, the page count has gone from 9 to 8 (whereas under 2.6.16,
> the page count has gone from 9 back to 1, where it belongs). From this,
> it seems fairly clear that the kernel isn't decrementing the use counts
> correctly on compound pages in 2.6.15.
Ahh. Isn't this the same problem that the fairly recent "mm: compound
release fix" by Nick should fix?
Git commit ID 8519fb30e438f8088b71a94a7d5a660a814d3872, to be exact:
Author: Nick Piggin <[email protected]>
Date: Tue Feb 7 12:58:52 2006 -0800
[PATCH] mm: compound release fix
Compound pages on SMP systems can now often be freed from pagetables via
the release_pages path. This uses put_page_testzero which does not handle
compound pages at all. Releasing constituent pages from process mappings
decrements their count to a large negative number and leaks the reference
at the head page - net result is a memory leak.
The problem was hidden because the debug check in put_page_testzero itself
actually did take compound pages into consideration.
Fix the bug and the debug check.
Signed-off-by: Nick Piggin <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
But anything based on 2.6.16-rc6 should be fine (The bug was fixed by
2.6.16-rc3, methinks). You said:
"First of all, it turns out that the BUG I mentioned was reported
against the SLES10 2.6.16-rc6 kernel. I haven't had a chance to chase
it down yet, but I'm going to have to, because..."
but if that _really_ is 2.6.16-rc6-based, this problem should have been
fixed already.
Maybe SLES is based on 2.6._15_-rc6?
Linus
On Wed, 2006-03-22 at 08:19 -0800, Linus Torvalds wrote:
> Ahh. Isn't this the same problem that the fairly recent "mm: compound
> release fix" by Nick should fix?
It sure sounds like it to me.
> But anything based on 2.6.16-rc6 should be fine (The bug was fixed by
> 2.6.16-rc3, methinks). You said:
>
> "First of all, it turns out that the BUG I mentioned was reported
> against the SLES10 2.6.16-rc6 kernel. I haven't had a chance to chase
> it down yet, but I'm going to have to, because..."
>
> but if that _really_ is 2.6.16-rc6-based, this problem should have been
> fixed already.
I'll have to first make sure I can reproduce the problem on SLES10 beta
(since I'm not the one who saw it), then go off and take a look at the
SLES10 beta kernel source and patches to see what's going on.
The SLES10 beta kernels have been following 2.6.16 very closely (each
one has been either based on a git snapshot or a release candidate), but
there's a biggish quilt patch stack on top. *If* there's a bug at all,
I wouldn't be surprised if it's somewhere in that pile of patches.
<b
On Wed, 22 Mar 2006, Bryan O'Sullivan wrote:
>
> ...the driver actually works just fine under 2.6.16-final. No memory
> leaks, no funnies with page counting being wrong.
Ah, great, then I needn't look through your code, phew (no offence)!
> Under 2.6.15, what seems to be actually happening is that vmops->nopage
> is being called on each page of a 32K compound page, driving the page
> count from 1 (prior to any nopage calls) to 9. By the time I get to my
> cleanup code, the page count has gone from 9 to 8 (whereas under 2.6.16,
> the page count has gone from 9 back to 1, where it belongs). From this,
> it seems fairly clear that the kernel isn't decrementing the use counts
> correctly on compound pages in 2.6.15.
I'm sure Linus is right about that. I remembered put_page_testzero
checking the wrong part of the compound page in its BUG_ON, but I'd
forgotten that release_pages ended up not freeing the compound page
at all. Yes, 2.6.15 and its relatives do indeed leak there.
> I think my next step, rather than boring you to tears with an
> interminable slog through unfamiliar source code, is to try Linus's
> suggestion from last week of just shooting nopage in the head, and
> instead use remap_pfn_range in fops->mmap. If the stars are aligned,
> perhaps this will give me something that works on a wide variety of
> kernels.
That may well be a good plan (given the doubts Nick raised about
whether dma_alcohol_rent gives the right kind of struct page non-slab
memory on all arches). But one way in which the stars will be slightly
misaligned: for 2.6.14 and earlier you'll need to SetPageReserved on
each constituent of the >0-page, to get remap_pfn_range to map it (and
ClearPageReserved before freeing the >0-page); that won't do any harm
on 2.6.15 and 2.6.16 (apart from enlarging the code unnecessarily);
but we might one day remove those macros, from driver use anyway.
Hugh
On Wed, 2006-03-22 at 17:46 +0000, Hugh Dickins wrote:
> Ah, great, then I needn't look through your code, phew (no offence)!
Indeed :-)
> That may well be a good plan (given the doubts Nick raised about
> whether dma_alcohol_rent gives the right kind of struct page non-slab
> memory on all arches).
Well, it Works For Me (TM) right now. Thank goodness. And the driver
is about 200 lines shorter without the nopage handler and consequent
mucking about.
> But one way in which the stars will be slightly
> misaligned: for 2.6.14 and earlier you'll need to SetPageReserved on
> each constituent of the >0-page, to get remap_pfn_range to map it (and
> ClearPageReserved before freeing the >0-page); that won't do any harm
> on 2.6.15 and 2.6.16 (apart from enlarging the code unnecessarily);
> but we might one day remove those macros, from driver use anyway.
Yes, we're already doing this. So far, I've tested on 2.6.12, 2.6.15,
and 2.6.16 using remap_pfn_range, and everything appears to be
cromulent.
<b