Hi!
The comment at set_page_dirty_lock says:
/*
* set_page_dirty() is racy if the caller has no reference against
* page->mapping->host, and if the page is unlocked. This is because another
* CPU could truncate the page off the mapping and then free the mapping.
*
* Usually, the page _is_ locked, or the caller is a user-space process which
* holds a reference on the inode by having an open file.
*
* In other cases, the page should be locked before running set_page_dirty().
*/
Still, I wander whether it might be OK to use set_page_dirty
in another case - if I previously got a reference to the page
with get_user_pages?
The page wouldnt be written back in this case, would it?
What if I'm in the middle of a system call?
Thanks,
--
MST
On Thu, 8 Dec 2005, Michael S. Tsirkin wrote:
> Hi!
> The comment at set_page_dirty_lock says:
>
> /*
> * set_page_dirty() is racy if the caller has no reference against
> * page->mapping->host, and if the page is unlocked. This is because another
> * CPU could truncate the page off the mapping and then free the mapping.
> *
> * Usually, the page _is_ locked, or the caller is a user-space process which
> * holds a reference on the inode by having an open file.
> *
> * In other cases, the page should be locked before running set_page_dirty().
> */
>
> Still, I wander whether it might be OK to use set_page_dirty
> in another case - if I previously got a reference to the page
> with get_user_pages?
> The page wouldnt be written back in this case, would it?
It might be, there's no guarantee not. So if it was written back just
before you did your own dirtying of the page, you do need to set page dirty
again after (usually when releasing the pages got). And get_user_pages is
a typical case when set_page_dirty_lock is really needed - you don't
usually have any hold on the inode (if any) that backs those pages.
It can be very inconvenient (I don't know what to do for drivers/scsi/sg.c
than set_page_dirty and hope for the best, since it cannot wait for a lock
where it needs to). But I'm afraid you do have the very case where
set_page_dirty_lock is appropriate.
Many would be pleased if we could manage without set_page_dirty_lock.
> What if I'm in the middle of a system call?
What if you are?
Hugh
Quoting Hugh Dickins <[email protected]>:
> Many would be pleased if we could manage without set_page_dirty_lock.
Yes, it forces me to bounce a work into a queue from the interrupt.
Thanks!
--
MST
On Thu, Dec 08 2005, Hugh Dickins wrote:
> It can be very inconvenient (I don't know what to do for drivers/scsi/sg.c
> than set_page_dirty and hope for the best, since it cannot wait for a lock
> where it needs to). But I'm afraid you do have the very case where
> set_page_dirty_lock is appropriate.
See bio_set_pages_dirty() in fs/bio.c and the framework for handling
those (notably bio_dirty_fn()).
> Many would be pleased if we could manage without set_page_dirty_lock.
Indeed, would make life easier there as well..
--
Jens Axboe
Quoting Hugh Dickins <[email protected]>:
> Many would be pleased if we could manage without set_page_dirty_lock.
It seems that I can do
if (TestSetPageLocked(page))
schedule_work()
and in this way, avoid the schedule_work overhead for the common case
where the page isnt locked.
Right?
If that works, I can mostly do things directly,
although I'm still stuck with the problem of an app performing
a fork + write into the same page while I'm doing DMA there.
I am currently solving this by doing a second get_user_pages after
DMA is done and comparing the page lists, but this, of course,
needs a task context ...
--
MST
Michael S. Tsirkin wrote:
> Quoting Hugh Dickins <[email protected]>:
>
>>Many would be pleased if we could manage without set_page_dirty_lock.
>
>
> It seems that I can do
>
> if (TestSetPageLocked(page))
> schedule_work()
>
> and in this way, avoid the schedule_work overhead for the common case
> where the page isnt locked.
> Right?
>
I think you can do that - provided you ensure the page mapping hasn't
disappeared after locking it. However, I think you should try to the
simplest way first.
> If that works, I can mostly do things directly,
> although I'm still stuck with the problem of an app performing
> a fork + write into the same page while I'm doing DMA there.
>
> I am currently solving this by doing a second get_user_pages after
> DMA is done and comparing the page lists, but this, of course,
> needs a task context ...
>
Usually we don't care about these kinds of races happening. So long
as it doesn't oops the kernel or hang the hardware, it is up to
userspace not to do stuff like that.
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Quoting Nick Piggin <[email protected]>:
> > It seems that I can do
> >
> > if (TestSetPageLocked(page))
> > schedule_work()
> >
> > and in this way, avoid the schedule_work overhead for the common case
> > where the page isnt locked.
> > Right?
> >
>
> I think you can do that - provided you ensure the page mapping hasn't
> disappeared after locking it.
Ugh. I dont really know how to do that.
Isnt this sufficient?
if (TestSetPageLocked(page))
schedule_work(...)
set_page_dirty(page)
unlock_page(page)
Thats all there seems to be set_page_dirty_lock does if it is called when
PG_Locked bit is clear.
> However, I think you should try to the simplest way first.
Thanks, Nick, thats what I have now, this already works for me here
https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_iocb.c
I'm looking at ways to improve performance, though.
> > If that works, I can mostly do things directly,
> > although I'm still stuck with the problem of an app performing
> > a fork + write into the same page while I'm doing DMA there.
> >
> > I am currently solving this by doing a second get_user_pages after
> > DMA is done and comparing the page lists, but this, of course,
> > needs a task context ...
> >
>
> Usually we don't care about these kinds of races happening. So long
> as it doesn't oops the kernel or hang the hardware, it is up to
> userspace not to do stuff like that.
Note that I am, even, not necessarily talking about full pages
here: an application could be writing to one part of a page
while hardware DMAs another part of it.
So the app is not necessarily buggy.
It seems to me people really dont want to change their applications.
They just want to load a library and have it go faster.
Given that I'm implementing a socket protocol, we are talking about
an awful lot of applications that currently work fine on top of TCP.
--
MST
Michael S. Tsirkin wrote:
> Quoting Nick Piggin <[email protected]>:
>
>>I think you can do that - provided you ensure the page mapping hasn't
>>disappeared after locking it.
>
>
> Ugh. I dont really know how to do that.
> Isnt this sufficient?
>
> if (TestSetPageLocked(page))
> schedule_work(...)
> set_page_dirty(page)
> unlock_page(page)
>
> Thats all there seems to be set_page_dirty_lock does if it is called when
> PG_Locked bit is clear.
>
Oh yeah you are right - set_page_dirty does the check for you. Sorry
for the misinformation.
>
>>However, I think you should try to the simplest way first.
>
>
> Thanks, Nick, thats what I have now, this already works for me here
> https://openib.org/svn/gen2/trunk/src/linux-kernel/infiniband/ulp/sdp/sdp_iocb.c
> I'm looking at ways to improve performance, though.
>
Oh good :)
>
>>>If that works, I can mostly do things directly,
>>>although I'm still stuck with the problem of an app performing
>>>a fork + write into the same page while I'm doing DMA there.
>>>
>>>I am currently solving this by doing a second get_user_pages after
>>>DMA is done and comparing the page lists, but this, of course,
>>>needs a task context ...
>>>
>>
>>Usually we don't care about these kinds of races happening. So long
>>as it doesn't oops the kernel or hang the hardware, it is up to
>>userspace not to do stuff like that.
>
>
> Note that I am, even, not necessarily talking about full pages
> here: an application could be writing to one part of a page
> while hardware DMAs another part of it.
> So the app is not necessarily buggy.
>
Sorry, I might have misunderstdood: what's the race? And how does
a second get_user_pages solve it?
> It seems to me people really dont want to change their applications.
> They just want to load a library and have it go faster.
> Given that I'm implementing a socket protocol, we are talking about
> an awful lot of applications that currently work fine on top of TCP.
>
Understandable.
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Quoting Nick Piggin <[email protected]>:
> >>>If that works, I can mostly do things directly,
> >>>although I'm still stuck with the problem of an app performing
> >>>a fork + write into the same page while I'm doing DMA there.
> >>>
> >>>I am currently solving this by doing a second get_user_pages after
> >>>DMA is done and comparing the page lists, but this, of course,
> >>>needs a task context ...
> >>>
> >>
> >>Usually we don't care about these kinds of races happening. So long
> >>as it doesn't oops the kernel or hang the hardware, it is up to
> >>userspace not to do stuff like that.
> >
> >
> > Note that I am, even, not necessarily talking about full pages
> > here: an application could be writing to one part of a page
> > while hardware DMAs another part of it.
> > So the app is not necessarily buggy.
> >
>
> Sorry, I might have misunderstdood: what's the race? And how does
> a second get_user_pages solve it?
Here's what I have in mind:
A multithreaded app calls recvmsg(2), (or io_submit with receive request),
passing in a buffer that is not page aligned.
This does get_user_pages on some page and blocks waiting for DMA to complete.
Another thread calls fork(2), marking the page for copy on write.
After fork, it writes (even 1 byte) into one of the pages that were passed
to recvmsg, possibly even outside the buffer passed to recvmsg.
This triggers a page copy in the parent process.
Any data that the device DMA's into this page after this point is not
seen by the application, since it goes into the original page,
while a copy is now mapped into the parent's memory.
--
MST
Michael S. Tsirkin wrote:
> Quoting Nick Piggin <[email protected]>:
>
>>>>>If that works, I can mostly do things directly,
>>>>>although I'm still stuck with the problem of an app performing
>>>>>a fork + write into the same page while I'm doing DMA there.
>>>>>
>>>>>I am currently solving this by doing a second get_user_pages after
>>>>>DMA is done and comparing the page lists, but this, of course,
>>>>>needs a task context ...
>>>>>
>>>>
>>>>Usually we don't care about these kinds of races happening. So long
>>>>as it doesn't oops the kernel or hang the hardware, it is up to
>>>>userspace not to do stuff like that.
>>>
>>>
>>>Note that I am, even, not necessarily talking about full pages
>>>here: an application could be writing to one part of a page
>>>while hardware DMAs another part of it.
>>>So the app is not necessarily buggy.
>>>
>>
>>Sorry, I might have misunderstdood: what's the race? And how does
>>a second get_user_pages solve it?
>
>
> Here's what I have in mind:
>
> A multithreaded app calls recvmsg(2), (or io_submit with receive request),
> passing in a buffer that is not page aligned.
> This does get_user_pages on some page and blocks waiting for DMA to complete.
>
> Another thread calls fork(2), marking the page for copy on write.
> After fork, it writes (even 1 byte) into one of the pages that were passed
> to recvmsg, possibly even outside the buffer passed to recvmsg.
> This triggers a page copy in the parent process.
>
OK, yeah if a thread in the parent process writes into the buffer, then
yes this would leave the copy in the parent AFAIKS.
But this is going to do similar weird stuff when racing with copy_to_user
with ethernet recvmsg, is it not? (and direct-io and probably others). As
such, I don't think it would be something you in particular need to worry
about.
I guess to solve it, we could either retain mmap_sem for the duration to
prevent fork, or try to do something tricky with page_count to determine
if we need to do a copy in fork() rather than a COW.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Quoting Nick Piggin <[email protected]>:
> OK, yeah if a thread in the parent process writes into the buffer, then
> yes this would leave the copy in the parent AFAIKS.
>
> But this is going to do similar weird stuff when racing with
> copy_to_user
> with ethernet recvmsg, is it not? (and direct-io and probably others).
FWIW, I think that copy_to_user will work correctly since it keeps the mmap
semaphore for the duration of the copy.
Direct-io might have the same problem.
> As such, I don't think it would be something you in particular need to
> worry about.
>
> I guess to solve it, we could either retain mmap_sem for the duration to
> prevent fork,
Since this is the receive side, the DMA can take an indefinite
time to arrive. Isnt this a problem if we keep the mmap_sem?
> or try to do something tricky with page_count to determine
> if we need to do a copy in fork() rather than a COW.
I'm actually reasonably happy with the trick that I'm using:
performing a second get_user_pages after DMA and comparing
the page lists.
However, doing this every time on the off chance that a
page was made COW forces me into task context, every time.
--
MST
Michael S. Tsirkin wrote:
>
> FWIW, I think that copy_to_user will work correctly since it keeps the mmap
> semaphore for the duration of the copy.
Oh, true.
> Direct-io might have the same problem.
>
>
>>As such, I don't think it would be something you in particular need to
>>worry about.
>>
>>I guess to solve it, we could either retain mmap_sem for the duration to
>>prevent fork,
>
>
> Since this is the receive side, the DMA can take an indefinite
> time to arrive. Isnt this a problem if we keep the mmap_sem?
>
Well... it goes against our usual stance of trying to push
these kinds of synchronisation issues to userspace.
In a way, you're doing direct-io from the network and so it is
perhaps reasonable to expect the racy semantics that O_DIRECT
has. OTOH, if you are providing the same API on a different
device, then basically by definition you need to provide the
exact same semantics.
>
>>or try to do something tricky with page_count to determine
>>if we need to do a copy in fork() rather than a COW.
>
>
> I'm actually reasonably happy with the trick that I'm using:
> performing a second get_user_pages after DMA and comparing
> the page lists.
> However, doing this every time on the off chance that a
> page was made COW forces me into task context, every time.
>
I think it might be possible to solve it with the early-copy in
fork(). I'll tinker with it.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Quoting Nick Piggin <[email protected]>:
> >>or try to do something tricky with page_count to determine
> >>if we need to do a copy in fork() rather than a COW.
> >
> >
> > I'm actually reasonably happy with the trick that I'm using:
> > performing a second get_user_pages after DMA and comparing
> > the page lists.
> > However, doing this every time on the off chance that a
> > page was made COW forces me into task context, every time.
> >
>
> I think it might be possible to solve it with the early-copy in
> fork(). I'll tinker with it.
Thanks, that would help!
--
MST
Quoting Nick Piggin <[email protected]>:
> >
> >>or try to do something tricky with page_count to determine
> >>if we need to do a copy in fork() rather than a COW.
> >
> >
> > I'm actually reasonably happy with the trick that I'm using:
> > performing a second get_user_pages after DMA and comparing
> > the page lists.
> > However, doing this every time on the off chance that a
> > page was made COW forces me into task context, every time.
> >
>
> I think it might be possible to solve it with the early-copy in
> fork(). I'll tinker with it.
A possible way to do this would be adding an option to get_user_pages,
and if selected, setting a flag on all pages in range.
This flag should be cleared in put_page when the reference count drops to 1.
The only thing left would be to do the actual copy if the flag is set.
HTH,
--
MST