2003-05-13 20:33:52

by Dave McCracken

[permalink] [raw]
Subject: Race between vmtruncate and mapped areas?


As part of chasing the BUG() we've been seeing in objrmap I took a good
look at vmtruncate(). I believe I've identified a race condition that no
only triggers that BUG(), but also could cause some strange behavior
without the objrmap patch.

Basically vmtruncate() does the following steps: first, it unmaps the
truncated pages from all page tables using zap_page_range(). Then it
removes those pages from the page cache using truncate_inode_pages().
These steps are done without any lock that I can find, so it's possible for
another task to get in between the unmap and the remove, and remap one or
more pages back into its page tables.

The result of this is a page that has been disconnected from the file but
is mapped in a task's address space as if it were still part of that file.
Any further modifications to this page will be lost.

I can easily detect this condition by adding a bugcheck for page_mapped()
in truncate_complete_page(), then running Andrew's bash-shared-mapping test
case.

Please feel free to poke holes in my analysis. I'm not at all sure I
haven't missed some subtlety here.

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059


2003-05-13 20:39:44

by Mika Penttilä

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Isn't that what inode->i_sem is supposed to protect...?

--Mika


Dave McCracken wrote:

>As part of chasing the BUG() we've been seeing in objrmap I took a good
>look at vmtruncate(). I believe I've identified a race condition that no
>only triggers that BUG(), but also could cause some strange behavior
>without the objrmap patch.
>
>Basically vmtruncate() does the following steps: first, it unmaps the
>truncated pages from all page tables using zap_page_range(). Then it
>removes those pages from the page cache using truncate_inode_pages().
>These steps are done without any lock that I can find, so it's possible for
>another task to get in between the unmap and the remove, and remap one or
>more pages back into its page tables.
>
>The result of this is a page that has been disconnected from the file but
>is mapped in a task's address space as if it were still part of that file.
>Any further modifications to this page will be lost.
>
>I can easily detect this condition by adding a bugcheck for page_mapped()
>in truncate_complete_page(), then running Andrew's bash-shared-mapping test
>case.
>
>Please feel free to poke holes in my analysis. I'm not at all sure I
>haven't missed some subtlety here.
>
>Dave McCracken
>
>======================================================================
>Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
>[email protected] T/L 678-3059
>
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>


2003-05-13 20:51:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Tue, May 13, 2003 at 11:58:21PM +0300, Mika Penttil? wrote:
> Isn't that what inode->i_sem is supposed to protect...?
> --Mika

It's already called under inode->i_sem. The trouble is that it's not
the ->i_sem but the ->page_lock that's taken by those it's racing
against.


-- wli

2003-05-13 20:48:07

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Tue, May 13, 2003 at 03:44:45PM -0500, Dave McCracken wrote:
> As part of chasing the BUG() we've been seeing in objrmap I took a good
> look at vmtruncate(). I believe I've identified a race condition that no
> only triggers that BUG(), but also could cause some strange behavior
> without the objrmap patch.
> Basically vmtruncate() does the following steps: first, it unmaps the
> truncated pages from all page tables using zap_page_range(). Then it
> removes those pages from the page cache using truncate_inode_pages().
> These steps are done without any lock that I can find, so it's possible for
> another task to get in between the unmap and the remove, and remap one or
> more pages back into its page tables.
> The result of this is a page that has been disconnected from the file but
> is mapped in a task's address space as if it were still part of that file.
> Any further modifications to this page will be lost.
> I can easily detect this condition by adding a bugcheck for page_mapped()
> in truncate_complete_page(), then running Andrew's bash-shared-mapping test
> case.
> Please feel free to poke holes in my analysis. I'm not at all sure I
> haven't missed some subtlety here.

There are various flavors of pain here. I think this was the new page
state hugh was talking about in an earlier post on the subject.

-- wli

2003-05-13 22:16:10

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Tuesday, May 13, 2003 23:58:21 +0300 Mika Penttil?
<[email protected]> wrote:

> Isn't that what inode->i_sem is supposed to protect...?

Hmm... Yep, it is. I did some more investigating. My initial scenario
required that the task mapping the page extend the file after the truncate,
which must be done via some kind of write(). The write() would trip over
i_sem and therefore hang waiting for vmtruncate() to complete. So I was
wrong about that one.

Hoever, vmtruncate() does get to truncate_complete_page() with a page
that's mapped...

After some though it occurred to me there is a simple alternative scenario
that's not protected. If a task is *already* in a page fault mapping the
page in, then vmtruncate() could call zap_page_range() before the page
fault completes. When the page fault does complete the page will be mapped
into the area previously cleared by vmtruncate().

We could make vmtruncate() take mmap_sem for write, but that seems somewhat
drastic. Does anyone have any alternative ideas?

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-13 22:36:54

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Tue, May 13, 2003 at 05:26:24PM -0500, Dave McCracken wrote:
> Hmm... Yep, it is. I did some more investigating. My initial scenario
> required that the task mapping the page extend the file after the truncate,
> which must be done via some kind of write(). The write() would trip over
> i_sem and therefore hang waiting for vmtruncate() to complete. So I was
> wrong about that one.
> Hoever, vmtruncate() does get to truncate_complete_page() with a page
> that's mapped...
> After some though it occurred to me there is a simple alternative scenario
> that's not protected. If a task is *already* in a page fault mapping the
> page in, then vmtruncate() could call zap_page_range() before the page
> fault completes. When the page fault does complete the page will be mapped
> into the area previously cleared by vmtruncate().
> We could make vmtruncate() take mmap_sem for write, but that seems somewhat
> drastic. Does anyone have any alternative ideas?

That doesn't sound like it's going to help, there isn't a unique
mmap_sem to be taken and so we just get caught between acquisitions
with the same problem.


-- wli

2003-05-13 22:49:22

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Tuesday, May 13, 2003 15:49:29 -0700 William Lee Irwin III
<[email protected]> wrote:

> That doesn't sound like it's going to help, there isn't a unique
> mmap_sem to be taken and so we just get caught between acquisitions
> with the same problem.

Actually it does fix it. I added code in vmtruncate_list() to do a
down_write(&vma->vm_mm->mmap_sem) around the zap_page_range(), and the
problem went away. It serializes against any outstanding page faults on a
particular page table. New faults will see that the page is no longer in
the file and fail with SIGBUS. Andrew's test case stopped failing.

I've attached the patch so you can see what I did.

Can anyone think of any gotchas to this solution?

Dave McCracken

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059


Attachments:
(No filename) (946.00 B)
vmtrunc-2.5.69-mm3-1.diff (982.00 B)
Download all attachments

2003-05-13 23:03:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Tuesday, May 13, 2003 15:49:29 -0700 William Lee Irwin III <[email protected]> wrote:
>> That doesn't sound like it's going to help, there isn't a unique
>> mmap_sem to be taken and so we just get caught between acquisitions
>> with the same problem.

On Tue, May 13, 2003 at 06:00:08PM -0500, Dave McCracken wrote:
> Actually it does fix it. I added code in vmtruncate_list() to do a
> down_write(&vma->vm_mm->mmap_sem) around the zap_page_range(), and the
> problem went away. It serializes against any outstanding page faults on a
> particular page table. New faults will see that the page is no longer in
> the file and fail with SIGBUS. Andrew's test case stopped failing.
> I've attached the patch so you can see what I did.
> Can anyone think of any gotchas to this solution?

How many processes are you doing this with? If it's a small number can
you try it with 1000 or so?


-- wli

2003-05-13 23:04:34

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Tuesday, May 13, 2003 16:11:39 -0700 William Lee Irwin III
<[email protected]> wrote:

> Okay, what's stopping filemap_nopage() from fetching the page from
> pagecache after one of the mm->mmap_sem's is dropped but before
> truncate_inode_pages() removes the page? The fault path is only locked
> out for one mm during one part of the operation. I can see taking
> ->i_sem in do_no_page() fixing it, but not ->mmap_sem in vmtruncate()
> (but of course that's _far_ too heavy-handed to merge at all).

mmap_sem is held for read across the entire fault, so by the time
vmtruncate_list() can call zap_page_range() the page has been instantiated
in the page table and will get removed.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-13 23:00:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Tuesday, May 13, 2003 15:49:29 -0700 William Lee Irwin III <[email protected]> wrote:
>> That doesn't sound like it's going to help, there isn't a unique
>> mmap_sem to be taken and so we just get caught between acquisitions
>> with the same problem.

On Tue, May 13, 2003 at 06:00:08PM -0500, Dave McCracken wrote:
> Actually it does fix it. I added code in vmtruncate_list() to do a
> down_write(&vma->vm_mm->mmap_sem) around the zap_page_range(), and the
> problem went away. It serializes against any outstanding page faults on a
> particular page table. New faults will see that the page is no longer in
> the file and fail with SIGBUS. Andrew's test case stopped failing.
> I've attached the patch so you can see what I did.
> Can anyone think of any gotchas to this solution?

Okay, what's stopping filemap_nopage() from fetching the page from
pagecache after one of the mm->mmap_sem's is dropped but before
truncate_inode_pages() removes the page? The fault path is only locked
out for one mm during one part of the operation. I can see taking
->i_sem in do_no_page() fixing it, but not ->mmap_sem in vmtruncate()
(but of course that's _far_ too heavy-handed to merge at all).

-- wli

2003-05-13 23:08:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Tuesday, May 13, 2003 16:11:39 -0700 William Lee Irwin III <[email protected]> wrote:
>> Okay, what's stopping filemap_nopage() from fetching the page from
>> pagecache after one of the mm->mmap_sem's is dropped but before
>> truncate_inode_pages() removes the page? The fault path is only locked
>> out for one mm during one part of the operation. I can see taking
>> ->i_sem in do_no_page() fixing it, but not ->mmap_sem in vmtruncate()
>> (but of course that's _far_ too heavy-handed to merge at all).

On Tue, May 13, 2003 at 06:16:16PM -0500, Dave McCracken wrote:
> mmap_sem is held for read across the entire fault, so by the time
> vmtruncate_list() can call zap_page_range() the page has been instantiated
> in the page table and will get removed.

That's not quite the answer, inode->i_size is.

The mmap_sem works because then ->i_size can't be sampled by
filemap_nopage() before the pagetable wiping operation starts.


-- wli

2003-05-13 23:16:07

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Tuesday, May 13, 2003 16:20:38 -0700 William Lee Irwin III
<[email protected]> wrote:

> The mmap_sem works because then ->i_size can't be sampled by
> filemap_nopage() before the pagetable wiping operation starts.

So why isn't that the right way to do it? Waiting for mmap_sem guarantees
we won't catch a page fault in flight, which is the cause of the problem in
the first place.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-13 23:17:00

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Tuesday, May 13, 2003 16:20:38 -0700 William Lee Irwin III <[email protected]> wrote:
>> The mmap_sem works because then ->i_size can't be sampled by
>> filemap_nopage() before the pagetable wiping operation starts.

On Tue, May 13, 2003 at 06:28:20PM -0500, Dave McCracken wrote:
> So why isn't that the right way to do it? Waiting for mmap_sem guarantees
> we won't catch a page fault in flight, which is the cause of the problem in
> the first place.

Oh it works fine, it's just the explanation for why it's sufficient I
was looking for.


-- wli

2003-05-14 00:56:24

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Dave McCracken <[email protected]> wrote:
>
> After some though it occurred to me there is a simple alternative scenario
> that's not protected. If a task is *already* in a page fault mapping the
> page in, then vmtruncate() could call zap_page_range() before the page
> fault completes. When the page fault does complete the page will be mapped
> into the area previously cleared by vmtruncate().

That's the one. Process is sleeping on I/O in filemap_nopage(), wakes up
after the truncate has done its thing and the page gets instantiated in
pagetables.

But it's an anon page now. So the application (which was racy anyway) gets
itself an anonymous page.

Which can still have buffers attached, which the swapout code needs to be
careful about.

2003-05-14 00:56:25

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Dave McCracken <[email protected]> wrote:
>
> Actually it does fix it. I added code in vmtruncate_list() to do a
> down_write(&vma->vm_mm->mmap_sem) around the zap_page_range(), and the
> problem went away. It serializes against any outstanding page faults on a
> particular page table. New faults will see that the page is no longer in
> the file and fail with SIGBUS. Andrew's test case stopped failing.
>
> I've attached the patch so you can see what I did.
>
> Can anyone think of any gotchas to this solution?

mmap_sem nests outside i_shared_sem.

2003-05-14 14:49:42

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Tuesday, May 13, 2003 18:10:18 -0700 Andrew Morton <[email protected]>
wrote:

> That's the one. Process is sleeping on I/O in filemap_nopage(), wakes up
> after the truncate has done its thing and the page gets instantiated in
> pagetables.
>
> But it's an anon page now. So the application (which was racy anyway)
> gets itself an anonymous page.

Which the application thinks is still part of the file, and will expect its
changes to be written back. Granted, if the page fault occurred just after
the truncate it'd get SIGBUS, so it's clearly not a robust assumption, but
it will result in unexpected behavior. Note that if the application later
extends the file to include this page it could result in a corrupted file,
since all the pages around it will be written properly.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-14 14:51:10

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Tuesday, May 13, 2003 18:10:22 -0700 Andrew Morton <[email protected]>
wrote:

>> Can anyone think of any gotchas to this solution?
>
> mmap_sem nests outside i_shared_sem.

Rats. You're right. Time to consider alternatives.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-14 14:54:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Tuesday, May 13, 2003 18:10:18 -0700 Andrew Morton <[email protected]> wrote:
>> That's the one. Process is sleeping on I/O in filemap_nopage(), wakes up
>> after the truncate has done its thing and the page gets instantiated in
>> pagetables.
>> But it's an anon page now. So the application (which was racy anyway)
>> gets itself an anonymous page.

On Wed, May 14, 2003 at 10:02:10AM -0500, Dave McCracken wrote:
> Which the application thinks is still part of the file, and will expect its
> changes to be written back. Granted, if the page fault occurred just after
> the truncate it'd get SIGBUS, so it's clearly not a robust assumption, but
> it will result in unexpected behavior. Note that if the application later
> extends the file to include this page it could result in a corrupted file,
> since all the pages around it will be written properly.

Well, for this one I'd say the app loses; it was its own failure to
synchronize truncation vs. access, at least given that the kernel
doesn't oops.


-- wli

2003-05-14 15:16:28

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Wednesday, May 14, 2003 08:06:53 -0700 William Lee Irwin III
<[email protected]> wrote:

>> Which the application thinks is still part of the file, and will expect
>> its changes to be written back. Granted, if the page fault occurred
>> just after the truncate it'd get SIGBUS, so it's clearly not a robust
>> assumption, but it will result in unexpected behavior. Note that if the
>> application later extends the file to include this page it could result
>> in a corrupted file, since all the pages around it will be written
>> properly.
>
> Well, for this one I'd say the app loses; it was its own failure to
> synchronize truncation vs. access, at least given that the kernel
> doesn't oops.

I think allowing a race condition that can randomly leave corrupted files
is a really bad idea, even if the app is doing something stupid. We know
what the race is. We should be able to prevent it.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-14 16:31:41

by Gerrit Huizenga

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Wed, 14 May 2003 08:06:53 PDT, William Lee Irwin III wrote:
> On Tuesday, May 13, 2003 18:10:18 -0700 Andrew Morton <[email protected]> wrote:
> >> That's the one. Process is sleeping on I/O in filemap_nopage(), wakes up
> >> after the truncate has done its thing and the page gets instantiated in
> >> pagetables.
> >> But it's an anon page now. So the application (which was racy anyway)
> >> gets itself an anonymous page.
>
> On Wed, May 14, 2003 at 10:02:10AM -0500, Dave McCracken wrote:
> > Which the application thinks is still part of the file, and will expect its
> > changes to be written back. Granted, if the page fault occurred just after
> > the truncate it'd get SIGBUS, so it's clearly not a robust assumption, but
> > it will result in unexpected behavior. Note that if the application later
> > extends the file to include this page it could result in a corrupted file,
> > since all the pages around it will be written properly.
>
> Well, for this one I'd say the app loses; it was its own failure to
> synchronize truncation vs. access, at least given that the kernel
> doesn't oops.

Hmm... I'd disagree... The operations of truncation and access should
be synchronized with respect to the mm, even if multiple threads are
performing the operations. The application *should* have synchronization
to understand which event will happen first, but the principle of least
surprise would suggest that either the access would happen first
followed by truncation (and future access might then extend the file,
putting in zero'd pages for the range that was trunacated) or the
truncation will happen first, and the access (beyond the new end
of the file) will behave as a normal access beyond the end of file.

Having truly garbage data between the previous end of file and the
new end of file is just wrong.

gerrit

2003-05-14 17:20:16

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Dave McCracken <[email protected]> wrote:
>
> Which the application thinks is still part of the file, and will expect its
> changes to be written back.

How so? Truncate will chop the page off the mapping - it doesn't
miss any pages.

Truncate has to wait for the page lock, so the page may be removed from the
mapping shortly after the major fault's IO has completed. Maybe that's
what you are seeing.

2003-05-14 17:30:22

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Wednesday, May 14, 2003 10:34:21 -0700 Andrew Morton <[email protected]>
wrote:

> How so? Truncate will chop the page off the mapping - it doesn't
> miss any pages.
>
> Truncate has to wait for the page lock, so the page may be removed from
> the mapping shortly after the major fault's IO has completed. Maybe
> that's what you are seeing.

My guess on the order is this:

task 1 waits for IO in the page fault.

task 2 calls truncate, which does zap_page_range() on the range that page
is in.

task 1 wakes up and maps the page.

task 2 calls truncate_inode_pages which removes the newly mapped page from
the page cache.

Now the state is that the page has been disconnected from the file, but
it's still mapped in task 1's address space. That task thinks it has valid
data from the file in that page, and may continue to read/write there, and
assume any changes will get written back..

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-14 17:43:01

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Dave McCracken <[email protected]> wrote:
>
> task 1 waits for IO in the page fault.
>
> task 2 calls truncate, which does zap_page_range() on the range that page
> is in.
>
> task 1 wakes up and maps the page.
>
> task 2 calls truncate_inode_pages which removes the newly mapped page from
> the page cache.
>
> Now the state is that the page has been disconnected from the file, but
> it's still mapped in task 1's address space. That task thinks it has valid
> data from the file in that page, and may continue to read/write there, and
> assume any changes will get written back..

yes. It's a very complex way of allocating anonymous memory.

I am told that Stephen, Linus and others discussed this at length at KS a
couple of years ago and the upshot was that the application is racy anyway
and there's nothing wrong with it.

Hugh calls these "Morton pages" but it wasn't me and nobody saw me do it.

It would be nice to make them go away - they cause problems.

2003-05-14 17:53:23

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Wednesday, May 14, 2003 10:57:06 -0700 Andrew Morton <[email protected]>
wrote:

> yes. It's a very complex way of allocating anonymous memory.

Yep. And randomly, at that.

> I am told that Stephen, Linus and others discussed this at length at KS a
> couple of years ago and the upshot was that the application is racy anyway
> and there's nothing wrong with it.
>
> Hugh calls these "Morton pages" but it wasn't me and nobody saw me do it.
>
> It would be nice to make them go away - they cause problems.

Definitely. We almost have the pieces necessary to detect it and/or
prevent it, but the info isn't in quite the right layer at the right time.
If it weren't for the lock order problem with mmap_sem we could have nailed
it that way. Sigh.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-14 18:03:43

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Dave McCracken <[email protected]> wrote:
>
> > It would be nice to make them go away - they cause problems.
>
> Definitely. We almost have the pieces necessary to detect it and/or
> prevent it, but the info isn't in quite the right layer at the right time.
> If it weren't for the lock order problem with mmap_sem we could have nailed
> it that way. Sigh.

I think it might be sufficient to re-check the page against i_size
after IO completion in filemap_nopage().

2003-05-14 18:11:49

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Wednesday, May 14, 2003 11:17:48 -0700 Andrew Morton <[email protected]>
wrote:

> I think it might be sufficient to re-check the page against i_size
> after IO completion in filemap_nopage().

It would definitely make the window a lot smaller, though it won't quite
close it. To be entirely safe we'd need to recheck after we've retaken
page_table_lock.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-14 18:40:59

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Dave McCracken <[email protected]> wrote:
>
>
> --On Wednesday, May 14, 2003 11:17:48 -0700 Andrew Morton <[email protected]>
> wrote:
>
> > I think it might be sufficient to re-check the page against i_size
> > after IO completion in filemap_nopage().
>
> It would definitely make the window a lot smaller, though it won't quite
> close it. To be entirely safe we'd need to recheck after we've retaken
> page_table_lock.

hmm.

One possible timing diagram is



truncate: pagefault:


check i_size

grab page
drop i_size

shoot down pagetables

install in pagetables

truncate file


converting i_sem to an rwsem and taking it in the pagefault would certainly
stitch it up. Unpopular, very messy.

Could "truncate file" return some code to say pages were left behind, so
truncate re-runs zap_page_range()? Sounds unpleasant.


Yes, re-checking the page against i_size from do_no_page() would fix it up.
But damn, that's another indirect call, 64-bit math, etc on _every_
file-backed pagefault.


Remind me again what problem this whole thing is currently causing?

2003-05-14 18:49:56

by Rik van Riel

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Wed, 14 May 2003, Andrew Morton wrote:

> It would be nice to make them go away - they cause problems.

Absolutely. No way to know you can't objrmap these pages
and need to convert them to pte chains, so you end up leaking
them with objrmap...

Not to mention they could end up being outside of any VMA,
meaning there's no sane way to deal with them.

2003-05-14 18:52:13

by Rik van Riel

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Wed, 14 May 2003, Rik van Riel wrote:
> On Wed, 14 May 2003, Andrew Morton wrote:
>
> > It would be nice to make them go away - they cause problems.
>
> Not to mention they could end up being outside of any VMA,
> meaning there's no sane way to deal with them.

I hate to follow up to my own email, but the fact that
they're not in any VMA could mean we leak these pages
at exit() time.

Which means a security bug, as well as the potential to
end up with bad pointers in kernel space, eg. think about
the rmap code jumping to a no longer existing mm_struct.

The more I think about it, the more I agree with Andrew
that it would be really really nice to get rid of them ;)

2003-05-14 18:56:37

by Dave McCracken

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?


--On Wednesday, May 14, 2003 15:04:55 -0400 Rik van Riel <[email protected]>
wrote:

>> Not to mention they could end up being outside of any VMA,
>> meaning there's no sane way to deal with them.
>
> I hate to follow up to my own email, but the fact that
> they're not in any VMA could mean we leak these pages
> at exit() time.

Well, they are still inside the vma. Truncate doesn't shrink the vma. It
just generates SIGBUS when the app tries to fault the pages in.

Dave

======================================================================
Dave McCracken IBM Linux Base Kernel Team 1-512-838-3059
[email protected] T/L 678-3059

2003-05-14 18:59:22

by Rik van Riel

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Wed, 14 May 2003, Dave McCracken wrote:
> --On Wednesday, May 14, 2003 15:04:55 -0400 Rik van Riel <[email protected]>
> wrote:
>
> >> Not to mention they could end up being outside of any VMA,
> >> meaning there's no sane way to deal with them.
> >
> > I hate to follow up to my own email, but the fact that
> > they're not in any VMA could mean we leak these pages
> > at exit() time.
>
> Well, they are still inside the vma. Truncate doesn't shrink the vma. It
> just generates SIGBUS when the app tries to fault the pages in.

Right, I forgot about that. Forget the memory leak and
security bug theory, then ;)))

2003-05-15 00:36:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Hi,

what do you think of this untested fix?

I wonder if vm_file is valid for all nopage operations, I think it
should, and the i_mapping as well should always exist, but in the worst
case it shouldn't be too difficult to take care of special cases
(just checking if the new_page is reserved and if the vma is VM_SPECIAL)
would eliminate most issues, shall there be any.

--- x/fs/inode.c.~1~ 2003-05-14 23:26:10.000000000 +0200
+++ x/fs/inode.c 2003-05-15 02:26:46.000000000 +0200
@@ -147,6 +147,8 @@ void inode_init_once(struct inode *inode
INIT_LIST_HEAD(&inode->i_data.clean_pages);
INIT_LIST_HEAD(&inode->i_data.dirty_pages);
INIT_LIST_HEAD(&inode->i_data.locked_pages);
+ inode->i_data.truncate_sequence1 = 0;
+ inode->i_data.truncate_sequence2 = 0;
INIT_LIST_HEAD(&inode->i_dentry);
INIT_LIST_HEAD(&inode->i_dirty_buffers);
INIT_LIST_HEAD(&inode->i_dirty_data_buffers);
--- x/include/linux/fs.h.~1~ 2003-05-14 23:26:19.000000000 +0200
+++ x/include/linux/fs.h 2003-05-15 02:35:57.000000000 +0200
@@ -421,6 +421,8 @@ struct address_space {
struct vm_area_struct *i_mmap; /* list of private mappings */
struct vm_area_struct *i_mmap_shared; /* list of shared mappings */
spinlock_t i_shared_lock; /* and spinlock protecting it */
+ int truncate_sequence1; /* serialize ->nopage against truncate */
+ int truncate_sequence2; /* serialize ->nopage against truncate */
int gfp_mask; /* how to allocate the pages */
};

--- x/mm/vmscan.c.~1~ 2003-05-14 23:26:12.000000000 +0200
+++ x/mm/vmscan.c 2003-05-15 00:22:57.000000000 +0200
@@ -165,11 +165,10 @@ drop_pte:
goto drop_pte;

/*
- * Anonymous buffercache pages can be left behind by
+ * Anonymous buffercache pages can't be left behind by
* concurrent truncate and pagefault.
*/
- if (page->buffers)
- goto preserve;
+ BUG_ON(page->buffers);

/*
* This is a dirty, swappable page. First of all,
--- x/mm/memory.c.~1~ 2003-05-14 23:26:19.000000000 +0200
+++ x/mm/memory.c 2003-05-15 02:37:08.000000000 +0200
@@ -1127,6 +1127,8 @@ int vmtruncate(struct inode * inode, lof
if (inode->i_size < offset)
goto do_expand;
i_size_write(inode, offset);
+ mapping->truncate_sequence1++;
+ wmb();
spin_lock(&mapping->i_shared_lock);
if (!mapping->i_mmap && !mapping->i_mmap_shared)
goto out_unlock;
@@ -1140,6 +1142,8 @@ int vmtruncate(struct inode * inode, lof
out_unlock:
spin_unlock(&mapping->i_shared_lock);
truncate_inode_pages(mapping, offset);
+ wmb();
+ mapping->truncate_sequence2++;
goto out_truncate;

do_expand:
@@ -1335,12 +1339,20 @@ static int do_no_page(struct mm_struct *
{
struct page * new_page;
pte_t entry;
+ int truncate_sequence;
+ struct file *file;
+ struct address_space *mapping;

if (!vma->vm_ops || !vma->vm_ops->nopage)
return do_anonymous_page(mm, vma, page_table, pmd, write_access, address);
spin_unlock(&mm->page_table_lock);
pte_kunmap(page_table);

+ file = vma->vm_file;
+ mapping = file->f_dentry->d_inode->i_mapping;
+ truncate_sequence = mapping->truncate_sequence2;
+ mb();
+
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);

if (new_page == NULL) /* no page was available -- SIGBUS */
@@ -1366,6 +1378,22 @@ static int do_no_page(struct mm_struct *

page_table = pte_offset_atomic(pmd, address);
spin_lock(&mm->page_table_lock);
+ mb(); /* spin_lock has inclusive semantics */
+ if (unlikely(truncate_sequence != mapping->truncate_sequence1)) {
+ struct inode *inode;
+
+ spin_unlock(&mm->page_table_lock);
+
+ /*
+ * Don't worthless loop here forever overloading the cpu
+ * until the truncate has completed.
+ */
+ inode = mapping->host;
+ down(&inode->i_sem);
+ up(&inode->i_sem);
+
+ goto retry;
+ }

/*
* This silly early PAGE_DIRTY setting removes a race
@@ -1388,6 +1416,7 @@ static int do_no_page(struct mm_struct *
set_pte(page_table, entry);
} else {
spin_unlock(&mm->page_table_lock);
+ retry:
pte_kunmap(page_table);
/* One of our sibling threads was faster, back out. */
page_cache_release(new_page);

Andrea

2003-05-15 02:23:44

by Rik van Riel

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Thu, 15 May 2003, Andrea Arcangeli wrote:

> --- x/include/linux/fs.h.~1~ 2003-05-14 23:26:19.000000000 +0200
> +++ x/include/linux/fs.h 2003-05-15 02:35:57.000000000 +0200
> @@ -421,6 +421,8 @@ struct address_space {
> struct vm_area_struct *i_mmap; /* list of private mappings */
> struct vm_area_struct *i_mmap_shared; /* list of shared mappings */
> spinlock_t i_shared_lock; /* and spinlock protecting it */
> + int truncate_sequence1; /* serialize ->nopage against truncate */
> + int truncate_sequence2; /* serialize ->nopage against truncate */

How about calling them truncate_start and truncate_end ?

> --- x/mm/vmscan.c.~1~ 2003-05-14 23:26:12.000000000 +0200
> +++ x/mm/vmscan.c 2003-05-15 00:22:57.000000000 +0200
> @@ -165,11 +165,10 @@ drop_pte:
> goto drop_pte;
>
> /*
> - * Anonymous buffercache pages can be left behind by
> + * Anonymous buffercache pages can't be left behind by
> * concurrent truncate and pagefault.
> */
> - if (page->buffers)
> - goto preserve;
> + BUG_ON(page->buffers);

I wonder if there is nothing else that can leave behind
buffers in this way.

> + mb(); /* spin_lock has inclusive semantics */
> + if (unlikely(truncate_sequence != mapping->truncate_sequence1)) {
> + struct inode *inode;

This code looks like it should work, but IMHO it is very subtle
so it should really get some documentation.


2003-05-15 08:18:32

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Andrea Arcangeli <[email protected]> wrote:
>
> what do you think of this untested fix?
>
> I wonder if vm_file is valid for all nopage operations, I think it
> should, and the i_mapping as well should always exist, but in the worst
> case it shouldn't be too difficult to take care of special cases
> (just checking if the new_page is reserved and if the vma is VM_SPECIAL)
> would eliminate most issues, shall there be any.

yes, I think this is a good solution.

In 2.5 (at least) we can push all the sequence number work into
filemap_nopage(), and add a new vm_ops->revalidate() thing, so do_no_page()
doesn't need to know about inodes and such.

So the mm/memory.c part would look something like:

diff -puN mm/memory.c~a mm/memory.c
--- 25/mm/memory.c~a 2003-05-15 01:29:21.000000000 -0700
+++ 25-akpm/mm/memory.c 2003-05-15 01:32:02.000000000 -0700
@@ -1399,7 +1399,7 @@ do_no_page(struct mm_struct *mm, struct
pmd, write_access, address);
pte_unmap(page_table);
spin_unlock(&mm->page_table_lock);
-
+retry:
new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);

/* no page was available -- either SIGBUS or OOM */
@@ -1408,9 +1408,11 @@ do_no_page(struct mm_struct *mm, struct
if (new_page == NOPAGE_OOM)
return VM_FAULT_OOM;

- pte_chain = pte_chain_alloc(GFP_KERNEL);
- if (!pte_chain)
- goto oom;
+ if (!pte_chain) {
+ pte_chain = pte_chain_alloc(GFP_KERNEL);
+ if (!pte_chain)
+ goto oom;
+ }

/*
* Should we do an early C-O-W break?
@@ -1428,6 +1430,17 @@ do_no_page(struct mm_struct *mm, struct
}

spin_lock(&mm->page_table_lock);
+
+ /*
+ * comment goes here
+ */
+ if (vma->vm_ops && vma->vm_ops->revalidate &&
+ vma->vm_ops->revalidate(vma, address) {
+ spin_unlock(&mm->page_table_lock);
+ put_page(new_page);
+ goto retry;
+ }
+
page_table = pte_offset_map(pmd, address);

/*

_

2003-05-15 08:27:53

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Andrew Morton <[email protected]> wrote:
>
> So the mm/memory.c part would look something like:

er, right patch, wrong concept. That's the "check i_size after taking
page_table_lock" patch.

It's actually not too bad. Yes, there's 64-bit arith involved, but it is
only a shift.

2003-05-15 08:37:35

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Wed, May 14, 2003 at 11:53:19AM -0700, Andrew Morton wrote:
> converting i_sem to an rwsem and taking it in the pagefault would certainly
> stitch it up. Unpopular, very messy.

and very slow, down_read on every page fault wouldn't scale

> Could "truncate file" return some code to say pages were left behind, so
> truncate re-runs zap_page_range()? Sounds unpleasant.
>
>
> Yes, re-checking the page against i_size from do_no_page() would fix it up.

think if there are two truncates, one zapping the entere file, and
another restoring the previous i_size, in such case the new_page will be
wrong, as it won't be zeroed out. I mean if we do anything about it, we
should close all races and make it 100% correct.

My fix has no scalability cost, no indirect calls, touches mostly just
hot cachelines anyways, and addresses the multiple truncate case too.


> But damn, that's another indirect call, 64-bit math, etc on _every_
> file-backed pagefault.
>
>
> Remind me again what problem this whole thing is currently causing?

the only thing I can imagine is an app trapping SIGBUS to garbage
collect the end of a file. So for example you truncate the file and you
wait the SIGBUS to know you've to re-extend it. it would be a legitimate
use, and this is a bug, it's not read against write that has no way to
synchronize anyways, however I doubt any application is being bitten by
this race.

Andrea

2003-05-15 08:42:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Thu, May 15, 2003 at 01:32:45AM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > what do you think of this untested fix?
> >
> > I wonder if vm_file is valid for all nopage operations, I think it
> > should, and the i_mapping as well should always exist, but in the worst
> > case it shouldn't be too difficult to take care of special cases
> > (just checking if the new_page is reserved and if the vma is VM_SPECIAL)
> > would eliminate most issues, shall there be any.
>
> yes, I think this is a good solution.
>
> In 2.5 (at least) we can push all the sequence number work into
> filemap_nopage(), and add a new vm_ops->revalidate() thing, so do_no_page()
> doesn't need to know about inodes and such.
>
> So the mm/memory.c part would look something like:

this is your i_size check idea, and it's still racy (though less racy
than before ;), you will corrupt the user address space again if you get
two truncates under you (the first truncating the page under fault, and
the second extending the file past the page under fault) and since you
only call revalidate after re-taking the spinlock you can't catch the
first truncate that mandates the page to be zeroed out despite the
i_size check.

you could trap this by checking the page->mapping inside the
page_table_lock probably but taking lock_page before page_table_lock in
do_no_page is way overkill compared to my read-lockless patch.

>
> diff -puN mm/memory.c~a mm/memory.c
> --- 25/mm/memory.c~a 2003-05-15 01:29:21.000000000 -0700
> +++ 25-akpm/mm/memory.c 2003-05-15 01:32:02.000000000 -0700
> @@ -1399,7 +1399,7 @@ do_no_page(struct mm_struct *mm, struct
> pmd, write_access, address);
> pte_unmap(page_table);
> spin_unlock(&mm->page_table_lock);
> -
> +retry:
> new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
>
> /* no page was available -- either SIGBUS or OOM */
> @@ -1408,9 +1408,11 @@ do_no_page(struct mm_struct *mm, struct
> if (new_page == NOPAGE_OOM)
> return VM_FAULT_OOM;
>
> - pte_chain = pte_chain_alloc(GFP_KERNEL);
> - if (!pte_chain)
> - goto oom;
> + if (!pte_chain) {
> + pte_chain = pte_chain_alloc(GFP_KERNEL);
> + if (!pte_chain)
> + goto oom;
> + }
>
> /*
> * Should we do an early C-O-W break?
> @@ -1428,6 +1430,17 @@ do_no_page(struct mm_struct *mm, struct
> }
>
> spin_lock(&mm->page_table_lock);
> +
> + /*
> + * comment goes here
> + */
> + if (vma->vm_ops && vma->vm_ops->revalidate &&
> + vma->vm_ops->revalidate(vma, address) {
> + spin_unlock(&mm->page_table_lock);
> + put_page(new_page);
> + goto retry;
> + }
> +
> page_table = pte_offset_map(pmd, address);
>
> /*
>
> _
>


Andrea

2003-05-15 09:05:46

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Andrea Arcangeli <[email protected]> wrote:
>
> and it's still racy

damn, and it just booted ;)

I'm just a little bit concerned over the ever-expanding inode. Do you
think the dual sequence numbers can be replaced by a single generation
counter?

I do think that we should push the revalidate operation over into the vm_ops.
That'll require an extra arg to ->nopage, but it has a spare one anyway (!).


2003-05-15 09:27:56

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Thu, May 15, 2003 at 02:20:00AM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > and it's still racy
>
> damn, and it just booted ;)
>
> I'm just a little bit concerned over the ever-expanding inode. Do you
> think the dual sequence numbers can be replaced by a single generation
> counter?

yes, I wrote it as a single counter first, but was unreadable and it had
more branches, so I added the other sequence number to make it cleaner.
I don't mind another 4 bytes, that cacheline should be hot anyways.

> I do think that we should push the revalidate operation over into the vm_ops.
> That'll require an extra arg to ->nopage, but it has a spare one anyway (!).

not sure why you need a callback, the lowlevel if needed can serialize
using the same locking in the address space that vmtruncate uses. I
would wait a real case need before adding a callback.

Andrea

2003-05-15 09:41:28

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Andrea Arcangeli <[email protected]> wrote:
>
> > > - if (page->buffers)
> > > - goto preserve;
> > > + BUG_ON(page->buffers);
> >
> > I wonder if there is nothing else that can leave behind
> > buffers in this way.
>
> that's why I left the BUG_ON, if there's anything else I want to know,
> there shouldn't be anything else as the comment also suggest. I recall
> when we discussed this single check with Andrew and that was the only
> reason we left it AFIK.

yes, the test should no longer be needed.

2003-05-15 09:34:09

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Wed, May 14, 2003 at 10:36:23PM -0400, Rik van Riel wrote:
> On Thu, 15 May 2003, Andrea Arcangeli wrote:
>
> > --- x/include/linux/fs.h.~1~ 2003-05-14 23:26:19.000000000 +0200
> > +++ x/include/linux/fs.h 2003-05-15 02:35:57.000000000 +0200
> > @@ -421,6 +421,8 @@ struct address_space {
> > struct vm_area_struct *i_mmap; /* list of private mappings */
> > struct vm_area_struct *i_mmap_shared; /* list of shared mappings */
> > spinlock_t i_shared_lock; /* and spinlock protecting it */
> > + int truncate_sequence1; /* serialize ->nopage against truncate */
> > + int truncate_sequence2; /* serialize ->nopage against truncate */
>
> How about calling them truncate_start and truncate_end ?

Normally we use start/end for ranges, this is not a range, so I wouldn't
suggest it, but I don't care about names.

> > --- x/mm/vmscan.c.~1~ 2003-05-14 23:26:12.000000000 +0200
> > +++ x/mm/vmscan.c 2003-05-15 00:22:57.000000000 +0200
> > @@ -165,11 +165,10 @@ drop_pte:
> > goto drop_pte;
> >
> > /*
> > - * Anonymous buffercache pages can be left behind by
> > + * Anonymous buffercache pages can't be left behind by
> > * concurrent truncate and pagefault.
> > */
> > - if (page->buffers)
> > - goto preserve;
> > + BUG_ON(page->buffers);
>
> I wonder if there is nothing else that can leave behind
> buffers in this way.

that's why I left the BUG_ON, if there's anything else I want to know,
there shouldn't be anything else as the comment also suggest. I recall
when we discussed this single check with Andrew and that was the only
reason we left it AFIK.

> > + mb(); /* spin_lock has inclusive semantics */
> > + if (unlikely(truncate_sequence != mapping->truncate_sequence1)) {
> > + struct inode *inode;
>
> This code looks like it should work, but IMHO it is very subtle
> so it should really get some documentation.

Andrea

2003-05-15 09:44:38

by Andrew Morton

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

Andrea Arcangeli <[email protected]> wrote:
>
> > I do think that we should push the revalidate operation over into the vm_ops.
> > That'll require an extra arg to ->nopage, but it has a spare one anyway (!).
>
> not sure why you need a callback, the lowlevel if needed can serialize
> using the same locking in the address space that vmtruncate uses. I
> would wait a real case need before adding a callback.

Just a vague feeling that knowing about vm_file in do_no_page() is
a layering violation. I guess we can live with it.

2003-05-15 16:25:52

by Daniel McNeil

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Thu, 2003-05-15 at 02:40, Andrea Arcangeli wrote:
> On Thu, May 15, 2003 at 02:20:00AM -0700, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > and it's still racy
> >
> > damn, and it just booted ;)
> >
> > I'm just a little bit concerned over the ever-expanding inode. Do you
> > think the dual sequence numbers can be replaced by a single generation
> > counter?
>
> yes, I wrote it as a single counter first, but was unreadable and it had
> more branches, so I added the other sequence number to make it cleaner.
> I don't mind another 4 bytes, that cacheline should be hot anyways.

You could use the seqlock.h sequence locking. It only uses 1 sequence
counter. The 2.5 isize patch 1 has a sequence lock without the spinlock
so it only uses 4 bytes and it is somewhat more readable. I don't
think it has more branches.

I've attached the isize seqlock.h patch.
--
Daniel McNeil <[email protected]>


Attachments:
patch-2.5.69-isize.1 (1.57 kB)

2003-05-15 19:06:37

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Thu, May 15, 2003 at 09:38:26AM -0700, Daniel McNeil wrote:
> On Thu, 2003-05-15 at 02:40, Andrea Arcangeli wrote:
> > On Thu, May 15, 2003 at 02:20:00AM -0700, Andrew Morton wrote:
> > > Andrea Arcangeli <[email protected]> wrote:
> > > >
> > > > and it's still racy
> > >
> > > damn, and it just booted ;)
> > >
> > > I'm just a little bit concerned over the ever-expanding inode. Do you
> > > think the dual sequence numbers can be replaced by a single generation
> > > counter?
> >
> > yes, I wrote it as a single counter first, but was unreadable and it had
> > more branches, so I added the other sequence number to make it cleaner.
> > I don't mind another 4 bytes, that cacheline should be hot anyways.
>
> You could use the seqlock.h sequence locking. It only uses 1 sequence
> counter. The 2.5 isize patch 1 has a sequence lock without the spinlock
> so it only uses 4 bytes and it is somewhat more readable. I don't
> think it has more branches.
>
> I've attached the isize seqlock.h patch.

what do you think of the rmb vs mb in the reader side? Can I use rmb
too? I used mb() to go safe. I mean gettimeofday is a no brainer since
it does only reads inside the critical section anyways. But here I feel
I need mb().


And yes, there are no more branches sorry, just an additional or.

Andrea

2003-05-15 21:51:36

by Daniel McNeil

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Thu, 2003-05-15 at 12:19, Andrea Arcangeli wrote:
> On Thu, May 15, 2003 at 09:38:26AM -0700, Daniel McNeil wrote:
> > On Thu, 2003-05-15 at 02:40, Andrea Arcangeli wrote:
> > > On Thu, May 15, 2003 at 02:20:00AM -0700, Andrew Morton wrote:
> > > > Andrea Arcangeli <[email protected]> wrote:
> > > > >
> > > > > and it's still racy
> > > >
> > > > damn, and it just booted ;)
> > > >
> > > > I'm just a little bit concerned over the ever-expanding inode. Do you
> > > > think the dual sequence numbers can be replaced by a single generation
> > > > counter?
> > >
> > > yes, I wrote it as a single counter first, but was unreadable and it had
> > > more branches, so I added the other sequence number to make it cleaner.
> > > I don't mind another 4 bytes, that cacheline should be hot anyways.
> >
> > You could use the seqlock.h sequence locking. It only uses 1 sequence
> > counter. The 2.5 isize patch 1 has a sequence lock without the spinlock
> > so it only uses 4 bytes and it is somewhat more readable. I don't
> > think it has more branches.
> >
> > I've attached the isize seqlock.h patch.
>
> what do you think of the rmb vs mb in the reader side? Can I use rmb
> too? I used mb() to go safe. I mean gettimeofday is a no brainer since
> it does only reads inside the critical section anyways. But here I feel
> I need mb().
>
>
> And yes, there are no more branches sorry, just an additional or.
>
> Andrea

rmb() is safe on the read side. In fact, The mb()s only need to be
smp_rmb()s and the wmb()s only need to be smp_wmb()s.

Also, the mb() after the spin_lock(&mm->page_table_lock);
is not even needed since spin_lock() acts as a memory barrier.

Why do you feel you need the mb()?
Isn't everything the reader might do protected already?
You just using the sequence value to know whether you should
cleanup and retry.

Also, I like using the seqlock.h approach since it gives consistent
use of sequence locks, is a bit more readable, and documents and hides
all the memory barrier stuff.

Is there any possibility that the truncate side can run faster than
the reader side?
--
Daniel McNeil <[email protected]>

2003-05-15 23:04:30

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Thu, May 15, 2003 at 03:04:10PM -0700, Daniel McNeil wrote:
> On Thu, 2003-05-15 at 12:19, Andrea Arcangeli wrote:
> > On Thu, May 15, 2003 at 09:38:26AM -0700, Daniel McNeil wrote:
> > > On Thu, 2003-05-15 at 02:40, Andrea Arcangeli wrote:
> > > > On Thu, May 15, 2003 at 02:20:00AM -0700, Andrew Morton wrote:
> > > > > Andrea Arcangeli <[email protected]> wrote:
> > > > > >
> > > > > > and it's still racy
> > > > >
> > > > > damn, and it just booted ;)
> > > > >
> > > > > I'm just a little bit concerned over the ever-expanding inode. Do you
> > > > > think the dual sequence numbers can be replaced by a single generation
> > > > > counter?
> > > >
> > > > yes, I wrote it as a single counter first, but was unreadable and it had
> > > > more branches, so I added the other sequence number to make it cleaner.
> > > > I don't mind another 4 bytes, that cacheline should be hot anyways.
> > >
> > > You could use the seqlock.h sequence locking. It only uses 1 sequence
> > > counter. The 2.5 isize patch 1 has a sequence lock without the spinlock
> > > so it only uses 4 bytes and it is somewhat more readable. I don't
> > > think it has more branches.
> > >
> > > I've attached the isize seqlock.h patch.
> >
> > what do you think of the rmb vs mb in the reader side? Can I use rmb
> > too? I used mb() to go safe. I mean gettimeofday is a no brainer since
> > it does only reads inside the critical section anyways. But here I feel
> > I need mb().
> >
> >
> > And yes, there are no more branches sorry, just an additional or.
> >
> > Andrea
>
> rmb() is safe on the read side. In fact, The mb()s only need to be
> smp_rmb()s and the wmb()s only need to be smp_wmb()s.

they all can be smp_ things indeed

> Also, the mb() after the spin_lock(&mm->page_table_lock);
> is not even needed since spin_lock() acts as a memory barrier.

no, the spin_lock only acts as a barrier in one way, not both ways, so
an smp_something is still needed.

But yes, I think it can be a smp_rmb, so I can use the seqlock code.

> Why do you feel you need the mb()?

just because there are both reads and writes in the critical section,
but what matters is what we read not what we wrote. we must make sure
that we have read a real pagecache page outside the page_table_lock,
doesn't matter anything else, doesn't matter when our writes are
excuted (the code is just robust against not-oopsing), so s/mb/smp_rmb/
should do fine here and we can use the seqlock framework.

> Isn't everything the reader might do protected already?
> You just using the sequence value to know whether you should
> cleanup and retry.
>
> Also, I like using the seqlock.h approach since it gives consistent
> use of sequence locks, is a bit more readable, and documents and hides
> all the memory barrier stuff.

Well, that was a quick patch I was more focused on the design than the
implementation, so yes, now that we cleared the mb thing we can convert
it to the seqlock. As said my tree is still using the old names and
implementation of the seqlock, I don't think it's very important to
upgrade it, but maybe I should, comments? (I've a few patches that can
make a difference in my queue, so it will probably happen later if
something)

>
> Is there any possibility that the truncate side can run faster than
> the reader side?

speed doesn't matter.

thanks for the help!

Andrea

2003-05-17 00:14:48

by Daniel McNeil

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Thu, 2003-05-15 at 16:17, Andrea Arcangeli wrote:

> no, the spin_lock only acts as a barrier in one way, not both ways, so
> an smp_something is still needed.
>

Can you explain this more? On a x86, isn't a spin_lock a lock; dec
instruction and the rmb() a lock; addl. I thought x86 instructions
with lock prefix provided a memory barrier.

Just curious,

--
Daniel McNeil <[email protected]>

2003-05-17 17:16:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Fri, May 16, 2003 at 05:27:25PM -0700, Daniel McNeil wrote:
> On Thu, 2003-05-15 at 16:17, Andrea Arcangeli wrote:
>
> > no, the spin_lock only acts as a barrier in one way, not both ways, so
> > an smp_something is still needed.
> >
>
> Can you explain this more? On a x86, isn't a spin_lock a lock; dec
> instruction and the rmb() a lock; addl. I thought x86 instructions
> with lock prefix provided a memory barrier.

spin_lock yes on x86, but spin_unlock isn't a two way barrier even on
x86. Other archs like ia64 have special ways to declare the direction of
the barrier so they can do it for spin_lock too, not only for
spin_unlock like on the x86.

In short stuff outside the critical section can anways enter inside
unless a full mb() is executed either before or after the
spin_lock/spin_unlock. Never rely on a single spin_lock/spin_unlock to be a mb().

In short:

spin_lock
spin_unlock

is a mb().

But:

spin_unlock
spin_lock

is not an mb (and the above is what we have in our code, and this is why
I had to add the mb(), of course we just agreed it can be reduced to
smp_rmb()).

Andrea

2003-05-17 18:08:33

by Paul McKenney

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?





> On Thu, May 15, 2003 at 02:20:00AM -0700, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > and it's still racy
> >
> > damn, and it just booted ;)
> >
> > I'm just a little bit concerned over the ever-expanding inode. Do you
> > think the dual sequence numbers can be replaced by a single generation
> > counter?
>
> yes, I wrote it as a single counter first, but was unreadable and it had
> more branches, so I added the other sequence number to make it cleaner.
> I don't mind another 4 bytes, that cacheline should be hot anyways.
>
> > I do think that we should push the revalidate operation over into the
vm_ops.
> > That'll require an extra arg to ->nopage, but it has a spare one anyway
(!).
>
> not sure why you need a callback, the lowlevel if needed can serialize
> using the same locking in the address space that vmtruncate uses. I
> would wait a real case need before adding a callback.

FYI, we verified that the revalidate callback could also do the same
job that the proposed nopagedone callback does -- permitting filesystems
that provide their on vm_operations_struct to avoid the race between
page faults and invalidating a page from a mapped file.

Thanx, Paul

2003-05-17 18:30:15

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?

On Sat, May 17, 2003 at 11:19:39AM -0700, Paul McKenney wrote:
>
>
>
>
> > On Thu, May 15, 2003 at 02:20:00AM -0700, Andrew Morton wrote:
> > > Andrea Arcangeli <[email protected]> wrote:
> > > >
> > > > and it's still racy
> > >
> > > damn, and it just booted ;)
> > >
> > > I'm just a little bit concerned over the ever-expanding inode. Do you
> > > think the dual sequence numbers can be replaced by a single generation
> > > counter?
> >
> > yes, I wrote it as a single counter first, but was unreadable and it had
> > more branches, so I added the other sequence number to make it cleaner.
> > I don't mind another 4 bytes, that cacheline should be hot anyways.
> >
> > > I do think that we should push the revalidate operation over into the
> vm_ops.
> > > That'll require an extra arg to ->nopage, but it has a spare one anyway
> (!).
> >
> > not sure why you need a callback, the lowlevel if needed can serialize
> > using the same locking in the address space that vmtruncate uses. I
> > would wait a real case need before adding a callback.
>
> FYI, we verified that the revalidate callback could also do the same
> job that the proposed nopagedone callback does -- permitting filesystems
> that provide their on vm_operations_struct to avoid the race between
> page faults and invalidating a page from a mapped file.

don't you need two callbacks to avoid the race? (really I mean, to call
two times a callback, the callback can be also the same)

Andrea

2003-05-19 18:25:24

by Paul McKenney

[permalink] [raw]
Subject: Re: Race between vmtruncate and mapped areas?





> On Sat, May 17, 2003 at 11:19:39AM -0700, Paul McKenney wrote:
> > > On Thu, May 15, 2003 at 02:20:00AM -0700, Andrew Morton wrote:
> > > not sure why you need a callback, the lowlevel if needed can
serialize
> > > using the same locking in the address space that vmtruncate uses. I
> > > would wait a real case need before adding a callback.
> >
> > FYI, we verified that the revalidate callback could also do the same
> > job that the proposed nopagedone callback does -- permitting
filesystems
> > that provide their on vm_operations_struct to avoid the race between
> > page faults and invalidating a page from a mapped file.
>
> don't you need two callbacks to avoid the race? (really I mean, to call
> two times a callback, the callback can be also the same)

I do not believe so -- though we could well be talking about
different race conditions. The one that I am worried about
is where a distributed filesystem has a page fault against an
mmap race against an invalidation request. The thought is
that the DFS takes one of its locks in the nopage callback,
and then releases it in the revalidate callback. The
invalidation request would use the same DFS lock, and would
therefore not be able to run between nopage and revalidate.
It would call something like invalidate_mmap_range(), which
in turn calls zap_page_range(), which acquires the
mm->page_table_lock. Since do_no_page() does not release
mm->page_table_lock until after it fills in the PTE, I believe
things are covered.

So, is there another race that I am missing here? ;-)

Thanx, Paul