LinuxLists.cc - page->mapping == 0

2000-10-29 10:08:16

Subject: page->mapping == 0

Linus,

I have tracked down why I am getting the oops with page->mapping == 0
in block_read_full_page. Well, at least if not the ultimate cause,
then the penultimate cause. It's basically yet another truncate bug
(tm). I have some fairly detailed traces which I can send you if you
like, but the executive summary is that a page can get truncated out
from under us while we are sleeping in lock_page. The scenario I
observed is this:

Process A goes to read some data from a file. It finds or makes a
page cache page and starts the read into the page.

Process B goes to read the same data from the same file. It gets down
into do_generic_file_read, finds the existing page cache page,
increments its count, sees that it is not up-to-date, and calls
lock_page, which calls ___wait_on_page.

At some later time the read completes, the page becomes up-to-date and
process B is woken. But by this time process C is running. It is
also trying to read the same data from the same file. It completes
the read without blocking and then does a truncate on the file (in
preparation for rewriting it; the canonical example is when several
bashes exit at the same time, and they all go to read and rewrite
the ~/.bash_history file).

So process C calls vmtruncate() which calls truncate_inode_pages()
which calls truncate_complete_page() on all of the complete pages of
the file. That in turn calls remove_inode_page() which calls
__remove_inode_page() which sets page->mapping to NULL. (This code
doesn't seem to check page->count at all.)

Then process B actually gets to run. It returns from ___wait_on_page
to lock_page to do_generic_file_read. It now has the page locked but
it is no longer really the page it wants. It has page->mapping set to
NULL and it isn't on the page list for the mapping any more. The code
in do_generic_file_read doesn't check this but just charges on to call
mapping->a_ops->readpage.

I am not any kind of expert on the page cache but it seems to me that
maybe after locking the page we should check if it is still the page
we want (i.e. page->mapping and page->index are still correct), and go
back and look up the page again if not. Presumably there will be
quite a few places besides do_generic_file_read where that check would
be needed also.

Regards,
Paul.

--
Paul Mackerras, Senior Open Source Researcher, Linuxcare, Inc.
+61 2 6262 8990 tel, +61 2 6262 8991 fax
[email protected], http://www.linuxcare.com.au/
Linuxcare. Support for the revolution.

2000-10-29 12:40:22

by Juri Haberland

[permalink] [raw]

Subject: Re: page->mapping == 0

Paul Mackerras wrote:
> I am not any kind of expert on the page cache but it seems to me that
> maybe after locking the page we should check if it is still the page
> we want (i.e. page->mapping and page->index are still correct), and go
> back and look up the page again if not. Presumably there will be
> quite a few places besides do_generic_file_read where that check would
> be needed also.

I'm getting at least some kind of clue about the page cache, though I
don't claim to be an expert. You've pointed out how the operation of
becoming a reader on a page can fail because the page disappears while
you're waiting. That's an error, right? It's the result of a user
space race, which in turn is the result of parallel processes doing
nonatomic file operations with no synchronization. It sounds like
bailing out of the file_read if the page turns out to be unmapped after
the lock is the right thing to do, at least for a quick fix.

http://lxr.linux.no/source/mm/filemap.c?v=2.4.0-test9#L1029

There actually aren't that many places where lock_page is called:

http://lxr.linux.no/ident?v=2.4.0-test9;i=lock_page

Maybe lock_page needs to return an error code.

--
Daniel

2000-10-29 18:06:02

by Linus Torvalds

[permalink] [raw]

Subject: Re: page->mapping == 0

On Sun, 29 Oct 2000, Paul Mackerras wrote:
>
> I have tracked down why I am getting the oops with page->mapping == 0
> in block_read_full_page. Well, at least if not the ultimate cause,
> then the penultimate cause. It's basically yet another truncate bug
> (tm). I have some fairly detailed traces which I can send you if you
> like, but the executive summary is that a page can get truncated out
> from under us while we are sleeping in lock_page. The scenario I
> observed is this:

Sorry, maybe our discussions with Al Viro weren't public, but yes, we did
come to the same conclusion a few days ago.

The only problem right now is deciding how to fix it - because as you
rightly point out, this actually happens in a number of places, and as
such it's not trivial to fix it cleanly. We may have to use a brute-force
approach for 2.4.x where we just re-test against the file size in multiple
places where this can happen - but it would be nicer to handle this more
cleanly.

There are a few details that still elude me, the main one being the big
question about why this started happening to so many people just recently.
The bug is not new, and yet it's become obvious only in the last few
weeks.

I _hope_ that the reason for this is that more people are testing out
2.4.x, and that we've fixed most of the other issues that kept people back
from using it. It may be as simple as that. But I wonder if there is
something else going on too, like the read-ahead being buggy, and that
causing the bug to just happen a lot more (if read-ahead is buggy and
tries to read ahead past the end of the file, a truncate() will be much
more likely to hit these pages past the end, for example. Because most
cases of "truncate" are actually about truncating _upwards_, not
downwards).

So the two worries I have now is (a) the details on how to fix this thing
(minor worry, mainly one of aesthetics) and (b) whether there is something
else going on that brings it out of the woodworks..

Linus

2000-10-29 18:33:05

by Alexander Viro

[permalink] [raw]

Subject: Re: page->mapping == 0

On Sun, 29 Oct 2000, Linus Torvalds wrote:

> approach for 2.4.x where we just re-test against the file size in multiple
> places where this can happen - but it would be nicer to handle this more
> cleanly.

One possible way is to access page->mapping _only_ under the page lock
and in cases when we call ->i_mapping->a_ops->foo check the ->mapping
before the method call.

> There are a few details that still elude me, the main one being the big
> question about why this started happening to so many people just recently.
> The bug is not new, and yet it's become obvious only in the last few
> weeks.

And original truncate() problems went undetected since...? Pattern looks
very similar. Petr posted bug reports long time ago - back then they
were ignored since they involved vmware. In both cases we had persistent
problems caught by few people who were using not-too-common combinations
(innd on 2.4 boxen, for example) and at some point the shit had hit the
fan big way. No amount of testing can prove that there is no bugs and all
such...

I would expect problems with truncate, mmap, rename, POSIX locks, fasync,
ptrace and mount go unnoticed for _long_. Ditto for parts of procfs
proper that are not exercised by ps and friends. Sigh...

Maybe something along the lines of BTS might be useful, hell knows.

2000-10-29 19:12:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: page->mapping == 0

On Sun, 29 Oct 2000, Alexander Viro wrote:
>
> One possible way is to access page->mapping _only_ under the page lock
> and in cases when we call ->i_mapping->a_ops->foo check the ->mapping
> before the method call.

I'm leaning towards this for a 2.4.x solution.

As far as I can tell, page->mapping is _already_ only accessed and
modified with the page lock held. It's just that we don't test it for
NULL, in case an earlier lock holder decided to clear it.

(No, I didn't look through all the users, but at least conceptually it
_should_ be true that we only look at "mapping" with the lock held: it's
mainly used for pagein, and pageout, buth with the lock held for other
reasons already. Certainly all the places where we have had bug-reports
have been of this type).

Making it policy that we have to re-test page->mapping after aquireing the
page lock might be the simplest fix for 2.4.x. It still means that we
might end up allowing people to have a "bad" page in the VM space due to
the "test->insert" race condition, but it woul dmake that event pretty
much a harmless bug (and thus move it to the "beauty wart - to be fixed
later" category).

And the places where we get the page lock and use page->mapping are not
that many, I think.

(And notice how we actually _have_ this approach already in
do_buffer_fdatasync(), for similar reasons - we use the "re-test the
page->buffers" thing there. Of course, there we do it because the clearing
of page->buffers is easier to see, and can happen as a result of memory
pressure, and not just truncate()).

So, for example, in __find_lock_page() we should re-test the mapping after
we aquired the page lock. Which is fairly easy, just add something like

/* Race: did the mapping go away before we got the page lock? */
if (page && page->mapping != mapping) {
page_cache_release(page);
goto repeat;
}

to the end of __find_lock_page(). Add similar logic to
do_generic_file_read(), filemap_nopage() and filemap_sync_pte() and
read_cache_page(), and you're pretty much done.

Linus

2000-10-29 19:34:17

by Linus Torvalds

[permalink] [raw]

Subject: Re: page->mapping == 0

On Sun, 29 Oct 2000, Linus Torvalds wrote:
>
> Making it policy that we have to re-test page->mapping after aquireing the
> page lock might be the simplest fix for 2.4.x. It still means that we
> might end up allowing people to have a "bad" page in the VM space due to
> the "test->insert" race condition, but it woul dmake that event pretty
> much a harmless bug (and thus move it to the "beauty wart - to be fixed
> later" category).

I'd like to just re-iterate this point: re-testing "page->mapping" fixes
the oops, but is not a full fix for the conceptual problem.

The problem with just re-testing "page->mapping" is that you still have a
nasty potential race where you insert a (bogus) page into the VM space of
a process instead of giving a SIGBUS/SIGSEGV.

Now, I don't think this is really a valid usage pattern, so it's most
likely to be a result of a buggy application, but I can imagine having
some strange kind of user-space VM memory management scheme that depends
on SIGBUS to maintain a file length. I've never heard of such a thing, but
I could imagine somebody doing some kind of persistent data object store
in user space this way.

Does anybody actually know of an application that does something like
this? Because I'm more and more inclined to just going with the half-fix
for now. It would at least guarantee internal kernel data consistency (and
no oopses, of course).

Don't get me wrong - we need to clean this part up, but as far as I can
tell we have never done this "right", so in that sense it's not a new
2.4.x bug and it can't break existing applications.

[ In fact, with the current ordering inside "vmtruncate()" of doing the
"truncate_inode_pages()" thing before doing the "vmtruncate_list()", I
have this suspicion that the race might even be impossible to trigger.
Even when the race "happens" in kernel space, we will end up unmapping
the page immediately afterwards, and the only effect as far as the user
is concerned is the disappearance of the SIGBUS.

And the "disappearing SIGBUS" is actually explainable with a
_user_level_ race: in order to get the kernel race at all, user level
itself must have been inherently racing on the truncate/access, and
depending on which one happened "first" you'd have lost the SIGBUS and
the data you wrote anyway.

So it may actually be that we can honestly claim that the half-fix is
actually a proper fix. I would have to look a lot closer at the issue to
be able to guarantee this, though. Comments? Anybody? ]

Al?

Linus

2000-10-29 20:32:12

by Alan

[permalink] [raw]

Subject: Re: page->mapping == 0

> I would expect problems with truncate, mmap, rename, POSIX locks, fasync,
> ptrace and mount go unnoticed for _long_. Ditto for parts of procfs

Well the ptrace one still has mysteriously breaks usermode linux against it
on my list here. Was that ever explained. It looked like the stack got corrupted
which is weird.

2000-10-29 20:43:06

by Alexander Viro

[permalink] [raw]

Subject: Re: page->mapping == 0

On Sun, 29 Oct 2000, Alan Cox wrote:

> > I would expect problems with truncate, mmap, rename, POSIX locks, fasync,
> > ptrace and mount go unnoticed for _long_. Ditto for parts of procfs
>
> Well the ptrace one still has mysteriously breaks usermode linux against it
> on my list here. Was that ever explained. It looked like the stack got corrupted
> which is weird.

Alan, is it me or usermode port really tends to catch the stack overflows?
<thinks> How about we allocate the pages by 4, not by 2 as we do now, and
explicitly unmap the pages below the task_struct? That would still leave the
chance of stack overflow going unnoticed, but the window would be limited.

Another possibility for ptrace-related screwups: past-the-EOF check in
filemap_nopage(). I'm still not convinced that it's right. It may be, but...

2000-10-29 22:05:19

by Jeff Dike

[permalink] [raw]

Subject: Re: page->mapping == 0

[email protected] said:
> Well the ptrace one still has mysteriously breaks usermode linux
> against it on my list here. Was that ever explained. It looked like
> the stack got corrupted which is weird.

If this is the test8 "turn off ORIG_EAX if EIP is changed" fix, it apparently
also broke a couple of other ptrace-intensive things which aren't coming to
mind (Andi Kleen noticed them).

AFAIK, it was never explained. Something would change its victim's EIP and
the victim would segfault as soon as it was continued.

Jeff

2000-10-29 22:30:55

by Jeff Merkey

[permalink] [raw]

Subject: Re: page->mapping == 0

Alan Cox wrote:
>
> > I would expect problems with truncate, mmap, rename, POSIX locks, fasync,
> > ptrace and mount go unnoticed for _long_. Ditto for parts of procfs
>
> Well the ptrace one still has mysteriously breaks usermode linux against it
> on my list here. Was that ever explained. It looked like the stack got corrupted
> which is weird.

Do any of you guys have an SMP inverse assembler handy? This bug is
scary as shit, and someone needs to setup an SMP ICE and actually trace
through the code (and set address breakpoints) to see just what is going
on. It really is sounding like a race condition of some kind, and I
read everyone's "speculation" about what they think is happneing, but
how about getting some hardware tools, and tracking is down. It feels
like an MMU bug of some type. Who has access to Yellow Cover Intel
Errata sheets? It may be a hardware bug intel knows about but we don't
-- their MMU is so bug infested how do we know it's not one of these.

I don't think you guys have a good handle on it yet. Linus seems to
think it's a coding problem with a race somewhere. How about someone
getting some hardware tools and nailing the puppy to the floor so 2.4
can get out the door...

:-)

Jeff

>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> Please read the FAQ at http://www.tux.org/lkml/

2000-10-30 01:57:45

by Juri Haberland

[permalink] [raw]

Subject: Re: page->mapping == 0

Linus Torvalds wrote:
> Now, I don't think this is really a valid usage pattern, so it's most
> likely to be a result of a buggy application, but I can imagine having
> some strange kind of user-space VM memory management scheme that depends
> on SIGBUS to maintain a file length. I've never heard of such a thing, but
> I could imagine somebody doing some kind of persistent data object store
> in user space this way.
>
> Does anybody actually know of an application that does something like
> this? Because I'm more and more inclined to just going with the half-fix
> for now. It would at least guarantee internal kernel data consistency (and
> no oopses, of course).
>
> Don't get me wrong - we need to clean this part up, but as far as I can
> tell we have never done this "right", so in that sense it's not a new
> 2.4.x bug and it can't break existing applications.

Strace shows that the history really is updated with no locking:

4030 open("/home/phillips/.bash_history", O_WRONLY|O_TRUNC) = 4
4030 write(4, "l\nless README\ne ../*test8/fs/ext"..., 7546) = 7546
4030 close(4) =
0

There's no reason to continue doing the read after you've detected the
page was pulled out from under you - clearly it's a user error state,
the only question is: return the error and break the buggy application?
Or gloss over it. And it seems like returning the error might actually
break bash - maybe it deserves it... but...

--
Daniel