zap_pte_range should not mark non-uptodate pages dirty
Doing O_DIRECT writes to an mmapped file caused pages in the page cache to
be marked dirty but not uptodate. This led to a bug in mpage_writepage.
Signed-off-by: Dave Kleikamp <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
diff -urp linux-2.6.9/mm/memory.c linux/mm/memory.c
--- linux-2.6.9/mm/memory.c 2004-10-21 10:49:26.598031488 -0500
+++ linux/mm/memory.c 2004-10-21 16:01:44.902376232 -0500
@@ -414,7 +414,15 @@ static void zap_pte_range(struct mmu_gat
&& linear_page_index(details->nonlinear_vma,
address+offset) != page->index)
set_pte(ptep, pgoff_to_pte(page->index));
- if (pte_dirty(pte))
+ /*
+ * PG_uptodate can be cleared by
+ * invalidate_inode_pages2, so we must not try to write
+ * not uptodate pages. Otherwise we risk invalidating
+ * underlying O_DIRECT writes, and secondly because
+ * pdflush would BUG(). Coherency of mmaps against
+ * O_DIRECT still cannot be guaranteed though.
+ */
+ if (pte_dirty(pte) && PageUptodate(page))
set_page_dirty(page);
if (pte_young(pte) && !PageAnon(page))
mark_page_accessed(page);
Dave Kleikamp <[email protected]> wrote:
>
> Doing O_DIRECT writes to an mmapped file caused pages in the page cache to
> be marked dirty but not uptodate. This led to a bug in mpage_writepage.
>
Methinks this merely reduces the probability of the BUG.
> diff -urp linux-2.6.9/mm/memory.c linux/mm/memory.c
> --- linux-2.6.9/mm/memory.c 2004-10-21 10:49:26.598031488 -0500
> +++ linux/mm/memory.c 2004-10-21 16:01:44.902376232 -0500
> @@ -414,7 +414,15 @@ static void zap_pte_range(struct mmu_gat
> && linear_page_index(details->nonlinear_vma,
> address+offset) != page->index)
> set_pte(ptep, pgoff_to_pte(page->index));
> - if (pte_dirty(pte))
> + /*
> + * PG_uptodate can be cleared by
> + * invalidate_inode_pages2, so we must not try to write
> + * not uptodate pages. Otherwise we risk invalidating
> + * underlying O_DIRECT writes, and secondly because
> + * pdflush would BUG(). Coherency of mmaps against
> + * O_DIRECT still cannot be guaranteed though.
> + */
> + if (pte_dirty(pte) && PageUptodate(page))
invalidate_inode_pages2 runs ClearPageUptodate() here
> set_page_dirty(page);
> if (pte_young(pte) && !PageAnon(page))
> mark_page_accessed(page);
Maybe we should revisit invalidate_inode_pages2(). It used to be an
invariant that "pages which are mapped into process address space are
always uptodate". We broke that (good) invariant and we're now seeing
some fallout. There may be more.
On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote:
> Maybe we should revisit invalidate_inode_pages2(). It used to be an
> invariant that "pages which are mapped into process address space are
> always uptodate". We broke that (good) invariant and we're now seeing
> some fallout. There may be more.
such invariant doesn't exists since 2.4.10. There's no way to get mmaps
reload data from disk without breaking such an invariant. It's not even
for the write side, it's buffered read against O_DIRECT write that
requires breaking such invariant.
Either you fix it the above way, or you remove the BUG() in pdflush and
you simply clear the dirty bit without doing anything, both are fine,
peraphs we should do both, but the above is good to have anyways since
it's more efficient to not even show the not uptodate pages to pdflush.
Andrea Arcangeli <[email protected]> wrote:
>
> On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote:
> > Maybe we should revisit invalidate_inode_pages2(). It used to be an
> > invariant that "pages which are mapped into process address space are
> > always uptodate". We broke that (good) invariant and we're now seeing
> > some fallout. There may be more.
>
> such invariant doesn't exists since 2.4.10. There's no way to get mmaps
> reload data from disk without breaking such an invariant.
There are at least two ways:
a) Set a new page flag in invalidate, test+clear that at fault time
b) shoot down all pte's mapping the locked page at invalidate time, mark the
page not uptodate.
The latter is complex but has the advantage of fixing the current
half-assed situation wherein existing mmaps are seeing invalidated data.
> It's not even
> for the write side, it's buffered read against O_DIRECT write that
> requires breaking such invariant.
>
> Either you fix it the above way, or you remove the BUG() in pdflush and
> you simply clear the dirty bit without doing anything, both are fine,
> peraphs we should do both, but the above is good to have anyways since
> it's more efficient to not even show the not uptodate pages to pdflush.
We could just remove the BUG in mpage_writepage() (which I assume is the
one which was being hit) but we might still have a not uptodate page with
uptodate buffers and I suspect that the kernel will either go BUG there
instead or will bring the page uptodate again without performing any I/O.
But I haven't checked that.
On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote:
> > > Maybe we should revisit invalidate_inode_pages2(). It used to be an
> > > invariant that "pages which are mapped into process address space are
> > > always uptodate". We broke that (good) invariant and we're now seeing
> > > some fallout. There may be more.
> >
> > such invariant doesn't exists since 2.4.10. There's no way to get mmaps
> > reload data from disk without breaking such an invariant.
>
> There are at least two ways:
>
> a) Set a new page flag in invalidate, test+clear that at fault time
What's the point of adding a new page flag when the invariant
!PageUptodate && page_mapcount(page) already provides the information?
I turned a condition that previously was impossible, and it made such
condition useful as another useful invariant, instead of a BUG_ON
invariant. The BUG itself guarantees us nobody was using it for other
purposes, infact invalidate_inode_pages2 is what triggered this in the
first place.
> b) shoot down all pte's mapping the locked page at invalidate time, mark the
> page not uptodate.
invalidate should run fast, I didn't enforce coherency or it'd hurt too
much the O_DIRECT write if something is mapped, we only allow buffered
read against O_DIRECT write to work coherently, the mmap coherency has
never been provided to avoid having to search for vmas in the prio_tree
for every single write to an inode.
> The latter is complex but has the advantage of fixing the current
> half-assed situation wherein existing mmaps are seeing invalidated data.
that's a feature not a bug since 2.4.10. Nobody ever asked for such
coherency, all we provide is read against write or read against read
(write against write only with both writes O_DIRECT or both writes
buffered). mmaps are ignored by O_DIRECT, mmaps don't crash the kernel
(well modulo the PageReserved check added in 2.6) but that's all.
> We could just remove the BUG in mpage_writepage() (which I assume is the
> one which was being hit) but we might still have a not uptodate page with
> uptodate buffers and I suspect that the kernel will either go BUG there
> instead or will bring the page uptodate again without performing any I/O.
> But I haven't checked that.
how can this be related to mmapped pages? Isn't this only an issue
with invalidate_inode_pages2? I agree we miss an invalidate of the bh
there. the mpage_readpage not using bh coupled with the page-size
alignment enforced by the 2.4 O_DIRECT API (not like 2.6 that uses
hardblocksize alignment) probably helps a lot in hiding this I guess.
Andrea Arcangeli <[email protected]> wrote:
>
> On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote:
> > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an
> > > > invariant that "pages which are mapped into process address space are
> > > > always uptodate". We broke that (good) invariant and we're now seeing
> > > > some fallout. There may be more.
> > >
> > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps
> > > reload data from disk without breaking such an invariant.
> >
> > There are at least two ways:
> >
> > a) Set a new page flag in invalidate, test+clear that at fault time
>
> What's the point of adding a new page flag when the invariant
> !PageUptodate && page_mapcount(page) already provides the information?
Step back and think about this. What earthly sense is there in permitting
userspace access to non uptodate pages?
None. It's completely wrong and the invariant was a good one. We broke it
by introducing some kluge to force new I/O when someone does a new fault
against the page.
(A new PG_needs_rereading flag isn't sufficient btw - we'd also need
BH_Needs_Rereading and associated code. ug.)
> > b) shoot down all pte's mapping the locked page at invalidate time, mark the
> > page not uptodate.
>
> invalidate should run fast, I didn't enforce coherency or it'd hurt too
> much the O_DIRECT write if something is mapped, we only allow buffered
> read against O_DIRECT write to work coherently, the mmap coherency has
> never been provided to avoid having to search for vmas in the prio_tree
> for every single write to an inode.
I don't get it. invalidate has the pageframe. All it need to do is to
lock the page, examine mapcount and if it's non-zero, do the shootdown.
The only way in which we would be performing the shootdown a significant
number of times would be if someone was repeatedly faulting the thing back
in anyway, and in that case the physical I/O cost would dominate. Where's
the performance overhead??
Plus it makes the currently incorrect code correct for existing mmaps.
Plus it avoids the idiotic situation of having non uptodate pages
accessible to user processes.
Andrew Morton <[email protected]> wrote:
>
> I don't get it. invalidate has the pageframe. All it need to do is to
> lock the page, examine mapcount and if it's non-zero, do the shootdown.
unmap_mapping_range() will do that - can call it one page at a time, or
batch up runs of pages. It's not fast, but presumably not frequent either.
The bigger problem is shooting down the buffer_heads. It's certainly the
case that mpage_readpage() will call block_read_full_page() which will then
bring the page uptodate without performing any I/O.
And invalidating the buffer_heads in invalidate_inode_pages2() is tricky -
we need to enter the filesystem and I'm not sure that either
->invalidatepage() or ->releasepage() are quite suitable. For a start,
they're best-effort and may fail. If we just go and mark the buffers not
uptodate we'll probably give ext3 a heart attack, so careful work would be
needed there.
Let's go back to why we needed all of this. Was it just for the NFS
something-changed-on-the-server code? If so, would it be sufficient to add
a new invalidate_inode_pages3() just for NFS, which clears the uptodate
bit? Or something along those lines?
On Thu, Oct 21, 2004 at 04:42:45PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Thu, Oct 21, 2004 at 04:02:33PM -0700, Andrew Morton wrote:
> > > Andrea Arcangeli <[email protected]> wrote:
> > > >
> > > > On Thu, Oct 21, 2004 at 02:45:31PM -0700, Andrew Morton wrote:
> > > > > Maybe we should revisit invalidate_inode_pages2(). It used to be an
> > > > > invariant that "pages which are mapped into process address space are
> > > > > always uptodate". We broke that (good) invariant and we're now seeing
> > > > > some fallout. There may be more.
> > > >
> > > > such invariant doesn't exists since 2.4.10. There's no way to get mmaps
> > > > reload data from disk without breaking such an invariant.
> > >
> > > There are at least two ways:
> > >
> > > a) Set a new page flag in invalidate, test+clear that at fault time
> >
> > What's the point of adding a new page flag when the invariant
> > !PageUptodate && page_mapcount(page) already provides the information?
>
> Step back and think about this. What earthly sense is there in permitting
> userspace access to non uptodate pages?
this is exactly why new page faults re-read from disk. Istantiating not
uptodate pages is definitely a mistake. But after the pte is istantiated
the uptodate information becomes pointless. It's up to the page fault to
make sure the page is uptodate before mapping it into userspace.
> None. It's completely wrong and the invariant was a good one. We
> broke it
> by introducing some kluge to force new I/O when someone does a new fault
> against the page.
the invariant that is important, is that the page fault must never map
not uptodate pages into userspace. That invariant is still obeyed, it's
just after the page is mapped that we start making good use of the
uptodate bit again.
> (A new PG_needs_rereading flag isn't sufficient btw - we'd also need
> BH_Needs_Rereading and associated code. ug.)
clearing uptodate bits for bh too would fix it. Anyways I doubt in
practice the lack of bh clearing could ever trigger thanks to
mpage_readpages and the page aligned API. Peraphs the 2.6 more relaxed
API could expose the problem, and that's why we need to address it in
2.6.
> I don't get it. invalidate has the pageframe. All it need to do is to
> lock the page, examine mapcount and if it's non-zero, do the shootdown.
> The only way in which we would be performing the shootdown a significant
> number of times would be if someone was repeatedly faulting the thing back
> in anyway, and in that case the physical I/O cost would dominate. Where's
> the performance overhead??
>
> Plus it makes the currently incorrect code correct for existing mmaps.
>
> Plus it avoids the idiotic situation of having non uptodate pages
> accessible to user processes.
peraphs we can implement it for 2.6, but for 2.4 it was not doable, and
clearing PG_uptodate in 2.4 is what made O_DIRECT possible at all, so I
wouldn't call it idiotic situation. It was a big new feature and it
still is, since your new bitflag is not needed: what the VM asks for is
to clear such uptodate bit because the page in the pagecache is not
uptodate anymore. What happens is that the disk has changed under you so
it's idotic to set another bitflag and to leave the uptodate bit set,
when the page is obviously not uptodate anymore.
If you want to shootdown ptes before clearing the bitflag, that's fine
with me, but you will still have to clear the uptodate bitflag on the
page after that, and the mmap coherency has never been provided by
O_DIRECT in linux, this would be a new feature, sure not a bugfix. the
mmap case is a controlled race by design that can cause no harm. btw,
nfs is doing it too in some 2.4 tree, we've got complains that mmaps
weren't refreshed. If you change invalidate_inode_pages2 to shootdown
ptes then they'll get much better refreshing for nfs too, but again, not
doable in the 2.4 backport: in 2.4 all I could do is to clear uptodate
on mapped pages, since after the page is mapped the uptodate bit becomes
useless and we can re-use it. Adding a new bitflag wouldn't change a
thing except to make it more complicated than it already is.
On Thu, Oct 21, 2004 at 05:15:58PM -0700, Andrew Morton wrote:
> Andrew Morton <[email protected]> wrote:
> >
> > I don't get it. invalidate has the pageframe. All it need to do is to
> > lock the page, examine mapcount and if it's non-zero, do the shootdown.
>
> unmap_mapping_range() will do that - can call it one page at a time, or
> batch up runs of pages. It's not fast, but presumably not frequent either.
That would shootdown the ptes to add completely coherency to the mmaps, right.
Still we could shootdown the ptes after clearing the uptodate bitflag,
allowing the mapped page to be not uptodate for a short while, since it
makes sense and it's harmless. The pte shootdown from my point of view
is just an additional coherency feature, but it cannot provide full
coherency anyways, since the invalidate arrives after the I/O hit the
disk, so the page will be out of sync with the disk if it's dirty, and
no coherency can be provided anyways, because no locking happens to get
max scalability.
> The bigger problem is shooting down the buffer_heads. It's certainly the
> case that mpage_readpage() will call block_read_full_page() which will then
> bring the page uptodate without performing any I/O.
yes, this is actually the only bug I can see in this whole affair
(besdies the BUG that goes away with the patch already posted, and that
patch still makes perfect sense to me since we could use it even for a
more relaxed pte shootdown as described above, plus it doesn't worth to
mark not-uptodate pages as dirty, that is really what makes no sense and
needs fixing).
> And invalidating the buffer_heads in invalidate_inode_pages2() is tricky -
> we need to enter the filesystem and I'm not sure that either
> ->invalidatepage() or ->releasepage() are quite suitable. For a start,
> they're best-effort and may fail. If we just go and mark the buffers not
> uptodate we'll probably give ext3 a heart attack, so careful work would be
> needed there.
>
> Let's go back to why we needed all of this. Was it just for the NFS
> something-changed-on-the-server code? If so, would it be sufficient to add
> a new invalidate_inode_pages3() just for NFS, which clears the uptodate
> bit? Or something along those lines?
nfs is a case here too. But this is mostly needed for O_DIRECT write
happening on a file that is mmapped and read in buffered mode at the
same time. The API totally ignores the mapping, but we must guarantee
buffered read to see the written data on disk, and in turn the uptodate
bitflag must be clared because the page is not uptodate anymore. This
just describes what has happened on disk and it tells to _future_ page
faults (or buffered read syscalls) they've to re-read from disk. The
only issue seems to be the bhs. Peraphs te bhs requires a new bitflag if
the fs risks an hearth attack, but the VM can do the natural thing of
clearing the uptodate bitflag reflecting the fact the cache is
out-of-date, since the VM can deal with that just fine.
On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote:
> If you want to shootdown ptes before clearing the bitflag, that's fine
small correction s/before/after/. doing the pte shootdown before
clearing the uptodate bitflag, would still not guarantee to read
uptodate data after the invalidate (a minor page fault could still
happen between the shootdown and the clear_bit; while after clearing the
uptodate bit a major fault hitting the disk and refreshing the pagecache
contents will be guaranteed - modulo bhs, well at least nfs is sure ok
in that respect ;).
Andrea Arcangeli <[email protected]> wrote:
>
> On Fri, Oct 22, 2004 at 02:30:04AM +0200, Andrea Arcangeli wrote:
> > If you want to shootdown ptes before clearing the bitflag, that's fine
>
> small correction s/before/after/. doing the pte shootdown before
> clearing the uptodate bitflag, would still not guarantee to read
> uptodate data after the invalidate (a minor page fault could still
> happen between the shootdown and the clear_bit; while after clearing the
> uptodate bit a major fault hitting the disk and refreshing the pagecache
> contents will be guaranteed - modulo bhs, well at least nfs is sure ok
> in that respect ;).
Yeah. I think the only 100% reliable way to do this is:
lock_page()
unmap_mapping_range(page)
ClearPageUptodate(page);
invalidate(page); /* Try to free the thing */
unlock_page(page);
(Can do it for a whole range of pages if we always agree to lock pages in
file-index-ascending order).
but hrm, we don't even have the locking there to prevent do_no_page() from
reinstantiating the pte immediately after the unmap_mapping_range().
So what to do?
- The patch you originally sent has a race window which can be made
nonfatal by removing the BUGcheck in mpage_writepage().
- NFS should probably be taught to use unmap_mapping_range() regardless
of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
up.
- invalidate_inode_pages2() isn't successfully invalidating the page if
it has buffers.
This is the biggest problem, because the pages can trivially have
buffers - just write()ing to them will attach buffers, if they're ext2-
or ext3-backed.
It means that a direct-IO write to a section of a file which is mmapped
causes pagecache and disk to get out of sync. Question is: do we care,
or do we say "don't do that"? Nobody seems to have noticed thus far and
it's a pretty dopey thing to be doing.
If we _do_ want to fix it, it seems the simplest approach would be to
nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we
raced - try again" loop in there to handle the "do_no_page()
reinstantiated a pte" race.
Fun.
Something like this? Slow as a dog, but possibly correct ;)
void invalidate_inode_pages2(struct address_space *mapping)
{
struct pagevec pvec;
pgoff_t next = 0;
int i;
pagevec_init(&pvec, 0);
while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
for (i = 0; i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
lock_page(page);
if (page->mapping == mapping) { /* truncate race? */
wait_on_page_writeback(page);
next = page->index + 1;
while (page_mapped(page)) {
unmap_mapping_range(mapping,
page->index << PAGE_CACHE_SHIFT,
page->index << PAGE_CACHE_SHIFT+1,
0);
clear_page_dirty(page);
}
invalidate_complete_page(mapping, page);
}
unlock_page(page);
}
pagevec_release(&pvec);
cond_resched();
}
}
The unmapping from pagetables means that invalidate_complete_page() will
generally successfully remove the page from pagecache. That mostly fixes
the direct-write-of-mmapped-data coherency problem: the pages simply aren't
in pagecache any more so we'll surely redo physical I/O.
But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
it really sucks that we're offering behaviour which only works most of the
time :(
On Fri, 22 Oct 2004, Andrea Arcangeli wrote:
> The pte shootdown from my point of view is just an additional coherency
> feature, but it cannot provide full coherency anyways, since the
> invalidate arrives after the I/O hit the disk, so the page will be out
> of sync with the disk if it's dirty, and no coherency can be provided
> anyways, because no locking happens to get max scalability.
That depends on the filesystem. I hope the clustered filesystems
will be able to provide full coherency by doing the invalidates
in the right order.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote:
> Yeah. I think the only 100% reliable way to do this is:
>
> lock_page()
> unmap_mapping_range(page)
> ClearPageUptodate(page);
> invalidate(page); /* Try to free the thing */
> unlock_page(page);
>
> (Can do it for a whole range of pages if we always agree to lock pages in
> file-index-ascending order).
>
> but hrm, we don't even have the locking there to prevent do_no_page() from
> reinstantiating the pte immediately after the unmap_mapping_range().
>
> So what to do?
>
> - The patch you originally sent has a race window which can be made
> nonfatal by removing the BUGcheck in mpage_writepage().
>
> - NFS should probably be taught to use unmap_mapping_range() regardless
> of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
> up.
>
> - invalidate_inode_pages2() isn't successfully invalidating the page if
> it has buffers.
I did it right in 2.4, somebody obviously broken the thing in 2.6. Now
w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers
regardless if the page is mapped or not (so this isn't a corner case).
define block_invalidate_page(page) discard_bh_page(page, 0, 0)
} else {
if (page->buffers) {
/* Restart after this page */
list_del(head);
list_add_tail(head, curr);
page_cache_get(page);
spin_unlock(&pagecache_lock);
block_invalidate_page(page);
} else
unlocked = 0;
ClearPageDirty(page);
The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate
an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6
broke the O_DIRECT vs buffered I/O coherency protocol.
The semantics of not uptodate pages mapped in the address space are as
well understood and working fine in 2.4.
> This is the biggest problem, because the pages can trivially have
> buffers - just write()ing to them will attach buffers, if they're ext2-
> or ext3-backed.
>
> It means that a direct-IO write to a section of a file which is mmapped
> causes pagecache and disk to get out of sync. Question is: do we care,
> or do we say "don't do that"? Nobody seems to have noticed thus far and
> it's a pretty dopey thing to be doing.
this is a supported feature. People is writing with O_DIRECT and reading
with buffered read and we must support that (think a database writing
and a tar.gz reading). The coherency is a must after the O_DIRECT has
completed. If people runs the operations at the same time the result is
undefined. But by serializing the I/O at the userspace level (like
stop_db(); tar.gz) we must provide guarantees to them.
> If we _do_ want to fix it, it seems the simplest approach would be to
> nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we
> raced - try again" loop in there to handle the "do_no_page()
> reinstantiated a pte" race.
the do_no_page don't need to be coherent. The pte is the last thing we
can care about. The API only guarantees read/write syscalls to be
coherent, mmap not.
the only bug is the bh and it has to be fixed as easily as I did in 2.4.
> Something like this? Slow as a dog, but possibly correct ;)
>
>
> void invalidate_inode_pages2(struct address_space *mapping)
> {
> struct pagevec pvec;
> pgoff_t next = 0;
> int i;
>
> pagevec_init(&pvec, 0);
> while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> for (i = 0; i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
>
> lock_page(page);
> if (page->mapping == mapping) { /* truncate race? */
> wait_on_page_writeback(page);
> next = page->index + 1;
> while (page_mapped(page)) {
> unmap_mapping_range(mapping,
> page->index << PAGE_CACHE_SHIFT,
> page->index << PAGE_CACHE_SHIFT+1,
> 0);
> clear_page_dirty(page);
> }
> invalidate_complete_page(mapping, page);
> }
> unlock_page(page);
> }
> pagevec_release(&pvec);
> cond_resched();
> }
> }
You know the page_mapped path isn't really very important as far as we
execute both clear_page_dirty and ClearPageUptodate. We must provide the
guarantee to the syscall. Leaving not updoate pages in the ptes is
fine, the VM can deal with that. the semantics of that condition is that
the page changed on disk due O_DIRECT from under us, so the page isn't
uptodate anymore. If something we can argue about ext3 not being capable
of dealing with a PG_update going away from under it, but the VM is
capable of that, there is only one spot that must be aware about it
(zap fix we posted to start the thread).
BTW, we have these fixes in our tree to make it work:
--- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200
+++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200
@@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr
if (page->mapping == mapping) { /* truncate race? */
wait_on_page_writeback(page);
next = page->index + 1;
- if (page_mapped(page))
+ if (page_mapped(page)) {
+ ClearPageUptodate(page);
clear_page_dirty(page);
- else
+ } else
invalidate_complete_page(mapping, page);
}
unlock_page(page);
and then this fix for BLKFLSBUF:
diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c
--- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200
+++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200
@@ -86,6 +86,12 @@ invalidate_complete_page(struct address_
write_unlock_irq(&mapping->tree_lock);
return 0;
}
+
+ BUG_ON(PagePrivate(page));
+ if (page_count(page) != 2) {
+ write_unlock_irq(&mapping->tree_lock);
+ return 0;
+ }
__remove_from_page_cache(page);
write_unlock_irq(&mapping->tree_lock);
ClearPageUptodate(page);
@@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr
clear_page_dirty(page);
ClearPageUptodate(page);
} else {
- invalidate_complete_page(mapping, page);
+ if (!invalidate_complete_page(mapping,
+ page)) {
+ clear_page_dirty(page);
+ ClearPageUptodate(page);
+ }
}
}
unlock_page(page);
apparently those weren't submitted to mainline yet, that could have
created some confusion on why we submitted the third patch that avoids
the BUG in pdflush and it avoids to try writing to disk a page that is
not uptodate anymore. Do you see why it's wrong to pass down to pdflush
a page that isn't utpodate? That page must be re-read, it sure must not
be written out. The semantics of mapped pages not uptodate are very well
defined and they're pose no risk to the VM (modulo the bug in pdflush
and this clear improvement in zap).
> The unmapping from pagetables means that invalidate_complete_page() will
> generally successfully remove the page from pagecache. That mostly fixes
> the direct-write-of-mmapped-data coherency problem: the pages simply aren't
> in pagecache any more so we'll surely redo physical I/O.
the direct-write-of-mmapped-data coherency is the last thing we can care
about. Even before thinking about direct-write-of-mmapped-data we must
fix direct-write-of-cached-data (assuming it's not mmapped cache). That
is a showstopper bug, the direct-write-of-mmapped-data is "undefined" by
the linux API. Garbage can end up in the pagecache as far as the kernel
is concerned (all we guarantee is that no random page contents are
delivered to userspace).
> But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
> it really sucks that we're offering behaviour which only works most of the
> time :(
That don't need to ever work, I don't think we even need to add a pte
shootdown, the behviour has never been provided by design. All we need
to do is to fix the bh layer, and fixup the spots in zap to stop bugging
out on dirty not uptodate pages, plus the above patches in this email,
then it'll work as well as 2.4 and it will avoid to break apps. Note
that this is a very serious matter, what happens is that oracle writes
with O_DIRECT and when you do a tar.gz of the database the last block
may be corrupt if it's not 512byte aligned if it was a partial page and
it had partial bh. mysql uses O_DIRECT too. I'm not particularly
concerned about ext3, that can be fixed too, or we can add a new bitflag
as you suggested to avoid fixing the filesystems to deal with that. But
for the VM I'm confortable clearing the uptodate page is much simpler
and it poses no risk to the VM. For the fs you may be very well right
that the BH_reread_from_disk might be needed instead.
So let's forget the mmapped case and let's fix the bh screwup in 2.6.
Then we can think at the mmapped case.
And to fix the mmapped case IMHO all we have to do is:
clear_page_dirty
ClearPageUptodate
unmap_page_range
Not the other way around. The uptodate bit must be clared before zapping
the pte to force a reload.
On Thu, Oct 21, 2004 at 10:51:34PM -0400, Rik van Riel wrote:
> That depends on the filesystem. I hope the clustered filesystems
I agree if you do a "is_underlying_fs_GFS?" check then you can make more
assumptions.
But if you don't do that, the linux API always left undefined the
mmapped contents after O_DIRECT writes on the mmapped data.
Andrea Arcangeli <[email protected]> wrote:
>
> On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote:
> > Yeah. I think the only 100% reliable way to do this is:
> >
> > lock_page()
> > unmap_mapping_range(page)
> > ClearPageUptodate(page);
> > invalidate(page); /* Try to free the thing */
> > unlock_page(page);
> >
> > (Can do it for a whole range of pages if we always agree to lock pages in
> > file-index-ascending order).
> >
> > but hrm, we don't even have the locking there to prevent do_no_page() from
> > reinstantiating the pte immediately after the unmap_mapping_range().
> >
> > So what to do?
> >
> > - The patch you originally sent has a race window which can be made
> > nonfatal by removing the BUGcheck in mpage_writepage().
> >
> > - NFS should probably be taught to use unmap_mapping_range() regardless
> > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
> > up.
> >
> > - invalidate_inode_pages2() isn't successfully invalidating the page if
> > it has buffers.
>
> I did it right in 2.4, somebody obviously broken the thing in 2.6. Now
> w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers
> regardless if the page is mapped or not (so this isn't a corner case).
>
> define block_invalidate_page(page) discard_bh_page(page, 0, 0)
>
> } else {
> if (page->buffers) {
> /* Restart after this page */
> list_del(head);
> list_add_tail(head, curr);
>
> page_cache_get(page);
> spin_unlock(&pagecache_lock);
> block_invalidate_page(page);
> } else
> unlocked = 0;
>
> ClearPageDirty(page);
This is incorrect for ext3 - we have to go through the a_ops to invalidate
the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3
patch.
> The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate
> an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6
> broke the O_DIRECT vs buffered I/O coherency protocol.
Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK
in 2.6.
> The semantics of not uptodate pages mapped in the address space are as
> well understood and working fine in 2.4.
I'd prefer to continue to disallow the mapping of nonuptodate pages into
process address space. It's indistinguishable from a security hole and I
think we can indeed preserve this invariant while fixing the mmap coherency
problem.
> > This is the biggest problem, because the pages can trivially have
> > buffers - just write()ing to them will attach buffers, if they're ext2-
> > or ext3-backed.
> >
> > It means that a direct-IO write to a section of a file which is mmapped
> > causes pagecache and disk to get out of sync. Question is: do we care,
> > or do we say "don't do that"? Nobody seems to have noticed thus far and
> > it's a pretty dopey thing to be doing.
>
> this is a supported feature. People is writing with O_DIRECT and reading
> with buffered read and we must support that (think a database writing
> and a tar.gz reading). The coherency is a must after the O_DIRECT has
> completed. If people runs the operations at the same time the result is
> undefined. But by serializing the I/O at the userspace level (like
> stop_db(); tar.gz) we must provide guarantees to them.
If the page is not mapped into pagetables then we'll invalidate it OK even
if it's backed by ext3. The only hole I can see is if someone comes in and
dirties the page via write() _after_ generic_file_direct_IO() has done its
filemap_write_and_wait().
And I agree that in that case we should propagate the
invalidate_complete_page() failure back to the diret-io write() caller. I wonder
how - via a short write, or via -EFOO?
Probably we should sync all the ptes prior to a direct-io read as well.
> > If we _do_ want to fix it, it seems the simplest approach would be to
> > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we
> > raced - try again" loop in there to handle the "do_no_page()
> > reinstantiated a pte" race.
>
> the do_no_page don't need to be coherent. The pte is the last thing we
> can care about. The API only guarantees read/write syscalls to be
> coherent, mmap not.
I want to fix the mmap case too. We currently have a half-fix for NFS
which doesn't work at all for (say) ext3.
> the only bug is the bh and it has to be fixed as easily as I did in 2.4.
No, we need to call into the filesystem to kill the buffer_heads.
> > Something like this? Slow as a dog, but possibly correct ;)
> >
> >
> > void invalidate_inode_pages2(struct address_space *mapping)
> > {
> > struct pagevec pvec;
> > pgoff_t next = 0;
> > int i;
> >
> > pagevec_init(&pvec, 0);
> > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> > for (i = 0; i < pagevec_count(&pvec); i++) {
> > struct page *page = pvec.pages[i];
> >
> > lock_page(page);
> > if (page->mapping == mapping) { /* truncate race? */
> > wait_on_page_writeback(page);
> > next = page->index + 1;
> > while (page_mapped(page)) {
> > unmap_mapping_range(mapping,
> > page->index << PAGE_CACHE_SHIFT,
> > page->index << PAGE_CACHE_SHIFT+1,
> > 0);
> > clear_page_dirty(page);
> > }
> > invalidate_complete_page(mapping, page);
> > }
> > unlock_page(page);
> > }
> > pagevec_release(&pvec);
> > cond_resched();
> > }
> > }
>
> You know the page_mapped path isn't really very important as far as we
> execute both clear_page_dirty and ClearPageUptodate.
It's important for NFS - it will allow mmaps to see server-side updates.
> We must provide the
> guarantee to the syscall. Leaving not updoate pages in the ptes is
> fine, the VM can deal with that. the semantics of that condition is that
> the page changed on disk due O_DIRECT from under us, so the page isn't
> uptodate anymore. If something we can argue about ext3 not being capable
> of dealing with a PG_update going away from under it, but the VM is
> capable of that, there is only one spot that must be aware about it
> (zap fix we posted to start the thread).
>
> BTW, we have these fixes in our tree to make it work:
>
> --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200
> +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200
> @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr
> if (page->mapping == mapping) { /* truncate race? */
> wait_on_page_writeback(page);
> next = page->index + 1;
> - if (page_mapped(page))
> + if (page_mapped(page)) {
> + ClearPageUptodate(page);
> clear_page_dirty(page);
> - else
> + } else
> invalidate_complete_page(mapping, page);
> }
> unlock_page(page);
That's in 2.6.9.
> and then this fix for BLKFLSBUF:
>
> diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c
> --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200
> +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200
> @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_
> write_unlock_irq(&mapping->tree_lock);
> return 0;
> }
> +
> + BUG_ON(PagePrivate(page));
> + if (page_count(page) != 2) {
> + write_unlock_irq(&mapping->tree_lock);
> + return 0;
> + }
> __remove_from_page_cache(page);
> write_unlock_irq(&mapping->tree_lock);
> ClearPageUptodate(page);
> @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr
> clear_page_dirty(page);
> ClearPageUptodate(page);
> } else {
> - invalidate_complete_page(mapping, page);
> + if (!invalidate_complete_page(mapping,
> + page)) {
> + clear_page_dirty(page);
> + ClearPageUptodate(page);
> + }
> }
> }
> unlock_page(page);
>
That's in 2.6.9 as well. I need to think about this a bit more - the first
hunk is for read() versus BLKFLSBUF (if someone else is playing with the
page, leave it alone) whereas the second hunk is kinda buggy, because it
will reintroduce the EIO-in-read() error which the first hunk tried to fix.
(You're using an rwlock for tree_lock. It's still unclear to me whether we
should go that way - it helps big SMP and hurts small SMP. Did you perform
an evaluation of this?)
>
> apparently those weren't submitted to mainline yet, that could have
> created some confusion on why we submitted the third patch that avoids
> the BUG in pdflush and it avoids to try writing to disk a page that is
> not uptodate anymore. Do you see why it's wrong to pass down to pdflush
> a page that isn't utpodate?
Sure. That's why there's a BUG() there ;)
> That page must be re-read, it sure must not
> be written out. The semantics of mapped pages not uptodate are very well
> defined and they're pose no risk to the VM (modulo the bug in pdflush
> and this clear improvement in zap).
>
> > The unmapping from pagetables means that invalidate_complete_page() will
> > generally successfully remove the page from pagecache. That mostly fixes
> > the direct-write-of-mmapped-data coherency problem: the pages simply aren't
> > in pagecache any more so we'll surely redo physical I/O.
>
> the direct-write-of-mmapped-data coherency is the last thing we can care
> about.
But we can fix it.
> Even before thinking about direct-write-of-mmapped-data we must
> fix direct-write-of-cached-data (assuming it's not mmapped cache). That
> is a showstopper bug,
I don't think we have a bug here. After the sync(), ext3_releasepage()
will be able to release the buffers. But we should check for failures.
The problem we're seeing is with mmapped pages. In that case, clearing
PG_uptodate is insufficient and directly invalidating the buffer_heads will
probably kill ext3 and we're not allowed to assume that page->private
contains bh's anyway...
Probably it is sufficient to run try_to_release_page() in the page_mapped
leg of invalidate_inode_pages2(), and to check the return value.
> the direct-write-of-mmapped-data is "undefined" by
> the linux API. Garbage can end up in the pagecache as far as the kernel
> is concerned (all we guarantee is that no random page contents are
> delivered to userspace).
>
> > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
> > it really sucks that we're offering behaviour which only works most of the
> > time :(
>
> That don't need to ever work, I don't think we even need to add a pte
> shootdown, the behviour has never been provided by design. All we need
> to do is to fix the bh layer, and fixup the spots in zap to stop bugging
> out on dirty not uptodate pages, plus the above patches in this email,
> then it'll work as well as 2.4 and it will avoid to break apps. Note
> that this is a very serious matter, what happens is that oracle writes
> with O_DIRECT and when you do a tar.gz of the database the last block
> may be corrupt if it's not 512byte aligned if it was a partial page and
> it had partial bh. mysql uses O_DIRECT too.
The current buffered-vs-direct handling looks OK, as long as the pages
aren't mmapped, and as long as nobody dirties a page after
generic_file_direct_IO() has done filemap_write_and_wait().
For the former I'd like to unmap the pages. For the latter we should check
the invalidate_complete_page() return value.
> I'm not particularly
> concerned about ext3, that can be fixed too, or we can add a new bitflag
> as you suggested to avoid fixing the filesystems to deal with that. But
> for the VM I'm confortable clearing the uptodate page is much simpler
> and it poses no risk to the VM. For the fs you may be very well right
> that the BH_reread_from_disk might be needed instead.
>
> So let's forget the mmapped case and let's fix the bh screwup in 2.6.
> Then we can think at the mmapped case.
The "bh screwup" _is_ the mmapped case. Maybe I said something different
yesterday - it's been a while since I looked in there.
> And to fix the mmapped case IMHO all we have to do is:
>
> clear_page_dirty
> ClearPageUptodate
> unmap_page_range
>
> Not the other way around. The uptodate bit must be clared before zapping
> the pte to force a reload.
Something like this...
- When invalidating pages, take care to shoot down any ptes which map them
as well.
This ensures that the next mmap access to the page will generate a major
fault, so NFS's server-side modifications are picked up.
This also allows us to call invalidate_complete_page() on all pages, so
filesytems such as ext3 get a chance to invalidate the buffer_heads.
- Don't mark in-pagetable pages as non-uptodate any more. That broke a
previous guarantee that mapped-into-user-process pages are always uptodate.
- Check the return value of invalidate_complete_page(). It can fail if
someone redirties a page after generic_file_direct_IO() write it back.
But we still have a problem. If invalidate_inode_pages2() calls
unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache
pages. That will redirty the page's buffers and will cause
invalidate_complete_page() to fail.
So, in generic_file_direct_IO() we do a complete pte shootdown on the file
up-front, prior to writing back dirty pagecache. This is only done for
O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full
mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but
permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke
other people's mapped pagecache.
NFS also uses invalidate_inode_pages2() for handling server-side modification
notifications. But in the NFS case the clear_page_dirty() in
invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry
about the "dirty buffers against a clean page" problem. (I think)
Signed-off-by: Andrew Morton <[email protected]>
---
25-akpm/include/linux/fs.h | 2 -
25-akpm/mm/filemap.c | 19 ++++++++++---
25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------
3 files changed, 55 insertions(+), 29 deletions(-)
diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c
--- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004
+++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004
@@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp
* be marked dirty at any time too. So we re-check the dirtiness inside
* ->tree_lock. That provides exclusion against the __set_page_dirty
* functions.
+ *
+ * Returns non-zero if the page was successfully invalidated.
*/
static int
invalidate_complete_page(struct address_space *mapping, struct page *page)
@@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str
EXPORT_SYMBOL(invalidate_inode_pages);
/**
- * invalidate_inode_pages2 - remove all unmapped pages from an address_space
+ * invalidate_inode_pages2 - remove all pages from an address_space
* @mapping - the address_space
*
- * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case
- * where the page is seen to be mapped into process pagetables. In that case,
- * the page is marked clean but is left attached to its address_space.
- *
- * The page is also marked not uptodate so that a subsequent pagefault will
- * perform I/O to bringthe page's contents back into sync with its backing
- * store.
+ * Any pages which are found to be mapped into pagetables are unmapped prior to
+ * invalidation.
*
- * FIXME: invalidate_inode_pages2() is probably trivially livelockable.
+ * Returns -EIO if any pages could not be invalidated.
*/
-void invalidate_inode_pages2(struct address_space *mapping)
+int invalidate_inode_pages2(struct address_space *mapping)
{
struct pagevec pvec;
pgoff_t next = 0;
int i;
+ int ret = 0;
+ int did_full_unmap = 0;
pagevec_init(&pvec, 0);
- while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
- for (i = 0; i < pagevec_count(&pvec); i++) {
+ while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
+ for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
struct page *page = pvec.pages[i];
lock_page(page);
- if (page->mapping == mapping) { /* truncate race? */
- wait_on_page_writeback(page);
- next = page->index + 1;
- if (page_mapped(page)) {
- clear_page_dirty(page);
- ClearPageUptodate(page);
+ if (page->mapping != mapping) { /* truncate race? */
+ unlock_page(page);
+ continue;
+ }
+ wait_on_page_writeback(page);
+ next = page->index + 1;
+ while (page_mapped(page)) {
+ if (!did_full_unmap) {
+ /*
+ * Zap the rest of the file in one hit.
+ * FIXME: invalidate_inode_pages2()
+ * should take start/end offsets.
+ */
+ unmap_mapping_range(mapping,
+ page->index << PAGE_CACHE_SHIFT,
+ -1, 0);
+ did_full_unmap = 1;
} else {
- if (!invalidate_complete_page(mapping,
- page)) {
- clear_page_dirty(page);
- ClearPageUptodate(page);
- }
+ /*
+ * Just zap this page
+ */
+ unmap_mapping_range(mapping,
+ page->index << PAGE_CACHE_SHIFT,
+ (page->index << PAGE_CACHE_SHIFT)+1,
+ 0);
}
}
+ clear_page_dirty(page);
+ if (!invalidate_complete_page(mapping, page))
+ ret = -EIO;
unlock_page(page);
}
pagevec_release(&pvec);
cond_resched();
}
+ return ret;
}
-
EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h
--- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004
+++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004
@@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino
S_ISLNK(inode->i_mode))
invalidate_inode_pages(inode->i_mapping);
}
-extern void invalidate_inode_pages2(struct address_space *mapping);
+extern int invalidate_inode_pages2(struct address_space *mapping);
extern void write_inode_now(struct inode *, int);
extern int filemap_fdatawrite(struct address_space *);
extern int filemap_flush(struct address_space *);
diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c
--- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004
+++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004
@@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file
EXPORT_SYMBOL(generic_file_writev);
/*
- * Called under i_sem for writes to S_ISREG files
+ * Called under i_sem for writes to S_ISREG files. Returns -EIO if something
+ * went wrong during pagecache shootdown.
*/
ssize_t
generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
@@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki
struct address_space *mapping = file->f_mapping;
ssize_t retval;
+ /*
+ * If it's a write, unmap all mmappings of the file up-front. This
+ * will cause any pte dirty bits to be propagated into the pageframes
+ * for the subsequent filemap_write_and_wait().
+ */
+ if (rw == WRITE && mapping_mapped(mapping))
+ unmap_mapping_range(mapping, 0, -1, 0);
+
retval = filemap_write_and_wait(mapping);
if (retval == 0) {
retval = mapping->a_ops->direct_IO(rw, iocb, iov,
offset, nr_segs);
- if (rw == WRITE && mapping->nrpages)
- invalidate_inode_pages2(mapping);
+ if (rw == WRITE && mapping->nrpages) {
+ int err = invalidate_inode_pages2(mapping);
+ if (err)
+ retval = err;
+ }
}
return retval;
}
-
EXPORT_SYMBOL_GPL(generic_file_direct_IO);
_
On Fri, Oct 22, 2004 at 06:17:44PM +0200, Andrea Arcangeli wrote:
> So let's forget the mmapped case and let's fix the bh screwup in 2.6.
I propose a solution to fix the coherency problem that afflicts 2.6 if
invalidate_complete_page fails. If the page is still PagePrivate despite
we called try_to_release_page (which means clearing the uptodate bitflag
wouldn't help), we should simply return an error to the O_DIRECT writes.
That way the error would not be overlooked by the database writing. This
is a minium guarantee we have to provide: if we fail invalidate cause
the O_DIRECT write to fail.
Of course we should return write failures as well in the mmapped case,
but let's ignore the mmapped case for now.
The real showstopper bug is try_to_release_page failing and preventing
the coherency protocol to work even without mmaps. This could never
happen in 2.4, in 2.4 as worse ext3 would get an hearth attack, which is
much better than silent corruption in the backup.
Andrew,
This patch does fix the oops I was seeing.
Shaggy
On Fri, 2004-10-22 at 18:24, Andrew Morton wrote:
> Something like this...
>
>
>
>
> - When invalidating pages, take care to shoot down any ptes which map them
> as well.
>
> This ensures that the next mmap access to the page will generate a major
> fault, so NFS's server-side modifications are picked up.
>
> This also allows us to call invalidate_complete_page() on all pages, so
> filesytems such as ext3 get a chance to invalidate the buffer_heads.
>
> - Don't mark in-pagetable pages as non-uptodate any more. That broke a
> previous guarantee that mapped-into-user-process pages are always uptodate.
>
> - Check the return value of invalidate_complete_page(). It can fail if
> someone redirties a page after generic_file_direct_IO() write it back.
>
> But we still have a problem. If invalidate_inode_pages2() calls
> unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache
> pages. That will redirty the page's buffers and will cause
> invalidate_complete_page() to fail.
>
> So, in generic_file_direct_IO() we do a complete pte shootdown on the file
> up-front, prior to writing back dirty pagecache. This is only done for
> O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full
> mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but
> permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke
> other people's mapped pagecache.
>
> NFS also uses invalidate_inode_pages2() for handling server-side modification
> notifications. But in the NFS case the clear_page_dirty() in
> invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry
> about the "dirty buffers against a clean page" problem. (I think)
>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> 25-akpm/include/linux/fs.h | 2 -
> 25-akpm/mm/filemap.c | 19 ++++++++++---
> 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------
> 3 files changed, 55 insertions(+), 29 deletions(-)
>
> diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c
> --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004
> +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004
> @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp
> * be marked dirty at any time too. So we re-check the dirtiness inside
> * ->tree_lock. That provides exclusion against the __set_page_dirty
> * functions.
> + *
> + * Returns non-zero if the page was successfully invalidated.
> */
> static int
> invalidate_complete_page(struct address_space *mapping, struct page *page)
> @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str
> EXPORT_SYMBOL(invalidate_inode_pages);
>
> /**
> - * invalidate_inode_pages2 - remove all unmapped pages from an address_space
> + * invalidate_inode_pages2 - remove all pages from an address_space
> * @mapping - the address_space
> *
> - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case
> - * where the page is seen to be mapped into process pagetables. In that case,
> - * the page is marked clean but is left attached to its address_space.
> - *
> - * The page is also marked not uptodate so that a subsequent pagefault will
> - * perform I/O to bringthe page's contents back into sync with its backing
> - * store.
> + * Any pages which are found to be mapped into pagetables are unmapped prior to
> + * invalidation.
> *
> - * FIXME: invalidate_inode_pages2() is probably trivially livelockable.
> + * Returns -EIO if any pages could not be invalidated.
> */
> -void invalidate_inode_pages2(struct address_space *mapping)
> +int invalidate_inode_pages2(struct address_space *mapping)
> {
> struct pagevec pvec;
> pgoff_t next = 0;
> int i;
> + int ret = 0;
> + int did_full_unmap = 0;
>
> pagevec_init(&pvec, 0);
> - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> - for (i = 0; i < pagevec_count(&pvec); i++) {
> + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> + for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
>
> lock_page(page);
> - if (page->mapping == mapping) { /* truncate race? */
> - wait_on_page_writeback(page);
> - next = page->index + 1;
> - if (page_mapped(page)) {
> - clear_page_dirty(page);
> - ClearPageUptodate(page);
> + if (page->mapping != mapping) { /* truncate race? */
> + unlock_page(page);
> + continue;
> + }
> + wait_on_page_writeback(page);
> + next = page->index + 1;
> + while (page_mapped(page)) {
> + if (!did_full_unmap) {
> + /*
> + * Zap the rest of the file in one hit.
> + * FIXME: invalidate_inode_pages2()
> + * should take start/end offsets.
> + */
> + unmap_mapping_range(mapping,
> + page->index << PAGE_CACHE_SHIFT,
> + -1, 0);
> + did_full_unmap = 1;
> } else {
> - if (!invalidate_complete_page(mapping,
> - page)) {
> - clear_page_dirty(page);
> - ClearPageUptodate(page);
> - }
> + /*
> + * Just zap this page
> + */
> + unmap_mapping_range(mapping,
> + page->index << PAGE_CACHE_SHIFT,
> + (page->index << PAGE_CACHE_SHIFT)+1,
> + 0);
> }
> }
> + clear_page_dirty(page);
> + if (!invalidate_complete_page(mapping, page))
> + ret = -EIO;
> unlock_page(page);
> }
> pagevec_release(&pvec);
> cond_resched();
> }
> + return ret;
> }
> -
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h
> --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004
> +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004
> @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino
> S_ISLNK(inode->i_mode))
> invalidate_inode_pages(inode->i_mapping);
> }
> -extern void invalidate_inode_pages2(struct address_space *mapping);
> +extern int invalidate_inode_pages2(struct address_space *mapping);
> extern void write_inode_now(struct inode *, int);
> extern int filemap_fdatawrite(struct address_space *);
> extern int filemap_flush(struct address_space *);
> diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c
> --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004
> +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004
> @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file
> EXPORT_SYMBOL(generic_file_writev);
>
> /*
> - * Called under i_sem for writes to S_ISREG files
> + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something
> + * went wrong during pagecache shootdown.
> */
> ssize_t
> generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki
> struct address_space *mapping = file->f_mapping;
> ssize_t retval;
>
> + /*
> + * If it's a write, unmap all mmappings of the file up-front. This
> + * will cause any pte dirty bits to be propagated into the pageframes
> + * for the subsequent filemap_write_and_wait().
> + */
> + if (rw == WRITE && mapping_mapped(mapping))
> + unmap_mapping_range(mapping, 0, -1, 0);
> +
> retval = filemap_write_and_wait(mapping);
> if (retval == 0) {
> retval = mapping->a_ops->direct_IO(rw, iocb, iov,
> offset, nr_segs);
> - if (rw == WRITE && mapping->nrpages)
> - invalidate_inode_pages2(mapping);
> + if (rw == WRITE && mapping->nrpages) {
> + int err = invalidate_inode_pages2(mapping);
> + if (err)
> + retval = err;
> + }
> }
> return retval;
> }
> -
> EXPORT_SYMBOL_GPL(generic_file_direct_IO);
> _
On Fri, Oct 22, 2004 at 04:24:33PM -0700, Andrew Morton wrote:
> This is incorrect for ext3 - we have to go through the a_ops to invalidate
> the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3
> patch.
BTW, truncate_complete_page also calls into do_flushpage without asking
the fs. truncate_inode_pages does it too (that's probably where it comes
from, however truncate happens when the fs already giveup on the inode
that's probably why it's safe there).
> Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK
> in 2.6.
yes, as far as try_to_release_page cannot fail on non-mapped pages,
otherwise it would lose coherency. When exactly can try_to_release_page
return failure from the fs?
> And I agree that in that case we should propagate the
> invalidate_complete_page() failure back to the diret-io write() caller. I wonder
> how - via a short write, or via -EFOO?
can we do a short write at all? The failure would come out of
try_to_release_buffers, we don't know which blocks failed to invalidate
there and in turn which pages would be visible right. I suspect all we
can do is -EFOO (which sounds safer anyways since this is a fundamental
userspace race condition detection path, a -ERACE would possibly be
appropriate if it existed ;).
> Probably we should sync all the ptes prior to a direct-io read as well.
yes, we should msync all the mappings if you want to truly provide mmap
coherency to O_DIRECT.
> I want to fix the mmap case too. We currently have a half-fix for NFS
> which doesn't work at all for (say) ext3.
for nfs it worked good enough because all they care about is that if you
munmap and mmap again the new data is shown. all they care about is that
if they restart the program it sees the new data on the server.
Perfect mmap coherency is impossible with nfs due the stateless
protocol but for ext3 we could provide it (in theory).
But keep in mind 99% of the O_DIRECT userbase seems not to need it, if
you use O_DIRECT by design it means you have no mapping on that data and
you do the caching in userspace (otherwise mmap would be the api to
use). the two things are quite mutually exclusive. This is why it
always looked an unnecessary slowdown to me, but it looks we might
implement it with little overhead for 2.6.
> It's important for NFS - it will allow mmaps to see server-side updates.
the folks I talked to only cares about munmap/mmap coherency, a
persistent mmap isn't required to see the new data.
> That's in 2.6.9.
ok fine.
> That's in 2.6.9 as well. I need to think about this a bit more - the first
fine too ;)
> (You're using an rwlock for tree_lock. It's still unclear to me whether we
> should go that way - it helps big SMP and hurts small SMP. Did you perform
> an evaluation of this?)
I didn't change anything in that area, I've no clue who changed that.
since it's a spinlock it will most probably be only an overhead to have
rwlock. BTW, I've still to backport the semaphore to spinlock conversion
to 2.6.5 (I'm in the process of backporting the lowmem_reserve patch,
2.6.5 didn't even have the protection new better behaviour of 2.6.8+ so
I'm in the usual rejects procedures.. ;). Then there's Hugh's rss to
backport to 2.6.5 and finally this one patch you posted here for mmap
coherency.
> Sure. That's why there's a BUG() there ;)
eheh ;)
> I don't think we have a bug here. After the sync(), ext3_releasepage()
> will be able to release the buffers. But we should check for failures.
this partly explains the above question (i.e. when exactly releasepage
can fail?).
> The problem we're seeing is with mmapped pages. In that case, clearing
> PG_uptodate is insufficient and directly invalidating the buffer_heads will
> probably kill ext3 and we're not allowed to assume that page->private
> contains bh's anyway...
>
> Probably it is sufficient to run try_to_release_page() in the page_mapped
> leg of invalidate_inode_pages2(), and to check the return value.
I guess so. I mean ext3 cannot know the page is mapped. Returning
failure sounds more than good enough.
> mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but
> permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke
> other people's mapped pagecache.
for O_DIRECT reads you don't need to shootdown the ptes but only to
msync them. That would avoid a performance impact enforceable on
other users and it would leave the cost in the task issuing the msync
and walking the ptes. Since you provided either coherency or -EIO to the
O_DIRECT operation, it probably worth to add the msync as well in the
ptes before the O_DIRECT read hits the disk, no?
BTW, this is still not coherent at the block level, so if I write a
512byte block with O_DIRECT on one side and at the same time I write a
512byte block with mmap (or buffered write) the end result is still
undefined. But if somebody orders the writes, then it should work even
with mmap now (while previously it couldn't work even with
userspace-ordered I/O since the dirty bit in the pte would be simply
ignored by O_DIRECT). Correct?
> NFS also uses invalidate_inode_pages2() for handling server-side modification
> notifications. But in the NFS case the clear_page_dirty() in
> invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry
> about the "dirty buffers against a clean page" problem. (I think)
indeed, nfs was the only safe one w.r.t. mappings.
patch looks good but perhaps we can add an msync (not an unmap) for
O_DIRECT reads too to reduce the magic?
Andrew & Andrea,
What is the status of this patch? It would be nice to have it in the
-mm4 kernel.
Thanks,
Shaggy
On Fri, 2004-10-22 at 18:24, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > On Thu, Oct 21, 2004 at 07:03:20PM -0700, Andrew Morton wrote:
> > > Yeah. I think the only 100% reliable way to do this is:
> > >
> > > lock_page()
> > > unmap_mapping_range(page)
> > > ClearPageUptodate(page);
> > > invalidate(page); /* Try to free the thing */
> > > unlock_page(page);
> > >
> > > (Can do it for a whole range of pages if we always agree to lock pages in
> > > file-index-ascending order).
> > >
> > > but hrm, we don't even have the locking there to prevent do_no_page() from
> > > reinstantiating the pte immediately after the unmap_mapping_range().
> > >
> > > So what to do?
> > >
> > > - The patch you originally sent has a race window which can be made
> > > nonfatal by removing the BUGcheck in mpage_writepage().
> > >
> > > - NFS should probably be taught to use unmap_mapping_range() regardless
> > > of what we do, so the existing-mmaps-hold-stale-data problem gets fixed
> > > up.
> > >
> > > - invalidate_inode_pages2() isn't successfully invalidating the page if
> > > it has buffers.
> >
> > I did it right in 2.4, somebody obviously broken the thing in 2.6. Now
> > w.r.t. to ext2/ext3 I never heard a problem, and we nuke buffers
> > regardless if the page is mapped or not (so this isn't a corner case).
> >
> > define block_invalidate_page(page) discard_bh_page(page, 0, 0)
> >
> > } else {
> > if (page->buffers) {
> > /* Restart after this page */
> > list_del(head);
> > list_add_tail(head, curr);
> >
> > page_cache_get(page);
> > spin_unlock(&pagecache_lock);
> > block_invalidate_page(page);
> > } else
> > unlocked = 0;
> >
> > ClearPageDirty(page);
>
> This is incorrect for ext3 - we have to go through the a_ops to invalidate
> the buffer_heads. That's bad news for my (unmerged) 2.4 direct-io-for-ext3
> patch.
>
> > The guarantee is total in 2.4 w.r.t. buffers, as worse it can generate
> > an hearth attack to ext3 but it won't risk to lose coherency. Only 2.6
> > broke the O_DIRECT vs buffered I/O coherency protocol.
>
> Only for mmapped pages. I believe that normal pagecache-vs-directIO is OK
> in 2.6.
>
> > The semantics of not uptodate pages mapped in the address space are as
> > well understood and working fine in 2.4.
>
> I'd prefer to continue to disallow the mapping of nonuptodate pages into
> process address space. It's indistinguishable from a security hole and I
> think we can indeed preserve this invariant while fixing the mmap coherency
> problem.
>
> > > This is the biggest problem, because the pages can trivially have
> > > buffers - just write()ing to them will attach buffers, if they're ext2-
> > > or ext3-backed.
> > >
> > > It means that a direct-IO write to a section of a file which is mmapped
> > > causes pagecache and disk to get out of sync. Question is: do we care,
> > > or do we say "don't do that"? Nobody seems to have noticed thus far and
> > > it's a pretty dopey thing to be doing.
> >
> > this is a supported feature. People is writing with O_DIRECT and reading
> > with buffered read and we must support that (think a database writing
> > and a tar.gz reading). The coherency is a must after the O_DIRECT has
> > completed. If people runs the operations at the same time the result is
> > undefined. But by serializing the I/O at the userspace level (like
> > stop_db(); tar.gz) we must provide guarantees to them.
>
> If the page is not mapped into pagetables then we'll invalidate it OK even
> if it's backed by ext3. The only hole I can see is if someone comes in and
> dirties the page via write() _after_ generic_file_direct_IO() has done its
> filemap_write_and_wait().
>
> And I agree that in that case we should propagate the
> invalidate_complete_page() failure back to the diret-io write() caller. I wonder
> how - via a short write, or via -EFOO?
>
> Probably we should sync all the ptes prior to a direct-io read as well.
>
> > > If we _do_ want to fix it, it seems the simplest approach would be to
> > > nuke the pte's in invalidate_inode_pages2(). And we'd need some "oh we
> > > raced - try again" loop in there to handle the "do_no_page()
> > > reinstantiated a pte" race.
> >
> > the do_no_page don't need to be coherent. The pte is the last thing we
> > can care about. The API only guarantees read/write syscalls to be
> > coherent, mmap not.
>
> I want to fix the mmap case too. We currently have a half-fix for NFS
> which doesn't work at all for (say) ext3.
>
> > the only bug is the bh and it has to be fixed as easily as I did in 2.4.
>
> No, we need to call into the filesystem to kill the buffer_heads.
>
> > > Something like this? Slow as a dog, but possibly correct ;)
> > >
> > >
> > > void invalidate_inode_pages2(struct address_space *mapping)
> > > {
> > > struct pagevec pvec;
> > > pgoff_t next = 0;
> > > int i;
> > >
> > > pagevec_init(&pvec, 0);
> > > while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> > > for (i = 0; i < pagevec_count(&pvec); i++) {
> > > struct page *page = pvec.pages[i];
> > >
> > > lock_page(page);
> > > if (page->mapping == mapping) { /* truncate race? */
> > > wait_on_page_writeback(page);
> > > next = page->index + 1;
> > > while (page_mapped(page)) {
> > > unmap_mapping_range(mapping,
> > > page->index << PAGE_CACHE_SHIFT,
> > > page->index << PAGE_CACHE_SHIFT+1,
> > > 0);
> > > clear_page_dirty(page);
> > > }
> > > invalidate_complete_page(mapping, page);
> > > }
> > > unlock_page(page);
> > > }
> > > pagevec_release(&pvec);
> > > cond_resched();
> > > }
> > > }
> >
> > You know the page_mapped path isn't really very important as far as we
> > execute both clear_page_dirty and ClearPageUptodate.
>
> It's important for NFS - it will allow mmaps to see server-side updates.
>
> > We must provide the
> > guarantee to the syscall. Leaving not updoate pages in the ptes is
> > fine, the VM can deal with that. the semantics of that condition is that
> > the page changed on disk due O_DIRECT from under us, so the page isn't
> > uptodate anymore. If something we can argue about ext3 not being capable
> > of dealing with a PG_update going away from under it, but the VM is
> > capable of that, there is only one spot that must be aware about it
> > (zap fix we posted to start the thread).
> >
> > BTW, we have these fixes in our tree to make it work:
> >
> > --- sles/mm/truncate.c.~1~ 2004-05-18 19:24:40.000000000 +0200
> > +++ sles/mm/truncate.c 2004-05-19 02:09:28.311781864 +0200
> > @@ -260,9 +260,10 @@ void invalidate_inode_pages2(struct addr
> > if (page->mapping == mapping) { /* truncate race? */
> > wait_on_page_writeback(page);
> > next = page->index + 1;
> > - if (page_mapped(page))
> > + if (page_mapped(page)) {
> > + ClearPageUptodate(page);
> > clear_page_dirty(page);
> > - else
> > + } else
> > invalidate_complete_page(mapping, page);
> > }
> > unlock_page(page);
>
> That's in 2.6.9.
>
> > and then this fix for BLKFLSBUF:
> >
> > diff -urNp linux-2.6.8/mm/truncate.c linux-2.6.8.SUSE/mm/truncate.c
> > --- linux-2.6.8/mm/truncate.c 2004-09-21 15:39:37.850379202 +0200
> > +++ linux-2.6.8.SUSE/mm/truncate.c 2004-09-21 15:39:54.733668309 +0200
> > @@ -86,6 +86,12 @@ invalidate_complete_page(struct address_
> > write_unlock_irq(&mapping->tree_lock);
> > return 0;
> > }
> > +
> > + BUG_ON(PagePrivate(page));
> > + if (page_count(page) != 2) {
> > + write_unlock_irq(&mapping->tree_lock);
> > + return 0;
> > + }
> > __remove_from_page_cache(page);
> > write_unlock_irq(&mapping->tree_lock);
> > ClearPageUptodate(page);
> > @@ -306,7 +312,11 @@ void invalidate_inode_pages2(struct addr
> > clear_page_dirty(page);
> > ClearPageUptodate(page);
> > } else {
> > - invalidate_complete_page(mapping, page);
> > + if (!invalidate_complete_page(mapping,
> > + page)) {
> > + clear_page_dirty(page);
> > + ClearPageUptodate(page);
> > + }
> > }
> > }
> > unlock_page(page);
> >
>
> That's in 2.6.9 as well. I need to think about this a bit more - the first
> hunk is for read() versus BLKFLSBUF (if someone else is playing with the
> page, leave it alone) whereas the second hunk is kinda buggy, because it
> will reintroduce the EIO-in-read() error which the first hunk tried to fix.
>
> (You're using an rwlock for tree_lock. It's still unclear to me whether we
> should go that way - it helps big SMP and hurts small SMP. Did you perform
> an evaluation of this?)
>
> >
> > apparently those weren't submitted to mainline yet, that could have
> > created some confusion on why we submitted the third patch that avoids
> > the BUG in pdflush and it avoids to try writing to disk a page that is
> > not uptodate anymore. Do you see why it's wrong to pass down to pdflush
> > a page that isn't utpodate?
>
> Sure. That's why there's a BUG() there ;)
>
> > That page must be re-read, it sure must not
> > be written out. The semantics of mapped pages not uptodate are very well
> > defined and they're pose no risk to the VM (modulo the bug in pdflush
> > and this clear improvement in zap).
> >
> > > The unmapping from pagetables means that invalidate_complete_page() will
> > > generally successfully remove the page from pagecache. That mostly fixes
> > > the direct-write-of-mmapped-data coherency problem: the pages simply aren't
> > > in pagecache any more so we'll surely redo physical I/O.
> >
> > the direct-write-of-mmapped-data coherency is the last thing we can care
> > about.
>
> But we can fix it.
>
> > Even before thinking about direct-write-of-mmapped-data we must
> > fix direct-write-of-cached-data (assuming it's not mmapped cache). That
> > is a showstopper bug,
>
> I don't think we have a bug here. After the sync(), ext3_releasepage()
> will be able to release the buffers. But we should check for failures.
>
> The problem we're seeing is with mmapped pages. In that case, clearing
> PG_uptodate is insufficient and directly invalidating the buffer_heads will
> probably kill ext3 and we're not allowed to assume that page->private
> contains bh's anyway...
>
> Probably it is sufficient to run try_to_release_page() in the page_mapped
> leg of invalidate_inode_pages2(), and to check the return value.
>
> > the direct-write-of-mmapped-data is "undefined" by
> > the linux API. Garbage can end up in the pagecache as far as the kernel
> > is concerned (all we guarantee is that no random page contents are
> > delivered to userspace).
> >
> > > But it's not 100% reliable because ->invalidatepage isn't 100% reliable and
> > > it really sucks that we're offering behaviour which only works most of the
> > > time :(
> >
> > That don't need to ever work, I don't think we even need to add a pte
> > shootdown, the behviour has never been provided by design. All we need
> > to do is to fix the bh layer, and fixup the spots in zap to stop bugging
> > out on dirty not uptodate pages, plus the above patches in this email,
> > then it'll work as well as 2.4 and it will avoid to break apps. Note
> > that this is a very serious matter, what happens is that oracle writes
> > with O_DIRECT and when you do a tar.gz of the database the last block
> > may be corrupt if it's not 512byte aligned if it was a partial page and
> > it had partial bh. mysql uses O_DIRECT too.
>
> The current buffered-vs-direct handling looks OK, as long as the pages
> aren't mmapped, and as long as nobody dirties a page after
> generic_file_direct_IO() has done filemap_write_and_wait().
>
> For the former I'd like to unmap the pages. For the latter we should check
> the invalidate_complete_page() return value.
>
> > I'm not particularly
> > concerned about ext3, that can be fixed too, or we can add a new bitflag
> > as you suggested to avoid fixing the filesystems to deal with that. But
> > for the VM I'm confortable clearing the uptodate page is much simpler
> > and it poses no risk to the VM. For the fs you may be very well right
> > that the BH_reread_from_disk might be needed instead.
> >
> > So let's forget the mmapped case and let's fix the bh screwup in 2.6.
> > Then we can think at the mmapped case.
>
> The "bh screwup" _is_ the mmapped case. Maybe I said something different
> yesterday - it's been a while since I looked in there.
>
> > And to fix the mmapped case IMHO all we have to do is:
> >
> > clear_page_dirty
> > ClearPageUptodate
> > unmap_page_range
> >
> > Not the other way around. The uptodate bit must be clared before zapping
> > the pte to force a reload.
>
> Something like this...
>
>
>
>
> - When invalidating pages, take care to shoot down any ptes which map them
> as well.
>
> This ensures that the next mmap access to the page will generate a major
> fault, so NFS's server-side modifications are picked up.
>
> This also allows us to call invalidate_complete_page() on all pages, so
> filesytems such as ext3 get a chance to invalidate the buffer_heads.
>
> - Don't mark in-pagetable pages as non-uptodate any more. That broke a
> previous guarantee that mapped-into-user-process pages are always uptodate.
>
> - Check the return value of invalidate_complete_page(). It can fail if
> someone redirties a page after generic_file_direct_IO() write it back.
>
> But we still have a problem. If invalidate_inode_pages2() calls
> unmap_mapping_range(), that can cause zap_pte_range() to dirty the pagecache
> pages. That will redirty the page's buffers and will cause
> invalidate_complete_page() to fail.
>
> So, in generic_file_direct_IO() we do a complete pte shootdown on the file
> up-front, prior to writing back dirty pagecache. This is only done for
> O_DIRECT writes. It _could_ be done for O_DIRECT reads too, providing full
> mmap-vs-direct-IO coherency for both O_DIRECT reads and O_DIRECT writes, but
> permitting the pte shootdown on O_DIRECT reads trivially allows people to nuke
> other people's mapped pagecache.
>
> NFS also uses invalidate_inode_pages2() for handling server-side modification
> notifications. But in the NFS case the clear_page_dirty() in
> invalidate_inode_pages2() is sufficient, because NFS doesn't have to worry
> about the "dirty buffers against a clean page" problem. (I think)
>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> 25-akpm/include/linux/fs.h | 2 -
> 25-akpm/mm/filemap.c | 19 ++++++++++---
> 25-akpm/mm/truncate.c | 63 +++++++++++++++++++++++++++------------------
> 3 files changed, 55 insertions(+), 29 deletions(-)
>
> diff -puN mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix mm/truncate.c
> --- 25/mm/truncate.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:40:26 2004
> +++ 25-akpm/mm/truncate.c Fri Oct 22 16:05:26 2004
> @@ -65,6 +65,8 @@ truncate_complete_page(struct address_sp
> * be marked dirty at any time too. So we re-check the dirtiness inside
> * ->tree_lock. That provides exclusion against the __set_page_dirty
> * functions.
> + *
> + * Returns non-zero if the page was successfully invalidated.
> */
> static int
> invalidate_complete_page(struct address_space *mapping, struct page *page)
> @@ -281,50 +283,63 @@ unsigned long invalidate_inode_pages(str
> EXPORT_SYMBOL(invalidate_inode_pages);
>
> /**
> - * invalidate_inode_pages2 - remove all unmapped pages from an address_space
> + * invalidate_inode_pages2 - remove all pages from an address_space
> * @mapping - the address_space
> *
> - * invalidate_inode_pages2() is like truncate_inode_pages(), except for the case
> - * where the page is seen to be mapped into process pagetables. In that case,
> - * the page is marked clean but is left attached to its address_space.
> - *
> - * The page is also marked not uptodate so that a subsequent pagefault will
> - * perform I/O to bringthe page's contents back into sync with its backing
> - * store.
> + * Any pages which are found to be mapped into pagetables are unmapped prior to
> + * invalidation.
> *
> - * FIXME: invalidate_inode_pages2() is probably trivially livelockable.
> + * Returns -EIO if any pages could not be invalidated.
> */
> -void invalidate_inode_pages2(struct address_space *mapping)
> +int invalidate_inode_pages2(struct address_space *mapping)
> {
> struct pagevec pvec;
> pgoff_t next = 0;
> int i;
> + int ret = 0;
> + int did_full_unmap = 0;
>
> pagevec_init(&pvec, 0);
> - while (pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> - for (i = 0; i < pagevec_count(&pvec); i++) {
> + while (!ret && pagevec_lookup(&pvec, mapping, next, PAGEVEC_SIZE)) {
> + for (i = 0; !ret && i < pagevec_count(&pvec); i++) {
> struct page *page = pvec.pages[i];
>
> lock_page(page);
> - if (page->mapping == mapping) { /* truncate race? */
> - wait_on_page_writeback(page);
> - next = page->index + 1;
> - if (page_mapped(page)) {
> - clear_page_dirty(page);
> - ClearPageUptodate(page);
> + if (page->mapping != mapping) { /* truncate race? */
> + unlock_page(page);
> + continue;
> + }
> + wait_on_page_writeback(page);
> + next = page->index + 1;
> + while (page_mapped(page)) {
> + if (!did_full_unmap) {
> + /*
> + * Zap the rest of the file in one hit.
> + * FIXME: invalidate_inode_pages2()
> + * should take start/end offsets.
> + */
> + unmap_mapping_range(mapping,
> + page->index << PAGE_CACHE_SHIFT,
> + -1, 0);
> + did_full_unmap = 1;
> } else {
> - if (!invalidate_complete_page(mapping,
> - page)) {
> - clear_page_dirty(page);
> - ClearPageUptodate(page);
> - }
> + /*
> + * Just zap this page
> + */
> + unmap_mapping_range(mapping,
> + page->index << PAGE_CACHE_SHIFT,
> + (page->index << PAGE_CACHE_SHIFT)+1,
> + 0);
> }
> }
> + clear_page_dirty(page);
> + if (!invalidate_complete_page(mapping, page))
> + ret = -EIO;
> unlock_page(page);
> }
> pagevec_release(&pvec);
> cond_resched();
> }
> + return ret;
> }
> -
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> diff -puN include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix include/linux/fs.h
> --- 25/include/linux/fs.h~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:54:50 2004
> +++ 25-akpm/include/linux/fs.h Fri Oct 22 15:54:58 2004
> @@ -1404,7 +1404,7 @@ static inline void invalidate_remote_ino
> S_ISLNK(inode->i_mode))
> invalidate_inode_pages(inode->i_mapping);
> }
> -extern void invalidate_inode_pages2(struct address_space *mapping);
> +extern int invalidate_inode_pages2(struct address_space *mapping);
> extern void write_inode_now(struct inode *, int);
> extern int filemap_fdatawrite(struct address_space *);
> extern int filemap_flush(struct address_space *);
> diff -puN mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix mm/filemap.c
> --- 25/mm/filemap.c~invalidate_inode_pages-mmap-coherency-fix Fri Oct 22 15:55:06 2004
> +++ 25-akpm/mm/filemap.c Fri Oct 22 16:17:19 2004
> @@ -2214,7 +2214,8 @@ ssize_t generic_file_writev(struct file
> EXPORT_SYMBOL(generic_file_writev);
>
> /*
> - * Called under i_sem for writes to S_ISREG files
> + * Called under i_sem for writes to S_ISREG files. Returns -EIO if something
> + * went wrong during pagecache shootdown.
> */
> ssize_t
> generic_file_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
> @@ -2224,14 +2225,24 @@ generic_file_direct_IO(int rw, struct ki
> struct address_space *mapping = file->f_mapping;
> ssize_t retval;
>
> + /*
> + * If it's a write, unmap all mmappings of the file up-front. This
> + * will cause any pte dirty bits to be propagated into the pageframes
> + * for the subsequent filemap_write_and_wait().
> + */
> + if (rw == WRITE && mapping_mapped(mapping))
> + unmap_mapping_range(mapping, 0, -1, 0);
> +
> retval = filemap_write_and_wait(mapping);
> if (retval == 0) {
> retval = mapping->a_ops->direct_IO(rw, iocb, iov,
> offset, nr_segs);
> - if (rw == WRITE && mapping->nrpages)
> - invalidate_inode_pages2(mapping);
> + if (rw == WRITE && mapping->nrpages) {
> + int err = invalidate_inode_pages2(mapping);
> + if (err)
> + retval = err;
> + }
> }
> return retval;
> }
> -
> EXPORT_SYMBOL_GPL(generic_file_direct_IO);
> _
On Tue, Nov 09, 2004 at 08:15:30AM -0600, Dave Kleikamp wrote:
> Andrew & Andrea,
> What is the status of this patch? It would be nice to have it in the
> -mm4 kernel.
I think we should add an msync in front of O_DIRECT reads too (msync
won't hurt other users, and it'll provide full coherency), everything
else is ok (the msync can be added as an incremental patch). We applied
it to SUSE (without the msync) to fix your crash in pdflush.
Dave Kleikamp <[email protected]> wrote:
>
> Andrew & Andrea,
> What is the status of this patch? It would be nice to have it in the
> -mm4 kernel.
>
I found that the patch was causing I/O errors to be reported with one of
the LTP tests:
testcases/bin/growfiles -W gf15 -b -e 1 -u -r 1-49600 -I r -u -i 0 -L 120 Lgfile1
So I dropped it out pending a bit more work.
Andrea Arcangeli <[email protected]> wrote:
>
> On Tue, Nov 09, 2004 at 08:15:30AM -0600, Dave Kleikamp wrote:
> > Andrew & Andrea,
> > What is the status of this patch? It would be nice to have it in the
> > -mm4 kernel.
>
> I think we should add an msync in front of O_DIRECT reads too (msync
> won't hurt other users, and it'll provide full coherency), everything
> else is ok (the msync can be added as an incremental patch).
I don't think we have a simple way of syncing all ptes which map the pages
without actually shooting those pte's down, via zap_page_range(). A
filemap_sync() will only sync the caller's mm's ptes.
I guess it would be pretty simple to add a sync_but_dont_unmap field to
struct zap_details, and propagate that down. So we can reuse all the
unmap_vmas() code for an all-mms pte sync.
It could all get very expensive if someone has a bit of the file mapped
though. Testing is needed there.