On s390 any write to a page (even from kernel itself) sets architecture
specific page dirty bit. Thus when a page is written to via standard write, HW
dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
finds the dirty bit and calls set_page_dirty().
Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
filesystems. The bug we observed in practice is that buffers from the page get
freed, so when the page gets later marked as dirty and writeback writes it, XFS
crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
from xfs_count_page_state().
Similar problem can also happen when zero_user_segment() call from
xfs_vm_writepage() (or block_write_full_page() for that matter) set the
hardware dirty bit during writeback, later buffers get freed, and then page
unmapped.
Fix the issue by ignoring s390 HW dirty bit for page cache pages in
page_mkclean() and page_remove_rmap(). This is safe because when a page gets
marked as writeable in PTE it is also marked dirty in do_wp_page() or
do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
the page gets writeprotected in page_mkclean(). So pagecache page is writeable
if and only if it is dirty.
CC: Martin Schwidefsky <[email protected]>
CC: Mel Gorman <[email protected]>
CC: [email protected]
Signed-off-by: Jan Kara <[email protected]>
---
mm/rmap.c | 16 ++++++++++++++--
1 files changed, 14 insertions(+), 2 deletions(-)
diff --git a/mm/rmap.c b/mm/rmap.c
index 0f3b7cd..6ce8ddb 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
struct address_space *mapping = page_mapping(page);
if (mapping) {
ret = page_mkclean_file(mapping, page);
- if (page_test_and_clear_dirty(page_to_pfn(page), 1))
+ /*
+ * We ignore dirty bit for pagecache pages. It is safe
+ * as page is marked dirty iff it is writeable (page is
+ * marked as dirty when it is made writeable and
+ * clear_page_dirty_for_io() writeprotects the page
+ * again).
+ */
+ if (PageSwapCache(page) &&
+ page_test_and_clear_dirty(page_to_pfn(page), 1))
ret = 1;
}
}
@@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
* this if the page is anon, so about to be freed; but perhaps
* not if it's in swapcache - there might be another pte slot
* containing the swap entry, but page not yet written to swap.
+ * For pagecache pages, we don't care about dirty bit in storage
+ * key because the page is writeable iff it is dirty (page is marked
+ * as dirty when it is made writeable and clear_page_dirty_for_io()
+ * writeprotects the page again).
*/
- if ((!anon || PageSwapCache(page)) &&
+ if (PageSwapCache(page) &&
page_test_and_clear_dirty(page_to_pfn(page), 1))
set_page_dirty(page);
/*
--
1.7.1
On Mon, Oct 01, 2012 at 06:26:36PM +0200, Jan Kara wrote:
> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
>
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().
>
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
>
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.
>
> CC: Martin Schwidefsky <[email protected]>
> CC: Mel Gorman <[email protected]>
> CC: [email protected]
> Signed-off-by: Jan Kara <[email protected]>
Acked-by: Mel Gorman <[email protected]>
--
Mel Gorman
SUSE Labs
On Mon, 1 Oct 2012, Jan Kara wrote:
> On s390 any write to a page (even from kernel itself) sets architecture
> specific page dirty bit. Thus when a page is written to via standard write, HW
> dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> finds the dirty bit and calls set_page_dirty().
>
> Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> filesystems. The bug we observed in practice is that buffers from the page get
> freed, so when the page gets later marked as dirty and writeback writes it, XFS
> crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> from xfs_count_page_state().
What changed recently? Was XFS hardly used on s390 until now?
>
> Similar problem can also happen when zero_user_segment() call from
> xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> hardware dirty bit during writeback, later buffers get freed, and then page
> unmapped.
>
> Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> marked as writeable in PTE it is also marked dirty in do_wp_page() or
> do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> if and only if it is dirty.
Very interesting patch...
>
> CC: Martin Schwidefsky <[email protected]>
which I'd very much like Martin's opinion on...
> CC: Mel Gorman <[email protected]>
and I'm grateful to Mel's ack for reawakening me to it...
> CC: [email protected]
> Signed-off-by: Jan Kara <[email protected]>
but I think it's wrong.
> ---
> mm/rmap.c | 16 ++++++++++++++--
> 1 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0f3b7cd..6ce8ddb 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> struct address_space *mapping = page_mapping(page);
> if (mapping) {
> ret = page_mkclean_file(mapping, page);
> - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> + /*
> + * We ignore dirty bit for pagecache pages. It is safe
> + * as page is marked dirty iff it is writeable (page is
> + * marked as dirty when it is made writeable and
> + * clear_page_dirty_for_io() writeprotects the page
> + * again).
> + */
> + if (PageSwapCache(page) &&
> + page_test_and_clear_dirty(page_to_pfn(page), 1))
> ret = 1;
This part you could cut out: page_mkclean() is not used on SwapCache pages.
I believe you are safe to remove the page_test_and_clear_dirty() from here.
> }
> }
> @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> * this if the page is anon, so about to be freed; but perhaps
> * not if it's in swapcache - there might be another pte slot
> * containing the swap entry, but page not yet written to swap.
> + * For pagecache pages, we don't care about dirty bit in storage
> + * key because the page is writeable iff it is dirty (page is marked
> + * as dirty when it is made writeable and clear_page_dirty_for_io()
> + * writeprotects the page again).
> */
> - if ((!anon || PageSwapCache(page)) &&
> + if (PageSwapCache(page) &&
> page_test_and_clear_dirty(page_to_pfn(page), 1))
> set_page_dirty(page);
But here's where I think the problem is. You're assuming that all
filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
there's no such function, just a confusing maze of three) route as XFS.
But filesystems like tmpfs and ramfs (perhaps they're the only two
that matter here) don't participate in that, and wait for an mmap'ed
page to be seen modified by the user (usually via pte_dirty, but that's
a no-op on s390) before page is marked dirty; and page reclaim throws
away undirtied pages.
So, if I'm understanding right, with this change s390 would be in danger
of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
written with the write system call would already be PageDirty and secure.
You mention above that even the kernel writing to the page would mark
the s390 storage key dirty. I think that means that these shm and
tmpfs and ramfs pages would all have dirty storage keys just from the
clear_highpage() used to prepare them originally, and so would have
been found dirty anyway by the existing code here in page_remove_rmap(),
even though other architectures would regard them as clean and removable.
If that's the case, then maybe we'd do better just to mark them dirty
when faulted in the s390 case. Then your patch above should (I think)
be safe. Though I'd then be VERY tempted to adjust the SwapCache case
too (I've not thought through exactly what that patch would be, just
one or two suitably placed SetPageDirtys, I think), and eliminate
page_test_and_clear_dirty() altogether - no tears shed by any of us!
A separate worry came to mind as I thought about your patch: where
in page migration is s390's dirty storage key migrated from old page
to new? And if there is a problem there, that too should be fixed
by what I propose in the previous paragraph.
Hugh
On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> > <SNIP>
> > CC: Mel Gorman <[email protected]>
>
> and I'm grateful to Mel's ack for reawakening me to it...
>
> > CC: [email protected]
> > Signed-off-by: Jan Kara <[email protected]>
>
> but I think it's wrong.
>
Dang.
> > ---
> > mm/rmap.c | 16 ++++++++++++++--
> > 1 files changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > struct address_space *mapping = page_mapping(page);
> > if (mapping) {
> > ret = page_mkclean_file(mapping, page);
> > - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > + /*
> > + * We ignore dirty bit for pagecache pages. It is safe
> > + * as page is marked dirty iff it is writeable (page is
> > + * marked as dirty when it is made writeable and
> > + * clear_page_dirty_for_io() writeprotects the page
> > + * again).
> > + */
> > + if (PageSwapCache(page) &&
> > + page_test_and_clear_dirty(page_to_pfn(page), 1))
> > ret = 1;
>
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
>
> > }
> > }
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> > * this if the page is anon, so about to be freed; but perhaps
> > * not if it's in swapcache - there might be another pte slot
> > * containing the swap entry, but page not yet written to swap.
> > + * For pagecache pages, we don't care about dirty bit in storage
> > + * key because the page is writeable iff it is dirty (page is marked
> > + * as dirty when it is made writeable and clear_page_dirty_for_io()
> > + * writeprotects the page again).
> > */
> > - if ((!anon || PageSwapCache(page)) &&
> > + if (PageSwapCache(page) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
>
> But here's where I think the problem is. You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
>
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
>
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
>
In the case of ramfs, what marks the page clean so it could be discarded? It
does not participate in dirty accounting so it's not going to clear the
dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
handler that would use an end_io handler to clear the page after "IO"
completes. I am not seeing how a ramfs page can get discarded at the moment.
shm and tmpfs are indeed different and I did not take them into account
(ba dum tisch) when reviewing. For those pages would it be sufficient to
check the following?
PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)
The problem the patch dealt with involved buffers associated with the page
and that shouldn't be a problem for tmpfs, right? I recognise that this
might work just because of co-incidence and set off your "Yuck" detector
and you'll prefer the proposed solution below.
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty. I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
>
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case. Then your patch above should (I think)
> be safe. Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
>
Do you mean something like this?
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..c66166f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else {
inc_mm_counter_fast(mm, MM_FILEPAGES);
page_add_file_rmap(page);
- if (flags & FAULT_FLAG_WRITE) {
+
+ /*
+ * s390 depends on the dirty flag from the storage key
+ * being propagated when the page is unmapped from the
+ * page tables. For dirty-accounted mapping, we instead
+ * depend on the page being marked dirty on writes and
+ * being write-protected on clear_page_dirty_for_io.
+ * The same protection does not apply for tmpfs pages
+ * that do not participate in dirty accounting so mark
+ * them dirty at fault time to avoid the data being
+ * lost
+ */
+ if (flags & FAULT_FLAG_WRITE ||
+ !bdi_cap_account_dirty(page->mapping)) {
dirty_page = page;
get_page(dirty_page);
}
Could something like this result in more writes to swap? Lets say there
is an unmapped tmpfs file with data on it -- a process maps it, reads the
entire mapping and exits. The page is now dirty and potentially will have
to be rewritten to swap. That seems bad. Did I miss your point?
> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new? And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
>
hmm, very good question. It should have been checked in
migrate_page_copy() where it could be done under the page lock before
the PageDirty check. Martin?
--
Mel Gorman
SUSE Labs
On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:
> On Mon, 1 Oct 2012, Jan Kara wrote:
>
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> >
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
>
> What changed recently? Was XFS hardly used on s390 until now?
One thing that changed is that the zero_user_segment for the remaining bytes between
i_size and the end of the page has been moved to block_write_full_page_endio, see
git commit eebd2aa355692afa. That changed the timing of the race window in regard
to map/unmap of the page by user space. And yes XFS is in use on s390.
> >
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> >
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
>
> Very interesting patch...
Yes, it is an interesting idea. I really like the part that we'll use less storage
key operations, as these are freaking expensive.
> >
> > CC: Martin Schwidefsky <[email protected]>
>
> which I'd very much like Martin's opinion on...
Until you pointed out the short-comings of the patch I really liked it ..
> > ---
> > mm/rmap.c | 16 ++++++++++++++--
> > 1 files changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > struct address_space *mapping = page_mapping(page);
> > if (mapping) {
> > ret = page_mkclean_file(mapping, page);
> > - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > + /*
> > + * We ignore dirty bit for pagecache pages. It is safe
> > + * as page is marked dirty iff it is writeable (page is
> > + * marked as dirty when it is made writeable and
> > + * clear_page_dirty_for_io() writeprotects the page
> > + * again).
> > + */
> > + if (PageSwapCache(page) &&
> > + page_test_and_clear_dirty(page_to_pfn(page), 1))
> > ret = 1;
>
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
future? At least we should add a comment there.
> > }
> > }
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> > * this if the page is anon, so about to be freed; but perhaps
> > * not if it's in swapcache - there might be another pte slot
> > * containing the swap entry, but page not yet written to swap.
> > + * For pagecache pages, we don't care about dirty bit in storage
> > + * key because the page is writeable iff it is dirty (page is marked
> > + * as dirty when it is made writeable and clear_page_dirty_for_io()
> > + * writeprotects the page again).
> > */
> > - if ((!anon || PageSwapCache(page)) &&
> > + if (PageSwapCache(page) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
>
> But here's where I think the problem is. You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
>
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
>
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
The patch relies on the software dirty bit tracking for file backed pages,
if dirty bit tracking is not done for tmpfs and ramfs we are borked.
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty. I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
No, the clear_highpage() will set the dirty bit in the storage key but
the SetPageUptodate will clear the complete storage key including the
dirty bit.
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case. Then your patch above should (I think)
> be safe. Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
I am seriously tempted to switch to pure software dirty bits by using
page protection for writable but clean pages. The worry is the number of
additional protection faults we would get. But as we do software dirty
bit tracking for the most part anyway this might not be as bad as it
used to be.
> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new? And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
That is covered by the SetPageUptodate() in migrate_page_copy().
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> On Mon, 1 Oct 2012, Jan Kara wrote:
>
> > On s390 any write to a page (even from kernel itself) sets architecture
> > specific page dirty bit. Thus when a page is written to via standard write, HW
> > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > finds the dirty bit and calls set_page_dirty().
> >
> > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > filesystems. The bug we observed in practice is that buffers from the page get
> > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > from xfs_count_page_state().
>
> What changed recently? Was XFS hardly used on s390 until now?
The problem was originally hit on SLE11-SP2 which is 3.0 based after
migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
XFS just started to be more peevish about what pages it gets between these
two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
up).
> > Similar problem can also happen when zero_user_segment() call from
> > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > hardware dirty bit during writeback, later buffers get freed, and then page
> > unmapped.
> >
> > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > if and only if it is dirty.
>
> Very interesting patch...
Originally, I even wanted to rip out pte dirty bit handling for shared
file pages but in the end that seemed too bold and unnecessary for my
problem ;)
> > CC: [email protected]
> > Signed-off-by: Jan Kara <[email protected]>
>
> but I think it's wrong.
Thanks for having a look.
> > ---
> > mm/rmap.c | 16 ++++++++++++++--
> > 1 files changed, 14 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 0f3b7cd..6ce8ddb 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > struct address_space *mapping = page_mapping(page);
> > if (mapping) {
> > ret = page_mkclean_file(mapping, page);
> > - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > + /*
> > + * We ignore dirty bit for pagecache pages. It is safe
> > + * as page is marked dirty iff it is writeable (page is
> > + * marked as dirty when it is made writeable and
> > + * clear_page_dirty_for_io() writeprotects the page
> > + * again).
> > + */
> > + if (PageSwapCache(page) &&
> > + page_test_and_clear_dirty(page_to_pfn(page), 1))
> > ret = 1;
>
> This part you could cut out: page_mkclean() is not used on SwapCache pages.
> I believe you are safe to remove the page_test_and_clear_dirty() from here.
OK, will do.
> > }
> > }
> > @@ -1183,8 +1191,12 @@ void page_remove_rmap(struct page *page)
> > * this if the page is anon, so about to be freed; but perhaps
> > * not if it's in swapcache - there might be another pte slot
> > * containing the swap entry, but page not yet written to swap.
> > + * For pagecache pages, we don't care about dirty bit in storage
> > + * key because the page is writeable iff it is dirty (page is marked
> > + * as dirty when it is made writeable and clear_page_dirty_for_io()
> > + * writeprotects the page again).
> > */
> > - if ((!anon || PageSwapCache(page)) &&
> > + if (PageSwapCache(page) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
>
> But here's where I think the problem is. You're assuming that all
> filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> there's no such function, just a confusing maze of three) route as XFS.
>
> But filesystems like tmpfs and ramfs (perhaps they're the only two
> that matter here) don't participate in that, and wait for an mmap'ed
> page to be seen modified by the user (usually via pte_dirty, but that's
> a no-op on s390) before page is marked dirty; and page reclaim throws
> away undirtied pages.
I admit I haven't thought of tmpfs and similar. After some discussion Mel
pointed me to the code in mmap which makes a difference. So if I get it
right, the difference which causes us problems is that on tmpfs we map the
page writeably even during read-only fault. OK, then if I make the above
code in page_remove_rmap():
if ((PageSwapCache(page) ||
(!anon && !mapping_cap_account_dirty(page->mapping))) &&
page_test_and_clear_dirty(page_to_pfn(page), 1))
set_page_dirty(page);
Things should be ok (modulo the ugliness of this condition), right?
> So, if I'm understanding right, with this change s390 would be in danger
> of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> written with the write system call would already be PageDirty and secure.
>
> You mention above that even the kernel writing to the page would mark
> the s390 storage key dirty. I think that means that these shm and
> tmpfs and ramfs pages would all have dirty storage keys just from the
> clear_highpage() used to prepare them originally, and so would have
> been found dirty anyway by the existing code here in page_remove_rmap(),
> even though other architectures would regard them as clean and removable.
Yes, except as Martin notes, SetPageUptodate() clears them again so that
doesn't work for us.
> If that's the case, then maybe we'd do better just to mark them dirty
> when faulted in the s390 case. Then your patch above should (I think)
> be safe. Though I'd then be VERY tempted to adjust the SwapCache case
> too (I've not thought through exactly what that patch would be, just
> one or two suitably placed SetPageDirtys, I think), and eliminate
> page_test_and_clear_dirty() altogether - no tears shed by any of us!
If we want to get rid of page_test_and_clear_dirty() completely (and a
hack in SetPageUptodate()) it should be possible. But we would have to
change mmap to map pages read-only for read-only faults of tmpfs pages at
least on s390 and then somehow fix the SwapCache handling...
> A separate worry came to mind as I thought about your patch: where
> in page migration is s390's dirty storage key migrated from old page
> to new? And if there is a problem there, that too should be fixed
> by what I propose in the previous paragraph.
I'd think so but I'll let Martin comment on this.
Honza
On Tue, 9 Oct 2012, Mel Gorman wrote:
> On Mon, Oct 08, 2012 at 09:24:40PM -0700, Hugh Dickins wrote:
> >
> > So, if I'm understanding right, with this change s390 would be in danger
> > of discarding shm, and mmap'ed tmpfs and ramfs pages - whereas pages
> > written with the write system call would already be PageDirty and secure.
> >
>
> In the case of ramfs, what marks the page clean so it could be discarded? It
> does not participate in dirty accounting so it's not going to clear the
> dirty flag in clear_page_dirty_for_io(). It doesn't have a writepage
> handler that would use an end_io handler to clear the page after "IO"
> completes. I am not seeing how a ramfs page can get discarded at the moment.
But we don't have a page clean bit: we have a page dirty bit, and where
is that set in the ramfs read-fault case? I've not experimented to check,
maybe you're right and ramfs is exempt from the issue. I thought it was
__do_fault() which does the set_page_dirty, but only if FAULT_FLAG_WRITE.
Ah, you quote almost the very place further down.
>
> shm and tmpfs are indeed different and I did not take them into account
> (ba dum tisch) when reviewing. For those pages would it be sufficient to
> check the following?
>
> PageSwapCache(page) || (page->mapping && !bdi_cap_account_dirty(page->mapping)
Something like that, yes: I've a possible patch I'll put in reply to Jan.
>
> The problem the patch dealt with involved buffers associated with the page
> and that shouldn't be a problem for tmpfs, right?
Right, though I'm now beginning to wonder what the underlying bug is.
It seems to me that we have a bug and an optimization on our hands,
and have rushed into the optimization which would avoid the bug,
without considering what the actual bug is. More in reply to Jan.
> I recognise that this
> might work just because of co-incidence and set off your "Yuck" detector
> and you'll prefer the proposed solution below.
No, I was mistaken to think that s390 would have dirty pages where
others had clean, Martin has now explained that SetPageUptodate cleans.
I didn't mind continuing an (imagined) inefficiency in s390, but I don't
want to make it more inefficient.
>
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty. I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
> >
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case. Then your patch above should (I think)
> > be safe. Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!
So that fantasy was all wrong: appealing, but wrong.
> >
>
> Do you mean something like this?
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5736170..c66166f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3316,7 +3316,20 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> } else {
> inc_mm_counter_fast(mm, MM_FILEPAGES);
> page_add_file_rmap(page);
> - if (flags & FAULT_FLAG_WRITE) {
> +
> + /*
> + * s390 depends on the dirty flag from the storage key
> + * being propagated when the page is unmapped from the
> + * page tables. For dirty-accounted mapping, we instead
> + * depend on the page being marked dirty on writes and
> + * being write-protected on clear_page_dirty_for_io.
> + * The same protection does not apply for tmpfs pages
> + * that do not participate in dirty accounting so mark
> + * them dirty at fault time to avoid the data being
> + * lost
> + */
> + if (flags & FAULT_FLAG_WRITE ||
> + !bdi_cap_account_dirty(page->mapping)) {
> dirty_page = page;
> get_page(dirty_page);
> }
>
> Could something like this result in more writes to swap? Lets say there
> is an unmapped tmpfs file with data on it -- a process maps it, reads the
> entire mapping and exits. The page is now dirty and potentially will have
> to be rewritten to swap. That seems bad. Did I miss your point?
My point was that I mistakenly thought s390 must already be behaving
like that, so wanted it to continue that way, but with cleaner source.
But the CONFIG_S390 in SetPageUptodate makes sure that the zeroed page
starts out storage-key-clean: so you're exactly right, my suggestion
would result in more writes to swap for it, which is not acceptable.
(Plus, having insisted that ramfs is also affected, I went on
to forget that, and was imagining a simple change in mm/shmem.c.)
Hugh
>
> > A separate worry came to mind as I thought about your patch: where
> > in page migration is s390's dirty storage key migrated from old page
> > to new? And if there is a problem there, that too should be fixed
> > by what I propose in the previous paragraph.
> >
>
> hmm, very good question. It should have been checked in
> migrate_page_copy() where it could be done under the page lock before
> the PageDirty check. Martin?
>
> --
> Mel Gorman
> SUSE Labs
On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> >
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > >
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > filesystems. The bug we observed in practice is that buffers from the page get
> > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > from xfs_count_page_state().
> >
> > What changed recently? Was XFS hardly used on s390 until now?
>
> One thing that changed is that the zero_user_segment for the remaining bytes between
> i_size and the end of the page has been moved to block_write_full_page_endio, see
> git commit eebd2aa355692afa. That changed the timing of the race window in regard
> to map/unmap of the page by user space. And yes XFS is in use on s390.
February 2008: I think we have different ideas of "recently" ;)
>
> > >
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > unmapped.
> > >
> > > Fix the issue by ignoring s390 HW dirty bit for page cache pages in
> > > page_mkclean() and page_remove_rmap(). This is safe because when a page gets
> > > marked as writeable in PTE it is also marked dirty in do_wp_page() or
> > > do_page_fault(). When the dirty bit is cleared by clear_page_dirty_for_io(),
> > > the page gets writeprotected in page_mkclean(). So pagecache page is writeable
> > > if and only if it is dirty.
> >
> > Very interesting patch...
>
> Yes, it is an interesting idea. I really like the part that we'll use less storage
> key operations, as these are freaking expensive.
As I said to Mel and will repeat to Jan, though an optimization would
be nice, I don't think we should necessarily mix it with the bugfix.
>
> > >
> > > CC: Martin Schwidefsky <[email protected]>
> >
> > which I'd very much like Martin's opinion on...
>
> Until you pointed out the short-comings of the patch I really liked it ..
>
> > > ---
> > > mm/rmap.c | 16 ++++++++++++++--
> > > 1 files changed, 14 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/rmap.c b/mm/rmap.c
> > > index 0f3b7cd..6ce8ddb 100644
> > > --- a/mm/rmap.c
> > > +++ b/mm/rmap.c
> > > @@ -973,7 +973,15 @@ int page_mkclean(struct page *page)
> > > struct address_space *mapping = page_mapping(page);
> > > if (mapping) {
> > > ret = page_mkclean_file(mapping, page);
> > > - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> > > + /*
> > > + * We ignore dirty bit for pagecache pages. It is safe
> > > + * as page is marked dirty iff it is writeable (page is
> > > + * marked as dirty when it is made writeable and
> > > + * clear_page_dirty_for_io() writeprotects the page
> > > + * again).
> > > + */
> > > + if (PageSwapCache(page) &&
> > > + page_test_and_clear_dirty(page_to_pfn(page), 1))
> > > ret = 1;
> >
> > This part you could cut out: page_mkclean() is not used on SwapCache pages.
> > I believe you are safe to remove the page_test_and_clear_dirty() from here.
>
> Hmm, who guarantees that page_mkclean won't be used for SwapCache in the
> future? At least we should add a comment there.
I set out to do so, to add a comment there; but honestly, it's a strange
place for such a comment when there's no longer even the code to comment
upon. And page_mkclean_file(), called in the line above, already says
BUG_ON(PageAnon(page)), so it would soon fire if we ever make a change
that sends PageSwapCache pages this way. It is possible that one day we
shall want to send tmpfs and swapcache down this route, I'm not ruling
that out; but then we shall have to extend page_mkclean(), yes.
>
> The patch relies on the software dirty bit tracking for file backed pages,
> if dirty bit tracking is not done for tmpfs and ramfs we are borked.
>
> > You mention above that even the kernel writing to the page would mark
> > the s390 storage key dirty. I think that means that these shm and
> > tmpfs and ramfs pages would all have dirty storage keys just from the
> > clear_highpage() used to prepare them originally, and so would have
> > been found dirty anyway by the existing code here in page_remove_rmap(),
> > even though other architectures would regard them as clean and removable.
>
> No, the clear_highpage() will set the dirty bit in the storage key but
> the SetPageUptodate will clear the complete storage key including the
> dirty bit.
Ah, thank you Martin, that clears that up...
>
> > If that's the case, then maybe we'd do better just to mark them dirty
> > when faulted in the s390 case. Then your patch above should (I think)
> > be safe. Though I'd then be VERY tempted to adjust the SwapCache case
> > too (I've not thought through exactly what that patch would be, just
> > one or two suitably placed SetPageDirtys, I think), and eliminate
> > page_test_and_clear_dirty() altogether - no tears shed by any of us!
... so I should not hurt your performance with a change of that kind.
>
> I am seriously tempted to switch to pure software dirty bits by using
> page protection for writable but clean pages. The worry is the number of
> additional protection faults we would get. But as we do software dirty
> bit tracking for the most part anyway this might not be as bad as it
> used to be.
That's exactly the same reason why tmpfs opts out of dirty tracking, fear
of unnecessary extra faults. Anomalous as s390 is here, tmpfs is being
anomalous too, and I'd be a hypocrite to push for you to make that change.
>
> > A separate worry came to mind as I thought about your patch: where
> > in page migration is s390's dirty storage key migrated from old page
> > to new? And if there is a problem there, that too should be fixed
> > by what I propose in the previous paragraph.
>
> That is covered by the SetPageUptodate() in migrate_page_copy().,>
I don't think so: that makes sure that the newpage is not marked
dirty in storage key just because of the copy_highpage to it; but
I see nothing to mark the newpage dirty in storage key when the
old page was dirty there.
Hugh
On Tue, 9 Oct 2012, Jan Kara wrote:
> On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > On Mon, 1 Oct 2012, Jan Kara wrote:
> >
> > > On s390 any write to a page (even from kernel itself) sets architecture
> > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > finds the dirty bit and calls set_page_dirty().
> > >
> > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > filesystems. The bug we observed in practice is that buffers from the page get
> > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > from xfs_count_page_state().
> >
> > What changed recently? Was XFS hardly used on s390 until now?
> The problem was originally hit on SLE11-SP2 which is 3.0 based after
> migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> XFS just started to be more peevish about what pages it gets between these
> two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> up).
Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
ASSERT usually compiled out, stumbling later in page_buffers() as you say.
>
> > > Similar problem can also happen when zero_user_segment() call from
> > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > unmapped.
Similar problem, or is that the whole of the problem? Where else does
the page get written to, after clearing page dirty? (It may not be worth
spending time to answer me, I feel I'm wasting too much time on this.)
I keep trying to put my finger on the precise bug. I said in earlier
mails to Mel and to Martin that we're mixing a bugfix and an optimization,
but I cannot quite point to the bug. Could one say that it's precisely at
the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
that the storage key ought to be re-cleaned after that?
What if one day I happened to copy that code into shmem_writepage()?
I've no intention to do so! And it wouldn't cause a BUG. Ah, and we
never write shmem to swap while it's still mapped, so it wouldn't even
have a chance to redirty the page in page_remove_rmap().
I guess I'm worrying too much; but it's not crystal clear to me why any
!mapping_cap_account_dirty mapping would necessarily not have the problem.
> > But here's where I think the problem is. You're assuming that all
> > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > there's no such function, just a confusing maze of three) route as XFS.
> >
> > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > that matter here) don't participate in that, and wait for an mmap'ed
> > page to be seen modified by the user (usually via pte_dirty, but that's
> > a no-op on s390) before page is marked dirty; and page reclaim throws
> > away undirtied pages.
> I admit I haven't thought of tmpfs and similar. After some discussion Mel
> pointed me to the code in mmap which makes a difference. So if I get it
> right, the difference which causes us problems is that on tmpfs we map the
> page writeably even during read-only fault. OK, then if I make the above
> code in page_remove_rmap():
> if ((PageSwapCache(page) ||
> (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> page_test_and_clear_dirty(page_to_pfn(page), 1))
> set_page_dirty(page);
>
> Things should be ok (modulo the ugliness of this condition), right?
(Setting aside my reservations above...) That's almost exactly right, but
I think the issue of a racing truncation (which could reset page->mapping
to NULL at any moment) means we have to be a bit more careful. Usually
we guard against that with page lock, but here we can rely on mapcount.
page_mapping(page), with its built-in PageSwapCache check, actually ends
up making the condition look less ugly; and so far as I could tell,
the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
when we are left with its VM_BUG_ON(PageSlab(page))).
But please look this over very critically and test (and if you like it,
please adopt it as your own): I'm not entirely convinced yet myself.
(One day, I do want to move that block further down page_remove_rmap(),
outside the mem_cgroup_[begin,end]_update_stat() bracketing: I don't think
there's an actual problem at present in calling set_page_dirty() there,
but I have seen patches which could give it a lock-ordering issue, so
better to untangle them. No reason to muddle that in with your fix,
but I thought I'd mention it while we're all staring at this.)
Hugh
---
mm/rmap.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)
--- 3.6.0+/mm/rmap.c 2012-10-09 14:01:12.356379322 -0700
+++ linux/mm/rmap.c 2012-10-09 14:58:48.160445605 -0700
@@ -56,6 +56,7 @@
#include <linux/mmu_notifier.h>
#include <linux/migrate.h>
#include <linux/hugetlb.h>
+#include <linux/backing-dev.h>
#include <asm/tlbflush.h>
@@ -926,11 +927,8 @@ int page_mkclean(struct page *page)
if (page_mapped(page)) {
struct address_space *mapping = page_mapping(page);
- if (mapping) {
+ if (mapping)
ret = page_mkclean_file(mapping, page);
- if (page_test_and_clear_dirty(page_to_pfn(page), 1))
- ret = 1;
- }
}
return ret;
@@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *pag
*/
void page_remove_rmap(struct page *page)
{
+ struct address_space *mapping = page_mapping(page);
bool anon = PageAnon(page);
bool locked;
unsigned long flags;
@@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
* this if the page is anon, so about to be freed; but perhaps
* not if it's in swapcache - there might be another pte slot
* containing the swap entry, but page not yet written to swap.
+ *
+ * And we can skip it on file pages, so long as the filesystem
+ * participates in dirty tracking; but need to catch shm and tmpfs
+ * and ramfs pages which have been modified since creation by read
+ * fault.
+ *
+ * Note that mapping must be decided above, before decrementing
+ * mapcount (which luckily provides a barrier): once page is unmapped,
+ * it could be truncated and page->mapping reset to NULL at any moment.
+ * Note also that we are relying on page_mapping(page) to set mapping
+ * to &swapper_space when PageSwapCache(page).
*/
- if ((!anon || PageSwapCache(page)) &&
+ if (mapping && !mapping_cap_account_dirty(mapping) &&
page_test_and_clear_dirty(page_to_pfn(page), 1))
set_page_dirty(page);
/*
On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > >
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > >
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > from xfs_count_page_state().
...
> > > > Similar problem can also happen when zero_user_segment() call from
> > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > > unmapped.
>
> Similar problem, or is that the whole of the problem? Where else does
> the page get written to, after clearing page dirty? (It may not be worth
> spending time to answer me, I feel I'm wasting too much time on this.)
I think the devil is in "after clearing page dirty" -
clear_page_dirty_for_io() has an optimization that it does not bother
transfering pte or storage key dirty bits to page dirty bit when page is
not mapped. On s390 that results in storage key dirty bit set once buffered
write modifies the page.
BTW there's no other place I'm aware of (and I was looking for some time
before I realized that storage key could remain set from buffered write as
described above).
>
> I keep trying to put my finger on the precise bug. I said in earlier
> mails to Mel and to Martin that we're mixing a bugfix and an optimization,
> but I cannot quite point to the bug. Could one say that it's precisely at
> the "page straddles i_size" zero_user_segment(), in XFS or in other FSes?
> that the storage key ought to be re-cleaned after that?
I think the precise bug is that we can leave dirty bit in storage key set
after writes from kernel while some parts of kernel assume the bit can be
set only via user mapping.
In a perfect world with infinite computation resources, all writes to
pages from kernel could look like:
.. assume locked page ..
page_mkclean(page);
if (page_test_and_clear_dirty(page))
set_page_dirty(page);
write to page
page_test_and_clear_dirty(page); /* Clean storage key */
This would be bulletproof ... and ridiculously expensive.
> What if one day I happened to copy that code into shmem_writepage()?
> I've no intention to do so! And it wouldn't cause a BUG. Ah, and we
> never write shmem to swap while it's still mapped, so it wouldn't even
> have a chance to redirty the page in page_remove_rmap().
>
> I guess I'm worrying too much; but it's not crystal clear to me why any
> !mapping_cap_account_dirty mapping would necessarily not have the problem.
They can have a problem - if they cared that page_remove_rmap() can mark
as dirty a page which was never written to via mmap. So far we are lucky
and all !mapping_cap_account_dirty users don't care.
> > > But here's where I think the problem is. You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > >
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> > I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > if ((PageSwapCache(page) ||
> > (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
> >
> > Things should be ok (modulo the ugliness of this condition), right?
>
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful. Usually
> we guard against that with page lock, but here we can rely on mapcount.
>
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
>
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
OK, I'll push the kernel with your updated patch to our build machines
and let it run there for a few days (it took about a day to reproduce the
issue originally). Thanks a lot for helping me with this.
Honza
> mm/rmap.c | 20 +++++++++++++++-----
> 1 file changed, 15 insertions(+), 5 deletions(-)
>
> --- 3.6.0+/mm/rmap.c 2012-10-09 14:01:12.356379322 -0700
> +++ linux/mm/rmap.c 2012-10-09 14:58:48.160445605 -0700
> @@ -56,6 +56,7 @@
> #include <linux/mmu_notifier.h>
> #include <linux/migrate.h>
> #include <linux/hugetlb.h>
> +#include <linux/backing-dev.h>
>
> #include <asm/tlbflush.h>
>
> @@ -926,11 +927,8 @@ int page_mkclean(struct page *page)
>
> if (page_mapped(page)) {
> struct address_space *mapping = page_mapping(page);
> - if (mapping) {
> + if (mapping)
> ret = page_mkclean_file(mapping, page);
> - if (page_test_and_clear_dirty(page_to_pfn(page), 1))
> - ret = 1;
> - }
> }
>
> return ret;
> @@ -1116,6 +1114,7 @@ void page_add_file_rmap(struct page *pag
> */
> void page_remove_rmap(struct page *page)
> {
> + struct address_space *mapping = page_mapping(page);
> bool anon = PageAnon(page);
> bool locked;
> unsigned long flags;
> @@ -1138,8 +1137,19 @@ void page_remove_rmap(struct page *page)
> * this if the page is anon, so about to be freed; but perhaps
> * not if it's in swapcache - there might be another pte slot
> * containing the swap entry, but page not yet written to swap.
> + *
> + * And we can skip it on file pages, so long as the filesystem
> + * participates in dirty tracking; but need to catch shm and tmpfs
> + * and ramfs pages which have been modified since creation by read
> + * fault.
> + *
> + * Note that mapping must be decided above, before decrementing
> + * mapcount (which luckily provides a barrier): once page is unmapped,
> + * it could be truncated and page->mapping reset to NULL at any moment.
> + * Note also that we are relying on page_mapping(page) to set mapping
> + * to &swapper_space when PageSwapCache(page).
> */
> - if ((!anon || PageSwapCache(page)) &&
> + if (mapping && !mapping_cap_account_dirty(mapping) &&
> page_test_and_clear_dirty(page_to_pfn(page), 1))
> set_page_dirty(page);
> /*
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, 10 Oct 2012, Jan Kara wrote:
> On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > >
> > > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > >
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > > from xfs_count_page_state().
> ...
> > > > > Similar problem can also happen when zero_user_segment() call from
> > > > > xfs_vm_writepage() (or block_write_full_page() for that matter) set the
> > > > > hardware dirty bit during writeback, later buffers get freed, and then page
> > > > > unmapped.
> >
> > Similar problem, or is that the whole of the problem? Where else does
> > the page get written to, after clearing page dirty? (It may not be worth
> > spending time to answer me, I feel I'm wasting too much time on this.)
> I think the devil is in "after clearing page dirty" -
> clear_page_dirty_for_io() has an optimization that it does not bother
> transfering pte or storage key dirty bits to page dirty bit when page is
> not mapped.
Right, its "if (page_mkclean) set_page_dirty".
> On s390 that results in storage key dirty bit set once buffered
> write modifies the page.
Ah yes, because set_page_dirty does not clean the storage key,
as perhaps I was expecting (and we wouldn't want to add that if
everything is working without).
>
> BTW there's no other place I'm aware of (and I was looking for some time
> before I realized that storage key could remain set from buffered write as
> described above).
> > I guess I'm worrying too much; but it's not crystal clear to me why any
> > !mapping_cap_account_dirty mapping would necessarily not have the problem.
> They can have a problem - if they cared that page_remove_rmap() can mark
> as dirty a page which was never written to via mmap. So far we are lucky
> and all !mapping_cap_account_dirty users don't care.
Yes, I think it's good enough: it's a workaround rather than a thorough
future-proof fix; a workaround with a nice optimization bonus for s390.
> > > Things should be ok (modulo the ugliness of this condition), right?
> >
> > (Setting aside my reservations above...) That's almost exactly right, but
> > I think the issue of a racing truncation (which could reset page->mapping
> > to NULL at any moment) means we have to be a bit more careful. Usually
> > we guard against that with page lock, but here we can rely on mapcount.
> >
> > page_mapping(page), with its built-in PageSwapCache check, actually ends
> > up making the condition look less ugly; and so far as I could tell,
> > the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> > when we are left with its VM_BUG_ON(PageSlab(page))).
> >
> > But please look this over very critically and test (and if you like it,
> > please adopt it as your own): I'm not entirely convinced yet myself.
> OK, I'll push the kernel with your updated patch to our build machines
> and let it run there for a few days (it took about a day to reproduce the
> issue originally). Thanks a lot for helping me with this.
And thank you for explaining it repeatedly for me.
I expect you're most interested in testing the XFS end of it; but if
you've time to check the swap/tmpfs aspect too, fsx on tmpfs while
heavily swapping should do it.
But perhaps these machines aren't much into heavy swapping. Now,
if Martin would send me a nice little zSeries netbook for Xmas,
I could then test that end of it myself ;)
I've just arrived at the conclusion that page migration does _not_
have a problem with transferring the dirty storage key: I had been
thinking that your testing might stumble on that issue, and need a
further patch, but I'll explain in other mail why now I think not.
Hugh
On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
> > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > >
> > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > finds the dirty bit and calls set_page_dirty().
> > > >
> > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > from xfs_count_page_state().
> > >
> > > What changed recently? Was XFS hardly used on s390 until now?
> > The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > XFS just started to be more peevish about what pages it gets between these
> > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > up).
>
> Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> ASSERT usually compiled out, stumbling later in page_buffers() as you say.
What that says is that no-one is running xfstests-based QA on s390
with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
I've never tested XFS on s390 before, and I doubt any of the
upstream developers have, either, because not many peopl ehave s390
machines in their basement. So this is probably just an oversight
in the distro QA environment more than anything....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, 9 Oct 2012, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Martin Schwidefsky wrote:
> > On Mon, 8 Oct 2012 21:24:40 -0700 (PDT)
> > Hugh Dickins <[email protected]> wrote:
> >
> > > A separate worry came to mind as I thought about your patch: where
> > > in page migration is s390's dirty storage key migrated from old page
> > > to new? And if there is a problem there, that too should be fixed
> > > by what I propose in the previous paragraph.
> >
> > That is covered by the SetPageUptodate() in migrate_page_copy().
>
> I don't think so: that makes sure that the newpage is not marked
> dirty in storage key just because of the copy_highpage to it; but
> I see nothing to mark the newpage dirty in storage key when the
> old page was dirty there.
I went to prepare a patch to fix this, and ended up finding no such
problem to fix - which fits with how no such problem has been reported.
Most of it is handled by page migration's unmap_and_move() having to
unmap the old page first: so the old page will pass through the final
page_remove_rmap(), which will transfer storage key to page_dirty in
those cases which it deals with (with the old code, any file or swap
page; with the new code, any unaccounted file or swap page, now that
we realize the accounted files don't even need this); and page_dirty
is already properly migrated to the new page.
But that does leave one case behind: an anonymous page not yet in
swapcache, migrated via a swap-like migration entry. But this case
is not a problem because PageDirty doesn't actually affect anything
for an anonymous page not in swapcache. There are various places
where we set it, and its life-history is hard to make sense of, but
in fact it's meaningless in 2.6, where page reclaim adds anon to swap
(and sets PageDirty) whether the page was marked dirty before or not
(which makes sense when we use the ZERO_PAGE for anon read faults).
2.4 did behave differently: it was liable to free anon pages not
marked dirty, and I think most of our anon SetPageDirtys are just a
relic of those days - I do have a patch from 18 months ago to remove
them (adding PG_dirty to the flags which should not be set when a
page is freed), but there are usually more urgent things to attend
to than rebase and retest that.
Hugh
On Wed, 10 Oct 2012 14:28:32 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:
> But perhaps these machines aren't much into heavy swapping. Now,
> if Martin would send me a nice little zSeries netbook for Xmas,
> I could then test that end of it myself ;)
Are you sure about that? The electricity cost alone for such a beast
is quite high ;-)
> I've just arrived at the conclusion that page migration does _not_
> have a problem with transferring the dirty storage key: I had been
> thinking that your testing might stumble on that issue, and need a
> further patch, but I'll explain in other mail why now I think not.
That is good to know, one problem less on the list.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Thu, 11 Oct 2012 08:56:00 +1100
Dave Chinner <[email protected]> wrote:
> On Tue, Oct 09, 2012 at 07:19:09PM -0700, Hugh Dickins wrote:
> > On Tue, 9 Oct 2012, Jan Kara wrote:
> > > On Mon 08-10-12 21:24:40, Hugh Dickins wrote:
> > > > On Mon, 1 Oct 2012, Jan Kara wrote:
> > > >
> > > > > On s390 any write to a page (even from kernel itself) sets architecture
> > > > > specific page dirty bit. Thus when a page is written to via standard write, HW
> > > > > dirty bit gets set and when we later map and unmap the page, page_remove_rmap()
> > > > > finds the dirty bit and calls set_page_dirty().
> > > > >
> > > > > Dirtying of a page which shouldn't be dirty can cause all sorts of problems to
> > > > > filesystems. The bug we observed in practice is that buffers from the page get
> > > > > freed, so when the page gets later marked as dirty and writeback writes it, XFS
> > > > > crashes due to an assertion BUG_ON(!PagePrivate(page)) in page_buffers() called
> > > > > from xfs_count_page_state().
> > > >
> > > > What changed recently? Was XFS hardly used on s390 until now?
> > > The problem was originally hit on SLE11-SP2 which is 3.0 based after
> > > migration of our s390 build machines from SLE11-SP1 (2.6.32 based). I think
> > > XFS just started to be more peevish about what pages it gets between these
> > > two releases ;) (e.g. ext3 or ext4 just says "oh, well" and fixes things
> > > up).
> >
> > Right, in 2.6.32 xfs_vm_writepage() had a !page_has_buffers(page) case,
> > whereas by 3.0 that had become ASSERT(page_has_buffers(page)), with the
> > ASSERT usually compiled out, stumbling later in page_buffers() as you say.
>
> What that says is that no-one is running xfstests-based QA on s390
> with CONFIG_XFS_DEBUG enabled, otherwise this would have been found.
> I've never tested XFS on s390 before, and I doubt any of the
> upstream developers have, either, because not many peopl ehave s390
> machines in their basement. So this is probably just an oversight
> in the distro QA environment more than anything....
Our internal builds indeed have CONFIG_XFS_DEBUG=n, I'll change that and
watch for the fallout.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Tue 09-10-12 19:19:09, Hugh Dickins wrote:
> On Tue, 9 Oct 2012, Jan Kara wrote:
<snip a lot>
> > > But here's where I think the problem is. You're assuming that all
> > > filesystems go the same mapping_cap_account_writeback_dirty() (yeah,
> > > there's no such function, just a confusing maze of three) route as XFS.
> > >
> > > But filesystems like tmpfs and ramfs (perhaps they're the only two
> > > that matter here) don't participate in that, and wait for an mmap'ed
> > > page to be seen modified by the user (usually via pte_dirty, but that's
> > > a no-op on s390) before page is marked dirty; and page reclaim throws
> > > away undirtied pages.
> > I admit I haven't thought of tmpfs and similar. After some discussion Mel
> > pointed me to the code in mmap which makes a difference. So if I get it
> > right, the difference which causes us problems is that on tmpfs we map the
> > page writeably even during read-only fault. OK, then if I make the above
> > code in page_remove_rmap():
> > if ((PageSwapCache(page) ||
> > (!anon && !mapping_cap_account_dirty(page->mapping))) &&
> > page_test_and_clear_dirty(page_to_pfn(page), 1))
> > set_page_dirty(page);
> >
> > Things should be ok (modulo the ugliness of this condition), right?
>
> (Setting aside my reservations above...) That's almost exactly right, but
> I think the issue of a racing truncation (which could reset page->mapping
> to NULL at any moment) means we have to be a bit more careful. Usually
> we guard against that with page lock, but here we can rely on mapcount.
>
> page_mapping(page), with its built-in PageSwapCache check, actually ends
> up making the condition look less ugly; and so far as I could tell,
> the extra code does get optimized out on x86 (unless CONFIG_DEBUG_VM,
> when we are left with its VM_BUG_ON(PageSlab(page))).
>
> But please look this over very critically and test (and if you like it,
> please adopt it as your own): I'm not entirely convinced yet myself.
Just to followup on this. The new version of the patch runs fine for
several days on our s390 build machines. I was also running fsx-linux on
tmpfs while pushing the machine to swap. fsx ran fine but I hit
WARN_ON(delalloc) in xfs_vm_releasepage(). The exact stack trace is:
[<000003c008edb38e>] xfs_vm_releasepage+0xc6/0xd4 [xfs]
[<0000000000213326>] shrink_page_list+0x6ba/0x734
[<0000000000213924>] shrink_inactive_list+0x230/0x578
[<0000000000214148>] shrink_list+0x6c/0x120
[<00000000002143ee>] shrink_zone+0x1f2/0x238
[<0000000000215482>] balance_pgdat+0x5f6/0x86c
[<00000000002158b8>] kswapd+0x1c0/0x248
[<000000000017642a>] kthread+0xa6/0xb0
[<00000000004e58be>] kernel_thread_starter+0x6/0xc
[<00000000004e58b8>] kernel_thread_starter+0x0/0xc
I don't think it is really related but I'll hold off the patch for a while
to investigate what's going on...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, 9 Oct 2012 16:21:24 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:
> >
> > I am seriously tempted to switch to pure software dirty bits by using
> > page protection for writable but clean pages. The worry is the number of
> > additional protection faults we would get. But as we do software dirty
> > bit tracking for the most part anyway this might not be as bad as it
> > used to be.
>
> That's exactly the same reason why tmpfs opts out of dirty tracking, fear
> of unnecessary extra faults. Anomalous as s390 is here, tmpfs is being
> anomalous too, and I'd be a hypocrite to push for you to make that change.
I tested the waters with the software dirty bit idea. Using kernel compile
as test case I got these numbers:
disk backing, swdirty: 10,023,870 minor-faults 18 major-faults
disk backing, hwdirty: 10,023,829 minor-faults 21 major-faults
tmpfs backing, swdirty: 10,019,552 minor-faults 49 major-faults
tmpfs backing, hwdirty: 10,032,909 minor-faults 81 major-faults
That does not look bad at all. One test I found that shows an effect is
lat_mmap from LMBench:
disk backing, hwdirty: 30,894 minor-faults 0 major-faults
disk backing, swdirty: 30,894 minor-faults 0 major-faults
tmpfs backing, hwdirty: 22,574 minor-faults 0 major-faults
tmpfs backing, swdirty: 36,652 minor-faults 0 major-faults
The runtime between the hwdirty vs. the swdirty setup is very similar,
encouraging enough for me to ask our performance team to run a larger test.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.