Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:
__set_page_dirty_nobuffers() __delete_from_page_cache()
if (TestSetPageDirty(page))
page->mapping = NULL
if (PageDirty())
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
if (page->mapping)
account_page_dirtied(page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held. The notable exception to this rule, though,
is do_wp_page(), for which this race exists. However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes. Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.
Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed. Remove it.
Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memory.c | 11 ++---------
mm/page-writeback.c | 33 ++++++++++++---------------------
2 files changed, 14 insertions(+), 30 deletions(-)
It is unfortunate to hold the page lock while balancing dirty pages,
but I don't see what else would protect mapping at that point. The
same btw applies for the page_mkwrite case: how is mapping safe to
pass to balance_dirty_pages() after unlocking page table and page?
diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..27aaee6b6f4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2150,17 +2150,10 @@ reuse:
if (!dirty_page)
return ret;
- /*
- * Yes, Virginia, this is actually required to prevent a race
- * with clear_page_dirty_for_io() from clearing the page dirty
- * bit after it clear all dirty ptes, but before a racing
- * do_wp_page installs a dirty pte.
- *
- * do_shared_fault is protected similarly.
- */
if (!page_mkwrite) {
- wait_on_page_locked(dirty_page);
+ lock_page(dirty_page);
set_page_dirty_balance(dirty_page);
+ unlock_page(dirty_page);
/* file_update_time outside page_lock */
if (vma->vm_file)
file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae87522d..86773236f42a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2123,32 +2123,25 @@ EXPORT_SYMBOL(account_page_dirtied);
* page dirty in that case, but not all the buffers. This is a "bottom-up"
* dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
*
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation. Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
*/
int __set_page_dirty_nobuffers(struct page *page)
{
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
- struct address_space *mapping2;
unsigned long flags;
if (!mapping)
return 1;
spin_lock_irqsave(&mapping->tree_lock, flags);
- mapping2 = page_mapping(page);
- if (mapping2) { /* Race with truncate? */
- BUG_ON(mapping2 != mapping);
- WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
- account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
- }
+ BUG_ON(page_mapping(page) != mapping);
+ WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+ account_page_dirtied(page, mapping);
+ radix_tree_tag_set(&mapping->page_tree, page_index(page),
+ PAGECACHE_TAG_DIRTY);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
if (mapping->host) {
/* !PageAnon && !swapper_space */
@@ -2305,12 +2298,10 @@ int clear_page_dirty_for_io(struct page *page)
/*
* We carefully synchronise fault handlers against
* installing a dirty pte and marking the page dirty
- * at this point. We do this by having them hold the
- * page lock at some point after installing their
- * pte, but before marking the page dirty.
- * Pages are always locked coming in here, so we get
- * the desired exclusion. See mm/memory.c:do_wp_page()
- * for more comments.
+ * at this point. We do this by having them hold the
+ * page lock while dirtying the page, and pages are
+ * always locked coming in here, so we get the desired
+ * exclusion.
*/
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
--
2.1.3
On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <[email protected]> wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
>
> __set_page_dirty_nobuffers() __delete_from_page_cache()
> if (TestSetPageDirty(page))
> page->mapping = NULL
> if (PageDirty())
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> if (page->mapping)
> account_page_dirtied(page)
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
>
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held. The notable exception to this rule, though,
> is do_wp_page(), for which this race exists. However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes. Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
>
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed. Remove it.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memory.c | 11 ++---------
> mm/page-writeback.c | 33 ++++++++++++---------------------
> 2 files changed, 14 insertions(+), 30 deletions(-)
>
> It is unfortunate to hold the page lock while balancing dirty pages,
> but I don't see what else would protect mapping at that point.
Yes.
I'm a bit surprised that calling balance_dirty_pages() under
lock_page() doesn't just go and deadlock. Memory fails me.
And yes, often the only thing which protects the address_space is
lock_page().
set_page_dirty_balance() and balance_dirty_pages() don't actually need
the address_space - they just use it to get at the backing_dev_info.
So perhaps what we could do here is the change those functions to take
a bdi directly, then change do_wp_page() to do something like
lock_page(dirty_page);
bdi = page->mapping->backing_dev_info;
need_balance = set_page_dirty2(bdi);
unlock_page(page);
if (need_balance)
balance_dirty_pages_ratelimited2(bdi);
so we no longer require that the address_space be stabilized after
lock_page(). Of course something needs to protect the bdi and I'm not
sure what that is, but we're talking about umount and that quiesces and
evicts lots of things before proceeding, so surely there's something in
there which will save us ;)
> The
> same btw applies for the page_mkwrite case: how is mapping safe to
> pass to balance_dirty_pages() after unlocking page table and page?
I'm not sure which code you're referring to here, but it's likely that
the switch-balancing-to-bdi approach will address that as well?
On Wed 26-11-14 14:00:06, Andrew Morton wrote:
> On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <[email protected]> wrote:
>
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> >
> > __set_page_dirty_nobuffers() __delete_from_page_cache()
> > if (TestSetPageDirty(page))
> > page->mapping = NULL
> > if (PageDirty())
> > dec_zone_page_state(page, NR_FILE_DIRTY);
> > dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > if (page->mapping)
> > account_page_dirtied(page)
> > __inc_zone_page_state(page, NR_FILE_DIRTY);
> > __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> >
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held. The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists. However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes. Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> >
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed. Remove it.
> >
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/memory.c | 11 ++---------
> > mm/page-writeback.c | 33 ++++++++++++---------------------
> > 2 files changed, 14 insertions(+), 30 deletions(-)
> >
> > It is unfortunate to hold the page lock while balancing dirty pages,
> > but I don't see what else would protect mapping at that point.
>
> Yes.
>
> I'm a bit surprised that calling balance_dirty_pages() under
> lock_page() doesn't just go and deadlock. Memory fails me.
>
> And yes, often the only thing which protects the address_space is
> lock_page().
>
> set_page_dirty_balance() and balance_dirty_pages() don't actually need
> the address_space - they just use it to get at the backing_dev_info.
> So perhaps what we could do here is the change those functions to take
> a bdi directly, then change do_wp_page() to do something like
>
> lock_page(dirty_page);
> bdi = page->mapping->backing_dev_info;
> need_balance = set_page_dirty2(bdi);
> unlock_page(page);
> if (need_balance)
> balance_dirty_pages_ratelimited2(bdi);
Yes, please! Holding lock_page() over balance dirty pages definitely has
a potential for deadlock (e.g. flusher might block on lock_page() in
WB_SYNC_ALL pass and then there'd be no one to clean pages and thus release
process from balance_dirty_pages()).
> so we no longer require that the address_space be stabilized after
> lock_page(). Of course something needs to protect the bdi and I'm not
> sure what that is, but we're talking about umount and that quiesces and
> evicts lots of things before proceeding, so surely there's something in
> there which will save us ;)
In do_wp_page() the process doing the fault and ending in
balance_dirty_pages() has to have the page mapped, thus it has to have the
file open => no umount.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, 27 Nov 2014 10:40:06 +0100 Jan Kara <[email protected]> wrote:
> > so we no longer require that the address_space be stabilized after
> > lock_page(). Of course something needs to protect the bdi and I'm not
> > sure what that is, but we're talking about umount and that quiesces and
> > evicts lots of things before proceeding, so surely there's something in
> > there which will save us ;)
> In do_wp_page() the process doing the fault and ending in
> balance_dirty_pages() has to have the page mapped, thus it has to have the
> file open => no umount.
Actually, umount isn't enough to kill the backing_dev_info. It's an
attribute of the device itself (for blockdevs it's a field in
request_queue) so I assume it will be stable until device hot-unplug,
losetup -d, rmmod, etc. If the backing_dev can go away in the middle
of a pagefault against that device then we have bigger problems ;)
On Wed, Nov 26, 2014 at 02:00:06PM -0800, Andrew Morton wrote:
> On Tue, 25 Nov 2014 14:48:41 -0500 Johannes Weiner <[email protected]> wrote:
> > The
> > same btw applies for the page_mkwrite case: how is mapping safe to
> > pass to balance_dirty_pages() after unlocking page table and page?
>
> I'm not sure which code you're referring to here, but it's likely that
> the switch-balancing-to-bdi approach will address that as well?
This code in do_wp_page():
pte_unmap_unlock(page_table, ptl);
[...]
put_page(dirty_page);
if (page_mkwrite) {
struct address_space *mapping = dirty_page->mapping;
set_page_dirty(dirty_page);
unlock_page(dirty_page);
page_cache_release(dirty_page);
if (mapping) {
/*
* Some device drivers do not set page.mapping
* but still dirty their pages
*/
balance_dirty_pages_ratelimited(mapping);
}
}
And there is also this code in do_shared_fault():
pte_unmap_unlock(pte, ptl);
if (set_page_dirty(fault_page))
dirtied = 1;
mapping = fault_page->mapping;
unlock_page(fault_page);
if ((dirtied || vma->vm_ops->page_mkwrite) && mapping) {
/*
* Some device drivers do not set page.mapping but still
* dirty their pages
*/
balance_dirty_pages_ratelimited(mapping);
}
I don't see anything that ensures mapping stays alive by the time it's
passed to balance_dirty_pages() in either case.
Argh, but of course there is. The mmap_sem. That pins the vma, which
pins the file, which pins the inode. In all cases. So I think we can
just stick with passing mapping to balance_dirty_pages() for now.