LinuxLists.cc - [patch 0/3 resend] mm: close race between dirtying and truncation

2014-12-16 16:18:39

Subject: [patch 0/3 resend] mm: close race between dirtying and truncation

Hi Andrew,

it looks like these 3 patches fell on the floor. The first one is a
bug fix that closes a race condition between truncation and dirtying
from do_wp_page(), which can result in a dirty-accounting error. #2
and #3 simplify the code in the area. Please consider them for 3.19.

Thanks!

include/linux/writeback.h | 1 -
mm/memory.c | 54 ++++++++++++++++++---------------------------
mm/page-writeback.c | 43 ++++++++++--------------------------
3 files changed, 34 insertions(+), 64 deletions(-)

2014-12-16 16:18:41

by Johannes Weiner

[permalink] [raw]

Subject: [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()

Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
pretty much the same. Make sure the page references are the same in
both cases, then merge the two branches.

It's tempting to go even further and page-lock the !page_mkwrite case,
to get it in line with everybody else setting the page table and thus
further simplify the model. But that's not quite compelling enough to
justify dropping the pte lock, then relocking and verifying the entry
for filesystems without ->page_mkwrite, which notably includes tmpfs.
Leave it for now and lock the page late in the !page_mkwrite case.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
mm/memory.c | 48 +++++++++++++++++-------------------------------
1 file changed, 17 insertions(+), 31 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d5a303cf2901..a26f30b6abd1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2033,7 +2033,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t entry;
int ret = 0;
int page_mkwrite = 0;
- struct page *dirty_page = NULL;
+ bool dirty_shared = false;
unsigned long mmun_start = 0; /* For mmu_notifiers */
unsigned long mmun_end = 0; /* For mmu_notifiers */
struct mem_cgroup *memcg;
@@ -2084,6 +2084,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
+ page_cache_get(old_page);
/*
* Only catch write-faults on shared writable pages,
* read-only shared pages can get COWed by
@@ -2091,7 +2092,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
int tmp;
- page_cache_get(old_page);
+
pte_unmap_unlock(page_table, ptl);
tmp = do_page_mkwrite(vma, old_page, address);
if (unlikely(!tmp || (tmp &
@@ -2111,11 +2112,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unlock_page(old_page);
goto unlock;
}
-
page_mkwrite = 1;
}
- dirty_page = old_page;
- get_page(dirty_page);
+
+ dirty_shared = true;

reuse:
/*
@@ -2134,43 +2134,29 @@ reuse:
pte_unmap_unlock(page_table, ptl);
ret |= VM_FAULT_WRITE;

- if (!dirty_page)
- return ret;
-
- if (!page_mkwrite) {
+ if (dirty_shared) {
struct address_space *mapping;
int dirtied;

- lock_page(dirty_page);
- dirtied = set_page_dirty(dirty_page);
- VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
- mapping = dirty_page->mapping;
- unlock_page(dirty_page);
+ if (!page_mkwrite)
+ lock_page(old_page);

- if (dirtied && mapping) {
- /*
- * Some device drivers do not set page.mapping
- * but still dirty their pages
- */
- balance_dirty_pages_ratelimited(mapping);
- }
+ dirtied = set_page_dirty(old_page);
+ VM_BUG_ON_PAGE(PageAnon(old_page), old_page);
+ mapping = old_page->mapping;
+ unlock_page(old_page);
+ page_cache_release(old_page);

- file_update_time(vma->vm_file);
- }
- put_page(dirty_page);
- if (page_mkwrite) {
- struct address_space *mapping = dirty_page->mapping;
-
- set_page_dirty(dirty_page);
- unlock_page(dirty_page);
- page_cache_release(dirty_page);
- if (mapping) {
+ if ((dirtied || page_mkwrite) && mapping) {
/*
* Some device drivers do not set page.mapping
* but still dirty their pages
*/
balance_dirty_pages_ratelimited(mapping);
}
+
+ if (!page_mkwrite)
+ file_update_time(vma->vm_file);
}

return ret;
--
2.1.3

2014-12-16 16:19:00

by Johannes Weiner

[permalink] [raw]

Subject: [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas

Shared anonymous mmaps are implemented with shmem files, so all VMAs
with shared writable semantics also have an underlying backing file.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4c06cdcd36cf..d5a303cf2901 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2155,9 +2155,7 @@ reuse:
balance_dirty_pages_ratelimited(mapping);
}

- /* file_update_time outside page_lock */
- if (vma->vm_file)
- file_update_time(vma->vm_file);
+ file_update_time(vma->vm_file);
}
put_page(dirty_page);
if (page_mkwrite) {
@@ -3013,8 +3011,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
balance_dirty_pages_ratelimited(mapping);
}

- /* file_update_time outside page_lock */
- if (vma->vm_file && !vma->vm_ops->page_mkwrite)
+ if (!vma->vm_ops->page_mkwrite)
file_update_time(vma->vm_file);

return ret;
--
2.1.3

2014-12-16 16:19:20

by Johannes Weiner

[permalink] [raw]

Subject: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:

__set_page_dirty_nobuffers() __delete_from_page_cache()
if (TestSetPageDirty(page))
page->mapping = NULL
if (PageDirty())
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
if (page->mapping)
account_page_dirtied(page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held. The notable exception to this rule, though,
is do_wp_page(), for which this race exists. However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes. Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.

Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed. Remove it.

Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
include/linux/writeback.h | 1 -
mm/memory.c | 27 +++++++++++++++++----------
mm/page-writeback.c | 43 ++++++++++++-------------------------------
3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a219be961c0a..00048339c23e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
struct writeback_control *wbc, writepage_t writepage,
void *data);
int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
-void set_page_dirty_balance(struct page *page);
void writeback_set_ratelimit(void);
void tag_pages_for_writeback(struct address_space *mapping,
pgoff_t start, pgoff_t end);
diff --git a/mm/memory.c b/mm/memory.c
index c3b9097251c5..4c06cdcd36cf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2137,17 +2137,24 @@ reuse:
if (!dirty_page)
return ret;

- /*
- * Yes, Virginia, this is actually required to prevent a race
- * with clear_page_dirty_for_io() from clearing the page dirty
- * bit after it clear all dirty ptes, but before a racing
- * do_wp_page installs a dirty pte.
- *
- * do_shared_fault is protected similarly.
- */
if (!page_mkwrite) {
- wait_on_page_locked(dirty_page);
- set_page_dirty_balance(dirty_page);
+ struct address_space *mapping;
+ int dirtied;
+
+ lock_page(dirty_page);
+ dirtied = set_page_dirty(dirty_page);
+ VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
+ mapping = dirty_page->mapping;
+ unlock_page(dirty_page);
+
+ if (dirtied && mapping) {
+ /*
+ * Some device drivers do not set page.mapping
+ * but still dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+
/* file_update_time outside page_lock */
if (vma->vm_file)
file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d5d81f5384d1..6f4335238e33 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1541,16 +1541,6 @@ pause:
bdi_start_background_writeback(bdi);
}

-void set_page_dirty_balance(struct page *page)
-{
- if (set_page_dirty(page)) {
- struct address_space *mapping = page_mapping(page);
-
- if (mapping)
- balance_dirty_pages_ratelimited(mapping);
- }
-}
-
static DEFINE_PER_CPU(int, bdp_ratelimits);

/*
@@ -2123,32 +2113,25 @@ EXPORT_SYMBOL(account_page_dirtied);
* page dirty in that case, but not all the buffers. This is a "bottom-up"
* dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
*
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation. Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
*/
int __set_page_dirty_nobuffers(struct page *page)
{
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
- struct address_space *mapping2;
unsigned long flags;

if (!mapping)
return 1;

spin_lock_irqsave(&mapping->tree_lock, flags);
- mapping2 = page_mapping(page);
- if (mapping2) { /* Race with truncate? */
- BUG_ON(mapping2 != mapping);
- WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
- account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
- }
+ BUG_ON(page_mapping(page) != mapping);
+ WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+ account_page_dirtied(page, mapping);
+ radix_tree_tag_set(&mapping->page_tree, page_index(page),
+ PAGECACHE_TAG_DIRTY);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
if (mapping->host) {
/* !PageAnon && !swapper_space */
@@ -2305,12 +2288,10 @@ int clear_page_dirty_for_io(struct page *page)
/*
* We carefully synchronise fault handlers against
* installing a dirty pte and marking the page dirty
- * at this point. We do this by having them hold the
- * page lock at some point after installing their
- * pte, but before marking the page dirty.
- * Pages are always locked coming in here, so we get
- * the desired exclusion. See mm/memory.c:do_wp_page()
- * for more comments.
+ * at this point. We do this by having them hold the
+ * page lock while dirtying the page, and pages are
+ * always locked coming in here, so we get the desired
+ * exclusion.
*/
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
--
2.1.3