2014-12-05 14:52:56

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:

__set_page_dirty_nobuffers() __delete_from_page_cache()
if (TestSetPageDirty(page))
page->mapping = NULL
if (PageDirty())
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
if (page->mapping)
account_page_dirtied(page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held. The notable exception to this rule, though,
is do_wp_page(), for which this race exists. However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes. Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.

Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed. Remove it.

Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Cc: <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
include/linux/writeback.h | 1 -
mm/memory.c | 27 +++++++++++++++++----------
mm/page-writeback.c | 43 ++++++++++++-------------------------------
3 files changed, 29 insertions(+), 42 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a219be961c0a..00048339c23e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
struct writeback_control *wbc, writepage_t writepage,
void *data);
int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
-void set_page_dirty_balance(struct page *page);
void writeback_set_ratelimit(void);
void tag_pages_for_writeback(struct address_space *mapping,
pgoff_t start, pgoff_t end);
diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..72d998eb0438 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2150,17 +2150,24 @@ reuse:
if (!dirty_page)
return ret;

- /*
- * Yes, Virginia, this is actually required to prevent a race
- * with clear_page_dirty_for_io() from clearing the page dirty
- * bit after it clear all dirty ptes, but before a racing
- * do_wp_page installs a dirty pte.
- *
- * do_shared_fault is protected similarly.
- */
if (!page_mkwrite) {
- wait_on_page_locked(dirty_page);
- set_page_dirty_balance(dirty_page);
+ struct address_space *mapping;
+ int dirtied;
+
+ lock_page(dirty_page);
+ dirtied = set_page_dirty(dirty_page);
+ VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
+ mapping = dirty_page->mapping;
+ unlock_page(dirty_page);
+
+ if (dirtied && mapping) {
+ /*
+ * Some device drivers do not set page.mapping
+ * but still dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+
/* file_update_time outside page_lock */
if (vma->vm_file)
file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae87522d..437174a2aaa3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1541,16 +1541,6 @@ pause:
bdi_start_background_writeback(bdi);
}

-void set_page_dirty_balance(struct page *page)
-{
- if (set_page_dirty(page)) {
- struct address_space *mapping = page_mapping(page);
-
- if (mapping)
- balance_dirty_pages_ratelimited(mapping);
- }
-}
-
static DEFINE_PER_CPU(int, bdp_ratelimits);

/*
@@ -2123,32 +2113,25 @@ EXPORT_SYMBOL(account_page_dirtied);
* page dirty in that case, but not all the buffers. This is a "bottom-up"
* dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
*
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation. Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
*/
int __set_page_dirty_nobuffers(struct page *page)
{
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
- struct address_space *mapping2;
unsigned long flags;

if (!mapping)
return 1;

spin_lock_irqsave(&mapping->tree_lock, flags);
- mapping2 = page_mapping(page);
- if (mapping2) { /* Race with truncate? */
- BUG_ON(mapping2 != mapping);
- WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
- account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
- }
+ BUG_ON(page_mapping(page) != mapping);
+ WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+ account_page_dirtied(page, mapping);
+ radix_tree_tag_set(&mapping->page_tree, page_index(page),
+ PAGECACHE_TAG_DIRTY);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
if (mapping->host) {
/* !PageAnon && !swapper_space */
@@ -2305,12 +2288,10 @@ int clear_page_dirty_for_io(struct page *page)
/*
* We carefully synchronise fault handlers against
* installing a dirty pte and marking the page dirty
- * at this point. We do this by having them hold the
- * page lock at some point after installing their
- * pte, but before marking the page dirty.
- * Pages are always locked coming in here, so we get
- * the desired exclusion. See mm/memory.c:do_wp_page()
- * for more comments.
+ * at this point. We do this by having them hold the
+ * page lock while dirtying the page, and pages are
+ * always locked coming in here, so we get the desired
+ * exclusion.
*/
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
--
2.1.3


2014-12-05 14:52:58

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas

Shared anonymous mmaps are implemented with shmem files, so all VMAs
with shared writable semantics also have an underlying backing file.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 72d998eb0438..5640a718ac58 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2168,9 +2168,7 @@ reuse:
balance_dirty_pages_ratelimited(mapping);
}

- /* file_update_time outside page_lock */
- if (vma->vm_file)
- file_update_time(vma->vm_file);
+ file_update_time(vma->vm_file);
}
put_page(dirty_page);
if (page_mkwrite) {
@@ -3026,8 +3024,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
balance_dirty_pages_ratelimited(mapping);
}

- /* file_update_time outside page_lock */
- if (vma->vm_file && !vma->vm_ops->page_mkwrite)
+ if (!vma->vm_ops->page_mkwrite)
file_update_time(vma->vm_file);

return ret;
--
2.1.3

2014-12-05 14:53:08

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()

Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
pretty much the same. Make sure the page references are the same in
both cases, then merge the two branches.

It's tempting to go even further and page-lock the !page_mkwrite case,
to get it in line with everybody else setting the page table and thus
further simplify the model. But that's not quite compelling enough to
justify dropping the pte lock, then relocking and verifying the entry
for filesystems without ->page_mkwrite, which notably includes tmpfs.
Leave it for now and lock the page late in the !page_mkwrite case.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
---
mm/memory.c | 48 +++++++++++++++++-------------------------------
1 file changed, 17 insertions(+), 31 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5640a718ac58..df47fd0a4b7f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2046,7 +2046,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t entry;
int ret = 0;
int page_mkwrite = 0;
- struct page *dirty_page = NULL;
+ bool dirty_shared = false;
unsigned long mmun_start = 0; /* For mmu_notifiers */
unsigned long mmun_end = 0; /* For mmu_notifiers */
struct mem_cgroup *memcg;
@@ -2097,6 +2097,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
+ page_cache_get(old_page);
/*
* Only catch write-faults on shared writable pages,
* read-only shared pages can get COWed by
@@ -2104,7 +2105,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
int tmp;
- page_cache_get(old_page);
+
pte_unmap_unlock(page_table, ptl);
tmp = do_page_mkwrite(vma, old_page, address);
if (unlikely(!tmp || (tmp &
@@ -2124,11 +2125,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unlock_page(old_page);
goto unlock;
}
-
page_mkwrite = 1;
}
- dirty_page = old_page;
- get_page(dirty_page);
+
+ dirty_shared = true;

reuse:
/*
@@ -2147,43 +2147,29 @@ reuse:
pte_unmap_unlock(page_table, ptl);
ret |= VM_FAULT_WRITE;

- if (!dirty_page)
- return ret;
-
- if (!page_mkwrite) {
+ if (dirty_shared) {
struct address_space *mapping;
int dirtied;

- lock_page(dirty_page);
- dirtied = set_page_dirty(dirty_page);
- VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
- mapping = dirty_page->mapping;
- unlock_page(dirty_page);
+ if (!page_mkwrite)
+ lock_page(old_page);

- if (dirtied && mapping) {
- /*
- * Some device drivers do not set page.mapping
- * but still dirty their pages
- */
- balance_dirty_pages_ratelimited(mapping);
- }
+ dirtied = set_page_dirty(old_page);
+ VM_BUG_ON_PAGE(PageAnon(old_page), old_page);
+ mapping = old_page->mapping;
+ unlock_page(old_page);
+ page_cache_release(old_page);

- file_update_time(vma->vm_file);
- }
- put_page(dirty_page);
- if (page_mkwrite) {
- struct address_space *mapping = dirty_page->mapping;
-
- set_page_dirty(dirty_page);
- unlock_page(dirty_page);
- page_cache_release(dirty_page);
- if (mapping) {
+ if ((dirtied || page_mkwrite) && mapping) {
/*
* Some device drivers do not set page.mapping
* but still dirty their pages
*/
balance_dirty_pages_ratelimited(mapping);
}
+
+ if (!page_mkwrite)
+ file_update_time(vma->vm_file);
}

return ret;
--
2.1.3

2014-12-09 18:18:54

by Jan Kara

[permalink] [raw]
Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Fri 05-12-14 09:52:44, Johannes Weiner wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
>
> __set_page_dirty_nobuffers() __delete_from_page_cache()
> if (TestSetPageDirty(page))
> page->mapping = NULL
> if (PageDirty())
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> if (page->mapping)
> account_page_dirtied(page)
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
>
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held. The notable exception to this rule, though,
> is do_wp_page(), for which this race exists. However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes. Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
>
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed. Remove it.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/writeback.h | 1 -
> mm/memory.c | 27 +++++++++++++++++----------
> mm/page-writeback.c | 43 ++++++++++++-------------------------------
> 3 files changed, 29 insertions(+), 42 deletions(-)
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a219be961c0a..00048339c23e 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
> struct writeback_control *wbc, writepage_t writepage,
> void *data);
> int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> -void set_page_dirty_balance(struct page *page);
> void writeback_set_ratelimit(void);
> void tag_pages_for_writeback(struct address_space *mapping,
> pgoff_t start, pgoff_t end);
> diff --git a/mm/memory.c b/mm/memory.c
> index 3e503831e042..72d998eb0438 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2150,17 +2150,24 @@ reuse:
> if (!dirty_page)
> return ret;
>
> - /*
> - * Yes, Virginia, this is actually required to prevent a race
> - * with clear_page_dirty_for_io() from clearing the page dirty
> - * bit after it clear all dirty ptes, but before a racing
> - * do_wp_page installs a dirty pte.
> - *
> - * do_shared_fault is protected similarly.
> - */
> if (!page_mkwrite) {
> - wait_on_page_locked(dirty_page);
> - set_page_dirty_balance(dirty_page);
> + struct address_space *mapping;
> + int dirtied;
> +
> + lock_page(dirty_page);
> + dirtied = set_page_dirty(dirty_page);
> + VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
> + mapping = dirty_page->mapping;
> + unlock_page(dirty_page);
> +
> + if (dirtied && mapping) {
> + /*
> + * Some device drivers do not set page.mapping
> + * but still dirty their pages
> + */
> + balance_dirty_pages_ratelimited(mapping);
> + }
> +
> /* file_update_time outside page_lock */
> if (vma->vm_file)
> file_update_time(vma->vm_file);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 19ceae87522d..437174a2aaa3 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1541,16 +1541,6 @@ pause:
> bdi_start_background_writeback(bdi);
> }
>
> -void set_page_dirty_balance(struct page *page)
> -{
> - if (set_page_dirty(page)) {
> - struct address_space *mapping = page_mapping(page);
> -
> - if (mapping)
> - balance_dirty_pages_ratelimited(mapping);
> - }
> -}
> -
> static DEFINE_PER_CPU(int, bdp_ratelimits);
>
> /*
> @@ -2123,32 +2113,25 @@ EXPORT_SYMBOL(account_page_dirtied);
> * page dirty in that case, but not all the buffers. This is a "bottom-up"
> * dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
> *
> - * Most callers have locked the page, which pins the address_space in memory.
> - * But zap_pte_range() does not lock the page, however in that case the
> - * mapping is pinned by the vma's ->vm_file reference.
> - *
> - * We take care to handle the case where the page was truncated from the
> - * mapping by re-checking page_mapping() inside tree_lock.
> + * The caller must ensure this doesn't race with truncation. Most will simply
> + * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
> + * the pte lock held, which also locks out truncation.
> */
> int __set_page_dirty_nobuffers(struct page *page)
> {
> if (!TestSetPageDirty(page)) {
> struct address_space *mapping = page_mapping(page);
> - struct address_space *mapping2;
> unsigned long flags;
>
> if (!mapping)
> return 1;
>
> spin_lock_irqsave(&mapping->tree_lock, flags);
> - mapping2 = page_mapping(page);
> - if (mapping2) { /* Race with truncate? */
> - BUG_ON(mapping2 != mapping);
> - WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> - account_page_dirtied(page, mapping);
> - radix_tree_tag_set(&mapping->page_tree,
> - page_index(page), PAGECACHE_TAG_DIRTY);
> - }
> + BUG_ON(page_mapping(page) != mapping);
> + WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
> + account_page_dirtied(page, mapping);
> + radix_tree_tag_set(&mapping->page_tree, page_index(page),
> + PAGECACHE_TAG_DIRTY);
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> if (mapping->host) {
> /* !PageAnon && !swapper_space */
> @@ -2305,12 +2288,10 @@ int clear_page_dirty_for_io(struct page *page)
> /*
> * We carefully synchronise fault handlers against
> * installing a dirty pte and marking the page dirty
> - * at this point. We do this by having them hold the
> - * page lock at some point after installing their
> - * pte, but before marking the page dirty.
> - * Pages are always locked coming in here, so we get
> - * the desired exclusion. See mm/memory.c:do_wp_page()
> - * for more comments.
> + * at this point. We do this by having them hold the
> + * page lock while dirtying the page, and pages are
> + * always locked coming in here, so we get the desired
> + * exclusion.
> */
> if (TestClearPageDirty(page)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> --
> 2.1.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-12-09 18:22:26

by Jan Kara

[permalink] [raw]
Subject: Re: [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()

On Fri 05-12-14 09:52:46, Johannes Weiner wrote:
> Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
> pretty much the same. Make sure the page references are the same in
> both cases, then merge the two branches.
>
> It's tempting to go even further and page-lock the !page_mkwrite case,
> to get it in line with everybody else setting the page table and thus
> further simplify the model. But that's not quite compelling enough to
> justify dropping the pte lock, then relocking and verifying the entry
> for filesystems without ->page_mkwrite, which notably includes tmpfs.
> Leave it for now and lock the page late in the !page_mkwrite case.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
The patch looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> mm/memory.c | 48 +++++++++++++++++-------------------------------
> 1 file changed, 17 insertions(+), 31 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5640a718ac58..df47fd0a4b7f 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2046,7 +2046,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> pte_t entry;
> int ret = 0;
> int page_mkwrite = 0;
> - struct page *dirty_page = NULL;
> + bool dirty_shared = false;
> unsigned long mmun_start = 0; /* For mmu_notifiers */
> unsigned long mmun_end = 0; /* For mmu_notifiers */
> struct mem_cgroup *memcg;
> @@ -2097,6 +2097,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unlock_page(old_page);
> } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> (VM_WRITE|VM_SHARED))) {
> + page_cache_get(old_page);
> /*
> * Only catch write-faults on shared writable pages,
> * read-only shared pages can get COWed by
> @@ -2104,7 +2105,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> */
> if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
> int tmp;
> - page_cache_get(old_page);
> +
> pte_unmap_unlock(page_table, ptl);
> tmp = do_page_mkwrite(vma, old_page, address);
> if (unlikely(!tmp || (tmp &
> @@ -2124,11 +2125,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unlock_page(old_page);
> goto unlock;
> }
> -
> page_mkwrite = 1;
> }
> - dirty_page = old_page;
> - get_page(dirty_page);
> +
> + dirty_shared = true;
>
> reuse:
> /*
> @@ -2147,43 +2147,29 @@ reuse:
> pte_unmap_unlock(page_table, ptl);
> ret |= VM_FAULT_WRITE;
>
> - if (!dirty_page)
> - return ret;
> -
> - if (!page_mkwrite) {
> + if (dirty_shared) {
> struct address_space *mapping;
> int dirtied;
>
> - lock_page(dirty_page);
> - dirtied = set_page_dirty(dirty_page);
> - VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page);
> - mapping = dirty_page->mapping;
> - unlock_page(dirty_page);
> + if (!page_mkwrite)
> + lock_page(old_page);
>
> - if (dirtied && mapping) {
> - /*
> - * Some device drivers do not set page.mapping
> - * but still dirty their pages
> - */
> - balance_dirty_pages_ratelimited(mapping);
> - }
> + dirtied = set_page_dirty(old_page);
> + VM_BUG_ON_PAGE(PageAnon(old_page), old_page);
> + mapping = old_page->mapping;
> + unlock_page(old_page);
> + page_cache_release(old_page);
>
> - file_update_time(vma->vm_file);
> - }
> - put_page(dirty_page);
> - if (page_mkwrite) {
> - struct address_space *mapping = dirty_page->mapping;
> -
> - set_page_dirty(dirty_page);
> - unlock_page(dirty_page);
> - page_cache_release(dirty_page);
> - if (mapping) {
> + if ((dirtied || page_mkwrite) && mapping) {
> /*
> * Some device drivers do not set page.mapping
> * but still dirty their pages
> */
> balance_dirty_pages_ratelimited(mapping);
> }
> +
> + if (!page_mkwrite)
> + file_update_time(vma->vm_file);
> }
>
> return ret;
> --
> 2.1.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Fri, Dec 05, 2014 at 09:52:44AM -0500, Johannes Weiner wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
>
> __set_page_dirty_nobuffers() __delete_from_page_cache()
> if (TestSetPageDirty(page))
> page->mapping = NULL
> if (PageDirty())
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> if (page->mapping)
> account_page_dirtied(page)
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
>
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held. The notable exception to this rule, though,
> is do_wp_page(), for which this race exists. However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes. Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
>
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed. Remove it.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>

Hi Johannes,

I'm seeing the following while fuzzing with trinity on linux-next (I've changed
the WARN to a VM_BUG_ON_PAGE for some extra page info).

[ 18.991007] page:ffffea000307c8c0 count:3 mapcount:0 mapping:ffff88010444cbf8 index:0x1^M
[ 18.993051] flags: 0x1fffc0000000011(locked|dirty)^M
[ 18.993621] raw: 01fffc0000000011 ffff88010444cbf8 0000000000000001 00000003ffffffff^M
[ 18.994522] raw: dead000000000100 dead000000000200 0000000000000000 ffff880109c38008^M [ 18.995418] page dumped because: VM_BUG_ON_PAGE(!PagePrivate(page) && !PageUptodate(page))^M
[ 18.996381] page->mem_cgroup:ffff880109c38008^M [ 18.996935] ------------[ cut here ]------------^M [ 18.997483] kernel BUG at mm/page-writeback.c:2486!^M [ 18.998063] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN^M
[ 18.998756] Modules linked in:^M [ 18.999129] CPU: 5 PID: 1388 Comm: trinity-c34 Not tainted 4.11.0-rc5-next-20170407-dirty #12^M [ 19.000117] task: ffff880106ee5d40 task.stack: ffff8800c0f40000^M [ 19.000828] RIP: 0010:__set_page_dirty_nobuffers (??:?)
[ 19.001491] RSP: 0018:ffff8800c0f47318 EFLAGS: 00010006^M
[ 19.002103] RAX: 0000000000000000 RBX: 1ffff100181e8e67 RCX: 0000000000000000^M
[ 19.002929] RDX: 0000000000000021 RSI: 1ffff100181e8da7 RDI: ffffed00181e8e58^M
[ 19.004806] RBP: ffff8800c0f47440 R08: 3830303833633930 R09: 3130383866666666^M
[ 19.005626] R10: dffffc0000000000 R11: 0000000000001491 R12: ffff8800c0f47418^M
[ 19.006452] R13: ffffea000307c8c0 R14: ffff88010444cc10 R15: ffff88010444cbf8^M
[ 19.007277] FS: 00007ff6a26fb700(0000) GS:ffff88010a340000(0000) knlGS:0000000000000000^M
[ 19.008424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
[ 19.009092] CR2: 00007ff6a155267c CR3: 00000000cb301000 CR4: 00000000000406a0^M
[ 19.009919] Call Trace:^M
[ 19.012266] set_page_dirty (mm/page-writeback.c:2579)
[ 19.020028] v9fs_write_end (fs/9p/vfs_addr.c:325)
[ 19.022473] generic_perform_write (mm/filemap.c:2842)
[ 19.024857] __generic_file_write_iter (mm/filemap.c:2957)
[ 19.025830] generic_file_write_iter (./include/linux/fs.h:702 mm/filemap.c:2985)
[ 19.028549] __do_readv_writev (./include/linux/fs.h:1734 fs/read_write.c:696 fs/read_write.c:862)
[ 19.029924] do_readv_writev (fs/read_write.c:895)
[ 19.034044] vfs_writev (fs/read_write.c:921)
[ 19.035223] do_writev (fs/read_write.c:955)
[ 19.036925] SyS_writev (fs/read_write.c:1024)
[ 19.037297] do_syscall_64 (arch/x86/entry/common.c:284)
[ 19.042085] entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249) [ 19.042608] RIP: 0033:0x7ff6a200a8e9^M [ 19.043015] RSP: 002b:00007fff78079608 EFLAGS: 00000246 ORIG_RAX: 0000000000000014^M
[ 19.044253] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007ff6a200a8e9^M [ 19.045045] RDX: 0000000000000001 RSI: 0000000002337d60 RDI: 000000000000018b^M
[ 19.045835] RBP: 00007ff6a2601000 R08: 000000482a1a83cf R09: fffdffffffffffff^M [ 19.046627] R10: 0012536735f82cf7 R11: 0000000000000246 R12: 0000000000000002^M [ 19.047413] R13: 00007ff6a2601048 R14: 00007ff6a26fb698 R15: 00007ff6a2601000^M [ 19.048212] Code: 89 85 f0 fe ff ff e8 39 1b 20 00 8b 85 f0 fe ff ff eb 1a e8 2c bd 12 00 31 c0 eb 11 48 c7 c6 e0 c4 47 83 4c 89 ef e8 39 44 07 00 <0f> 0b 48 ba 00 00 00 00 00 fc ff df 48 c7 04 13 00 00 00 00 48 ^M All code ======== 0: 89 85 f0 fe ff ff mov %eax,-0x110(%rbp)
6: e8 39 1b 20 00 callq 0x201b44
b: 8b 85 f0 fe ff ff mov -0x110(%rbp),%eax
11: eb 1a jmp 0x2d
13: e8 2c bd 12 00 callq 0x12bd44
18: 31 c0 xor %eax,%eax
1a: eb 11 jmp 0x2d
1c: 48 c7 c6 e0 c4 47 83 mov $0xffffffff8347c4e0,%rsi
23: 4c 89 ef mov %r13,%rdi
26: e8 39 44 07 00 callq 0x74464
2b:* 0f 0b ud2 <-- trapping instruction
2d: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
34: fc ff df
37: 48 c7 04 13 00 00 00 movq $0x0,(%rbx,%rdx,1)
3e: 00
3f: 48 rex.W
...

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
9: fc ff df
c: 48 c7 04 13 00 00 00 movq $0x0,(%rbx,%rdx,1)
13: 00
14: 48 rex.W
...
[ 19.050311] RIP: __set_page_dirty_nobuffers+0x407/0x450 RSP: ffff8800c0f47318^M (??:?)

--

Thanks,
Sasha

2017-04-10 12:06:44

by Jan Kara

[permalink] [raw]
Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Mon 10-04-17 02:22:33, [email protected] wrote:
> On Fri, Dec 05, 2014 at 09:52:44AM -0500, Johannes Weiner wrote:
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> >
> > __set_page_dirty_nobuffers() __delete_from_page_cache()
> > if (TestSetPageDirty(page))
> > page->mapping = NULL
> > if (PageDirty())
> > dec_zone_page_state(page, NR_FILE_DIRTY);
> > dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > if (page->mapping)
> > account_page_dirtied(page)
> > __inc_zone_page_state(page, NR_FILE_DIRTY);
> > __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> >
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held. The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists. However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes. Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> >
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed. Remove it.
> >
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Cc: <[email protected]>
> > Acked-by: Kirill A. Shutemov <[email protected]>
>
> Hi Johannes,
>
> I'm seeing the following while fuzzing with trinity on linux-next (I've changed
> the WARN to a VM_BUG_ON_PAGE for some extra page info).

But this looks more like a bug in 9p which allows v9fs_write_end() to dirty
a !Uptodate page?

Honza

>
> [ 18.991007] page:ffffea000307c8c0 count:3 mapcount:0 mapping:ffff88010444cbf8 index:0x1^M
> [ 18.993051] flags: 0x1fffc0000000011(locked|dirty)^M
> [ 18.993621] raw: 01fffc0000000011 ffff88010444cbf8 0000000000000001 00000003ffffffff^M
> [ 18.994522] raw: dead000000000100 dead000000000200 0000000000000000 ffff880109c38008^M [ 18.995418] page dumped because: VM_BUG_ON_PAGE(!PagePrivate(page) && !PageUptodate(page))^M
> [ 18.996381] page->mem_cgroup:ffff880109c38008^M [ 18.996935] ------------[ cut here ]------------^M [ 18.997483] kernel BUG at mm/page-writeback.c:2486!^M [ 18.998063] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN^M
> [ 18.998756] Modules linked in:^M [ 18.999129] CPU: 5 PID: 1388 Comm: trinity-c34 Not tainted 4.11.0-rc5-next-20170407-dirty #12^M [ 19.000117] task: ffff880106ee5d40 task.stack: ffff8800c0f40000^M [ 19.000828] RIP: 0010:__set_page_dirty_nobuffers (??:?)
> [ 19.001491] RSP: 0018:ffff8800c0f47318 EFLAGS: 00010006^M
> [ 19.002103] RAX: 0000000000000000 RBX: 1ffff100181e8e67 RCX: 0000000000000000^M
> [ 19.002929] RDX: 0000000000000021 RSI: 1ffff100181e8da7 RDI: ffffed00181e8e58^M
> [ 19.004806] RBP: ffff8800c0f47440 R08: 3830303833633930 R09: 3130383866666666^M
> [ 19.005626] R10: dffffc0000000000 R11: 0000000000001491 R12: ffff8800c0f47418^M
> [ 19.006452] R13: ffffea000307c8c0 R14: ffff88010444cc10 R15: ffff88010444cbf8^M
> [ 19.007277] FS: 00007ff6a26fb700(0000) GS:ffff88010a340000(0000) knlGS:0000000000000000^M
> [ 19.008424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> [ 19.009092] CR2: 00007ff6a155267c CR3: 00000000cb301000 CR4: 00000000000406a0^M
> [ 19.009919] Call Trace:^M
> [ 19.012266] set_page_dirty (mm/page-writeback.c:2579)
> [ 19.020028] v9fs_write_end (fs/9p/vfs_addr.c:325)
> [ 19.022473] generic_perform_write (mm/filemap.c:2842)
> [ 19.024857] __generic_file_write_iter (mm/filemap.c:2957)
> [ 19.025830] generic_file_write_iter (./include/linux/fs.h:702 mm/filemap.c:2985)
> [ 19.028549] __do_readv_writev (./include/linux/fs.h:1734 fs/read_write.c:696 fs/read_write.c:862)
> [ 19.029924] do_readv_writev (fs/read_write.c:895)
> [ 19.034044] vfs_writev (fs/read_write.c:921)
> [ 19.035223] do_writev (fs/read_write.c:955)
> [ 19.036925] SyS_writev (fs/read_write.c:1024)
> [ 19.037297] do_syscall_64 (arch/x86/entry/common.c:284)
> [ 19.042085] entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249) [ 19.042608] RIP: 0033:0x7ff6a200a8e9^M [ 19.043015] RSP: 002b:00007fff78079608 EFLAGS: 00000246 ORIG_RAX: 0000000000000014^M
> [ 19.044253] RAX: ffffffffffffffda RBX: 0000000000000014 RCX: 00007ff6a200a8e9^M [ 19.045045] RDX: 0000000000000001 RSI: 0000000002337d60 RDI: 000000000000018b^M
> [ 19.045835] RBP: 00007ff6a2601000 R08: 000000482a1a83cf R09: fffdffffffffffff^M [ 19.046627] R10: 0012536735f82cf7 R11: 0000000000000246 R12: 0000000000000002^M [ 19.047413] R13: 00007ff6a2601048 R14: 00007ff6a26fb698 R15: 00007ff6a2601000^M [ 19.048212] Code: 89 85 f0 fe ff ff e8 39 1b 20 00 8b 85 f0 fe ff ff eb 1a e8 2c bd 12 00 31 c0 eb 11 48 c7 c6 e0 c4 47 83 4c 89 ef e8 39 44 07 00 <0f> 0b 48 ba 00 00 00 00 00 fc ff df 48 c7 04 13 00 00 00 00 48 ^M All code ======== 0: 89 85 f0 fe ff ff mov %eax,-0x110(%rbp)
> 6: e8 39 1b 20 00 callq 0x201b44
> b: 8b 85 f0 fe ff ff mov -0x110(%rbp),%eax
> 11: eb 1a jmp 0x2d
> 13: e8 2c bd 12 00 callq 0x12bd44
> 18: 31 c0 xor %eax,%eax
> 1a: eb 11 jmp 0x2d
> 1c: 48 c7 c6 e0 c4 47 83 mov $0xffffffff8347c4e0,%rsi
> 23: 4c 89 ef mov %r13,%rdi
> 26: e8 39 44 07 00 callq 0x74464
> 2b:* 0f 0b ud2 <-- trapping instruction
> 2d: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
> 34: fc ff df
> 37: 48 c7 04 13 00 00 00 movq $0x0,(%rbx,%rdx,1)
> 3e: 00
> 3f: 48 rex.W
> ...
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 48 ba 00 00 00 00 00 movabs $0xdffffc0000000000,%rdx
> 9: fc ff df
> c: 48 c7 04 13 00 00 00 movq $0x0,(%rbx,%rdx,1)
> 13: 00
> 14: 48 rex.W
> ...
> [ 19.050311] RIP: __set_page_dirty_nobuffers+0x407/0x450 RSP: ffff8800c0f47318^M (??:?)
>
> --
>
> Thanks,
> Sasha
--
Jan Kara <[email protected]>
SUSE Labs, CR

Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Mon, Apr 10, 2017 at 02:06:38PM +0200, Jan Kara wrote:
> On Mon 10-04-17 02:22:33, [email protected] wrote:
> > On Fri, Dec 05, 2014 at 09:52:44AM -0500, Johannes Weiner wrote:
> > > Tejun, while reviewing the code, spotted the following race condition
> > > between the dirtying and truncation of a page:
> > >
> > > __set_page_dirty_nobuffers() __delete_from_page_cache()
> > > if (TestSetPageDirty(page))
> > > page->mapping = NULL
> > > if (PageDirty())
> > > dec_zone_page_state(page, NR_FILE_DIRTY);
> > > dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > > if (page->mapping)
> > > account_page_dirtied(page)
> > > __inc_zone_page_state(page, NR_FILE_DIRTY);
> > > __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > >
> > > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > >
> > > Dirtiers usually lock out truncation, either by holding the page lock
> > > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > > the page table lock held. The notable exception to this rule, though,
> > > is do_wp_page(), for which this race exists. However, do_wp_page()
> > > already waits for a locked page to unlock before setting the dirty
> > > bit, in order to prevent a race where clear_page_dirty() misses the
> > > page bit in the presence of dirty ptes. Upgrade that wait to a fully
> > > locked set_page_dirty() to also cover the situation explained above.
> > >
> > > Afterwards, the code in set_page_dirty() dealing with a truncation
> > > race is no longer needed. Remove it.
> > >
> > > Reported-by: Tejun Heo <[email protected]>
> > > Signed-off-by: Johannes Weiner <[email protected]>
> > > Cc: <[email protected]>
> > > Acked-by: Kirill A. Shutemov <[email protected]>
> >
> > Hi Johannes,
> >
> > I'm seeing the following while fuzzing with trinity on linux-next (I've changed
> > the WARN to a VM_BUG_ON_PAGE for some extra page info).
>
> But this looks more like a bug in 9p which allows v9fs_write_end() to dirty
> a !Uptodate page?

I thought that 77469c3f5 ("9p: saner ->write_end() on failing copy into
non-uptodate page") prevented from that happening, but that's actually the
change that's causing it (I ended up misreading it last night).

Will fix it as follows:

diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
index adaf6f6..be84c0c 100644
--- a/fs/9p/vfs_addr.c
+++ b/fs/9p/vfs_addr.c
@@ -310,9 +310,13 @@ static int v9fs_write_end(struct file *filp, struct address_space *mapping,

p9_debug(P9_DEBUG_VFS, "filp %p, mapping %p\n", filp, mapping);

- if (unlikely(copied < len && !PageUptodate(page))) {
- copied = 0;
- goto out;
+ if (!PageUptodate(page)) {
+ if (unlikely(copied < len)) {
+ copied = 0;
+ goto out;
+ } else {
+ SetPageUptodate(page);
+ }
}
/*
* No need to use i_size_read() here, the i_size

--

Thanks,
Sasha

2017-04-10 15:51:36

by Jan Kara

[permalink] [raw]
Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Mon 10-04-17 15:07:58, [email protected] wrote:
> On Mon, Apr 10, 2017 at 02:06:38PM +0200, Jan Kara wrote:
> > On Mon 10-04-17 02:22:33, [email protected] wrote:
> > > On Fri, Dec 05, 2014 at 09:52:44AM -0500, Johannes Weiner wrote:
> > > > Tejun, while reviewing the code, spotted the following race condition
> > > > between the dirtying and truncation of a page:
> > > >
> > > > __set_page_dirty_nobuffers() __delete_from_page_cache()
> > > > if (TestSetPageDirty(page))
> > > > page->mapping = NULL
> > > > if (PageDirty())
> > > > dec_zone_page_state(page, NR_FILE_DIRTY);
> > > > dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > > > if (page->mapping)
> > > > account_page_dirtied(page)
> > > > __inc_zone_page_state(page, NR_FILE_DIRTY);
> > > > __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > > >
> > > > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> > > >
> > > > Dirtiers usually lock out truncation, either by holding the page lock
> > > > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > > > the page table lock held. The notable exception to this rule, though,
> > > > is do_wp_page(), for which this race exists. However, do_wp_page()
> > > > already waits for a locked page to unlock before setting the dirty
> > > > bit, in order to prevent a race where clear_page_dirty() misses the
> > > > page bit in the presence of dirty ptes. Upgrade that wait to a fully
> > > > locked set_page_dirty() to also cover the situation explained above.
> > > >
> > > > Afterwards, the code in set_page_dirty() dealing with a truncation
> > > > race is no longer needed. Remove it.
> > > >
> > > > Reported-by: Tejun Heo <[email protected]>
> > > > Signed-off-by: Johannes Weiner <[email protected]>
> > > > Cc: <[email protected]>
> > > > Acked-by: Kirill A. Shutemov <[email protected]>
> > >
> > > Hi Johannes,
> > >
> > > I'm seeing the following while fuzzing with trinity on linux-next (I've changed
> > > the WARN to a VM_BUG_ON_PAGE for some extra page info).
> >
> > But this looks more like a bug in 9p which allows v9fs_write_end() to dirty
> > a !Uptodate page?
>
> I thought that 77469c3f5 ("9p: saner ->write_end() on failing copy into
> non-uptodate page") prevented from that happening, but that's actually the
> change that's causing it (I ended up misreading it last night).
>
> Will fix it as follows:

Yep, this looks good to me, although I'd find it more future-proof if we
had that SetPageUptodate() additionally guarded a by len == PAGE_SIZE
check.

Honza

>
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c
> index adaf6f6..be84c0c 100644
> --- a/fs/9p/vfs_addr.c
> +++ b/fs/9p/vfs_addr.c
> @@ -310,9 +310,13 @@ static int v9fs_write_end(struct file *filp, struct address_space *mapping,
>
> p9_debug(P9_DEBUG_VFS, "filp %p, mapping %p\n", filp, mapping);
>
> - if (unlikely(copied < len && !PageUptodate(page))) {
> - copied = 0;
> - goto out;
> + if (!PageUptodate(page)) {
> + if (unlikely(copied < len)) {
> + copied = 0;
> + goto out;
> + } else {
> + SetPageUptodate(page);
> + }
> }
> /*
> * No need to use i_size_read() here, the i_size
>
> --
>
> Thanks,
> Sasha
--
Jan Kara <[email protected]>
SUSE Labs, CR