2014-12-01 22:58:18

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

Tejun, while reviewing the code, spotted the following race condition
between the dirtying and truncation of a page:

__set_page_dirty_nobuffers() __delete_from_page_cache()
if (TestSetPageDirty(page))
page->mapping = NULL
if (PageDirty())
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
if (page->mapping)
account_page_dirtied(page)
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);

which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.

Dirtiers usually lock out truncation, either by holding the page lock
directly, or in case of zap_pte_range(), by pinning the mapcount with
the page table lock held. The notable exception to this rule, though,
is do_wp_page(), for which this race exists. However, do_wp_page()
already waits for a locked page to unlock before setting the dirty
bit, in order to prevent a race where clear_page_dirty() misses the
page bit in the presence of dirty ptes. Upgrade that wait to a fully
locked set_page_dirty() to also cover the situation explained above.

Afterwards, the code in set_page_dirty() dealing with a truncation
race is no longer needed. Remove it.

Reported-by: Tejun Heo <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Cc: <[email protected]>
---
include/linux/writeback.h | 1 -
mm/memory.c | 26 ++++++++++++++++----------
mm/page-writeback.c | 43 ++++++++++++-------------------------------
3 files changed, 28 insertions(+), 42 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a219be961c0a..00048339c23e 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
struct writeback_control *wbc, writepage_t writepage,
void *data);
int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
-void set_page_dirty_balance(struct page *page);
void writeback_set_ratelimit(void);
void tag_pages_for_writeback(struct address_space *mapping,
pgoff_t start, pgoff_t end);
diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..73220eb6e9e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2150,17 +2150,23 @@ reuse:
if (!dirty_page)
return ret;

- /*
- * Yes, Virginia, this is actually required to prevent a race
- * with clear_page_dirty_for_io() from clearing the page dirty
- * bit after it clear all dirty ptes, but before a racing
- * do_wp_page installs a dirty pte.
- *
- * do_shared_fault is protected similarly.
- */
if (!page_mkwrite) {
- wait_on_page_locked(dirty_page);
- set_page_dirty_balance(dirty_page);
+ struct address_space *mapping;
+ int dirtied;
+
+ lock_page(dirty_page);
+ dirtied = set_page_dirty(dirty_page);
+ mapping = dirty_page->mapping;
+ unlock_page(dirty_page);
+
+ if (dirtied && mapping) {
+ /*
+ * Some device drivers do not set page.mapping
+ * but still dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+
/* file_update_time outside page_lock */
if (vma->vm_file)
file_update_time(vma->vm_file);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae87522d..437174a2aaa3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1541,16 +1541,6 @@ pause:
bdi_start_background_writeback(bdi);
}

-void set_page_dirty_balance(struct page *page)
-{
- if (set_page_dirty(page)) {
- struct address_space *mapping = page_mapping(page);
-
- if (mapping)
- balance_dirty_pages_ratelimited(mapping);
- }
-}
-
static DEFINE_PER_CPU(int, bdp_ratelimits);

/*
@@ -2123,32 +2113,25 @@ EXPORT_SYMBOL(account_page_dirtied);
* page dirty in that case, but not all the buffers. This is a "bottom-up"
* dirtying, whereas __set_page_dirty_buffers() is a "top-down" dirtying.
*
- * Most callers have locked the page, which pins the address_space in memory.
- * But zap_pte_range() does not lock the page, however in that case the
- * mapping is pinned by the vma's ->vm_file reference.
- *
- * We take care to handle the case where the page was truncated from the
- * mapping by re-checking page_mapping() inside tree_lock.
+ * The caller must ensure this doesn't race with truncation. Most will simply
+ * hold the page lock, but e.g. zap_pte_range() calls with the page mapped and
+ * the pte lock held, which also locks out truncation.
*/
int __set_page_dirty_nobuffers(struct page *page)
{
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
- struct address_space *mapping2;
unsigned long flags;

if (!mapping)
return 1;

spin_lock_irqsave(&mapping->tree_lock, flags);
- mapping2 = page_mapping(page);
- if (mapping2) { /* Race with truncate? */
- BUG_ON(mapping2 != mapping);
- WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
- account_page_dirtied(page, mapping);
- radix_tree_tag_set(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
- }
+ BUG_ON(page_mapping(page) != mapping);
+ WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
+ account_page_dirtied(page, mapping);
+ radix_tree_tag_set(&mapping->page_tree, page_index(page),
+ PAGECACHE_TAG_DIRTY);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
if (mapping->host) {
/* !PageAnon && !swapper_space */
@@ -2305,12 +2288,10 @@ int clear_page_dirty_for_io(struct page *page)
/*
* We carefully synchronise fault handlers against
* installing a dirty pte and marking the page dirty
- * at this point. We do this by having them hold the
- * page lock at some point after installing their
- * pte, but before marking the page dirty.
- * Pages are always locked coming in here, so we get
- * the desired exclusion. See mm/memory.c:do_wp_page()
- * for more comments.
+ * at this point. We do this by having them hold the
+ * page lock while dirtying the page, and pages are
+ * always locked coming in here, so we get the desired
+ * exclusion.
*/
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
--
2.1.3


2014-12-01 22:58:17

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas

The only way a VMA can have shared and writable semantics is with a
backing file.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memory.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 73220eb6e9e3..2a2e3648ed65 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2167,9 +2167,7 @@ reuse:
balance_dirty_pages_ratelimited(mapping);
}

- /* file_update_time outside page_lock */
- if (vma->vm_file)
- file_update_time(vma->vm_file);
+ file_update_time(vma->vm_file);
}
put_page(dirty_page);
if (page_mkwrite) {
@@ -3025,8 +3023,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
balance_dirty_pages_ratelimited(mapping);
}

- /* file_update_time outside page_lock */
- if (vma->vm_file && !vma->vm_ops->page_mkwrite)
+ if (!vma->vm_ops->page_mkwrite)
file_update_time(vma->vm_file);

return ret;
--
2.1.3

2014-12-01 22:58:14

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()

Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
pretty much the same. Make sure the page references are the same in
both cases, then merge the two branches.

It's tempting to go even further and page-lock the !page_mkwrite case,
to get it in line with everybody else setting the page table and thus
further simplify the model. But that's not quite compelling enough to
justify dropping the pte lock, then relocking and verifying the entry
for filesystems without ->page_mkwrite, which notably includes tmpfs.
Leave it for now and lock the page late in the !page_mkwrite case.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memory.c | 46 ++++++++++++++++------------------------------
1 file changed, 16 insertions(+), 30 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2a2e3648ed65..ff92abfa5303 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2046,7 +2046,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_t entry;
int ret = 0;
int page_mkwrite = 0;
- struct page *dirty_page = NULL;
+ bool dirty_shared = false;
unsigned long mmun_start = 0; /* For mmu_notifiers */
unsigned long mmun_end = 0; /* For mmu_notifiers */
struct mem_cgroup *memcg;
@@ -2097,6 +2097,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
+ page_cache_get(old_page);
/*
* Only catch write-faults on shared writable pages,
* read-only shared pages can get COWed by
@@ -2104,7 +2105,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
int tmp;
- page_cache_get(old_page);
+
pte_unmap_unlock(page_table, ptl);
tmp = do_page_mkwrite(vma, old_page, address);
if (unlikely(!tmp || (tmp &
@@ -2124,11 +2125,10 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unlock_page(old_page);
goto unlock;
}
-
page_mkwrite = 1;
}
- dirty_page = old_page;
- get_page(dirty_page);
+
+ dirty_shared = true;

reuse:
/*
@@ -2147,42 +2147,28 @@ reuse:
pte_unmap_unlock(page_table, ptl);
ret |= VM_FAULT_WRITE;

- if (!dirty_page)
- return ret;
-
- if (!page_mkwrite) {
+ if (dirty_shared) {
struct address_space *mapping;
int dirtied;

- lock_page(dirty_page);
- dirtied = set_page_dirty(dirty_page);
- mapping = dirty_page->mapping;
- unlock_page(dirty_page);
+ if (!page_mkwrite)
+ lock_page(old_page);

- if (dirtied && mapping) {
- /*
- * Some device drivers do not set page.mapping
- * but still dirty their pages
- */
- balance_dirty_pages_ratelimited(mapping);
- }
+ dirtied = set_page_dirty(old_page);
+ mapping = old_page->mapping;
+ unlock_page(old_page);
+ page_cache_release(old_page);

- file_update_time(vma->vm_file);
- }
- put_page(dirty_page);
- if (page_mkwrite) {
- struct address_space *mapping = dirty_page->mapping;
-
- set_page_dirty(dirty_page);
- unlock_page(dirty_page);
- page_cache_release(dirty_page);
- if (mapping) {
+ if ((dirtied || page_mkwrite) && mapping) {
/*
* Some device drivers do not set page.mapping
* but still dirty their pages
*/
balance_dirty_pages_ratelimited(mapping);
}
+
+ if (!page_mkwrite)
+ file_update_time(vma->vm_file);
}

return ret;
--
2.1.3

2014-12-02 08:58:31

by Jan Kara

[permalink] [raw]
Subject: Re: [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas

On Mon 01-12-14 17:58:01, Johannes Weiner wrote:
> The only way a VMA can have shared and writable semantics is with a
> backing file.
OK, one always learns :) After some digging I found that MAP_SHARED |
MAP_ANONYMOUS mappings are in fact mappings of a temporary file in tmpfs.
It would be worth to mention this in the changelog I believe. Otherwise
feel free to add:
Reviewed-by: Jan Kara <[email protected]>

Honza
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memory.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 73220eb6e9e3..2a2e3648ed65 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2167,9 +2167,7 @@ reuse:
> balance_dirty_pages_ratelimited(mapping);
> }
>
> - /* file_update_time outside page_lock */
> - if (vma->vm_file)
> - file_update_time(vma->vm_file);
> + file_update_time(vma->vm_file);
> }
> put_page(dirty_page);
> if (page_mkwrite) {
> @@ -3025,8 +3023,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> balance_dirty_pages_ratelimited(mapping);
> }
>
> - /* file_update_time outside page_lock */
> - if (vma->vm_file && !vma->vm_ops->page_mkwrite)
> + if (!vma->vm_ops->page_mkwrite)
> file_update_time(vma->vm_file);
>
> return ret;
> --
> 2.1.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-12-02 09:12:18

by Jan Kara

[permalink] [raw]
Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Mon 01-12-14 17:58:00, Johannes Weiner wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
>
> __set_page_dirty_nobuffers() __delete_from_page_cache()
> if (TestSetPageDirty(page))
> page->mapping = NULL
> if (PageDirty())
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> if (page->mapping)
> account_page_dirtied(page)
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
>
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held. The notable exception to this rule, though,
> is do_wp_page(), for which this race exists. However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes. Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
>
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed. Remove it.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: <[email protected]>
> ---
> include/linux/writeback.h | 1 -
> mm/memory.c | 26 ++++++++++++++++----------
> mm/page-writeback.c | 43 ++++++++++++-------------------------------
> 3 files changed, 28 insertions(+), 42 deletions(-)
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a219be961c0a..00048339c23e 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
> struct writeback_control *wbc, writepage_t writepage,
> void *data);
> int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> -void set_page_dirty_balance(struct page *page);
> void writeback_set_ratelimit(void);
> void tag_pages_for_writeback(struct address_space *mapping,
> pgoff_t start, pgoff_t end);
> diff --git a/mm/memory.c b/mm/memory.c
> index 3e503831e042..73220eb6e9e3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2150,17 +2150,23 @@ reuse:
> if (!dirty_page)
> return ret;
>
> - /*
> - * Yes, Virginia, this is actually required to prevent a race
> - * with clear_page_dirty_for_io() from clearing the page dirty
> - * bit after it clear all dirty ptes, but before a racing
> - * do_wp_page installs a dirty pte.
> - *
> - * do_shared_fault is protected similarly.
> - */
> if (!page_mkwrite) {
> - wait_on_page_locked(dirty_page);
> - set_page_dirty_balance(dirty_page);
> + struct address_space *mapping;
> + int dirtied;
> +
> + lock_page(dirty_page);
> + dirtied = set_page_dirty(dirty_page);
> + mapping = dirty_page->mapping;
> + unlock_page(dirty_page);
> +
> + if (dirtied && mapping) {
> + /*
> + * Some device drivers do not set page.mapping
> + * but still dirty their pages
> + */
The comment doesn't make sense to me here. Is it meant to explain why we
check 'mapping' in the above condition? I always thought truncate is the
main reason.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-12-02 09:19:45

by Jan Kara

[permalink] [raw]
Subject: Re: [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()

On Mon 01-12-14 17:58:02, Johannes Weiner wrote:
> Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
> pretty much the same. Make sure the page references are the same in
> both cases, then merge the two branches.
>
> It's tempting to go even further and page-lock the !page_mkwrite case,
> to get it in line with everybody else setting the page table and thus
> further simplify the model. But that's not quite compelling enough to
> justify dropping the pte lock, then relocking and verifying the entry
> for filesystems without ->page_mkwrite, which notably includes tmpfs.
> Leave it for now and lock the page late in the !page_mkwrite case.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memory.c | 46 ++++++++++++++++------------------------------
> 1 file changed, 16 insertions(+), 30 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 2a2e3648ed65..ff92abfa5303 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
...
> @@ -2147,42 +2147,28 @@ reuse:
> pte_unmap_unlock(page_table, ptl);
> ret |= VM_FAULT_WRITE;
>
> - if (!dirty_page)
> - return ret;
> -
> - if (!page_mkwrite) {
> + if (dirty_shared) {
> struct address_space *mapping;
> int dirtied;
>
> - lock_page(dirty_page);
> - dirtied = set_page_dirty(dirty_page);
> - mapping = dirty_page->mapping;
> - unlock_page(dirty_page);
> + if (!page_mkwrite)
> + lock_page(old_page);
>
> - if (dirtied && mapping) {
> - /*
> - * Some device drivers do not set page.mapping
> - * but still dirty their pages
> - */
> - balance_dirty_pages_ratelimited(mapping);
> - }
> + dirtied = set_page_dirty(old_page);
> + mapping = old_page->mapping;
> + unlock_page(old_page);
> + page_cache_release(old_page);
>
> - file_update_time(vma->vm_file);
> - }
> - put_page(dirty_page);
> - if (page_mkwrite) {
> - struct address_space *mapping = dirty_page->mapping;
> -
> - set_page_dirty(dirty_page);
> - unlock_page(dirty_page);
> - page_cache_release(dirty_page);
> - if (mapping) {
> + if ((dirtied || page_mkwrite) && mapping) {
Why do we actually call balance_dirty_pages_ratelimited() even if we
didn't dirty the page when ->page_mkwrite() exists? Is it because
filesystem may dirty the page in ->page_mkwrite() and we don't want it to
deal with calling balance_dirty_pages_ratelimited()?

Otherwise the patch looks good to me.

Honza

> /*
> * Some device drivers do not set page.mapping
> * but still dirty their pages
> */
> balance_dirty_pages_ratelimited(mapping);
> }
> +
> + if (!page_mkwrite)
> + file_update_time(vma->vm_file);
> }
>
> return ret;
> --
> 2.1.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-12-02 11:57:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Mon, Dec 01, 2014 at 05:58:00PM -0500, Johannes Weiner wrote:
> Tejun, while reviewing the code, spotted the following race condition
> between the dirtying and truncation of a page:
>
> __set_page_dirty_nobuffers() __delete_from_page_cache()
> if (TestSetPageDirty(page))
> page->mapping = NULL
> if (PageDirty())
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> if (page->mapping)
> account_page_dirtied(page)
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>
> which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
>
> Dirtiers usually lock out truncation, either by holding the page lock
> directly, or in case of zap_pte_range(), by pinning the mapcount with
> the page table lock held. The notable exception to this rule, though,
> is do_wp_page(), for which this race exists. However, do_wp_page()
> already waits for a locked page to unlock before setting the dirty
> bit, in order to prevent a race where clear_page_dirty() misses the
> page bit in the presence of dirty ptes. Upgrade that wait to a fully
> locked set_page_dirty() to also cover the situation explained above.
>
> Afterwards, the code in set_page_dirty() dealing with a truncation
> race is no longer needed. Remove it.
>
> Reported-by: Tejun Heo <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> Cc: <[email protected]>
> ---
> include/linux/writeback.h | 1 -
> mm/memory.c | 26 ++++++++++++++++----------
> mm/page-writeback.c | 43 ++++++++++++-------------------------------
> 3 files changed, 28 insertions(+), 42 deletions(-)
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a219be961c0a..00048339c23e 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
> struct writeback_control *wbc, writepage_t writepage,
> void *data);
> int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> -void set_page_dirty_balance(struct page *page);
> void writeback_set_ratelimit(void);
> void tag_pages_for_writeback(struct address_space *mapping,
> pgoff_t start, pgoff_t end);
> diff --git a/mm/memory.c b/mm/memory.c
> index 3e503831e042..73220eb6e9e3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2150,17 +2150,23 @@ reuse:
> if (!dirty_page)
> return ret;
>
> - /*
> - * Yes, Virginia, this is actually required to prevent a race
> - * with clear_page_dirty_for_io() from clearing the page dirty
> - * bit after it clear all dirty ptes, but before a racing
> - * do_wp_page installs a dirty pte.
> - *
> - * do_shared_fault is protected similarly.
> - */
> if (!page_mkwrite) {
> - wait_on_page_locked(dirty_page);
> - set_page_dirty_balance(dirty_page);
> + struct address_space *mapping;
> + int dirtied;
> +
> + lock_page(dirty_page);
> + dirtied = set_page_dirty(dirty_page);
> + mapping = dirty_page->mapping;

At first, I wanted to ask why you don't use page_mapping() here, but after
a bit of digging I see we cannot get here with anon page.

Explicid VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page); would be great.

Otherwise looks good to me.

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-12-02 12:00:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas

On Mon, Dec 01, 2014 at 05:58:01PM -0500, Johannes Weiner wrote:
> The only way a VMA can have shared and writable semantics is with a
> backing file.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-12-02 12:08:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()

On Mon, Dec 01, 2014 at 05:58:02PM -0500, Johannes Weiner wrote:
> Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
> pretty much the same. Make sure the page references are the same in
> both cases, then merge the two branches.
>
> It's tempting to go even further and page-lock the !page_mkwrite case,
> to get it in line with everybody else setting the page table and thus
> further simplify the model. But that's not quite compelling enough to
> justify dropping the pte lock, then relocking and verifying the entry
> for filesystems without ->page_mkwrite, which notably includes tmpfs.
> Leave it for now and lock the page late in the !page_mkwrite case.
>
> Signed-off-by: Johannes Weiner <[email protected]>

This would conflict with other patchset of do_wp_page() cleanups, but look
good.

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2014-12-02 15:06:35

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Tue, Dec 02, 2014 at 10:12:12AM +0100, Jan Kara wrote:
> On Mon 01-12-14 17:58:00, Johannes Weiner wrote:
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> >
> > __set_page_dirty_nobuffers() __delete_from_page_cache()
> > if (TestSetPageDirty(page))
> > page->mapping = NULL
> > if (PageDirty())
> > dec_zone_page_state(page, NR_FILE_DIRTY);
> > dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > if (page->mapping)
> > account_page_dirtied(page)
> > __inc_zone_page_state(page, NR_FILE_DIRTY);
> > __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> >
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held. The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists. However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes. Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> >
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed. Remove it.
> >
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Cc: <[email protected]>
> > ---
> > include/linux/writeback.h | 1 -
> > mm/memory.c | 26 ++++++++++++++++----------
> > mm/page-writeback.c | 43 ++++++++++++-------------------------------
> > 3 files changed, 28 insertions(+), 42 deletions(-)
> >
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index a219be961c0a..00048339c23e 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
> > struct writeback_control *wbc, writepage_t writepage,
> > void *data);
> > int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> > -void set_page_dirty_balance(struct page *page);
> > void writeback_set_ratelimit(void);
> > void tag_pages_for_writeback(struct address_space *mapping,
> > pgoff_t start, pgoff_t end);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3e503831e042..73220eb6e9e3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2150,17 +2150,23 @@ reuse:
> > if (!dirty_page)
> > return ret;
> >
> > - /*
> > - * Yes, Virginia, this is actually required to prevent a race
> > - * with clear_page_dirty_for_io() from clearing the page dirty
> > - * bit after it clear all dirty ptes, but before a racing
> > - * do_wp_page installs a dirty pte.
> > - *
> > - * do_shared_fault is protected similarly.
> > - */
> > if (!page_mkwrite) {
> > - wait_on_page_locked(dirty_page);
> > - set_page_dirty_balance(dirty_page);
> > + struct address_space *mapping;
> > + int dirtied;
> > +
> > + lock_page(dirty_page);
> > + dirtied = set_page_dirty(dirty_page);
> > + mapping = dirty_page->mapping;
> > + unlock_page(dirty_page);
> > +
> > + if (dirtied && mapping) {
> > + /*
> > + * Some device drivers do not set page.mapping
> > + * but still dirty their pages
> > + */
> The comment doesn't make sense to me here. Is it meant to explain why we
> check 'mapping' in the above condition? I always thought truncate is the
> main reason.

Yes, I just copied it from the page_mkwrite case a few lines down, and
there is another copy of it in do_shared_fault(). Truncate is also a
possibility during a race, of course, but even without it we expect
that the mapping can be NULL for certain device drivers.

2014-12-02 15:11:21

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 1/3] mm: protect set_page_dirty() from ongoing truncation

On Tue, Dec 02, 2014 at 01:56:52PM +0200, Kirill A. Shutemov wrote:
> On Mon, Dec 01, 2014 at 05:58:00PM -0500, Johannes Weiner wrote:
> > Tejun, while reviewing the code, spotted the following race condition
> > between the dirtying and truncation of a page:
> >
> > __set_page_dirty_nobuffers() __delete_from_page_cache()
> > if (TestSetPageDirty(page))
> > page->mapping = NULL
> > if (PageDirty())
> > dec_zone_page_state(page, NR_FILE_DIRTY);
> > dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> > if (page->mapping)
> > account_page_dirtied(page)
> > __inc_zone_page_state(page, NR_FILE_DIRTY);
> > __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> >
> > which results in an imbalance of NR_FILE_DIRTY and BDI_RECLAIMABLE.
> >
> > Dirtiers usually lock out truncation, either by holding the page lock
> > directly, or in case of zap_pte_range(), by pinning the mapcount with
> > the page table lock held. The notable exception to this rule, though,
> > is do_wp_page(), for which this race exists. However, do_wp_page()
> > already waits for a locked page to unlock before setting the dirty
> > bit, in order to prevent a race where clear_page_dirty() misses the
> > page bit in the presence of dirty ptes. Upgrade that wait to a fully
> > locked set_page_dirty() to also cover the situation explained above.
> >
> > Afterwards, the code in set_page_dirty() dealing with a truncation
> > race is no longer needed. Remove it.
> >
> > Reported-by: Tejun Heo <[email protected]>
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Cc: <[email protected]>
> > ---
> > include/linux/writeback.h | 1 -
> > mm/memory.c | 26 ++++++++++++++++----------
> > mm/page-writeback.c | 43 ++++++++++++-------------------------------
> > 3 files changed, 28 insertions(+), 42 deletions(-)
> >
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index a219be961c0a..00048339c23e 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -177,7 +177,6 @@ int write_cache_pages(struct address_space *mapping,
> > struct writeback_control *wbc, writepage_t writepage,
> > void *data);
> > int do_writepages(struct address_space *mapping, struct writeback_control *wbc);
> > -void set_page_dirty_balance(struct page *page);
> > void writeback_set_ratelimit(void);
> > void tag_pages_for_writeback(struct address_space *mapping,
> > pgoff_t start, pgoff_t end);
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3e503831e042..73220eb6e9e3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2150,17 +2150,23 @@ reuse:
> > if (!dirty_page)
> > return ret;
> >
> > - /*
> > - * Yes, Virginia, this is actually required to prevent a race
> > - * with clear_page_dirty_for_io() from clearing the page dirty
> > - * bit after it clear all dirty ptes, but before a racing
> > - * do_wp_page installs a dirty pte.
> > - *
> > - * do_shared_fault is protected similarly.
> > - */
> > if (!page_mkwrite) {
> > - wait_on_page_locked(dirty_page);
> > - set_page_dirty_balance(dirty_page);
> > + struct address_space *mapping;
> > + int dirtied;
> > +
> > + lock_page(dirty_page);
> > + dirtied = set_page_dirty(dirty_page);
> > + mapping = dirty_page->mapping;
>
> At first, I wanted to ask why you don't use page_mapping() here, but after
> a bit of digging I see we cannot get here with anon page.
>
> Explicid VM_BUG_ON_PAGE(PageAnon(dirty_page), dirty_page); would be great.

Fair enough, I added that.

> Otherwise looks good to me.
>
> Acked-by: Kirill A. Shutemov <[email protected]>

Thanks!

2014-12-02 15:18:38

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/3] mm: memory: remove ->vm_file check on shared writable vmas

On Tue, Dec 02, 2014 at 09:58:25AM +0100, Jan Kara wrote:
> On Mon 01-12-14 17:58:01, Johannes Weiner wrote:
> > The only way a VMA can have shared and writable semantics is with a
> > backing file.
> OK, one always learns :) After some digging I found that MAP_SHARED |
> MAP_ANONYMOUS mappings are in fact mappings of a temporary file in tmpfs.
> It would be worth to mention this in the changelog I believe. Otherwise
> feel free to add:
> Reviewed-by: Jan Kara <[email protected]>

Thanks, Jan. I updated the changelog to read:

Shared anonymous mmaps are implemented with shmem files, so all VMAs
with shared writable semantics also have an underlying backing file.

2014-12-02 16:56:20

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/3] mm: memory: merge shared-writable dirtying branches in do_wp_page()

On Tue, Dec 02, 2014 at 10:19:39AM +0100, Jan Kara wrote:
> On Mon 01-12-14 17:58:02, Johannes Weiner wrote:
> > Whether there is a vm_ops->page_mkwrite or not, the page dirtying is
> > pretty much the same. Make sure the page references are the same in
> > both cases, then merge the two branches.
> >
> > It's tempting to go even further and page-lock the !page_mkwrite case,
> > to get it in line with everybody else setting the page table and thus
> > further simplify the model. But that's not quite compelling enough to
> > justify dropping the pte lock, then relocking and verifying the entry
> > for filesystems without ->page_mkwrite, which notably includes tmpfs.
> > Leave it for now and lock the page late in the !page_mkwrite case.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/memory.c | 46 ++++++++++++++++------------------------------
> > 1 file changed, 16 insertions(+), 30 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2a2e3648ed65..ff92abfa5303 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> ...
> > @@ -2147,42 +2147,28 @@ reuse:
> > pte_unmap_unlock(page_table, ptl);
> > ret |= VM_FAULT_WRITE;
> >
> > - if (!dirty_page)
> > - return ret;
> > -
> > - if (!page_mkwrite) {
> > + if (dirty_shared) {
> > struct address_space *mapping;
> > int dirtied;
> >
> > - lock_page(dirty_page);
> > - dirtied = set_page_dirty(dirty_page);
> > - mapping = dirty_page->mapping;
> > - unlock_page(dirty_page);
> > + if (!page_mkwrite)
> > + lock_page(old_page);
> >
> > - if (dirtied && mapping) {
> > - /*
> > - * Some device drivers do not set page.mapping
> > - * but still dirty their pages
> > - */
> > - balance_dirty_pages_ratelimited(mapping);
> > - }
> > + dirtied = set_page_dirty(old_page);
> > + mapping = old_page->mapping;
> > + unlock_page(old_page);
> > + page_cache_release(old_page);
> >
> > - file_update_time(vma->vm_file);
> > - }
> > - put_page(dirty_page);
> > - if (page_mkwrite) {
> > - struct address_space *mapping = dirty_page->mapping;
> > -
> > - set_page_dirty(dirty_page);
> > - unlock_page(dirty_page);
> > - page_cache_release(dirty_page);
> > - if (mapping) {
> > + if ((dirtied || page_mkwrite) && mapping) {
> Why do we actually call balance_dirty_pages_ratelimited() even if we
> didn't dirty the page when ->page_mkwrite() exists? Is it because
> filesystem may dirty the page in ->page_mkwrite() and we don't want it to
> deal with calling balance_dirty_pages_ratelimited()?

Yes, ->page_mkwrite() can dirty the page, but balance_dirty_pages(),
as you noted, is not allowed under the page lock. However, it also
can't drop the page lock if that is the final set_page_dirty() as the
pte isn't dirty yet, and clear_page_dirty_for_io() relies on the pte
to be dirtied before the page outside the page lock.

That being said, the page lock semantics in there are strange. It
seems to me that page_mkwrite semantics inherited fault semantics,
which don't allow the page lock to be dropped between verifying the
page->mapping and installing the page table, to make sure we don't map
a truncated page. That's why when ->page_mkwrite() returns with the
page unlocked, do_page_mkwrite() locks it and verifies page->mapping,
only to hold the page lock until after the page table update is done.

However, unlike during a nopage fault, we fault against an existing
read-only pte mapping of the page, and truncation needs to unmap it.
Even if we dropped both the page lock and the pte lock, that
pte_same() check after re-locking the page table would reliably tell
us if somebody swooped in and truncated the page behind our backs.

So AFAICS ->page_mkwrite() could safely return with the page unlocked
in terms of correctness. But OTOH if it locks the page anyway, it's
cheaper to just keep it until we need it again for set_page_dirty().
Either way, I think the truncation verification in do_page_mkwrite()
is unnecessary.

> Otherwise the patch looks good to me.

Thanks!