2006-05-25 13:55:39

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 0/3] mm: tracking dirty pages -v5


I hacked up a new version last night.

Its now based on top of David's patches, Hugh's approach of using the
MAP_PRIVATE protections instead of the MAP_SHARED seems far superior indeed.

Q: would it be feasable to do so for al shared mappings so we can remove
the MAP_SHARED protections all together?

They survive my simple testing, but esp. the msync cleanup might need some
more attention.

I post them now instead of after a little more testing because I'll not
have much time the coming few days to do so, and hoarding them does
nobody any good.

Peter


2006-05-25 13:56:16

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 2/3] mm: balance dirty pages


From: Peter Zijlstra <[email protected]>

Now that we can detect writers of shared mappings, throttle them.
Avoids OOM by surprise.

Changes -v2:

- small helper function (Andrew Morton)

Signed-off-by: Peter Zijlstra <[email protected]>

include/linux/writeback.h | 1 +
mm/memory.c | 5 +++--
mm/page-writeback.c | 10 ++++++++++
3 files changed, 14 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2006-05-25 15:46:31.000000000 +0200
+++ linux-2.6/mm/memory.c 2006-05-25 15:46:55.000000000 +0200
@@ -49,6 +49,7 @@
#include <linux/module.h>
#include <linux/init.h>
#include <linux/backing-dev.h>
+#include <linux/writeback.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -1558,7 +1559,7 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
- set_page_dirty(dirty_page);
+ set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
return ret;
@@ -2205,7 +2206,7 @@ retry:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
- set_page_dirty(dirty_page);
+ set_page_dirty_balance(dirty_page);
put_page(dirty_page);
}
return ret;
Index: linux-2.6/include/linux/writeback.h
===================================================================
--- linux-2.6.orig/include/linux/writeback.h 2006-05-25 15:07:44.000000000 +0200
+++ linux-2.6/include/linux/writeback.h 2006-05-25 15:46:55.000000000 +0200
@@ -114,6 +114,7 @@ int sync_page_range(struct inode *inode,
loff_t pos, loff_t count);
int sync_page_range_nolock(struct inode *inode, struct address_space *mapping,
loff_t pos, loff_t count);
+void set_page_dirty_balance(struct page *page);

/* pdflush.c */
extern int nr_pdflush_threads; /* Global so it can be exported to sysctl
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c 2006-05-25 15:46:08.000000000 +0200
+++ linux-2.6/mm/page-writeback.c 2006-05-25 15:46:55.000000000 +0200
@@ -255,6 +255,16 @@ static void balance_dirty_pages(struct a
pdflush_operation(background_writeout, 0);
}

+void set_page_dirty_balance(struct page *page)
+{
+ if (set_page_dirty(page)) {
+ struct address_space *mapping = page_mapping(page);
+
+ if (mapping)
+ balance_dirty_pages_ratelimited(mapping);
+ }
+}
+
/**
* balance_dirty_pages_ratelimited_nr - balance dirty memory state
* @mapping: address_space which was dirtied

2006-05-25 13:56:38

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 3/3] mm: msync cleanup


From: Peter Zijlstra <[email protected]>

With the tracking of dirty pages properly done now, msync doesn't need to
scan the pte's anymore to determine the dirty status.

Signed-off-by: Peter Zijlstra <[email protected]>

mm/msync.c | 145 ++++---------------------------------------------------------
1 file changed, 10 insertions(+), 135 deletions(-)

Index: linux-2.6/mm/msync.c
===================================================================
--- linux-2.6.orig/mm/msync.c 2006-05-24 18:03:15.000000000 +0200
+++ linux-2.6/mm/msync.c 2006-05-24 18:38:09.000000000 +0200
@@ -20,128 +20,20 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>

-static unsigned long msync_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end)
-{
- pte_t *pte;
- spinlock_t *ptl;
- int progress = 0;
- unsigned long ret = 0;
-
-again:
- pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- do {
- struct page *page;
-
- if (progress >= 64) {
- progress = 0;
- if (need_resched() || need_lockbreak(ptl))
- break;
- }
- progress++;
- if (!pte_present(*pte))
- continue;
- if (!pte_maybe_dirty(*pte))
- continue;
- page = vm_normal_page(vma, addr, *pte);
- if (!page)
- continue;
- if (ptep_clear_flush_dirty(vma, addr, pte) ||
- page_test_and_clear_dirty(page))
- ret += set_page_dirty(page);
- progress += 3;
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap_unlock(pte - 1, ptl);
- cond_resched();
- if (addr != end)
- goto again;
- return ret;
-}
-
-static inline unsigned long msync_pmd_range(struct vm_area_struct *vma,
- pud_t *pud, unsigned long addr, unsigned long end)
-{
- pmd_t *pmd;
- unsigned long next;
- unsigned long ret = 0;
-
- pmd = pmd_offset(pud, addr);
- do {
- next = pmd_addr_end(addr, end);
- if (pmd_none_or_clear_bad(pmd))
- continue;
- ret += msync_pte_range(vma, pmd, addr, next);
- } while (pmd++, addr = next, addr != end);
- return ret;
-}
-
-static inline unsigned long msync_pud_range(struct vm_area_struct *vma,
- pgd_t *pgd, unsigned long addr, unsigned long end)
-{
- pud_t *pud;
- unsigned long next;
- unsigned long ret = 0;
-
- pud = pud_offset(pgd, addr);
- do {
- next = pud_addr_end(addr, end);
- if (pud_none_or_clear_bad(pud))
- continue;
- ret += msync_pmd_range(vma, pud, addr, next);
- } while (pud++, addr = next, addr != end);
- return ret;
-}
-
-static unsigned long msync_page_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
-{
- pgd_t *pgd;
- unsigned long next;
- unsigned long ret = 0;
-
- /* For hugepages we can't go walking the page table normally,
- * but that's ok, hugetlbfs is memory based, so we don't need
- * to do anything more on an msync().
- */
- if (vma->vm_flags & VM_HUGETLB)
- return 0;
-
- BUG_ON(addr >= end);
- pgd = pgd_offset(vma->vm_mm, addr);
- flush_cache_range(vma, addr, end);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_none_or_clear_bad(pgd))
- continue;
- ret += msync_pud_range(vma, pgd, addr, next);
- } while (pgd++, addr = next, addr != end);
- return ret;
-}
-
/*
* MS_SYNC syncs the entire file - including mappings.
*
- * MS_ASYNC does not start I/O (it used to, up to 2.5.67). Instead, it just
- * marks the relevant pages dirty. The application may now run fsync() to
+ * MS_ASYNC does not start I/O (it used to, up to 2.5.67).
+ * Nor does it just marks the relevant pages dirty (it used to up to 2.6.17).
+ * Now it doesn't do anything, since dirty pages are properly tracked.
+ *
+ * The application may now run fsync() to
* write out the dirty pages and wait on the writeout and check the result.
* Or the application may run fadvise(FADV_DONTNEED) against the fd to start
* async writeout immediately.
* So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
*/
-static int msync_interval(struct vm_area_struct *vma, unsigned long addr,
- unsigned long end, int flags,
- unsigned long *nr_pages_dirtied)
-{
- struct file *file = vma->vm_file;
-
- if ((flags & MS_INVALIDATE) && (vma->vm_flags & VM_LOCKED))
- return -EBUSY;
-
- if (file && (vma->vm_flags & VM_SHARED))
- *nr_pages_dirtied = msync_page_range(vma, addr, end);
- return 0;
-}

asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
{
@@ -157,6 +49,10 @@ asmlinkage long sys_msync(unsigned long
goto out;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
goto out;
+ if (flags & MS_ASYNC) {
+ error = 0;
+ goto out;
+ }
error = -ENOMEM;
len = (len + ~PAGE_MASK) & PAGE_MASK;
end = start + len;
@@ -178,7 +74,6 @@ asmlinkage long sys_msync(unsigned long
goto out_unlock;
}
do {
- unsigned long nr_pages_dirtied = 0;
struct file *file;

/* Here start < vma->vm_end. */
@@ -188,32 +83,12 @@ asmlinkage long sys_msync(unsigned long
}
/* Here vma->vm_start <= start < vma->vm_end. */
if (end <= vma->vm_end) {
- if (start < end) {
- error = msync_interval(vma, start, end, flags,
- &nr_pages_dirtied);
- if (error)
- goto out_unlock;
- }
error = unmapped_error;
done = 1;
- } else {
- /* Here vma->vm_start <= start < vma->vm_end < end. */
- error = msync_interval(vma, start, vma->vm_end, flags,
- &nr_pages_dirtied);
- if (error)
- goto out_unlock;
}
file = vma->vm_file;
start = vma->vm_end;
- if ((flags & MS_ASYNC) && file && nr_pages_dirtied) {
- get_file(file);
- up_read(&current->mm->mmap_sem);
- balance_dirty_pages_ratelimited_nr(file->f_mapping,
- nr_pages_dirtied);
- fput(file);
- down_read(&current->mm->mmap_sem);
- vma = find_vma(current->mm, start);
- } else if ((flags & MS_SYNC) && file &&
+ if ((flags & MS_SYNC) && file &&
(vma->vm_flags & VM_SHARED)) {
get_file(file);
up_read(&current->mm->mmap_sem);

2006-05-25 13:55:59

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 1/3] mm: tracking shared dirty pages


From: Peter Zijlstra <[email protected]>

People expressed the need to track dirty pages in shared mappings.

Linus outlined the general idea of doing that through making clean
writable pages write-protected and taking the write fault.

This patch does exactly that, it makes pages in a shared writable
mapping write-protected. On write-fault the pages are marked dirty and
made writable. When the pages get synced with their backing store, the
write-protection is re-instated.

It survives a simple test and shows the dirty pages in /proc/vmstat.

Changes in -v5

- rename page_wrprotect() to page_mkclean() (suggested by Nick Piggin)
- added comment to test_clear_page_dirty() (Andrew Morton)
- cleanup page_wrprotect() (Andrew Morton)
- renamed VM_SharedWritable() to is_shared_writable()
- fs/buffers.c try_to_free_buffers(): remove clear_page_dirty() from under
->private_lock. This seems to be save, since ->private_lock is used to
serialize access to the buffers, not the page itself.
- rebased on top of David Howells' page_mkwrite() patch.

Changes in -v4:

- small cleanup as suggested by Christoph Lameter.

Changes in -v3:

- move set_page_dirty() outside pte lock (suggested by Christoph Lameter)

Changes in -v2:

- only wrprotect pages from dirty capable mappings. (Nick Piggin)
- move the writefault handling from do_wp_page() into handle_pte_fault().
(Nick Piggin)
- revert to the old install_page interface. (Nick Piggin)
- also clear the pte dirty bit when we make pages read-only again.
(spotted by Rik van Riel)
- make page_wrprotect() return the number of reprotected ptes.

Signed-off-by: Peter Zijlstra <[email protected]>

fs/buffer.c | 2 -
include/linux/mm.h | 5 +++
include/linux/rmap.h | 8 ++++++
mm/fremap.c | 4 +--
mm/memory.c | 20 +++++++++++++++
mm/mmap.c | 7 ++++-
mm/mprotect.c | 7 ++++-
mm/page-writeback.c | 13 ++++++++--
mm/rmap.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 122 insertions(+), 8 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h 2006-05-25 15:46:03.000000000 +0200
+++ linux-2.6/include/linux/mm.h 2006-05-25 15:46:08.000000000 +0200
@@ -183,6 +183,11 @@ extern unsigned int kobjsize(const void
#define VM_SequentialReadHint(v) ((v)->vm_flags & VM_SEQ_READ)
#define VM_RandomReadHint(v) ((v)->vm_flags & VM_RAND_READ)

+static inline int is_shared_writable(struct vm_area_struct *vma)
+{
+ return (vma->vm_flags & (VM_SHARED|VM_WRITE)) == (VM_SHARED|VM_WRITE);
+}
+
/*
* mapping from the currently active vm_flags protection bits (the
* low four bits) to a page protection mask..
Index: linux-2.6/mm/fremap.c
===================================================================
--- linux-2.6.orig/mm/fremap.c 2006-05-25 15:07:43.000000000 +0200
+++ linux-2.6/mm/fremap.c 2006-05-25 15:46:08.000000000 +0200
@@ -79,9 +79,9 @@ int install_page(struct mm_struct *mm, s
inc_mm_counter(mm, file_rss);

flush_icache_page(vma, page);
- set_pte_at(mm, addr, pte, mk_pte(page, prot));
+ pte_val = mk_pte(page, prot);
+ set_pte_at(mm, addr, pte, pte_val);
page_add_file_rmap(page);
- pte_val = *pte;
update_mmu_cache(vma, addr, pte_val);
err = 0;
unlock:
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c 2006-05-25 15:46:03.000000000 +0200
+++ linux-2.6/mm/memory.c 2006-05-25 15:47:10.000000000 +0200
@@ -48,6 +48,7 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/init.h>
+#include <linux/backing-dev.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -1446,12 +1447,13 @@ static int do_wp_page(struct mm_struct *
struct page *old_page, *new_page;
pte_t entry;
int reuse, ret = VM_FAULT_MINOR;
+ struct page *dirty_page = NULL;

old_page = vm_normal_page(vma, address, orig_pte);
if (!old_page)
goto gotten;

- if (unlikely(vma->vm_flags & VM_SHARED)) {
+ if (vma->vm_flags & VM_SHARED) {
if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
/*
* Notify the address space that the page is about to
@@ -1482,6 +1484,9 @@ static int do_wp_page(struct mm_struct *
}

reuse = 1;
+
+ dirty_page = old_page;
+ get_page(dirty_page);
} else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
reuse = can_share_swap_page(old_page);
unlock_page(old_page);
@@ -1552,6 +1557,10 @@ gotten:
page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
+ if (dirty_page) {
+ set_page_dirty(dirty_page);
+ put_page(dirty_page);
+ }
return ret;
oom:
if (old_page)
@@ -2084,6 +2093,7 @@ static int do_no_page(struct mm_struct *
unsigned int sequence = 0;
int ret = VM_FAULT_MINOR;
int anon = 0;
+ struct page *dirty_page = NULL;

pte_unmap(page_table);
BUG_ON(vma->vm_flags & VM_PFNMAP);
@@ -2178,6 +2188,10 @@ retry:
} else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
+ if (write_access) {
+ dirty_page = new_page;
+ get_page(dirty_page);
+ }
}
} else {
/* One of our sibling threads was faster, back out. */
@@ -2190,6 +2204,10 @@ retry:
lazy_mmu_prot_update(entry);
unlock:
pte_unmap_unlock(page_table, ptl);
+ if (dirty_page) {
+ set_page_dirty(dirty_page);
+ put_page(dirty_page);
+ }
return ret;
oom:
page_cache_release(new_page);
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c 2006-05-25 15:07:44.000000000 +0200
+++ linux-2.6/mm/page-writeback.c 2006-05-25 15:47:10.000000000 +0200
@@ -29,6 +29,7 @@
#include <linux/sysctl.h>
#include <linux/cpu.h>
#include <linux/syscalls.h>
+#include <linux/rmap.h>

/*
* The maximum number of pages to writeout in a single bdflush/kupdate
@@ -725,8 +726,14 @@ int test_clear_page_dirty(struct page *p
page_index(page),
PAGECACHE_TAG_DIRTY);
write_unlock_irqrestore(&mapping->tree_lock, flags);
- if (mapping_cap_account_dirty(mapping))
+ /*
+ * We can continue to use `mapping' here because the
+ * page is locked, which pins the address_space
+ */
+ if (mapping_cap_account_dirty(mapping)) {
+ page_mkclean(page);
dec_page_state(nr_dirty);
+ }
return 1;
}
write_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -756,8 +763,10 @@ int clear_page_dirty_for_io(struct page

if (mapping) {
if (TestClearPageDirty(page)) {
- if (mapping_cap_account_dirty(mapping))
+ if (mapping_cap_account_dirty(mapping)) {
+ page_mkclean(page);
dec_page_state(nr_dirty);
+ }
return 1;
}
return 0;
Index: linux-2.6/mm/rmap.c
===================================================================
--- linux-2.6.orig/mm/rmap.c 2006-05-25 15:07:46.000000000 +0200
+++ linux-2.6/mm/rmap.c 2006-05-25 15:46:08.000000000 +0200
@@ -472,6 +472,70 @@ int page_referenced(struct page *page, i
return referenced;
}

+static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address;
+ pte_t *pte, entry;
+ spinlock_t *ptl;
+ int ret = 0;
+
+ address = vma_address(page, vma);
+ if (address == -EFAULT)
+ goto out;
+
+ pte = page_check_address(page, mm, address, &ptl);
+ if (!pte)
+ goto out;
+
+ if (!pte_write(*pte))
+ goto unlock;
+
+ entry = pte_mkclean(pte_wrprotect(*pte));
+ ptep_establish(vma, address, pte, entry);
+ update_mmu_cache(vma, address, entry);
+ lazy_mmu_prot_update(entry);
+ ret = 1;
+
+unlock:
+ pte_unmap_unlock(pte, ptl);
+out:
+ return ret;
+}
+
+static int page_mkclean_file(struct address_space *mapping, struct page *page)
+{
+ pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT);
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ int ret = 0;
+
+ BUG_ON(PageAnon(page));
+
+ spin_lock(&mapping->i_mmap_lock);
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+ if (is_shared_writable(vma))
+ ret += page_mkclean_one(page, vma);
+ }
+ spin_unlock(&mapping->i_mmap_lock);
+ return ret;
+}
+
+int page_mkclean(struct page *page)
+{
+ int ret = 0;
+
+ BUG_ON(!PageLocked(page));
+
+ if (page_mapped(page)) {
+ struct address_space *mapping = page_mapping(page);
+ if (mapping)
+ ret = page_mkclean_file(mapping, page);
+ }
+
+ return ret;
+}
+
/**
* page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
Index: linux-2.6/include/linux/rmap.h
===================================================================
--- linux-2.6.orig/include/linux/rmap.h 2006-05-25 15:07:46.000000000 +0200
+++ linux-2.6/include/linux/rmap.h 2006-05-25 15:46:08.000000000 +0200
@@ -105,6 +105,14 @@ pte_t *page_check_address(struct page *,
*/
unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);

+/*
+ * Cleans the PTEs of shared mappings.
+ * (and since clean PTEs should also be readonly, write protects them too)
+ *
+ * returns the number of cleaned PTEs.
+ */
+int page_mkclean(struct page *);
+
#else /* !CONFIG_MMU */

#define anon_vma_init() do {} while (0)
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c 2006-05-25 15:07:43.000000000 +0200
+++ linux-2.6/fs/buffer.c 2006-05-25 15:46:09.000000000 +0200
@@ -2985,6 +2985,7 @@ int try_to_free_buffers(struct page *pag

spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
+ spin_unlock(&mapping->private_lock);
if (ret) {
/*
* If the filesystem writes its buffers by hand (eg ext3)
@@ -2996,7 +2997,6 @@ int try_to_free_buffers(struct page *pag
*/
clear_page_dirty(page);
}
- spin_unlock(&mapping->private_lock);
out:
if (buffers_to_free) {
struct buffer_head *bh = buffers_to_free;
Index: linux-2.6/mm/mmap.c
===================================================================
--- linux-2.6.orig/mm/mmap.c 2006-05-25 15:46:03.000000000 +0200
+++ linux-2.6/mm/mmap.c 2006-05-25 15:46:45.000000000 +0200
@@ -25,6 +25,7 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/backing-dev.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -888,6 +889,7 @@ unsigned long do_mmap_pgoff(struct file
struct rb_node ** rb_link, * rb_parent;
int accountable = 1;
unsigned long charged = 0, reqprot = prot;
+ struct address_space *mapping = NULL;

if (file) {
if (is_file_hugepages(file))
@@ -1092,7 +1094,10 @@ munmap_back:

/* Don't make the VMA automatically writable if it's shared, but the
* backer wishes to know when pages are first written to */
- if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+ if (is_shared_writable(vma) && vma->vm_file)
+ mapping = vma->vm_file->f_mapping;
+ if ((mapping && mapping_cap_account_dirty(mapping)) ||
+ (vma->vm_ops && vma->vm_ops->page_mkwrite))
vma->vm_page_prot =
protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];

Index: linux-2.6/mm/mprotect.c
===================================================================
--- linux-2.6.orig/mm/mprotect.c 2006-05-25 15:46:03.000000000 +0200
+++ linux-2.6/mm/mprotect.c 2006-05-25 15:46:09.000000000 +0200
@@ -19,6 +19,7 @@
#include <linux/mempolicy.h>
#include <linux/personality.h>
#include <linux/syscalls.h>
+#include <linux/backing-dev.h>

#include <asm/uaccess.h>
#include <asm/pgtable.h>
@@ -107,6 +108,7 @@ mprotect_fixup(struct vm_area_struct *vm
long nrpages = (end - start) >> PAGE_SHIFT;
unsigned long charged = 0;
unsigned int mask;
+ struct address_space *mapping = NULL;
pgprot_t newprot;
pgoff_t pgoff;
int error;
@@ -162,7 +164,10 @@ success:
/* Don't make the VMA automatically writable if it's shared, but the
* backer wishes to know when pages are first written to */
mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
- if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+ if (is_shared_writable(vma) && vma->vm_file)
+ mapping = vma->vm_file->f_mapping;
+ if ((mapping && mapping_cap_account_dirty(mapping)) ||
+ (vma->vm_ops && vma->vm_ops->page_mkwrite))
mask &= ~VM_SHARED;

newprot = protection_map[newflags & mask];

2006-05-25 13:55:49

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH -1/3] mm: page_mkwrite


The attached patch adds a new VMA operation to notify a filesystem or other
driver about the MMU generating a fault because userspace attempted to write
to a page mapped through a read-only PTE.

This facility permits the filesystem or driver to:

(*) Implement storage allocation/reservation on attempted write, and so to
deal with problems such as ENOSPC more gracefully (perhaps by generating
SIGBUS).

(*) Delay making the page writable until the contents have been written to a
backing cache. This is useful for NFS/AFS when using FS-Cache/CacheFS.
It permits the filesystem to have some guarantee about the state of the
cache.

(*) Account and limit number of dirty pages. This is one piece of the puzzle
needed to make shared writable mapping work safely in FUSE.

Signed-Off-By: David Howells <[email protected]>
---

include/linux/mm.h | 4 ++
mm/memory.c | 99 +++++++++++++++++++++++++++++++++++++++-------------
mm/mmap.c | 12 +++++-
mm/mprotect.c | 11 +++++-
4 files changed, 98 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1154684..cd3c2cf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -200,6 +200,10 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+
+ /* notification that a previously read-only page is about to become
+ * writable, if an error is returned it will cause a SIGBUS */
+ int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
#ifdef CONFIG_NUMA
int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
struct mempolicy *(*get_policy)(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 0ec7bc6..6c6891e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1445,25 +1445,59 @@ static int do_wp_page(struct mm_struct *
{
struct page *old_page, *new_page;
pte_t entry;
- int ret = VM_FAULT_MINOR;
+ int reuse, ret = VM_FAULT_MINOR;

old_page = vm_normal_page(vma, address, orig_pte);
if (!old_page)
goto gotten;

- if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
- int reuse = can_share_swap_page(old_page);
- unlock_page(old_page);
- if (reuse) {
- flush_cache_page(vma, address, pte_pfn(orig_pte));
- entry = pte_mkyoung(orig_pte);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- ptep_set_access_flags(vma, address, page_table, entry, 1);
- update_mmu_cache(vma, address, entry);
- lazy_mmu_prot_update(entry);
- ret |= VM_FAULT_WRITE;
- goto unlock;
+ if (unlikely(vma->vm_flags & VM_SHARED)) {
+ if (vma->vm_ops && vma->vm_ops->page_mkwrite) {
+ /*
+ * Notify the address space that the page is about to
+ * become writable so that it can prohibit this or wait
+ * for the page to get into an appropriate state.
+ *
+ * We do this without the lock held, so that it can
+ * sleep if it needs to.
+ */
+ page_cache_get(old_page);
+ pte_unmap_unlock(page_table, ptl);
+
+ if (vma->vm_ops->page_mkwrite(vma, old_page) < 0)
+ goto unwritable_page;
+
+ page_cache_release(old_page);
+
+ /*
+ * Since we dropped the lock we need to revalidate
+ * the PTE as someone else may have changed it. If
+ * they did, we just return, as we can count on the
+ * MMU to tell us if they didn't also make it writable.
+ */
+ page_table = pte_offset_map_lock(mm, pmd, address,
+ &ptl);
+ if (!pte_same(*page_table, orig_pte))
+ goto unlock;
}
+
+ reuse = 1;
+ } else if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
+ reuse = can_share_swap_page(old_page);
+ unlock_page(old_page);
+ } else {
+ reuse = 0;
+ }
+
+ if (reuse) {
+ flush_cache_page(vma, address, pte_pfn(orig_pte));
+ entry = pte_mkyoung(orig_pte);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ ptep_set_access_flags(vma, address, page_table, entry, 1);
+ update_mmu_cache(vma, address, entry);
+ lazy_mmu_prot_update(entry);
+ ret |= VM_FAULT_WRITE;
+ goto unlock;
}

/*
@@ -1523,6 +1557,10 @@ oom:
if (old_page)
page_cache_release(old_page);
return VM_FAULT_OOM;
+
+unwritable_page:
+ page_cache_release(old_page);
+ return VM_FAULT_SIGBUS;
}

/*
@@ -2074,18 +2112,31 @@ retry:
/*
* Should we do an early C-O-W break?
*/
- if (write_access && !(vma->vm_flags & VM_SHARED)) {
- struct page *page;
+ if (write_access) {
+ if (!(vma->vm_flags & VM_SHARED)) {
+ struct page *page;

- if (unlikely(anon_vma_prepare(vma)))
- goto oom;
- page = alloc_page_vma(GFP_HIGHUSER, vma, address);
- if (!page)
- goto oom;
- copy_user_highpage(page, new_page, address);
- page_cache_release(new_page);
- new_page = page;
- anon = 1;
+ if (unlikely(anon_vma_prepare(vma)))
+ goto oom;
+ page = alloc_page_vma(GFP_HIGHUSER, vma, address);
+ if (!page)
+ goto oom;
+ copy_user_highpage(page, new_page, address);
+ page_cache_release(new_page);
+ new_page = page;
+ anon = 1;
+
+ } else {
+ /* if the page will be shareable, see if the backing
+ * address space wants to know that the page is about
+ * to become writable */
+ if (vma->vm_ops->page_mkwrite &&
+ vma->vm_ops->page_mkwrite(vma, new_page) < 0
+ ) {
+ page_cache_release(new_page);
+ return VM_FAULT_SIGBUS;
+ }
+ }
}

page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
diff --git a/mm/mmap.c b/mm/mmap.c
index e6ee123..6446c61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1065,7 +1065,8 @@ munmap_back:
vma->vm_start = addr;
vma->vm_end = addr + len;
vma->vm_flags = vm_flags;
- vma->vm_page_prot = protection_map[vm_flags & 0x0f];
+ vma->vm_page_prot = protection_map[vm_flags &
+ (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
vma->vm_pgoff = pgoff;

if (file) {
@@ -1089,6 +1090,12 @@ munmap_back:
goto free_vma;
}

+ /* Don't make the VMA automatically writable if it's shared, but the
+ * backer wishes to know when pages are first written to */
+ if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+ vma->vm_page_prot =
+ protection_map[vm_flags & (VM_READ|VM_WRITE|VM_EXEC)];
+
/* We set VM_ACCOUNT in a shared mapping's vm_flags, to inform
* shmem_zero_setup (perhaps called through /dev/zero's ->mmap)
* that memory reservation must be checked; but that reservation
@@ -1921,7 +1928,8 @@ unsigned long do_brk(unsigned long addr,
vma->vm_end = addr + len;
vma->vm_pgoff = pgoff;
vma->vm_flags = flags;
- vma->vm_page_prot = protection_map[flags & 0x0f];
+ vma->vm_page_prot = protection_map[flags &
+ (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)];
vma_link(mm, vma, prev, rb_link, rb_parent);
out:
mm->total_vm += len >> PAGE_SHIFT;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4c14d42..2697abd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -106,6 +106,7 @@ mprotect_fixup(struct vm_area_struct *vm
unsigned long oldflags = vma->vm_flags;
long nrpages = (end - start) >> PAGE_SHIFT;
unsigned long charged = 0;
+ unsigned int mask;
pgprot_t newprot;
pgoff_t pgoff;
int error;
@@ -132,8 +133,6 @@ mprotect_fixup(struct vm_area_struct *vm
}
}

- newprot = protection_map[newflags & 0xf];
-
/*
* First try to merge with previous and/or next vma.
*/
@@ -160,6 +159,14 @@ mprotect_fixup(struct vm_area_struct *vm
}

success:
+ /* Don't make the VMA automatically writable if it's shared, but the
+ * backer wishes to know when pages are first written to */
+ mask = VM_READ|VM_WRITE|VM_EXEC|VM_SHARED;
+ if (vma->vm_ops && vma->vm_ops->page_mkwrite)
+ mask &= ~VM_SHARED;
+
+ newprot = protection_map[newflags & mask];
+
/*
* vm_flags and vm_page_prot are protected by the mmap_sem
* held in write mode.

2006-05-25 16:21:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Thu, 25 May 2006, Peter Zijlstra wrote:

> @@ -1446,12 +1447,13 @@ static int do_wp_page(struct mm_struct *
>
> - if (unlikely(vma->vm_flags & VM_SHARED)) {
> + if (vma->vm_flags & VM_SHARED) {

You add this unlikely later again it seems. Why remove in the first place?

> +static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
> + entry = pte_mkclean(pte_wrprotect(*pte));
> + ptep_establish(vma, address, pte, entry);

> + update_mmu_cache(vma, address, entry);

You only changed protections on an estisting pte and ptep_establish
already flushed the tlb. No need to call update_mmu_cache. See how
change_protection() in mm/mprotect.c does it.

> + lazy_mmu_prot_update(entry);

Needed.

2006-05-25 16:27:58

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Thu, 25 May 2006, Peter Zijlstra wrote:

> - rebased on top of David Howells' page_mkwrite() patch.

I am a bit confused about the need for Davids patch. set_page_dirty() is
already a notification that a page is to be dirtied. Why do we need it
twice? set_page_dirty could return an error code and the file system can
use the set_page_dirty() hook to get its notification. What we would need
to do is to make sure that set_page_dirty can sleep.

2006-05-25 17:00:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Thu, 2006-05-25 at 09:21 -0700, Christoph Lameter wrote:
> On Thu, 25 May 2006, Peter Zijlstra wrote:
>
> > @@ -1446,12 +1447,13 @@ static int do_wp_page(struct mm_struct *
> >
> > - if (unlikely(vma->vm_flags & VM_SHARED)) {
> > + if (vma->vm_flags & VM_SHARED) {
>
> You add this unlikely later again it seems. Why remove in the first place?

I'm not sure I follow you, are you suggesting that we'll find the
condition to be unlikely still, even with most of the shared mappings
trapping this branch?

> > +static int page_mkclean_one(struct page *page, struct vm_area_struct *vma)
> > + entry = pte_mkclean(pte_wrprotect(*pte));
> > + ptep_establish(vma, address, pte, entry);
>
> > + update_mmu_cache(vma, address, entry);
>
> You only changed protections on an estisting pte and ptep_establish
> already flushed the tlb. No need to call update_mmu_cache. See how
> change_protection() in mm/mprotect.c does it.

OK, will check.

> > + lazy_mmu_prot_update(entry);
>
> Needed.

2006-05-25 17:03:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Thu, 2006-05-25 at 09:27 -0700, Christoph Lameter wrote:
> On Thu, 25 May 2006, Peter Zijlstra wrote:
>
> > - rebased on top of David Howells' page_mkwrite() patch.
>
> I am a bit confused about the need for Davids patch. set_page_dirty() is
> already a notification that a page is to be dirtied. Why do we need it
> twice? set_page_dirty could return an error code and the file system can
> use the set_page_dirty() hook to get its notification. What we would need
> to do is to make sure that set_page_dirty can sleep.

Ah, I see what you're saying here. Good point, David, Hugh?

The reason I did it was because of Hugh's trick to use MAP_SHARED
protection and building on top of it naturally solves the patch conflict
Andrew would have had to resolve otherwise.

2006-05-25 17:03:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Thu, 25 May 2006, Peter Zijlstra wrote:

> On Thu, 2006-05-25 at 09:21 -0700, Christoph Lameter wrote:
> > On Thu, 25 May 2006, Peter Zijlstra wrote:
> >
> > > @@ -1446,12 +1447,13 @@ static int do_wp_page(struct mm_struct *
> > >
> > > - if (unlikely(vma->vm_flags & VM_SHARED)) {
> > > + if (vma->vm_flags & VM_SHARED) {
> >
> > You add this unlikely later again it seems. Why remove in the first place?
>
> I'm not sure I follow you, are you suggesting that we'll find the
> condition to be unlikely still, even with most of the shared mappings
> trapping this branch?

No, I just saw the opposite in a later patch. It was the -1 patch that
does

+ if (unlikely(vma->vm_flags & VM_SHARED)) {

but thats a different context?
\

2006-05-25 17:06:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Thu, 25 May 2006, Peter Zijlstra wrote:

> Ah, I see what you're saying here. Good point, David, Hugh?
>
> The reason I did it was because of Hugh's trick to use MAP_SHARED
> protection and building on top of it naturally solves the patch conflict
> Andrew would have had to resolve otherwise.

I guess what we wanted is a patch that addresses the concern of both
patches and not a combination of the patches. IMHO the dirty notification
of David's patch is possible with the shared dirty pages patch if we allow
the set_page_dirty method in address operations to sleep and return an
error code. However, this may raise some additional issues because we have
to check whenever we dirty a page.


2006-05-26 02:29:08

by Jeff Anderson-Lee

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

Christoph Lameter wrote:

> I am a bit confused about the need for Davids patch. set_page_dirty() is
> already a notification that a page is to be dirtied. Why do we need it
> twice? set_page_dirty could return an error code and the file system can
> use the set_page_dirty() hook to get its notification. What we would need
> to do is to make sure that set_page_dirty can sleep.

set_page_dirty() is actually called fairly late in the game by
zap_pte_range() and follow_page().? Thus, it is a notification that a page
HAS BEEN dirtied and needs a writeback.

What is needed (at least for log structured or copy-on-write filesystems) is
a function that is called during the page fault BEFORE the pte's writeable
bit is set so that the pages' mapping can either be unmapped or remapped.?
That prevents possible readers of the old (read-only) version of the page
from seeing any changes via the page map cache.? If the page was unmapped, a
second call is needed later to map it to a new set of blocks on the device.?
For a log structured filesystem it can makes sense to defer the remapping
until late in the game when the block is about to be queued for i/o, in case
it gets deleted before it is ever flushed from the page cache.

I looked at using set_page_dirty for the first of these hooks, but it comes
too late.? It might be OK for the second (remapping) hook though.

do_wp_page() is called by handle_pte_fault() in exactly the case where this
would apply: if write access is needed and !pte_write(entry).? This patch
appears to address the necessary issue in a way that set_page_dirty cannot.

Jeff Anderson-Lee

[I'm a little late in coming in on this thread, and not 100% familiar with
the latest kernels, so please pardon me if I'm not fully up to speed yet.?
I'm trying to catch up as quickly as I can though!]

2006-05-26 02:33:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Thu, 25 May 2006, Jeff Anderson-Lee wrote:

> Christoph Lameter wrote:
>
> > I am a bit confused about the need for Davids patch. set_page_dirty() is
> > already a notification that a page is to be dirtied. Why do we need it
> > twice? set_page_dirty could return an error code and the file system can
> > use the set_page_dirty() hook to get its notification. What we would need
> > to do is to make sure that set_page_dirty can sleep.
>
> set_page_dirty() is actually called fairly late in the game by
> zap_pte_range() and follow_page().? Thus, it is a notification that a page
> HAS BEEN dirtied and needs a writeback.

The tracking patch changes that behavior. set_page_dirty is called before
the write to the page occurs and so its similar to the new method
introduced by David's patch.

2006-05-26 14:33:46

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

Christoph Lameter <[email protected]> wrote:

> > - rebased on top of David Howells' page_mkwrite() patch.
>
> I am a bit confused about the need for Davids patch. set_page_dirty() is
> already a notification that a page is to be dirtied. Why do we need it
> twice? set_page_dirty could return an error code and the file system can
> use the set_page_dirty() hook to get its notification. What we would need
> to do is to make sure that set_page_dirty can sleep.

page_mkwrite() is called just before the _PTE_ is dirtied. Take do_wp_page()
for example, set_page_dirty() is called after a lot of stuff, including some
stuff that marks the PTE dirty... by which time it's too late as another
thread sharing the page tables can come along and modify the page before the
first thread calls set_page_dirty().

Furthermore, by that point, it's pointless to have set_page_dirty() return an
error by then. The deed is effectively done: the PTE is marked dirty and
writable.

And also as you pointed out, set_page_dirty() needs to be able to sleep.
There are places where it's called still, even with Peter's patch, with the
page table lock held - zap_pte_range() for example. In that particular case,
dropping the lock for each PTE would be bad for performance.

Basically, you can look at it as page_mkwrite() is called upfront, and
set_page_dirty() is called at the end.

David

2006-05-26 15:40:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Fri, 26 May 2006, David Howells wrote:

> page_mkwrite() is called just before the _PTE_ is dirtied. Take do_wp_page()
> for example, set_page_dirty() is called after a lot of stuff, including some
> stuff that marks the PTE dirty... by which time it's too late as another
> thread sharing the page tables can come along and modify the page before the
> first thread calls set_page_dirty().

Since we are terminating the application with extreme prejudice on an
error (SIGBUS) it does not matter if another process has written to the
page in the meantime.

> And also as you pointed out, set_page_dirty() needs to be able to sleep.
> There are places where it's called still, even with Peter's patch, with the
> page table lock held - zap_pte_range() for example. In that particular case,
> dropping the lock for each PTE would be bad for performance.

zap_pte_range would only have to dirty anonymous pages. The pages of
shared mappings would already be dirty.

> Basically, you can look at it as page_mkwrite() is called upfront, and
> set_page_dirty() is called at the end.

The end is that the page is written back. I think we can still solve this
with set_page_dirty being called when a page is about to be dirtied or
was dirtied.

The page_mkwrite() method does not really allow the tracking of dirty
pages. It is a way to track the potentially dirty pages that is useful if
one is not able to track dirty pages. Moreover, unmapped dirtied pages do
not factor into that scheme probably because it was thought that they are
already sufficiently tracked by nr_dirty. However, having two methods
of accounting for dirty pages creates problems in correlating the number
of dirty pages. This is unnecessarily complex.

In order to consistently reach the goal of of tracking dirty pages we
have to deal with set_page_dirty(). In the first stage lets just
be satified with being able to throttle dirty pages by having an accurate
nr_dirty.

We can then avoid doing too many things on set_page_dirty so that we do
not have to sleep or return an error. Maybe add the support for errors
(SIGBUS) later. But then we should consistently check everytime we dirty a
page be it mapped or unmapped.

2006-05-30 08:01:33

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

Christoph Lameter <[email protected]> wrote:

> > page_mkwrite() is called just before the _PTE_ is dirtied. Take
> > do_wp_page() for example, set_page_dirty() is called after a lot of stuff,
> > including some stuff that marks the PTE dirty... by which time it's too
> > late as another thread sharing the page tables can come along and modify
> > the page before the first thread calls set_page_dirty().
>
> Since we are terminating the application with extreme prejudice on an
> error (SIGBUS) it does not matter if another process has written to the
> page in the meantime.

Erm... Yes, it does matter, at least for AFS or NFS using FS-Cache, and whether
or not we're generating a SIGBUS or just proceeding normally. There are two
cases I've come across:

Firstly I use page_mkwrite() to make sure that the page is written to the cache
before we let anyone modify it, just so that we've got a reasonable idea of
what's in the cache.

What we currently have is:

invoke page_mkwrite()
- wait for page to be written to the cache
lock
modify PTE
unlock
invoke set_page_dirty()

What your suggestion gives is:

lock
modify PTE
unlock
invoke set_page_dirty()
- wait for page to be written to the cache

But what can happen is:

CPU 1 CPU 2
======================= =======================
write to page (faults)
lock
modify PTE
unlock
write to page (succeeds)
invoke set_page_dirty()
- wait for page to be written to the cache
write to page (succeeds)

That potentially lets data of uncertain state into the cache, which means we
can't trust what's in the cache any longer.

Secondly some filesystems want to use page_mkwrite() to prevent a write from
occurring if a write to a shared writable mapping would require an allocation
from a filesystem that's currently in an ENOSPC state. That means that we may
not change the PTE until we're sure that the allocation is guaranteed to
succeed, and that means that the kernel isn't left with dirty pages attached to
inodes it'd like to dispose of but can't because there's nowhere to write the
data.

David

2006-05-30 15:38:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Tue, 30 May 2006, David Howells wrote:

> Christoph Lameter <[email protected]> wrote:
>
> > > page_mkwrite() is called just before the _PTE_ is dirtied. Take
> > > do_wp_page() for example, set_page_dirty() is called after a lot of stuff,
> > > including some stuff that marks the PTE dirty... by which time it's too
> > > late as another thread sharing the page tables can come along and modify
> > > the page before the first thread calls set_page_dirty().

Maybe I do not understand properly. I thought page_mkwrite is called
before a page is made writable not before it is dirtied. If its only
called before the page is dirtied then a better name maybe before_dirty or
so?

> > Since we are terminating the application with extreme prejudice on an
> > error (SIGBUS) it does not matter if another process has written to the
> > page in the meantime.
>
> Erm... Yes, it does matter, at least for AFS or NFS using FS-Cache, and whether
> or not we're generating a SIGBUS or just proceeding normally. There are two
> cases I've come across:
>
> Firstly I use page_mkwrite() to make sure that the page is written to the cache
> before we let anyone modify it, just so that we've got a reasonable idea of
> what's in the cache.

What do you mean by "written to the cache"? It cannot be written back
since the page has been dirtied yet. So "written to the cache" means
that the FS does some reservation, right?

> What we currently have is:
>
> invoke page_mkwrite()
> - wait for page to be written to the cache

I am not sure what the point would be to write a page that is
not dirty. So I guess this reserves space on disk for the page.

> lock
> modify PTE
> unlock
> invoke set_page_dirty()
>
> What your suggestion gives is:
>
> lock
> modify PTE
> unlock
> invoke set_page_dirty()
> - wait for page to be written to the cache

Regardless of what "written to the cache" exactly means here
we have the page marked dirty in the mapping and queued for write back.

> But what can happen is:
>
> CPU 1 CPU 2
> ======================= =======================
> write to page (faults)
> lock
> modify PTE
> unlock
> write to page (succeeds)

Correct.

> invoke set_page_dirty()
> - wait for page to be written to the cache
> write to page (succeeds)
>
> That potentially lets data of uncertain state into the cache, which means we
> can't trust what's in the cache any longer.

It continues established usage and usage that even with your patch
continues at least for anonymous pages. The page dirty state may also be
reflected by any dirty bit set in a pte pointing to the page even if the
dirty state is not set on the page itself.

> Secondly some filesystems want to use page_mkwrite() to prevent a write from
> occurring if a write to a shared writable mapping would require an allocation
> from a filesystem that's currently in an ENOSPC state. That means that we may
> not change the PTE until we're sure that the allocation is guaranteed to
> succeed, and that means that the kernel isn't left with dirty pages attached to
> inodes it'd like to dispose of but can't because there's nowhere to write the
> data.

If set_page_dirty cannot reserve the page then we know that some severe
action is required. The FS method set_page_dirty() could:

1. Determine the ENOSPC condition before it sets the page dirty.
That leaves the potential that some writes to the page have occurred
by other processes.

2. Track down all processes that use the mapping (or maybe less
severe: processes that have set the dirty bit in the pte) and
terminate them with SIGBUS.

2006-05-30 16:27:59

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

Christoph Lameter <[email protected]> wrote:

> Maybe I do not understand properly. I thought page_mkwrite is called
> before a page is made writable not before it is dirtied. If its only
> called before the page is dirtied then a better name maybe before_dirty or
> so?

page_mkwrite() is called before any of the PTEs referring to a page are made
writable. This must precede a page being dirtied by writing to it directly
through an mmap'd section. It does not catch write() and co. dirtying pages,
but then there's no need since prepare_write() is available for that.

> What do you mean by "written to the cache"? It cannot be written back
> since the page has been dirtied yet. So "written to the cache" means
> that the FS does some reservation, right?

See the FS-Cache patches posted to LKML on the 19th of May, in particular the
documentation included in the patch with the subject:

[PATCH 10/14] FS-Cache: Generic filesystem caching facility [try #10]

These patches permit data retrieved from network filesystems (NFS and AFS for
now) to be cached locally on disk.

The page is fetched from the server and then written to the cache. We don't
let the clean page be modified or released until the write to the cache is
complete. This permits us to keep track of what state the cache is in.

> If set_page_dirty cannot reserve the page then we know that some severe
> action is required. The FS method set_page_dirty() could:

But by the time set_page_dirty() is called, it's too late as the code
currently stands. We've already marked the PTE writable and dirty. The
page_mkwrite() op is called _first_.

> 1. Determine the ENOSPC condition before it sets the page dirty.
> That leaves the potential that some writes to the page have occurred
> by other processes.

You have to detect ENOSPC before you modify the PTE, otherwise someone else
can jump through your hoop and dirty the page before you can stop them.

> 2. Track down all processes that use the mapping (or maybe less

That's bad, even if you restrict it to those that have MAP_SHARED and
PROT_WRITE set. They should not be terminated if they haven't attempted to
write to the mapping.

> severe: processes that have set the dirty bit in the pte) and
> terminate them with SIGBUS.

Brrrr.

What's wrong with my suggestion anyway?

David

2006-05-30 17:03:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Tue, 30 May 2006, David Howells wrote:

> > If set_page_dirty cannot reserve the page then we know that some severe
> > action is required. The FS method set_page_dirty() could:
>
> But by the time set_page_dirty() is called, it's too late as the code
> currently stands. We've already marked the PTE writable and dirty. The
> page_mkwrite() op is called _first_.

We are in set_page_dirty and this would be part of set_page_dirty
processing.

> > 2. Track down all processes that use the mapping (or maybe less
>
> That's bad, even if you restrict it to those that have MAP_SHARED and
> PROT_WRITE set. They should not be terminated if they haven't attempted to
> write to the mapping.

Its bad but the out of space situation is an exceptional situation. We do
similar contortions when we run out of memory space. As I said: One can
track down the processes that have dirtied the pte to the page in question
and just terminate those and remove the page.

> What's wrong with my suggestion anyway?

Adds yet another method with functionality that for the most part
is the same as set_page_dirty().

The advantage of such a method seems to be that it reserves filesystem
space for pages that could potentially be written to. This allows the
filesystem to accurately deal with out of space situations (a very rare
condition. Is this really justifiable?). Maybe having already reserved
space could speed up the real dirtying of pages?

2006-05-30 17:25:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Tue, 30 May 2006, Christoph Lameter wrote:
> On Tue, 30 May 2006, David Howells wrote:
>
> > What's wrong with my suggestion anyway?
>
> Adds yet another method with functionality that for the most part
> is the same as set_page_dirty().

Your original question, whether they could be combined, was a good one;
and I hoped you'd be right. But I agree with David, they cannot, unless
we sacrifice the guarantee that one or the other is there to give. It's
much like the relationship between ->prepare_write and ->commit_write.

Hugh

2006-05-30 17:31:17

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Tue, 30 May 2006, Hugh Dickins wrote:

> Your original question, whether they could be combined, was a good one;
> and I hoped you'd be right. But I agree with David, they cannot, unless
> we sacrifice the guarantee that one or the other is there to give. It's
> much like the relationship between ->prepare_write and ->commit_write.

Ok, so separate patch sets?

2006-05-30 17:41:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Tue, 30 May 2006, Christoph Lameter wrote:
> On Tue, 30 May 2006, Hugh Dickins wrote:
>
> > Your original question, whether they could be combined, was a good one;
> > and I hoped you'd be right. But I agree with David, they cannot, unless
> > we sacrifice the guarantee that one or the other is there to give. It's
> > much like the relationship between ->prepare_write and ->commit_write.
>
> Ok, so separate patch sets?

They are separate patch sets (and rc5-mm1 has David's without Peter's).
But they trample on nearby areas and share infrastructure, so I'm happy
that they're looked at together. Peter has helpfully arranged his to
go on top of David's: that can be reversed later if it's decided that
Peter's is wanted but David's not.

Hugh

2006-05-30 17:57:05

by David Howells

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

Christoph Lameter <[email protected]> wrote:

> On Tue, 30 May 2006, David Howells wrote:
>
> > > If set_page_dirty cannot reserve the page then we know that some severe
> > > action is required. The FS method set_page_dirty() could:
> >
> > But by the time set_page_dirty() is called, it's too late as the code
> > currently stands. We've already marked the PTE writable and dirty. The
> > page_mkwrite() op is called _first_.
>
> We are in set_page_dirty and this would be part of set_page_dirty
> processing.

Eh? What do you mean "We are in set_page_dirty"?

The main caller of page_mkwrite() is do_wp_page(). With my patch, this
function does the following steps in order:

(1) Invoke page_mkwrite() if available, giving SIGBUS on error.

(2) Mark the PTE writable and dirty.

(3) Unlock the page table.

Then with Peter's patch:

(4) Invoke set_page_dirty_balance().

We shouldn't be having to handle an error at step (4), and we shouldn't be
marking the page dirty before step (2).

In fact, it could be argued that we shouldn't be marking the page dirty until
it is actually dirty, but I can't see any way to do that atomically short of
emulating the store instruction without hardware support.

> > > 2. Track down all processes that use the mapping (or maybe less
> >
> > That's bad, even if you restrict it to those that have MAP_SHARED and
> > PROT_WRITE set. They should not be terminated if they haven't attempted to
> > write to the mapping.
>
> Its bad but the out of space situation is an exceptional situation. We do
> similar contortions when we run out of memory space. As I said: One can
> track down the processes that have dirtied the pte to the page in question
> and just terminate those and remove the page.

In general, we really really shouldn't just remove the page. The only place I
would consider it appropriate to destroy a dirty page is if we can't write it
back because the backing store is no longer available due to an event we
cannot predict (network filesystems for example).

We should operate on preventative measures rather than curative ones: we
should detect ENOSPC in advance and signal the process doing the write that
that it's not permitted to proceed. We shouldn't go and kill all the
processes that might write to that mapping.

Prevention is almost always cleaner than trying to fix things up afterwards.

> > What's wrong with my suggestion anyway?
>
> Adds yet another method with functionality that for the most part
> is the same as set_page_dirty().

It's not exactly the same. It's almost the same difference as between
prepare_write() and commit_write(). Currently it's very much the same
because, as I understand it, the PTE dirty flag is transferred to the page
struct when the PTE is destroyed (in zap_pte_range()).

Actually, I'm not sure that calling set_page_dirty() at the bottom of
do_wp_page() is necessarily a good idea. It's possible that the page will be
marked dirty in do_wp_page() and then will get written back before the write
actually succeeds. In other words the page may be marked dirty and cleaned up
all before the modification _actually_ occurs. On the other hand, the common
case is probably that the store instruction will beat the writeback.

Of course, if you're proposing to call set_page_dirty() before the PTE is
modified, that would work. But you still have to make sure that it isn't
called under spinlock, and doing that would make zap_pte_range() potentially
indulge in lock thrashing.

> The advantage of such a method seems to be that it reserves filesystem
> space for pages that could potentially be written to. This allows the
> filesystem to accurately deal with out of space situations (a very rare
> condition. Is this really justifiable?). Maybe having already reserved
> space could speed up the real dirtying of pages?

It's better than having unwritable pages sat around in the pagecache because
there's no space on the backing device to write them back. At least this way
gives a graceful way to handle the situation.

Destroying uncommitted data isn't really an acceptable answer.

David

2006-05-30 20:22:43

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/3] mm: tracking shared dirty pages

On Tue, 30 May 2006, David Howells wrote:

> Christoph Lameter <[email protected]> wrote:
>
> > On Tue, 30 May 2006, David Howells wrote:
> >
> > > > If set_page_dirty cannot reserve the page then we know that some severe
> > > > action is required. The FS method set_page_dirty() could:
> > >
> > > But by the time set_page_dirty() is called, it's too late as the code
> > > currently stands. We've already marked the PTE writable and dirty. The
> > > page_mkwrite() op is called _first_.
> >
> > We are in set_page_dirty and this would be part of set_page_dirty
> > processing.
>
> Eh? What do you mean "We are in set_page_dirty"?

We could do the reservation in as part of the set_page_dirty FS method.

> Actually, I'm not sure that calling set_page_dirty() at the bottom of
> do_wp_page() is necessarily a good idea. It's possible that the page will be
> marked dirty in do_wp_page() and then will get written back before the write
> actually succeeds. In other words the page may be marked dirty and cleaned up
> all before the modification _actually_ occurs. On the other hand, the common
> case is probably that the store instruction will beat the writeback.

Yes we are aware of that case.