2004-06-02 20:05:38

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] swapper_space.i_mmap_nonlinear

Initialize swapper_space.i_mmap_nonlinear, so mapping_mapped reports
false on it (as it used to do). Update comment on swapper_space,
now more fields are used than those initialized explicitly.

Signed-off-by: Hugh Dickins <[email protected]>

mm/swap_state.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletion(-)

--- 2.6.7-rc2/mm/swap_state.c 2004-05-30 11:36:40.000000000 +0100
+++ linux/mm/swap_state.c 2004-06-02 16:31:21.890093952 +0100
@@ -19,7 +19,8 @@

/*
* swapper_space is a fiction, retained to simplify the path through
- * vmscan's shrink_list. Only those fields initialized below are used.
+ * vmscan's shrink_list, to make sync_page look nicer, and to allow
+ * future use of radix_tree tags in the swap cache.
*/
static struct address_space_operations swap_aops = {
.writepage = swap_writepage,
@@ -36,6 +37,7 @@ struct address_space swapper_space = {
.page_tree = RADIX_TREE_INIT(GFP_ATOMIC),
.tree_lock = SPIN_LOCK_UNLOCKED,
.a_ops = &swap_aops,
+ .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.backing_dev_info = &swap_backing_dev_info,
};



2004-06-02 20:07:11

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] follow_page invalid pte_page

The follow_page write-access case is relying on pte_page before checking
pfn_valid: rearrange that - and we don't need three struct page *pages.

(I notice mempolicy.c's verify_pages is also relying on pte_page, but
I'll leave that to Andi: maybe it ought to be failing on, or skipping
over, VM_IO or VM_RESERVED vmas?)

Signed-off-by: Hugh Dickins <[email protected]>

mm/memory.c | 10 +++-------
1 files changed, 3 insertions(+), 7 deletions(-)

--- 2.6.7-rc2/mm/memory.c 2004-05-30 11:36:39.000000000 +0100
+++ linux/mm/memory.c 2004-06-02 16:31:33.037399304 +0100
@@ -637,15 +637,11 @@ follow_page(struct mm_struct *mm, unsign
if (pte_present(pte)) {
if (write && !pte_write(pte))
goto out;
- if (write && !pte_dirty(pte)) {
- struct page *page = pte_page(pte);
- if (!PageDirty(page))
- set_page_dirty(page);
- }
pfn = pte_pfn(pte);
if (pfn_valid(pfn)) {
- struct page *page = pfn_to_page(pfn);
-
+ page = pfn_to_page(pfn);
+ if (write && !pte_dirty(pte) && !PageDirty(page))
+ set_page_dirty(page);
mark_page_accessed(page);
return page;
}

2004-06-02 20:09:42

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] vma_adjust insert file earlier

For those arches (arm and parisc) which use the i_mmap tree to implement
flush_dcache_page, during split_vma there's a small window in vma_adjust
when flush_dcache_mmap_lock is dropped, and pages in the split-off part
of the vma might for an instant be invisible to __flush_dcache_page.

Though we're more solid there than ever before, I guess it's a bad idea
to leave that window: so (with regret, it was structurally nicer before)
take __vma_link_file (and vma_prio_tree_init) out of __vma_link.

vma_prio_tree_init (which NULLs a few fields) is actually only needed
when copying a vma, not when a new one has just been memset to 0.

__insert_vm_struct is used by nothing but vma_adjust's split_vma case:
comment it accordingly, remove its mark_mm_hugetlb (it can never create
a new kind of vma) and its validate_mm (another follows immediately).

Signed-off-by: Hugh Dickins <[email protected]>

mm/mmap.c | 24 +++++++++++++++++-------
1 files changed, 17 insertions(+), 7 deletions(-)

--- 2.6.7-rc2/mm/mmap.c 2004-05-30 11:36:40.000000000 +0100
+++ linux/mm/mmap.c 2004-06-02 16:31:55.343008336 +0100
@@ -293,10 +293,8 @@ __vma_link(struct mm_struct *mm, struct
struct vm_area_struct *prev, struct rb_node **rb_link,
struct rb_node *rb_parent)
{
- vma_prio_tree_init(vma);
__vma_link_list(mm, vma, prev, rb_parent);
__vma_link_rb(mm, vma, rb_link, rb_parent);
- __vma_link_file(vma);
__anon_vma_link(vma);
}

@@ -312,7 +310,10 @@ static void vma_link(struct mm_struct *m
if (mapping)
spin_lock(&mapping->i_mmap_lock);
anon_vma_lock(vma);
+
__vma_link(mm, vma, prev, rb_link, rb_parent);
+ __vma_link_file(vma);
+
anon_vma_unlock(vma);
if (mapping)
spin_unlock(&mapping->i_mmap_lock);
@@ -323,9 +324,9 @@ static void vma_link(struct mm_struct *m
}

/*
- * Insert vm structure into process list sorted by address and into the
- * inode's i_mmap tree. The caller should hold mm->mmap_sem and
- * ->f_mappping->i_mmap_lock if vm_file is non-NULL.
+ * Helper for vma_adjust in the split_vma insert case:
+ * insert vm structure into list and rbtree and anon_vma,
+ * but it has already been inserted into prio_tree earlier.
*/
static void
__insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
@@ -337,9 +338,7 @@ __insert_vm_struct(struct mm_struct * mm
if (__vma && __vma->vm_start < vma->vm_end)
BUG();
__vma_link(mm, vma, prev, rb_link, rb_parent);
- mark_mm_hugetlb(mm, vma);
mm->map_count++;
- validate_mm(mm);
}

static inline void
@@ -396,6 +395,15 @@ again: remove_next = 1 + (end > next->
if (!(vma->vm_flags & VM_NONLINEAR))
root = &mapping->i_mmap;
spin_lock(&mapping->i_mmap_lock);
+ if (insert) {
+ /*
+ * Put into prio_tree now, so instantiated pages
+ * are visible to arm/parisc __flush_dcache_page
+ * throughout; but we cannot insert into address
+ * space until vma start or end is updated.
+ */
+ __vma_link_file(insert);
+ }
}

/*
@@ -1456,6 +1464,7 @@ int split_vma(struct mm_struct * mm, str

/* most fields are the same, copy all, and then fixup */
*new = *vma;
+ vma_prio_tree_init(new);

if (new_below)
new->vm_end = addr;
@@ -1768,6 +1777,7 @@ struct vm_area_struct *copy_vma(struct v
new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (new_vma) {
*new_vma = *vma;
+ vma_prio_tree_init(new_vma);
pol = mpol_copy(vma_policy(vma));
if (IS_ERR(pol)) {
kmem_cache_free(vm_area_cachep, new_vma);

2004-06-02 20:08:56

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] vma_adjust adjust_next wrap

Fix vma_adjust adjust_next wrapping: Rajesh V. pointed out that if end
were 2GB or more beyond next->vm_start (on 32-bit), then next->vm_pgoff
would have been negatively adjusted.

Signed-off-by: Hugh Dickins <[email protected]>

mm/mmap.c | 27 +++++++++++++++++----------
1 files changed, 17 insertions(+), 10 deletions(-)

--- 2.6.7-rc2/mm/mmap.c 2004-05-30 11:36:40.000000000 +0100
+++ linux/mm/mmap.c 2004-06-02 16:31:44.190703744 +0100
@@ -373,20 +373,27 @@ void vma_adjust(struct vm_area_struct *v

if (next && !insert) {
if (end >= next->vm_end) {
+ /*
+ * vma expands, overlapping all the next, and
+ * perhaps the one after too (mprotect case 6).
+ */
again: remove_next = 1 + (end > next->vm_end);
end = next->vm_end;
anon_vma = next->anon_vma;
- } else if (end < vma->vm_end || end > next->vm_start) {
+ } else if (end > next->vm_start) {
/*
- * vma shrinks, and !insert tells it's not
- * split_vma inserting another: so it must
- * be mprotect shifting the boundary down.
- * Or:
* vma expands, overlapping part of the next:
- * must be mprotect shifting the boundary up.
+ * mprotect case 5 shifting the boundary up.
+ */
+ adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
+ anon_vma = next->anon_vma;
+ } else if (end < vma->vm_end) {
+ /*
+ * vma shrinks, and !insert tells it's not
+ * split_vma inserting another: so it must be
+ * mprotect case 4 shifting the boundary down.
*/
- BUG_ON(vma->vm_end != next->vm_start);
- adjust_next = end - next->vm_start;
+ adjust_next = - ((vma->vm_end - end) >> PAGE_SHIFT);
anon_vma = next->anon_vma;
}
}
@@ -418,8 +425,8 @@ again: remove_next = 1 + (end > next->
vma->vm_end = end;
vma->vm_pgoff = pgoff;
if (adjust_next) {
- next->vm_start += adjust_next;
- next->vm_pgoff += adjust_next >> PAGE_SHIFT;
+ next->vm_start += adjust_next << PAGE_SHIFT;
+ next->vm_pgoff += adjust_next;
}

if (root) {

2004-06-02 20:11:23

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] get_user_pages vs. try_to_unmap

Andrea Arcangeli's fix to an ironic weakness with get_user_pages.
try_to_unmap_one must check page_count against page->mapcount before
unmapping a swapcache page: because the raised pagecount by which
get_user_pages ensures the page cannot be freed, will cause any write
fault to see that page as not exclusively owned, and therefore a copy
page will be substituted for it - the reverse of what's intended.

rmap.c was entirely free of such page_count heuristics before, I tried
hard to avoid putting this in. But Andrea's fix rarely gives a false
positive; and although it might be nicer to change exclusive_swap_page
etc. to rely on page->mapcount instead, it seems likely that we'll want
to get rid of page->mapcount later, so better not to entrench its use.

Signed-off-by: Hugh Dickins <[email protected]>

mm/rmap.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+)

--- 2.6.7-rc2/mm/rmap.c 2004-05-30 11:36:40.000000000 +0100
+++ linux/mm/rmap.c 2004-06-02 16:32:06.492313384 +0100
@@ -485,6 +485,23 @@ static int try_to_unmap_one(struct page
goto out_unmap;
}

+ /*
+ * Don't pull an anonymous page out from under get_user_pages.
+ * GUP carefully breaks COW and raises page count (while holding
+ * page_table_lock, as we have here) to make sure that the page
+ * cannot be freed. If we unmap that page here, a user write
+ * access to the virtual address will bring back the page, but
+ * its raised count will (ironically) be taken to mean it's not
+ * an exclusive swap page, do_wp_page will replace it by a copy
+ * page, and the user never get to see the data GUP was holding
+ * the original page for.
+ */
+ if (PageSwapCache(page) &&
+ page_count(page) != page->mapcount + 2) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+
/* Nuke the page table entry. */
flush_cache_page(vma, address);
pteval = ptep_clear_flush(vma, address, pte);

2004-06-02 20:13:16

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] kill missed pte warning

I've seen no warnings, nor heard any reports of warnings, that anon_vma
ever misses ptes (nor anonmm before it). That WARN_ON (with its useless
stack dump) was okay to goad developers into making reports, but would
mainly be an irritation if it ever appears on user systems: kill it now.

Signed-off-by: Hugh Dickins <[email protected]>

mm/rmap.c | 29 +++++++----------------------
1 files changed, 7 insertions(+), 22 deletions(-)

--- 2.6.7-rc2/mm/rmap.c 2004-05-30 11:36:40.000000000 +0100
+++ linux/mm/rmap.c 2004-06-02 16:32:17.644617976 +0100
@@ -193,7 +193,7 @@ vma_address(struct page *page, struct vm
* repeatedly from either page_referenced_anon or page_referenced_file.
*/
static int page_referenced_one(struct page *page,
- struct vm_area_struct *vma, unsigned int *mapcount, int *failed)
+ struct vm_area_struct *vma, unsigned int *mapcount)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -208,14 +208,8 @@ static int page_referenced_one(struct pa
if (address == -EFAULT)
goto out;

- if (!spin_trylock(&mm->page_table_lock)) {
- /*
- * For debug we're currently warning if not all found,
- * but in this case that's expected: suppress warning.
- */
- (*failed)++;
+ if (!spin_trylock(&mm->page_table_lock))
goto out;
- }

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
@@ -251,18 +245,14 @@ static inline int page_referenced_anon(s
struct anon_vma *anon_vma = (struct anon_vma *) page->mapping;
struct vm_area_struct *vma;
int referenced = 0;
- int failed = 0;

spin_lock(&anon_vma->lock);
BUG_ON(list_empty(&anon_vma->head));
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- referenced += page_referenced_one(page, vma,
- &mapcount, &failed);
+ referenced += page_referenced_one(page, vma, &mapcount);
if (!mapcount)
- goto out;
+ break;
}
- WARN_ON(!failed);
-out:
spin_unlock(&anon_vma->lock);
return referenced;
}
@@ -289,7 +279,6 @@ static inline int page_referenced_file(s
struct vm_area_struct *vma = NULL;
struct prio_tree_iter iter;
int referenced = 0;
- int failed = 0;

if (!spin_trylock(&mapping->i_mmap_lock))
return 0;
@@ -299,17 +288,13 @@ static inline int page_referenced_file(s
if ((vma->vm_flags & (VM_LOCKED|VM_MAYSHARE))
== (VM_LOCKED|VM_MAYSHARE)) {
referenced++;
- goto out;
+ break;
}
- referenced += page_referenced_one(page, vma,
- &mapcount, &failed);
+ referenced += page_referenced_one(page, vma, &mapcount);
if (!mapcount)
- goto out;
+ break;
}

- if (list_empty(&mapping->i_mmap_nonlinear))
- WARN_ON(!failed);
-out:
spin_unlock(&mapping->i_mmap_lock);
return referenced;
}

2004-06-02 20:19:00

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] pretest pte_young and pte_dirty

Test for pte_young before going to the costlier atomic test_and_clear,
as asm-generic does. Test for pte_dirty before going to the costlier
atomic test_and_clear, as asm-generic does (I said before that I would
not do so for pte_dirty, but was missing the point: there is nothing
atomic about deciding to do nothing). But I've not touched the rather
different ppc and ppc64.

Signed-off-by: Hugh Dickins <[email protected]>

include/asm-i386/pgtable.h | 16 ++++++++++++++--
include/asm-ia64/pgtable.h | 4 ++++
include/asm-parisc/pgtable.h | 4 ++++
include/asm-x86_64/pgtable.h | 17 +++++++++++++++--
4 files changed, 37 insertions(+), 4 deletions(-)

--- 2.6.7-rc2/include/asm-i386/pgtable.h 2004-05-30 11:36:30.000000000 +0100
+++ linux/include/asm-i386/pgtable.h 2004-06-02 16:32:40.032214544 +0100
@@ -220,8 +220,20 @@ static inline pte_t pte_mkdirty(pte_t pt
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; }

-static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }
-static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low); }
+static inline int ptep_test_and_clear_dirty(pte_t *ptep)
+{
+ if (!pte_dirty(*ptep))
+ return 0;
+ return test_and_clear_bit(_PAGE_BIT_DIRTY, &ptep->pte_low);
+}
+
+static inline int ptep_test_and_clear_young(pte_t *ptep)
+{
+ if (!pte_young(*ptep))
+ return 0;
+ return test_and_clear_bit(_PAGE_BIT_ACCESSED, &ptep->pte_low);
+}
+
static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, &ptep->pte_low); }
static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, &ptep->pte_low); }

--- 2.6.7-rc2/include/asm-ia64/pgtable.h 2004-05-30 11:36:31.000000000 +0100
+++ linux/include/asm-ia64/pgtable.h 2004-06-02 16:32:40.034214240 +0100
@@ -346,6 +346,8 @@ static inline int
ptep_test_and_clear_young (pte_t *ptep)
{
#ifdef CONFIG_SMP
+ if (!pte_young(*ptep))
+ return 0;
return test_and_clear_bit(_PAGE_A_BIT, ptep);
#else
pte_t pte = *ptep;
@@ -360,6 +362,8 @@ static inline int
ptep_test_and_clear_dirty (pte_t *ptep)
{
#ifdef CONFIG_SMP
+ if (!pte_dirty(*ptep))
+ return 0;
return test_and_clear_bit(_PAGE_D_BIT, ptep);
#else
pte_t pte = *ptep;
--- 2.6.7-rc2/include/asm-parisc/pgtable.h 2004-05-30 11:36:33.000000000 +0100
+++ linux/include/asm-parisc/pgtable.h 2004-06-02 16:32:40.035214088 +0100
@@ -417,6 +417,8 @@ extern void update_mmu_cache(struct vm_a
static inline int ptep_test_and_clear_young(pte_t *ptep)
{
#ifdef CONFIG_SMP
+ if (!pte_young(*ptep))
+ return 0;
return test_and_clear_bit(xlate_pabit(_PAGE_ACCESSED_BIT), ptep);
#else
pte_t pte = *ptep;
@@ -430,6 +432,8 @@ static inline int ptep_test_and_clear_yo
static inline int ptep_test_and_clear_dirty(pte_t *ptep)
{
#ifdef CONFIG_SMP
+ if (!pte_dirty(*ptep))
+ return 0;
return test_and_clear_bit(xlate_pabit(_PAGE_DIRTY_BIT), ptep);
#else
pte_t pte = *ptep;
--- 2.6.7-rc2/include/asm-x86_64/pgtable.h 2004-05-30 11:36:35.000000000 +0100
+++ linux/include/asm-x86_64/pgtable.h 2004-06-02 16:32:40.036213936 +0100
@@ -262,8 +262,21 @@ extern inline pte_t pte_mkexec(pte_t pte
extern inline pte_t pte_mkdirty(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_DIRTY)); return pte; }
extern inline pte_t pte_mkyoung(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_ACCESSED)); return pte; }
extern inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
-static inline int ptep_test_and_clear_dirty(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep); }
-static inline int ptep_test_and_clear_young(pte_t *ptep) { return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep); }
+
+static inline int ptep_test_and_clear_dirty(pte_t *ptep)
+{
+ if (!pte_dirty(*ptep))
+ return 0;
+ return test_and_clear_bit(_PAGE_BIT_DIRTY, ptep);
+}
+
+static inline int ptep_test_and_clear_young(pte_t *ptep)
+{
+ if (!pte_young(*ptep))
+ return 0;
+ return test_and_clear_bit(_PAGE_BIT_ACCESSED, ptep);
+}
+
static inline void ptep_set_wrprotect(pte_t *ptep) { clear_bit(_PAGE_BIT_RW, ptep); }
static inline void ptep_mkdirty(pte_t *ptep) { set_bit(_PAGE_BIT_DIRTY, ptep); }


2004-06-02 20:16:42

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] flush TLB when clearing young

Traditionally we've not flushed TLB after clearing the young/referenced
bit, it has seemed just a waste of time. Russell King points out that
on some architectures, with the move from 2.4 mm sweeping to 2.6 rmap,
this may be a serious omission: very frequently referenced pages never
re-marked young, and the worst choices made for unmapping.

So, replace ptep_test_and_clear_young by ptep_clear_flush_young
throughout rmap.c. Originally I'd imagined making some kind of TLB
gather optimization, but don't see what now: whether worth it rather
depends on how common cross-cpu flushes are, and whether global or not.

ppc and ppc64 have already found this issue, and worked around it by
arranging TLB flush from their ptep_test_and_clear_young: with the aid
of pgtable rmap pointers. I'm hoping ptep_clear_flush_young will allow
ppc and ppc64 to remove that special code, but won't change them myself.

It's worth noting that it is Andrea's anon_vma rmap which makes the vma
available for ptep_clear_flush_young in page_referenced_one: anonmm and
pte_chains would both need an additional find_vma for that.

Signed-off-by: Hugh Dickins <[email protected]>

mm/rmap.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

--- 2.6.7-rc2/mm/rmap.c 2004-05-30 11:36:40.000000000 +0100
+++ linux/mm/rmap.c 2004-06-02 16:32:28.792923176 +0100
@@ -232,7 +232,7 @@ static int page_referenced_one(struct pa
if (page_to_pfn(page) != pte_pfn(*pte))
goto out_unmap;

- if (ptep_test_and_clear_young(pte))
+ if (ptep_clear_flush_young(vma, address, pte))
referenced++;

(*mapcount)--;
@@ -480,7 +480,7 @@ static int try_to_unmap_one(struct page
* skipped over this mm) then we should reactivate it.
*/
if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
- ptep_test_and_clear_young(pte)) {
+ ptep_clear_flush_young(vma, address, pte)) {
ret = SWAP_FAIL;
goto out_unmap;
}
@@ -590,7 +590,7 @@ static int try_to_unmap_cluster(unsigned
if (PageReserved(page))
continue;

- if (ptep_test_and_clear_young(pte))
+ if (ptep_clear_flush_young(vma, address, pte))
continue;

/* Nuke the page table entry. */

2004-06-02 20:30:44

by David Mosberger

[permalink] [raw]
Subject: Re: [PATCH] pretest pte_young and pte_dirty

>>>>> On Wed, 2 Jun 2004 21:16:59 +0100 (BST), Hugh Dickins <[email protected]> said:

Hugh> Test for pte_young before going to the costlier atomic
Hugh> test_and_clear, as asm-generic does. Test for pte_dirty
Hugh> before going to the costlier atomic test_and_clear, as
Hugh> asm-generic does (I said before that I would not do so for
Hugh> pte_dirty, but was missing the point: there is nothing atomic
Hugh> about deciding to do nothing). But I've not touched the
Hugh> rather different ppc and ppc64.

The ia64-portion of the patch looks good to me.

Thanks!

--david