LinuxLists.cc - [PATCH 00/11] unpaged: PageReserved VM fixups

2005-11-17 19:29:43

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 00/11] unpaged: PageReserved VM fixups

Here come my fixups to the PageReserved removal in 2.6.15-rc1.
Diffed against 2.6.15-rc1-git5. I hope they can be included in
2.6.15-rc2, but admit they've had had no real driver testing yet.
Please scrutinize and test!

Hugh

arch/powerpc/kernel/vdso.c | 3
arch/sparc/mm/generic.c | 2
arch/sparc64/mm/generic.c | 2
drivers/char/mem.c | 2
include/linux/mm.h | 24 -------
include/linux/page-flags.h | 4 -
kernel/fork.c | 2
mm/fremap.c | 10 +--
mm/madvise.c | 2
mm/memory.c | 128 +++++++++++++++++++++++++++-------------
mm/mempolicy.c | 2
mm/mmap.c | 11 ---
mm/mprotect.c | 8 --
mm/msync.c | 4 -
mm/page_alloc.c | 51 ++++++++++-----
mm/rmap.c | 22 +++---
mm/swap.c | 3
sound/usb/usx2y/usx2yhwdeppcm.c | 1
18 files changed, 149 insertions(+), 132 deletions(-)

2005-11-17 19:31:25

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 02/11] unpaged: private write VM_RESERVED

The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message
when you tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED,
and failed with -EACCES: because do_wp_page lacks the refinement to COW
pages in those areas, nor do we expect to find anonymous pages in them;
and it seemed just bloat to add code for handling such a peculiar case.
But immediately it caused vbetool and ddcprobe (using lrmi) to fail.

So revert the "deprecated" messages, letting mmap and mprotect succeed.
But leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place
until we've added the code to do it right: so this particular patch is
only good if the app doesn't really need to write to that private area.

Dave Jones has changed vbetool & ddcprobe to use MAP_SHARED or PROT_READ
just as well, but we don't want to force people to update their tools.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/mmap.c | 11 -----------
mm/mprotect.c | 8 --------
2 files changed, 19 deletions(-)

--- unpaged01/mm/mmap.c 2005-11-12 09:01:25.000000000 +0000
+++ unpaged02/mm/mmap.c 2005-11-17 15:10:20.000000000 +0000
@@ -1076,17 +1076,6 @@ munmap_back:
error = file->f_op->mmap(file, vma);
if (error)
goto unmap_and_free_vma;
- if ((vma->vm_flags & (VM_SHARED | VM_WRITE | VM_RESERVED))
- == (VM_WRITE | VM_RESERVED)) {
- printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
- "PROT_WRITE mmap of VM_RESERVED memory, which "
- "is deprecated. Please report this to "
- "[email protected]\n",current->comm);
- if (vma->vm_ops && vma->vm_ops->close)
- vma->vm_ops->close(vma);
- error = -EACCES;
- goto unmap_and_free_vma;
- }
} else if (vm_flags & VM_SHARED) {
error = shmem_zero_setup(vma);
if (error)
--- unpaged01/mm/mprotect.c 2005-11-12 09:01:25.000000000 +0000
+++ unpaged02/mm/mprotect.c 2005-11-17 15:10:20.000000000 +0000
@@ -124,14 +124,6 @@ mprotect_fixup(struct vm_area_struct *vm
* a MAP_NORESERVE private mapping to writable will now reserve.
*/
if (newflags & VM_WRITE) {
- if (oldflags & VM_RESERVED) {
- BUG_ON(oldflags & VM_WRITE);
- printk(KERN_WARNING "program %s is using MAP_PRIVATE, "
- "PROT_WRITE mprotect of VM_RESERVED memory, "
- "which is deprecated. Please report this to "
- "[email protected]\n",current->comm);
- return -EACCES;
- }
if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_SHARED|VM_HUGETLB))) {
charged = nrpages;
if (security_vm_enough_memory(charged))

2005-11-17 19:30:37

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 01/11] unpaged: get_user_pages VM_RESERVED

The PageReserved removal in 2.6.15-rc1 prohibited get_user_pages on the
areas flagged VM_RESERVED in place of PageReserved. That is correct in
theory - we ought not to interfere with struct pages in such a reserved
area; but in practice it broke BTTV for one.

So revert to prohibiting only on VM_IO: if someone gets into trouble
with get_user_pages on VM_RESERVED, it'll just be a "don't do that".

You can argue that videobuf_mmap_mapper shouldn't set VM_RESERVED in the
first place, but now's not the time for breaking drivers without notice.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

--- 2.6.15-rc1-git5/mm/memory.c 2005-11-17 13:53:21.000000000 +0000
+++ unpaged01/mm/memory.c 2005-11-17 15:10:08.000000000 +0000
@@ -968,7 +968,7 @@ int get_user_pages(struct task_struct *t
continue;
}

- if (!vma || (vma->vm_flags & (VM_IO | VM_RESERVED))
+ if (!vma || (vma->vm_flags & VM_IO)
|| !(vm_flags & vma->vm_flags))
return i ? : -EFAULT;

2005-11-17 19:32:58

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 03/11] unpaged: sound nopage get_page

Something noticed when studying use of VM_RESERVED in different drivers:
snd_usX2Y_hwdep_pcm_vm_nopage omitted to get_page: fixed.

And how did this work before? Aargh! That nopage is returning a page
from within a buffer allocated by snd_malloc_pages, which allocates a
high-order page, then does SetPageReserved on each 0-order page within.

That would have worked in 2.6.14, because when the area was unmapped,
PageReserved inhibited put_page. 2.6.15-rc1 removed that inhibition
(while leaving ineffective PageReserveds around for now), but it hasn't
caused trouble because.. we've not been freeing from VM_RESERVED at all.

Signed-off-by: Hugh Dickins <[email protected]>
---

sound/usb/usx2y/usx2yhwdeppcm.c | 1 +
1 files changed, 1 insertion(+)

--- unpaged02/sound/usb/usx2y/usx2yhwdeppcm.c 2005-11-12 09:01:30.000000000 +0000
+++ unpaged03/sound/usb/usx2y/usx2yhwdeppcm.c 2005-11-17 15:10:34.000000000 +0000
@@ -691,6 +691,7 @@ static struct page * snd_usX2Y_hwdep_pcm
snd_assert((offset % PAGE_SIZE) == 0, return NOPAGE_OOM);
vaddr = (char*)((usX2Ydev_t*)area->vm_private_data)->hwdep_pcm_shm + offset;
page = virt_to_page(vaddr);
+ get_page(page);

if (type)
*type = VM_FAULT_MINOR;

2005-11-17 19:33:59

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 04/11] unpaged: unifdefed PageCompound

It looks like snd_xxx is not the only nopage to be using PageReserved as
a way of holding a high-order page together: which no longer works, but
is masked by our failure to free from VM_RESERVED areas. We cannot fix
that bug without first substituting another way to hold the high-order
page together, while farming out the 0-order pages from within it.

That's just what PageCompound is designed for, but it's been kept under
CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out-
of-line put_page), doesn't slow down what most needs to be fast (already
using hugetlb), and unifies the way we handle high-order pages.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/mm.h | 19 -------------------
include/linux/page-flags.h | 4 ----
mm/page_alloc.c | 5 -----
mm/swap.c | 3 ---
4 files changed, 31 deletions(-)

--- unpaged03/include/linux/mm.h 2005-11-17 13:53:20.000000000 +0000
+++ unpaged04/include/linux/mm.h 2005-11-17 15:10:50.000000000 +0000
@@ -311,8 +311,6 @@ struct page {

extern void FASTCALL(__page_cache_release(struct page *));

-#ifdef CONFIG_HUGETLB_PAGE
-
static inline int page_count(struct page *page)
{
if (PageCompound(page))
@@ -329,23 +327,6 @@ static inline void get_page(struct page

void put_page(struct page *page);

-#else /* CONFIG_HUGETLB_PAGE */
-
-#define page_count(p) (atomic_read(&(p)->_count) + 1)
-
-static inline void get_page(struct page *page)
-{
- atomic_inc(&page->_count);
-}
-
-static inline void put_page(struct page *page)
-{
- if (put_page_testzero(page))
- __page_cache_release(page);
-}
-
-#endif /* CONFIG_HUGETLB_PAGE */
-
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
--- unpaged03/include/linux/page-flags.h 2005-10-28 01:02:08.000000000 +0100
+++ unpaged04/include/linux/page-flags.h 2005-11-17 15:10:50.000000000 +0000
@@ -287,11 +287,7 @@ extern void __mod_page_state(unsigned lo
#define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags)
#define TestClearPageReclaim(page) test_and_clear_bit(PG_reclaim, &(page)->flags)

-#ifdef CONFIG_HUGETLB_PAGE
#define PageCompound(page) test_bit(PG_compound, &(page)->flags)
-#else
-#define PageCompound(page) 0
-#endif
#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)

--- unpaged03/mm/page_alloc.c 2005-11-17 13:53:21.000000000 +0000
+++ unpaged04/mm/page_alloc.c 2005-11-17 15:10:50.000000000 +0000
@@ -148,10 +148,6 @@ static void bad_page(const char *functio
add_taint(TAINT_BAD_PAGE);
}

-#ifndef CONFIG_HUGETLB_PAGE
-#define prep_compound_page(page, order) do { } while (0)
-#define destroy_compound_page(page, order) do { } while (0)
-#else
/*
* Higher-order pages are called "compound pages". They are structured thusly:
*
@@ -205,7 +201,6 @@ static void destroy_compound_page(struct
ClearPageCompound(p);
}
}
-#endif /* CONFIG_HUGETLB_PAGE */

/*
* function for dealing with page's order in buddy system.
--- unpaged03/mm/swap.c 2005-11-12 09:01:25.000000000 +0000
+++ unpaged04/mm/swap.c 2005-11-17 15:10:50.000000000 +0000
@@ -34,8 +34,6 @@
/* How many pages do we try to swap or page in/out together? */
int page_cluster;

-#ifdef CONFIG_HUGETLB_PAGE
-
void put_page(struct page *page)
{
if (unlikely(PageCompound(page))) {
@@ -52,7 +50,6 @@ void put_page(struct page *page)
__page_cache_release(page);
}
EXPORT_SYMBOL(put_page);
-#endif

/*
* Writeback is about to end against a page which has been marked for immediate

2005-11-17 19:36:16

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 05/11] unpaged: VM_UNPAGED

Although we tend to associate VM_RESERVED with remap_pfn_range, quite a
few drivers set VM_RESERVED on areas which are then populated by nopage.
The PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free
pages in zap_pte_range, without changing those drivers not to set it:
so their pages just leak away.

Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the
core, to flag the special areas where the ptes may have no struct page,
or if they have then it's not to be touched. Replace most instances of
VM_RESERVED in core mm by VM_UNPAGED. Force it on in remap_pfn_range,
and the sparc and sparc64 io_remap_pfn_range.

Revert addition of VM_RESERVED to powerpc vdso, it's not needed there.
Is it needed anywhere? It still governs the mm->reserved_vm statistic,
and special vmas not to be merged, and areas not to be core dumped; but
could probably be eliminated later (the drivers are probably specifying
it because in 2.4 it kept swapout off the vma, but in 2.6 we work from
the LRU, which these pages don't get on).

Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no
purpose whatsoever, and should be removed from drivers when we clean up.

Signed-off-by: Hugh Dickins <[email protected]>
---

arch/powerpc/kernel/vdso.c | 3 +--
arch/sparc/mm/generic.c | 2 +-
arch/sparc64/mm/generic.c | 2 +-
include/linux/mm.h | 5 +++--
mm/fremap.c | 4 ++--
mm/madvise.c | 2 +-
mm/memory.c | 30 ++++++++++++++++++------------
mm/mempolicy.c | 2 +-
mm/msync.c | 4 ++--
9 files changed, 30 insertions(+), 24 deletions(-)

--- unpaged04/arch/powerpc/kernel/vdso.c 2005-11-12 09:00:32.000000000 +0000
+++ unpaged05/arch/powerpc/kernel/vdso.c 2005-11-17 15:11:02.000000000 +0000
@@ -285,8 +285,7 @@ int arch_setup_additional_pages(struct l
* It's fine to use that for setting breakpoints in the vDSO code
* pages though
*/
- vma->vm_flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE |
- VM_MAYEXEC | VM_RESERVED;
+ vma->vm_flags = VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
vma->vm_flags |= mm->def_flags;
vma->vm_page_prot = protection_map[vma->vm_flags & 0x7];
vma->vm_ops = &vdso_vmops;
--- unpaged04/arch/sparc/mm/generic.c 2005-11-12 09:00:36.000000000 +0000
+++ unpaged05/arch/sparc/mm/generic.c 2005-11-17 15:11:02.000000000 +0000
@@ -74,7 +74,7 @@ int io_remap_pfn_range(struct vm_area_st
unsigned long offset = GET_PFN(pfn) << PAGE_SHIFT;

/* See comment in mm/memory.c remap_pfn_range */
- vma->vm_flags |= VM_IO | VM_RESERVED;
+ vma->vm_flags |= VM_IO | VM_RESERVED | VM_UNPAGED;

prot = __pgprot(pg_iobits);
offset -= from;
--- unpaged04/arch/sparc64/mm/generic.c 2005-11-12 09:00:36.000000000 +0000
+++ unpaged05/arch/sparc64/mm/generic.c 2005-11-17 15:11:02.000000000 +0000
@@ -128,7 +128,7 @@ int io_remap_pfn_range(struct vm_area_st
unsigned long offset = GET_PFN(pfn) << PAGE_SHIFT;

/* See comment in mm/memory.c remap_pfn_range */
- vma->vm_flags |= VM_IO | VM_RESERVED;
+ vma->vm_flags |= VM_IO | VM_RESERVED | VM_UNPAGED;

prot = __pgprot(pg_iobits);
offset -= from;
--- unpaged04/include/linux/mm.h 2005-11-17 15:10:50.000000000 +0000
+++ unpaged05/include/linux/mm.h 2005-11-17 15:11:02.000000000 +0000
@@ -144,7 +144,8 @@ extern unsigned int kobjsize(const void

#define VM_GROWSDOWN 0x00000100 /* general info on the segment */
#define VM_GROWSUP 0x00000200
-#define VM_SHM 0x00000400 /* shared memory area, don't swap out */
+#define VM_SHM 0x00000000 /* Means nothing: delete it later */
+#define VM_UNPAGED 0x00000400 /* Pages managed without map count */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */

#define VM_EXECUTABLE 0x00001000
@@ -157,7 +158,7 @@ extern unsigned int kobjsize(const void

#define VM_DONTCOPY 0x00020000 /* Do not copy this vma on fork */
#define VM_DONTEXPAND 0x00040000 /* Cannot expand with mremap() */
-#define VM_RESERVED 0x00080000 /* Pages managed in a special way */
+#define VM_RESERVED 0x00080000 /* Count as reserved_vm like IO */
#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
#define VM_NONLINEAR 0x00800000 /* Is non-linear (remap_file_pages) */
--- unpaged04/mm/fremap.c 2005-11-12 09:01:25.000000000 +0000
+++ unpaged05/mm/fremap.c 2005-11-17 15:11:02.000000000 +0000
@@ -65,7 +65,7 @@ int install_page(struct mm_struct *mm, s
pte_t pte_val;
spinlock_t *ptl;

- BUG_ON(vma->vm_flags & VM_RESERVED);
+ BUG_ON(vma->vm_flags & VM_UNPAGED);

pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
@@ -122,7 +122,7 @@ int install_file_pte(struct mm_struct *m
pte_t pte_val;
spinlock_t *ptl;

- BUG_ON(vma->vm_flags & VM_RESERVED);
+ BUG_ON(vma->vm_flags & VM_UNPAGED);

pgd = pgd_offset(mm, addr);
pud = pud_alloc(mm, pgd, addr);
--- unpaged04/mm/madvise.c 2005-11-12 09:01:25.000000000 +0000
+++ unpaged05/mm/madvise.c 2005-11-17 15:11:02.000000000 +0000
@@ -126,7 +126,7 @@ static long madvise_dontneed(struct vm_a
unsigned long start, unsigned long end)
{
*prev = vma;
- if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_RESERVED))
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_UNPAGED))
return -EINVAL;

if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
--- unpaged04/mm/memory.c 2005-11-17 15:10:08.000000000 +0000
+++ unpaged05/mm/memory.c 2005-11-17 15:11:02.000000000 +0000
@@ -334,7 +334,7 @@ static inline void add_mm_rss(struct mm_

/*
* This function is called to print an error when a pte in a
- * !VM_RESERVED region is found pointing to an invalid pfn (which
+ * !VM_UNPAGED region is found pointing to an invalid pfn (which
* is an error.
*
* The calling function must still handle the error.
@@ -381,15 +381,15 @@ copy_one_pte(struct mm_struct *dst_mm, s
goto out_set_pte;
}

- /* If the region is VM_RESERVED, the mapping is not
+ /* If the region is VM_UNPAGED, the mapping is not
* mapped via rmap - duplicate the pte as is.
*/
- if (vm_flags & VM_RESERVED)
+ if (vm_flags & VM_UNPAGED)
goto out_set_pte;

pfn = pte_pfn(pte);
/* If the pte points outside of valid memory but
- * the region is not VM_RESERVED, we have a problem.
+ * the region is not VM_UNPAGED, we have a problem.
*/
if (unlikely(!pfn_valid(pfn))) {
print_bad_pte(vma, pte, addr);
@@ -528,7 +528,7 @@ int copy_page_range(struct mm_struct *ds
* readonly mappings. The tradeoff is that copy_page_range is more
* efficient than faulting.
*/
- if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_RESERVED))) {
+ if (!(vma->vm_flags & (VM_HUGETLB|VM_NONLINEAR|VM_UNPAGED))) {
if (!vma->anon_vma)
return 0;
}
@@ -572,7 +572,7 @@ static unsigned long zap_pte_range(struc

(*zap_work) -= PAGE_SIZE;

- if (!(vma->vm_flags & VM_RESERVED)) {
+ if (!(vma->vm_flags & VM_UNPAGED)) {
unsigned long pfn = pte_pfn(ptent);
if (unlikely(!pfn_valid(pfn)))
print_bad_pte(vma, ptent, addr);
@@ -1191,10 +1191,16 @@ int remap_pfn_range(struct vm_area_struc
* rest of the world about it:
* VM_IO tells people not to look at these pages
* (accesses can have side effects).
- * VM_RESERVED tells the core MM not to "manage" these pages
- * (e.g. refcount, mapcount, try to swap them out).
+ * VM_RESERVED is specified all over the place, because
+ * in 2.4 it kept swapout's vma scan off this vma; but
+ * in 2.6 the LRU scan won't even find its pages, so this
+ * flag means no more than count its pages in reserved_vm,
+ * and omit it from core dump, even when VM_IO turned off.
+ * VM_UNPAGED tells the core MM not to "manage" these pages
+ * (e.g. refcount, mapcount, try to swap them out): in
+ * particular, zap_pte_range does not try to free them.
*/
- vma->vm_flags |= VM_IO | VM_RESERVED;
+ vma->vm_flags |= VM_IO | VM_RESERVED | VM_UNPAGED;

BUG_ON(addr >= end);
pfn -= addr >> PAGE_SHIFT;
@@ -1276,7 +1282,7 @@ static int do_wp_page(struct mm_struct *
pte_t entry;
int ret = VM_FAULT_MINOR;

- BUG_ON(vma->vm_flags & VM_RESERVED);
+ BUG_ON(vma->vm_flags & VM_UNPAGED);

if (unlikely(!pfn_valid(pfn))) {
/*
@@ -1924,7 +1930,7 @@ retry:
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);
- } else if (!(vma->vm_flags & VM_RESERVED)) {
+ } else if (!(vma->vm_flags & VM_UNPAGED)) {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
}
@@ -2203,7 +2209,7 @@ static int __init gate_vma_init(void)
gate_vma.vm_start = FIXADDR_USER_START;
gate_vma.vm_end = FIXADDR_USER_END;
gate_vma.vm_page_prot = PAGE_READONLY;
- gate_vma.vm_flags = VM_RESERVED;
+ gate_vma.vm_flags = 0;
return 0;
}
__initcall(gate_vma_init);
--- unpaged04/mm/mempolicy.c 2005-11-12 09:01:25.000000000 +0000
+++ unpaged05/mm/mempolicy.c 2005-11-17 15:11:02.000000000 +0000
@@ -269,7 +269,7 @@ check_range(struct mm_struct *mm, unsign
first = find_vma(mm, start);
if (!first)
return ERR_PTR(-EFAULT);
- if (first->vm_flags & VM_RESERVED)
+ if (first->vm_flags & VM_UNPAGED)
return ERR_PTR(-EACCES);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
--- unpaged04/mm/msync.c 2005-11-12 09:01:25.000000000 +0000
+++ unpaged05/mm/msync.c 2005-11-17 15:11:02.000000000 +0000
@@ -97,9 +97,9 @@ static void msync_page_range(struct vm_a
/* For hugepages we can't go walking the page table normally,
* but that's ok, hugetlbfs is memory based, so we don't need
* to do anything more on an msync().
- * Can't do anything with VM_RESERVED regions either.
+ * Can't do anything with VM_UNPAGED regions either.
*/
- if (vma->vm_flags & (VM_HUGETLB|VM_RESERVED))
+ if (vma->vm_flags & (VM_HUGETLB|VM_UNPAGED))
return;

BUG_ON(addr >= end);

2005-11-17 19:37:59

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 06/11] unpaged: VM_NONLINEAR VM_RESERVED

There's one peculiar use of VM_RESERVED which the previous patch left
behind: because VM_NONLINEAR's try_to_unmap_cluster uses vm_private_data
as a swapout cursor, but should never meet VM_RESERVED vmas, it was a
way of extending VM_NONLINEAR to VM_RESERVED vmas using vm_private_data
for some other purpose. But that's an empty set - they don't have the
populate function required. So just throw away those VM_RESERVED tests.

But one more interesting in rmap.c has to go too: try_to_unmap_one will
want to swap out an anonymous page from VM_RESERVED or VM_UNPAGED area.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/fremap.c | 6 ++----
mm/rmap.c | 15 +++++----------
2 files changed, 7 insertions(+), 14 deletions(-)

--- unpaged05/mm/fremap.c 2005-11-17 15:11:02.000000000 +0000
+++ unpaged06/mm/fremap.c 2005-11-17 15:11:16.000000000 +0000
@@ -204,12 +204,10 @@ asmlinkage long sys_remap_file_pages(uns
* Make sure the vma is shared, that it supports prefaulting,
* and that the remapped range is valid and fully within
* the single existing vma. vm_private_data is used as a
- * swapout cursor in a VM_NONLINEAR vma (unless VM_RESERVED
- * or VM_LOCKED, but VM_LOCKED could be revoked later on).
+ * swapout cursor in a VM_NONLINEAR vma.
*/
if (vma && (vma->vm_flags & VM_SHARED) &&
- (!vma->vm_private_data ||
- (vma->vm_flags & (VM_NONLINEAR|VM_RESERVED))) &&
+ (!vma->vm_private_data || (vma->vm_flags & VM_NONLINEAR)) &&
vma->vm_ops && vma->vm_ops->populate &&
end > start && start >= vma->vm_start &&
end <= vma->vm_end) {
--- unpaged05/mm/rmap.c 2005-11-12 09:01:25.000000000 +0000
+++ unpaged06/mm/rmap.c 2005-11-17 15:11:16.000000000 +0000
@@ -529,10 +529,8 @@ static int try_to_unmap_one(struct page
* If the page is mlock()d, we cannot swap it out.
* If it's recently referenced (perhaps page_referenced
* skipped over this mm) then we should reactivate it.
- *
- * Pages belonging to VM_RESERVED regions should not happen here.
*/
- if ((vma->vm_flags & (VM_LOCKED|VM_RESERVED)) ||
+ if ((vma->vm_flags & VM_LOCKED) ||
ptep_clear_flush_young(vma, address, pte)) {
ret = SWAP_FAIL;
goto out_unmap;
@@ -727,7 +725,7 @@ static int try_to_unmap_file(struct page

list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
+ if (vma->vm_flags & VM_LOCKED)
continue;
cursor = (unsigned long) vma->vm_private_data;
if (cursor > max_nl_cursor)
@@ -761,7 +759,7 @@ static int try_to_unmap_file(struct page
do {
list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
shared.vm_set.list) {
- if (vma->vm_flags & (VM_LOCKED|VM_RESERVED))
+ if (vma->vm_flags & VM_LOCKED)
continue;
cursor = (unsigned long) vma->vm_private_data;
while ( cursor < max_nl_cursor &&
@@ -783,11 +781,8 @@ static int try_to_unmap_file(struct page
* in locked vmas). Reset cursor on all unreserved nonlinear
* vmas, now forgetting on which ones it had fallen behind.
*/
- list_for_each_entry(vma, &mapping->i_mmap_nonlinear,
- shared.vm_set.list) {
- if (!(vma->vm_flags & VM_RESERVED))
- vma->vm_private_data = NULL;
- }
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list)
+ vma->vm_private_data = NULL;
out:
spin_unlock(&mapping->i_mmap_lock);
return ret;

2005-11-17 19:38:43

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 07/11] unpaged: COW on VM_UNPAGED

Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let
it do Copy-On-Write without touching the VM_UNPAGED's page counts - but
this is incomplete, because the anonymous page it inserts will itself
need to be handled, here and in other functions - next patch.

We still don't copy the page if the pfn is invalid, because the
copy_user_highpage interface does not allow it. But that's not been
a problem in the past: can be added in later if the need arises.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 37 +++++++++++++++++++++++++------------
1 files changed, 25 insertions(+), 12 deletions(-)

--- unpaged06/mm/memory.c 2005-11-17 15:11:02.000000000 +0000
+++ unpaged07/mm/memory.c 2005-11-17 15:11:30.000000000 +0000
@@ -1277,22 +1277,28 @@ static int do_wp_page(struct mm_struct *
unsigned long address, pte_t *page_table, pmd_t *pmd,
spinlock_t *ptl, pte_t orig_pte)
{
- struct page *old_page, *new_page;
+ struct page *old_page, *src_page, *new_page;
unsigned long pfn = pte_pfn(orig_pte);
pte_t entry;
int ret = VM_FAULT_MINOR;

- BUG_ON(vma->vm_flags & VM_UNPAGED);
-
if (unlikely(!pfn_valid(pfn))) {
/*
* Page table corrupted: show pte and kill process.
+ * Or it's an attempt to COW an out-of-map VM_UNPAGED
+ * entry, which copy_user_highpage does not support.
*/
print_bad_pte(vma, orig_pte, address);
ret = VM_FAULT_OOM;
goto unlock;
}
old_page = pfn_to_page(pfn);
+ src_page = old_page;
+
+ if (unlikely(vma->vm_flags & VM_UNPAGED)) {
+ old_page = NULL;
+ goto gotten;
+ }

if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
int reuse = can_share_swap_page(old_page);
@@ -1313,11 +1319,12 @@ static int do_wp_page(struct mm_struct *
* Ok, we need to copy. Oh, well..
*/
page_cache_get(old_page);
+gotten:
pte_unmap_unlock(page_table, ptl);

if (unlikely(anon_vma_prepare(vma)))
goto oom;
- if (old_page == ZERO_PAGE(address)) {
+ if (src_page == ZERO_PAGE(address)) {
new_page = alloc_zeroed_user_highpage(vma, address);
if (!new_page)
goto oom;
@@ -1325,7 +1332,7 @@ static int do_wp_page(struct mm_struct *
new_page = alloc_page_vma(GFP_HIGHUSER, vma, address);
if (!new_page)
goto oom;
- copy_user_highpage(new_page, old_page, address);
+ copy_user_highpage(new_page, src_page, address);
}

/*
@@ -1333,11 +1340,14 @@ static int do_wp_page(struct mm_struct *
*/
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
if (likely(pte_same(*page_table, orig_pte))) {
- page_remove_rmap(old_page);
- if (!PageAnon(old_page)) {
+ if (old_page) {
+ page_remove_rmap(old_page);
+ if (!PageAnon(old_page)) {
+ dec_mm_counter(mm, file_rss);
+ inc_mm_counter(mm, anon_rss);
+ }
+ } else
inc_mm_counter(mm, anon_rss);
- dec_mm_counter(mm, file_rss);
- }
flush_cache_page(vma, address, pfn);
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -1351,13 +1361,16 @@ static int do_wp_page(struct mm_struct *
new_page = old_page;
ret |= VM_FAULT_WRITE;
}
- page_cache_release(new_page);
- page_cache_release(old_page);
+ if (new_page)
+ page_cache_release(new_page);
+ if (old_page)
+ page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
return ret;
oom:
- page_cache_release(old_page);
+ if (old_page)
+ page_cache_release(old_page);
return VM_FAULT_OOM;
}

2005-11-17 19:39:23

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 08/11] unpaged: anon in VM_UNPAGED

copy_one_pte needs to copy the anonymous COWed pages in a VM_UNPAGED
area, zap_pte_range needs to free them, do_wp_page needs to COW them:
just like ordinary pages, not like the unpaged.

But recognizing them is a little subtle: because PageReserved is no
longer a condition for remap_pfn_range, we can now mmap all of /dev/mem
(whether the distro permits, and whether it's advisable on this or that
architecture, is another matter). So if we can see a PageAnon, it may
not be ours to mess with (or may be ours from elsewhere in the address
space). I suspect there's an entertaining insoluble self-referential
problem here, but the page_is_anon function does a good practical job,
and MAP_PRIVATE PROT_WRITE VM_UNPAGED will always be an odd choice.

In updating the comment on page_address_in_vma, noticed a potential
NULL dereference, in a path we don't actually take, but fixed it.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 63 +++++++++++++++++++++++++++++++++++++++---------------------
mm/rmap.c | 7 ++++--
2 files changed, 46 insertions(+), 24 deletions(-)

--- unpaged07/mm/memory.c 2005-11-17 15:11:30.000000000 +0000
+++ unpaged08/mm/memory.c 2005-11-17 15:11:43.000000000 +0000
@@ -350,6 +350,22 @@ void print_bad_pte(struct vm_area_struct
}

/*
+ * page_is_anon applies strict checks for an anonymous page belonging to
+ * this vma at this address. It is used on VM_UNPAGED vmas, which are
+ * usually populated with shared originals (which must not be counted),
+ * but occasionally contain private COWed copies (when !VM_SHARED, or
+ * perhaps via ptrace when VM_SHARED). An mmap of /dev/mem might window
+ * free pages, pages from other processes, or from other parts of this:
+ * it's tricky, but try not to be deceived by foreign anonymous pages.
+ */
+static inline int page_is_anon(struct page *page,
+ struct vm_area_struct *vma, unsigned long addr)
+{
+ return page && PageAnon(page) && page_mapped(page) &&
+ page_address_in_vma(page, vma) == addr;
+}
+
+/*
* copy one vm_area from one task to the other. Assumes the page tables
* already present in the new task to be cleared in the whole range
* covered by this vma.
@@ -381,23 +397,22 @@ copy_one_pte(struct mm_struct *dst_mm, s
goto out_set_pte;
}

- /* If the region is VM_UNPAGED, the mapping is not
- * mapped via rmap - duplicate the pte as is.
- */
- if (vm_flags & VM_UNPAGED)
- goto out_set_pte;
-
pfn = pte_pfn(pte);
- /* If the pte points outside of valid memory but
+ page = pfn_valid(pfn)? pfn_to_page(pfn): NULL;
+
+ if (unlikely(vm_flags & VM_UNPAGED))
+ if (!page_is_anon(page, vma, addr))
+ goto out_set_pte;
+
+ /*
+ * If the pte points outside of valid memory but
* the region is not VM_UNPAGED, we have a problem.
*/
- if (unlikely(!pfn_valid(pfn))) {
+ if (unlikely(!page)) {
print_bad_pte(vma, pte, addr);
goto out_set_pte; /* try to do something sane */
}

- page = pfn_to_page(pfn);
-
/*
* If it's a COW mapping, write protect it both
* in the parent and the child
@@ -568,17 +583,20 @@ static unsigned long zap_pte_range(struc
continue;
}
if (pte_present(ptent)) {
- struct page *page = NULL;
+ struct page *page;
+ unsigned long pfn;

(*zap_work) -= PAGE_SIZE;

- if (!(vma->vm_flags & VM_UNPAGED)) {
- unsigned long pfn = pte_pfn(ptent);
- if (unlikely(!pfn_valid(pfn)))
- print_bad_pte(vma, ptent, addr);
- else
- page = pfn_to_page(pfn);
- }
+ pfn = pte_pfn(ptent);
+ page = pfn_valid(pfn)? pfn_to_page(pfn): NULL;
+
+ if (unlikely(vma->vm_flags & VM_UNPAGED)) {
+ if (!page_is_anon(page, vma, addr))
+ page = NULL;
+ } else if (unlikely(!page))
+ print_bad_pte(vma, ptent, addr);
+
if (unlikely(details) && page) {
/*
* unmap_shared_mapping_pages() wants to
@@ -1295,10 +1313,11 @@ static int do_wp_page(struct mm_struct *
old_page = pfn_to_page(pfn);
src_page = old_page;

- if (unlikely(vma->vm_flags & VM_UNPAGED)) {
- old_page = NULL;
- goto gotten;
- }
+ if (unlikely(vma->vm_flags & VM_UNPAGED))
+ if (!page_is_anon(old_page, vma, address)) {
+ old_page = NULL;
+ goto gotten;
+ }

if (PageAnon(old_page) && !TestSetPageLocked(old_page)) {
int reuse = can_share_swap_page(old_page);
--- unpaged07/mm/rmap.c 2005-11-17 15:11:16.000000000 +0000
+++ unpaged08/mm/rmap.c 2005-11-17 15:11:43.000000000 +0000
@@ -225,7 +225,9 @@ vma_address(struct page *page, struct vm

/*
* At what user virtual address is page expected in vma? checking that the
- * page matches the vma: currently only used by unuse_process, on anon pages.
+ * page matches the vma: currently only used on anon pages, by unuse_vma;
+ * and by extraordinary checks on anon pages in VM_UNPAGED vmas, taking
+ * care that an mmap of /dev/mem might window free and foreign pages.
*/
unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
{
@@ -234,7 +236,8 @@ unsigned long page_address_in_vma(struct
(void *)page->mapping - PAGE_MAPPING_ANON)
return -EFAULT;
} else if (page->mapping && !(vma->vm_flags & VM_NONLINEAR)) {
- if (vma->vm_file->f_mapping != page->mapping)
+ if (!vma->vm_file ||
+ vma->vm_file->f_mapping != page->mapping)
return -EFAULT;
} else
return -EFAULT;

2005-11-17 19:40:24

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 09/11] unpaged: ZERO_PAGE in VM_UNPAGED

It's strange enough to be looking out for anonymous pages in VM_UNPAGED
areas, let's not insert the ZERO_PAGE there - though whether it would
matter will depend on what we decide about ZERO_PAGE refcounting.

But whereas do_anonymous_page may (exceptionally) be called on a
VM_UNPAGED area, do_no_page should never be: just BUG_ON.

Signed-off-by: Hugh Dickins <[email protected]>
---

drivers/char/mem.c | 2 +-
mm/memory.c | 14 ++++++++++++--
2 files changed, 13 insertions(+), 3 deletions(-)

--- unpaged08/drivers/char/mem.c 2005-11-12 09:00:39.000000000 +0000
+++ unpaged09/drivers/char/mem.c 2005-11-17 15:11:57.000000000 +0000
@@ -591,7 +591,7 @@ static inline size_t read_zero_pagealign

if (vma->vm_start > addr || (vma->vm_flags & VM_WRITE) == 0)
goto out_up;
- if (vma->vm_flags & (VM_SHARED | VM_HUGETLB))
+ if (vma->vm_flags & (VM_SHARED | VM_HUGETLB | VM_UNPAGED))
break;
count = vma->vm_end - addr;
if (count > size)
--- unpaged08/mm/memory.c 2005-11-17 15:11:43.000000000 +0000
+++ unpaged09/mm/memory.c 2005-11-17 15:11:57.000000000 +0000
@@ -1812,7 +1812,16 @@ static int do_anonymous_page(struct mm_s
spinlock_t *ptl;
pte_t entry;

- if (write_access) {
+ /*
+ * A VM_UNPAGED vma will normally be filled with present ptes
+ * by remap_pfn_range, and never arrive here; but it might have
+ * holes, or if !VM_DONTEXPAND, mremap might have expanded it.
+ * It's weird enough handling anon pages in unpaged vmas, we do
+ * not want to worry about ZERO_PAGEs too (it may or may not
+ * matter if their counts wrap): just give them anon pages.
+ */
+
+ if (write_access || (vma->vm_flags & VM_UNPAGED)) {
/* Allocate our own private page. */
pte_unmap(page_table);

@@ -1887,6 +1896,7 @@ static int do_no_page(struct mm_struct *
int anon = 0;

pte_unmap(page_table);
+ BUG_ON(vma->vm_flags & VM_UNPAGED);

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
@@ -1962,7 +1972,7 @@ retry:
inc_mm_counter(mm, anon_rss);
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);
- } else if (!(vma->vm_flags & VM_UNPAGED)) {
+ } else {
inc_mm_counter(mm, file_rss);
page_add_file_rmap(new_page);
}

2005-11-17 19:40:47

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 10/11] unpaged: PG_reserved bad_page

It used to be the case that PG_reserved pages were silently never freed,
but in 2.6.15-rc1 they may be freed with a "Bad page state" message. We
should work through such cases as they appear, fixing the code; but for
now it's safer to issue the message without freeing the page, leaving
PG_reserved set.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/page_alloc.c | 46 ++++++++++++++++++++++++++++++++++------------
1 files changed, 34 insertions(+), 12 deletions(-)

--- unpaged09/mm/page_alloc.c 2005-11-17 15:10:50.000000000 +0000
+++ unpaged10/mm/page_alloc.c 2005-11-17 15:12:12.000000000 +0000
@@ -140,8 +140,7 @@ static void bad_page(const char *functio
1 << PG_reclaim |
1 << PG_slab |
1 << PG_swapcache |
- 1 << PG_writeback |
- 1 << PG_reserved );
+ 1 << PG_writeback );
set_page_count(page, 0);
reset_page_mapcount(page);
page->mapping = NULL;
@@ -335,7 +334,7 @@ static inline void __free_pages_bulk (st
zone->free_area[order].nr_free++;
}

-static inline void free_pages_check(const char *function, struct page *page)
+static inline int free_pages_check(const char *function, struct page *page)
{
if ( page_mapcount(page) ||
page->mapping != NULL ||
@@ -353,6 +352,12 @@ static inline void free_pages_check(cons
bad_page(function, page);
if (PageDirty(page))
__ClearPageDirty(page);
+ /*
+ * For now, we report if PG_reserved was found set, but do not
+ * clear it, and do not free the page. But we shall soon need
+ * to do more, for when the ZERO_PAGE count wraps negative.
+ */
+ return PageReserved(page);
}

/*
@@ -392,11 +397,10 @@ void __free_pages_ok(struct page *page,
{
LIST_HEAD(list);
int i;
+ int reserved = 0;

arch_free_page(page, order);

- mod_page_state(pgfree, 1 << order);
-
#ifndef CONFIG_MMU
if (order > 0)
for (i = 1 ; i < (1 << order) ; ++i)
@@ -404,8 +408,12 @@ void __free_pages_ok(struct page *page,
#endif

for (i = 0 ; i < (1 << order) ; ++i)
- free_pages_check(__FUNCTION__, page + i);
+ reserved += free_pages_check(__FUNCTION__, page + i);
+ if (reserved)
+ return;
+
list_add(&page->lru, &list);
+ mod_page_state(pgfree, 1 << order);
kernel_map_pages(page, 1<<order, 0);
free_pages_bulk(page_zone(page), 1, &list, order);
}
@@ -463,7 +471,7 @@ void set_page_refs(struct page *page, in
/*
* This page is about to be returned from the page allocator
*/
-static void prep_new_page(struct page *page, int order)
+static int prep_new_page(struct page *page, int order)
{
if ( page_mapcount(page) ||
page->mapping != NULL ||
@@ -481,12 +489,20 @@ static void prep_new_page(struct page *p
1 << PG_reserved )))
bad_page(__FUNCTION__, page);

+ /*
+ * For now, we report if PG_reserved was found set, but do not
+ * clear it, and do not allocate the page: as a safety net.
+ */
+ if (PageReserved(page))
+ return 1;
+
page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_checked | 1 << PG_mappedtodisk);
set_page_private(page, 0);
set_page_refs(page, order);
kernel_map_pages(page, 1 << order, 1);
+ return 0;
}

/*
@@ -669,11 +685,14 @@ static void fastcall free_hot_cold_page(

arch_free_page(page, 0);

- kernel_map_pages(page, 1, 0);
- inc_page_state(pgfree);
if (PageAnon(page))
page->mapping = NULL;
- free_pages_check(__FUNCTION__, page);
+ if (free_pages_check(__FUNCTION__, page))
+ return;
+
+ inc_page_state(pgfree);
+ kernel_map_pages(page, 1, 0);
+
pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
local_irq_save(flags);
list_add(&page->lru, &pcp->list);
@@ -712,12 +731,14 @@ static struct page *
buffered_rmqueue(struct zone *zone, int order, gfp_t gfp_flags)
{
unsigned long flags;
- struct page *page = NULL;
+ struct page *page;
int cold = !!(gfp_flags & __GFP_COLD);

+again:
if (order == 0) {
struct per_cpu_pages *pcp;

+ page = NULL;
pcp = &zone_pcp(zone, get_cpu())->pcp[cold];
local_irq_save(flags);
if (pcp->count <= pcp->low)
@@ -739,7 +760,8 @@ buffered_rmqueue(struct zone *zone, int
if (page != NULL) {
BUG_ON(bad_range(zone, page));
mod_page_state_zone(zone, pgalloc, 1 << order);
- prep_new_page(page, order);
+ if (prep_new_page(page, order))
+ goto again;

if (gfp_flags & __GFP_ZERO)
prep_zero_page(page, order, gfp_flags);

2005-11-17 19:41:21

[permalink] [raw]

Subject: Re: [PATCH 02/11] unpaged: private write VM_RESERVED

On Thu, Nov 17, 2005 at 07:30:04PM +0000, Hugh Dickins wrote:
> The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message
> when you tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED,
> and failed with -EACCES: because do_wp_page lacks the refinement to COW
> pages in those areas, nor do we expect to find anonymous pages in them;
> and it seemed just bloat to add code for handling such a peculiar case.
> But immediately it caused vbetool and ddcprobe (using lrmi) to fail.
>
> So revert the "deprecated" messages, letting mmap and mprotect succeed.
> But leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place
> until we've added the code to do it right: so this particular patch is
> only good if the app doesn't really need to write to that private area.
>
> Dave Jones has changed vbetool & ddcprobe to use MAP_SHARED or PROT_READ
> just as well, but we don't want to force people to update their tools.

Actually Dave Miller did the detective work on that one, I just
rebuilt some packages, and spread the good word :)

Dave

2005-11-17 19:41:36

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH 11/11] unpaged: copy_page_range vma

For copy_one_pte's print_bad_pte to show the task correctly (instead
of "???"), dup_mmap must pass down parent vma rather than child vma.

Signed-off-by: Hugh Dickins <[email protected]>
---

kernel/fork.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

--- unpaged10/kernel/fork.c 2005-11-17 13:53:21.000000000 +0000
+++ unpaged11/kernel/fork.c 2005-11-17 15:12:25.000000000 +0000
@@ -263,7 +263,7 @@ static inline int dup_mmap(struct mm_str
rb_parent = &tmp->vm_rb;

mm->map_count++;
- retval = copy_page_range(mm, oldmm, tmp);
+ retval = copy_page_range(mm, oldmm, mpnt);

if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);

2005-11-17 20:46:19

by Dominik Brodowski

[permalink] [raw]

Subject: Re: [PATCH 02/11] unpaged: private write VM_RESERVED

Hi!

On Thu, Nov 17, 2005 at 02:41:02PM -0500, Dave Jones wrote:
> On Thu, Nov 17, 2005 at 07:30:04PM +0000, Hugh Dickins wrote:
> > The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message
> > when you tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED,
> > and failed with -EACCES: because do_wp_page lacks the refinement to COW
> > pages in those areas, nor do we expect to find anonymous pages in them;
> > and it seemed just bloat to add code for handling such a peculiar case.
> > But immediately it caused vbetool and ddcprobe (using lrmi) to fail.
> >
> > So revert the "deprecated" messages, letting mmap and mprotect succeed.
> > But leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place
> > until we've added the code to do it right: so this particular patch is
> > only good if the app doesn't really need to write to that private area.
> >
> > Dave Jones has changed vbetool & ddcprobe to use MAP_SHARED or PROT_READ
> > just as well, but we don't want to force people to update their tools.
>
> Actually Dave Miller did the detective work on that one, I just
> rebuilt some packages, and spread the good word :)

My Samsung X05 requires vbetool to resume from suspend-to-ram properly. Up
to 2.6.14, vbetool-0.3 worked fine; the PageReserved patch broke this (as
reported). Also the new package by Dave {Miller,Jones} didn't help and does
not help, even with these new 11 patches. Using the unmodified, original
vbetool-0.3, though, does work as expected, and I can resume back into X
again.

kernel vbetool-0.3 vbetool-0.3+lrmi-patch
--------------------------------------------------------------
2.6.14 + ?
2.6.15-rc1 - -
2.6.15-rc1+patches + -

However -- and this is the first time I'm experiencing this bug, I now got
this in my dmesg right after the last resume from sleep.

[4294994.302000] Restarting tasks... done
[4295188.230000] Bad page state at free_hot_cold_page (in process 'gaim', page c15eb020)
[4295188.230000] flags:0x80000414 mapping:00000000 mapcount:0 count:0
[4295188.230000] flags:0x80000414 mapping:00000000 mapcount:0 count:0
[4295188.230000] Backtrace:
[4295188.230000] [<c0103b59>] dump_stack+0x15/0x17
[4295188.230000] [<c0138f99>] bad_page+0x5b/0x92
[4295188.230000] [<c013967b>] free_hot_cold_page+0x5c/0xfe
[4295188.230000] [<c0139727>] free_hot_page+0xa/0xc
[4295188.230000] [<c013eff0>] __page_cache_release+0x8f/0x94
[4295188.230000] [<c013ecd3>] put_page+0x5b/0x5d
[4295188.230000] [<c0148d69>] free_page_and_swap_cache+0x2c/0x2f
[4295188.230000] [<c0142609>] zap_pte_range+0x1a1/0x214
[4295188.230000] [<c014271a>] unmap_page_range+0x9e/0xe8
[4295188.230000] [<c0142827>] unmap_vmas+0xc3/0x199
[4295188.230000] [<c0145cf7>] unmap_region+0x77/0xf2
[4295188.230000] [<c0145f91>] do_munmap+0xdd/0xf3
[4295188.230000] [<c0145ff6>] sys_munmap+0x4f/0x68
[4295188.230000] [<c0102cab>] sysenter_past_esp+0x54/0x75
[4295188.230000] Trying to fix it up, but a reboot is needed
[4295188.230000] Bad page state at free_hot_cold_page (in process 'gaim', page c15eb040)
[4295188.230000] flags:0x80000414 mapping:00000000 mapcount:0 count:0
[4295188.230000] Backtrace:
[4295188.230000] [<c0103b59>] dump_stack+0x15/0x17
[4295188.230000] [<c0138f99>] bad_page+0x5b/0x92
[4295188.230000] [<c013967b>] free_hot_cold_page+0x5c/0xfe
[4295188.230000] [<c0139727>] free_hot_page+0xa/0xc
[4295188.230000] [<c013eff0>] __page_cache_release+0x8f/0x94
[4295188.230000] [<c013ecd3>] put_page+0x5b/0x5d
[4295188.230000] [<c0148d69>] free_page_and_swap_cache+0x2c/0x2f
[4295188.230000] [<c0142609>] zap_pte_range+0x1a1/0x214
[4295188.230000] [<c014271a>] unmap_page_range+0x9e/0xe8
[4295188.230000] [<c0142827>] unmap_vmas+0xc3/0x199
[4295188.230000] [<c0145cf7>] unmap_region+0x77/0xf2
[4295188.230000] [<c0145f91>] do_munmap+0xdd/0xf3
[4295188.230000] [<c0145ff6>] sys_munmap+0x4f/0x68
[4295188.230000] [<c0102cab>] sysenter_past_esp+0x54/0x75
[4295188.230000] Trying to fix it up, but a reboot is needed
[4295188.230000] Bad page state at free_hot_cold_page (in process 'gaim', page c15eb060)
[4295188.230000] flags:0x80000414 mapping:00000000 mapcount:0 count:0
[4295188.230000] Backtrace:
[4295188.230000] [<c0103b59>] dump_stack+0x15/0x17
[4295188.230000] [<c0138f99>] bad_page+0x5b/0x92
[4295188.230000] [<c013967b>] free_hot_cold_page+0x5c/0xfe
[4295188.230000] [<c0139727>] free_hot_page+0xa/0xc
[4295188.230000] [<c013eff0>] __page_cache_release+0x8f/0x94
[4295188.231000] [<c013ecd3>] put_page+0x5b/0x5d

... and so on. Don't know whether this is related... and it isn't
reproducible.

Dominik

2005-11-17 20:52:28

[permalink] [raw]

Subject: Re: [PATCH 02/11] unpaged: private write VM_RESERVED

On Thu, Nov 17, 2005 at 09:46:17PM +0100, Dominik Brodowski wrote:

> > Actually Dave Miller did the detective work on that one, I just
> > rebuilt some packages, and spread the good word :)
>
> My Samsung X05 requires vbetool to resume from suspend-to-ram properly. Up
> to 2.6.14, vbetool-0.3 worked fine; the PageReserved patch broke this (as
> reported). Also the new package by Dave {Miller,Jones} didn't help and does
> not help, even with these new 11 patches.

Davem's initial analysis was on ddcprobe, it's possible that whilst the
code is the same in both projects, that vbetool's needs are different
enough to require a different patch.

Dave

2005-11-17 20:59:24

by Dominik Brodowski

[permalink] [raw]

Subject: Re: [PATCH 02/11] unpaged: private write VM_RESERVED

Uh,

On Thu, Nov 17, 2005 at 09:46:17PM +0100, Dominik Brodowski wrote:
> [4294994.302000] Restarting tasks... done
> [4295188.230000] Bad page state at free_hot_cold_page (in process 'gaim', page c15eb020)
> [4295188.230000] flags:0x80000414 mapping:00000000 mapcount:0 count:0
> [4295188.230000] flags:0x80000414 mapping:00000000 mapcount:0 count:0
> [4295188.230000] Backtrace:
> [4295188.230000] [<c0103b59>] dump_stack+0x15/0x17
> [4295188.230000] [<c0138f99>] bad_page+0x5b/0x92
> [4295188.230000] [<c013967b>] free_hot_cold_page+0x5c/0xfe
> [4295188.230000] [<c0139727>] free_hot_page+0xa/0xc
> [4295188.230000] [<c013eff0>] __page_cache_release+0x8f/0x94
> [4295188.230000] [<c013ecd3>] put_page+0x5b/0x5d
> [4295188.230000] [<c0148d69>] free_page_and_swap_cache+0x2c/0x2f
> [4295188.230000] [<c0142609>] zap_pte_range+0x1a1/0x214
> [4295188.230000] [<c014271a>] unmap_page_range+0x9e/0xe8
> [4295188.230000] [<c0142827>] unmap_vmas+0xc3/0x199
> [4295188.230000] [<c0145cf7>] unmap_region+0x77/0xf2
> [4295188.230000] [<c0145f91>] do_munmap+0xdd/0xf3
> [4295188.230000] [<c0145ff6>] sys_munmap+0x4f/0x68
> [4295188.230000] [<c0102cab>] sysenter_past_esp+0x54/0x75
> [4295188.230000] Trying to fix it up, but a reboot is needed
> [4295188.230000] Bad page state at free_hot_cold_page (in process 'gaim', page c15eb040)
> [4295188.230000] flags:0x80000414 mapping:00000000 mapcount:0 count:0
> [4295188.230000] Backtrace:
> [4295188.230000] [<c0103b59>] dump_stack+0x15/0x17
> [4295188.230000] [<c0138f99>] bad_page+0x5b/0x92
> [4295188.230000] [<c013967b>] free_hot_cold_page+0x5c/0xfe
> [4295188.230000] [<c0139727>] free_hot_page+0xa/0xc
> [4295188.230000] [<c013eff0>] __page_cache_release+0x8f/0x94
> [4295188.230000] [<c013ecd3>] put_page+0x5b/0x5d
> [4295188.230000] [<c0148d69>] free_page_and_swap_cache+0x2c/0x2f
> [4295188.230000] [<c0142609>] zap_pte_range+0x1a1/0x214
> [4295188.230000] [<c014271a>] unmap_page_range+0x9e/0xe8
> [4295188.230000] [<c0142827>] unmap_vmas+0xc3/0x199
> [4295188.230000] [<c0145cf7>] unmap_region+0x77/0xf2
> [4295188.230000] [<c0145f91>] do_munmap+0xdd/0xf3
> [4295188.230000] [<c0145ff6>] sys_munmap+0x4f/0x68
> [4295188.230000] [<c0102cab>] sysenter_past_esp+0x54/0x75
> [4295188.230000] Trying to fix it up, but a reboot is needed
> [4295188.230000] Bad page state at free_hot_cold_page (in process 'gaim', page c15eb060)
> [4295188.230000] flags:0x80000414 mapping:00000000 mapcount:0 count:0
> [4295188.230000] Backtrace:
> [4295188.230000] [<c0103b59>] dump_stack+0x15/0x17
> [4295188.230000] [<c0138f99>] bad_page+0x5b/0x92
> [4295188.230000] [<c013967b>] free_hot_cold_page+0x5c/0xfe
> [4295188.230000] [<c0139727>] free_hot_page+0xa/0xc
> [4295188.230000] [<c013eff0>] __page_cache_release+0x8f/0x94
> [4295188.231000] [<c013ecd3>] put_page+0x5b/0x5d
>
> ... and so on. Don't know whether this is related... and it isn't
> reproducible.

It _is_ reproducible. Only happened after suspend-to-ram, though; but the
first real thing I did after reboot was trying to suspend-to-ram... Will
check tomorrow whether this also appears without suspend-to-ram in between.

Dominik

2005-11-17 21:01:15

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [PATCH 05/11] unpaged: VM_UNPAGED

On Thu, Nov 17, 2005 at 07:34:55PM +0000, Hugh Dickins wrote:
> Although we tend to associate VM_RESERVED with remap_pfn_range, quite a
> few drivers set VM_RESERVED on areas which are then populated by nopage.
> The PageReserved removal in 2.6.15-rc1 changed VM_RESERVED not to free
> pages in zap_pte_range, without changing those drivers not to set it:
> so their pages just leak away.
> Let's not change miscellaneous drivers now: introduce VM_UNPAGED at the
> core, to flag the special areas where the ptes may have no struct page,
> or if they have then it's not to be touched. Replace most instances of
> VM_RESERVED in core mm by VM_UNPAGED. Force it on in remap_pfn_range,
> and the sparc and sparc64 io_remap_pfn_range.
> Revert addition of VM_RESERVED to powerpc vdso, it's not needed there.
> Is it needed anywhere? It still governs the mm->reserved_vm statistic,
> and special vmas not to be merged, and areas not to be core dumped; but
> could probably be eliminated later (the drivers are probably specifying
> it because in 2.4 it kept swapout off the vma, but in 2.6 we work from
> the LRU, which these pages don't get on).

Eminently reasonable. This solves a lot of problems.
Acked-by: William Irwin <[email protected]>

-- wli

2005-11-17 21:25:38

[permalink] [raw]

Subject: Re: [PATCH 09/11] unpaged: ZERO_PAGE in VM_UNPAGED

Hi Hugh,

On Thursday 17 November 2005 20:38, Hugh Dickins wrote:
> It's strange enough to be looking out for anonymous pages in VM_UNPAGED
> areas, let's not insert the ZERO_PAGE there - though whether it would
> matter will depend on what we decide about ZERO_PAGE refcounting.

We do we refcount ZERO_PAGE at all?
Ok, there may be multiple, but they exist always and always at
the same physical addresses, right?

So why do we care at all?
Memory hotplug?
Doesn't it suffice there, that they are reverse mappable?

Aren't they a perfect candidate for a VM_UNPAGED from /dev/zero :-)

Regards

Ingo Oeser

Attachments:

(No filename) (609.00 B)
(No filename) (189.00 B)
Download all attachments

2005-11-17 23:29:00

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 01/11] unpaged: get_user_pages VM_RESERVED

From: Hugh Dickins <[email protected]>
Date: Thu, 17 Nov 2005 19:29:17 +0000 (GMT)

> The PageReserved removal in 2.6.15-rc1 prohibited get_user_pages on the
> areas flagged VM_RESERVED in place of PageReserved. That is correct in
> theory - we ought not to interfere with struct pages in such a reserved
> area; but in practice it broke BTTV for one.
>
> So revert to prohibiting only on VM_IO: if someone gets into trouble
> with get_user_pages on VM_RESERVED, it'll just be a "don't do that".
>
> You can argue that videobuf_mmap_mapper shouldn't set VM_RESERVED in the
> first place, but now's not the time for breaking drivers without notice.
>
> Signed-off-by: Hugh Dickins <[email protected]>

It might be a nice little 2.6.16 project to simplify that
videobuf_mmap_mapper() code, and in fact other driver subsystems want
a similar facility, such as DVB.

So putting some helper routines in the core MM for this, in one spot,
seems like the long term way to handle this peculiar stuff.

2005-11-17 23:37:38

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 02/11] unpaged: private write VM_RESERVED

From: Hugh Dickins <[email protected]>
Date: Thu, 17 Nov 2005 19:30:04 +0000 (GMT)

> The PageReserved removal in 2.6.15-rc1 issued a "deprecated" message
> when you tried to mmap or mprotect MAP_PRIVATE PROT_WRITE a VM_RESERVED,
> and failed with -EACCES: because do_wp_page lacks the refinement to COW
> pages in those areas, nor do we expect to find anonymous pages in them;
> and it seemed just bloat to add code for handling such a peculiar case.
> But immediately it caused vbetool and ddcprobe (using lrmi) to fail.
>
> So revert the "deprecated" messages, letting mmap and mprotect succeed.
> But leave do_wp_page's BUG_ON(vma->vm_flags & VM_RESERVED) in place
> until we've added the code to do it right: so this particular patch is
> only good if the app doesn't really need to write to that private area.
>
> Dave Jones has changed vbetool & ddcprobe to use MAP_SHARED or PROT_READ
> just as well, but we don't want to force people to update their tools.
>
> Signed-off-by: Hugh Dickins <[email protected]>

lrmi makes two mmaps of interest on "/dev/mem":

1) the lowest page, which includes the interrupt vectors and
the BIOS data area

the BIOS code it executes does need to write to that BIOS
data area, so PROT_WRITE is necessary, but more on this...

2) the ROM image from 0xa0000 -> 0x100000, no PROT_WRITE necessary

But the thing about #1, which needs the PROT_WRITE, is that it does
not want anonymous pages for that stuff, it's mapping the real
physical memory at those addresses, and indeed /dev/mem is going to
setup all the damn page tables already regardless of whether
MAP_SHARED or MAP_PRIVATE is set, and logically indeed MAP_SHARED is
the thing which should be specified here because this mapping is
not "private" to the application in any sense.

That is, /dev/mem mmap()'s do the remap_pfn_range() for the whole area
being mmap()'d (which is where the VM_RESERVED comes from), and
therefore no COW page faults should ever occur for such areas. Faults
can occur for protection violations or memory errors, but that's it.

I would even argue that MAP_PRIVATE on things like /dev/mem should be
flagged with at least a kernel log message if not an outright -EINVAL
as well.

2005-11-17 23:42:23

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 03/11] unpaged: sound nopage get_page

From: Hugh Dickins <[email protected]>
Date: Thu, 17 Nov 2005 19:31:36 +0000 (GMT)

> Something noticed when studying use of VM_RESERVED in different drivers:
> snd_usX2Y_hwdep_pcm_vm_nopage omitted to get_page: fixed.
>
> And how did this work before? Aargh! That nopage is returning a page
> from within a buffer allocated by snd_malloc_pages, which allocates a
> high-order page, then does SetPageReserved on each 0-order page within.
>
> That would have worked in 2.6.14, because when the area was unmapped,
> PageReserved inhibited put_page. 2.6.15-rc1 removed that inhibition
> (while leaving ineffective PageReserveds around for now), but it hasn't
> caused trouble because.. we've not been freeing from VM_RESERVED at all.
>
> Signed-off-by: Hugh Dickins <[email protected]>

There is probably a lot of other grot like this in various drivers.

The amazing thing about this is that all of these drivers getting
snuffed up by VM_RESERVED issues are trying to do essentially the
same thing. They want to allocate some buffers, which the device
can DMA into and out of, and mmap() those buffers into user space
in some sane way.

The video capture driver layer drivers/media/video/video-buf.c is
probably the best known example, and all the other copies in the
tree of this logic is some derivative.

Note also that the AF_PACKET mmap() facility (in
net/packet/af_packet.c) does this VM_RESERVED stuff, but I think since
it never does the get_user_pages() bit like the video capture drivers
do, it didn't trigger any of the new messages or BUG() traps.

I say "I think" because I haven't tested this stuff out specifically.

2005-11-17 23:44:06

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 04/11] unpaged: unifdefed PageCompound

From: Hugh Dickins <[email protected]>
Date: Thu, 17 Nov 2005 19:32:40 +0000 (GMT)

> It looks like snd_xxx is not the only nopage to be using PageReserved as
> a way of holding a high-order page together: which no longer works, but
> is masked by our failure to free from VM_RESERVED areas. We cannot fix
> that bug without first substituting another way to hold the high-order
> page together, while farming out the 0-order pages from within it.
>
> That's just what PageCompound is designed for, but it's been kept under
> CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out-
> of-line put_page), doesn't slow down what most needs to be fast (already
> using hugetlb), and unifies the way we handle high-order pages.
>
> Signed-off-by: Hugh Dickins <[email protected]>

I think this is a good change regardless of the VM_RESERVED issues.

I've been wanting to use this facility in some sparc64 bits in
the past, for example. But since it was HUGETLB guarded that
wasn't possible.

2005-11-17 23:47:59

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 05/11] unpaged: VM_UNPAGED

From: Hugh Dickins <[email protected]>
Date: Thu, 17 Nov 2005 19:34:55 +0000 (GMT)

> Use the VM_SHM slot for VM_UNPAGED, and define VM_SHM to 0: it serves no
> purpose whatsoever, and should be removed from drivers when we clean up.

VM_SHM does absolutely nothing in 2.4.x as well, I'd recommend
killing it off now.

2005-11-17 23:53:09

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

From: Hugh Dickins <[email protected]>
Date: Thu, 17 Nov 2005 19:37:23 +0000 (GMT)

> Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let
> it do Copy-On-Write without touching the VM_UNPAGED's page counts - but
> this is incomplete, because the anonymous page it inserts will itself
> need to be handled, here and in other functions - next patch.
>
> We still don't copy the page if the pfn is invalid, because the
> copy_user_highpage interface does not allow it. But that's not been
> a problem in the past: can be added in later if the need arises.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Do we even need this? It is a very serious question...

The cases I've seen, and am aware of, I documented in a previous
mail and those involve MAP_PRIVATE maps of /dev/mem or other
similar fixed page mapping devices.

Which case truly needs COW faults on VM_RESERVED memory which
isn't an application bug of some sort?

2005-11-18 00:00:00

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 02/11] unpaged: private write VM_RESERVED

From: Dave Jones <[email protected]>
Date: Thu, 17 Nov 2005 15:51:56 -0500

> On Thu, Nov 17, 2005 at 09:46:17PM +0100, Dominik Brodowski wrote:
>
> > > Actually Dave Miller did the detective work on that one, I just
> > > rebuilt some packages, and spread the good word :)
> >
> > My Samsung X05 requires vbetool to resume from suspend-to-ram properly. Up
> > to 2.6.14, vbetool-0.3 worked fine; the PageReserved patch broke this (as
> > reported). Also the new package by Dave {Miller,Jones} didn't help and does
> > not help, even with these new 11 patches.
>
> Davem's initial analysis was on ddcprobe, it's possible that whilst the
> code is the same in both projects, that vbetool's needs are different
> enough to require a different patch.

I don't think so, but lrmi.c instances are doing exactly the
same stuff so the same fix ought to work in both spots.

It could be some other 2.6.15 change that's mucked up suspend-to-ram
resume for this person.

2005-11-18 05:44:36

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

David S. Miller wrote:
> From: Hugh Dickins <[email protected]>
> Date: Thu, 17 Nov 2005 19:37:23 +0000 (GMT)
>
>
>>Remove the BUG_ON(vma->vm_flags & VM_UNPAGED) from do_wp_page, and let
>>it do Copy-On-Write without touching the VM_UNPAGED's page counts - but
>>this is incomplete, because the anonymous page it inserts will itself
>>need to be handled, here and in other functions - next patch.
>>
>>We still don't copy the page if the pfn is invalid, because the
>>copy_user_highpage interface does not allow it. But that's not been
>>a problem in the past: can be added in later if the need arises.
>>
>>Signed-off-by: Hugh Dickins <[email protected]>
>
>
> Do we even need this? It is a very serious question...
>

I think for 2.6.15, yes. We [read: I :(] was too hasty in removing
this completely. However I think it would not be unresonable to spit
out a warning, and remove it in 2.6.??

Looks like Hugh's done a very good job of this, however keeping
complexity down is always a good thing.

--
SUSE Labs, Novell Inc.

Send instant messages to your online friends http://au.messenger.yahoo.com

2005-11-18 06:45:41

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

From: Nick Piggin <[email protected]>
Date: Fri, 18 Nov 2005 16:46:56 +1100

> I think for 2.6.15, yes. We [read: I :(] was too hasty in removing
> this completely. However I think it would not be unresonable to spit
> out a warning, and remove it in 2.6.??

I am so convinced that handling COW faults on VM_RESERVED is
unnecessary, that I think it's prudent to spit out a warning
for MAP_PRIVATE+VM_RESERVED and changing it to MAP_SHARED
to complete the mmap() call.

I bet every single application still works.

And we'll have none of this rediculious complex crap handling COW
pages in VM_RESERVED areas, which I believe is seriously more
complicated than what we started with before any of the VM_RESERVED
changed went into 2.6.15. In fact, we might as well go back to the
2.6.14 stuff instead. I do not see the second half of Hugh's patches
as progress, it's a severe regression to even 2.6.14

Doing a get_user_pages() on a VM_UNPAGED area, that's sane, and
we know exactly what makes use of that. COW faults on VM_UNPAGED
areas, that's not sane, and we don't know of a single instance
which correctly needs that behavior.

MAP_SHARED mappings of reserved pages shared between driver, device,
and userspace is common and understandable. But MAP_PRIVATE mappings
of such things? Please show me an example of something legitimately
using that, and not doing so by mistake. I will drop all of my
arguments once I see that :-)

Because, frankly, a lot of these COW on VM_UNPAGED patches seemingly
are derived from studying the MM and a few drivers and saying "oh yes,
that's _possible_" but that is far from being enough to justify this
complexity. We really need to see real usage, that makes sense.
All the cases I've investigated in userspace are "they really want
MAP_SHARED" or "they didn't need PROT_WRITE in the first place".

2005-11-18 07:12:11

by Dominik Brodowski

[permalink] [raw]

Subject: Re: [PATCH 02/11] unpaged: private write VM_RESERVED

Hi,

On Thu, Nov 17, 2005 at 03:58:56PM -0800, David S. Miller wrote:
> > Davem's initial analysis was on ddcprobe, it's possible that whilst the
> > code is the same in both projects, that vbetool's needs are different
> > enough to require a different patch.
>
> I don't think so, but lrmi.c instances are doing exactly the
> same stuff so the same fix ought to work in both spots.
>
> It could be some other 2.6.15 change that's mucked up suspend-to-ram
> resume for this person.

No. 2.6.15-rc1-git-as-of-yesterday and Hugh's patches plus original
vbetool-0.3 works fine, but it plus vbetool-0.3-lrmi.patch breaks resume for
this person's notebook, i.e. mine. And it's repeatable.

Dominik

2005-11-18 07:29:03

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

On Thu, 17 Nov 2005, David S. Miller wrote:
> From: Nick Piggin <[email protected]>
> Date: Fri, 18 Nov 2005 16:46:56 +1100
>
> > I think for 2.6.15, yes. We [read: I :(] was too hasty in removing
> > this completely.

No, that was not too hasty: we all agreed that the case _ought_ not to
arise, and we hadn't worked out the right code to handle it if it did
arise. What was disappointing is that nobody reported the messages
while it was in -mm, but

> > However I think it would not be unresonable to spit
> > out a warning, and remove it in 2.6.??
>
> I am so convinced that handling COW faults on VM_RESERVED is
> unnecessary, that I think it's prudent to spit out a warning
> for MAP_PRIVATE+VM_RESERVED and changing it to MAP_SHARED
> to complete the mmap() call.
>
> I bet every single application still works.
>
> And we'll have none of this rediculious complex crap handling COW
> pages in VM_RESERVED areas, which I believe is seriously more
> complicated than what we started with before any of the VM_RESERVED
> changed went into 2.6.15. In fact, we might as well go back to the
> 2.6.14 stuff instead. I do not see the second half of Hugh's patches
> as progress, it's a severe regression to even 2.6.14
>
> Doing a get_user_pages() on a VM_UNPAGED area, that's sane, and
> we know exactly what makes use of that. COW faults on VM_UNPAGED
> areas, that's not sane, and we don't know of a single instance
> which correctly needs that behavior.
>
> MAP_SHARED mappings of reserved pages shared between driver, device,
> and userspace is common and understandable. But MAP_PRIVATE mappings
> of such things? Please show me an example of something legitimately
> using that, and not doing so by mistake. I will drop all of my
> arguments once I see that :-)
>
> Because, frankly, a lot of these COW on VM_UNPAGED patches seemingly
> are derived from studying the MM and a few drivers and saying "oh yes,
> that's _possible_" but that is far from being enough to justify this
> complexity. We really need to see real usage, that makes sense.
> All the cases I've investigated in userspace are "they really want
> MAP_SHARED" or "they didn't need PROT_WRITE in the first place".
>

2005-11-18 07:47:18

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

Hugh Dickins <[email protected]> wrote:
>
> On Thu, 17 Nov 2005, David S. Miller wrote:
> > From: Nick Piggin <[email protected]>
> > Date: Fri, 18 Nov 2005 16:46:56 +1100
> >
> > > I think for 2.6.15, yes. We [read: I :(] was too hasty in removing
> > > this completely.
>
> No, that was not too hasty: we all agreed that the case _ought_ not to
> arise, and we hadn't worked out the right code to handle it if it did
> arise. What was disappointing is that nobody reported the messages
> while it was in -mm

The window was small - only 1-2 days. I took a punt on jamming it in early
because we want the patches and everyone in the world seems to be hacking
on core mm. As it turns out it wasn't a terribly good punt, but we'll get
there. Well. You will ;)

2005-11-18 08:00:13

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 02/11] unpaged: private write VM_RESERVED

From: Dominik Brodowski <[email protected]>
Date: Fri, 18 Nov 2005 08:12:08 +0100

> No. 2.6.15-rc1-git-as-of-yesterday and Hugh's patches plus original
> vbetool-0.3 works fine, but it plus vbetool-0.3-lrmi.patch breaks resume for
> this person's notebook, i.e. mine. And it's repeatable.

Ok, thanks for the datapoint.

DaveJ take note.

2005-11-18 08:03:24

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

Sorry for the abortive earlier send...

On Thu, 17 Nov 2005, David S. Miller wrote:
> From: Nick Piggin <[email protected]>
> Date: Fri, 18 Nov 2005 16:46:56 +1100
>
> > I think for 2.6.15, yes. We [read: I :(] was too hasty in removing
> > this completely.

No, that was not too hasty: we all agreed that the case _ought_ not to
arise, and we hadn't worked out the right code to handle it if it did
arise. What was disappointing is that nobody hit the warnings while
it was in -mm, but only when it hit Linus' tree.

> > However I think it would not be unresonable to spit
> > out a warning, and remove it in 2.6.??

Yes.

> I am so convinced that handling COW faults on VM_RESERVED is
> unnecessary, that I think it's prudent to spit out a warning
> for MAP_PRIVATE+VM_RESERVED and changing it to MAP_SHARED
> to complete the mmap() call.
>
> I bet every single application still works.

Maybe, but MAP_SHARED PROT_WRITE needs write permission on the device,
whereas MAP_PRIVATE PROT_WRITE only needs read permission. I'd hate
to be giving away write permissions like this.

If I remember rightly from earlier trawls around here, one of the
nasty things I found was many drivers not checking whether private
or shared, but giving a writable vm_page_prot in either case (you
hint at this in one of your other mails).

I expect there are devices where read permission is in fact just as
dangerous as write permission; but nonetheless we ought to clean
that up with checks in the drivers.

> And we'll have none of this rediculious complex crap handling COW
> pages in VM_RESERVED areas, which I believe is seriously more
> complicated than what we started with before any of the VM_RESERVED
> changed went into 2.6.15. In fact, we might as well go back to the
> 2.6.14 stuff instead. I do not see the second half of Hugh's patches
> as progress, it's a severe regression to even 2.6.14

I agree with your principles, but not with your timings.

That code is necessary to reproduce the existing behaviour, which has
always done COW on PageReserved mappings without complaint - if the
vm_page_prot didn't already let you slip through without a WP fault.

(But even if the driver set up too relaxed a vm_page_prot intentionally,
that's vulnerable to having write protection added by mprotect or more
likely fork, if the mapping is actually declared to be private.)

You may be right that in fact nothing has ever been doing COW on
PageReserved mappings. But we don't know that. And we don't have
time to waste letting 2.6.15 sail forward, without being able to
handle the case if it turns out it is really needed. Changing the
apps (and all the user installations) at short notice is not a
sensible option.

We should stick with my "rediculious complex crap handling" for now
(it's not bad at all! and you should see what I originally sent
Nick for that!), and I should send Andrew a patch to add a warning
there later in the day.

> Doing a get_user_pages() on a VM_UNPAGED area, that's sane, and
> we know exactly what makes use of that. COW faults on VM_UNPAGED
> areas, that's not sane, and we don't know of a single instance
> which correctly needs that behavior.
>
> MAP_SHARED mappings of reserved pages shared between driver, device,
> and userspace is common and understandable. But MAP_PRIVATE mappings
> of such things? Please show me an example of something legitimately
> using that, and not doing so by mistake. I will drop all of my
> arguments once I see that :-)
>
> Because, frankly, a lot of these COW on VM_UNPAGED patches seemingly
> are derived from studying the MM and a few drivers and saying "oh yes,
> that's _possible_" but that is far from being enough to justify this
> complexity. We really need to see real usage, that makes sense.
> All the cases I've investigated in userspace are "they really want
> MAP_SHARED" or "they didn't need PROT_WRITE in the first place".

Yes, we do agree with this, which is why the core PageReserved removal
started out with just warning messages and refusal.

But 2.6.15-rc2 is the wrong point for conducting experiments in what
strange applications might be up to. And we know there's lots of
driver cleanup to be done around here: I agree with your suggestions
in earlier mail about tidying them up, but it probably won't happen
quickly. vmalloc_to_page was a _huge_ improvement to them, but
that's about the only thing that's been done in many years.

It's one of those areas where you go in, and find more and more,
and it's hard ever to wrap it up. I'm sure it's already too late
to expect any such cleanup in 2.6.16, but 2.6.17 or 2.6.18 perhaps.

Hugh

2005-11-18 08:04:46

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

From: Hugh Dickins <[email protected]>
Date: Fri, 18 Nov 2005 07:27:36 +0000 (GMT)

> On Thu, 17 Nov 2005, David S. Miller wrote:
> > From: Nick Piggin <[email protected]>
> > Date: Fri, 18 Nov 2005 16:46:56 +1100
> >
> > > I think for 2.6.15, yes. We [read: I :(] was too hasty in removing
> > > this completely.
>
> No, that was not too hasty: we all agreed that the case _ought_ not to
> arise, and we hadn't worked out the right code to handle it if it did
> arise. What was disappointing is that nobody reported the messages
> while it was in -mm, but

The recent vbetool suspend-from-ram datapoint shows that it might be
important that the BIOS data area is application local,
ie. MAP_PRIVATE. Ie. it works only if the writes are not performed
to the real BIOS data page.

If true, that means the MAP_PRIVATE+VM_UNPAGED case has legit users.
Although, such applications could just copy the interrupt vector plus
BIOS data area into an anonymously mapped region and have the vm86
execution work off that instead of the /dev/mem mapping.

So, just to make sure this all adds up, a PROT_WRITE+MAP_PRIVATE
mapping of /dev/mem results in any pages written to being COW'd.
Right?

It is a good question as to which cases doing stuff like this want to
make modifications to the real BIOS data area, and which ones do not.
Aparently vbetool does not.

2005-11-18 08:08:26

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

From: Hugh Dickins <[email protected]>
Date: Fri, 18 Nov 2005 08:02:02 +0000 (GMT)

> That code is necessary to reproduce the existing behaviour, which has
> always done COW on PageReserved mappings without complaint - if the
> vm_page_prot didn't already let you slip through without a WP fault.

And there is evidence today that this is really needed, at least
by vbetool.

Ok, we need COW on VM_UNPAGED. :)

2005-11-18 08:13:53

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

On Fri, 18 Nov 2005, David S. Miller wrote:
>
> The recent vbetool suspend-from-ram datapoint shows that it might be
> important that the BIOS data area is application local,
> ie. MAP_PRIVATE. Ie. it works only if the writes are not performed
> to the real BIOS data page.

Interesting. I haven't dared reach that conclusion yet.

> If true, that means the MAP_PRIVATE+VM_UNPAGED case has legit users.
> Although, such applications could just copy the interrupt vector plus
> BIOS data area into an anonymously mapped region and have the vm86
> execution work off that instead of the /dev/mem mapping.

Yes, they could indeed. Would save the kernel contortions.

> So, just to make sure this all adds up, a PROT_WRITE+MAP_PRIVATE
> mapping of /dev/mem results in any pages written to being COW'd.
> Right?

Yes, in everything before 2.6.15-rc1, and again with my patches.

> It is a good question as to which cases doing stuff like this want to
> make modifications to the real BIOS data area, and which ones do not.
> Aparently vbetool does not.

Hugh

2005-11-18 08:15:16

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

On Fri, 18 Nov 2005, David S. Miller wrote:
> From: Hugh Dickins <[email protected]>
> Date: Fri, 18 Nov 2005 08:02:02 +0000 (GMT)
>
> > That code is necessary to reproduce the existing behaviour, which has
> > always done COW on PageReserved mappings without complaint - if the
> > vm_page_prot didn't already let you slip through without a WP fault.
>
> And there is evidence today that this is really needed, at least
> by vbetool.
>
> Ok, we need COW on VM_UNPAGED. :)

Are you so sure of that, that we should even skip adding a warning?

Hugh

2005-11-18 08:36:19

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

From: Hugh Dickins <[email protected]>
Date: Fri, 18 Nov 2005 08:13:54 +0000 (GMT)

> On Fri, 18 Nov 2005, David S. Miller wrote:
> > From: Hugh Dickins <[email protected]>
> > Date: Fri, 18 Nov 2005 08:02:02 +0000 (GMT)
> >
> > > That code is necessary to reproduce the existing behaviour, which has
> > > always done COW on PageReserved mappings without complaint - if the
> > > vm_page_prot didn't already let you slip through without a WP fault.
> >
> > And there is evidence today that this is really needed, at least
> > by vbetool.
> >
> > Ok, we need COW on VM_UNPAGED. :)
>
> Are you so sure of that, that we should even skip adding a warning?

Yes, I'm pretty sure. The datapoints are like this:

1) Existing vbetool with 2.6.14 and previous works
2) 2.6.15 w/no-COW-on-reserved makes existing vbetool fail
immediately (of course)
3) 2.6.15 w/no-COW-on-reserved still fails with MAP_SHARED
patched vbetool
4) 2.6.15 w/COW-on-VM_UNPAGED works with existing vbetool
5) 2.6.15 w/COW-on-VM_UNPAGED _FAILS_ with MAP_SHARED patched
vbetool

That #5 is the key.

If we use MAP_SHARED and let vbetool modify the real BIOS data
area, resume fails. That's convincing enough for me :)

2005-11-18 09:34:42

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

On Fri, 18 Nov 2005, David S. Miller wrote:
>
> If we use MAP_SHARED and let vbetool modify the real BIOS data
> area, resume fails. That's convincing enough for me :)

Yes, I agree with your analysis: I was expecting Dominik's resume
failure to have a more conventional cause, but yes, it's from that
new vbetool actually corrupting the memory instead of using it as
a template.

Right, no warning message then. (Some other time we might want to
withdraw the feature - as you said, they could just read /dev/mem -
and put in a research warning in preparation; but right now I've
had enough of such noise.)

Dominik's Bad page states I'll be thinking about later in the day.

Hugh

2005-11-18 18:40:36

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

On Iau, 2005-11-17 at 15:52 -0800, David S. Miller wrote:
> The cases I've seen, and am aware of, I documented in a previous
> mail and those involve MAP_PRIVATE maps of /dev/mem or other
> similar fixed page mapping devices.
>
> Which case truly needs COW faults on VM_RESERVED memory which
> isn't an application bug of some sort?

lrmi, X, vbetool, ...

In general these applications want to operate with a copy of the real
bios data mapped into the system. It might be possible to read some of
the data objects (eg the 0x400 page) but it is not possible to read the
BIOS ROM area as it may contain paged segments.

I suspect a read only bios map and an r/w read copy of the bios config
pages might work but the new behaviour is a serious regression breaking
legitimate code so it is problematic when it occurs without a years
warning in the middle of a fairly stable sequence of kernels.

2005-11-18 19:59:42

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 09/11] unpaged: ZERO_PAGE in VM_UNPAGED

On Thu, 17 Nov 2005, Ingo Oeser wrote:
>
> We do we refcount ZERO_PAGE at all?

We never used to. They were, and for the moment still are, marked
PageReserved. Prior to 2.6.15-rc we didn't refcount reserved pages,
but now we're trying to move away from PageReserved (some differences
of opinion how far to go), so refcounting them.

We're currently refcounting the ZERO_PAGE(s) simply because the
common case is refcounted: it would just be extra tests and code
NOT to refcount the ZERO_PAGE(s). If they were a commoner case,
then it would indeed be worth avoiding refcounting them, but it
currently doesn't appear to be worth the effort.

But it is still up in the air: there is or may be an issue with
refcounts overflowing, and if it's clear that the ZERO_PAGE is
the only one vulnerable on any architecture, then I'm sure we'd
deal with it by not refcounting them. However, I believe the
issue extends to mapped file pages too: though you need a huge
amount of RAM to reach overflow, so it's not something we need
to resolve this week.

> Ok, there may be multiple, but they exist always and always at
> the same physical addresses, right?

Right.

> So why do we care at all?
> Memory hotplug?
> Doesn't it suffice there, that they are reverse mappable?

Actually, they're not reverse mappable: we tend to find
there's not much gained by swapping out the ZERO_PAGE ;)

Hugh

2005-11-18 21:08:36

[permalink] [raw]

Subject: Re: [PATCH 07/11] unpaged: COW on VM_UNPAGED

On Fri, Nov 18, 2005 at 08:02:02AM +0000, Hugh Dickins wrote:

> No, that was not too hasty: we all agreed that the case _ought_ not to
> arise, and we hadn't worked out the right code to handle it if it did
> arise. What was disappointing is that nobody hit the warnings while
> it was in -mm, but only when it hit Linus' tree.

It was found quite quickly though. Fedora rawhide tracks Linus'
-git trees on an almost daily basis, and this change blew up
the installer, so it wasn't hard to reproduce :)

ddcprobe isn't something that a lot of people run on a daily basis
(probably just the installer people, and they never see -mm kernels),
but I'm surprised that vbetool didn't get spotted when it was in -mm,
as more and more people seem to be trying out suspend/resume these days.

Maybe suspend was busted for some other reason at the time it
was in -mm, so people never got that far ?

Dave

2005-11-19 20:15:04

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 04/11] unpaged: unifdefed PageCompound

On Thu, 17 Nov 2005, David S. Miller wrote:
> From: Hugh Dickins <[email protected]>
> Date: Thu, 17 Nov 2005 19:32:40 +0000 (GMT)
>
> > That's just what PageCompound is designed for, but it's been kept under
> > CONFIG_HUGETLB_PAGE. Remove the #ifdefs: which saves some space (out-
> > of-line put_page), doesn't slow down what most needs to be fast (already
> > using hugetlb), and unifies the way we handle high-order pages.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> I think this is a good change regardless of the VM_RESERVED issues.
>
> I've been wanting to use this facility in some sparc64 bits in
> the past, for example. But since it was HUGETLB guarded that
> wasn't possible.

I've only just found that we have to supply the __GFP_COMP flag to get
this working. And one of the routes through snd_dma_alloc_pages goes
to sbus_alloc_consistent. Would you be happy for me to send Andrew a
patch with sparc and sparc64 sbus_alloc_consistent including __GFP_COMP?
Ought I to do the same in the sparc and sparc64 pci_alloc_consistent??

Thanks,
Hugh

--- 2.6.15-rc1-mm2/arch/sparc/kernel/ioport.c 2005-06-17 20:48:29.000000000 +0100
+++ linux/arch/sparc/kernel/ioport.c 2005-11-19 19:11:04.000000000 +0000
@@ -252,7 +252,7 @@ void *sbus_alloc_consistent(struct sbus_
}

order = get_order(len_total);
- if ((va = __get_free_pages(GFP_KERNEL, order)) == 0)
+ if ((va = __get_free_pages(GFP_KERNEL|__GFP_COMP, order)) == 0)
goto err_nopages;

if ((res = kmalloc(sizeof(struct resource), GFP_KERNEL)) == NULL)
--- 2.6.15-rc1-mm2/arch/sparc64/kernel/sbus.c 2005-11-12 09:00:36.000000000 +0000
+++ linux/arch/sparc64/kernel/sbus.c 2005-11-19 19:12:53.000000000 +0000
@@ -327,7 +327,7 @@ void *sbus_alloc_consistent(struct sbus_
order = get_order(size);
if (order >= 10)
return NULL;
- first_page = __get_free_pages(GFP_KERNEL, order);
+ first_page = __get_free_pages(GFP_KERNEL|__GFP_COMP, order);
if (first_page == 0UL)
return NULL;
memset((char *)first_page, 0, PAGE_SIZE << order);

2005-11-19 20:56:47

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [PATCH 04/11] unpaged: unifdefed PageCompound

On Thu, 17 Nov 2005, David S. Miller wrote:
>> I think this is a good change regardless of the VM_RESERVED issues.
>> I've been wanting to use this facility in some sparc64 bits in
>> the past, for example. But since it was HUGETLB guarded that
>> wasn't possible.

On Sat, Nov 19, 2005 at 08:15:13PM +0000, Hugh Dickins wrote:
> I've only just found that we have to supply the __GFP_COMP flag to get
> this working. And one of the routes through snd_dma_alloc_pages goes
> to sbus_alloc_consistent. Would you be happy for me to send Andrew a
> patch with sparc and sparc64 sbus_alloc_consistent including __GFP_COMP?
> Ought I to do the same in the sparc and sparc64 pci_alloc_consistent??

I usually end up deferring to Dave on the driver issues, but this time
I can independently agree in an informed manner.

-- wli

2005-11-19 21:45:15

by David Miller

[permalink] [raw]

Subject: Re: [PATCH 04/11] unpaged: unifdefed PageCompound

From: William Lee Irwin III <[email protected]>
Date: Sat, 19 Nov 2005 12:55:57 -0800

> On Thu, 17 Nov 2005, David S. Miller wrote:
> >> I think this is a good change regardless of the VM_RESERVED issues.
> >> I've been wanting to use this facility in some sparc64 bits in
> >> the past, for example. But since it was HUGETLB guarded that
> >> wasn't possible.
>
> On Sat, Nov 19, 2005 at 08:15:13PM +0000, Hugh Dickins wrote:
> > I've only just found that we have to supply the __GFP_COMP flag to get
> > this working. And one of the routes through snd_dma_alloc_pages goes
> > to sbus_alloc_consistent. Would you be happy for me to send Andrew a
> > patch with sparc and sparc64 sbus_alloc_consistent including __GFP_COMP?
> > Ought I to do the same in the sparc and sparc64 pci_alloc_consistent??
>
> I usually end up deferring to Dave on the driver issues, but this time
> I can independently agree in an informed manner.

What is it needed for in this case? We never try to use those
pci_alloc_consistent() pages independantly, ie. freeing up
individual pages from a non-zero order allocation.

Just curious. :-)

2005-11-19 21:59:53

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [PATCH 04/11] unpaged: unifdefed PageCompound

> From: William Lee Irwin III <[email protected]>
> Date: Sat, 19 Nov 2005 12:55:57 -0800
> > I usually end up deferring to Dave on the driver issues, but this time
> > I can independently agree in an informed manner.

On Sat, Nov 19, 2005 at 01:41:14PM -0800, David S. Miller wrote:
> What is it needed for in this case? We never try to use those
> pci_alloc_consistent() pages independantly, ie. freeing up
> individual pages from a non-zero order allocation.
> Just curious. :-)

I glossed over the pci_alloc_consistent() part of hugh's msg. I
recalled the sound driver from memory. I don't know whether
pci_alloc_consistent()'s sparc-specific usage includes a similar
behavior, but never saw it when I looked for allocators of higher-order
pages that free fragments of them at a time. The audit was far from
comprehensive, though.

I must defer once again. =)

-- wli