2008-12-01 00:38:05

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 0/8] badpage: more resilient bad page pte and rmap

Here's a batch of 8 mm patches, intended for 2.6.29: revisiting
bad_page() and print_bad_pte() and the page_remove_rmap() Eeek BUG.
Diffed to slot in to the mmotm series just before "mmend".

The only clash with later mmotm patches is with Manfred's
mm-debug-dump-pageframes-on-bad_page.patch
which puts a hexdump in there. Trivial to fix up, but I've never
actually found that patch helpful - perhaps because it isn't an -mm
tree that "Bad page state" reporters are running. Time to drop it?

include/linux/page-flags.h | 25 ++------
include/linux/rmap.h | 2
include/linux/swap.h | 12 ---
mm/filemap_xip.c | 2
mm/fremap.c | 2
mm/internal.h | 1
mm/memory.c | 109 ++++++++++++++++++++++++++---------
mm/page_alloc.c | 108 +++++++++++++++++++---------------
mm/rmap.c | 24 -------
mm/swapfile.c | 7 +-
10 files changed, 166 insertions(+), 126 deletions(-)


2008-12-01 00:40:30

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 1/8] badpage: simplify page_alloc flag check+clear

Simplify the PAGE_FLAGS checking and clearing when freeing and allocating
a page: check the same flags as before when freeing, clear ALL the flags
(unless PageReserved) when freeing, check ALL flags off when allocating.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/page-flags.h | 25 ++++++++-----------------
mm/page_alloc.c | 19 ++++++-------------
2 files changed, 14 insertions(+), 30 deletions(-)

--- badpage0/include/linux/page-flags.h 2008-11-26 12:18:59.000000000 +0000
+++ badpage1/include/linux/page-flags.h 2008-11-28 20:40:33.000000000 +0000
@@ -375,31 +375,22 @@ static inline void __ClearPageTail(struc
#define __PG_MLOCKED 0
#endif

-#define PAGE_FLAGS (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \
- 1 << PG_buddy | 1 << PG_writeback | \
- 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
- __PG_UNEVICTABLE | __PG_MLOCKED)
-
-/*
- * Flags checked in bad_page(). Pages on the free list should not have
- * these flags set. It they are, there is a problem.
- */
-#define PAGE_FLAGS_CLEAR_WHEN_BAD (PAGE_FLAGS | \
- 1 << PG_reclaim | 1 << PG_dirty | 1 << PG_swapbacked)
-
/*
* Flags checked when a page is freed. Pages being freed should not have
* these flags set. It they are, there is a problem.
*/
-#define PAGE_FLAGS_CHECK_AT_FREE (PAGE_FLAGS | 1 << PG_reserved)
+#define PAGE_FLAGS_CHECK_AT_FREE \
+ (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \
+ 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
+ 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
+ __PG_UNEVICTABLE | __PG_MLOCKED)

/*
* Flags checked when a page is prepped for return by the page allocator.
- * Pages being prepped should not have these flags set. It they are, there
- * is a problem.
+ * Pages being prepped should not have any flags set. It they are set,
+ * there has been a kernel bug or struct page corruption.
*/
-#define PAGE_FLAGS_CHECK_AT_PREP (PAGE_FLAGS | \
- 1 << PG_reserved | 1 << PG_dirty | 1 << PG_swapbacked)
+#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)

#endif /* !__GENERATING_BOUNDS_H */
#endif /* PAGE_FLAGS_H */
--- badpage0/mm/page_alloc.c 2008-11-26 12:19:00.000000000 +0000
+++ badpage1/mm/page_alloc.c 2008-11-28 20:40:33.000000000 +0000
@@ -231,7 +231,6 @@ static void bad_page(struct page *page)
printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
KERN_EMERG "Backtrace:\n");
dump_stack();
- page->flags &= ~PAGE_FLAGS_CLEAR_WHEN_BAD;
set_page_count(page, 0);
reset_page_mapcount(page);
page->mapping = NULL;
@@ -468,16 +467,16 @@ static inline int free_pages_check(struc
(page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
bad_page(page);
- if (PageDirty(page))
- __ClearPageDirty(page);
- if (PageSwapBacked(page))
- __ClearPageSwapBacked(page);
/*
* For now, we report if PG_reserved was found set, but do not
* clear it, and do not free the page. But we shall soon need
* to do more, for when the ZERO_PAGE count wraps negative.
*/
- return PageReserved(page);
+ if (PageReserved(page))
+ return 1;
+ if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
+ page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ return 0;
}

/*
@@ -621,13 +620,7 @@ static int prep_new_page(struct page *pa
if (PageReserved(page))
return 1;

- page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim |
- 1 << PG_referenced | 1 << PG_arch_1 |
- 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk
-#ifdef CONFIG_UNEVICTABLE_LRU
- | 1 << PG_mlocked
-#endif
- );
+ page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
set_page_private(page, 0);
set_page_refcounted(page);

2008-12-01 00:42:39

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 2/8] badpage: keep any bad page out of circulation

Until now the bad_page() checkers have special-cased PageReserved, keeping
those pages out of circulation thereafter. Now extend the special case to
all: we want to keep ANY page with bad state out of circulation - the
"free" page may well be in use by something.

Leave the bad state of those pages untouched, for examination by debuggers;
except for PageBuddy - leaving that set would risk bringing the page back.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/page_alloc.c | 52 +++++++++++++++++++++-------------------------
1 file changed, 24 insertions(+), 28 deletions(-)

--- badpage1/mm/page_alloc.c 2008-11-28 20:40:33.000000000 +0000
+++ badpage2/mm/page_alloc.c 2008-11-28 20:40:36.000000000 +0000
@@ -231,9 +231,9 @@ static void bad_page(struct page *page)
printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
KERN_EMERG "Backtrace:\n");
dump_stack();
- set_page_count(page, 0);
- reset_page_mapcount(page);
- page->mapping = NULL;
+
+ /* Leave bad fields for debug, except PageBuddy could make trouble */
+ __ClearPageBuddy(page);
add_taint(TAINT_BAD_PAGE);
}

@@ -290,25 +290,31 @@ void prep_compound_gigantic_page(struct
}
#endif

-static void destroy_compound_page(struct page *page, unsigned long order)
+static int destroy_compound_page(struct page *page, unsigned long order)
{
int i;
int nr_pages = 1 << order;
+ int bad = 0;

- if (unlikely(compound_order(page) != order))
+ if (unlikely(compound_order(page) != order) ||
+ unlikely(!PageHead(page))) {
bad_page(page);
+ bad++;
+ }

- if (unlikely(!PageHead(page)))
- bad_page(page);
__ClearPageHead(page);
+
for (i = 1; i < nr_pages; i++) {
struct page *p = page + i;

- if (unlikely(!PageTail(p) |
- (p->first_page != page)))
+ if (unlikely(!PageTail(p) | (p->first_page != page))) {
bad_page(page);
+ bad++;
+ }
__ClearPageTail(p);
}
+
+ return bad;
}

static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags)
@@ -428,7 +434,8 @@ static inline void __free_one_page(struc
int migratetype = get_pageblock_migratetype(page);

if (unlikely(PageCompound(page)))
- destroy_compound_page(page, order);
+ if (unlikely(destroy_compound_page(page, order)))
+ return;

page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);

@@ -465,15 +472,10 @@ static inline int free_pages_check(struc
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
(page_count(page) != 0) |
- (page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
+ (page->flags & PAGE_FLAGS_CHECK_AT_FREE))) {
bad_page(page);
- /*
- * For now, we report if PG_reserved was found set, but do not
- * clear it, and do not free the page. But we shall soon need
- * to do more, for when the ZERO_PAGE count wraps negative.
- */
- if (PageReserved(page))
return 1;
+ }
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -521,11 +523,11 @@ static void __free_pages_ok(struct page
{
unsigned long flags;
int i;
- int reserved = 0;
+ int bad = 0;

for (i = 0 ; i < (1 << order) ; ++i)
- reserved += free_pages_check(page + i);
- if (reserved)
+ bad += free_pages_check(page + i);
+ if (bad)
return;

if (!PageHighMem(page)) {
@@ -610,17 +612,11 @@ static int prep_new_page(struct page *pa
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
(page_count(page) != 0) |
- (page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
+ (page->flags & PAGE_FLAGS_CHECK_AT_PREP))) {
bad_page(page);
-
- /*
- * For now, we report if PG_reserved was found set, but do not
- * clear it, and do not allocate the page: as a safety net.
- */
- if (PageReserved(page))
return 1;
+ }

- page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
set_page_private(page, 0);
set_page_refcounted(page);

2008-12-01 00:42:56

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 3/8] badpage: replace page_remove_rmap Eeek and BUG

Now that bad pages are kept out of circulation, there is no need for the
infamous page_remove_rmap() BUG() - once that page is freed, its negative
mapcount will issue a "Bad page state" message and the page won't be freed.
Removing the BUG() allows more info, on subsequent pages, to be gathered.

We do have more info about the page at this point than bad_page() can know
- notably, what the pmd is, which might pinpoint something like low 64kB
corruption - but page_remove_rmap() isn't given the address to find that.

In practice, there is only one call to page_remove_rmap() which has ever
reported anything, that from zap_pte_range() (usually on exit, sometimes
on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek"
message and leave it all to zap_pte_range().

mm/memory.c already has a hardly used print_bad_pte() function, showing
some of the appropriate info: extend it to show what we want for the rmap
case: pte info, page info (when there is a page) and vma info to compare.
zap_pte_range() already knows the pmd, but print_bad_pte() is easier to
use if it works that out for itself.

Some of this info is also shown in bad_page()'s "Bad page state" message.
Keep them separate, but adjust them to match each other as far as possible.
Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE there too.

print_bad_pte() show current->comm unconditionally (though it should get
repeated in the usually irrelevant stack trace): sorry, I misled Nick
Piggin to make it conditional on vm_mm == current->mm, but current->mm
is already NULL in the exit case. Usually current->comm is good, though
exceptionally it may not be that of the mm (when "swapoff" for example).

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 50 +++++++++++++++++++++++++++++++++++-----------
mm/page_alloc.c | 14 ++++++------
mm/rmap.c | 16 --------------
3 files changed, 46 insertions(+), 34 deletions(-)

--- badpage2/mm/memory.c 2008-11-26 12:19:00.000000000 +0000
+++ badpage3/mm/memory.c 2008-11-28 20:40:40.000000000 +0000
@@ -52,6 +52,9 @@
#include <linux/writeback.h>
#include <linux/memcontrol.h>
#include <linux/mmu_notifier.h>
+#include <linux/kallsyms.h>
+#include <linux/swapops.h>
+#include <linux/elf.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -59,9 +62,6 @@
#include <asm/tlbflush.h>
#include <asm/pgtable.h>

-#include <linux/swapops.h>
-#include <linux/elf.h>
-
#include "internal.h"

#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -375,15 +375,41 @@ static inline void add_mm_rss(struct mm_
*
* The calling function must still handle the error.
*/
-static void print_bad_pte(struct vm_area_struct *vma, pte_t pte,
- unsigned long vaddr)
+static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
+ pte_t pte, struct page *page)
{
- printk(KERN_ERR "Bad pte = %08llx, process = %s, "
- "vm_flags = %lx, vaddr = %lx\n",
- (long long)pte_val(pte),
- (vma->vm_mm == current->mm ? current->comm : "???"),
- vma->vm_flags, vaddr);
+ pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+ struct address_space *mapping;
+ pgoff_t index;
+
+ mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
+ index = linear_page_index(vma, addr);
+
+ printk(KERN_EMERG "Bad page map in process %s pte:%08llx pmd:%08llx\n",
+ current->comm,
+ (long long)pte_val(pte), (long long)pmd_val(*pmd));
+ if (page) {
+ printk(KERN_EMERG
+ "page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n",
+ page, (void *)page->flags, page_count(page),
+ page_mapcount(page), page->mapping, page->index);
+ }
+ printk(KERN_EMERG
+ "addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
+ (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
+ /*
+ * Choose text because data symbols depend on CONFIG_KALLSYMS_ALL=y
+ */
+ if (vma->vm_ops)
+ print_symbol(KERN_EMERG "vma->vm_ops->fault: %s\n",
+ (unsigned long)vma->vm_ops->fault);
+ if (vma->vm_file && vma->vm_file->f_op)
+ print_symbol(KERN_EMERG "vma->vm_file->f_op->mmap: %s\n",
+ (unsigned long)vma->vm_file->f_op->mmap);
dump_stack();
+ add_taint(TAINT_BAD_PAGE);
}

static inline int is_cow_mapping(unsigned int flags)
@@ -763,6 +789,8 @@ static unsigned long zap_pte_range(struc
file_rss--;
}
page_remove_rmap(page, vma);
+ if (unlikely(page_mapcount(page) < 0))
+ print_bad_pte(vma, addr, ptent, page);
tlb_remove_page(tlb, page);
continue;
}
@@ -2657,7 +2685,7 @@ static int do_nonlinear_fault(struct mm_
/*
* Page table corrupted: show pte and kill process.
*/
- print_bad_pte(vma, orig_pte, address);
+ print_bad_pte(vma, address, orig_pte, NULL);
return VM_FAULT_OOM;
}

--- badpage2/mm/page_alloc.c 2008-11-28 20:40:36.000000000 +0000
+++ badpage3/mm/page_alloc.c 2008-11-28 20:40:40.000000000 +0000
@@ -222,14 +222,14 @@ static inline int bad_range(struct zone

static void bad_page(struct page *page)
{
- printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG
- "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n",
- current->comm, page, (int)(2*sizeof(unsigned long)),
- (unsigned long)page->flags, page->mapping,
- page_mapcount(page), page_count(page));
+ printk(KERN_EMERG "Bad page state in process %s pfn:%05lx\n",
+ current->comm, page_to_pfn(page));
+ printk(KERN_EMERG
+ "page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n",
+ page, (void *)page->flags, page_count(page),
+ page_mapcount(page), page->mapping, page->index);
+ printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n");

- printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"
- KERN_EMERG "Backtrace:\n");
dump_stack();

/* Leave bad fields for debug, except PageBuddy could make trouble */
--- badpage2/mm/rmap.c 2008-11-26 12:19:00.000000000 +0000
+++ badpage3/mm/rmap.c 2008-11-28 20:40:40.000000000 +0000
@@ -47,7 +47,6 @@
#include <linux/rmap.h>
#include <linux/rcupdate.h>
#include <linux/module.h>
-#include <linux/kallsyms.h>
#include <linux/memcontrol.h>
#include <linux/mmu_notifier.h>
#include <linux/migrate.h>
@@ -725,21 +724,6 @@ void page_dup_rmap(struct page *page, st
void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
{
if (atomic_add_negative(-1, &page->_mapcount)) {
- if (unlikely(page_mapcount(page) < 0)) {
- printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page));
- printk (KERN_EMERG " page pfn = %lx\n", page_to_pfn(page));
- printk (KERN_EMERG " page->flags = %lx\n", page->flags);
- printk (KERN_EMERG " page->count = %x\n", page_count(page));
- printk (KERN_EMERG " page->mapping = %p\n", page->mapping);
- print_symbol (KERN_EMERG " vma->vm_ops = %s\n", (unsigned long)vma->vm_ops);
- if (vma->vm_ops) {
- print_symbol (KERN_EMERG " vma->vm_ops->fault = %s\n", (unsigned long)vma->vm_ops->fault);
- }
- if (vma->vm_file && vma->vm_file->f_op)
- print_symbol (KERN_EMERG " vma->vm_file->f_op->mmap = %s\n", (unsigned long)vma->vm_file->f_op->mmap);
- BUG();
- }
-
/*
* Now that the last pte has gone, s390 must transfer dirty
* flag from storage key to struct page. We can usually skip

2008-12-01 00:43:36

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 4/8] badpage: vm_normal_page use print_bad_pte

print_bad_pte() is so far being called only when zap_pte_range() finds
negative page_mapcount, or there's a fault on a pte_file where it does
not belong. That's weak coverage when we suspect pagetable corruption.

Originally, it was called when vm_normal_page() found an invalid pfn:
but pfn_valid is expensive on some architectures and configurations, so
2.6.24 put that under CONFIG_DEBUG_VM (which doesn't help in the field),
then 2.6.26 replaced it by a VM_BUG_ON (likewise).

Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper
test than pfn_valid(): memmap_init_zone() (used in bootup and hotplug)
keep a __read_mostly note of the highest_memmap_pfn, vm_normal_page()
then check pfn against that. We could call this pfn_plausible() or
pfn_sane(), but I doubt we'll need it elsewhere: of course it's not
reliable, but gives much stronger pagetable validation on many boxes.

Also use print_bad_pte() when the pte_special bit is found outside a
VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON.

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/internal.h | 1 +
mm/memory.c | 20 ++++++++++----------
mm/page_alloc.c | 4 ++++
3 files changed, 15 insertions(+), 10 deletions(-)

--- badpage3/mm/internal.h 2008-11-10 11:27:02.000000000 +0000
+++ badpage4/mm/internal.h 2008-11-28 20:40:42.000000000 +0000
@@ -49,6 +49,7 @@ extern void putback_lru_page(struct page
/*
* in mm/page_alloc.c
*/
+extern unsigned long highest_memmap_pfn;
extern void __free_pages_bootmem(struct page *page, unsigned int order);

/*
--- badpage3/mm/memory.c 2008-11-28 20:40:40.000000000 +0000
+++ badpage4/mm/memory.c 2008-11-28 20:40:42.000000000 +0000
@@ -467,21 +467,18 @@ static inline int is_cow_mapping(unsigne
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
pte_t pte)
{
- unsigned long pfn;
+ unsigned long pfn = pte_pfn(pte);

if (HAVE_PTE_SPECIAL) {
- if (likely(!pte_special(pte))) {
- VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
- return pte_page(pte);
- }
- VM_BUG_ON(!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)));
+ if (likely(!pte_special(pte)))
+ goto check_pfn;
+ if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)))
+ print_bad_pte(vma, addr, pte, NULL);
return NULL;
}

/* !HAVE_PTE_SPECIAL case follows: */

- pfn = pte_pfn(pte);
-
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
@@ -497,11 +494,14 @@ struct page *vm_normal_page(struct vm_ar
}
}

- VM_BUG_ON(!pfn_valid(pfn));
+check_pfn:
+ if (unlikely(pfn > highest_memmap_pfn)) {
+ print_bad_pte(vma, addr, pte, NULL);
+ return NULL;
+ }

/*
* NOTE! We still have PageReserved() pages in the page tables.
- *
* eg. VDSO mappings can cause them to exist.
*/
out:
--- badpage3/mm/page_alloc.c 2008-11-28 20:40:40.000000000 +0000
+++ badpage4/mm/page_alloc.c 2008-11-28 20:40:42.000000000 +0000
@@ -69,6 +69,7 @@ EXPORT_SYMBOL(node_states);

unsigned long totalram_pages __read_mostly;
unsigned long totalreserve_pages __read_mostly;
+unsigned long highest_memmap_pfn __read_mostly;
int percpu_pagelist_fraction;

#ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE
@@ -2597,6 +2598,9 @@ void __meminit memmap_init_zone(unsigned
unsigned long pfn;
struct zone *z;

+ if (highest_memmap_pfn < end_pfn - 1)
+ highest_memmap_pfn = end_pfn - 1;
+
z = &NODE_DATA(nid)->node_zones[zone];
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*

2008-12-01 00:44:42

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 5/8] badpage: zap print_bad_pte on swap and file

Complete zap_pte_range()'s coverage of bad pagetable entries by calling
print_bad_pte() on a pte_file in a linear vma and on a bad swap entry.
That needs free_swap_and_cache() to tell it, which will also have shown
one of those "swap_free" errors (but with much less information).

Similar checks in fork's copy_one_pte()? No, that would be more noisy
than helpful: we'll see them when parent and child exec or exit.

Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR
case, that could only be a bug in sys_remap_file_pages(), not a bad pte.
VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent
with what happens if do_swap_page() operates a bad swap entry; but don't
we have patches to be more careful about killing when VM_FAULT_OOM?

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/swap.h | 12 +++---------
mm/memory.c | 11 +++++++----
mm/swapfile.c | 7 ++++---

--- badpage4/include/linux/swap.h 2008-11-26 12:19:00.000000000 +0000
+++ badpage5/include/linux/swap.h 2008-11-28 20:40:46.000000000 +0000
@@ -305,7 +305,7 @@ extern swp_entry_t get_swap_page_of_type
extern int swap_duplicate(swp_entry_t);
extern int valid_swaphandles(swp_entry_t, unsigned long *);
extern void swap_free(swp_entry_t);
-extern void free_swap_and_cache(swp_entry_t);
+extern int free_swap_and_cache(swp_entry_t);
extern int swap_type_of(dev_t, sector_t, struct block_device **);
extern unsigned int count_swap_pages(int, int);
extern sector_t map_swap_page(struct swap_info_struct *, pgoff_t);
@@ -355,14 +355,8 @@ static inline void show_swap_cache_info(
{
}

-static inline void free_swap_and_cache(swp_entry_t swp)
-{
-}
-
-static inline int swap_duplicate(swp_entry_t swp)
-{
- return 0;
-}
+#define free_swap_and_cache(swp) is_migration_entry(swp)
+#define swap_duplicate(swp) is_migration_entry(swp)

static inline void swap_free(swp_entry_t swp)
{
--- badpage4/mm/memory.c 2008-11-28 20:40:42.000000000 +0000
+++ badpage5/mm/memory.c 2008-11-28 20:40:46.000000000 +0000
@@ -800,8 +800,12 @@ static unsigned long zap_pte_range(struc
*/
if (unlikely(details))
continue;
- if (!pte_file(ptent))
- free_swap_and_cache(pte_to_swp_entry(ptent));
+ if (pte_file(ptent)) {
+ if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
+ print_bad_pte(vma, addr, ptent, NULL);
+ } else if
+ (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
+ print_bad_pte(vma, addr, ptent, NULL);
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

@@ -2680,8 +2684,7 @@ static int do_nonlinear_fault(struct mm_
if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
return 0;

- if (unlikely(!(vma->vm_flags & VM_NONLINEAR) ||
- !(vma->vm_flags & VM_CAN_NONLINEAR))) {
+ if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
/*
* Page table corrupted: show pte and kill process.
*/
--- badpage4/mm/swapfile.c 2008-11-28 20:37:16.000000000 +0000
+++ badpage5/mm/swapfile.c 2008-11-28 20:40:46.000000000 +0000
@@ -571,13 +571,13 @@ int try_to_free_swap(struct page *page)
* Free the swap entry like above, but also try to
* free the page cache entry if it is the last user.
*/
-void free_swap_and_cache(swp_entry_t entry)
+int free_swap_and_cache(swp_entry_t entry)
{
- struct swap_info_struct * p;
+ struct swap_info_struct *p;
struct page *page = NULL;

if (is_migration_entry(entry))
- return;
+ return 1;

p = swap_info_get(entry);
if (p) {
@@ -603,6 +603,7 @@ void free_swap_and_cache(swp_entry_t ent
unlock_page(page);
page_cache_release(page);
}
+ return p != NULL;
}

#ifdef CONFIG_HIBERNATION

2008-12-01 00:45:36

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 6/8] badpage: remove vma from page_remove_rmap

Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
page_dup_rmap(): we're trying to be more resilient about that than BUGs.

Signed-off-by: Hugh Dickins <[email protected]>
---

include/linux/rmap.h | 2 +-
mm/filemap_xip.c | 2 +-
mm/fremap.c | 2 +-
mm/memory.c | 4 ++--
mm/rmap.c | 8 +++-----
5 files changed, 8 insertions(+), 10 deletions(-)

--- badpage5/include/linux/rmap.h 2008-11-26 12:18:59.000000000 +0000
+++ badpage6/include/linux/rmap.h 2008-11-28 20:40:48.000000000 +0000
@@ -69,7 +69,7 @@ void __anon_vma_link(struct vm_area_stru
void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
void page_add_file_rmap(struct page *);
-void page_remove_rmap(struct page *, struct vm_area_struct *);
+void page_remove_rmap(struct page *);

#ifdef CONFIG_DEBUG_VM
void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address);
--- badpage5/mm/filemap_xip.c 2008-10-09 23:13:53.000000000 +0100
+++ badpage6/mm/filemap_xip.c 2008-11-28 20:40:48.000000000 +0000
@@ -193,7 +193,7 @@ retry:
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush_notify(vma, address, pte);
- page_remove_rmap(page, vma);
+ page_remove_rmap(page);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
--- badpage5/mm/fremap.c 2008-10-24 09:28:26.000000000 +0100
+++ badpage6/mm/fremap.c 2008-11-28 20:40:48.000000000 +0000
@@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm
if (page) {
if (pte_dirty(pte))
set_page_dirty(page);
- page_remove_rmap(page, vma);
+ page_remove_rmap(page);
page_cache_release(page);
update_hiwater_rss(mm);
dec_mm_counter(mm, file_rss);
--- badpage5/mm/memory.c 2008-11-28 20:40:46.000000000 +0000
+++ badpage6/mm/memory.c 2008-11-28 20:40:48.000000000 +0000
@@ -788,7 +788,7 @@ static unsigned long zap_pte_range(struc
mark_page_accessed(page);
file_rss--;
}
- page_remove_rmap(page, vma);
+ page_remove_rmap(page);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
tlb_remove_page(tlb, page);
@@ -1996,7 +1996,7 @@ gotten:
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(old_page, vma);
+ page_remove_rmap(old_page);
}

/* Free the old page.. */
--- badpage5/mm/rmap.c 2008-11-28 20:40:40.000000000 +0000
+++ badpage6/mm/rmap.c 2008-11-28 20:40:48.000000000 +0000
@@ -707,7 +707,6 @@ void page_add_file_rmap(struct page *pag
*/
void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address)
{
- BUG_ON(page_mapcount(page) == 0);
if (PageAnon(page))
__page_check_anon_rmap(page, vma, address);
atomic_inc(&page->_mapcount);
@@ -717,11 +716,10 @@ void page_dup_rmap(struct page *page, st
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
- * @vma: the vm area in which the mapping is removed
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page, struct vm_area_struct *vma)
+void page_remove_rmap(struct page *page)
{
if (atomic_add_negative(-1, &page->_mapcount)) {
/*
@@ -837,7 +835,7 @@ static int try_to_unmap_one(struct page
dec_mm_counter(mm, file_rss);


- page_remove_rmap(page, vma);
+ page_remove_rmap(page);
page_cache_release(page);

out_unmap:
@@ -952,7 +950,7 @@ static int try_to_unmap_cluster(unsigned
if (pte_dirty(pteval))
set_page_dirty(page);

- page_remove_rmap(page, vma);
+ page_remove_rmap(page);
page_cache_release(page);
dec_mm_counter(mm, file_rss);
(*mapcount)--;

2008-12-01 00:46:45

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 7/8] badpage: ratelimit print_bad_pte and bad_page

print_bad_pte() and bad_page() might each need ratelimiting - especially
for their dump_stacks, almost never of interest, yet not quite dispensible.
Correlating corruption across neighbouring entries can be very helpful,
so allow a burst of 60 reports before keeping quiet for the remainder
of that minute (or allow a steady drip of one report per second).

Signed-off-by: Hugh Dickins <[email protected]>
---

mm/memory.c | 23 +++++++++++++++++++++++
mm/page_alloc.c | 26 +++++++++++++++++++++++++-
2 files changed, 48 insertions(+), 1 deletion(-)

--- badpage6/mm/memory.c 2008-11-28 20:40:48.000000000 +0000
+++ badpage7/mm/memory.c 2008-11-28 20:40:50.000000000 +0000
@@ -383,6 +383,29 @@ static void print_bad_pte(struct vm_area
pmd_t *pmd = pmd_offset(pud, addr);
struct address_space *mapping;
pgoff_t index;
+ static unsigned long resume;
+ static unsigned long nr_shown;
+ static unsigned long nr_unshown;
+
+ /*
+ * Allow a burst of 60 reports, then keep quiet for that minute;
+ * or allow a steady drip of one report per second.
+ */
+ if (nr_shown == 60) {
+ if (time_before(jiffies, resume)) {
+ nr_unshown++;
+ return;
+ }
+ if (nr_unshown) {
+ printk(KERN_EMERG
+ "Bad page map: %lu messages suppressed\n",
+ nr_unshown);
+ nr_unshown = 0;
+ }
+ nr_shown = 0;
+ }
+ if (nr_shown++ == 0)
+ resume = jiffies + 60 * HZ;

mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
index = linear_page_index(vma, addr);
--- badpage6/mm/page_alloc.c 2008-11-28 20:40:42.000000000 +0000
+++ badpage7/mm/page_alloc.c 2008-11-28 20:40:50.000000000 +0000
@@ -223,6 +223,30 @@ static inline int bad_range(struct zone

static void bad_page(struct page *page)
{
+ static unsigned long resume;
+ static unsigned long nr_shown;
+ static unsigned long nr_unshown;
+
+ /*
+ * Allow a burst of 60 reports, then keep quiet for that minute;
+ * or allow a steady drip of one report per second.
+ */
+ if (nr_shown == 60) {
+ if (time_before(jiffies, resume)) {
+ nr_unshown++;
+ goto out;
+ }
+ if (nr_unshown) {
+ printk(KERN_EMERG
+ "Bad page state: %lu messages suppressed\n",
+ nr_unshown);
+ nr_unshown = 0;
+ }
+ nr_shown = 0;
+ }
+ if (nr_shown++ == 0)
+ resume = jiffies + 60 * HZ;
+
printk(KERN_EMERG "Bad page state in process %s pfn:%05lx\n",
current->comm, page_to_pfn(page));
printk(KERN_EMERG
@@ -232,7 +256,7 @@ static void bad_page(struct page *page)
printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n");

dump_stack();
-
+out:
/* Leave bad fields for debug, except PageBuddy could make trouble */
__ClearPageBuddy(page);
add_taint(TAINT_BAD_PAGE);

2008-12-01 00:48:16

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 8/8] badpage: KERN_ALERT BUG instead of KERN_EMERG

bad_page() and rmap Eeek messages have said KERN_EMERG for a few years,
which I've followed in print_bad_pte(). These are serious system errors,
on a par with BUGs, but they're not quite emergencies, and we do our best
to carry on: say KERN_ALERT "BUG: " like the x86 oops does.

And remove the "Trying to fix it up, but a reboot is needed" line:
it's not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.

Signed-off-by: Hugh Dickins <[email protected]>
---
I've left this proposal until last, expecting some opposition.

Considered adding oops_begin() and oops_end(),
but I'm not at all sure that would work out well for these.

mm/memory.c | 15 ++++++++-------
mm/page_alloc.c | 9 ++++-----
2 files changed, 12 insertions(+), 12 deletions(-)

--- badpage7/mm/memory.c 2008-11-28 20:40:50.000000000 +0000
+++ badpage8/mm/memory.c 2008-11-28 20:40:52.000000000 +0000
@@ -397,8 +397,8 @@ static void print_bad_pte(struct vm_area
return;
}
if (nr_unshown) {
- printk(KERN_EMERG
- "Bad page map: %lu messages suppressed\n",
+ printk(KERN_ALERT
+ "BUG: Bad page map: %lu messages suppressed\n",
nr_unshown);
nr_unshown = 0;
}
@@ -410,26 +410,27 @@ static void print_bad_pte(struct vm_area
mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
index = linear_page_index(vma, addr);

- printk(KERN_EMERG "Bad page map in process %s pte:%08llx pmd:%08llx\n",
+ printk(KERN_ALERT
+ "BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n",
current->comm,
(long long)pte_val(pte), (long long)pmd_val(*pmd));
if (page) {
- printk(KERN_EMERG
+ printk(KERN_ALERT
"page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n",
page, (void *)page->flags, page_count(page),
page_mapcount(page), page->mapping, page->index);
}
- printk(KERN_EMERG
+ printk(KERN_ALERT
"addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
(void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
/*
* Choose text because data symbols depend on CONFIG_KALLSYMS_ALL=y
*/
if (vma->vm_ops)
- print_symbol(KERN_EMERG "vma->vm_ops->fault: %s\n",
+ print_symbol(KERN_ALERT "vma->vm_ops->fault: %s\n",
(unsigned long)vma->vm_ops->fault);
if (vma->vm_file && vma->vm_file->f_op)
- print_symbol(KERN_EMERG "vma->vm_file->f_op->mmap: %s\n",
+ print_symbol(KERN_ALERT "vma->vm_file->f_op->mmap: %s\n",
(unsigned long)vma->vm_file->f_op->mmap);
dump_stack();
add_taint(TAINT_BAD_PAGE);
--- badpage7/mm/page_alloc.c 2008-11-28 20:40:50.000000000 +0000
+++ badpage8/mm/page_alloc.c 2008-11-28 20:40:52.000000000 +0000
@@ -237,8 +237,8 @@ static void bad_page(struct page *page)
goto out;
}
if (nr_unshown) {
- printk(KERN_EMERG
- "Bad page state: %lu messages suppressed\n",
+ printk(KERN_ALERT
+ "BUG: Bad page state: %lu messages suppressed\n",
nr_unshown);
nr_unshown = 0;
}
@@ -247,13 +247,12 @@ static void bad_page(struct page *page)
if (nr_shown++ == 0)
resume = jiffies + 60 * HZ;

- printk(KERN_EMERG "Bad page state in process %s pfn:%05lx\n",
+ printk(KERN_ALERT "BUG: Bad page state in process %s pfn:%05lx\n",
current->comm, page_to_pfn(page));
- printk(KERN_EMERG
+ printk(KERN_ALERT
"page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n",
page, (void *)page->flags, page_count(page),
page_mapcount(page), page->mapping, page->index);
- printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n");

dump_stack();
out:

2008-12-01 14:43:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 8/8] badpage: KERN_ALERT BUG instead of KERN_EMERG

On Mon, 1 Dec 2008, Hugh Dickins wrote:
> And remove the "Trying to fix it up, but a reboot is needed" line:
> it's not untrue, but I hope the KERN_ALERT "BUG: " conveys as much.

Thanks. That was a pretty strange message....

2008-12-01 14:50:40

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 2/8] badpage: keep any bad page out of circulation

On Mon, 1 Dec 2008, Hugh Dickins wrote:

> Until now the bad_page() checkers have special-cased PageReserved, keeping
> those pages out of circulation thereafter. Now extend the special case to
> all: we want to keep ANY page with bad state out of circulation - the
> "free" page may well be in use by something.

If I screw up with a VM patch then my machine will now die because of OOM
instead of letting me shutdown and reboot?

2008-12-01 14:53:46

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/8] badpage: simplify page_alloc flag check+clear

On Mon, 1 Dec 2008, Hugh Dickins wrote:

> /*
> * Flags checked when a page is freed. Pages being freed should not have
> * these flags set. It they are, there is a problem.
> */
> -#define PAGE_FLAGS_CHECK_AT_FREE (PAGE_FLAGS | 1 << PG_reserved)
> +#define PAGE_FLAGS_CHECK_AT_FREE \
> + (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \
> + 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
> + 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
> + __PG_UNEVICTABLE | __PG_MLOCKED)

Rename this to PAGE_FLAGS_CLEAR_WHEN_FREE?

> + * Pages being prepped should not have any flags set. It they are set,
> + * there has been a kernel bug or struct page corruption.
> */
> -#define PAGE_FLAGS_CHECK_AT_PREP (PAGE_FLAGS | \
> - 1 << PG_reserved | 1 << PG_dirty | 1 << PG_swapbacked)
> +#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)

These are all the bits. Can we get rid of this definition?

> /*
> * For now, we report if PG_reserved was found set, but do not
> * clear it, and do not free the page. But we shall soon need
> * to do more, for when the ZERO_PAGE count wraps negative.
> */
> - return PageReserved(page);
> + if (PageReserved(page))
> + return 1;
> + if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> + page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> + return 0;

The name PAGE_FLAGS_CHECK_AT_PREP is strange. We clear these flags without
message. This is equal to

page->flags &=~PAGE_FLAGS_CHECK_AT_PREP;

You can drop the if...

2008-12-01 23:20:07

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/8] badpage: keep any bad page out of circulation

On Mon, 1 Dec 2008, Christoph Lameter wrote:
> On Mon, 1 Dec 2008, Hugh Dickins wrote:
>
> > Until now the bad_page() checkers have special-cased PageReserved, keeping
> > those pages out of circulation thereafter. Now extend the special case to
> > all: we want to keep ANY page with bad state out of circulation - the
> > "free" page may well be in use by something.
>
> If I screw up with a VM patch

Oh, perish the thought!

> then my machine will now die because of OOM
> instead of letting me shutdown and reboot?

If you screw up so royally as to allocate every page in the machine
and free it with bad state, yes, that's indeed the way it will tend.
Or, to the extent that you're relying on high orders and low zones,
it will happen even sooner. Same as if you screw up and forget to
free your pages.

That's okay. In more normal cases you'll see the warnings before
it's dead, and shutdown and reboot (hopefully a different kernel!)
before it reaches that state. By the time your patches reach -mm,
I'd hope you'll have weeded out the immediate OOM cases.

Hugh

2008-12-01 23:50:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/8] badpage: simplify page_alloc flag check+clear

On Mon, 1 Dec 2008, Christoph Lameter wrote:
> On Mon, 1 Dec 2008, Hugh Dickins wrote:
> > /*
> > * Flags checked when a page is freed. Pages being freed should not have
> > * these flags set. It they are, there is a problem.
> > */
> > -#define PAGE_FLAGS_CHECK_AT_FREE (PAGE_FLAGS | 1 << PG_reserved)
> > +#define PAGE_FLAGS_CHECK_AT_FREE \
> > + (1 << PG_lru | 1 << PG_private | 1 << PG_locked | \
> > + 1 << PG_buddy | 1 << PG_writeback | 1 << PG_reserved | \
> > + 1 << PG_slab | 1 << PG_swapcache | 1 << PG_active | \
> > + __PG_UNEVICTABLE | __PG_MLOCKED)
>
> Rename this to PAGE_FLAGS_CLEAR_WHEN_FREE?

No, that's a list of just the ones it's checking at free;
it then (with this patch) goes on to clear all of them.

>
> > + * Pages being prepped should not have any flags set. It they are set,
> > + * there has been a kernel bug or struct page corruption.
> > */
> > -#define PAGE_FLAGS_CHECK_AT_PREP (PAGE_FLAGS | \
> > - 1 << PG_reserved | 1 << PG_dirty | 1 << PG_swapbacked)
> > +#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
>
> These are all the bits. Can we get rid of this definition?

PAGE_FLAGS_CHECK_AT_PREP may not be the best name for it now;
but I do think we need a definition for it, and I'm not sure
that it will remain "all the page flags".

As it was, I just took the existing name, and then included every
flag in it. I'd love to include the empty space, if any, up as
far as the mmzone bits - is there a convenient way to do that?

It could as well be called PAGE_FLAGS_CLEAR_AT_FREE. I'm not
sure that it's necessarily the same as all the flags - in fact,
I was rather surprised that the patch booted first time, I was
expecting to find that I'd overlooked some special cases.

I meant to, but didn't, look at Martin's guest page hinting, might
that be defining page flags set even across the free/alloc gap?
Cc'ed Martin now, no need for him to answer, but let's at least
warn him of this patch, something he might need to change with his.

> > /*
> > * For now, we report if PG_reserved was found set, but do not
> > * clear it, and do not free the page. But we shall soon need
> > * to do more, for when the ZERO_PAGE count wraps negative.
> > */
> > - return PageReserved(page);
> > + if (PageReserved(page))
> > + return 1;
> > + if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> > + page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > + return 0;
>
> The name PAGE_FLAGS_CHECK_AT_PREP is strange. We clear these flags without
> message.

Would you be happier with PAGE_FLAGS_CLEAR_AT_FREE, then?
That would be fine by me, even if we add the gap to mmzone later.

One of the problems with PREP is that it's not obvious that it
means ALLOC: yes, I'd be happier with PAGE_FLAGS_CLEAR_AT_FREE.

> This is equal to
>
> page->flags &=~PAGE_FLAGS_CHECK_AT_PREP;
>
> You can drop the if...

I was intentionally following the existing style of
if (PageDirty(page))
__ClearPageDirty(page);
if (PageSwapBacked(page))
__ClearPageSwapBacked(page);
which is going out of its way to avoid dirtying a cacheline.

In all the obvious cases, I think the cacheline will already
be dirty; but I guess there's an important case (high order
but not compound?) which has a lot of clean cachelines.

Hugh

2008-12-02 02:23:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/8] badpage: simplify page_alloc flag check+clear

On Mon, 1 Dec 2008, Hugh Dickins wrote:

> > Rename this to PAGE_FLAGS_CLEAR_WHEN_FREE?
>
> No, that's a list of just the ones it's checking at free;
> it then (with this patch) goes on to clear all of them.

But they are always clear on free. The checking is irrelevant.

> One of the problems with PREP is that it's not obvious that it
> means ALLOC: yes, I'd be happier with PAGE_FLAGS_CLEAR_AT_FREE.

Ok.

>
> > This is equal to
> >
> > page->flags &=~PAGE_FLAGS_CHECK_AT_PREP;
> >
> > You can drop the if...
>
> I was intentionally following the existing style of
> if (PageDirty(page))
> __ClearPageDirty(page);
> if (PageSwapBacked(page))
> __ClearPageSwapBacked(page);
> which is going out of its way to avoid dirtying a cacheline.
>
> In all the obvious cases, I think the cacheline will already
> be dirty; but I guess there's an important case (high order
> but not compound?) which has a lot of clean cachelines.

Free or alloc dirties the cacheline of the page struct regardless since
the LRU field is always modified.

Well, ok. The not compound high order case may be an exception.

But then lets at least make a single check

If (page->flags & (all the flags including dirty and SwapBacked))
zap-em.

2008-12-02 10:39:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/8] badpage: simplify page_alloc flag check+clear

On Mon, 1 Dec 2008, Christoph Lameter wrote:
> On Mon, 1 Dec 2008, Hugh Dickins wrote:
> > > > PAGE_FLAGS_CHECK_AT_FREE
>
> > > Rename this to PAGE_FLAGS_CLEAR_WHEN_FREE?
> >
> > No, that's a list of just the ones it's checking at free;
> > it then (with this patch) goes on to clear all of them.
>
> But they are always clear on free. The checking is irrelevant.

How about CHECK_PAGE_FLAGS_CLEAR_AT_FREE?

>
> > One of the problems with PREP is that it's not obvious that it
> > means ALLOC: yes, I'd be happier with PAGE_FLAGS_CLEAR_AT_FREE.
>
> Ok.

If you like the suggestion above, then for this one
how about CHECK_PAGE_FLAGS_CLEAR_AT_ALLOC?

I just haven't changed those names in the patch, continued to
use the names from before: they're probably not the names I'd
have chosen, but it is hard to write a paragraph in a name.

The one I really disliked was "PAGE_FLAGS" for an obscure
subset of page flags, and have got rid of that.

> > > This is equal to
> > >
> > > page->flags &=~PAGE_FLAGS_CHECK_AT_PREP;
> > >
> > > You can drop the if...
> >
> > I was intentionally following the existing style of
> > if (PageDirty(page))
> > __ClearPageDirty(page);
> > if (PageSwapBacked(page))
> > __ClearPageSwapBacked(page);
> > which is going out of its way to avoid dirtying a cacheline.
> >
> > In all the obvious cases, I think the cacheline will already
> > be dirty; but I guess there's an important case (high order
> > but not compound?) which has a lot of clean cachelines.
>
> Free or alloc dirties the cacheline of the page struct regardless since
> the LRU field is always modified.
>
> Well, ok. The not compound high order case may be an exception.
>
> But then lets at least make a single check
>
> If (page->flags & (all the flags including dirty and SwapBacked))
> zap-em.

That's exactly what I did, isn't it?

Hugh

2008-12-02 13:17:48

by Christoph Lameter

[permalink] [raw]
Subject: Re: [PATCH 1/8] badpage: simplify page_alloc flag check+clear

On Tue, 2 Dec 2008, Hugh Dickins wrote:

> > But they are always clear on free. The checking is irrelevant.
>
> How about CHECK_PAGE_FLAGS_CLEAR_AT_FREE?

Strange name.

> The one I really disliked was "PAGE_FLAGS" for an obscure
> subset of page flags, and have got rid of that.

Good.

> > If (page->flags & (all the flags including dirty and SwapBacked))
> > zap-em.
>
> That's exactly what I did, isn't it?

Yes but you added another instance of this. Can you consolidate all the
check and clears into one?

2008-12-02 14:12:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 1/8] badpage: simplify page_alloc flag check+clear

On Tue, 2 Dec 2008, Christoph Lameter wrote:
> On Tue, 2 Dec 2008, Hugh Dickins wrote:
>
> > > But they are always clear on free. The checking is irrelevant.
> >
> > How about CHECK_PAGE_FLAGS_CLEAR_AT_FREE?
>
> Strange name.

Looks like I'm not going to be able to satisfy you then. I didn't
introduce the names in the patch, so let's leave them as is for now,
and everybody can muse on what they should get called in the end.

> > > If (page->flags & (all the flags including dirty and SwapBacked))
> > > zap-em.
> >
> > That's exactly what I did, isn't it?
>
> Yes but you added another instance of this.

Did I? Whereabouts? I wonder if you're thinking of the
+ page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
in prep_new_page(), which replaces the clearing of another
collection of flags which somehow didn't get named before.

That clearing is a temporary measure, to keep the handling
of PageReserved unchanged in that patch; then it vanishes in the
next patch, where we treat all bad_page candidates the same way.

> Can you consolidate all the check and clears into one?

You mean one test_and_clear_bits() that somehow covers the different
cases of what we expect at free time and what we need at alloc time?
I don't think so.

Hugh

2008-12-03 00:57:59

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 7/8] badpage: ratelimit print_bad_pte and bad_page

On Mon, 1 Dec 2008 00:46:53 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> print_bad_pte() and bad_page() might each need ratelimiting - especially
> for their dump_stacks, almost never of interest, yet not quite dispensible.
> Correlating corruption across neighbouring entries can be very helpful,
> so allow a burst of 60 reports before keeping quiet for the remainder
> of that minute (or allow a steady drip of one report per second).
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/memory.c | 23 +++++++++++++++++++++++
> mm/page_alloc.c | 26 +++++++++++++++++++++++++-
> 2 files changed, 48 insertions(+), 1 deletion(-)
>
> --- badpage6/mm/memory.c 2008-11-28 20:40:48.000000000 +0000
> +++ badpage7/mm/memory.c 2008-11-28 20:40:50.000000000 +0000
> @@ -383,6 +383,29 @@ static void print_bad_pte(struct vm_area
> pmd_t *pmd = pmd_offset(pud, addr);
> struct address_space *mapping;
> pgoff_t index;
> + static unsigned long resume;
> + static unsigned long nr_shown;
> + static unsigned long nr_unshown;
> +
> + /*
> + * Allow a burst of 60 reports, then keep quiet for that minute;
> + * or allow a steady drip of one report per second.
> + */
> + if (nr_shown == 60) {
> + if (time_before(jiffies, resume)) {
> + nr_unshown++;
> + return;
> + }
> + if (nr_unshown) {
> + printk(KERN_EMERG
> + "Bad page map: %lu messages suppressed\n",
> + nr_unshown);
> + nr_unshown = 0;
> + }
> + nr_shown = 0;
> + }
> + if (nr_shown++ == 0)
> + resume = jiffies + 60 * HZ;
>
> mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> index = linear_page_index(vma, addr);
> --- badpage6/mm/page_alloc.c 2008-11-28 20:40:42.000000000 +0000
> +++ badpage7/mm/page_alloc.c 2008-11-28 20:40:50.000000000 +0000
> @@ -223,6 +223,30 @@ static inline int bad_range(struct zone
>
> static void bad_page(struct page *page)
> {
> + static unsigned long resume;
> + static unsigned long nr_shown;
> + static unsigned long nr_unshown;
> +
> + /*
> + * Allow a burst of 60 reports, then keep quiet for that minute;
> + * or allow a steady drip of one report per second.
> + */
> + if (nr_shown == 60) {
> + if (time_before(jiffies, resume)) {
> + nr_unshown++;
> + goto out;
> + }
> + if (nr_unshown) {
> + printk(KERN_EMERG
> + "Bad page state: %lu messages suppressed\n",
> + nr_unshown);
> + nr_unshown = 0;
> + }
> + nr_shown = 0;
> + }
> + if (nr_shown++ == 0)
> + resume = jiffies + 60 * HZ;
> +

gee, that's pretty elaborate. There's no way of using the
possibly-enhanced ratelimit.h?

2008-12-03 00:59:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/8] badpage: simplify page_alloc flag check+clear

On Tue, 2 Dec 2008 14:12:05 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> On Tue, 2 Dec 2008, Christoph Lameter wrote:
> > On Tue, 2 Dec 2008, Hugh Dickins wrote:
> >
> > > > But they are always clear on free. The checking is irrelevant.
> > >
> > > How about CHECK_PAGE_FLAGS_CLEAR_AT_FREE?
> >
> > Strange name.
>
> Looks like I'm not going to be able to satisfy you then. I didn't
> introduce the names in the patch, so let's leave them as is for now,
> and everybody can muse on what they should get called in the end.

It's unclear to me where your discussion with Christoph ended up, so I
went the merge-it-and-see-who-shouts-at-me route.

2008-12-03 13:05:31

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 7/8] badpage: ratelimit print_bad_pte and bad_page

On Tue, 2 Dec 2008, Andrew Morton wrote:
> On Mon, 1 Dec 2008 00:46:53 +0000 (GMT)
> Hugh Dickins <[email protected]> wrote:
> > + /*
> > + * Allow a burst of 60 reports, then keep quiet for that minute;
> > + * or allow a steady drip of one report per second.
> > + */
> > + if (nr_shown == 60) {
> > + if (time_before(jiffies, resume)) {
> > + nr_unshown++;
> > + goto out;
> > + }
> > + if (nr_unshown) {
> > + printk(KERN_EMERG
> > + "Bad page state: %lu messages suppressed\n",
> > + nr_unshown);
> > + nr_unshown = 0;
> > + }
> > + nr_shown = 0;
> > + }
> > + if (nr_shown++ == 0)
> > + resume = jiffies + 60 * HZ;
> > +
>
> gee, that's pretty elaborate. There's no way of using the
> possibly-enhanced ratelimit.h?

Thanks a lot for the pointer: I'd browsed around kernel/printk.c and
not found what I needed, hadn't realized there's a lib/ratelimit.c.

It looks eerily like what I'm trying to do, just a less specific
missed/suppressed message, never mind that. I'll try making a patch
later to replace this (in its subsequent KERN_ALERT form) by that -
in doing so, perhaps I'll encounter a problem, but should be good.

Hugh