2009-04-08 00:11:47

by Russ Anderson

[permalink] [raw]
Subject: [PATCH 1/2] Avoid putting a bad page back on the LRU

Prevent a page with a physical memory error from being placed back
on the LRU. This patch applies on top of Andi Kleen's POISON
patchset.


Signed-off-by: Russ Anderson <[email protected]>

---
include/linux/page-flags.h | 8 +++++++-
mm/migrate.c | 39 ++++++++++++++++++++++++++++++++++++++-
2 files changed, 45 insertions(+), 2 deletions(-)

Index: linux-next/mm/migrate.c
===================================================================
--- linux-next.orig/mm/migrate.c 2009-04-07 18:32:12.781949840 -0500
+++ linux-next/mm/migrate.c 2009-04-07 18:34:19.169736260 -0500
@@ -72,6 +72,7 @@ int putback_lru_pages(struct list_head *
}
return count;
}
+EXPORT_SYMBOL(isolate_lru_page);

/*
* Restore a potential migration pte to a working pte entry
@@ -139,6 +140,7 @@ static void remove_migration_pte(struct
out:
pte_unmap_unlock(ptep, ptl);
}
+EXPORT_SYMBOL(migrate_prep);

/*
* Note that remove_file_migration_ptes will only work on regular mappings,
@@ -161,6 +163,7 @@ static void remove_file_migration_ptes(s

spin_unlock(&mapping->i_mmap_lock);
}
+EXPORT_SYMBOL(putback_lru_pages);

/*
* Must hold mmap_sem lock on at least one of the vmas containing
@@ -693,6 +696,26 @@ unlock:
* restored.
*/
list_del(&page->lru);
+#ifdef CONFIG_MEMORY_FAILURE
+ if (PagePoison(page)) {
+ if (rc == 0)
+ /*
+ * A page with a memory error that has
+ * been migrated will not be moved to
+ * the LRU.
+ */
+ goto move_newpage;
+ else
+ /*
+ * The page failed to migrate and will not
+ * be added to the bad page list. Clearing
+ * the error bit will allow another attempt
+ * to migrate if it gets another correctable
+ * error.
+ */
+ ClearPagePoison(page);
+ }
+#endif
putback_lru_page(page);
}

@@ -736,7 +759,7 @@ int migrate_pages(struct list_head *from
struct page *page;
struct page *page2;
int swapwrite = current->flags & PF_SWAPWRITE;
- int rc;
+ int rc = 0;

if (!swapwrite)
current->flags |= PF_SWAPWRITE;
@@ -765,6 +788,19 @@ int migrate_pages(struct list_head *from
}
}
}
+
+#ifdef CONFIG_MEMORY_FAILURE
+ if (rc != 0)
+ list_for_each_entry_safe(page, page2, from, lru)
+ if (PagePoison(page))
+ /*
+ * The page failed to migrate. Clearing
+ * the error bit will allow another attempt
+ * to migrate if it gets another correctable
+ * error.
+ */
+ ClearPagePoison(page);
+#endif
rc = 0;
out:
if (!swapwrite)
@@ -777,6 +813,7 @@ out:

return nr_failed + retry;
}
+EXPORT_SYMBOL(migrate_pages);

#ifdef CONFIG_NUMA
/*
Index: linux-next/include/linux/page-flags.h
===================================================================
--- linux-next.orig/include/linux/page-flags.h 2009-04-07 18:32:12.789950956 -0500
+++ linux-next/include/linux/page-flags.h 2009-04-07 18:34:19.197737925 -0500
@@ -169,15 +169,21 @@ static inline int TestSetPage##uname(str
static inline int TestClearPage##uname(struct page *page) \
{ return test_and_clear_bit(PG_##lname, &page->flags); }

+#define PAGEFLAGMASK(uname, lname) \
+static inline int PAGEMASK_##uname(void) \
+ { return (1 << PG_##lname); }

#define PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \
- SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname)
+ SETPAGEFLAG(uname, lname) CLEARPAGEFLAG(uname, lname) \
+ PAGEFLAGMASK(uname, lname)

#define __PAGEFLAG(uname, lname) TESTPAGEFLAG(uname, lname) \
__SETPAGEFLAG(uname, lname) __CLEARPAGEFLAG(uname, lname)

#define PAGEFLAG_FALSE(uname) \
static inline int Page##uname(struct page *page) \
+ { return 0; } \
+static inline int PAGEMASK_##uname(void) \
{ return 0; }

#define TESTSCFLAG(uname, lname) \
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]


2009-04-08 03:42:38

by Ingo Oeser

[permalink] [raw]
Subject: Re: [PATCH 1/2] Avoid putting a bad page back on the LRU

Hi Russ,

On Wednesday 08 April 2009, Russ Anderson wrote:
> --- linux-next.orig/mm/migrate.c 2009-04-07 18:32:12.781949840 -0500
> +++ linux-next/mm/migrate.c 2009-04-07 18:34:19.169736260 -0500
> @@ -693,6 +696,26 @@ unlock:
> * restored.
> */
> list_del(&page->lru);
> +#ifdef CONFIG_MEMORY_FAILURE
> + if (PagePoison(page)) {
> + if (rc == 0)
> + /*
> + * A page with a memory error that has
> + * been migrated will not be moved to
> + * the LRU.
> + */
> + goto move_newpage;
> + else
> + /*
> + * The page failed to migrate and will not
> + * be added to the bad page list. Clearing
> + * the error bit will allow another attempt
> + * to migrate if it gets another correctable
> + * error.
> + */
> + ClearPagePoison(page);

Clearing the flag doesn't change the fact, that this page is representing
permanently bad RAM.

What about removing it from the LRU and adding it to a bad RAM list in every case?
After hot swapping the physical RAM banks it could be moved back, not before.


Best Regards

Ingo Oeser

2009-04-08 06:46:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/2] Avoid putting a bad page back on the LRU

Ingo Oeser <[email protected]> writes:
>
> Clearing the flag doesn't change the fact, that this page is representing
> permanently bad RAM.

Yes, you cannot ever clear a Poison flag, at least not without a special
hardware mechanism that clears the hardware poison too (but that has
other issues in Linux too). Otherwise you would die later.

> What about removing it from the LRU and adding it to a bad RAM list in every case?

That is what memory_failure() already should be doing. Except there's no list
currently.

> After hot swapping the physical RAM banks it could be moved back, not before.

Linux doesn't really support that. That is at least not when it's OS visible.

-Andi
--
[email protected] -- Speaking for myself only.

2009-04-08 13:31:31

by Russ Anderson

[permalink] [raw]
Subject: Re: [PATCH 1/2] Avoid putting a bad page back on the LRU

On Wed, Apr 08, 2009 at 05:43:15AM +0200, Ingo Oeser wrote:
> Hi Russ,
>
> On Wednesday 08 April 2009, Russ Anderson wrote:
> > --- linux-next.orig/mm/migrate.c 2009-04-07 18:32:12.781949840 -0500
> > +++ linux-next/mm/migrate.c 2009-04-07 18:34:19.169736260 -0500
> > @@ -693,6 +696,26 @@ unlock:
> > * restored.
> > */
> > list_del(&page->lru);
> > +#ifdef CONFIG_MEMORY_FAILURE
> > + if (PagePoison(page)) {
> > + if (rc == 0)
> > + /*
> > + * A page with a memory error that has
> > + * been migrated will not be moved to
> > + * the LRU.
> > + */
> > + goto move_newpage;
> > + else
> > + /*
> > + * The page failed to migrate and will not
> > + * be added to the bad page list. Clearing
> > + * the error bit will allow another attempt
> > + * to migrate if it gets another correctable
> > + * error.
> > + */
> > + ClearPagePoison(page);
>
> Clearing the flag doesn't change the fact, that this page is representing
> permanently bad RAM.

Yes, but this is intended for corrected memory errors (meaning there is
an underlying RAM error, but has not reached the point of losing data).

After talking with Andi, it is clear the intent of the Poison flag
(uncorrectable memory error) is different from my intent (corrected
memory error). I'll go back to using a different page flag to avoid
confusing the two issues.

> What about removing it from the LRU and adding it to a bad RAM list in every case?

That is what happens when the page migrates (the normal case). The else case
s when the page could not be migrated. My intent was to wait for the next
corrected error on that page and try migrating again.

> After hot swapping the physical RAM banks it could be moved back, not before.

As soon as the code is written. :-)

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]