2008-07-18 20:35:28

by Russ Anderson

[permalink] [raw]
Subject: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

[PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

Version 7 changes: resubmitting for 2.6.27
page.discard.v7:
- Add pagemask macros in page-flags.h (per Christoph
Lameter's request).
- refreshed for linux-next

cpe.migrate.v7:
- refreshed for linux-next

page.cleanup
- Accepted by Linus.

Version 6 changes:
page.cleanup.v6:
- Fix cut-n-paste comment (per Linus Torvalds's request).

page.discard.v6:
- Fixed a problem where a page failed to migrate would end
up with an extra reference count (per Christoph Lameter's
request).
- Fixed comments (per Christoph Lameter's request).
- Moved totalbad_pages definition from mm/migrate.c to
arch/ia64/kernel/mca.c (per Christoph Lameter's request).

cpe.migrate.v6:
- Move totalbad_pages from mm/migrate.c to ia64/kernel/mca.c.
- Replace tab with space (per Christoph Lameter's request).


Version 5 changes:
page.cleanup.v5:
- Change names to reflect the use and add comments to explain
the meaning. (per Linus Torvalds's request).

Version 4 changes:
page.discard.v4:
- Remove the hot path checks in lru_cache_add() and
lru_cache_add_active(). Avoid moving the bad page to
the LRU in unmap_and_move() and putback_lru_pages().
(per Linus Torvalds's request).

cpe.migrate.v4
- More code style cleanup (per Andrew Morton's request).
- Removed locking when calling the migration code
(per Christoph Lameter's request).
- If the page fails to migrate, clear the PG_memerror flag.
This avoids a page with PG_memerror on the free list.

Version 3 changes:
page.cleanup.v3:
- Put PAGE_FLAGS definitions back in page-flags.h
(per Christoph Lameter's request).

cpe.migrate.v3
- Use putback_lru_pages() when returning an individual page
(per Christoph Lameter's request).
- Code style cleanup
(per Pekka Enberg's request).
- Use strict_strtoul()
(per Pekka Enberg's request).
- Added locking
(per Pekka Enberg's request).
- Use /sys/kernel/ instead of /proc
(per Pekka Enberg's request).

Version 2 changes:

Broke the page.discard patch into two patches, per request by
Christoph Lameter.

page.cleanup.v2:
- minor clean-up of page flags in page_alloc.c.

page.discard.v2:
- Updated for recent page flag clean-up.
- Removed the change to the sysinfo struct.

cpe.migrate.v2
- Added /proc/badram interface to print page discard
information and to free bad pages.

Purpose:

Physical memory with corrected errors may decay over time into
uncorrectable errors. The purpose of this patch is to move the
data off pages with correctable memory errors before the memory
goes bad.

The patches:

[1/3] page.cleanup.v6: Minor clean-up of page flags in mm/page_alloc.c

Minor source code clean-up of page flags in mm/page_alloc.c.
The cleanup makes it easier for the next patch to add PG_memerror.

[2/3] page.discard.v6: Avoid putting a bad page back on the LRU.

page.discard.v6 are the arch independent changes. It adds a new
page flag (PG_memerror) to mark the page as bad and avoids putting
the page back on the LRU after migrating the data to a new page.
The reference count on the bad page is not decremented to zero to
avoid it being reallocated. PG_memerror is only defined if
CONFIG_PAGEFLAGS_EXTENDED is defined.

[3/3] cpe.migrate.v6: Call migration code on correctable errors

cpe.migrate.v6 are the IA64 specific changes. It connects the CPE
handler to the page migration code. It is implemented as a kernel
loadable module, similar to the mca recovery code (mca_recovery.ko),
so that it can be removed to turn off the feature. Create
/sys/kernel/badram to print page discard information and to free
bad pages.

Comments:

There is always an issue of how agressive the code should be on
migrating pages. Should it migrate on the first correctable error,
or wait for some threshold? Reasonable people may disagree on the
threshold and the "right" answer may be hardware specific. The
decision making is confined to the cpe_migrate.c code and can be
built as a kernel loadable module. It is currently set to migrate
on the first correctable error.

Only pages that can be isolated on the LRU are migrated. Other
pages, such as compound pages, are not migrated. That functionality
could be a future enhancement.

/sys/kernel/badram is a way of displaying information about the bad
memory and freeing the bad pages. A userspace program (or sysadmin)
could determine if a discarded page needs to be freed.

Sample output:

linux> insmod cpe_migrate.ko
linux> cat /sys/kernel/badram // This shows no discarded memory
Bad RAM: 0 kB, 0 pages marked bad
List of bad physical pages

linux> ./errsingle -c 6 -s 1 // Inject correctable errors on
// six pages.
linux> cat /sys/kernel/badram
Bad RAM: 384 kB, 6 pages marked bad
List of bad physical pages
0x06048e10000 0x06870c40000 0x06870c20000 0x06870c10000 0x06007f00000
0x06042070000

linux> echo 0x06870c20000 > /sys/kernel/badram // Free one of the pages

linux> cat /sys/kernel/badram // Five pages remain on the list
Bad RAM: 320 kB, 5 pages marked bad
List of bad physical pages
0x06048e10000 0x06870c40000 0x06870c10000 0x06007f00000 0x06042070000

linux> echo 0 > /sys/kernel/badram // Free all the bad pages
linux> cat /sys/kernel/badram // All the pages are freed
Bad RAM: 0 kB, 0 pages marked bad
List of bad physical pages



Flow of the code description (while testing on IA64):

1) A user level application test program allocates memory and
passes the virtual address of the memory to the error injection
driver.

2) The error injection driver converts the virtual address to
physical, functions the Altix hardware to modify the ECC for the
physical page, creating a correctable error, and returns to the
user application.

3) The user application reads the memory.

4) The Altix hardware detects the correctable error and interrupts
prom. SAL builds a CPU error record, then sends a CPE
interrupt to linux.

5) The linux CPE handler calls the cpe_migrate module (if installed).

6) cpe_migrate parses the physical address from the CPE record and
adds the address to the migrate list (if not already on the list)
and schedules the worker thread (cpe_enable_work).

7) cpe_enable_work calls ia64_mca_cpe_move_page.

8) ia64_mca_cpe_move_page validates the physical address, converts
to page, sets PG_memerror flag and calls the migration code
(migrate_prep(), isolate_lru_page(), and migrate_pages(). If the
page migrates successfully, the bad page is added to badpagelist.

9) Because PG_memerror is set, the bad page is not added back on the LRU
by avoiding calls to move_to_lru(). Avoiding move_to_lru() prevents
the page count from being decremented to zero.

10) If the page fails to migrate, PG_memerror is cleared and the page
returned to the LRU. If another correctable error occurs on the
page another attempt will be made to migrate it.

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]


2008-07-19 10:38:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

Russ Anderson <[email protected]> writes:

> [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

FWIW I discussed this with some hardware people and the general
opinion was that it was way too aggressive to disable a page on the
first corrected error like this patchkit currently does.

The corrected bit error could be caused by a temporary condition
e.g. in the DIMM link, and does not necessarily mean that part of the
DIMM is really going bad. Permanently disabling would only be
justified if you saw repeated corrected errors over a long time from
the same DIMM.

There are also some potential scenarios where being so aggressive
could hurt, e.g. if you have a low rate of random corrected events
spread randomly all over your memory (e.g. with a flakey DIMM
connection) after a long enough uptime you could lose significant parts
of your memory even though the DIMM is actually still ok.

Also the other issue that if the DIMM is going bad then it's likely
larger areas than just the lines making up this page. So you
would still risk uncorrected errors anyways because disabling
the page would only cover a small subset of the affected area.

If you really wanted to do this you probably should hook it up
to mcelog's (or the IA64 equivalent) DIMM database and then
control it from user space with suitable large thresholds
and DIMM specific knowledge. But it's unlikely it can be really
done nicely in a way that is isolated from very specific
knowledge about the underlying memory configuration.

-Andi

2008-07-19 12:13:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

On Sat, Jul 19, 2008 at 12:37:11PM +0200, Andi Kleen wrote:
> Russ Anderson <[email protected]> writes:
>
> > [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)
>
> FWIW I discussed this with some hardware people and the general
> opinion was that it was way too aggressive to disable a page on the
> first corrected error like this patchkit currently does.

I think it's reasonable to take a page out of service on the first error.
Then a user program needs to be notified of which bit is suspected.
It can then subject that page to an intense set of tests (I'd start
by stealing the ones from memtest86+) and if no more errors are found,
it could return the page to service.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-07-19 15:07:03

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

Matthew Wilcox <[email protected]> writes:

> On Sat, Jul 19, 2008 at 12:37:11PM +0200, Andi Kleen wrote:
>> Russ Anderson <[email protected]> writes:
>>
>> > [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)
>>
>> FWIW I discussed this with some hardware people and the general
>> opinion was that it was way too aggressive to disable a page on the
>> first corrected error like this patchkit currently does.
>
> I think it's reasonable to take a page out of service on the first error.
> Then a user program needs to be notified of which bit is suspected.
> It can then subject that page to an intense set of tests (I'd start
> by stealing the ones from memtest86+) and if no more errors are found,
> it could return the page to service.

That would only really help if really only parts of that specific page
is corrupted. But my understanding is that DIMM failures usually
cluster in larger units (channels, DIMMs, memory chips on them, banks
inside the chips etc., all far larger than a 4K page)

So to do your proposal you would need to do this on the units of whole
DIMMs or at least their pages, otherwise it is somewhat
pointless. Since the memory systems typically interleave this would
likely need to be done on multiple DIMMs, potentially covering a large
memory area.

In the end you'll end up with most of the mess of memory hot unplug
because the more memory is affected the more likely it is
some unmoveable kernel data is affected.

-Andi

2008-07-20 17:39:26

by Russ Anderson

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

On Sat, Jul 19, 2008 at 12:37:11PM +0200, Andi Kleen wrote:
> Russ Anderson <[email protected]> writes:
>
> > [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)
>
> FWIW I discussed this with some hardware people and the general
> opinion was that it was way too aggressive to disable a page on the
> first corrected error like this patchkit currently does.

Part of the "fun" of memory error decision making is that memory hardware
can fail in different ways based on design, manufacturing process, running
conditions (ie temperature), etc. So the right answer for one type
of memory hardware may be the wrong answer for another type. That is
why the decision making part of the migration code is implemented
as a kernel loadable module. That way distros/vendors can use
a module appropriate for the specific hardware.

The patch has a module for IA64, based on experience on IA64 hardware.
It is a first step, to get the basic functionality in the kernel.
The module can be enhanced for different failure modes and hardware
types.

Note also the functionality to return pages that have been marked
bad. This allows the pages to be freed if the module is too aggressive.

> The corrected bit error could be caused by a temporary condition
> e.g. in the DIMM link, and does not necessarily mean that part of the
> DIMM is really going bad. Permanently disabling would only be
> justified if you saw repeated corrected errors over a long time from
> the same DIMM.

That is true in some cases. We have extensive experience with Altix
hardware where corrected errors quickly degrade to uncorrected errors.

> There are also some potential scenarios where being so aggressive
> could hurt, e.g. if you have a low rate of random corrected events
> spread randomly all over your memory (e.g. with a flakey DIMM
> connection) after a long enough uptime you could lose significant parts
> of your memory even though the DIMM is actually still ok.

That is a function of system size. The fewer DIMMs in the system the
greater that could be a issue. Altix systems tend to have many DIMMs
(~20,000 in one customer system). So disabling the memory on a
DIMM with a flaky connector is a small percentage of overall memory.
On a large NUMA machine the flaky DIMM connector would only effect
memory on one node.

> Also the other issue that if the DIMM is going bad then it's likely
> larger areas than just the lines making up this page. So you
> would still risk uncorrected errors anyways because disabling
> the page would only cover a small subset of the affected area.

Sure. A common failure mode is that a row/column on a DRAM goes
bad, which effects a range of addresses. I have a DIMM on one
of my test machines which behaves that way. It was valuable for
testing the code because several meg worth of pages get migrated.
It is a good stress test for the migration code.

A good enhancement would be to migrate all the data off a DRAM and/or
DIMM when a threshold is exceeded. That would take knowledge of the
physical memory to memory map layout.

> If you really wanted to do this you probably should hook it up
> to mcelog's (or the IA64 equivalent) DIMM database

Is there an IA64 equivalent? I've looked at the x86_64 mcelog,
but have not found a IA64 version.

> and then
> control it from user space with suitable large thresholds
> and DIMM specific knowledge. But it's unlikely it can be really
> done nicely in a way that is isolated from very specific
> knowledge about the underlying memory configuration.

Agreed. An interface to export the physical memory configuration
(from ACPI tables?) would be useful.

Thanks,
--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]

2008-07-20 17:50:19

by Russ Anderson

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

On Sat, Jul 19, 2008 at 06:13:28AM -0600, Matthew Wilcox wrote:
> On Sat, Jul 19, 2008 at 12:37:11PM +0200, Andi Kleen wrote:
> > Russ Anderson <[email protected]> writes:
> >
> > > [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)
> >
> > FWIW I discussed this with some hardware people and the general
> > opinion was that it was way too aggressive to disable a page on the
> > first corrected error like this patchkit currently does.
>
> I think it's reasonable to take a page out of service on the first error.
> Then a user program needs to be notified of which bit is suspected.
> It can then subject that page to an intense set of tests (I'd start
> by stealing the ones from memtest86+) and if no more errors are found,
> it could return the page to service.

In general I agree with that approach. One concern is that in the
process of testing the memory the diagnostic may hit an uncorrectable
error. That is not a problem with Itanium, which is designed to handle
uncorrected/poisoned data going into and out of the processor core, but
can be a system fatal error (requiring a reboot) on other processor types.
Just something to be aware of.

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]

2008-07-21 19:15:52

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

On Sun, 2008-07-20 at 12:39 -0500, Russ Anderson wrote:
> On Sat, Jul 19, 2008 at 12:37:11PM +0200, Andi Kleen wrote:
> > If you really wanted to do this you probably should hook it up
> > to mcelog's (or the IA64 equivalent) DIMM database
>
> Is there an IA64 equivalent? I've looked at the x86_64 mcelog,
> but have not found a IA64 version.

There's a bit in the SAL error record that can tell you when the
platform thinks the page should be deallocated. In the section header
(B2.2), ERROR_RECOVERY_INFO, bit 3 "Error threshold exceeded". If you
use this bit, then it's a platform decision. If you want pages to be
deallocated on the first hit, then have your SAL always set that bit. I
believe HP systems do implement this bit in SAL using some kind of
heuristics.

Alex

--
Alex Williamson HP Open Source & Linux Org.

2008-07-21 19:40:28

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

On Sun, Jul 20, 2008 at 12:39:14PM -0500, Russ Anderson wrote:
> The patch has a module for IA64, based on experience on IA64 hardware.
> It is a first step, to get the basic functionality in the kernel.

The basic functionality doesn't seem flexible enough for me
for useful policies.

> (~20,000 in one customer system). So disabling the memory on a
> DIMM with a flaky connector is a small percentage of overall memory.
> On a large NUMA machine the flaky DIMM connector would only effect
> memory on one node.

You would still lose significant parts of that node, won't you?
Even on the large systems people might miss a node or two.

> A good enhancement would be to migrate all the data off a DRAM and/or
> DIMM when a threshold is exceeded. That would take knowledge of the
> physical memory to memory map layout.

Would be probably difficult to teach this the kernel in a nice generic
way. In particular interleaving is difficult.

> > If you really wanted to do this you probably should hook it up
> > to mcelog's (or the IA64 equivalent) DIMM database
>
> Is there an IA64 equivalent? I've looked at the x86_64 mcelog,
> but have not found a IA64 version.

There's a sal logger process in user space I believe, but I have never looked
at it. It could do these things in theory.
Also in the IA64 case the firmware can actually tell the kernel
what to do because it gets involved here (and firmware often
has usable heuristics for this case)

> > and DIMM specific knowledge. But it's unlikely it can be really
> > done nicely in a way that is isolated from very specific
> > knowledge about the underlying memory configuration.
>
> Agreed. An interface to export the physical memory configuration
> (from ACPI tables?) would be useful.

On x86 there's currently only DMI/SMBIOS for this, but it has some issues.

-Andi

2008-07-21 19:45:56

by Russ Anderson

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

On Mon, Jul 21, 2008 at 01:11:39PM -0600, Alex Williamson wrote:
> On Sun, 2008-07-20 at 12:39 -0500, Russ Anderson wrote:
> > On Sat, Jul 19, 2008 at 12:37:11PM +0200, Andi Kleen wrote:
> > > If you really wanted to do this you probably should hook it up
> > > to mcelog's (or the IA64 equivalent) DIMM database
> >
> > Is there an IA64 equivalent? I've looked at the x86_64 mcelog,
> > but have not found a IA64 version.
>
> There's a bit in the SAL error record that can tell you when the
> platform thinks the page should be deallocated. In the section header
> (B2.2), ERROR_RECOVERY_INFO, bit 3 "Error threshold exceeded". If you
> use this bit, then it's a platform decision. If you want pages to be
> deallocated on the first hit, then have your SAL always set that bit. I
> believe HP systems do implement this bit in SAL using some kind of
> heuristics.

Good point. Linux does not have that field defined.

I'll submit a real patch to Tony shortly.
-------------------------------------------------
---
include/asm-ia64/sal.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linus/include/asm-ia64/sal.h
===================================================================
--- linus.orig/include/asm-ia64/sal.h 2008-07-18 11:32:02.000000000 -0500
+++ linus/include/asm-ia64/sal.h 2008-07-21 14:40:47.142922279 -0500
@@ -341,7 +341,8 @@ typedef struct sal_log_record_header {
typedef struct sal_log_sec_header {
efi_guid_t guid; /* Unique Section ID */
sal_log_revision_t revision; /* Major and Minor revision of Section */
- u16 reserved;
+ u8 error_recovery_info; /* Platform error recovery status */
+ u8 reserved;
u32 len; /* Section length */
} sal_log_section_hdr_t;


--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]

2008-07-28 21:44:45

by Russ Anderson

[permalink] [raw]
Subject: Re: [PATCH 0/2] Migrate data off physical pages with corrected memory errors (Version 7)

On Mon, Jul 21, 2008 at 09:40:00PM +0200, Andi Kleen wrote:
> On Sun, Jul 20, 2008 at 12:39:14PM -0500, Russ Anderson wrote:
> > The patch has a module for IA64, based on experience on IA64 hardware.
> > It is a first step, to get the basic functionality in the kernel.
>
> The basic functionality doesn't seem flexible enough for me
> for useful policies.

To make sure I understand, it is the decision making functionality
in the kernel loadable module you find not flexible enough, not
the migration code (in mm/migrate.c), correct? I knew the decision
making part would be the most controversial. That's why it's implemented
as a kernel loadable module. I'm not opposed to more flexibility,
but I'm also trying to get some functionality in.

> > (~20,000 in one customer system). So disabling the memory on a
> > DIMM with a flaky connector is a small percentage of overall memory.
> > On a large NUMA machine the flaky DIMM connector would only effect
> > memory on one node.
>
> You would still lose significant parts of that node, won't you?

The amount of memory loss would depend on the number of DIMMs on
the node. It is not unusual to have 4-6 DIMM pairs.

> Even on the large systems people might miss a node or two.

Not the entire node, just a percentage of memory on the node.

Customers vary, but in the case of a flaky connector causing
correctable errors, most of the customers I've worked with would
not want to contunue hitting corrected errors, out of fear that it
could become uncorrectable errors, even if that means disabling
innocent memory, to reduce the risk of crashing.

> > A good enhancement would be to migrate all the data off a DRAM and/or
> > DIMM when a threshold is exceeded. That would take knowledge of the
> > physical memory to memory map layout.
>
> Would be probably difficult to teach this the kernel in a nice generic
> way. In particular interleaving is difficult.

Sure, especially given the differences in the various archs. If limited
to just x86 (or x86_64) would it be less difficult? Just trying to find
a way of making forward progress.

> > > If you really wanted to do this you probably should hook it up
> > > to mcelog's (or the IA64 equivalent) DIMM database
> >
> > Is there an IA64 equivalent? I've looked at the x86_64 mcelog,
> > but have not found a IA64 version.
>
> There's a sal logger process in user space I believe, but I have never looked
> at it. It could do these things in theory.

Do you mean salinfo_decode? salinfo_decode reads & logs error records.
I guess it could be modified to be more intelligent.

> Also in the IA64 case the firmware can actually tell the kernel
> what to do because it gets involved here (and firmware often
> has usable heuristics for this case)

I'm looking at that.

> > > and DIMM specific knowledge. But it's unlikely it can be really
> > > done nicely in a way that is isolated from very specific
> > > knowledge about the underlying memory configuration.
> >
> > Agreed. An interface to export the physical memory configuration
> > (from ACPI tables?) would be useful.
>
> On x86 there's currently only DMI/SMBIOS for this, but it has some issues.

What would be the best way to export the physical memory configuration?
Enhance DMI/SMBIOS? ACPI table? Other?

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]