2011-04-20 08:08:54

by Robert Whitton

[permalink] [raw]
Subject: Background memory scrubbing

Hi,

I have a home grown module that performs background memory scrubbing to eliminate single bit memory errors before they become a problem. This has been working in the 2.6.26 kernels for sometime (it is specifically targeted at the AMD64 PC architecture). I have now moved to the 2.6.32 kernel and it fails with "unable to handle kernel paging request" after a couple of minutes. The code works in summary as follows in a kernel thread...

for each PFN from 256 to the highest valid PFN
{
if (pfn_valid(PFN))
{
page = pfn_to_page(PFN)
va = kmap(page)
atomic_scrub(va, PAGE_SIZE)
kunmap(page)
}

sleep(for_a_while)
}


This code works absolutely fine up to a short distance beyond the 16MB boundary (specifically it seems to always fail on my hardware at PFN 4105). At this point despite the fact that kmap returns a valid virtual address (and it is the virtual address that I expect - 0xffff880001009000) I get the kernel oops - "unable to handle kernel paging request".

My immediate thought was to check the kernel page tables and avoid those pages that are marked as not present or read only however it appears that init_mm and pgd_offset_k have both been deprecated. I have also looked at page->flags but I've found that the flags for the first page that fails are exactly the same as for the previous page that works absolutely fine so I don't appear to be able to use page->flags to make a valid distinction.

So I'm looking for any hints on how to fix the original code i.e. how can the I sensibly detect "a priori" if a PFN/page has a valid mapping in the kernel page tables such that I can read/write to that page via a kmap(ped) virtual address. Alternatively since init_mm and pgd_offset_k have been deprecated how can I gain access to the kernel page tables?

Thanks in advance for any help.

Rob


(please CC me in on any responses)




2011-04-20 13:28:43

by Clemens Ladisch

[permalink] [raw]
Subject: Re: Background memory scrubbing

Robert Whitton wrote:
> I have a home grown module that performs background memory scrubbing
> to eliminate single bit memory errors before they become a problem.
> ... it is specifically targeted at the AMD64 PC architecture

Then why don't you use the memory controller's automatic background
memory scrubbing support? Doesn't your BIOS have this option?


Regards,
Clemens

2011-04-20 14:41:28

by Robert Whitton

[permalink] [raw]
Subject: Re: Background memory scrubbing

> Robert Whitton wrote:
> > I have a home grown module that performs background memory scrubbing
> > to eliminate single bit memory errors before they become a problem.
> > ... it is specifically targeted at the AMD64 PC architecture
>
> Then why don't you use the memory controller's automatic background
> memory scrubbing support? Doesn't your BIOS have this option?
>
> Regards,
> Clemens
>

Hi,

Unfortunately in common with a large number of hardware platforms background scrubbing isn't supported in the hardware (even though ECC error correction is supported) and thus there is no BIOS option to enable it. The software solution has always been fine and the CPU load negligible as it's only necessary to complete one complete scrub every day or so. I just need to find a solution to making this work on newer Linux kernels.

Rob

2011-04-20 15:23:43

by Clemens Ladisch

[permalink] [raw]
Subject: Re: Background memory scrubbing

Robert Whitton wrote:
> > Robert Whitton wrote:
> > > I have a home grown module that performs background memory scrubbing
> > > to eliminate single bit memory errors before they become a problem.
> > > ... it is specifically targeted at the AMD64 PC architecture
> >
> > Then why don't you use the memory controller's automatic background
> > memory scrubbing support? Doesn't your BIOS have this option?
>
> Unfortunately in common with a large number of hardware platforms
> background scrubbing isn't supported in the hardware (even though ECC
> error correction is supported) and thus there is no BIOS option to
> enable it.

Which hardware platform is this? AFAICT all architectures with ECC
(old AMD64, Family 0Fh, Family 10h) also have scrubbing support.
If your BIOS is too dumb, just try enabling it directly (bits 0-4 of
PCI configuration register 0x58 in function 3 of the CPU's northbridge
device, see the BIOS and Kernel's Developer's Guide for details).


Regards,
Clemens

2011-04-20 15:35:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: Background memory scrubbing

On Wed, Apr 20, 2011 at 05:19:41PM +0200, Clemens Ladisch wrote:
> > Unfortunately in common with a large number of hardware platforms
> > background scrubbing isn't supported in the hardware (even though ECC
> > error correction is supported) and thus there is no BIOS option to
> > enable it.
>
> Which hardware platform is this? AFAICT all architectures with ECC
> (old AMD64, Family 0Fh, Family 10h) also have scrubbing support.
> If your BIOS is too dumb, just try enabling it directly (bits 0-4 of
> PCI configuration register 0x58 in function 3 of the CPU's northbridge
> device, see the BIOS and Kernel's Developer's Guide for details).

Or even better, if on AMD, you can build the amd64_edac module
(CONFIG_EDAC_AMD64) and do

echo <x> > /sys/devices/system/edac/mc/mc<y>/sdram_scrub_rate

where x is the scrubbing bandwidth in bytes/sec and y is the memory
controller on the machine, i.e. node.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-04-20 15:47:03

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Background memory scrubbing

On 2011.04.20 at 17:35 +0200, Borislav Petkov wrote:
> On Wed, Apr 20, 2011 at 05:19:41PM +0200, Clemens Ladisch wrote:
> > > Unfortunately in common with a large number of hardware platforms
> > > background scrubbing isn't supported in the hardware (even though ECC
> > > error correction is supported) and thus there is no BIOS option to
> > > enable it.
> >
> > Which hardware platform is this? AFAICT all architectures with ECC
> > (old AMD64, Family 0Fh, Family 10h) also have scrubbing support.
> > If your BIOS is too dumb, just try enabling it directly (bits 0-4 of
> > PCI configuration register 0x58 in function 3 of the CPU's northbridge
> > device, see the BIOS and Kernel's Developer's Guide for details).
>
> Or even better, if on AMD, you can build the amd64_edac module
> (CONFIG_EDAC_AMD64) and do
>
> echo <x> > /sys/devices/system/edac/mc/mc<y>/sdram_scrub_rate
>
> where x is the scrubbing bandwidth in bytes/sec and y is the memory
> controller on the machine, i.e. node.

BTW is it really necessary to print the following to syslog:

EDAC amd64: pci-read, sdram scrub control value: 15
EDAC MC: Read scrub rate: 97650

everytime one runs:
# cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
97650

?
--
Markus

2011-04-20 15:47:11

by Robert Whitton

[permalink] [raw]
Subject: Re: Background memory scrubbing


> On Wed, Apr 20, 2011 at 05:19:41PM +0200, Clemens Ladisch wrote:
> > > Unfortunately in common with a large number of hardware platforms
> > > background scrubbing isn't supported in the hardware (even though ECC
> > > error correction is supported) and thus there is no BIOS option to
> > > enable it.
> >
> > Which hardware platform is this? AFAICT all architectures with ECC
> > (old AMD64, Family 0Fh, Family 10h) also have scrubbing support.
> > If your BIOS is too dumb, just try enabling it directly (bits 0-4 of
> > PCI configuration register 0x58 in function 3 of the CPU's northbridge
> > device, see the BIOS and Kernel's Developer's Guide for details).
>
> Or even better, if on AMD, you can build the amd64_edac module
> (CONFIG_EDAC_AMD64) and do
>
> echo > /sys/devices/system/edac/mc/mc/sdram_scrub_rate
>
> where x is the scrubbing bandwidth in bytes/sec and y is the memory
> controller on the machine, i.e. node.
>
> --
> Regards/Gruss,
> Boris.
>

Unfortunately that also isn't an option on my platform(s). There surely must be a way for a module to be able to get a mapping for each physical page of memory in the system and to be able to use that mapping to do atomic read/writes to scrub the memory.

2011-04-20 15:58:59

by Borislav Petkov

[permalink] [raw]
Subject: Re: Background memory scrubbing

On Wed, Apr 20, 2011 at 05:46:58PM +0200, Markus Trippelsdorf wrote:
> BTW is it really necessary to print the following to syslog:
>
> EDAC amd64: pci-read, sdram scrub control value: 15
> EDAC MC: Read scrub rate: 97650
>
> everytime one runs:
> # cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
> 97650
>
> ?

This is KERN_DEBUG since .38 (commit 24f9a7fe3f19f3fd310f556364d01a22911724b3)
and it shouldn't appear on the console if you don't change your default log level.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-04-20 16:01:34

by Borislav Petkov

[permalink] [raw]
Subject: Re: Background memory scrubbing

On Wed, Apr 20, 2011 at 04:46:22PM +0100, Robert Whitton wrote:
>
> > On Wed, Apr 20, 2011 at 05:19:41PM +0200, Clemens Ladisch wrote:
> > > > Unfortunately in common with a large number of hardware platforms
> > > > background scrubbing isn't supported in the hardware (even though ECC
> > > > error correction is supported) and thus there is no BIOS option to
> > > > enable it.
> > >
> > > Which hardware platform is this? AFAICT all architectures with ECC
> > > (old AMD64, Family 0Fh, Family 10h) also have scrubbing support.
> > > If your BIOS is too dumb, just try enabling it directly (bits 0-4 of
> > > PCI configuration register 0x58 in function 3 of the CPU's northbridge
> > > device, see the BIOS and Kernel's Developer's Guide for details).
> >
> > Or even better, if on AMD, you can build the amd64_edac module
> > (CONFIG_EDAC_AMD64) and do
> >
> > echo > /sys/devices/system/edac/mc/mc/sdram_scrub_rate
> >
> > where x is the scrubbing bandwidth in bytes/sec and y is the memory
> > controller on the machine, i.e. node.
>
> Unfortunately that also isn't an option on my platform(s). There surely must be a way for a module to be able to get a mapping for each physical page of memory in the system and to be able to use that mapping to do atomic read/writes to scrub the memory.

For such questions I've added just the right ML to Cc :).

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-04-20 16:45:09

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Background memory scrubbing

On 2011.04.20 at 17:58 +0200, Borislav Petkov wrote:
> On Wed, Apr 20, 2011 at 05:46:58PM +0200, Markus Trippelsdorf wrote:
> > BTW is it really necessary to print the following to syslog:
> >
> > EDAC amd64: pci-read, sdram scrub control value: 15
> > EDAC MC: Read scrub rate: 97650
> >
> > everytime one runs:
> > # cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
> > 97650
> >
> > ?
>
> This is KERN_DEBUG since .38 (commit 24f9a7fe3f19f3fd310f556364d01a22911724b3)
> and it shouldn't appear on the console if you don't change your default log level.

Yes. Sorry, but I was referring to dmesg and not the console.
What I mean is that maybe debugf1 or debugf2 is more appropriate than
amd64_debug?

--
Markus

2011-04-20 16:45:37

by Rik van Riel

[permalink] [raw]
Subject: Re: Background memory scrubbing

On 04/20/2011 03:58 AM, Robert Whitton wrote:

> for each PFN from 256 to the highest valid PFN
> {
> if (pfn_valid(PFN))
> {
> page = pfn_to_page(PFN)
> va = kmap(page)
> atomic_scrub(va, PAGE_SIZE)
> kunmap(page)
> }
>
> sleep(for_a_while)
> }

What exactly does atomic_scrub do?

> This code works absolutely fine up to a short distance beyond the 16MB boundary (specifically it seems to always fail on my hardware at PFN 4105). At this point despite the fact that kmap returns a valid virtual address (and it is the virtual address that I expect - 0xffff880001009000) I get the kernel oops - "unable to handle kernel paging request".

Looks like you might be making some of the kernel code that
is running at that moment unreachable, leading to a kernel
page fault.

2011-04-20 16:55:37

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Background memory scrubbing

On 2011.04.20 at 18:45 +0200, Markus Trippelsdorf wrote:
> On 2011.04.20 at 17:58 +0200, Borislav Petkov wrote:
> > On Wed, Apr 20, 2011 at 05:46:58PM +0200, Markus Trippelsdorf wrote:
> > > BTW is it really necessary to print the following to syslog:
> > >
> > > EDAC amd64: pci-read, sdram scrub control value: 15
> > > EDAC MC: Read scrub rate: 97650
> > >
> > > everytime one runs:
> > > # cat /sys/devices/system/edac/mc/mc0/sdram_scrub_rate
> > > 97650
> > >
> > > ?
> >
> > This is KERN_DEBUG since .38 (commit 24f9a7fe3f19f3fd310f556364d01a22911724b3)
> > and it shouldn't appear on the console if you don't change your default log level.
>
> Yes. Sorry, but I was referring to dmesg and not the console.
> What I mean is that maybe debugf1 or debugf2 is more appropriate than
> amd64_debug?

In other words:

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 31e71c4f..13b107e 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -211,7 +211,7 @@ static int amd64_get_scrub_rate(struct mem_ctl_info *mci)

scrubval = scrubval & 0x001F;

- amd64_debug("pci-read, sdram scrub control value: %d\n", scrubval);
+ debugf1 ("pci-read, sdram scrub control value: %d\n", scrubval);

for (i = 0; i < ARRAY_SIZE(scrubrates); i++) {
if (scrubrates[i].scrubval == scrubval) {
diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 26343fd..8a34aea 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -459,7 +459,7 @@ static ssize_t mci_sdram_scrub_rate_store(struct mem_ctl_info *mci,

new_bw = mci->set_sdram_scrub_rate(mci, bandwidth);
if (new_bw >= 0) {
- edac_printk(KERN_DEBUG, EDAC_MC, "Scrub rate set to %d\n", new_bw);
+ debugf1 ("Scrub rate set to %d\n", new_bw);
return count;
}

@@ -483,7 +483,7 @@ static ssize_t mci_sdram_scrub_rate_show(struct mem_ctl_info *mci, char *data)
return bandwidth;
}

- edac_printk(KERN_DEBUG, EDAC_MC, "Read scrub rate: %d\n", bandwidth);
+ debugf1 ("Read scrub rate: %d\n", bandwidth);
return sprintf(data, "%d\n", bandwidth);
}


--
Markus

2011-04-20 17:06:00

by Robert Whitton

[permalink] [raw]
Subject: Re: Background memory scrubbing


On Wed 20/04/11 6:45 PM , Rik van Riel <[email protected]> wrote:

> On 04/20/2011 03:58 AM, Robert Whitton wrote:
>
> > for each PFN from 256 to the highest valid PFN
> > {
> > if (pfn_valid(PFN))
> > {
> > page = pfn_to_page(PFN)
> > va = kmap(page)
> > atomic_scrub(va, PAGE_SIZE)
> > kunmap(page)
> > }
> >
> > sleep(for_a_while)
> > }
>
> What exactly does atomic_scrub do?

atomic_scrub is part of the edac subsystem see arch/x86/include/asm/edac.h. It simply does a locked add of zero to each DWORD in the specified range.

(a shame that for 64 bit platforms it doesn't use QWORDS but that's just an optimisation)

>
> > This code works absolutely fine up to a short distance beyond the 16MB
> boundary (specifically it seems to always fail on my hardware at PFN
> 4105). At this point despite the fact that kmap returns a valid virtual
> address (and it is the virtual address that I expect - 0xffff880001009000)
> I get the kernel oops - "unable to handle kernel paging request".
>
> Looks like you might be making some of the kernel code that
> is running at that moment unreachable, leading to a kernel
> page fault.
>
> ______________________________________________________________________
> This email has been scanned by the MessageLabs Email Security System.
> For more information please visit http://www.messagelabs.com/email [1]
> ______________________________________________________________________
>
>
>
> Links:
> ------
> [1]
> http://webmail.eclipse.net.uk/parse.php?redirect=http://www.messagelabs.com
> /email
>

2011-04-20 17:36:53

by Borislav Petkov

[permalink] [raw]
Subject: Re: Background memory scrubbing

On Wed, Apr 20, 2011 at 12:55:33PM -0400, Markus Trippelsdorf wrote:
> > > This is KERN_DEBUG since .38 (commit 24f9a7fe3f19f3fd310f556364d01a22911724b3)
> > > and it shouldn't appear on the console if you don't change your default log level.
> >
> > Yes. Sorry, but I was referring to dmesg and not the console.
> > What I mean is that maybe debugf1 or debugf2 is more appropriate than
> > amd64_debug?
>
> In other words:

Ok, that whole debugging output there is partly historical remains and
we don't really need them if we're smart about it. But you'll have to
change your patch and make a proper commit message :).

>
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 31e71c4f..13b107e 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -211,7 +211,7 @@ static int amd64_get_scrub_rate(struct mem_ctl_info *mci)
>
> scrubval = scrubval & 0x001F;
>
> - amd64_debug("pci-read, sdram scrub control value: %d\n", scrubval);
> + debugf1 ("pci-read, sdram scrub control value: %d\n", scrubval);

remove this one completely.

>
> for (i = 0; i < ARRAY_SIZE(scrubrates); i++) {
> if (scrubrates[i].scrubval == scrubval) {
> diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
> index 26343fd..8a34aea 100644
> --- a/drivers/edac/edac_mc_sysfs.c
> +++ b/drivers/edac/edac_mc_sysfs.c
> @@ -459,7 +459,7 @@ static ssize_t mci_sdram_scrub_rate_store(struct mem_ctl_info *mci,
>
> new_bw = mci->set_sdram_scrub_rate(mci, bandwidth);
> if (new_bw >= 0) {
> - edac_printk(KERN_DEBUG, EDAC_MC, "Scrub rate set to %d\n", new_bw);
> + debugf1 ("Scrub rate set to %d\n", new_bw);
> return count;
> }

Make here the success case implicit and issue a warning only if we fail
setting the scrub rate:


new_bw = mci->set_sdram_scrub_rate(mci, bandwidth);
if (new_bw < 0) {
edac_printk(KERN_WARNING, EDAC_MC,
"Error setting scrub rate to: %lu\n", bandwidth);
return -EINVAL;
}

and do the same thing with mci_sdram_scrub_rate_show() below.

>
> @@ -483,7 +483,7 @@ static ssize_t mci_sdram_scrub_rate_show(struct mem_ctl_info *mci, char *data)
> return bandwidth;
> }
>
> - edac_printk(KERN_DEBUG, EDAC_MC, "Read scrub rate: %d\n", bandwidth);
> + debugf1 ("Read scrub rate: %d\n", bandwidth);

This one can go too since the success-case is in sysfs anyway.

> return sprintf(data, "%d\n", bandwidth);
> }

Would you like to do that?

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-04-20 17:53:47

by Rik van Riel

[permalink] [raw]
Subject: Re: Background memory scrubbing

On 04/20/2011 01:05 PM, Robert Whitton wrote:
> On Wed 20/04/11 6:45 PM , Rik van Riel<[email protected]> wrote:
>> On 04/20/2011 03:58 AM, Robert Whitton wrote:
>>
>>> for each PFN from 256 to the highest valid PFN
>>> {
>>> if (pfn_valid(PFN))
>>> {
>>> page = pfn_to_page(PFN)
>>> va = kmap(page)
>>> atomic_scrub(va, PAGE_SIZE)
>>> kunmap(page)
>>> }
>>>
>>> sleep(for_a_while)
>>> }
>>
>> What exactly does atomic_scrub do?
>
> atomic_scrub is part of the edac subsystem see arch/x86/include/asm/edac.h. It simply does a locked add of zero to each DWORD in the specified range.

I can think of only a few ways in which that could cause a
kernel page fault.

One of the more obvious causes would be running into an
area of kernel memory that is mapped read-only. Writing
to a page that is mapped read-only would cause a page
fault :)

Walking the page tables to check whether my guess is correct
should be possible in the current context. Look at current->mm->pgd for
the page directory and start walking from there.

Incidentally, the kernel mappings should be the same for any
process, so the above should hold true from any context.

2011-04-20 18:28:48

by Markus Trippelsdorf

[permalink] [raw]
Subject: [PATCH] edac: Remove debugging output in scrub rate handling

On 2011.04.20 at 19:36 +0200, Borislav Petkov wrote:
> On Wed, Apr 20, 2011 at 12:55:33PM -0400, Markus Trippelsdorf wrote:
> > > > This is KERN_DEBUG since .38 (commit 24f9a7fe3f19f3fd310f556364d01a22911724b3)
> > > > and it shouldn't appear on the console if you don't change your default log level.
> > >
> > > Yes. Sorry, but I was referring to dmesg and not the console.
> > > What I mean is that maybe debugf1 or debugf2 is more appropriate than
> > > amd64_debug?
> >
> > In other words:
>
> Ok, that whole debugging output there is partly historical remains and
> we don't really need them if we're smart about it. But you'll have to
> change your patch and make a proper commit message :).

This patch removes superfluous debugging output in the sysfs scrub rate
handler. It also consolidates the error handling in
mci_sdram_scrub_rate_store.

Signed-off-by: Markus Trippelsdorf <[email protected]>
---
drivers/edac/amd64_edac.c | 2 --
drivers/edac/edac_mc_sysfs.c | 11 +++++------
2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 31e71c4f..4b4071e 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -211,8 +211,6 @@ static int amd64_get_scrub_rate(struct mem_ctl_info *mci)

scrubval = scrubval & 0x001F;

- amd64_debug("pci-read, sdram scrub control value: %d\n", scrubval);
-
for (i = 0; i < ARRAY_SIZE(scrubrates); i++) {
if (scrubrates[i].scrubval == scrubval) {
retval = scrubrates[i].bandwidth;
diff --git a/drivers/edac/edac_mc_sysfs.c b/drivers/edac/edac_mc_sysfs.c
index 26343fd..6ffe438 100644
--- a/drivers/edac/edac_mc_sysfs.c
+++ b/drivers/edac/edac_mc_sysfs.c
@@ -458,13 +458,13 @@ static ssize_t mci_sdram_scrub_rate_store(struct mem_ctl_info *mci,
return -EINVAL;

new_bw = mci->set_sdram_scrub_rate(mci, bandwidth);
- if (new_bw >= 0) {
- edac_printk(KERN_DEBUG, EDAC_MC, "Scrub rate set to %d\n", new_bw);
- return count;
+ if (new_bw < 0) {
+ edac_printk(KERN_WARNING, EDAC_MC,
+ "Error setting scrub rate to: %lu\n", bandwidth);
+ return -EINVAL;
}

- edac_printk(KERN_DEBUG, EDAC_MC, "Error setting scrub rate to: %lu\n", bandwidth);
- return -EINVAL;
+ return count;
}

/*
@@ -483,7 +483,6 @@ static ssize_t mci_sdram_scrub_rate_show(struct mem_ctl_info *mci, char *data)
return bandwidth;
}

- edac_printk(KERN_DEBUG, EDAC_MC, "Read scrub rate: %d\n", bandwidth);
return sprintf(data, "%d\n", bandwidth);
}

--
Markus

2011-04-20 19:23:38

by Bill Gatliff

[permalink] [raw]
Subject: Re: Background memory scrubbing

Ladisch:


On Wed, Apr 20, 2011 at 10:19 AM, Clemens Ladisch <[email protected]> wrote:
> Which hardware platform is this? ?AFAICT all architectures with ECC
> (old AMD64, Family 0Fh, Family 10h) also have scrubbing support.
> If your BIOS is too dumb, just try enabling it directly (bits 0-4 of
> PCI configuration register 0x58 in function 3 of the CPU's northbridge
> device, see the BIOS and Kernel's Developer's Guide for details).

That won't help non-AMD64 platforms that want to scrub.

Is there a way to make Robert's approach work? I'm aware of a few
non-AMD64, somewhat-exotic platforms that require scrubbing the way
that Robert is proposing...


b.g.
--
Bill Gatliff
[email protected]

2011-04-21 11:55:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] edac: Remove debugging output in scrub rate handling

On Wed, Apr 20, 2011 at 02:28:45PM -0400, Markus Trippelsdorf wrote:
>
> This patch removes superfluous debugging output in the sysfs scrub rate
> handler. It also consolidates the error handling in
> mci_sdram_scrub_rate_store.
>
> Signed-off-by: Markus Trippelsdorf <[email protected]>

Applied, thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-04-25 11:41:14

by Pavel Machek

[permalink] [raw]
Subject: Re: Background memory scrubbing

Hi!

> >>>for each PFN from 256 to the highest valid PFN
> >>>{
> >>>if (pfn_valid(PFN))
> >>>{
> >>>page = pfn_to_page(PFN)
> >>>va = kmap(page)
> >>>atomic_scrub(va, PAGE_SIZE)
> >>>kunmap(page)
> >>>}
> >>>
> >>>sleep(for_a_while)
> >>>}
> >>
> >>What exactly does atomic_scrub do?
> >
> >atomic_scrub is part of the edac subsystem see arch/x86/include/asm/edac.h. It simply does a locked add of zero to each DWORD in the specified range.
>
> I can think of only a few ways in which that could cause a
> kernel page fault.
>
> One of the more obvious causes would be running into an
> area of kernel memory that is mapped read-only. Writing
> to a page that is mapped read-only would cause a page
> fault :)

...also... you are actually making kernel use "self modifying code"
here. There are CPU bugs in that area... for example on K6.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2011-04-25 18:21:17

by Chris Friesen

[permalink] [raw]
Subject: Re: Background memory scrubbing

On 04/20/2011 10:01 AM, Borislav Petkov wrote:
> On Wed, Apr 20, 2011 at 04:46:22PM +0100, Robert Whitton wrote:
>>
>>> On Wed, Apr 20, 2011 at 05:19:41PM +0200, Clemens Ladisch wrote:
>>>>> Unfortunately in common with a large number of hardware platforms
>>>>> background scrubbing isn't supported in the hardware (even though ECC
>>>>> error correction is supported) and thus there is no BIOS option to
>>>>> enable it.
>>>>
>>>> Which hardware platform is this? AFAICT all architectures with ECC
>>>> (old AMD64, Family 0Fh, Family 10h) also have scrubbing support.
>>>> If your BIOS is too dumb, just try enabling it directly (bits 0-4 of
>>>> PCI configuration register 0x58 in function 3 of the CPU's northbridge
>>>> device, see the BIOS and Kernel's Developer's Guide for details).
>>>
>>> Or even better, if on AMD, you can build the amd64_edac module
>>> (CONFIG_EDAC_AMD64) and do
>>>
>>> echo > /sys/devices/system/edac/mc/mc/sdram_scrub_rate
>>>
>>> where x is the scrubbing bandwidth in bytes/sec and y is the memory
>>> controller on the machine, i.e. node.
>>
>> Unfortunately that also isn't an option on my platform(s). There surely must be a way for a module to be able to get a mapping for each physical page of memory in the system and to be able to use that mapping to do atomic read/writes to scrub the memory.
>
> For such questions I've added just the right ML to Cc :).

There was a thread back in 2009 cwith the subject "marching through all
physical memory in software" that discussed some of the issues of a
software background scrub.

Chris

--
Chris Friesen
Software Developer
GENBAND
[email protected]
http://www.genband.com