I tested 2.6.31 and the mainline amd64_edac driver. Once errors were
getting reported I noticed that a good amount of valid looking system
addressed were not being correctly decoded to the csrow level. I was
only able to correctly decode errors on channel 0 of a given csrow.
Errors on channel 1 were unable to be mapped.
After some digging I realized that the there was an issue with handling
of Dram Chip Select Masks. Specifically that amd64_map_to_dcs_mask was
returning incorrect values on my Rev F based system. See AMD #32559,
4.5.4, starting on pg 90 for a Rev F explanation of the correct mapping
of DRAM CS Base and DRAM CS Mask regs. This lead to the code below. I
can provide further explanation if needed.
I have tested this code on Rev F based system with ecc debug dimms and
fully expect it to work on earlier and later cpus. I am now able to
correctly decode and map errors to csrows rows on this system.
Submitted-by: Keith Mannthey<[email protected]>
---
diff -urN linux-2.6.31/drivers/edac/amd64_edac.c linux-2.6.31-fixed/drivers/edac/amd64_edac.c
--- linux-2.6.31/drivers/edac/amd64_edac.c 2009-09-09 15:13:59.000000000 -0700
+++ linux-2.6.31-fixed/drivers/edac/amd64_edac.c 2009-09-17 22:32:09.000000000 -0700
@@ -1,6 +1,8 @@
-#include "amd64_edac.h"
+#include <linux/log2.h>
#include <asm/k8.h>
+#include "amd64_edac.h"
+
static struct edac_pci_ctl_info *amd64_ctl_pci;
static int report_gart_errors;
@@ -132,7 +134,7 @@
/* Map from a CSROW entry to the mask entry that operates on it */
static inline u32 amd64_map_to_dcs_mask(struct amd64_pvt *pvt, int csrow)
{
- return csrow >> (pvt->num_dcsm >> 3);
+ return csrow >> (8 >> (ilog2(pvt->num_dcsm)+1));
}
/* return the 'base' address the i'th CS entry of the 'dct' DRAM controller */
diff -urN linux-2.6.31/drivers/edac/amd64_edac.h linux-2.6.31-fixed/drivers/edac/amd64_edac.h
--- linux-2.6.31/drivers/edac/amd64_edac.h 2009-09-17 22:22:18.000000000 -0700
+++ linux-2.6.31-fixed/drivers/edac/amd64_edac.h 2009-09-17 22:48:50.000000000 -0700
@@ -526,7 +526,7 @@
/*
* The following fields are set at (load) run time, after CPU revision
* has been determined, since the dct_base and dct_mask registers vary
- * based on revision
+ * based on revision. num_dcsm is assumed to be a power of 2.
*/
u32 dcsb_base; /* DCSB base bits */
u32 dcsm_mask; /* DCSM mask bits */
On Thu, Sep 17, 2009 at 07:09:41PM -0700, Keith Mannthey wrote:
> I tested 2.6.31 and the mainline amd64_edac driver. Once errors were
> getting reported I noticed that a good amount of valid looking system
> addressed were not being correctly decoded to the csrow level. I was
> only able to correctly decode errors on channel 0 of a given csrow.
> Errors on channel 1 were unable to be mapped.
>
> After some digging I realized that the there was an issue with handling
> of Dram Chip Select Masks. Specifically that amd64_map_to_dcs_mask was
> returning incorrect values on my Rev F based system. See AMD #32559,
> 4.5.4, starting on pg 90 for a Rev F explanation of the correct mapping
> of DRAM CS Base and DRAM CS Mask regs. This lead to the code below. I
> can provide further explanation if needed.
>
>
> I have tested this code on Rev F based system with ecc debug dimms and
> fully expect it to work on earlier and later cpus. I am now able to
> correctly decode and map errors to csrows rows on this system.
>
>
> Submitted-by: Keith Mannthey<[email protected]>
> ---
This whole DSC[BM] handling is rather long-winded and overengineered. It
is on my to-be-rewritten-and-simplified list.
>
> diff -urN linux-2.6.31/drivers/edac/amd64_edac.c linux-2.6.31-fixed/drivers/edac/amd64_edac.c
> --- linux-2.6.31/drivers/edac/amd64_edac.c 2009-09-09 15:13:59.000000000 -0700
> +++ linux-2.6.31-fixed/drivers/edac/amd64_edac.c 2009-09-17 22:32:09.000000000 -0700
> @@ -1,6 +1,8 @@
> -#include "amd64_edac.h"
> +#include <linux/log2.h>
> #include <asm/k8.h>
>
> +#include "amd64_edac.h"
> +
> static struct edac_pci_ctl_info *amd64_ctl_pci;
>
> static int report_gart_errors;
> @@ -132,7 +134,7 @@
> /* Map from a CSROW entry to the mask entry that operates on it */
> static inline u32 amd64_map_to_dcs_mask(struct amd64_pvt *pvt, int csrow)
> {
> - return csrow >> (pvt->num_dcsm >> 3);
> + return csrow >> (8 >> (ilog2(pvt->num_dcsm)+1));
Almost. You have 8 DCSMs on RevE, 4 on RevF and F10h and 2 on F11h and
this way you get wrong DCSM offsets for F11h. A dirty fix would be:
if (boot_cpu_data.x86 == 0xf && pvt->ext_model < OPTERON_CPU_REV_E) {
return csrow;
else
return csrow >> 1;
The problem is, the csrow thing still goes over 0..7 which is obviously
wrong on F11h but I'll fix that later. Care to redo your patch according
to these and the comments from my previous mail and resend?
By the way, your patches made me look harder at that code region and
I've found some more problems with it which I've fixed. Would you like
to test the whole bunch of fixes on your setup?
Thanks.
--
Regards/Gruss,
Boris.
Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. M?nchen, Germany
Research | Gesch?ftsf?hrer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis M?nchen
(OSRC) | Registergericht M?nchen, HRB Nr. 43632
On Fri, 2009-09-18 at 16:42 +0200, Borislav Petkov wrote:
> On Thu, Sep 17, 2009 at 07:09:41PM -0700, Keith Mannthey wrote:
<snip>
> > /* Map from a CSROW entry to the mask entry that operates on it */
> > static inline u32 amd64_map_to_dcs_mask(struct amd64_pvt *pvt, int csrow)
> > {
> > - return csrow >> (pvt->num_dcsm >> 3);
> > + return csrow >> (8 >> (ilog2(pvt->num_dcsm)+1));
>
> Almost. You have 8 DCSMs on RevE, 4 on RevF and F10h and 2 on F11h and
> this way you get wrong DCSM offsets for F11h. A dirty fix would be:
I think this will still be ok for F11.
ilog2(2) = 1
1 + 1 == 2
8 >> 2 == 2
csrow >> 2
This would be ok rev F11 assuming 8 total.
Am I missing something else?
> if (boot_cpu_data.x86 == 0xf && pvt->ext_model < OPTERON_CPU_REV_E) {
> return csrow;
> else
> return csrow >> 1;
Still busted for F11.
> The problem is, the csrow thing still goes over 0..7 which is obviously
> wrong on F11h but I'll fix that later. Care to redo your patch according
> to these and the comments from my previous mail and resend?
Are there more than 8 csrows any any version (I don't currently have F11
specs). Maybe should just move to a map rather than a math trick to get
to the right index?
> By the way, your patches made me look harder at that code region and
> I've found some more problems with it which I've fixed. Would you like
> to test the whole bunch of fixes on your setup?
Yes please send any changes you have. I have a decent test setup for
live errors.
Thanks,
Keith
On Fri, Sep 18, 2009 at 10:28:53AM -0700, Keith Mannthey wrote:
> > Almost. You have 8 DCSMs on RevE, 4 on RevF and F10h and 2 on F11h and
> > this way you get wrong DCSM offsets for F11h. A dirty fix would be:
>
> I think this will still be ok for F11.
>
>
> ilog2(2) = 1
>
> 1 + 1 == 2
>
> 8 >> 2 == 2
>
> csrow >> 2
>
> This would be ok rev F11 assuming 8 total.
>
> Am I missing something else?
Yes, F11h has only 4 DCSB and 2 DCSM registers. So the first
two DCSB registers F2x[1,0]40 and F2x[1,0]44 use F2x[1,0]60 as
a mask register and F2x[1,0]48 and F2x[1,0]4C use F2x[1,0]64.
You can look at F11h as a F10h but with only the half of the
DSC[BM] registers present. You can find the F11h BKDG at
http://support.amd.com/us/Processor_TechDocs/41256.pdf and especially
Table 24 on page 115.
So, in that case csrow >> 1 is still valid but csrow going beyond 3 is
out of range that's why it needs to be fixed differently.
> Are there more than 8 csrows any any version (I don't currently have
> F11 specs). Maybe should just move to a map rather than a math trick
> to get to the right index?
I'll think up something on Monday.
> > By the way, your patches made me look harder at that code region and
> > I've found some more problems with it which I've fixed. Would you
> > like to test the whole bunch of fixes on your setup?
>
> Yes please send any changes you have. I have a decent test setup for
> live errors.
Cool, I'll get back to you when I have them ready, thanks.
--
Regards/Gruss,
Boris.
Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. M?nchen, Germany
Research | Gesch?ftsf?hrer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis M?nchen
(OSRC) | Registergericht M?nchen, HRB Nr. 43632
Hi Keith,
On Fri, Sep 18, 2009 at 10:28:53AM -0700, Keith Mannthey wrote:
> > By the way, your patches made me look harder at that code region and
> > I've found some more problems with it which I've fixed. Would you like
> > to test the whole bunch of fixes on your setup?
>
> Yes please send any changes you have. I have a decent test setup for
> live errors.
The patch queue is at
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git
the 'edac-fixes' branch.
Please do test and let me know of any problems you might have.
Thanks a lot.
--
Regards/Gruss,
Boris.
Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. M?nchen, Germany
Research | Gesch?ftsf?hrer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis M?nchen
(OSRC) | Registergericht M?nchen, HRB Nr. 43632
On Mon, 2009-09-21 at 16:55 +0200, Borislav Petkov wrote:
> Hi Keith,
>
> On Fri, Sep 18, 2009 at 10:28:53AM -0700, Keith Mannthey wrote:
> > > By the way, your patches made me look harder at that code region and
> > > I've found some more problems with it which I've fixed. Would you like
> > > to test the whole bunch of fixes on your setup?
> >
> > Yes please send any changes you have. I have a decent test setup for
> > live errors.
>
> The patch queue is at
>
> git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git
>
> the 'edac-fixes' branch.
Please check you have synced with that git tree. When I pull the tree
it is 2.6.30-rc5 without the amd64_edac driver. It has no branch
outside of "master".
Thanks,
Keith
On Mon, Sep 21, 2009 at 04:50:24PM -0700, Keith Mannthey wrote:
> On Mon, 2009-09-21 at 16:55 +0200, Borislav Petkov wrote:
> > Hi Keith,
> >
> > On Fri, Sep 18, 2009 at 10:28:53AM -0700, Keith Mannthey wrote:
> > > > By the way, your patches made me look harder at that code region and
> > > > I've found some more problems with it which I've fixed. Would you like
> > > > to test the whole bunch of fixes on your setup?
> > >
> > > Yes please send any changes you have. I have a decent test setup for
> > > live errors.
> >
> > The patch queue is at
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git
> >
> > the 'edac-fixes' branch.
>
> Please check you have synced with that git tree. When I pull the tree
> it is 2.6.30-rc5 without the amd64_edac driver. It has no branch
> outside of "master".
This happens because it is a bare repository without a working tree.
You'll have to add it as a remote repository to an already existing
(with a working tree) one you have locally like so:
(in your repository do:)
git remote add edac git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git
git fetch edac
git checkout -b edac-test edac/edac-fixes
Now you should be on the branch with the patchset.
Let me know how it goes.
Thanks.
--
Regards/Gruss,
Boris.
Operating | Advanced Micro Devices GmbH
System | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. M?nchen, Germany
Research | Gesch?ftsf?hrer: Andrew Bowd, Thomas M. McCoy, Giuliano Meroni
Center | Sitz: Dornach, Gemeinde Aschheim, Landkreis M?nchen
(OSRC) | Registergericht M?nchen, HRB Nr. 43632