> > UBSAN complains this error
> > ~~~
> > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> …
> > ~~~
> >
> > when get rows, cols and ranks, the returned error value doesn't be
> > handled.
> >
> > check the return value is EINVAL, if yes, directly return 0.
> …
>
> * Please improve this change description further.
To be specific. Initially you reported this because of the UBSAN error
report. But, after community discussion you now know that the problem
is that one or more of the rows/cols/ranks has a value that the EDAC driver
doesn't expect and probably can handle.
So, in V2, the commit message should start with the information these
values are out of range and mention this was discovered when UBSAN
put out a warning about a negative shift. No need to include the whole
of the UBSAN stack trace.
Then describe the two fixes that this patch includes. One is to change the
edac debug message into a console error message to enable further
debug of this issue. The other is to skip the unrecognized DIMM.
> * How do you think about to add the tag “Fixes”?
This is a good idea. Use git blame, or dig into the GIT history to
find the commit where this code was introduced (hint .. git blame
says:
88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
but that obviously just refactored code, so you should dig back more into
the history.
> > V2: make error-print explicitly
> > ---
>
> Would you like to avoid a misplaced marker line here?
>
> See also:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686
That's an excellent resource.
-Tony
On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <[email protected]> wrote:
>
> > > UBSAN complains this error
> > > ~~~
> > > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> > …
> > > ~~~
> > >
> > > when get rows, cols and ranks, the returned error value doesn't be
> > > handled.
> > >
> > > check the return value is EINVAL, if yes, directly return 0.
> > …
> >
> > * Please improve this change description further.
>
> To be specific. Initially you reported this because of the UBSAN error
> report. But, after community discussion you now know that the problem
> is that one or more of the rows/cols/ranks has a value that the EDAC driver
> doesn't expect and probably can handle.
>
> So, in V2, the commit message should start with the information these
> values are out of range and mention this was discovered when UBSAN
> put out a warning about a negative shift. No need to include the whole
> of the UBSAN stack trace.
>
> Then describe the two fixes that this patch includes. One is to change the
> edac debug message into a console error message to enable further
> debug of this issue. The other is to skip the unrecognized DIMM.
>
> > * How do you think about to add the tag “Fixes”?
>
> This is a good idea. Use git blame, or dig into the GIT history to
> find the commit where this code was introduced (hint .. git blame
> says:
> 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
> but that obviously just refactored code, so you should dig back more into
> the history.
>
> > > V2: make error-print explicitly
> > > ---
> >
> > Would you like to avoid a misplaced marker line here?
> >
> > See also:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686
>
> That's an excellent resource.
Thanks for your advices and I will modify.
here's part of dmesg enabled EDAC_DEBUG
~~~
[ 4.032332] EDAC DEBUG: skx_register_mci: MC#1: mci = 00000000799db99e
[ 4.032334] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff
mcddrtcfg 0xffffffff (mc1 ch0 dimm0)
[ 4.032335] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[ 4.032337] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[ 4.032338] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
[ 4.032339] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff
mcddrtcfg 0xffffffff (mc1 ch0 dimm1)
[ 4.032340] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[ 4.032341] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[ 4.032341] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
[ 4.032343] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff
mcddrtcfg 0xffffffff (mc1 ch1 dimm0)
[ 4.032344] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[ 4.032345] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[ 4.032346] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
[ 4.032347] EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff
mcddrtcfg 0xffffffff (mc1 ch1 dimm1)
[ 4.032348] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
[ 4.032349] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
[ 4.032349] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
~~~
I shared the whole dmesg through g-drive.
https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing
>
> -Tony
On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <[email protected]> wrote:
>
> > > UBSAN complains this error
> > > ~~~
> > > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> > …
> > > ~~~
> > >
> > > when get rows, cols and ranks, the returned error value doesn't be
> > > handled.
> > >
> > > check the return value is EINVAL, if yes, directly return 0.
> > …
> >
> > * Please improve this change description further.
>
> To be specific. Initially you reported this because of the UBSAN error
> report. But, after community discussion you now know that the problem
> is that one or more of the rows/cols/ranks has a value that the EDAC driver
> doesn't expect and probably can handle.
>
> So, in V2, the commit message should start with the information these
> values are out of range and mention this was discovered when UBSAN
> put out a warning about a negative shift. No need to include the whole
> of the UBSAN stack trace.
>
> Then describe the two fixes that this patch includes. One is to change the
> edac debug message into a console error message to enable further
> debug of this issue. The other is to skip the unrecognized DIMM.
>
> > * How do you think about to add the tag “Fixes”?
>
> This is a good idea. Use git blame, or dig into the GIT history to
> find the commit where this code was introduced (hint .. git blame
> says:
> 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
> but that obviously just refactored code, so you should dig back more into
> the history.
There are two parts,
1. @get_dimm_attr, edac_dbg was added since e235dd43d8b0f0
2. get num of ranks, rows and cols, 4ec656bdf43a13
Should I add all of them prefixes with "Fixes"?
>
> > > V2: make error-print explicitly
> > > ---
> >
> > Would you like to avoid a misplaced marker line here?
> >
> > See also:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686
>
> That's an excellent resource.
>
> -Tony
Interesting. This isn’t some odd shaped DIMM.
You are just getting a failed read from the register.
So trying to decode all-ones.
I’ll try to look at the full log you linked when I’m
on a computer instead of a phone.
> [ 4.032335] EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
> [ 4.032337] EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
> [ 4.032338] EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
As per suggestions, i modified V3.
could you please take a look
Subject: [PATCH][V3] EDAC/i10nm: shift exponent is negative
Because failed to read from DIMM, get the negative value for shift
operation.
`EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)
EDAC DEBUG: skx_get_dimm_attr: bad rows = 7 (raw=0xffffffff)
EDAC DEBUG: skx_get_dimm_attr: bad cols = 3 (raw=0xffffffff)
EDAC DEBUG: i10nm_get_dimm_config: dimmmtr 0xffffffff mcddrtcfg
0xffffffff (mc1 ch0 dimm1)`
UBSAN complains this error
`UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
shift exponent -66 is negative`
when get rows, cols and ranks, the returned error value doesn't be
handled.
check the return value is EINVAL, if yes, directly return 0 and
show error message explicitly.
Fixes: 4ec656bdf43a13) EDAC, skx_edac: Add EDAC driver for Skylake
Signed-off-by: Koba Ko <[email protected]>
---
V2 -> V3: simplify the summary and add 'Fixes:'
V1 -> V2: make error-print explicitly
On Tue, Jul 4, 2023 at 10:20 AM Koba Ko <[email protected]> wrote:
>
> On Tue, Jul 4, 2023 at 5:51 AM Luck, Tony <[email protected]> wrote:
> >
> > > > UBSAN complains this error
> > > > ~~~
> > > > UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
> > > …
> > > > ~~~
> > > >
> > > > when get rows, cols and ranks, the returned error value doesn't be
> > > > handled.
> > > >
> > > > check the return value is EINVAL, if yes, directly return 0.
> > > …
> > >
> > > * Please improve this change description further.
> >
> > To be specific. Initially you reported this because of the UBSAN error
> > report. But, after community discussion you now know that the problem
> > is that one or more of the rows/cols/ranks has a value that the EDAC driver
> > doesn't expect and probably can handle.
> >
> > So, in V2, the commit message should start with the information these
> > values are out of range and mention this was discovered when UBSAN
> > put out a warning about a negative shift. No need to include the whole
> > of the UBSAN stack trace.
> >
> > Then describe the two fixes that this patch includes. One is to change the
> > edac debug message into a console error message to enable further
> > debug of this issue. The other is to skip the unrecognized DIMM.
> >
> > > * How do you think about to add the tag “Fixes”?
> >
> > This is a good idea. Use git blame, or dig into the GIT history to
> > find the commit where this code was introduced (hint .. git blame
> > says:
> > 88a242c98740 ("EDAC, skx_common: Separate common code out from skx_edac")
> > but that obviously just refactored code, so you should dig back more into
> > the history.
> There are two parts,
> 1. @get_dimm_attr, edac_dbg was added since e235dd43d8b0f0
> 2. get num of ranks, rows and cols, 4ec656bdf43a13
>
> Should I add all of them prefixes with "Fixes"?
>
> >
> > > > V2: make error-print explicitly
> > > > ---
> > >
> > > Would you like to avoid a misplaced marker line here?
> > >
> > > See also:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.4#n686
> >
> > That's an excellent resource.
> >
> > -Tony
> I shared the whole dmesg through g-drive.
> https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing
The EDAC driver was OK for MC#0, but then failed for the other memory controllers.
Can you send some more information about your system. Output from the
following commands:
# head /proc/cpuinfo
# dmidecode -t 17
# lspci
Thanks
-Tony
On Wed, Jul 5, 2023 at 12:08 PM Luck, Tony <[email protected]> wrote:
>
> > I shared the whole dmesg through g-drive.
> > https://drive.google.com/file/d/1epnDZNezGiJsK1eT4UNOi8_igcBSXtiF/view?usp=sharing
>
> The EDAC driver was OK for MC#0, but then failed for the other memory controllers.
>
> Can you send some more information about your system. Output from the
> following commands:
>
> # head /proc/cpuinfo
>
> # dmidecode -t 17
>
> # lspci
please check through this url,
https://drive.google.com/drive/folders/199k3BX6IipNYCDfuMGy8W26ZtYRDIYZr?usp=sharing
>
> Thanks
>
> -Tony
>> # head /proc/cpuinfo
This shows your system is the workstation version of Sapphire rapids. I don't
think we did any validation of the EDAC driver against this model.
> # dmidecode -t 17
You have just one 16GB DIMM, and EDAC found that. So despite the messy warnings,
EDAC should be working for you.
> # lspci
I didn't dig into this. Qiuxu - can you compare this against a server Sapphire rapids?
Maybe it has some clues so the EDAC driver will know not to look for non-existent
memory controllers.
> please check through this url,
> https://drive.google.com/drive/folders/199k3BX6IipNYCDfuMGy8W26ZtYRDIYZr?usp=sharing
Thanks for all the details.
-Tony
> From: Luck, Tony <[email protected]>
> Sent: Wednesday, July 5, 2023 11:22 PM
> ...
> Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
>
> >> # head /proc/cpuinfo
>
> This shows your system is the workstation version of Sapphire rapids. I don't
> think we did any validation of the EDAC driver against this model.
No, we didn't do any validation of the EDAC on Sapphires Rapids workstations.
From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers.
https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html
We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before.
> > # dmidecode -t 17
>
> You have just one 16GB DIMM, and EDAC found that. So despite the messy
> warnings, EDAC should be working for you.
>
> > # lspci
>
> I didn't dig into this. Qiuxu - can you compare this against a server Sapphire
> rapids?
> Maybe it has some clues so the EDAC driver will know not to look for non-
> existent memory controllers.
This Sapphire Rapids workstation had 2 memory controllers but appeared
4 memory controller PCIe devices from the log:
0000:fe:0c.0 1101: 8086:324a
0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear
0000:fe:0e.0 1101: 8086:324a
0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear
By observing that the MMIO registers of these absent
memory controllers consistently hold the value of ~0.
We may identify a memory controller as absent by checking
if its MMIO register "mcmtr" == ~0 in all its channels.
I made a patch below to skip all these absent memory controllers
https://lore.kernel.org/linux-edac/[email protected]/T/#u
@Koba Ko, could you please verify the patch from the link above on your workstation? Thanks!
BTW,
Kai-Heng Feng also found the same issue before:
https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2
- Qiuxu
On Thu, Jul 6, 2023 at 10:12 PM Zhuo, Qiuxu <[email protected]> wrote:
>
> > From: Luck, Tony <[email protected]>
> > Sent: Wednesday, July 5, 2023 11:22 PM
> > ...
> > Subject: RE: [PATCH v2] EDAC/i10nm: shift exponent is negative
> >
> > >> # head /proc/cpuinfo
> >
> > This shows your system is the workstation version of Sapphire rapids. I don't
> > think we did any validation of the EDAC driver against this model.
>
> No, we didn't do any validation of the EDAC on Sapphires Rapids workstations.
> From the link below, we know this is a Sapphire Rapids workstation with only 2 memory controllers.
> https://www.intel.com/content/www/us/en/products/sku/233480/intel-xeon-w32435-processor-22-5m-cache-3-10-ghz/specifications.html
>
> We only did validation on the Sapphire Rapids servers which were with 4 memory controllers per socket before.
>
> > > # dmidecode -t 17
> >
> > You have just one 16GB DIMM, and EDAC found that. So despite the messy
> > warnings, EDAC should be working for you.
> >
> > > # lspci
> >
> > I didn't dig into this. Qiuxu - can you compare this against a server Sapphire
> > rapids?
> > Maybe it has some clues so the EDAC driver will know not to look for non-
> > existent memory controllers.
>
> This Sapphire Rapids workstation had 2 memory controllers but appeared
> 4 memory controller PCIe devices from the log:
>
> 0000:fe:0c.0 1101: 8086:324a
> 0000:fe:0d.0 1101: 8086:324a // absent mc fooling the driver, should not appear
> 0000:fe:0e.0 1101: 8086:324a
> 0000:fe:0f.0 1101: 8086:324a // absent mc fooling the driver, should not appear
>
> By observing that the MMIO registers of these absent
> memory controllers consistently hold the value of ~0.
> We may identify a memory controller as absent by checking
> if its MMIO register "mcmtr" == ~0 in all its channels.
>
> I made a patch below to skip all these absent memory controllers
> https://lore.kernel.org/linux-edac/[email protected]/T/#u
> @Koba Ko, could you please verify the patch from the link above on your workstation? Thanks!
Here's dmesg patched(Ref. 1). didn't find the previous message,
`EDAC DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)`
Ref. 1, https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmVlp?usp=sharing
>
> BTW,
> Kai-Heng Feng also found the same issue before:
> https://lore.kernel.org/linux-edac/CAAd53p41Ku1m1rapeqb1xtD+kKuk+BaUW=dumuoF0ZO3GhFjFA@mail.gmail.com/T/#m5de16dce60a8c836ec235868c7c16e3fefad0cc2
>
> - Qiuxu
> From: Koba Ko <[email protected]>
> Sent: Friday, July 7, 2023 1:41 AM
> To: Zhuo, Qiuxu <[email protected]>
> ...
> > I made a patch below to skip all these absent memory controllers
> > https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@i
> > ntel.com/T/#u @Koba Ko, could you please verify the patch from the
> > link above on your workstation? Thanks!
>
> Here's dmesg patched(Ref. 1). didn't find the previous message, `EDAC
> DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)`
>
> Ref. 1,
> https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmV
> lp?usp=sharing
Thanks for the verification.
Your log showed that the patch worked well.
-Qiuxu
On Fri, Jul 7, 2023 at 11:34 AM Zhuo, Qiuxu <[email protected]> wrote:
>
> > From: Koba Ko <[email protected]>
> > Sent: Friday, July 7, 2023 1:41 AM
> > To: Zhuo, Qiuxu <[email protected]>
> > ...
> > > I made a patch below to skip all these absent memory controllers
> > > https://lore.kernel.org/linux-edac/20230706134216.37044-1-qiuxu.zhuo@i
> > > ntel.com/T/#u @Koba Ko, could you please verify the patch from the
> > > link above on your workstation? Thanks!
> >
> > Here's dmesg patched(Ref. 1). didn't find the previous message, `EDAC
> > DEBUG: skx_get_dimm_attr: bad ranks = 3 (raw=0xffffffff)`
> >
> > Ref. 1,
> > https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJKmV
> > lp?usp=sharing
>
> Thanks for the verification.
> Your log showed that the patch worked well.
please,
tested-by: [email protected]
>
> -Qiuxu
> From: Koba Ko <[email protected]>
> ...
> > > Ref. 1,
> > > https://drive.google.com/drive/folders/1xym9JgZZgaJ3EqtP4ccRcVeQYoJK
> > > mV
> > > lp?usp=sharing
> >
> > Thanks for the verification.
> > Your log showed that the patch worked well.
>
> please,
> tested-by: [email protected]
Sure.
On Sun, Jul 09, 2023 at 11:42:22PM +0800, Koba Ko wrote:
> tested-by: [email protected]
It's better if you put the tag in the correct format. There are a bunch
of tools which automatically add tags, but they only work if the tag is
in the correct format.
regards,
dan carpenter
> From: Dan Carpenter <[email protected]>
> ...
> Subject: Re: [PATCH v2] EDAC/i10nm: shift exponent is negative
>
> On Sun, Jul 09, 2023 at 11:42:22PM +0800, Koba Ko wrote:
> > tested-by: [email protected]
>
> It's better if you put the tag in the correct format. There are a bunch of tools
> which automatically add tags, but they only work if the tag is in the correct
> format.
I corrected Koba Ko's "Tested-by" tag and added it to my v2:
https://lore.kernel.org/linux-edac/[email protected]/
-Qiuxu