UBSAN complains this error
~~~
UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:369:16
shift exponent -66 is negative
Call Trace:
<TASK>
dump_stack_lvl+0x48/0x70
dump_stack+0x10/0x20
__ubsan_handle_shift_out_of_bounds+0x1ac/0x360
skx_get_dimm_info.cold+0x91/0x175 [i10nm_edac]
? kvasprintf_const+0x2a/0xb0
i10nm_get_dimm_config+0x23c/0x340 [i10nm_edac]
skx_register_mci+0x139/0x1e0 [i10nm_edac]
? __pfx_i10nm_get_dimm_config+0x10/0x10 [i10nm_edac]
i10nm_init+0x403/0xd10 [i10nm_edac]
? __pfx_i10nm_init+0x10/0x10 [i10nm_edac]
do_one_initcall+0x5b/0x250
do_init_module+0x68/0x260
load_module+0xb45/0xcd0
? kernel_read_file+0x2a4/0x320
__do_sys_finit_module+0xc4/0x140
? __do_sys_finit_module+0xc4/0x140
__x64_sys_finit_module+0x18/0x30
do_syscall_64+0x58/0x90
? syscall_exit_to_user_mode+0x29/0x50
? do_syscall_64+0x67/0x90
? syscall_exit_to_user_mode+0x29/0x50
? do_syscall_64+0x67/0x90
? do_syscall_64+0x67/0x90
? __flush_smp_call_function_queue+0x122/0x1f0
? exit_to_user_mode_prepare+0x30/0xb0
? irqentry_exit_to_user_mode+0x9/0x20
? irqentry_exit+0x43/0x50
? sysvec_call_function+0x4b/0xd0
entry_SYSCALL_64_after_hwframe+0x72/0xdc
~~~
when get rows, cols and ranks, the returned error value doesn't be
handled.
check the return value is EINVAL, if yes, directly return 0.
Signed-off-by: Koba Ko <[email protected]>
---
drivers/edac/skx_common.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/drivers/edac/skx_common.c b/drivers/edac/skx_common.c
index 2a00e0503f0d5..98baed1617c9f 100644
--- a/drivers/edac/skx_common.c
+++ b/drivers/edac/skx_common.c
@@ -351,6 +351,8 @@ int skx_get_dimm_info(u32 mtr, u32 mcmtr, u32 amap, struct dimm_info *dimm,
ranks = numrank(mtr);
rows = numrow(mtr);
cols = imc->hbm_mc ? 6 : numcol(mtr);
+ if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
+ return 0;
if (imc->hbm_mc) {
banks = 32;
--
2.34.1
> ranks = numrank(mtr);
> rows = numrow(mtr);
> cols = imc->hbm_mc ? 6 : numcol(mtr);
> + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> + return 0;
This seems to be just hiding the real problem that a DIMM was found
with some number of ranks, rows, or columns that the EDAC driver
didn't expect to see. Your fix makes the driver skip over this DIMM.
Can you build your kernel with CONFIG_EDAC_DEBUG=y and see
what messages you get from this code:
static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add,
int minval, int maxval, const char *name)
{
u32 val = GET_BITFIELD(reg, lobit, hibit);
if (val < minval || val > maxval) {
edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
return -EINVAL;
}
-Tony
hi Luck,
I agree with your points
is it expected to shift with negative?
Thanks
Koba Ko
On Thu, Jun 29, 2023 at 12:41 AM Luck, Tony <[email protected]> wrote:
>
> > ranks = numrank(mtr);
> > rows = numrow(mtr);
> > cols = imc->hbm_mc ? 6 : numcol(mtr);
> > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> > + return 0;
>
> This seems to be just hiding the real problem that a DIMM was found
> with some number of ranks, rows, or columns that the EDAC driver
> didn't expect to see. Your fix makes the driver skip over this DIMM.
>
> Can you build your kernel with CONFIG_EDAC_DEBUG=y and see
> what messages you get from this code:
>
> static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add,
> int minval, int maxval, const char *name)
> {
> u32 val = GET_BITFIELD(reg, lobit, hibit);
>
> if (val < minval || val > maxval) {
> edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
> return -EINVAL;
> }
>
> -Tony
>
>
Hi Ko,
I don't agree with simply skipping over a DIMM even EDAC doesn't expect to see it.
As the EDAC driver can still report errors for this DIMM once there are errors that occur in this DIMM.
As per Tony's suggestion, could you test your kernel with CONFIG_EDAC_DEBUG=y and see the result?
@Luck, Tony, Perhaps we may turn the debug print
edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
to an error-print explicitly
skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg);
Let the user have the chance to notice there is a DIMM that EDAC doesn't expect to see.
- Qiuxu
> From: Koba Ko <[email protected]>
> Sent: Thursday, June 29, 2023 11:53 AM
> To: Luck, Tony <[email protected]>
> Cc: Borislav Petkov <[email protected]>; James Morse <[email protected]>;
> Mauro Carvalho Chehab <[email protected]>; Robert Richter
> <[email protected]>; [email protected]; [email protected]
> Subject: Re: [PATCH] EDAC/i10nm: shift exponent is negative
>
> hi Luck,
> I agree with your points
> is it expected to shift with negative?
>
> Thanks
> Koba Ko
>
> On Thu, Jun 29, 2023 at 12:41 AM Luck, Tony <[email protected]> wrote:
> >
> > > ranks = numrank(mtr);
> > > rows = numrow(mtr);
> > > cols = imc->hbm_mc ? 6 : numcol(mtr);
> > > + if (ranks == -EINVAL || rows == -EINVAL || cols == -EINVAL)
> > > + return 0;
> >
> > This seems to be just hiding the real problem that a DIMM was found
> > with some number of ranks, rows, or columns that the EDAC driver
> > didn't expect to see. Your fix makes the driver skip over this DIMM.
> >
> > Can you build your kernel with CONFIG_EDAC_DEBUG=y and see what
> > messages you get from this code:
> >
> > static int skx_get_dimm_attr(u32 reg, int lobit, int hibit, int add,
> > int minval, int maxval, const char *name)
> > {
> > u32 val = GET_BITFIELD(reg, lobit, hibit);
> >
> > if (val < minval || val > maxval) {
> > edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
> > return -EINVAL;
> > }
> >
> > -Tony
> >
> >
> I don't agree with simply skipping over a DIMM even EDAC doesn't expect to see it.
> As the EDAC driver can still report errors for this DIMM once there are errors that occur in this DIMM.
>
> As per Tony's suggestion, could you test your kernel with CONFIG_EDAC_DEBUG=y and see the result?
>
> @Luck, Tony, Perhaps we may turn the debug print
>
> edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
>
> to an error-print explicitly
>
> skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg);
>
> Let the user have the chance to notice there is a DIMM that EDAC doesn't expect to see.
We need both. Changing that debug message to a real error message will let the user
know that EDAC doesn't recognize this DIMM (and will provide the information for you
or me to fix the driver).
But we also need Ko's fix - because it makes no sense to just use that negative shift
and pretend that EDAC knows how to handle this DIMM.
-Tony
> From: Luck, Tony <[email protected]>
> Sent: Friday, June 30, 2023 12:12 AM
> To: Zhuo, Qiuxu <[email protected]>; Koba Ko <[email protected]>
> Cc: Borislav Petkov <[email protected]>; James Morse <[email protected]>;
> Mauro Carvalho Chehab <[email protected]>; Robert Richter
> <[email protected]>; [email protected]; [email protected]
> Subject: RE: [PATCH] EDAC/i10nm: shift exponent is negative
>
> > I don't agree with simply skipping over a DIMM even EDAC doesn't expect
> to see it.
> > As the EDAC driver can still report errors for this DIMM once there are
> errors that occur in this DIMM.
> >
> > As per Tony's suggestion, could you test your kernel with
> CONFIG_EDAC_DEBUG=y and see the result?
> >
> > @Luck, Tony, Perhaps we may turn the debug print
> >
> > edac_dbg(2, "bad %s = %d (raw=0x%x)\n", name, val, reg);
> >
> > to an error-print explicitly
> >
> > skx_printk(KERN_ERR, "bad %s = %d (raw=0x%x)\n", name, val, reg);
> >
> > Let the user have the chance to notice there is a DIMM that EDAC doesn't
> expect to see.
>
> We need both. Changing that debug message to a real error message will let
> the user know that EDAC doesn't recognize this DIMM (and will provide the
> information for you or me to fix the driver).
>
> But we also need Ko's fix - because it makes no sense to just use that
> negative shift and pretend that EDAC knows how to handle this DIMM.
>
OK.
@Koba Ko, could you make a new patch with Tony's suggestion? Thanks!
-Qiuxu