Commit dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in
apci_init()") introduced a bug that ghes_edac_register() would be invoked
before edac_init(). Because at that time, the bus "edac" hasn't been even
registered, this created sysfs /devices/mc0 instead of
/sys/devices/system/edac/mc/mc0 on an Ampere eMag server.
The solution is modularizing the ghes_edac driver.
Changelog:
v2:
- add acked-by tag of Patch 1 from Ard
- split the notifier patch
- add 2 patch to get regular drivers selected when ghes edac is not loaded
- fix an errno in igen6 driver
- add a patch to fix the sparse warning of ghes
- refine the commit logs
Jia He (7):
efi/cper: export several helpers for ghes edac to use
EDAC/ghes: Add notifier to report ghes_edac mem error
EDAC/ghes: Modularize ghes_edac driver to remove the dependency on
ghes
EDAC: Get chipset-specific edac drivers selected only when ghes_edac
is not enabled
EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac
is registered
apei/ghes: Use unrcu_pointer for cmpxchg
EDAC/igen6: Keep returned errno consistent when edac mc has been
enabled
drivers/acpi/apei/ghes.c | 72 ++++++++++++++++++++++++++++++++---
drivers/edac/Kconfig | 4 +-
drivers/edac/amd64_edac.c | 3 ++
drivers/edac/ghes_edac.c | 76 ++++++++++++++++++++++++++-----------
drivers/edac/igen6_edac.c | 2 +-
drivers/edac/pnd2_edac.c | 3 ++
drivers/edac/sb_edac.c | 3 ++
drivers/edac/skx_base.c | 3 ++
drivers/firmware/efi/cper.c | 3 ++
include/acpi/ghes.h | 37 ++++++------------
10 files changed, 150 insertions(+), 56 deletions(-)
--
2.25.1
ghes_estatus_caches should be add rcu annotation to avoid sparse warnings.
drivers/acpi/apei/ghes.c:733:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
drivers/acpi/apei/ghes.c:733:25: sparse: struct ghes_estatus_cache [noderef] __rcu *
drivers/acpi/apei/ghes.c:733:25: sparse: struct ghes_estatus_cache *
drivers/acpi/apei/ghes.c:813:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
drivers/acpi/apei/ghes.c:813:25: sparse: struct ghes_estatus_cache [noderef] __rcu *
drivers/acpi/apei/ghes.c:813:25: sparse: struct ghes_estatus_cache *
unrcu_pointer is to strip the __rcu in cmpxchg.
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Jia He <[email protected]>
---
drivers/acpi/apei/ghes.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9272d963b57d..92ae58f4f7bb 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -144,7 +144,7 @@ struct ghes_vendor_record_entry {
static struct gen_pool *ghes_estatus_pool;
static unsigned long ghes_estatus_pool_size_request;
-static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
+static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
static atomic_t ghes_estatus_cache_alloced;
static int ghes_panic_timeout __read_mostly = 30;
@@ -834,8 +834,9 @@ static void ghes_estatus_cache_add(
}
/* new_cache must be put into array after its contents are written */
smp_wmb();
- if (slot != -1 && cmpxchg(ghes_estatus_caches + slot,
- slot_cache, new_cache) == slot_cache) {
+ if (slot != -1 && unrcu_pointer(cmpxchg(ghes_estatus_caches + slot,
+ RCU_INITIALIZER(slot_cache),
+ RCU_INITIALIZER(new_cache)))) {
if (slot_cache)
call_rcu(&slot_cache->rcu, ghes_estatus_cache_rcu_free);
} else
--
2.25.1
Previous, there is only one edac can be registering in the EDAC core. After
ghes_edac is modularized, the registering of ghes devices may be deferred
when ghes_edac is loaded. Prevent the chipset-specific edac drivers from
loading after ghes_edac is registered, which allow the edac drivers to
get selected in e.g. HPE platform.
Signed-off-by: Jia He <[email protected]>
---
drivers/acpi/apei/ghes.c | 14 ++++++++++++++
drivers/edac/amd64_edac.c | 2 +-
drivers/edac/ghes_edac.c | 2 ++
drivers/edac/pnd2_edac.c | 2 +-
drivers/edac/sb_edac.c | 2 +-
drivers/edac/skx_base.c | 2 +-
include/acpi/ghes.h | 4 ++++
7 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9c52183e3ad9..9272d963b57d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -95,6 +95,7 @@
#endif
static ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
+static bool devs_registered;
static inline bool is_hest_type_generic_v2(struct ghes *ghes)
{
@@ -1266,6 +1267,18 @@ static int apei_sdei_unregister_ghes(struct ghes *ghes)
return sdei_unregister_ghes(ghes);
}
+void set_ghes_devs_registered(bool flag)
+{
+ devs_registered = flag;
+}
+EXPORT_SYMBOL_GPL(set_ghes_devs_registered);
+
+bool ghes_devs_registered(void)
+{
+ return devs_registered;
+}
+EXPORT_SYMBOL_GPL(ghes_devs_registered);
+
static int ghes_probe(struct platform_device *ghes_dev)
{
struct acpi_hest_generic *generic;
@@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device *ghes_dev)
platform_set_drvdata(ghes_dev, ghes);
ghes->dev = &ghes_dev->dev;
+ set_ghes_devs_registered(false);
mutex_lock(&ghes_devs_mutex);
list_add_tail(&ghes->elist, &ghes_devs);
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index e4eaf6668feb..f48507fa7b94 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -4329,7 +4329,7 @@ static int __init amd64_edac_init(void)
int err = -ENODEV;
int i;
- if (ghes_get_devices(0))
+ if (ghes_get_devices(0) && ghes_devs_registered())
return -EBUSY;
owner = edac_get_owner();
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 3bdf8469882d..d26644b3bc47 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -564,6 +564,7 @@ static int __init ghes_edac_init(void)
list_for_each_entry_safe(g, g_tmp, ghes_devs, elist) {
ghes_edac_register(g->dev);
}
+ set_ghes_devs_registered(true);
return 0;
}
@@ -576,6 +577,7 @@ static void __exit ghes_edac_exit(void)
list_for_each_entry_safe(g, g_tmp, ghes_devs, elist) {
ghes_edac_unregister(g);
}
+ set_ghes_devs_registered(false);
}
module_exit(ghes_edac_exit);
diff --git a/drivers/edac/pnd2_edac.c b/drivers/edac/pnd2_edac.c
index 73f2ba0e64e3..66d89d00c4b3 100644
--- a/drivers/edac/pnd2_edac.c
+++ b/drivers/edac/pnd2_edac.c
@@ -1528,7 +1528,7 @@ static int __init pnd2_init(void)
edac_dbg(2, "\n");
- if (ghes_get_devices(0))
+ if (ghes_get_devices(0) && ghes_devs_registered())
return -EBUSY;
owner = edac_get_owner();
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 1d0520a16840..a3a89e366a74 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3506,7 +3506,7 @@ static int __init sbridge_init(void)
edac_dbg(2, "\n");
- if (ghes_get_devices(0))
+ if (ghes_get_devices(0) && ghes_devs_registered())
return -EBUSY;
owner = edac_get_owner();
diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c
index fe267f8543f5..efa1ae79c35a 100644
--- a/drivers/edac/skx_base.c
+++ b/drivers/edac/skx_base.c
@@ -653,7 +653,7 @@ static int __init skx_init(void)
edac_dbg(2, "\n");
- if (ghes_get_devices(0))
+ if (ghes_get_devices(0) && ghes_devs_registered())
return -EBUSY;
owner = edac_get_owner();
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 150c0b9500d6..21b9d4d4c3e9 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -72,8 +72,12 @@ int ghes_register_vendor_record_notifier(struct notifier_block *nb);
*/
void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
struct list_head *ghes_get_devices(bool force);
+bool ghes_devs_registered(void);
+void set_ghes_devs_registered(bool flag);
#else
static inline struct list_head *ghes_get_devices(bool force) { return NULL; }
+static inline bool ghes_devs_registered(void) { return false; }
+static inline void set_ghes_devs_registered(bool flag) { return; }
#endif
int ghes_estatus_pool_init(int num_ghes);
--
2.25.1
From: Jia He
> Sent: 17 August 2022 15:35
>
> ghes_estatus_caches should be add rcu annotation to avoid sparse warnings.
> drivers/acpi/apei/ghes.c:733:25: sparse: sparse: incompatible types in comparison expression
> (different address spaces):
> drivers/acpi/apei/ghes.c:733:25: sparse: struct ghes_estatus_cache [noderef] __rcu *
> drivers/acpi/apei/ghes.c:733:25: sparse: struct ghes_estatus_cache *
> drivers/acpi/apei/ghes.c:813:25: sparse: sparse: incompatible types in comparison expression
> (different address spaces):
> drivers/acpi/apei/ghes.c:813:25: sparse: struct ghes_estatus_cache [noderef] __rcu *
> drivers/acpi/apei/ghes.c:813:25: sparse: struct ghes_estatus_cache *
>
> unrcu_pointer is to strip the __rcu in cmpxchg.
>
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Jia He <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 9272d963b57d..92ae58f4f7bb 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -144,7 +144,7 @@ struct ghes_vendor_record_entry {
> static struct gen_pool *ghes_estatus_pool;
> static unsigned long ghes_estatus_pool_size_request;
>
> -static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> +static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> static atomic_t ghes_estatus_cache_alloced;
>
> static int ghes_panic_timeout __read_mostly = 30;
> @@ -834,8 +834,9 @@ static void ghes_estatus_cache_add(
> }
> /* new_cache must be put into array after its contents are written */
> smp_wmb();
> - if (slot != -1 && cmpxchg(ghes_estatus_caches + slot,
> - slot_cache, new_cache) == slot_cache) {
> + if (slot != -1 && unrcu_pointer(cmpxchg(ghes_estatus_caches + slot,
> + RCU_INITIALIZER(slot_cache),
> + RCU_INITIALIZER(new_cache)))) {
Did you test this?
There seems to be an == missing.
David
> if (slot_cache)
> call_rcu(&slot_cache->rcu, ghes_estatus_cache_rcu_free);
> } else
> --
> 2.25.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
> -----Original Message-----
> From: David Laight <[email protected]>
> Sent: Wednesday, August 17, 2022 11:39 PM
> To: Justin He <[email protected]>; Ard Biesheuvel <[email protected]>; Len
> Brown <[email protected]>; James Morse <[email protected]>; Tony
> Luck <[email protected]>; Borislav Petkov <[email protected]>; Mauro
> Carvalho Chehab <[email protected]>; Robert Richter <[email protected]>;
> Robert Moore <[email protected]>; Qiuxu Zhuo
> <[email protected]>; Yazen Ghannam <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Rafael J . Wysocki
> <[email protected]>; Shuai Xue <[email protected]>; Jarkko
> Sakkinen <[email protected]>; [email protected];
> [email protected]; nd <[email protected]>; kernel test robot <[email protected]>
> Subject: RE: [PATCH v2 6/7] apei/ghes: Use unrcu_pointer for cmpxchg
>
> From: Jia He
> > Sent: 17 August 2022 15:35
> >
> > ghes_estatus_caches should be add rcu annotation to avoid sparse
> warnings.
> > drivers/acpi/apei/ghes.c:733:25: sparse: sparse: incompatible types
> > in comparison expression (different address spaces):
> > drivers/acpi/apei/ghes.c:733:25: sparse: struct
> ghes_estatus_cache [noderef] __rcu *
> > drivers/acpi/apei/ghes.c:733:25: sparse: struct
> ghes_estatus_cache *
> > drivers/acpi/apei/ghes.c:813:25: sparse: sparse: incompatible types
> > in comparison expression (different address spaces):
> > drivers/acpi/apei/ghes.c:813:25: sparse: struct
> ghes_estatus_cache [noderef] __rcu *
> > drivers/acpi/apei/ghes.c:813:25: sparse: struct
> ghes_estatus_cache *
> >
> > unrcu_pointer is to strip the __rcu in cmpxchg.
> >
> > Reported-by: kernel test robot <[email protected]>
> > Signed-off-by: Jia He <[email protected]>
> > ---
> > drivers/acpi/apei/ghes.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
> > 9272d963b57d..92ae58f4f7bb 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -144,7 +144,7 @@ struct ghes_vendor_record_entry { static struct
> > gen_pool *ghes_estatus_pool; static unsigned long
> > ghes_estatus_pool_size_request;
> >
> > -static struct ghes_estatus_cache
> > *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> > +static struct ghes_estatus_cache __rcu
> > +*ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> > static atomic_t ghes_estatus_cache_alloced;
> >
> > static int ghes_panic_timeout __read_mostly = 30; @@ -834,8 +834,9
> @@
> > static void ghes_estatus_cache_add(
> > }
> > /* new_cache must be put into array after its contents are written */
> > smp_wmb();
> > - if (slot != -1 && cmpxchg(ghes_estatus_caches + slot,
> > - slot_cache, new_cache) == slot_cache) {
> > + if (slot != -1 && unrcu_pointer(cmpxchg(ghes_estatus_caches + slot,
> > + RCU_INITIALIZER(slot_cache),
> > + RCU_INITIALIZER(new_cache)))) {
>
> Did you test this?
> There seems to be an == missing.
Sorry about it,
--
Cheers,
Justin (Jia He)
On Wednesday, August 17, 2022 8:35 AM, Jia He wrote:
> Previous, there is only one edac can be registering in the EDAC core. After
> ghes_edac is modularized, the registering of ghes devices may be deferred
> when ghes_edac is loaded. Prevent the chipset-specific edac drivers from
> loading after ghes_edac is registered, which allow the edac drivers to
> get selected in e.g. HPE platform.
...
> @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> *ghes_dev)
> platform_set_drvdata(ghes_dev, ghes);
>
> ghes->dev = &ghes_dev->dev;
> + set_ghes_devs_registered(false);
This does not look right to me.
The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
where:
- ghes-present is latched on ghes_probe()
- ghes-preferred is true if the platform-check passes.
ghes_get_device() introduced in the previous patch works as the
ghes-preferred check.
We cannot use ghes_edac registration as the ghes-present check in
this scheme since it is deferred to module_init().
I'd suggest the following changes:
- Update ghes_get_device() to check a flag latched on ghes_probe().
- Remove 'force' argument from ghes_get_device(). ghes_edac_init()
is too late to set this flag. Also, other edac drivers should not be able
to control this flag. I think this force_load kernel option needs to
belong to ghes with this scheme.
- ghes_get_device() exposes 'ghes_devs' to chipset-specific edac drivers,
which should be internal to ghes. ghes_get_device() may be renamed
to something like ghes_edac_preferred() which returns true / false.
Toshi
Hi Toshi,
> -----Original Message-----
> From: Kani, Toshi <[email protected]>
> Sent: Friday, August 19, 2022 9:30 AM
> To: Justin He <[email protected]>; Ard Biesheuvel <[email protected]>; Len
> Brown <[email protected]>; James Morse <[email protected]>; Tony Luck
> <[email protected]>; Borislav Petkov <[email protected]>; Mauro Carvalho
> Chehab <[email protected]>; Robert Richter <[email protected]>; Robert
> Moore <[email protected]>; Qiuxu Zhuo <[email protected]>;
> Yazen Ghannam <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Rafael J . Wysocki
> <[email protected]>; Shuai Xue <[email protected]>; Jarkko
> Sakkinen <[email protected]>; [email protected]; nd <[email protected]>
> Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from
> loading after ghes_edac is registered
>
> On Wednesday, August 17, 2022 8:35 AM, Jia He wrote:
> > Previous, there is only one edac can be registering in the EDAC core.
> > After ghes_edac is modularized, the registering of ghes devices may be
> > deferred when ghes_edac is loaded. Prevent the chipset-specific edac
> > drivers from loading after ghes_edac is registered, which allow the
> > edac drivers to get selected in e.g. HPE platform.
> ...
> > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> > *ghes_dev)
> > platform_set_drvdata(ghes_dev, ghes);
> >
> > ghes->dev = &ghes_dev->dev;
> > + set_ghes_devs_registered(false);
>
> This does not look right to me.
>
> The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
> where:
> - ghes-present is latched on ghes_probe()
> - ghes-preferred is true if the platform-check passes.
>
> ghes_get_device() introduced in the previous patch works as the
> ghes-preferred check.
>
> We cannot use ghes_edac registration as the ghes-present check in this
> scheme since it is deferred to module_init().
What is the logic for ghes-present check? In this patch, I assumed it is equal to
"ghes_edac devices have been registered". Seems it is not correct.
But should we consider the case as follows:
What if sbridge_init () is invoked before ghes_edac_init()? i.e. Should we get
sb_edac driver selected when ghes_edac is not loaded yet (e.g. on HPE)?
>
> I'd suggest the following changes:
> - Update ghes_get_device() to check a flag latched on ghes_probe().
Still need your elaborating about the details of the flag. i.e. When is this flag
latched? When is it unlatched?
> - Remove 'force' argument from ghes_get_device(). ghes_edac_init() is too
> late to set this flag. Also, other edac drivers should not be able to control this
> flag. I think this force_load kernel option needs to belong to ghes with this
> scheme.
Agree, I will remove force_load in ghes_get_device(), and put all check/update of
force_load in ghes
> - ghes_get_device() exposes 'ghes_devs' to chipset-specific edac drivers, which
> should be internal to ghes. ghes_get_device() may be renamed to something
> like ghes_edac_preferred() which returns true / false.
>
Okay, agree
--
Cheers,
Justin (Jia He)
On Friday, August 19, 2022 4:35 AM, Justin He wrote:
> > > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> > > *ghes_dev)
> > > platform_set_drvdata(ghes_dev, ghes);
> > >
> > > ghes->dev = &ghes_dev->dev;
> > > + set_ghes_devs_registered(false);
> >
> > This does not look right to me.
> >
> > The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
> > where:
> > - ghes-present is latched on ghes_probe()
> > - ghes-preferred is true if the platform-check passes.
> >
> > ghes_get_device() introduced in the previous patch works as the
> > ghes-preferred check.
> >
> > We cannot use ghes_edac registration as the ghes-present check in this
> > scheme since it is deferred to module_init().
>
> What is the logic for ghes-present check? In this patch, I assumed it is equal to
> "ghes_edac devices have been registered". Seems it is not correct.
Using (ghes_edac-registered) is a wrong check in this scheme
since ghes_edac registration is deferred. This check is fine in
the current scheme since ghes_edac gets registered before
any other chipset-specific edac drivers.
> But should we consider the case as follows:
> What if sbridge_init () is invoked before ghes_edac_init()? i.e. Should we get
> sb_edac driver selected when ghes_edac is not loaded yet (e.g. on HPE)?
No. The point is that ghes_edac driver needs to be selected,
"regardless of the module ordering", on a system with GHES
present & preferred.
Note that this new scheme leads to the following condition
change:
- Current: (ghes-present) && (ghes-preferred) && (ghes_edac registered)
- New: (ghes-present) && (ghes-preferred)
The option I suggested previously keeps the current condition,
but this new scheme does not for better modularity.
What this means is that if ghes_edac is not enabled (but ghes
is enabled) on a system with GHES present & preferred, no edac
driver gets registered. This change is fine from my (HPE) perspective
and should be fine for other GHES systems. GHES systems have
chipset-specific edac driver in FW. OS-based chipset-specific edac
driver is not necessary and may lead to a conflict of chipset register
ownership.
> > I'd suggest the following changes:
> > - Update ghes_get_device() to check a flag latched on ghes_probe().
>
> Still need your elaborating about the details of the flag. i.e. When is this flag
> latched? When is it unlatched?
This flag is a static variable, say ghes_present, which is set to false initially.
ghes_probe() sets it to true. ghes_edac_preferred() (aka. ghes_get_device)
checks this flag at beginning and returns false if this flag is false. It does not
get unlatched since ACPI GHES table is static.
Toshi
On Fri, Aug 19, 2022 at 05:48:39PM +0000, Kani, Toshi wrote:
> This flag is a static variable, say ghes_present, which is set to
> false initially. ghes_probe() sets it to true. ghes_edac_preferred()
> (aka. ghes_get_device) checks this flag at beginning and returns false
> if this flag is false. It does not get unlatched since ACPI GHES table
> is static.
What is that flag needed for at all?
There are two possibilities:
1. ghes_probe() succeeds. ghes_edac loads properly and other drivers use
ghes_get_devices() to know when to load.
2. ghes_probe() fails and that is caught during platform testing of all
those platforms who wish to use ghes_edac. BIOS is fixed and goto 1.
No need for funky flags whatsoever.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Friday, August 19, 2022 12:50 PM, Borislav Petkov wrote:
> > 3. ghes_probe() is not called,
>
> When is ghes_probe() not called on ARM?
When the system does not implement ACPI GHES table,
which I suppose is most of the case on ARM.
Toshi
> -----Original Message-----
> From: Kani, Toshi <[email protected]>
> Sent: Friday, August 19, 2022 12:49 PM
> To: Justin He <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Rafael J . Wysocki
> <[email protected]>; Shuai Xue <[email protected]>; Jarkko
> Sakkinen <[email protected]>; [email protected]; nd
> <[email protected]>; Ard Biesheuvel <[email protected]>; Len Brown
> <[email protected]>; James Morse <[email protected]>; Tony Luck
> <[email protected]>; Borislav Petkov <[email protected]>; Mauro Carvalho
> Chehab <[email protected]>; Robert Richter <[email protected]>; Robert
> Moore <[email protected]>; Qiuxu Zhuo <[email protected]>;
> Yazen Ghannam <[email protected]>
> Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac
> from loading after ghes_edac is registered
>
> On Friday, August 19, 2022 4:35 AM, Justin He wrote:
> > > > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct
> platform_device
> > > > *ghes_dev)
> > > > platform_set_drvdata(ghes_dev, ghes);
> > > >
> > > > ghes->dev = &ghes_dev->dev;
> > > > + set_ghes_devs_registered(false);
> > >
> > > This does not look right to me.
> > >
> > > The condition of using ghes_edac is (ghes-present) && (ghes-
> preferred),
> > > where:
> > > - ghes-present is latched on ghes_probe()
> > > - ghes-preferred is true if the platform-check passes.
> > >
> > > ghes_get_device() introduced in the previous patch works as the
> > > ghes-preferred check.
> > >
> > > We cannot use ghes_edac registration as the ghes-present check in
> this
> > > scheme since it is deferred to module_init().
> >
> > What is the logic for ghes-present check? In this patch, I assumed it
> is equal to
> > "ghes_edac devices have been registered". Seems it is not correct.
>
> Using (ghes_edac-registered) is a wrong check in this scheme
> since ghes_edac registration is deferred. This check is fine in
> the current scheme since ghes_edac gets registered before
> any other chipset-specific edac drivers.
>
> > But should we consider the case as follows:
> > What if sbridge_init () is invoked before ghes_edac_init()? i.e.
> Should we get
> > sb_edac driver selected when ghes_edac is not loaded yet (e.g. on
> HPE)?
>
> No. The point is that ghes_edac driver needs to be selected,
> "regardless of the module ordering", on a system with GHES
> present & preferred.
>
> Note that this new scheme leads to the following condition
> change:
> - Current: (ghes-present) && (ghes-preferred) && (ghes_edac registered)
> - New: (ghes-present) && (ghes-preferred)
>
> The option I suggested previously keeps the current condition,
> but this new scheme does not for better modularity.
>
> What this means is that if ghes_edac is not enabled (but ghes
> is enabled) on a system with GHES present & preferred, no edac
> driver gets registered. This change is fine from my (HPE) perspective
> and should be fine for other GHES systems. GHES systems have
> chipset-specific edac driver in FW. OS-based chipset-specific edac
> driver is not necessary and may lead to a conflict of chipset register
> ownership.
Currently, running with this on the kernel command line
ghes.disable
causes the ACPI ghes driver to quit early in acpi_ghes_init():
/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
* compatible with existing boot arg use cases.
*/
bool ghes_disable;
module_param_named(disable, ghes_disable, bool, 0);
which results in the skx_edac module assuming it should run:
[ 33.628140] calling skx_init+0x0/0xe5a [skx_edac] @ 1444
[ 33.628531] EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:36:0a.0 (INTERRUPT)
[ 33.641432] EDAC MC1: Giving out device to module skx_edac controller Skylake Socket#0 IMC#1: DEV 0000:36:0c.0 (INTERRUPT)
[ 33.653256] EDAC MC2: Giving out device to module skx_edac controller Skylake Socket#1 IMC#0: DEV 0000:ae:0a.0 (INTERRUPT)
[ 33.665055] EDAC MC3: Giving out device to module skx_edac controller Skylake Socket#1 IMC#1: DEV 0000:ae:0c.0 (INTERRUPT)
[ 33.676801] initcall skx_init+0x0/0xe5a [skx_edac] returned 0 after 36343 usecs
We might need to differentiate between the system ROM really not
offering GHES vs. the ghes module not running.
On Fri, Aug 19, 2022 at 06:46:34PM +0000, Kani, Toshi wrote:
> 3. ghes_probe() is not called,
When is ghes_probe() not called on ARM?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Friday, August 19, 2022 12:30 PM, Borislav Petkov wrote:
> > This flag is a static variable, say ghes_present, which is set to
> > false initially. ghes_probe() sets it to true. ghes_edac_preferred()
> > (aka. ghes_get_device) checks this flag at beginning and returns false
> > if this flag is false. It does not get unlatched since ACPI GHES table
> > is static.
>
> What is that flag needed for at all?
Because ghes_get_device() always returns &ghes_devs on Arm,
which is !NULL. It does not check if GHES is present.
> There are two possibilities:
>
> 1. ghes_probe() succeeds. ghes_edac loads properly and other drivers use
> ghes_get_devices() to know when to load.
>
> 2. ghes_probe() fails and that is caught during platform testing of all
> those platforms who wish to use ghes_edac. BIOS is fixed and goto 1.
>
> No need for funky flags whatsoever.
3. ghes_probe() is not called, but ghes_get_device() is called from
other edac drivers.
Toshi
On Fri, Aug 19, 2022 at 06:57:11PM +0000, Elliott, Robert (Servers) wrote:
> We might need to differentiate between the system ROM really not
> offering GHES vs. the ghes module not running.
All detectable in ghes_get_devices().
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Aug 19, 2022 at 06:53:41PM +0000, Kani, Toshi wrote:
> When the system does not implement ACPI GHES table,
> which I suppose is most of the case on ARM.
That should be detected in ghes_get_devices() - just like on x86.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Friday, August 19, 2022 1:31 PM, Borislav Petkov wrote:
> > When the system does not implement ACPI GHES table,
> > which I suppose is most of the case on ARM.
>
> That should be detected in ghes_get_devices() - just like on x86.
Agreed. And that is the check to ghes_present flag...
Toshi