2022-08-17 14:53:48

by Justin He

[permalink] [raw]
Subject: [PATCH v2 0/7] Modularize ghes_edac driver

Commit dc4e8c07e9e2 ("ACPI: APEI: explicit init of HEST and GHES in
apci_init()") introduced a bug that ghes_edac_register() would be invoked
before edac_init(). Because at that time, the bus "edac" hasn't been even
registered, this created sysfs /devices/mc0 instead of
/sys/devices/system/edac/mc/mc0 on an Ampere eMag server.

The solution is modularizing the ghes_edac driver.

Changelog:
v2:
- add acked-by tag of Patch 1 from Ard
- split the notifier patch
- add 2 patch to get regular drivers selected when ghes edac is not loaded
- fix an errno in igen6 driver
- add a patch to fix the sparse warning of ghes
- refine the commit logs

Jia He (7):
efi/cper: export several helpers for ghes edac to use
EDAC/ghes: Add notifier to report ghes_edac mem error
EDAC/ghes: Modularize ghes_edac driver to remove the dependency on
ghes
EDAC: Get chipset-specific edac drivers selected only when ghes_edac
is not enabled
EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac
is registered
apei/ghes: Use unrcu_pointer for cmpxchg
EDAC/igen6: Keep returned errno consistent when edac mc has been
enabled

drivers/acpi/apei/ghes.c | 72 ++++++++++++++++++++++++++++++++---
drivers/edac/Kconfig | 4 +-
drivers/edac/amd64_edac.c | 3 ++
drivers/edac/ghes_edac.c | 76 ++++++++++++++++++++++++++-----------
drivers/edac/igen6_edac.c | 2 +-
drivers/edac/pnd2_edac.c | 3 ++
drivers/edac/sb_edac.c | 3 ++
drivers/edac/skx_base.c | 3 ++
drivers/firmware/efi/cper.c | 3 ++
include/acpi/ghes.h | 37 ++++++------------
10 files changed, 150 insertions(+), 56 deletions(-)

--
2.25.1


2022-08-17 14:53:55

by Justin He

[permalink] [raw]
Subject: [PATCH v2 6/7] apei/ghes: Use unrcu_pointer for cmpxchg

ghes_estatus_caches should be add rcu annotation to avoid sparse warnings.
drivers/acpi/apei/ghes.c:733:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
drivers/acpi/apei/ghes.c:733:25: sparse: struct ghes_estatus_cache [noderef] __rcu *
drivers/acpi/apei/ghes.c:733:25: sparse: struct ghes_estatus_cache *
drivers/acpi/apei/ghes.c:813:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
drivers/acpi/apei/ghes.c:813:25: sparse: struct ghes_estatus_cache [noderef] __rcu *
drivers/acpi/apei/ghes.c:813:25: sparse: struct ghes_estatus_cache *

unrcu_pointer is to strip the __rcu in cmpxchg.

Reported-by: kernel test robot <[email protected]>
Signed-off-by: Jia He <[email protected]>
---
drivers/acpi/apei/ghes.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9272d963b57d..92ae58f4f7bb 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -144,7 +144,7 @@ struct ghes_vendor_record_entry {
static struct gen_pool *ghes_estatus_pool;
static unsigned long ghes_estatus_pool_size_request;

-static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
+static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
static atomic_t ghes_estatus_cache_alloced;

static int ghes_panic_timeout __read_mostly = 30;
@@ -834,8 +834,9 @@ static void ghes_estatus_cache_add(
}
/* new_cache must be put into array after its contents are written */
smp_wmb();
- if (slot != -1 && cmpxchg(ghes_estatus_caches + slot,
- slot_cache, new_cache) == slot_cache) {
+ if (slot != -1 && unrcu_pointer(cmpxchg(ghes_estatus_caches + slot,
+ RCU_INITIALIZER(slot_cache),
+ RCU_INITIALIZER(new_cache)))) {
if (slot_cache)
call_rcu(&slot_cache->rcu, ghes_estatus_cache_rcu_free);
} else
--
2.25.1

2022-08-17 15:19:52

by Justin He

[permalink] [raw]
Subject: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

Previous, there is only one edac can be registering in the EDAC core. After
ghes_edac is modularized, the registering of ghes devices may be deferred
when ghes_edac is loaded. Prevent the chipset-specific edac drivers from
loading after ghes_edac is registered, which allow the edac drivers to
get selected in e.g. HPE platform.

Signed-off-by: Jia He <[email protected]>
---
drivers/acpi/apei/ghes.c | 14 ++++++++++++++
drivers/edac/amd64_edac.c | 2 +-
drivers/edac/ghes_edac.c | 2 ++
drivers/edac/pnd2_edac.c | 2 +-
drivers/edac/sb_edac.c | 2 +-
drivers/edac/skx_base.c | 2 +-
include/acpi/ghes.h | 4 ++++
7 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 9c52183e3ad9..9272d963b57d 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -95,6 +95,7 @@
#endif

static ATOMIC_NOTIFIER_HEAD(ghes_report_chain);
+static bool devs_registered;

static inline bool is_hest_type_generic_v2(struct ghes *ghes)
{
@@ -1266,6 +1267,18 @@ static int apei_sdei_unregister_ghes(struct ghes *ghes)
return sdei_unregister_ghes(ghes);
}

+void set_ghes_devs_registered(bool flag)
+{
+ devs_registered = flag;
+}
+EXPORT_SYMBOL_GPL(set_ghes_devs_registered);
+
+bool ghes_devs_registered(void)
+{
+ return devs_registered;
+}
+EXPORT_SYMBOL_GPL(ghes_devs_registered);
+
static int ghes_probe(struct platform_device *ghes_dev)
{
struct acpi_hest_generic *generic;
@@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device *ghes_dev)
platform_set_drvdata(ghes_dev, ghes);

ghes->dev = &ghes_dev->dev;
+ set_ghes_devs_registered(false);

mutex_lock(&ghes_devs_mutex);
list_add_tail(&ghes->elist, &ghes_devs);
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index e4eaf6668feb..f48507fa7b94 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -4329,7 +4329,7 @@ static int __init amd64_edac_init(void)
int err = -ENODEV;
int i;

- if (ghes_get_devices(0))
+ if (ghes_get_devices(0) && ghes_devs_registered())
return -EBUSY;

owner = edac_get_owner();
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 3bdf8469882d..d26644b3bc47 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -564,6 +564,7 @@ static int __init ghes_edac_init(void)
list_for_each_entry_safe(g, g_tmp, ghes_devs, elist) {
ghes_edac_register(g->dev);
}
+ set_ghes_devs_registered(true);

return 0;
}
@@ -576,6 +577,7 @@ static void __exit ghes_edac_exit(void)
list_for_each_entry_safe(g, g_tmp, ghes_devs, elist) {
ghes_edac_unregister(g);
}
+ set_ghes_devs_registered(false);
}
module_exit(ghes_edac_exit);

diff --git a/drivers/edac/pnd2_edac.c b/drivers/edac/pnd2_edac.c
index 73f2ba0e64e3..66d89d00c4b3 100644
--- a/drivers/edac/pnd2_edac.c
+++ b/drivers/edac/pnd2_edac.c
@@ -1528,7 +1528,7 @@ static int __init pnd2_init(void)

edac_dbg(2, "\n");

- if (ghes_get_devices(0))
+ if (ghes_get_devices(0) && ghes_devs_registered())
return -EBUSY;

owner = edac_get_owner();
diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 1d0520a16840..a3a89e366a74 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -3506,7 +3506,7 @@ static int __init sbridge_init(void)

edac_dbg(2, "\n");

- if (ghes_get_devices(0))
+ if (ghes_get_devices(0) && ghes_devs_registered())
return -EBUSY;

owner = edac_get_owner();
diff --git a/drivers/edac/skx_base.c b/drivers/edac/skx_base.c
index fe267f8543f5..efa1ae79c35a 100644
--- a/drivers/edac/skx_base.c
+++ b/drivers/edac/skx_base.c
@@ -653,7 +653,7 @@ static int __init skx_init(void)

edac_dbg(2, "\n");

- if (ghes_get_devices(0))
+ if (ghes_get_devices(0) && ghes_devs_registered())
return -EBUSY;

owner = edac_get_owner();
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 150c0b9500d6..21b9d4d4c3e9 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -72,8 +72,12 @@ int ghes_register_vendor_record_notifier(struct notifier_block *nb);
*/
void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
struct list_head *ghes_get_devices(bool force);
+bool ghes_devs_registered(void);
+void set_ghes_devs_registered(bool flag);
#else
static inline struct list_head *ghes_get_devices(bool force) { return NULL; }
+static inline bool ghes_devs_registered(void) { return false; }
+static inline void set_ghes_devs_registered(bool flag) { return; }
#endif

int ghes_estatus_pool_init(int num_ghes);
--
2.25.1

2022-08-17 15:55:22

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 6/7] apei/ghes: Use unrcu_pointer for cmpxchg

From: Jia He
> Sent: 17 August 2022 15:35
>
> ghes_estatus_caches should be add rcu annotation to avoid sparse warnings.
> drivers/acpi/apei/ghes.c:733:25: sparse: sparse: incompatible types in comparison expression
> (different address spaces):
> drivers/acpi/apei/ghes.c:733:25: sparse: struct ghes_estatus_cache [noderef] __rcu *
> drivers/acpi/apei/ghes.c:733:25: sparse: struct ghes_estatus_cache *
> drivers/acpi/apei/ghes.c:813:25: sparse: sparse: incompatible types in comparison expression
> (different address spaces):
> drivers/acpi/apei/ghes.c:813:25: sparse: struct ghes_estatus_cache [noderef] __rcu *
> drivers/acpi/apei/ghes.c:813:25: sparse: struct ghes_estatus_cache *
>
> unrcu_pointer is to strip the __rcu in cmpxchg.
>
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Jia He <[email protected]>
> ---
> drivers/acpi/apei/ghes.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 9272d963b57d..92ae58f4f7bb 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -144,7 +144,7 @@ struct ghes_vendor_record_entry {
> static struct gen_pool *ghes_estatus_pool;
> static unsigned long ghes_estatus_pool_size_request;
>
> -static struct ghes_estatus_cache *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> +static struct ghes_estatus_cache __rcu *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> static atomic_t ghes_estatus_cache_alloced;
>
> static int ghes_panic_timeout __read_mostly = 30;
> @@ -834,8 +834,9 @@ static void ghes_estatus_cache_add(
> }
> /* new_cache must be put into array after its contents are written */
> smp_wmb();
> - if (slot != -1 && cmpxchg(ghes_estatus_caches + slot,
> - slot_cache, new_cache) == slot_cache) {
> + if (slot != -1 && unrcu_pointer(cmpxchg(ghes_estatus_caches + slot,
> + RCU_INITIALIZER(slot_cache),
> + RCU_INITIALIZER(new_cache)))) {

Did you test this?
There seems to be an == missing.

David

> if (slot_cache)
> call_rcu(&slot_cache->rcu, ghes_estatus_cache_rcu_free);
> } else
> --
> 2.25.1

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2022-08-18 01:36:09

by Justin He

[permalink] [raw]
Subject: RE: [PATCH v2 6/7] apei/ghes: Use unrcu_pointer for cmpxchg



> -----Original Message-----
> From: David Laight <[email protected]>
> Sent: Wednesday, August 17, 2022 11:39 PM
> To: Justin He <[email protected]>; Ard Biesheuvel <[email protected]>; Len
> Brown <[email protected]>; James Morse <[email protected]>; Tony
> Luck <[email protected]>; Borislav Petkov <[email protected]>; Mauro
> Carvalho Chehab <[email protected]>; Robert Richter <[email protected]>;
> Robert Moore <[email protected]>; Qiuxu Zhuo
> <[email protected]>; Yazen Ghannam <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Rafael J . Wysocki
> <[email protected]>; Shuai Xue <[email protected]>; Jarkko
> Sakkinen <[email protected]>; [email protected];
> [email protected]; nd <[email protected]>; kernel test robot <[email protected]>
> Subject: RE: [PATCH v2 6/7] apei/ghes: Use unrcu_pointer for cmpxchg
>
> From: Jia He
> > Sent: 17 August 2022 15:35
> >
> > ghes_estatus_caches should be add rcu annotation to avoid sparse
> warnings.
> > drivers/acpi/apei/ghes.c:733:25: sparse: sparse: incompatible types
> > in comparison expression (different address spaces):
> > drivers/acpi/apei/ghes.c:733:25: sparse: struct
> ghes_estatus_cache [noderef] __rcu *
> > drivers/acpi/apei/ghes.c:733:25: sparse: struct
> ghes_estatus_cache *
> > drivers/acpi/apei/ghes.c:813:25: sparse: sparse: incompatible types
> > in comparison expression (different address spaces):
> > drivers/acpi/apei/ghes.c:813:25: sparse: struct
> ghes_estatus_cache [noderef] __rcu *
> > drivers/acpi/apei/ghes.c:813:25: sparse: struct
> ghes_estatus_cache *
> >
> > unrcu_pointer is to strip the __rcu in cmpxchg.
> >
> > Reported-by: kernel test robot <[email protected]>
> > Signed-off-by: Jia He <[email protected]>
> > ---
> > drivers/acpi/apei/ghes.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index
> > 9272d963b57d..92ae58f4f7bb 100644
> > --- a/drivers/acpi/apei/ghes.c
> > +++ b/drivers/acpi/apei/ghes.c
> > @@ -144,7 +144,7 @@ struct ghes_vendor_record_entry { static struct
> > gen_pool *ghes_estatus_pool; static unsigned long
> > ghes_estatus_pool_size_request;
> >
> > -static struct ghes_estatus_cache
> > *ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> > +static struct ghes_estatus_cache __rcu
> > +*ghes_estatus_caches[GHES_ESTATUS_CACHES_SIZE];
> > static atomic_t ghes_estatus_cache_alloced;
> >
> > static int ghes_panic_timeout __read_mostly = 30; @@ -834,8 +834,9
> @@
> > static void ghes_estatus_cache_add(
> > }
> > /* new_cache must be put into array after its contents are written */
> > smp_wmb();
> > - if (slot != -1 && cmpxchg(ghes_estatus_caches + slot,
> > - slot_cache, new_cache) == slot_cache) {
> > + if (slot != -1 && unrcu_pointer(cmpxchg(ghes_estatus_caches + slot,
> > + RCU_INITIALIZER(slot_cache),
> > + RCU_INITIALIZER(new_cache)))) {
>
> Did you test this?
> There seems to be an == missing.

Sorry about it,


--
Cheers,
Justin (Jia He)

2022-08-19 01:39:51

by Kani, Toshimitsu

[permalink] [raw]
Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Wednesday, August 17, 2022 8:35 AM, Jia He wrote:
> Previous, there is only one edac can be registering in the EDAC core. After
> ghes_edac is modularized, the registering of ghes devices may be deferred
> when ghes_edac is loaded. Prevent the chipset-specific edac drivers from
> loading after ghes_edac is registered, which allow the edac drivers to
> get selected in e.g. HPE platform.
...
> @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> *ghes_dev)
> platform_set_drvdata(ghes_dev, ghes);
>
> ghes->dev = &ghes_dev->dev;
> + set_ghes_devs_registered(false);

This does not look right to me.

The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
where:
- ghes-present is latched on ghes_probe()
- ghes-preferred is true if the platform-check passes.

ghes_get_device() introduced in the previous patch works as the
ghes-preferred check.

We cannot use ghes_edac registration as the ghes-present check in
this scheme since it is deferred to module_init().

I'd suggest the following changes:
- Update ghes_get_device() to check a flag latched on ghes_probe().
- Remove 'force' argument from ghes_get_device(). ghes_edac_init()
is too late to set this flag. Also, other edac drivers should not be able
to control this flag. I think this force_load kernel option needs to
belong to ghes with this scheme.
- ghes_get_device() exposes 'ghes_devs' to chipset-specific edac drivers,
which should be internal to ghes. ghes_get_device() may be renamed
to something like ghes_edac_preferred() which returns true / false.

Toshi

2022-08-19 11:03:14

by Justin He

[permalink] [raw]
Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

Hi Toshi,

> -----Original Message-----
> From: Kani, Toshi <[email protected]>
> Sent: Friday, August 19, 2022 9:30 AM
> To: Justin He <[email protected]>; Ard Biesheuvel <[email protected]>; Len
> Brown <[email protected]>; James Morse <[email protected]>; Tony Luck
> <[email protected]>; Borislav Petkov <[email protected]>; Mauro Carvalho
> Chehab <[email protected]>; Robert Richter <[email protected]>; Robert
> Moore <[email protected]>; Qiuxu Zhuo <[email protected]>;
> Yazen Ghannam <[email protected]>
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; Rafael J . Wysocki
> <[email protected]>; Shuai Xue <[email protected]>; Jarkko
> Sakkinen <[email protected]>; [email protected]; nd <[email protected]>
> Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from
> loading after ghes_edac is registered
>
> On Wednesday, August 17, 2022 8:35 AM, Jia He wrote:
> > Previous, there is only one edac can be registering in the EDAC core.
> > After ghes_edac is modularized, the registering of ghes devices may be
> > deferred when ghes_edac is loaded. Prevent the chipset-specific edac
> > drivers from loading after ghes_edac is registered, which allow the
> > edac drivers to get selected in e.g. HPE platform.
> ...
> > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> > *ghes_dev)
> > platform_set_drvdata(ghes_dev, ghes);
> >
> > ghes->dev = &ghes_dev->dev;
> > + set_ghes_devs_registered(false);
>
> This does not look right to me.
>
> The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
> where:
> - ghes-present is latched on ghes_probe()
> - ghes-preferred is true if the platform-check passes.
>
> ghes_get_device() introduced in the previous patch works as the
> ghes-preferred check.
>
> We cannot use ghes_edac registration as the ghes-present check in this
> scheme since it is deferred to module_init().

What is the logic for ghes-present check? In this patch, I assumed it is equal to
"ghes_edac devices have been registered". Seems it is not correct.
But should we consider the case as follows:
What if sbridge_init () is invoked before ghes_edac_init()? i.e. Should we get
sb_edac driver selected when ghes_edac is not loaded yet (e.g. on HPE)?

>
> I'd suggest the following changes:
> - Update ghes_get_device() to check a flag latched on ghes_probe().

Still need your elaborating about the details of the flag. i.e. When is this flag
latched? When is it unlatched?

> - Remove 'force' argument from ghes_get_device(). ghes_edac_init() is too
> late to set this flag. Also, other edac drivers should not be able to control this
> flag. I think this force_load kernel option needs to belong to ghes with this
> scheme.
Agree, I will remove force_load in ghes_get_device(), and put all check/update of
force_load in ghes

> - ghes_get_device() exposes 'ghes_devs' to chipset-specific edac drivers, which
> should be internal to ghes. ghes_get_device() may be renamed to something
> like ghes_edac_preferred() which returns true / false.
>

Okay, agree

--
Cheers,
Justin (Jia He)


2022-08-19 18:30:51

by Kani, Toshimitsu

[permalink] [raw]
Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Friday, August 19, 2022 4:35 AM, Justin He wrote:
> > > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct platform_device
> > > *ghes_dev)
> > > platform_set_drvdata(ghes_dev, ghes);
> > >
> > > ghes->dev = &ghes_dev->dev;
> > > + set_ghes_devs_registered(false);
> >
> > This does not look right to me.
> >
> > The condition of using ghes_edac is (ghes-present) && (ghes-preferred),
> > where:
> > - ghes-present is latched on ghes_probe()
> > - ghes-preferred is true if the platform-check passes.
> >
> > ghes_get_device() introduced in the previous patch works as the
> > ghes-preferred check.
> >
> > We cannot use ghes_edac registration as the ghes-present check in this
> > scheme since it is deferred to module_init().
>
> What is the logic for ghes-present check? In this patch, I assumed it is equal to
> "ghes_edac devices have been registered". Seems it is not correct.

Using (ghes_edac-registered) is a wrong check in this scheme
since ghes_edac registration is deferred. This check is fine in
the current scheme since ghes_edac gets registered before
any other chipset-specific edac drivers.

> But should we consider the case as follows:
> What if sbridge_init () is invoked before ghes_edac_init()? i.e. Should we get
> sb_edac driver selected when ghes_edac is not loaded yet (e.g. on HPE)?

No. The point is that ghes_edac driver needs to be selected,
"regardless of the module ordering", on a system with GHES
present & preferred.

Note that this new scheme leads to the following condition
change:
- Current: (ghes-present) && (ghes-preferred) && (ghes_edac registered)
- New: (ghes-present) && (ghes-preferred)

The option I suggested previously keeps the current condition,
but this new scheme does not for better modularity.

What this means is that if ghes_edac is not enabled (but ghes
is enabled) on a system with GHES present & preferred, no edac
driver gets registered. This change is fine from my (HPE) perspective
and should be fine for other GHES systems. GHES systems have
chipset-specific edac driver in FW. OS-based chipset-specific edac
driver is not necessary and may lead to a conflict of chipset register
ownership.

> > I'd suggest the following changes:
> > - Update ghes_get_device() to check a flag latched on ghes_probe().
>
> Still need your elaborating about the details of the flag. i.e. When is this flag
> latched? When is it unlatched?

This flag is a static variable, say ghes_present, which is set to false initially.
ghes_probe() sets it to true. ghes_edac_preferred() (aka. ghes_get_device)
checks this flag at beginning and returns false if this flag is false. It does not
get unlatched since ACPI GHES table is static.

Toshi

2022-08-19 18:52:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Fri, Aug 19, 2022 at 05:48:39PM +0000, Kani, Toshi wrote:
> This flag is a static variable, say ghes_present, which is set to
> false initially. ghes_probe() sets it to true. ghes_edac_preferred()
> (aka. ghes_get_device) checks this flag at beginning and returns false
> if this flag is false. It does not get unlatched since ACPI GHES table
> is static.

What is that flag needed for at all?

There are two possibilities:

1. ghes_probe() succeeds. ghes_edac loads properly and other drivers use
ghes_get_devices() to know when to load.

2. ghes_probe() fails and that is caught during platform testing of all
those platforms who wish to use ghes_edac. BIOS is fixed and goto 1.

No need for funky flags whatsoever.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-19 18:57:39

by Kani, Toshimitsu

[permalink] [raw]
Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Friday, August 19, 2022 12:50 PM, Borislav Petkov wrote:
> > 3. ghes_probe() is not called,
>
> When is ghes_probe() not called on ARM?

When the system does not implement ACPI GHES table,
which I suppose is most of the case on ARM.

Toshi

Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered



> -----Original Message-----
> From: Kani, Toshi <[email protected]>
> Sent: Friday, August 19, 2022 12:49 PM
> To: Justin He <[email protected]>
> Cc: [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Rafael J . Wysocki
> <[email protected]>; Shuai Xue <[email protected]>; Jarkko
> Sakkinen <[email protected]>; [email protected]; nd
> <[email protected]>; Ard Biesheuvel <[email protected]>; Len Brown
> <[email protected]>; James Morse <[email protected]>; Tony Luck
> <[email protected]>; Borislav Petkov <[email protected]>; Mauro Carvalho
> Chehab <[email protected]>; Robert Richter <[email protected]>; Robert
> Moore <[email protected]>; Qiuxu Zhuo <[email protected]>;
> Yazen Ghannam <[email protected]>
> Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac
> from loading after ghes_edac is registered
>
> On Friday, August 19, 2022 4:35 AM, Justin He wrote:
> > > > @@ -1382,6 +1395,7 @@ static int ghes_probe(struct
> platform_device
> > > > *ghes_dev)
> > > > platform_set_drvdata(ghes_dev, ghes);
> > > >
> > > > ghes->dev = &ghes_dev->dev;
> > > > + set_ghes_devs_registered(false);
> > >
> > > This does not look right to me.
> > >
> > > The condition of using ghes_edac is (ghes-present) && (ghes-
> preferred),
> > > where:
> > > - ghes-present is latched on ghes_probe()
> > > - ghes-preferred is true if the platform-check passes.
> > >
> > > ghes_get_device() introduced in the previous patch works as the
> > > ghes-preferred check.
> > >
> > > We cannot use ghes_edac registration as the ghes-present check in
> this
> > > scheme since it is deferred to module_init().
> >
> > What is the logic for ghes-present check? In this patch, I assumed it
> is equal to
> > "ghes_edac devices have been registered". Seems it is not correct.
>
> Using (ghes_edac-registered) is a wrong check in this scheme
> since ghes_edac registration is deferred. This check is fine in
> the current scheme since ghes_edac gets registered before
> any other chipset-specific edac drivers.
>
> > But should we consider the case as follows:
> > What if sbridge_init () is invoked before ghes_edac_init()? i.e.
> Should we get
> > sb_edac driver selected when ghes_edac is not loaded yet (e.g. on
> HPE)?
>
> No. The point is that ghes_edac driver needs to be selected,
> "regardless of the module ordering", on a system with GHES
> present & preferred.
>
> Note that this new scheme leads to the following condition
> change:
> - Current: (ghes-present) && (ghes-preferred) && (ghes_edac registered)
> - New: (ghes-present) && (ghes-preferred)
>
> The option I suggested previously keeps the current condition,
> but this new scheme does not for better modularity.
>
> What this means is that if ghes_edac is not enabled (but ghes
> is enabled) on a system with GHES present & preferred, no edac
> driver gets registered. This change is fine from my (HPE) perspective
> and should be fine for other GHES systems. GHES systems have
> chipset-specific edac driver in FW. OS-based chipset-specific edac
> driver is not necessary and may lead to a conflict of chipset register
> ownership.

Currently, running with this on the kernel command line
ghes.disable

causes the ACPI ghes driver to quit early in acpi_ghes_init():

/*
* This driver isn't really modular, however for the time being,
* continuing to use module_param is the easiest way to remain
* compatible with existing boot arg use cases.
*/
bool ghes_disable;
module_param_named(disable, ghes_disable, bool, 0);

which results in the skx_edac module assuming it should run:
[ 33.628140] calling skx_init+0x0/0xe5a [skx_edac] @ 1444
[ 33.628531] EDAC MC0: Giving out device to module skx_edac controller Skylake Socket#0 IMC#0: DEV 0000:36:0a.0 (INTERRUPT)
[ 33.641432] EDAC MC1: Giving out device to module skx_edac controller Skylake Socket#0 IMC#1: DEV 0000:36:0c.0 (INTERRUPT)
[ 33.653256] EDAC MC2: Giving out device to module skx_edac controller Skylake Socket#1 IMC#0: DEV 0000:ae:0a.0 (INTERRUPT)
[ 33.665055] EDAC MC3: Giving out device to module skx_edac controller Skylake Socket#1 IMC#1: DEV 0000:ae:0c.0 (INTERRUPT)
[ 33.676801] initcall skx_init+0x0/0xe5a [skx_edac] returned 0 after 36343 usecs

We might need to differentiate between the system ROM really not
offering GHES vs. the ghes module not running.

2022-08-19 19:04:23

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Fri, Aug 19, 2022 at 06:46:34PM +0000, Kani, Toshi wrote:
> 3. ghes_probe() is not called,

When is ghes_probe() not called on ARM?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-19 19:18:17

by Kani, Toshimitsu

[permalink] [raw]
Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Friday, August 19, 2022 12:30 PM, Borislav Petkov wrote:
> > This flag is a static variable, say ghes_present, which is set to
> > false initially. ghes_probe() sets it to true. ghes_edac_preferred()
> > (aka. ghes_get_device) checks this flag at beginning and returns false
> > if this flag is false. It does not get unlatched since ACPI GHES table
> > is static.
>
> What is that flag needed for at all?

Because ghes_get_device() always returns &ghes_devs on Arm,
which is !NULL. It does not check if GHES is present.

> There are two possibilities:
>
> 1. ghes_probe() succeeds. ghes_edac loads properly and other drivers use
> ghes_get_devices() to know when to load.
>
> 2. ghes_probe() fails and that is caught during platform testing of all
> those platforms who wish to use ghes_edac. BIOS is fixed and goto 1.
>
> No need for funky flags whatsoever.

3. ghes_probe() is not called, but ghes_get_device() is called from
other edac drivers.

Toshi

2022-08-19 19:41:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Fri, Aug 19, 2022 at 06:57:11PM +0000, Elliott, Robert (Servers) wrote:
> We might need to differentiate between the system ROM really not
> offering GHES vs. the ghes module not running.

All detectable in ghes_get_devices().

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-19 19:42:55

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Fri, Aug 19, 2022 at 06:53:41PM +0000, Kani, Toshi wrote:
> When the system does not implement ACPI GHES table,
> which I suppose is most of the case on ARM.

That should be detected in ghes_get_devices() - just like on x86.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2022-08-19 20:50:29

by Kani, Toshimitsu

[permalink] [raw]
Subject: RE: [PATCH v2 5/7] EDAC/ghes: Prevent chipset-specific edac from loading after ghes_edac is registered

On Friday, August 19, 2022 1:31 PM, Borislav Petkov wrote:
> > When the system does not implement ACPI GHES table,
> > which I suppose is most of the case on ARM.
>
> That should be detected in ghes_get_devices() - just like on x86.

Agreed. And that is the check to ghes_present flag...

Toshi