2024-03-11 15:58:23

by Borislav Petkov

[permalink] [raw]
Subject: [GIT PULL] EDAC updates for v6.9

Hi Linus,

please pull EDAC updates for 6.9.

Due to the topology changes from tip, a oneliner is needed to be applied
as part of the merge commit:

diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
index 08c6dbd44c62..59b6169093f7 100644
--- a/drivers/ras/amd/atl/umc.c
+++ b/drivers/ras/amd/atl/umc.c
@@ -315,7 +315,7 @@ static u8 get_die_id(struct atl_err *err)
* For CPUs, this is the AMD Node ID modulo the number
* of AMD Nodes per socket.
*/
- return topology_die_id(err->cpu) % amd_get_nodes_per_socket();
+ return topology_amd_node_id(err->cpu) % topology_amd_nodes_per_pkg();
}

#define UMC_CHANNEL_NUM GENMASK(31, 20)
---

Linux-next did test with a similar diff carried on forwards:

https://lore.kernel.org/r/[email protected]

but we very recently realized that
s/topology_die_id/topology_amd_node_id/ needs to happen too.

That's not a big deal, though, as these are all new drivers for new
hardware which pretty much no one has yet so there's no risk of breaking
any existing machines out there.

Thx.

---

The following changes since commit 6613476e225e090cc9aad49be7fa504e290dd33d:

Linux 6.8-rc1 (2024-01-21 14:11:32 -0800)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/edac_updates_for_v6.9

for you to fetch changes up to af65545a0f82d7336f62e34f69d3c644806f5f95:

Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and 'ras/edac-amd-atl' into edac-updates-for-v6.9 (2024-03-11 16:24:20 +0100)

----------------------------------------------------------------
- Add a FRU (Field Replaceable Unit) memory poison manager which
collects and manages previously encountered hw errors in order to
save them to persistent storage across reboots. Previously recorded
errors are "replayed" upon reboot in order to poison memory which has
caused said errors in the past.

The main use case is stacked, on-chip memory which cannot simply be
replaced so poisoning faulty areas of it and thus making them
inaccessible is the only strategy to prolong its lifetime.

- Add an AMD address translation library glue which converts the
reported addresses of hw errors into system physical addresses in
order to be used by other subsystems like memory failure, for
example. Add support for MI300 accelerators to that library.

- igen6: Add support for Alder Lake-N SoC

- i10nm: Add Grand Ridge support

- The usual fixlets and cleanups

----------------------------------------------------------------
Borislav Petkov (AMD) (3):
Documentation: Move RAS section to admin-guide
RAS: Export helper to get ras_debugfs_dir
Merge remote-tracking branches 'ras/edac-drivers', 'ras/edac-misc' and 'ras/edac-amd-atl' into edac-updates-for-v6.9

Dan Carpenter (2):
RAS/AMD/ATL: Fix array overflow in get_logical_coh_st_fabric_id_mi300()
RAS/AMD/FMPM: Fix off by one when unwinding on error

Lili Li (1):
EDAC/igen6: Add one more Intel Alder Lake-N SoC support

Muralidhara M K (1):
RAS/AMD/ATL: Add MI300 support

Qiuxu Zhuo (1):
EDAC/i10nm: Add Intel Grand Ridge micro-server support

Shubhrajyoti Datta (1):
EDAC/versal: Make the bit position of injected errors configurable

Uwe Kleine-König (1):
EDAC/versal: Convert to platform remove callback returning void

Yangtao Li (1):
EDAC/synopsys: Convert to devm_platform_ioremap_resource()

Yazen Ghannam (9):
RAS: Introduce AMD Address Translation Library
EDAC/amd64: Use new AMD Address Translation Library
Documentation: RAS: Add index and address translation section
RAS/AMD/ATL: Add MI300 DRAM to normalized address translation support
RAS/AMD/ATL: Add MI300 row retirement support
RAS: Introduce a FRU memory poison manager
RAS/AMD/ATL: Fix bit overflow in denorm_addr_df4_np2()
RAS/AMD/FMPM: Save SPA values
RAS/AMD/FMPM: Add debugfs interface to print record entries

.../admin-guide/RAS/address-translation.rst | 24 +
.../ras.rst => admin-guide/RAS/error-decoding.rst} | 11 +-
Documentation/admin-guide/RAS/index.rst | 7 +
.../admin-guide/{ras.rst => RAS/main.rst} | 10 +-
Documentation/admin-guide/index.rst | 2 +-
Documentation/index.rst | 1 -
MAINTAINERS | 15 +-
drivers/edac/Kconfig | 1 +
drivers/edac/amd64_edac.c | 286 +-----
drivers/edac/i10nm_base.c | 1 +
drivers/edac/igen6_edac.c | 2 +
drivers/edac/synopsys_edac.c | 4 +-
drivers/edac/versal_edac.c | 199 +++-
drivers/ras/Kconfig | 13 +
drivers/ras/Makefile | 3 +
drivers/ras/amd/atl/Kconfig | 21 +
drivers/ras/amd/atl/Makefile | 18 +
drivers/ras/amd/atl/access.c | 133 +++
drivers/ras/amd/atl/core.c | 225 +++++
drivers/ras/amd/atl/dehash.c | 500 ++++++++++
drivers/ras/amd/atl/denormalize.c | 718 ++++++++++++++
drivers/ras/amd/atl/internal.h | 306 ++++++
drivers/ras/amd/atl/map.c | 682 +++++++++++++
drivers/ras/amd/atl/reg_fields.h | 606 ++++++++++++
drivers/ras/amd/atl/system.c | 288 ++++++
drivers/ras/amd/atl/umc.c | 341 +++++++
drivers/ras/amd/fmpm.c | 1013 ++++++++++++++++++++
drivers/ras/cec.c | 10 +-
drivers/ras/debugfs.c | 8 +-
drivers/ras/debugfs.h | 2 +-
drivers/ras/ras.c | 31 +
include/linux/ras.h | 18 +
32 files changed, 5164 insertions(+), 335 deletions(-)
create mode 100644 Documentation/admin-guide/RAS/address-translation.rst
rename Documentation/{RAS/ras.rst => admin-guide/RAS/error-decoding.rst} (73%)
create mode 100644 Documentation/admin-guide/RAS/index.rst
rename Documentation/admin-guide/{ras.rst => RAS/main.rst} (99%)
create mode 100644 drivers/ras/amd/atl/Kconfig
create mode 100644 drivers/ras/amd/atl/Makefile
create mode 100644 drivers/ras/amd/atl/access.c
create mode 100644 drivers/ras/amd/atl/core.c
create mode 100644 drivers/ras/amd/atl/dehash.c
create mode 100644 drivers/ras/amd/atl/denormalize.c
create mode 100644 drivers/ras/amd/atl/internal.h
create mode 100644 drivers/ras/amd/atl/map.c
create mode 100644 drivers/ras/amd/atl/reg_fields.h
create mode 100644 drivers/ras/amd/atl/system.c
create mode 100644 drivers/ras/amd/atl/umc.c
create mode 100644 drivers/ras/amd/fmpm.c


--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


2024-03-12 01:13:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] EDAC updates for v6.9

On Mon, 11 Mar 2024 at 08:57, Borislav Petkov <[email protected]> wrote:
>
> - return topology_die_id(err->cpu) % amd_get_nodes_per_socket();
> + return topology_amd_node_id(err->cpu) % topology_amd_nodes_per_pkg();

Ho humm. Lookie here:

static inline unsigned int topology_amd_nodes_per_pkg(void)
{ return 0; };

that's the UP case.

Yeah, I'm assuming nobody tests this for UP, but it's clearly wrong to
potentially do that modulus by zero.

So I made the merge also change that UP case of
topology_amd_nodes_per_pkg() to return 1.

Because dammit, not only is a mod-by-zero wrong, a UP system most
definitely has one node per package, not zero.

Linus

2024-03-12 01:30:59

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [GIT PULL] EDAC updates for v6.9

The pull request you sent on Mon, 11 Mar 2024 16:57:11 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git tags/edac_updates_for_v6.9

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/b0402403e54ae9eb94ce1cbb53c7def776e97426

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2024-03-12 02:24:28

by Randy Dunlap

[permalink] [raw]
Subject: Re: [GIT PULL] EDAC updates for v6.9



On 3/11/24 18:12, Linus Torvalds wrote:
> On Mon, 11 Mar 2024 at 08:57, Borislav Petkov <[email protected]> wrote:
>>
>> - return topology_die_id(err->cpu) % amd_get_nodes_per_socket();
>> + return topology_amd_node_id(err->cpu) % topology_amd_nodes_per_pkg();
>
> Ho humm. Lookie here:
>
> static inline unsigned int topology_amd_nodes_per_pkg(void)
> { return 0; };
>

and there's an extra/trailing ';'.

> that's the UP case.
>
> Yeah, I'm assuming nobody tests this for UP, but it's clearly wrong to
> potentially do that modulus by zero.
>
> So I made the merge also change that UP case of
> topology_amd_nodes_per_pkg() to return 1.
>
> Because dammit, not only is a mod-by-zero wrong, a UP system most
> definitely has one node per package, not zero.
>
> Linus
>

--
#Randy

2024-03-12 02:25:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [GIT PULL] EDAC updates for v6.9

On Mon, 11 Mar 2024 at 19:24, Randy Dunlap <[email protected]> wrote:
>
> and there's an extra/trailing ';'.

Ayup, I fixed that too while I was in there anyway.

Linus

2024-03-12 07:45:37

by Borislav Petkov

[permalink] [raw]
Subject: Re: [GIT PULL] EDAC updates for v6.9

On Mon, Mar 11, 2024 at 06:12:54PM -0700, Linus Torvalds wrote:
> Ho humm. Lookie here:
>
> static inline unsigned int topology_amd_nodes_per_pkg(void)
> { return 0; };
>
> that's the UP case.
>
> Yeah, I'm assuming nobody tests this for UP,

Unless it gets randomly enabled in my randconfig builds once in a blue
moon, I'd say pretty seldomly. I've heard people raise the question
multiple times whether we should simply make CONFIG_SMP default y on x86
and frankly, it'll get rid of a whole bunch of stupid corner cases like
that...

> but it's clearly wrong to potentially do that modulus by zero.

Yep.

> So I made the merge also change that UP case of
> topology_amd_nodes_per_pkg() to return 1.
>
> Because dammit, not only is a mod-by-zero wrong, a UP system most
> definitely has one node per package, not zero.

Yap, that's the the straight-forward thing to do, thanks for fixing it!

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-12 09:16:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [GIT PULL] EDAC updates for v6.9


* Borislav Petkov <[email protected]> wrote:

> On Mon, Mar 11, 2024 at 06:12:54PM -0700, Linus Torvalds wrote:
> > Ho humm. Lookie here:
> >
> > static inline unsigned int topology_amd_nodes_per_pkg(void)
> > { return 0; };
> >
> > that's the UP case.
> >
> > Yeah, I'm assuming nobody tests this for UP,
>
> Unless it gets randomly enabled in my randconfig builds once in a blue
> moon, I'd say pretty seldomly. I've heard people raise the question
> multiple times whether we should simply make CONFIG_SMP default y on x86
> and frankly, it'll get rid of a whole bunch of stupid corner cases like
> that...

Making it 'default y' in the Kconfig alone changes very little, as people &
bots will still stumble on !SMP via allnoconfig or randconfig builds.

If you mean forcing CONFIG_SMP via 'select SMP' on x86 on the other hand,
that's worth considering - although I think there will be a ton of extra
cross-build breakage as most patches still get created & tested on x86.

In other words, the x86 UP build basically has the side-effect utility of
covering a lot of UP cross-build scenarios in generic code.

I think the most viable approach would be to make SMP the only model all
across the kernel (and eventually removing the CONFIG_SMP option), while
propagating UP data structures and locking primitives to the UP arch level,
instead of having CONFIG_SMP #ifdefs in generic code.

Maybe not today, but certainly in a few years.

Thanks,

Ingo

2024-03-12 09:41:52

by Borislav Petkov

[permalink] [raw]
Subject: Re: [GIT PULL] EDAC updates for v6.9

On Tue, Mar 12, 2024 at 10:16:10AM +0100, Ingo Molnar wrote:
> If you mean forcing CONFIG_SMP via 'select SMP' on x86 on the other
> hand, that's worth considering

Yeah, that.

> - although I think there will be a ton of extra cross-build breakage
> as most patches still get created & tested on x86.

I wanna say "this better be build-tested on the target architecture too"
but I can certainly see the use case of having to cross-build a UP
config.

> I think the most viable approach would be to make SMP the only model
> all across the kernel (and eventually removing the CONFIG_SMP option),
> while propagating UP data structures and locking primitives to the UP
> arch level, instead of having CONFIG_SMP #ifdefs in generic code.

Right, UP is a SMP machine with only 1 CPU. It should just work. :-P

> Maybe not today, but certainly in a few years.

It makes sense to aim for such a model, yap. Let's do it.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-12 10:07:48

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [GIT PULL] EDAC updates for v6.9

On Mon, Mar 11 2024 at 18:12, Linus Torvalds wrote:

> On Mon, 11 Mar 2024 at 08:57, Borislav Petkov <[email protected]> wrote:
>>
>> - return topology_die_id(err->cpu) % amd_get_nodes_per_socket();
>> + return topology_amd_node_id(err->cpu) % topology_amd_nodes_per_pkg();
>
> Ho humm. Lookie here:
>
> static inline unsigned int topology_amd_nodes_per_pkg(void)
> { return 0; };
>
> that's the UP case.
>
> Yeah, I'm assuming nobody tests this for UP, but it's clearly wrong to
> potentially do that modulus by zero.

Duh. I clearly was not thinking at all when I wrote this.

Thanks for spotting it.


tglx