This patch proposes to add an extra .note section in the corefile to dump the CPUID information of a machine. This is being done to solve the issue of tools like the debuggers having to deal with coredumps from machines with varying XSAVE layouts in spite of having the same XCR0 bits. The new proposed .note section, at this point, consists of an array of records containing the information of each extended feature that is present. This provides details about the offsets and the sizes of the various extended save state components of the machine where the application crash occurred. Requesting a review for this patch.
Some background:
The XSAVE layouts of modern AMD and Intel CPUs differ, especially since Memory Protection Keys and the AVX-512 features have been inculcated into the AMD CPUs. This is since AMD never adopted (and hence never left room in the XSAVE layout for) the Intel MPX feature. Tools like GDB had assumed a fixed XSAVE layout matching that of Intel (based on the XCR0 mask). Hence, the core dumps from AMD CPUs didn't match the known size for the XCR0 mask. This resulted in GDB and other tools not being able to access the values of the AVX-512 and PKRU registers on AMD CPUs. To solve this, an interim solution has been accepted into GDB, and is already a part of GDB 14, thanks to these series of patches : [ https://sourceware.org/pipermail/gdb-patches/2023-March/198081.html ].
But this patch series depends on heuristics based on the total XSAVE register set size and the XCR0 mask to infer the layouts of the various register blocks for core dumps, and hence, is not a foolproof mechanism to determine the layout of the XSAVE area.
Hence this new core dump note has been proposed as a more sturdy mechanism to allow GDB/LLDB and other relevant tools to determine the layout of the XSAVE area of the machine where the corefile was dumped.
The new core dump note (which is being proposed as a per-process .note section), NT_X86_XSAVE_LAYOUT (0x205) contains an array of structures.
Each structure describes an individual extended feature containing offset, size and flags (that is obtained through CPUID instruction) in a format roughly matching the follow C structure:
struct xfeat_component {
u32 xfeat_type;
u32 xfeat_sz;
u32 xfeat_off;
u32 xfeat_flags;
};
Vignesh Balasubramanian (1):
x86/elf: Add a new .note section containing Xfeatures information to
x86 core files
arch/Kconfig | 9 +++
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/elf.h | 2 -
arch/x86/Kconfig | 1 +
arch/x86/include/asm/elf.h | 7 +++
arch/x86/kernel/fpu/xstate.c | 101 +++++++++++++++++++++++++++++++++
include/linux/elf.h | 2 +-
include/uapi/linux/elf.h | 1 +
8 files changed, 121 insertions(+), 3 deletions(-)
--
2.43.0
Add a new .note section containing type, size, offset and flags of
every xfeature that is present.
This information will be used by the debuggers to understand the XSAVE
layout of the machine where the core file is dumped, and to read XSAVE
registers, especially during cross-platform debugging.
Co-developed-by: Jini Susan George <[email protected]>
Signed-off-by: Jini Susan George <[email protected]>
Signed-off-by: Vignesh Balasubramanian <[email protected]>
---
arch/Kconfig | 9 +++
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/elf.h | 2 -
arch/x86/Kconfig | 1 +
arch/x86/include/asm/elf.h | 7 +++
arch/x86/kernel/fpu/xstate.c | 101 +++++++++++++++++++++++++++++++++
include/linux/elf.h | 2 +-
include/uapi/linux/elf.h | 1 +
8 files changed, 121 insertions(+), 3 deletions(-)
diff --git a/arch/Kconfig b/arch/Kconfig
index fd18b7db2c77..3bd8a0b2bba1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -502,6 +502,15 @@ config MMU_LAZY_TLB_SHOOTDOWN
config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
+config ARCH_HAVE_EXTRA_ELF_NOTES
+ bool
+ help
+ An architecture should select this in order to enable adding an
+ arch-specific ELF note section to core files. It must provide two
+ functions: elf_coredump_extra_notes_size() and
+ elf_coredump_extra_notes_write() which are invoked by the ELF core
+ dumper.
+
config ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
bool
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index a91cb070ca4a..3b31bd7490e2 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -156,6 +156,7 @@ config PPC
select ARCH_HAS_UACCESS_FLUSHCACHE
select ARCH_HAS_UBSAN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+ select ARCH_HAVE_EXTRA_ELF_NOTES if SPU_BASE
select ARCH_KEEP_MEMBLOCK
select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index 79f1c480b5eb..bb4b94444d3e 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -127,8 +127,6 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
/* Notes used in ET_CORE. Note name is "SPU/<fd>/<filename>". */
#define NT_SPU 1
-#define ARCH_HAVE_EXTRA_ELF_NOTES
-
#endif /* CONFIG_SPU_BASE */
#ifdef CONFIG_PPC64
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 78050d5d7fac..35e8d1201099 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -104,6 +104,7 @@ config X86
select ARCH_HAS_DEBUG_WX
select ARCH_HAS_ZONE_DMA_SET if EXPERT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+ select ARCH_HAVE_EXTRA_ELF_NOTES
select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 1fb83d47711f..1b9f0b4bf6bc 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -13,6 +13,13 @@
#include <asm/auxvec.h>
#include <asm/fsgsbase.h>
+struct xfeat_component {
+ u32 xfeat_type;
+ u32 xfeat_sz;
+ u32 xfeat_off;
+ u32 xfeat_flags;
+} __packed;
+
typedef unsigned long elf_greg_t;
#define ELF_NGREG (sizeof(struct user_regs_struct) / sizeof(elf_greg_t))
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 117e74c44e75..6e5ea483ec1d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -13,6 +13,7 @@
#include <linux/seq_file.h>
#include <linux/proc_fs.h>
#include <linux/vmalloc.h>
+#include <linux/coredump.h>
#include <asm/fpu/api.h>
#include <asm/fpu/regset.h>
@@ -1836,3 +1837,103 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
return 0;
}
#endif /* CONFIG_PROC_PID_ARCH_STATUS */
+
+/*
+ * Dump type, size, offset and flag values for every xfeature that is present.
+ */
+static int dump_xsave_layout_desc(struct coredump_params *cprm)
+{
+
+ struct xfeat_component xc;
+ int num_records = 0;
+ int i;
+
+ /* XFEATURE_FPU and XFEATURE_SSE, both are fixed legacy states. */
+ for (i = 0; i < FIRST_EXTENDED_XFEATURE; i++) {
+ xc.xfeat_type = i;
+ xc.xfeat_sz = xstate_sizes[i];
+ xc.xfeat_off = xstate_offsets[i];
+ xc.xfeat_flags = xstate_flags[i];
+
+ if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
+ return 0;
+ num_records++;
+ }
+
+ for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
+ xc.xfeat_type = i;
+ xc.xfeat_sz = xstate_sizes[i];
+ xc.xfeat_off = xstate_offsets[i];
+ xc.xfeat_flags = xstate_flags[i];
+
+ if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
+ return 0;
+ num_records++;
+ }
+
+ return num_records;
+}
+
+static int get_xsave_desc_size(void)
+{
+ /* XFEATURE_FP and XFEATURE_SSE, both are fixed legacy states */
+ int xfeatures_count = 2;
+ int i;
+
+ for_each_extended_xfeature(i, fpu_user_cfg.max_features)
+ xfeatures_count++;
+
+ return xfeatures_count * (sizeof(struct xfeat_component));
+}
+
+int elf_coredump_extra_notes_write(struct coredump_params *cprm)
+{
+ const char *owner_name = "LINUX";
+ int num_records = 0;
+ struct elf_note en;
+
+ en.n_namesz = strlen(owner_name) + 1;
+ en.n_descsz = get_xsave_desc_size();
+ en.n_type = NT_X86_XSAVE_LAYOUT;
+
+ if (!dump_emit(cprm, &en, sizeof(en)))
+ return 1;
+ if (!dump_emit(cprm, owner_name, en.n_namesz))
+ return 1;
+ if (!dump_align(cprm, 4))
+ return 1;
+
+ num_records = dump_xsave_layout_desc(cprm);
+ if (!num_records) {
+ pr_warn("Error adding XSTATE layout ELF note. XSTATE buffer in the core file will be unparseable.");
+ return 1;
+ }
+
+ /* Total size should be equal to the number of records */
+ if ((sizeof(struct xfeat_component) * num_records) != en.n_descsz) {
+ pr_warn("Error adding XSTATE layout ELF note. The size of the .note section does not match with the total size of the records.");
+ return 1;
+ }
+
+ if (!dump_align(cprm, 4))
+ return 1;
+
+ return 0;
+}
+
+/*
+ * Return the size of new note.
+ */
+int elf_coredump_extra_notes_size(void)
+{
+ const char *fullname = "LINUX";
+ int size = 0;
+
+ /* NOTE Header */
+ size += sizeof(struct elf_note);
+ /* name + align */
+ size += roundup(strlen(fullname) + 1, 4);
+ size += get_xsave_desc_size();
+
+ return size;
+}
diff --git a/include/linux/elf.h b/include/linux/elf.h
index c9a46c4e183b..5c402788da19 100644
--- a/include/linux/elf.h
+++ b/include/linux/elf.h
@@ -65,7 +65,7 @@ extern Elf64_Dyn _DYNAMIC [];
struct file;
struct coredump_params;
-#ifndef ARCH_HAVE_EXTRA_ELF_NOTES
+#ifndef CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES
static inline int elf_coredump_extra_notes_size(void) { return 0; }
static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) { return 0; }
#else
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 9417309b7230..3325488cb39b 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -411,6 +411,7 @@ typedef struct elf64_shdr {
#define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
/* Old binutils treats 0x203 as a CET state */
#define NT_X86_SHSTK 0x204 /* x86 SHSTK state */
+#define NT_X86_XSAVE_LAYOUT 0x205 /* XSAVE layout description */
#define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */
#define NT_S390_TIMER 0x301 /* s390 timer register */
#define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */
--
2.43.0
On 3/14/24 04:23, Vignesh Balasubramanian wrote:
> Add a new .note section containing type, size, offset and flags of
> every xfeature that is present.
Mechanically, I'd much rather have all of that info in the cover letter
in the actual changelog instead.
I'd also love to see a practical example of what an actual example core
dump looks like on two conflicting systems:
* Total XSAVE size
* XCR0 value
* XSTATE_BV from the core dump
* XFEATURE offsets for each feature
Do you have any information about what other OSes are doing in this
area? I thought Windows, for instance, was even less flexible about the
XSAVE format than Linux is.
Why didn't LWP cause this problem?
From the cover letter:
> But this patch series depends on heuristics based on the total XSAVE
> register set size and the XCR0 mask to infer the layouts of the
> various register blocks for core dumps, and hence, is not a foolproof
> mechanism to determine the layout of the XSAVE area.
It may not be theoretically foolproof. But I'm struggling to think of a
case where it would matter in practice. Is there any CPU from any
vendor where this is actually _needed_?
Sure, it's ugly as hell, but these notes aren't going to be available
universally _ever_, so it's not like the crummy heuristic code gets to
go away.
Have you seen the APX spec?
>
https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html
It makes this even more fun because it adds a new XSAVE state component,
but reuses the MPX offsets.
> This information will be used by the debuggers to understand the XSAVE
> layout of the machine where the core file is dumped, and to read XSAVE
> registers, especially during cross-platform debugging.
This is pretty close to just a raw dump of the XSAVE CPUID leaves.
Rather than come up with an XSAVE-specific ABI that depends on CPUID
*ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
should just bite the bullet and dump out (some of) the raw CPUID space.
On Thu, Mar 14, 2024 at 08:37:09AM -0700, Dave Hansen wrote:
> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
> Rather than come up with an XSAVE-specific ABI that depends on CPUID
> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
> should just bite the bullet and dump out (some of) the raw CPUID space.
Funny you should say that. This was what they had done originally but if
you dump CPUID and you want to add another component in the future which
is *not* described by CPUID, your scheme breaks.
So the idea is to have a self-describing buffers layout, independent
from any x86-ism. You can extend this in a straight-forward way then
later.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Mar 14, 2024 at 04:53:28PM +0530, Vignesh Balasubramanian wrote:
> Add a new .note section containing type, size, offset and flags of
> every xfeature that is present.
>
> This information will be used by the debuggers to understand the XSAVE
> layout of the machine where the core file is dumped, and to read XSAVE
> registers, especially during cross-platform debugging.
I see binutils in CC. Can someone from gdb confirm that this solution
can be used?
>
> Co-developed-by: Jini Susan George <[email protected]>
> Signed-off-by: Jini Susan George <[email protected]>
> Signed-off-by: Vignesh Balasubramanian <[email protected]>
> ---
> arch/Kconfig | 9 +++
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/include/asm/elf.h | 2 -
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/elf.h | 7 +++
> arch/x86/kernel/fpu/xstate.c | 101 +++++++++++++++++++++++++++++++++
> include/linux/elf.h | 2 +-
> include/uapi/linux/elf.h | 1 +
> 8 files changed, 121 insertions(+), 3 deletions(-)
>
> diff --git a/arch/Kconfig b/arch/Kconfig
> index fd18b7db2c77..3bd8a0b2bba1 100644
> --- a/arch/Kconfig
> +++ b/arch/Kconfig
> @@ -502,6 +502,15 @@ config MMU_LAZY_TLB_SHOOTDOWN
> config ARCH_HAVE_NMI_SAFE_CMPXCHG
> bool
>
> +config ARCH_HAVE_EXTRA_ELF_NOTES
> + bool
> + help
> + An architecture should select this in order to enable adding an
> + arch-specific ELF note section to core files. It must provide two
> + functions: elf_coredump_extra_notes_size() and
> + elf_coredump_extra_notes_write() which are invoked by the ELF core
> + dumper.
> +
> config ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
> bool
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index a91cb070ca4a..3b31bd7490e2 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -156,6 +156,7 @@ config PPC
> select ARCH_HAS_UACCESS_FLUSHCACHE
> select ARCH_HAS_UBSAN
> select ARCH_HAVE_NMI_SAFE_CMPXCHG
> + select ARCH_HAVE_EXTRA_ELF_NOTES if SPU_BASE
> select ARCH_KEEP_MEMBLOCK
> select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU
> select ARCH_MIGHT_HAVE_PC_PARPORT
> diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
> index 79f1c480b5eb..bb4b94444d3e 100644
> --- a/arch/powerpc/include/asm/elf.h
> +++ b/arch/powerpc/include/asm/elf.h
> @@ -127,8 +127,6 @@ extern int arch_setup_additional_pages(struct linux_binprm *bprm,
> /* Notes used in ET_CORE. Note name is "SPU/<fd>/<filename>". */
> #define NT_SPU 1
>
> -#define ARCH_HAVE_EXTRA_ELF_NOTES
> -
> #endif /* CONFIG_SPU_BASE */
>
> #ifdef CONFIG_PPC64
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 78050d5d7fac..35e8d1201099 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -104,6 +104,7 @@ config X86
> select ARCH_HAS_DEBUG_WX
> select ARCH_HAS_ZONE_DMA_SET if EXPERT
> select ARCH_HAVE_NMI_SAFE_CMPXCHG
> + select ARCH_HAVE_EXTRA_ELF_NOTES
> select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
> select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
> select ARCH_MIGHT_HAVE_PC_PARPORT
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index 1fb83d47711f..1b9f0b4bf6bc 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -13,6 +13,13 @@
> #include <asm/auxvec.h>
> #include <asm/fsgsbase.h>
>
> +struct xfeat_component {
> + u32 xfeat_type;
> + u32 xfeat_sz;
> + u32 xfeat_off;
> + u32 xfeat_flags;
> +} __packed;
While it is currently true, just for robustness, can you add
a _Static_assert that sizeof(struct xfeat_component) % 4 == 0 ?
Notes must be 4-byte aligned.
> +
> typedef unsigned long elf_greg_t;
>
> #define ELF_NGREG (sizeof(struct user_regs_struct) / sizeof(elf_greg_t))
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 117e74c44e75..6e5ea483ec1d 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -13,6 +13,7 @@
> #include <linux/seq_file.h>
> #include <linux/proc_fs.h>
> #include <linux/vmalloc.h>
> +#include <linux/coredump.h>
>
> #include <asm/fpu/api.h>
> #include <asm/fpu/regset.h>
> @@ -1836,3 +1837,103 @@ int proc_pid_arch_status(struct seq_file *m, struct pid_namespace *ns,
> return 0;
> }
> #endif /* CONFIG_PROC_PID_ARCH_STATUS */
> +
> +/*
> + * Dump type, size, offset and flag values for every xfeature that is present.
> + */
> +static int dump_xsave_layout_desc(struct coredump_params *cprm)
> +{
> +
> + struct xfeat_component xc;
> + int num_records = 0;
> + int i;
> +
> + /* XFEATURE_FPU and XFEATURE_SSE, both are fixed legacy states. */
> + for (i = 0; i < FIRST_EXTENDED_XFEATURE; i++) {
> + xc.xfeat_type = i;
> + xc.xfeat_sz = xstate_sizes[i];
> + xc.xfeat_off = xstate_offsets[i];
> + xc.xfeat_flags = xstate_flags[i];
> +
> + if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
> + return 0;
> + num_records++;
> + }
> +
> + for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
> + xc.xfeat_type = i;
> + xc.xfeat_sz = xstate_sizes[i];
> + xc.xfeat_off = xstate_offsets[i];
> + xc.xfeat_flags = xstate_flags[i];
> +
> + if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
> + return 0;
> + num_records++;
> + }
> +
> + return num_records;
> +}
> +
> +static int get_xsave_desc_size(void)
> +{
> + /* XFEATURE_FP and XFEATURE_SSE, both are fixed legacy states */
> + int xfeatures_count = 2;
> + int i;
> +
> + for_each_extended_xfeature(i, fpu_user_cfg.max_features)
> + xfeatures_count++;
> +
> + return xfeatures_count * (sizeof(struct xfeat_component));
> +}
> +
> +int elf_coredump_extra_notes_write(struct coredump_params *cprm)
> +{
> + const char *owner_name = "LINUX";
If you use an array instead of a pointer, there's no need for strlen(),
and you can make it a static outside of the function to refer to it
later.
static const char owner_name[] = "LINUX";
> + int num_records = 0;
> + struct elf_note en;
> +
> + en.n_namesz = strlen(owner_name) + 1;
en.n_namesz = sizeof(owner_name);
> + en.n_descsz = get_xsave_desc_size();
> + en.n_type = NT_X86_XSAVE_LAYOUT;
> +
> + if (!dump_emit(cprm, &en, sizeof(en)))
> + return 1;
> + if (!dump_emit(cprm, owner_name, en.n_namesz))
> + return 1;
> + if (!dump_align(cprm, 4))
> + return 1;
> +
> + num_records = dump_xsave_layout_desc(cprm);
> + if (!num_records) {
> + pr_warn("Error adding XSTATE layout ELF note. XSTATE buffer in the core file will be unparseable.");
Can you make this pr_warn_ratelimited() (and below)?
> + return 1;
> + }
> +
> + /* Total size should be equal to the number of records */
> + if ((sizeof(struct xfeat_component) * num_records) != en.n_descsz) {
> + pr_warn("Error adding XSTATE layout ELF note. The size of the .note section does not match with the total size of the records.");
> + return 1;
> + }
> +
> + if (!dump_align(cprm, 4))
> + return 1;
I don't think this call is needed?
> +
> + return 0;
> +}
> +
> +/*
> + * Return the size of new note.
> + */
> +int elf_coredump_extra_notes_size(void)
> +{
> + const char *fullname = "LINUX";
Now this can be dropped.
> + int size = 0;
> +
> + /* NOTE Header */
> + size += sizeof(struct elf_note);
> + /* name + align */
> + size += roundup(strlen(fullname) + 1, 4);
size += roundup(sizeof(owner_name), 4);
> + size += get_xsave_desc_size();
> +
> + return size;
> +}
> diff --git a/include/linux/elf.h b/include/linux/elf.h
> index c9a46c4e183b..5c402788da19 100644
> --- a/include/linux/elf.h
> +++ b/include/linux/elf.h
> @@ -65,7 +65,7 @@ extern Elf64_Dyn _DYNAMIC [];
> struct file;
> struct coredump_params;
>
> -#ifndef ARCH_HAVE_EXTRA_ELF_NOTES
> +#ifndef CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES
> static inline int elf_coredump_extra_notes_size(void) { return 0; }
> static inline int elf_coredump_extra_notes_write(struct coredump_params *cprm) { return 0; }
> #else
> diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
> index 9417309b7230..3325488cb39b 100644
> --- a/include/uapi/linux/elf.h
> +++ b/include/uapi/linux/elf.h
> @@ -411,6 +411,7 @@ typedef struct elf64_shdr {
> #define NT_X86_XSTATE 0x202 /* x86 extended state using xsave */
> /* Old binutils treats 0x203 as a CET state */
> #define NT_X86_SHSTK 0x204 /* x86 SHSTK state */
> +#define NT_X86_XSAVE_LAYOUT 0x205 /* XSAVE layout description */
> #define NT_S390_HIGH_GPRS 0x300 /* s390 upper register halves */
> #define NT_S390_TIMER 0x301 /* s390 timer register */
> #define NT_S390_TODCMP 0x302 /* s390 TOD clock comparator register */
> --
> 2.43.0
>
Otherwise looks reasonable, though I see Dave has feedback to address
too. :)
Thanks for working on this!
-Kees
--
Kees Cook
On 3/14/24 09:08, Borislav Petkov wrote:
> On Thu, Mar 14, 2024 at 08:37:09AM -0700, Dave Hansen wrote:
>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>> should just bite the bullet and dump out (some of) the raw CPUID space.
>
> Funny you should say that. This was what they had done originally but if
> you dump CPUID and you want to add another component in the future which
> is *not* described by CPUID, your scheme breaks.
Are you envisioning an *XSAVE* state component that's not described by
CPUID?
Or some _other_ (non-XSAVE) component in a core dump that isn't
described by CPUID?
> So the idea is to have a self-describing buffers layout, independent
> from any x86-ism. You can extend this in a straight-forward way then
> later.
That argument breaks down a bit on the flags though:
xc.xfeat_flags = xstate_flags[i];
Because it comes _directly_ from CPUID with zero filtering:
cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
...
xstate_flags[i] = ecx;
So this layout is quite dependent on what's in x86's CPUID.
> -----Original Message-----
> From: Vignesh Balasubramanian <[email protected]>
> Sent: Donnerstag, 14. März 2024 12:23
> To: [email protected]; [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]; linuxppc-
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; Willgerodt, Felix <[email protected]>; Vignesh
> Balasubramanian <[email protected]>
> Subject: [PATCH 0/1] Add XSAVE layout description to Core files for debuggers to
> support varying XSAVE layouts
>
> This patch proposes to add an extra .note section in the corefile to dump the
> CPUID information of a machine. This is being done to solve the issue of tools like
> the debuggers having to deal with coredumps from machines with varying XSAVE
> layouts in spite of having the same XCR0 bits. The new proposed .note section, at
> this point, consists of an array of records containing the information of each
> extended feature that is present. This provides details about the offsets and the
> sizes of the various extended save state components of the machine where the
> application crash occurred. Requesting a review for this patch.
>
> Some background:
>
> The XSAVE layouts of modern AMD and Intel CPUs differ, especially since Memory
> Protection Keys and the AVX-512 features have been inculcated into the AMD
> CPUs. This is since AMD never adopted (and hence never left room in the XSAVE
> layout for) the Intel MPX feature. Tools like GDB had assumed a fixed XSAVE
> layout matching that of Intel (based on the XCR0 mask).
Hi,
I am a GDB developer and very much in favour of getting rid of the
interim solution added to GDB. It doesn't scale well, as soon as we add new state
that has the same size as some existing state.
I am wondering if it wouldn't be easier for everyone if corefiles would just
contain space for all possible XSAVE components? Regardless of whether the CPU
supports it or whether or not AMD or Intel ever supported the feature.
Or if we would at least not drop some state from the middle, like in this case.
Regards,
Felix
Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, http://www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
On Thu, Mar 14, 2024 at 09:19:15AM -0700, Dave Hansen wrote:
> Are you envisioning an *XSAVE* state component that's not described by
> CPUID?
I want to be prepared. You can imagine some of the short cuts and
corners cutting hw guys would do so I'd want to be prepared there and
not tie this to CPUID.
> Or some _other_ (non-XSAVE) component in a core dump that isn't
> described by CPUID?
Yes, that too.
Since the format of this buffer is so simple and machine-independent, it
can be extended as needed without issues.
> That argument breaks down a bit on the flags though:
>
> xc.xfeat_flags = xstate_flags[i];
>
> Because it comes _directly_ from CPUID with zero filtering:
>
> cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
> ...
> xstate_flags[i] = ecx;
>
> So this layout is quite dependent on what's in x86's CPUID.
Yeah, no, this should not be copying CPUID flags - those flags should be
*translated* to independently defined flags which describe those
buffers.
This is even more important if we change our xstate_flags[] machinery.
This buffer should not use any kernel-internal definitions and
structures but be completely self-describing.
Vignesh, pls fix that.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Mar 14, 2024 at 04:25:44PM +0000, Willgerodt, Felix wrote:
> I am wondering if it wouldn't be easier for everyone if corefiles would just
> contain space for all possible XSAVE components?
You mean we should shuffle out from the kernel 8K of AMX state even if
nothing uses it or the machine doesn't even support it?
That's silly.
Please have a look at this:
+struct xfeat_component {
+ u32 xfeat_type;
+ u32 xfeat_sz;
+ u32 xfeat_off;
+ u32 xfeat_flags;
+} __packed;
What is wrong with having a blob of such xfeat_component things
describing the XSTATE buffer and parsing it in gdb?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 3/14/24 09:29, Borislav Petkov wrote:
>
>> That argument breaks down a bit on the flags though:
>>
>> xc.xfeat_flags = xstate_flags[i];
>>
>> Because it comes _directly_ from CPUID with zero filtering:
>>
>> cpuid_count(XSTATE_CPUID, i, &eax, &ebx, &ecx, &edx);
>> ...
>> xstate_flags[i] = ecx;
>>
>> So this layout is quite dependent on what's in x86's CPUID.
> Yeah, no, this should not be copying CPUID flags - those flags should be
> *translated* to independently defined flags which describe those
> buffers.
Ditto for:
xc.xfeat_type = i;
Right now, that's bound to CPUID and XSAVE. "feat_type==10" can only
ever be PKRU and that's derived from the XSAVE architecture.
If you want this to be extensible to things outside of the XSAVE
architecture, it needs to be something actually extensible and not
entangled with XSAVE.
In other words "xc.xfeat_type" can enumerate XSAVE state components
being in the dump, but it should not be limited to XSAVE. Just as an
example:
enum feat_type {
FEATURE_XSAVE_PKRU,
FEATURE_XSAVE__YMM,
FEATURE_XSAVE_BNDREGS,
FEATURE_XSAVE_BNDCSR,
...
RANDOM_STATE_NOT_XSAVE
};
See how feat_type==1 is PKRU and *NOT* feat_type==10? That opens the
door to RANDOM_STATE_NOT_XSAVE or anything else you want. This would be
_actually_ extensible.
On 3/14/24 8:37 AM, Dave Hansen wrote:
> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>> Add a new .note section containing type, size, offset and flags of
>> every xfeature that is present.
>
> Mechanically, I'd much rather have all of that info in the cover letter
> in the actual changelog instead.
>
> I'd also love to see a practical example of what an actual example core
> dump looks like on two conflicting systems:
>
> * Total XSAVE size
> * XCR0 value
> * XSTATE_BV from the core dump
> * XFEATURE offsets for each feature
I noticed this when I bought an AMD Ryzen 9 5900X based system for my desktop
running FreeBSD and found that the XSAVE core dump notes were not recognized
by GDB (FreeBSD dumps an XSAVE register set note that matches the same
layout of NT_X86_XSTATE used by Linux).
In particular, as the cover letter notes, on this AMD processor, there is
no "gap" for MPX, so the PKRU registers it provides are at a different offset
than on Intel CPUs. Furthermore, my reading of the SDM is that there is no
guarantee of architectural offsets of a given XSAVE feature and that software
should be querying CPUID to determine the layout.
FWIW, the relevant CPUID leaves for my AMD system:
XSAVE features (0xd/0):
XCR0 valid bit field mask = 0x0000000000000207
x87 state = true
SSE state = true
AVX state = true
MPX BNDREGS = false
MPX BNDCSR = false
AVX-512 opmask = false
AVX-512 ZMM_Hi256 = false
AVX-512 Hi16_ZMM = false
PKRU state = true
XTILECFG state = false
XTILEDATA state = false
bytes required by fields in XCR0 = 0x00000988 (2440)
bytes required by XSAVE/XRSTOR area = 0x00000988 (2440)
XSAVEOPT instruction = true
XSAVEC instruction = true
XGETBV instruction = true
XSAVES/XRSTORS instructions = true
XFD: extended feature disable supported = false
SAVE area size in bytes = 0x00000348 (840)
IA32_XSS valid bit field mask = 0x0000000000001800
PT state = false
PASID state = false
CET_U user state = true
CET_S supervisor state = true
HDC state = false
UINTR state = false
LBR state = false
HWP state = false
> Do you have any information about what other OSes are doing in this
> area? I thought Windows, for instance, was even less flexible about the
> XSAVE format than Linux is.
I have an implementation of a similar note for FreeBSD already as well as
patches for GDB to make use of the note (for FreeBSD) and generate it
via 'gcore' (on FreeBSD). However, I would very much like to reach
consensus on a shared format of the note to avoid gratuitous differences
between FreeBSD and Linux. The AMD folks were gracious enough to work on
the Linux kernel implementation. A bit more on that below though.
> Why didn't LWP cause this problem?
>
> From the cover letter:
>
>> But this patch series depends on heuristics based on the total XSAVE
>> register set size and the XCR0 mask to infer the layouts of the
>> various register blocks for core dumps, and hence, is not a foolproof
>> mechanism to determine the layout of the XSAVE area.
>
> It may not be theoretically foolproof. But I'm struggling to think of a
> case where it would matter in practice. Is there any CPU from any
> vendor where this is actually _needed_?
>
> Sure, it's ugly as hell, but these notes aren't going to be available
> universally _ever_, so it's not like the crummy heuristic code gets to
> go away.
>
> Have you seen the APX spec?
>
>>
> https://www.intel.com/content/www/us/en/developer/articles/technical/advanced-performance-extensions-apx.html
>
> It makes this even more fun because it adds a new XSAVE state component,
> but reuses the MPX offsets.
>
>> This information will be used by the debuggers to understand the XSAVE
>> layout of the machine where the core file is dumped, and to read XSAVE
>> registers, especially during cross-platform debugging.
>
> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
> Rather than come up with an XSAVE-specific ABI that depends on CPUID
> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
> should just bite the bullet and dump out (some of) the raw CPUID space.
So the current note I initially proposed and implemented for FreeBSD
(https://reviews.freebsd.org/D42136) and an initial patch set for GDB
(https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
do indeed dump a raw set of CPUID leaves. The version I have for FreeBSD
only dumps the raw leaf values for leaf 0x0d though the note format is
extensible should additional leaves be needed in the future. One of the
questions if we wanted to use a CPUID leaf note is which leaves to dump
(e.g. do you dump all of them, or do you just dump the subset that is
currently needed). Another quirky question is what to do about systems
with hetergeneous cores (E vs P for example). Currently those systems
use the same XSAVE layout across all cores, but other CPUID leaves do
already vary across cores on those systems. Some options considered for
that are to 1) use a separate note type for "other" core types (e.g.
a separate note type for "E" cores), or 2) make this new note a per-thread
note that matches the core the given thread was running on when the
register state stored in the process core dump was saved.
However, there are other wrinkles with the leaf approach. Namely, one
of the use cases that I currently have an ugly hack for in GDB is if
you are using gdb against a remote host running gdbserver and then use
'gcore' to generate a core dump. GDB needs to write out a NT_X86_XSTATE
note, but that note requires a layout. What GDB does today is just pick
a known Intel layout based on the XCR0 mask. However, GDB should ideally
start writing out whatever new note we adopt here, so if we dump raw
CPUID leaves it means extending the GDB remote protocol so we can query
the CPUID leaves from the remote host. On the other hand, if we choose a
more abstract format as proposed in this patch, the local GDB (or LLDB
or whatever) can generate whatever synthetic layout it wants to write
the local NT_X86_XSTATE. (NB: A relevant detail here is that the GDB
remote protocol does not pass the entire XSAVE state across as a block,
instead gdbserver parses individual register values for AVX, etc.
registers and those decoded register values are passed over the
protocol.)
Another question is potentially supporting compact XSAVE format in
for NT_X86_XSTATE. Today Linux has some complicated code to re-expand
the compat XSAVE format back out to the standard layout for ptrace() and
process core dumps. (FreeBSD doesn't yet make use of XSAVEC so we
haven't yet dealt with that problem.) The CPUID leaf approach would
allow us to support compact formats, though GDB would have to check for
the flag in the XSAVE header to decide which format to use, etc. On
the other hand, if we use the more abstract format in this patch, then
GDB wouldn't actually have to care at all. The kernel would just dump
out the "compact" form of the layout note and the direct XSAVEC output
as the note. (I will probably do this in FreeBSD eventually, but
using a policy knob (sysctl on FreeBSD) to control if it is enabled
that FreeBSD would default to on at some point in the future.)
I don't really have a strong preference for which type of note to dump
myself, I really just want to have a shared format so that there is
less work to do on the tools side (e.g. GDB).
Also, FWIW, I did try to raise this topic on LLDB's discussion forums
and got a simple "sounds ok" type response but no detailed feedback.
That was a proposal for the CPUID leaf note, but I suspect LLDB will
be fine with either approach. Certainly I will update GDB to work
with whatever approach is adopted.
--
John Baldwin
On 3/14/24 8:37 AM, Dave Hansen wrote:
> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>> But this patch series depends on heuristics based on the total XSAVE
>> register set size and the XCR0 mask to infer the layouts of the
>> various register blocks for core dumps, and hence, is not a foolproof
>> mechanism to determine the layout of the XSAVE area.
>
> It may not be theoretically foolproof. But I'm struggling to think of a
> case where it would matter in practice. Is there any CPU from any
> vendor where this is actually _needed_?
>
> Sure, it's ugly as hell, but these notes aren't going to be available
> universally _ever_, so it's not like the crummy heuristic code gets to
> go away.
I forgot to mention one other use case for this note.
Today (and before my earlier patch series to add the ugly heuristic),
when the NT_X86_XSTATE core dump note grows because a CPU vendor adds
a new xfeature and OS's which just dump the entire XSAVE state start
including that, GDB fails to parse the entire note.
Having a note describing the layout (whichever format is chosen),
allows GDB to still pull registers for features it understands from
the larger note and ignoring the parts of the XSAVE block it doesn't
understand.
--
John Baldwin
On 3/14/24 09:45, John Baldwin wrote:
> On 3/14/24 8:37 AM, Dave Hansen wrote:
>> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>>> Add a new .note section containing type, size, offset and flags of
>>> every xfeature that is present.
>>
>> Mechanically, I'd much rather have all of that info in the cover letter
>> in the actual changelog instead.
>>
>> I'd also love to see a practical example of what an actual example core
>> dump looks like on two conflicting systems:
>>
>> * Total XSAVE size
>> * XCR0 value
>> * XSTATE_BV from the core dump
>> * XFEATURE offsets for each feature
>
> I noticed this when I bought an AMD Ryzen 9 5900X based system for
> my desktop running FreeBSD and found that the XSAVE core dump notes
> were not recognized by GDB (FreeBSD dumps an XSAVE register set note
> that matches the same layout of NT_X86_XSTATE used by Linux).
I just want to make sure that you heard what I asked. I'd like to see a
practical example of how the real-world enumeration changes between two
real world systems.
Is that possible?
Here's the raw CPUID data from the XSAVE region on my laptop:
> 0x0000000d 0x00: eax=0x000002e7 ebx=0x00000a88 ecx=0x00000a88 edx=0x00000000
> 0x0000000d 0x01: eax=0x0000000f ebx=0x00000998 ecx=0x00003900 edx=0x00000000
> 0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
> 0x0000000d 0x05: eax=0x00000040 ebx=0x00000440 ecx=0x00000000 edx=0x00000000
> 0x0000000d 0x06: eax=0x00000200 ebx=0x00000480 ecx=0x00000000 edx=0x00000000
> 0x0000000d 0x07: eax=0x00000400 ebx=0x00000680 ecx=0x00000000 edx=0x00000000
> 0x0000000d 0x08: eax=0x00000080 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
> 0x0000000d 0x09: eax=0x00000008 ebx=0x00000a80 ecx=0x00000000 edx=0x00000000
> 0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
> 0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
> 0x0000000d 0x0d: eax=0x00000008 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
Could we get that for an impacted AMD system, please?
cpuid -1 --raw | grep " 0x0000000d "
should do it.
> In particular, as the cover letter notes, on this AMD processor,
> there is no "gap" for MPX, so the PKRU registers it provides are at a
> different offset than on Intel CPUs. Furthermore, my reading of the
> SDM is that there is no guarantee of architectural offsets of a given
> XSAVE feature and that software should be querying CPUID to determine
> the layout.
I'd say the SDM is an aspirational document. Intel _aspires_ to be able
to change the layouts whenever it wants. That doesn't mean that it
could actually pull it off in practice.
In practice, the offset are fixed and Intel can't change them.
> FWIW, the relevant CPUID leaves for my AMD system:
>
> XSAVE features (0xd/0):
> XCR0 valid bit field mask = 0x0000000000000207> x87 state = true
..
So, those actually aren't the relevant ones. We need EAX (size) and EBX
(user offset) from all of the sub-leaves.
>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>> should just bite the bullet and dump out (some of) the raw CPUID space.
>
> So the current note I initially proposed and implemented for FreeBSD
> (https://reviews.freebsd.org/D42136) and an initial patch set for GDB
> (https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
> do indeed dump a raw set of CPUID leaves. The version I have for FreeBSD
> only dumps the raw leaf values for leaf 0x0d though the note format is
> extensible should additional leaves be needed in the future. One of the
> questions if we wanted to use a CPUID leaf note is which leaves to dump
> (e.g. do you dump all of them, or do you just dump the subset that is
> currently needed).
You dump what is needed and add to the dump over time.
> Another quirky question is what to do about systems with hetergeneous
> cores (E vs P for example).
That's irrelevant for now. The cores may be heterogeneous but the
userspace ISA and (and thus XSAVE formats) are identical. If they're
not, then we have bigger problems on our hands.
> Currently those systems use the same XSAVE layout across all cores,
> but other CPUID leaves do already vary across cores on those systems.
There shouldn't be any CPUID leaves that differ _and_ matter to
userspace and thus core dumps.
> However, there are other wrinkles with the leaf approach. Namely, one
> of the use cases that I currently have an ugly hack for in GDB is if
> you are using gdb against a remote host running gdbserver and then use
> 'gcore' to generate a core dump. GDB needs to write out a NT_X86_XSTATE
> note, but that note requires a layout. What GDB does today is just pick
> a known Intel layout based on the XCR0 mask. However, GDB should ideally
> start writing out whatever new note we adopt here, so if we dump raw
> CPUID leaves it means extending the GDB remote protocol so we can query
> the CPUID leaves from the remote host. On the other hand, if we choose a
> more abstract format as proposed in this patch, the local GDB (or LLDB
> or whatever) can generate whatever synthetic layout it wants to write
> the local NT_X86_XSTATE. (NB: A relevant detail here is that the GDB
> remote protocol does not pass the entire XSAVE state across as a block,
> instead gdbserver parses individual register values for AVX, etc.
> registers and those decoded register values are passed over the
> protocol.)
So the gdb side says, "Give me PKRU" and the remote side parses the
XSAVE image, finds PKRU, and sends it over the wire?
> Another question is potentially supporting compact XSAVE format in
> for NT_X86_XSTATE. Today Linux has some complicated code to re-expand
> the compat XSAVE format back out to the standard layout for ptrace() and
> process core dumps.
Yeah, but supporting the compacted format in NT_X86_XSTATE doesn't help
us at all. We still intermingle user and supervisor state and that
needs to get repacked _anyway_.
In other words, no matter what we do, it's going to be complicated
because the userspace buffer can't have supervisor state and the kernel
buffer does have it. The compacted format mismatch is the least of our
problems.
> (FreeBSD doesn't yet make use of XSAVEC so we
> haven't yet dealt with that problem.)
.. or XSAVES, which is actually the most relevant here.
Backing up... there are two approaches here:
1. Dump out raw x86-specific gunk, aka. CPUID contents itself. There
are a billion ways to do this and lots of complications, including
the remote protocol implications
or
2. Define an abstract format that works anywhere, not just on x86 and
not just for XSAVE.
There's no (sane) middle ground. The implementation here (in this
patch) is fundamentally x86-specific and pretends to be some kind of
abstracted x86-independent format.
On 3/14/24 10:10 AM, Dave Hansen wrote:
> On 3/14/24 09:45, John Baldwin wrote:
>> On 3/14/24 8:37 AM, Dave Hansen wrote:
>>> On 3/14/24 04:23, Vignesh Balasubramanian wrote:
>>>> Add a new .note section containing type, size, offset and flags of
>>>> every xfeature that is present.
>>>
>>> Mechanically, I'd much rather have all of that info in the cover letter
>>> in the actual changelog instead.
>>>
>>> I'd also love to see a practical example of what an actual example core
>>> dump looks like on two conflicting systems:
>>>
>>> * Total XSAVE size
>>> * XCR0 value
>>> * XSTATE_BV from the core dump
>>> * XFEATURE offsets for each feature
>>
>> I noticed this when I bought an AMD Ryzen 9 5900X based system for
>> my desktop running FreeBSD and found that the XSAVE core dump notes
>> were not recognized by GDB (FreeBSD dumps an XSAVE register set note
>> that matches the same layout of NT_X86_XSTATE used by Linux).
>
> I just want to make sure that you heard what I asked. I'd like to see a
> practical example of how the real-world enumeration changes between two
> real world systems.
>
> Is that possible?
>
> Here's the raw CPUID data from the XSAVE region on my laptop:
>
>> 0x0000000d 0x00: eax=0x000002e7 ebx=0x00000a88 ecx=0x00000a88 edx=0x00000000
>> 0x0000000d 0x01: eax=0x0000000f ebx=0x00000998 ecx=0x00003900 edx=0x00000000
>> 0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x05: eax=0x00000040 ebx=0x00000440 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x06: eax=0x00000200 ebx=0x00000480 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x07: eax=0x00000400 ebx=0x00000680 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x08: eax=0x00000080 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>> 0x0000000d 0x09: eax=0x00000008 ebx=0x00000a80 ecx=0x00000000 edx=0x00000000
>> 0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>> 0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>> 0x0000000d 0x0d: eax=0x00000008 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
>
> Could we get that for an impacted AMD system, please?
>
> cpuid -1 --raw | grep " 0x0000000d "
>
> should do it.
0x0000000d 0x00: eax=0x00000207 ebx=0x00000988 ecx=0x00000988 edx=0x00000000
0x0000000d 0x01: eax=0x0000000f ebx=0x00000348 ecx=0x00001800 edx=0x00000000
0x0000000d 0x02: eax=0x00000100 ebx=0x00000240 ecx=0x00000000 edx=0x00000000
0x0000000d 0x09: eax=0x00000008 ebx=0x00000980 ecx=0x00000000 edx=0x00000000
0x0000000d 0x0b: eax=0x00000010 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
0x0000000d 0x0c: eax=0x00000018 ebx=0x00000000 ecx=0x00000001 edx=0x00000000
Here, I think the ebx value for the 0x09 leaf (PKRU) is the relevant difference
here, it is 0xa80 on your laptop and 0x980 on the AMD CPU. (This is the
missing MPX gap on AMD.)
>>> This is pretty close to just a raw dump of the XSAVE CPUID leaves.
>>> Rather than come up with an XSAVE-specific ABI that depends on CPUID
>>> *ANYWAY* (because it dumps the "flags" register aka. ECX), maybe we
>>> should just bite the bullet and dump out (some of) the raw CPUID space.
>>
>> So the current note I initially proposed and implemented for FreeBSD
>> (https://reviews.freebsd.org/D42136) and an initial patch set for GDB
>> (https://sourceware.org/pipermail/gdb-patches/2023-October/203083.html)
>> do indeed dump a raw set of CPUID leaves. The version I have for FreeBSD
>> only dumps the raw leaf values for leaf 0x0d though the note format is
>> extensible should additional leaves be needed in the future. One of the
>> questions if we wanted to use a CPUID leaf note is which leaves to dump
>> (e.g. do you dump all of them, or do you just dump the subset that is
>> currently needed).
>
> You dump what is needed and add to the dump over time.
That is what I started with, yes, but am attempting to anticipate future
problems in my list of caveats.
>> Another quirky question is what to do about systems with hetergeneous
>> cores (E vs P for example).
> That's irrelevant for now. The cores may be heterogeneous but the
> userspace ISA and (and thus XSAVE formats) are identical. If they're
> not, then we have bigger problems on our hands.
Yes, I agree on the bigger problems and hope we don't have to solve
them.
>> Currently those systems use the same XSAVE layout across all cores,
>> but other CPUID leaves do already vary across cores on those systems.
>
> There shouldn't be any CPUID leaves that differ _and_ matter to
> userspace and thus core dumps.
Today that is true, yes. I'm fine with making that tradeoff (along
with only dumping a subset of leaves) so long as the consensus is that
is an acceptable tradeoff to make.
>> However, there are other wrinkles with the leaf approach. Namely, one
>> of the use cases that I currently have an ugly hack for in GDB is if
>> you are using gdb against a remote host running gdbserver and then use
>> 'gcore' to generate a core dump. GDB needs to write out a NT_X86_XSTATE
>> note, but that note requires a layout. What GDB does today is just pick
>> a known Intel layout based on the XCR0 mask. However, GDB should ideally
>> start writing out whatever new note we adopt here, so if we dump raw
>> CPUID leaves it means extending the GDB remote protocol so we can query
>> the CPUID leaves from the remote host. On the other hand, if we choose a
>> more abstract format as proposed in this patch, the local GDB (or LLDB
>> or whatever) can generate whatever synthetic layout it wants to write
>> the local NT_X86_XSTATE. (NB: A relevant detail here is that the GDB
>> remote protocol does not pass the entire XSAVE state across as a block,
>> instead gdbserver parses individual register values for AVX, etc.
>> registers and those decoded register values are passed over the
>> protocol.)
>
> So the gdb side says, "Give me PKRU" and the remote side parses the
> XSAVE image, finds PKRU, and sends it over the wire?
Yes.
>> Another question is potentially supporting compact XSAVE format in
>> for NT_X86_XSTATE. Today Linux has some complicated code to re-expand
>> the compat XSAVE format back out to the standard layout for ptrace() and
>> process core dumps.
>
> Yeah, but supporting the compacted format in NT_X86_XSTATE doesn't help
> us at all. We still intermingle user and supervisor state and that
> needs to get repacked _anyway_.
Fair enough.
> In other words, no matter what we do, it's going to be complicated
> because the userspace buffer can't have supervisor state and the kernel
> buffer does have it. The compacted format mismatch is the least of our
> problems.
>
>> (FreeBSD doesn't yet make use of XSAVEC so we
>> haven't yet dealt with that problem.)
>
> ... or XSAVES, which is actually the most relevant here.
>
> Backing up... there are two approaches here:
>
> 1. Dump out raw x86-specific gunk, aka. CPUID contents itself. There
> are a billion ways to do this and lots of complications, including
> the remote protocol implications
> or
> 2. Define an abstract format that works anywhere, not just on x86 and
> not just for XSAVE.
>
> There's no (sane) middle ground. The implementation here (in this
> patch) is fundamentally x86-specific and pretends to be some kind of
> abstracted x86-independent format.
Well, are there other register notes that could benefit from an approach
like this? Most other register notes I'm aware of on various architectures
either have a fixed layout (like the typical general purpose register notes),
or they have a fixed set of registers but the size of individual registers
can vary (thinks like SME or RISC-V's vector extension). XSAVE is the only
one I'm aware of that packs multiple register sets into a single note.
To step back a bit, another approach that could be taken (and I'm not sure
it is worth it at this point) would to stop dumping a single XSAVE note
and dump a separate register note for each feature. That is, dump a note
for AVX (the upper bits of ymmX), a note for PKRU, etc. I think if I had
to pick a strategy at the very beginning that's what I would choose now,
but this isn't the very beginning and that sort of change is likely too
disruptive. (This approach is what happens on other arches today in effect,
e.g. on AArch64.)
--
John Baldwin
Hi Vignesh,
Vignesh Balasubramanian <[email protected]> writes:
> Add a new .note section containing type, size, offset and flags of
> every xfeature that is present.
>
> This information will be used by the debuggers to understand the XSAVE
> layout of the machine where the core file is dumped, and to read XSAVE
> registers, especially during cross-platform debugging.
>
> Co-developed-by: Jini Susan George <[email protected]>
> Signed-off-by: Jini Susan George <[email protected]>
> Signed-off-by: Vignesh Balasubramanian <[email protected]>
> ---
> arch/Kconfig | 9 +++
> arch/powerpc/Kconfig | 1 +
> arch/powerpc/include/asm/elf.h | 2 -
> arch/x86/Kconfig | 1 +
> arch/x86/include/asm/elf.h | 7 +++
> arch/x86/kernel/fpu/xstate.c | 101 +++++++++++++++++++++++++++++++++
> include/linux/elf.h | 2 +-
> include/uapi/linux/elf.h | 1 +
> 8 files changed, 121 insertions(+), 3 deletions(-)
IMHO you should split the changes to replace ARCH_HAVE_EXTRA_ELF_NOTES
with CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES into a lead-up patch.
cheers
> -----Original Message-----
> From: Borislav Petkov <[email protected]>
> Sent: Donnerstag, 14. März 2024 17:34
> To: Willgerodt, Felix <[email protected]>
> Cc: Vignesh Balasubramanian <[email protected]>; linux-
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 0/1] Add XSAVE layout description to Core files for
> debuggers to support varying XSAVE layouts
>
> On Thu, Mar 14, 2024 at 04:25:44PM +0000, Willgerodt, Felix wrote:
> > I am wondering if it wouldn't be easier for everyone if corefiles would just
> > contain space for all possible XSAVE components?
>
> You mean we should shuffle out from the kernel 8K of AMX state even if
> nothing uses it or the machine doesn't even support it?
>
> That's silly.
I don't think it is so silly ;) Let me elaborate.
I was mostly trying to suggest an "easy fix" for the MPX issue.
I also said that we could drop features from the end of the list if the CPU
doesn't support them. Yes it is still wasteful, but kind of was the status quo
in the past afaik. And yes, if new states come after AMX it could get more wasteful.
Though in another email here Dave said that the offsets in XSAVE
are "fixed in practice". That makes me wonder. Even if we add CPUID to corefiles,
will Intel CPUs benefit? (I am not saying it isn't worth changing even if Intel CPUs don't.)
E.g. will we actually get rid of the 8k that you are complaining about?
I don't know the answer to be honest. If XSAVE offsets are "fixed in practice",
I don't see how we would benefit.
And how would you check that "nothing uses AMX", if the state exist according
to CPUID?
> Please have a look at this:
>
> +struct xfeat_component {
> + u32 xfeat_type;
> + u32 xfeat_sz;
> + u32 xfeat_off;
> + u32 xfeat_flags;
> +} __packed;
>
> What is wrong with having a blob of such xfeat_component things
> describing the XSTATE buffer and parsing it in gdb?
I didn't say it is wrong or that I am opposing it. In fact CPUID in
corefiles could be useful for other scenarios as well.
Though all consumers will need to add to their implementation.
I assume LLDB and GDB aren't the only consumers.
Felix
Intel Deutschland GmbH
Registered Address: Am Campeon 10, 85579 Neubiberg, Germany
Tel: +49 89 99 8853-0, http://www.intel.de <http://www.intel.de>
Managing Directors: Christin Eisenschmid, Sharon Heck, Tiffany Doon Silva
Chairperson of the Supervisory Board: Nicole Lau
Registered Office: Munich
Commercial Register: Amtsgericht Muenchen HRB 186928
Hi Vignesh,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.8 next-20240315]
[cannot apply to kees/for-next/execve tip/x86/core powerpc/next powerpc/fixes]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vignesh-Balasubramanian/x86-elf-Add-a-new-note-section-containing-Xfeatures-information-to-x86-core-files/20240314-192650
base: linus/master
patch link: https://lore.kernel.org/r/20240314112359.50713-2-vigbalas%40amd.com
patch subject: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files
config: i386-buildonly-randconfig-001-20240315 (https://download.01.org/0day-ci/archive/20240315/[email protected]/config)
compiler: clang version 17.0.6 (https://github.com/llvm/llvm-project 6009708b4367171ccdbf4b5905cb6a803753fe18)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240315/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> arch/x86/kernel/fpu/xstate.c:1858:8: error: call to undeclared function 'dump_emit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1858 | if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
| ^
arch/x86/kernel/fpu/xstate.c:1869:8: error: call to undeclared function 'dump_emit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1869 | if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
| ^
arch/x86/kernel/fpu/xstate.c:1899:7: error: call to undeclared function 'dump_emit'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1899 | if (!dump_emit(cprm, &en, sizeof(en)))
| ^
>> arch/x86/kernel/fpu/xstate.c:1903:7: error: call to undeclared function 'dump_align'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
1903 | if (!dump_align(cprm, 4))
| ^
4 errors generated.
vim +/dump_emit +1858 arch/x86/kernel/fpu/xstate.c
1846
1847 struct xfeat_component xc;
1848 int num_records = 0;
1849 int i;
1850
1851 /* XFEATURE_FPU and XFEATURE_SSE, both are fixed legacy states. */
1852 for (i = 0; i < FIRST_EXTENDED_XFEATURE; i++) {
1853 xc.xfeat_type = i;
1854 xc.xfeat_sz = xstate_sizes[i];
1855 xc.xfeat_off = xstate_offsets[i];
1856 xc.xfeat_flags = xstate_flags[i];
1857
> 1858 if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
1859 return 0;
1860 num_records++;
1861 }
1862
1863 for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
1864 xc.xfeat_type = i;
1865 xc.xfeat_sz = xstate_sizes[i];
1866 xc.xfeat_off = xstate_offsets[i];
1867 xc.xfeat_flags = xstate_flags[i];
1868
1869 if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
1870 return 0;
1871 num_records++;
1872 }
1873
1874 return num_records;
1875 }
1876
1877 static int get_xsave_desc_size(void)
1878 {
1879 /* XFEATURE_FP and XFEATURE_SSE, both are fixed legacy states */
1880 int xfeatures_count = 2;
1881 int i;
1882
1883 for_each_extended_xfeature(i, fpu_user_cfg.max_features)
1884 xfeatures_count++;
1885
1886 return xfeatures_count * (sizeof(struct xfeat_component));
1887 }
1888
1889 int elf_coredump_extra_notes_write(struct coredump_params *cprm)
1890 {
1891 const char *owner_name = "LINUX";
1892 int num_records = 0;
1893 struct elf_note en;
1894
1895 en.n_namesz = strlen(owner_name) + 1;
1896 en.n_descsz = get_xsave_desc_size();
1897 en.n_type = NT_X86_XSAVE_LAYOUT;
1898
1899 if (!dump_emit(cprm, &en, sizeof(en)))
1900 return 1;
1901 if (!dump_emit(cprm, owner_name, en.n_namesz))
1902 return 1;
> 1903 if (!dump_align(cprm, 4))
1904 return 1;
1905
1906 num_records = dump_xsave_layout_desc(cprm);
1907 if (!num_records) {
1908 pr_warn("Error adding XSTATE layout ELF note. XSTATE buffer in the core file will be unparseable.");
1909 return 1;
1910 }
1911
1912 /* Total size should be equal to the number of records */
1913 if ((sizeof(struct xfeat_component) * num_records) != en.n_descsz) {
1914 pr_warn("Error adding XSTATE layout ELF note. The size of the .note section does not match with the total size of the records.");
1915 return 1;
1916 }
1917
1918 if (!dump_align(cprm, 4))
1919 return 1;
1920
1921 return 0;
1922 }
1923
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Hi Vignesh,
kernel test robot noticed the following build errors:
[auto build test ERROR on linus/master]
[also build test ERROR on v6.8 next-20240315]
[cannot apply to kees/for-next/execve tip/x86/core powerpc/next powerpc/fixes]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Vignesh-Balasubramanian/x86-elf-Add-a-new-note-section-containing-Xfeatures-information-to-x86-core-files/20240314-192650
base: linus/master
patch link: https://lore.kernel.org/r/20240314112359.50713-2-vigbalas%40amd.com
patch subject: [PATCH 1/1] x86/elf: Add a new .note section containing Xfeatures information to x86 core files
config: x86_64-randconfig-122-20240315 (https://download.01.org/0day-ci/archive/20240315/[email protected]/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240315/[email protected]/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
arch/x86/kernel/fpu/xstate.c: In function 'dump_xsave_layout_desc':
>> arch/x86/kernel/fpu/xstate.c:1858:22: error: implicit declaration of function 'dump_emit'; did you mean 'dir_emit'? [-Werror=implicit-function-declaration]
1858 | if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
| ^~~~~~~~~
| dir_emit
arch/x86/kernel/fpu/xstate.c: In function 'elf_coredump_extra_notes_write':
>> arch/x86/kernel/fpu/xstate.c:1903:14: error: implicit declaration of function 'dump_align'; did you mean 'dump_mapping'? [-Werror=implicit-function-declaration]
1903 | if (!dump_align(cprm, 4))
| ^~~~~~~~~~
| dump_mapping
cc1: some warnings being treated as errors
vim +1858 arch/x86/kernel/fpu/xstate.c
1846
1847 struct xfeat_component xc;
1848 int num_records = 0;
1849 int i;
1850
1851 /* XFEATURE_FPU and XFEATURE_SSE, both are fixed legacy states. */
1852 for (i = 0; i < FIRST_EXTENDED_XFEATURE; i++) {
1853 xc.xfeat_type = i;
1854 xc.xfeat_sz = xstate_sizes[i];
1855 xc.xfeat_off = xstate_offsets[i];
1856 xc.xfeat_flags = xstate_flags[i];
1857
> 1858 if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
1859 return 0;
1860 num_records++;
1861 }
1862
1863 for_each_extended_xfeature(i, fpu_user_cfg.max_features) {
1864 xc.xfeat_type = i;
1865 xc.xfeat_sz = xstate_sizes[i];
1866 xc.xfeat_off = xstate_offsets[i];
1867 xc.xfeat_flags = xstate_flags[i];
1868
1869 if (!dump_emit(cprm, &xc, sizeof(struct xfeat_component)))
1870 return 0;
1871 num_records++;
1872 }
1873
1874 return num_records;
1875 }
1876
1877 static int get_xsave_desc_size(void)
1878 {
1879 /* XFEATURE_FP and XFEATURE_SSE, both are fixed legacy states */
1880 int xfeatures_count = 2;
1881 int i;
1882
1883 for_each_extended_xfeature(i, fpu_user_cfg.max_features)
1884 xfeatures_count++;
1885
1886 return xfeatures_count * (sizeof(struct xfeat_component));
1887 }
1888
1889 int elf_coredump_extra_notes_write(struct coredump_params *cprm)
1890 {
1891 const char *owner_name = "LINUX";
1892 int num_records = 0;
1893 struct elf_note en;
1894
1895 en.n_namesz = strlen(owner_name) + 1;
1896 en.n_descsz = get_xsave_desc_size();
1897 en.n_type = NT_X86_XSAVE_LAYOUT;
1898
1899 if (!dump_emit(cprm, &en, sizeof(en)))
1900 return 1;
1901 if (!dump_emit(cprm, owner_name, en.n_namesz))
1902 return 1;
> 1903 if (!dump_align(cprm, 4))
1904 return 1;
1905
1906 num_records = dump_xsave_layout_desc(cprm);
1907 if (!num_records) {
1908 pr_warn("Error adding XSTATE layout ELF note. XSTATE buffer in the core file will be unparseable.");
1909 return 1;
1910 }
1911
1912 /* Total size should be equal to the number of records */
1913 if ((sizeof(struct xfeat_component) * num_records) != en.n_descsz) {
1914 pr_warn("Error adding XSTATE layout ELF note. The size of the .note section does not match with the total size of the records.");
1915 return 1;
1916 }
1917
1918 if (!dump_align(cprm, 4))
1919 return 1;
1920
1921 return 0;
1922 }
1923
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Thu, Mar 14 2024 at 17:29, Borislav Petkov wrote:
> On Thu, Mar 14, 2024 at 09:19:15AM -0700, Dave Hansen wrote:
>> Are you envisioning an *XSAVE* state component that's not described by
>> CPUID?
>
> I want to be prepared. You can imagine some of the short cuts and
> corners cutting hw guys would do so I'd want to be prepared there and
> not tie this to CPUID.
Anything which is not enumerated in CPUID does not exist in
XSTATE. Period and end of story.
Stop paving the way for hardware people to have an excuse for being
stupid.
Aside of that XSTATE is a dead end as we all know.
Thanks,
tglx
On Sat, Mar 16, 2024 at 12:51:28AM +0100, Thomas Gleixner wrote:
> Anything which is not enumerated in CPUID does not exist in
> XSTATE. Period and end of story.
But why not have a simple buffer definition which doesn't need CPUID?
Also, doing the CPUID thing would need extending the gdb remote protocol
as explained here:
https://lore.kernel.org/r/[email protected]
The simple buffer layout won't.
So regardless of where hw is going, I think a simple buffer definition
is always better.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
> Otherwise looks reasonable, though I see Dave has feedback to address
> too. :)
>
> Thanks for working on this!
>
> -Kees
Thank you for the review.
I will address all this on next version.
thanks,
vigneshbalu.
> --
> Kees Cook
> IMHO you should split the changes to replace ARCH_HAVE_EXTRA_ELF_NOTES
> with CONFIG_ARCH_HAVE_EXTRA_ELF_NOTES into a lead-up patch.
>
> cheers
Thanks for the input and i will take care in next version.
regards,
vigneshbalu.