2019-01-10 12:27:26

by Lianbo Jiang

[permalink] [raw]
Subject: [PATCH 0/2 v6] kdump,vmcoreinfo: Export the value of sme mask to vmcoreinfo

This patchset did two things:
a. add a new document for vmcoreinfo

This document lists some variables that export to vmcoreinfo, and briefly
describles what these variables indicate. It should be instructive for
many people who do not know the vmcoreinfo.

b. export the value of sme mask to vmcoreinfo

For AMD machine with SME feature, makedumpfile tools need to know whether
the crashed kernel was encrypted or not. If SME is enabled in the first
kernel, the crashed kernel's page table(pgd/pud/pmd/pte) contains the
memory encryption mask, so makedumpfile needs to remove the sme mask to
obtain the true physical address.

Changes since v1:
1. No need to export a kernel-internal mask to userspace, so copy the
value of sme_me_mask to a local variable 'sme_mask' and write the value
of sme_mask to vmcoreinfo.
2. Add comment for the code.
3. Improve the patch log.
4. Add the vmcoreinfo documentation.

Changes since v2:
1. Improve the vmcoreinfo document, add more descripts for these
variables exported.
2. Fix spelling errors in the document.

Changes since v3:
1. Still improve the vmcoreinfo document, and make it become more
clear and easy to read.
2. Move sme_mask comments in the code to the vmcoreinfo document.
3. Improve patch log.

Changes since v4:
1. Remove a command that dumping the VMCOREINFO contents from this
document.
2. Merge the 'PG_buddy' and 'PG_offline' into the PG_* flag in this
document.
3. Correct some of the mistakes in this document.

Changes since v5:
1. Improve patch log.

Lianbo Jiang (2):
kdump: add the vmcoreinfo documentation
kdump,vmcoreinfo: Export the value of sme mask to vmcoreinfo

Documentation/kdump/vmcoreinfo.txt | 500 +++++++++++++++++++++++++++++
arch/x86/kernel/machine_kexec_64.c | 3 +
2 files changed, 503 insertions(+)
create mode 100644 Documentation/kdump/vmcoreinfo.txt

--
2.17.1



2019-01-10 13:51:38

by Lianbo Jiang

[permalink] [raw]
Subject: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

This document lists some variables that export to vmcoreinfo, and briefly
describles what these variables indicate. It should be instructive for
many people who do not know the vmcoreinfo.

Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
---
Documentation/kdump/vmcoreinfo.txt | 500 +++++++++++++++++++++++++++++
1 file changed, 500 insertions(+)
create mode 100644 Documentation/kdump/vmcoreinfo.txt

diff --git a/Documentation/kdump/vmcoreinfo.txt b/Documentation/kdump/vmcoreinfo.txt
new file mode 100644
index 000000000000..8e444586b87b
--- /dev/null
+++ b/Documentation/kdump/vmcoreinfo.txt
@@ -0,0 +1,500 @@
+================================================================
+ VMCOREINFO
+================================================================
+
+=======================
+What is the VMCOREINFO?
+=======================
+
+VMCOREINFO is a special ELF note section. It contains various
+information from the kernel like structure size, page size, symbol
+values, field offsets, etc. These data are packed into an ELF note
+section and used by user-space tools like crash and makedumpfile to
+analyze a kernel's memory layout.
+
+================
+Common variables
+================
+
+init_uts_ns.name.release
+------------------------
+
+The version of the Linux kernel. Used to find the corresponding source
+code from which the kernel has been built.
+
+PAGE_SIZE
+---------
+
+The size of a page. It is the smallest unit of data for memory
+management in kernel. It is usually 4096 bytes and a page is aligned
+on 4096 bytes. Used for computing page addresses.
+
+init_uts_ns
+-----------
+
+This is the UTS namespace, which is used to isolate two specific
+elements of the system that relate to the uname(2) system call. The UTS
+namespace is named after the data structure used to store information
+returned by the uname(2) system call.
+
+User-space tools can get the kernel name, host name, kernel release
+number, kernel version, architecture name and OS type from it.
+
+node_online_map
+---------------
+
+An array node_states[N_ONLINE] which represents the set of online node
+in a system, one bit position per node number. Used to keep track of
+which nodes are in the system and online.
+
+swapper_pg_dir
+-------------
+
+The global page directory pointer of the kernel. Used to translate
+virtual to physical addresses.
+
+_stext
+------
+
+Defines the beginning of the text section. In general, _stext indicates
+the kernel start address. Used to convert a virtual address from the
+direct kernel map to a physical address.
+
+vmap_area_list
+--------------
+
+Stores the virtual area list. makedumpfile can get the vmalloc start
+value from this variable. This value is necessary for vmalloc translation.
+
+mem_map
+-------
+
+Physical addresses are translated to struct pages by treating them as
+an index into the mem_map array. Right-shifting a physical address
+PAGE_SHIFT bits converts it into a page frame number which is an index
+into that mem_map array.
+
+Used to map an address to the corresponding struct page.
+
+contig_page_data
+----------------
+
+Makedumpfile can get the pglist_data structure from this symbol, which
+is used to describe the memory layout.
+
+User-space tools use this to exclude free pages when dumping memory.
+
+mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
+--------------------------------------------------------------------------
+
+The address of the mem_section array, its length, structure size, and
+the section_mem_map offset.
+
+It exists in the sparse memory mapping model, and it is also somewhat
+similar to the mem_map variable, both of them are used to translate an
+address.
+
+page
+----
+
+The size of a page structure. struct page is an important data structure
+and it is widely used to compute the contiguous memory.
+
+pglist_data
+-----------
+
+The size of a pglist_data structure. This value will be used to check
+if the pglist_data structure is valid. It is also used for checking the
+memory type.
+
+zone
+----
+
+The size of a zone structure. This value is often used to check if the
+zone structure has been found. It is also used for excluding free pages.
+
+free_area
+---------
+
+The size of a free_area structure. It indicates whether the free_area
+structure is valid or not. Useful for excluding free pages.
+
+list_head
+---------
+
+The size of a list_head structure. Used when iterating lists in a
+post-mortem analysis session.
+
+nodemask_t
+----------
+
+The size of a nodemask_t type. Used to compute the number of online
+nodes.
+
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|
+ compound_order|compound_head)
+-------------------------------------------------------------------
+
+User-space tools can compute their values based on the offset of these
+variables. The variables are helpful to exclude unnecessary pages.
+
+(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_
+ spanned_pages|node_id)
+-------------------------------------------------------------------
+
+On NUMA machines, each NUMA node has a pg_data_t to describe its memory
+layout. On UMA machines there is a single pglist_data which describes the
+whole memory.
+
+These values are used to check the memory type, and they are also helpful
+to compute the virtual address for memory map.
+
+(zone, free_area|vm_stat|spanned_pages)
+---------------------------------------
+
+Each node is divided into a number of blocks called zones which
+represent ranges within memory. A zone is described by a structure zone.
+Each zone type is suitable for a different type of usage.
+
+User-space tools can compute required values based on the offset of these
+variables.
+
+(free_area, free_list)
+----------------------
+
+Offset of the free_list's member. This value is used to compute the number
+of free pages.
+
+Each zone has a free_area structure array called free_area[MAX_ORDER].
+The fields in this structure are simple, the free_list represents a linked
+list of free page blocks.
+
+(list_head, next|prev)
+----------------------
+
+Offsets of the list_head's members. list_head is used to define a
+circular linked list. User-space tools need these in order to traverse
+lists.
+
+(vmap_area, va_start|list)
+--------------------------
+
+Offsets of the vmap_area's members. They indicate the vmalloc layer
+information. Makedumpfile gets the start address of the vmalloc region.
+
+(zone.free_area, MAX_ORDER)
+---------------------------
+
+It indicates the maximum number of the array free_area. This macro is
+used by the zone buddy allocator. User-space tools use this value to
+iterate the free_area.
+
+log_buf
+-------
+
+Console output is written to the ring buffer log_buf at index
+log_first_idx. Used to get the kernel log.
+
+log_buf_len
+-----------
+
+Length of a log_buf. Used to read the number of strings from the
+log_buf.
+
+log_first_idx
+-------------
+
+Index of the first record stored in the buffer log_buf. Used by
+user-space tools to read the strings in the log_buf.
+
+clear_idx
+---------
+
+The index that the next printk() record to read after the last clear
+command. It indicates the first record after the last SYSLOG_ACTION
+_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
+the dmesg log.
+
+log_next_idx
+------------
+
+The index of the next record to store in the buffer log_buf. Used to
+compute the index of the current string position.
+
+printk_log
+----------
+
+The size of a structure printk_log. Used to compute the size of
+messages, and extract dmesg log. It can output human readable text.
+Encapsulate header information for log_buf, such as timestamp, syslog
+level, etc.
+
+(printk_log, ts_nsec|len|text_len|dict_len)
+-------------------------------------------
+
+It represents field offsets in struct printk_log. User space tools can
+parse it and check whether the values of printk_log's members have been
+changed.
+
+(free_area.free_list, MIGRATE_TYPES)
+------------------------------------
+
+The number of migrate types for pages. The free_list is divided into
+the array, it needs to know the number of the array when makedumpfile
+computes the number of free pages.
+
+NR_FREE_PAGES
+-------------
+
+On linux-2.6.21 or later, the number of free_pages is in
+vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
+
+PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision
+|PG_head_mask|PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)
+|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+-----------------------------------------------------------------
+
+Page attributes. These flags are used to filter various unnecessary
+pages.
+
+HUGETLB_PAGE_DTOR
+-----------------
+
+The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
+excludes these pages.
+
+======
+x86_64
+======
+
+phys_base
+---------
+
+Used to convert the virtual address of an exported kernel symbol to its
+physical address.
+
+init_top_pgt
+------------
+
+Used to walk through the whole page table and convert virtual addresses
+to physical addresses. The init_top_pgt is somewhat similar to the
+swapper_pg_dir, but it is only used in x86_64.
+
+pgtable_l5_enabled
+------------------
+
+User-space tools need to know whether the crash kernel was in 5-level
+paging mode.
+
+node_data
+---------
+
+This is a struct pglist_data array and stores all numa nodes
+information. Makedumpfile gets the pglist_data structure from it.
+
+(node_data, MAX_NUMNODES)
+-------------------------
+
+The maximum number of the nodes in system.
+
+KERNELOFFSET
+------------
+
+The kernel randomization offset. Used to compute the page offset. If
+KASLR is disabled, this value is zero.
+
+KERNEL_IMAGE_SIZE
+-----------------
+
+Currently unused by Makedumpfile. Used to compute the module virtual
+address by Crash.
+
+sme_mask
+--------
+
+For AMD machine with SME feature, it indicates the secure memory
+encryption mask. Makedumpfile tools need to know whether the crash
+kernel was encrypted. If SME is enabled in the first kernel, the crash
+kernel's page table (pgd/pud/pmd/pte) contains the memory encryption
+mask and this is used to remove the SME mask to obtain the true physical
+address.
+
+Currently, the sme_mask stores the value of sme_me_mask(bit 47). If need,
+the bit(sme_mask) might be redefined in the future, but the bit 63 will
+be reserved.
+
+For example:
+[ misc ][ enc bit ][ other misc SME info ]
+0000_0000_0000_0000_1000_0000_0000_0000_0000_0000_..._0000
+63 59 55 51 47 43 39 35 31 27 ... 3
+
+======
+x86_32
+======
+
+X86_PAE
+-------
+
+Denotes whether physical address extensions are enabled. It has the cost
+of more page table lookup overhead, and also consumes more page table
+space per process. Used to check whether PAE was enabled in the crash
+kernel when converting virtual addresses to physical addresses.
+
+====
+ia64
+====
+
+pgdat_list|(pgdat_list, MAX_NUMNODES)
+-------------------------------------
+
+pg_data_t array storing all numa nodes information. MAX_NUMNODES
+indicates the number of the nodes.
+
+node_memblk|(node_memblk, NR_NODE_MEMBLKS)
+------------------------------------------
+
+List of node memory chunks. Filled when parsing SRAT table to obtain
+information about memory nodes. NR_NODE_MEMBLKS indicates the number
+of node memory chunks.
+
+These values are used to compute the number of nodes in the crash kernel.
+
+node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
+----------------------------------------------------------------
+
+The size of a struct node_memblk_s and the offsets of the
+node_memblk_s's members. Used to compute the number of nodes.
+
+PGTABLE_3|PGTABLE_4
+-------------------
+
+User-space tools need to know whether the crash kernel was in 3-level or
+4-level paging mode. Used to distinguish the page table.
+
+=====
+ARM64
+=====
+
+VA_BITS
+-------
+
+The maximum number of bits for virtual addresses. Used to compute the
+virtual memory ranges.
+
+kimage_voffset
+--------------
+
+The offset between the kernel virtual and physical mappings. Used to
+translate virtual to physical addresses.
+
+PHYS_OFFSET
+-----------
+
+Indicates the physical address of the start of memory. Similar to
+kimage_voffset, which is used to translate virtual address to physical
+address.
+
+KERNELOFFSET
+------------
+
+The kernel randomization offset. Used to compute the page offset. If
+KASLR is disabled, this value is zero.
+
+====
+arm
+====
+
+ARM_LPAE
+--------
+
+It indicates whether the crash kernel supports large physical address
+extensions. Used to translate virtual address to physical address.
+
+====
+s390
+====
+
+lowcore_ptr
+----------
+
+An array with a pointer to the lowcore of every CPU. Used to print the
+psw and all registers information.
+
+high_memory
+-----------
+
+Used to get the vmalloc_start address from the high_memory symbol.
+
+(lowcore_ptr, NR_CPUS)
+----------------------
+
+The maximum number of CPUs.
+
+=======
+powerpc
+=======
+
+
+node_data|(node_data, MAX_NUMNODES)
+-----------------------------------
+
+See above.
+
+contig_page_data
+----------------
+
+See above.
+
+vmemmap_list
+------------
+
+The vmemmap_list maintains the entire vmemmap physical mapping. It can
+get vmemmap list count and populate vmemmap regions info. If the vmemmap
+address translation information is stored in the crash kernel, it helps
+to translate vmemmap kernel virtual addresses.
+
+mmu_vmemmap_psize
+-----------------
+
+The size of a page. Used to translate address to physical addresses.
+
+mmu_psize_defs
+--------------
+
+Page size definitions, i.e. 4k, 64k, or 16M.
+
+Used to make vtop translations.
+
+vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|
+(vmemmap_backing, virt_addr)
+----------------------------------------------------------------
+
+The vmemmap virtual address space management does not have a traditional
+page table to track which virtual struct pages are backed by physical
+mapping. The virtual to physical mappings are tracked in a simple linked
+list format.
+
+User-space tools need to know the offset of list, phys and virt_addr
+when computing the count of vmemmap regions.
+
+mmu_psize_def|(mmu_psize_def, shift)
+------------------------------------
+
+The size of a struct mmu_psize_def and the offset of mmu_psize_def's
+member.
+
+Used in vtop translations.
+
+==
+sh
+==
+
+node_data|(node_data, MAX_NUMNODES)
+-----------------------------------
+
+See above.
+
+X2TLB
+-----
+
+Indicates whether the crash kernel enables SH extended mode.
--
2.17.1


2019-01-10 13:52:03

by Lianbo Jiang

[permalink] [raw]
Subject: [PATCH 2/2 v6] kdump,vmcoreinfo: Export the value of sme mask to vmcoreinfo

For AMD machine with SME feature, makedumpfile tools need to know
whether the crashed kernel was encrypted or not. If SME is enabled
in the first kernel, the crashed kernel's page table(pgd/pud/pmd/pte)
contains the memory encryption mask, so makedumpfile needs to remove
the sme mask to obtain the true physical address.

Signed-off-by: Lianbo Jiang <[email protected]>
---
arch/x86/kernel/machine_kexec_64.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 4c8acdfdc5a7..bc4108096b18 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -352,10 +352,13 @@ void machine_kexec(struct kimage *image)

void arch_crash_save_vmcoreinfo(void)
{
+ u64 sme_mask = sme_me_mask;
+
VMCOREINFO_NUMBER(phys_base);
VMCOREINFO_SYMBOL(init_top_pgt);
vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
pgtable_l5_enabled());
+ VMCOREINFO_NUMBER(sme_mask);

#ifdef CONFIG_NUMA
VMCOREINFO_SYMBOL(node_data);
--
2.17.1


2019-01-11 14:19:28

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Thu, Jan 10, 2019 at 08:19:43PM +0800, Lianbo Jiang wrote:
> +init_uts_ns.name.release
> +------------------------
> +
> +The version of the Linux kernel. Used to find the corresponding source
> +code from which the kernel has been built.
> +

...

> +
> +init_uts_ns
> +-----------
> +
> +This is the UTS namespace, which is used to isolate two specific
> +elements of the system that relate to the uname(2) system call. The UTS
> +namespace is named after the data structure used to store information
> +returned by the uname(2) system call.
> +
> +User-space tools can get the kernel name, host name, kernel release
> +number, kernel version, architecture name and OS type from it.

Already asked this but no reply so lemme paste my question again:

"And this document already fulfills its purpose - those two vmcoreinfo
exports are redundant and the first one can be removed.

And now that we agreed that VMCOREINFO is not an ABI and is very tightly
coupled to the kernel version, init_uts_ns.name.release can be removed,
yes?

Or is there anything speaking against that?"

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-11 14:59:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Thu, Jan 10, 2019 at 08:19:43PM +0800, Lianbo Jiang wrote:
> This document lists some variables that export to vmcoreinfo, and briefly
> describles what these variables indicate. It should be instructive for
> many people who do not know the vmcoreinfo.
>
> Suggested-by: Borislav Petkov <[email protected]>
> Signed-off-by: Lianbo Jiang <[email protected]>
> ---
> Documentation/kdump/vmcoreinfo.txt | 500 +++++++++++++++++++++++++++++
> 1 file changed, 500 insertions(+)
> create mode 100644 Documentation/kdump/vmcoreinfo.txt

Ok, below is what I'm going to commit if no one complains. I hope you'd
find some time to work on adding the checkpatch check for patches which
add vmcoreinfo members but do not document them and also remove those
vmcoreinfo members which are unused.

Which should be easy because we don't have to be backwards-compatible
with makedumpfile as this is not an ABI.

Thx.

---
From: Lianbo Jiang <[email protected]>
Date: Thu, 10 Jan 2019 20:19:43 +0800
Subject: [PATCH] kdump: Document kernel data exported in the vmcoreinfo note

Document data exported in vmcoreinfo and briefly describe its use by
userspace tools.

[ bp: heavily massage and redact the text. ]

Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: x86-ml <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
Documentation/kdump/vmcoreinfo.txt | 494 +++++++++++++++++++++++++++++
1 file changed, 494 insertions(+)
create mode 100644 Documentation/kdump/vmcoreinfo.txt

diff --git a/Documentation/kdump/vmcoreinfo.txt b/Documentation/kdump/vmcoreinfo.txt
new file mode 100644
index 000000000000..2dc3797940a3
--- /dev/null
+++ b/Documentation/kdump/vmcoreinfo.txt
@@ -0,0 +1,494 @@
+================================================================
+ VMCOREINFO
+================================================================
+
+===========
+What is it?
+===========
+
+VMCOREINFO is a special ELF note section. It contains various
+information from the kernel like structure size, page size, symbol
+values, field offsets, etc. These data are packed into an ELF note
+section and used by user-space tools like crash and makedumpfile to
+analyze a kernel's memory layout.
+
+================
+Common variables
+================
+
+init_uts_ns.name.release
+------------------------
+
+The version of the Linux kernel. Used to find the corresponding source
+code from which the kernel has been built.
+
+PAGE_SIZE
+---------
+
+The size of a page. It is the smallest unit of data used by the memory
+management facilities. It is usually 4096 bytes of size and a page is
+aligned on 4096 bytes. Used for computing page addresses.
+
+init_uts_ns
+-----------
+
+The UTS namespace which is used to isolate two specific elements of the
+system that relate to the uname(2) system call. It is named after the
+data structure used to store information returned by the uname(2) system
+call.
+
+User-space tools can get the kernel name, host name, kernel release
+number, kernel version, architecture name and OS type from it.
+
+node_online_map
+---------------
+
+An array node_states[N_ONLINE] which represents the set of online nodes
+in a system, one bit position per node number. Used to keep track of
+which nodes are in the system and online.
+
+swapper_pg_dir
+-------------
+
+The global page directory pointer of the kernel. Used to translate
+virtual to physical addresses.
+
+_stext
+------
+
+Defines the beginning of the text section. In general, _stext indicates
+the kernel start address. Used to convert a virtual address from the
+direct kernel map to a physical address.
+
+vmap_area_list
+--------------
+
+Stores the virtual area list. makedumpfile gets the vmalloc start value
+from this variable and its value is necessary for vmalloc translation.
+
+mem_map
+-------
+
+Physical addresses are translated to struct pages by treating them as
+an index into the mem_map array. Right-shifting a physical address
+PAGE_SHIFT bits converts it into a page frame number which is an index
+into that mem_map array.
+
+Used to map an address to the corresponding struct page.
+
+contig_page_data
+----------------
+
+Makedumpfile gets the pglist_data structure from this symbol, which is
+used to describe the memory layout.
+
+User-space tools use this to exclude free pages when dumping memory.
+
+mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
+--------------------------------------------------------------------------
+
+The address of the mem_section array, its length, structure size, and
+the section_mem_map offset.
+
+It exists in the sparse memory mapping model, and it is also somewhat
+similar to the mem_map variable, both of them are used to translate an
+address.
+
+page
+----
+
+The size of a page structure. struct page is an important data structure
+and it is widely used to compute contiguous memory.
+
+pglist_data
+-----------
+
+The size of a pglist_data structure. This value is used to check if the
+pglist_data structure is valid. It is also used for checking the memory
+type.
+
+zone
+----
+
+The size of a zone structure. This value is used to check if the zone
+structure has been found. It is also used for excluding free pages.
+
+free_area
+---------
+
+The size of a free_area structure. It indicates whether the free_area
+structure is valid or not. Useful when excluding free pages.
+
+list_head
+---------
+
+The size of a list_head structure. Used when iterating lists in a
+post-mortem analysis session.
+
+nodemask_t
+----------
+
+The size of a nodemask_t type. Used to compute the number of online
+nodes.
+
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|
+ compound_order|compound_head)
+-------------------------------------------------------------------
+
+User-space tools compute their values based on the offset of these
+variables. The variables are used when excluding unnecessary pages.
+
+(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_
+ spanned_pages|node_id)
+-------------------------------------------------------------------
+
+On NUMA machines, each NUMA node has a pg_data_t to describe its memory
+layout. On UMA machines there is a single pglist_data which describes the
+whole memory.
+
+These values are used to check the memory type and to compute the
+virtual address for memory map.
+
+(zone, free_area|vm_stat|spanned_pages)
+---------------------------------------
+
+Each node is divided into a number of blocks called zones which
+represent ranges within memory. A zone is described by a structure zone.
+
+User-space tools compute required values based on the offset of these
+variables.
+
+(free_area, free_list)
+----------------------
+
+Offset of the free_list's member. This value is used to compute the number
+of free pages.
+
+Each zone has a free_area structure array called free_area[MAX_ORDER].
+The free_list represents a linked list of free page blocks.
+
+(list_head, next|prev)
+----------------------
+
+Offsets of the list_head's members. list_head is used to define a
+circular linked list. User-space tools need these in order to traverse
+lists.
+
+(vmap_area, va_start|list)
+--------------------------
+
+Offsets of the vmap_area's members. They carry vmalloc-specific
+information. Makedumpfile gets the start address of the vmalloc region
+from this.
+
+(zone.free_area, MAX_ORDER)
+---------------------------
+
+Free areas descriptor. User-space tools use this value to iterate the
+free_area ranges. MAX_ORDER is used by the zone buddy allocator.
+
+log_first_idx
+-------------
+
+Index of the first record stored in the buffer log_buf. Used by
+user-space tools to read the strings in the log_buf.
+
+log_buf
+-------
+
+Console output is written to the ring buffer log_buf at index
+log_first_idx. Used to get the kernel log.
+
+log_buf_len
+-----------
+
+log_buf's length.
+
+clear_idx
+---------
+
+The index that the next printk() record to read after the last clear
+command. It indicates the first record after the last SYSLOG_ACTION
+_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
+the dmesg log.
+
+log_next_idx
+------------
+
+The index of the next record to store in the buffer log_buf. Used to
+compute the index of the current buffer position.
+
+printk_log
+----------
+
+The size of a structure printk_log. Used to compute the size of
+messages, and extract dmesg log. It encapsulates header information for
+log_buf, such as timestamp, syslog level, etc.
+
+(printk_log, ts_nsec|len|text_len|dict_len)
+-------------------------------------------
+
+It represents field offsets in struct printk_log. User space tools
+parse it and check whether the values of printk_log's members have been
+changed.
+
+(free_area.free_list, MIGRATE_TYPES)
+------------------------------------
+
+The number of migrate types for pages. The free_list is described by the
+array. Used by tools to compute the number of free pages.
+
+NR_FREE_PAGES
+-------------
+
+On linux-2.6.21 or later, the number of free pages is in
+vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
+
+PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision
+|PG_head_mask|PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)
+|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+-----------------------------------------------------------------
+
+Page attributes. These flags are used to filter various unnecessary for
+dumping pages.
+
+HUGETLB_PAGE_DTOR
+-----------------
+
+The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
+excludes these pages.
+
+======
+x86_64
+======
+
+phys_base
+---------
+
+Used to convert the virtual address of an exported kernel symbol to its
+corresponding physical address.
+
+init_top_pgt
+------------
+
+Used to walk through the whole page table and convert virtual addresses
+to physical addresses. The init_top_pgt is somewhat similar to
+swapper_pg_dir, but it is only used in x86_64.
+
+pgtable_l5_enabled
+------------------
+
+User-space tools need to know whether the crash kernel was in 5-level
+paging mode.
+
+node_data
+---------
+
+This is a struct pglist_data array and stores all NUMA nodes
+information. Makedumpfile gets the pglist_data structure from it.
+
+(node_data, MAX_NUMNODES)
+-------------------------
+
+The maximum number of nodes in system.
+
+KERNELOFFSET
+------------
+
+The kernel randomization offset. Used to compute the page offset. If
+KASLR is disabled, this value is zero.
+
+KERNEL_IMAGE_SIZE
+-----------------
+
+Currently unused by Makedumpfile. Used to compute the module virtual
+address by Crash.
+
+sme_mask
+--------
+
+AMD-specific with SME support: it indicates the secure memory encryption
+mask. Makedumpfile tools need to know whether the crash kernel was
+encrypted. If SME is enabled in the first kernel, the crash kernel's
+page table entries (pgd/pud/pmd/pte) contain the memory encryption
+mask. This is used to remove the SME mask and obtain the true physical
+address.
+
+Currently, sme_mask stores the value of the C-bit position. If needed,
+additional SME-relevant info can be placed in that variable.
+
+For example:
+[ misc ][ enc bit ][ other misc SME info ]
+0000_0000_0000_0000_1000_0000_0000_0000_0000_0000_..._0000
+63 59 55 51 47 43 39 35 31 27 ... 3
+
+======
+x86_32
+======
+
+X86_PAE
+-------
+
+Denotes whether physical address extensions are enabled. It has the cost
+of a higher page table lookup overhead, and also consumes more page
+table space per process. Used to check whether PAE was enabled in the
+crash kernel when converting virtual addresses to physical addresses.
+
+====
+ia64
+====
+
+pgdat_list|(pgdat_list, MAX_NUMNODES)
+-------------------------------------
+
+pg_data_t array storing all NUMA nodes information. MAX_NUMNODES
+indicates the number of the nodes.
+
+node_memblk|(node_memblk, NR_NODE_MEMBLKS)
+------------------------------------------
+
+List of node memory chunks. Filled when parsing the SRAT table to obtain
+information about memory nodes. NR_NODE_MEMBLKS indicates the number of
+node memory chunks.
+
+These values are used to compute the number of nodes the crashed kernel used.
+
+node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
+----------------------------------------------------------------
+
+The size of a struct node_memblk_s and the offsets of the
+node_memblk_s's members. Used to compute the number of nodes.
+
+PGTABLE_3|PGTABLE_4
+-------------------
+
+User-space tools need to know whether the crash kernel was in 3-level or
+4-level paging mode. Used to distinguish the page table.
+
+=====
+ARM64
+=====
+
+VA_BITS
+-------
+
+The maximum number of bits for virtual addresses. Used to compute the
+virtual memory ranges.
+
+kimage_voffset
+--------------
+
+The offset between the kernel virtual and physical mappings. Used to
+translate virtual to physical addresses.
+
+PHYS_OFFSET
+-----------
+
+Indicates the physical address of the start of memory. Similar to
+kimage_voffset, which is used to translate virtual to physical
+addresses.
+
+KERNELOFFSET
+------------
+
+The kernel randomization offset. Used to compute the page offset. If
+KASLR is disabled, this value is zero.
+
+====
+arm
+====
+
+ARM_LPAE
+--------
+
+It indicates whether the crash kernel supports large physical address
+extensions. Used to translate virtual to physical addresses.
+
+====
+s390
+====
+
+lowcore_ptr
+----------
+
+An array with a pointer to the lowcore of every CPU. Used to print the
+psw and all registers information.
+
+high_memory
+-----------
+
+Used to get the vmalloc_start address from the high_memory symbol.
+
+(lowcore_ptr, NR_CPUS)
+----------------------
+
+The maximum number of CPUs.
+
+=======
+powerpc
+=======
+
+
+node_data|(node_data, MAX_NUMNODES)
+-----------------------------------
+
+See above.
+
+contig_page_data
+----------------
+
+See above.
+
+vmemmap_list
+------------
+
+The vmemmap_list maintains the entire vmemmap physical mapping. Used
+to get vmemmap list count and populated vmemmap regions info. If the
+vmemmap address translation information is stored in the crash kernel,
+it is used to translate vmemmap kernel virtual addresses.
+
+mmu_vmemmap_psize
+-----------------
+
+The size of a page. Used to translate virtual to physical addresses.
+
+mmu_psize_defs
+--------------
+
+Page size definitions, i.e. 4k, 64k, or 16M.
+
+Used to make vtop translations.
+
+vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|
+(vmemmap_backing, virt_addr)
+----------------------------------------------------------------
+
+The vmemmap virtual address space management does not have a traditional
+page table to track which virtual struct pages are backed by a physical
+mapping. The virtual to physical mappings are tracked in a simple linked
+list format.
+
+User-space tools need to know the offset of list, phys and virt_addr
+when computing the count of vmemmap regions.
+
+mmu_psize_def|(mmu_psize_def, shift)
+------------------------------------
+
+The size of a struct mmu_psize_def and the offset of mmu_psize_def's
+member.
+
+Used in vtop translations.
+
+==
+sh
+==
+
+node_data|(node_data, MAX_NUMNODES)
+-----------------------------------
+
+See above.
+
+X2TLB
+-----
+
+Indicates whether the crashed kernel enabled SH extended mode.

--
2.19.1

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

Subject: [tip:x86/kdump] x86/kdump: Export the SME mask to vmcoreinfo

Commit-ID: 65f750e5457aef9a8085a99d613fea0430303e93
Gitweb: https://git.kernel.org/tip/65f750e5457aef9a8085a99d613fea0430303e93
Author: Lianbo Jiang <[email protected]>
AuthorDate: Thu, 10 Jan 2019 20:19:44 +0800
Committer: Borislav Petkov <[email protected]>
CommitDate: Fri, 11 Jan 2019 16:09:25 +0100

x86/kdump: Export the SME mask to vmcoreinfo

On AMD SME machines, makedumpfile tools need to know whether the crashed
kernel was encrypted.

If SME is enabled in the first kernel, the crashed kernel's page table
entries (pgd/pud/pmd/pte) contain the memory encryption mask which
makedumpfile needs to remove in order to obtain the true physical
address.

Export that mask in a vmcoreinfo variable.

[ bp: Massage commit message and move define at the end of the
function. ]

Signed-off-by: Lianbo Jiang <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Tom Lendacky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: x86-ml <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/kernel/machine_kexec_64.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 4c8acdfdc5a7..ceba408ea982 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -352,6 +352,8 @@ void machine_kexec(struct kimage *image)

void arch_crash_save_vmcoreinfo(void)
{
+ u64 sme_mask = sme_me_mask;
+
VMCOREINFO_NUMBER(phys_base);
VMCOREINFO_SYMBOL(init_top_pgt);
vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
@@ -364,6 +366,7 @@ void arch_crash_save_vmcoreinfo(void)
vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
kaslr_offset());
VMCOREINFO_NUMBER(KERNEL_IMAGE_SIZE);
+ VMCOREINFO_NUMBER(sme_mask);
}

/* arch-dependent functionality related to kexec file-based syscall */

2019-01-14 01:53:43

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

在 2019年01月11日 20:33, Borislav Petkov 写道:
> On Thu, Jan 10, 2019 at 08:19:43PM +0800, Lianbo Jiang wrote:
>> +init_uts_ns.name.release
>> +------------------------
>> +
>> +The version of the Linux kernel. Used to find the corresponding source
>> +code from which the kernel has been built.
>> +
>
> ...
>
>> +
>> +init_uts_ns
>> +-----------
>> +
>> +This is the UTS namespace, which is used to isolate two specific
>> +elements of the system that relate to the uname(2) system call. The UTS
>> +namespace is named after the data structure used to store information
>> +returned by the uname(2) system call.
>> +
>> +User-space tools can get the kernel name, host name, kernel release
>> +number, kernel version, architecture name and OS type from it.
>
> Already asked this but no reply so lemme paste my question again:
>
> "And this document already fulfills its purpose - those two vmcoreinfo
> exports are redundant and the first one can be removed.
>
> And now that we agreed that VMCOREINFO is not an ABI and is very tightly
> coupled to the kernel version, init_uts_ns.name.release can be removed,
> yes?
>
> Or is there anything speaking against that?"
>

Sorry for this, that is my mistake. Thanks for your reminder.

I agree on your point of view. But i forgot that i should remove this variable
in this post.

I would like to remove this variable and post again.

Thanks.
Lianbo

2019-01-14 05:35:25

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

在 2019年01月11日 22:56, Borislav Petkov 写道:
> On Thu, Jan 10, 2019 at 08:19:43PM +0800, Lianbo Jiang wrote:
>> This document lists some variables that export to vmcoreinfo, and briefly
>> describles what these variables indicate. It should be instructive for
>> many people who do not know the vmcoreinfo.
>>
>> Suggested-by: Borislav Petkov <[email protected]>
>> Signed-off-by: Lianbo Jiang <[email protected]>
>> ---
>> Documentation/kdump/vmcoreinfo.txt | 500 +++++++++++++++++++++++++++++
>> 1 file changed, 500 insertions(+)
>> create mode 100644 Documentation/kdump/vmcoreinfo.txt
>
> Ok, below is what I'm going to commit if no one complains. I hope you'd
> find some time to work on adding the checkpatch check for patches which
> add vmcoreinfo members but do not document them

I noticed that the checkpatch was coded in Perl. But i am not familiar with
the Perl program language, that would be beyond my ability to do this, i have
to learn the Perl program language step by step. :-)

> and also remove those vmcoreinfo members which are unused.
>

Do you mean this one 'KERNEL_IMAGE_SIZE'?

Currently unused by Makedumpfile, but used to compute the module virtual
address by Crash.

I have corrected this issue in VMCOREINFO doc.

Thanks.
Lianbo

> Which should be easy because we don't have to be backwards-compatible
> with makedumpfile as this is not an ABI.
>
> Thx.
>
> ---
> From: Lianbo Jiang <[email protected]>
> Date: Thu, 10 Jan 2019 20:19:43 +0800
> Subject: [PATCH] kdump: Document kernel data exported in the vmcoreinfo note
>
> Document data exported in vmcoreinfo and briefly describe its use by
> userspace tools.a
>
> [ bp: heavily massage and redact the text. ]
>
> Suggested-by: Borislav Petkov <[email protected]>
> Signed-off-by: Lianbo Jiang <[email protected]>
> Signed-off-by: Borislav Petkov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Dave Young <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: x86-ml <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
> ---
> Documentation/kdump/vmcoreinfo.txt | 494 +++++++++++++++++++++++++++++
> 1 file changed, 494 insertions(+)
> create mode 100644 Documentation/kdump/vmcoreinfo.txt
>
> diff --git a/Documentation/kdump/vmcoreinfo.txt b/Documentation/kdump/vmcoreinfo.txt
> new file mode 100644
> index 000000000000..2dc3797940a3
> --- /dev/null
> +++ b/Documentation/kdump/vmcoreinfo.txt
> @@ -0,0 +1,494 @@
> +================================================================
> + VMCOREINFO
> +================================================================
> +
> +===========
> +What is it?
> +===========
> +
> +VMCOREINFO is a special ELF note section. It contains various
> +information from the kernel like structure size, page size, symbol
> +values, field offsets, etc. These data are packed into an ELF note
> +section and used by user-space tools like crash and makedumpfile to
> +analyze a kernel's memory layout.
> +
> +================
> +Common variables
> +================
> +
> +init_uts_ns.name.release
> +------------------------
> +
> +The version of the Linux kernel. Used to find the corresponding source
> +code from which the kernel has been built.
> +
> +PAGE_SIZE
> +---------
> +
> +The size of a page. It is the smallest unit of data used by the memory
> +management facilities. It is usually 4096 bytes of size and a page is
> +aligned on 4096 bytes. Used for computing page addresses.
> +
> +init_uts_ns
> +-----------
> +
> +The UTS namespace which is used to isolate two specific elements of the
> +system that relate to the uname(2) system call. It is named after the
> +data structure used to store information returned by the uname(2) system
> +call.
> +
> +User-space tools can get the kernel name, host name, kernel release
> +number, kernel version, architecture name and OS type from it.
> +
> +node_online_map
> +---------------
> +
> +An array node_states[N_ONLINE] which represents the set of online nodes
> +in a system, one bit position per node number. Used to keep track of
> +which nodes are in the system and online.
> +
> +swapper_pg_dir
> +-------------
> +
> +The global page directory pointer of the kernel. Used to translate
> +virtual to physical addresses.
> +
> +_stext
> +------
> +
> +Defines the beginning of the text section. In general, _stext indicates
> +the kernel start address. Used to convert a virtual address from the
> +direct kernel map to a physical address.
> +
> +vmap_area_list
> +--------------
> +
> +Stores the virtual area list. makedumpfile gets the vmalloc start value
> +from this variable and its value is necessary for vmalloc translation.
> +
> +mem_map
> +-------
> +
> +Physical addresses are translated to struct pages by treating them as
> +an index into the mem_map array. Right-shifting a physical address
> +PAGE_SHIFT bits converts it into a page frame number which is an index
> +into that mem_map array.
> +
> +Used to map an address to the corresponding struct page.
> +
> +contig_page_data
> +----------------
> +
> +Makedumpfile gets the pglist_data structure from this symbol, which is
> +used to describe the memory layout.
> +
> +User-space tools use this to exclude free pages when dumping memory.
> +
> +mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
> +--------------------------------------------------------------------------
> +
> +The address of the mem_section array, its length, structure size, and
> +the section_mem_map offset.
> +
> +It exists in the sparse memory mapping model, and it is also somewhat
> +similar to the mem_map variable, both of them are used to translate an
> +address.
> +
> +page
> +----
> +
> +The size of a page structure. struct page is an important data structure
> +and it is widely used to compute contiguous memory.
> +
> +pglist_data
> +-----------
> +
> +The size of a pglist_data structure. This value is used to check if the
> +pglist_data structure is valid. It is also used for checking the memory
> +type.
> +
> +zone
> +----
> +
> +The size of a zone structure. This value is used to check if the zone
> +structure has been found. It is also used for excluding free pages.
> +
> +free_area
> +---------
> +
> +The size of a free_area structure. It indicates whether the free_area
> +structure is valid or not. Useful when excluding free pages.
> +
> +list_head
> +---------
> +
> +The size of a list_head structure. Used when iterating lists in a
> +post-mortem analysis session.
> +
> +nodemask_t
> +----------
> +
> +The size of a nodemask_t type. Used to compute the number of online
> +nodes.
> +
> +(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|
> + compound_order|compound_head)
> +-------------------------------------------------------------------
> +
> +User-space tools compute their values based on the offset of these
> +variables. The variables are used when excluding unnecessary pages.
> +
> +(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_
> + spanned_pages|node_id)
> +-------------------------------------------------------------------
> +
> +On NUMA machines, each NUMA node has a pg_data_t to describe its memory
> +layout. On UMA machines there is a single pglist_data which describes the
> +whole memory.
> +
> +These values are used to check the memory type and to compute the
> +virtual address for memory map.
> +
> +(zone, free_area|vm_stat|spanned_pages)
> +---------------------------------------
> +
> +Each node is divided into a number of blocks called zones which
> +represent ranges within memory. A zone is described by a structure zone.
> +
> +User-space tools compute required values based on the offset of these
> +variables.
> +
> +(free_area, free_list)
> +----------------------
> +
> +Offset of the free_list's member. This value is used to compute the number
> +of free pages.
> +
> +Each zone has a free_area structure array called free_area[MAX_ORDER].
> +The free_list represents a linked list of free page blocks.
> +
> +(list_head, next|prev)
> +----------------------
> +
> +Offsets of the list_head's members. list_head is used to define a
> +circular linked list. User-space tools need these in order to traverse
> +lists.
> +
> +(vmap_area, va_start|list)
> +--------------------------
> +
> +Offsets of the vmap_area's members. They carry vmalloc-specific
> +information. Makedumpfile gets the start address of the vmalloc region
> +from this.
> +
> +(zone.free_area, MAX_ORDER)
> +---------------------------
> +
> +Free areas descriptor. User-space tools use this value to iterate the
> +free_area ranges. MAX_ORDER is used by the zone buddy allocator.
> +
> +log_first_idx
> +-------------
> +
> +Index of the first record stored in the buffer log_buf. Used by
> +user-space tools to read the strings in the log_buf.
> +
> +log_buf
> +-------
> +
> +Console output is written to the ring buffer log_buf at index
> +log_first_idx. Used to get the kernel log.
> +
> +log_buf_len
> +-----------
> +
> +log_buf's length.
> +
> +clear_idx
> +---------
> +
> +The index that the next printk() record to read after the last clear
> +command. It indicates the first record after the last SYSLOG_ACTION
> +_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
> +the dmesg log.
> +
> +log_next_idx
> +------------
> +
> +The index of the next record to store in the buffer log_buf. Used to
> +compute the index of the current buffer position.
> +
> +printk_log
> +----------
> +
> +The size of a structure printk_log. Used to compute the size of
> +messages, and extract dmesg log. It encapsulates header information for
> +log_buf, such as timestamp, syslog level, etc.
> +
> +(printk_log, ts_nsec|len|text_len|dict_len)
> +-------------------------------------------
> +
> +It represents field offsets in struct printk_log. User space tools
> +parse it and check whether the values of printk_log's members have been
> +changed.
> +
> +(free_area.free_list, MIGRATE_TYPES)
> +------------------------------------
> +
> +The number of migrate types for pages. The free_list is described by the
> +array. Used by tools to compute the number of free pages.
> +
> +NR_FREE_PAGES
> +-------------
> +
> +On linux-2.6.21 or later, the number of free pages is in
> +vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
> +
> +PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision
> +|PG_head_mask|PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)
> +|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
> +-----------------------------------------------------------------
> +
> +Page attributes. These flags are used to filter various unnecessary for
> +dumping pages.
> +
> +HUGETLB_PAGE_DTOR
> +-----------------
> +
> +The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
> +excludes these pages.
> +
> +======
> +x86_64
> +======
> +
> +phys_base
> +---------
> +
> +Used to convert the virtual address of an exported kernel symbol to its
> +corresponding physical address.
> +
> +init_top_pgt
> +------------
> +
> +Used to walk through the whole page table and convert virtual addresses
> +to physical addresses. The init_top_pgt is somewhat similar to
> +swapper_pg_dir, but it is only used in x86_64.
> +
> +pgtable_l5_enabled
> +------------------
> +
> +User-space tools need to know whether the crash kernel was in 5-level
> +paging mode.
> +
> +node_data
> +---------
> +
> +This is a struct pglist_data array and stores all NUMA nodes
> +information. Makedumpfile gets the pglist_data structure from it.
> +
> +(node_data, MAX_NUMNODES)
> +-------------------------
> +
> +The maximum number of nodes in system.
> +
> +KERNELOFFSET
> +------------
> +
> +The kernel randomization offset. Used to compute the page offset. If
> +KASLR is disabled, this value is zero.
> +
> +KERNEL_IMAGE_SIZE
> +-----------------
> +
> +Currently unused by Makedumpfile. Used to compute the module virtual
> +address by Crash.
> +
> +sme_mask
> +--------
> +
> +AMD-specific with SME support: it indicates the secure memory encryption
> +mask. Makedumpfile tools need to know whether the crash kernel was
> +encrypted. If SME is enabled in the first kernel, the crash kernel's
> +page table entries (pgd/pud/pmd/pte) contain the memory encryption
> +mask. This is used to remove the SME mask and obtain the true physical
> +address.
> +
> +Currently, sme_mask stores the value of the C-bit position. If needed,
> +additional SME-relevant info can be placed in that variable.
> +
> +For example:
> +[ misc ][ enc bit ][ other misc SME info ]
> +0000_0000_0000_0000_1000_0000_0000_0000_0000_0000_..._0000
> +63 59 55 51 47 43 39 35 31 27 ... 3
> +
> +======
> +x86_32
> +======
> +
> +X86_PAE
> +-------
> +
> +Denotes whether physical address extensions are enabled. It has the cost
> +of a higher page table lookup overhead, and also consumes more page
> +table space per process. Used to check whether PAE was enabled in the
> +crash kernel when converting virtual addresses to physical addresses.
> +
> +====
> +ia64
> +====
> +
> +pgdat_list|(pgdat_list, MAX_NUMNODES)
> +-------------------------------------
> +
> +pg_data_t array storing all NUMA nodes information. MAX_NUMNODES
> +indicates the number of the nodes.
> +
> +node_memblk|(node_memblk, NR_NODE_MEMBLKS)
> +------------------------------------------
> +
> +List of node memory chunks. Filled when parsing the SRAT table to obtain
> +information about memory nodes. NR_NODE_MEMBLKS indicates the number of
> +node memory chunks.
> +
> +These values are used to compute the number of nodes the crashed kernel used.
> +
> +node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
> +----------------------------------------------------------------
> +
> +The size of a struct node_memblk_s and the offsets of the
> +node_memblk_s's members. Used to compute the number of nodes.
> +
> +PGTABLE_3|PGTABLE_4
> +-------------------
> +
> +User-space tools need to know whether the crash kernel was in 3-level or
> +4-level paging mode. Used to distinguish the page table.
> +
> +=====
> +ARM64
> +=====
> +
> +VA_BITS
> +-------
> +
> +The maximum number of bits for virtual addresses. Used to compute the
> +virtual memory ranges.
> +
> +kimage_voffset
> +--------------
> +
> +The offset between the kernel virtual and physical mappings. Used to
> +translate virtual to physical addresses.
> +
> +PHYS_OFFSET
> +-----------
> +
> +Indicates the physical address of the start of memory. Similar to
> +kimage_voffset, which is used to translate virtual to physical
> +addresses.
> +
> +KERNELOFFSET
> +------------
> +
> +The kernel randomization offset. Used to compute the page offset. If
> +KASLR is disabled, this value is zero.
> +
> +====
> +arm
> +====
> +
> +ARM_LPAE
> +--------
> +
> +It indicates whether the crash kernel supports large physical address
> +extensions. Used to translate virtual to physical addresses.
> +
> +====
> +s390
> +====
> +
> +lowcore_ptr
> +----------
> +
> +An array with a pointer to the lowcore of every CPU. Used to print the
> +psw and all registers information.
> +
> +high_memory
> +-----------
> +
> +Used to get the vmalloc_start address from the high_memory symbol.
> +
> +(lowcore_ptr, NR_CPUS)
> +----------------------
> +
> +The maximum number of CPUs.
> +
> +=======
> +powerpc
> +=======
> +
> +
> +node_data|(node_data, MAX_NUMNODES)
> +-----------------------------------
> +
> +See above.
> +
> +contig_page_data
> +----------------
> +
> +See above.
> +
> +vmemmap_list
> +------------
> +
> +The vmemmap_list maintains the entire vmemmap physical mapping. Used
> +to get vmemmap list count and populated vmemmap regions info. If the
> +vmemmap address translation information is stored in the crash kernel,
> +it is used to translate vmemmap kernel virtual addresses.
> +
> +mmu_vmemmap_psize
> +-----------------
> +
> +The size of a page. Used to translate virtual to physical addresses.
> +
> +mmu_psize_defs
> +--------------
> +
> +Page size definitions, i.e. 4k, 64k, or 16M.
> +
> +Used to make vtop translations.
> +
> +vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|
> +(vmemmap_backing, virt_addr)
> +----------------------------------------------------------------
> +
> +The vmemmap virtual address space management does not have a traditional
> +page table to track which virtual struct pages are backed by a physical
> +mapping. The virtual to physical mappings are tracked in a simple linked
> +list format.
> +
> +User-space tools need to know the offset of list, phys and virt_addr
> +when computing the count of vmemmap regions.
> +
> +mmu_psize_def|(mmu_psize_def, shift)
> +------------------------------------
> +
> +The size of a struct mmu_psize_def and the offset of mmu_psize_def's
> +member.
> +
> +Used in vtop translations.
> +
> +==
> +sh
> +==
> +
> +node_data|(node_data, MAX_NUMNODES)
> +-----------------------------------
> +
> +See above.
> +
> +X2TLB
> +-----
> +
> +Indicates whether the crashed kernel enabled SH extended mode.
>

2019-01-14 09:12:03

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Mon, Jan 14, 2019 at 09:52:14AM +0800, lijiang wrote:
> I would like to remove this variable and post again.

No, you should remove the vmcoreinfo export too:

kernel/crash_core.c:398: VMCOREINFO_OSRELEASE(init_uts_ns.name.release);

after making sure userspace is not using it and *then* remove the
documentation.

But you can do that in a separate patch, so that it can be reverted if
trouble.

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-14 09:17:29

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Mon, Jan 14, 2019 at 01:30:30PM +0800, lijiang wrote:
> I noticed that the checkpatch was coded in Perl. But i am not familiar with
> the Perl program language, that would be beyond my ability to do this, i have
> to learn the Perl program language step by step. :-)

You could give it a try - it is not hard :-)

And there's no hurry for this, take your time.

> Do you mean this one 'KERNEL_IMAGE_SIZE'?

I mean, all those which are unused. Optimally, you should look at the
tools and see whether they're using those exports and if not, remove
them. But no hurry here too, take your time.

My final goal is to have this up-to-date documentation of what is
exported and what is used by user tools so that people can look at it
first before carelessly exporting yet another thing.

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-14 17:54:48

by Kazuhito Hagio

[permalink] [raw]
Subject: RE: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On 1/11/2019 7:33 AM, Borislav Petkov wrote:
> On Thu, Jan 10, 2019 at 08:19:43PM +0800, Lianbo Jiang wrote:
>> +init_uts_ns.name.release
>> +------------------------
>> +
>> +The version of the Linux kernel. Used to find the corresponding source
>> +code from which the kernel has been built.
>> +
>
> ...
>
>> +
>> +init_uts_ns
>> +-----------
>> +
>> +This is the UTS namespace, which is used to isolate two specific
>> +elements of the system that relate to the uname(2) system call. The UTS
>> +namespace is named after the data structure used to store information
>> +returned by the uname(2) system call.
>> +
>> +User-space tools can get the kernel name, host name, kernel release
>> +number, kernel version, architecture name and OS type from it.
>
> Already asked this but no reply so lemme paste my question again:
>
> "And this document already fulfills its purpose - those two vmcoreinfo
> exports are redundant and the first one can be removed.
>
> And now that we agreed that VMCOREINFO is not an ABI and is very tightly
> coupled to the kernel version, init_uts_ns.name.release can be removed,
> yes?
>
> Or is there anything speaking against that?"

As for makedumpfile, it will be not impossible to remove the
init_uts_ns.name.relase (OSRELEASE), but some fixes are needed.
Because historically OSRELEASE has been a kind of a mandatory entry
in vmcoreinfo from the beginning of vmcoreinfo, so makedumpfile uses
its existence to check whether a vmcoreinfo is sane.

Also, I think crash also will need to be fixed if it is removed.
So I hope it will be left as it is.

Thanks,
Kazu

2019-01-14 18:03:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Mon, Jan 14, 2019 at 05:48:48PM +0000, Kazuhito Hagio wrote:
> As for makedumpfile, it will be not impossible to remove the
> init_uts_ns.name.relase (OSRELEASE), but some fixes are needed.
> Because historically OSRELEASE has been a kind of a mandatory entry
> in vmcoreinfo from the beginning of vmcoreinfo, so makedumpfile uses
> its existence to check whether a vmcoreinfo is sane.

Well, init_uts_ns is exported in vmcoreinfo anyway - makedumpfile
can simply test init_uts_ns.name.release just as well. And the
"historically" argument doesn't matter because vmcoreinfo is not an ABI.

So makedumpfile needs to be changed to check that new export.

> Also, I think crash also will need to be fixed if it is removed.

Yes, I'm expecting user tools to be fixed and then exports removed.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-14 18:59:50

by Dave Anderson

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation



----- Original Message -----
> On Mon, Jan 14, 2019 at 05:48:48PM +0000, Kazuhito Hagio wrote:
> > As for makedumpfile, it will be not impossible to remove the
> > init_uts_ns.name.relase (OSRELEASE), but some fixes are needed.
> > Because historically OSRELEASE has been a kind of a mandatory entry
> > in vmcoreinfo from the beginning of vmcoreinfo, so makedumpfile uses
> > its existence to check whether a vmcoreinfo is sane.
>
> Well, init_uts_ns is exported in vmcoreinfo anyway - makedumpfile
> can simply test init_uts_ns.name.release just as well. And the
> "historically" argument doesn't matter because vmcoreinfo is not an ABI.
>
> So makedumpfile needs to be changed to check that new export.
>
> > Also, I think crash also will need to be fixed if it is removed.

Preferably it would be left as-is. The crash utility has a "crash --osrelease vmcore"
option that only looks at the dumpfile header, and just dump the string. With respect
to compressed kdumps, crash could alternatively look at the utsname data that is stored
in the diskdump_header.utsname field, but with ELF vmcores, there is no such back-up.
What's the problem with leaving it alone?

Dave


>
> Yes, I'm expecting user tools to be fixed and then exports removed.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
>

2019-01-14 19:23:19

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Mon, Jan 14, 2019 at 01:58:32PM -0500, Dave Anderson wrote:
> Preferably it would be left as-is. The crash utility has a "crash --osrelease vmcore"
> option that only looks at the dumpfile header, and just dump the string. With respect
> to compressed kdumps, crash could alternatively look at the utsname data that is stored
> in the diskdump_header.utsname field, but with ELF vmcores, there is no such back-up.

Well, there is:

00000000 4f 53 52 45 4c 45 41 53 45 3d 35 2e 30 2e 30 2d |OSRELEASE=5.0.0-|
00000010 72 63 32 2b 0a 50 41 47 45 53 49 5a 45 3d 34 30 |rc2+.PAGESIZE=40|
00000020 39 36 0a 53 59 4d 42 4f 4c 28 6d 65 6d 5f 73 65 |96.SYMBOL(mem_se|
00000030 63 74 69 6f 6e 29 3d 66 66 66 66 66 66 66 66 38 |ction)=ffffffff8|
00000040 34 35 31 61 31 61 38 0a 53 59 4d 42 4f 4c 28 69 |451a1a8.SYMBOL(i|
00000050 6e 69 74 5f 75 74 73 5f 6e 73 29 3d 66 66 66 66 |nit_uts_ns)=ffff|
^^^^
00000060 66 66 66 66 38 32 30 31 33 35 34 30 0a 53 59 4d |ffff82013540
^^^^^^^^^^^^

This address has it.

> What's the problem with leaving it alone?

The problem is that I'd like to get all those vmcoreinfo exports under
control and to not have people frivolously export whatever they feel
like, for obvious reasons, and to get rid of the duplicate/unused pieces
being part of vmcoreinfo.

I'm guessing removing OSRELEASE would simplify the kernel a bit by
getting rid of the VMCOREINFO_OSRELEASE define and export, and userspace
can read out the kernel version from init_uts_ns which is also exported.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-14 19:38:14

by Dave Anderson

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation




----- Original Message -----
> On Mon, Jan 14, 2019 at 01:58:32PM -0500, Dave Anderson wrote:
> > Preferably it would be left as-is. The crash utility has a "crash
> > --osrelease vmcore"
> > option that only looks at the dumpfile header, and just dump the string.
> > With respect
> > to compressed kdumps, crash could alternatively look at the utsname data
> > that is stored
> > in the diskdump_header.utsname field, but with ELF vmcores, there is no
> > such back-up.
>
> Well, there is:
>
> 00000000 4f 53 52 45 4c 45 41 53 45 3d 35 2e 30 2e 30 2d |OSRELEASE=5.0.0-|
> 00000010 72 63 32 2b 0a 50 41 47 45 53 49 5a 45 3d 34 30 |rc2+.PAGESIZE=40|
> 00000020 39 36 0a 53 59 4d 42 4f 4c 28 6d 65 6d 5f 73 65 |96.SYMBOL(mem_se|
> 00000030 63 74 69 6f 6e 29 3d 66 66 66 66 66 66 66 66 38 |ction)=ffffffff8|
> 00000040 34 35 31 61 31 61 38 0a 53 59 4d 42 4f 4c 28 69 |451a1a8.SYMBOL(i|
> 00000050 6e 69 74 5f 75 74 73 5f 6e 73 29 3d 66 66 66 66 |nit_uts_ns)=ffff|
> ^^^^
> 00000060 66 66 66 66 38 32 30 31 33 35 34 30 0a 53 59 4d |ffff82013540
> ^^^^^^^^^^^^
>
> This address has it.

There's no reading of the dumpfile's memory involved, and that being the case,
the vmlinux file is not utilized. That's the whole point of the crash option, i.e.,
taking a vmcore file, and trying to determine what kernel should be used with it:

$ man crash
...
--osrelease dumpfile
Display the OSRELEASE vmcoreinfo string from a kdump dumpfile header.
...


>
> > What's the problem with leaving it alone?
>
> The problem is that I'd like to get all those vmcoreinfo exports under
> control and to not have people frivolously export whatever they feel
> like, for obvious reasons, and to get rid of the duplicate/unused pieces
> being part of vmcoreinfo.

Well, I just don't agree that the OSRELEASE item is "frivolous". It's been in place,
and depended upon, for many years.

Dave


2019-01-14 20:02:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Mon, Jan 14, 2019 at 02:36:47PM -0500, Dave Anderson wrote:
> There's no reading of the dumpfile's memory involved, and that being the case,
> the vmlinux file is not utilized. That's the whole point of the crash option, i.e.,
> taking a vmcore file, and trying to determine what kernel should be used with it:
>
> $ man crash
> ...
> --osrelease dumpfile
> Display the OSRELEASE vmcoreinfo string from a kdump dumpfile header.

I don't understand - if you have the vmcoreinfo (which I assume is part
of the vmcore, yes, no?) you can go and dig out the kernel version from
it, no?

Why should you not utilize the vmcore file?

(I'm most likely missing something.)

> Well, I just don't agree that the OSRELEASE item is "frivolous". It's
> been in place, and depended upon, for many years.

Yeah, no. The ABI argument is moot in this case as in the last couple
of months people have been persuading me that vmcoreinfo is not ABI. So
you guys need to make up your mind what is it. And if it is an ABI, it
wasn't documented anywhere.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-14 20:08:55

by Dave Anderson

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation



----- Original Message -----
> On Mon, Jan 14, 2019 at 02:36:47PM -0500, Dave Anderson wrote:
> > There's no reading of the dumpfile's memory involved, and that being the case,
> > the vmlinux file is not utilized. That's the whole point of the crash option, i.e.,
> > taking a vmcore file, and trying to determine what kernel should be used
> > with it:
> >
> > $ man crash
> > ...
> > --osrelease dumpfile
> > Display the OSRELEASE vmcoreinfo string from a kdump dumpfile header.
>
> I don't understand - if you have the vmcoreinfo (which I assume is part
> of the vmcore, yes, no?) you can go and dig out the kernel version from
> it, no?
>
> Why should you not utilize the vmcore file?

That's what it *does* utilize -- it takes a standalone vmcore dumpfile, and
pulls out the OSRELEASE string from it, so that a user can determine what
vmlinux file should be used with that vmcore for normal crash analysis.

Dave

>
> (I'm most likely missing something.)
>
> > Well, I just don't agree that the OSRELEASE item is "frivolous". It's
> > been in place, and depended upon, for many years.
>
> Yeah, no. The ABI argument is moot in this case as in the last couple
> of months people have been persuading me that vmcoreinfo is not ABI. So
> you guys need to make up your mind what is it. And if it is an ABI, it
> wasn't documented anywhere.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
>

2019-01-14 20:21:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Mon, Jan 14, 2019 at 03:07:33PM -0500, Dave Anderson wrote:
> That's what it *does* utilize -- it takes a standalone vmcore dumpfile, and
> pulls out the OSRELEASE string from it, so that a user can determine what
> vmlinux file should be used with that vmcore for normal crash analysis.

And the vmcoreinfo is part of the vmcore, right?

So it can just as well read out the address of init_uts_ns and get the
kernel version from there.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-14 20:28:29

by Dave Anderson

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation



----- Original Message -----
> On Mon, Jan 14, 2019 at 03:07:33PM -0500, Dave Anderson wrote:
> > That's what it *does* utilize -- it takes a standalone vmcore dumpfile, and
> > pulls out the OSRELEASE string from it, so that a user can determine what
> > vmlinux file should be used with that vmcore for normal crash analysis.
>
> And the vmcoreinfo is part of the vmcore, right?

Correct.

>
> So it can just as well read out the address of init_uts_ns and get the
> kernel version from there.

No. It needs *both* the vmlinux file and the vmcore file in order to read kernel
virtual memory, so just having a kernel virtual address is insufficient.

So it's a chicken-and-egg situation. This particular --osrelease option is used
to determine *what* vmlinux file would be required for an actual crash analysis
session.

Dave



2019-01-14 20:36:31

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Mon, Jan 14, 2019 at 03:26:32PM -0500, Dave Anderson wrote:
> No. It needs *both* the vmlinux file and the vmcore file in order to read kernel
> virtual memory, so just having a kernel virtual address is insufficient.
>
> So it's a chicken-and-egg situation. This particular --osrelease option is used
> to determine *what* vmlinux file would be required for an actual crash analysis
> session.

Ok, that makes sense. I could've used that explanation when reviewing
the documentation. Do you mind skimming through this:

https://lkml.kernel.org/r/[email protected]

in case we've missed explaining relevant usage - like that above - of
some of the vmcoreinfo members?

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2019-01-14 20:50:52

by Dave Anderson

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation



----- Original Message -----
> On Mon, Jan 14, 2019 at 03:26:32PM -0500, Dave Anderson wrote:
> > No. It needs *both* the vmlinux file and the vmcore file in order to read
> > kernel
> > virtual memory, so just having a kernel virtual address is insufficient.
> >
> > So it's a chicken-and-egg situation. This particular --osrelease option is used
> > to determine *what* vmlinux file would be required for an actual crash analysis
> > session.
>
> Ok, that makes sense. I could've used that explanation when reviewing
> the documentation. Do you mind skimming through this:
>
> https://lkml.kernel.org/r/[email protected]
>
> in case we've missed explaining relevant usage - like that above - of
> some of the vmcoreinfo members?

Yeah, I've been watching the thread, and the document looks fine to me.
It's just that when I saw the discussion of this one being removed that
I felt the need to respond... ;-)

Dave

2019-01-15 09:54:17

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

在 2019年01月14日 17:15, Borislav Petkov 写道:
> On Mon, Jan 14, 2019 at 01:30:30PM +0800, lijiang wrote:
>> I noticed that the checkpatch was coded in Perl. But i am not familiar with
>> the Perl program language, that would be beyond my ability to do this, i have
>> to learn the Perl program language step by step. :-)
>
> You could give it a try - it is not hard :-)
>

Thank you for encouraging me to do the things.

> And there's no hurry for this, take your time.
>
OK. I would like to put this task in my queue. Once i have free time, i will work
on this issue.

>> Do you mean this one 'KERNEL_IMAGE_SIZE'?
>
> I mean, all those which are unused. Optimally, you should look at the
> tools and see whether they're using those exports and if not, remove
> them. But no hurry here too, take your time.
>
OK. I will check it again, But, basically they are used by Makedumpfile or Crash tools.


> My final goal is to have this up-to-date documentation of what is
> exported and what is used by user tools so that people can look at it
> first before carelessly exporting yet another thing.
>
Agree.

Thanks,
Lianbo
> Thx.
>

Subject: [tip:x86/kdump] kdump: Document kernel data exported in the vmcoreinfo note

Commit-ID: f263245a0ce2c4e23b89a58fa5f7dfc048e11929
Gitweb: https://git.kernel.org/tip/f263245a0ce2c4e23b89a58fa5f7dfc048e11929
Author: Lianbo Jiang <[email protected]>
AuthorDate: Thu, 10 Jan 2019 20:19:43 +0800
Committer: Borislav Petkov <[email protected]>
CommitDate: Tue, 15 Jan 2019 11:05:28 +0100

kdump: Document kernel data exported in the vmcoreinfo note

Document data exported in vmcoreinfo and briefly describe its use by
userspace tools.

[ bp: heavily massage and redact the text. ]

Suggested-by: Borislav Petkov <[email protected]>
Signed-off-by: Lianbo Jiang <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Dave Young <[email protected]>
Cc: Jonathan Corbet <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: x86-ml <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
---
Documentation/kdump/vmcoreinfo.txt | 495 +++++++++++++++++++++++++++++++++++++
1 file changed, 495 insertions(+)

diff --git a/Documentation/kdump/vmcoreinfo.txt b/Documentation/kdump/vmcoreinfo.txt
new file mode 100644
index 000000000000..bb94a4bd597a
--- /dev/null
+++ b/Documentation/kdump/vmcoreinfo.txt
@@ -0,0 +1,495 @@
+================================================================
+ VMCOREINFO
+================================================================
+
+===========
+What is it?
+===========
+
+VMCOREINFO is a special ELF note section. It contains various
+information from the kernel like structure size, page size, symbol
+values, field offsets, etc. These data are packed into an ELF note
+section and used by user-space tools like crash and makedumpfile to
+analyze a kernel's memory layout.
+
+================
+Common variables
+================
+
+init_uts_ns.name.release
+------------------------
+
+The version of the Linux kernel. Used to find the corresponding source
+code from which the kernel has been built. For example, crash uses it to
+find the corresponding vmlinux in order to process vmcore.
+
+PAGE_SIZE
+---------
+
+The size of a page. It is the smallest unit of data used by the memory
+management facilities. It is usually 4096 bytes of size and a page is
+aligned on 4096 bytes. Used for computing page addresses.
+
+init_uts_ns
+-----------
+
+The UTS namespace which is used to isolate two specific elements of the
+system that relate to the uname(2) system call. It is named after the
+data structure used to store information returned by the uname(2) system
+call.
+
+User-space tools can get the kernel name, host name, kernel release
+number, kernel version, architecture name and OS type from it.
+
+node_online_map
+---------------
+
+An array node_states[N_ONLINE] which represents the set of online nodes
+in a system, one bit position per node number. Used to keep track of
+which nodes are in the system and online.
+
+swapper_pg_dir
+-------------
+
+The global page directory pointer of the kernel. Used to translate
+virtual to physical addresses.
+
+_stext
+------
+
+Defines the beginning of the text section. In general, _stext indicates
+the kernel start address. Used to convert a virtual address from the
+direct kernel map to a physical address.
+
+vmap_area_list
+--------------
+
+Stores the virtual area list. makedumpfile gets the vmalloc start value
+from this variable and its value is necessary for vmalloc translation.
+
+mem_map
+-------
+
+Physical addresses are translated to struct pages by treating them as
+an index into the mem_map array. Right-shifting a physical address
+PAGE_SHIFT bits converts it into a page frame number which is an index
+into that mem_map array.
+
+Used to map an address to the corresponding struct page.
+
+contig_page_data
+----------------
+
+Makedumpfile gets the pglist_data structure from this symbol, which is
+used to describe the memory layout.
+
+User-space tools use this to exclude free pages when dumping memory.
+
+mem_section|(mem_section, NR_SECTION_ROOTS)|(mem_section, section_mem_map)
+--------------------------------------------------------------------------
+
+The address of the mem_section array, its length, structure size, and
+the section_mem_map offset.
+
+It exists in the sparse memory mapping model, and it is also somewhat
+similar to the mem_map variable, both of them are used to translate an
+address.
+
+page
+----
+
+The size of a page structure. struct page is an important data structure
+and it is widely used to compute contiguous memory.
+
+pglist_data
+-----------
+
+The size of a pglist_data structure. This value is used to check if the
+pglist_data structure is valid. It is also used for checking the memory
+type.
+
+zone
+----
+
+The size of a zone structure. This value is used to check if the zone
+structure has been found. It is also used for excluding free pages.
+
+free_area
+---------
+
+The size of a free_area structure. It indicates whether the free_area
+structure is valid or not. Useful when excluding free pages.
+
+list_head
+---------
+
+The size of a list_head structure. Used when iterating lists in a
+post-mortem analysis session.
+
+nodemask_t
+----------
+
+The size of a nodemask_t type. Used to compute the number of online
+nodes.
+
+(page, flags|_refcount|mapping|lru|_mapcount|private|compound_dtor|
+ compound_order|compound_head)
+-------------------------------------------------------------------
+
+User-space tools compute their values based on the offset of these
+variables. The variables are used when excluding unnecessary pages.
+
+(pglist_data, node_zones|nr_zones|node_mem_map|node_start_pfn|node_
+ spanned_pages|node_id)
+-------------------------------------------------------------------
+
+On NUMA machines, each NUMA node has a pg_data_t to describe its memory
+layout. On UMA machines there is a single pglist_data which describes the
+whole memory.
+
+These values are used to check the memory type and to compute the
+virtual address for memory map.
+
+(zone, free_area|vm_stat|spanned_pages)
+---------------------------------------
+
+Each node is divided into a number of blocks called zones which
+represent ranges within memory. A zone is described by a structure zone.
+
+User-space tools compute required values based on the offset of these
+variables.
+
+(free_area, free_list)
+----------------------
+
+Offset of the free_list's member. This value is used to compute the number
+of free pages.
+
+Each zone has a free_area structure array called free_area[MAX_ORDER].
+The free_list represents a linked list of free page blocks.
+
+(list_head, next|prev)
+----------------------
+
+Offsets of the list_head's members. list_head is used to define a
+circular linked list. User-space tools need these in order to traverse
+lists.
+
+(vmap_area, va_start|list)
+--------------------------
+
+Offsets of the vmap_area's members. They carry vmalloc-specific
+information. Makedumpfile gets the start address of the vmalloc region
+from this.
+
+(zone.free_area, MAX_ORDER)
+---------------------------
+
+Free areas descriptor. User-space tools use this value to iterate the
+free_area ranges. MAX_ORDER is used by the zone buddy allocator.
+
+log_first_idx
+-------------
+
+Index of the first record stored in the buffer log_buf. Used by
+user-space tools to read the strings in the log_buf.
+
+log_buf
+-------
+
+Console output is written to the ring buffer log_buf at index
+log_first_idx. Used to get the kernel log.
+
+log_buf_len
+-----------
+
+log_buf's length.
+
+clear_idx
+---------
+
+The index that the next printk() record to read after the last clear
+command. It indicates the first record after the last SYSLOG_ACTION
+_CLEAR, like issued by 'dmesg -c'. Used by user-space tools to dump
+the dmesg log.
+
+log_next_idx
+------------
+
+The index of the next record to store in the buffer log_buf. Used to
+compute the index of the current buffer position.
+
+printk_log
+----------
+
+The size of a structure printk_log. Used to compute the size of
+messages, and extract dmesg log. It encapsulates header information for
+log_buf, such as timestamp, syslog level, etc.
+
+(printk_log, ts_nsec|len|text_len|dict_len)
+-------------------------------------------
+
+It represents field offsets in struct printk_log. User space tools
+parse it and check whether the values of printk_log's members have been
+changed.
+
+(free_area.free_list, MIGRATE_TYPES)
+------------------------------------
+
+The number of migrate types for pages. The free_list is described by the
+array. Used by tools to compute the number of free pages.
+
+NR_FREE_PAGES
+-------------
+
+On linux-2.6.21 or later, the number of free pages is in
+vm_stat[NR_FREE_PAGES]. Used to get the number of free pages.
+
+PG_lru|PG_private|PG_swapcache|PG_swapbacked|PG_slab|PG_hwpoision
+|PG_head_mask|PAGE_BUDDY_MAPCOUNT_VALUE(~PG_buddy)
+|PAGE_OFFLINE_MAPCOUNT_VALUE(~PG_offline)
+-----------------------------------------------------------------
+
+Page attributes. These flags are used to filter various unnecessary for
+dumping pages.
+
+HUGETLB_PAGE_DTOR
+-----------------
+
+The HUGETLB_PAGE_DTOR flag denotes hugetlbfs pages. Makedumpfile
+excludes these pages.
+
+======
+x86_64
+======
+
+phys_base
+---------
+
+Used to convert the virtual address of an exported kernel symbol to its
+corresponding physical address.
+
+init_top_pgt
+------------
+
+Used to walk through the whole page table and convert virtual addresses
+to physical addresses. The init_top_pgt is somewhat similar to
+swapper_pg_dir, but it is only used in x86_64.
+
+pgtable_l5_enabled
+------------------
+
+User-space tools need to know whether the crash kernel was in 5-level
+paging mode.
+
+node_data
+---------
+
+This is a struct pglist_data array and stores all NUMA nodes
+information. Makedumpfile gets the pglist_data structure from it.
+
+(node_data, MAX_NUMNODES)
+-------------------------
+
+The maximum number of nodes in system.
+
+KERNELOFFSET
+------------
+
+The kernel randomization offset. Used to compute the page offset. If
+KASLR is disabled, this value is zero.
+
+KERNEL_IMAGE_SIZE
+-----------------
+
+Currently unused by Makedumpfile. Used to compute the module virtual
+address by Crash.
+
+sme_mask
+--------
+
+AMD-specific with SME support: it indicates the secure memory encryption
+mask. Makedumpfile tools need to know whether the crash kernel was
+encrypted. If SME is enabled in the first kernel, the crash kernel's
+page table entries (pgd/pud/pmd/pte) contain the memory encryption
+mask. This is used to remove the SME mask and obtain the true physical
+address.
+
+Currently, sme_mask stores the value of the C-bit position. If needed,
+additional SME-relevant info can be placed in that variable.
+
+For example:
+[ misc ][ enc bit ][ other misc SME info ]
+0000_0000_0000_0000_1000_0000_0000_0000_0000_0000_..._0000
+63 59 55 51 47 43 39 35 31 27 ... 3
+
+======
+x86_32
+======
+
+X86_PAE
+-------
+
+Denotes whether physical address extensions are enabled. It has the cost
+of a higher page table lookup overhead, and also consumes more page
+table space per process. Used to check whether PAE was enabled in the
+crash kernel when converting virtual addresses to physical addresses.
+
+====
+ia64
+====
+
+pgdat_list|(pgdat_list, MAX_NUMNODES)
+-------------------------------------
+
+pg_data_t array storing all NUMA nodes information. MAX_NUMNODES
+indicates the number of the nodes.
+
+node_memblk|(node_memblk, NR_NODE_MEMBLKS)
+------------------------------------------
+
+List of node memory chunks. Filled when parsing the SRAT table to obtain
+information about memory nodes. NR_NODE_MEMBLKS indicates the number of
+node memory chunks.
+
+These values are used to compute the number of nodes the crashed kernel used.
+
+node_memblk_s|(node_memblk_s, start_paddr)|(node_memblk_s, size)
+----------------------------------------------------------------
+
+The size of a struct node_memblk_s and the offsets of the
+node_memblk_s's members. Used to compute the number of nodes.
+
+PGTABLE_3|PGTABLE_4
+-------------------
+
+User-space tools need to know whether the crash kernel was in 3-level or
+4-level paging mode. Used to distinguish the page table.
+
+=====
+ARM64
+=====
+
+VA_BITS
+-------
+
+The maximum number of bits for virtual addresses. Used to compute the
+virtual memory ranges.
+
+kimage_voffset
+--------------
+
+The offset between the kernel virtual and physical mappings. Used to
+translate virtual to physical addresses.
+
+PHYS_OFFSET
+-----------
+
+Indicates the physical address of the start of memory. Similar to
+kimage_voffset, which is used to translate virtual to physical
+addresses.
+
+KERNELOFFSET
+------------
+
+The kernel randomization offset. Used to compute the page offset. If
+KASLR is disabled, this value is zero.
+
+====
+arm
+====
+
+ARM_LPAE
+--------
+
+It indicates whether the crash kernel supports large physical address
+extensions. Used to translate virtual to physical addresses.
+
+====
+s390
+====
+
+lowcore_ptr
+----------
+
+An array with a pointer to the lowcore of every CPU. Used to print the
+psw and all registers information.
+
+high_memory
+-----------
+
+Used to get the vmalloc_start address from the high_memory symbol.
+
+(lowcore_ptr, NR_CPUS)
+----------------------
+
+The maximum number of CPUs.
+
+=======
+powerpc
+=======
+
+
+node_data|(node_data, MAX_NUMNODES)
+-----------------------------------
+
+See above.
+
+contig_page_data
+----------------
+
+See above.
+
+vmemmap_list
+------------
+
+The vmemmap_list maintains the entire vmemmap physical mapping. Used
+to get vmemmap list count and populated vmemmap regions info. If the
+vmemmap address translation information is stored in the crash kernel,
+it is used to translate vmemmap kernel virtual addresses.
+
+mmu_vmemmap_psize
+-----------------
+
+The size of a page. Used to translate virtual to physical addresses.
+
+mmu_psize_defs
+--------------
+
+Page size definitions, i.e. 4k, 64k, or 16M.
+
+Used to make vtop translations.
+
+vmemmap_backing|(vmemmap_backing, list)|(vmemmap_backing, phys)|
+(vmemmap_backing, virt_addr)
+----------------------------------------------------------------
+
+The vmemmap virtual address space management does not have a traditional
+page table to track which virtual struct pages are backed by a physical
+mapping. The virtual to physical mappings are tracked in a simple linked
+list format.
+
+User-space tools need to know the offset of list, phys and virt_addr
+when computing the count of vmemmap regions.
+
+mmu_psize_def|(mmu_psize_def, shift)
+------------------------------------
+
+The size of a struct mmu_psize_def and the offset of mmu_psize_def's
+member.
+
+Used in vtop translations.
+
+==
+sh
+==
+
+node_data|(node_data, MAX_NUMNODES)
+-----------------------------------
+
+See above.
+
+X2TLB
+-----
+
+Indicates whether the crashed kernel enabled SH extended mode.

2019-01-15 11:03:53

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

在 2019年01月15日 04:49, Dave Anderson 写道:
>
>
> ----- Original Message -----
>> On Mon, Jan 14, 2019 at 03:26:32PM -0500, Dave Anderson wrote:
>>> No. It needs *both* the vmlinux file and the vmcore file in order to read
>>> kernel
>>> virtual memory, so just having a kernel virtual address is insufficient.
>>>
>>> So it's a chicken-and-egg situation. This particular --osrelease option is used
>>> to determine *what* vmlinux file would be required for an actual crash analysis
>>> session.
>>
>> Ok, that makes sense. I could've used that explanation when reviewing
>> the documentation. Do you mind skimming through this:
>>
>> https://lkml.kernel.org/r/[email protected]
>>
>> in case we've missed explaining relevant usage - like that above - of
>> some of the vmcoreinfo members?
>
> Yeah, I've been watching the thread, and the document looks fine to me.
> It's just that when I saw the discussion of this one being removed that
> I felt the need to respond... ;-)

Thank you for explaining this issue.

Regards,
Lianbo

>
> Dave
>

2019-01-15 11:42:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH 1/2 v6] kdump: add the vmcoreinfo documentation

On Mon, Jan 14, 2019 at 03:49:14PM -0500, Dave Anderson wrote:
> Yeah, I've been watching the thread, and the document looks fine to me.
> It's just that when I saw the discussion of this one being removed that
> I felt the need to respond... ;-)

Good. :-)

Ok, I've amended the init_uts_ns.name.release description with the
additional important info.

Thx.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.