2018-11-15 21:48:57

by Bhupesh Sharma

[permalink] [raw]
Subject: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

x86_64 kernel uses 'page_offset_base' variable to point to the
start of direct mapping of all physical memory. This variable
is also updated for KASLR boot cases, so this can be exported
via vmcoreinfo as a standard ABI between kernel and user-space,
to allow user-space utilities to use the same for calculating
the start of direct mapping of all physical memory.

'arch/x86/kernel/head64.c' sets the same as:
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;

and also uses the same to indicate the base of KASLR regions on x86_64:
static __initdata struct kaslr_memory_region {
unsigned long *base;
unsigned long size_tb;
} kaslr_regions[] = {
{ &page_offset_base, 0 },
.. snip ..

Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
live-debugging of a running kernel via user-space utilities
like makedumpfile (see [1]).

Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
details), whose live debugging feature is broken with newer kernels
(I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
added to kcore, thus leading to an additional sections in the same, and
makedumpfile is not longer able to determine the start of direct
mapping of all physical memory, as it relies on traversing the PT_LOAD
segments inside kcore and using the last PT_LOAD segment
to determine the start of direct mapping.

Such user-space issues can be resolved if the user-space code instead
uses a standard ABI to read the kernel exposed machine specific
variables. With the kernel commit 23c85094fe1895caefdd
["proc/kcore: add vmcoreinfo note to /proc/kcore"]), it is
now possible to use the vmcoreinfo present inside kcore as the standard
ABI which can be used by the user-space utilities for reading
the machine specific information (and hence for debugging a
live kernel).

User-space utilities like makedumpfile, kexec-tools and crash
are either already using this ABI or are discussing patches
which look to add the same feature. This helps in simplifying the
overall code and also in reducing code-rewrite across the
user-space utilities for getting values of these kernel
symbols/variables.

Accordingly this patch allows appending 'page_offset_base' for
x86_64 platforms to vmcoreinfo, so that user-space tools can use the
same as a standard interface to determine the start of direct mapping
of all physical memory.

Testing:
-------
- I tested this patch (rebased on 'linux-next') on a x86_64 machine
using the modified 'makedumpfile' user-space code (see [3] for my
github tree which contains the same) for determining how many pages
are dumpable when different dump_level is specified (which is
one use-case of live-debugging via 'makedumpfile').
- I tested both the KASLR and non-KASLR boot cases with this patch.
- Here is one sample log (for KASLR boot case) on my x86_64 machine:

< snip..>
The kernel doesn't support mmap(),read() will be used instead.

TYPE PAGES EXCLUDABLE DESCRIPTION
----------------------------------------------------------------------
ZERO 21299 yes Pages filled
with zero
NON_PRI_CACHE 91785 yes Cache
pages without private flag
PRI_CACHE 1 yes Cache pages with
private flag
USER 14057 yes User process
pages
FREE 740346 yes Free pages
KERN_DATA 58152 no Dumpable kernel
data

page size: 4096
Total pages on system: 925640
Total size on system: 3791421440 Byte

[1]. MAN pages -> MAKEDUMPFILE(8) and CRASH(8)
[2]. makedumpfile issue with latest kernels -> http://lists.infradead.org/pipermail/kexec/2018-October/021769.html
[3]. https://github.com/bhupesh-sharma/makedumpfile/tree/add-page-offset-base-to-vmcore-v1

Cc: Boris Petkov <[email protected]>
Cc: Baoquan He <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Kazuhito Hagio <[email protected]>
Cc: Dave Anderson <[email protected]>
Cc: James Morse <[email protected]>
Cc: Omar Sandoval <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Bhupesh Sharma <[email protected]>
---
Changes since v1:
- Fixed the build issue reported by build bot and tested this version
with 'make allmodconfig'.
- Reworded most of the commit log to explain the intent behind the
patch.

arch/x86/kernel/machine_kexec_64.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 4c8acdfdc5a7..6161d77c5bfb 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
VMCOREINFO_SYMBOL(init_top_pgt);
vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
pgtable_l5_enabled());
+#ifdef CONFIG_RANDOMIZE_BASE
+ VMCOREINFO_NUMBER(page_offset_base);
+#endif

#ifdef CONFIG_NUMA
VMCOREINFO_SYMBOL(node_data);
--
2.7.4



2018-11-19 21:18:55

by Kazuhito Hagio

[permalink] [raw]
Subject: RE: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On 11/15/2018 4:47 PM, Bhupesh Sharma wrote:
> Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> live-debugging of a running kernel via user-space utilities
> like makedumpfile (see [1]).

I agree.

> Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
> details), whose live debugging feature is broken with newer kernels
> (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> added to kcore, thus leading to an additional sections in the same, and
> makedumpfile is not longer able to determine the start of direct
> mapping of all physical memory, as it relies on traversing the PT_LOAD
> segments inside kcore and using the last PT_LOAD segment
> to determine the start of direct mapping.

Actually, the KCORE_REMAP segments were already removed from kcore by
commit bf904d2762ee ("x86/pti/64: Remove the SYSCALL64 entry trampoline")
and kcore's program headers got back to the previous ones, but this
fact shows that they are changeable.

So I think that if we have this NUMBER(page_offset_base) in vmcoreinfo
for x86_64, user-space tools (especially makedumpfile) would become
more stable against changes in kcore and vmcore, rather than depending
on their PT_LOAD values.

Thanks,
Kazu



2018-11-21 08:42:21

by Bhupesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

Hi Kazu,

On Tue, Nov 20, 2018 at 2:47 AM Kazuhito Hagio <[email protected]> wrote:
>
> On 11/15/2018 4:47 PM, Bhupesh Sharma wrote:
> > Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> > live-debugging of a running kernel via user-space utilities
> > like makedumpfile (see [1]).
>
> I agree.
>
> > Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
> > details), whose live debugging feature is broken with newer kernels
> > (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> > added to kcore, thus leading to an additional sections in the same, and
> > makedumpfile is not longer able to determine the start of direct
> > mapping of all physical memory, as it relies on traversing the PT_LOAD
> > segments inside kcore and using the last PT_LOAD segment
> > to determine the start of direct mapping.
>
> Actually, the KCORE_REMAP segments were already removed from kcore by
> commit bf904d2762ee ("x86/pti/64: Remove the SYSCALL64 entry trampoline")
> and kcore's program headers got back to the previous ones, but this
> fact shows that they are changeable.
>
> So I think that if we have this NUMBER(page_offset_base) in vmcoreinfo
> for x86_64, user-space tools (especially makedumpfile) would become
> more stable against changes in kcore and vmcore, rather than depending
> on their PT_LOAD values.

Thanks a lot for your review.

Hi x86 maintainers,

Ping, any comments on this patch. Since it is a useful addition to the
kernel-userspace ABI (as suggested by the user-space utility
maintainer as well), can we consider including this in the kernel?

Thanks,
Bhupesh

2018-11-21 11:49:26

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

+ Kees.

On Fri, Nov 16, 2018 at 03:17:49AM +0530, Bhupesh Sharma wrote:
> x86_64 kernel uses 'page_offset_base' variable to point to the
> start of direct mapping of all physical memory. This variable
> is also updated for KASLR boot cases, so this can be exported
> via vmcoreinfo as a standard ABI between kernel and user-space,
> to allow user-space utilities to use the same for calculating
> the start of direct mapping of all physical memory.
>
> 'arch/x86/kernel/head64.c' sets the same as:
> unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
>
> and also uses the same to indicate the base of KASLR regions on x86_64:
> static __initdata struct kaslr_memory_region {
> unsigned long *base;
> unsigned long size_tb;
> } kaslr_regions[] = {
> { &page_offset_base, 0 },
> .. snip ..

Why is that detail needed in the commit message?

> Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> live-debugging of a running kernel via user-space utilities
> like makedumpfile (see [1]).
>
> Recently, I saw an issue with the 'makedumpfile' utility (see [2] for

Use passive tone in your commit message: no "we" or "I", etc.

Also, pls read section "2) Describe your changes" in
Documentation/process/submitting-patches.rst.

> details), whose live debugging feature is broken with newer kernels
> (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> added to kcore, thus leading to an additional sections in the same, and
> makedumpfile is not longer able to determine the start of direct
> mapping of all physical memory, as it relies on traversing the PT_LOAD
> segments inside kcore and using the last PT_LOAD segment
> to determine the start of direct mapping.
>
> Such user-space issues can be resolved if the user-space code instead
> uses a standard ABI to read the kernel exposed machine specific
> variables. With the kernel commit 23c85094fe1895caefdd
> ["proc/kcore: add vmcoreinfo note to /proc/kcore"]), it is

ERROR: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 23c85094fe18 ("proc/kcore: add vmcoreinfo note to /proc/kcore")'
#54:
variables. With the kernel commit 23c85094fe1895caefdd

> now possible to use the vmcoreinfo present inside kcore as the standard
> ABI which can be used by the user-space utilities for reading
> the machine specific information (and hence for debugging a
> live kernel).
>
> User-space utilities like makedumpfile, kexec-tools and crash
> are either already using this ABI or are discussing patches
> which look to add the same feature. This helps in simplifying the
> overall code and also in reducing code-rewrite across the
> user-space utilities for getting values of these kernel
> symbols/variables.

> Accordingly this patch allows appending 'page_offset_base' for
> x86_64 platforms to vmcoreinfo, so that user-space tools can use the
> same as a standard interface to determine the start of direct mapping
> of all physical memory.
>
> Testing:
> -------
> - I tested this patch (rebased on 'linux-next') on a x86_64 machine
> using the modified 'makedumpfile' user-space code (see [3] for my
> github tree which contains the same) for determining how many pages
> are dumpable when different dump_level is specified (which is
> one use-case of live-debugging via 'makedumpfile').
> - I tested both the KASLR and non-KASLR boot cases with this patch.
> - Here is one sample log (for KASLR boot case) on my x86_64 machine:
>
> < snip..>
> The kernel doesn't support mmap(),read() will be used instead.
>
> TYPE PAGES EXCLUDABLE DESCRIPTION
> ----------------------------------------------------------------------
> ZERO 21299 yes Pages filled
> with zero
> NON_PRI_CACHE 91785 yes Cache
> pages without private flag
> PRI_CACHE 1 yes Cache pages with
> private flag
> USER 14057 yes User process
> pages
> FREE 740346 yes Free pages
> KERN_DATA 58152 no Dumpable kernel
> data
>
> page size: 4096
> Total pages on system: 925640
> Total size on system: 3791421440 Byte
>
> [1]. MAN pages -> MAKEDUMPFILE(8) and CRASH(8)
> [2]. makedumpfile issue with latest kernels -> http://lists.infradead.org/pipermail/kexec/2018-October/021769.html
> [3]. https://github.com/bhupesh-sharma/makedumpfile/tree/add-page-offset-base-to-vmcore-v1
>
> Cc: Boris Petkov <[email protected]>
> Cc: Baoquan He <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Kazuhito Hagio <[email protected]>
> Cc: Dave Anderson <[email protected]>
> Cc: James Morse <[email protected]>
> Cc: Omar Sandoval <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Bhupesh Sharma <[email protected]>
> ---
> Changes since v1:
> - Fixed the build issue reported by build bot and tested this version
> with 'make allmodconfig'.
> - Reworded most of the commit log to explain the intent behind the
> patch.
>
> arch/x86/kernel/machine_kexec_64.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 4c8acdfdc5a7..6161d77c5bfb 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> VMCOREINFO_SYMBOL(init_top_pgt);
> vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> pgtable_l5_enabled());
> +#ifdef CONFIG_RANDOMIZE_BASE
> + VMCOREINFO_NUMBER(page_offset_base);
> +#endif
>
> #ifdef CONFIG_NUMA
> VMCOREINFO_SYMBOL(node_data);
> --

All above are only nitpicks though.

My opinion is this: people are exporting all kinds of kernel-internal
stuff in vmcoreinfo and frankly, I'm not crazy about this idea.

And AFAICT, this thing basically bypasses KASLR completely but you need
root for it so it probably doesn't really matter.

Now, on another thread we agreed more or less that what gets exported in
vmcoreinfo is so tightly coupled to the running kernel so that it is not
even considered an ABI. I guess that is debatable but whatever.

So my only request right now would be to have all those things being
exported, documented somewhere and I believe Lianbo is working on that.

But I'm sure others will have more to say about it.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2018-11-24 20:09:33

by Bhupesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

Hi Boris,

Thanks for your review. Please see my replies inline:

On Wed, Nov 21, 2018 at 5:10 PM Borislav Petkov <[email protected]> wrote:
>
> + Kees.
>
> On Fri, Nov 16, 2018 at 03:17:49AM +0530, Bhupesh Sharma wrote:
> > x86_64 kernel uses 'page_offset_base' variable to point to the
> > start of direct mapping of all physical memory. This variable
> > is also updated for KASLR boot cases, so this can be exported
> > via vmcoreinfo as a standard ABI between kernel and user-space,
> > to allow user-space utilities to use the same for calculating
> > the start of direct mapping of all physical memory.
> >
> > 'arch/x86/kernel/head64.c' sets the same as:
> > unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
> >
> > and also uses the same to indicate the base of KASLR regions on x86_64:
> > static __initdata struct kaslr_memory_region {
> > unsigned long *base;
> > unsigned long size_tb;
> > } kaslr_regions[] = {
> > { &page_offset_base, 0 },
> > .. snip ..
>
> Why is that detail needed in the commit message?

This (and similar) details were requested by Baoquan during the v1
review, that is why I added them to the v2 commit log.
Although personally I also think that such details are probably not
needed in a commit log (may be better suited for a cover letter, which
is maybe a overkill for this single patch).
Will drop this from v3.

> > Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> > live-debugging of a running kernel via user-space utilities
> > like makedumpfile (see [1]).
> >
> > Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
>
> Use passive tone in your commit message: no "we" or "I", etc.

Ok.

> Also, pls read section "2) Describe your changes" in
> Documentation/process/submitting-patches.rst.

Ok.

> > details), whose live debugging feature is broken with newer kernels
> > (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> > added to kcore, thus leading to an additional sections in the same, and
> > makedumpfile is not longer able to determine the start of direct
> > mapping of all physical memory, as it relies on traversing the PT_LOAD
> > segments inside kcore and using the last PT_LOAD segment
> > to determine the start of direct mapping.
> >
> > Such user-space issues can be resolved if the user-space code instead
> > uses a standard ABI to read the kernel exposed machine specific
> > variables. With the kernel commit 23c85094fe1895caefdd
> > ["proc/kcore: add vmcoreinfo note to /proc/kcore"]), it is
>
> ERROR: Please use git commit description style 'commit <12+ chars of sha1> ("<title line>")' - ie: 'commit 23c85094fe18 ("proc/kcore: add vmcoreinfo note to /proc/kcore")'
> #54:
> variables. With the kernel commit 23c85094fe1895caefdd

Ok.

> > now possible to use the vmcoreinfo present inside kcore as the standard
> > ABI which can be used by the user-space utilities for reading
> > the machine specific information (and hence for debugging a
> > live kernel).
> >
> > User-space utilities like makedumpfile, kexec-tools and crash
> > are either already using this ABI or are discussing patches
> > which look to add the same feature. This helps in simplifying the
> > overall code and also in reducing code-rewrite across the
> > user-space utilities for getting values of these kernel
> > symbols/variables.
>
> > Accordingly this patch allows appending 'page_offset_base' for
> > x86_64 platforms to vmcoreinfo, so that user-space tools can use the
> > same as a standard interface to determine the start of direct mapping
> > of all physical memory.
> >
> > Testing:
> > -------
> > - I tested this patch (rebased on 'linux-next') on a x86_64 machine
> > using the modified 'makedumpfile' user-space code (see [3] for my
> > github tree which contains the same) for determining how many pages
> > are dumpable when different dump_level is specified (which is
> > one use-case of live-debugging via 'makedumpfile').
> > - I tested both the KASLR and non-KASLR boot cases with this patch.
> > - Here is one sample log (for KASLR boot case) on my x86_64 machine:
> >
> > < snip..>
> > The kernel doesn't support mmap(),read() will be used instead.
> >
> > TYPE PAGES EXCLUDABLE DESCRIPTION
> > ----------------------------------------------------------------------
> > ZERO 21299 yes Pages filled
> > with zero
> > NON_PRI_CACHE 91785 yes Cache
> > pages without private flag
> > PRI_CACHE 1 yes Cache pages with
> > private flag
> > USER 14057 yes User process
> > pages
> > FREE 740346 yes Free pages
> > KERN_DATA 58152 no Dumpable kernel
> > data
> >
> > page size: 4096
> > Total pages on system: 925640
> > Total size on system: 3791421440 Byte
> >
> > [1]. MAN pages -> MAKEDUMPFILE(8) and CRASH(8)
> > [2]. makedumpfile issue with latest kernels -> http://lists.infradead.org/pipermail/kexec/2018-October/021769.html
> > [3]. https://github.com/bhupesh-sharma/makedumpfile/tree/add-page-offset-base-to-vmcore-v1
> >
> > Cc: Boris Petkov <[email protected]>
> > Cc: Baoquan He <[email protected]>
> > Cc: Ingo Molnar <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: Kazuhito Hagio <[email protected]>
> > Cc: Dave Anderson <[email protected]>
> > Cc: James Morse <[email protected]>
> > Cc: Omar Sandoval <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Bhupesh Sharma <[email protected]>
> > ---
> > Changes since v1:
> > - Fixed the build issue reported by build bot and tested this version
> > with 'make allmodconfig'.
> > - Reworded most of the commit log to explain the intent behind the
> > patch.
> >
> > arch/x86/kernel/machine_kexec_64.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 4c8acdfdc5a7..6161d77c5bfb 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> > VMCOREINFO_SYMBOL(init_top_pgt);
> > vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> > pgtable_l5_enabled());
> > +#ifdef CONFIG_RANDOMIZE_BASE
> > + VMCOREINFO_NUMBER(page_offset_base);
> > +#endif
> >
> > #ifdef CONFIG_NUMA
> > VMCOREINFO_SYMBOL(node_data);
> > --
>
> All above are only nitpicks though.

Right the above are mostly nitpicks, but useful ones, so I will
address these in the v3.

> My opinion is this: people are exporting all kinds of kernel-internal
> stuff in vmcoreinfo and frankly, I'm not crazy about this idea.
>
> And AFAICT, this thing basically bypasses KASLR completely but you need
> root for it so it probably doesn't really matter.
>
> Now, on another thread we agreed more or less that what gets exported in
> vmcoreinfo is so tightly coupled to the running kernel so that it is not
> even considered an ABI. I guess that is debatable but whatever.

I understand.

> So my only request right now would be to have all those things being
> exported, documented somewhere and I believe Lianbo is working on that.

I will work with Lianbo to get this added to his documentation patch as well.

> But I'm sure others will have more to say about it.

Sure, I will wait for a couple of days more and then send out a v3 and
also make sure that
this variable being export'ed also gets added to the overall documentation patch
which Lianbo is creating.

Regards,
Bhupesh

2018-11-25 10:21:01

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On 11/25/18 at 01:36am, Bhupesh Sharma wrote:
> Hi Boris,
>
> Thanks for your review. Please see my replies inline:
>
> On Wed, Nov 21, 2018 at 5:10 PM Borislav Petkov <[email protected]> wrote:
> >
> > + Kees.
> >
> > On Fri, Nov 16, 2018 at 03:17:49AM +0530, Bhupesh Sharma wrote:
> > > x86_64 kernel uses 'page_offset_base' variable to point to the
> > > start of direct mapping of all physical memory. This variable
> > > is also updated for KASLR boot cases, so this can be exported
> > > via vmcoreinfo as a standard ABI between kernel and user-space,
> > > to allow user-space utilities to use the same for calculating
> > > the start of direct mapping of all physical memory.
> > >
> > > 'arch/x86/kernel/head64.c' sets the same as:
> > > unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
> > >
> > > and also uses the same to indicate the base of KASLR regions on x86_64:
> > > static __initdata struct kaslr_memory_region {
> > > unsigned long *base;
> > > unsigned long size_tb;
> > > } kaslr_regions[] = {
> > > { &page_offset_base, 0 },
> > > .. snip ..
> >
> > Why is that detail needed in the commit message?
>
> This (and similar) details were requested by Baoquan during the v1
> review, that is why I added them to the v2 commit log.

Hmm, Bhupesh, this is what I requested in your v1:
lkml.kernel.org/r/20181030020752.GB11408@MiWiFi-R3L-srv

You said x86 makedumpfile is broken because KCORE_REMAP is added. I
wanted to know why it's broken. Now I think I don't need to request it
in this thread, becasue Kazu has made it clear and fixed it:
[email protected]

Or could you abstract the sentences in which I requested you to add
above information into patch log so that I can notice my expression
and can improve on future reviewing and communicating?

And for problem solving and describing, if things are simple, just
tell it direclty; If things are complex, try to simplify it and make it
be simply told. For this page_offset_base exporting, we can save time on
'what' and 'how' since it's very simple and very easy to see what it is
and how it's done. You just need tell WHY it's needed. BUT I saw so many
words and not easy to get why. If simplifying problems is one kind of
ability, I don't know what to say about complicating one simple problem,
another kind of ability?

I have read your patch log when you sent out v2, honestly I don't like it.
I tried to not touch this patch, hope other people can review and give
comments. This can make you know I didn't intentionally embarrass you on
patch reviewing.

Honestly, if this patch is merged, no matter v2 or your old v1,
a small record will be made. On my x1 laptop, I need scroll down FOUR
full screens till the real patch content is seen. Are those really
needed? You can check other vmcoreinfo exporting patch.

I really don't want to review people's patch if they don't welcome my
words, just I was asked to do. Hope you know I had no any offence when
I started to read your patch, just I felt not good when my head was
scratched to bleed when read. My english is not good, I don't know how to
express euphemistically sometime, so if I say so about one thing, that is
how I think so to the thing, without any personal prejudice or attack to
person. Please forgive me if any offence felt.

Thanks
Baoquan

> Although personally I also think that such details are probably not
> needed in a commit log (may be better suited for a cover letter, which
> is maybe a overkill for this single patch).
> Will drop this from v3.


2018-11-26 01:29:16

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On 11/16/18 at 03:17am, Bhupesh Sharma wrote:
> Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> live-debugging of a running kernel via user-space utilities
> like makedumpfile (see [1]).
>
> Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
> details), whose live debugging feature is broken with newer kernels

I think this paragraph explained why KCORE_REMAP adding caused the
mistake of page_offset calculation in makedumpfile. It can prove the
advantage of appending 'page_offset_base' to vmcoreinfo. The old way I
took in makedumpfile could be impacted by kernel code change, adding it
to vmcoreinfo can make it stable. The example is KCORE_REMAP adding, and
later it's removed.

But it's not live debugging feature of makedumpfile. Makedumpfile can't be
used to live debug. The feature is called '--mem-usage' in makedumpfile,
in fact it's used to estimate how big the vmcore could be so that customer
can deply an appropriate size of storage space to store it. Because both
kcore and vmcore are all elf files which the 1st kernel's memory is
mapped to, even though they are different, kcore is dynamically changing.
This is more likely a precision in order of of magnitude. This is a feature
required by redhat customer.

I thought you are talking about using DaveA's crash utility to live
debug the running kernel, like we usually do with gdb.

gdb vmlinux /proc/kcore

Yes, this gdb live debugging is broken because of KASLR. We have bug about
this, while it has not been fixed. Using Crash utility to replace gdb is
one way if Crash code is adjusted.

> (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> added to kcore, thus leading to an additional sections in the same, and
> makedumpfile is not longer able to determine the start of direct
> mapping of all physical memory, as it relies on traversing the PT_LOAD
> segments inside kcore and using the last PT_LOAD segment
> to determine the start of direct mapping.
...
> Testing:
> -------

This one vmcoreinfo entry adding won't impact kernel performance. And
page_offset_base need be got during makedumpfile initialization, it
won't impact makedumpfile efficiency either, especially compared with
the later page filterring and writting out to storage space. I don't
think there's any need to provide a detailed test result here. If
possible, just mention it works in this way, maybe it's better in some
aspects, such as code simplicity, etc.

> - I tested this patch (rebased on 'linux-next') on a x86_64 machine
> using the modified 'makedumpfile' user-space code (see [3] for my
> github tree which contains the same) for determining how many pages
> are dumpable when different dump_level is specified (which is
> one use-case of live-debugging via 'makedumpfile').
> - I tested both the KASLR and non-KASLR boot cases with this patch.
> - Here is one sample log (for KASLR boot case) on my x86_64 machine:
>
> < snip..>
> The kernel doesn't support mmap(),read() will be used instead.
>
> TYPE PAGES EXCLUDABLE DESCRIPTION
> ----------------------------------------------------------------------
> ZERO 21299 yes Pages filled
> with zero
> NON_PRI_CACHE 91785 yes Cache
> pages without private flag
> PRI_CACHE 1 yes Cache pages with
> private flag
> USER 14057 yes User process
> pages
> FREE 740346 yes Free pages
> KERN_DATA 58152 no Dumpable kernel
> data
>
> page size: 4096
> Total pages on system: 925640
> Total size on system: 3791421440 Byte
>
...

> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> index 4c8acdfdc5a7..6161d77c5bfb 100644
> --- a/arch/x86/kernel/machine_kexec_64.c
> +++ b/arch/x86/kernel/machine_kexec_64.c
> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> VMCOREINFO_SYMBOL(init_top_pgt);
> vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> pgtable_l5_enabled());
> +#ifdef CONFIG_RANDOMIZE_BASE

Finally, embracing it into CONFIG_RANDOMIZE_BASE ifdefery seems not
right. The latest kernel is using page_offset_base to do the dynamic
memory layout between level4 and level5 changing. This may not work in
5-level system with CONFIG_RANDOMIZE_BASE=n.

> + VMCOREINFO_NUMBER(page_offset_base);
> +#endif
>
> #ifdef CONFIG_NUMA
> VMCOREINFO_SYMBOL(node_data);
> --
> 2.7.4
>

2018-11-26 19:34:12

by Bhupesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On Mon, Nov 26, 2018 at 6:58 AM Baoquan He <[email protected]> wrote:
>
> On 11/16/18 at 03:17am, Bhupesh Sharma wrote:
> > Adding 'page_offset_base' to the vmcoreinfo can be specially useful for
> > live-debugging of a running kernel via user-space utilities
> > like makedumpfile (see [1]).
> >
> > Recently, I saw an issue with the 'makedumpfile' utility (see [2] for
> > details), whose live debugging feature is broken with newer kernels
>
> I think this paragraph explained why KCORE_REMAP adding caused the
> mistake of page_offset calculation in makedumpfile. It can prove the
> advantage of appending 'page_offset_base' to vmcoreinfo. The old way I
> took in makedumpfile could be impacted by kernel code change, adding it
> to vmcoreinfo can make it stable. The example is KCORE_REMAP adding, and
> later it's removed.
>
> But it's not live debugging feature of makedumpfile. Makedumpfile can't be
> used to live debug. The feature is called '--mem-usage' in makedumpfile,
> in fact it's used to estimate how big the vmcore could be so that customer
> can deply an appropriate size of storage space to store it. Because both
> kcore and vmcore are all elf files which the 1st kernel's memory is
> mapped to, even though they are different, kcore is dynamically changing.
> This is more likely a precision in order of of magnitude. This is a feature
> required by redhat customer.

Indeed this is a live debugging feature - see we are running this in
the primary kernel
context, not in kdump context. We are trying to debug a kernel we are
presently running (in this case determining the page mapping)
hence the term live debugging.

Also, this feature is not limited to redhat - we are talking in
upstream makedumpfile context here - it is used by other projects as
well which can have even a simple busybox rootfs configuration (e.g.
qemu).

> I thought you are talking about using DaveA's crash utility to live
> debug the running kernel, like we usually do with gdb.
>
> gdb vmlinux /proc/kcore
>
> Yes, this gdb live debugging is broken because of KASLR. We have bug about
> this, while it has not been fixed. Using Crash utility to replace gdb is
> one way if Crash code is adjusted.
>
> > (I tested the same with 4.19-rc8+ kernel), as KCORE_REMAP segments were
> > added to kcore, thus leading to an additional sections in the same, and
> > makedumpfile is not longer able to determine the start of direct
> > mapping of all physical memory, as it relies on traversing the PT_LOAD
> > segments inside kcore and using the last PT_LOAD segment
> > to determine the start of direct mapping.
> ...
> > Testing:
> > -------
>
> This one vmcoreinfo entry adding won't impact kernel performance. And
> page_offset_base need be got during makedumpfile initialization, it
> won't impact makedumpfile efficiency either, especially compared with
> the later page filterring and writting out to storage space. I don't
> think there's any need to provide a detailed test result here. If
> possible, just mention it works in this way, maybe it's better in some
> aspects, such as code simplicity, etc.
>
> > - I tested this patch (rebased on 'linux-next') on a x86_64 machine
> > using the modified 'makedumpfile' user-space code (see [3] for my
> > github tree which contains the same) for determining how many pages
> > are dumpable when different dump_level is specified (which is
> > one use-case of live-debugging via 'makedumpfile').
> > - I tested both the KASLR and non-KASLR boot cases with this patch.
> > - Here is one sample log (for KASLR boot case) on my x86_64 machine:
> >
> > < snip..>
> > The kernel doesn't support mmap(),read() will be used instead.
> >
> > TYPE PAGES EXCLUDABLE DESCRIPTION
> > ----------------------------------------------------------------------
> > ZERO 21299 yes Pages filled
> > with zero
> > NON_PRI_CACHE 91785 yes Cache
> > pages without private flag
> > PRI_CACHE 1 yes Cache pages with
> > private flag
> > USER 14057 yes User process
> > pages
> > FREE 740346 yes Free pages
> > KERN_DATA 58152 no Dumpable kernel
> > data
> >
> > page size: 4096
> > Total pages on system: 925640
> > Total size on system: 3791421440 Byte
> >
> ...
>
> > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > index 4c8acdfdc5a7..6161d77c5bfb 100644
> > --- a/arch/x86/kernel/machine_kexec_64.c
> > +++ b/arch/x86/kernel/machine_kexec_64.c
> > @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> > VMCOREINFO_SYMBOL(init_top_pgt);
> > vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> > pgtable_l5_enabled());
> > +#ifdef CONFIG_RANDOMIZE_BASE
>
> Finally, embracing it into CONFIG_RANDOMIZE_BASE ifdefery seems not
> right. The latest kernel is using page_offset_base to do the dynamic
> memory layout between level4 and level5 changing. This may not work in
> 5-level system with CONFIG_RANDOMIZE_BASE=n.

I think you missed the v2 change log and the build-bot error on v1
(see here: <https://patchwork.kernel.org/patch/10657933/#22291691>).
With .config files which have CONFIG_RANDOMIZE_BASE=n, we get the
following compilation error
without the #ifdef jugglery:

arch/x86/kernel/machine_kexec_64.o: In function `arch_crash_save_vmcoreinfo':
arch/x86/kernel/machine_kexec_64.c:359: undefined reference to
`page_offset_base'
arch/x86/kernel/machine_kexec_64.c:359: undefined reference to
`page_offset_base'

Anyways, with Kazu's and Boris's comments on the v2, I understand that
adding 'page_offset_base' variable to vmcoreinfo is useful for x86
kernel.
I will now work on the v3 to take into account review comments and
also work with Lianbo to get the same added to the overall vmcoreinfo
documentation he is preparing for x86.

Thanks,
Bhupesh


> > + VMCOREINFO_NUMBER(page_offset_base);
> > +#endif
> >
> > #ifdef CONFIG_NUMA
> > VMCOREINFO_SYMBOL(node_data);
> > --
> > 2.7.4
> >

2018-11-27 06:57:44

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On 11/27/18 at 01:01am, Bhupesh Sharma wrote:
> > But it's not live debugging feature of makedumpfile. Makedumpfile can't be
> > used to live debug. The feature is called '--mem-usage' in makedumpfile,
> > in fact it's used to estimate how big the vmcore could be so that customer
> > can deply an appropriate size of storage space to store it. Because both
> > kcore and vmcore are all elf files which the 1st kernel's memory is
> > mapped to, even though they are different, kcore is dynamically changing.
> > This is more likely a precision in order of of magnitude. This is a feature
> > required by redhat customer.
>
> Indeed this is a live debugging feature - see we are running this in
> the primary kernel
> context, not in kdump context. We are trying to debug a kernel we are
> presently running (in this case determining the page mapping)
> hence the term live debugging.

Thanks for redefining '--mem-usage' of makedumpfile as a live debugging
feature. I am the author of this feature, I don't know it has this
awesome attribute. Can you post patch to refresh it's doc or man page
about this live debugging thing? I think it's important, I am
astonished. I noticed Boris said you are adding this to support live
debug, just not sure if he is misled by you, or he also think this
vmcore size estimating feature of makedumpfile is live debugging
feature.

>
> Also, this feature is not limited to redhat - we are talking in
> upstream makedumpfile context here - it is used by other projects as
> well which can have even a simple busybox rootfs configuration (e.g.
> qemu).

Yes, it's not limited to redhat, I am just saying its origin.

>
> > I thought you are talking about using DaveA's crash utility to live
> > debug the running kernel, like we usually do with gdb.
> >
> > gdb vmlinux /proc/kcore
> >
> > Yes, this gdb live debugging is broken because of KASLR. We have bug about
> > this, while it has not been fixed. Using Crash utility to replace gdb is
> > one way if Crash code is adjusted.

......

> > > diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > > index 4c8acdfdc5a7..6161d77c5bfb 100644
> > > --- a/arch/x86/kernel/machine_kexec_64.c
> > > +++ b/arch/x86/kernel/machine_kexec_64.c
> > > @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> > > VMCOREINFO_SYMBOL(init_top_pgt);
> > > vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> > > pgtable_l5_enabled());
> > > +#ifdef CONFIG_RANDOMIZE_BASE
> >
> > Finally, embracing it into CONFIG_RANDOMIZE_BASE ifdefery seems not
> > right. The latest kernel is using page_offset_base to do the dynamic
> > memory layout between level4 and level5 changing. This may not work in
> > 5-level system with CONFIG_RANDOMIZE_BASE=n.
>
> I think you missed the v2 change log and the build-bot error on v1
> (see here: <https://patchwork.kernel.org/patch/10657933/#22291691>).
> With .config files which have CONFIG_RANDOMIZE_BASE=n, we get the
> following compilation error
> without the #ifdef jugglery:

Which line of your v2 change log did I miss?

Ask again, what if I set
CONFIG_DYNAMIC_MEMORY_LAYOUT=y && CONFIG_RANDOMIZE_MEMORY=n
in a kernel with LEVEL5 enabled to make kernel be able to switch
into l4 and l5 dynamically?

Is there anything wrong if I add it like this:

#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
VMCOREINFO_SYMBOL(node_data);
#endif

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
...
#endif

config DYNAMIC_MEMORY_LAYOUT
bool
---help---
This option makes base addresses of vmalloc and vmemmap as well as
__PAGE_OFFSET movable during boot.

config RANDOMIZE_MEMORY
bool "Randomize the kernel memory sections"
depends on X86_64
depends on RANDOMIZE_BASE
select DYNAMIC_MEMORY_LAYOUT
default RANDOMIZE_BASE
---help---
Randomizes the base virtual address of kernel memory sections
(physical memory mapping, vmalloc & vmemmap). This security feature
makes exploits relying on predictable memory locations less reliable.

The order of allocations remains unchanged. Entropy is generated in
the same way as RANDOMIZE_BASE. Current implementation in the optimal
configuration have in average 30,000 different possible virtual
addresses for each memory section.

If unsure, say Y.

>
> arch/x86/kernel/machine_kexec_64.o: In function `arch_crash_save_vmcoreinfo':
> arch/x86/kernel/machine_kexec_64.c:359: undefined reference to
> `page_offset_base'
> arch/x86/kernel/machine_kexec_64.c:359: undefined reference to
> `page_offset_base'
>
> Anyways, with Kazu's and Boris's comments on the v2, I understand that
> adding 'page_offset_base' variable to vmcoreinfo is useful for x86
> kernel.

Please notice that the makedumpfile '--mem-usage' bug has been fixed by
below commit. Adding 'page_offset_base' into vmcoreinfo is 50:50 to
me. Adding it to vmcoreinfo to expose, or deducing it from kcore/vmcore
program segment, both is fine to me since both has pro and con. Adding
it you have to give a good log, based on good understanding, but
not those confusing self invented term and the useless test results,
and wrong code.

commit 1ea989bf6a93db377689c16b61e9c2c6a9bafa17 (HEAD -> devel, origin/devel)
Author: Kazuhito Hagio <[email protected]>
Date: Wed Nov 21 13:57:31 2018 -0500

[PATCH] x86_64: fix failure of getting kcore vmcoreinfo on kernel 4.19

> I will now work on the v3 to take into account review comments and
> also work with Lianbo to get the same added to the overall vmcoreinfo
> documentation he is preparing for x86.

So please do not post new version directly without replying to confirm if
we have agreement on concerns, then saying that is requested by me or
other people.

Thanks
Baoquan

2018-11-27 07:23:47

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On 11/27/18 at 02:48pm, Baoquan He wrote:

> #ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
> VMCOREINFO_SYMBOL(node_data);
^^^

> #endif

Wrong copy
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
VMCOREINFO_NUMBER(page_offset_base);

#endif


2018-11-27 22:18:51

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On Wed, Nov 21, 2018 at 3:39 AM, Borislav Petkov <[email protected]> wrote:
> + Kees.
>
> On Fri, Nov 16, 2018 at 03:17:49AM +0530, Bhupesh Sharma wrote:
>> x86_64 kernel uses 'page_offset_base' variable to point to the
>> start of direct mapping of all physical memory. This variable
>> is also updated for KASLR boot cases, so this can be exported
>> via vmcoreinfo as a standard ABI between kernel and user-space,
>> to allow user-space utilities to use the same for calculating
>> the start of direct mapping of all physical memory.

Why is KERNELOFFSET= not sufficient?

See commit b6085a865762 ("x86, kaslr: export offset in VMCOREINFO ELF notes")

+ vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
+ (unsigned long)&_text - __START_KERNEL);

-Kees

>> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
>> index 4c8acdfdc5a7..6161d77c5bfb 100644
>> --- a/arch/x86/kernel/machine_kexec_64.c
>> +++ b/arch/x86/kernel/machine_kexec_64.c
>> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
>> VMCOREINFO_SYMBOL(init_top_pgt);
>> vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
>> pgtable_l5_enabled());
>> +#ifdef CONFIG_RANDOMIZE_BASE
>> + VMCOREINFO_NUMBER(page_offset_base);
>> +#endif
>>
>> #ifdef CONFIG_NUMA
>> VMCOREINFO_SYMBOL(node_data);

--
Kees Cook

2018-11-27 23:31:30

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On 11/27/18 at 02:16pm, Kees Cook wrote:
> Why is KERNELOFFSET= not sufficient?
>
> See commit b6085a865762 ("x86, kaslr: export offset in VMCOREINFO ELF notes")
>
> + vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
> + (unsigned long)&_text - __START_KERNEL);

KERNELOFFSET is virtual address delta after kernel text KASLR, namely
the offset from the original default kernel text virtual address,
0xffffffff88000000.

While after memory region KASLR in kernel_randomize_memory(), the
starting address of the direct mapping of physical memory, PAGE_OFFSET,
is changed too. We need get it to analyze memory in makedumpfile/crash.
Currently we deduce it from elf program segment of kcore:
Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
......

LOAD 0x00000a62c0004000 0xffff8a62c0001000 0x0000000000001000
0x000000000009c000 0x000000000009c000 RWE 1000

page_offset = 0xffff8a62c0001000 - 0x0000000000001000;
Since we put the direct mapping segments at the bottom part of kcore, we
can always get page_offset right.

Thanks
Baoquan

>
> -Kees
>
> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
> >> --- a/arch/x86/kernel/machine_kexec_64.c
> >> +++ b/arch/x86/kernel/machine_kexec_64.c
> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> >> VMCOREINFO_SYMBOL(init_top_pgt);
> >> vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> >> pgtable_l5_enabled());
> >> +#ifdef CONFIG_RANDOMIZE_BASE
> >> + VMCOREINFO_NUMBER(page_offset_base);
> >> +#endif
> >>
> >> #ifdef CONFIG_NUMA
> >> VMCOREINFO_SYMBOL(node_data);
>
> --
> Kees Cook
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec

2018-11-28 00:41:06

by Kees Cook

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On Tue, Nov 27, 2018 at 3:29 PM, Baoquan He <[email protected]> wrote:
> On 11/27/18 at 02:16pm, Kees Cook wrote:
>> Why is KERNELOFFSET= not sufficient?
>>
>> See commit b6085a865762 ("x86, kaslr: export offset in VMCOREINFO ELF notes")
>>
>> + vmcoreinfo_append_str("KERNELOFFSET=%lx\n",
>> + (unsigned long)&_text - __START_KERNEL);
>
> KERNELOFFSET is virtual address delta after kernel text KASLR, namely
> the offset from the original default kernel text virtual address,
> 0xffffffff88000000.
>
> While after memory region KASLR in kernel_randomize_memory(), the
> starting address of the direct mapping of physical memory, PAGE_OFFSET,
> is changed too. We need get it to analyze memory in makedumpfile/crash.
> Currently we deduce it from elf program segment of kcore:
> Program Headers:
> Type Offset VirtAddr PhysAddr
> FileSiz MemSiz Flags Align
> ......
>
> LOAD 0x00000a62c0004000 0xffff8a62c0001000 0x0000000000001000
> 0x000000000009c000 0x000000000009c000 RWE 1000
>
> page_offset = 0xffff8a62c0001000 - 0x0000000000001000;
> Since we put the direct mapping segments at the bottom part of kcore, we
> can always get page_offset right.
>
> Thanks
> Baoquan
>
>>
>> -Kees
>>
>> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
>> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
>> >> --- a/arch/x86/kernel/machine_kexec_64.c
>> >> +++ b/arch/x86/kernel/machine_kexec_64.c
>> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
>> >> VMCOREINFO_SYMBOL(init_top_pgt);
>> >> vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
>> >> pgtable_l5_enabled());
>> >> +#ifdef CONFIG_RANDOMIZE_BASE

Okay, gotcha. In that case, shouldn't this be CONFIG_RANDOMIZE_MEMORY?

-Kees

>> >> + VMCOREINFO_NUMBER(page_offset_base);
>> >> +#endif
>> >>
>> >> #ifdef CONFIG_NUMA
>> >> VMCOREINFO_SYMBOL(node_data);
>>
>> --
>> Kees Cook
>>
>> _______________________________________________
>> kexec mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/kexec



--
Kees Cook

2018-11-28 01:41:15

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On 11/27/18 at 04:39pm, Kees Cook wrote:
> >> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> >> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
> >> >> --- a/arch/x86/kernel/machine_kexec_64.c
> >> >> +++ b/arch/x86/kernel/machine_kexec_64.c
> >> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> >> >> VMCOREINFO_SYMBOL(init_top_pgt);
> >> >> vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> >> >> pgtable_l5_enabled());
> >> >> +#ifdef CONFIG_RANDOMIZE_BASE
>
> Okay, gotcha. In that case, shouldn't this be CONFIG_RANDOMIZE_MEMORY?

Currently, Kirill added level5 support to x86_64, and kernel with level5
enabled can be boot switched into level4 or level5 with kernel option
"no5lvl". So page_offset_base will be changed accordingly. You can see
code pasted at bottom, DYNAMIC_MEMORY_LAYOUT is added for this change,
not only KASLR, but 5LEVEL.

If only put it inside ifdef CONFIG_RANDOMIZE_MEMORY, change between l4
and l5 will force us to decide page_offset again if
CONFIG_RANDOMIZE_MEMORY or CONFIG_RANDOMIZE_BASE is not set. Besides,
below commit change the starting address of the direct mapping again, if
only judge CONFIG_RANDOMIZE_MEMORY, in case KASLR is disabled, code in
userspace may need many if-else checking as below. So if we pass, better
pass it for all.

get_page_offset()
{
if(get_page_offset_from_vmcoreinfo()) {
xxx //in KASLR case
return;
} else if (check_5level_paging()) {
if (version < 4.21) {
page_offset = 0xff10000000000000;
} else //version > = 4.21
{
page_offset = 0xff11000000000000;
}

} else { //4level
if (version < 2.6.27) {
page_offset = 0xffff810000000000;
} else if (version < 4.21) {
page_offset = 0xffff880000000000;
} else //version > = 4.21
{
page_offset = 0xffff888000000000,;
}
}
}

Sign, seeing above code, I still think that deducing it from
kcore/vmcore elf header is better. Can't we be better to ourselves?

commit d52888aa2753e3063a9d3a0c9f72f94aa9809c15
Author: Kirill A. Shutemov <[email protected]>
Date: Fri Oct 26 15:28:54 2018 +0300

x86/mm: Move LDT remap out of KASLR region on 5-level paging

[bhe@ linux]$ git describe --contains d52888aa2753e3063a9d3a0c9f72f94aa9809c15
v4.20-rc2~5^2~4


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#ifdef CONFIG_DYNAMIC_MEMORY_LAYOUT
unsigned long page_offset_base __ro_after_init = __PAGE_OFFSET_BASE_L4;
EXPORT_SYMBOL(page_offset_base);
unsigned long vmalloc_base __ro_after_init = __VMALLOC_BASE_L4;
EXPORT_SYMBOL(vmalloc_base);
unsigned long vmemmap_base __ro_after_init = __VMEMMAP_BASE_L4;
EXPORT_SYMBOL(vmemmap_base);
#endif


config DYNAMIC_MEMORY_LAYOUT
bool
---help---
This option makes base addresses of vmalloc and vmemmap as well as
__PAGE_OFFSET movable during boot.

config RANDOMIZE_MEMORY
bool "Randomize the kernel memory sections"
depends on X86_64
depends on RANDOMIZE_BASE
select DYNAMIC_MEMORY_LAYOUT
default RANDOMIZE_BASE

config X86_5LEVEL
bool "Enable 5-level page tables support"
select DYNAMIC_MEMORY_LAYOUT
select SPARSEMEM_VMEMMAP
depends on X86_64

2018-11-28 02:00:06

by Baoquan He

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On 11/27/18 at 04:39pm, Kees Cook wrote:
> >> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> >> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
> >> >> --- a/arch/x86/kernel/machine_kexec_64.c
> >> >> +++ b/arch/x86/kernel/machine_kexec_64.c
> >> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> >> >> VMCOREINFO_SYMBOL(init_top_pgt);
> >> >> vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> >> >> pgtable_l5_enabled());
> >> >> +#ifdef CONFIG_RANDOMIZE_BASE
>
> Okay, gotcha. In that case, shouldn't this be CONFIG_RANDOMIZE_MEMORY?

And yes, if we only care about KASLR, it should be
CONFIG_RANDOMIZE_MEMORY, but not CONFIG_RANDOMIZE_BASE.


2018-11-28 04:28:05

by Bhupesh Sharma

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

On Wed, Nov 28, 2018 at 7:27 AM Baoquan He <[email protected]> wrote:
>
> On 11/27/18 at 04:39pm, Kees Cook wrote:
> > >> >> diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
> > >> >> index 4c8acdfdc5a7..6161d77c5bfb 100644
> > >> >> --- a/arch/x86/kernel/machine_kexec_64.c
> > >> >> +++ b/arch/x86/kernel/machine_kexec_64.c
> > >> >> @@ -356,6 +356,9 @@ void arch_crash_save_vmcoreinfo(void)
> > >> >> VMCOREINFO_SYMBOL(init_top_pgt);
> > >> >> vmcoreinfo_append_str("NUMBER(pgtable_l5_enabled)=%d\n",
> > >> >> pgtable_l5_enabled());
> > >> >> +#ifdef CONFIG_RANDOMIZE_BASE
> >
> > Okay, gotcha. In that case, shouldn't this be CONFIG_RANDOMIZE_MEMORY?
>
> And yes, if we only care about KASLR, it should be
> CONFIG_RANDOMIZE_MEMORY, but not CONFIG_RANDOMIZE_BASE.
>

Have you tried building the changes with 'make allmodconfig' with all
the different CONFIG options you have suggested so far?

2018-11-28 11:39:24

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH v2] x86_64, vmcoreinfo: Append 'page_offset_base' to vmcoreinfo

> > now possible to use the vmcoreinfo present inside kcore as the standard
> > ABI which can be used by the user-space utilities for reading
> > the machine specific information (and hence for debugging a
> > live kernel).
> >
> > User-space utilities like makedumpfile, kexec-tools and crash
> > are either already using this ABI or are discussing patches
> > which look to add the same feature. This helps in simplifying the
> > overall code and also in reducing code-rewrite across the
> > user-space utilities for getting values of these kernel
> > symbols/variables.
>
> > Accordingly this patch allows appending 'page_offset_base' for
> > x86_64 platforms to vmcoreinfo, so that user-space tools can use the
> > same as a standard interface to determine the start of direct mapping
> > of all physical memory.
> >

[snip]

> All above are only nitpicks though.
>
> My opinion is this: people are exporting all kinds of kernel-internal
> stuff in vmcoreinfo and frankly, I'm not crazy about this idea.
>
> And AFAICT, this thing basically bypasses KASLR completely but you need
> root for it so it probably doesn't really matter.
>
> Now, on another thread we agreed more or less that what gets exported in
> vmcoreinfo is so tightly coupled to the running kernel so that it is not
> even considered an ABI. I guess that is debatable but whatever.

We do not regard this strictly as an ABI, but we also carefully review
every new extra exported thing and only export when we have to do so eg.
something breaks.

Seems this change only make userspace tools handling on the kaslr case
easier but since everything works without this patch I would prefer not to
do it.

>
> So my only request right now would be to have all those things being
> exported, documented somewhere and I believe Lianbo is working on that.
>
> But I'm sure others will have more to say about it.
>
> --
> Regards/Gruss,
> Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave