2020-02-21 11:44:33

by Will Deacon

[permalink] [raw]
Subject: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

Hi folks,

Despite having just a single modular in-tree user that I could spot,
kallsyms_lookup_name() is exported to modules and provides a mechanism
for out-of-tree modules to access and invoke arbitrary, non-exported
kernel symbols when kallsyms is enabled.

This patch series fixes up that one user and unexports the symbol along
with kallsyms_on_each_symbol(), since that could also be abused in a
similar manner.

Cheers,

Will

Cc: K.Prasad <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Quentin Perret <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Masami Hiramatsu <[email protected]>

--->8

Will Deacon (3):
samples/hw_breakpoint: Drop HW_BREAKPOINT_R when reporting writes
samples/hw_breakpoint: Drop use of kallsyms_lookup_name()
kallsyms: Unexport kallsyms_lookup_name() and
kallsyms_on_each_symbol()

kernel/kallsyms.c | 2 --
samples/hw_breakpoint/data_breakpoint.c | 11 ++++++++---
2 files changed, 8 insertions(+), 5 deletions(-)

--
2.25.0.265.gbab2e86ba0-goog


2020-02-21 11:44:38

by Will Deacon

[permalink] [raw]
Subject: [PATCH 1/3] samples/hw_breakpoint: Drop HW_BREAKPOINT_R when reporting writes

Given the name of a kernel symbol, the 'data_breakpoint' test claims to
"report any write operations on the kernel symbol". However, it creates
the breakpoint using both HW_BREAKPOINT_W and HW_BREAKPOINT_R, which
menas it also fires for read access.

Drop HW_BREAKPOINT_R from the breakpoint attributes.

Cc: K.Prasad <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Quentin Perret <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
samples/hw_breakpoint/data_breakpoint.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/samples/hw_breakpoint/data_breakpoint.c b/samples/hw_breakpoint/data_breakpoint.c
index c58504774118..469b36f93696 100644
--- a/samples/hw_breakpoint/data_breakpoint.c
+++ b/samples/hw_breakpoint/data_breakpoint.c
@@ -45,7 +45,7 @@ static int __init hw_break_module_init(void)
hw_breakpoint_init(&attr);
attr.bp_addr = kallsyms_lookup_name(ksym_name);
attr.bp_len = HW_BREAKPOINT_LEN_4;
- attr.bp_type = HW_BREAKPOINT_W | HW_BREAKPOINT_R;
+ attr.bp_type = HW_BREAKPOINT_W;

sample_hbp = register_wide_hw_breakpoint(&attr, sample_hbp_handler, NULL);
if (IS_ERR((void __force *)sample_hbp)) {
--
2.25.0.265.gbab2e86ba0-goog

2020-02-21 11:45:06

by Will Deacon

[permalink] [raw]
Subject: [PATCH 3/3] kallsyms: Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to
modules despite having no in-tree users and being wide open to abuse by
out-of-tree modules that can use them as a method to invoke arbitrary
non-exported kernel functions.

Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol().

Cc: Alexei Starovoitov <[email protected]>
Cc: Masami Hiramatsu <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Quentin Perret <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
kernel/kallsyms.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
index a9b3f660dee7..16c8c605f4b0 100644
--- a/kernel/kallsyms.c
+++ b/kernel/kallsyms.c
@@ -175,7 +175,6 @@ unsigned long kallsyms_lookup_name(const char *name)
}
return module_kallsyms_lookup_name(name);
}
-EXPORT_SYMBOL_GPL(kallsyms_lookup_name);

int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *,
unsigned long),
@@ -194,7 +193,6 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *,
}
return module_kallsyms_on_each_symbol(fn, data);
}
-EXPORT_SYMBOL_GPL(kallsyms_on_each_symbol);

static unsigned long get_symbol_pos(unsigned long addr,
unsigned long *symbolsize,
--
2.25.0.265.gbab2e86ba0-goog

2020-02-21 11:45:41

by Will Deacon

[permalink] [raw]
Subject: [PATCH 2/3] samples/hw_breakpoint: Drop use of kallsyms_lookup_name()

The 'data_breakpoint' test code is the only modular user of
kallsyms_lookup_name(), which was exported as part of fixing the test in
f60d24d2ad04 ("hw-breakpoints: Fix broken hw-breakpoint sample module").

In preparation for un-exporting this symbol, switch the test over to
using __symbol_get(), which can be used to place breakpoints on exported
symbols.

Cc: K.Prasad <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Greg Kroah-Hartman <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Quentin Perret <[email protected]>
Signed-off-by: Will Deacon <[email protected]>
---
samples/hw_breakpoint/data_breakpoint.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/samples/hw_breakpoint/data_breakpoint.c b/samples/hw_breakpoint/data_breakpoint.c
index 469b36f93696..418c46fe5ffc 100644
--- a/samples/hw_breakpoint/data_breakpoint.c
+++ b/samples/hw_breakpoint/data_breakpoint.c
@@ -23,7 +23,7 @@

struct perf_event * __percpu *sample_hbp;

-static char ksym_name[KSYM_NAME_LEN] = "pid_max";
+static char ksym_name[KSYM_NAME_LEN] = "jiffies";
module_param_string(ksym, ksym_name, KSYM_NAME_LEN, S_IRUGO);
MODULE_PARM_DESC(ksym, "Kernel symbol to monitor; this module will report any"
" write operations on the kernel symbol");
@@ -41,9 +41,13 @@ static int __init hw_break_module_init(void)
{
int ret;
struct perf_event_attr attr;
+ void *addr = __symbol_get(ksym_name);
+
+ if (!addr)
+ return -ENXIO;

hw_breakpoint_init(&attr);
- attr.bp_addr = kallsyms_lookup_name(ksym_name);
+ attr.bp_addr = (unsigned long)addr;
attr.bp_len = HW_BREAKPOINT_LEN_4;
attr.bp_type = HW_BREAKPOINT_W;

@@ -66,6 +70,7 @@ static int __init hw_break_module_init(void)
static void __exit hw_break_module_exit(void)
{
unregister_wide_hw_breakpoint(sample_hbp);
+ symbol_put(ksym_name);
printk(KERN_INFO "HW Breakpoint for %s write uninstalled\n", ksym_name);
}

--
2.25.0.265.gbab2e86ba0-goog

2020-02-21 11:53:45

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 3/3] kallsyms: Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Fri, Feb 21, 2020 at 11:44:04AM +0000, Will Deacon wrote:
> kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to
> modules despite having no in-tree users and being wide open to abuse by
> out-of-tree modules that can use them as a method to invoke arbitrary
> non-exported kernel functions.
>
> Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol().
>
> Cc: Alexei Starovoitov <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>

Reviewed-by: Greg Kroah-Hartman <[email protected]>

2020-02-21 14:12:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/3] samples/hw_breakpoint: Drop HW_BREAKPOINT_R when reporting writes

On Fri, Feb 21, 2020 at 11:44:02AM +0000, Will Deacon wrote:
> Given the name of a kernel symbol, the 'data_breakpoint' test claims to
> "report any write operations on the kernel symbol". However, it creates
> the breakpoint using both HW_BREAKPOINT_W and HW_BREAKPOINT_R, which
> menas it also fires for read access.
>
> Drop HW_BREAKPOINT_R from the breakpoint attributes.

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-02-21 14:14:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/3] samples/hw_breakpoint: Drop use of kallsyms_lookup_name()

On Fri, Feb 21, 2020 at 11:44:03AM +0000, Will Deacon wrote:
> -static char ksym_name[KSYM_NAME_LEN] = "pid_max";
> +static char ksym_name[KSYM_NAME_LEN] = "jiffies";

Is jiffies actually an exported symbol on all configfs? I thought
there was some weird aliasing going on with jiffies64.

Except for the symbol choice this looks fine, though:

Reviewed-by: Christoph Hellwig <[email protected]>

2020-02-21 14:15:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 3/3] kallsyms: Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Fri, Feb 21, 2020 at 11:44:04AM +0000, Will Deacon wrote:
> kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to
> modules despite having no in-tree users and being wide open to abuse by
> out-of-tree modules that can use them as a method to invoke arbitrary
> non-exported kernel functions.
>
> Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol().

Looks good,

Reviewed-by: Christoph Hellwig <[email protected]>

2020-02-21 14:21:28

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 2/3] samples/hw_breakpoint: Drop use of kallsyms_lookup_name()

On Fri, Feb 21, 2020 at 03:13:54PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 21, 2020 at 11:44:03AM +0000, Will Deacon wrote:
> > -static char ksym_name[KSYM_NAME_LEN] = "pid_max";
> > +static char ksym_name[KSYM_NAME_LEN] = "jiffies";
>
> Is jiffies actually an exported symbol on all configfs? I thought
> there was some weird aliasing going on with jiffies64.

There is some weird aliasing with jiffies_64, but kernel/time/jiffies.c
has an unconditional:

EXPORT_SYMBOL(jiffies);

so I think we're ok.

> Except for the symbol choice this looks fine, though:
>
> Reviewed-by: Christoph Hellwig <[email protected]>

Brill, cheers.

Will

2020-02-21 14:28:30

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

Hi Will,

On Fri, 21 Feb 2020 11:44:01 +0000
Will Deacon <[email protected]> wrote:

> Hi folks,
>
> Despite having just a single modular in-tree user that I could spot,
> kallsyms_lookup_name() is exported to modules and provides a mechanism
> for out-of-tree modules to access and invoke arbitrary, non-exported
> kernel symbols when kallsyms is enabled.
>
> This patch series fixes up that one user and unexports the symbol along
> with kallsyms_on_each_symbol(), since that could also be abused in a
> similar manner.

What kind of issue would you like to fix with this?
There are many ways to find (estimate) symbol address, especially, if
the programmer already has the symbol map, it is *very* easy to find
the target symbol address even from one exported symbol (the distance
of 2 symbols doesn't change.) If not, they can use kprobes to find
their required symbol address. If they have a time, they can use
snprintf("%pF") to search symbol.

So, for me, this series just make it hard for casual developers (but
maybe they will find the answer on any technical Q&A site soon).

Hmm, are there other good way to detect such bad-manner out-of-tree
module and reject them? What about decoding them and monitor their
all call instructions?

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-02-21 14:49:25

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

Hi Masami,

On Fri, Feb 21, 2020 at 11:27:46PM +0900, Masami Hiramatsu wrote:
> On Fri, 21 Feb 2020 11:44:01 +0000
> Will Deacon <[email protected]> wrote:
> > Despite having just a single modular in-tree user that I could spot,
> > kallsyms_lookup_name() is exported to modules and provides a mechanism
> > for out-of-tree modules to access and invoke arbitrary, non-exported
> > kernel symbols when kallsyms is enabled.
> >
> > This patch series fixes up that one user and unexports the symbol along
> > with kallsyms_on_each_symbol(), since that could also be abused in a
> > similar manner.
>
> What kind of issue would you like to fix with this?

I would like to avoid out-of-tree modules being easily able to call
functions that are not exported. kallsyms_lookup_name() makes this
trivial to the point that there is very little incentive to rework these
modules to either use upstream interfaces correctly or propose functionality
which may be otherwise missing upstream. Both of these latter solutions
would be pre-requisites to upstreaming these modules, and the current state
of things actively discourages that approach.

The background here is that we are aiming for Android devices to be able
to use a generic binary kernel image closely following upstream, with
any vendor extensions coming in as kernel modules. In this case, we
(Google) end up maintaining the binary module ABI within the scope of a
single LTS kernel. Monitoring and managing the ABI surface is not feasible
if it effectively includes all data and functions via kallsyms_lookup_name().
Of course, we could just carry this patch in the Android kernel tree,
but we're aiming to carry as little as possible (ideally nothing) and
I think it's a sensible change in its own right. I'm surprised you object
to it, in all honesty.

Now, you could turn around and say "that's not upstream's problem", but
it still seems highly undesirable to me to have an upstream bypass for
exported symbols that isn't even used by upstream modules. It's ripe for
abuse and encourages people to work outside of the upstream tree. The
usual rule is that we don't export symbols without a user in the tree
and that seems especially relevant in this case.

> There are many ways to find (estimate) symbol address, especially, if
> the programmer already has the symbol map, it is *very* easy to find
> the target symbol address even from one exported symbol (the distance
> of 2 symbols doesn't change.) If not, they can use kprobes to find
> their required symbol address. If they have a time, they can use
> snprintf("%pF") to search symbol.

I would say that both of these are inconvenient enough that the developer
would think twice before considering to use them in production.

> So, for me, this series just make it hard for casual developers (but
> maybe they will find the answer on any technical Q&A site soon).

Which casual developers? I don't understand who you're referring to here.
Do you have a specific example in mind?

> Hmm, are there other good way to detect such bad-manner out-of-tree
> module and reject them? What about decoding them and monitor their
> all call instructions?

That sounds like using a sledge hammer to crack a nut to me.

Will

2020-02-21 15:13:16

by Alexei Starovoitov

[permalink] [raw]
Subject: Re: [PATCH 3/3] kallsyms: Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Fri, Feb 21, 2020 at 11:44:04AM +0000, Will Deacon wrote:
> kallsyms_lookup_name() and kallsyms_on_each_symbol() are exported to
> modules despite having no in-tree users and being wide open to abuse by
> out-of-tree modules that can use them as a method to invoke arbitrary
> non-exported kernel functions.
>
> Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol().
>
> Cc: Alexei Starovoitov <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Signed-off-by: Will Deacon <[email protected]>
> ---
> kernel/kallsyms.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/kernel/kallsyms.c b/kernel/kallsyms.c
> index a9b3f660dee7..16c8c605f4b0 100644
> --- a/kernel/kallsyms.c
> +++ b/kernel/kallsyms.c
> @@ -175,7 +175,6 @@ unsigned long kallsyms_lookup_name(const char *name)
> }
> return module_kallsyms_lookup_name(name);
> }
> -EXPORT_SYMBOL_GPL(kallsyms_lookup_name);
>
> int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *,
> unsigned long),
> @@ -194,7 +193,6 @@ int kallsyms_on_each_symbol(int (*fn)(void *, const char *, struct module *,
> }
> return module_kallsyms_on_each_symbol(fn, data);
> }
> -EXPORT_SYMBOL_GPL(kallsyms_on_each_symbol);

Looking at commit 75a66614db21 ("Ksplice: Add functions for walking kallsyms symbols")
this change will break ksplice.
But I think it's the right call. live-patching needs to find a way to be better
integrated with the core kernel.

Acked-by: Alexei Starovoitov <[email protected]>

2020-02-21 15:42:11

by David Laight

[permalink] [raw]
Subject: RE: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

From: Will Deacon
> Sent: 21 February 2020 11:44
> Hi folks,
>
> Despite having just a single modular in-tree user that I could spot,
> kallsyms_lookup_name() is exported to modules and provides a mechanism
> for out-of-tree modules to access and invoke arbitrary, non-exported
> kernel symbols when kallsyms is enabled.

Now where did I put that kernel code that opens /proc/kallsyms and
then reads it to find the addresses of symbols ...

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-02-21 16:27:12

by Quentin Perret

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Friday 21 Feb 2020 at 15:41:35 (+0000), David Laight wrote:
> From: Will Deacon
> > Sent: 21 February 2020 11:44
> > Hi folks,
> >
> > Despite having just a single modular in-tree user that I could spot,
> > kallsyms_lookup_name() is exported to modules and provides a mechanism
> > for out-of-tree modules to access and invoke arbitrary, non-exported
> > kernel symbols when kallsyms is enabled.
>
> Now where did I put that kernel code that opens /proc/kallsyms and
> then reads it to find the addresses of symbols ...

Sure, but the point of this patch IIUC is not to make it totally
impossible to find the address of symbols in the kernel. It is just to
not make it utterly trivial for out-of-tree modules to bypass entirely
the upstream infrastructure for exported symbols.

All in all, I really don't see what is the benefit of keeping
kallsyms_lookup_name exported upstream. Especially since we don't have a
good use-case for it.

So, for the whole series:

Reviewed-by: Quentin Perret <[email protected]>

Thanks,
Quentin

2020-02-21 23:45:32

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Fri, 21 Feb 2020 14:48:54 +0000
Will Deacon <[email protected]> wrote:

> Hi Masami,
>
> On Fri, Feb 21, 2020 at 11:27:46PM +0900, Masami Hiramatsu wrote:
> > On Fri, 21 Feb 2020 11:44:01 +0000
> > Will Deacon <[email protected]> wrote:
> > > Despite having just a single modular in-tree user that I could spot,
> > > kallsyms_lookup_name() is exported to modules and provides a mechanism
> > > for out-of-tree modules to access and invoke arbitrary, non-exported
> > > kernel symbols when kallsyms is enabled.
> > >
> > > This patch series fixes up that one user and unexports the symbol along
> > > with kallsyms_on_each_symbol(), since that could also be abused in a
> > > similar manner.
> >
> > What kind of issue would you like to fix with this?
>
> I would like to avoid out-of-tree modules being easily able to call
> functions that are not exported. kallsyms_lookup_name() makes this
> trivial to the point that there is very little incentive to rework these
> modules to either use upstream interfaces correctly or propose functionality
> which may be otherwise missing upstream. Both of these latter solutions
> would be pre-requisites to upstreaming these modules, and the current state
> of things actively discourages that approach.
>
> The background here is that we are aiming for Android devices to be able
> to use a generic binary kernel image closely following upstream, with
> any vendor extensions coming in as kernel modules. In this case, we
> (Google) end up maintaining the binary module ABI within the scope of a
> single LTS kernel. Monitoring and managing the ABI surface is not feasible
> if it effectively includes all data and functions via kallsyms_lookup_name().
> Of course, we could just carry this patch in the Android kernel tree,
> but we're aiming to carry as little as possible (ideally nothing) and
> I think it's a sensible change in its own right. I'm surprised you object
> to it, in all honesty.
>
> Now, you could turn around and say "that's not upstream's problem", but
> it still seems highly undesirable to me to have an upstream bypass for
> exported symbols that isn't even used by upstream modules. It's ripe for
> abuse and encourages people to work outside of the upstream tree. The
> usual rule is that we don't export symbols without a user in the tree
> and that seems especially relevant in this case.

So this is to officially states our policy that if out-of-tree driver
developers need some symbol exposed, they should work with upstream to
find better solution. Not for fixing some kind of security hole.

> > There are many ways to find (estimate) symbol address, especially, if
> > the programmer already has the symbol map, it is *very* easy to find
> > the target symbol address even from one exported symbol (the distance
> > of 2 symbols doesn't change.) If not, they can use kprobes to find
> > their required symbol address. If they have a time, they can use
> > snprintf("%pF") to search symbol.
>
> I would say that both of these are inconvenient enough that the developer
> would think twice before considering to use them in production.

Fair enough.

>
> > So, for me, this series just make it hard for casual developers (but
> > maybe they will find the answer on any technical Q&A site soon).
>
> Which casual developers? I don't understand who you're referring to here.
> Do you have a specific example in mind?

No, I don't. :)

>
> > Hmm, are there other good way to detect such bad-manner out-of-tree
> > module and reject them? What about decoding them and monitor their
> > all call instructions?
>
> That sounds like using a sledge hammer to crack a nut to me.

Agreed. Just for discouraging abuse of unexposed symbols, I think this is
enough.

Reviewed-by: Masami Hiramatsu <[email protected]>

for thise series.

Thank you,

--
Masami Hiramatsu <[email protected]>

2020-02-25 10:07:04

by Miroslav Benes

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

CC live-patching ML, because this could affect many of its users...

On Fri, 21 Feb 2020, Will Deacon wrote:

> Hi folks,
>
> Despite having just a single modular in-tree user that I could spot,
> kallsyms_lookup_name() is exported to modules and provides a mechanism
> for out-of-tree modules to access and invoke arbitrary, non-exported
> kernel symbols when kallsyms is enabled.
>
> This patch series fixes up that one user and unexports the symbol along
> with kallsyms_on_each_symbol(), since that could also be abused in a
> similar manner.
>
> Cheers,
>
> Will
>
> Cc: K.Prasad <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Greg Kroah-Hartman <[email protected]>
> Cc: Frederic Weisbecker <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: Quentin Perret <[email protected]>
> Cc: Alexei Starovoitov <[email protected]>
> Cc: Masami Hiramatsu <[email protected]>
>
> --->8
>
> Will Deacon (3):
> samples/hw_breakpoint: Drop HW_BREAKPOINT_R when reporting writes
> samples/hw_breakpoint: Drop use of kallsyms_lookup_name()
> kallsyms: Unexport kallsyms_lookup_name() and
> kallsyms_on_each_symbol()
>
> kernel/kallsyms.c | 2 --
> samples/hw_breakpoint/data_breakpoint.c | 11 ++++++++---
> 2 files changed, 8 insertions(+), 5 deletions(-)
>
> --
> 2.25.0.265.gbab2e86ba0-goog
>

2020-02-25 12:15:21

by Petr Mladek

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Tue 2020-02-25 11:05:39, Miroslav Benes wrote:
> CC live-patching ML, because this could affect many of its users...
>
> On Fri, 21 Feb 2020, Will Deacon wrote:
>
> > Hi folks,
> >
> > Despite having just a single modular in-tree user that I could spot,
> > kallsyms_lookup_name() is exported to modules and provides a mechanism
> > for out-of-tree modules to access and invoke arbitrary, non-exported
> > kernel symbols when kallsyms is enabled.

Just to explain how this affects livepatching users.

Livepatch is a module that inludes fixed copies of functions that
are buggy in the running kernel. These functions often
call functions or access variables that were defined static in
the original source code. There are two ways how this is currently
solved.

Some livepatch authors use kallsyms_lookup_name() to locate the
non-exported symbols in the running kernel and then use these
address in the fixed code.

Another possibility is to used special relocation sections,
see Documentation/livepatch/module-elf-format.rst

The problem with the special relocations sections is that the support
to generate them is not ready yet. The main piece would klp-convert
tool. Its development is pretty slow. The last version can be
found at
https://lkml.kernel.org/r/[email protected]

I am not sure if this use case is enough to keep the symbols exported.
Anyway, there are currently some out-of-tree users.

Best Regards,
Petr

2020-02-25 15:09:35

by Joe Lawrence

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On 2/25/20 7:11 AM, Petr Mladek wrote:
> On Tue 2020-02-25 11:05:39, Miroslav Benes wrote:
>> CC live-patching ML, because this could affect many of its users...
>>
>> On Fri, 21 Feb 2020, Will Deacon wrote:
>>
>>> Hi folks,
>>>
>>> Despite having just a single modular in-tree user that I could spot,
>>> kallsyms_lookup_name() is exported to modules and provides a mechanism
>>> for out-of-tree modules to access and invoke arbitrary, non-exported
>>> kernel symbols when kallsyms is enabled.
>
> Just to explain how this affects livepatching users.
>
> Livepatch is a module that inludes fixed copies of functions that
> are buggy in the running kernel. These functions often
> call functions or access variables that were defined static in
> the original source code. There are two ways how this is currently
> solved.
>
> Some livepatch authors use kallsyms_lookup_name() to locate the
> non-exported symbols in the running kernel and then use these
> address in the fixed code.
>

FWIW, kallsyms was historically used by the out-of-tree kpatch support
module to resolve external symbols as well as call set_memory_r{w,o}()
API. All of that support code has been merged upstream, so modern
kpatch modules* no longer leverage kallsyms by default.

* That said, there are still some users who still use the deprecated
support module with newer kernels, but that is not officially supported
by the project.

> Another possibility is to used special relocation sections,
> see Documentation/livepatch/module-elf-format.rst
>
> The problem with the special relocations sections is that the support
> to generate them is not ready yet. The main piece would klp-convert
> tool. Its development is pretty slow. The last version can be
> found at
> https://lkml.kernel.org/r/[email protected]
>
> I am not sure if this use case is enough to keep the symbols exported.
> Anyway, there are currently some out-of-tree users.
>

Another (temporary?) klp-relocation issue is that binutils has limited
support for them as currently implemented:

https://sourceware.org/ml/binutils/2020-02/msg00317.html

For example, try running strip or objcopy on a .ko that includes them
and you may find surprising results :(


As far as the klp-convert patchset goes, I forget whether or not we tied
its use case to source-based livepatch creation. If kallsyms goes
unexported, perhaps it finds more immediate users.

However since klp-convert provides nearly the same functionality as
kallsyms, i.e. both can be used to circumvent symbol export licensing --
one could make similar arguments against its inclusion.


If there is renewed (or greater, to be more accurate) interest in the
klp-convert patchset, we can dust it off and see what's left. AFAIK it
was blocked on arch-specific klp-relocations and whether per-object​
livepatch modules would remove that requirement.

-- Joe

2020-02-25 18:07:43

by Miroslav Benes

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Tue, 25 Feb 2020, Joe Lawrence wrote:

> On 2/25/20 7:11 AM, Petr Mladek wrote:
> > On Tue 2020-02-25 11:05:39, Miroslav Benes wrote:
> >> CC live-patching ML, because this could affect many of its users...
> >>
> >> On Fri, 21 Feb 2020, Will Deacon wrote:
> >>
> >>> Hi folks,
> >>>
> >>> Despite having just a single modular in-tree user that I could spot,
> >>> kallsyms_lookup_name() is exported to modules and provides a mechanism
> >>> for out-of-tree modules to access and invoke arbitrary, non-exported
> >>> kernel symbols when kallsyms is enabled.
> >
> > Just to explain how this affects livepatching users.
> >
> > Livepatch is a module that inludes fixed copies of functions that
> > are buggy in the running kernel. These functions often
> > call functions or access variables that were defined static in
> > the original source code. There are two ways how this is currently
> > solved.
> >
> > Some livepatch authors use kallsyms_lookup_name() to locate the
> > non-exported symbols in the running kernel and then use these
> > address in the fixed code.
> >
>
> FWIW, kallsyms was historically used by the out-of-tree kpatch support module
> to resolve external symbols as well as call set_memory_r{w,o}() API. All of
> that support code has been merged upstream, so modern kpatch modules* no
> longer leverage kallsyms by default.

Good. Quick grep through the sources gave me a couple of hits, so I was
not sure.

> * That said, there are still some users who still use the deprecated support
> module with newer kernels, but that is not officially supported by the
> project.
>
> > Another possibility is to used special relocation sections,
> > see Documentation/livepatch/module-elf-format.rst
> >
> > The problem with the special relocations sections is that the support
> > to generate them is not ready yet. The main piece would klp-convert
> > tool. Its development is pretty slow. The last version can be
> > found at
> > https://lkml.kernel.org/r/[email protected]
> >
> > I am not sure if this use case is enough to keep the symbols exported.
> > Anyway, there are currently some out-of-tree users.
> >
>
> Another (temporary?) klp-relocation issue is that binutils has limited support
> for them as currently implemented:
>
> https://sourceware.org/ml/binutils/2020-02/msg00317.html
>
> For example, try running strip or objcopy on a .ko that includes them and you
> may find surprising results :(
>
>
> As far as the klp-convert patchset goes, I forget whether or not we tied its
> use case to source-based livepatch creation. If kallsyms goes unexported,
> perhaps it finds more immediate users.
>
> However since klp-convert provides nearly the same functionality as kallsyms,
> i.e. both can be used to circumvent symbol export licensing -- one could make
> similar arguments against its inclusion.

In a way yes, but as Masami described elsewhere in the thread there are
more convenient ways to circumvent it even now. Not as convenient as
kallsyms, of course.

> If there is renewed (or greater, to be more accurate) interest in the
> klp-convert patchset, we can dust it off and see what's left. AFAIK it was
> blocked on arch-specific klp-relocations and whether per-object​ livepatch
> modules would remove that requirement.

Yes, I think it is on standby now. I thought about it while walking
through Petr's module split patch set and it seemed to me that klp-convert
could be made much much simpler on top of that. So we should start there.

Anyway, as far as Will's patch set is concerned, there is no real obstacle
on our side, is there?

Alexei mentioned ksplice from git history, but no one cares about ksplice
in upstream now, I would say.

Thanks
Miroslav

2020-02-26 14:19:13

by Joe Lawrence

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On 2/25/20 1:01 PM, Miroslav Benes wrote:
> Anyway, as far as Will's patch set is concerned, there is no real obstacle
> on our side, is there?
>

It places greater importance on getting the klp-relocation parts
correct, but assuming is is/will be then I think we're good.

-- Joe

2020-03-02 19:29:59

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On 21-Feb-2020 11:44:01 AM, Will Deacon wrote:
> Hi folks,
>
> Despite having just a single modular in-tree user that I could spot,
> kallsyms_lookup_name() is exported to modules and provides a mechanism
> for out-of-tree modules to access and invoke arbitrary, non-exported
> kernel symbols when kallsyms is enabled.
>
> This patch series fixes up that one user and unexports the symbol along
> with kallsyms_on_each_symbol(), since that could also be abused in a
> similar manner.

Hi,

I maintain a GPL kernel tracer (LTTng) since 2005 which happens to be
out-of-tree, even though we have made unsuccessful attempts to upstream
it in the past. It uses kallsyms_lookup_name() to fetch a few symbols. I
would be very glad to have them GPL-exported upstream rather than
relying on this work-around. Here is the list of symbols we would need
to GPL-export:

stack_trace_save
stack_trace_save_user
vmalloc_sync_all (CONFIG_X86)
get_pfnblock_flags_mask
disk_name
block_class
disk_type
global_wb_domain
task_prio

In order to provide address-to-symbol mapping at trace post-processing
(for which we have a prototype branch), we would also need the "_text"
symbol to be GPL-exported, as well as the list of currently loaded
modules (LIST_HEAD(modules) or a getter function).

The tricky part is justifying having those exported for a project
which is not upstream.

I welcome advice on this matter,

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-03-02 19:39:09

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Mon, Mar 02, 2020 at 02:28:11PM -0500, Mathieu Desnoyers wrote:
> On 21-Feb-2020 11:44:01 AM, Will Deacon wrote:
> > Hi folks,
> >
> > Despite having just a single modular in-tree user that I could spot,
> > kallsyms_lookup_name() is exported to modules and provides a mechanism
> > for out-of-tree modules to access and invoke arbitrary, non-exported
> > kernel symbols when kallsyms is enabled.
> >
> > This patch series fixes up that one user and unexports the symbol along
> > with kallsyms_on_each_symbol(), since that could also be abused in a
> > similar manner.
>
> Hi,
>
> I maintain a GPL kernel tracer (LTTng) since 2005 which happens to be
> out-of-tree, even though we have made unsuccessful attempts to upstream
> it in the past. It uses kallsyms_lookup_name() to fetch a few symbols. I
> would be very glad to have them GPL-exported upstream rather than
> relying on this work-around. Here is the list of symbols we would need
> to GPL-export:
>
> stack_trace_save
> stack_trace_save_user
> vmalloc_sync_all (CONFIG_X86)
> get_pfnblock_flags_mask
> disk_name
> block_class
> disk_type

I hate to ask, but why does anyone need block_class? or disk_name or
disk_type? I need to put them behind a driver core namespace or
something soon...

thanks,

greg k-h

2020-03-02 19:40:26

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Mon, Mar 02, 2020 at 08:36:58PM +0100, Greg Kroah-Hartman wrote:
> On Mon, Mar 02, 2020 at 02:28:11PM -0500, Mathieu Desnoyers wrote:
> > On 21-Feb-2020 11:44:01 AM, Will Deacon wrote:
> > > Hi folks,
> > >
> > > Despite having just a single modular in-tree user that I could spot,
> > > kallsyms_lookup_name() is exported to modules and provides a mechanism
> > > for out-of-tree modules to access and invoke arbitrary, non-exported
> > > kernel symbols when kallsyms is enabled.
> > >
> > > This patch series fixes up that one user and unexports the symbol along
> > > with kallsyms_on_each_symbol(), since that could also be abused in a
> > > similar manner.
> >
> > Hi,
> >
> > I maintain a GPL kernel tracer (LTTng) since 2005 which happens to be
> > out-of-tree, even though we have made unsuccessful attempts to upstream
> > it in the past. It uses kallsyms_lookup_name() to fetch a few symbols. I
> > would be very glad to have them GPL-exported upstream rather than
> > relying on this work-around. Here is the list of symbols we would need
> > to GPL-export:
> >
> > stack_trace_save
> > stack_trace_save_user
> > vmalloc_sync_all (CONFIG_X86)
> > get_pfnblock_flags_mask
> > disk_name
> > block_class
> > disk_type
>
> I hate to ask, but why does anyone need block_class? or disk_name or
> disk_type? I need to put them behind a driver core namespace or
> something soon...

Wait, disk_type is a static variable. And there's multiple ones of
them, how does that work?

thanks,

greg k-h

2020-03-02 20:17:46

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

----- On Mar 2, 2020, at 2:39 PM, Greg Kroah-Hartman [email protected] wrote:

> On Mon, Mar 02, 2020 at 08:36:58PM +0100, Greg Kroah-Hartman wrote:
>> On Mon, Mar 02, 2020 at 02:28:11PM -0500, Mathieu Desnoyers wrote:
>> > On 21-Feb-2020 11:44:01 AM, Will Deacon wrote:
>> > > Hi folks,
>> > >
>> > > Despite having just a single modular in-tree user that I could spot,
>> > > kallsyms_lookup_name() is exported to modules and provides a mechanism
>> > > for out-of-tree modules to access and invoke arbitrary, non-exported
>> > > kernel symbols when kallsyms is enabled.
>> > >
>> > > This patch series fixes up that one user and unexports the symbol along
>> > > with kallsyms_on_each_symbol(), since that could also be abused in a
>> > > similar manner.
>> >
>> > Hi,
>> >
>> > I maintain a GPL kernel tracer (LTTng) since 2005 which happens to be
>> > out-of-tree, even though we have made unsuccessful attempts to upstream
>> > it in the past. It uses kallsyms_lookup_name() to fetch a few symbols. I
>> > would be very glad to have them GPL-exported upstream rather than
>> > relying on this work-around. Here is the list of symbols we would need
>> > to GPL-export:
>> >
>> > stack_trace_save
>> > stack_trace_save_user
>> > vmalloc_sync_all (CONFIG_X86)
>> > get_pfnblock_flags_mask
>> > disk_name
>> > block_class
>> > disk_type
>>
>> I hate to ask, but why does anyone need block_class? or disk_name or
>> disk_type? I need to put them behind a driver core namespace or
>> something soon...
>

In LTTng, we have a "statedump" which dumps all the relevant state of
the kernel at trace start (or when the user asks for it) into the
kernel trace. It is used to collect information which helps translating
compact numeric data into human-readable information at post-processing.

For block devices, the statedump contains the mapping between the
device number and the disk name [1]. It uses the "block_class",
"disk_name", and "disk_type" symbols for this. Here is the
post-processing output:

[14:48:41.388934812] (+?.?????????) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 1048576, diskname = "ram0" }
[...]
[14:48:41.442548745] (+0.003574998) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 1048591, diskname = "ram15" }
[14:48:41.446064977] (+0.003516232) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 265289728, diskname = "vda" }
[14:48:41.449579781] (+0.003514804) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 265289729, diskname = "vda1" }
[14:48:41.453113808] (+0.003534027) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 265289744, diskname = "vdb" }
[14:48:41.456640876] (+0.003527068) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 265289745, diskname = "vdb1" }

This information is then used in our I/O analyses to show information
comprehensible to a user.

> Wait, disk_type is a static variable. And there's multiple ones of
> them, how does that work?

Yes, this is far from ideal. Here are the ones I observe in the kernel
sources:

block/genhd.c
40:static const struct device_type disk_type; <---- the one we use

lib/raid6/test/test.c
41:static char disk_type(int d) <---- this is a stand-alone user-space test program, not part of the kernel image.

crypto/async_tx/raid6test.c (depends on CONFIG_ASYNC_RAID6_TEST)
44:static char disk_type(int d, int disks) <---- the compiler optimizes away this function, so this symbol is not present in the kernel image.

I think a better approach to solve this would be to implement and expose an
iterator function in the core kernel which could invoke a callback. However,
the main issue remains: if the only user is out-of-tree, I cannot justify
adding an exported kernel helper for this.

Thanks,

Mathieu

[1] https://github.com/lttng/lttng-modules/blob/master/lttng-statedump-impl.c#L125

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-03-03 06:58:23

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

On Mon, Mar 02, 2020 at 03:17:07PM -0500, Mathieu Desnoyers wrote:
> ----- On Mar 2, 2020, at 2:39 PM, Greg Kroah-Hartman [email protected] wrote:
>
> > On Mon, Mar 02, 2020 at 08:36:58PM +0100, Greg Kroah-Hartman wrote:
> >> On Mon, Mar 02, 2020 at 02:28:11PM -0500, Mathieu Desnoyers wrote:
> >> > On 21-Feb-2020 11:44:01 AM, Will Deacon wrote:
> >> > > Hi folks,
> >> > >
> >> > > Despite having just a single modular in-tree user that I could spot,
> >> > > kallsyms_lookup_name() is exported to modules and provides a mechanism
> >> > > for out-of-tree modules to access and invoke arbitrary, non-exported
> >> > > kernel symbols when kallsyms is enabled.
> >> > >
> >> > > This patch series fixes up that one user and unexports the symbol along
> >> > > with kallsyms_on_each_symbol(), since that could also be abused in a
> >> > > similar manner.
> >> >
> >> > Hi,
> >> >
> >> > I maintain a GPL kernel tracer (LTTng) since 2005 which happens to be
> >> > out-of-tree, even though we have made unsuccessful attempts to upstream
> >> > it in the past. It uses kallsyms_lookup_name() to fetch a few symbols. I
> >> > would be very glad to have them GPL-exported upstream rather than
> >> > relying on this work-around. Here is the list of symbols we would need
> >> > to GPL-export:
> >> >
> >> > stack_trace_save
> >> > stack_trace_save_user
> >> > vmalloc_sync_all (CONFIG_X86)
> >> > get_pfnblock_flags_mask
> >> > disk_name
> >> > block_class
> >> > disk_type
> >>
> >> I hate to ask, but why does anyone need block_class? or disk_name or
> >> disk_type? I need to put them behind a driver core namespace or
> >> something soon...
> >
>
> In LTTng, we have a "statedump" which dumps all the relevant state of
> the kernel at trace start (or when the user asks for it) into the
> kernel trace. It is used to collect information which helps translating
> compact numeric data into human-readable information at post-processing.
>
> For block devices, the statedump contains the mapping between the
> device number and the disk name [1]. It uses the "block_class",
> "disk_name", and "disk_type" symbols for this. Here is the
> post-processing output:
>
> [14:48:41.388934812] (+?.?????????) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 1048576, diskname = "ram0" }
> [...]
> [14:48:41.442548745] (+0.003574998) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 1048591, diskname = "ram15" }
> [14:48:41.446064977] (+0.003516232) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 265289728, diskname = "vda" }
> [14:48:41.449579781] (+0.003514804) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 265289729, diskname = "vda1" }
> [14:48:41.453113808] (+0.003534027) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 265289744, diskname = "vdb" }
> [14:48:41.456640876] (+0.003527068) compudjdev lttng_statedump_block_device: { cpu_id = 0 }, { dev = 265289745, diskname = "vdb1" }
>
> This information is then used in our I/O analyses to show information
> comprehensible to a user.

But all of that is availble to you today in userspace, why dig through
random kernel symbols?

Look in /sys/dev/block/ or in /sys/block/ for all of that information.
Is there something that you can only find by the internal symbols that
is not present today in sysfs?

> > Wait, disk_type is a static variable. And there's multiple ones of
> > them, how does that work?
>
> Yes, this is far from ideal. Here are the ones I observe in the kernel
> sources:
>
> block/genhd.c
> 40:static const struct device_type disk_type; <---- the one we use
>
> lib/raid6/test/test.c
> 41:static char disk_type(int d) <---- this is a stand-alone user-space test program, not part of the kernel image.
>
> crypto/async_tx/raid6test.c (depends on CONFIG_ASYNC_RAID6_TEST)
> 44:static char disk_type(int d, int disks) <---- the compiler optimizes away this function, so this symbol is not present in the kernel image.
>
> I think a better approach to solve this would be to implement and expose an
> iterator function in the core kernel which could invoke a callback. However,
> the main issue remains: if the only user is out-of-tree, I cannot justify
> adding an exported kernel helper for this.

I think the best thing would be to use the userspace api :)

thanks,

greg k-h

2020-03-03 17:16:13

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [PATCH 0/3] Unexport kallsyms_lookup_name() and kallsyms_on_each_symbol()

----- On Mar 3, 2020, at 1:57 AM, Greg Kroah-Hartman [email protected] wrote:

> On Mon, Mar 02, 2020 at 03:17:07PM -0500, Mathieu Desnoyers wrote:
>> ----- On Mar 2, 2020, at 2:39 PM, Greg Kroah-Hartman [email protected]
>> wrote:
>>
>> > On Mon, Mar 02, 2020 at 08:36:58PM +0100, Greg Kroah-Hartman wrote:
[...]
>> >>
>> >> I hate to ask, but why does anyone need block_class? or disk_name or
>> >> disk_type? I need to put them behind a driver core namespace or
>> >> something soon...
>> >
>>
>> In LTTng, we have a "statedump" which dumps all the relevant state of
>> the kernel at trace start (or when the user asks for it) into the
>> kernel trace. It is used to collect information which helps translating
>> compact numeric data into human-readable information at post-processing.
>>
>> For block devices, the statedump contains the mapping between the
>> device number and the disk name [1]. It uses the "block_class",
>> "disk_name", and "disk_type" symbols for this. Here is the
>> post-processing output:
>>
>> [14:48:41.388934812] (+?.?????????) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 1048576, diskname = "ram0" }
>> [...]
>> [14:48:41.442548745] (+0.003574998) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 1048591, diskname = "ram15" }
>> [14:48:41.446064977] (+0.003516232) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289728, diskname = "vda" }
>> [14:48:41.449579781] (+0.003514804) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289729, diskname = "vda1" }
>> [14:48:41.453113808] (+0.003534027) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289744, diskname = "vdb" }
>> [14:48:41.456640876] (+0.003527068) compudjdev lttng_statedump_block_device: {
>> cpu_id = 0 }, { dev = 265289745, diskname = "vdb1" }
>>
>> This information is then used in our I/O analyses to show information
>> comprehensible to a user.
>
> But all of that is availble to you today in userspace, why dig through
> random kernel symbols?
>
> Look in /sys/dev/block/ or in /sys/block/ for all of that information.
> Is there something that you can only find by the internal symbols that
> is not present today in sysfs?

There is indeed additional information we are getting by iterating
directly on the data structures and emitting tracepoints from within the
kernel which is lost when we copy the data to user-space: the time-stamp
at which the data is observed.

Please note that the statedump approach is applied not only to block
devices, but also to namespaces, thread scheduling, process memory
mappings, file descriptor tables, interrupt handlers, network
interfaces, and cpu topology. Those are more or less long running states
which can change dynamically while a trace is being recorded. Our trace
post-processing tools model the overall kernel state over time by
reconciling tracepoints tracking all state changes (e.g.
insertions/removals) with the statedump information. The time-stamps
available for both state-change events and statedump events is what
allow us to do this modeling.

In the case of block/genhd.c, we care about the mapping between the
device number and the disk name, which is something which can be changed
dynamically, and is thus valid for a given time-range in the trace.

What we are trying to ensure is to gather all the relevant information
to allow trace analyses to re-create a precise model of the kernel
through time. In the case of genhd, that would be tuples of mapping
between device number and name, which are valid for given time-ranges.

These are recorded by the "lttng_statedump_block_device" event, which
has a timestamp generated by the LTTng tracer, along with the
(device_nr, disk_name) event payload. You will notice from what I stated
above that we are missing information about disks being
registered/unregistered later in the trace. Ideally, we would also like
to add tracepoints into register_disk() and del_gendisk() (or more
relevant functions nearby) to trace those state changes. We have those
state-change tracking tracepoints in other subsystems, but unfortunately
not in the block layer.

This information is then used at post-processing to augment the block
events (see include/trace/block.h) to convert the device numbers to
human-readable strings. It is also used to provide human-readable
rows, columns and graph labels when presenting block I/O state over
time, and aggregated summary, on a per disk basis.

If we would instead go through sysfs to export this block device
information, we end up copying the relevant information to user-space,
and only then write the data into a trace buffer. There ends up being a
delay between the point at which the state is observed within the kernel
and the sampling of the time-stamp describing when it occurs, which
introduces many interesting race scenarios where disk
registration/unregistration happens in parallel with a statedump and
block device activity emitting tracepoints into the trace. We lose the
ability to provide a reliable time-range for which the mapping between
(device_nr, disk_name) is valid. Going through uevent also lacks
time-stamp consistency with the block tracepoint events.

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com