2011-05-12 14:48:51

by Stephane Eranian

[permalink] [raw]
Subject: [BUG] perf: bogus correlation of kernel symbols

Hi,

I think there is a serious problem with kernel symbol correlation
with the latest perf in 2.6.39-rc7-tip.

Here is a simple example with a stupid program that only
does open()/close on /dev/null:

$ perf record -e cycles:k openclose
$ perf report --stdio

# Events: 2K cycles
#
# Overhead Command Shared Object Symbol
# ........ ......... ................ ...............
#
99.76% openclose [binfmt_misc] [k] 0xffffffff81010fe6
0.13% openclose libc-2.12.1.so [.] __open_nocancel
0.09% openclose libc-2.12.1.so [.] __GI_close

The DSO (binfmt_misc) is bogus. That's not where time is spent.

But if I ran the same test as root:

$ sudo perf record -e cycles:k openclose
$ sudo perf report --stdio

# Events: 2K cycles
#
# Overhead Command Shared Object Symbol
# ........ ......... ................. .............................
#
17.13% openclose [kernel.kallsyms] [k] __lock_acquire
11.77% openclose [kernel.kallsyms] [k] native_sched_clock
7.36% openclose [kernel.kallsyms] [k] sched_clock_local
5.99% openclose [kernel.kallsyms] [k] lock_release
5.38% openclose [kernel.kallsyms] [k] local_clock
4.43% openclose [kernel.kallsyms] [k] lock_acquired
4.05% openclose [kernel.kallsyms] [k] lock_acquire
3.95% openclose [kernel.kallsyms] [k] lock_is_held
3.51% openclose [kernel.kallsyms] [k] sched_clock_cpu
3.24% openclose [kernel.kallsyms] [k] trace_hardirqs_off_caller

This is much more meaningful.

This is not related to the paranoid level (1 for me).

Looking at perf report -D, the same kernel address is associated to different
module based on my permission level.

first perf.data:
416749738927 0x4210 [0x28]: PERF_RECORD_SAMPLE(IP, 1): 4886/4886:
0xffffffff8107c1d8 period: 2262681
... thread: openclose:4886
...... dso: /lib/modules/2.6.39-rc7-tip/kernel/fs/binfmt_misc.ko

second perf.data:
436879910722 0xc950 [0x28]: PERF_RECORD_SAMPLE(IP, 1): 4894/4894:
0xffffffff8107c1d8 period: 2280253
... thread: openclose:4894
...... dso: vmlinux

Same address different mapping!

My path to vmlinux is all accessible to me.

If there were permission problems, I would expect perf record or perf report
to tell me and not fallback to some bogus mappings.


2011-05-12 18:06:37

by David Miller

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

From: Stephane Eranian <[email protected]>
Date: Thu, 12 May 2011 16:48:46 +0200

> I think there is a serious problem with kernel symbol correlation
> with the latest perf in 2.6.39-rc7-tip.

The behavior seems to be intentional, so that we don't expose internal
kernel addresses to userspace.

I hate this too, and I think it's absolutely rediculous.

Also, like you, I lost an entire afternoon trying to figure out why
this started happening.

I wish we could revert this change.

2011-05-12 18:37:51

by Dave Jones

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
> From: Stephane Eranian <[email protected]>
> Date: Thu, 12 May 2011 16:48:46 +0200
>
> > I think there is a serious problem with kernel symbol correlation
> > with the latest perf in 2.6.39-rc7-tip.
>
> The behavior seems to be intentional, so that we don't expose internal
> kernel addresses to userspace.

Sounds like commit 9f36e2c448007b54851e7e4fa48da97d1477a175

> I hate this too, and I think it's absolutely rediculous.
>
> Also, like you, I lost an entire afternoon trying to figure out why
> this started happening.
>
> I wish we could revert this change.

At least it can be permanently disabled..

echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf

Dave

2011-05-12 19:01:45

by David Miller

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

From: Dave Jones <[email protected]>
Date: Thu, 12 May 2011 14:37:41 -0400

> On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
> > I hate this too, and I think it's absolutely rediculous.
> >
> > Also, like you, I lost an entire afternoon trying to figure out why
> > this started happening.
> >
> > I wish we could revert this change.
>
> At least it can be permanently disabled..
>
> echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf

Regardless, what to do about all of the "perf is broken" reports?

First off, perf can find out whether this madness exists, and it
should by default print out a warning in this situation instead of
knowingly emitting garbage kernel event information.

"I'm going to knowingly give you bad data, and I'm not even going to
let you know about it."

It's really crazy that we give people these incredibly powerful tools
and they don't even work properly by default.

We've been exposing kernel pointers for 20 years, nobody's grandmother
died because of it.

This is very "Animal Farm" the way we're gradually losing little bits
of functionality, time and time again, over this "kernel pointer
exposure" issue.

Are we going to be like animals and just accept the totality of this,
or are we going to be outraged enough to push back on stuff like perf
actually working properly?

2011-05-12 19:58:55

by Pekka Enberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 10:01 PM, David Miller <[email protected]> wrote:
> From: Dave Jones <[email protected]>
> Date: Thu, 12 May 2011 14:37:41 -0400
>
>> On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
>> ?> I hate this too, and I think it's absolutely rediculous.
>> ?>
>> ?> Also, like you, I lost an entire afternoon trying to figure out why
>> ?> this started happening.
>> ?>
>> ?> I wish we could revert this change.
>>
>> At least it can be permanently disabled..
>>
>> echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf
>
> Regardless, what to do about all of the "perf is broken" reports?

Lets revert the commit 9f36e2c448007b54851e7e4fa48da97d1477a175
("printk: use %pK for /proc/kallsyms and /proc/modules"), please! I
too have been wondering what's going on with perf reporting insane
symbols and this should definitely not be enabled by default.

Pekka

2011-05-12 20:25:10

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 03:01:32PM -0400, David Miller wrote:
> From: Dave Jones <[email protected]>
> Date: Thu, 12 May 2011 14:37:41 -0400
>
> > On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
> > > I hate this too, and I think it's absolutely rediculous.
> > >
> > > Also, like you, I lost an entire afternoon trying to figure out why
> > > this started happening.
> > >
> > > I wish we could revert this change.
> >
> > At least it can be permanently disabled..
> >
> > echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf
>
> Regardless, what to do about all of the "perf is broken" reports?

The problem is that they turned it on by default.

int kptr_restrict = 1;

2011-05-12 20:32:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 7:48 AM, Stephane Eranian <[email protected]> wrote:
>
> I think there is a serious problem with kernel symbol correlation
> with the latest perf in 2.6.39-rc7-tip.

Yeah. It's annoying. It's a "perf" bug, though - triggered by
/proc/sys/kernel/kptr_restrict being set to 1.

The bug is that perf doesn't say "I can't match kernel symbols", but
instead does some crazy matching and gives total crap module
information (I think it just picks the one that shows up last in
/proc/kallsyms).

That said, I have considered just reverting the thing that makes
kptr_restrict be 1 by default. I do like the security implications of
restricting visibility into kernel pointers, but I also think that
security rules that make the system less usable are dubious. So I
dunno.

Linus

2011-05-12 20:44:07

by David Miller

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

From: Linus Torvalds <[email protected]>
Date: Thu, 12 May 2011 13:31:37 -0700

> That said, I have considered just reverting the thing that makes
> kptr_restrict be 1 by default. I do like the security implications of
> restricting visibility into kernel pointers, but I also think that
> security rules that make the system less usable are dubious. So I
> dunno.

We don't have any firewalling or SELINUX rules installed by default,
even if those features are enabled in the kernel. Userspace asks for
it.

Many people would claim that use of such things are "essential" these
days.

I don't see a good reason to handle kptr_restrict any differently.

2011-05-12 21:00:40

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] vsprintf: Turn kptr_restrict off by default


* David Miller <[email protected]> wrote:

> From: Linus Torvalds <[email protected]>
> Date: Thu, 12 May 2011 13:31:37 -0700
>
> > That said, I have considered just reverting the thing that makes
> > kptr_restrict be 1 by default. I do like the security implications of
> > restricting visibility into kernel pointers, but I also think that
> > security rules that make the system less usable are dubious. So I
> > dunno.
>
> We don't have any firewalling or SELINUX rules installed by default, even if
> those features are enabled in the kernel. Userspace asks for it.
>
> Many people would claim that use of such things are "essential" these days.
>
> I don't see a good reason to handle kptr_restrict any differently.

That's a good argument.

We'll fix the perf bug - i was bitten by another incarnation of it: 'perf top'
stops showing kernel symbols and it took some time that kptr_restrict was
turned on by default. I reported it to Arnaldo knows about it but there's no
fix yet at the moment.

I didnt realize that perf diff got confused by this as well. (but it's logical)

So how about the patch below?

Thanks,

Ingo

------------------------>
Subject: vsprintf: Turn kptr_restrict off by default

kptr_restrict has been triggering bugs in apps such as perf, and it also makes
the system less useful by default, so turn it off by default.

This is how we generally handle security features that remove functionality,
such as firewall code or SELinux - they have to be configured and activated
from user-space.

Distributions can turn kptr_restrict on again via this line in
/etc/sysctrl.conf:

kernel.kptr_restrict = 1

( Also mark the variable __read_mostly while at it, as it's typically modified
only once per bootup, or not at all. )

Signed-off-by: Ingo Molnar <[email protected]>
---
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index bc0ac6b..dfd6019 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -797,7 +797,7 @@ char *uuid_string(char *buf, char *end, const u8 *addr,
return string(buf, end, uuid, spec);
}

-int kptr_restrict = 1;
+int kptr_restrict __read_mostly;

/*
* Show a '%p' thing. A kernel extension is that the '%p' is followed

2011-05-12 21:06:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* David Miller <[email protected]> wrote:

> From: Stephane Eranian <[email protected]>
> Date: Thu, 12 May 2011 16:48:46 +0200
>
> > I think there is a serious problem with kernel symbol correlation
> > with the latest perf in 2.6.39-rc7-tip.
>
> The behavior seems to be intentional, so that we don't expose internal
> kernel addresses to userspace.
>
> I hate this too, and I think it's absolutely rediculous.
>
> Also, like you, I lost an entire afternoon trying to figure out why
> this started happening.

I lost about an hour with Arnaldo on IRC to help me until we figured out that
/proc/kallsyms started having zero value entries ... I'm too running perf as an
unprivileged user.

Zero is a valid symbol address so nothing within perf tripped up explicitly,
but perf report and perf top results were nonsensical.

There was another problem with it: perf is caching and storing known kernel
buildid addresses in ~/.debug, under the (previously correct) assumption that
kernel symbols do not change for one given kernel build. But with kptr_restrict
it would cache the zero values - which were cached even after kptr_restrict was
set back to 0.

Thanks,

Ingo

2011-05-12 21:07:13

by Stephane Eranian

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 10:31 PM, Linus Torvalds
<[email protected]> wrote:
>
> On Thu, May 12, 2011 at 7:48 AM, Stephane Eranian <[email protected]> wrote:
> >
> > I think there is a serious problem with kernel symbol correlation
> > with the latest perf in 2.6.39-rc7-tip.
>
> Yeah. It's annoying. It's a "perf" bug, though - triggered by
> /proc/sys/kernel/kptr_restrict being set to 1.
>
I did not know about this new masquerading of pointers in /proc/kallsyms.
That certainly explains the problem.

>
> The bug is that perf doesn't say "I can't match kernel symbols", but
> instead does some crazy matching and gives total crap module
> information (I think it just picks the one that shows up last in
> /proc/kallsyms).
>
But I agree perf must not silently return bogus information. It
should print a big warning message and/or fallback to printing the raw
addresses. So much for having perf in the kernel source tree to
keep things in sync...

>
> That said, I have considered just reverting the thing that makes
> kptr_restrict be 1 by default. I do like the security implications of
> restricting visibility into kernel pointers, but I also think that
> security rules that make the system less usable are dubious. So I
> dunno.
>
I am not clear as to what people could actually do with the addresses
taken out of /proc/kallsyms. Looks to me like we've lost functionality
for the vast majority of users. So maybe the default should be inverted.

I know of a somewhat similar issue with the file descriptor limit which
people are hitting frequently these days when monitoring apps with lots
of threads or lots of events in one run on large smp systems.
That can easily be corrected by again requires root privilege to regain
the functionality.

2011-05-12 21:09:24

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] vsprintf: Turn kptr_restrict off by default

From: Ingo Molnar <[email protected]>
Date: Thu, 12 May 2011 23:00:28 +0200

> Subject: vsprintf: Turn kptr_restrict off by default
>
> kptr_restrict has been triggering bugs in apps such as perf, and it also makes
> the system less useful by default, so turn it off by default.
>
> This is how we generally handle security features that remove functionality,
> such as firewall code or SELinux - they have to be configured and activated
> from user-space.
>
> Distributions can turn kptr_restrict on again via this line in
> /etc/sysctrl.conf:
>
> kernel.kptr_restrict = 1
>
> ( Also mark the variable __read_mostly while at it, as it's typically modified
> only once per bootup, or not at all. )
>
> Signed-off-by: Ingo Molnar <[email protected]>

Acked-by: David S. Miller <[email protected]>

2011-05-12 21:30:34

by Stephane Eranian

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

The other contradiction, I see, is that you have perf_event paranoia level
and this new kptr masquerading feature which conflict with each
other.

You can be allowed to monitor at the kernel level (paranoid=1, default)
but you cannot correlate symbols:

$ perf record -e cycles:k foo

I suspect if you have this kptr thing turned on, then you need to disallow
monitoring at the kernel level too.



On Thu, May 12, 2011 at 11:07 PM, Stephane Eranian <[email protected]> wrote:
> On Thu, May 12, 2011 at 10:31 PM, Linus Torvalds
> <[email protected]> wrote:
>>
>> On Thu, May 12, 2011 at 7:48 AM, Stephane Eranian <[email protected]> wrote:
>> >
>> > I think there is a serious problem with kernel symbol correlation
>> > with the latest perf in 2.6.39-rc7-tip.
>>
>> Yeah. It's annoying. It's a "perf" bug, though - triggered by
>> /proc/sys/kernel/kptr_restrict being set to 1.
>>
> I did not know about this new masquerading of pointers in /proc/kallsyms.
> That certainly explains the problem.
>
>>
>> The bug is that perf doesn't say "I can't match kernel symbols", but
>> instead does some crazy matching and gives total crap module
>> information (I think it just picks the one that shows up last in
>> /proc/kallsyms).
>>
> But I agree perf must not silently return bogus information. It
> should print a big warning message and/or fallback to printing the raw
> addresses. So much for having perf in the kernel source tree to
> keep things in sync...
>
>>
>> That said, I have considered just reverting the thing that makes
>> kptr_restrict be 1 by default. I do like the security implications of
>> restricting visibility into kernel pointers, but I also think that
>> security rules that make the system less usable are dubious. So I
>> dunno.
>>
> I am not clear as to what people could actually do with the addresses
> taken out of /proc/kallsyms. Looks to me like we've lost functionality
> for the vast majority of users. So maybe the default should be inverted.
>
> I know of a somewhat similar issue with the file descriptor limit which
> people are hitting frequently these days when monitoring apps with lots
> of threads or lots of events in one run on large smp systems.
> That can easily be corrected by again requires root privilege to regain
> the functionality.
>

2011-05-12 21:35:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Stephane Eranian <[email protected]> wrote:

> The other contradiction, I see, is that you have perf_event paranoia level
> and this new kptr masquerading feature which conflict with each
> other.
>
> You can be allowed to monitor at the kernel level (paranoid=1, default)
> but you cannot correlate symbols:
>
> $ perf record -e cycles:k foo
>
> I suspect if you have this kptr thing turned on, then you need to disallow
> monitoring at the kernel level too.

The better (and consistent) solution would be to turn the kptr_restrict thing
off - see the patch i sent.

Thanks,

Ingo

2011-05-12 21:37:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Stephane Eranian <[email protected]> wrote:

> > The bug is that perf doesn't say "I can't match kernel symbols", but
> > instead does some crazy matching and gives total crap module information (I
> > think it just picks the one that shows up last in /proc/kallsyms).
>
> But I agree perf must not silently return bogus information. It should print
> a big warning message and/or fallback to printing the raw addresses. [...]

Yes, agreed, this is a bug in perf. I found out about this about two weeks ago
and reported it to Arnaldo, but he is away right now - he might be able to fix
it next week the earliest.

> [...]?So much for having perf in the kernel source tree to keep things?in
> sync...

What do you mean?

Thanks,

Ingo

2011-05-12 21:38:43

by Stephane Eranian

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 11:35 PM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> The other contradiction, I see, is that you have perf_event paranoia level
>> and this new kptr masquerading feature which conflict with each
>> other.
>>
>> You can be allowed to monitor at the kernel level (paranoid=1, default)
>> but you cannot correlate symbols:
>>
>> $ perf record -e cycles:k foo
>>
>> I suspect if you have this kptr thing turned on, then you need to disallow
>> monitoring at the kernel level too.
>
> The better (and consistent) solution would be to turn the kptr_restrict thing
> off - see the patch i sent.
>
I saw that. But I think that when someone turns it back on, then you need
to increase the perf_events paranoia level to disallow kernel monitoring to
regular users such that you maintain consistency across the board.

2011-05-12 21:41:33

by Stephane Eranian

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 11:36 PM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> > The bug is that perf doesn't say "I can't match kernel symbols", but
>> > instead does some crazy matching and gives total crap module information (I
>> > think it just picks the one that shows up last in /proc/kallsyms).
>>
>> But I agree perf must not silently return bogus information. It should print
>> a big warning message and/or fallback to printing the raw addresses. [...]
>
> Yes, agreed, this is a bug in perf. I found out about this about two weeks ago
> and reported it to Arnaldo, but he is away right now - he might be able to fix
> it next week the earliest.
>
>> [...] So much for having perf in the kernel source tree to keep things in
>> sync...
>
> What do you mean?
>
I meant that when this kptr feature was added, people should have scanned the
entire tree (include tools/perf) to look for potential impact on
programs relying
on /proc/kallsyms. Having perf in the tree should have made this easier to
catch. That's all.



> Thanks,
>
>        Ingo
>

2011-05-12 21:50:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Stephane Eranian <[email protected]> wrote:

> On Thu, May 12, 2011 at 11:35 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * Stephane Eranian <[email protected]> wrote:
> >
> >> The other contradiction, I see, is that you have perf_event paranoia level
> >> and this new kptr masquerading feature which conflict with each
> >> other.
> >>
> >> You can be allowed to monitor at the kernel level (paranoid=1, default)
> >> but you cannot correlate symbols:
> >>
> >> $ perf record -e cycles:k foo
> >>
> >> I suspect if you have this kptr thing turned on, then you need to disallow
> >> monitoring at the kernel level too.
> >
> > The better (and consistent) solution would be to turn the kptr_restrict thing
> > off - see the patch i sent.
>
> I saw that. But I think that when someone turns it back on, then you need to
> increase the perf_events paranoia level to disallow kernel monitoring to
> regular users such that you maintain consistency across the board.

Dunno, i would not couple them necessarily - certain users might still have
access to kernel symbols via some other channel - for example the System.map.

Thanks,

Ingo

2011-05-12 21:54:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Stephane Eranian <[email protected]> wrote:

> >> [...]?So much for having perf in the kernel source tree to keep things?in
> >> sync...
> >
> > What do you mean?
>
> I meant that when this kptr feature was added, people should have scanned the
> entire tree (include tools/perf) to look for potential impact on programs
> relying on /proc/kallsyms. Having perf in the tree should have made this
> easier to catch. That's all.

It was noticed in another case when there was kallsyms twiddling going on so it
depends. What wasnt noticed here was how the present but zero value symbols:

0000000000000000 D irq_stack_union
0000000000000000 D __per_cpu_start
0000000000000000 D gdt_page
0000000000000000 d exception_stacks
0000000000000000 d tlb_vector_offset
0000000000000000 d shared_msrs
0000000000000000 d cpu_tsc_khz

caused the symbol code of perf consider them non-existing.

Perf being in-tree wont magically avoid all bugs, so you should not expect that
magical effect from tool integration into the kernel tree.

Thanks,

Ingo

2011-05-12 21:56:40

by Stephane Eranian

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 11:50 PM, Ingo Molnar <[email protected]> wrote:
>
> * Stephane Eranian <[email protected]> wrote:
>
>> On Thu, May 12, 2011 at 11:35 PM, Ingo Molnar <[email protected]> wrote:
>> >
>> > * Stephane Eranian <[email protected]> wrote:
>> >
>> >> The other contradiction, I see, is that you have perf_event paranoia level
>> >> and this new kptr masquerading feature which conflict with each
>> >> other.
>> >>
>> >> You can be allowed to monitor at the kernel level (paranoid=1, default)
>> >> but you cannot correlate symbols:
>> >>
>> >> $ perf record -e cycles:k foo
>> >>
>> >> I suspect if you have this kptr thing turned on, then you need to disallow
>> >> monitoring at the kernel level too.
>> >
>> > The better (and consistent) solution would be to turn the kptr_restrict thing
>> > off - see the patch i sent.
>>
>> I saw that. But I think that when someone turns it back on, then you need to
>> increase the perf_events paranoia level to disallow kernel monitoring to
>> regular users such that you maintain consistency across the board.
>
> Dunno, i would not couple them necessarily - certain users might still have
> access to kernel symbols via some other channel - for example the System.map.
>
Ok, that's true, but then you'd need to have perf print a message or refuse to
use /proc/kallsyms and suggest that the user provides a System.map.

2011-05-12 22:00:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Stephane Eranian <[email protected]> wrote:

> On Thu, May 12, 2011 at 11:50 PM, Ingo Molnar <[email protected]> wrote:
> >
> > * Stephane Eranian <[email protected]> wrote:
> >
> >> On Thu, May 12, 2011 at 11:35 PM, Ingo Molnar <[email protected]> wrote:
> >> >
> >> > * Stephane Eranian <[email protected]> wrote:
> >> >
> >> >> The other contradiction, I see, is that you have perf_event paranoia level
> >> >> and this new kptr masquerading feature which conflict with each
> >> >> other.
> >> >>
> >> >> You can be allowed to monitor at the kernel level (paranoid=1, default)
> >> >> but you cannot correlate symbols:
> >> >>
> >> >> $ perf record -e cycles:k foo
> >> >>
> >> >> I suspect if you have this kptr thing turned on, then you need to disallow
> >> >> monitoring at the kernel level too.
> >> >
> >> > The better (and consistent) solution would be to turn the kptr_restrict thing
> >> > off - see the patch i sent.
> >>
> >> I saw that. But I think that when someone turns it back on, then you need to
> >> increase the perf_events paranoia level to disallow kernel monitoring to
> >> regular users such that you maintain consistency across the board.
> >
> > Dunno, i would not couple them necessarily - certain users might still have
> > access to kernel symbols via some other channel - for example the System.map.
>
> Ok, that's true, but then you'd need to have perf print a message or refuse to
> use /proc/kallsyms and suggest that the user provides a System.map.

Correct - the right approach would be to just use what we had in earlier
versions, an 'unknown symbol' kind of catch-all entry that shows how much
stuff we did not recognize.

Thanks,

Ingo

2011-05-12 22:07:28

by Dave Jones

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:

> Dunno, i would not couple them necessarily - certain users might still have
> access to kernel symbols via some other channel - for example the System.map.

That always made this security by obscurity feature seem pointless for the bulk
of users to me. Given the majority are going to be running distro kernels,
anyone can find those addresses easily no matter how hard we hide them on the
running system.

Unless we were somehow introduced randomness into where we unpack the kernel
each boot, and using System.map as a table of offsets instead of absolute addresses.

Dave

2011-05-12 22:15:55

by Stephane Eranian

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Fri, May 13, 2011 at 12:07 AM, Dave Jones <[email protected]> wrote:
> On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
>
>  > Dunno, i would not couple them necessarily - certain users might still have
>  > access to kernel symbols via some other channel - for example the System.map.
>
> That always made this security by obscurity feature seem pointless for the bulk
> of users to me. Given the majority are going to be running distro kernels,
> anyone can find those addresses easily no matter how hard we hide them on the
> running system.

> Unless we were somehow introduced randomness into where we unpack the kernel
> each boot, and using System.map as a table of offsets instead of absolute addresses.
>
Good point about System.map! Even if /proc/kallsyms contains zero
addresses, I can
still get them from /boot/System.map which is readable by everyone, I
think. It does
not contain the modules addresses, but you have the core functions, unless I am
somehow mistaken.

2011-05-13 06:15:31

by Kees Cook

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

Hi Pekka,

On Thu, May 12, 2011 at 10:58:53PM +0300, Pekka Enberg wrote:
> On Thu, May 12, 2011 at 10:01 PM, David Miller <[email protected]> wrote:
> > From: Dave Jones <[email protected]>
> > Date: Thu, 12 May 2011 14:37:41 -0400
> >
> >> On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
> >> ?> I hate this too, and I think it's absolutely rediculous.
> >> ?>
> >> ?> Also, like you, I lost an entire afternoon trying to figure out why
> >> ?> this started happening.
> >> ?>
> >> ?> I wish we could revert this change.
> >>
> >> At least it can be permanently disabled..
> >>
> >> echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf
> >
> > Regardless, what to do about all of the "perf is broken" reports?
>
> Lets revert the commit 9f36e2c448007b54851e7e4fa48da97d1477a175
> ("printk: use %pK for /proc/kallsyms and /proc/modules"), please! I
> too have been wondering what's going on with perf reporting insane
> symbols and this should definitely not be enabled by default.

No, reverting that is not the answer. If perf has a problem with the
kptr_restrict feature, it should just disable it in /proc/sys when it
runs and restore it when finished. Since our defaults should be secure
for the average user (who does not use perf), it's fine the way it
is. Anyone using perf can adjust this for their use-case (that is why
there is a /proc/sys tunable).

-Kees

--
Kees Cook
Ubuntu Security Team

2011-05-13 06:24:19

by Pekka Enberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

Hi Kees,

On Fri, May 13, 2011 at 9:12 AM, Kees Cook <[email protected]> wrote:
> Hi Pekka,
>
> On Thu, May 12, 2011 at 10:58:53PM +0300, Pekka Enberg wrote:
>> On Thu, May 12, 2011 at 10:01 PM, David Miller <[email protected]> wrote:
>> > From: Dave Jones <[email protected]>
>> > Date: Thu, 12 May 2011 14:37:41 -0400
>> >
>> >> On Thu, May 12, 2011 at 02:06:30PM -0400, David Miller wrote:
>> >> ?> I hate this too, and I think it's absolutely rediculous.
>> >> ?>
>> >> ?> Also, like you, I lost an entire afternoon trying to figure out why
>> >> ?> this started happening.
>> >> ?>
>> >> ?> I wish we could revert this change.
>> >>
>> >> At least it can be permanently disabled..
>> >>
>> >> echo kernel.kptr_restrict = 0 >> /etc/sysctl.conf
>> >
>> > Regardless, what to do about all of the "perf is broken" reports?
>>
>> Lets revert the commit 9f36e2c448007b54851e7e4fa48da97d1477a175
>> ("printk: use %pK for /proc/kallsyms and /proc/modules"), please! I
>> too have been wondering what's going on with perf reporting insane
>> symbols and this should definitely not be enabled by default.
>
> No, reverting that is not the answer. If perf has a problem with the
> kptr_restrict feature, it should just disable it in /proc/sys when it
> runs and restore it when finished. Since our defaults should be secure
> for the average user (who does not use perf), it's fine the way it
> is. Anyone using perf can adjust this for their use-case (that is why
> there is a /proc/sys tunable).

No, it's the other way around. See commit
411f05f123cbd7f8aa1edcae86970755a6e2a9d9 ("vsprintf: Turn
kptr_restrict off by default") in Linus' tree for details.

Pekka

2011-05-13 08:58:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Dave Jones <[email protected]> wrote:

> On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
>
> > Dunno, i would not couple them necessarily - certain users might still have
> > access to kernel symbols via some other channel - for example the System.map.
>
> That always made this security by obscurity feature seem pointless for the bulk
> of users to me. Given the majority are going to be running distro kernels,
> anyone can find those addresses easily no matter how hard we hide them on the
> running system.

I certainly agree and made that argument as well, in the original thread(s)
about /proc/kallsyms.

> Unless we were somehow introduced randomness into where we unpack the kernel
> each boot, and using System.map as a table of offsets instead of absolute
> addresses.

Correct. This security feature is IMO only solving a tiny fraction of the
problem and is thus in fact hindering the implementation of a *real* layer
of protection of kernel absolute addresses:

The x86 kernel is relocatable, so slightly randomizing the position of the
kernel would be feasible with no overhead on the vast majority of exising
distro installs, with just an updated kernel.

When exposing randomized RIPs to user-space we could recalculate all RIPs back
to the 0xffffffff80000000 base, so oopses would have the usual non-randomized
form:

[ 32.946003] IP: [<ffffffff80222521>] get_cur_val+0xcc/0x106
[ 32.946003] PGD 0
[ 32.946003] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
[ 32.946003] last sysfs file:
[ 32.946003] CPU 1
[ 32.946003] Pid: 1, comm: swapper Tainted: G W 2.6.29-rc1-00190-g37a76bd #10
[ 32.946003] RIP: 0010:[<ffffffff80222521>] [<ffffffff80222521>] get_cur_val+0xcc/0x106
[ 32.946003] RSP: 0018:ffff88003f977b80 EFLAGS: 00010202
[ 32.946003] RAX: 0000000000000001 RBX: ffff8800029c8c80 RCX: 0000000000000008
[ 32.946003] RDX: 0000000000000000 RSI: ffffffff80ce0100 RDI: 0000000000000000
[ 32.946003] RBP: ffff88003f977bd0 R08: 0000000000000004 R09: 0000000000000040
[ 32.946003] R10: 0000000000000060 R11: 0000000081363fa8 R12: ffffffff81c4f0e0
[ 32.946003] R13: ffffffff80ce0100 R14: ffff88003c888a00 R15: 0000000000000000
[ 32.946003] FS: 0000000000000000(0000) GS:ffff88003f802c00(0000) knlGS:0000000000000000
[ 32.946003] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 32.946003] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
[ 32.946003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 32.946003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 32.946003] Process swapper (pid: 1, threadinfo ffff88003f976000, task ffff88003f978000)
[ 32.946003] Stack:

Likewise, /proc/kallsyms could pass these addresses as well and the perf
call-chain code and other places that sample RIPs could easily convert them to
the constant address as well.

We'd still leak some information like the relative position of symbols from
each other (this can be useful to certain classes of attacks), but we could
pretty effectively hide the absolute location of the kernel - which is the most
valuable piece of information -.

Then the random base has to be protected: i.e. all information leaks of raw
kernel RIPs have to be plugged. The nice thing is that this will happen as
*bugfixes*: randomized RIPs will not be useful for anything, so any
tools/people who rely on them will notice it immediately.

I think *that* would be a maintainable and complete security feature to truly
hide the exact location of the kernel image. kptr_restrict is not.

Thanks,

Ingo

2011-05-13 09:01:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Stephane Eranian <[email protected]> wrote:

> Good point about System.map! Even if /proc/kallsyms contains zero addresses,
> I can still get them from /boot/System.map which is readable by everyone, I
> think. It does not contain the modules addresses, but you have the core
> functions, unless I am somehow mistaken.

Yes. I pointed out this and some other details a couple of months ago, for a
similarly motivated /proc/kallsyms obfuscation patch:

http://lkml.org/lkml/2010/11/4/113
http://lkml.org/lkml/2010/11/4/145

Thanks,

Ingo

2011-05-13 13:29:18

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

Hi all,

I would have appreciated a CC on this one, as the author of the feature
that got disabled.

> * Dave Jones <[email protected]> wrote:
>
> > On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
> >
> > > Dunno, i would not couple them necessarily - certain users might still have
> > > access to kernel symbols via some other channel - for example the System.map.
> >
> > That always made this security by obscurity feature seem pointless for the bulk
> > of users to me. Given the majority are going to be running distro kernels,
> > anyone can find those addresses easily no matter how hard we hide them on the
> > running system.
>
> I certainly agree and made that argument as well, in the original thread(s)
> about /proc/kallsyms.
>

I agree about the fact that kptr_restrict is an incomplete security
feature. However, I disagree that it lacks usefulness entirely.
Virtually every public kernel exploit in the past year
leverages /proc/kallsyms or other kernel address leakage to target an
attack. I'm not ignorant of the fact that it's trivial to fingerprint
distribution kernels in the absence of this information, but the reality
is, a huge portion of real life exploit attempts leverage pre-fabricated
exploits and are conducted by people who lack the ability to adjust
exploits to target a specific running kernel. Even though this is
trivial to sidestep if you know what you're doing, this extra little
step may mean some script kiddie can't root some poor sysadmin's
machine, and that's a win. In addition, when more powerful
randomization is hopefully introduced, blocking access to these pointers
will be more essential in preserving the lack of knowledge of the
location of kernel internals.

But this is all just for the record I suppose, since it seems that ship
has sailed.

> > Unless we were somehow introduced randomness into where we unpack the kernel
> > each boot, and using System.map as a table of offsets instead of absolute
> > addresses.
>
> Correct. This security feature is IMO only solving a tiny fraction of the
> problem and is thus in fact hindering the implementation of a *real* layer
> of protection of kernel absolute addresses:
>
> The x86 kernel is relocatable, so slightly randomizing the position of the
> kernel would be feasible with no overhead on the vast majority of exising
> distro installs, with just an updated kernel.
>
> When exposing randomized RIPs to user-space we could recalculate all RIPs back
> to the 0xffffffff80000000 base, so oopses would have the usual non-randomized
> form:
>
> [ 32.946003] IP: [<ffffffff80222521>] get_cur_val+0xcc/0x106
> [ 32.946003] PGD 0
> [ 32.946003] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
> [ 32.946003] last sysfs file:
> [ 32.946003] CPU 1
> [ 32.946003] Pid: 1, comm: swapper Tainted: G W 2.6.29-rc1-00190-g37a76bd #10
> [ 32.946003] RIP: 0010:[<ffffffff80222521>] [<ffffffff80222521>] get_cur_val+0xcc/0x106
> [ 32.946003] RSP: 0018:ffff88003f977b80 EFLAGS: 00010202
> [ 32.946003] RAX: 0000000000000001 RBX: ffff8800029c8c80 RCX: 0000000000000008
> [ 32.946003] RDX: 0000000000000000 RSI: ffffffff80ce0100 RDI: 0000000000000000
> [ 32.946003] RBP: ffff88003f977bd0 R08: 0000000000000004 R09: 0000000000000040
> [ 32.946003] R10: 0000000000000060 R11: 0000000081363fa8 R12: ffffffff81c4f0e0
> [ 32.946003] R13: ffffffff80ce0100 R14: ffff88003c888a00 R15: 0000000000000000
> [ 32.946003] FS: 0000000000000000(0000) GS:ffff88003f802c00(0000) knlGS:0000000000000000
> [ 32.946003] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 32.946003] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> [ 32.946003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 32.946003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 32.946003] Process swapper (pid: 1, threadinfo ffff88003f976000, task ffff88003f978000)
> [ 32.946003] Stack:
>
> Likewise, /proc/kallsyms could pass these addresses as well and the perf
> call-chain code and other places that sample RIPs could easily convert them to
> the constant address as well.
>
> We'd still leak some information like the relative position of symbols from
> each other (this can be useful to certain classes of attacks), but we could
> pretty effectively hide the absolute location of the kernel - which is the most
> valuable piece of information -.
>
> Then the random base has to be protected: i.e. all information leaks of raw
> kernel RIPs have to be plugged. The nice thing is that this will happen as
> *bugfixes*: randomized RIPs will not be useful for anything, so any
> tools/people who rely on them will notice it immediately.
>
> I think *that* would be a maintainable and complete security feature to truly
> hide the exact location of the kernel image. kptr_restrict is not.
>

I want this feature, as I think it is far more useful and important.
This has been mentioned before, but no one has stepped up to actually do
it. Unfortunately, I lack the necessary knowledge of the relevant code
to do it properly. What's the best way to make this feature a reality?

Regards,
Dan


> Thanks,
>
> Ingo

2011-05-13 16:24:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

Ingo Molnar <[email protected]> writes:

I agree that the current %kP default is really a catastrophe, clearly
on the trajectory of "the system is only secure when nothing works anymore"

> The x86 kernel is relocatable, so slightly randomizing the position of the
> kernel would be feasible with no overhead on the vast majority of exising
> distro installs, with just an updated kernel.

Problem is that we don't have a source of secure randomness early on
when the relocation would need to happen.

You could either pass it as an option, but that option would be right
now too exposed, or just use kexec and boot twice.

But all of this has drawbacks.

> When exposing randomized RIPs to user-space we could recalculate all RIPs back
> to the 0xffffffff80000000 base, so oopses would have the usual non-randomized
> form:

This would be very confusing because the register and stack contents
would have the non relocated addresses.

I bet it would cause a lot of similar problems as the current %kP
madness, just more subtle ones.

-Andi
--
[email protected] -- Speaking for myself only

2011-05-16 15:35:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Dan Rosenberg <[email protected]> wrote:

> Hi all,
>
> I would have appreciated a CC on this one, as the author of the feature
> that got disabled.

That's true and sorry about it: i could have sworn the author was Cc:-ed but
confused you with Kees ...

> > * Dave Jones <[email protected]> wrote:
> >
> > > On Thu, May 12, 2011 at 11:50:23PM +0200, Ingo Molnar wrote:
> > >
> > > > Dunno, i would not couple them necessarily - certain users might still have
> > > > access to kernel symbols via some other channel - for example the System.map.
> > >
> > > That always made this security by obscurity feature seem pointless for the bulk
> > > of users to me. Given the majority are going to be running distro kernels,
> > > anyone can find those addresses easily no matter how hard we hide them on the
> > > running system.
> >
> > I certainly agree and made that argument as well, in the original thread(s)
> > about /proc/kallsyms.
>
> I agree about the fact that kptr_restrict is an incomplete security feature.
> However, I disagree that it lacks usefulness entirely. Virtually every public
> kernel exploit in the past year leverages /proc/kallsyms or other kernel
> address leakage to target an attack. I'm not ignorant of the fact that it's
> trivial to fingerprint distribution kernels in the absence of this
> information, but the reality is, a huge portion of real life exploit attempts
> leverage pre-fabricated exploits and are conducted by people who lack the
> ability to adjust exploits to target a specific running kernel. Even though
> this is trivial to sidestep if you know what you're doing, this extra little
> step may mean some script kiddie can't root some poor sysadmin's machine, and
> that's a win. In addition, when more powerful randomization is hopefully
> introduced, blocking access to these pointers will be more essential in
> preserving the lack of knowledge of the location of kernel internals.

Well, but lets think it through further: what happens when we do such a change?

- Script kiddies get thwarted for a few weeks.

- Script authors will laugh and will update their scripts to query rpmfind.net
or other package servers for symbol info.

- After that transition all the exploits will continue to work. They might in
fact be more robust because they can specifically target only package
versions that are known to be exploitable.

- *Useful* tools that do not try to harm the system will stay less useful
forever and that's permanent collateral damage.

I.e. we would have driven the development of *attack* tools to be even more
harmful and will have hurt *useful* tools. Is this really what we want?

> But this is all just for the record I suppose, since it seems that ship has
> sailed.

We can still revert the revert as well although indeed it is not very common.

> > > Unless we were somehow introduced randomness into where we unpack the kernel
> > > each boot, and using System.map as a table of offsets instead of absolute
> > > addresses.
> >
> > Correct. This security feature is IMO only solving a tiny fraction of the
> > problem and is thus in fact hindering the implementation of a *real* layer
> > of protection of kernel absolute addresses:
> >
> > The x86 kernel is relocatable, so slightly randomizing the position of the
> > kernel would be feasible with no overhead on the vast majority of exising
> > distro installs, with just an updated kernel.
> >
> > When exposing randomized RIPs to user-space we could recalculate all RIPs back
> > to the 0xffffffff80000000 base, so oopses would have the usual non-randomized
> > form:
> >
> > [ 32.946003] IP: [<ffffffff80222521>] get_cur_val+0xcc/0x106
> > [ 32.946003] PGD 0
> > [ 32.946003] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
> > [ 32.946003] last sysfs file:
> > [ 32.946003] CPU 1
> > [ 32.946003] Pid: 1, comm: swapper Tainted: G W 2.6.29-rc1-00190-g37a76bd #10
> > [ 32.946003] RIP: 0010:[<ffffffff80222521>] [<ffffffff80222521>] get_cur_val+0xcc/0x106
> > [ 32.946003] RSP: 0018:ffff88003f977b80 EFLAGS: 00010202
> > [ 32.946003] RAX: 0000000000000001 RBX: ffff8800029c8c80 RCX: 0000000000000008
> > [ 32.946003] RDX: 0000000000000000 RSI: ffffffff80ce0100 RDI: 0000000000000000
> > [ 32.946003] RBP: ffff88003f977bd0 R08: 0000000000000004 R09: 0000000000000040
> > [ 32.946003] R10: 0000000000000060 R11: 0000000081363fa8 R12: ffffffff81c4f0e0
> > [ 32.946003] R13: ffffffff80ce0100 R14: ffff88003c888a00 R15: 0000000000000000
> > [ 32.946003] FS: 0000000000000000(0000) GS:ffff88003f802c00(0000) knlGS:0000000000000000
> > [ 32.946003] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > [ 32.946003] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> > [ 32.946003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 32.946003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 32.946003] Process swapper (pid: 1, threadinfo ffff88003f976000, task ffff88003f978000)
> > [ 32.946003] Stack:
> >
> > Likewise, /proc/kallsyms could pass these addresses as well and the perf
> > call-chain code and other places that sample RIPs could easily convert them to
> > the constant address as well.
> >
> > We'd still leak some information like the relative position of symbols from
> > each other (this can be useful to certain classes of attacks), but we could
> > pretty effectively hide the absolute location of the kernel - which is the most
> > valuable piece of information -.
> >
> > Then the random base has to be protected: i.e. all information leaks of raw
> > kernel RIPs have to be plugged. The nice thing is that this will happen as
> > *bugfixes*: randomized RIPs will not be useful for anything, so any
> > tools/people who rely on them will notice it immediately.
> >
> > I think *that* would be a maintainable and complete security feature to truly
> > hide the exact location of the kernel image. kptr_restrict is not.
> >
>
> I want this feature, as I think it is far more useful and important. This has
> been mentioned before, but no one has stepped up to actually do it.
> Unfortunately, I lack the necessary knowledge of the relevant code to do it
> properly. What's the best way to make this feature a reality?

Agreed, it would be a very useful feature.

I'd suggest to implement it along the lines of:

- First check whether grsecurity or PAX has this implemented already via the
relocation facility - they are pretty good at being paranoid so i'd be
surprised if they didnt think of this already! :-)

- If not then have a look at CONFIG_RELOCATABLE and to relocate the kernel
binary intentionally via a hardcoded parameter. Just see whether you can do
it and whether it works as you expect it. Check /proc/kallsyms changing
after your patch. Enjoy the kernel still working ;-)

- Then promote it to a boot parameter - this way you'll be able to tell
whether there's any hidden build-time assumptions about relocation position.
(there really shouldnt be any given that kexec works just fine - but i'd
suggest this step just in case.)

- Then promote that hack to be a randomized parameter. Marvel at a different,
randomized /proc/kallsyms output at every bootup and enjoy the still working
kernel!

- Then look at all RIP outputs (thanks to your prior efforts they are now
mostly concentrated in the vprints code!) and reverse apply the random
offset before it's exported into user-space. wchan, etc. Marvel at the
constant /proc/kallsyms output, fully knowing that the *real* addresses
are randomized.

- Please do not forget to transfer perf RIPs and callchains and marvel at the
well working 'perf top' output.

At that point the feature will be highly useful already IMO. Remaining work
will be to think through and close down all remaining avenues of RIP leakage.

At this point kptr_restrict will be a lot less relevant - the symbols will
expose offsets (so it's not totally unhelpful to attackers) but not the real
absolute addresses.

Unless i'm missing some particularly difficult roadblock, which is possible.

If you try this then please keep us posted at every step above, even if your
patches are not fully working and useful yet. Maybe some other
details/ideas/suggestions will arise at that point.

Thanks,

Ingo

2011-05-16 16:14:22

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Mon, 2011-05-16 at 17:35 +0200, Ingo Molnar wrote:

> Agreed, it would be a very useful feature.
>
> I'd suggest to implement it along the lines of:
>
> - First check whether grsecurity or PAX has this implemented already via the
> relocation facility - they are pretty good at being paranoid so i'd be
> surprised if they didnt think of this already! :-)
>
> - If not then have a look at CONFIG_RELOCATABLE and to relocate the kernel
> binary intentionally via a hardcoded parameter. Just see whether you can do
> it and whether it works as you expect it. Check /proc/kallsyms changing
> after your patch. Enjoy the kernel still working ;-)
>
> - Then promote it to a boot parameter - this way you'll be able to tell
> whether there's any hidden build-time assumptions about relocation position.
> (there really shouldnt be any given that kexec works just fine - but i'd
> suggest this step just in case.)
>
> - Then promote that hack to be a randomized parameter. Marvel at a different,
> randomized /proc/kallsyms output at every bootup and enjoy the still working
> kernel!
>
> - Then look at all RIP outputs (thanks to your prior efforts they are now
> mostly concentrated in the vprints code!) and reverse apply the random
> offset before it's exported into user-space. wchan, etc. Marvel at the
> constant /proc/kallsyms output, fully knowing that the *real* addresses
> are randomized.
>
> - Please do not forget to transfer perf RIPs and callchains and marvel at the
> well working 'perf top' output.
>
> At that point the feature will be highly useful already IMO. Remaining work
> will be to think through and close down all remaining avenues of RIP leakage.
>
> At this point kptr_restrict will be a lot less relevant - the symbols will
> expose offsets (so it's not totally unhelpful to attackers) but not the real
> absolute addresses.
>
> Unless i'm missing some particularly difficult roadblock, which is possible.
>
> If you try this then please keep us posted at every step above, even if your
> patches are not fully working and useful yet. Maybe some other
> details/ideas/suggestions will arise at that point.
>

Thanks for the detailed response. I will attempt to go down this road,
and will keep people posted with my progress.

-Dan

> Thanks,
>
> Ingo

2011-05-17 12:18:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Andi Kleen <[email protected]> wrote:

> > The x86 kernel is relocatable, so slightly randomizing the position of the
> > kernel would be feasible with no overhead on the vast majority of exising
> > distro installs, with just an updated kernel.
>
> Problem is that we don't have a source of secure randomness early on when the
> relocation would need to happen.

That's indeed a problem but not a fundamental one: we can read out the current
time (RTC CMOS is always available on most systems), mix it with the current
cycle counter value and PRNG mix it.

It could only be recovered if the attacker is local to that box, guesses the
precise cycle count on that specific hardware (and hardware has small thermal
variations) and knows the precise boot time to the second as present in the
RTC.

Note that the amount of randomization would be small to begin with: if we have
only 3 bits of randomization and can thus make ~90% of kernel exploit attempts
crash statistically then we have most of the advantages already.

[ For the really paranoid we could add a new flag to the boot protocol and
embedd a random seed in the bzImage. This could be re-set upon installation
of a new kernel package, so on the next reboot the system gains a unique
seed.

Or we could add a boot parameter to seed things and cut this particular boot
parameter from all output like /proc/cmdline or the syslog command line
parameters printout. /etc/grub.conf is already inaccessible to unprivileged
userspace on most distros so the parameter is hidden. ]

So it's a solvable problem.

> You could either pass it as an option, but that option would be right now too
> exposed, or just use kexec and boot twice.
>
> But all of this has drawbacks.
>
> > When exposing randomized RIPs to user-space we could recalculate all RIPs back
> > to the 0xffffffff80000000 base, so oopses would have the usual non-randomized
> > form:
>
> This would be very confusing because the register and stack contents
> would have the non relocated addresses.

Well, kernel crashes can expose security relevant details anyway so they better
be hidden from unprivileged attackers anyway, the important thing is to
properly decode the symbols. We can keep the rest of the oops in its raw form
(and thus expose the seed to a privileged user - which we'd do anyway), being
dependable is important for oopses.

> I bet it would cause a lot of similar problems as the current %kP madness,
> just more subtle ones.

Well, did you expect me to react to your claim of 'subtle issues'?

If yes (which i assume) then why didn't you outline what you meant with that in
more detail, why are you forcing me to ask you what you mean precisely?

Thanks,

Ingo

2011-05-20 00:56:23

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


> > > > Unless we were somehow introduced randomness into where we unpack the kernel
> > > > each boot, and using System.map as a table of offsets instead of absolute
> > > > addresses.
> > >
> > > Correct. This security feature is IMO only solving a tiny fraction of the
> > > problem and is thus in fact hindering the implementation of a *real* layer
> > > of protection of kernel absolute addresses:
> > >
> > > The x86 kernel is relocatable, so slightly randomizing the position of the
> > > kernel would be feasible with no overhead on the vast majority of exising
> > > distro installs, with just an updated kernel.
> > >
> > > When exposing randomized RIPs to user-space we could recalculate all RIPs back
> > > to the 0xffffffff80000000 base, so oopses would have the usual non-randomized
> > > form:
> > >
> > > [ 32.946003] IP: [<ffffffff80222521>] get_cur_val+0xcc/0x106
> > > [ 32.946003] PGD 0
> > > [ 32.946003] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
> > > [ 32.946003] last sysfs file:
> > > [ 32.946003] CPU 1
> > > [ 32.946003] Pid: 1, comm: swapper Tainted: G W 2.6.29-rc1-00190-g37a76bd #10
> > > [ 32.946003] RIP: 0010:[<ffffffff80222521>] [<ffffffff80222521>] get_cur_val+0xcc/0x106
> > > [ 32.946003] RSP: 0018:ffff88003f977b80 EFLAGS: 00010202
> > > [ 32.946003] RAX: 0000000000000001 RBX: ffff8800029c8c80 RCX: 0000000000000008
> > > [ 32.946003] RDX: 0000000000000000 RSI: ffffffff80ce0100 RDI: 0000000000000000
> > > [ 32.946003] RBP: ffff88003f977bd0 R08: 0000000000000004 R09: 0000000000000040
> > > [ 32.946003] R10: 0000000000000060 R11: 0000000081363fa8 R12: ffffffff81c4f0e0
> > > [ 32.946003] R13: ffffffff80ce0100 R14: ffff88003c888a00 R15: 0000000000000000
> > > [ 32.946003] FS: 0000000000000000(0000) GS:ffff88003f802c00(0000) knlGS:0000000000000000
> > > [ 32.946003] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> > > [ 32.946003] CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000006e0
> > > [ 32.946003] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [ 32.946003] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > [ 32.946003] Process swapper (pid: 1, threadinfo ffff88003f976000, task ffff88003f978000)
> > > [ 32.946003] Stack:
> > >
> > > Likewise, /proc/kallsyms could pass these addresses as well and the perf
> > > call-chain code and other places that sample RIPs could easily convert them to
> > > the constant address as well.
> > >
> > > We'd still leak some information like the relative position of symbols from
> > > each other (this can be useful to certain classes of attacks), but we could
> > > pretty effectively hide the absolute location of the kernel - which is the most
> > > valuable piece of information -.
> > >
> > > Then the random base has to be protected: i.e. all information leaks of raw
> > > kernel RIPs have to be plugged. The nice thing is that this will happen as
> > > *bugfixes*: randomized RIPs will not be useful for anything, so any
> > > tools/people who rely on them will notice it immediately.
> > >
> > > I think *that* would be a maintainable and complete security feature to truly
> > > hide the exact location of the kernel image. kptr_restrict is not.
> > >
> >
> > I want this feature, as I think it is far more useful and important. This has
> > been mentioned before, but no one has stepped up to actually do it.
> > Unfortunately, I lack the necessary knowledge of the relevant code to do it
> > properly. What's the best way to make this feature a reality?
>
> Agreed, it would be a very useful feature.
>
> I'd suggest to implement it along the lines of:
>
> - First check whether grsecurity or PAX has this implemented already via the
> relocation facility - they are pretty good at being paranoid so i'd be
> surprised if they didnt think of this already! :-)
>
> - If not then have a look at CONFIG_RELOCATABLE and to relocate the kernel
> binary intentionally via a hardcoded parameter. Just see whether you can do
> it and whether it works as you expect it. Check /proc/kallsyms changing
> after your patch. Enjoy the kernel still working ;-)
>
> - Then promote it to a boot parameter - this way you'll be able to tell
> whether there's any hidden build-time assumptions about relocation position.
> (there really shouldnt be any given that kexec works just fine - but i'd
> suggest this step just in case.)
>
> - Then promote that hack to be a randomized parameter. Marvel at a different,
> randomized /proc/kallsyms output at every bootup and enjoy the still working
> kernel!
>
> - Then look at all RIP outputs (thanks to your prior efforts they are now
> mostly concentrated in the vprints code!) and reverse apply the random
> offset before it's exported into user-space. wchan, etc. Marvel at the
> constant /proc/kallsyms output, fully knowing that the *real* addresses
> are randomized.
>
> - Please do not forget to transfer perf RIPs and callchains and marvel at the
> well working 'perf top' output.
>
> At that point the feature will be highly useful already IMO. Remaining work
> will be to think through and close down all remaining avenues of RIP leakage.
>
> At this point kptr_restrict will be a lot less relevant - the symbols will
> expose offsets (so it's not totally unhelpful to attackers) but not the real
> absolute addresses.
>
> Unless i'm missing some particularly difficult roadblock, which is possible.
>
> If you try this then please keep us posted at every step above, even if your
> patches are not fully working and useful yet. Maybe some other
> details/ideas/suggestions will arise at that point.
>

I was able to boot a relocatable kernel with the decompression location
at a hard-coded offset without too much trouble. Everything seems to
work fine.

However, it occurred to me that even if the kernel image's base address
were randomized at boot, assuming a binary distro kernel it would still
be possible to sidt the address of the IDT and calculate symbol offsets
relative to that. Any thoughts on how to avoid that? Seems difficult.
Another hurdle will be to find a reasonable source of entropy that early
in the boot process.

-Dan


> Thanks,
>
> Ingo

2011-05-20 12:08:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Dan Rosenberg <[email protected]> wrote:

> I was able to boot a relocatable kernel with the decompression location at a
> hard-coded offset without too much trouble. Everything seems to work fine.

Nice!

> However, it occurred to me that even if the kernel image's base address were
> randomized at boot, assuming a binary distro kernel it would still be
> possible to sidt the address of the IDT and calculate symbol offsets relative
> to that. Any thoughts on how to avoid that? Seems difficult. Another hurdle
> will be to find a reasonable source of entropy that early in the boot
> process.

I do not think it's an issue.

If an attacker can execute arbitrary privileged instructions like SIDT then
it's game over. There's plenty of CPU state, the IDT, GDT, various MSRs that
would tell roughly where the kernel is, etc.

The attack randomization protects against is when the attacker has a limited
amount of control over a stack return address (due to a buffer overflow for
example) and can redirect kernel execution to some 'interesting' place that
allows more control. With SMEP and kernel image randomization this would be
rather difficult to pull off: the kernel wont jump to a pre-prepared user-space
shellcode buffer (due to SMEP) while the location of already existing,
executable, supervisor-privileged pages is randomized.

So when you have implemented this i'd suggest enabling CONFIG_X86_PTDUMP=y to
get access to a dump of all pagetables, in the /debug/kernel_page_tables file.
There you can check every single executable, kernel-privileged mapping on a
live system and make sure it's not easily discovered.

Thanks,

Ingo

2011-05-20 12:55:04

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Fri, 2011-05-20 at 14:07 +0200, Ingo Molnar wrote:
> * Dan Rosenberg <[email protected]> wrote:
>
> > I was able to boot a relocatable kernel with the decompression location at a
> > hard-coded offset without too much trouble. Everything seems to work fine.
>
> Nice!
>
> > However, it occurred to me that even if the kernel image's base address were
> > randomized at boot, assuming a binary distro kernel it would still be
> > possible to sidt the address of the IDT and calculate symbol offsets relative
> > to that. Any thoughts on how to avoid that? Seems difficult. Another hurdle
> > will be to find a reasonable source of entropy that early in the boot
> > process.
>
> I do not think it's an issue.
>
> If an attacker can execute arbitrary privileged instructions like SIDT then
> it's game over. There's plenty of CPU state, the IDT, GDT, various MSRs that
> would tell roughly where the kernel is, etc.
>

Except that SIDT isn't a privilege instruction, it's accessible as ring
3.

> The attack randomization protects against is when the attacker has a limited
> amount of control over a stack return address (due to a buffer overflow for
> example) and can redirect kernel execution to some 'interesting' place that
> allows more control. With SMEP and kernel image randomization this would be
> rather difficult to pull off: the kernel wont jump to a pre-prepared user-space
> shellcode buffer (due to SMEP) while the location of already existing,
> executable, supervisor-privileged pages is randomized.
>

Yes, all true, except are you specifically considering remote-only
attack vectors?

> So when you have implemented this i'd suggest enabling CONFIG_X86_PTDUMP=y to
> get access to a dump of all pagetables, in the /debug/kernel_page_tables file.
> There you can check every single executable, kernel-privileged mapping on a
> live system and make sure it's not easily discovered.
>

I'll do this too, but first I'd like to address the above.

Thanks,
Dan

> Thanks,
>
> Ingo

2011-05-20 13:11:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Dan Rosenberg <[email protected]> wrote:

> On Fri, 2011-05-20 at 14:07 +0200, Ingo Molnar wrote:
> > * Dan Rosenberg <[email protected]> wrote:
> >
> > > I was able to boot a relocatable kernel with the decompression location at a
> > > hard-coded offset without too much trouble. Everything seems to work fine.
> >
> > Nice!
> >
> > > However, it occurred to me that even if the kernel image's base address were
> > > randomized at boot, assuming a binary distro kernel it would still be
> > > possible to sidt the address of the IDT and calculate symbol offsets relative
> > > to that. Any thoughts on how to avoid that? Seems difficult. Another hurdle
> > > will be to find a reasonable source of entropy that early in the boot
> > > process.
> >
> > I do not think it's an issue.
> >
> > If an attacker can execute arbitrary privileged instructions like SIDT then
> > it's game over. There's plenty of CPU state, the IDT, GDT, various MSRs
> > that would tell roughly where the kernel is, etc.
>
> Except that SIDT isn't a privilege instruction, it's accessible as ring 3.

Oops, stupid me :-/

We need to allocate the IDT dynamically: just kmalloc() it, update idt_descr
and do a load_idt(). Double check places that modify idt_descr or use
idt_table.

Note, you could do this as a side effect of a nice performance optimization:
would you be interested in allocating it in the percpu area, using
percpu_alloc()? That way the IDT is distributed between CPUs - this has
scalability advantages on NUMA systems and maybe even on SMP.

> > The attack randomization protects against is when the attacker has a
> > limited amount of control over a stack return address (due to a buffer
> > overflow for example) and can redirect kernel execution to some
> > 'interesting' place that allows more control. With SMEP and kernel image
> > randomization this would be rather difficult to pull off: the kernel wont
> > jump to a pre-prepared user-space shellcode buffer (due to SMEP) while the
> > location of already existing, executable, supervisor-privileged pages is
> > randomized.
>
> Yes, all true, except are you specifically considering remote-only attack
> vectors?

No, unprivileged local user, so yes, the IDT address has to be protected.

Thanks,

Ingo

2011-05-20 17:41:12

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Fri, 2011-05-20 at 15:11 +0200, Ingo Molnar wrote:

> We need to allocate the IDT dynamically: just kmalloc() it, update idt_descr
> and do a load_idt(). Double check places that modify idt_descr or use
> idt_table.
>
> Note, you could do this as a side effect of a nice performance optimization:
> would you be interested in allocating it in the percpu area, using
> percpu_alloc()? That way the IDT is distributed between CPUs - this has
> scalability advantages on NUMA systems and maybe even on SMP.
>

Any suggestions on when this allocation should take place? I'm hesitant
to touch anything in arch/x86/kernel/head_32.S, where the IDT is setup
and lidt idt_descr is called (on x86-32 anyway). That means at some
point I'd have to copy the table into a region allocated with
alloc_percpu() and set up a new descriptor. Seems like this should
happen before IRQ is enabled, but I'm not sure about the best place.

Also, I'd still welcome suggestions on generating entropy so early in
the boot process as to randomize the location at which the kernel is
decompressed.

On a related note, would there be obstacles to marking the IDT as
read-only?

Thanks,
Dan

2011-05-20 18:14:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Fri, May 20, 2011 at 10:41 AM, Dan Rosenberg
<[email protected]> wrote:
>
> Also, I'd still welcome suggestions on generating entropy so early in
> the boot process as to randomize the location at which the kernel is
> decompressed.

The fundamental problem with the whole kernel address randomization is
sadly totally unrelated to any of the small details.

There's a *big* detail that makes it hard: there's only a few bits of
randomness we can add to the address. The kernel base address ends up
having various fundamental limitations (cacheline alignment for the
code, and we have several segments that require page alignment), so
you really can't realistically do more than something like 8-12 bits
of address randomization.

Which means that once you have a vmlinux image (say, because it's a
standard distro kernel), you only need to try your exploit a few
hundred times. That can be done quickly enough that no MIS person will
ever have time to react to the attack.

Sure, it will likely leave some hints around (oopses etc), but still..

> On a related note, would there be obstacles to marking the IDT as
> read-only?

We do that for the F00F bug workaround. But while the linear address
is read-only, the IDT can still be accessed read-write through the
physical address through the normal 1:1 mapping.

Regardless, the virtual mapping trick (independently of whether it's
read-only or not) can be used to avoid exposing the *actual* address
of the IDT of the kernel, and would hide the kernel load address
details. However, it does make traps slightly slower, if they cannot
use the 1:1 mapping with large pages for the IDT access and thus cause
more TLB pressure. Of course, in many situations we probably end up
not having large pages for the kernel anyway, so..

As a result, we do that F00F bug workaround _only_ if we're actually
running on a CPU with the FOOF bug.

Linus

2011-05-20 18:27:23

by Kees Cook

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Fri, May 20, 2011 at 11:14:09AM -0700, Linus Torvalds wrote:
> On Fri, May 20, 2011 at 10:41 AM, Dan Rosenberg
> <[email protected]> wrote:
> >
> > Also, I'd still welcome suggestions on generating entropy so early in
> > the boot process as to randomize the location at which the kernel is
> > decompressed.
>
> The fundamental problem with the whole kernel address randomization is
> sadly totally unrelated to any of the small details.
>
> There's a *big* detail that makes it hard: there's only a few bits of
> randomness we can add to the address. The kernel base address ends up
> having various fundamental limitations (cacheline alignment for the
> code, and we have several segments that require page alignment), so
> you really can't realistically do more than something like 8-12 bits
> of address randomization.
>
> Which means that once you have a vmlinux image (say, because it's a
> standard distro kernel), you only need to try your exploit a few
> hundred times. That can be done quickly enough that no MIS person will
> ever have time to react to the attack.
>
> Sure, it will likely leave some hints around (oopses etc), but still..

Certain flaws will present that way, yes. Others will take the entire
system down on the first missed address guess. Many times, ASLR will
give a statistical advantage to the defender. As a result it has value,
even if it's not perfect.

-Kees

--
Kees Cook
Ubuntu Security Team

2011-05-20 18:28:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Linus Torvalds <[email protected]> wrote:

> There's a *big* detail that makes it hard: there's only a few bits of
> randomness we can add to the address. The kernel base address ends up having
> various fundamental limitations (cacheline alignment for the code, and we
> have several segments that require page alignment), so you really can't
> realistically do more than something like 8-12 bits of address randomization.

Yeah, i tried to address this issue in my first mail: basically just a few bits
would already make a big difference in practice: even a *single* bit of
randomness makes an exploit crash 50% of the time - at which point the attack
stops being stealth.

8 bits would be a lot.

So i think this is really realistic, even if a brute force, networked attack
can successfully attack 1 out of 256, 512 or 1024 boxes. Even for the worm cas
the networked attack would not scale very well.

> Regardless, the virtual mapping trick (independently of whether it's
> read-only or not) can be used to avoid exposing the *actual* address of the
> IDT of the kernel, and would hide the kernel load address details. However,
> it does make traps slightly slower, if they cannot use the 1:1 mapping with
> large pages for the IDT access and thus cause more TLB pressure. Of course,
> in many situations we probably end up not having large pages for the kernel
> anyway, so..

We could put per CPU IDTs into the percpu area if that improves performance.

This might help on NUMA: on NUMA only one node has the IDT local, the others
will take a remote DRAM access every time they miss the IDT - and the IDT could
easily be missed if there are no IRQs or traps for a long time (say CPU-bound
user-space processing).

There may also be cases where an implicit locked access is generated to the
IDT?

Thanks,

Ingo

2011-05-20 18:34:55

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Fri, 2011-05-20 at 11:27 -0700, Kees Cook wrote:
> On Fri, May 20, 2011 at 11:14:09AM -0700, Linus Torvalds wrote:
> > On Fri, May 20, 2011 at 10:41 AM, Dan Rosenberg
> > <[email protected]> wrote:
> > >
> > > Also, I'd still welcome suggestions on generating entropy so early in
> > > the boot process as to randomize the location at which the kernel is
> > > decompressed.
> >
> > The fundamental problem with the whole kernel address randomization is
> > sadly totally unrelated to any of the small details.
> >
> > There's a *big* detail that makes it hard: there's only a few bits of
> > randomness we can add to the address. The kernel base address ends up
> > having various fundamental limitations (cacheline alignment for the
> > code, and we have several segments that require page alignment), so
> > you really can't realistically do more than something like 8-12 bits
> > of address randomization.
> >
> > Which means that once you have a vmlinux image (say, because it's a
> > standard distro kernel), you only need to try your exploit a few
> > hundred times. That can be done quickly enough that no MIS person will
> > ever have time to react to the attack.
> >
> > Sure, it will likely leave some hints around (oopses etc), but still..
>
> Certain flaws will present that way, yes. Others will take the entire
> system down on the first missed address guess. Many times, ASLR will
> give a statistical advantage to the defender. As a result it has value,
> even if it's not perfect.
>

At least one distro (Red Hat) ships with panic_on_oops enabled by
default, so attackers don't get more than one chance. Likewise,
vulnerabilities in interrupt context will only have one chance, as will
any issue where failed exploitation prevents subsequent attempts, as is
frequently the case due to failures to clean up locking primitives on an
OOPS.

-Dan

> -Kees
>
> --
> Kees Cook
> Ubuntu Security Team

2011-05-20 18:35:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Dan Rosenberg <[email protected]> wrote:

> On Fri, 2011-05-20 at 15:11 +0200, Ingo Molnar wrote:
>
> > We need to allocate the IDT dynamically: just kmalloc() it, update idt_descr
> > and do a load_idt(). Double check places that modify idt_descr or use
> > idt_table.
> >
> > Note, you could do this as a side effect of a nice performance optimization:
> > would you be interested in allocating it in the percpu area, using
> > percpu_alloc()? That way the IDT is distributed between CPUs - this has
> > scalability advantages on NUMA systems and maybe even on SMP.
> >
>
> Any suggestions on when this allocation should take place? I'm hesitant to
> touch anything in arch/x86/kernel/head_32.S, where the IDT is setup and lidt
> idt_descr is called (on x86-32 anyway). That means at some point I'd have to
> copy the table into a region allocated with alloc_percpu() and set up a new
> descriptor. Seems like this should happen before IRQ is enabled, but I'm not
> sure about the best place.

I think there's a static percpu area that can be used pretty early on.

The boot IDT can be marked __initdata so its space wont be wasted.

The thing is, until SMP is not initialized the boot IDT can be kept. So i'd
suggest allocating per CPU IDTs after memory has initialized. For that a pretty
good place is trap_init(): there we already have the page allocator initialized
and probably the percpu allocator too. IDT allocation is also pretty naturally
done in trap_init().

> Also, I'd still welcome suggestions on generating entropy so early in the
> boot process as to randomize the location at which the kernel is
> decompressed.
>
> On a related note, would there be obstacles to marking the IDT as read-only?

The cost is that its access TLB may change from a 2MB TB to a 4K TLB. We
generally try to keep critical data structures in 2MB mapped areas.

But this is really hard to measure (you'd have to have a borderline workload
where the loss of a single 4K TLB is measurable) so i'd suggest splitting this
from the randomization step.

Thanks,

Ingo

2011-05-20 18:43:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Dan Rosenberg <[email protected]> wrote:

> At least one distro (Red Hat) ships with panic_on_oops enabled by default, so
> attackers don't get more than one chance. Likewise, vulnerabilities in
> interrupt context will only have one chance, as will any issue where failed
> exploitation prevents subsequent attempts, as is frequently the case due to
> failures to clean up locking primitives on an OOPS.

So it's basically a last line of defense: the attacker has to assume the risk
of the attack being detected.

That has a chilling effect on some types of attacks: especially those where the
attacker goes against a high value target with a zero day kernel exploit.
Risking a crash does not just mean possibly alerting the target, but also means
possibly losing the zero-day exploit - if that oops log gets to a kernel
developer who starts wondering about the weird backtrace.

Thanks,

Ingo

2011-05-22 06:11:57

by David Lang

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Fri, 20 May 2011, Linus Torvalds wrote:

> Sure, it will likely leave some hints around (oopses etc), but still..
>

as an admin running a large site, I'd love to have evidence around that
strange things were happening, especially prior to the exploit running.

no, I probably can't take manual action fast enough to prevent the initial
exploit, but I can alarm on the failed attacks and potentially react fast
enough to prevent the attacker from going further into my network.

David Lang

2011-05-22 18:45:38

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Fri, 2011-05-20 at 15:11 +0200, Ingo Molnar wrote:
> * Dan Rosenberg <[email protected]> wrote:
>
> > On Fri, 2011-05-20 at 14:07 +0200, Ingo Molnar wrote:
> > > * Dan Rosenberg <[email protected]> wrote:
> > >
> > > > I was able to boot a relocatable kernel with the decompression location at a
> > > > hard-coded offset without too much trouble. Everything seems to work fine.
> > >
> > > Nice!
> > >

A progress update, and a number of questions.

The randomization itself is working fine with the following hack in
arch/x86/boot/compressed/head_32.S:

#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
#ifdef CONFIG_RANDOMIZE_BASE
rdtsc
shll $0x8, %eax
andl $0x3ffffff, %eax
addl %eax, %ebx
#endif
movl BP_kernel_alignment(%esi), %eax
decl %eax
addl %eax, %ebx
notl %eax
andl %eax
#endif

That brings me to my first two questions:

1. Is it ok to assume the existence of rdtsc? If not, what are other
ways of gathering entropy early in the boot process? If this is the
approach that's going to be taken, system uptime potentially becomes
useful for attackers. Any thoughts on how to address this?

2. The current default physical alignment is 16mb as a result of this
patch:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ceefccc93932b920a8ec6f35f596db05202a12fe

Having 16mb alignment greatly restricts the amount of usable entropy for
randomization. It seems an alternate solution to the problem the above
patch addresses is reverting back to 1mb alignment (or 2/4 mb if that
has performance benefits) and enforcing a 16mb minimum physical start
for relocatable kernels by bumping it up in the boot code if necessary.
Would this be possible? I'd like to avoid requiring distros to touch
CONFIG_PHYSICAL_ALIGN (and risk breaking things) in order to have more
useful randomization.

A few more questions arose during my efforts:

3. The current hack I'm using to determine the offset to reverse apply
to symbol output looks something like this:

unsigned long kptr_adjust = ((unsigned long)_text &
~(CONFIG_PHYSICAL_ALIGN-1)) - (PAGE_OFFSET + CONFIG_PHYSICAL_START);

Is it safe to assume that kernel .text is the first thing in a
decompressed kernel image? If not, any other suggestions? It seemed
easier to compute this in the decompressed kernel at runtime rather than
try to figure out a way to pass the actual decompression address from
the boot stage to the main kernel.

4. What kind of behavior do people want with %pK and kptr_restrict? If
possible, I'd like to find a way that perf users can have the benefits
of this feature and still have usable symbol support. However, module
symbols are a bit tricky, since they're not being relocated with the
rest of the kernel, and it doesn't seem meaningful to reverse-apply the
same offset in module symbol output. Perhaps a separate format
specifier should be introduced to differentiate symbols that need to be
offset?

Basically, we've got kernel .text symbols, module symbols, and dynamic
kernel pointers, and I'm not sure with what granularity people are
interested in hiding them. It seems perf at least wants more than "all
or nothing".

5. I'd like some more opinions on moving the IDT. So far, the two
options are using a fixed mapping similar to the F00F bug fix, and
allocating it percpu at runtime.

Looking forward to feedback, criticism, disgust, etc.

Regards,
Dan

2011-05-23 00:25:47

by Dan Rosenberg

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Sun, 2011-05-22 at 16:49 -0700, Tony Luck wrote:
>
>
> On Sun, May 22, 2011 at 11:45 AM, Dan Rosenberg
> <[email protected]> wrote:
> 1. Is it ok to assume the existence of rdtsc? If not, what
> are other
> ways of gathering entropy early in the boot process? If this
> is the
> approach that's going to be taken, system uptime potentially
> becomes
> useful for attackers. Any thoughts on how to address this?
>
> There is a cpuid bit to tell you whether the processor supports rdtsc.
>

This might be worth checking, but it also might just be easier to
clearly document in the Kconfig that the feature requires rdtsc. I'll
play with it a bit.

> In the cold boot case, I'd worry about whether rdtsc was all that
> random,
> it counts from zero when the processors come out of cold reset, and
> things should be quite deterministic up to when the kernel loads;
> especially if you have a solid state boot drive rather than the old
> spinning kind.
>

The question really becomes whether it's "good enough". It's certainly
not cryptographic randomness, but it will produce different results on
different boots depending on a variety of factors, including
out-of-order instruction execution, multi-CPU systems, and differences
in CPU models. Even though a single machine may be slightly more likely
to produce certain offsets more often, it seems to me that it would
still accomplish the goal, because it's highly unlikely that an attacker
would be able to perform the statistical analysis necessary to figure
that out.

> Sometime soon you'll have "rdrand" available (check a different cpuid
> bit).
>

This of course is highly preferable, but I'd rather implement a solution
that's widely supported in hardware today. Perhaps further down the
road it could be switched.

> -Tony
>
>

Thanks for the feedback.

-Dan

2011-05-23 00:38:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On 05/22/2011 05:25 PM, Dan Rosenberg wrote:
>
>> Sometime soon you'll have "rdrand" available (check a different cpuid
>> bit).
>
> This of course is highly preferable, but I'd rather implement a solution
> that's widely supported in hardware today. Perhaps further down the
> road it could be switched.
>

Use the better one that is actually available.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2011-05-23 10:49:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Dan Rosenberg <[email protected]> wrote:

> On Sun, 2011-05-22 at 16:49 -0700, Tony Luck wrote:
> >
> >
> > On Sun, May 22, 2011 at 11:45 AM, Dan Rosenberg
> > <[email protected]> wrote:
> > 1. Is it ok to assume the existence of rdtsc? If not, what
> > are other
> > ways of gathering entropy early in the boot process? If this
> > is the
> > approach that's going to be taken, system uptime potentially
> > becomes
> > useful for attackers. Any thoughts on how to address this?
> >
> > There is a cpuid bit to tell you whether the processor supports rdtsc.
> >
>
> This might be worth checking, but it also might just be easier to clearly
> document in the Kconfig that the feature requires rdtsc. I'll play with it a
> bit.

All modern x86 boxes have an RDTSC so this not really a practical concern -
other than not crashing old boxes that do not have an RDTSC.

In theory there's other sources of entropy: we could read out the RTC CMOS as
well. We could also read the current CPU frequency and fan speeds - those have
components of thermal noise as well. Obviously it's hard to read this out early
during bootup, when system devices are not enumerated yet.

What would be very useful is to print out the bootup RDTSC value for several
bootups. I've done this on a testbox here, with real full reboots inbetween,
putting the RDTSC printout very close to where you'd sample it for kernel image
randomization:

[ 0.000000] RDTSC: 26615467048
[ 0.000000] RDTSC: 26527108278
[ 0.000000] RDTSC: 26464738628
[ 0.000000] RDTSC: 26554778708
[ 0.000000] RDTSC: 26441165788
[ 0.000000] RDTSC: 26555252088
[ 0.000000] RDTSC: 26431986988
[ 0.000000] RDTSC: 26521303608
[ 0.000000] RDTSC: 26497878018
[ 0.000000] RDTSC: 26455546968
[ 0.000000] RDTSC: 26467673718
[ 0.000000] RDTSC: 26460404758
[ 0.000000] RDTSC: 26496175038

Even visually there's *plenty* of real randomness in the bootup value of the
cycle counter (look at the above pattern of numbers and squint), even during
early bootup, on real hardware.

While the last few bits are non-random there's at least 10-20 bits of
randomness on this (very simple) testbox - possibly more.

I've done a few tests on virtual hardware as well, KVM based:

[ 0.000000] RDTSC: 208122033
[ 0.000000] RDTSC: 200455104
[ 0.000000] RDTSC: 207258945

As expected there's even more randmness on virtual hardware than on real
hardware: virtual hardware will boot in a complex host CPU state and is so
exposed to the non-determinisic CPU cache state.

So the cycle counter is plenty good for this.

[ I'd only worry about the cycle counter if a hardware platform is so
minimalistic that it boots up very quickly without exposing itself to natural
sources of entropy. Eventually we'll get such systems, but right now the
cycle counter is good enough. ]

Btw, we already rely on the cycle counter for early bootup randomness:
boot_init_stack_canary() relies on it.

> > Sometime soon you'll have "rdrand" available (check a different cpuid bit).
>
> This of course is highly preferable, but I'd rather implement a solution
> that's widely supported in hardware today. Perhaps further down the road it
> could be switched.

Well, since entropy does not get reduced on addition of independent variables
the right sequence is (pseudocode):

rnd = entropy_cycles();
rnd += entropy_rdrand();
rnd += entropy_RTC();
rnd += entropy_system();

That way systems that do not have RDRAND will still have the other ones as
fallback.

Of course in your prototype you can use RDTSC as a first step, just make it
easy to add other sources.

Thanks,

Ingo

2011-05-23 19:03:15

by Ray Lee

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Mon, May 23, 2011 at 3:49 AM, Ingo Molnar <[email protected]> wrote:
> Well, since entropy does not get reduced on addition of independent variables
> the right sequence is (pseudocode):
>
>        rnd  = entropy_cycles();
>        rnd += entropy_rdrand();
>        rnd += entropy_RTC();
>        rnd += entropy_system();

I think you mean concatenation rather than addition? Or perhaps XOR,
or a hash? It's pretty easy to show that the addition of n random
variables evenly distributed between [0, 1] converges to 1/2 n +-
1/sqrt(n) (or numbers to that effect), which gives an attacker better
chances than they would otherwise if they target the center of the
distribution.

But none of this is to detract from your main point, which still
holds. Structuring it such that other sources of randomness can be
included as available keeps options open.

~r.

2011-05-23 19:36:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* Ray Lee <[email protected]> wrote:

> On Mon, May 23, 2011 at 3:49 AM, Ingo Molnar <[email protected]> wrote:
> > Well, since entropy does not get reduced on addition of independent variables
> > the right sequence is (pseudocode):
> >
> > ? ? ? ?rnd ?= entropy_cycles();
> > ? ? ? ?rnd += entropy_rdrand();
> > ? ? ? ?rnd += entropy_RTC();
> > ? ? ? ?rnd += entropy_system();
>
> I think you mean concatenation rather than addition? Or perhaps XOR, or a
> hash? [...]

Yeah.

In this special case probably concatenation works the best: the above 4 random
variables have total randomness probably less than 32 bits, so we want to
create 4 tight random numbers and concatenate them.

[ XOR would destroy some fair amount of entropy because most of these random
variables have their randomness in their low bits, and a hash would probably
lose about 2 bits and would also be slower. A hash would probably be safer
and more robust though, if we mis-identify any of the random variables. ]

Thanks,

Ingo

2011-05-24 02:00:38

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols

On Mon, 23 May 2011 12:49:02 +0200, Ingo Molnar said:
> Well, since entropy does not get reduced on addition of independent variables
> the right sequence is (pseudocode):
>
> rnd = entropy_cycles();
> rnd += entropy_rdrand();
> rnd += entropy_RTC();
> rnd += entropy_system();

I'm having trouble convincing myself that RTC and cycles are truly independent
variables.... ;)

Consider the case of a fixed-frequency CPU - if you know the time since boot,
and the current RTC, and the current cycle count, you can work backwards to
find the RTC and cycle count at boot. I'm not sure that a variable clockspeed
helps all that much - an attacker can perhaps find a way to force the highest/
lowest CPU speed - or the system may even helpfully do it for the attacker -
I've seen plenty of misconfigured laptops that force lowest supported CPU
clockspeed on battery rather than race-to-idle.

Having said that, the 13 bootup rdtsc values you list *seem* to have on the
order of 24-28 bits of entropy, and only the lowest-order bit seems to be
non-random (the low-order byte of the 13 values are 28, b6, 44, 54, dc, 78, 2c,
38, 02, 58, 76, 16, and be). So rdtsc appears to be good enough for what we
want here...


Attachments:
(No filename) (227.00 B)

2011-05-24 04:06:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] perf: bogus correlation of kernel symbols


* [email protected] <[email protected]> wrote:

> On Mon, 23 May 2011 12:49:02 +0200, Ingo Molnar said:
> > Well, since entropy does not get reduced on addition of independent variables
> > the right sequence is (pseudocode):
> >
> > rnd = entropy_cycles();
> > rnd += entropy_rdrand();
> > rnd += entropy_RTC();
> > rnd += entropy_system();
>
> I'm having trouble convincing myself that RTC and cycles are truly independent
> variables.... ;)

Generally the RTC stores absolute time in seconds (it stores the date), while
cycles start new when the CPU is reset.

So they are independent.

The question i think you are asking is whether the fact that we can observe
current values of them after bootup can be used to figure out their value:

> Consider the case of a fixed-frequency CPU - if you know the time since boot,
> and the current RTC, and the current cycle count, you can work backwards to
> find the RTC and cycle count at boot. [...]

Yes, you are correct, if you are local then the guessing the RTC to the second
is probably possible.

Guessing the cycle counter's value will be hard: see the natural noise it has
at a fixed instruction after bootup in the same-bzImage test i performed - with
no IRQs having executed at all yet ...

The RTC is still reasonably noisy to external attackers though.

> [...] I'm not sure that a variable clockspeed helps all that much - an
> attacker can perhaps find a way to force the highest/ lowest CPU speed - or
> the system may even helpfully do it for the attacker - I've seen plenty of
> misconfigured laptops that force lowest supported CPU clockspeed on battery
> rather than race-to-idle.

The tests i performed were on a fixed frequency system - the cycle counter was
still largely random during early bootup.

Others should try it too - i've attached a simple patch. Maybe my system has
more bootup noise than others.

> Having said that, the 13 bootup rdtsc values you list *seem* to have on the
> order of 24-28 bits of entropy, and only the lowest-order bit seems to be
> non-random (the low-order byte of the 13 values are 28, b6, 44, 54, dc, 78,
> 2c, 38, 02, 58, 76, 16, and be). So rdtsc appears to be good enough for what
> we want here...

Yeah. And for cases that the rdtsc might be predictable for some weird reason
(say it would be 0 on an old system with no RDTSC), the RTC would give some
minimal fallback seed to make the canary at least not remotely guessable.

Thanks,

Ingo

---
init/main.c | 6 ++++++
1 file changed, 6 insertions(+)

Index: linux/init/main.c
===================================================================
--- linux.orig/init/main.c
+++ linux/init/main.c
@@ -472,6 +472,12 @@ asmlinkage void __init start_kernel(void
*/
boot_init_stack_canary();

+ {
+ u64 cycles = get_cycles();
+
+ printk("RDTSC: %Ld / %08Lx\n", cycles, cycles);
+ }
+
cgroup_init_early();

local_irq_disable();