update: latest x86.git#mm has a pretty much working stack-protector
feature - you can pick it up for testing via:
http://people.redhat.com/mingo/x86.git/README
as [email protected] has indicated it already in his analysis and
patch, there were multiple bugs hitting us here. The amount and scope of
these problems show structural problems in how security in this area was
approached. So with these changes we try to go deeper than just
minimally fixing the feature. We've got 15 changes so far in and around
this area:
x86: fix execve with -fstack-protect
x86: exclude vsyscall files from stackprotect
x86: fix stackprotect Makefile rule
x86: fix stackprotector canary updates during context switches
panic: print more informative messages on stackprotect failure
panic: print out stacktrace if DEBUG_BUGVERBOSE
x86: enable stack-protector by default
x86: setup stack canary for the idle threads
x86: fix canary of the boot CPU's idle task
stackprotector: include files
stackprotector: add boot_init_stack_canary()
x86: fix the stackprotector canary of the boot CPU
x86: stackprotector: mix TSC to the boot canary
x86: test the presence of the stackprotector
x86: streamline stackprotector
but we've not completed this work yet. We'll push the independent bits
to Linus ASAP.
Ingo
On 14 Feb 2008 at 18:00, Ingo Molnar wrote:
some comments:
> x86: fix execve with -fstack-protect
the commit comment says:
> This hack was true up to the point the stackprotector added
> another word to the stack frame. Shifting all the addresses
> by 8 bytes, crashing and burning any exec attempt.
actually, that's not the problem here because the canary is in the
local variable area, it doesn't affect the passed arguments. what
happens here is that gcc treats the argument area as owned by the
callee, not the caller and is allowed to do certain tricks. for ssp
it will make a copy of the struct passed by value into the local
variable area and pass *its* address down, and it won't copy it back
into the original instance stored in the argument area.
so once sys_execve returns, the pt_regs passed by value hasn't at all
changed and its default content will cause a nice double fault (FWIW,
this part took me the longest to debug, being down with cold didn't
help it either ;).
> x86: fix stackprotector canary updates during context switches
the commit says:
> Note: this means that we call __switch_to() [and its sub-functions]
> still with the old canary, but that is not a problem, both the previous
> and the next task has a high-quality canary. The only (mostly academic)
> disadvantage is that the canary of one task may leak onto the stack of
> another task, increasing the risk of information leaks, were an attacker
> able to read the stack of specific tasks (but not that of others).
the best practical defense against leaking the canary is to change its
value on every syscall but it has some performance impact in microbenchmarks.
> x86: stackprotector: mix TSC to the boot canary
this should probably be implemented in a general get_random_long()
function...
> x86: test the presence of the stackprotector
> x86: streamline stackprotector
the config comment says -fstack-protector whereas you're really enabling
the stonger -fstack-protector-all feature (without the latter the ssp
test function sample_stackprotector_fn wouldn't even be instrumented ;-).
* [email protected] <[email protected]> wrote:
> the commit comment says:
>
> > This hack was true up to the point the stackprotector added another
> > word to the stack frame. Shifting all the addresses by 8 bytes,
> > crashing and burning any exec attempt.
>
> actually, that's not the problem here because the canary is in the
> local variable area, it doesn't affect the passed arguments. what
> happens here is that gcc treats the argument area as owned by the
> callee, not the caller and is allowed to do certain tricks. for ssp it
> will make a copy of the struct passed by value into the local variable
> area and pass *its* address down, and it won't copy it back into the
> original instance stored in the argument area.
>
> so once sys_execve returns, the pt_regs passed by value hasn't at all
> changed and its default content will cause a nice double fault (FWIW,
> this part took me the longest to debug, being down with cold didn't
> help it either ;).
heh, indeed :-) I saw the double fault and was happy that you showed how
to fix it (it did not seem obvious to me _at all_, and i've seen my fair
share of fixes) and i quickly lured myself into having understood it :-/
I've now updated the commit message with your (correct) description.
> > Note: this means that we call __switch_to() [and its sub-functions]
> > still with the old canary, but that is not a problem, both the
> > previous and the next task has a high-quality canary. The only
> > (mostly academic) disadvantage is that the canary of one task may
> > leak onto the stack of another task, increasing the risk of
> > information leaks, were an attacker able to read the stack of
> > specific tasks (but not that of others).
>
> the best practical defense against leaking the canary is to change its
> value on every syscall but it has some performance impact in
> microbenchmarks.
yeah, that's not really practical (especially as it would deplete our
entropy pool pretty fast would could in some circumstances introduce a
higher risk to the system than the risk of a canary leaking).
I think we can avoid the leakage across tasks by being careful during
context-switch time: never calling with the old canary still in the PDA.
I think this should be fairly easy as we'd just have to load the new
pdaptr in the switch_to() assembly code.
[ the only (even more academic) possibility here would be an NMI to hit
during those two instructions that update RSP and load the new pdpptr,
and leaving the old canary on the stack. But NMIs go to separate stack
frames anyway on 64-bit so this isnt really a practical worry IMO. ]
Hm?
> > x86: stackprotector: mix TSC to the boot canary
>
> this should probably be implemented in a general get_random_long()
> function...
yeah ... this is the second time i could have used it recently.
> > x86: test the presence of the stackprotector
> > x86: streamline stackprotector
>
> the config comment says -fstack-protector whereas you're really
> enabling the stonger -fstack-protector-all feature (without the latter
> the ssp test function sample_stackprotector_fn wouldn't even be
> instrumented ;-).
yeah, indeed. I fixed the comments and flipped the commits around :-)
TODO: perhaps all vsyscall functions need to move into separate .o
files. (probably academic as the functions affected by that tend to be
very simple and do not deal with anything overflowable.)
Ingo
On 14 Feb 2008 at 20:00, Ingo Molnar wrote:
> > the best practical defense against leaking the canary is to change its
> > value on every syscall but it has some performance impact in
> > microbenchmarks.
>
> yeah, that's not really practical (especially as it would deplete our
> entropy pool pretty fast would could in some circumstances introduce a
> higher risk to the system than the risk of a canary leaking).
you don't necessarily have to use the heavy-handed ip id code as it
is used now, random32 is plenty good here.
> I think we can avoid the leakage across tasks by being careful during
> context-switch time: never calling with the old canary still in the PDA.
> I think this should be fairly easy as we'd just have to load the new
> pdaptr in the switch_to() assembly code.
i don't think you have to worry about cross-task leaking at all, a
hypothetical exploit is happy to learn its own canary and that's
actually easier than learning some other task's canary by virtue
of bugs that leak uninitialized struct padding and stuff...
really, the best defense is to reduce the useful lifetime of any
leaked canary, and you can't get better than syscall granularity
without disproportional effort and impact elsewhere (and i'm sure
some would find even this disproportional ;).
> TODO: perhaps all vsyscall functions need to move into separate .o
> files. (probably academic as the functions affected by that tend to be
> very simple and do not deal with anything overflowable.)
yeah, i wasn't suggesting it for its security value, more like for
'full coverage'. if such a separation isn't useful otherwise (no idea
if not having the .vsyscall* code along with related kernel code would be
confusing for the reader for example), i'd say this isn't important.
* [email protected] <[email protected]> wrote:
> really, the best defense is to reduce the useful lifetime of any
> leaked canary, and you can't get better than syscall granularity
> without disproportional effort and impact elsewhere (and i'm sure some
> would find even this disproportional ;).
hm, i think per syscall canaries are really expensive.
The per function call overhead from stackprotector is already pretty
serious IMO, but at least that's something that GCC _could_ be doing
(much) smarter (why doesnt it jne forward out to __check_stk_failure,
instead of generating 4 instructions, one of them a default-mispredicted
branch instruction??), so that overhead could in theory be something
like 4 fall-through instructions per function, instead of the current 6.
Also, even with -fstack-protector-all, gcc could detect the _provably_
correct and overflow-safe functions and skip their annotation
altogether. And those are the ones that hurt most: the small and
straightforward (and also, most of the time, overflow-free) ones.
The heuristics for -fstack-protector were really botched, as you noted
they happened to mark vmsplice as "safe". So either we do the full thing
or nothing - anything inbetween is just russian roulette and false sense
of security.
But per syscall canary randomization is something that would be a drag
on our null syscall latency forever. I'm really unsure about it. It will
also be driven in the wrong direction: people would try to "optimize"
the per syscall random number generation which is a constant erosive
force on its strength and there's no strong counter-force. (and no clear
boundary either 'random32 should be plenty enough' is too vague as
limit.)
So lets perhaps also analyze the security effects of not having per
system call canaries and make the best out of it: the main threat is
information leaks via padding and similar initialization mistakes [which
have almost zero accidental discovery rate], right?
So could we perhaps somehow turn kernel-space information leaks into
functional, and hence detectable failures? Wouldnt for example kmemcheck
notice them, once they are copied to user-space? [ If you havent seen
kmemcheck yet i think you'll love it, it's a similarly neat trick as the
pte noexec trick you did in the PaX kernel. It's similarly disgusting as
well ;-) ]
> > TODO: perhaps all vsyscall functions need to move into separate .o
> > files. (probably academic as the functions affected by that tend to
> > be very simple and do not deal with anything overflowable.)
>
> yeah, i wasn't suggesting it for its security value, more like for
> 'full coverage'. if such a separation isn't useful otherwise (no idea
> if not having the .vsyscall* code along with related kernel code would
> be confusing for the reader for example), i'd say this isn't
> important.
perhaps it would be more useful to have some "no stack protector" gcc
attribute. We could then stick that into the __vsyscall definition and
it's all done and self-maintaining from that point on. (because a
missing __vsyscall annotation results in a runtime failure.) (I'd fully
expect GCC to not have provided such an attribute.)
but yes, i agree in general that it's easier to maintain something if
it's full coverage, that way one does not have to keep in mind (and
cannot accidentally forget ;-) the exceptions.
Ingo
On Thu, 14 Feb 2008 21:25:35 +0100
Ingo Molnar <[email protected]> wrote:
>
> * [email protected] <[email protected]> wrote:
>
> > really, the best defense is to reduce the useful lifetime of any
> > leaked canary, and you can't get better than syscall granularity
> > without disproportional effort and impact elsewhere (and i'm sure
> > some would find even this disproportional ;).
>
> hm, i think per syscall canaries are really expensive.
it's not that bad. Assuming you use a PNR that you re-seed periodically,
it's
* go to the next random number with PNR
* write to PDA and task struct
give or take 10 cycles total if you squeeze it hard, 20 if you don't.
On 14 Feb 2008 at 21:25, Ingo Molnar wrote:
> * [email protected] <[email protected]> wrote:
>
> > really, the best defense is to reduce the useful lifetime of any
> > leaked canary, and you can't get better than syscall granularity
> > without disproportional effort and impact elsewhere (and i'm sure some
> > would find even this disproportional ;).
>
> hm, i think per syscall canaries are really expensive.
it depends on how it's implemented and what expensive is. i'd say
offering this as a config option would make sense for those who want
to turn ssp from a debugging feature (which it is at the moment) into
a security one (i know i will do it once i port the kernel stack
randomization to amd64 in PaX, it's the same code path).
speaking of that, for years PaX has had a feature to randomize the kernel
stack ptr on each kernel->user transition based on rdtsc (using only a few
bits of it obviously). its performance impact is really hard to measure
on anything (i mean, macrobenchmarks) and frankly, people blindly enable
it without having second thoughts about its impact because it just doesn't
matter.
as Arjan said, this ssp canary randomization could be done cheaply from
a PRNG, or heck, even just a simple ror by a random amount from rdtsc on
it will raise the bar enough (and the PRNG state is subject to info leaking
itself whereas rdtsc isn't ;).
in short, i suggest that this per-syscall canary be offered as a config
option at least and use something cheap and not particularly strong
(in crypto terms) just to raise the bar enough that no attacker would
accept the risk of panic'ing the box should the opportunity arise.
> The per function call overhead from stackprotector is already pretty
> serious IMO,
it's a mov/mov/.../mov/xor/jz on the normal path basically, it's a
few cycles except for the forward jz but that should be easy to fix
in gcc.
with that said, this whole ssp business should have been changed into
return address verification a long time ago (i.e., use the saved return
address as the canary itself), that would detect both stack overflows
and prevent ret2libc style attacks for practical purposes.
> But per syscall canary randomization is something that would be a drag
> on our null syscall latency forever. I'm really unsure about it. It will
> also be driven in the wrong direction: people would try to "optimize"
> the per syscall random number generation which is a constant erosive
> force on its strength and there's no strong counter-force. (and no clear
> boundary either 'random32 should be plenty enough' is too vague as
> limit.)
i think you're getting lost in the details a bit here. the purpose of
changing the canary is to create a guaranteed minimum risk for an attacker
to abuse a leaked canary. obviously a constant (per task) canary means
0 risk for an attacker, so we can get only better than that. how much
better we want to get is mainly limited by what is considered cheap
implementation-wise and 'good enough' from a risk assessment point of
view.
say, if you were to flip a random bit in the canary then a leaked canary
would give an attacker 50% chance to succeed and 50% chance to panic the
box. even that is already high (for an attacker) but you see where i'm
getting at when you increase the randomly changed bits in the canary.
in practice, there's no need to change all 32/64 bits, that's for crypto
nuts only, but for real life attackers there's no difference between 99%
and 99.99999% chances of panic'ing a box (or even a box farm, one alarm
raised is an alarm that will draw attention regardless of how many other
boxes panic'ed). that's why i keep saying that as simple a thing as using
rdtsc/xor/ror/etc is 'plenty good'. i know it's not good for a cryptanalyst
in a generic setting, but this is a special situation (if you fail to guess
you don't get to try infinitely, you fail hard) and we can relax the crypto
requirements.
> So lets perhaps also analyze the security effects of not having per
> system call canaries and make the best out of it: the main threat is
> information leaks via padding and similar initialization mistakes [which
> have almost zero accidental discovery rate], right?
that was only an example for a bug class that could provide/leak such
info, but obviously programmers are a lot more creative in accomplishing
the same.
the problem with kmemcheck is that it's a debugging aid, it obviously
cannot be used in production. also it's not comprehensive when you have
bugs that leak memory like the do_brk (or mremap, i forget) one did: a
valid userland pte kept pointing to a free'd and later reused page. i'm
not saying that one would use such a bug for canary leaking when it can
be abused other ways a lot better, it's just to say that there're many
more bug classes in the kernel than userland.
On Thu, Feb 14, 2008 at 09:25:35PM +0100, Ingo Molnar wrote:
> The per function call overhead from stackprotector is already pretty
> serious IMO, but at least that's something that GCC _could_ be doing
> (much) smarter (why doesnt it jne forward out to __check_stk_failure,
> instead of generating 4 instructions, one of them a default-mispredicted
> branch instruction??), so that overhead could in theory be something
> like 4 fall-through instructions per function, instead of the current 6.
Where do you see a mispredicted branch?
int foo (void)
{
char buf[64];
bar (buf);
return 6;
}
-O2 -fstack-protector -m64:
subq $88, %rsp
movq %fs:40, %rax
movq %rax, 72(%rsp)
xorl %eax, %eax
movq %rsp, %rdi
call bar
movq 72(%rsp), %rdx
xorq %fs:40, %rdx
movl $6, %eax
jne .L5
addq $88, %rsp
ret
.L5:
.p2align 4,,6
.p2align 3
call __stack_chk_fail
-O2 -fstack-protector -m32:
pushl %ebp
movl %esp, %ebp
subl $88, %esp
movl %gs:20, %eax
movl %eax, -4(%ebp)
xorl %eax, %eax
leal -68(%ebp), %eax
movl %eax, (%esp)
call bar
movl $6, %eax
movl -4(%ebp), %edx
xorl %gs:20, %edx
jne .L5
leave
ret
.L5:
.p2align 4,,7
.p2align 3
call __stack_chk_fail
-O2 -fstack-protector -m64 -mcmodel=kernel:
subq $88, %rsp
movq %gs:40, %rax
movq %rax, 72(%rsp)
xorl %eax, %eax
movq %rsp, %rdi
call bar
movq 72(%rsp), %rdx
xorq %gs:40, %rdx
movl $6, %eax
jne .L5
addq $88, %rsp
ret
.L5:
.p2align 4,,6
.p2align 3
call __stack_chk_fail
both with gcc 4.1.x and 4.3.0.
BTW, you can use -fstack-protector --param=ssp-buffer-size=4
etc. to tweak the size of buffers to trigger stack protection, the
default is 8, but e.g. whole Fedora is compiled with 4.
Jakub
On 14 Feb 2008 at 17:35, Jakub Jelinek wrote:
> Where do you see a mispredicted branch?
try it with -Os (gentoo gcc 4.2.2):
0000000000000000 <foo>:
0: 48 83 ec 58 sub $0x58,%rsp
4: 65 48 8b 04 25 28 00 00 00 mov %gs:0x28,%rax
d: 48 89 44 24 48 mov %rax,0x48(%rsp)
12: 31 c0 xor %eax,%eax
14: 48 89 e7 mov %rsp,%rdi
17: e8 00 00 00 00 callq 1c <foo+0x1c> 18: R_X86_64_PC32
bar+0xfffffffffffffffc
1c: 48 8b 54 24 48 mov 0x48(%rsp),%rdx
21: 65 48 33 14 25 28 00 00 00 xor %gs:0x28,%rdx
2a: b8 06 00 00 00 mov $0x6,%eax
2f: 74 05 je 36 <foo+0x36>
31: e8 00 00 00 00 callq 36 <foo+0x36> 32: R_X86_64_PC32
__stack_chk_fail+0xfffffffffffffffc
36: 48 83 c4 58 add $0x58,%rsp
3a: c3 retq
* Jakub Jelinek <[email protected]> wrote:
> On Thu, Feb 14, 2008 at 09:25:35PM +0100, Ingo Molnar wrote:
> > The per function call overhead from stackprotector is already pretty
> > serious IMO, but at least that's something that GCC _could_ be doing
> > (much) smarter (why doesnt it jne forward out to __check_stk_failure,
> > instead of generating 4 instructions, one of them a default-mispredicted
> > branch instruction??), so that overhead could in theory be something
> > like 4 fall-through instructions per function, instead of the current 6.
>
> Where do you see a mispredicted branch?
ah!
> int foo (void)
> {
> char buf[64];
> bar (buf);
> return 6;
> }
>
> -O2 -fstack-protector -m64:
> subq $88, %rsp
> movq %fs:40, %rax
> movq %rax, 72(%rsp)
> xorl %eax, %eax
> movq %rsp, %rdi
> call bar
> movq 72(%rsp), %rdx
> xorq %fs:40, %rdx
> movl $6, %eax
> jne .L5
> addq $88, %rsp
> ret
> .L5:
> .p2align 4,,6
> .p2align 3
> call __stack_chk_fail
i got this:
.file ""
.text
.globl foo
.type foo, @function
foo:
.LFB2:
pushq %rbp
.LCFI0:
movq %rsp, %rbp
.LCFI1:
subq $208, %rsp
.LCFI2:
movq __stack_chk_guard(%rip), %rax
movq %rax, -8(%rbp)
xorl %eax, %eax
movl $3, %eax
movq -8(%rbp), %rdx
xorq __stack_chk_guard(%rip), %rdx
je .L3
call __stack_chk_fail
.L3:
leave
ret
but that's F8's gcc 4.1, and not the kernel mode code generator either.
the code you cited looks far better - that's good news!
one optimization would be to do a 'jne' straight into __stack_chk_fail()
- it's not like we ever want to return. [and it's obvious from the
existing stackframe which one the failing function was] That way we'd
have about 3 bytes less per function? We dont want to return to the
original function so for the kernel it would be OK.
another potential optimization would be to exchange this:
> subq $88, %rsp
> movq %fs:40, %rax
> movq %rax, 72(%rsp)
into:
pushq %fs:40
subq $80, %rsp
or am i missing something? (is there perhaps an address generation
dependency between the pushq and the subq? Or the canary would be at the
wrong position?)
> both with gcc 4.1.x and 4.3.0. BTW, you can use -fstack-protector
> --param=ssp-buffer-size=4 etc. to tweak the size of buffers to trigger
> stack protection, the default is 8, but e.g. whole Fedora is compiled
> with 4.
ok. is -fstack-protector-all basically equivalent to
--param=ssp-buffer-size=0 ? I'm wondering whether it would be easy for
gcc to completely skip stackprotector code on functions that have no
buffers, even under -fstack-protector-all. (perhaps it already does?)
Ingo
* [email protected] <[email protected]> wrote:
> On 14 Feb 2008 at 17:35, Jakub Jelinek wrote:
>
> > Where do you see a mispredicted branch?
>
> try it with -Os (gentoo gcc 4.2.2):
>
> 0000000000000000 <foo>:
> 0: 48 83 ec 58 sub $0x58,%rsp
> 4: 65 48 8b 04 25 28 00 00 00 mov %gs:0x28,%rax
> d: 48 89 44 24 48 mov %rax,0x48(%rsp)
> 12: 31 c0 xor %eax,%eax
> 14: 48 89 e7 mov %rsp,%rdi
> 17: e8 00 00 00 00 callq 1c <foo+0x1c> 18: R_X86_64_PC32
> bar+0xfffffffffffffffc
> 1c: 48 8b 54 24 48 mov 0x48(%rsp),%rdx
> 21: 65 48 33 14 25 28 00 00 00 xor %gs:0x28,%rdx
> 2a: b8 06 00 00 00 mov $0x6,%eax
> 2f: 74 05 je 36 <foo+0x36>
> 31: e8 00 00 00 00 callq 36 <foo+0x36> 32: R_X86_64_PC32
> __stack_chk_fail+0xfffffffffffffffc
> 36: 48 83 c4 58 add $0x58,%rsp
> 3a: c3 retq
ah, that's what my kernel uses too - and CONFIG_CC_OPTIMIZE_FOR_SIZE=y
is the default for Fedora too.
hm, especially those big addressing mode mov's and xor's look rather
nasty.
Ingo