Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932456AbbBYV7a (ORCPT ); Wed, 25 Feb 2015 16:59:30 -0500 Received: from mail-lb0-f177.google.com ([209.85.217.177]:42296 "EHLO mail-lb0-f177.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932182AbbBYV72 (ORCPT ); Wed, 25 Feb 2015 16:59:28 -0500 MIME-Version: 1.0 In-Reply-To: <54EE3E94.7060208@redhat.com> References: <1423778052-21038-1-git-send-email-dvlasenk@redhat.com> <1423778052-21038-2-git-send-email-dvlasenk@redhat.com> <54EE1799.2000602@redhat.com> <54EE3E94.7060208@redhat.com> From: Andy Lutomirski Date: Wed, 25 Feb 2015 13:59:06 -0800 Message-ID: Subject: Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs" To: Denys Vlasenko Cc: Andrey Wagin , Ingo Molnar , Linus Torvalds , Oleg Nesterov , Borislav Petkov , "H. Peter Anvin" , Frederic Weisbecker , X86 ML , Alexei Starovoitov , Will Drewry , Kees Cook , LKML Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7016 Lines: 184 On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko wrote: > On 02/25/2015 09:10 PM, Andy Lutomirski wrote: >> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin wrote: >>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko : >>>> On 02/25/2015 01:37 PM, Andrey Wagin wrote: >>>>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : >>>>>> 64-bit code was using six stack slots less by not saving/restoring >>>>>> registers which are callee-preserved according to C ABI, >>>>>> and not allocating space for them. >>>>>> Only when syscall needed a complete "struct pt_regs", >>>>>> the complete area was allocated and filled in. >>>>>> As an additional twist, on interrupt entry a "slightly less truncated pt_regs" >>>>>> trick is used, to make nested interrupt stacks easier to unwind. >>>>>> >>>>>> This proved to be a source of significant obfuscation and subtle bugs. >>>>>> For example, stub_fork had to pop the return address, >>>>>> extend the struct, save registers, and push return address back. Ugly. >>>>>> ia32_ptregs_common pops return address and "returns" via jmp insn, >>>>>> throwing a wrench into CPU return stack cache. >>>>>> >>>>>> This patch changes code to always allocate a complete "struct pt_regs". >>>>>> The saving of registers is still done lazily. >>>>>> >>>>>> "Partial pt_regs" trick on interrupt stack is retained. >>>>>> >>>>>> Macros which manipulate "struct pt_regs" on stack are reworked: >>>>>> ALLOC_PT_GPREGS_ON_STACK allocates the structure. >>>>>> SAVE_C_REGS saves to it those registers which are clobbered by C code. >>>>>> SAVE_EXTRA_REGS saves to it all other registers. >>>>>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. >>>>>> >>>>>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with >>>>>> return pointer. >>>>>> >>>>>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets >>>>>> instead of magic numbers. >>>>>> >>>>>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS >>>>>> instead of having it open-coded yet again. >>>>>> >>>>>> Patch was run-tested: 64-bit executables, 32-bit executables, >>>>>> strace works. >>>>>> Timing tests did not show measurable difference in 32-bit >>>>>> and 64-bit syscalls. >>>>> >>>>> Hello Denys, >>>>> >>>>> My test vm doesn't boot with this patch. Could you help to investigate >>>>> this issue? >>>> >>>> I think I found it. This part of my patch is possibly wrong: >>>> >>>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) >>>> #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ >>>> TRACE_IRQS_ON; \ >>>> sti; \ >>>> - SAVE_REST; \ >>>> + SAVE_EXTRA_REGS; \ >>>> LOCKDEP_SYS_EXIT; \ >>>> - RESTORE_REST; \ >>>> + RESTORE_EXTRA_REGS; \ >>>> cli; \ >>>> TRACE_IRQS_OFF; >>>> >>>> The "SAVE_REST" here is intended to really *push* extra regs on stack, >>>> but the patch changed it so that they are written to existing stack >>>> slots above. >>>> >>>> From code inspection it should work in almost all cases, but some >>>> locations where it is used are really obscure. >>>> >>>> If there are places where *pushing* regs is really necessary, >>>> this can corrupt rbp,rbx,r12-15 registers. >>>> >>>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug >>>> was here. >>>> Please find updated patch attached. Can you try it? >>> >>> It doesn't work > > Thanks for testing it anyway. > > >>> [ 3.016262] traps: systemd-cgroups[390] general protection >>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in >>> ld-2.18.so[7f456f79e000+20000] > > This is what I know about these crashes. The SEGV itself is caused by > HLT instruction executed by dynamic loader, ld-2.NN.so. > The instruction is in _exit function, and is only reachable if > exit_group and exit syscalls fail to terminate the process. > So it seems that syscall execution is getting badly broken somehow > at some point. > > This happens to both reporters. > > My theory that it is related to lockdep seems to be wrong, because > Sabrina's kernel is not lockdep-enabled, yet it sees the same failure. > > Both kernels are paravirtualized, both are booted under KVM, > Andrey runs it with four virtual CPUs, Sabrina runs with two. > > My next theory is that I missed something related to paravirt. > I am looking at that code, so far I don't see anything suspicious. > > Unfortunately, it doesn't happen to me: I have Sabrina's bzImage, > I run it under "qemu-system-x86_64 -enable-kvm -smp 2", > I see in dmesg that kernel does detect that it is being run under KVM, > but it works for me. No mysterious segfaults. > > Andrey, can you send me your bzImage? Maybe it will trigger > the problem for me. > > >> The change to stub_\func looks wrong to me. It saves and restores >> regs, but those regs might already have been saved if we're on the >> slow path. (Yes, all that code is quite buggy even without all these >> patches.) So is execve. >> >> This means that, for example, execve called in the slow path will >> save/restore regs twice. If the values in the regs after the first >> save and before the second save are different, then we corrupt user >> state. > > This part? > > .macro FORK_LIKE func > ENTRY(stub_\func) > CFI_STARTPROC > - popq %r11 /* save return address */ > - PARTIAL_FRAME 0 > - SAVE_REST > - pushq %r11 /* put it back on stack */ > + DEFAULT_FRAME 0, 8 /* offset 8: return address */ > + SAVE_EXTRA_REGS 8 > FIXUP_TOP_OF_STACK %r11, 8 > - DEFAULT_FRAME 0 8 /* offset 8: return address */ > call sys_\func > RESTORE_TOP_OF_STACK %r11, 8 > - ret $REST_SKIP /* pop extended registers */ > + ret > CFI_ENDPROC > END(stub_\func) > .endm > > FORK_LIKE clone > FORK_LIKE fork > FORK_LIKE vfork > > But the old code (SAVE_REST thing) was also saving registers here. > It had to jump through hoops (pop return address, SAVE_REST, > push return address) to do that. > After the patch, "SAVE_EXTRA_REGS 8" does the same, just without > pop/push pair. > > I just don't see what's wrong with it. Can you elaborate? SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just writes them in place. It's possible for this to be called when the regs have already been saved. > > And this area of code has no paravirt gunk, so if the bug is here, > why it doesn't fail for people running this natively? I don't know whether paravirt is involved. It could be something else. --Andy > > -- > vda > -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/