Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754073AbbBYWlP (ORCPT ); Wed, 25 Feb 2015 17:41:15 -0500 Received: from smtp3-g21.free.fr ([212.27.42.3]:38869 "EHLO smtp3-g21.free.fr" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753012AbbBYWlN (ORCPT ); Wed, 25 Feb 2015 17:41:13 -0500 Date: Wed, 25 Feb 2015 23:40:55 +0100 From: Sabrina Dubroca To: Andy Lutomirski Cc: Denys Vlasenko , Andrey Wagin , Ingo Molnar , Linus Torvalds , Oleg Nesterov , Borislav Petkov , "H. Peter Anvin" , Frederic Weisbecker , X86 ML , Alexei Starovoitov , Will Drewry , Kees Cook , LKML Subject: Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs" Message-ID: <20150225224055.GA3678@kria> References: <1423778052-21038-1-git-send-email-dvlasenk@redhat.com> <1423778052-21038-2-git-send-email-dvlasenk@redhat.com> <54EE1799.2000602@redhat.com> <54EE3E94.7060208@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7662 Lines: 186 2015-02-25, 13:59:06 -0800, Andy Lutomirski wrote: > On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko wrote: > > On 02/25/2015 09:10 PM, Andy Lutomirski wrote: > >> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin wrote: > >>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko : > >>>> On 02/25/2015 01:37 PM, Andrey Wagin wrote: > >>>>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko : > >>>>>> 64-bit code was using six stack slots less by not saving/restoring > >>>>>> registers which are callee-preserved according to C ABI, > >>>>>> and not allocating space for them. > >>>>>> Only when syscall needed a complete "struct pt_regs", > >>>>>> the complete area was allocated and filled in. > >>>>>> As an additional twist, on interrupt entry a "slightly less truncated pt_regs" > >>>>>> trick is used, to make nested interrupt stacks easier to unwind. > >>>>>> > >>>>>> This proved to be a source of significant obfuscation and subtle bugs. > >>>>>> For example, stub_fork had to pop the return address, > >>>>>> extend the struct, save registers, and push return address back. Ugly. > >>>>>> ia32_ptregs_common pops return address and "returns" via jmp insn, > >>>>>> throwing a wrench into CPU return stack cache. > >>>>>> > >>>>>> This patch changes code to always allocate a complete "struct pt_regs". > >>>>>> The saving of registers is still done lazily. > >>>>>> > >>>>>> "Partial pt_regs" trick on interrupt stack is retained. > >>>>>> > >>>>>> Macros which manipulate "struct pt_regs" on stack are reworked: > >>>>>> ALLOC_PT_GPREGS_ON_STACK allocates the structure. > >>>>>> SAVE_C_REGS saves to it those registers which are clobbered by C code. > >>>>>> SAVE_EXTRA_REGS saves to it all other registers. > >>>>>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it. > >>>>>> > >>>>>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with > >>>>>> return pointer. > >>>>>> > >>>>>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets > >>>>>> instead of magic numbers. > >>>>>> > >>>>>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS > >>>>>> instead of having it open-coded yet again. > >>>>>> > >>>>>> Patch was run-tested: 64-bit executables, 32-bit executables, > >>>>>> strace works. > >>>>>> Timing tests did not show measurable difference in 32-bit > >>>>>> and 64-bit syscalls. > >>>>> > >>>>> Hello Denys, > >>>>> > >>>>> My test vm doesn't boot with this patch. Could you help to investigate > >>>>> this issue? > >>>> > >>>> I think I found it. This part of my patch is possibly wrong: > >>>> > >>>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void) > >>>> #define ARCH_LOCKDEP_SYS_EXIT_IRQ \ > >>>> TRACE_IRQS_ON; \ > >>>> sti; \ > >>>> - SAVE_REST; \ > >>>> + SAVE_EXTRA_REGS; \ > >>>> LOCKDEP_SYS_EXIT; \ > >>>> - RESTORE_REST; \ > >>>> + RESTORE_EXTRA_REGS; \ > >>>> cli; \ > >>>> TRACE_IRQS_OFF; > >>>> > >>>> The "SAVE_REST" here is intended to really *push* extra regs on stack, > >>>> but the patch changed it so that they are written to existing stack > >>>> slots above. > >>>> > >>>> From code inspection it should work in almost all cases, but some > >>>> locations where it is used are really obscure. > >>>> > >>>> If there are places where *pushing* regs is really necessary, > >>>> this can corrupt rbp,rbx,r12-15 registers. > >>>> > >>>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug > >>>> was here. > >>>> Please find updated patch attached. Can you try it? > >>> > >>> It doesn't work > > > > Thanks for testing it anyway. > > > > > >>> [ 3.016262] traps: systemd-cgroups[390] general protection > >>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in > >>> ld-2.18.so[7f456f79e000+20000] > > > > This is what I know about these crashes. The SEGV itself is caused by > > HLT instruction executed by dynamic loader, ld-2.NN.so. > > The instruction is in _exit function, and is only reachable if > > exit_group and exit syscalls fail to terminate the process. > > So it seems that syscall execution is getting badly broken somehow > > at some point. > > > > This happens to both reporters. > > > > My theory that it is related to lockdep seems to be wrong, because > > Sabrina's kernel is not lockdep-enabled, yet it sees the same failure. > > > > Both kernels are paravirtualized, both are booted under KVM, > > Andrey runs it with four virtual CPUs, Sabrina runs with two. > > > > My next theory is that I missed something related to paravirt. > > I am looking at that code, so far I don't see anything suspicious. > > > > Unfortunately, it doesn't happen to me: I have Sabrina's bzImage, > > I run it under "qemu-system-x86_64 -enable-kvm -smp 2", > > I see in dmesg that kernel does detect that it is being run under KVM, > > but it works for me. No mysterious segfaults. > > > > Andrey, can you send me your bzImage? Maybe it will trigger > > the problem for me. > > > > > >> The change to stub_\func looks wrong to me. It saves and restores > >> regs, but those regs might already have been saved if we're on the > >> slow path. (Yes, all that code is quite buggy even without all these > >> patches.) So is execve. > >> > >> This means that, for example, execve called in the slow path will > >> save/restore regs twice. If the values in the regs after the first > >> save and before the second save are different, then we corrupt user > >> state. > > > > This part? > > > > .macro FORK_LIKE func > > ENTRY(stub_\func) > > CFI_STARTPROC > > - popq %r11 /* save return address */ > > - PARTIAL_FRAME 0 > > - SAVE_REST > > - pushq %r11 /* put it back on stack */ > > + DEFAULT_FRAME 0, 8 /* offset 8: return address */ > > + SAVE_EXTRA_REGS 8 > > FIXUP_TOP_OF_STACK %r11, 8 > > - DEFAULT_FRAME 0 8 /* offset 8: return address */ > > call sys_\func > > RESTORE_TOP_OF_STACK %r11, 8 > > - ret $REST_SKIP /* pop extended registers */ > > + ret > > CFI_ENDPROC > > END(stub_\func) > > .endm > > > > FORK_LIKE clone > > FORK_LIKE fork > > FORK_LIKE vfork > > > > But the old code (SAVE_REST thing) was also saving registers here. > > It had to jump through hoops (pop return address, SAVE_REST, > > push return address) to do that. > > After the patch, "SAVE_EXTRA_REGS 8" does the same, just without > > pop/push pair. > > > > I just don't see what's wrong with it. Can you elaborate? > > SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just > writes them in place. It's possible for this to be called when the > regs have already been saved. > > > > > And this area of code has no paravirt gunk, so if the bug is here, > > why it doesn't fail for people running this natively? > > I don't know whether paravirt is involved. It could be something else. After reading Denys's last mail, I tried booting the same VM - with 1 cpu - without CONFIG_PARAVIRT - with x86_64_defconfig and I still get the same traps in all 3 cases. I can run some userspace programs, but I have no idea what would be helpful. I can also try booting a real machine with archlinux/systemd tomorrow. -- Sabrina -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/