MIME-Version: 1.0
In-Reply-To: <54EE3E94.7060208@redhat.com>
References: <1423778052-21038-1-git-send-email-dvlasenk@redhat.com>
 <1423778052-21038-2-git-send-email-dvlasenk@redhat.com> <CANaxB-zH5k1cKYMA+tOJxgLhenadJQmdtGktBE6EAUOExsqmEg@mail.gmail.com>
 <54EE1799.2000602@redhat.com> <CANaxB-xtTzwAw22LT1wBWD5V4yBcf5yv-YtvvOp3QtXG76hgTg@mail.gmail.com>
 <CALCETrWWHxqZo+aaDrPhJKEORMWG5Fn4-8Hq6B19yNX++GLb5g@mail.gmail.com> <54EE3E94.7060208@redhat.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Wed, 25 Feb 2015 13:59:06 -0800
Message-ID: <CALCETrVFzm-nXwJSgjhLuHS6B1HT70Q0CsxybjJUmVAKALL5NA@mail.gmail.com>
Subject: Re: [PATCH 2/3 v3] x86: entry_64.S: always allocate complete "struct pt_regs"
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andrey Wagin <avagin@gmail.com>, Ingo Molnar <mingo@redhat.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Oleg Nesterov <oleg@redhat.com>, Borislav Petkov <bp@alien8.de>,
        "H. Peter Anvin" <hpa@zytor.com>,
        Frederic Weisbecker <fweisbec@gmail.com>, X86 ML <x86@kernel.org>,
        Alexei Starovoitov <ast@plumgrid.com>, Will Drewry <wad@chromium.org>,
        Kees Cook <keescook@chromium.org>, LKML <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 7016
Lines: 184

On Wed, Feb 25, 2015 at 1:28 PM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 02/25/2015 09:10 PM, Andy Lutomirski wrote:
>> On Wed, Feb 25, 2015 at 11:59 AM, Andrey Wagin <avagin@gmail.com> wrote:
>>> 2015-02-25 21:42 GMT+03:00 Denys Vlasenko <dvlasenk@redhat.com>:
>>>> On 02/25/2015 01:37 PM, Andrey Wagin wrote:
>>>>> 2015-02-13 0:54 GMT+03:00 Denys Vlasenko <dvlasenk@redhat.com>:
>>>>>> 64-bit code was using six stack slots less by not saving/restoring
>>>>>> registers which are callee-preserved according to C ABI,
>>>>>> and not allocating space for them.
>>>>>> Only when syscall needed a complete "struct pt_regs",
>>>>>> the complete area was allocated and filled in.
>>>>>> As an additional twist, on interrupt entry a "slightly less truncated pt_regs"
>>>>>> trick is used, to make nested interrupt stacks easier to unwind.
>>>>>>
>>>>>> This proved to be a source of significant obfuscation and subtle bugs.
>>>>>> For example, stub_fork had to pop the return address,
>>>>>> extend the struct, save registers, and push return address back. Ugly.
>>>>>> ia32_ptregs_common pops return address and "returns" via jmp insn,
>>>>>> throwing a wrench into CPU return stack cache.
>>>>>>
>>>>>> This patch changes code to always allocate a complete "struct pt_regs".
>>>>>> The saving of registers is still done lazily.
>>>>>>
>>>>>> "Partial pt_regs" trick on interrupt stack is retained.
>>>>>>
>>>>>> Macros which manipulate "struct pt_regs" on stack are reworked:
>>>>>> ALLOC_PT_GPREGS_ON_STACK allocates the structure.
>>>>>> SAVE_C_REGS saves to it those registers which are clobbered by C code.
>>>>>> SAVE_EXTRA_REGS saves to it all other registers.
>>>>>> Corresponding RESTORE_* and REMOVE_PT_GPREGS_FROM_STACK macros reverse it.
>>>>>>
>>>>>> ia32_ptregs_common, stub_fork and friends lost their ugly dance with
>>>>>> return pointer.
>>>>>>
>>>>>> LOAD_ARGS32 in ia32entry.S now uses symbolic stack offsets
>>>>>> instead of magic numbers.
>>>>>>
>>>>>> error_entry and save_paranoid now use SAVE_C_REGS + SAVE_EXTRA_REGS
>>>>>> instead of having it open-coded yet again.
>>>>>>
>>>>>> Patch was run-tested: 64-bit executables, 32-bit executables,
>>>>>> strace works.
>>>>>> Timing tests did not show measurable difference in 32-bit
>>>>>> and 64-bit syscalls.
>>>>>
>>>>> Hello Denys,
>>>>>
>>>>> My test vm doesn't boot with this patch. Could you help to investigate
>>>>> this issue?
>>>>
>>>> I think I found it. This part of my patch is possibly wrong:
>>>>
>>>> @@ -171,9 +171,9 @@ static inline int arch_irqs_disabled(void)
>>>>  #define ARCH_LOCKDEP_SYS_EXIT_IRQ      \
>>>>         TRACE_IRQS_ON; \
>>>>         sti; \
>>>> -       SAVE_REST; \
>>>> +       SAVE_EXTRA_REGS; \
>>>>         LOCKDEP_SYS_EXIT; \
>>>> -       RESTORE_REST; \
>>>> +       RESTORE_EXTRA_REGS; \
>>>>         cli; \
>>>>         TRACE_IRQS_OFF;
>>>>
>>>> The "SAVE_REST" here is intended to really *push* extra regs on stack,
>>>> but the patch changed it so that they are written to existing stack
>>>> slots above.
>>>>
>>>> From code inspection it should work in almost all cases, but some
>>>> locations where it is used are really obscure.
>>>>
>>>> If there are places where *pushing* regs is really necessary,
>>>> this can corrupt rbp,rbx,r12-15 registers.
>>>>
>>>> Your config has CONFIG_LOCKDEP=y, I think it's worth trying whether the bug
>>>> was here.
>>>> Please find updated patch attached. Can you try it?
>>>
>>> It doesn't work
>
> Thanks for testing it anyway.
>
>
>>> [    3.016262] traps: systemd-cgroups[390] general protection
>>> ip:7f456f7b6028 sp:7fffdc059718 error:0 in
>>> ld-2.18.so[7f456f79e000+20000]
>
> This is what I know about these crashes. The SEGV itself is caused by
> HLT instruction executed by dynamic loader, ld-2.NN.so.
> The instruction is in _exit function, and is only reachable if
> exit_group and exit syscalls fail to terminate the process.
> So it seems that syscall execution is getting badly broken somehow
> at some point.
>
> This happens to both reporters.
>
> My theory that it is related to lockdep seems to be wrong, because
> Sabrina's kernel is not lockdep-enabled, yet it sees the same failure.
>
> Both kernels are paravirtualized, both are booted under KVM,
> Andrey runs it with four virtual CPUs, Sabrina runs with two.
>
> My next theory is that I missed something related to paravirt.
> I am looking at that code, so far I don't see anything suspicious.
>
> Unfortunately, it doesn't happen to me: I have Sabrina's bzImage,
> I run it under "qemu-system-x86_64 -enable-kvm -smp 2",
> I see in dmesg that kernel does detect that it is being run under KVM,
> but it works for me. No mysterious segfaults.
>
> Andrey, can you send me your bzImage? Maybe it will trigger
> the problem for me.
>
>
>> The change to stub_\func looks wrong to me.  It saves and restores
>> regs, but those regs might already have been saved if we're on the
>> slow path.  (Yes, all that code is quite buggy even without all these
>> patches.)  So is execve.
>>
>> This means that, for example, execve called in the slow path will
>> save/restore regs twice.  If the values in the regs after the first
>> save and before the second save are different, then we corrupt user
>> state.
>
> This part?
>
>         .macro FORK_LIKE func
>  ENTRY(stub_\func)
>         CFI_STARTPROC
> -       popq    %r11                    /* save return address */
> -       PARTIAL_FRAME 0
> -       SAVE_REST
> -       pushq   %r11                    /* put it back on stack */
> +       DEFAULT_FRAME 0, 8              /* offset 8: return address */
> +       SAVE_EXTRA_REGS 8
>         FIXUP_TOP_OF_STACK %r11, 8
> -       DEFAULT_FRAME 0 8               /* offset 8: return address */
>         call sys_\func
>         RESTORE_TOP_OF_STACK %r11, 8
> -       ret $REST_SKIP          /* pop extended registers */
> +       ret
>         CFI_ENDPROC
>  END(stub_\func)
>         .endm
>
>         FORK_LIKE  clone
>         FORK_LIKE  fork
>         FORK_LIKE  vfork
>
> But the old code (SAVE_REST thing) was also saving registers here.
> It had to jump through hoops (pop return address, SAVE_REST,
> push return address) to do that.
> After the patch, "SAVE_EXTRA_REGS 8" does the same, just without
> pop/push pair.
>
> I just don't see what's wrong with it. Can you elaborate?

SAVE_REST pushed the regs onto the stack, whereas SAVE_EXTRA_REGS just
writes them in place.  It's possible for this to be called when the
regs have already been saved.

>
> And this area of code has no paravirt gunk, so if the bug is here,
> why it doesn't fail for people running this natively?

I don't know whether paravirt is involved.  It could be something else.

--Andy

>
> --
> vda
>


-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/