Received: by 10.223.185.116 with SMTP id b49csp2789623wrg; Mon, 5 Mar 2018 08:42:46 -0800 (PST) X-Google-Smtp-Source: AG47ELsG2I0OBesWVDCV1FYMo+WjfuOcIntJLPF4nxX3U4eBrd6DcWAPmxaBcRzt54uShzfztJZ4 X-Received: by 10.98.192.74 with SMTP id x71mr15712449pff.21.1520268166879; Mon, 05 Mar 2018 08:42:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520268166; cv=none; d=google.com; s=arc-20160816; b=Ir5zqEi+xfqixTOIog6GVKqBc8NhfTWuWBIMz7s1E0WW7FR1ld+dpLbPQOILHu9cT5 Iz9ZT8DnZxdUq+pBUhtq3e8KSODNsCPjFz0lYWJxkXZTBDd34PgCZqMzasjFU5tJFW7o LLMrFi4yjE6unanM6wo3RswOa+H3+QkUGwyuPUKEbW0XjqvnMFKpprr6leLUUWYUSRvz 4QdsP7htIPdWplPl1slsg7zz+RyIaodFyJlx1WmwSuI0jASJqnJo9lAvQzBTlkRaVFI7 5TCL+4MwFd870j7sZwUld/7NzXR1NBPAfkUV79CLWLD+3YRdEE3drKpgifw2QJV48d3m iPIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=wntvuZPz0Q5kVETXIRwdE9kq6T7dowZ7tgy9fjSNnOE=; b=CJVjROEXLPAuYyFjIlAftDyqLMcdKbNZElj4OQKFS481BNUjTEJohBTFJKajBSBEvX g5Z8w6HJa9PZICKgWPwbKnDCaVMlfK4ws/Rb9ucEjRCN0onXfy2dxoLMut4Y1poZDS+/ JajaMtLcJXC4fIn8pGZGPXn9jz9NbEEaE5dNiAN6bAzIcc5sgCudpRBP2CVRIW+SnPBe 2Fu1zqwUSktDK6+nmYLaBWM/Sn7QmDZxIqdG6tJEIRU93nUE+3A+Uoc0AU/ugRWbPT+6 VNiOKqtnWnYmlTWHewrpfNhYt+IUq7FqWFEiUpEepzFGRyLut0xhW6HgVHz1J/7K1L9m 248A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=AW/V6+PP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b17si8505784pgu.407.2018.03.05.08.42.30; Mon, 05 Mar 2018 08:42:46 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=AW/V6+PP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751881AbeCEQlI (ORCPT + 99 others); Mon, 5 Mar 2018 11:41:08 -0500 Received: from mail-io0-f174.google.com ([209.85.223.174]:33524 "EHLO mail-io0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751275AbeCEQlD (ORCPT ); Mon, 5 Mar 2018 11:41:03 -0500 Received: by mail-io0-f174.google.com with SMTP id f1so18679463iob.0 for ; Mon, 05 Mar 2018 08:41:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=wntvuZPz0Q5kVETXIRwdE9kq6T7dowZ7tgy9fjSNnOE=; b=AW/V6+PPwNNMRpqGOqM3NoI7urnvQPcBGN2Ps3xBkvDvwS06IBgaTJMj27N538k527 hnTnGsmrQh2bpqrXicBuquIRMXSyCACv/dOjLI/F9yaVl8ZFHudVKD6jGptMsZa/qe87 xlCuR9Xma1HCabC5oJv1DaEpjCJBCq2fSWGfjSDdmubFB+OIAcR7p9ElfRczvPYVoBj5 jDLgwg1l11YEl18IedbCeQtI8csYg4MBBmGT7KaKyh8lMnvf/roB4NZPfABIhdePNcat eloKuvBVcc0ROZCpI9yLW3KteO8A3RbA3f2wmx0YYAEDnzOT6Crkq2GJEUEi48wMq56x U7mA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=wntvuZPz0Q5kVETXIRwdE9kq6T7dowZ7tgy9fjSNnOE=; b=qbpEKljorIlbdIzBrAIEEWjtIbn0PP4TPjt7jRtHOYj/ILxot+7F32jtCOZ2eegVw6 j2Couci9OjCkDj4KtlUtAnRslLL7TcqRPAJ3RCxXL2b0W6O+DuyUDhDwB14g8m79ANZz U9eqlspAjQofMc8FDTIxFXgJMusDinvs7oX4ujrA0elBTO7K1N6rOCevV+7XFp3mjiOH bv+YUZhSrgRRqOVElCDK1KPPXqEe/l+PUuQc52bXMDuFBFPXprHTCWaMIHaz6/SyxVJj KmZoMyOu+7r97ner2NX1n/D24m2PjQOurC+WwL2dBxa5pAq2LPMMe+6z4VLiXWphqCVi WTZA== X-Gm-Message-State: AElRT7EdOb/VX6ez+HmCLVBqOsryJ1OcMBEzv2lf1b6ZFY2yIlzMi0MI qrm6XpJwFKZVMpDEvi+ttWFgTH5JMfVisMtulg== X-Received: by 10.107.114.21 with SMTP id n21mr17526927ioc.294.1520268062641; Mon, 05 Mar 2018 08:41:02 -0800 (PST) MIME-Version: 1.0 Received: by 10.2.118.212 with HTTP; Mon, 5 Mar 2018 08:41:01 -0800 (PST) In-Reply-To: <1520245563-8444-12-git-send-email-joro@8bytes.org> References: <1520245563-8444-1-git-send-email-joro@8bytes.org> <1520245563-8444-12-git-send-email-joro@8bytes.org> From: Brian Gerst Date: Mon, 5 Mar 2018 11:41:01 -0500 Message-ID: Subject: Re: [PATCH 11/34] x86/entry/32: Handle Entry from Kernel-Mode on Entry-Stack To: Joerg Roedel Cc: Thomas Gleixner , Ingo Molnar , "H . Peter Anvin" , "the arch/x86 maintainers" , Linux Kernel Mailing List , Linux-MM , Linus Torvalds , Andy Lutomirski , Dave Hansen , Josh Poimboeuf , Juergen Gross , Peter Zijlstra , Borislav Petkov , Jiri Kosina , Boris Ostrovsky , David Laight , Denys Vlasenko , Eduardo Valentin , Greg KH , Will Deacon , "Liguori, Anthony" , Daniel Gruss , Hugh Dickins , Kees Cook , Andrea Arcangeli , Waiman Long , Pavel Machek , Joerg Roedel Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 5, 2018 at 5:25 AM, Joerg Roedel wrote: > From: Joerg Roedel > > It can happen that we enter the kernel from kernel-mode and > on the entry-stack. The most common way this happens is when > we get an exception while loading the user-space segment > registers on the kernel-to-userspace exit path. > > The segment loading needs to be done after the entry-stack > switch, because the stack-switch needs kernel %fs for > per_cpu access. > > When this happens, we need to make sure that we leave the > kernel with the entry-stack again, so that the interrupted > code-path runs on the right stack when switching to the > user-cr3. > > We do this by detecting this condition on kernel-entry by > checking CS.RPL and %esp, and if it happens, we copy over > the complete content of the entry stack to the task-stack. > This needs to be done because once we enter the exception > handlers we might be scheduled out or even migrated to a > different CPU, so that we can't rely on the entry-stack > contents. We also leave a marker in the stack-frame to > detect this condition on the exit path. > > On the exit path the copy is reversed, we copy all of the > remaining task-stack back to the entry-stack and switch > to it. > > Signed-off-by: Joerg Roedel > --- > arch/x86/entry/entry_32.S | 110 +++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 109 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S > index bb0bd896..3a84945 100644 > --- a/arch/x86/entry/entry_32.S > +++ b/arch/x86/entry/entry_32.S > @@ -299,6 +299,9 @@ > * copied there. So allocate the stack-frame on the task-stack and > * switch to it before we do any copying. > */ > + > +#define CS_FROM_ENTRY_STACK (1 << 31) > + > .macro SWITCH_TO_KERNEL_STACK > > ALTERNATIVE "", "jmp .Lend_\@", X86_FEATURE_XENPV > @@ -320,6 +323,10 @@ > /* Load top of task-stack into %edi */ > movl TSS_entry_stack(%edi), %edi > > + /* Special case - entry from kernel mode via entry stack */ > + testl $SEGMENT_RPL_MASK, PT_CS(%esp) > + jz .Lentry_from_kernel_\@ > + > /* Bytes to copy */ > movl $PTREGS_SIZE, %ecx > > @@ -333,8 +340,8 @@ > */ > addl $(4 * 4), %ecx > > -.Lcopy_pt_regs_\@: > #endif > +.Lcopy_pt_regs_\@: > > /* Allocate frame on task-stack */ > subl %ecx, %edi > @@ -350,6 +357,56 @@ > cld > rep movsl > > + jmp .Lend_\@ > + > +.Lentry_from_kernel_\@: > + > + /* > + * This handles the case when we enter the kernel from > + * kernel-mode and %esp points to the entry-stack. When this > + * happens we need to switch to the task-stack to run C code, > + * but switch back to the entry-stack again when we approach > + * iret and return to the interrupted code-path. This usually > + * happens when we hit an exception while restoring user-space > + * segment registers on the way back to user-space. > + * > + * When we switch to the task-stack here, we can't trust the > + * contents of the entry-stack anymore, as the exception handler > + * might be scheduled out or moved to another CPU. Therefore we > + * copy the complete entry-stack to the task-stack and set a > + * marker in the iret-frame (bit 31 of the CS dword) to detect > + * what we've done on the iret path. We don't need to worry about preemption changing the entry stack. The faults that IRET or segment loads can generate just run the exception fixup handler and return. Interrupts were disabled when the fault occurred, so the kernel cannot be preempted. The other case to watch is #DB on SYSENTER, but that simply returns and doesn't sleep either. We can keep the same process as the existing debug/NMI handlers - leave the current exception pt_regs on the entry stack and just switch to the task stack for the call to the handler. Then switch back to the entry stack and continue. No copying needed. > + * > + * On the iret path we copy everything back and switch to the > + * entry-stack, so that the interrupted kernel code-path > + * continues on the same stack it was interrupted with. > + * > + * Be aware that an NMI can happen anytime in this code. > + * > + * %esi: Entry-Stack pointer (same as %esp) > + * %edi: Top of the task stack > + */ > + > + /* Calculate number of bytes on the entry stack in %ecx */ > + movl %esi, %ecx > + > + /* %ecx to the top of entry-stack */ > + andl $(MASK_entry_stack), %ecx > + addl $(SIZEOF_entry_stack), %ecx > + > + /* Number of bytes on the entry stack to %ecx */ > + sub %esi, %ecx > + > + /* Mark stackframe as coming from entry stack */ > + orl $CS_FROM_ENTRY_STACK, PT_CS(%esp) Not all 32-bit processors will zero-extend segment pushes. You will need to explicitly clear the bit in the case where we didn't switch CR3. > + > + /* > + * %esi and %edi are unchanged, %ecx contains the number of > + * bytes to copy. The code at .Lcopy_pt_regs_\@ will allocate > + * the stack-frame on task-stack and copy everything over > + */ > + jmp .Lcopy_pt_regs_\@ > + > .Lend_\@: > .endm > > @@ -408,6 +465,56 @@ > .endm > > /* > + * This macro handles the case when we return to kernel-mode on the iret > + * path and have to switch back to the entry stack. > + * > + * See the comments below the .Lentry_from_kernel_\@ label in the > + * SWITCH_TO_KERNEL_STACK macro for more details. > + */ > +.macro PARANOID_EXIT_TO_KERNEL_MODE > + > + /* > + * Test if we entered the kernel with the entry-stack. Most > + * likely we did not, because this code only runs on the > + * return-to-kernel path. > + */ > + testl $CS_FROM_ENTRY_STACK, PT_CS(%esp) > + jz .Lend_\@ > + > + /* Unlikely slow-path */ > + > + /* Clear marker from stack-frame */ > + andl $(~CS_FROM_ENTRY_STACK), PT_CS(%esp) > + > + /* Copy the remaining task-stack contents to entry-stack */ > + movl %esp, %esi > + movl PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %edi > + > + /* Bytes on the task-stack to ecx */ > + movl PER_CPU_VAR(cpu_current_top_of_stack), %ecx > + subl %esi, %ecx > + > + /* Allocate stack-frame on entry-stack */ > + subl %ecx, %edi > + > + /* > + * Save future stack-pointer, we must not switch until the > + * copy is done, otherwise the NMI handler could destroy the > + * contents of the task-stack we are about to copy. > + */ > + movl %edi, %ebx > + > + /* Do the copy */ > + shrl $2, %ecx > + cld > + rep movsl > + > + /* Safe to switch to entry-stack now */ > + movl %ebx, %esp > + > +.Lend_\@: > +.endm > +/* > * %eax: prev task > * %edx: next task > */ > @@ -765,6 +872,7 @@ restore_all: > > restore_all_kernel: > TRACE_IRQS_IRET > + PARANOID_EXIT_TO_KERNEL_MODE > RESTORE_REGS 4 > jmp .Lirq_return > > -- > 2.7.4 > -- Brian Gerst