Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp1655026pxb; Tue, 26 Oct 2021 13:08:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyR8km4ZpxhfcwOuPNi/5yvcHXcWf8KD9JzKnTUuqC11Mr0Og1+v9ryTXb6WhfQfWilOfWV X-Received: by 2002:aa7:8b0d:0:b0:44c:89ca:7844 with SMTP id f13-20020aa78b0d000000b0044c89ca7844mr28072612pfd.19.1635278926387; Tue, 26 Oct 2021 13:08:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635278926; cv=none; d=google.com; s=arc-20160816; b=RUA3e7QjrKO+8pLPlMjP+1NINoWTlC/fdESXKhHknBjnkIR9hRiFqsHS2g4qee/SmY 9fhEFhoUMa3SC6d6fXp8QQkzlM2TnnBpKH8mGuYZoVnSgf+ZbW4IEDm7yyms7rmkSbWI u6TNgXaRSpk6AaQNSkRcFAyMo8rC7C3UbGNM7AXuZbFmZHIa4e9meUll88TjPp5BaV9M sxmqe7qzNL50dbluvWCHg9iIkE8MYRgiS6oLvLGdanlJQR+r2IYzG0nE0ExVtmwtqaVa itGiheUuNzg9koYw4+TWSvjocr74N3f7GOen/FcpGa/Be2Y8dzUzO91x2Leyhllatz1M Fynw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=8gapk3V4HatrspmmIuZhzZ/+PJ1R5G7hBL/Ge3mxoCY=; b=APJbRhqu+HL/CiZ6sYeZWOeo6YDRVpIgu/wSuMIVWOFD9CvvuTEIvorjutJDFfnBpn gxOu+St4fwT1Ra5ZyhNOGGv5p+2Rt5Jzo76rZcJC6WC8crhfxXRVk7YkrId8spgOhgGO tbTqveGsuWh6Algq7Pn3RpaLflIzeHSt4iNpm1m5S1zbfz2VVrnWtX16AqpTVlB1QT/S Cafg73zUVjxPNTo0K3jkLYKpeixnszCRep45cQeujA7IpQv8xSawVSuUdX34Kv73Hyqt zcXrInYFQyFuVFSQzcBykl+phOfk9R+Ih1pV+PhUt6w+tWrkN83lyiewTJ2C0LxQ1oPV 7eOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=EW6ExOk9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id x7si28187011pgr.92.2021.10.26.13.07.52; Tue, 26 Oct 2021 13:08:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=EW6ExOk9; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236863AbhJZOlv (ORCPT + 99 others); Tue, 26 Oct 2021 10:41:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57962 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236846AbhJZOls (ORCPT ); Tue, 26 Oct 2021 10:41:48 -0400 Received: from mail-pg1-x52f.google.com (mail-pg1-x52f.google.com [IPv6:2607:f8b0:4864:20::52f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6FD49C06122A for ; Tue, 26 Oct 2021 07:39:18 -0700 (PDT) Received: by mail-pg1-x52f.google.com with SMTP id s136so14353572pgs.4 for ; Tue, 26 Oct 2021 07:39:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=8gapk3V4HatrspmmIuZhzZ/+PJ1R5G7hBL/Ge3mxoCY=; b=EW6ExOk9x6XK4XvU6mnO7imdQPSalUjEZbaLh5pXXTd6mAB4rA32KOibE5cP7Valhp HBOHdkJbVEeQv3ho2wkwjqiVpn26D8F3mL64OhoiYaiDfnpOhZCAKMdAqqekMLSPwP2p 7D8G9ftMVGG+t5g8zK3EJakifrDZKBLnsaEwMlh5iczJ8fwHx0vTNXJG9HuBuOuB/tsB hyUad4ZPp0f51ECxMKbzXSOteHkpD8Xc4KXyiq1auQh0jh5W0kpvvry9iiO+UM8ucxTh ulAaij+li+pIXK9NWNT0yxo4EwdedfQL76r+nLrKUpfqKtG987TshkSeJ1jr7Wnh6SaZ WO4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=8gapk3V4HatrspmmIuZhzZ/+PJ1R5G7hBL/Ge3mxoCY=; b=mQrsIPU9Lh7iDaTlp+A8yOepog74+hpbGwvuv5cEbqAf+Qe+eibConP1w6yORDOmec BtPTwBtvRHjZtx9xCxoGP14Qxvgo2gqHKzXon3kNPqmMvkk43wmtEfvBV/KE3LYuBobe p7WuO67i0Pp2X85h+ExYZfIQQSjWIDADZteKvDUjuIs+1ZjosgoHJQ2W6KtPOFXoqVz6 h03VRNaLMMcpYYddyiOTk0f7cPsN4M2CnAOs/qFinCv6qqEvx3vmz+91+HDVyzjzwUGE /mD28arDvx7o6Y7KP1satXBYw8GBDhu5V9dpvdFx7nDMD6XvObuU+MjY3KPBXQu5ad8z 4jTA== X-Gm-Message-State: AOAM532O5zq+fS1Ai06Ojx98f263UlN2/z8nGp5UY/j+yCHhQAs+DTka YoMJh8cTxbGTvX85DLJVPqp6iVZxyqA= X-Received: by 2002:a63:3549:: with SMTP id c70mr11826340pga.120.1635259157723; Tue, 26 Oct 2021 07:39:17 -0700 (PDT) Received: from localhost ([47.88.5.130]) by smtp.gmail.com with ESMTPSA id k127sm23856762pfd.1.2021.10.26.07.39.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Tue, 26 Oct 2021 07:39:17 -0700 (PDT) From: Lai Jiangshan To: linux-kernel@vger.kernel.org Cc: x86@kernel.org, Lai Jiangshan , Andy Lutomirski , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" Subject: [PATCH V4 50/50] x86/syscall/64: Move the checking for sysret to C code Date: Tue, 26 Oct 2021 22:38:48 +0800 Message-Id: <20211026143851.19481-6-jiangshanlai@gmail.com> X-Mailer: git-send-email 2.19.1.6.gb485710b In-Reply-To: <20211026141420.17138-1-jiangshanlai@gmail.com> References: <20211026141420.17138-1-jiangshanlai@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Lai Jiangshan Like do_fast_syscall_32() which checks whether it can return to userspace via fast instructions before the function returns, do_syscall_64() also checks whether it can use sysret to return to userspace before do_syscall_64() returns via C code. And a bunch of ASM code can be removed. No functional change intended. Signed-off-by: Lai Jiangshan --- arch/x86/entry/calling.h | 10 +---- arch/x86/entry/common.c | 73 ++++++++++++++++++++++++++++++- arch/x86/entry/entry_64.S | 78 ++-------------------------------- arch/x86/include/asm/syscall.h | 2 +- 4 files changed, 78 insertions(+), 85 deletions(-) diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index 6f9de1c6da73..05da3ef48ee4 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -109,27 +109,19 @@ For 32-bit we have the following conventions - kernel is built with CLEAR_REGS .endm -.macro POP_REGS pop_rdi=1 skip_r11rcx=0 +.macro POP_REGS pop_rdi=1 popq %r15 popq %r14 popq %r13 popq %r12 popq %rbp popq %rbx - .if \skip_r11rcx - popq %rsi - .else popq %r11 - .endif popq %r10 popq %r9 popq %r8 popq %rax - .if \skip_r11rcx - popq %rsi - .else popq %rcx - .endif popq %rdx popq %rsi .if \pop_rdi diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c index 6c2826417b33..718045b7a53c 100644 --- a/arch/x86/entry/common.c +++ b/arch/x86/entry/common.c @@ -70,7 +70,77 @@ static __always_inline bool do_syscall_x32(struct pt_regs *regs, int nr) return false; } -__visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) +/* + * Change top bits to match the most significant bit (47th or 56th bit + * depending on paging mode) in the address to get canonical address. + * + * If width of "canonical tail" ever becomes variable, this will need + * to be updated to remain correct on both old and new CPUs. + */ +static __always_inline u64 canonical_address(u64 vaddr) +{ + if (IS_ENABLED(CONFIG_X86_5LEVEL) && static_cpu_has(X86_FEATURE_LA57)) + return ((s64)vaddr << (64 - 57)) >> (64 - 57); + else + return ((s64)vaddr << (64 - 48)) >> (64 - 48); +} + +/* + * Check if it can use SYSRET. + * + * Try to use SYSRET instead of IRET if we're returning to + * a completely clean 64-bit userspace context. + * + * Returns 0 to return using IRET or 1 to return using SYSRET. + */ +static __always_inline int can_sysret(struct pt_regs *regs) +{ + /* In the Xen PV case we must use iret anyway. */ + if (static_cpu_has(X86_FEATURE_XENPV)) + return 0; + + /* SYSRET requires RCX == RIP && R11 == RFLAGS */ + if (regs->ip != regs->cx || regs->flags != regs->r11) + return 0; + + /* CS and SS must match SYSRET */ + if (regs->cs != __USER_CS || regs->ss != __USER_DS) + return 0; + + /* + * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP + * in kernel space. This essentially lets the user take over + * the kernel, since userspace controls RSP. + */ + if (regs->cx != canonical_address(regs->cx)) + return 0; + + /* + * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot + * restore RF properly. If the slowpath sets it for whatever reason, we + * need to restore it correctly. + * + * SYSRET can restore TF, but unlike IRET, restoring TF results in a + * trap from userspace immediately after SYSRET. This would cause an + * infinite loop whenever #DB happens with register state that satisfies + * the opportunistic SYSRET conditions. For example, single-stepping + * this user code: + * + * movq $stuck_here, %rcx + * pushfq + * popq %r11 + * stuck_here: + * + * would never get past 'stuck_here'. + */ + if (regs->r11 & (X86_EFLAGS_RF | X86_EFLAGS_TF)) + return 0; + + return 1; +} + +/* Returns 0 to return using IRET or 1 to return using SYSRET. */ +__visible noinstr int do_syscall_64(struct pt_regs *regs, int nr) { add_random_kstack_offset(); nr = syscall_enter_from_user_mode(regs, nr); @@ -84,6 +154,7 @@ __visible noinstr void do_syscall_64(struct pt_regs *regs, int nr) instrumentation_end(); syscall_exit_to_user_mode(regs); + return can_sysret(regs); } #endif diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S index 19f3e642707b..06b33631494d 100644 --- a/arch/x86/entry/entry_64.S +++ b/arch/x86/entry/entry_64.S @@ -112,85 +112,15 @@ SYM_INNER_LABEL(entry_SYSCALL_64_after_hwframe, SYM_L_GLOBAL) movslq %eax, %rsi call do_syscall_64 /* returns with IRQs disabled */ - /* - * Try to use SYSRET instead of IRET if we're returning to - * a completely clean 64-bit userspace context. If we're not, - * go to the slow exit path. - * In the Xen PV case we must use iret anyway. - */ - - ALTERNATIVE "", "jmp xenpv_restore_regs_and_return_to_usermode", \ - X86_FEATURE_XENPV - - movq RCX(%rsp), %rcx - movq RIP(%rsp), %r11 - - cmpq %rcx, %r11 /* SYSRET requires RCX == RIP */ - jne swapgs_restore_regs_and_return_to_usermode - - /* - * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP - * in kernel space. This essentially lets the user take over - * the kernel, since userspace controls RSP. - * - * If width of "canonical tail" ever becomes variable, this will need - * to be updated to remain correct on both old and new CPUs. - * - * Change top bits to match most significant bit (47th or 56th bit - * depending on paging mode) in the address. - */ -#ifdef CONFIG_X86_5LEVEL - ALTERNATIVE "shl $(64 - 48), %rcx; sar $(64 - 48), %rcx", \ - "shl $(64 - 57), %rcx; sar $(64 - 57), %rcx", X86_FEATURE_LA57 -#else - shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx - sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx -#endif - - /* If this changed %rcx, it was not canonical */ - cmpq %rcx, %r11 - jne swapgs_restore_regs_and_return_to_usermode - - cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */ - jne swapgs_restore_regs_and_return_to_usermode - - movq R11(%rsp), %r11 - cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */ - jne swapgs_restore_regs_and_return_to_usermode - - /* - * SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot - * restore RF properly. If the slowpath sets it for whatever reason, we - * need to restore it correctly. - * - * SYSRET can restore TF, but unlike IRET, restoring TF results in a - * trap from userspace immediately after SYSRET. This would cause an - * infinite loop whenever #DB happens with register state that satisfies - * the opportunistic SYSRET conditions. For example, single-stepping - * this user code: - * - * movq $stuck_here, %rcx - * pushfq - * popq %r11 - * stuck_here: - * - * would never get past 'stuck_here'. - */ - testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11 - jnz swapgs_restore_regs_and_return_to_usermode - - /* nothing to check for RSP */ - - cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */ - jne swapgs_restore_regs_and_return_to_usermode + ALTERNATIVE "testl %eax, %eax; jz swapgs_restore_regs_and_return_to_usermode", \ + "jmp xenpv_restore_regs_and_return_to_usermode", X86_FEATURE_XENPV /* - * We win! This label is here just for ease of understanding + * This label is here just for ease of understanding * perf profiles. Nothing jumps here. */ syscall_return_via_sysret: - /* rcx and r11 are already restored (see code above) */ - POP_REGS pop_rdi=0 skip_r11rcx=1 + POP_REGS pop_rdi=0 /* * Now all regs are restored except RSP and RDI. diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index f7e2d82d24fb..477adea7bac0 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -159,7 +159,7 @@ static inline int syscall_get_arch(struct task_struct *task) ? AUDIT_ARCH_I386 : AUDIT_ARCH_X86_64; } -void do_syscall_64(struct pt_regs *regs, int nr); +int do_syscall_64(struct pt_regs *regs, int nr); void do_int80_syscall_32(struct pt_regs *regs); long do_fast_syscall_32(struct pt_regs *regs); -- 2.19.1.6.gb485710b