Received: by 10.223.185.116 with SMTP id b49csp820758wrg; Fri, 16 Feb 2018 07:43:35 -0800 (PST) X-Google-Smtp-Source: AH8x2246L+Ob0hh33vp9gh1jQQsb+LlrenHzgogFhcXimUz9TR3cNKGmiFiKjdn0j+Z+ysM1qPVs X-Received: by 10.167.128.194 with SMTP id a2mr1021629pfn.186.1518795815696; Fri, 16 Feb 2018 07:43:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518795815; cv=none; d=google.com; s=arc-20160816; b=YKesnxQvASIyWJvRfLDP3ig1AyLSVZPOcn2z00iCZJNkbjtTwxinotHi8x/uAlpvUV kEU3/MloW440dXoEpuJv0n1nabtZ51eEJBXRhRijU5Bj4EWxcwTyxutlm6u0GpTifkYa QIuQmEP4bXtNBADvpWBBZGJoLWwO62jr8loUzScJhhd9BFw9cva5xnnkEWqik4mBc5O7 SfGnw0ujOWRzXMeGmGg1hopTWzQG66arPBKbSLxWdLlEmZ4FrTe54mCNtwjK8uW5G1mF cOKXkxGVm8RpHivsBhsoabQWzCqbauC3ZDOih8o8D5Y2SyccxjzgMmxxH0w7XVqYqa6n JoNw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:references:in-reply-to:mime-version :dmarc-filter:arc-authentication-results; bh=YTXrHH33h7la8Z1m6NPMSYhxLveP5+y1pvAiUNEK3u0=; b=kafeODopBWO6WOhOVHu1zJk0UVVQumqzlSvrY3yphxlCpVNyGcGLz5Byzy0ebfgfgM fhxwqxdZaL2hwRieXMFQATNGgw5qUAdDQgNUfQu6JwLV9oaYXEQVUkpO3gOoZH69SSsh Ai/0pnuM9gvgMrSAr1+J+dri13V6gbAswHqNQJpRFJFT3dvjuzP5zXXoevXOUCMxQPGe /g58BmC3kXPuynJLWoXhZONmkOh1V9qdRVCP9Po2g5qjC8IDsEDdsAq8TKpl0iaAtIjU dHHSqSj7zOGR/9p3xQqunHl0wE7+VBooARYuRfKTna3mw1P+nAuEOOFG7dLV2FwZyz4+ TDFw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l6si2871087pgq.562.2018.02.16.07.43.20; Fri, 16 Feb 2018 07:43:35 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1164896AbeBOXaF convert rfc822-to-8bit (ORCPT + 99 others); Thu, 15 Feb 2018 18:30:05 -0500 Received: from mail.kernel.org ([198.145.29.99]:57002 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1163785AbeBOXaE (ORCPT ); Thu, 15 Feb 2018 18:30:04 -0500 Received: from mail-it0-f48.google.com (mail-it0-f48.google.com [209.85.214.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 68858217CC for ; Thu, 15 Feb 2018 23:30:03 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 68858217CC Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org Received: by mail-it0-f48.google.com with SMTP id p204so119302itc.4 for ; Thu, 15 Feb 2018 15:30:03 -0800 (PST) X-Gm-Message-State: APf1xPDz1T1mWlZLZpXb4w6SI6okF3j+/lW3XAGzTl15H91VeAXXrsFR gXI17qS+8ZTP0rI02qfQb9ekebU8MXhcZxE/R8H51g== X-Received: by 10.36.1.20 with SMTP id 20mr5961447itk.104.1518737402601; Thu, 15 Feb 2018 15:30:02 -0800 (PST) MIME-Version: 1.0 Received: by 10.2.137.84 with HTTP; Thu, 15 Feb 2018 15:29:42 -0800 (PST) In-Reply-To: <9EB804CA-0EC9-4CBB-965A-F3C8520201E7@gmail.com> References: <20180215163602.61162-1-namit@vmware.com> <20180215163602.61162-5-namit@vmware.com> <9EB804CA-0EC9-4CBB-965A-F3C8520201E7@gmail.com> From: Andy Lutomirski Date: Thu, 15 Feb 2018 23:29:42 +0000 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH RFC v2 4/6] x86: Disable PTI on compatibility mode To: Nadav Amit , Pavel Emelyanov , Cyrill Gorcunov , Linus Torvalds Cc: Andy Lutomirski , Ingo Molnar , Thomas Gleixner , Peter Zijlstra , Dave Hansen , Willy Tarreau , X86 ML , LKML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Feb 15, 2018 at 8:58 PM, Nadav Amit wrote: > Andy Lutomirski wrote: > >> On Thu, Feb 15, 2018 at 4:36 PM, Nadav Amit wrote: >>> Based on the understanding that there should be no way for userspace to >>> address the kernel-space from compatibility mode, disable it while >>> running in compatibility mode as long as the 64-bit code segment of the >>> user is not used. >>> >>> Reenabling PTI is performed by restoring NX-bits to the userspace >>> mappings, flushing the TLBs, and notifying all the CPUs that use the >>> affected mm to disable PTI. Each core responds by removing the present >>> bit for the 64-bit code-segment, and marking that PTI is disabled on >>> that core. >> >> I dislike this patch because it's conflating two things. The patch >> claims to merely disable PTI for compat tasks, whatever those are. >> But it's also introducing a much stronger concept of what a compat >> task is. The kernel currently mostly doesn't care whether a task is >> "compat" or not, and I think that most remaining code paths that do >> care are buggy and should be removed. >> >> I think the right way to approach this is to add a new arch_prctl() >> that changes allowable bitness, like this: >> >> arch_prctl(ARCH_SET_ALLOWED_GDT_CS, X86_ALLOW_CS32 | X86_ALLOW_CS64); >> >> this would set the current task to work the normal way, where 32-bit >> and 64-bit CS are available. You could set just X86_ALLOW_CS32 to >> deny 64-bit mode and just X86_ALLOW_CS64 to deny 32-bit mode. This >> would make nice attack surface reduction tools for the more paranoid >> sandbox users to use. Doing arch_prctl(ARCH_SET_ALLOWED_GDT_CS, 0) >> would return -EINVAL. >> >> A separate patch would turn PTI off if you set X86_ALLOW_CS32. >> >> This has the downside that old code doesn't get the benefit without >> some code change, but that's not the end of the world. >> >>> +static void pti_cpu_update_func(void *info) >>> +{ >>> + struct mm_struct *mm = (struct mm_struct *)info; >>> + >>> + if (mm != this_cpu_read(cpu_tlbstate.loaded_mm)) >>> + return; >>> + >>> + /* >>> + * Keep CS64 and CPU settings in sync despite potential concurrent >>> + * updates. >>> + */ >>> + set_cpu_pti_disable(READ_ONCE(mm->context.pti_disable)); >>> +} >> >> I don't like this at all. IMO a sane implementation should never >> change PTI status on a remote CPU. Just track it per task. > > From your comments in the past, I understood that this requirement > (enabling/disabling PTI per mm) came from Linus. I can do it per task, but > that means that the NX-bit will always be cleared on the kernel page-table > for user mappings. I disagree with Linus here, although I could be convinced. But there's still no need for an IPI -- even if it were per-mm, PTI could stay on until the next kernel entry. > >>> +void __pti_reenable(void) >>> +{ >>> + struct mm_struct *mm = current->mm; >>> + int cpu; >>> + >>> + if (!mm_pti_disable(mm)) >>> + return; >>> + >>> + /* >>> + * Prevent spurious page-fault storm while we set the NX-bit and have >>> + * yet not updated the per-CPU pti_disable flag. >>> + */ >>> + down_write(&mm->mmap_sem); >>> + >>> + if (!mm_pti_disable(mm)) >>> + goto out; >>> + >>> + /* >>> + * First, mark the PTI is enabled. Although we do anything yet, we are >>> + * safe as long as we do not reenable CS64. Since we did not update the >>> + * page tables yet, this may lead to spurious page-faults, but we need >>> + * the pti_disable in mm to be set for __pti_set_user_pgd() to do the >>> + * right thing. Holding mmap_sem would ensure matter we hold the >>> + * mmap_sem to prevent them from swamping the system. >>> + */ >>> + mm->context.pti_disable = PTI_DISABLE_OFF; >>> + >>> + /* Second, restore the NX bits. */ >>> + pti_update_user_pgds(mm, true); >> >> You're holding mmap_sem, but there are code paths that touch page >> tables that don't hold mmap_sem, such as the stack extension code. > > Do they change user mappings? These are the only ones pti_update_user_pgds() > changes. I think they do -- it's the user stack. page_table_lock may be more appropriate, although I'm far from an expert in this. > >>> + >>> +bool pti_handle_segment_not_present(long error_code) >>> +{ >>> + if (!static_cpu_has(X86_FEATURE_PTI)) >>> + return false; >>> + >>> + if ((unsigned short)error_code != GDT_ENTRY_DEFAULT_USER_CS << 3) >>> + return false; >>> + >>> + pti_reenable(); >>> + return true; >>> +} >> >> Please don't. You're trying to emulate the old behavior here, but >> you're emulating it wrong. In particular, you won't trap on LAR. > > Yes, I thought I’ll manage to address LAR, but failed. I thought you said > this is not a “show-stopper”. I’ll adapt your approach of using prctl, although > it really limits the benefit of this mechanism. > It's possible we could get away with adding the prctl but making the default be that only the bitness that matches the program being run is allowed. After all, it's possible that CRIU is literally the only program that switches bitness using the GDT. (DOSEMU2 definitely does cross-bitness stuff, but it uses the LDT as far as I know.) And I've never been entirely sure that CRIU fully counts toward the Linux "don't break ABI" guarantee. Linus, how would you feel about, by default, preventing 64-bit programs from long-jumping to __USER32_CS and vice versa? I think it has some value as a hardening measure. I've certainly engaged in some exploit shenanigans myself that took advantage of the ability to long jump/ret to change bitness at will. This wouldn't affect users of modify_ldt() -- 64-bit programs could still create and use their own private 32-bit segments with modify_ldt(), and seccomp can (and should!) prevent that in sandboxed programs. In general, I prefer an approach where everything is explicit to an approach where we almost, but not quite, emulate the weird historical behavior. Pavel and Cyrill, how annoying would it be if CRIU had to do an extra arch_prctl() to enable its cross-bitness shenanigans when checkpointing and restoring a 32-bit program?