Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1359172imu; Tue, 11 Dec 2018 18:21:42 -0800 (PST) X-Google-Smtp-Source: AFSGD/U4La1tXYmQwNoOLZIcQnrjHIP54kQ16EL2BT4hH3WS5NizyIxvydERc6vqWPIqcQRDUJZK X-Received: by 2002:a63:6704:: with SMTP id b4mr16873198pgc.100.1544581302196; Tue, 11 Dec 2018 18:21:42 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544581302; cv=none; d=google.com; s=arc-20160816; b=rMoyiJnH3WIlV+5yud89Cx+xbh25wEFAeKe/A6X2HzkKWjvaCnx9ZFoDHpAXAXqNhX 6dSzxO9G2G6aqVlC8BKE2ueFDtS61l9oTqFp1DWIMpC/BO05GCxw409fn/IBP95lWzQp QtZc2RYnMm9MsEoGez0TVhIlSw03O+DIXQcMBYez3bHLT7iiH7R9a2KeCnU45XWiGFsT ryoe1kltONBuMFZ44+OVGYyeQ7gNk7HAwgTFDvVWThl1G3BKEnbbU1nGw/ZRQGZbKXlt 0DLIC4fvOhTqTuWe5zw87OG5K72QSfvDOTNzGqYqPJDzU8Il3Kk3sGmDrcwyOf2gk6xm 3hYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=uvA/zffggwIXrk207wSN/c2cp/fg5oN/6jRt56vbqec=; b=stpK7mWhfEmrXU24US/jHZhOYqxEAF388rERJOLSsZ/cLmA3E/ktb/fO/Q2W2Ldzmw OTk+VS+0AUJN7nnzOd8NquDSO1i0HWaQU6sc6vqMJskcVJJatmsVLIUB0UCv0FDdpjjZ VtBz90hMiixmNHcThxoqh5/KrgvgAxNYV3nbFjV7MlsQHRtM3DMh7x4Lmil8cksDXiMt 0RcO9L/XysfPUw9AF74BKBT5Dc51MPvhx3kTzJNFpM5um3TsOZp3E8DomgOConqReMLX H28P3PFLr6CRcrCEezmTuCE2pXdIMatD2ifx4ey5sUuDL1c6crydISzKCXnTY46Dz/vj 0Raw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=aQisrzDu; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 142si14549971pfy.217.2018.12.11.18.21.26; Tue, 11 Dec 2018 18:21:42 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=aQisrzDu; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726298AbeLLCUe (ORCPT + 99 others); Tue, 11 Dec 2018 21:20:34 -0500 Received: from mail.kernel.org ([198.145.29.99]:57062 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726220AbeLLCUe (ORCPT ); Tue, 11 Dec 2018 21:20:34 -0500 Received: from mail-wm1-f43.google.com (mail-wm1-f43.google.com [209.85.128.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 553612147D for ; Wed, 12 Dec 2018 02:20:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1544581233; bh=H/u97Si58UOz/Es9D1ZoXEkOHG7iAKoQ/YPeuFQHHBs=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=aQisrzDuHi/7tGTeK9cEBGGjzlMAt9mftuPoS+LH8nv9MySUsDZA/enI2zDHoZbyD aFr6e5Vg/XL3odYiU+NIOX1VA62onOfyl2f8wsfJNIOxxizu71C9CYYhfHor2rBpu0 z/cjskBl3zfHKfav0f8COrQjeUn+oE8aPlYnWM+0= Received: by mail-wm1-f43.google.com with SMTP id m1so3548547wml.2 for ; Tue, 11 Dec 2018 18:20:33 -0800 (PST) X-Gm-Message-State: AA+aEWadTOGFn5emjR8XN3f3Q8r5tzHXKysF4o+IAc19dFMJHVfFz+np yblYbNvjkqCd80Zthh36V0YbxCyW9k+4JvJGVga/Ow== X-Received: by 2002:a1c:864f:: with SMTP id i76mr4359408wmd.83.1544581231524; Tue, 11 Dec 2018 18:20:31 -0800 (PST) MIME-Version: 1.0 References: <20181212000354.31955-1-rick.p.edgecombe@intel.com> <20181212000354.31955-2-rick.p.edgecombe@intel.com> In-Reply-To: <20181212000354.31955-2-rick.p.edgecombe@intel.com> From: Andy Lutomirski Date: Tue, 11 Dec 2018 18:20:19 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 1/4] vmalloc: New flags for safe vfree on special perms To: Rick Edgecombe Cc: Andrew Morton , Andrew Lutomirski , Will Deacon , Linux-MM , LKML , Kernel Hardening , "Naveen N . Rao" , Anil S Keshavamurthy , "David S. Miller" , Masami Hiramatsu , Steven Rostedt , Ingo Molnar , Alexei Starovoitov , Daniel Borkmann , Jessica Yu , Nadav Amit , Network Development , Ard Biesheuvel , Jann Horn , Kristen Carlson Accardi , Dave Hansen , "Dock, Deneen T" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 11, 2018 at 4:12 PM Rick Edgecombe wrote: > > This adds two new flags VM_IMMEDIATE_UNMAP and VM_HAS_SPECIAL_PERMS, for > enabling vfree operations to immediately clear executable TLB entries to freed > pages, and handle freeing memory with special permissions. > > In order to support vfree being called on memory that might be RO, the vfree > deferred list node is moved to a kmalloc allocated struct, from where it is > today, reusing the allocation being freed. > > arch_vunmap is a new __weak function that implements the actual unmapping and > resetting of the direct map permissions. It can be overridden by more efficient > architecture specific implementations. > > For the default implementation, it uses architecture agnostic methods which are > equivalent to what most usages do before calling vfree. So now it is just > centralized here. > > This implementation derives from two sketches from Dave Hansen and Andy > Lutomirski. > > Suggested-by: Dave Hansen > Suggested-by: Andy Lutomirski > Suggested-by: Will Deacon > Signed-off-by: Rick Edgecombe > --- > include/linux/vmalloc.h | 2 ++ > mm/vmalloc.c | 73 +++++++++++++++++++++++++++++++++++++---- > 2 files changed, 69 insertions(+), 6 deletions(-) > > diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h > index 398e9c95cd61..872bcde17aca 100644 > --- a/include/linux/vmalloc.h > +++ b/include/linux/vmalloc.h > @@ -21,6 +21,8 @@ struct notifier_block; /* in notifier.h */ > #define VM_UNINITIALIZED 0x00000020 /* vm_struct is not fully initialized */ > #define VM_NO_GUARD 0x00000040 /* don't add guard page */ > #define VM_KASAN 0x00000080 /* has allocated kasan shadow memory */ > +#define VM_IMMEDIATE_UNMAP 0x00000200 /* flush before releasing pages */ > +#define VM_HAS_SPECIAL_PERMS 0x00000400 /* may be freed with special perms */ > /* bits [20..32] reserved for arch specific ioremap internals */ > > /* > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 97d4b25d0373..02b284d2245a 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -18,6 +18,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -38,6 +39,11 @@ > > #include "internal.h" > > +struct vfree_work { > + struct llist_node node; > + void *addr; > +}; > + > struct vfree_deferred { > struct llist_head list; > struct work_struct wq; > @@ -50,9 +56,13 @@ static void free_work(struct work_struct *w) > { > struct vfree_deferred *p = container_of(w, struct vfree_deferred, wq); > struct llist_node *t, *llnode; > + struct vfree_work *cur; > > - llist_for_each_safe(llnode, t, llist_del_all(&p->list)) > - __vunmap((void *)llnode, 1); > + llist_for_each_safe(llnode, t, llist_del_all(&p->list)) { > + cur = container_of(llnode, struct vfree_work, node); > + __vunmap(cur->addr, 1); > + kfree(cur); > + } > } > > /*** Page table manipulation functions ***/ > @@ -1494,6 +1504,48 @@ struct vm_struct *remove_vm_area(const void *addr) > return NULL; > } > > +/* > + * This function handles unmapping and resetting the direct map as efficiently > + * as it can with cross arch functions. The three categories of architectures > + * are: > + * 1. Architectures with no set_memory implementations and no direct map > + * permissions. > + * 2. Architectures with set_memory implementations but no direct map > + * permissions > + * 3. Architectures with set_memory implementations and direct map permissions > + */ > +void __weak arch_vunmap(struct vm_struct *area, int deallocate_pages) My general preference is to avoid __weak functions -- they don't optimize well. Instead, I prefer either: #ifndef arch_vunmap void arch_vunmap(...); #endif or #ifdef CONFIG_HAVE_ARCH_VUNMAP ... #endif > +{ > + unsigned long addr = (unsigned long)area->addr; > + int immediate = area->flags & VM_IMMEDIATE_UNMAP; > + int special = area->flags & VM_HAS_SPECIAL_PERMS; > + > + /* > + * In case of 2 and 3, use this general way of resetting the permissions > + * on the directmap. Do NX before RW, in case of X, so there is no W^X > + * violation window. > + * > + * For case 1 these will be noops. > + */ > + if (immediate) > + set_memory_nx(addr, area->nr_pages); > + if (deallocate_pages && special) > + set_memory_rw(addr, area->nr_pages); Can you elaborate on the intent here? VM_IMMEDIATE_UNMAP means "I want that alias gone before any deallocation happens". VM_HAS_SPECIAL_PERMS means "I mucked with the direct map -- fix it for me, please". deallocate means "this was vfree -- please free the pages". I'm not convinced that all the various combinations make sense. Do we really need both flags? (VM_IMMEDIATE_UNMAP is a bit of a lie, since, if in_interrupt(), it's not immediate.) If we do keep both flags, maybe some restructuring would make sense, like this, perhaps. Sorry about horrible whitespace damage. if (special) { /* VM_HAS_SPECIAL_PERMS makes little sense without deallocate_pages. */ WARN_ON_ONCE(!deallocate_pages); if (immediate) { /* It's possible that the vmap alias is X and we're about to make the direct map RW. To avoid a window where executable memory is writable, first mark the vmap alias NX. This is silly, since we're about to *unmap* it, but this is the best we can do if all we have to work with is the set_memory_abc() APIs. Architectures should override this whole function to get better behavior. */ set_memory_nx(...); } set_memory_rw(addr, area->nr_pages); } > + > + /* Always actually remove the area */ > + remove_vm_area(area->addr); > + > + /* > + * Need to flush the TLB before freeing pages in the case of this flag. > + * As long as that's happening, unmap aliases. > + * > + * For 2 and 3, this will not be needed because of the set_memory_nx > + * above, because the stale TLBs will be NX. I'm not sure I agree with this comment. If the caller asked for an immediate unmap, we should give an immediate unmap. But I'm still not sure I see why VM_IMMEDIATE_UNMAP needs to exist as a separate flag. > + */ > + if (immediate && !IS_ENABLED(ARCH_HAS_SET_MEMORY)) > + vm_unmap_aliases(); > +} > + > static void __vunmap(const void *addr, int deallocate_pages) > { > struct vm_struct *area; > @@ -1515,7 +1567,8 @@ static void __vunmap(const void *addr, int deallocate_pages) > debug_check_no_locks_freed(area->addr, get_vm_area_size(area)); > debug_check_no_obj_freed(area->addr, get_vm_area_size(area)); > > - remove_vm_area(addr); > + arch_vunmap(area, deallocate_pages); > + > if (deallocate_pages) { > int i; > > @@ -1542,8 +1595,15 @@ static inline void __vfree_deferred(const void *addr) > * nother cpu's list. schedule_work() should be fine with this too. > */ > struct vfree_deferred *p = raw_cpu_ptr(&vfree_deferred); > + struct vfree_work *w = kmalloc(sizeof(struct vfree_work), GFP_ATOMIC); > + > + /* If no memory for the deferred list node, give up */ > + if (!w) > + return; That's nasty. I see what you're trying to do here, but I think you're solving a problem that doesn't need solving quite so urgently. How about dropping this part and replacing it with a comment like "NB: this writes a word to a potentially executable address. It would be nice if we could avoid doing this." And maybe a future patch could more robustly avoid it without risking memory leaks.