Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp6843422imm; Tue, 28 Aug 2018 01:52:14 -0700 (PDT) X-Google-Smtp-Source: ANB0VdYyYApjUvFyt49yfhGiesNM/DSJ3VPC6pvsguEDF9gMuLjd6LRAF69C/Kx2BL3jzvwhNaPe X-Received: by 2002:a63:706:: with SMTP id 6-v6mr569012pgh.137.1535446334372; Tue, 28 Aug 2018 01:52:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535446334; cv=none; d=google.com; s=arc-20160816; b=lUa6cZaw5kLd3wH36bnMw9mzwHcrJCkqh5BR8on1WRfwQNU9G6EhmG5SJmdYQChuNN fEqVyrB+R037ZV7Y0/oP/KNWwDkK5S4WLiMmDXBKA/pp1GGMa4h9QEux3U8r8NAGHtpY l54jLjDY7c3pXxIad7hKsXsowHe/XWZUjmZXUrYaWlRkjGwcP6gDepjJpfOKXvG2Wk5N b6RqYkwFJ4RAhszPWZIQ0qOPyTpm99wyy2MpgQtQkO44GhbTYkIGZkGqHuZsaAcn6LUz TBkavwUq7GtzPgwH61SB6YjzVTvB4Z0qY4ucJ/Jw5UvNSCM/Cb2buN1wr4E/7IFewk6g bOFw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:subject:cc:to:from:date :dkim-signature:arc-authentication-results; bh=2SoFd0meOXDu7Cm4u4Hk/WlymhiMFIXgbV91QZjUrIs=; b=plBqI8ry4QHUQzOuV9JExKexKNsmOJtkMSE5WTXVh0OEKRi3o+PJbvG36IwaNw7RQq KctXVeT+AWDZGabCIKzt0n/mq4Ddjp+oZFQ/9B2T1ympWL0WunXpwKzn8vW53STgiANp TwG4LyoeQ0vejCOFdt9QEs+i20YuIwqty0HWQrHUSkUnUIOfimGhY6KNCaCY4Oa3bJfL wa+bBpWZ7r2VWrwQa/MkK7ssP2Bkd+wxob3Ot4TI6b6xHipiExn7rFzIJyLVzKkupkC5 qseZrX5j1ZTrHcO5h/9x9WPZbunRjXezP5aJOP+Xbi2mpPChe6ik8PFKElc4RGzTHjPY YPNw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=vzH0h2iF; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 3-v6si462884plx.173.2018.08.28.01.51.58; Tue, 28 Aug 2018 01:52:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=vzH0h2iF; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727559AbeH1Mkb (ORCPT + 99 others); Tue, 28 Aug 2018 08:40:31 -0400 Received: from mail.kernel.org ([198.145.29.99]:54872 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726975AbeH1Mkb (ORCPT ); Tue, 28 Aug 2018 08:40:31 -0400 Received: from devbox (NE2965lan1.rev.em-net.ne.jp [210.141.244.193]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 19A93208A1; Tue, 28 Aug 2018 08:49:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1535446193; bh=8Rpmv2PANTDloXAbYhNkLxIu/Tk55dicrGuzGaIvLy0=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=vzH0h2iFbxjGIrwok1x4IHn/45GRs7yH+aKgbf+PajYa46AKpBQBYQBOdX3SCwAZp FlrCh2RAUUvZpbG0StftsJPPElDrMIUp7BAK947xetvfWkgsryotbzn4ar/02bI6/z tza8BtUuJvgbIKJY4W3I2FFtsz9YB0EetsFM0+JM= Date: Tue, 28 Aug 2018 17:49:48 +0900 From: Masami Hiramatsu To: Andy Lutomirski Cc: Nadav Amit , Masami Hiramatsu , Peter Zijlstra , Kees Cook , Linus Torvalds , Paolo Bonzini , Jiri Kosina , Will Deacon , Benjamin Herrenschmidt , Nick Piggin , "the arch/x86 maintainers" , Borislav Petkov , Rik van Riel , Jann Horn , Adin Scannell , Dave Hansen , Linux Kernel Mailing List , linux-mm , David Miller , Martin Schwidefsky , Michael Ellerman Subject: Re: TLB flushes on fixmap changes Message-Id: <20180828174948.dc0474009be1ec7a0392cfcd@kernel.org> In-Reply-To: References: <20180826090958.GT24124@hirez.programming.kicks-ass.net> <20180827120305.01a6f26267c64610cadec5d8@kernel.org> <4BF82052-4738-441C-8763-26C85003F2C9@gmail.com> <20180827170511.6bafa15cbc102ae135366e86@kernel.org> <01DA0BDD-7504-4209-8A8F-20B27CF6A1C7@gmail.com> <0000D631-FDDF-4273-8F3C-714E6825E59B@gmail.com> <823D916E-4056-4A36-BDD8-0FB682A8DCAE@gmail.com> <08A6BCB2-66C2-47ED-AEB8-AA8F4D7DBD45@gmail.com> <4F72D40A-25A3-4C64-B8DD-56970CFDE61E@gmail.com> <2998E63A-5663-4805-9DE7-829A5870AA6D@gmail.com> X-Mailer: Sylpheed 3.5.1 (GTK+ 2.24.31; x86_64-redhat-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-2022-JP Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 27 Aug 2018 16:01:32 -0700 Andy Lutomirski wrote: > On Mon, Aug 27, 2018 at 3:54 PM, Nadav Amit wrote: > > at 3:32 PM, Andy Lutomirski wrote: > > > >> On Mon, Aug 27, 2018 at 2:55 PM, Nadav Amit wrote: > >>> at 1:16 PM, Nadav Amit wrote: > >>> > >>>> at 12:58 PM, Andy Lutomirski wrote: > >>>> > >>>>> On Mon, Aug 27, 2018 at 12:43 PM, Nadav Amit wrote: > >>>>>> at 12:10 PM, Nadav Amit wrote: > >>>>>> > >>>>>>> at 11:58 AM, Andy Lutomirski wrote: > >>>>>>> > >>>>>>>> On Mon, Aug 27, 2018 at 11:54 AM, Nadav Amit wrote: > >>>>>>>>>> On Mon, Aug 27, 2018 at 10:34 AM, Nadav Amit wrote: > >>>>>>>>>> What do you all think? > >>>>>>>>> > >>>>>>>>> I agree in general. But I think that current->mm would need to be loaded, as > >>>>>>>>> otherwise I am afraid it would break switch_mm_irqs_off(). > >>>>>>>> > >>>>>>>> What breaks? > >>>>>>> > >>>>>>> Actually nothing. I just saw the IBPB stuff regarding tsk, but it should not > >>>>>>> matter. > >>>>>> > >>>>>> So here is what I got. It certainly needs some cleanup, but it boots. > >>>>>> > >>>>>> Let me know how crappy you find it... > >>>>>> > >>>>>> > >>>>>> diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h > >>>>>> index bbc796eb0a3b..336779650a41 100644 > >>>>>> --- a/arch/x86/include/asm/mmu_context.h > >>>>>> +++ b/arch/x86/include/asm/mmu_context.h > >>>>>> @@ -343,4 +343,24 @@ static inline unsigned long __get_current_cr3_fast(void) > >>>>>> return cr3; > >>>>>> } > >>>>>> > >>>>>> +typedef struct { > >>>>>> + struct mm_struct *prev; > >>>>>> +} temporary_mm_state_t; > >>>>>> + > >>>>>> +static inline temporary_mm_state_t use_temporary_mm(struct mm_struct *mm) > >>>>>> +{ > >>>>>> + temporary_mm_state_t state; > >>>>>> + > >>>>>> + lockdep_assert_irqs_disabled(); > >>>>>> + state.prev = this_cpu_read(cpu_tlbstate.loaded_mm); > >>>>>> + switch_mm_irqs_off(NULL, mm, current); > >>>>>> + return state; > >>>>>> +} > >>>>>> + > >>>>>> +static inline void unuse_temporary_mm(temporary_mm_state_t prev) > >>>>>> +{ > >>>>>> + lockdep_assert_irqs_disabled(); > >>>>>> + switch_mm_irqs_off(NULL, prev.prev, current); > >>>>>> +} > >>>>>> + > >>>>>> #endif /* _ASM_X86_MMU_CONTEXT_H */ > >>>>>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > >>>>>> index 5715647fc4fe..ef62af9a0ef7 100644 > >>>>>> --- a/arch/x86/include/asm/pgtable.h > >>>>>> +++ b/arch/x86/include/asm/pgtable.h > >>>>>> @@ -976,6 +976,10 @@ static inline void __meminit init_trampoline_default(void) > >>>>>> /* Default trampoline pgd value */ > >>>>>> trampoline_pgd_entry = init_top_pgt[pgd_index(__PAGE_OFFSET)]; > >>>>>> } > >>>>>> + > >>>>>> +void __init patching_mm_init(void); > >>>>>> +#define patching_mm_init patching_mm_init > >>>>>> + > >>>>>> # ifdef CONFIG_RANDOMIZE_MEMORY > >>>>>> void __meminit init_trampoline(void); > >>>>>> # else > >>>>>> diff --git a/arch/x86/include/asm/pgtable_64_types.h b/arch/x86/include/asm/pgtable_64_types.h > >>>>>> index 054765ab2da2..9f44262abde0 100644 > >>>>>> --- a/arch/x86/include/asm/pgtable_64_types.h > >>>>>> +++ b/arch/x86/include/asm/pgtable_64_types.h > >>>>>> @@ -116,6 +116,9 @@ extern unsigned int ptrs_per_p4d; > >>>>>> #define LDT_PGD_ENTRY (pgtable_l5_enabled() ? LDT_PGD_ENTRY_L5 : LDT_PGD_ENTRY_L4) > >>>>>> #define LDT_BASE_ADDR (LDT_PGD_ENTRY << PGDIR_SHIFT) > >>>>>> > >>>>>> +#define TEXT_POKE_PGD_ENTRY -5UL > >>>>>> +#define TEXT_POKE_ADDR (TEXT_POKE_PGD_ENTRY << PGDIR_SHIFT) > >>>>>> + > >>>>>> #define __VMALLOC_BASE_L4 0xffffc90000000000UL > >>>>>> #define __VMALLOC_BASE_L5 0xffa0000000000000UL > >>>>>> > >>>>>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h > >>>>>> index 99fff853c944..840c72ec8c4f 100644 > >>>>>> --- a/arch/x86/include/asm/pgtable_types.h > >>>>>> +++ b/arch/x86/include/asm/pgtable_types.h > >>>>>> @@ -505,6 +505,9 @@ pgprot_t phys_mem_access_prot(struct file *file, unsigned long pfn, > >>>>>> /* Install a pte for a particular vaddr in kernel space. */ > >>>>>> void set_pte_vaddr(unsigned long vaddr, pte_t pte); > >>>>>> > >>>>>> +struct mm_struct; > >>>>>> +void set_mm_pte_vaddr(struct mm_struct *mm, unsigned long vaddr, pte_t pte); > >>>>>> + > >>>>>> #ifdef CONFIG_X86_32 > >>>>>> extern void native_pagetable_init(void); > >>>>>> #else > >>>>>> diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h > >>>>>> index 2ecd34e2d46c..cb364ea5b19d 100644 > >>>>>> --- a/arch/x86/include/asm/text-patching.h > >>>>>> +++ b/arch/x86/include/asm/text-patching.h > >>>>>> @@ -38,4 +38,6 @@ extern void *text_poke(void *addr, const void *opcode, size_t len); > >>>>>> extern int poke_int3_handler(struct pt_regs *regs); > >>>>>> extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler); > >>>>>> > >>>>>> +extern struct mm_struct *patching_mm; > >>>>>> + > >>>>>> #endif /* _ASM_X86_TEXT_PATCHING_H */ > >>>>>> diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c > >>>>>> index a481763a3776..fd8a950b0d62 100644 > >>>>>> --- a/arch/x86/kernel/alternative.c > >>>>>> +++ b/arch/x86/kernel/alternative.c > >>>>>> @@ -11,6 +11,7 @@ > >>>>>> #include > >>>>>> #include > >>>>>> #include > >>>>>> +#include > >>>>>> #include > >>>>>> #include > >>>>>> #include > >>>>>> @@ -701,8 +702,36 @@ void *text_poke(void *addr, const void *opcode, size_t len) > >>>>>> WARN_ON(!PageReserved(pages[0])); > >>>>>> pages[1] = virt_to_page(addr + PAGE_SIZE); > >>>>>> } > >>>>>> - BUG_ON(!pages[0]); > >>>>>> + > >>>>>> local_irq_save(flags); > >>>>>> + BUG_ON(!pages[0]); > >>>>>> + > >>>>>> + /* > >>>>>> + * During initial boot, it is hard to initialize patching_mm due to > >>>>>> + * dependencies in boot order. > >>>>>> + */ > >>>>>> + if (patching_mm) { > >>>>>> + pte_t pte; > >>>>>> + temporary_mm_state_t prev; > >>>>>> + > >>>>>> + prev = use_temporary_mm(patching_mm); > >>>>>> + pte = mk_pte(pages[0], PAGE_KERNEL); > >>>>>> + set_mm_pte_vaddr(patching_mm, TEXT_POKE_ADDR, pte); > >>>>>> + pte = mk_pte(pages[1], PAGE_KERNEL); > >>>>>> + set_mm_pte_vaddr(patching_mm, TEXT_POKE_ADDR + PAGE_SIZE, pte); > >>>>>> + > >>>>>> + memcpy((void *)(TEXT_POKE_ADDR | ((unsigned long)addr & ~PAGE_MASK)), > >>>>>> + opcode, len); > >>>>>> + > >>>>>> + set_mm_pte_vaddr(patching_mm, TEXT_POKE_ADDR, __pte(0)); > >>>>>> + set_mm_pte_vaddr(patching_mm, TEXT_POKE_ADDR + PAGE_SIZE, __pte(0)); > >>>>>> + local_flush_tlb(); > >>>>> > >>>>> Hmm. This is stuff busted on SMP, and it's IMO more complicated than > >>>>> needed. How about getting rid of all the weird TLB flushing stuff and > >>>>> instead putting the mapping at vaddr - __START_KERNEL_map or whatever > >>>>> it is? You *might* need to flush_tlb_mm_range() on module unload, but > >>>>> that's it. > >>>> > >>>> I don’t see what’s wrong in SMP, since this entire piece of code should be > >>>> running under text_mutex. > >>>> > >>>> I don’t quite understand your proposal. I really don’t want to have any > >>>> chance in which the page-tables for the poked address is not preallocated. > >>>> > >>>> It is more complicated than needed, and there are redundant TLB flushes. The > >>>> reason I preferred to do it this way, is in order not to use other functions > >>>> that take locks during the software page-walk and not to duplicate existing > >>>> code. Yet, duplication might be the way to go. > >>>> > >>>>>> + sync_core(); > >>>>> > >>>>> I can't think of any case where sync_core() is needed. The mm switch > >>>>> serializes. > >>>> > >>>> Good point! > >>>> > >>>>> Also, is there any circumstance in which any of this is used before at > >>>>> least jump table init? All the early stuff is text_poke_early(), > >>>>> right? > >>>> > >>>> Not before jump_label_init. However, I did not manage to get rid of the two > >>>> code-patches in text_poke(), since text_poke is used relatively early by > >>>> x86_late_time_init(), and at this stage kmem_cache_alloc() - which is needed > >>>> to duplicate init_mm - still fails. > >>> > >>> Another correction: the populate_extra_pte() is not needed. > >>> > >>> Anyhow, if you want to do this whole thing differently, I obviously will not > >>> object, but I think it will end up more complicated. > >>> > >>> I think I finally understood your comment about "vaddr - > >>> __START_KERNEL_map”. I did something like that before, and it is not > >>> super-simple. You need not only to conditionally flush the TLB, but also > >>> to synchronize the PUD/PMD on changes. Don’t forget that module memory > >>> is installed even when BPF programs are installed. > >>> > >>> Let me know if you want me to submit cleaner patches or you want to carry on > >>> yourself. > >> > >> I think your approach is a good start and should be good enough (with > >> cleanups) as a fix for the bug. But I think your code has the same > >> bug that we have now! You're reusing the same address on multiple > >> CPUs without flushing. You can easily fix it by forcing a flush > >> before loading the mm, which should be as simple as adding > >> flush_tlb_mm() before you load the mm. (It won't actually flush > >> anything by itself, since the mm isn't loaded, but it will update the > >> bookkeeping so that switch_mm_irqs_off() flushes the mm.) > > > > What am I missing? > > > > We have a lock (text_mutex) which prevents the use of the new page-table > > hierarchy on multiple CPUs. In addition we have __set_pte_vaddr() which does > > a local TLB flush before the lock is released and before IRQs are enabled. > > So how can the PTE be cached on multiple CPUs? > > > > Yes, __set_pte_vaddr() is ugly and flushes too much. I’ll try to remove some > > redundant TLB flushes, but these flushes are already there. > > I missed that set_pte_vaddr() contained a flush. It's probably nicer > to use one of the APIs that doesn't imply a flush. If you end up with > vm_insert_pfn() or similar, you can use can use copy_to_user() if > needed :) > > > > >> Also, please at least get rid of TEXT_POKE_ADDR. If you don't want to > >> do the vaddr - __START_KERNEL_map thing, then at least pick an address > >> in the low half of the address space such as 0 :) Ideally you'd only > >> use this thing late enough that you could even use the normal > >> insert_pfn (or similar) API for it, but that doesn't really matter. > > > > Perhaps the vaddr - __START_KERNEL_map actually makes sense. I was > > misunderstanding it (again) before, thinking you want me to use vaddr (and > > not the delta). I’ll give it a try. > > It does prevent easy preallocation of the intermediate tables. Would we have to change the mapping address? If so, how about using hash of vaddr, which allow us to map it in one pte? > And do we really allow text_poke() / text_poke_bp() on BPF programs? Hmm. I don't think so. Allowing text_poke from BPF program is security hazzerd. (or would you mean BPF JIT?) Thanks, -- Masami Hiramatsu