Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp622634imm; Tue, 15 May 2018 06:49:55 -0700 (PDT) X-Google-Smtp-Source: AB8JxZoJXQjNQC+hjXMafEp7eoaAU7tHZyFEzzBT7u8WNNM75wVUcQj0g1O2fchBwX70w0DKftU3 X-Received: by 2002:a17:902:8bc5:: with SMTP id r5-v6mr1515732plo.182.1526392195489; Tue, 15 May 2018 06:49:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526392195; cv=none; d=google.com; s=arc-20160816; b=bqpd9kdFsjcgOXt/GwcauiQH4QUi5qGuq/KcQQBbnN6EhCSDYGhyco9NF+adLx4vGs 4SDagQfsPQZQptT0UZPFvvlRN4T/jnR9PhV19EL3iB8ftRYm2CoSDQryyM/tnnTQgTnA apxm97I7lFK9oHa/qfU+ZB2OOJht7L1XLUWhHYD47V3k9qjhzNIq89oFKpHEP8z/d1xh +OI3bgDJO+0al6wsluokWUA9Y1uQD00mxt6NITk6O9SyB8haAtUe4Nj6MxCaONBNuaR9 Yjfydmo8zuBcfKDVD0YUx3XXh62rMYv5ggWPtbRj0WrTYtSWaKFph2GjBurGmWbRryDe UAYg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=gUv6VrQD9/CWsXLirT/5ucnoXC1scnOn3B0KZuWofug=; b=kweoHxfoPrWu4NRbsflGpxQ958caEkmidA9s0RH5+qG1/EnAiqPWOVu8ShnUZPx3+s IefdKplXqR/imOR7z1nrF/0E6y7FcQIYMQPUTK0SN/406xROI1mrBWyrE9xEGHSA5iY5 Ld8rqktIXXd9QjeW6a39QNccu1U+Fn2AjgE+3HQaAu/b199gZsFkUdcUpHgSOvxzxwUa EFJcrZBDCHdqRy2iJFWOoExsYWr1qO4J35kwHdkGabg5Awkbn21fpiSuMIslbuLy3c6z 99S2mt2GFt+wtoOzR3hxQRm9CnnMlhmx4SRFIKENEse40hc2nF2+kd97bw0SQnCQD/MK UQNA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=md0z8K9o; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b5-v6si76567ple.417.2018.05.15.06.49.17; Tue, 15 May 2018 06:49:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=md0z8K9o; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752859AbeEONJZ (ORCPT + 99 others); Tue, 15 May 2018 09:09:25 -0400 Received: from mail-wm0-f67.google.com ([74.125.82.67]:53983 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752149AbeEONJX (ORCPT ); Tue, 15 May 2018 09:09:23 -0400 Received: by mail-wm0-f67.google.com with SMTP id a67-v6so912560wmf.3 for ; Tue, 15 May 2018 06:09:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=gUv6VrQD9/CWsXLirT/5ucnoXC1scnOn3B0KZuWofug=; b=md0z8K9ohYOAg5DQ+m9crw5v4nOGnmWRsmnsMdiSXiQCM9trt9nnpvP8zD4Q4BRQwq PPTMoVctuTlQYJrFUFwqF4I8Fykw1WBTKR13aB0gtyw0U+TRhAixiygwInvoFnY+XCW9 k1vSIIHSsEkDQi+eId9jcb8aC+dMrMXrUeKe4FcGu9s8Agwvb75dH4PZP8/Zp/TpS+AS UHQuptewOFiEXWkRxYmd4OSPT6bURaBjZwZ0N/DBBjNEu1wMvlUp4MiPLFB9Qh1P8ien mhHJuiZ8Adrsgu/waCwC/1DkSQwRbiQDlp5xUoezpVyZehIqn1QhsXFVjQpkhULJ6sMI Ph9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=gUv6VrQD9/CWsXLirT/5ucnoXC1scnOn3B0KZuWofug=; b=me5ElTWnQcDBgTlzsAHRkLL1kuVRuwnFhSuZChefjzIyTNewpfqKVIbLVtARX1BmM2 AY4OzMsqjnks5SCb66YYJDTbeyod25DknOGlrYL9wnX4M+UCmSXb5lG4IuR/qgNTjbr4 D7LKOokM2xRDgvexghQudfyc43NafjyqOIZF3k32BPTxKsBgCDGLeSo1QGTFbNzH5lz9 D9mcqgeHYDAMZiAH54D0Ddhj0H9VOyiGj3GU27gdz/r6B/EdftNGokW81CsUtMjaAwAx z8k0WGCAQYKAh0lmGuYFRCAoKJ2t3sNHzEo82mNdoJBPU7uh6RboLRS7ICYaTTaxmGUt 947A== X-Gm-Message-State: ALKqPweE5/78FnJbSYKLWECGjBOBnadBeUBjeN3S7zvPPihycLWerE2p ifWsHNht4i6LaQr6XXVHfFPLnOybpa6wC/ZNCh0= X-Received: by 2002:a1c:512:: with SMTP id 18-v6mr7549976wmf.47.1526389762024; Tue, 15 May 2018 06:09:22 -0700 (PDT) MIME-Version: 1.0 Received: by 10.28.221.131 with HTTP; Tue, 15 May 2018 06:09:20 -0700 (PDT) In-Reply-To: <1523975611-15978-19-git-send-email-ldufour@linux.vnet.ibm.com> References: <1523975611-15978-1-git-send-email-ldufour@linux.vnet.ibm.com> <1523975611-15978-19-git-send-email-ldufour@linux.vnet.ibm.com> From: vinayak menon Date: Tue, 15 May 2018 18:39:20 +0530 Message-ID: Subject: Re: [PATCH v10 18/25] mm: provide speculative fault infrastructure To: Laurent Dufour Cc: Andrew Morton , Michal Hocko , Peter Zijlstra , kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, sergey.senozhatsky.work@gmail.com, Daniel Jordan , David Rientjes , Jerome Glisse , Ganesh Mahendran , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, Balbir Singh , Paul McKenney , Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org, Vinayak Menon Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 17, 2018 at 8:03 PM, Laurent Dufour wrote: > > +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT > + > +#ifndef __HAVE_ARCH_PTE_SPECIAL > +/* This is required by vm_normal_page() */ > +#error "Speculative page fault handler requires __HAVE_ARCH_PTE_SPECIAL" > +#endif > + > +/* > + * vm_normal_page() adds some processing which should be done while > + * hodling the mmap_sem. > + */ > +int __handle_speculative_fault(struct mm_struct *mm, unsigned long address, > + unsigned int flags) > +{ > + struct vm_fault vmf = { > + .address = address, > + }; > + pgd_t *pgd, pgdval; > + p4d_t *p4d, p4dval; > + pud_t pudval; > + int seq, ret = VM_FAULT_RETRY; > + struct vm_area_struct *vma; > + > + /* Clear flags that may lead to release the mmap_sem to retry */ > + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE); > + flags |= FAULT_FLAG_SPECULATIVE; > + > + vma = get_vma(mm, address); > + if (!vma) > + return ret; > + > + seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */ > + if (seq & 1) > + goto out_put; > + > + /* > + * Can't call vm_ops service has we don't know what they would do > + * with the VMA. > + * This include huge page from hugetlbfs. > + */ > + if (vma->vm_ops) > + goto out_put; > + > + /* > + * __anon_vma_prepare() requires the mmap_sem to be held > + * because vm_next and vm_prev must be safe. This can't be guaranteed > + * in the speculative path. > + */ > + if (unlikely(!vma->anon_vma)) > + goto out_put; > + > + vmf.vma_flags = READ_ONCE(vma->vm_flags); > + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot); > + > + /* Can't call userland page fault handler in the speculative path */ > + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING)) > + goto out_put; > + > + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP) > + /* > + * This could be detected by the check address against VMA's > + * boundaries but we want to trace it as not supported instead > + * of changed. > + */ > + goto out_put; > + > + if (address < READ_ONCE(vma->vm_start) > + || READ_ONCE(vma->vm_end) <= address) > + goto out_put; > + > + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, > + flags & FAULT_FLAG_INSTRUCTION, > + flags & FAULT_FLAG_REMOTE)) { > + ret = VM_FAULT_SIGSEGV; > + goto out_put; > + } > + > + /* This is one is required to check that the VMA has write access set */ > + if (flags & FAULT_FLAG_WRITE) { > + if (unlikely(!(vmf.vma_flags & VM_WRITE))) { > + ret = VM_FAULT_SIGSEGV; > + goto out_put; > + } > + } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) { > + ret = VM_FAULT_SIGSEGV; > + goto out_put; > + } > + > + if (IS_ENABLED(CONFIG_NUMA)) { > + struct mempolicy *pol; > + > + /* > + * MPOL_INTERLEAVE implies additional checks in > + * mpol_misplaced() which are not compatible with the > + *speculative page fault processing. > + */ > + pol = __get_vma_policy(vma, address); This gives a compile time error when CONFIG_NUMA is disabled, as there is no definition for __get_vma_policy. > + if (!pol) > + pol = get_task_policy(current); > + if (pol && pol->mode == MPOL_INTERLEAVE) > + goto out_put; > + } > + > + /* > + * Do a speculative lookup of the PTE entry. > + */ > + local_irq_disable(); > + pgd = pgd_offset(mm, address); > + pgdval = READ_ONCE(*pgd); > + if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval))) > + goto out_walk; > + > + p4d = p4d_offset(pgd, address); > + p4dval = READ_ONCE(*p4d); > + if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval))) > + goto out_walk; > + > + vmf.pud = pud_offset(p4d, address); > + pudval = READ_ONCE(*vmf.pud); > + if (pud_none(pudval) || unlikely(pud_bad(pudval))) > + goto out_walk; > + > + /* Huge pages at PUD level are not supported. */ > + if (unlikely(pud_trans_huge(pudval))) > + goto out_walk; > + > + vmf.pmd = pmd_offset(vmf.pud, address); > + vmf.orig_pmd = READ_ONCE(*vmf.pmd); > + /* > + * pmd_none could mean that a hugepage collapse is in progress > + * in our back as collapse_huge_page() mark it before > + * invalidating the pte (which is done once the IPI is catched > + * by all CPU and we have interrupt disabled). > + * For this reason we cannot handle THP in a speculative way since we > + * can't safely indentify an in progress collapse operation done in our > + * back on that PMD. > + * Regarding the order of the following checks, see comment in > + * pmd_devmap_trans_unstable() > + */ > + if (unlikely(pmd_devmap(vmf.orig_pmd) || > + pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) || > + is_swap_pmd(vmf.orig_pmd))) > + goto out_walk; > + > + /* > + * The above does not allocate/instantiate page-tables because doing so > + * would lead to the possibility of instantiating page-tables after > + * free_pgtables() -- and consequently leaking them. > + * > + * The result is that we take at least one !speculative fault per PMD > + * in order to instantiate it. > + */ > + > + vmf.pte = pte_offset_map(vmf.pmd, address); > + vmf.orig_pte = READ_ONCE(*vmf.pte); > + barrier(); /* See comment in handle_pte_fault() */ > + if (pte_none(vmf.orig_pte)) { > + pte_unmap(vmf.pte); > + vmf.pte = NULL; > + } > + > + vmf.vma = vma; > + vmf.pgoff = linear_page_index(vma, address); > + vmf.gfp_mask = __get_fault_gfp_mask(vma); > + vmf.sequence = seq; > + vmf.flags = flags; > + > + local_irq_enable(); > + > + /* > + * We need to re-validate the VMA after checking the bounds, otherwise > + * we might have a false positive on the bounds. > + */ > + if (read_seqcount_retry(&vma->vm_sequence, seq)) > + goto out_put; > + > + mem_cgroup_oom_enable(); > + ret = handle_pte_fault(&vmf); > + mem_cgroup_oom_disable(); > + > + put_vma(vma); > + > + /* > + * The task may have entered a memcg OOM situation but > + * if the allocation error was handled gracefully (no > + * VM_FAULT_OOM), there is no need to kill anything. > + * Just clean up the OOM state peacefully. > + */ > + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)) > + mem_cgroup_oom_synchronize(false); > + return ret; > + > +out_walk: > + local_irq_enable(); > +out_put: > + put_vma(vma); > + return ret; > +} > +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ > + > /* > * By the time we get here, we already hold the mm semaphore > * > -- > 2.7.4 >