Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp7152768imm; Tue, 24 Jul 2018 09:12:18 -0700 (PDT) X-Google-Smtp-Source: AAOMgpezMSnkLY4E98kyja3/mdrQMdLuc7tnDXsCmiLMcI0YZiNI3HlsQvQlQMYJDypWLUl109nX X-Received: by 2002:a63:2043:: with SMTP id r3-v6mr17228322pgm.105.1532448738540; Tue, 24 Jul 2018 09:12:18 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1532448738; cv=none; d=google.com; s=arc-20160816; b=hcVWk1VdTcgt+9r4lvdrpWF73ACOOGk9ph4Xj86smZYEDTUZxOlFIam03BFdIJ/0Dw gJj5LHoRCGgiUBizA+czYWlLgODGRqT++MruWOECrYv2krMxCFAkNY1VpkNK4holcieh izxG+7xsJGO4kEuStNvRb7HpCXaDgNHH6EjUqQfpzQCC/hcIPsa8SHjDVLqvJRCuVxBP kEzPgbf3cTNCAfpmoIPiX8nmQOwatqE1+1PXW2fJZGBsAto8cRFQ4b6GGdp6J0wiXSk3 n1xyhQiUoC2W0Go1tXgX/zghsCgj5obl4HK17U398cPS8/54KKB0BAVu2/5L+4qt1OLM fIcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date:from :references:cc:to:subject:arc-authentication-results; bh=/OIZAu0SPXPYGUAeBKEAZkMYFwOalADSy6tp7FzbTsU=; b=tNZmz+fOMkNBydd8opXmkoVD8cuVDUyKjh/N0sWZWO4wgaVVg6lbuHXTLdYtR3wsYq GmCnJ0ygZoFbckAgPnSfLX9jrANCxdnTpb80gM81Agm6OugtAGGFGb2pKBpC6ZE247gd 0mDaeNCSVicLBY5GBvNQaQW5Hi4fufAZWkoloieNHpH33u8uAg7iJvR2ivhUteQAUI5J 27F5GCTZYKglLzlisGdBALXd5/q9EzLqydDNiL99CKX0yI5dSCZNhEzSZqKObcQlPMTn 6I9L64OsplodjRLbmywq4Nr+qFjSox0CrDDdrB7znfbEjklu91T8FoxNgDbKdj9jAgW9 BWxA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n1-v6si10353200plp.166.2018.07.24.09.12.03; Tue, 24 Jul 2018 09:12:18 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2388472AbeGXRSU (ORCPT + 99 others); Tue, 24 Jul 2018 13:18:20 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:34090 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388393AbeGXRSU (ORCPT ); Tue, 24 Jul 2018 13:18:20 -0400 Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w6OGArBI099034 for ; Tue, 24 Jul 2018 12:11:08 -0400 Received: from e06smtp01.uk.ibm.com (e06smtp01.uk.ibm.com [195.75.94.97]) by mx0a-001b2d01.pphosted.com with ESMTP id 2ke41ms8p2-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 24 Jul 2018 12:10:59 -0400 Received: from localhost by e06smtp01.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 24 Jul 2018 17:10:33 +0100 Received: from b06cxnps3075.portsmouth.uk.ibm.com (9.149.109.195) by e06smtp01.uk.ibm.com (192.168.101.131) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Tue, 24 Jul 2018 17:10:24 +0100 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w6OGANSR29098054 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Tue, 24 Jul 2018 16:10:23 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 068E85204E; Tue, 24 Jul 2018 19:10:38 +0100 (BST) Received: from [9.145.45.110] (unknown [9.145.45.110]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id B3D675204F; Tue, 24 Jul 2018 19:10:35 +0100 (BST) Subject: Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure To: zhong jiang Cc: akpm@linux-foundation.org, mhocko@kernel.org, peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , khandual@linux.vnet.ibm.com, aneesh.kumar@linux.vnet.ibm.com, benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , sergey.senozhatsky.work@gmail.com, Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, Daniel Jordan , David Rientjes , Jerome Glisse , Ganesh Mahendran , Minchan Kim , Punit Agrawal , vinayak menon , Yang Shi , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org References: <1526555193-7242-1-git-send-email-ldufour@linux.vnet.ibm.com> <1526555193-7242-20-git-send-email-ldufour@linux.vnet.ibm.com> <5B573715.5070201@huawei.com> From: Laurent Dufour Date: Tue, 24 Jul 2018 18:10:20 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <5B573715.5070201@huawei.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 18072416-4275-0000-0000-0000029D040C X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18072416-4276-0000-0000-000037A53C8C Message-Id: <3ec16be6-ce2a-99af-4805-e00a59bad677@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-07-24_04:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1806210000 definitions=main-1807240170 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 24/07/2018 16:26, zhong jiang wrote: > On 2018/5/17 19:06, Laurent Dufour wrote: >> From: Peter Zijlstra >> >> Provide infrastructure to do a speculative fault (not holding >> mmap_sem). >> >> The not holding of mmap_sem means we can race against VMA >> change/removal and page-table destruction. We use the SRCU VMA freeing >> to keep the VMA around. We use the VMA seqcount to detect change >> (including umapping / page-table deletion) and we use gup_fast() style >> page-table walking to deal with page-table races. >> >> Once we've obtained the page and are ready to update the PTE, we >> validate if the state we started the fault with is still valid, if >> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the >> PTE and we're done. >> >> Signed-off-by: Peter Zijlstra (Intel) >> >> [Manage the newly introduced pte_spinlock() for speculative page >> fault to fail if the VMA is touched in our back] >> [Rename vma_is_dead() to vma_has_changed() and declare it here] >> [Fetch p4d and pud] >> [Set vmd.sequence in __handle_mm_fault()] >> [Abort speculative path when handle_userfault() has to be called] >> [Add additional VMA's flags checks in handle_speculative_fault()] >> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()] >> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed] >> [Remove warning comment about waiting for !seq&1 since we don't want >> to wait] >> [Remove warning about no huge page support, mention it explictly] >> [Don't call do_fault() in the speculative path as __do_fault() calls >> vma->vm_ops->fault() which may want to release mmap_sem] >> [Only vm_fault pointer argument for vma_has_changed()] >> [Fix check against huge page, calling pmd_trans_huge()] >> [Use READ_ONCE() when reading VMA's fields in the speculative path] >> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for >> processing done in vm_normal_page()] >> [Check that vma->anon_vma is already set when starting the speculative >> path] >> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to >> the processing done in mpol_misplaced()] >> [Don't support VMA growing up or down] >> [Move check on vm_sequence just before calling handle_pte_fault()] >> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT] >> [Add mem cgroup oom check] >> [Use READ_ONCE to access p*d entries] >> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()] >> [Don't fetch pte again in handle_pte_fault() when running the speculative >> path] >> [Check PMD against concurrent collapsing operation] >> [Try spin lock the pte during the speculative path to avoid deadlock with >> other CPU's invalidating the TLB and requiring this CPU to catch the >> inter processor's interrupt] >> [Move define of FAULT_FLAG_SPECULATIVE here] >> [Introduce __handle_speculative_fault() and add a check against >> mm->mm_users in handle_speculative_fault() defined in mm.h] >> Signed-off-by: Laurent Dufour >> --- >> include/linux/hugetlb_inline.h | 2 +- >> include/linux/mm.h | 30 ++++ >> include/linux/pagemap.h | 4 +- >> mm/internal.h | 16 +- >> mm/memory.c | 340 ++++++++++++++++++++++++++++++++++++++++- >> 5 files changed, 385 insertions(+), 7 deletions(-) >> >> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h >> index 0660a03d37d9..9e25283d6fc9 100644 >> --- a/include/linux/hugetlb_inline.h >> +++ b/include/linux/hugetlb_inline.h >> @@ -8,7 +8,7 @@ >> >> static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma) >> { >> - return !!(vma->vm_flags & VM_HUGETLB); >> + return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB); >> } >> >> #else >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 05cbba70104b..31acf98a7d92 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16]; >> #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ >> #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */ >> #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */ >> +#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */ >> >> #define FAULT_FLAG_TRACE \ >> { FAULT_FLAG_WRITE, "WRITE" }, \ >> @@ -343,6 +344,10 @@ struct vm_fault { >> gfp_t gfp_mask; /* gfp mask to be used for allocations */ >> pgoff_t pgoff; /* Logical page offset based on vma */ >> unsigned long address; /* Faulting virtual address */ >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> + unsigned int sequence; >> + pmd_t orig_pmd; /* value of PMD at the time of fault */ >> +#endif >> pmd_t *pmd; /* Pointer to pmd entry matching >> * the 'address' */ >> pud_t *pud; /* Pointer to pud entry matching >> @@ -1415,6 +1420,31 @@ int invalidate_inode_page(struct page *page); >> #ifdef CONFIG_MMU >> extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, >> unsigned int flags); >> + >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> +extern int __handle_speculative_fault(struct mm_struct *mm, >> + unsigned long address, >> + unsigned int flags); >> +static inline int handle_speculative_fault(struct mm_struct *mm, >> + unsigned long address, >> + unsigned int flags) >> +{ >> + /* >> + * Try speculative page fault for multithreaded user space task only. >> + */ >> + if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1) >> + return VM_FAULT_RETRY; >> + return __handle_speculative_fault(mm, address, flags); >> +} >> +#else >> +static inline int handle_speculative_fault(struct mm_struct *mm, >> + unsigned long address, >> + unsigned int flags) >> +{ >> + return VM_FAULT_RETRY; >> +} >> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ >> + >> extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, >> unsigned long address, unsigned int fault_flags, >> bool *unlocked); >> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h >> index b1bd2186e6d2..6e2aa4e79af7 100644 >> --- a/include/linux/pagemap.h >> +++ b/include/linux/pagemap.h >> @@ -456,8 +456,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma, >> pgoff_t pgoff; >> if (unlikely(is_vm_hugetlb_page(vma))) >> return linear_hugepage_index(vma, address); >> - pgoff = (address - vma->vm_start) >> PAGE_SHIFT; >> - pgoff += vma->vm_pgoff; >> + pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT; >> + pgoff += READ_ONCE(vma->vm_pgoff); >> return pgoff; >> } >> >> diff --git a/mm/internal.h b/mm/internal.h >> index fb2667b20f0a..10b188c87fa4 100644 >> --- a/mm/internal.h >> +++ b/mm/internal.h >> @@ -44,7 +44,21 @@ int do_swap_page(struct vm_fault *vmf); >> extern struct vm_area_struct *get_vma(struct mm_struct *mm, >> unsigned long addr); >> extern void put_vma(struct vm_area_struct *vma); >> -#endif >> + >> +static inline bool vma_has_changed(struct vm_fault *vmf) >> +{ >> + int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb); >> + unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence); >> + >> + /* >> + * Matches both the wmb in write_seqlock_{begin,end}() and >> + * the wmb in vma_rb_erase(). >> + */ >> + smp_rmb(); >> + >> + return ret || seq != vmf->sequence; >> +} >> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ >> >> void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma, >> unsigned long floor, unsigned long ceiling); >> diff --git a/mm/memory.c b/mm/memory.c >> index ab32b0b4bd69..7bbbb8c7b9cd 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -769,7 +769,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, >> if (page) >> dump_page(page, "bad pte"); >> pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n", >> - (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index); >> + (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma, >> + mapping, index); >> pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n", >> vma->vm_file, >> vma->vm_ops ? vma->vm_ops->fault : NULL, >> @@ -2306,6 +2307,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr, >> } >> EXPORT_SYMBOL_GPL(apply_to_page_range); >> >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> +static bool pte_spinlock(struct vm_fault *vmf) >> +{ >> + bool ret = false; >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + pmd_t pmdval; >> +#endif >> + >> + /* Check if vma is still valid */ >> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) { >> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); >> + spin_lock(vmf->ptl); >> + return true; >> + } >> + >> +again: >> + local_irq_disable(); >> + if (vma_has_changed(vmf)) >> + goto out; >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + /* >> + * We check if the pmd value is still the same to ensure that there >> + * is not a huge collapse operation in progress in our back. >> + */ >> + pmdval = READ_ONCE(*vmf->pmd); >> + if (!pmd_same(pmdval, vmf->orig_pmd)) >> + goto out; >> +#endif >> + >> + vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); >> + if (unlikely(!spin_trylock(vmf->ptl))) { >> + local_irq_enable(); >> + goto again; >> + } >> + >> + if (vma_has_changed(vmf)) { >> + spin_unlock(vmf->ptl); >> + goto out; >> + } >> + >> + ret = true; >> +out: >> + local_irq_enable(); >> + return ret; >> +} >> + >> +static bool pte_map_lock(struct vm_fault *vmf) >> +{ >> + bool ret = false; >> + pte_t *pte; >> + spinlock_t *ptl; >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + pmd_t pmdval; >> +#endif >> + >> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) { >> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, >> + vmf->address, &vmf->ptl); >> + return true; >> + } >> + >> + /* >> + * The first vma_has_changed() guarantees the page-tables are still >> + * valid, having IRQs disabled ensures they stay around, hence the >> + * second vma_has_changed() to make sure they are still valid once >> + * we've got the lock. After that a concurrent zap_pte_range() will >> + * block on the PTL and thus we're safe. >> + */ >> +again: >> + local_irq_disable(); >> + if (vma_has_changed(vmf)) >> + goto out; >> + >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >> + /* >> + * We check if the pmd value is still the same to ensure that there >> + * is not a huge collapse operation in progress in our back. >> + */ >> + pmdval = READ_ONCE(*vmf->pmd); >> + if (!pmd_same(pmdval, vmf->orig_pmd)) >> + goto out; >> +#endif >> + >> + /* >> + * Same as pte_offset_map_lock() except that we call >> + * spin_trylock() in place of spin_lock() to avoid race with >> + * unmap path which may have the lock and wait for this CPU >> + * to invalidate TLB but this CPU has irq disabled. >> + * Since we are in a speculative patch, accept it could fail >> + */ >> + ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); >> + pte = pte_offset_map(vmf->pmd, vmf->address); >> + if (unlikely(!spin_trylock(ptl))) { >> + pte_unmap(pte); >> + local_irq_enable(); >> + goto again; >> + } >> + >> + if (vma_has_changed(vmf)) { >> + pte_unmap_unlock(pte, ptl); >> + goto out; >> + } >> + >> + vmf->pte = pte; >> + vmf->ptl = ptl; >> + ret = true; >> +out: >> + local_irq_enable(); >> + return ret; >> +} >> +#else >> static inline bool pte_spinlock(struct vm_fault *vmf) >> { >> vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); >> @@ -2319,6 +2432,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf) >> vmf->address, &vmf->ptl); >> return true; >> } >> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ >> >> /* >> * handle_pte_fault chooses page fault handler according to an entry which was >> @@ -3208,6 +3322,14 @@ static int do_anonymous_page(struct vm_fault *vmf) >> ret = check_stable_address_space(vma->vm_mm); >> if (ret) >> goto unlock; >> + /* >> + * Don't call the userfaultfd during the speculative path. >> + * We already checked for the VMA to not be managed through >> + * userfaultfd, but it may be set in our back once we have lock >> + * the pte. In such a case we can ignore it this time. >> + */ >> + if (vmf->flags & FAULT_FLAG_SPECULATIVE) >> + goto setpte; >> /* Deliver the page fault to userland, check inside PT lock */ >> if (userfaultfd_missing(vma)) { >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> @@ -3249,7 +3371,7 @@ static int do_anonymous_page(struct vm_fault *vmf) >> goto unlock_and_release; >> >> /* Deliver the page fault to userland, check inside PT lock */ >> - if (userfaultfd_missing(vma)) { >> + if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) && userfaultfd_missing(vma)) { >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> mem_cgroup_cancel_charge(page, memcg, false); >> put_page(page); >> @@ -3994,13 +4116,22 @@ static int handle_pte_fault(struct vm_fault *vmf) >> >> if (unlikely(pmd_none(*vmf->pmd))) { >> /* >> + * In the case of the speculative page fault handler we abort >> + * the speculative path immediately as the pmd is probably >> + * in the way to be converted in a huge one. We will try >> + * again holding the mmap_sem (which implies that the collapse >> + * operation is done). >> + */ >> + if (vmf->flags & FAULT_FLAG_SPECULATIVE) >> + return VM_FAULT_RETRY; >> + /* >> * Leave __pte_alloc() until later: because vm_ops->fault may >> * want to allocate huge page, and if we expose page table >> * for an instant, it will be difficult to retract from >> * concurrent faults and from rmap lookups. >> */ >> vmf->pte = NULL; >> - } else { >> + } else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) { >> /* See comment in pte_alloc_one_map() */ >> if (pmd_devmap_trans_unstable(vmf->pmd)) >> return 0; >> @@ -4009,6 +4140,9 @@ static int handle_pte_fault(struct vm_fault *vmf) >> * pmd from under us anymore at this point because we hold the >> * mmap_sem read mode and khugepaged takes it in write mode. >> * So now it's safe to run pte_offset_map(). >> + * This is not applicable to the speculative page fault handler >> + * but in that case, the pte is fetched earlier in >> + * handle_speculative_fault(). >> */ >> vmf->pte = pte_offset_map(vmf->pmd, vmf->address); >> vmf->orig_pte = *vmf->pte; >> @@ -4031,6 +4165,8 @@ static int handle_pte_fault(struct vm_fault *vmf) >> if (!vmf->pte) { >> if (vma_is_anonymous(vmf->vma)) >> return do_anonymous_page(vmf); >> + else if (vmf->flags & FAULT_FLAG_SPECULATIVE) >> + return VM_FAULT_RETRY; >> else >> return do_fault(vmf); >> } >> @@ -4128,6 +4264,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, >> vmf.pmd = pmd_alloc(mm, vmf.pud, address); >> if (!vmf.pmd) >> return VM_FAULT_OOM; >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> + vmf.sequence = raw_read_seqcount(&vma->vm_sequence); >> +#endif >> if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) { >> ret = create_huge_pmd(&vmf); >> if (!(ret & VM_FAULT_FALLBACK)) >> @@ -4161,6 +4300,201 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, >> return handle_pte_fault(&vmf); >> } >> >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> +/* >> + * Tries to handle the page fault in a speculative way, without grabbing the >> + * mmap_sem. >> + */ >> +int __handle_speculative_fault(struct mm_struct *mm, unsigned long address, >> + unsigned int flags) >> +{ >> + struct vm_fault vmf = { >> + .address = address, >> + }; >> + pgd_t *pgd, pgdval; >> + p4d_t *p4d, p4dval; >> + pud_t pudval; >> + int seq, ret = VM_FAULT_RETRY; >> + struct vm_area_struct *vma; >> +#ifdef CONFIG_NUMA >> + struct mempolicy *pol; >> +#endif >> + >> + /* Clear flags that may lead to release the mmap_sem to retry */ >> + flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE); >> + flags |= FAULT_FLAG_SPECULATIVE; >> + >> + vma = get_vma(mm, address); >> + if (!vma) >> + return ret; >> + >> + seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */ >> + if (seq & 1) >> + goto out_put; >> + >> + /* >> + * Can't call vm_ops service has we don't know what they would do >> + * with the VMA. >> + * This include huge page from hugetlbfs. >> + */ >> + if (vma->vm_ops) >> + goto out_put; >> + > Hi Laurent > > I think that most of pagefault will leave here. Is there any case need to skip ? > I have tested the following patch, it work well. Hi Zhong, Well this will allow file mapping to be handle in a speculative way, but that's a bit dangerous today as there is no guaranty that the vm_ops.vm_fault() operation will be fair. In the case of the anonymous file mapping that's often not a problem, depending on the underlying file system, but there are so many cases to check and this is hard to say this can be done in a speculative way as is. The huge work to do is to double check that all the code called by vm_ops.fault() is not dealing with the mmap_sem, which could be handled using FAULT_FLAG_RETRY_NOWAIT, and care is also needed about the resources that code is managing as it may assume that it is under the protection of the mmap_sem in read mode, and that can be done implicitly. Cheers, Laurent. > > diff --git a/mm/memory.c b/mm/memory.c > index 936128b..9bc1545 100644 > @@ -3893,8 +3898,6 @@ static int handle_pte_fault(struct fault_env *fe) > if (!fe->pte) { > if (vma_is_anonymous(fe->vma)) > return do_anonymous_page(fe); > - else if (fe->flags & FAULT_FLAG_SPECULATIVE) > - return VM_FAULT_RETRY; > else > return do_fault(fe); > } > @@ -4026,20 +4029,11 @@ int __handle_speculative_fault(struct mm_struct *mm, unsigned long address, > goto out_put; > } > /* > - * Can't call vm_ops service has we don't know what they would do > - * with the VMA. > - * This include huge page from hugetlbfs. > - */ > - if (vma->vm_ops) { > - trace_spf_vma_notsup(_RET_IP_, vma, address); > - goto out_put; > - } > > > Thanks > zhong jiang >> + /* >> + * __anon_vma_prepare() requires the mmap_sem to be held >> + * because vm_next and vm_prev must be safe. This can't be guaranteed >> + * in the speculative path. >> + */ >> + if (unlikely(!vma->anon_vma)) >> + goto out_put; >> + >> + vmf.vma_flags = READ_ONCE(vma->vm_flags); >> + vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot); >> + >> + /* Can't call userland page fault handler in the speculative path */ >> + if (unlikely(vmf.vma_flags & VM_UFFD_MISSING)) >> + goto out_put; >> + >> + if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP) >> + /* >> + * This could be detected by the check address against VMA's >> + * boundaries but we want to trace it as not supported instead >> + * of changed. >> + */ >> + goto out_put; >> + >> + if (address < READ_ONCE(vma->vm_start) >> + || READ_ONCE(vma->vm_end) <= address) >> + goto out_put; >> + >> + if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE, >> + flags & FAULT_FLAG_INSTRUCTION, >> + flags & FAULT_FLAG_REMOTE)) { >> + ret = VM_FAULT_SIGSEGV; >> + goto out_put; >> + } >> + >> + /* This is one is required to check that the VMA has write access set */ >> + if (flags & FAULT_FLAG_WRITE) { >> + if (unlikely(!(vmf.vma_flags & VM_WRITE))) { >> + ret = VM_FAULT_SIGSEGV; >> + goto out_put; >> + } >> + } else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) { >> + ret = VM_FAULT_SIGSEGV; >> + goto out_put; >> + } >> + >> +#ifdef CONFIG_NUMA >> + /* >> + * MPOL_INTERLEAVE implies additional checks in >> + * mpol_misplaced() which are not compatible with the >> + *speculative page fault processing. >> + */ >> + pol = __get_vma_policy(vma, address); >> + if (!pol) >> + pol = get_task_policy(current); >> + if (pol && pol->mode == MPOL_INTERLEAVE) >> + goto out_put; >> +#endif >> + >> + /* >> + * Do a speculative lookup of the PTE entry. >> + */ >> + local_irq_disable(); >> + pgd = pgd_offset(mm, address); >> + pgdval = READ_ONCE(*pgd); >> + if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval))) >> + goto out_walk; >> + >> + p4d = p4d_offset(pgd, address); >> + p4dval = READ_ONCE(*p4d); >> + if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval))) >> + goto out_walk; >> + >> + vmf.pud = pud_offset(p4d, address); >> + pudval = READ_ONCE(*vmf.pud); >> + if (pud_none(pudval) || unlikely(pud_bad(pudval))) >> + goto out_walk; >> + >> + /* Huge pages at PUD level are not supported. */ >> + if (unlikely(pud_trans_huge(pudval))) >> + goto out_walk; >> + >> + vmf.pmd = pmd_offset(vmf.pud, address); >> + vmf.orig_pmd = READ_ONCE(*vmf.pmd); >> + /* >> + * pmd_none could mean that a hugepage collapse is in progress >> + * in our back as collapse_huge_page() mark it before >> + * invalidating the pte (which is done once the IPI is catched >> + * by all CPU and we have interrupt disabled). >> + * For this reason we cannot handle THP in a speculative way since we >> + * can't safely indentify an in progress collapse operation done in our >> + * back on that PMD. >> + * Regarding the order of the following checks, see comment in >> + * pmd_devmap_trans_unstable() >> + */ >> + if (unlikely(pmd_devmap(vmf.orig_pmd) || >> + pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) || >> + is_swap_pmd(vmf.orig_pmd))) >> + goto out_walk; >> + >> + /* >> + * The above does not allocate/instantiate page-tables because doing so >> + * would lead to the possibility of instantiating page-tables after >> + * free_pgtables() -- and consequently leaking them. >> + * >> + * The result is that we take at least one !speculative fault per PMD >> + * in order to instantiate it. >> + */ >> + >> + vmf.pte = pte_offset_map(vmf.pmd, address); >> + vmf.orig_pte = READ_ONCE(*vmf.pte); >> + barrier(); /* See comment in handle_pte_fault() */ >> + if (pte_none(vmf.orig_pte)) { >> + pte_unmap(vmf.pte); >> + vmf.pte = NULL; >> + } >> + >> + vmf.vma = vma; >> + vmf.pgoff = linear_page_index(vma, address); >> + vmf.gfp_mask = __get_fault_gfp_mask(vma); >> + vmf.sequence = seq; >> + vmf.flags = flags; >> + >> + local_irq_enable(); >> + >> + /* >> + * We need to re-validate the VMA after checking the bounds, otherwise >> + * we might have a false positive on the bounds. >> + */ >> + if (read_seqcount_retry(&vma->vm_sequence, seq)) >> + goto out_put; >> + >> + mem_cgroup_oom_enable(); >> + ret = handle_pte_fault(&vmf); >> + mem_cgroup_oom_disable(); >> + >> + put_vma(vma); >> + >> + /* >> + * The task may have entered a memcg OOM situation but >> + * if the allocation error was handled gracefully (no >> + * VM_FAULT_OOM), there is no need to kill anything. >> + * Just clean up the OOM state peacefully. >> + */ >> + if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM)) >> + mem_cgroup_oom_synchronize(false); >> + return ret; >> + >> +out_walk: >> + local_irq_enable(); >> +out_put: >> + put_vma(vma); >> + return ret; >> +} >> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ >> + >> /* >> * By the time we get here, we already hold the mm semaphore >> * > >