Received: by 10.213.65.68 with SMTP id h4csp277860imn; Wed, 28 Mar 2018 03:29:05 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+z3bCQroAfReRYLQ0s7ogPf/QKIxDBo1eB3Q54mkHJcp3i83NG7O3bnvJFDtfs+O6OwVSp X-Received: by 2002:a17:902:5716:: with SMTP id k22-v6mr3190468pli.229.1522232945713; Wed, 28 Mar 2018 03:29:05 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522232945; cv=none; d=google.com; s=arc-20160816; b=P36tpvSrmRpJvpKtZJPqQnKpXPFnygK6vYic4aSr4l4GV7f9xGiidQe1J+6ZmVhL/1 ndOujaL5dnZxM3aYuKy8DN5XZw34+6apvV76qR8IXFKgwvY/PiNrSA4pPnJLit2V1Trx idmS1dOV6A0iQm75EOP3zeKl+aKrRSJzpK3HLJfn+oR2peuOF4Qmj7yK1jKTm6NEf7dF 6Dnf01M/KPtN85fYXOdrp7McVnOorH9nS+ZRgoSbQ7UAwdFTcLOEebTLheKULMjD0d8E 1tKnelfnDEs/ZIT4tUFJ+hD8O8b7VDMoeHz/L85QmX+SUKqA5SZxGQqwGlQ+cq69Q4SO C36g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :references:cc:to:subject:from:arc-authentication-results; bh=TeCK+KiyrK6WKR9n28siquUDpwwK7U1GRH1FLFH34vQ=; b=B7WJNf7M2j3Zk2SAc4Hoaj1tLxrv+qHZ9iwlaIO3q+8uQWivax1uPKH8HuXCU/HAXz tFsSx8ZVbyTzMcvDQssxYXErsnHCBwfmzHLHL8uX3u6T1XJlFBIfMt3svZEAdXudZsZR fxz9nKsmDI9w27P28YCOArXjWt6eUKBhjN7lfvFzlXHVsCbvUjadllvGWzy6YPgPh2dG 9mH1mGJGy5CfQHXqJmvZSmh3cHeumg3h2vm/6pPTRuS+CmGGaCLcZxO2gbrX0NRQUfh7 sFIOrIKmAvKZnHA2Smo1mojmh8EHyxefnNdm2ikByln7l+ihRrHvPKlk6jpDwlQ7O238 lbkA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e193si2248908pgc.785.2018.03.28.03.28.50; Wed, 28 Mar 2018 03:29:05 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752532AbeC1K1q (ORCPT + 99 others); Wed, 28 Mar 2018 06:27:46 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:44372 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1750735AbeC1K1o (ORCPT ); Wed, 28 Mar 2018 06:27:44 -0400 Received: from pps.filterd (m0098421.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w2SAOf4b046615 for ; Wed, 28 Mar 2018 06:27:43 -0400 Received: from e06smtp14.uk.ibm.com (e06smtp14.uk.ibm.com [195.75.94.110]) by mx0a-001b2d01.pphosted.com with ESMTP id 2h07qumb0g-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Wed, 28 Mar 2018 06:27:43 -0400 Received: from localhost by e06smtp14.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 28 Mar 2018 11:27:40 +0100 Received: from b06cxnps4076.portsmouth.uk.ibm.com (9.149.109.198) by e06smtp14.uk.ibm.com (192.168.101.144) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 28 Mar 2018 11:27:32 +0100 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4076.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w2SARWcw8257790; Wed, 28 Mar 2018 10:27:32 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id F30A65203F; Wed, 28 Mar 2018 10:18:47 +0100 (BST) Received: from [9.145.178.80] (unknown [9.145.178.80]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id E9D0152043; Wed, 28 Mar 2018 10:18:45 +0100 (BST) From: Laurent Dufour Subject: Re: [PATCH v9 04/24] mm: Prepare for FAULT_FLAG_SPECULATIVE To: David Rientjes Cc: paulmck@linux.vnet.ibm.com, peterz@infradead.org, akpm@linux-foundation.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, sergey.senozhatsky.work@gmail.com, Daniel Jordan , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org References: <1520963994-28477-1-git-send-email-ldufour@linux.vnet.ibm.com> <1520963994-28477-5-git-send-email-ldufour@linux.vnet.ibm.com> Date: Wed, 28 Mar 2018 12:27:29 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 18032810-0044-0000-0000-00000540B523 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18032810-0045-0000-0000-0000287FD484 Message-Id: <361fa6e7-3c17-e1b8-8046-af72c4459613@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-03-28_03:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1803280114 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 25/03/2018 23:50, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> From: Peter Zijlstra >> >> When speculating faults (without holding mmap_sem) we need to validate >> that the vma against which we loaded pages is still valid when we're >> ready to install the new PTE. >> >> Therefore, replace the pte_offset_map_lock() calls that (re)take the >> PTL with pte_map_lock() which can fail in case we find the VMA changed >> since we started the fault. >> > > Based on how its used, I would have suspected this to be named > pte_map_trylock(). > >> Signed-off-by: Peter Zijlstra (Intel) >> >> [Port to 4.12 kernel] >> [Remove the comment about the fault_env structure which has been >> implemented as the vm_fault structure in the kernel] >> [move pte_map_lock()'s definition upper in the file] >> Signed-off-by: Laurent Dufour >> --- >> include/linux/mm.h | 1 + >> mm/memory.c | 56 ++++++++++++++++++++++++++++++++++++++---------------- >> 2 files changed, 41 insertions(+), 16 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 4d02524a7998..2f3e98edc94a 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -300,6 +300,7 @@ extern pgprot_t protection_map[16]; >> #define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */ >> #define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */ >> #define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */ >> +#define FAULT_FLAG_SPECULATIVE 0x200 /* Speculative fault, not holding mmap_sem */ >> >> #define FAULT_FLAG_TRACE \ >> { FAULT_FLAG_WRITE, "WRITE" }, \ > > I think FAULT_FLAG_SPECULATIVE should be introduced in the patch that > actually uses it. I think you're right, I'll move down this define in the series. >> diff --git a/mm/memory.c b/mm/memory.c >> index e0ae4999c824..8ac241b9f370 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -2288,6 +2288,13 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr, >> } >> EXPORT_SYMBOL_GPL(apply_to_page_range); >> >> +static bool pte_map_lock(struct vm_fault *vmf) > > inline? Agreed. >> +{ >> + vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, >> + vmf->address, &vmf->ptl); >> + return true; >> +} >> + >> /* >> * handle_pte_fault chooses page fault handler according to an entry which was >> * read non-atomically. Before making any commitment, on those architectures >> @@ -2477,6 +2484,7 @@ static int wp_page_copy(struct vm_fault *vmf) >> const unsigned long mmun_start = vmf->address & PAGE_MASK; >> const unsigned long mmun_end = mmun_start + PAGE_SIZE; >> struct mem_cgroup *memcg; >> + int ret = VM_FAULT_OOM; >> >> if (unlikely(anon_vma_prepare(vma))) >> goto oom; >> @@ -2504,7 +2512,11 @@ static int wp_page_copy(struct vm_fault *vmf) >> /* >> * Re-check the pte - we dropped the lock >> */ >> - vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); >> + if (!pte_map_lock(vmf)) { >> + mem_cgroup_cancel_charge(new_page, memcg, false); >> + ret = VM_FAULT_RETRY; >> + goto oom_free_new; >> + } > > Ugh, but we aren't oom here, so maybe rename oom_free_new so that it makes > sense for return values other than VM_FAULT_OOM? You're right, now this label name is not correct, I'll rename it to "out_free_new" and rename also the label "oom" to "out" since it is generic too now. >> if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { >> if (old_page) { >> if (!PageAnon(old_page)) { >> @@ -2596,7 +2608,7 @@ static int wp_page_copy(struct vm_fault *vmf) >> oom: >> if (old_page) >> put_page(old_page); >> - return VM_FAULT_OOM; >> + return ret; >> } >> >> /** >> @@ -2617,8 +2629,8 @@ static int wp_page_copy(struct vm_fault *vmf) >> int finish_mkwrite_fault(struct vm_fault *vmf) >> { >> WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED)); >> - vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + if (!pte_map_lock(vmf)) >> + return VM_FAULT_RETRY; >> /* >> * We might have raced with another page fault while we released the >> * pte_offset_map_lock. >> @@ -2736,8 +2748,11 @@ static int do_wp_page(struct vm_fault *vmf) >> get_page(vmf->page); >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> lock_page(vmf->page); >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, >> - vmf->address, &vmf->ptl); >> + if (!pte_map_lock(vmf)) { >> + unlock_page(vmf->page); >> + put_page(vmf->page); >> + return VM_FAULT_RETRY; >> + } >> if (!pte_same(*vmf->pte, vmf->orig_pte)) { >> unlock_page(vmf->page); >> pte_unmap_unlock(vmf->pte, vmf->ptl); >> @@ -2947,8 +2962,10 @@ int do_swap_page(struct vm_fault *vmf) >> * Back out if somebody else faulted in this pte >> * while we released the pte lock. >> */ > > Comment needs updating, pte_same() isn't the only reason to bail out here. I'll update it to : /* * Back out if the VMA has changed in our back during * a speculative page fault or if somebody else * faulted in this pte while we released the pte lock. */ > >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, >> - vmf->address, &vmf->ptl); >> + if (!pte_map_lock(vmf)) { >> + delayacct_clear_flag(DELAYACCT_PF_SWAPIN); >> + return VM_FAULT_RETRY; >> + } >> if (likely(pte_same(*vmf->pte, vmf->orig_pte))) >> ret = VM_FAULT_OOM; >> delayacct_clear_flag(DELAYACCT_PF_SWAPIN); > > Not crucial, but it would be nice if this could do goto out instead, > otherwise this is the first mid function return. ok will do. > >> @@ -3003,8 +3020,11 @@ int do_swap_page(struct vm_fault *vmf) >> /* >> * Back out if somebody else already faulted in this pte. >> */ > > Same as above. Ok changing as above. > >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + if (!pte_map_lock(vmf)) { >> + ret = VM_FAULT_RETRY; >> + mem_cgroup_cancel_charge(page, memcg, false); >> + goto out_page; >> + } >> if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) >> goto out_nomap; >> > > mem_cgroup_try_charge() is done before grabbing pte_offset_map_lock(), why > does the out_nomap exit path do mem_cgroup_cancel_charge(); > pte_unmap_unlock()? If the pte lock can be droppde first, there's no need > to embed the mem_cgroup_cancel_charge() here. I think we can safely invert the call to mem_cgroup_cancel_charge() and to pte_unmap_unlock(), and then introduce a new label and jump in if pte_map_lock() failed. Something like this: @@ -3001,10 +3020,13 @@ int do_swap_page(struct vm_fault *vmf) } /* - * Back out if somebody else already faulted in this pte. + * Back out if the VMA has changed in our back during a speculative + * page fault or if somebody else already faulted in this pte. */ - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); + if (!pte_map_lock(vmf)) { + ret = VM_FAULT_RETRY; + goto out_cancel_cgroup; + } if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) goto out_nomap; @@ -3082,8 +3104,9 @@ int do_swap_page(struct vm_fault *vmf) out: return ret; out_nomap: - mem_cgroup_cancel_charge(page, memcg, false); pte_unmap_unlock(vmf->pte, vmf->ptl); +out_cancel_cgroup: + mem_cgroup_cancel_charge(page, memcg, false); out_page: unlock_page(page); out_release: >> @@ -3133,8 +3153,8 @@ static int do_anonymous_page(struct vm_fault *vmf) >> !mm_forbids_zeropage(vma->vm_mm)) { >> entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address), >> vma->vm_page_prot)); >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, >> - vmf->address, &vmf->ptl); >> + if (!pte_map_lock(vmf)) >> + return VM_FAULT_RETRY; >> if (!pte_none(*vmf->pte)) >> goto unlock; >> ret = check_stable_address_space(vma->vm_mm); >> @@ -3169,8 +3189,11 @@ static int do_anonymous_page(struct vm_fault *vmf) >> if (vma->vm_flags & VM_WRITE) >> entry = pte_mkwrite(pte_mkdirty(entry)); >> >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + if (!pte_map_lock(vmf)) { >> + mem_cgroup_cancel_charge(page, memcg, false); >> + put_page(page); >> + return VM_FAULT_RETRY; >> + } >> if (!pte_none(*vmf->pte)) >> goto release; >> > > This is more spaghetti, can the exit path be fixed up so we order things > consistently for all gotos? I do agree, this was due to inverted calls to mem_cgroup_cancel_charge() and pte_unmap_unlock(). This will become: @@ -3170,14 +3193,16 @@ static int do_anonymous_page(struct vm_fault *vmf) if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, - &vmf->ptl); - if (!pte_none(*vmf->pte)) + if (!pte_map_lock(vmf)) { + ret = VM_FAULT_RETRY; goto release; + } + if (!pte_none(*vmf->pte)) + goto unlock_and_release; ret = check_stable_address_space(vma->vm_mm); if (ret) - goto release; + goto unlock_and_release; /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { @@ -3199,10 +3224,12 @@ static int do_anonymous_page(struct vm_fault *vmf) unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return ret; +unlock_and_release: + pte_unmap_unlock(vmf->pte, vmf->ptl); release: mem_cgroup_cancel_charge(page, memcg, false); put_page(page); - goto unlock; + return ret; oom_free_page: put_page(page); oom: Thanks, Laurent. > >> @@ -3294,8 +3317,9 @@ static int pte_alloc_one_map(struct vm_fault *vmf) >> * pte_none() under vmf->ptl protection when we return to >> * alloc_set_pte(). >> */ >> - vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> - &vmf->ptl); >> + if (!pte_map_lock(vmf)) >> + return VM_FAULT_RETRY; >> + >> return 0; >> } >> >