Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp2993137yba; Mon, 22 Apr 2019 17:22:24 -0700 (PDT) X-Google-Smtp-Source: APXvYqxTMVNy61Ik/F2M5C8DyqjZKcjkqM2IS1ntQXgRsXcKtdSna60RfJ6HRahVHtNefDDskqKu X-Received: by 2002:a17:902:266:: with SMTP id 93mr23185110plc.201.1555978944900; Mon, 22 Apr 2019 17:22:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555978944; cv=none; d=google.com; s=arc-20160816; b=Wp+O+3UF8B2fyjbPukERmSmTZezccHnKennv3rAJWqkSLEc2W//AEy9JNcDQId82f8 3EG+OT60Oq9canAJ8U7AEDScoYZ+7kj+SbbHmLC0MbZfDT/GjIsdJEKPHA0cBlz92FAs RMVPKQXzuiZrjwb97UESPuUXxG+50a5epKPxB/VMj23eQrsgfKR34TQi5+JsFQi7oPv/ gl3j99J3LZ8TQsjOd6t3zHuOy7aPGYCLx9vI9vzS18fNA6KXYu58Q8yfaAj8Eb/KGPwe A6AUennHqjHqGW9DzeWLWMWhLXizUmkL14HSM81M5lpx2q88I5KIWNhrQ0uWera8/EpB AT2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=SzkWtREsqMJ/B3gEB0632OtK6r69SLWKmRyZ+SsPo3g=; b=y3CkcWAXJ2rFaJ4WPzg+YZfxxBle0WM31bjaZ/nNRMrJvLSM7T201sJECw3aaA+fbD 1VwUsSRMEvl4bjSpO8rDf8aaexCIede1YHUqEfAy7PoPcRfp4HKHsCQ6zEnZd0dOQ7uD y4ca7Bqr3yT+ONcUlGOKufN08Zgd7oUlW86GdP/qKtokDGQC2KwMvMH8jtaM6csI5iIt hC3gr6KfBVI6xXWrpPCKbUkgM+y4HSTqgzotZ0W2+cSZZqijSrVAEiQUXmpXvjNAM6Co uUAXmcu8hj3+QHNWpNpNF0JVvkiFuzouiO/J/m86w6jGtBJVG6bHqo746WAKXc9C0Bw8 WTKw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a20si13110669pgw.465.2019.04.22.17.22.09; Mon, 22 Apr 2019 17:22:24 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732372AbfDVUHE (ORCPT + 99 others); Mon, 22 Apr 2019 16:07:04 -0400 Received: from mx1.redhat.com ([209.132.183.28]:53456 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730398AbfDVUHB (ORCPT ); Mon, 22 Apr 2019 16:07:01 -0400 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 153B5307D84D; Mon, 22 Apr 2019 20:07:00 +0000 (UTC) Received: from redhat.com (unknown [10.20.6.236]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 05CC860142; Mon, 22 Apr 2019 20:06:55 +0000 (UTC) Date: Mon, 22 Apr 2019 16:06:54 -0400 From: Jerome Glisse To: Laurent Dufour Cc: akpm@linux-foundation.org, mhocko@kernel.org, peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , aneesh.kumar@linux.ibm.com, benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , sergey.senozhatsky.work@gmail.com, Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, Daniel Jordan , David Rientjes , Ganesh Mahendran , Minchan Kim , Punit Agrawal , vinayak menon , Yang Shi , zhong jiang , Haiyan Song , Balbir Singh , sj38.park@gmail.com, Michel Lespinasse , Mike Rapoport , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, npiggin@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org Subject: Re: [PATCH v12 13/31] mm: cache some VMA fields in the vm_fault structure Message-ID: <20190422200654.GD14666@redhat.com> References: <20190416134522.17540-1-ldufour@linux.ibm.com> <20190416134522.17540-14-ldufour@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20190416134522.17540-14-ldufour@linux.ibm.com> User-Agent: Mutt/1.11.3 (2019-02-01) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.48]); Mon, 22 Apr 2019 20:07:00 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 16, 2019 at 03:45:04PM +0200, Laurent Dufour wrote: > When handling speculative page fault, the vma->vm_flags and > vma->vm_page_prot fields are read once the page table lock is released. So > there is no more guarantee that these fields would not change in our back. > They will be saved in the vm_fault structure before the VMA is checked for > changes. > > In the detail, when we deal with a speculative page fault, the mmap_sem is > not taken, so parallel VMA's changes can occurred. When a VMA change is > done which will impact the page fault processing, we assumed that the VMA > sequence counter will be changed. In the page fault processing, at the > time the PTE is locked, we checked the VMA sequence counter to detect > changes done in our back. If no change is detected we can continue further. > But this doesn't prevent the VMA to not be changed in our back while the > PTE is locked. So VMA's fields which are used while the PTE is locked must > be saved to ensure that we are using *static* values. This is important > since the PTE changes will be made on regards to these VMA fields and they > need to be consistent. This concerns the vma->vm_flags and > vma->vm_page_prot VMA fields. > > This patch also set the fields in hugetlb_no_page() and > __collapse_huge_page_swapin even if it is not need for the callee. > > Signed-off-by: Laurent Dufour I am unsure about something see below, so you might need to update that one but it would not change the structure of the patch thus: Reviewed-by: J?r?me Glisse > --- > include/linux/mm.h | 10 +++++++-- > mm/huge_memory.c | 6 +++--- > mm/hugetlb.c | 2 ++ > mm/khugepaged.c | 2 ++ > mm/memory.c | 53 ++++++++++++++++++++++++---------------------- > mm/migrate.c | 2 +- > 6 files changed, 44 insertions(+), 31 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 5d45b7d8718d..f465bb2b049e 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -439,6 +439,12 @@ struct vm_fault { > * page table to avoid allocation from > * atomic context. > */ > + /* > + * These entries are required when handling speculative page fault. > + * This way the page handling is done using consistent field values. > + */ > + unsigned long vma_flags; > + pgprot_t vma_page_prot; > }; > > /* page entry size for vm->huge_fault() */ > @@ -781,9 +787,9 @@ void free_compound_page(struct page *page); > * pte_mkwrite. But get_user_pages can cause write faults for mappings > * that do not have writing enabled, when used by access_process_vm. > */ > -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) > +static inline pte_t maybe_mkwrite(pte_t pte, unsigned long vma_flags) > { > - if (likely(vma->vm_flags & VM_WRITE)) > + if (likely(vma_flags & VM_WRITE)) > pte = pte_mkwrite(pte); > return pte; > } > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 823688414d27..865886a689ee 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1244,8 +1244,8 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf, > > for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) { > pte_t entry; > - entry = mk_pte(pages[i], vma->vm_page_prot); > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + entry = mk_pte(pages[i], vmf->vma_page_prot); > + entry = maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags); > memcg = (void *)page_private(pages[i]); > set_page_private(pages[i], 0); > page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false); > @@ -2228,7 +2228,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, > entry = pte_swp_mksoft_dirty(entry); > } else { > entry = mk_pte(page + i, READ_ONCE(vma->vm_page_prot)); > - entry = maybe_mkwrite(entry, vma); > + entry = maybe_mkwrite(entry, vma->vm_flags); > if (!write) > entry = pte_wrprotect(entry); > if (!young) > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 109f5de82910..13246da4bc50 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3812,6 +3812,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > .vma = vma, > .address = haddr, > .flags = flags, > + .vma_flags = vma->vm_flags, > + .vma_page_prot = vma->vm_page_prot, Shouldn't you use READ_ONCE ? I doubt compiler will do something creative with struct initialization but as you are using WRITE_ONCE to update those fields maybe pairing read with READ_ONCE where the mmap_sem is not always taken might make sense. > /* > * Hard to debug if it ends up being > * used by a callee that assumes > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 6a0cbca3885e..42469037240a 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -888,6 +888,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, > .flags = FAULT_FLAG_ALLOW_RETRY, > .pmd = pmd, > .pgoff = linear_page_index(vma, address), > + .vma_flags = vma->vm_flags, > + .vma_page_prot = vma->vm_page_prot, Same as above. [...] > return VM_FAULT_FALLBACK; > @@ -3924,6 +3925,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, > .flags = flags, > .pgoff = linear_page_index(vma, address), > .gfp_mask = __get_fault_gfp_mask(vma), > + .vma_flags = vma->vm_flags, > + .vma_page_prot = vma->vm_page_prot, Same as above > }; > unsigned int dirty = flags & FAULT_FLAG_WRITE; > struct mm_struct *mm = vma->vm_mm; > diff --git a/mm/migrate.c b/mm/migrate.c > index f2ecc2855a12..a9138093a8e2 100644 > --- a/mm/migrate.c > +++ b/mm/migrate.c > @@ -240,7 +240,7 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma, > */ > entry = pte_to_swp_entry(*pvmw.pte); > if (is_write_migration_entry(entry)) > - pte = maybe_mkwrite(pte, vma); > + pte = maybe_mkwrite(pte, vma->vm_flags); > > if (unlikely(is_zone_device_page(new))) { > if (is_device_private_page(new)) { > -- > 2.21.0 >