Received: by 10.192.165.148 with SMTP id m20csp3900738imm; Mon, 30 Apr 2018 08:15:29 -0700 (PDT) X-Google-Smtp-Source: AB8JxZqUrnp7G15cGw2uyZH89TcW/x2h8Qi34rxQcs3lsKR0IkDOBaQ/3Fph/b2+yXcnGgjF4P03 X-Received: by 2002:a17:902:b609:: with SMTP id b9-v6mr12589862pls.29.1525101329694; Mon, 30 Apr 2018 08:15:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525101329; cv=none; d=google.com; s=arc-20160816; b=xo3vx6SwYfd+oTdn0Qj3PWNH+Ad9CQSCVthiPGruWbk97fJtDQrjEa4ztqJIR8yh8V r5ubal3KOKhX597E2yF56yUv1dJepsjSSu/2EvBZBYJDopjk+t1lIuMdKdZnaXwF351D oG0o2YmOcJfA7KARZnN5yLBV8EUTsgKAx3NIeKL8ravt7nbKntrf4pKYieegJ6sPRlV/ 7AmS4FMYzaKSzAzFnX8+BR3BvAhvGKwVYH3vDTrf4xm4O4e8GpPmlGq0z959iEJiifMD zQu+OKIBPTocem1lhrOPPAsl+0qDtNnSWlI6ZCfw93q//Tn9po3QsXGavuFJuFM9bmLB fegA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date:from :references:cc:to:subject:arc-authentication-results; bh=g4T45oj/xrdSBqMRvh+wInS7jH8Zm8M4zGHssqCLYNk=; b=h0RcojtldDphX9irD5LxrOiKkQrj5A10JnOe3FXH6glKe9JP8Hn58a+uw2bKMHZsI2 hP4GNzsXRaVpfDsWVi1lbjG8cich/smb6vfy3rkV8jwODpda8pTLfgLWc8Iv6eryGSn5 8oyEP+LfGYmP9SRTgD+cof9yB9gK+DaByjDaY+P9f9/+kr2qoH4lbcwSfPVFAhdyxiyz 1ZxIp70DyhkXA1VeWxIQhJEx7Ksklf2mizaOHljJI2m2mrr8vZ0v/BI2o4sXm9o2Yx7k O+YaqoRUPXLsFJuquvp2iyzb1ibir994U6a9uUNYnas2LI+aYcWdTaY8InqchjJmQlQ/ TdzA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g72si5044823pfb.280.2018.04.30.08.15.15; Mon, 30 Apr 2018 08:15:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754582AbeD3POp (ORCPT + 99 others); Mon, 30 Apr 2018 11:14:45 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:44808 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752534AbeD3POn (ORCPT ); Mon, 30 Apr 2018 11:14:43 -0400 Received: from pps.filterd (m0098410.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w3UFEOCt129584 for ; Mon, 30 Apr 2018 11:14:43 -0400 Received: from e06smtp11.uk.ibm.com (e06smtp11.uk.ibm.com [195.75.94.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 2hp5da0pd3-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Mon, 30 Apr 2018 11:14:42 -0400 Received: from localhost by e06smtp11.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Mon, 30 Apr 2018 16:14:39 +0100 Received: from b06cxnps3075.portsmouth.uk.ibm.com (9.149.109.195) by e06smtp11.uk.ibm.com (192.168.101.141) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Mon, 30 Apr 2018 16:14:31 +0100 Received: from d06av22.portsmouth.uk.ibm.com (d06av22.portsmouth.uk.ibm.com [9.149.105.58]) by b06cxnps3075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w3UFEV5q13893978; Mon, 30 Apr 2018 15:14:31 GMT Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D0C244C046; Mon, 30 Apr 2018 16:06:43 +0100 (BST) Received: from d06av22.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 5E5C24C052; Mon, 30 Apr 2018 16:06:41 +0100 (BST) Received: from [9.145.70.34] (unknown [9.145.70.34]) by d06av22.portsmouth.uk.ibm.com (Postfix) with ESMTP; Mon, 30 Apr 2018 16:06:41 +0100 (BST) Subject: Re: [PATCH v10 08/25] mm: VMA sequence count To: Minchan Kim Cc: akpm@linux-foundation.org, mhocko@kernel.org, peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, sergey.senozhatsky.work@gmail.com, Daniel Jordan , David Rientjes , Jerome Glisse , Ganesh Mahendran , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org References: <1523975611-15978-1-git-send-email-ldufour@linux.vnet.ibm.com> <1523975611-15978-9-git-send-email-ldufour@linux.vnet.ibm.com> <20180423064259.GC114098@rodete-desktop-imager.corp.google.com> From: Laurent Dufour Date: Mon, 30 Apr 2018 17:14:27 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180423064259.GC114098@rodete-desktop-imager.corp.google.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 18043015-0040-0000-0000-0000045382AC X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18043015-0041-0000-0000-000020F79A97 Message-Id: <06b996b0-b831-3d39-8a99-792abfb6a4d1@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-04-30_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1804300146 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 23/04/2018 08:42, Minchan Kim wrote: > On Tue, Apr 17, 2018 at 04:33:14PM +0200, Laurent Dufour wrote: >> From: Peter Zijlstra >> >> Wrap the VMA modifications (vma_adjust/unmap_page_range) with sequence >> counts such that we can easily test if a VMA is changed. > > So, seqcount is to protect modifying all attributes of vma? The seqcount is used to protect fields that will be used during the speculative page fault like boundaries, protections. >> >> The unmap_page_range() one allows us to make assumptions about >> page-tables; when we find the seqcount hasn't changed we can assume >> page-tables are still valid. > > Hmm, seqcount covers page-table, too. > Please describe what the seqcount want to protect. The calls to vm_write_begin/end() in unmap_page_range() are used to detect when a VMA is being unmap and thus that new page fault should not be satisfied for this VMA. This is protecting the VMA unmapping operation, not the page tables themselves. >> >> The flip side is that we cannot distinguish between a vma_adjust() and >> the unmap_page_range() -- where with the former we could have >> re-checked the vma bounds against the address. >> >> Signed-off-by: Peter Zijlstra (Intel) >> >> [Port to 4.12 kernel] >> [Build depends on CONFIG_SPECULATIVE_PAGE_FAULT] >> [Introduce vm_write_* inline function depending on >> CONFIG_SPECULATIVE_PAGE_FAULT] >> [Fix lock dependency between mapping->i_mmap_rwsem and vma->vm_sequence by >> using vm_raw_write* functions] >> [Fix a lock dependency warning in mmap_region() when entering the error >> path] >> [move sequence initialisation INIT_VMA()] >> Signed-off-by: Laurent Dufour >> --- >> include/linux/mm.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ >> include/linux/mm_types.h | 3 +++ >> mm/memory.c | 2 ++ >> mm/mmap.c | 31 +++++++++++++++++++++++++++++++ >> 4 files changed, 80 insertions(+) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index efc1248b82bd..988daf7030c9 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1264,6 +1264,9 @@ struct zap_details { >> static inline void INIT_VMA(struct vm_area_struct *vma) >> { >> INIT_LIST_HEAD(&vma->anon_vma_chain); >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> + seqcount_init(&vma->vm_sequence); >> +#endif >> } >> >> struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, >> @@ -1386,6 +1389,47 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping, >> unmap_mapping_range(mapping, holebegin, holelen, 0); >> } >> >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> +static inline void vm_write_begin(struct vm_area_struct *vma) >> +{ >> + write_seqcount_begin(&vma->vm_sequence); >> +} >> +static inline void vm_write_begin_nested(struct vm_area_struct *vma, >> + int subclass) >> +{ >> + write_seqcount_begin_nested(&vma->vm_sequence, subclass); >> +} >> +static inline void vm_write_end(struct vm_area_struct *vma) >> +{ >> + write_seqcount_end(&vma->vm_sequence); >> +} >> +static inline void vm_raw_write_begin(struct vm_area_struct *vma) >> +{ >> + raw_write_seqcount_begin(&vma->vm_sequence); >> +} >> +static inline void vm_raw_write_end(struct vm_area_struct *vma) >> +{ >> + raw_write_seqcount_end(&vma->vm_sequence); >> +} >> +#else >> +static inline void vm_write_begin(struct vm_area_struct *vma) >> +{ >> +} >> +static inline void vm_write_begin_nested(struct vm_area_struct *vma, >> + int subclass) >> +{ >> +} >> +static inline void vm_write_end(struct vm_area_struct *vma) >> +{ >> +} >> +static inline void vm_raw_write_begin(struct vm_area_struct *vma) >> +{ >> +} >> +static inline void vm_raw_write_end(struct vm_area_struct *vma) >> +{ >> +} >> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */ >> + >> extern int access_process_vm(struct task_struct *tsk, unsigned long addr, >> void *buf, int len, unsigned int gup_flags); >> extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 21612347d311..db5e9d630e7a 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -335,6 +335,9 @@ struct vm_area_struct { >> struct mempolicy *vm_policy; /* NUMA policy for the VMA */ >> #endif >> struct vm_userfaultfd_ctx vm_userfaultfd_ctx; >> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT >> + seqcount_t vm_sequence; >> +#endif >> } __randomize_layout; >> >> struct core_thread { >> diff --git a/mm/memory.c b/mm/memory.c >> index f86efcb8e268..f7fed053df80 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -1503,6 +1503,7 @@ void unmap_page_range(struct mmu_gather *tlb, >> unsigned long next; >> >> BUG_ON(addr >= end); > > The comment about saying it aims for page-table stability will help. A comment may be added mentioning that we use the seqcount to indicate that the VMA is modified, being unmapped. But there is not a real page table protection, and I think this may be confusing to talk about page table stability here. > >> + vm_write_begin(vma); >> tlb_start_vma(tlb, vma); >> pgd = pgd_offset(vma->vm_mm, addr); >> do { >> @@ -1512,6 +1513,7 @@ void unmap_page_range(struct mmu_gather *tlb, >> next = zap_p4d_range(tlb, vma, pgd, addr, next, details); >> } while (pgd++, addr = next, addr != end); >> tlb_end_vma(tlb, vma); >> + vm_write_end(vma); >> } >> >> >> diff --git a/mm/mmap.c b/mm/mmap.c >> index 8bd9ae1dfacc..813e49589ea1 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -692,6 +692,30 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> long adjust_next = 0; >> int remove_next = 0; >> >> + /* >> + * Why using vm_raw_write*() functions here to avoid lockdep's warning ? >> + * >> + * Locked is complaining about a theoretical lock dependency, involving >> + * 3 locks: >> + * mapping->i_mmap_rwsem --> vma->vm_sequence --> fs_reclaim >> + * >> + * Here are the major path leading to this dependency : >> + * 1. __vma_adjust() mmap_sem -> vm_sequence -> i_mmap_rwsem >> + * 2. move_vmap() mmap_sem -> vm_sequence -> fs_reclaim >> + * 3. __alloc_pages_nodemask() fs_reclaim -> i_mmap_rwsem >> + * 4. unmap_mapping_range() i_mmap_rwsem -> vm_sequence >> + * >> + * So there is no way to solve this easily, especially because in >> + * unmap_mapping_range() the i_mmap_rwsem is grab while the impacted >> + * VMAs are not yet known. >> + * However, the way the vm_seq is used is guarantying that we will >> + * never block on it since we just check for its value and never wait >> + * for it to move, see vma_has_changed() and handle_speculative_fault(). >> + */ >> + vm_raw_write_begin(vma); >> + if (next) >> + vm_raw_write_begin(next); >> + >> if (next && !insert) { >> struct vm_area_struct *exporter = NULL, *importer = NULL; >> >> @@ -902,6 +926,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> anon_vma_merge(vma, next); >> mm->map_count--; >> mpol_put(vma_policy(next)); >> + vm_raw_write_end(next); >> kmem_cache_free(vm_area_cachep, next); >> /* >> * In mprotect's case 6 (see comments on vma_merge), >> @@ -916,6 +941,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> * "vma->vm_next" gap must be updated. >> */ >> next = vma->vm_next; >> + if (next) >> + vm_raw_write_begin(next); >> } else { >> /* >> * For the scope of the comment "next" and >> @@ -962,6 +989,10 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> if (insert && file) >> uprobe_mmap(insert); >> >> + if (next && next != vma) >> + vm_raw_write_end(next); >> + vm_raw_write_end(vma); >> + >> validate_mm(mm); >> >> return 0; >> -- >> 2.7.4 >> >