Received: by 10.213.65.68 with SMTP id h4csp704964imn; Wed, 28 Mar 2018 11:14:51 -0700 (PDT) X-Google-Smtp-Source: AIpwx48sgZwf+WtTqdV3iFxdRdYshoczHFaQGFmpbHA0N4rR+25ufAQSOiY++MHHKSTIYE2kzb4O X-Received: by 10.98.213.9 with SMTP id d9mr3759548pfg.234.1522260891575; Wed, 28 Mar 2018 11:14:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522260891; cv=none; d=google.com; s=arc-20160816; b=ZaoB3lLrQJ0KSI+wFOq1ER9RIF1MT2RLSLQErbjKDnn13y1Si+aNe3zBH6dvnq+DrC +pIAA37I0SEHkT/KAZ3sNcMXl8hk7RLQXhxFD9GonAgliTsMg7XUIRvo5t/Sanfri0yT 5xn+uqlRgNzx952OQcMC2Meo6Q/MSIBvUXhAZed1Qqty/vJeUYQ73aHIbjork+dty+IO BXghz7hNSDTPFeiBIm2RkOsulysksL3Yc8gfMqxb0h02Iq0aPDU2JwMuph0Puq31enhv sdcQXXSjK3QMIV7OSb7mvoSfpWRX2WTNBDkpK28d5weJE5gee+Bmfl7Pn8pGYNuU2y7v sjnQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date:from :references:cc:to:subject:arc-authentication-results; bh=eh41rVgYTOwqKTV02PuZ/wE8dmhBreI1ksvPj5++SOY=; b=FwyDFfTbLHIIP7/1aKkRQBAnXgh4OKrAgS1BDP6+94stWEW5NCi7c9Xeyq/9g/VsQ3 YiGiCiF9Sh1OYGEXR4VWuFKsNKlgh4JSbee3y+rHkuZIYX7t32Ego6CQ+dKRVEgxOQy3 X4uyPcOQAdM2oS+71obJ0TdeUScegYNpFqLTdRMO4fIyZnZf0HX601h/3/DMFLHokELk /1nzgy57fdXZSUz10fY8Rg+Vesajn9CASbT6BZGPmg/qf+yXRwmNuZJl9iU7axkAA6u/ 9tn/kXTmq8ub6yIg9WFXXzpfpEDNXS12VX4PwYSGvWR8sbXaHFPWvwb6V2QDAYp++dPc Mvnw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k5si2772881pgr.143.2018.03.28.11.14.36; Wed, 28 Mar 2018 11:14:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753262AbeC1SLo (ORCPT + 99 others); Wed, 28 Mar 2018 14:11:44 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:34916 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752606AbeC1SLm (ORCPT ); Wed, 28 Mar 2018 14:11:42 -0400 Received: from pps.filterd (m0098396.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w2SIBY2s127852 for ; Wed, 28 Mar 2018 14:11:42 -0400 Received: from e06smtp11.uk.ibm.com (e06smtp11.uk.ibm.com [195.75.94.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 2h0g0ur882-1 (version=TLSv1.2 cipher=AES256-SHA256 bits=256 verify=NOT) for ; Wed, 28 Mar 2018 14:11:41 -0400 Received: from localhost by e06smtp11.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 28 Mar 2018 19:11:38 +0100 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp11.uk.ibm.com (192.168.101.141) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Wed, 28 Mar 2018 19:11:32 +0100 Received: from d06av21.portsmouth.uk.ibm.com (d06av21.portsmouth.uk.ibm.com [9.149.105.232]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w2SIBVgE49676446; Wed, 28 Mar 2018 18:11:31 GMT Received: from d06av21.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 0390852045; Wed, 28 Mar 2018 18:02:47 +0100 (BST) Received: from [9.145.178.80] (unknown [9.145.178.80]) by d06av21.portsmouth.uk.ibm.com (Postfix) with ESMTP id DA10F5203F; Wed, 28 Mar 2018 18:02:44 +0100 (BST) Subject: Re: [PATCH v9 09/24] mm: protect mremap() against SPF hanlder To: David Rientjes Cc: paulmck@linux.vnet.ibm.com, peterz@infradead.org, akpm@linux-foundation.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, sergey.senozhatsky.work@gmail.com, Daniel Jordan , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org References: <1520963994-28477-1-git-send-email-ldufour@linux.vnet.ibm.com> <1520963994-28477-10-git-send-email-ldufour@linux.vnet.ibm.com> From: Laurent Dufour Date: Wed, 28 Mar 2018 20:11:28 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-TM-AS-GCONF: 00 x-cbid: 18032818-0040-0000-0000-000004466CE4 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18032818-0041-0000-0000-000020EA7E26 Message-Id: <1fe7529a-947c-fdb2-12d2-b38bdd41bb04@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-03-28_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1803280189 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 28/03/2018 00:12, David Rientjes wrote: > On Tue, 13 Mar 2018, Laurent Dufour wrote: > >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 88042d843668..ef6ef0627090 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2189,16 +2189,24 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node); >> extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin); >> extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, >> - struct vm_area_struct *expand); >> + struct vm_area_struct *expand, bool keep_locked); >> static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert) >> { >> - return __vma_adjust(vma, start, end, pgoff, insert, NULL); >> + return __vma_adjust(vma, start, end, pgoff, insert, NULL, false); >> } >> -extern struct vm_area_struct *vma_merge(struct mm_struct *, >> +extern struct vm_area_struct *__vma_merge(struct mm_struct *, >> struct vm_area_struct *prev, unsigned long addr, unsigned long end, >> unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, >> - struct mempolicy *, struct vm_userfaultfd_ctx); >> + struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked); >> +static inline struct vm_area_struct *vma_merge(struct mm_struct *vma, >> + struct vm_area_struct *prev, unsigned long addr, unsigned long end, >> + unsigned long vm_flags, struct anon_vma *anon, struct file *file, >> + pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff) >> +{ >> + return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off, >> + pol, uff, false); >> +} > > The first formal to vma_merge() is an mm, not a vma. Oops ! > This area could use an uncluttering. > >> extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); >> extern int __split_vma(struct mm_struct *, struct vm_area_struct *, >> unsigned long addr, int new_below); >> diff --git a/mm/mmap.c b/mm/mmap.c >> index d6533cb85213..ac32b577a0c9 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm, >> */ >> int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, >> - struct vm_area_struct *expand) >> + struct vm_area_struct *expand, bool keep_locked) >> { >> struct mm_struct *mm = vma->vm_mm; >> struct vm_area_struct *next = vma->vm_next, *orig_vma = vma; >> @@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> >> if (next && next != vma) >> vm_raw_write_end(next); >> - vm_raw_write_end(vma); >> + if (!keep_locked) >> + vm_raw_write_end(vma); >> >> validate_mm(mm); >> > > This will require a fixup for the following patch where a retval from > anon_vma_close() can also return without vma locked even though > keep_locked == true. Yes I saw your previous email about that. > > How does vma_merge() handle that error wrt vm_raw_write_begin(vma)? The not written assumption is that in the case __vma_adjust() is returning an error, the vma is no more sequence locked. That's need to be clarify a huge comment near the __vma_adjust() definition. This being said, in that case of __vma_merge() is returning NULL which means that it didn't do the job (the caller don't know if this is due to an error or because the VMA cannot be merged). In that case again the assumption is that the VMA are no locked in that case since the merge operation was not done. But again this is not documented at all and I've to fix that. In addition the caller copy_vma() which is the only one calling __vma_merge() with keep_locked=true, is assuming that in that case, the VMA is not locked and it will allocate a new VMA which will be locked before it is inserted in the RB tree. So it should work, but I've to make a huge effort to document that in the code. Thanks a lot for raising this ! > >> @@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, >> * parameter) may establish ptes with the wrong permissions of NNNN >> * instead of the right permissions of XXXX. >> */ >> -struct vm_area_struct *vma_merge(struct mm_struct *mm, >> +struct vm_area_struct *__vma_merge(struct mm_struct *mm, >> struct vm_area_struct *prev, unsigned long addr, >> unsigned long end, unsigned long vm_flags, >> struct anon_vma *anon_vma, struct file *file, >> pgoff_t pgoff, struct mempolicy *policy, >> - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) >> + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, >> + bool keep_locked) >> { >> pgoff_t pglen = (end - addr) >> PAGE_SHIFT; >> struct vm_area_struct *area, *next; >> @@ -1185,10 +1187,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >> /* cases 1, 6 */ >> err = __vma_adjust(prev, prev->vm_start, >> next->vm_end, prev->vm_pgoff, NULL, >> - prev); >> + prev, keep_locked); >> } else /* cases 2, 5, 7 */ >> err = __vma_adjust(prev, prev->vm_start, >> - end, prev->vm_pgoff, NULL, prev); >> + end, prev->vm_pgoff, NULL, prev, >> + keep_locked); >> if (err) >> return NULL; >> khugepaged_enter_vma_merge(prev, vm_flags); >> @@ -1205,10 +1208,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >> vm_userfaultfd_ctx)) { >> if (prev && addr < prev->vm_end) /* case 4 */ >> err = __vma_adjust(prev, prev->vm_start, >> - addr, prev->vm_pgoff, NULL, next); >> + addr, prev->vm_pgoff, NULL, next, >> + keep_locked); >> else { /* cases 3, 8 */ >> err = __vma_adjust(area, addr, next->vm_end, >> - next->vm_pgoff - pglen, NULL, next); >> + next->vm_pgoff - pglen, NULL, next, >> + keep_locked); >> /* >> * In case 3 area is already equal to next and >> * this is a noop, but in case 8 "area" has >> @@ -3163,9 +3168,20 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, >> >> if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) >> return NULL; /* should never get here */ >> - new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags, >> - vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), >> - vma->vm_userfaultfd_ctx); >> + >> + /* There is 3 cases to manage here in >> + * AAAA AAAA AAAA AAAA >> + * PPPP.... PPPP......NNNN PPPP....NNNN PP........NN >> + * PPPPPPPP(A) PPPP..NNNNNNNN(B) PPPPPPPPPPPP(1) NULL >> + * PPPPPPPPNNNN(2) >> + * PPPPNNNNNNNN(3) >> + * >> + * new_vma == prev in case A,1,2 >> + * new_vma == next in case B,3 >> + */ > > Interleaved tabs and whitespace. Fair enough, I will try to fix that. >> + new_vma = __vma_merge(mm, prev, addr, addr + len, vma->vm_flags, >> + vma->anon_vma, vma->vm_file, pgoff, >> + vma_policy(vma), vma->vm_userfaultfd_ctx, true); >> if (new_vma) { >> /* >> * Source vma may have been merged into new_vma >> @@ -3205,6 +3221,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, >> get_file(new_vma->vm_file); >> if (new_vma->vm_ops && new_vma->vm_ops->open) >> new_vma->vm_ops->open(new_vma); >> + /* >> + * As the VMA is linked right now, it may be hit by the >> + * speculative page fault handler. But we don't want it to >> + * to start mapping page in this area until the caller has >> + * potentially move the pte from the moved VMA. To prevent >> + * that we protect it right now, and let the caller unprotect >> + * it once the move is done. >> + */ >> + vm_raw_write_begin(new_vma); >> vma_link(mm, new_vma, prev, rb_link, rb_parent); >> *need_rmap_locks = false; >> } >> diff --git a/mm/mremap.c b/mm/mremap.c >> index 049470aa1e3e..8ed1a1d6eaed 100644 >> --- a/mm/mremap.c >> +++ b/mm/mremap.c >> @@ -302,6 +302,14 @@ static unsigned long move_vma(struct vm_area_struct *vma, >> if (!new_vma) >> return -ENOMEM; >> >> + /* new_vma is returned protected by copy_vma, to prevent speculative >> + * page fault to be done in the destination area before we move the pte. >> + * Now, we must also protect the source VMA since we don't want pages >> + * to be mapped in our back while we are copying the PTEs. >> + */ >> + if (vma != new_vma) >> + vm_raw_write_begin(vma); >> + >> moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, >> need_rmap_locks); >> if (moved_len < old_len) { >> @@ -318,6 +326,8 @@ static unsigned long move_vma(struct vm_area_struct *vma, >> */ >> move_page_tables(new_vma, new_addr, vma, old_addr, moved_len, >> true); >> + if (vma != new_vma) >> + vm_raw_write_end(vma); >> vma = new_vma; >> old_len = new_len; >> old_addr = new_addr; >> @@ -326,7 +336,10 @@ static unsigned long move_vma(struct vm_area_struct *vma, >> mremap_userfaultfd_prep(new_vma, uf); >> arch_remap(mm, old_addr, old_addr + old_len, >> new_addr, new_addr + new_len); >> + if (vma != new_vma) >> + vm_raw_write_end(vma); >> } >> + vm_raw_write_end(new_vma); > > Just do > > vm_raw_write_end(vma); > vm_raw_write_end(new_vma); > > here. Are you sure ? we can have vma = new_vma done if (unlikely(err)) Cheers, Laurent.