Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp3742950yba; Tue, 23 Apr 2019 08:54:13 -0700 (PDT) X-Google-Smtp-Source: APXvYqwsLtePBg7+svTJArT6KVdBE5S/6GxlpH+/L4iZVplNHCannqmq2jlkgraGFBGSb9UU5Ta4 X-Received: by 2002:aa7:83ce:: with SMTP id j14mr27916638pfn.57.1556034853135; Tue, 23 Apr 2019 08:54:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556034853; cv=none; d=google.com; s=arc-20160816; b=cFLujbBxi/6lEKpPcbMNPCXxrwl1+Z1/NznKlVnhjMxFHJC115bu/ZZJ/boNV8mrZr NE09+Rw3yxVZ2E0za7rDv7TJmXtI9WLXqVjhebwh07k5/SaJIOfPQ4cCfjYMzMU4Yrct C9tL3sCB65+LXwiwdb8K0nEUB6YJxecdkRWHa8hOj5itiMXTjDOs/P5d4VDkZfALGmJd EN4azqV6XO9q05wnJUvsct/o8+6jZGYbGQPlzzdRTrCSKQBAA/37oDerXiNVaHheI+OU qoG5tq3ECYIcIdUrYzAcfMmNZz6TwO/USJHJCs1vb7wPH4VD7jWP1gZm2Z0gTkH0Bc78 KdAw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date:from :references:cc:to:subject; bh=miyJbPoDlvHnnAS95z2jHSaaGhkMIUzHopSxuor4k5Q=; b=dS3gbfL01m7KaJBm2UeXp8DycN41NgJ0Zc1PvkULxEu4ILFQUowjbBacgHdLaJByL8 hJtCj/+gYcW98JqYdrRFhpkDOpqLHjlznTcJVUmy8Nl7A2M/U4ep12luL/q8OkADWd9o joSZeHQXa8uRwtsWPGh16wVQqs5xVLf8KAiGkcBtPTBBPspGQ0sTYwsj5nNM/dn7RQu8 UJ12pZ1rev2OEY7zXLWMs+XY0RnATr331pMwQ/mMHDzU9LPkizQzXebPqgBUTXVWjI7v uTKqIDbH16VpO1atsljSzY2y07PTQsxOI+CMG+4NfVO+U4iBbcr8ZADgTOxDObeeAfr7 c7OQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 31si3943717plj.417.2019.04.23.08.53.57; Tue, 23 Apr 2019 08:54:13 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728483AbfDWPvm (ORCPT + 99 others); Tue, 23 Apr 2019 11:51:42 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:42570 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1728405AbfDWPvl (ORCPT ); Tue, 23 Apr 2019 11:51:41 -0400 Received: from pps.filterd (m0098416.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3NFp68l061566 for ; Tue, 23 Apr 2019 11:51:39 -0400 Received: from e06smtp05.uk.ibm.com (e06smtp05.uk.ibm.com [195.75.94.101]) by mx0b-001b2d01.pphosted.com with ESMTP id 2s240rnhkv-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Tue, 23 Apr 2019 11:51:39 -0400 Received: from localhost by e06smtp05.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 23 Apr 2019 16:51:35 +0100 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp05.uk.ibm.com (192.168.101.135) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Tue, 23 Apr 2019 16:51:26 +0100 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x3NFpOYb50856158 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 23 Apr 2019 15:51:24 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id C634742042; Tue, 23 Apr 2019 15:51:23 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 090384205E; Tue, 23 Apr 2019 15:51:21 +0000 (GMT) Received: from [9.145.7.116] (unknown [9.145.7.116]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Tue, 23 Apr 2019 15:51:20 +0000 (GMT) Subject: Re: [PATCH v12 11/31] mm: protect mremap() against SPF hanlder To: Jerome Glisse Cc: akpm@linux-foundation.org, mhocko@kernel.org, peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , aneesh.kumar@linux.ibm.com, benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , sergey.senozhatsky.work@gmail.com, Andrea Arcangeli , Alexei Starovoitov , Daniel Jordan , David Rientjes , Ganesh Mahendran , Minchan Kim , Punit Agrawal , vinayak menon , Yang Shi , zhong jiang , Haiyan Song , Balbir Singh , sj38.park@gmail.com, Michel Lespinasse , Mike Rapoport , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, npiggin@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org References: <20190416134522.17540-1-ldufour@linux.ibm.com> <20190416134522.17540-12-ldufour@linux.ibm.com> <20190422195157.GB14666@redhat.com> From: Laurent Dufour Date: Tue, 23 Apr 2019 17:51:20 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190422195157.GB14666@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 x-cbid: 19042315-0020-0000-0000-00000332DF96 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19042315-0021-0000-0000-00002185412F Message-Id: <665c1e4f-cf0f-31b7-d5f9-287239f7c0d9@linux.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-04-23_05:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904230107 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Le 22/04/2019 à 21:51, Jerome Glisse a écrit : > On Tue, Apr 16, 2019 at 03:45:02PM +0200, Laurent Dufour wrote: >> If a thread is remapping an area while another one is faulting on the >> destination area, the SPF handler may fetch the vma from the RB tree before >> the pte has been moved by the other thread. This means that the moved ptes >> will overwrite those create by the page fault handler leading to page >> leaked. >> >> CPU 1 CPU2 >> enter mremap() >> unmap the dest area >> copy_vma() Enter speculative page fault handler >> >> at this time the dest area is present in the RB tree >> fetch the vma matching dest area >> create a pte as the VMA matched >> Exit the SPF handler >> >> move_ptes() >> > it is assumed that the dest area is empty, >> > the move ptes overwrite the page mapped by the CPU2. >> >> To prevent that, when the VMA matching the dest area is extended or created >> by copy_vma(), it should be marked as non available to the SPF handler. >> The usual way to so is to rely on vm_write_begin()/end(). >> This is already in __vma_adjust() called by copy_vma() (through >> vma_merge()). But __vma_adjust() is calling vm_write_end() before returning >> which create a window for another thread. >> This patch adds a new parameter to vma_merge() which is passed down to >> vma_adjust(). >> The assumption is that copy_vma() is returning a vma which should be >> released by calling vm_raw_write_end() by the callee once the ptes have >> been moved. >> >> Signed-off-by: Laurent Dufour > > Reviewed-by: Jérôme Glisse > > Small comment about a comment below but can be fix as a fixup > patch nothing earth shattering. > >> --- >> include/linux/mm.h | 24 ++++++++++++++++----- >> mm/mmap.c | 53 +++++++++++++++++++++++++++++++++++----------- >> mm/mremap.c | 13 ++++++++++++ >> 3 files changed, 73 insertions(+), 17 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 906b9e06f18e..5d45b7d8718d 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2343,18 +2343,32 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node); >> >> /* mmap.c */ >> extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin); >> + >> extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, >> - struct vm_area_struct *expand); >> + struct vm_area_struct *expand, bool keep_locked); >> + >> static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert) >> { >> - return __vma_adjust(vma, start, end, pgoff, insert, NULL); >> + return __vma_adjust(vma, start, end, pgoff, insert, NULL, false); >> } >> -extern struct vm_area_struct *vma_merge(struct mm_struct *, >> + >> +extern struct vm_area_struct *__vma_merge(struct mm_struct *mm, >> + struct vm_area_struct *prev, unsigned long addr, unsigned long end, >> + unsigned long vm_flags, struct anon_vma *anon, struct file *file, >> + pgoff_t pgoff, struct mempolicy *mpol, >> + struct vm_userfaultfd_ctx uff, bool keep_locked); >> + >> +static inline struct vm_area_struct *vma_merge(struct mm_struct *mm, >> struct vm_area_struct *prev, unsigned long addr, unsigned long end, >> - unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, >> - struct mempolicy *, struct vm_userfaultfd_ctx); >> + unsigned long vm_flags, struct anon_vma *anon, struct file *file, >> + pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff) >> +{ >> + return __vma_merge(mm, prev, addr, end, vm_flags, anon, file, off, >> + pol, uff, false); >> +} >> + >> extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); >> extern int __split_vma(struct mm_struct *, struct vm_area_struct *, >> unsigned long addr, int new_below); >> diff --git a/mm/mmap.c b/mm/mmap.c >> index b77ec0149249..13460b38b0fb 100644 >> --- a/mm/mmap.c >> +++ b/mm/mmap.c >> @@ -714,7 +714,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm, >> */ >> int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, >> - struct vm_area_struct *expand) >> + struct vm_area_struct *expand, bool keep_locked) >> { >> struct mm_struct *mm = vma->vm_mm; >> struct vm_area_struct *next = vma->vm_next, *orig_vma = vma; >> @@ -830,8 +830,12 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> >> importer->anon_vma = exporter->anon_vma; >> error = anon_vma_clone(importer, exporter); >> - if (error) >> + if (error) { >> + if (next && next != vma) >> + vm_raw_write_end(next); >> + vm_raw_write_end(vma); >> return error; >> + } >> } >> } >> again: >> @@ -1025,7 +1029,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, >> >> if (next && next != vma) >> vm_raw_write_end(next); >> - vm_raw_write_end(vma); >> + if (!keep_locked) >> + vm_raw_write_end(vma); >> >> validate_mm(mm); >> >> @@ -1161,12 +1166,13 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, >> * parameter) may establish ptes with the wrong permissions of NNNN >> * instead of the right permissions of XXXX. >> */ >> -struct vm_area_struct *vma_merge(struct mm_struct *mm, >> +struct vm_area_struct *__vma_merge(struct mm_struct *mm, >> struct vm_area_struct *prev, unsigned long addr, >> unsigned long end, unsigned long vm_flags, >> struct anon_vma *anon_vma, struct file *file, >> pgoff_t pgoff, struct mempolicy *policy, >> - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) >> + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, >> + bool keep_locked) >> { >> pgoff_t pglen = (end - addr) >> PAGE_SHIFT; >> struct vm_area_struct *area, *next; >> @@ -1214,10 +1220,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >> /* cases 1, 6 */ >> err = __vma_adjust(prev, prev->vm_start, >> next->vm_end, prev->vm_pgoff, NULL, >> - prev); >> + prev, keep_locked); >> } else /* cases 2, 5, 7 */ >> err = __vma_adjust(prev, prev->vm_start, >> - end, prev->vm_pgoff, NULL, prev); >> + end, prev->vm_pgoff, NULL, prev, >> + keep_locked); >> if (err) >> return NULL; >> khugepaged_enter_vma_merge(prev, vm_flags); >> @@ -1234,10 +1241,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, >> vm_userfaultfd_ctx)) { >> if (prev && addr < prev->vm_end) /* case 4 */ >> err = __vma_adjust(prev, prev->vm_start, >> - addr, prev->vm_pgoff, NULL, next); >> + addr, prev->vm_pgoff, NULL, next, >> + keep_locked); >> else { /* cases 3, 8 */ >> err = __vma_adjust(area, addr, next->vm_end, >> - next->vm_pgoff - pglen, NULL, next); >> + next->vm_pgoff - pglen, NULL, next, >> + keep_locked); >> /* >> * In case 3 area is already equal to next and >> * this is a noop, but in case 8 "area" has >> @@ -3259,9 +3268,20 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, >> >> if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) >> return NULL; /* should never get here */ >> - new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags, >> - vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), >> - vma->vm_userfaultfd_ctx); >> + >> + /* There is 3 cases to manage here in >> + * AAAA AAAA AAAA AAAA >> + * PPPP.... PPPP......NNNN PPPP....NNNN PP........NN >> + * PPPPPPPP(A) PPPP..NNNNNNNN(B) PPPPPPPPPPPP(1) NULL >> + * PPPPPPPPNNNN(2) >> + * PPPPNNNNNNNN(3) >> + * >> + * new_vma == prev in case A,1,2 >> + * new_vma == next in case B,3 >> + */ >> + new_vma = __vma_merge(mm, prev, addr, addr + len, vma->vm_flags, >> + vma->anon_vma, vma->vm_file, pgoff, >> + vma_policy(vma), vma->vm_userfaultfd_ctx, true); >> if (new_vma) { >> /* >> * Source vma may have been merged into new_vma >> @@ -3299,6 +3319,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, >> get_file(new_vma->vm_file); >> if (new_vma->vm_ops && new_vma->vm_ops->open) >> new_vma->vm_ops->open(new_vma); >> + /* >> + * As the VMA is linked right now, it may be hit by the >> + * speculative page fault handler. But we don't want it to >> + * to start mapping page in this area until the caller has >> + * potentially move the pte from the moved VMA. To prevent >> + * that we protect it right now, and let the caller unprotect >> + * it once the move is done. >> + */ > > It would be better to say: > /* > * Block speculative page fault on the new VMA before "linking" it as > * as once it is linked then it may be hit by speculative page fault. > * But we don't want it to start mapping page in this area until the > * caller has potentially move the pte from the moved VMA. To prevent > * that we protect it before linking and let the caller unprotect it > * once the move is done. > */ > I'm fine with your proposal. Thanks for reviewing this. >> + vm_raw_write_begin(new_vma); >> vma_link(mm, new_vma, prev, rb_link, rb_parent); >> *need_rmap_locks = false; >> } >> diff --git a/mm/mremap.c b/mm/mremap.c >> index fc241d23cd97..ae5c3379586e 100644 >> --- a/mm/mremap.c >> +++ b/mm/mremap.c >> @@ -357,6 +357,14 @@ static unsigned long move_vma(struct vm_area_struct *vma, >> if (!new_vma) >> return -ENOMEM; >> >> + /* new_vma is returned protected by copy_vma, to prevent speculative >> + * page fault to be done in the destination area before we move the pte. >> + * Now, we must also protect the source VMA since we don't want pages >> + * to be mapped in our back while we are copying the PTEs. >> + */ >> + if (vma != new_vma) >> + vm_raw_write_begin(vma); >> + >> moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, >> need_rmap_locks); >> if (moved_len < old_len) { >> @@ -373,6 +381,8 @@ static unsigned long move_vma(struct vm_area_struct *vma, >> */ >> move_page_tables(new_vma, new_addr, vma, old_addr, moved_len, >> true); >> + if (vma != new_vma) >> + vm_raw_write_end(vma); >> vma = new_vma; >> old_len = new_len; >> old_addr = new_addr; >> @@ -381,7 +391,10 @@ static unsigned long move_vma(struct vm_area_struct *vma, >> mremap_userfaultfd_prep(new_vma, uf); >> arch_remap(mm, old_addr, old_addr + old_len, >> new_addr, new_addr + new_len); >> + if (vma != new_vma) >> + vm_raw_write_end(vma); >> } >> + vm_raw_write_end(new_vma); >> >> /* Conceal VM_ACCOUNT so old reservation is not undone */ >> if (vm_flags & VM_ACCOUNT) { >> -- >> 2.21.0 >> >