Received: by 10.223.185.116 with SMTP id b49csp1038696wrg; Fri, 16 Feb 2018 11:16:07 -0800 (PST) X-Google-Smtp-Source: AH8x227rFJUUsSm7TKugKIUklQe+rFpvcxvmn3Uh7B69yplNE6pZtgkQd+SwHs8/ze4H7K9t1XhQ X-Received: by 2002:a17:902:68cb:: with SMTP id x11-v6mr6723080plm.198.1518808566981; Fri, 16 Feb 2018 11:16:06 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1518808566; cv=none; d=google.com; s=arc-20160816; b=r5FdTs+ArrVaQUzoMhZ7/oRzEvO1JL1q651m7loSZP8Fs8MhxgCf2/FmRf4MzR7eGQ Q8AhfORXU/u6ywux8bwh5ywHTV56Fbl1rJ9azw9AruxVgh2IhxgCvqJQzQ4GQOYK7IZm xl0bItYyVHoOPFq8sHDOcRG1NVkzMcsiJyQEy7gC+4cc0HstOl2wRRPg9b0Ji/d928rD ELseuxTXqafV7e0Dm+KNmtOpAisLVsatO65VEu5YbKa05tMkzH9EBHFCue8JJahGZLU5 2Y8PJwnIaGOrvKLg7nBWxNebLQGapWWCg+Rr0W4d09bJemcqaA93BnVFAwuN77SAfU33 9x5A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:references:in-reply-to:date :subject:cc:to:from:arc-authentication-results; bh=MNlZbdSZ6HkO/Hwk+8HTfSgboUdlxYMJdSyESlURG6Y=; b=0FppC0zjOi9VmrZ6q2YWzDeDEViPebjopuXL+132/Brd8BC4bnaCX3pbY3ciW/H6pF Vl/NjooRlNoY/GboXd6LSn9TT/D7pKSNeNX5uOFq+0C84KyjjH9dQDEwMps8VcAY0+fL MyguOpZXz92c8BRYULEjOVzXm12GwQrm/jKqWz/KFCJMnjW3XMpdAscoiTjOke+mi3iB ErkWn5ctZkPJDfN97aJD8ssciB68xT+uo/DcS+c0xE8UiYc1a5Hsfja9Pww875yUs539 psrrVZ6nkFlV7Hm/SqnUvlZS1PTkeRbvXCLx4hxSeddYCldGUHIntl6gKZNM1rBWXtBZ GZ4g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w62si2381614pgd.573.2018.02.16.11.15.52; Fri, 16 Feb 2018 11:16:06 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758208AbeBPPbA (ORCPT + 99 others); Fri, 16 Feb 2018 10:31:00 -0500 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:59026 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1755456AbeBPP0W (ORCPT ); Fri, 16 Feb 2018 10:26:22 -0500 Received: from pps.filterd (m0098420.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w1GFO0FK086026 for ; Fri, 16 Feb 2018 10:26:22 -0500 Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109]) by mx0b-001b2d01.pphosted.com with ESMTP id 2g5yuc5y9w-1 (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT) for ; Fri, 16 Feb 2018 10:26:21 -0500 Received: from localhost by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 16 Feb 2018 15:26:19 -0000 Received: from b06cxnps4074.portsmouth.uk.ibm.com (9.149.109.196) by e06smtp13.uk.ibm.com (192.168.101.143) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 16 Feb 2018 15:26:12 -0000 Received: from d06av24.portsmouth.uk.ibm.com (mk.ibm.com [9.149.105.60]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w1GFQC0W45613206; Fri, 16 Feb 2018 15:26:12 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id E2C2842042; Fri, 16 Feb 2018 15:18:55 +0000 (GMT) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D6E2942045; Fri, 16 Feb 2018 15:18:53 +0000 (GMT) Received: from nimbus.lab.toulouse-stg.fr.ibm.com (unknown [9.145.186.2]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTP; Fri, 16 Feb 2018 15:18:53 +0000 (GMT) From: Laurent Dufour To: paulmck@linux.vnet.ibm.com, peterz@infradead.org, akpm@linux-foundation.org, kirill@shutemov.name, ak@linux.intel.com, mhocko@kernel.org, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, sergey.senozhatsky.work@gmail.com, Daniel Jordan Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, khandual@linux.vnet.ibm.com, npiggin@gmail.com, bsingharora@gmail.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org Subject: [PATCH v8 09/24] mm: protect mremap() against SPF hanlder Date: Fri, 16 Feb 2018 16:25:23 +0100 X-Mailer: git-send-email 2.7.4 In-Reply-To: <1518794738-4186-1-git-send-email-ldufour@linux.vnet.ibm.com> References: <1518794738-4186-1-git-send-email-ldufour@linux.vnet.ibm.com> X-TM-AS-GCONF: 00 x-cbid: 18021615-0012-0000-0000-000005B18014 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18021615-0013-0000-0000-0000192D84AD Message-Id: <1518794738-4186-10-git-send-email-ldufour@linux.vnet.ibm.com> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2018-02-16_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1802160183 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org If a thread is remapping an area while another one is faulting on the destination area, the SPF handler may fetch the vma from the RB tree before the pte has been moved by the other thread. This means that the moved ptes will overwrite those create by the page fault handler leading to page leaked. CPU 1 CPU2 enter mremap() unmap the dest area copy_vma() Enter speculative page fault handler >> at this time the dest area is present in the RB tree fetch the vma matching dest area create a pte as the VMA matched Exit the SPF handler move_ptes() > it is assumed that the dest area is empty, > the move ptes overwrite the page mapped by the CPU2. To prevent that, when the VMA matching the dest area is extended or created by copy_vma(), it should be marked as non available to the SPF handler. The usual way to so is to rely on vm_write_begin()/end(). This is already in __vma_adjust() called by copy_vma() (through vma_merge()). But __vma_adjust() is calling vm_write_end() before returning which create a window for another thread. This patch adds a new parameter to vma_merge() which is passed down to vma_adjust(). The assumption is that copy_vma() is returning a vma which should be released by calling vm_raw_write_end() by the callee once the ptes have been moved. Signed-off-by: Laurent Dufour --- include/linux/mm.h | 16 ++++++++++++---- mm/mmap.c | 47 ++++++++++++++++++++++++++++++++++++----------- mm/mremap.c | 13 +++++++++++++ 3 files changed, 61 insertions(+), 15 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 9c22b4134c5d..dcc2a6db8419 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2187,16 +2187,24 @@ void anon_vma_interval_tree_verify(struct anon_vma_chain *node); extern int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin); extern int __vma_adjust(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, - struct vm_area_struct *expand); + struct vm_area_struct *expand, bool keep_locked); static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert) { - return __vma_adjust(vma, start, end, pgoff, insert, NULL); + return __vma_adjust(vma, start, end, pgoff, insert, NULL, false); } -extern struct vm_area_struct *vma_merge(struct mm_struct *, +extern struct vm_area_struct *__vma_merge(struct mm_struct *, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, - struct mempolicy *, struct vm_userfaultfd_ctx); + struct mempolicy *, struct vm_userfaultfd_ctx, bool keep_locked); +static inline struct vm_area_struct *vma_merge(struct mm_struct *vma, + struct vm_area_struct *prev, unsigned long addr, unsigned long end, + unsigned long vm_flags, struct anon_vma *anon, struct file *file, + pgoff_t off, struct mempolicy *pol, struct vm_userfaultfd_ctx uff) +{ + return __vma_merge(vma, prev, addr, end, vm_flags, anon, file, off, + pol, uff, false); +} extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); extern int __split_vma(struct mm_struct *, struct vm_area_struct *, unsigned long addr, int new_below); diff --git a/mm/mmap.c b/mm/mmap.c index c4d944e1be23..13c799710a8a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -684,7 +684,7 @@ static inline void __vma_unlink_prev(struct mm_struct *mm, */ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, unsigned long end, pgoff_t pgoff, struct vm_area_struct *insert, - struct vm_area_struct *expand) + struct vm_area_struct *expand, bool keep_locked) { struct mm_struct *mm = vma->vm_mm; struct vm_area_struct *next = vma->vm_next, *orig_vma = vma; @@ -996,7 +996,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, if (next && next != vma) vm_raw_write_end(next); - vm_raw_write_end(vma); + if (!keep_locked) + vm_raw_write_end(vma); validate_mm(mm); @@ -1132,12 +1133,13 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, * parameter) may establish ptes with the wrong permissions of NNNN * instead of the right permissions of XXXX. */ -struct vm_area_struct *vma_merge(struct mm_struct *mm, +struct vm_area_struct *__vma_merge(struct mm_struct *mm, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t pgoff, struct mempolicy *policy, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + bool keep_locked) { pgoff_t pglen = (end - addr) >> PAGE_SHIFT; struct vm_area_struct *area, *next; @@ -1185,10 +1187,11 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, /* cases 1, 6 */ err = __vma_adjust(prev, prev->vm_start, next->vm_end, prev->vm_pgoff, NULL, - prev); + prev, keep_locked); } else /* cases 2, 5, 7 */ err = __vma_adjust(prev, prev->vm_start, - end, prev->vm_pgoff, NULL, prev); + end, prev->vm_pgoff, NULL, prev, + keep_locked); if (err) return NULL; khugepaged_enter_vma_merge(prev, vm_flags); @@ -1205,10 +1208,12 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, vm_userfaultfd_ctx)) { if (prev && addr < prev->vm_end) /* case 4 */ err = __vma_adjust(prev, prev->vm_start, - addr, prev->vm_pgoff, NULL, next); + addr, prev->vm_pgoff, NULL, next, + keep_locked); else { /* cases 3, 8 */ err = __vma_adjust(area, addr, next->vm_end, - next->vm_pgoff - pglen, NULL, next); + next->vm_pgoff - pglen, NULL, next, + keep_locked); /* * In case 3 area is already equal to next and * this is a noop, but in case 8 "area" has @@ -3163,9 +3168,20 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent)) return NULL; /* should never get here */ - new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags, - vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx); + + /* There is 3 cases to manage here in + * AAAA AAAA AAAA AAAA + * PPPP.... PPPP......NNNN PPPP....NNNN PP........NN + * PPPPPPPP(A) PPPP..NNNNNNNN(B) PPPPPPPPPPPP(1) NULL + * PPPPPPPPNNNN(2) + * PPPPNNNNNNNN(3) + * + * new_vma == prev in case A,1,2 + * new_vma == next in case B,3 + */ + new_vma = __vma_merge(mm, prev, addr, addr + len, vma->vm_flags, + vma->anon_vma, vma->vm_file, pgoff, + vma_policy(vma), vma->vm_userfaultfd_ctx, true); if (new_vma) { /* * Source vma may have been merged into new_vma @@ -3205,6 +3221,15 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, get_file(new_vma->vm_file); if (new_vma->vm_ops && new_vma->vm_ops->open) new_vma->vm_ops->open(new_vma); + /* + * As the VMA is linked right now, it may be hit by the + * speculative page fault handler. But we don't want it to + * to start mapping page in this area until the caller has + * potentially move the pte from the moved VMA. To prevent + * that we protect it right now, and let the caller unprotect + * it once the move is done. + */ + vm_raw_write_begin(new_vma); vma_link(mm, new_vma, prev, rb_link, rb_parent); *need_rmap_locks = false; } diff --git a/mm/mremap.c b/mm/mremap.c index 049470aa1e3e..8ed1a1d6eaed 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -302,6 +302,14 @@ static unsigned long move_vma(struct vm_area_struct *vma, if (!new_vma) return -ENOMEM; + /* new_vma is returned protected by copy_vma, to prevent speculative + * page fault to be done in the destination area before we move the pte. + * Now, we must also protect the source VMA since we don't want pages + * to be mapped in our back while we are copying the PTEs. + */ + if (vma != new_vma) + vm_raw_write_begin(vma); + moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len, need_rmap_locks); if (moved_len < old_len) { @@ -318,6 +326,8 @@ static unsigned long move_vma(struct vm_area_struct *vma, */ move_page_tables(new_vma, new_addr, vma, old_addr, moved_len, true); + if (vma != new_vma) + vm_raw_write_end(vma); vma = new_vma; old_len = new_len; old_addr = new_addr; @@ -326,7 +336,10 @@ static unsigned long move_vma(struct vm_area_struct *vma, mremap_userfaultfd_prep(new_vma, uf); arch_remap(mm, old_addr, old_addr + old_len, new_addr, new_addr + new_len); + if (vma != new_vma) + vm_raw_write_end(vma); } + vm_raw_write_end(new_vma); /* Conceal VM_ACCOUNT so old reservation is not undone */ if (vm_flags & VM_ACCOUNT) { -- 2.7.4