Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp597730yba; Wed, 24 Apr 2019 06:38:44 -0700 (PDT) X-Google-Smtp-Source: APXvYqw4oSI4MFH0QKXEjZHzcwx8jBnR06WCohS+lTuuQ0ysXEmSogAwoI1PhKvqwR5chp8JfjYI X-Received: by 2002:a63:500f:: with SMTP id e15mr30710017pgb.198.1556113124645; Wed, 24 Apr 2019 06:38:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556113124; cv=none; d=google.com; s=arc-20160816; b=gqFyLuoNmnICJZWx9X5nIQmQxxH8Mxv4uQ15tzYvhfmQvgIQElFAeE7xzRcdmcRUUi D4IrspX8BxYJis5mRc7vuY6YezqQZoQBgY8FM2dGMv43bjGCpm4sQGW95GYDTe2NrcJS /kn77U7Vnkm3UNxVZl0+WoIiW4PyfpFqtB6VAUK9aqyLyPWNqlfxqHTAV5X4n/qjKJkD aDVyrI9PBJB9s3ELBWQWtYRP/auNiDpnjPx5yAdgaURfGjsoGLK1GaDvf7WTJcu9fcx0 SsH+4uVNizIZnket/XULVxGmGXkWrZ2cdEHqlI31em3tilNIHihRfomy4G6aeAFQpB4r v7RA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date:from :references:cc:to:subject; bh=nFs6IDDG/TshbkxB94tbMwb5jDUiYltDvO7hNh8yOqw=; b=x8moj5zoC6HwyFxjZTCEY6ilmKJF7b34pxE8zMT+XbKUJQ7kCocOrtXyFzUpQpwUEG 76D7+IuDYB9gDNrv1Yrhlx38JHtJFnBEG00TgUE8Ye8dlQV7iQSO/XKHlk8VIO1bquDe eXL0yoP322xyJ8sfblBSYhXcwxOIYBPu0lcK96jtORdXb74rANcVoL1cw57IMLYeGUT5 ABpGKux/p9uw7QLcfz5jXZrpLoy/M0a1qmNIvysFxyCY4AltoYL2voi2VmQeS4MJ1Zyq WwsSIovmajtxHEzs1u0vplD6KhiWWXZYjzmVXi40dR+eMmw7u1Dvldcj3NH3w7j1kyGJ mgsw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h189si19459726pfc.283.2019.04.24.06.38.22; Wed, 24 Apr 2019 06:38:44 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728036AbfDXKdq (ORCPT + 99 others); Wed, 24 Apr 2019 06:33:46 -0400 Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:43154 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726168AbfDXKdp (ORCPT ); Wed, 24 Apr 2019 06:33:45 -0400 Received: from pps.filterd (m0098414.ppops.net [127.0.0.1]) by mx0b-001b2d01.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x3OAXaL9122095 for ; Wed, 24 Apr 2019 06:33:44 -0400 Received: from e06smtp05.uk.ibm.com (e06smtp05.uk.ibm.com [195.75.94.101]) by mx0b-001b2d01.pphosted.com with ESMTP id 2s2n2nkj9k-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Wed, 24 Apr 2019 06:33:43 -0400 Received: from localhost by e06smtp05.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Wed, 24 Apr 2019 11:33:27 +0100 Received: from b06cxnps4075.portsmouth.uk.ibm.com (9.149.109.197) by e06smtp05.uk.ibm.com (192.168.101.135) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; (version=TLSv1/SSLv3 cipher=AES256-GCM-SHA384 bits=256/256) Wed, 24 Apr 2019 11:33:16 +0100 Received: from d06av23.portsmouth.uk.ibm.com (d06av23.portsmouth.uk.ibm.com [9.149.105.59]) by b06cxnps4075.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id x3OAXFXh57802952 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 24 Apr 2019 10:33:15 GMT Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id EC166A4057; Wed, 24 Apr 2019 10:33:14 +0000 (GMT) Received: from d06av23.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 325F3A4040; Wed, 24 Apr 2019 10:33:12 +0000 (GMT) Received: from [9.145.184.124] (unknown [9.145.184.124]) by d06av23.portsmouth.uk.ibm.com (Postfix) with ESMTP; Wed, 24 Apr 2019 10:33:12 +0000 (GMT) Subject: Re: [PATCH v12 18/31] mm: protect against PTE changes done by dup_mmap() To: Jerome Glisse Cc: akpm@linux-foundation.org, mhocko@kernel.org, peterz@infradead.org, kirill@shutemov.name, ak@linux.intel.com, dave@stgolabs.net, jack@suse.cz, Matthew Wilcox , aneesh.kumar@linux.ibm.com, benh@kernel.crashing.org, mpe@ellerman.id.au, paulus@samba.org, Thomas Gleixner , Ingo Molnar , hpa@zytor.com, Will Deacon , Sergey Senozhatsky , sergey.senozhatsky.work@gmail.com, Andrea Arcangeli , Alexei Starovoitov , kemi.wang@intel.com, Daniel Jordan , David Rientjes , Ganesh Mahendran , Minchan Kim , Punit Agrawal , vinayak menon , Yang Shi , zhong jiang , Haiyan Song , Balbir Singh , sj38.park@gmail.com, Michel Lespinasse , Mike Rapoport , linux-kernel@vger.kernel.org, linux-mm@kvack.org, haren@linux.vnet.ibm.com, npiggin@gmail.com, paulmck@linux.vnet.ibm.com, Tim Chen , linuxppc-dev@lists.ozlabs.org, x86@kernel.org, Vinayak Menon References: <20190416134522.17540-1-ldufour@linux.ibm.com> <20190416134522.17540-19-ldufour@linux.ibm.com> <20190422203217.GI14666@redhat.com> From: Laurent Dufour Date: Wed, 24 Apr 2019 12:33:11 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190422203217.GI14666@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 x-cbid: 19042410-0020-0000-0000-00000333F137 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 19042410-0021-0000-0000-0000218655DC Message-Id: X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2019-04-24_07:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=2 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1904240088 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Le 22/04/2019 à 22:32, Jerome Glisse a écrit : > On Tue, Apr 16, 2019 at 03:45:09PM +0200, Laurent Dufour wrote: >> Vinayak Menon and Ganesh Mahendran reported that the following scenario may >> lead to thread being blocked due to data corruption: >> >> CPU 1 CPU 2 CPU 3 >> Process 1, Process 1, Process 1, >> Thread A Thread B Thread C >> >> while (1) { while (1) { while(1) { >> pthread_mutex_lock(l) pthread_mutex_lock(l) fork >> pthread_mutex_unlock(l) pthread_mutex_unlock(l) } >> } } >> >> In the details this happens because : >> >> CPU 1 CPU 2 CPU 3 >> fork() >> copy_pte_range() >> set PTE rdonly >> got to next VMA... >> . PTE is seen rdonly PTE still writable >> . thread is writing to page >> . -> page fault >> . copy the page Thread writes to page >> . . -> no page fault >> . update the PTE >> . flush TLB for that PTE >> flush TLB PTE are now rdonly > > Should the fork be on CPU3 to be consistant with the top thing (just to > make it easier to read and go from one to the other as thread can move > from one CPU to another). Sure, this is quite confusing this way ;) >> >> So the write done by the CPU 3 is interfering with the page copy operation >> done by CPU 2, leading to the data corruption. >> >> To avoid this we mark all the VMA involved in the COW mechanism as changing >> by calling vm_write_begin(). This ensures that the speculative page fault >> handler will not try to handle a fault on these pages. >> The marker is set until the TLB is flushed, ensuring that all the CPUs will >> now see the PTE as not writable. >> Once the TLB is flush, the marker is removed by calling vm_write_end(). >> >> The variable last is used to keep tracked of the latest VMA marked to >> handle the error path where part of the VMA may have been marked. >> >> Since multiple VMA from the same mm may have the sequence count increased >> during this process, the use of the vm_raw_write_begin/end() is required to >> avoid lockdep false warning messages. >> >> Reported-by: Ganesh Mahendran >> Reported-by: Vinayak Menon >> Signed-off-by: Laurent Dufour > > A minor comment (see below) > > Reviewed-by: Jérome Glisse Thanks for the review Jérôme. >> --- >> kernel/fork.c | 30 ++++++++++++++++++++++++++++-- >> 1 file changed, 28 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/fork.c b/kernel/fork.c >> index f8dae021c2e5..2992d2c95256 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -462,7 +462,7 @@ EXPORT_SYMBOL(free_task); >> static __latent_entropy int dup_mmap(struct mm_struct *mm, >> struct mm_struct *oldmm) >> { >> - struct vm_area_struct *mpnt, *tmp, *prev, **pprev; >> + struct vm_area_struct *mpnt, *tmp, *prev, **pprev, *last = NULL; >> struct rb_node **rb_link, *rb_parent; >> int retval; >> unsigned long charge; >> @@ -581,8 +581,18 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> rb_parent = &tmp->vm_rb; >> >> mm->map_count++; >> - if (!(tmp->vm_flags & VM_WIPEONFORK)) >> + if (!(tmp->vm_flags & VM_WIPEONFORK)) { >> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) { >> + /* >> + * Mark this VMA as changing to prevent the >> + * speculative page fault hanlder to process >> + * it until the TLB are flushed below. >> + */ >> + last = mpnt; >> + vm_raw_write_begin(mpnt); >> + } >> retval = copy_page_range(mm, oldmm, mpnt); >> + } >> >> if (tmp->vm_ops && tmp->vm_ops->open) >> tmp->vm_ops->open(tmp); >> @@ -595,6 +605,22 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, >> out: >> up_write(&mm->mmap_sem); >> flush_tlb_mm(oldmm); >> + >> + if (IS_ENABLED(CONFIG_SPECULATIVE_PAGE_FAULT)) { > > You do not need to check for CONFIG_SPECULATIVE_PAGE_FAULT as last > will always be NULL if it is not enabled but maybe the compiler will > miss the optimization opportunity if you only have the for() loop > below. I didn't check for the generated code, perhaps the compiler will be optimize that correctly. This being said, I think the if block is better for the code readability, highlighting that this block is only needed in the case of SPF. >> + /* >> + * Since the TLB has been flush, we can safely unmark the >> + * copied VMAs and allows the speculative page fault handler to >> + * process them again. >> + * Walk back the VMA list from the last marked VMA. >> + */ >> + for (; last; last = last->vm_prev) { >> + if (last->vm_flags & VM_DONTCOPY) >> + continue; >> + if (!(last->vm_flags & VM_WIPEONFORK)) >> + vm_raw_write_end(last); >> + } >> + } >> + >> up_write(&oldmm->mmap_sem); >> dup_userfaultfd_complete(&uf); >> fail_uprobe_end: >> -- >> 2.21.0 >> >