Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp2946060imm; Thu, 24 May 2018 19:38:30 -0700 (PDT) X-Google-Smtp-Source: AB8JxZq11gmvsNFb9Gr/7epo4hoUtsuXtvAMPxlB3mbc0mQGYdm3K+sfdFulIsV1QBLS8TQRogej X-Received: by 2002:a62:e211:: with SMTP id a17-v6mr584127pfi.126.1527215910327; Thu, 24 May 2018 19:38:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527215910; cv=none; d=google.com; s=arc-20160816; b=fckRbz0ykImerNuOQ9M1tcSj65H8BIU6L+v1H7qf6UobIEUd6kgakNIEp7vz7AnsYL Hu9MOhPG3pDDIxeRbgPKWOP0dx4RlQ/dW7pQUYLbd+xAthI3sJyubhjRy/X8g1CSbPog 75VDN+uM4sGRJgvbw6eTZD+wnKBWYlmJs9+b53hQ7M78Bb1FCy3MSABh78MS5j4iTgzb UWDaNa+3Hx4sVhr+UP3vIRVDn7/IVJ4gIYZTFhS+Ik7YvZMBINEA2g1ACEOsBqDHQP+M GEAI8SLTYW4LCFyF9/n90rLMCGO8OIF+XHvtRTaq3gfrgO68lGYzetTx0gdESMOLdTZi vvQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:user-agent:in-reply-to :content-disposition:mime-version:references:subject:cc:to:from:date :arc-authentication-results; bh=ArLSAwcAKL/Quw3z5T2B/HhK2yfjFIMBF6JQDnRCAyI=; b=IsZU6zMn8JZvKQBgYP3fMZIG8Y3vETJPQ+bcCG45HXOJ/HebfduNtYAvydJr7fZnjU Q28o4+A8FoKDrUzCGfEk7S4q/vf+jB2Mbpo95SN3vgyB502RWCkwv+klVv8AJ0QuNwtt cyN38e93Ol8de+GeQdtkCWqYBbfiNHGiU7+1aYx/7byPVLVVDB8i/nLe0D8Rmd52g1az JG3rhDYBa2Na7RveuDILhWTJMyNtR8atm+KiBbuFeZgpyd6SCpTenT7dUyYRDI9K52aj ZmYR5TyciD2P7Ib7eTeaYxjtsPkLOoNHWiOL7ZFOGgV1rkKqnVpQnofwLr2hWYhVJiSu kWrA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n3-v6si17548672pgc.56.2018.05.24.19.38.15; Thu, 24 May 2018 19:38:30 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=ibm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1034446AbeEXTGw (ORCPT + 99 others); Thu, 24 May 2018 15:06:52 -0400 Received: from mx0a-001b2d01.pphosted.com ([148.163.156.1]:45748 "EHLO mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1030244AbeEXTGu (ORCPT ); Thu, 24 May 2018 15:06:50 -0400 Received: from pps.filterd (m0098404.ppops.net [127.0.0.1]) by mx0a-001b2d01.pphosted.com (8.16.0.22/8.16.0.22) with SMTP id w4OJ3l0N009088 for ; Thu, 24 May 2018 15:06:50 -0400 Received: from e06smtp11.uk.ibm.com (e06smtp11.uk.ibm.com [195.75.94.107]) by mx0a-001b2d01.pphosted.com with ESMTP id 2j60rq75m3-1 (version=TLSv1.2 cipher=AES256-GCM-SHA384 bits=256 verify=NOT) for ; Thu, 24 May 2018 15:06:49 -0400 Received: from localhost by e06smtp11.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Thu, 24 May 2018 20:06:47 +0100 Received: from b06cxnps4074.portsmouth.uk.ibm.com (9.149.109.196) by e06smtp11.uk.ibm.com (192.168.101.141) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Thu, 24 May 2018 20:06:43 +0100 Received: from d06av24.portsmouth.uk.ibm.com (d06av24.portsmouth.uk.ibm.com [9.149.105.60]) by b06cxnps4074.portsmouth.uk.ibm.com (8.14.9/8.14.9/NCO v10.0) with ESMTP id w4OJ6gYp7405988; Thu, 24 May 2018 19:06:42 GMT Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id 93A8A4204C; Thu, 24 May 2018 19:57:22 +0100 (BST) Received: from d06av24.portsmouth.uk.ibm.com (unknown [127.0.0.1]) by IMSVA (Postfix) with ESMTP id D00D542041; Thu, 24 May 2018 19:57:21 +0100 (BST) Received: from rapoport-lnx (unknown [9.148.205.64]) by d06av24.portsmouth.uk.ibm.com (Postfix) with ESMTPS; Thu, 24 May 2018 19:57:21 +0100 (BST) Date: Thu, 24 May 2018 22:06:40 +0300 From: Mike Rapoport To: Pavel Emelyanov Cc: Andrew Morton , linux-mm , lkml , Andrea Arcangeli , Mike Kravetz , Andrei Vagin Subject: Re: [PATCH] userfaultfd: prevent non-cooperative events vs mcopy_atomic races References: <1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.com> <0e1ce040-1beb-fd96-683c-1b18eb635fd6@virtuozzo.com> <20180524115613.GA16908@rapoport-lnx> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) X-TM-AS-GCONF: 00 x-cbid: 18052419-0040-0000-0000-0000045D2705 X-IBM-AV-DETECTION: SAVI=unused REMOTE=unused XFE=unused x-cbparentid: 18052419-0041-0000-0000-000021018598 Message-Id: <20180524190639.GD16908@rapoport-lnx> X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:,, definitions=2018-05-24_06:,, signatures=0 X-Proofpoint-Spam-Details: rule=outbound_notspam policy=outbound score=0 priorityscore=1501 malwarescore=0 suspectscore=0 phishscore=0 bulkscore=0 spamscore=0 clxscore=1015 lowpriorityscore=0 impostorscore=0 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1709140000 definitions=main-1805240218 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, May 24, 2018 at 07:40:07PM +0300, Pavel Emelyanov wrote: > On 05/24/2018 02:56 PM, Mike Rapoport wrote: > > On Thu, May 24, 2018 at 02:24:37PM +0300, Pavel Emelyanov wrote: > >> On 05/23/2018 10:42 AM, Mike Rapoport wrote: > >>> If a process monitored with userfaultfd changes it's memory mappings or > >>> forks() at the same time as uffd monitor fills the process memory with > >>> UFFDIO_COPY, the actual creation of page table entries and copying of the > >>> data in mcopy_atomic may happen either before of after the memory mapping > >>> modifications and there is no way for the uffd monitor to maintain > >>> consistent view of the process memory layout. > >>> > >>> For instance, let's consider fork() running in parallel with > >>> userfaultfd_copy(): > >>> > >>> process | uffd monitor > >>> ---------------------------------+------------------------------ > >>> fork() | userfaultfd_copy() > >>> ... | ... > >>> dup_mmap() | down_read(mmap_sem) > >>> down_write(mmap_sem) | /* create PTEs, copy data */ > >>> dup_uffd() | up_read(mmap_sem) > >>> copy_page_range() | > >>> up_write(mmap_sem) | > >>> dup_uffd_complete() | > >>> /* notify monitor */ | > >>> > >>> If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will be > >>> present by the time copy_page_range() is called and they will appear in the > >>> child's memory mappings. However, if the fork() is the first to take the > >>> mmap_sem, the new pages won't be mapped in the child's address space. > >> > >> But in this case child should get an entry, that emits a message to uffd when step upon! > >> And uffd will just userfaultfd_copy() it again. No? > > > > There will be a message, indeed. But there is no way for monitor to tell > > whether the pages it copied are present or not in the child. > > If there's a message, then they are not present, that's for sure :) If the pages are not present and child tries to access them, the monitor will get page fault notification and everything is fine. However, if the pages *are present*, the child can access them without uffd noticing. And if we copy them into child it'll see the wrong data. Since we are talking about background copy, we'd need to decide whether the pages should be copied or not regardless #PF notifications. > > Since the monitor cannot assume that the process will access all its memory > > it has to copy some pages "in the background". A simple monitor may look > > like: > > > > for (;;) { > > wait_for_uffd_events(timeout); > > handle_uffd_events(); > > uffd_copy(some not faulted pages); > > } > > > > Then, if the "background" uffd_copy() races with fork, the pages we've > > copied may be already present in parent's mappings before the call to > > copy_page_range() and may be not. > > > > If the pages were not present, uffd_copy'ing them again to the child's > > memory would be ok. > > Yes. > > > But if uffd_copy() was first to catch mmap_sem, and we would uffd_copy them > > again, child process will get memory corruption. > > You mean the background uffd_copy()? Yes. > But doesn't it race even with regular PF handling, not only the fork? How > do we handle this race? With the regular #PF handing, the faulting thread patiently waits until page fault is resolved. With fork(), mremap() etc the thread that caused the event resumes once the uffd message is read by the monitor. That's surely way before monitor had chance to somehow process that message. > -- Pavel > -- Sincerely yours, Mike.