Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4750689pxf; Tue, 30 Mar 2021 16:36:41 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwlPrnZYMyVZ1mf3+AVWDdy+4rXK8D+kQhrnA4C+wxeZlxRCMe4cUlf03Qs1v0Pw0f+PBjW X-Received: by 2002:a17:906:b341:: with SMTP id cd1mr580231ejb.391.1617147401210; Tue, 30 Mar 2021 16:36:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617147401; cv=none; d=google.com; s=arc-20160816; b=YpvDRi0yB8K8+Bha8SFjbb//CDRzRtwSrrYezV05j6QpTrbYF90HQ/zP8I6BYJ6dh+ T8VVM+82czSlOuFdqTW493PPGMP0ebeV77o4XZfYaJ5HWjBnvEYpGmKntIijInbADZau 7BMO9I1dMxG1yiew3pNmbDZHWC1fSMqMWYoF03JDSc2HXU6PczVEXLfrNLAD/FDyHBUk ZdSXc8eSpJKMO8Wn/AEQQHvzdO4ul656Qncjj25BImF6Pd4f6w5fgSaqqUmfwo/eJOIW Am9ydt4rrNgH2DGx6L7Zl1H9HM2/t/uoTxHFpsG2o6dGzmi3xfV5Rt4Q2AHi1d59aQsd UnhA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=o671+X957x3ohIb8lZy+lC6mbupaAR79VW4I7NvzAKM=; b=kF7cA/4Bds5bjh6oH8GCR+rfUaUh4i6Ks4VZZQ/knvvgSmqXmEGF3mP3b0rvGA04ig JEn92cqM9Rd+D0iSv+x/C2SU7zHUF1K00UBTVxZqV8S3BYsUIUOpFwO0Wwpg3P1OJ28t j7cUZ4t2ZlNXcuIwJAausx1cL42ShEsuDfmCQkjP73FsdIFlTo1bzhdAKz/PhpJoIGe5 MBDgh8u4Lzfaer26/cxbuzaDh/+yb7Wi0VnaMnoSaGuVzxEgTY5X1cERaaTM0dm07EfM aRwPqPm4rTfkNXi+DLz063aQH4jhbWNdtH2luOGVqi07oJVhJXNsSUs9KI1smygerBsJ iu5w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=q3+rZbXz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f11si339127edw.236.2021.03.30.16.36.18; Tue, 30 Mar 2021 16:36:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=q3+rZbXz; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232812AbhC3XbN (ORCPT + 99 others); Tue, 30 Mar 2021 19:31:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49720 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232855AbhC3Xav (ORCPT ); Tue, 30 Mar 2021 19:30:51 -0400 Received: from mail-il1-x12c.google.com (mail-il1-x12c.google.com [IPv6:2607:f8b0:4864:20::12c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D6CECC061762 for ; Tue, 30 Mar 2021 16:30:50 -0700 (PDT) Received: by mail-il1-x12c.google.com with SMTP id h7so8472212ilj.8 for ; Tue, 30 Mar 2021 16:30:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=o671+X957x3ohIb8lZy+lC6mbupaAR79VW4I7NvzAKM=; b=q3+rZbXz4Tlokww2KCM+zuxWcJwwrl95Q65TPi93mxksx1cxTwhxtRQthFoKlW/aaE xYID/mh8kTYSBSFSfeMbQi09UlsGYuv8sTi/J70kP03kPtupVHpt9VM/2P9gZqtyP8Mv BkqWQSDaOHMyIQHFlG5WOBGaB6HIzl/+hJoq3s8i1M+lLfy/3H7DnwEUm82xCg/zAUNA jg7s43cR2owsrmQgLifKwSvKOxA9Aac0QfDYalUL1M0kAly1px5zGaIeprVHZmph+p9s VD6+cGjT+xJe4apJ1QbLt4mEgkKKQJ15CYKhdnnztORfoWmXVvrLogqWQow8RAfchHEl 8j4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=o671+X957x3ohIb8lZy+lC6mbupaAR79VW4I7NvzAKM=; b=V/iqULTwmxEWnUKLzyCZM2Z5rRG+ohaXAPs3+wiPHQYXsLhC6TZRew2qIeJpxKhvvc 52kRSnhnLkTiBQMfWxiL+NXeyJqKvir/tcxo/bhRKsAJwg7NQpzkwtiRWnj5omCnMyyj xRg36rhCCtEvj5InHOHKwOUIFSvLihyOniSlfyecFSONc4uZ3vCj1WgLHcD/YgFx3wAj 8Hjgws/cTacN5EXbyAN2STS9cwWN/QAs/U8D/8y8HUxb4lDGT8IrPPXL1yFonoYErqV3 6glGTvRj7eOAlEbXyQyms+D+2ftLbWGV/3ZycbwpGcLSUdnAr+vbUpPzIDkM5+G23+GF ZXYw== X-Gm-Message-State: AOAM533+1BXlxtz+5QPvcunl+EUUjxBWyiDWWOrdtSl+xp8w6rcLfCAj m/UyUW1y0w/2YPyU6dpPzcCHv3yXNhfLZPbB3OKT6A== X-Received: by 2002:a05:6e02:ee3:: with SMTP id j3mr520900ilk.85.1617147049699; Tue, 30 Mar 2021 16:30:49 -0700 (PDT) MIME-Version: 1.0 References: <20210329234131.304999-1-axelrasmussen@google.com> <20210330205519.GK429942@xz-x1> In-Reply-To: <20210330205519.GK429942@xz-x1> From: Axel Rasmussen Date: Tue, 30 Mar 2021 16:30:13 -0700 Message-ID: Subject: Re: [PATCH v3] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTNUE behavior To: Peter Xu Cc: Alexander Viro , Andrea Arcangeli , Andrew Morton , Hugh Dickins , Jerome Glisse , Joe Perches , Lokesh Gidra , Mike Rapoport , Shaohua Li , Shuah Khan , Stephen Rothwell , Wang Qing , LKML , linux-fsdevel@vger.kernel.org, Linux MM , linux-kselftest@vger.kernel.org, Brian Geffon , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Michel Lespinasse , Mina Almasry , Oliver Upton Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 30, 2021 at 1:55 PM Peter Xu wrote: > > On Mon, Mar 29, 2021 at 04:41:31PM -0700, Axel Rasmussen wrote: > > Previously, we shared too much of the code with COPY and ZEROPAGE, so we > > manipulated things in various invalid ways: > > > > - Previously, we unconditionally called shmem_inode_acct_block. In the > > continue case, we're looking up an existing page which would have been > > accounted for properly when it was allocated. So doing it twice > > results in double-counting, and eventually leaking. > > > > - Previously, we made the pte writable whenever the VMA was writable. > > However, for continue, consider this case: > > > > 1. A tmpfs file was created > > 2. The non-UFFD-registered side mmap()-s with MAP_SHARED > > 3. The UFFD-registered side mmap()-s with MAP_PRIVATE > > > > In this case, even though the UFFD-registered VMA may be writable, we > > still want CoW behavior. So, check for this case and don't make the > > pte writable. > > > > - The initial pgoff / max_off check isn't necessary, so we can skip past > > it. The second one seems likely to be unnecessary too, but keep it > > just in case. Modify both checks to use pgoff, as offset is equivalent > > and not needed. > > > > - Previously, we unconditionally called ClearPageDirty() in the error > > path. In the continue case though, since this is an existing page, it > > might have already been dirty before we started touching it. It's very > > problematic to clear the bit incorrectly, but not a problem to leave > > it - so, just omit the ClearPageDirty() entirely. > > > > - Previously, we unconditionally removed the page from the page cache in > > the error path. But in the continue case, we didn't add it - it was > > already there because the page is present in some second > > (non-UFFD-registered) mapping. So, removing it is invalid. > > > > Because the error handling issues are easy to exercise in the selftest, > > make a small modification there to do so. > > > > Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've > > added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just > > check for that mode first thing, and then "goto" down to where the parts > > we actually want are. This leaves the code in between cleaner. > > > > Changes since v2: > > - Drop the ClearPageDirty() entirely, instead of trying to remember the > > old value. > > - Modify both pgoff / max_off checks to use pgoff. It's equivalent to > > offset, but offset wasn't initialized until the first check (which > > we're skipping). > > - Keep the second pgoff / max_off check in the continue case. > > > > Changes since v1: > > - Refactor to skip ahead with goto, instead of adding several more > > "if (!is_continue)". > > - Fix unconditional ClearPageDirty(). > > - Don't pte_mkwrite() when is_continue && !VM_SHARED. > > > > Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem") > > Signed-off-by: Axel Rasmussen > > --- > > mm/shmem.c | 60 +++++++++++++----------- > > tools/testing/selftests/vm/userfaultfd.c | 12 +++++ > > 2 files changed, 44 insertions(+), 28 deletions(-) > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > index d2e0e81b7d2e..fbcce850a16e 100644 > > --- a/mm/shmem.c > > +++ b/mm/shmem.c > > @@ -2377,18 +2377,22 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, > > struct page *page; > > pte_t _dst_pte, *dst_pte; > > int ret; > > - pgoff_t offset, max_off; > > - > > - ret = -ENOMEM; > > - if (!shmem_inode_acct_block(inode, 1)) > > - goto out; > > + pgoff_t max_off; > > + int writable; > > Nit: can be bool. > > [...] > > > +install_ptes: > > _dst_pte = mk_pte(page, dst_vma->vm_page_prot); > > - if (dst_vma->vm_flags & VM_WRITE) > > + /* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */ > > + writable = is_continue && !(dst_vma->vm_flags & VM_SHARED) > > + ? 0 > > + : dst_vma->vm_flags & VM_WRITE; > > Nit: this code is slightly hard to read.. I'd slightly prefer "if > (is_continue)...". But more below. > > > + if (writable) > > _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); > > else { > > /* > > @@ -2455,7 +2458,7 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, > > > > ret = -EFAULT; > > max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); > > - if (unlikely(offset >= max_off)) > > + if (unlikely(pgoff >= max_off)) > > goto out_release_unlock; > > > > ret = -EEXIST; > > @@ -2485,13 +2488,14 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, > > return ret; > > out_release_unlock: > > pte_unmap_unlock(dst_pte, ptl); > > - ClearPageDirty(page); > > - delete_from_page_cache(page); > > + if (!is_continue) > > + delete_from_page_cache(page); > > out_release: > > unlock_page(page); > > put_page(page); > > out_unacct_blocks: > > - shmem_inode_unacct_blocks(inode, 1); > > + if (!is_continue) > > + shmem_inode_unacct_blocks(inode, 1); > > If you see we still have tons of "if (!is_continue)". Those are the places > error prone.. even if not in this patch, could be in the patch when this > function got changed again. > > Sorry to say this a bit late: how about introduce a helper to install the pte? No worries. :) > Pesudo code: > > int shmem_install_uffd_pte(..., bool writable) > { > ... > _dst_pte = mk_pte(page, dst_vma->vm_page_prot); > if (dst_vma->vm_flags & VM_WRITE) > _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); > else > set_page_dirty(page); > > dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); > if (!pte_none(*dst_pte)) { > pte_unmap_unlock(dst_pte, ptl); > return -EEXIST; > } > > inc_mm_counter(dst_mm, mm_counter_file(page)); > page_add_file_rmap(page, false); > set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); > > /* No need to invalidate - it was non-present before */ > update_mmu_cache(dst_vma, dst_addr, dst_pte); > pte_unmap_unlock(dst_pte, ptl); > return 0; > } > > Then at the entry of shmem_mcopy_atomic_pte(): > > if (is_continue) { > page = find_lock_page(mapping, pgoff); > if (!page) > return -EFAULT; > ret = shmem_install_uffd_pte(..., > is_continue && !(dst_vma->vm_flags & VM_SHARED)); > unlock_page(page); > if (ret) > put_page(page); > return ret; > } > > Do you think this would be cleaner? Yes, a refactor like that is promising. It's hard to say for certain without actually looking at the result - I'll spend some time tomorrow on a few options, and send along the cleanest version I come up with. Thanks for all the feedback and advice on this feature, Peter! > > -- > Peter Xu >