Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp3844810pxf; Mon, 29 Mar 2021 13:10:24 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwVp7uZzcoICKiItFzVtoSTyM51zr9y1FnWJHg/11aM6xMAjZyv1GxQ9FjTLMCvvWWuZbBe X-Received: by 2002:aa7:cf95:: with SMTP id z21mr29631672edx.76.1617048624198; Mon, 29 Mar 2021 13:10:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617048624; cv=none; d=google.com; s=arc-20160816; b=D2+QoO4TuB6CmOm+MiOOpXz93F1cP9q0Wzs0Ta1wVBAg5nTv9mQC7VlLQa0n2C0EvC SQ4+s4TPA23Pxe04TCsCzte/S72BffYOSLRzkIB1LsDCLrn8LXKBBJy+V6EjsZbhpso5 RBxo4X7T+haAN6lnYeQKwdTQ6iBUQp2H7V5tjgyzebXvQaVC4gwXrd0F9zxu7DGxrUfj 0OGL7+uCr997IvhptE1zo+z/ote9EMMbSOwn6ZBTpJs9UfgTJPL0lHep3zyNR0TY0eiS hLFL0YRTeDTbna4SeVl/E2ylmZr5foXwGzNygrevEQxO1lf3XOej+0s8w7ue4iv7DKi8 aI9Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:from:subject:mime-version:message-id:date :dkim-signature; bh=NOHWU2jsEa/0zjQcmOdfFuIgUG8Hlc9nuqXoIh83A5Q=; b=n27PumuriCsmrQ6B/2+k5zni3MZD0jTzH9WXSLLZddcD7I4ZVU5XqQRInA/HPXRnxg O/Wv6dJWkLtKqrbfb7/k5Yl+/RAPahssbOLtDkGhSNhnqkUeldzJYaOhDUdmGS8MwPS4 +HROD5TEJ778i7B0ZC+O4NQPqwwW03iN6y4tc5bfXzNjCULU8y+rONNyaZZT/xLbS5Aw v7mezDY4s7nfQ701emKXWJ9qScUeDeSQ+iCOPVazzLNqJ149p3g8cMYwN5WDeuHdy+/Y 2VB96vTk/HrIt7cPSsrhLJhZcaJUEqtvUo8JDoo0tYqsi5rHbyHTRz3nm5m39q5bQGLN 1XLQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ZedhW5SW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g23si13266194ejx.623.2021.03.29.13.10.01; Mon, 29 Mar 2021 13:10:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ZedhW5SW; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231899AbhC2UHo (ORCPT + 99 others); Mon, 29 Mar 2021 16:07:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231905AbhC2UHV (ORCPT ); Mon, 29 Mar 2021 16:07:21 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 240D1C061762 for ; Mon, 29 Mar 2021 13:07:21 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id u128so20311869ybf.12 for ; Mon, 29 Mar 2021 13:07:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:to:cc; bh=NOHWU2jsEa/0zjQcmOdfFuIgUG8Hlc9nuqXoIh83A5Q=; b=ZedhW5SWdfomD/RnQ+SHrsEUCCBbuJiPY9OzxJ1ZFxXYEZzCuSc0SRKBj6vawvmRrX 39+YtgebEbfd/BN7ghUiBk400/OFFNODUez8lemMziyt09UJlv37pc/8jW0TJbkPVOMN YiZKF/tGBBSHOYp6O0CIj2+Nwhs+At6wrK9HwHA1mrP7g5E6YEsdDlCWvuo9XsVp6s+8 l5XBKu6NE0qxj6pVKxCrSf3CwiTYBSoxWkSDJATwW6OxIvK6rn6Vus9ZCH9yrPvGzzov Tlj/ZxEI9xnRDXbijtCFxfG8hQOT9VCP9CBCZ2NP63ocQ6Yj397+L60dJI7NrGwiHBog ClmQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:to:cc; bh=NOHWU2jsEa/0zjQcmOdfFuIgUG8Hlc9nuqXoIh83A5Q=; b=qNfg9rt7BUFSpKlMas6Z1TGU75OrYWFs03ERunpfnykIhIAa5oWOXRmbP5zRC9ZrD5 4S4kCCuEclN9MV4LUXUrXdUPUsNJbLxrqFvVKwhqzns6riiu+d6xTwprbRifeF0Hj8dP mBWOBJgLZQJ2r10G2u/dDsOdbKx3O3OEJy2HxZLW0N4fKjvTZJXEMp+DlsqZ0DsznYAX 1YOVZM0E3hc8DfbUbvHpBiNV+APNj5husEQFTaHTRECrYJX/TDVAoK7HJ6jOLR6Msiqd SDfgJkwyFBdWng5AolPSW7HPV+5azPyvAlxoSaD2Op7yUXwbUxybXk2EvpgVrJSexVzR oPIQ== X-Gm-Message-State: AOAM532VU5xzlQISsyudBahcKzfUmI3qHEc3+m2QBeojpGTl5vlPyqbA t79yNpnVAv8IlxnavhmFCseL5Y05KipOGKDFl5pU X-Received: from ajr0.svl.corp.google.com ([2620:15c:2cd:203:8578:c167:dddf:13a2]) (user=axelrasmussen job=sendgmr) by 2002:a25:3188:: with SMTP id x130mr15759667ybx.174.1617048440354; Mon, 29 Mar 2021 13:07:20 -0700 (PDT) Date: Mon, 29 Mar 2021 13:06:59 -0700 Message-Id: <20210329200659.65971-1-axelrasmussen@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.31.0.291.g576ba9dcdaf-goog Subject: [PATCH v2] userfaultfd/shmem: fix MCOPY_ATOMIC_CONTNUE behavior From: Axel Rasmussen To: Alexander Viro , Andrea Arcangeli , Andrew Morton , Hugh Dickins , Jerome Glisse , Joe Perches , Lokesh Gidra , Mike Rapoport , Peter Xu , Shaohua Li , Shuah Khan , Stephen Rothwell , Wang Qing Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Axel Rasmussen , Brian Geffon , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Michel Lespinasse , Mina Almasry , Oliver Upton Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Previously, we shared too much of the code with COPY and ZEROPAGE, so we manipulated things in various invalid ways: - Previously, we unconditionally called shmem_inode_acct_block. In the continue case, we're looking up an existing page which would have been accounted for properly when it was allocated. So doing it twice results in double-counting, and eventually leaking. - Previously, we made the pte writable whenever the VMA was writable. However, for continue, consider this case: 1. A tmpfs file was created 2. The non-UFFD-registered side mmap()-s with MAP_SHARED 3. The UFFD-registered side mmap()-s with MAP_PRIVATE In this case, even though the UFFD-registered VMA may be writable, we still want CoW behavior. So, check for this case and don't make the pte writable. - The offset / max_off checking doesn't necessarily hurt anything, but it's not needed in the CONTINUE case, so skip it. - Previously, we unconditionally called ClearPageDirty() in the error path. In the continue case though, since this is an existing page, it might have already been dirty before we started touching it. So, remember whether or not it was dirty before we set_page_dirty(), and only clear the bit if it wasn't dirty before. - Previously, we unconditionally removed the page from the page cache in the error path. But in the continue case, we didn't add it - it was already there because the page is present in some second (non-UFFD-registered) mapping. So, removing it is invalid. Because the error handling issues are easy to exercise in the selftest, make a small modification there to do so. Finally, refactor shmem_mcopy_atomic_pte a bit. By this point, we've added a lot of "if (!is_continue)"-s everywhere. It's cleaner to just check for that mode first thing, and then "goto" down to where the parts we actually want are. This leaves the code in between cleaner. Changes since v1: - Refactor to skip ahead with goto, instead of adding several more "if (!is_continue)". - Fix unconditional ClearPageDirty(). - Don't pte_mkwrite() when is_continue && !VM_SHARED. Fixes: 00da60b9d0a0 ("userfaultfd: support minor fault handling for shmem") Signed-off-by: Axel Rasmussen --- mm/shmem.c | 67 ++++++++++++++---------- tools/testing/selftests/vm/userfaultfd.c | 12 +++++ 2 files changed, 51 insertions(+), 28 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index d2e0e81b7d2e..8ab1f1f29987 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2378,17 +2378,22 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, pte_t _dst_pte, *dst_pte; int ret; pgoff_t offset, max_off; - - ret = -ENOMEM; - if (!shmem_inode_acct_block(inode, 1)) - goto out; + int writable; + bool was_dirty; if (is_continue) { ret = -EFAULT; page = find_lock_page(mapping, pgoff); if (!page) - goto out_unacct_blocks; - } else if (!*pagep) { + goto out; + goto install_ptes; + } + + ret = -ENOMEM; + if (!shmem_inode_acct_block(inode, 1)) + goto out; + + if (!*pagep) { page = shmem_alloc_page(gfp, info, pgoff); if (!page) goto out_unacct_blocks; @@ -2415,13 +2420,11 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, *pagep = NULL; } - if (!is_continue) { - VM_BUG_ON(PageSwapBacked(page)); - VM_BUG_ON(PageLocked(page)); - __SetPageLocked(page); - __SetPageSwapBacked(page); - __SetPageUptodate(page); - } + VM_BUG_ON(PageSwapBacked(page)); + VM_BUG_ON(PageLocked(page)); + __SetPageLocked(page); + __SetPageSwapBacked(page); + __SetPageUptodate(page); ret = -EFAULT; offset = linear_page_index(dst_vma, dst_addr); @@ -2429,16 +2432,18 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, if (unlikely(offset >= max_off)) goto out_release; - /* If page wasn't already in the page cache, add it. */ - if (!is_continue) { - ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL, - gfp & GFP_RECLAIM_MASK, dst_mm); - if (ret) - goto out_release; - } + ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL, + gfp & GFP_RECLAIM_MASK, dst_mm); + if (ret) + goto out_release; +install_ptes: _dst_pte = mk_pte(page, dst_vma->vm_page_prot); - if (dst_vma->vm_flags & VM_WRITE) + /* For CONTINUE on a non-shared VMA, don't pte_mkwrite for CoW. */ + writable = is_continue && !(dst_vma->vm_flags & VM_SHARED) + ? 0 + : dst_vma->vm_flags & VM_WRITE; + if (writable) _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte)); else { /* @@ -2448,15 +2453,18 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, * unconditionally before unlock_page(), but doing it * only if VM_WRITE is not set is faster. */ + was_dirty = PageDirty(page); set_page_dirty(page); } dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); - ret = -EFAULT; - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(offset >= max_off)) - goto out_release_unlock; + if (!is_continue) { + ret = -EFAULT; + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); + if (unlikely(offset >= max_off)) + goto out_release_unlock; + } ret = -EEXIST; if (!pte_none(*dst_pte)) @@ -2485,13 +2493,16 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, return ret; out_release_unlock: pte_unmap_unlock(dst_pte, ptl); - ClearPageDirty(page); - delete_from_page_cache(page); + if (!was_dirty) + ClearPageDirty(page); + if (!is_continue) + delete_from_page_cache(page); out_release: unlock_page(page); put_page(page); out_unacct_blocks: - shmem_inode_unacct_blocks(inode, 1); + if (!is_continue) + shmem_inode_unacct_blocks(inode, 1); goto out; } #endif /* CONFIG_USERFAULTFD */ diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c index f6c86b036d0f..d8541a59dae5 100644 --- a/tools/testing/selftests/vm/userfaultfd.c +++ b/tools/testing/selftests/vm/userfaultfd.c @@ -485,6 +485,7 @@ static void wp_range(int ufd, __u64 start, __u64 len, bool wp) static void continue_range(int ufd, __u64 start, __u64 len) { struct uffdio_continue req; + int ret; req.range.start = start; req.range.len = len; @@ -493,6 +494,17 @@ static void continue_range(int ufd, __u64 start, __u64 len) if (ioctl(ufd, UFFDIO_CONTINUE, &req)) err("UFFDIO_CONTINUE failed for address 0x%" PRIx64, (uint64_t)start); + + /* + * Error handling within the kernel for continue is subtly different + * from copy or zeropage, so it may be a source of bugs. Trigger an + * error (-EEXIST) on purpose, to verify doing so doesn't cause a BUG. + */ + req.mapped = 0; + ret = ioctl(ufd, UFFDIO_CONTINUE, &req); + if (ret >= 0 || req.mapped != -EEXIST) + err("failed to exercise UFFDIO_CONTINUE error handling, ret=%d, mapped=%" PRId64, + ret, req.mapped); } static void *locking_thread(void *arg) -- 2.31.0.291.g576ba9dcdaf-goog