Received: by 2002:a05:7412:2a8c:b0:e2:908c:2ebd with SMTP id u12csp3473453rdh; Thu, 28 Sep 2023 12:43:31 -0700 (PDT) X-Google-Smtp-Source: AGHT+IG4GPKOpDTqRcImfwOTah8LXCQB4PXFCsPVbqNEZMviWdgB51QEmV56XXdmFqknSLsp/63R X-Received: by 2002:a05:6870:c0d3:b0:1cc:c744:d320 with SMTP id e19-20020a056870c0d300b001ccc744d320mr2341980oad.53.1695930211544; Thu, 28 Sep 2023 12:43:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695930211; cv=none; d=google.com; s=arc-20160816; b=Oaph3sX7e6xxKrELoqHg0nuxMQ3QxLX0jIOsb3uuWmrVchKni953rgOxuvTWloIoDS 5/tQb2QQ2Qh+VG5k2H5dqh8n7JFwlSSfOr2ey0W/KEE/j9GQgqgSngN44jlekb6zgx5W uVCzsT8hhne1CoBlKakC5K4zmCa0ynlH4W62OBhfehpHCFsoVwoa5ZSYOjASLws60CpD Bqo9B+HLtKUAtAnoaJPq1uN6rjdBU1qrwOe/2E2A8vo6LHUNlEoWquQJLn/x89wvm/8B d4Ky+mczF3nhmB7dB/IgpO7pLyfBTrC2lVtkm8ekdZVPuxNEUhdZ88x28mwArRIq6Q/T s/AQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=gW3bQkxv5ZlVvPDaRZ7GOUnQxOQtKH5unki4YKmOzgQ=; fh=JOr8xojVCiRsbvzn8XPDH2l4zQ0oHNHTCWCX1o+v/To=; b=sqCMRAupgWniX7E4+hzwtpRR7Og0t45RFVCXks+Nvw5Na59MGiCjwgAQRdXiwD/hdT qJkGJVwkIu/ZnnkkyevCuBMn/b37FjmPrlFfMKjRhjDZqDr4T+noTOD//H12ENl4BLWU fIPQcHnBf6q3OJJfno4QvsUhmediyYtQ5EDLSrFnwzXyehBfYiOQt4pnXgLkv6gv13A0 0LyWmaR056vQFiaoDgfLRkCS4deYAGw3bIgUHPfliqmjZVVflrz8/lJst8+si/bDWBhG hd2ykilkeH0LbToW4YvhgHc764akP67emDcWM4EtfcRNLARnoebW+FlsnmcJTbVpUG7H N4Nw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=pBn+H+ni; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from morse.vger.email (morse.vger.email. [23.128.96.31]) by mx.google.com with ESMTPS id s3-20020a63d043000000b00578b5eb76d1si16841092pgi.226.2023.09.28.12.43.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 28 Sep 2023 12:43:31 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) client-ip=23.128.96.31; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=pBn+H+ni; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.31 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id B982D8075DD2; Thu, 28 Sep 2023 08:31:28 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231881AbjI1Paj (ORCPT + 99 others); Thu, 28 Sep 2023 11:30:39 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231874AbjI1PaW (ORCPT ); Thu, 28 Sep 2023 11:30:22 -0400 Received: from mail-pl1-x632.google.com (mail-pl1-x632.google.com [IPv6:2607:f8b0:4864:20::632]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 283431A4 for ; Thu, 28 Sep 2023 08:30:20 -0700 (PDT) Received: by mail-pl1-x632.google.com with SMTP id d9443c01a7336-1c6052422acso166535ad.1 for ; Thu, 28 Sep 2023 08:30:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695915019; x=1696519819; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=gW3bQkxv5ZlVvPDaRZ7GOUnQxOQtKH5unki4YKmOzgQ=; b=pBn+H+ni9EOJ4jKn6TpZfhqkcqfY9Z3xCfaTO2FkB9Rr47Z3RVSxDst6po7Ai6tsZ0 OjxwbRaWaDzGUsiEcJzZn+NeDek3XxVqGI3wY7yXK/fwBCyLE0vriydhOpPUUL+3sFCN PheLGrNP2ShN0JP5IMyXM2mL1HB7PpeE8epZJKr0YnmY6ezexGUiuhAw3HyZngrK2tbY oRRsizqyE7PvQvOdqiKojUT0LSaASlXpV3hg6E5QH08Y/9Yp47+HfAkA+vNRPQ4c3N20 OMrVGTo8mV145JYxvRBnesRPr2Qzh9pLZZWvLu8djV3+OVOIBM8jXi6i1EApMS/rR5sI GdPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695915019; x=1696519819; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=gW3bQkxv5ZlVvPDaRZ7GOUnQxOQtKH5unki4YKmOzgQ=; b=YA+6L9JO2E0JJpOIPVPbUi87pXTNhgg/6rqDF1QZdiFixmbu7kbg2GY4zmkwjcK7O7 1w/Og+pCudeyGyC7ZnalppBvl067du2I3ku4Z8m7Y8ysHpJhEerEsYVPnL23CdkP7MVT mJPNRcRFkUWHoyRVdoBKmXJ1IrKBnMRBtd+lQGhupWfZqythpQACh10ghkgL0VnB51EM lNBus1nHEJ0xcWijWuTU4chPxOheicoYnG1wgESuiMAPGFooJP5G24n3aHXBxAbUB/Az FfO2SFxWwxf7+nkStSBrcIzQJdFh2VrUaN57gmm2oabCH/L69LY0kg0vw0nvUCJCjjkb 6pIA== X-Gm-Message-State: AOJu0Yyr2E7oB2PTLqGM5lx+rgj3Ox2m4JGEYxsKnfvXp/q6ioaklp65 FImZpEgMQa6Kecy91jSsI/hnCSyQ3tOvEa2yHSAxQA== X-Received: by 2002:a17:902:9a85:b0:1c4:1392:e4b5 with SMTP id w5-20020a1709029a8500b001c41392e4b5mr686821plp.21.1695915019275; Thu, 28 Sep 2023 08:30:19 -0700 (PDT) MIME-Version: 1.0 References: <20230923013148.1390521-1-surenb@google.com> <20230923013148.1390521-3-surenb@google.com> In-Reply-To: From: Jann Horn Date: Thu, 28 Sep 2023 17:29:43 +0200 Message-ID: Subject: Re: potential new userfaultfd vs khugepaged conflict [was: Re: [PATCH v2 2/3] userfaultfd: UFFDIO_REMAP uABI] To: Suren Baghdasaryan Cc: Hugh Dickins , Andrew Morton , Al Viro , brauner@kernel.org, Shuah Khan , Andrea Arcangeli , Lokesh Gidra , Peter Xu , David Hildenbrand , Michal Hocko , Axel Rasmussen , Mike Rapoport , willy@infradead.org, Liam.Howlett@oracle.com, zhangpeng362@huawei.com, Brian Geffon , Kalesh Singh , Nicolas Geoffray , Jared Duke , Linux-MM , linux-fsdevel , kernel list , "open list:KERNEL SELFTEST FRAMEWORK" , kernel-team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-8.4 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Thu, 28 Sep 2023 08:31:29 -0700 (PDT) On Wed, Sep 27, 2023 at 7:12=E2=80=AFPM Suren Baghdasaryan wrote: > > On Wed, Sep 27, 2023 at 3:07=E2=80=AFAM Jann Horn wrot= e: > > > > [moving Hugh into "To:" recipients as FYI for khugepaged interaction] > > > > On Sat, Sep 23, 2023 at 3:31=E2=80=AFAM Suren Baghdasaryan wrote: > > > From: Andrea Arcangeli > > > > > > This implements the uABI of UFFDIO_REMAP. > > > > > > Notably one mode bitflag is also forwarded (and in turn known) by the > > > lowlevel remap_pages method. > > > > > > Signed-off-by: Andrea Arcangeli > > > Signed-off-by: Suren Baghdasaryan > > [...] > > > +/* > > > + * The mmap_lock for reading is held by the caller. Just move the pa= ge > > > + * from src_pmd to dst_pmd if possible, and return true if succeeded > > > + * in moving the page. > > > + */ > > > +static int remap_pages_pte(struct mm_struct *dst_mm, > > > + struct mm_struct *src_mm, > > > + pmd_t *dst_pmd, > > > + pmd_t *src_pmd, > > > + struct vm_area_struct *dst_vma, > > > + struct vm_area_struct *src_vma, > > > + unsigned long dst_addr, > > > + unsigned long src_addr, > > > + __u64 mode) > > > +{ > > > + swp_entry_t entry; > > > + pte_t orig_src_pte, orig_dst_pte; > > > + spinlock_t *src_ptl, *dst_ptl; > > > + pte_t *src_pte =3D NULL; > > > + pte_t *dst_pte =3D NULL; > > > + > > > + struct folio *src_folio =3D NULL; > > > + struct anon_vma *src_anon_vma =3D NULL; > > > + struct mmu_notifier_range range; > > > + int err =3D 0; > > > + > > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm, > > > + src_addr, src_addr + PAGE_SIZE); > > > + mmu_notifier_invalidate_range_start(&range); > > > +retry: > > > + dst_pte =3D pte_offset_map_nolock(dst_mm, dst_pmd, dst_addr, = &dst_ptl); > > > + > > > + /* If an huge pmd materialized from under us fail */ > > > + if (unlikely(!dst_pte)) { > > > + err =3D -EFAULT; > > > + goto out; > > > + } > > > + > > > + src_pte =3D pte_offset_map_nolock(src_mm, src_pmd, src_addr, = &src_ptl); > > > + > > > + /* > > > + * We held the mmap_lock for reading so MADV_DONTNEED > > > + * can zap transparent huge pages under us, or the > > > + * transparent huge page fault can establish new > > > + * transparent huge pages under us. > > > + */ > > > + if (unlikely(!src_pte)) { > > > + err =3D -EFAULT; > > > + goto out; > > > + } > > > + > > > + BUG_ON(pmd_none(*dst_pmd)); > > > + BUG_ON(pmd_none(*src_pmd)); > > > + BUG_ON(pmd_trans_huge(*dst_pmd)); > > > + BUG_ON(pmd_trans_huge(*src_pmd)); > > > > This works for now, but note that Hugh Dickins has recently been > > reworking khugepaged such that PTE-based mappings can be collapsed > > into transhuge mappings under the mmap lock held in *read mode*; > > holders of the mmap lock in read mode can only synchronize against > > this by taking the right page table spinlock and rechecking the pmd > > value. This is only the case for file-based mappings so far, not for > > anonymous private VMAs; and this code only operates on anonymous > > private VMAs so far, so it works out. > > > > But if either Hugh further reworks khugepaged such that anonymous VMAs > > can be collapsed under the mmap lock in read mode, or you expand this > > code to work on file-backed VMAs, then it will become possible to hit > > these BUG_ON() calls. I'm not sure what the plans for khugepaged going > > forward are, but the number of edgecases everyone has to keep in mind > > would go down if you changed this function to deal gracefully with > > page tables disappearing under you. > > > > In the newest version of mm/pgtable-generic.c, above > > __pte_offset_map_lock(), there is a big comment block explaining the > > current rules for page table access; in particular, regarding the > > helper pte_offset_map_nolock() that you're using: > > > > * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offse= t_map(); > > * but when successful, it also outputs a pointer to the spinlock in pt= lp - as > > * pte_offset_map_lock() does, but in this case without locking it. Th= is helps > > * the caller to avoid a later pte_lockptr(mm, *pmd), which might by th= at time > > * act on a changed *pmd: pte_offset_map_nolock() provides the correct = spinlock > > * pointer for the page table that it returns. In principle, the calle= r should > > * recheck *pmd once the lock is taken; in practice, no callsite needs = that - > > * either the mmap_lock for write, or pte_same() check on contents, is = enough. > > > > If this becomes hittable in the future, I think you will need to > > recheck *pmd, at least for dst_pte, to avoid copying PTEs into a > > detached page table. > > Thanks for the warning, Jann. It sounds to me it would be better to > add this pmd check now even though it's not hittable. Does that sound > good to everyone? Sounds good to me.