Received: by 2002:a05:7412:da14:b0:e2:908c:2ebd with SMTP id fe20csp2004911rdb; Mon, 9 Oct 2023 09:29:48 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEJO5VD3eH7fCOpetL2SSrPNtelVCPEAmjGK3YuKJTAK6+ieQnZjgTNytdLKbtdCWsKODu6 X-Received: by 2002:a17:902:6b04:b0:1b7:e86f:7631 with SMTP id o4-20020a1709026b0400b001b7e86f7631mr12704904plk.19.1696868988490; Mon, 09 Oct 2023 09:29:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696868988; cv=none; d=google.com; s=arc-20160816; b=yePGBA2cqMxr90a6MvTkG2ZtSIJaH4jmZraddD4vIvga9fiAJhvB7isYPLv0ZnPYbJ gyaBodaLsOAhhFx8OgbJHlrJ6aAVwhba5du4V8RVAaEYGHrbTvVzZojLAJcO2QZ1TEX1 4FB1Qyl14sxda4shFGSTWzO9JEGcLsvWnJoadZgmeITpTT4l44sls7wTlb0LhcSfPo5B lISRPBXuuUHmpezgMsT6qrwn157JttQ5l/xPYLp+PKrzU7R+IgU1u6U9XwIcCqxygkXA t/WH8MfRz7i8bL8TFkRSWcLiBShNIzZoQILfFB7Z5OKZF/rZXZRQnS3CGLRlNdkbLnH3 DQcA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=n9uxnDjj5RsOcAN+RF0ynO6xdWSr/KBmzmP+ARysPwQ=; fh=zTYYM8eLLPhNAYI2noeOflv4wnOOG4bsonXVXomK7fI=; b=v2JFvQaAvXNoOGAhgsfSQcS5yBUhC4OWmK23+ZQNiD8N5tpqUyu0OrWcnSx9AZYrsC GflxdXRAek3RStWNlEWCA6RmMts/bWuhN8DfZOv+i0vwcN0Vi/JlylX3i+7Z4FsEbixY hhQbUSm6kyi4OVz4oYa8Q2O2R28XRySuxipv/wj+34LWa7leHpz5QrCPv6vCBAqBnp9i ZXjW2Z48EE/G7r3cT3ylIgGkWuQWUE2H8aF7UEriCmi/IMHebvO9e26gFaPwAF59Wroo Df2IFRqdn6pJczKcXFy46YgGie8/krgsDfK4xSKV2T/zVVqmpRB5gegFsbmZzT0tkud8 Sw8Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=bzknRyEN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from morse.vger.email (morse.vger.email. [2620:137:e000::3:1]) by mx.google.com with ESMTPS id a4-20020a170902ecc400b001b829a32f2dsi1190556plh.457.2023.10.09.09.29.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Oct 2023 09:29:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) client-ip=2620:137:e000::3:1; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20230601 header.b=bzknRyEN; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:1 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by morse.vger.email (Postfix) with ESMTP id 8C4CB801BF4C; Mon, 9 Oct 2023 09:29:45 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at morse.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1377179AbjJIQ30 (ORCPT + 99 others); Mon, 9 Oct 2023 12:29:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56446 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1377146AbjJIQ3Z (ORCPT ); Mon, 9 Oct 2023 12:29:25 -0400 Received: from mail-wr1-x429.google.com (mail-wr1-x429.google.com [IPv6:2a00:1450:4864:20::429]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0BE619F for ; Mon, 9 Oct 2023 09:29:23 -0700 (PDT) Received: by mail-wr1-x429.google.com with SMTP id ffacd0b85a97d-313e742a787so2816337f8f.1 for ; Mon, 09 Oct 2023 09:29:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696868961; x=1697473761; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=n9uxnDjj5RsOcAN+RF0ynO6xdWSr/KBmzmP+ARysPwQ=; b=bzknRyENyCX43KOi5mzPbUiU3uf2+7Ruiku1yomFUMIf8Pj1+NQk6nQ40aJi6/djLl COd/P9S3AabzMxTQQQexSgQ+PqzU4Xajk75Fe6OW4EVQWhVaBHf7onlEw7rm9MzwZY15 ZlhP3AnlljENRL3p/Uqfz0Z1yyIHlcscCHuaUB69h7F4mpZBLqW6XRUs45M0Twihf7Zf 86lYSXibmt9ixAT5fZCzgqgRltLgPcg7zpjVKGCTOlen4PuQVpb6Ui3M/qJCJFfkWBSl RpLiTRgiUdL5D7tSiIj4wDhMsRug935ToshteoMxm7VoqMbYNdI/HW968qTvMSRxS3Tm jZpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696868961; x=1697473761; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=n9uxnDjj5RsOcAN+RF0ynO6xdWSr/KBmzmP+ARysPwQ=; b=GsLUAxzFGkXe7AVMfcVr7ev4hIGZh3ppSBcA6gTQk/xF6KErAPAVw2hmQFgLWame2C GO6pIdz1l/QQzaxA506ofX9SYU2dcnhSRSNgpqiWQzi/Vyaq7i6V7Ae3Ri0NS8tn4X0J 1cl4VkBBImagct1NypQe+Ab/xtY2L5eiZHIa7FGTm57R4pi7Ozvyt2EuVaTapfTIWF34 OpvKg7RL912cumPbnK2L+z452kM3u+J7C8Egid42iCNG6mTMDf3iti4b+6nPGtEjBZcS UOGt9wO0Km/SLa/o8G8mR7Uac2PDujXIgxDHdabpc529Lb23D4qL6EasAgMbm+91fbmb 5leA== X-Gm-Message-State: AOJu0YyZ/JBLVCW9lwCiizjnKJ9Xv44UW/61br67HIVHnJ0ac+BPbkMg jV98F/ieY2UdZ26SYCQq2aXtWjOfinElffMcRnYowg== X-Received: by 2002:adf:ff88:0:b0:319:5234:5c92 with SMTP id j8-20020adfff88000000b0031952345c92mr9481043wrr.35.1696868961196; Mon, 09 Oct 2023 09:29:21 -0700 (PDT) MIME-Version: 1.0 References: <20231009064230.2952396-1-surenb@google.com> <20231009064230.2952396-3-surenb@google.com> <214b78ed-3842-5ba1-fa9c-9fa719fca129@redhat.com> <478697aa-f55c-375a-6888-3abb343c6d9d@redhat.com> In-Reply-To: <478697aa-f55c-375a-6888-3abb343c6d9d@redhat.com> From: Lokesh Gidra Date: Mon, 9 Oct 2023 17:29:08 +0100 Message-ID: Subject: Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI To: David Hildenbrand Cc: Suren Baghdasaryan , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com, peterx@redhat.com, hughd@google.com, mhocko@suse.com, axelrasmussen@google.com, rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com, jannh@google.com, zhangpeng362@huawei.com, bgeffon@google.com, kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-4.8 required=5.0 tests=DKIMWL_WL_MED,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,RCVD_IN_SBL_CSS,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on morse.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (morse.vger.email [0.0.0.0]); Mon, 09 Oct 2023 09:29:45 -0700 (PDT) On Mon, Oct 9, 2023 at 5:24=E2=80=AFPM David Hildenbrand = wrote: > > On 09.10.23 18:21, Suren Baghdasaryan wrote: > > On Mon, Oct 9, 2023 at 7:38=E2=80=AFAM David Hildenbrand wrote: > >> > >> On 09.10.23 08:42, Suren Baghdasaryan wrote: > >>> From: Andrea Arcangeli > >>> > >>> Implement the uABI of UFFDIO_MOVE ioctl. > >>> UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the applicatio= n > >>> needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages = are > >>> available (in userspace) for recycling, as is usually the case in hea= p > >>> compaction algorithms, then we can avoid the page allocation and memc= py > >>> (done by UFFDIO_COPY). Also, since the pages are recycled in the > >>> userspace, we avoid the need to release (via madvise) the pages back = to > >>> the kernel [2]. > >>> We see over 40% reduction (on a Google pixel 6 device) in the compact= ing > >>> thread=E2=80=99s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY= . This was > >>> measured using a benchmark that emulates a heap compaction implementa= tion > >>> using userfaultfd (to allow concurrent accesses by application thread= s). > >>> More details of the usecase are explained in [2]. > >>> Furthermore, UFFDIO_MOVE enables moving swapped-out pages without > >>> touching them within the same vma. Today, it can only be done by mrem= ap, > >>> however it forces splitting the vma. > >>> > >>> [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarc= ange@redhat.com/ > >>> [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZ= KNKXqxRCyjniNVjp0Aw@mail.gmail.com/ > >>> > >>> Update for the ioctl_userfaultfd(2) manpage: > >>> > >>> UFFDIO_MOVE > >>> (Since Linux xxx) Move a continuous memory chunk into the > >>> userfault registered range and optionally wake up the blocke= d > >>> thread. The source and destination addresses and the number = of > >>> bytes to move are specified by the src, dst, and len fields = of > >>> the uffdio_move structure pointed to by argp: > >>> > >>> struct uffdio_move { > >>> __u64 dst; /* Destination of move */ > >>> __u64 src; /* Source of move */ > >>> __u64 len; /* Number of bytes to move */ > >>> __u64 mode; /* Flags controlling behavior of move = */ > >>> __s64 move; /* Number of bytes moved, or negated e= rror */ > >>> }; > >>> > >>> The following value may be bitwise ORed in mode to change th= e > >>> behavior of the UFFDIO_MOVE operation: > >>> > >>> UFFDIO_MOVE_MODE_DONTWAKE > >>> Do not wake up the thread that waits for page-fault > >>> resolution > >>> > >>> UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES > >>> Allow holes in the source virtual range that is being= moved. > >>> When not specified, the holes will result in ENOENT e= rror. > >>> When specified, the holes will be accounted as succes= sfully > >>> moved memory. This is mostly useful to move hugepage = aligned > >>> virtual regions without knowing if there are transpar= ent > >>> hugepages in the regions or not, but preventing the r= isk of > >>> having to split the hugepage during the operation. > >>> > >>> The move field is used by the kernel to return the number of > >>> bytes that was actually moved, or an error (a negated errno- > >>> style value). If the value returned in move doesn't match t= he > >>> value that was specified in len, the operation fails with th= e > >>> error EAGAIN. The move field is output-only; it is not read= by > >>> the UFFDIO_MOVE operation. > >>> > >>> The operation may fail for various reasons. Usually, remappi= ng of > >>> pages that are not exclusive to the given process fail; once= KSM > >>> might deduplicate pages or fork() COW-shares pages during fo= rk() > >>> with child processes, they are no longer exclusive. Further,= the > >>> kernel might only perform lightweight checks for detecting w= hether > >>> the pages are exclusive, and return -EBUSY in case that chec= k fails. > >>> To make the operation more likely to succeed, KSM should be > >>> disabled, fork() should be avoided or MADV_DONTFORK should b= e > >>> configured for the source VMA before fork(). > >>> > >>> This ioctl(2) operation returns 0 on success. In this case,= the > >>> entire area was moved. On error, -1 is returned and errno i= s > >>> set to indicate the error. Possible errors include: > >>> > >>> EAGAIN The number of bytes moved (i.e., the value returned i= n > >>> the move field) does not equal the value that was > >>> specified in the len field. > >>> > >>> EINVAL Either dst or len was not a multiple of the system pa= ge > >>> size, or the range specified by src and len or dst an= d len > >>> was invalid. > >>> > >>> EINVAL An invalid bit was specified in the mode field. > >>> > >>> ENOENT > >>> The source virtual memory range has unmapped holes an= d > >>> UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. > >>> > >>> EEXIST > >>> The destination virtual memory range is fully or part= ially > >>> mapped. > >>> > >>> EBUSY > >>> The pages in the source virtual memory range are not > >>> exclusive to the process. The kernel might only perfo= rm > >>> lightweight checks for detecting whether the pages ar= e > >>> exclusive. To make the operation more likely to succe= ed, > >>> KSM should be disabled, fork() should be avoided or > >>> MADV_DONTFORK should be configured for the source vir= tual > >>> memory area before fork(). > >>> > >>> ENOMEM Allocating memory needed for the operation failed. > >>> > >>> ESRCH > >>> The faulting process has exited at the time of a > >>> UFFDIO_MOVE operation. > >>> > >> > >> A general comment simply because I realized that just now: does anythi= ng > >> speak against limiting the operations now to a single MM? > >> > >> The use cases I heard so far don't need it. If ever required, we could > >> consider extending it. > >> > >> Let's reduce complexity and KIS unless really required. > > > > Let me check if there are use cases that require moves between MMs. > > Andrea seems to have put considerable effort to make it work between > > MMs and it would be a pity to lose that. I can send a follow-up patch > > to recover that functionality and even if it does not get merged, it > > can be used in the future as a reference. But first let me check if we > > can drop it. For the compaction use case that we have it's fine to limit it to single MM. However, for general use I think Peter will have a better idea. > > Yes, that sounds reasonable. Unless the big important use cases requires > moving pages between processes, let's leave that as future work for now. > > -- > Cheers, > > David / dhildenb >