Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp896375pxb; Fri, 22 Apr 2022 13:42:36 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzuOl8+mwG5jClG6Xx8+aLxLKefInPqEzKP7BmfyWRhJQK9Mx+ufbXWEJQ4cslNE6qYLqoH X-Received: by 2002:a17:903:41ca:b0:15a:4442:e74 with SMTP id u10-20020a17090341ca00b0015a44420e74mr6468198ple.154.1650660156346; Fri, 22 Apr 2022 13:42:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650660156; cv=none; d=google.com; s=arc-20160816; b=gUtrCL/yJ0Qsjaulsb7d3rjI1zuJnZSx5rOuiyVFyPr32Y5UT3r3JMV5tFCG4mtH0c iSNKknLTx8O7+8n2UOWg7MjT2qLh3V2BF3frLXuHOIM2oaMPlwLU9g3d3OkRJV6Qz2zN rR8+2PcznkKicACcgy7zSw6I99NQ26itiOsOjzNWT2uc1NouewD1eAlkU0iQIqebgLJt DkjUDgdidVGkNKP0OrQ8TXDxaJfBlvzbJDJbLpDTlwuc64Y3SMffXIi8wuL5k1KGuI1D Wq8aCUUCtqBpkpQez/ivvrZgcv4UW6JTrIJwn6z+JIGE6q2FuHiRiAYr8t+vWRRiZMz9 ObOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=qs8TzJaMMRdlcgPCr158FHr4muxROdIwIKSMdWLHkHo=; b=0ywp2uiriyia9Nl1E3flAIueoU4KVxZraas1VgQDCBU+na3vxpVO6vzLoYqiHQyiL2 EN3caKkNdLqijAbvww4lAU94FD+VK4L68HUhkWp2Qvar13uTInRdvC4XQFlF5jK6UTOy dNEP47xt6b6MbGBIOsLJE6whA8m1OX93A12X7QUzwRStk87aVuMvZjxtffDrZNCZxOk7 sFEZjr0yCPfll06oIsq+eRB9fb2weMyQkf8qqVpEzPksA4LAErDES2pw/v8MYcPEBN8v xi0Iu6P/JtwsiwdCtjLiBMvI3z4VCX29Yxp+ukJp3WNd3U83+aMpHe35+9QmkL61LzQD xYEQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b="G/BsrDbX"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [2620:137:e000::1:18]) by mx.google.com with ESMTPS id y5-20020a17090322c500b0015caeda8720si1816809plg.106.2022.04.22.13.42.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Apr 2022 13:42:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) client-ip=2620:137:e000::1:18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b="G/BsrDbX"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 678EEF1351; Fri, 22 Apr 2022 12:28:53 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231768AbiDVFlN (ORCPT + 99 others); Fri, 22 Apr 2022 01:41:13 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55148 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232259AbiDVF0I (ORCPT ); Fri, 22 Apr 2022 01:26:08 -0400 Received: from mail-pg1-x52d.google.com (mail-pg1-x52d.google.com [IPv6:2607:f8b0:4864:20::52d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E5E194F442 for ; Thu, 21 Apr 2022 22:23:14 -0700 (PDT) Received: by mail-pg1-x52d.google.com with SMTP id x191so6490910pgd.4 for ; Thu, 21 Apr 2022 22:23:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=qs8TzJaMMRdlcgPCr158FHr4muxROdIwIKSMdWLHkHo=; b=G/BsrDbXacV0SZ9Ne77oXKsQiEopiKSGAiAaqGYk4H47zX958UCh8d3h3fFf0mau// MbXqvOB66RzNwCOaZyNP4nEsHtnwdkNJyeQgbU5sculwgSThov+D1pKikSKCetXMO262 ly67chsgg+6zDuX8kxXbJXJsM8vcXDnYmnl9QxhveSX5SgDKMtjFnsuQn0iuvI5xsc9j yZHAOh1JlYWxiC8XEu5CIy1vrTrauwUA/V/xI8L4EFalIBpwfuRCuMVn6XSLY6sPj80a iRHpIbGzVp89absSi+vJ3vna/H+s3JJdcIcKsKzqb+1fbLbZD42wIsiHagNaDg9+Jm37 KPhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=qs8TzJaMMRdlcgPCr158FHr4muxROdIwIKSMdWLHkHo=; b=aFYeUxAONmAvCSueRIwkt5EXhdzzZbcII1LQ3O93naU9T/MDoWJ4EXxL/nf30ODduk oLc8i20Kjnh1iFH9zGlTAbWcjhmqZYm2ZpeXrX+ZmFiY+5bECyHg0cJMiXAXmDtgnV5G NUSF2c1uwqnMnBjFzOX24bogtUhuO78vNEfcDiAdDZZRRCD91BXoFylWREBijUPh0g/J uXAB57iMOOx7H4FdGsh5cgTzu7slJajH8rBV85ifaVPj25Vf4emXokr0Gmqd6GYvjB3G arRZK2kBG8t5D9lBDVC3xZvvfCaWRbWDafyKH+6iyFpBvTrFkcjnLt6sShAUuDxFJNRV 5gig== X-Gm-Message-State: AOAM5335MGVTuNucmEA9pVTwkJD4aavCRfgHUvmPON7eqs274Zn3yYYy dWRBhcHsYLdtRpxc0wNdNjgB4+k+QgrWWSxASpc= X-Received: by 2002:a65:5b4b:0:b0:3a3:d8fb:6926 with SMTP id y11-20020a655b4b000000b003a3d8fb6926mr2507143pgr.76.1650604994231; Thu, 21 Apr 2022 22:23:14 -0700 (PDT) MIME-Version: 1.0 References: <20220420073717.GD16310@xsang-OptiPlex-9020> In-Reply-To: From: Andrei Vagin Date: Thu, 21 Apr 2022 22:23:02 -0700 Message-ID: Subject: Re: [fs/pipe] 5a519c8fe4: WARNING:at_mm/page_alloc.c:#__alloc_pages To: Linus Torvalds Cc: kernel test robot , Dmitry Safonov <0x7f454c46@gmail.com>, Alexander Viro , Andrew Morton , LKML , lkp@lists.01.org, kernel test robot , Mike Rapoport , Pavel Emelyanov Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-1.7 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 21, 2022 at 12:28 PM Linus Torvalds wrote: > > On Thu, Apr 21, 2022 at 9:30 AM Linus Torvalds > wrote: > > > > The pipe part sounds like a horrible hacky thing. > > > > I also assume you already tried that, and hit some performance issues. > > But it does sound like the better interface, more directly what you > > want. > > > > So what are the problems with using process_vm_readv? The big advantage of vmsplice is that it can attach real user pages into a pipe and then any following changes of these pages by the process don't trigger any allocations and extra copies of data. vmsplice in this case is fast. After splicing pages to pipes, we resume a process and splice pages from pipes to a socket or a file. The whole process of dumping process pages is zero-copy. > > Actually, I take that back. > > Don't use pipes. > > Don't use process_vm_readv(). > > Use the system call we already have for "snapshot the current VM". > > It's called "fork()". It's cheap, it's efficient, and it snapshots the > whole VM in one go. No stupid extra buffers in pipes, no crazy things > like that. > > So just make your pre-dump code do a simple fork(), let the parent > continue, and then do the dumping in the child at whatever pace you > want. > > In fact, you might just leave the child process alone, and let it _be_ > that pre-dump. > > You can create a new snapshot every once in a while, and kill the > previous snapshot, if you want to keep the snapshot close to the > target, and then use the memory tracking to track what has changed > since. > > And you might not want to use plain "fork()", but instead some kind of > "clone()" variant. You might want to use CLONE_PARENT and some > non-SIGCHLD exit signal to basically hide the snapshot image from the > thing you are snapshotting. > > Anyway, the "use vmsplice to a pipe to create a snapshot" sounds just > insane when you have a very traditional system call that is all about > snapshotting the process. > > Maybe a new CLONE_xyz flag could be added to make that memory tracking > integrate better or whatever. > > Any showstoppers? We considered this approach. CRIU dumps a tree of processes. In many cases, it's a container with its pid namespace. In such cases, it isn't possible to fork helper processes without affecting the behavior of dumped processes. First, helper processes will be visible for dumped processes. Second, waitid with __WALL will wait for our helpers and a dumped process can be very surprised to find a child that it hasn't created. For the pre-dump, we don't need a true memory snapshot, we don't care about changed pages. But if we fork a process in the wrong moment, we can double its memory consumption and as this is happening in a dumped process context, we can hit its resource limits or trigger OOM in a dumped container. Forking a helper itself can hit resource limits such as rlimits or cgroup limits. Thanks, Andrei