Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp8264252pxb; Fri, 19 Feb 2021 11:27:36 -0800 (PST) X-Google-Smtp-Source: ABdhPJwN7BP7ipPQA00zFvUJO8enlgOgEC3MPHS1aIDZ7de3UJVcXfsQN54l7R0hDtdRwfiq9mvw X-Received: by 2002:aa7:c5d3:: with SMTP id h19mr10861214eds.69.1613762856649; Fri, 19 Feb 2021 11:27:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613762856; cv=none; d=google.com; s=arc-20160816; b=hcVXNC85IQY/bLyPDkDcl07/23DC+9KwsM4y0xtJCcfYrgFyFxh+Frg8kifR5J6mTh PJn9FPt8ezhIzDyhk8PI0lYnznLfE5HNaNijq7eqRdbA5oTiGy6SyeS73leUEGPp+uZb 2bOUmyrS0TWdrJzx3paPcLsklOeVRuRx+6pRG3ctf5opi3TSK9JjgZQFyV1FwRjZ5r8f ceyirLPMCSlNM/btjB5O3A8Bqgs7WjbzG+cfi7ZiSkoF4S80gZtfo4w6gytoLG1GXjFr JnmEiYQlRtXDByHAgiFu+hVKk5GPMpe90Mw71zvuGjjxs9Jky7Betap5hUz4zp05H6Xe yvlA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=fdCSAN64npUd+P1YDfJ63MqsRK7B7rJbfhaQ54M1yUc=; b=afSm9dmVuIBnnmKSTzjW0hdal1XzZ52N9WAauXRJ92z3tWbnMI0V42aaWfgicYrmrF eheFqCKKGvKlI55NZU2zP8+QGBVSrdyr/H/VHfDP/CroU+x/FTd6ICFFzYDwfDNyIy8o z6Opv1Aj1YQ05hyGIZGsKU9SQH0xdfBddi5wAMM3nQi0lW+ViFug0syr/OkJPXKn1iZU h746ZDReAzA9jqZweCbA8Bf7TYsj08RhtF7Dr2BqiAcrTbdwq72IvbqAT40MUNsDaK63 vgc+/M/CjD/In/fXWxf4peAHuQvWmCtYKVPIsnYCyiJDpO0/z542WdraRqzGUsPYVdRa PqLw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UB1kTviu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u18si3793226edo.365.2021.02.19.11.26.49; Fri, 19 Feb 2021 11:27:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UB1kTviu; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229689AbhBSTYq (ORCPT + 99 others); Fri, 19 Feb 2021 14:24:46 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:23768 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229658AbhBSTYm (ORCPT ); Fri, 19 Feb 2021 14:24:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613762595; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=fdCSAN64npUd+P1YDfJ63MqsRK7B7rJbfhaQ54M1yUc=; b=UB1kTviubUmLmit1ExQoe6Ps1l/QqotAnFIrwA/4wyXXOafBwpSqGohS2uaZ21LALMYZy4 J2N+gSI/27XoLHCIcw/OlHgz797flJz26uFzknDWLgWhwqrw/0vbJu5vcTC+RYz15mg7+O Oz2adYb5WxS1V/jQIr/i//BNQriWDsI= Received: from mail-qk1-f197.google.com (mail-qk1-f197.google.com [209.85.222.197]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-341-EO6wkSiQPnyhOTc6r1Uwdw-1; Fri, 19 Feb 2021 14:23:14 -0500 X-MC-Unique: EO6wkSiQPnyhOTc6r1Uwdw-1 Received: by mail-qk1-f197.google.com with SMTP id h126so942556qkd.4 for ; Fri, 19 Feb 2021 11:23:14 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=fdCSAN64npUd+P1YDfJ63MqsRK7B7rJbfhaQ54M1yUc=; b=LgSpkzCpQIdna9yK0P9R3Yr3pZYbyvZyMDYQzmPDEb+XW4Lh+6ASLP0EywmKGttYkf IvJJcyV514a1FeOCLmOOxt2w+6zk6931CyQIQgWF/EQsS2m7tagyapadta7UfGEkFTqX TTxfdYn1vg/lbsY3e+zy0/Y4eO0co7gjLF7I2f5j3fOZlMydpxD5621N+munssnHxH2X rWpzFD/LA/RXC7cN5/AsS6cz1qJnC7l1M7LpCgVfz2cEyJ7/8hFX/K9HZdZYN9NYqjKL CRFgXHVI0unZfXBDK20EavnsFibL63mkcqQfuG650V2zc7A+NMxKHxWuwiLFOielTRRc t4bA== X-Gm-Message-State: AOAM532M7eXgsW2fLYtAEbkl7ooXA1X24l36+e3LmL7NUxba4P8xuIsh L8MexCQqAjD+zphveFOzaYU4SbNoENusRjhQzmgOtSmF/N6mrrHSo3QOnL184PvHL+N/1qwkHeH 3N4EeT89kBg7Gu4RQvEB7db5B X-Received: by 2002:a05:622a:354:: with SMTP id r20mr10488491qtw.99.1613762593477; Fri, 19 Feb 2021 11:23:13 -0800 (PST) X-Received: by 2002:a05:622a:354:: with SMTP id r20mr10488454qtw.99.1613762593193; Fri, 19 Feb 2021 11:23:13 -0800 (PST) Received: from xz-x1 (bras-vprn-toroon474qw-lp130-25-174-95-95-253.dsl.bell.ca. [174.95.95.253]) by smtp.gmail.com with ESMTPSA id l24sm5994647qtj.50.2021.02.19.11.23.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 19 Feb 2021 11:23:12 -0800 (PST) Date: Fri, 19 Feb 2021 14:23:10 -0500 From: Peter Xu To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Arnd Bergmann , Michal Hocko , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jann Horn , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch@vger.kernel.org Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Message-ID: <20210219192310.GI6669@xz-x1> References: <20210217154844.12392-1-david@redhat.com> <20210218225904.GB6669@xz-x1> <20210219163157.GF6669@xz-x1> <41444eb8-8bb8-8d5b-4cec-be7fa7530d0e@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <41444eb8-8bb8-8d5b-4cec-be7fa7530d0e@redhat.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 19, 2021 at 06:13:47PM +0100, David Hildenbrand wrote: > On 19.02.21 17:31, Peter Xu wrote: > > On Fri, Feb 19, 2021 at 09:20:16AM +0100, David Hildenbrand wrote: > > > On 18.02.21 23:59, Peter Xu wrote: > > > > Hi, David, > > > > > > > > On Wed, Feb 17, 2021 at 04:48:44PM +0100, David Hildenbrand wrote: > > > > > When we manage sparse memory mappings dynamically in user space - also > > > > > sometimes involving MADV_NORESERVE - we want to dynamically populate/ > > > > > discard memory inside such a sparse memory region. Example users are > > > > > hypervisors (especially implementing memory ballooning or similar > > > > > technologies like virtio-mem) and memory allocators. In addition, we want > > > > > to fail in a nice way if populating does not succeed because we are out of > > > > > backend memory (which can happen easily with file-based mappings, > > > > > especially tmpfs and hugetlbfs). [1] > > E.g., can we simply ask the kernel "how much memory this process can still > > allocate", then get a number out of it? I'm not sure whether it can be done > > Anything like that is completely racy and unreliable. The failure path won't be racy imho - If we can detect current process doesn't have enough memory budget, it'll be more efficient to fail even before trying to populate any memory and then drop part of them again. But I see your point - indeed it's good to guarantee the guest won't crash at any point of further guest side memory access. Another question: can the user actually specify arbitrary max-length for the virtio-mem device (which decides the maximum memory this device could possibly consume)? I thought we should check that first before realizing the device and we really shouldn't fail any guest memory access if that check passed. Feel free to correct me.. [...] > > > > I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs > > mapping, that could cause the memory accouting to be delayed until COW happens. > > That would be kind of weird. I'd assume the reservation gets properly done > during fork() - just like for VM_ACCOUNT. AFAIK VM_ACCOUNT is never applied for hugetlbfs. Neither do I know any accounting done for hugetlbfs during fork(), if not taking the pinned pages into account - that is definitely a special case. > > > However that's definitely not the case for QEMU since QEMU won't work at all as > > late as that point. > > > > IOW, for hugetlbfs I don't know why we need to populate the pages at all if we > > simply want to know "whether we do still have enough space".. And IIUC 2) > > above is the major issue you'd like to solve too. > > To avoid page faults at runtime on access I think. Reservation <= > Preallocation. Yes. Besides my above question regarding max-length of virtio-mem device: we care most about private mappings of hugetlbfs/shmem here, am I right? I'm thinking why we'd need MAP_PRIVATE of these at all for VM context. It's definitely not the major scenario when they're used shared with either ovs or any non-qemu process, because then MAP_SHARED is a must. Then if we use them privately, can we simply always make it MAP_SHARED? IMHO MAP_PRIVATE could be helpful only if we'd like the COW scemantics, so it means when there're something already, we'd like to keep that snapshot but trigger page copy when writes. But is that the case for a VM memory backend which should be always zeroed by default? Then, I'm wondering can we simply avoid bothering with VM_PRIVATE on these file-backed memory at all - then we'll naturally get fallocate() on hand, which seems already working for us. Thanks, -- Peter Xu