Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp8259253pxb; Fri, 19 Feb 2021 11:19:26 -0800 (PST) X-Google-Smtp-Source: ABdhPJxsfArMi+TQNqSgRbjXNfVvyXJNxQ4GGQ2smdgowk8JDwm0MsaPazo+dehT4G0hI8HxU9Bz X-Received: by 2002:a05:6402:34c1:: with SMTP id w1mr10856289edc.147.1613762366606; Fri, 19 Feb 2021 11:19:26 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613762366; cv=none; d=google.com; s=arc-20160816; b=VhnGMpQBQA/mncwijyrjoIy4YNhZ57mRIswHsD2v1VmIJOOa5JXsrjn+jiXrd01xSP luPsKaZWk/err3kesuTmbYfh/nnmAylFgx9N67Q6izty0hXHKsH8tmcbBgs0hPJXY/iD w2R+Xsj0dfNxI6cF/Enh9jliq0BrJ8DM8Bnxla0vG5Ln8/ZD/EIqqr4T2KoDHRcEKGB6 GmU+o9l3sXg3YKhmQZX0Z6zBSewUCiGLSGiri1xin+fET2WXs9JDExu055gwV/xGYDEq ly93MmWXMpxdsp8zjd1n8E2av3m/VjHdmx5yc10kSFMEkiCxW9tumvFFyuy92M48JnTy mU+A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject :organization:references:cc:to:from:dkim-signature; bh=GuikDQq1sciUIXGoinMOJrB+j3nRzQ2K1NXhrcaqYcg=; b=u3HS9TpRLBW/T1F0sawP9nDUjxvWtYT8kViFuSc/V0IxbHGr/SwNRHCXsd+kfKgM3U cEczGwNpQcQcBHQHmFQSbKOjB5ki/OoKB+3VoZJSz3nTIqB4UlHHlMWNR8DH4nrHu6f8 xtbbWiGQI2waV5MBKxno2kO3Ncrp/phRPn7C82sSq+DSHYblkIsIJpyqpQJLy07wm/44 Wqa5kEW6fzGWxFKnXU3uqGqKdKp3P4u0OhTf+Lp6eXDXWsNVcRCVpFfVzKJvz1/DMIXe Ztwj02f4Gs2UuMoaDkD0QK3iVWVCc5olsEuxYQeNRvkrvLg1W6DkV/g1zCDHhhqbHzbD zzMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=AixfeM0D; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id k13si6522862ejg.567.2021.02.19.11.19.02; Fri, 19 Feb 2021 11:19:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=AixfeM0D; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229515AbhBSTQj (ORCPT + 99 others); Fri, 19 Feb 2021 14:16:39 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:21366 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229636AbhBSTQe (ORCPT ); Fri, 19 Feb 2021 14:16:34 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613762106; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GuikDQq1sciUIXGoinMOJrB+j3nRzQ2K1NXhrcaqYcg=; b=AixfeM0DCKu5tz4MU6wE5nUxFWYQVfVY2vYVMRhCFIpFYSygZlzLuAaHM0X5nFAWzUHyPo 2A+TH4MgyVKAL07SaASh/fyA8Vx2JmA6DjekLJ9LyggP6FHak2QOqatzNOacdFOPJ0fl3w 4p483vFSdC0dnUjnPoYtk1wmwjvgFqw= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-386-On0zvAUYOq6qvarRhkE3xg-1; Fri, 19 Feb 2021 14:15:04 -0500 X-MC-Unique: On0zvAUYOq6qvarRhkE3xg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BC82A8030BB; Fri, 19 Feb 2021 19:15:00 +0000 (UTC) Received: from [10.36.113.117] (ovpn-113-117.ams2.redhat.com [10.36.113.117]) by smtp.corp.redhat.com (Postfix) with ESMTP id 8671B5C1BB; Fri, 19 Feb 2021 19:14:46 +0000 (UTC) From: David Hildenbrand To: Peter Xu Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Arnd Bergmann , Michal Hocko , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jann Horn , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch@vger.kernel.org References: <20210217154844.12392-1-david@redhat.com> <20210218225904.GB6669@xz-x1> <20210219163157.GF6669@xz-x1> <41444eb8-8bb8-8d5b-4cec-be7fa7530d0e@redhat.com> Organization: Red Hat GmbH Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Message-ID: <4d8e6f55-66a6-d701-6a94-79f5e2b23e46@redhat.com> Date: Fri, 19 Feb 2021 20:14:45 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: <41444eb8-8bb8-8d5b-4cec-be7fa7530d0e@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >> It's interesting to know about commit 1e356fc14be ("mem-prealloc: reduce large >> guest start-up and migration time.", 2017-03-14). It seems for speeding up VM >> boot, but what I can't understand is why it would cause the delay of hugetlb >> accounting - I thought we'd fail even earlier at either fallocate() on the >> hugetlb file (when we use /dev/hugepages) or on mmap() of the memfd which >> contains the huge pages. See hugetlb_reserve_pages() and its callers. Or did >> I miss something? > > We should fail on mmap() when the reservation happens (unless > MAP_NORESERVE is passed) I think. > >> >> I think there's a special case if QEMU fork() with a MAP_PRIVATE hugetlbfs >> mapping, that could cause the memory accouting to be delayed until COW happens. > > That would be kind of weird. I'd assume the reservation gets properly > done during fork() - just like for VM_ACCOUNT. > >> However that's definitely not the case for QEMU since QEMU won't work at all as >> late as that point. >> >> IOW, for hugetlbfs I don't know why we need to populate the pages at all if we >> simply want to know "whether we do still have enough space".. And IIUC 2) >> above is the major issue you'd like to solve too. > > To avoid page faults at runtime on access I think. Reservation <= > Preallocation. I just learned that there is more to it: (test done on v5.9) # echo 512 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages # cat /sys/devices/system/node/node*/meminfo | grep HugePages_ Node 0 HugePages_Total: 512 Node 0 HugePages_Free: 512 Node 0 HugePages_Surp: 0 Node 1 HugePages_Total: 0 Node 1 HugePages_Free: 0 Node 1 HugePages_Surp: 0 # cat /proc/meminfo | grep HugePages_ HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 0 HugePages_Surp: 0 # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=0 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic -> works just fine # /usr/libexec/qemu-kvm -m 1G -smp 1 -object memory-backend-memfd,id=mem0,size=1G,hugetlb=on,hugetlbsize=2M,policy=bind,host-nodes=1 -numa node,nodeid=0,memdev=mem0 -hda Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2 -nographic -> Does not fail nicely but crashes! See https://bugzilla.redhat.com/show_bug.cgi?id=1686261 for something similar, however, it no longer applies like that on more recent kernels. Hugetlbfs reservations don't always protect you (especially with NUMA) - that's why e.g., libvirt always tells QEMU to prealloc. I think the "issue" is that the reservation happens on mmap(). mbind() runs afterwards. Preallocation saves you from that. I suspect something similar will happen with anonymous memory with mbind() even if we reserved swap space. Did not test yet, though. -- Thanks, David / dhildenb