Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
Subject: Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
To:     Alexander Duyck <alexander.h.duyck@linux.intel.com>,
        Mel Gorman <mgorman@techsingularity.net>,
        Andrew Morton <akpm@linux-foundation.org>,
        Andrea Arcangeli <aarcange@redhat.com>,
        Dan Williams <dan.j.williams@intel.com>,
        "Michael S. Tsirkin" <mst@redhat.com>,
        Jason Wang <jasowang@redhat.com>,
        Dave Hansen <dave.hansen@intel.com>,
        Michal Hocko <mhocko@suse.com>,
        Liang Li <liliangleo@didiglobal.com>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org,
        virtualization@lists.linux-foundation.org
References: <20201221162519.GA22504@open-light-1.localdomain>
From:   David Hildenbrand <david@redhat.com>
Organization: Red Hat GmbH
Message-ID: <7bf0e895-52d6-9e2d-294b-980c33cf08e4@redhat.com>
Date:   Tue, 22 Dec 2020 09:47:27 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.5.0
MIME-Version: 1.0
In-Reply-To: <20201221162519.GA22504@open-light-1.localdomain>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Precedence: bulk

On 21.12.20 17:25, Liang Li wrote:
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
> 
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.
> 
> My original intention for adding this feature is to shorten VM
> creation time when SR-IOV devicde is attached, it works good and the
> VM creation time is reduced by about 90%.
> 
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =====================================================
> QEMU use 4K pages, THP is off
>                   round1      round2      round3
> w/o this patch:    23.5s       24.7s       24.6s
> w/ this patch:     10.2s       10.3s       11.2s
> 
> QEMU use 4K pages, THP is on
>                   round1      round2      round3
> w/o this patch:    17.9s       14.8s       14.9s
> w/ this patch:     1.9s        1.8s        1.9s
> =====================================================
> 

I am still not convinces that we want/need this for this (main) use
case. Why can't we use huge pages for such use cases (that really care
about VM creation time) and rather deal with pre-zeroing of huge pages
instead?

If possible, I'd like to avoid GFP_ZERO (for reasons already discussed).

> Obviously, it can do more than this. We can benefit from this feature
> in the flowing case:
> 
> Interactive sence
> =================
> Shorten application lunch time on desktop or mobile phone, it can help
> to improve the user experience. Test shows on a
> server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
> the kernel will take about 200ms, while some mainly used application
> like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
> launch, by pre zero out free pages, it means the application launch
> time will be reduced about 20~60ms (can be visual sensed?). May be
> we can make use of this feature to speed up the launch of Andorid APP
> (I didn't do any test for Android).

I am not really sure if you can actually visually sense a difference in
your examples. Startup time of an application is not just memory
allocation (page zeroing) time. It would be interesting of much of a
difference this actually makes in practice. (e.g., firefox startup time
etc.)

> 
> Virtulization
> =============
> Speed up VM creation and shorten guest boot time, especially for PCI
> SR-IOV device passthrough scenario. Compared with some of the para
> vitalization solutions, it is easy to deploy because it’s transparent
> to guest and can handle DMA properly in BIOS stage, while the para
> virtualization solution can’t handle it well.

What is the "para virtualization" approach you are talking about?

> 
> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
> to the VMM, VMM will unmap the corresponding host page for reclaim,
> when guest allocate a page just reclaimed, host will allocate a new page
> and zero it out for guest, in this case pre zero out free page will help
> to speed up the proccess of fault in and reduce the performance impaction.

Such faults in the VMM are no different to other faults, when first
accessing a page to be populated. Again, I wonder how much of a
difference it actually makes.

> 
> Speed up kernel routine
> =======================
> This can’t be guaranteed because we don’t pre zero out all the free pages,
> but is true for most case. It can help to speed up some important system
> call just like fork, which will allocate zero pages for building page
> table. And speed up the process of page fault, especially for huge page
> fault. The POC of Hugetlb free page pre zero out has been done.

Would be interesting to have an actual example with some numbers.

> 
> Security
> ========
> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> boot options", which zero out page in a asynchronous way. For users can't
> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> this feauture provide another choice.

"we don’t pre zero out all the free pages" so this is of little actual use.

-- 
Thanks,

David / dhildenb