Received: by 2002:ac0:946b:0:0:0:0:0 with SMTP id j40csp3620492imj; Tue, 19 Feb 2019 06:41:16 -0800 (PST) X-Google-Smtp-Source: AHgI3IbJrpE+CXlM1H6X3wg2UjJx2Kz7tIM165MR9ugn1BcDJLqXVOTvF+XvgOdq7/a2i3cmAxTU X-Received: by 2002:a17:902:e90b:: with SMTP id cs11mr9343694plb.197.1550587276141; Tue, 19 Feb 2019 06:41:16 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550587276; cv=none; d=google.com; s=arc-20160816; b=LvSDw53J9WEx30QJD2mh1PwOLpEjQFTqJrnroEE2qpPrhRsxROKCI6OVR2X61oVRAH cb52Did8GCX/Hx8MccOoIWgkScriqSMaVBRbxsi+5K360Ix1OWSvrjKLEYFkSlM42UrB kMQJfRVLov9F/2y7pOdWAu6KOCFUZTim3wBIlMqgceOqDyTekH7L0g71KrWDZlcFeaNQ bvJSOYKsUEng4GQSSnbNgoGVYZm6qd0Y5/Yp0WMmJQphisqRsxpijkIbvQVxrols49Ga 13gWy/wR8/DKjSn5XAVuvQYxNp312e2qJueOGF2JGZdxcsfOUpNUAqb8WhqGYt+yJMUF WD1A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=jodP6HU8/avt2fdAtxPT/EckTuP45flD2r11mBXSCEQ=; b=CK8U4DHEwN1i/ZE4t427NjAZOfcKhZGH5ZjmAQgAxAjR814S6DK+k6OtpDilXbtKXF 30dA0jKRNPYq/Oaefx0wlg8Yq/QYhiE8Kx/jiDAyW/eeKU2Djf/xj0l5LsQ2t8o9ih8d OR23103wvG8nycv+5ScUJG06fUJWHd2wRllUu5vx9+3swShLZLugrwewcE77J1kL6KTK ZA6ySfNVrNOHfInVppYCFOc7C2kk1DzbMS17VhFi2upo1xJesR5+Ms+FMOyzMe2SYLDU jTK98pkCKpWrmDqDAgjwuJIgHtaNdGsPsZWkzJwoSs68Sul/k2lSfnzP+IaR/dPK+ZYW bJ4w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t9si15079092pgu.103.2019.02.19.06.41.00; Tue, 19 Feb 2019 06:41:16 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726546AbfBSOkZ (ORCPT + 99 others); Tue, 19 Feb 2019 09:40:25 -0500 Received: from mail-qt1-f194.google.com ([209.85.160.194]:38797 "EHLO mail-qt1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726289AbfBSOkY (ORCPT ); Tue, 19 Feb 2019 09:40:24 -0500 Received: by mail-qt1-f194.google.com with SMTP id 2so23345881qtb.5 for ; Tue, 19 Feb 2019 06:40:24 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=jodP6HU8/avt2fdAtxPT/EckTuP45flD2r11mBXSCEQ=; b=quihBoqXmdMs/mP2y1uFktMrFVcRolRQ99C4Jp6h7fLs8fPPA7tlt3OVkS4ueDihLo G6yTflpc7tiPBJBwwsEQ3eZhVUeGGUQ7vjg8tLkIDwSWPh9kDKASsEQoz8nyqVnAqV3O RNu48tbEezqZqeCa15m1gPsc+xOUknpYYJIpIhpYcTbqDFdYTp2QNpo3J7803WkfrXUT cFHbGCYkxNg93V6KaLLklUmLhMIRCbiK8Q4wwJUdBP5tEhCE+1r0nXz6oiSM/qNT196/ Fqye0d5+Nwh6yP+5M6Z9i3S2kDVMJ2zr6kbe7oVR4c/UVlsilUA4BrP0X4KOmYOhkSOA yIQA== X-Gm-Message-State: AHQUAuaqGw9wnIWMA4ki0mzRy3GSrFdh8CgctiFFY9TFPhbIRYo9OwH/ epA8KbB7FaJ1vScFbRvHSqac2w== X-Received: by 2002:aed:3ae4:: with SMTP id o91mr23163280qte.251.1550587219613; Tue, 19 Feb 2019 06:40:19 -0800 (PST) Received: from redhat.com (pool-173-76-246-42.bstnma.fios.verizon.net. [173.76.246.42]) by smtp.gmail.com with ESMTPSA id l12sm7800842qkk.40.2019.02.19.06.40.17 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 19 Feb 2019 06:40:18 -0800 (PST) Date: Tue, 19 Feb 2019 09:40:16 -0500 From: "Michael S. Tsirkin" To: David Hildenbrand Cc: Alexander Duyck , Nitesh Narayan Lal , kvm list , LKML , Paolo Bonzini , lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, Yang Zhang , Rik van Riel , dodgen@google.com, Konrad Rzeszutek Wilk , dhildenb@redhat.com, Andrea Arcangeli Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Message-ID: <20190219093000-mutt-send-email-mst@kernel.org> References: <20190204201854.2328-1-nitesh@redhat.com> <20190218114601-mutt-send-email-mst@kernel.org> <44740a29-bb14-e6e6-2992-98d0ae58e994@redhat.com> <93c78cb7-5dc9-39ae-83bf-a4d6426b5221@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <93c78cb7-5dc9-39ae-83bf-a4d6426b5221@redhat.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 19, 2019 at 09:06:01AM +0100, David Hildenbrand wrote: > On 19.02.19 00:47, Alexander Duyck wrote: > > On Mon, Feb 18, 2019 at 9:42 AM David Hildenbrand wrote: > >> > >> On 18.02.19 18:31, Alexander Duyck wrote: > >>> On Mon, Feb 18, 2019 at 8:59 AM David Hildenbrand wrote: > >>>> > >>>> On 18.02.19 17:49, Michael S. Tsirkin wrote: > >>>>> On Sat, Feb 16, 2019 at 10:40:15AM +0100, David Hildenbrand wrote: > >>>>>> It would be worth a try. My feeling is that a synchronous report after > >>>>>> e.g. 512 frees should be acceptable, as it seems to be acceptable on > >>>>>> s390x. (basically always enabled, nobody complains). > >>>>> > >>>>> What slips under the radar on an arch like s390 might > >>>>> raise issues for a popular arch like x86. My fear would be > >>>>> if it's only a problem e.g. for realtime. Then you get > >>>>> a condition that's very hard to trigger and affects > >>>>> worst case latencies. > >>>> > >>>> Realtime should never use free page hinting. Just like it should never > >>>> use ballooning. Just like it should pin all pages in the hypervisor. > >>>> > >>>>> > >>>>> But really what business has something that is supposedly > >>>>> an optimization blocking a VCPU? We are just freeing up > >>>>> lots of memory why is it a good idea to slow that > >>>>> process down? > >>>> > >>>> I first want to know that it is a problem before we declare it a > >>>> problem. I provided an example (s390x) where it does not seem to be a > >>>> problem. One hypercall ~every 512 frees. As simple as it can get. > >>>> > >>>> No trying to deny that it could be a problem on x86, but then I assume > >>>> it is only a problem in specific setups. > >>>> > >>>> I would much rather prefer a simple solution that can eventually be > >>>> disabled in selected setup than a complicated solution that tries to fit > >>>> all possible setups. Realtime is one of the examples where such stuff is > >>>> to be disabled either way. > >>>> > >>>> Optimization of space comes with a price (here: execution time). > >>> > >>> One thing to keep in mind though is that if you are already having to > >>> pull pages in and out of swap on the host in order be able to provide > >>> enough memory for the guests the free page hinting should be a > >>> significant win in terms of performance. > >> > >> Indeed. And also we are in a virtualized environment already, we can > >> have any kind of sudden hickups. (again, realtime has special > >> requirements on the setup) > >> > >> Side note: I like your approach because it is simple. I don't like your > >> approach because it cannot deal with fragmented memory. And that can > >> happen easily. > >> > >> The idea I described here can be similarly be an extension of your > >> approach, merging in a "batched reporting" Nitesh proposed, so we can > >> report on something < MAX_ORDER, similar to s390x. In the end it boils > >> down to reporting via hypercall vs. reporting via virtio. The main point > >> is that it is synchronous and batched. (and that we properly take care > >> of the race between host freeing and guest allocation) > > > > I'd say the discussion is even simpler then that. My concern is more > > synchronous versus asynchronous. I honestly think the cost for a > > synchronous call is being overblown and we are likely to see the fault > > and zeroing of pages cost more than the hypercall or virtio > > transaction itself. > > The overhead of page faults and zeroing should be mitigated by > MADV_FREE, as Andrea correctly stated (thanks!). Then, the call overhead > (context switch) becomes relevant. > > We have various discussions now :) And I think they are related. > > synchronous versus asynchronous > batched vs. non-batched > MAX_ORDER - 1 vs. other/none magic number > > 1. synchronous call without batching on every kfree is bad. The > interface is fixed to big magic numbers, otherwise we end up having a > hypercall on every kfree. This is your approach. > > 2. asynchronous calls without batching would most probably have similar > problems with small granularities as we had when ballooning without > batching. Just overhead we can avoid. > > 3. synchronous and batched is what s390x does. It can deal with page > granularity. It is what I initially described in this sub-thread. > > 4. asynchronous and batched. This is the other approach we discussed > yesterday. If we can get it implemented, I would be interested in > performance numbers. > > As far as I understood, Michael seems to favor something like 4 (and I > assume eventually 2 if it is similarly fast). I am a friend of either 3 > or 4. Well Linus said big granularity is important for linux MM and not to bother with hinting small sizes. Alex said cost of a hypercall is drawfed by a pagefault after alloc. I would be curious whether async pagefault can help things somehow though. > > > > Also one reason why I am not a fan of working with anything less than > > PMD order is because there have been issues in the past with false > > memory leaks being created when hints were provided on THP pages that > > essentially fragmented them. I guess hugepaged went through and > > started trying to reassemble the huge pages and as a result there have > > been apps that ended up consuming more memory than they would have > > otherwise since they were using fragments of THP pages after doing an > > MADV_DONTNEED on sections of the page. > > I understand your concerns, but we should not let bugs in the hypervisor > dictate the design. Bugs are there to be fixed. Interesting read, > though, thanks! Right but if we break up a huge page we are then creating more work for hypervisor to reassemble it. > -- > > Thanks, > > David / dhildenb