Received: by 2002:ac0:946b:0:0:0:0:0 with SMTP id j40csp4048811imj; Tue, 19 Feb 2019 14:19:09 -0800 (PST) X-Google-Smtp-Source: AHgI3IaAX4LcMTP2t9bs56uRRMr49Y5YI9LGRWL/5sCCv+tyFZmwo38sfVzjfnFHM8dqywzsvMLF X-Received: by 2002:aa7:8249:: with SMTP id e9mr31593552pfn.93.1550614749672; Tue, 19 Feb 2019 14:19:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1550614749; cv=none; d=google.com; s=arc-20160816; b=hTfIT9s6Ro/TE+dBh64DlxIqqfhFAlWF5eDr92pXlcH+sc60j6mFm/iRE0ko5CIGbl LrVA8BwF9ywK0KD8pfP6BsmDuZdZNOKJsVVTbzAomXnaF0VZk22sQFtmtJqOq2CKGzNW KFoPp0zKK9h5TTuWUBSlqqSlYBiYXo7tJHjwGKq8nWv5mwU1v0NHUQwvUbVRCGYRjTWi WO9Dh7WJaj1HFJWDeVFul+JDbU4En9ilccQfyNzMnB5Tmmt0832KVIgfJgcfSuUWy+VJ 7AWWQupSXXndLkRRy9jOpw5ESfDiphTqnTlqu8RkBrQxs2xWtb/bO8mJW9JFxWv6ItPe noHA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=Z2Z+lpO8HZ6jlB7OoQM3hrLYDeJtdKTChovy+cPVcYs=; b=YE6YOXfUzbx1Is1ETeufqCmppH6eNUUN46mWDZW8TKduh+fehDrZYp1vdqZC3j4S5H 31tLLagMFYRMu5p13326eUi9ph9Qu6PsmKidlzWYaorldZgPKRBXQXsuOuAn2/5d3x1d OQmFF+Zc2wLfpwnbiBsjsEubP5kl+CeiZ1OCy/5DMP0N/I0bFnxVpPOt9ZwK8N341+x7 ZHwhUgs7X2Jum6BHbw3cnKvlXgbzkFyKYdd2bmE9Pb+pTb6KEvrXqWHmnq5D6jR2E2sA U8nBFdEeH/x/l/v/bjRQCotOSLFQt6vZ6Hlsun1HPY3tNRqGb+8GliUsBXcjH4L+Rh0F MDKg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id y3si12039026plr.116.2019.02.19.14.18.54; Tue, 19 Feb 2019 14:19:09 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729676AbfBSWRL (ORCPT + 99 others); Tue, 19 Feb 2019 17:17:11 -0500 Received: from mail-qt1-f196.google.com ([209.85.160.196]:36773 "EHLO mail-qt1-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726986AbfBSWRK (ORCPT ); Tue, 19 Feb 2019 17:17:10 -0500 Received: by mail-qt1-f196.google.com with SMTP id p25so24845343qtb.3 for ; Tue, 19 Feb 2019 14:17:09 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=Z2Z+lpO8HZ6jlB7OoQM3hrLYDeJtdKTChovy+cPVcYs=; b=tAX4NUdDMi2Vn44BavNLxQzHE50UUidUSkHsEI/mVNuQbVEpAMGgZWMvZNZFkZ/0b7 mWWR2e5AOu4sZHd5gqCy1AFSh7jShceM9bNijWBRx1ElyZCoeP8b1CG8w0wiQ6jp4E7C OMrFOfz432sdCe802BHxzHO0ci69AasUSoy1jEFFGpeBnHQK4IvsCUWgRgkfzWyZ2noD q6ZuosrKy/hzH0s1Y40K58J/OpCj3e/Y5ccAyCfdRiCOc2hEpT/Vqp0N/MyDBB+9VgBG TjJALToCOLYxcesszfpfzoFLguuwQoNsRsW5KhGDEWlRyFogK91HKVUKtVSt2y7VuvrN 2Q1A== X-Gm-Message-State: AHQUAubtKzE1AQV0z71ZemSRzxANR+On9M9e6PqkMnYxDnemDQOLqzwK FhXZ10vm9Pp9tBL8J2blhUQBvQ== X-Received: by 2002:a0c:ad19:: with SMTP id u25mr23544080qvc.14.1550614629274; Tue, 19 Feb 2019 14:17:09 -0800 (PST) Received: from redhat.com (pool-173-76-246-42.bstnma.fios.verizon.net. [173.76.246.42]) by smtp.gmail.com with ESMTPSA id z42sm5991956qta.28.2019.02.19.14.17.07 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 19 Feb 2019 14:17:08 -0800 (PST) Date: Tue, 19 Feb 2019 17:17:06 -0500 From: "Michael S. Tsirkin" To: Alexander Duyck Cc: David Hildenbrand , Nitesh Narayan Lal , kvm list , LKML , Paolo Bonzini , lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, Yang Zhang , Rik van Riel , dodgen@google.com, Konrad Rzeszutek Wilk , dhildenb@redhat.com, Andrea Arcangeli Subject: Re: [RFC][Patch v8 0/7] KVM: Guest Free Page Hinting Message-ID: <20190219170446-mutt-send-email-mst@kernel.org> References: <20190218143819-mutt-send-email-mst@kernel.org> <58714908-f203-0b64-845b-5818e52a62fa@redhat.com> <20190218152021-mutt-send-email-mst@kernel.org> <18d87846-72c7-adf0-5ca3-7312540bb31b@redhat.com> <478a9574-a604-0aa9-d569-6a5cd98d7cdc@redhat.com> <77e71dc3-640b-bbf6-6a47-bb2371c06172@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 19, 2019 at 01:57:14PM -0800, Alexander Duyck wrote: > On Tue, Feb 19, 2019 at 10:32 AM David Hildenbrand wrote: > > > > >>> This essentially just ends up being another trade-off of CPU versus > > >>> memory though. Assuming we aren't using THP we are going to take a > > >>> penalty in terms of performance but could then free individual pages > > >>> less than HUGETLB_PAGE_ORDER, but the CPU utilization is going to be > > >>> much higher in general even without the hinting. I figure for x86 we > > >>> probably don't have too many options since if I am not mistaken > > >>> MAX_ORDER is just one or two more than HUGETLB_PAGE_ORDER. > > >> > > >> THP is an implementation detail in the hypervisor. Yes, it is the common > > >> case on x86. But it is e.g. not available on s390x yet. And we also want > > >> this mechanism to work on s390x (e.g. for nested virtualization setups > > >> as discussed). > > >> > > >> If we e.g. report any granularity after merging was done in the buddy, > > >> we could end up reporting everything from page size up to MAX_SIZE - 1, > > >> the hypervisor could ignore hints below a certain magic number, if it > > >> makes its life easier. > > > > > > For each architecture we can do a separate implementation of what to > > > hint on. We already do that for bare metal so why would we have guests > > > do the same type of hinting in the virtualization case when there are > > > fundamental differences in page size and features in each > > > architecture? > > > > > > This is another reason why I think the hypercall approach is a better > > > idea since each architecture is likely going to want to handle things > > > differently and it would be a pain to try and sort that all out in a > > > virtio driver. > > > > I can't follow. We are talking about something as simple as a minimum > > page granularity here that can easily be configured. Nothing that > > screams for different implementations. But I get your point, we could > > tune for different architectures. > > I was thinking about the guest side of things. Basically if we need to > define different page orders for different architectures then we start > needing to do architecture specific includes. Then if we throw in > stuff like the fact that the first level of KVM can make use of the > host style hints then that is another thing that will be a difference > int he different architectures. Sorry didn't catch this one. What are host style hints? > I'm just worried this stuff is going > to start adding up to a bunch of "#ifdef" cruft if we are trying to do > this as a virtio driver. I agree we want to avoid that. And by comparison, if it's up to host or if it's tied to logic within guest (such as MAX_PAGE_ORDER as suggested by Linus) as opposed to CPU architecture, then virtio is easier as you can re-use config space and feature bits to negotiate host/guest capabilities. Doing hypercalls for that would add lots of hypercalls. I CC'd Wei Wang who implemented host-driven hints in the balloon right now. Wei I wonder - could you try changing from MAX_PAGE_ORDER to HUGETLB_PAGE_ORDER? Does this affect performance for you at all? Thanks! > > > > > >>> > > >>> As far as fragmentation my thought is that we may want to look into > > >>> adding support to the guest for prioritizing defragmentation on pages > > >>> lower than THP size. Then that way we could maintain the higher > > >>> overall performance with or without the hinting since shuffling lower > > >>> order pages around between guests would start to get expensive pretty > > >>> quick. > > >> > > >> My take would be, design an interface/mechanism that allows any kind of > > >> granularity. You can than balance between cpu overead and space shifting. > > > > > > The problem with using "any kind of granularity" is that in the case > > > of memory we are already having problems with 4K pages being deemed > > > too small of a granularity to be useful for anything and making > > > operations too expensive. > > > > No, sorry, s390x does it. And via batch reporting it could work. Not > > saying we should do page granularity, but "to be useful for anything" is > > just wrong. > > Yeah, I was engaging in a bit of hyperbole. I have had a headache this > morning so I am a bit cranky. > > So I am assuming the batching is the reason why you also have a > arch_alloc_page then for the s390 so that you can abort the hint if a > page is reallocated before the hint is processed then? I just want to > confirm so that my understanding of this is correct. > > If that is the case I would be much happier with an asynchronous page > hint setup as this doesn't deprive the guest of memory while waiting > on the hint. The current logic in the patches from Nitesh has the > pages unavailable to the guest while waiting on the hint and that has > me somewhat concerned as it is going to hurt cache locality as it will > guarantee that we cannot reuse the same page if we are doing a cycle > of alloc and free for the same page size. > > > > > > I'm open to using other page orders for other architectures. Nothing > > > says we have to stick with THP sized pages for all architectures. I > > > have just been focused on x86 and this seems like the best fit for the > > > balance between CPU and freeing of memory for now on that > > > architecture. > > > > > >> I feel like repeating myself, but on s390x hinting is done on page > > >> granularity, and I have never heard somebody say "how can I turn it off, > > >> this is slowing down my system too much.". All we know is that one > > >> hypercall per free is most probably not acceptable. We really have to > > >> play with the numbers. > > > > > > My thought was we could look at doing different implementations for > > > other architectures such as s390 and powerPC. Odds are the > > > implementations would be similar but have slight differences where > > > appropriate such as what order we should start hinting on, or if we > > > bypass the hypercall/virtio-balloon for a host native approach if > > > available. > > > > > >> I tend to like an asynchronous reporting approach as discussed in this > > >> thread, we would have to see if Nitesh could get it implemented. > > > > > > I agree it would be great if it could work. However I have concerns > > > given that work on this patch set dates back to 2017, major issues > > > such as working around device assignment have yet to be addressed, and > > > it seems like most of the effort is being focused on things that in my > > > opinion are being over-engineered for little to no benefit. > > > > I can understand that you are trying to push your solution. I would do > > the same. Again, I don't like a pure synchronous approach that works on > > one-element-at-a-time. Period. Other people might have other opinions. > > This is mine - luckily I don't have anything to say here :) > > > > MST also voted for an asynchronous solution if we can make it work. > > Nitesh made significant improvements since the 2017. Complicated stuff > > needs time. No need to rush. People have been talking about free page > > hinting since 2006. I talked to various people that experimented with > > bitmap based solutions two years ago. > > Now that I think I have a better understanding of how the s390x is > handling this I'm beginning to come around to the idea of an > asynchronous setup. The one thing that has been bugging me about the > asynchronous approach is the fact that the pages are not available to > the guest while waiting on the hint to be completed. If we can do > something like an arch_alloc_page and that would abort the hint and > allow us to keep the page available while waiting on the hint that > would be my preferred way of handling this. > > > So much to that, if you think your solution is the way to go, please > > follow up on it. Nitesh seems to have decided to look into the > > asynchronous approach you also called "great if it could work". As long > > as we don't run into elementary blockers there, to me it all looks like > > we are making progress, which is good. If we find out asynchronous > > doesn't work, synchronous is the only alternative. > > I plan to follow up in the next week or so. > > > And just so you don't get me wrong: Thanks for looking and working on > > this. And thanks for sharing your opinions and insights! However making > > a decision about going your way at this point does not seem reasonable > > to me. We have plenty of time. > > I appreciate the feedback. Sorry if I seemed a bit short. As I > mentioned I've had a headache most of the morning which hasn't really > helped my mood. > > Thanks. > > - Alex -- MST