Received: by 2002:ac0:946b:0:0:0:0:0 with SMTP id j40csp3423522imj; Mon, 11 Feb 2019 21:17:07 -0800 (PST) X-Google-Smtp-Source: AHgI3IYDi8E30yz29iTuXoc2E7mo5FlvWCgzxJCUSIlQJa+Nfn0ytumTP5s8Y2S/uuW7LP3X1G/S X-Received: by 2002:a63:200e:: with SMTP id g14mr1966557pgg.235.1549948627113; Mon, 11 Feb 2019 21:17:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1549948627; cv=none; d=google.com; s=arc-20160816; b=FiNr2Vf/TbKG/0b2rn6nqb+AxWA+GHYnynrAQE3pqT3YxbJA0y9mu1u/lv0Mu80qwv /KbCfYiKmUsyfOQhAZVAAD/+urwD2Al/FJEzfAvMs2xrtuj104l7CpN8E4wxZhUy4135 Zs8IuIgteW9scouHqdsmnN9bq1h83LVQSzai61tXY0yzF/9Ie2bYHoBNPht9cnKdGF8G IwfKWqI9sohR99X3Ey1v32j3lieY57WSxQylfn+w3oNytKPf3RwZm74mF54QURR6gGqi YQU9cC1L8JJR1mLCIeFzs96E65gI/UmSnm2G+H9qD1fYY/hsY7jyowiHSb+56J/XWSWi MWwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=YxTgDxLo17EtGa8UUyx9C/FQ2CCj182vJTlpyZpt0bQ=; b=U3fdgNY6kbtryTmCXaXzGafa/A3irqoP+WBRMj0tCI1kpu9+eRoGLep2pAG9pe2tmb IbjIEfAevOUuoUNNP0HJGcbkDlGsymVUC2ppTUaqcbjtMtYr4U155KK84avlhXUSyRoO UOlOWY6fItM96UCAGtE/LFJU53bbSFwht4lyZy/xA78HaYKoewk54dcRbLpKPEXZlLDg qvgAh3euRuH1Es+v72VGLQhq51NShHcXxfq5S2rMoA2QyblJNuyFpXQP0JUj7iWR0o0u 3gMC1J5C3xhteW6+OKKYsbGOIUuAFawuuyeogFVheZvBKjTNQcrGyFXBmvmj+yuxvOiw DKHQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c8si11263737pgl.507.2019.02.11.21.16.48; Mon, 11 Feb 2019 21:17:07 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726059AbfBLFQo (ORCPT + 99 others); Tue, 12 Feb 2019 00:16:44 -0500 Received: from mx1.redhat.com ([209.132.183.28]:57124 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725909AbfBLFQn (ORCPT ); Tue, 12 Feb 2019 00:16:43 -0500 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E0D2586679; Tue, 12 Feb 2019 05:16:42 +0000 (UTC) Received: from redhat.com (ovpn-120-96.rdu2.redhat.com [10.10.120.96]) by smtp.corp.redhat.com (Postfix) with SMTP id A057660BE8; Tue, 12 Feb 2019 05:16:34 +0000 (UTC) Date: Tue, 12 Feb 2019 00:16:34 -0500 From: "Michael S. Tsirkin" To: David Hildenbrand Cc: Alexander Duyck , Nitesh Narayan Lal , kvm list , LKML , Paolo Bonzini , lcapitulino@redhat.com, pagupta@redhat.com, wei.w.wang@intel.com, Yang Zhang , Rik van Riel , dodgen@google.com, Konrad Rzeszutek Wilk , dhildenb@redhat.com, Andrea Arcangeli Subject: Re: [RFC][Patch v8 6/7] KVM: Enables the kernel to isolate and report free pages Message-ID: <20190212001027-mutt-send-email-mst@kernel.org> References: <20190205165514-mutt-send-email-mst@kernel.org> <20190208163601-mutt-send-email-mst@kernel.org> <20190209192104-mutt-send-email-mst@kernel.org> <19f6d1f2-9287-6113-07b8-1988907b6108@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <19f6d1f2-9287-6113-07b8-1988907b6108@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.26]); Tue, 12 Feb 2019 05:16:43 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 11, 2019 at 10:28:31AM +0100, David Hildenbrand wrote: > On 10.02.19 01:38, Michael S. Tsirkin wrote: > > On Fri, Feb 08, 2019 at 02:05:09PM -0800, Alexander Duyck wrote: > >> On Fri, Feb 8, 2019 at 1:38 PM Michael S. Tsirkin wrote: > >>> > >>> On Fri, Feb 08, 2019 at 03:41:55PM -0500, Nitesh Narayan Lal wrote: > >>>>>> I am also planning to try Michael's suggestion of using MAX_ORDER - 1. > >>>>>> However I am still thinking about a workload which I can use to test its > >>>>>> effectiveness. > >>>>> You might want to look at doing something like min(MAX_ORDER - 1, > >>>>> HUGETLB_PAGE_ORDER). I know for x86 a 2MB page is the upper limit for > >>>>> THP which is the most likely to be used page size with the guest. > >>>> Sure, thanks for the suggestion. > >>> > >>> Given current hinting in balloon is MAX_ORDER I'd say > >>> share code. If you feel a need to adjust down the road, > >>> adjust both of them with actual testing showing gains. > >> > >> Actually I'm left kind of wondering why we are even going through > >> virtio-balloon for this? > > > > Just look at what does it do. > > > > It improves memory overcommit if guests are cooperative, and it does > > this by giving the hypervisor addresses of pages which it can discard. > > > > It's just *exactly* like the balloon with all the same limitations. > > I agree, this belongs to virtio-balloon *unless* we run into real > problems implementing it via an asynchronous mechanism. > > > > >> It seems like this would make much more sense > >> as core functionality of KVM itself for the specific architectures > >> rather than some side thing. > > Whatever can be handled in user space and does not have significant > performance impacts should be handled in user space. If we run into real > problems with that approach, fair enough. (e.g. vcpu yielding is a good > example where an implementation in KVM makes sense, not going via QEMU) Just to note, if we wanted to we could add a special kind of VQ where e.g. kick yields the VCPU. You don't necessarily need a hypercall for this. A virtio-cpu, yay! > > > > Well same as balloon: whether it's useful to you at all > > would very much depend on your workloads. > > > > This kind of cooperative functionality is good for co-located > > single-tenant VMs. That's pretty niche. The core things in KVM > > generally don't trust guests. > > > > > >> In addition this could end up being > >> redundant when you start getting into either the s390 or PowerPC > >> architectures as they already have means of providing unused page > >> hints. > > I'd like to note that on s390x the functionality is not provided when > running nested guests. And there are real problems getting it ever > supported. (see description below how it works on s390x, the issue for > nested guests are the bits in the guest -> host page tables we cannot > support for nested guests). > > Hinting only works for guests running one level under LPAR (with a > recent machine), but not nested guests. > > (LPAR -> KVM1 works, LPAR - KVM1 -> KVM2 foes not work for the latter) > > So an implementation for s390 would still make sense for this scenario. > > > > > Interesting. Is there host support in kvm? > > On s390x there is. It works on page granularity and synchronization > between guest/host ("don't drop a page in the host while the guest is > reusing it") is done via special bits in the host->guest page table. > Instructions in the guest are able to modify these bits. A guest can > configure a "usage state" of it's backed PTEs. E.g. "unused" or "stable". > > Whenever a page in the guest is freed/reused, the ESSA instruction is > triggered in the guest. It will modify the page table bits and add the > guest phyical pfn to a buffer in the host. Once that buffer is full, > ESSA will trigger an intercept to the hypervisor. Here, all these > "unused" pages can be zapped. > > Also, when swapping a page out in the hypervisor, if it was masked by > the guest as unused or logically zero, instead of swapping out the page, > it can simply be dropped and a fresh zero page can be supplied when the > guest tries to access it. > > "ESSA" is implemented in KVM in arch/s390/kvm/priv.c:handle_essa(). > > So on s390x, it works because the synchronization with the hypervisor is > directly built into hw vitualization support (guest->host page tables + > instruction) and ESSA will not intercept on every call (due to the buffer). > > > > > >> I have a set of patches I proposed that add similar functionality via > >> a KVM hypercall for x86 instead of doing it as a part of a Virtio > >> device[1]. I'm suspecting the overhead of doing things this way is > >> much less then having to make multiple madvise system calls from QEMU > >> back into the kernel. > > > > Well whether it's a virtio device is orthogonal to whether it's an > > madvise call, right? You can build vhost-pagehint and that can > > handle requests in a VQ within balloon and do it > > within host kernel directly. > > > > virtio rings let you pass multiple pages so it's really hard to > > say which will win outright - maybe it's more important > > to coalesce exits. > > We don't know until we measure it. So to measure, I think we can start with traces that show how often do specific workloads allocate/free pages of specific size. We don't necessarily need hypercall/host support. We might want "mm: Add merge page notifier" so we can count merges. > -- > > Thanks, > > David / dhildenb