Received: by 2002:a25:8b12:0:0:0:0:0 with SMTP id i18csp421250ybl; Fri, 23 Aug 2019 02:58:12 -0700 (PDT) X-Google-Smtp-Source: APXvYqwt23B+DcsK5kxAgK502qYi0RdFJ0kc+RcSdWVwQE1BjN1Amt2kYea5nPtEh3co/ZuJfbs0 X-Received: by 2002:a17:902:8a94:: with SMTP id p20mr3776040plo.312.1566554292450; Fri, 23 Aug 2019 02:58:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1566554292; cv=none; d=google.com; s=arc-20160816; b=bJZhl44XOIeUPTLA9amfHr3C75xLq5AK50QvprlenEA3KFEr90NijNDxOLwZFEvM64 DWPVsn0htNS1Vgc8aHiw3zW6RSmRaBZnE2aP4LnODnhOMaTaVOOjMVfQkgqWay6UdNtd mdy7vJpsXGzj6w4xHkHFDHkAbr932HBBJMrblptJRwQs3+MbrC+pBOoRjRal8h+2/kwn kz483448eucEEK3hsVH0n/cxSuIurNPhPHELSVxf+auygiriPVKSYC4mpQykNhJtQp4a CpnPH2O2K9ub6tyjdR4j02QEz5j4RAk35FT9RNDYZeoyviQ5+KCN8DFGo6WKsnbiq0FD mPMg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:thread-index:thread-topic :content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date; bh=t8lcKGrZSFf6Z/pAAXRb5exNXHxrtNmFxeb7oLlLEa8=; b=iYfrfMsAK0Nd606OXSjdVKKIrpXQX8tMuZzVZuGNjnrhQ5VGtuhToAKJif6CJVVPE5 O+2RT/u7t6H85ydywrxEsjIBQc4JSFI1C1+5aWGEhXQZOEbPxNaUXzcXkQvjb7j+bqDR pKD1nAg5XIMzzI1/WQmVTyE5+khWhBjg3OU5tyPw+G2nsRCoPFFFMGBvP280EuMBD20r 596pgMgP2p/6Xk9vWE3J67aZ6m5ZqqXBBLoW7mHzMDxRa5hY8451m6+JSf8YtIwYX9kc l8KrNNd18kBSLfCNmZ3tziCbqrcxz5gNB5VUDpxTdQy6Qahd70E502YoO5Ill6WUK4xl DQ2A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j4si2211280pfi.34.2019.08.23.02.57.57; Fri, 23 Aug 2019 02:58:12 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391921AbfHWFQh (ORCPT + 99 others); Fri, 23 Aug 2019 01:16:37 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35444 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726283AbfHWFQh (ORCPT ); Fri, 23 Aug 2019 01:16:37 -0400 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 642043091754; Fri, 23 Aug 2019 05:16:36 +0000 (UTC) Received: from colo-mx.corp.redhat.com (colo-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.21]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 0F3A510016E9; Fri, 23 Aug 2019 05:16:36 +0000 (UTC) Received: from zmail21.collab.prod.int.phx2.redhat.com (zmail21.collab.prod.int.phx2.redhat.com [10.5.83.24]) by colo-mx.corp.redhat.com (Postfix) with ESMTP id C84DB4A460; Fri, 23 Aug 2019 05:16:34 +0000 (UTC) Date: Fri, 23 Aug 2019 01:16:34 -0400 (EDT) From: Pankaj Gupta To: Alexander Duyck Cc: Alexander Duyck , nitesh@redhat.com, kvm@vger.kernel.org, mst@redhat.com, david@redhat.com, dave hansen , linux-kernel@vger.kernel.org, willy@infradead.org, mhocko@kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, virtio-dev@lists.oasis-open.org, osalvador@suse.de, yang zhang wz , riel@surriel.com, konrad wilk , lcapitulino@redhat.com, wei w wang , aarcange@redhat.com, pbonzini@redhat.com, dan j williams Message-ID: <860165703.10076075.1566537394212.JavaMail.zimbra@redhat.com> In-Reply-To: <31b75078d004a1ccf77b710b35b8f847f404de9a.camel@linux.intel.com> References: <20190821145806.20926.22448.stgit@localhost.localdomain> <1297409377.9866813.1566470593223.JavaMail.zimbra@redhat.com> <31b75078d004a1ccf77b710b35b8f847f404de9a.camel@linux.intel.com> Subject: Re: [PATCH v6 0/6] mm / virtio: Provide support for unused page reporting MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Originating-IP: [10.67.116.62, 10.4.195.21] Thread-Topic: mm / virtio: Provide support for unused page reporting Thread-Index: sTNT+EjyPXY4peHrw8rPQkPz8nPuRw== X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.41]); Fri, 23 Aug 2019 05:16:36 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Thu, 2019-08-22 at 06:43 -0400, Pankaj Gupta wrote: > > > This series provides an asynchronous means of reporting to a hypervisor > > > that a guest page is no longer in use and can have the data associated > > > with it dropped. To do this I have implemented functionality that allows > > > for what I am referring to as unused page reporting > > > > > > The functionality for this is fairly simple. When enabled it will > > > allocate > > > statistics to track the number of reported pages in a given free area. > > > When the number of free pages exceeds this value plus a high water value, > > > currently 32, it will begin performing page reporting which consists of > > > pulling pages off of free list and placing them into a scatter list. The > > > scatterlist is then given to the page reporting device and it will > > > perform > > > the required action to make the pages "reported", in the case of > > > virtio-balloon this results in the pages being madvised as MADV_DONTNEED > > > and as such they are forced out of the guest. After this they are placed > > > back on the free list, and an additional bit is added if they are not > > > merged indicating that they are a reported buddy page instead of a > > > standard buddy page. The cycle then repeats with additional non-reported > > > pages being pulled until the free areas all consist of reported pages. > > > > > > I am leaving a number of things hard-coded such as limiting the lowest > > > order processed to PAGEBLOCK_ORDER, and have left it up to the guest to > > > determine what the limit is on how many pages it wants to allocate to > > > process the hints. The upper limit for this is based on the size of the > > > queue used to store the scattergather list. > > > > > > My primary testing has just been to verify the memory is being freed > > > after > > > allocation by running memhog 40g on a 40g guest and watching the total > > > free memory via /proc/meminfo on the host. With this I have verified most > > > of the memory is freed after each iteration. > > > > I tried to go through the entire patch series. I can see you reported a > > -3.27 drop from the baseline. If its because of re-faulting the page after > > host has freed them? Can we avoid freeing all the pages from the guest > > free_area > > and keep some pages(maybe some mixed order), so that next allocation is > > done from > > the guest itself than faulting to host. This will work with real workload > > where > > allocation and deallocation happen at regular intervals. > > > > This can be further optimized based on other factors like host memory > > pressure etc. > > > > Thanks, > > Pankaj > > When I originally started implementing and testing this code I was seeing > less than a 1% regression. I didn't feel like that was really an accurate > result since it wasn't putting much stress on the changed code so I have > modified my tests and kernel so that I have memory shuffting and THP > enabled. In addition I have gone out of my way to lock things down to a > single NUMA node on my host system as the code I had would sometimes > perform better than baseline when running the test due to the fact that > memory was being freed back to the hose and then reallocated which > actually allowed for better NUMA locality. > > The general idea was I wanted to know what the worst case penalty would be > for running this code, and it turns out most of that is just the cost of > faulting back in the pages. By enabling memory shuffling I am forcing the > memory to churn as pages are added to both the head and tail of the > free_list. The test itself was modified so that it didn't allocate order 0 > pages and instead was allocating transparent huge pages so the effects > were as visible as possible. Without that the page faulting overhead would > mostly fall into the noise of having to allocate the memory as order 0 > pages, that is what I had essentially seen earlier when I was running the > stock page_fault1 test. Right. I think the reason is this test is allocating THP's in guest, host side you are still using order 0 pages, I assume? > > This code does no hinting on anything smaller than either MAX_ORDER - 1 or > HUGETLB_PAGE_ORDER pages, and it only starts when there are at least 32 of > them available to hint on. This results in us not starting to perform the > hinting until there is 64MB to 128MB of memory sitting in the higher order > regions of the zone. o.k > > The hinting itself stops as soon as we run out of unhinted pages to pull > from. When this occurs we let any pages that are freed after that > accumulate until we get back to 32 pages being free in a given order. > During this time we should build up the cache of warm pages that you > mentioned, assuming that shuffling is not enabled. I was thinking about something like retaining pages to a lower watermark here. Looks like we still might have few lower order pages in free list if they are not merged to orders which are hinted. > > As far as further optimizations I don't think there is anything here that > prevents us from doing that. For now I am focused on just getting the > basics in place so we have a foundation to start from. Agree. Thanks for explaining. Best rgards, Pankaj > > Thanks. > > - Alex > >