Received: by 10.223.176.46 with SMTP id f43csp1202140wra; Fri, 26 Jan 2018 13:44:29 -0800 (PST) X-Google-Smtp-Source: AH8x2267+7Ci5/OfzMoz8pZvhlh4NsJru+HRHpZenFfq3EAynNoJSONWmyXUaCylHVhTI+YP/ccW X-Received: by 10.98.198.2 with SMTP id m2mr20278869pfg.113.1517003069607; Fri, 26 Jan 2018 13:44:29 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517003069; cv=none; d=google.com; s=arc-20160816; b=uUDqshZXTgXyprRSL/qKAMqiep6FpFE6E0WTnUbegCqeuXjM67jbZcVE3txSqyvGwa /YUQve3qZjoRnf/kSFCVbkJICufKqCtuKXDw5b23Z5f8Q+RgvmCoeJZl+Cz1vmjaFSnl ujiYmV60tOQKnsLPV/NwpRxa1CYXNPJ7fr2eGgQ6onQ4x8pR5P/QPcKCIpNCQxENwIcT rjGfiLyhETb8YnaLfzlMCCz2YfdEMDJsI1i/ZVJ2Upr7gz/zH+xb/EZGE4u25AHy6Tu5 cwUwQuYLfTDXSWuhRgd3fSQ4+7yLLnRtfF+DK6UXgmge2NmonyFPSdnrGGKutstHN2m2 H0Sw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :arc-authentication-results; bh=M4nKQUjCirehAOY7ZwD89xdySWgUVXa8sotHWgvS5J4=; b=bpgqptcA7bk2OJ0c2t5CVF9Y+loV9Sl1+7lshhiKouIh7Gb43tbO6VBT5qWP6lAYq4 kEuTUS7FcD8t6ak0sgFe3LbJJtNm+pMPQruOxnHcOV+A1fjCKJt2WwOaGc6ROPcT2xSU ji2VYAKCRi99KHiNgPa6ZcR1wqrC0IOAsJaal2IbffEGpj9Bs76n8Ogs/ftYBb3JMIUV 9p9B1cH4QJxbG3p6sux0L7NmCienk+MGUFJLOA1M112FWfBGDY8M9Y/ZM3Wo0rwGXeay /Scx3azCrFbKi9UwnXt77GjUB/N9sn7pif8XtX9WqP6q444sTyWoZ1Z+TdaPpJj1EAgI cFTA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m1-v6si4271920plb.83.2018.01.26.13.44.15; Fri, 26 Jan 2018 13:44:29 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751613AbeAZVnx (ORCPT + 99 others); Fri, 26 Jan 2018 16:43:53 -0500 Received: from mx1.redhat.com ([209.132.183.28]:56152 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751370AbeAZVnw (ORCPT ); Fri, 26 Jan 2018 16:43:52 -0500 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 18F6AF1943; Fri, 26 Jan 2018 21:43:52 +0000 (UTC) Received: from redhat.com (ovpn-121-146.rdu2.redhat.com [10.10.121.146]) by smtp.corp.redhat.com (Postfix) with SMTP id 4CBCB5EE0E; Fri, 26 Jan 2018 21:43:45 +0000 (UTC) Date: Fri, 26 Jan 2018 23:43:43 +0200 From: "Michael S. Tsirkin" To: Wei Wang Cc: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org, akpm@linux-foundation.org, pbonzini@redhat.com, liliang.opensource@gmail.com, yang.zhang.wz@gmail.com, quan.xu0@gmail.com, nilal@redhat.com, riel@redhat.com Subject: Re: [PATCH v24 1/2] mm: support reporting free page blocks Message-ID: <20180126233950-mutt-send-email-mst@kernel.org> References: <1516790562-37889-1-git-send-email-wei.w.wang@intel.com> <1516790562-37889-2-git-send-email-wei.w.wang@intel.com> <20180125152933-mutt-send-email-mst@kernel.org> <5A6AA08B.2080508@intel.com> <20180126155224-mutt-send-email-mst@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180126155224-mutt-send-email-mst@kernel.org> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Fri, 26 Jan 2018 21:43:52 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 26, 2018 at 05:00:09PM +0200, Michael S. Tsirkin wrote: > On Fri, Jan 26, 2018 at 11:29:15AM +0800, Wei Wang wrote: > > On 01/25/2018 09:41 PM, Michael S. Tsirkin wrote: > > > On Wed, Jan 24, 2018 at 06:42:41PM +0800, Wei Wang wrote: > > > > This patch adds support to walk through the free page blocks in the > > > > system and report them via a callback function. Some page blocks may > > > > leave the free list after zone->lock is released, so it is the caller's > > > > responsibility to either detect or prevent the use of such pages. > > > > > > > > One use example of this patch is to accelerate live migration by skipping > > > > the transfer of free pages reported from the guest. A popular method used > > > > by the hypervisor to track which part of memory is written during live > > > > migration is to write-protect all the guest memory. So, those pages that > > > > are reported as free pages but are written after the report function > > > > returns will be captured by the hypervisor, and they will be added to the > > > > next round of memory transfer. > > > > > > > > Signed-off-by: Wei Wang > > > > Signed-off-by: Liang Li > > > > Cc: Michal Hocko > > > > Cc: Michael S. Tsirkin > > > > Acked-by: Michal Hocko > > > > --- > > > > include/linux/mm.h | 6 ++++ > > > > mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > 2 files changed, 97 insertions(+) > > > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > > index ea818ff..b3077dd 100644 > > > > --- a/include/linux/mm.h > > > > +++ b/include/linux/mm.h > > > > @@ -1938,6 +1938,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > > > > unsigned long zone_start_pfn, unsigned long *zholes_size); > > > > extern void free_initmem(void); > > > > +extern void walk_free_mem_block(void *opaque, > > > > + int min_order, > > > > + bool (*report_pfn_range)(void *opaque, > > > > + unsigned long pfn, > > > > + unsigned long num)); > > > > + > > > > /* > > > > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > > > > * into the buddy system. The freed pages will be poisoned with pattern > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > > index 76c9688..705de22 100644 > > > > --- a/mm/page_alloc.c > > > > +++ b/mm/page_alloc.c > > > > @@ -4899,6 +4899,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > > > > show_swap_cache_info(); > > > > } > > > > +/* > > > > + * Walk through a free page list and report the found pfn range via the > > > > + * callback. > > > > + * > > > > + * Return false if the callback requests to stop reporting. Otherwise, > > > > + * return true. > > > > + */ > > > > +static bool walk_free_page_list(void *opaque, > > > > + struct zone *zone, > > > > + int order, > > > > + enum migratetype mt, > > > > + bool (*report_pfn_range)(void *, > > > > + unsigned long, > > > > + unsigned long)) > > > > +{ > > > > + struct page *page; > > > > + struct list_head *list; > > > > + unsigned long pfn, flags; > > > > + bool ret; > > > > + > > > > + spin_lock_irqsave(&zone->lock, flags); > > > > + list = &zone->free_area[order].free_list[mt]; > > > > + list_for_each_entry(page, list, lru) { > > > > + pfn = page_to_pfn(page); > > > > + ret = report_pfn_range(opaque, pfn, 1 << order); > > > > + if (!ret) > > > > + break; > > > > + } > > > > + spin_unlock_irqrestore(&zone->lock, flags); > > > > + > > > > + return ret; > > > > +} > > > There are two issues with this API. One is that it is not > > > restarteable: if you return false, you start from the > > > beginning. So no way to drop lock, do something slow > > > and then proceed. > > > > > > Another is that you are using it to report free page hints. Presumably > > > the point is to drop these pages - keeping them near head of the list > > > and reusing the reported ones will just make everything slower > > > invalidating the hint. > > > > > > How about rotating these pages towards the end of the list? > > > Probably not on each call, callect reported pages and then > > > move them to tail when we exit. > > > > > > I'm not sure how this would help. For example, we have a list of 2M free > > page blocks: > > A-->B-->C-->D-->E-->F-->G--H > > > > After reporting A and B, and put them to the end and exit, when the caller > > comes back, > > 1) if the list remains unchanged, then it will be > > C-->D-->E-->F-->G-->H-->A-->B > > Right. So here we can just scan until we see A, right? It's a harder > question what to do if A and only A has been consumed. We don't want B > to be sent twice ideally. OTOH maybe that isn't a big deal if it's only > twice. Host might know page is already gone - how about host gives us a > hint after using the buffer? > > > 2) If worse, all the blocks have been split into smaller blocks and used > > after the caller comes back. > > > > where could we continue? > > I'm not sure. But an alternative appears to be to hold a lock > and just block whoever wanted to use any pages. Yes we are sending > hints faster but apparently something wanted these pages, and holding > the lock is interfering with this something. I've been thinking about it. How about the following scheme: 1. register balloon to get a (new) callback when free list runs empty 2. take pages off the free list, add them to the balloon specific list 3. report to host 4. readd to free list at tail 5. if callback triggers, interrupt balloon reporting to host, and readd to free list at tail This needs some thought wrt what happens when there are multiple users of this API, but looks like it will work. -- MST