Received: by 10.223.176.46 with SMTP id f43csp2144664wra; Thu, 25 Jan 2018 05:42:12 -0800 (PST) X-Google-Smtp-Source: AH8x224vffppBNJXBcvbtJBcAI51hE6RhB/YvBzmC3cv5jaH6V+cv/tOwV/4Guus4lkax9gPMbUW X-Received: by 2002:a17:902:8a97:: with SMTP id p23-v6mr11448058plo.74.1516887732343; Thu, 25 Jan 2018 05:42:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1516887732; cv=none; d=google.com; s=arc-20160816; b=h1o2uUo+IItszrmSgl4BYR+bglQMCqobmHxHgTk6EX6MJya7qPVfuBEgtFLxUza8yf DVpnqSo8nLPRDHRzFP10TRGgZiSQg5MibD1BujYjfrOr6V2XDE0Y/AKsmmx85BghN9OB XagAkR0Kl9sgJehHI7RZ/kaMx2V/hMCvsA1EQZmcD6SmLpuBs+8ijK6raXBhGUovl8Xg tz4fCF5z2zDf+mU1fagbqYGMQw6C9pBjkILc6ymBtiIoWS/pM2iXCaGCHSOyiBLF8jKc 5w6q5TbSPrrrcY478DlaPIwhbFDVPfSVa0lwdQymUH7RsR63gVury6CFTc089wv8cp3Z qS7Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :arc-authentication-results; bh=a2d6ijWzCSA991csKFq1JS3qp3Q8dphuLFgSorxsrow=; b=Aau2iYUg8Xf2qcGbN9u/f8d8pBovcBiF54bshZ+9QenqkE/TGo3ULdWX/JAkZYci38 QPLfIWyyGiEvxKrKkB5JL/jOvcndg8fk3hPUUx3Io0VAw5ITEZVBA+RuUcRHozXtdKD+ gQQ1sY4xvm6mQVepFyrQ5mh2JMqezC3ONYISJZiEJT01AYOVdg5Fw4KKjXNLieUj9vtx Eipk5VOc1RlsOvVpQwoaQLnG91BllbZl4Bzn+XKN9sygZd9mH2cuyBBMDPC6ggBjlc6P RdEfcD4xDjzxbVJV/W6CQutlU4L35NNDoW/xuDJ5QZp/3aqtZR75DM2HkIqP5p5cEwZM 0fGA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t2si4615660pfg.94.2018.01.25.05.41.57; Thu, 25 Jan 2018 05:42:12 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751270AbeAYNlZ (ORCPT + 99 others); Thu, 25 Jan 2018 08:41:25 -0500 Received: from mx1.redhat.com ([209.132.183.28]:57000 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751111AbeAYNlX (ORCPT ); Thu, 25 Jan 2018 08:41:23 -0500 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 3DD57D9631; Thu, 25 Jan 2018 13:41:23 +0000 (UTC) Received: from redhat.com (ovpn-122-11.rdu2.redhat.com [10.10.122.11]) by smtp.corp.redhat.com (Postfix) with SMTP id 3C8DF2C326; Thu, 25 Jan 2018 13:41:16 +0000 (UTC) Date: Thu, 25 Jan 2018 15:41:15 +0200 From: "Michael S. Tsirkin" To: Wei Wang Cc: virtio-dev@lists.oasis-open.org, linux-kernel@vger.kernel.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, linux-mm@kvack.org, mhocko@kernel.org, akpm@linux-foundation.org, pbonzini@redhat.com, liliang.opensource@gmail.com, yang.zhang.wz@gmail.com, quan.xu0@gmail.com, nilal@redhat.com, riel@redhat.com Subject: Re: [PATCH v24 1/2] mm: support reporting free page blocks Message-ID: <20180125152933-mutt-send-email-mst@kernel.org> References: <1516790562-37889-1-git-send-email-wei.w.wang@intel.com> <1516790562-37889-2-git-send-email-wei.w.wang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1516790562-37889-2-git-send-email-wei.w.wang@intel.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.38]); Thu, 25 Jan 2018 13:41:23 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jan 24, 2018 at 06:42:41PM +0800, Wei Wang wrote: > This patch adds support to walk through the free page blocks in the > system and report them via a callback function. Some page blocks may > leave the free list after zone->lock is released, so it is the caller's > responsibility to either detect or prevent the use of such pages. > > One use example of this patch is to accelerate live migration by skipping > the transfer of free pages reported from the guest. A popular method used > by the hypervisor to track which part of memory is written during live > migration is to write-protect all the guest memory. So, those pages that > are reported as free pages but are written after the report function > returns will be captured by the hypervisor, and they will be added to the > next round of memory transfer. > > Signed-off-by: Wei Wang > Signed-off-by: Liang Li > Cc: Michal Hocko > Cc: Michael S. Tsirkin > Acked-by: Michal Hocko > --- > include/linux/mm.h | 6 ++++ > mm/page_alloc.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 97 insertions(+) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ea818ff..b3077dd 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1938,6 +1938,12 @@ extern void free_area_init_node(int nid, unsigned long * zones_size, > unsigned long zone_start_pfn, unsigned long *zholes_size); > extern void free_initmem(void); > > +extern void walk_free_mem_block(void *opaque, > + int min_order, > + bool (*report_pfn_range)(void *opaque, > + unsigned long pfn, > + unsigned long num)); > + > /* > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > * into the buddy system. The freed pages will be poisoned with pattern > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 76c9688..705de22 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -4899,6 +4899,97 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask) > show_swap_cache_info(); > } > > +/* > + * Walk through a free page list and report the found pfn range via the > + * callback. > + * > + * Return false if the callback requests to stop reporting. Otherwise, > + * return true. > + */ > +static bool walk_free_page_list(void *opaque, > + struct zone *zone, > + int order, > + enum migratetype mt, > + bool (*report_pfn_range)(void *, > + unsigned long, > + unsigned long)) > +{ > + struct page *page; > + struct list_head *list; > + unsigned long pfn, flags; > + bool ret; > + > + spin_lock_irqsave(&zone->lock, flags); > + list = &zone->free_area[order].free_list[mt]; > + list_for_each_entry(page, list, lru) { > + pfn = page_to_pfn(page); > + ret = report_pfn_range(opaque, pfn, 1 << order); > + if (!ret) > + break; > + } > + spin_unlock_irqrestore(&zone->lock, flags); > + > + return ret; > +} There are two issues with this API. One is that it is not restarteable: if you return false, you start from the beginning. So no way to drop lock, do something slow and then proceed. Another is that you are using it to report free page hints. Presumably the point is to drop these pages - keeping them near head of the list and reusing the reported ones will just make everything slower invalidating the hint. How about rotating these pages towards the end of the list? Probably not on each call, callect reported pages and then move them to tail when we exit. Of course it's possible not all reporters want this. So maybe change the callback to return int: 0 - page reported, move page to end of free list > 0 - page skipped, proceed < 0 - stop processing > + > +/** > + * walk_free_mem_block - Walk through the free page blocks in the system > + * @opaque: the context passed from the caller > + * @min_order: the minimum order of free lists to check > + * @report_pfn_range: the callback to report the pfn range of the free pages > + * > + * If the callback returns false, stop iterating the list of free page blocks. > + * Otherwise, continue to report. > + * > + * Please note that there are no locking guarantees for the callback and > + * that the reported pfn range might be freed or disappear after the > + * callback returns so the caller has to be very careful how it is used. > + * > + * The callback itself must not sleep or perform any operations which would > + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC) > + * or via any lock dependency. It is generally advisable to implement > + * the callback as simple as possible and defer any heavy lifting to a > + * different context. > + * > + * There is no guarantee that each free range will be reported only once > + * during one walk_free_mem_block invocation. > + * > + * pfn_to_page on the given range is strongly discouraged and if there is > + * an absolute need for that make sure to contact MM people to discuss > + * potential problems. > + * > + * The function itself might sleep so it cannot be called from atomic > + * contexts. > + * > + * In general low orders tend to be very volatile and so it makes more > + * sense to query larger ones first for various optimizations which like > + * ballooning etc... This will reduce the overhead as well. > + */ > +void walk_free_mem_block(void *opaque, > + int min_order, > + bool (*report_pfn_range)(void *opaque, > + unsigned long pfn, > + unsigned long num)) > +{ > + struct zone *zone; > + int order; > + enum migratetype mt; > + bool ret; > + > + for_each_populated_zone(zone) { > + for (order = MAX_ORDER - 1; order >= min_order; order--) { > + for (mt = 0; mt < MIGRATE_TYPES; mt++) { > + ret = walk_free_page_list(opaque, zone, > + order, mt, > + report_pfn_range); > + if (!ret) > + return; > + } > + } > + } > +} > +EXPORT_SYMBOL_GPL(walk_free_mem_block); > + I think callers need a way to 1. distinguish between completion and exit on error 2. restart from where we stopped So I would both accept and return the current zone and a special value to mean "complete" > static void zoneref_set_zone(struct zone *zone, struct zoneref *zoneref) > { > zoneref->zone = zone; > -- > 2.7.4