Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp487722ybf; Fri, 28 Feb 2020 01:51:25 -0800 (PST) X-Google-Smtp-Source: APXvYqye2JEtjyjuThM2+tinb3ZcrblDPTRytYEvHvKuShpM+JhC7Fx0QHV9OkTiM2AxisNC6rde X-Received: by 2002:a05:6808:1:: with SMTP id u1mr2563266oic.74.1582883484946; Fri, 28 Feb 2020 01:51:24 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1582883484; cv=none; d=google.com; s=arc-20160816; b=e+UgvlU2JZHBC08j6j2mJzuhkdvjBsU6wi1EBmM7WtutT2pi78dvYTpU4YBmM4Q6it EVRH+DTDBAN4rKVOT9ibFLcxC/02p8UK6IYsGZbbkTnuwHYt5DsgEO5cm0fzfvAbdiuF 74YAawNQ+spZZG22C28m5YHPafsJSyljJLLc+xuvgP7PIj70e+llWihxiwudF91L8RX8 8uYw2MhRx1D3pElmvamp0V9iBy0sizXa56UvmXmKCVE8P9R6a5yQPJ+D0FEc/9gYdNPA WlbddDQi1CZS5pf15uVQtOrIiJu8f1GqWx1P4X5YnQFcgv/3iQZDRE/HRNIiJbD6CppQ SXbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=XFxdvySbewo7gV43qBo3IvKcVs5Lus+Y+D8hfnJMiiI=; b=RIk5Mdsbz3vVBUkWIJYQAvIUrxfFs79c1BI42KZA7nylTUd8aGNjQOX1p6PHXVg7vj IZap++KGXM1U3wUThd0a0wApyBxeLcqL2utXyi9m55eC6shs+nM1B1teffcbBDJAjFiC D/MA70xDB7/0vBBn/HU29XiLokmnj1TlFDrHL2AwsboAwab2XlMGBS5p/9bZJAU+so6H ozweaZwoNYq5NQv4i1jPaiMU90vuhj9f4mHZoG6xcSz7O6/mYMwsL8K1eBp8MzYpK9FW LZ/XC2iKWeMLSNh6m7/LZ2Q2RDcNQarraRKEYfWij/IkWncfXF0saLnvhL5ypPxAhWyd /UjA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l4si1575838oib.170.2020.02.28.01.51.13; Fri, 28 Feb 2020 01:51:24 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726765AbgB1JuE (ORCPT + 99 others); Fri, 28 Feb 2020 04:50:04 -0500 Received: from mx2.suse.de ([195.135.220.15]:36304 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726413AbgB1JuD (ORCPT ); Fri, 28 Feb 2020 04:50:03 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 093F6B2D4; Fri, 28 Feb 2020 09:50:01 +0000 (UTC) Date: Fri, 28 Feb 2020 09:49:54 +0000 From: Mel Gorman To: "Huang, Ying" Cc: David Hildenbrand , Matthew Wilcox , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vlastimil Babka , Zi Yan , Michal Hocko , Peter Zijlstra , Dave Hansen , Minchan Kim , Johannes Weiner , Hugh Dickins , Alexander Duyck Subject: Re: [RFC 0/3] mm: Discard lazily freed pages when migrating Message-ID: <20200228094954.GB3772@suse.de> References: <20200228033819.3857058-1-ying.huang@intel.com> <20200228034248.GE29971@bombadil.infradead.org> <87a7538977.fsf@yhuang-dev.intel.com> <871rqf850z.fsf@yhuang-dev.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <871rqf850z.fsf@yhuang-dev.intel.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Feb 28, 2020 at 04:55:40PM +0800, Huang, Ying wrote: > > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will > > report currently free pages to the hypervisor, which will MADV_FREE the > > reported memory. As long as there is no memory pressure, there is no > > need to actually free the pages. Once the guest reuses such a page, it > > could happen that there is still the old page and pulling in in a fresh > > (zeroed) page can be avoided. > > > > AFAIKs, after your change, we would get more pages discarded from our > > guest, resulting in more fresh (zeroed) pages having to be pulled in > > when a guest touches a reported free page again. But OTOH, page > > migration is speed up (avoiding to migrate these pages). > > Let's look at this problem in another perspective. To migrate the > MADV_FREE pages of the QEMU process from the node A to the node B, we > need to free the original pages in the node A, and (maybe) allocate the > same number of pages in the node B. So the question becomes > > - we may need to allocate some pages in the node B > - these pages may be accessed by the application or not > - we should allocate all these pages in advance or allocate them lazily > when they are accessed. > > We thought the common philosophy in Linux kernel is to allocate lazily. > I also think there needs to be an example of a real application that benefits from this new behaviour. Consider the possible sources of page migration 1. NUMA balancing -- The application has to read/write the data for this to trigger. In the case of write, MADV_FREE is cancelled and it's mostly likely going to be a write unless it's an application bug. 2. sys_movepages -- the application has explictly stated the data is in use on a particular node yet any MADV_FREE page gets discarded 3. Compaction -- there may be no memory pressure at all but the MADV_FREE memory is discarded prematurely In the first case, the data is explicitly in use, most likely due to a write in which case it's inappropriate to discard. Discarding and reallocating a zero'd page is not expected. In second case, the data is likely in use or else why would the system call be used? In the third case the timing of when MADV_FREE pages disappear is arbitrary as it can happen without any actual memory pressure. This may or may not be problematic but it leads to unpredictable latencies for applications that use MADV_FREE for a quick malloc/free implementation. Before, as long as there is no pressure, the reuse of a MADV_FREE incurs just a small penalty but now with compaction it does not matter if the system avoids memory pressure because they may still incur a fault to allocate and zero a new page. There is a hypothetical fourth case which I only mention because of your email address. If persistent memory is ever used for tiered memory then MADV_FREE pages that migrate from dram to pmem gets discarded instead of migrated. When it's reused, it gets reallocated from dram regardless of whether that region is hot or not. This may lead to an odd scenario whereby applications occupy dram prematurely due to a single reference of a MADV_FREE page. It's all subtle enough that we really should have an example application in mind that benefits so we can weigh the benefits against the potential risks. -- Mel Gorman SUSE Labs