Received: by 2002:a25:c205:0:0:0:0:0 with SMTP id s5csp2367592ybf; Mon, 2 Mar 2020 07:16:59 -0800 (PST) X-Google-Smtp-Source: ADFU+vsjtlpzNyyGTnB4e37ieYvBvKfhQRuaU6qfuhjCCf01r0Wu8BYQm4gCS+NSCEzwE9yYH5JP X-Received: by 2002:aca:4f0f:: with SMTP id d15mr14222oib.78.1583162218957; Mon, 02 Mar 2020 07:16:58 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1583162218; cv=none; d=google.com; s=arc-20160816; b=j7agVG8gVw3NF6oSLSBdKvuS57hxrC/x9M+89AY7eqY9mwByBp9hd5lf8QhKwPOTKd wj80Sc7b+QnlXWyqCmdpOBQjrOmxRq8PlTMeNunYywqY4SAdL2MFlBbnlsVsXVXBM3/u kZAr9nFRUuwYmdkHJvUdjuliI/m7X6C5nMBRQm38yTGmxGIIy23RjDWOj9n8+nUtxz3R 6EfnxD2R/exY3O3MIzovvIDav3zmMx5hnzl0HvQpWZB3meSXcxnz/DEI0OBXPqkXDsBw aMrvZb0bSVgdz+F5NqU1H0IBOQJaAYH1fXnJgpxZG1aBo9xJvbb1Gf5UVaNbBqgpPZsA I1Og== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=ZOgIBEWSC2k28vgpmFn5dHKpSW5l07JOWypLs1KKteM=; b=tCPlJrY/GIdHA1Y/2t/uNL3GPstP7unEi8sApEgECRPZNGpukFsCjr5FB4+YVlcHvR pGqEJENq1XU+Bpa1SI8gk1vrghnIpaShYJ23MoS8gfvavC7DzidRTPwrE7PvVYoCMWOa 4Mngoo5nQb1uyQ6ft3slpCzIHLr0y4aRShZ+fzkQOPybuWiV14lmcGX+hR4LPnFIMYVD t8A4w2QGZzqiyHXX59UGE8YZQ6wVn3H8E3t18LKV7W81GXx5+YOywh9z4lX2RxHWKmYm 3DCMo4z2j4XHWB3lPvY3mIQnLLpBwmDGGDKAk8zTErN9lHZjbtczSXXQ7vxqxkKlI6O8 qmtw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l17si6316338otp.248.2020.03.02.07.16.45; Mon, 02 Mar 2020 07:16:58 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726982AbgCBPQQ (ORCPT + 99 others); Mon, 2 Mar 2020 10:16:16 -0500 Received: from mx2.suse.de ([195.135.220.15]:48280 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726390AbgCBPQQ (ORCPT ); Mon, 2 Mar 2020 10:16:16 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 6EE54B426; Mon, 2 Mar 2020 15:16:12 +0000 (UTC) Date: Mon, 2 Mar 2020 15:16:07 +0000 From: Mel Gorman To: "Huang, Ying" Cc: David Hildenbrand , Matthew Wilcox , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vlastimil Babka , Zi Yan , Michal Hocko , Peter Zijlstra , Dave Hansen , Minchan Kim , Johannes Weiner , Hugh Dickins , Alexander Duyck Subject: Re: [RFC 0/3] mm: Discard lazily freed pages when migrating Message-ID: <20200302151607.GC3772@suse.de> References: <20200228033819.3857058-1-ying.huang@intel.com> <20200228034248.GE29971@bombadil.infradead.org> <87a7538977.fsf@yhuang-dev.intel.com> <871rqf850z.fsf@yhuang-dev.intel.com> <20200228094954.GB3772@suse.de> <87h7z76lwf.fsf@yhuang-dev.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: <87h7z76lwf.fsf@yhuang-dev.intel.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 02, 2020 at 07:23:12PM +0800, Huang, Ying wrote: > Mel Gorman writes: > > > On Fri, Feb 28, 2020 at 04:55:40PM +0800, Huang, Ying wrote: > >> > E.g., free page reporting in QEMU wants to use MADV_FREE. The guest will > >> > report currently free pages to the hypervisor, which will MADV_FREE the > >> > reported memory. As long as there is no memory pressure, there is no > >> > need to actually free the pages. Once the guest reuses such a page, it > >> > could happen that there is still the old page and pulling in in a fresh > >> > (zeroed) page can be avoided. > >> > > >> > AFAIKs, after your change, we would get more pages discarded from our > >> > guest, resulting in more fresh (zeroed) pages having to be pulled in > >> > when a guest touches a reported free page again. But OTOH, page > >> > migration is speed up (avoiding to migrate these pages). > >> > >> Let's look at this problem in another perspective. To migrate the > >> MADV_FREE pages of the QEMU process from the node A to the node B, we > >> need to free the original pages in the node A, and (maybe) allocate the > >> same number of pages in the node B. So the question becomes > >> > >> - we may need to allocate some pages in the node B > >> - these pages may be accessed by the application or not > >> - we should allocate all these pages in advance or allocate them lazily > >> when they are accessed. > >> > >> We thought the common philosophy in Linux kernel is to allocate lazily. > >> > > > > I also think there needs to be an example of a real application that > > benefits from this new behaviour. Consider the possible sources of page > > migration > > > > 1. NUMA balancing -- The application has to read/write the data for this > > to trigger. In the case of write, MADV_FREE is cancelled and it's > > mostly likely going to be a write unless it's an application bug. > > 2. sys_movepages -- the application has explictly stated the data is in > > use on a particular node yet any MADV_FREE page gets discarded > > 3. Compaction -- there may be no memory pressure at all but the > > MADV_FREE memory is discarded prematurely > > > > In the first case, the data is explicitly in use, most likely due to > > a write in which case it's inappropriate to discard. Discarding and > > reallocating a zero'd page is not expected. > > If my understanding were correct, NUMA balancing will not try to migrate > clean MADV_FREE pages most of the times. Because the lazily freed pages > may be zeroed at any time, it makes almost no sense to read them. So, > the first access after being freed lazily should be writing. That will > make the page dirty MADV_FREE, so will not be discarded. And the first > writing access in a new node will not trigger migration usually because > of the two-stage filter in should_numa_migrate_memory(). So current > behavior in case 1 will not change most of the times after the patchset. > Yes, so in this case the series neither helps nor hurts NUMA balancing. > > In second case, the data is likely in use or else why would the system > > call be used? > > If the pages were in use, they shouldn't be clean MADV_FREE pages. So > no behavior change after this patchset. > You cannot guarantee that. The application could be caching them optimistically as long as they stay resident until memory pressure forces them out. Consider something like an in-memory object database. For very old objects, it might decide to mark them MADV_FREE using a header at the start to detect when a page still has valid data. The intent would be that under memory pressure, the hot data would be preserved as long as possible. I'm not aware of such an application, it simply is a valid use case. Similarly, a malloc implementation may be using MADV_FREE to mark freed buffers so they can be quickly reused. Now if they use sys_movepages, they take the alloc+zero hit and the strategy is less useful. > > In the third case the timing of when MADV_FREE pages disappear is > > arbitrary as it can happen without any actual memory pressure. This > > may or may not be problematic but it leads to unpredictable latencies > > for applications that use MADV_FREE for a quick malloc/free > > implementation. Before, as long as there is no pressure, the reuse of > > a MADV_FREE incurs just a small penalty but now with compaction it > > does not matter if the system avoids memory pressure because they may > > still incur a fault to allocate and zero a new page. > > Yes. This is a real problem. Previously we thought that the migration > is kind of > > - unmap and free the old pages > - map the new pages > > If we can allocate new pages lazily in mmap(), why cannot we allocate > new pages lazily in migrating pages too if possible (for clean > MADV_CLEAN pages) because its second stage is kind of mmap() too. But > you remind us that there are some differences, > > - mmap() is called by the application directly, so its effect is > predictable. While some migration like that in the compaction isn't > called by the application, so may have unpredictable behavior. > - With mmap(), the application can choose to allocate new pages lazily > or eagerly. But we lacks the same mechanism in this patchset. > > So, maybe we can make the mechanism more flexible. That is, let the > administrator or the application choose the right behavior for them, via > some system level configuration knob or API flags. For example, if the > memory allocation latency isn't critical for the workloads, the > administrator can choose to discard clean MADV_FREE pages to make > compaction quicker and easier? > That will be the type of knob that is almost impossible to tune for. More on this later. > > > > > > This may lead to an odd scenario whereby applications occupy dram > > prematurely due to a single reference of a MADV_FREE page. > > The better way to deal with this is to enhance mem_cgroup to limit DRAM > usage in a memory tiering system? > I did not respond to this properly because it's a separate discussion on implementation details of something that does not exist. I only mentioned it to highlight a potential hazard that could raise its head later. > > It's all subtle enough that we really should have an example application > > in mind that benefits so we can weigh the benefits against the potential > > risks. > > If some applications cannot tolerate the latency incurred by the memory > allocation and zeroing. Then we cannot discard instead of migrate > always. While in some situations, less memory pressure can help. So > it's better to let the administrator and the application choose the > right behavior in the specific situation? > Is there an application you have in mind that benefits from discarding MADV_FREE pages instead of migrating them? Allowing the administrator or application to tune this would be very problematic. An application would require an update to the system call to take advantage of it and then detect if the running kernel supports it. An administrator would have to detect that MADV_FREE pages are being prematurely discarded leading to a slowdown and that is hard to detect. It could be inferred from monitoring compaction stats and checking if compaction activity is correlated with higher minor faults in the target application. Proving the correlation would require using the perf software event PERF_COUNT_SW_PAGE_FAULTS_MIN and matching the addresses to MADV_FREE regions that were freed prematurely. That is not an obvious debugging step to take when an application detects latency spikes. Now, you could add a counter specifically for MADV_FREE pages freed for reasons other than memory pressure and hope the administrator knows about the counter and what it means. That type of knowledge could take a long time to spread so it's really very important that there is evidence of an application that suffers due to the current MADV_FREE and migration behaviour. -- Mel Gorman SUSE Labs