Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp2049342imm; Thu, 24 May 2018 05:05:31 -0700 (PDT) X-Google-Smtp-Source: AB8JxZo8ExcNXa56DbLYVD3F0zu7Kp3+hUkGSPfHduOdlgk9Itg7JZyr1E4vvvQvSuoFk+lG2Tai X-Received: by 2002:a17:902:7e06:: with SMTP id b6-v6mr7093765plm.151.1527163531581; Thu, 24 May 2018 05:05:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527163531; cv=none; d=google.com; s=arc-20160816; b=jlQkK6lc6xdkfmUs1DK5LtHRII6A3ArRND8rRRLtkZyRJIRA4X5g8kVYriygDDUCIf HwF5vH//wbuh1BdIV/8zM1hHl4/Yw/1lPaBSuhmeeKemEl7IvqtkoBjBLwhgCZOXy2Wx CG8cEMM9p9rueGuXN3Thy3SBQxAnuhRrxLqk3ba7FtI4q2KftASrpzpT01w9i0m9Gi88 YZoJysCnzpRQ2TCybZbXS13icSmXmGcY37kZqsapXhHvNPUTjRdiEmfAQBi0nIBxLy+j 3scq0NB/VQLaHZm8wZ5Ubn51rUj90wxJgC2tmZQZ7306oqoqRZbaWF6cjbzIVYnzSCgX +xfQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=hsXtIEGRKJWgILdMTC4vKnW/fcGMfcMZKZlaMyI2R5k=; b=h49ar/Lj/kLFzu/w9EhL3ls4OX8CsU6X5GRiOVUK8ZWnQA3gwiFb/jqwiGaHzH0Boj aQIRCi3zo9bey4683QyVMfpmXhbJM0CFUeFy4Ieb8rY4ZkrqYxqlhT74sA3LRgGTQqEt n6MYtk6Uajw6tjn5wOAw6vohaYgCg6M6WkhONRnK+lMPYHIYymDS80VTsqa5SwkeD/e/ gBd5As5IspI7IvSptT7Wck3LGiDhRXVJ67EsxAUo+W04djRF5cSwjs51fJe7fjGJ1vNc aMMxbHffeobCpSkLzLqk1kzWpC9LZBAmtUU4lIycxAf/CGQ0yeJMAdTStrk5c7xj7XEg DsRQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n2-v6si21927612plk.433.2018.05.24.05.05.16; Thu, 24 May 2018 05:05:31 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S969689AbeEXMDw (ORCPT + 99 others); Thu, 24 May 2018 08:03:52 -0400 Received: from mx2.suse.de ([195.135.220.15]:41963 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966462AbeEXMDs (ORCPT ); Thu, 24 May 2018 08:03:48 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id EECD3ADFD; Thu, 24 May 2018 12:03:45 +0000 (UTC) Date: Thu, 24 May 2018 14:03:41 +0200 From: Michal Hocko To: David Hildenbrand Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Alexander Potapenko , Andrew Morton , Andrey Ryabinin , Balbir Singh , Baoquan He , Benjamin Herrenschmidt , Boris Ostrovsky , Dan Williams , Dave Young , Dmitry Vyukov , Greg Kroah-Hartman , Hari Bathini , Huang Ying , Hugh Dickins , Ingo Molnar , Jaewon Kim , Jan Kara , =?iso-8859-1?B?Suly9G1l?= Glisse , Joonsoo Kim , Juergen Gross , Kate Stewart , "Kirill A. Shutemov" , Matthew Wilcox , Mel Gorman , Michael Ellerman , Miles Chen , Oscar Salvador , Paul Mackerras , Pavel Tatashin , Philippe Ombredanne , Rashmica Gupta , Reza Arbab , Souptick Joarder , Tetsuo Handa , Thomas Gleixner , Vlastimil Babka Subject: Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver Message-ID: <20180524120341.GF20441@dhcp22.suse.cz> References: <20180523151151.6730-1-david@redhat.com> <20180524075327.GU20441@dhcp22.suse.cz> <14d79dad-ad47-f090-2ec0-c5daf87ac529@redhat.com> <20180524093121.GZ20441@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.5 (2018-04-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 24-05-18 12:45:50, David Hildenbrand wrote: > On 24.05.2018 11:31, Michal Hocko wrote: > > On Thu 24-05-18 10:31:30, David Hildenbrand wrote: [...] > >> Allowing to unplug such small chunks is actually the interesting thing. > > > > Not really. The vmemmap will stay behind and so you are still wasting > > memory. Well, unless you want to have small ptes for the hotplug memory > > which is just too suboptimal > > > >> Try unplugging a 128MB DIMM. With ZONE_NORMAL: pretty much impossible. > >> With ZONE_MOVABLE: maybe possible. > > > > It should be always possible. We have some players who pin pages for > > arbitrary amount of time even from zone movable but we should focus on > > fixing them or come with a way to handle that. Zone movable is about > > movable memory pretty much by definition. > > You exactly describe what has been the case for way too long. But this > is only the tip of the ice berg. Simply adding all memory to > ZONE_MOVABLE is not going to work (we create an imbalance - e.g. page > tables have to go into ZONE_NORMAL. this imbalance will have to be > managed later on). That's why I am rather thinking about said assignment > to different zones in the future. For now using ZONE_NORMAL is the > easiest approach. Well, I think it would be fair to say that memory hotplug is not really suitable for balancing memory between guests. Exactly because of the offline part. If you want to have a reliable offline you just screw yourself to the highmem land problems. This is the primary reason why I really detest any balooning like solutions based on the memory hotplug. I can see the case for adding memory to increase the initial guests sizes but that can be achieved with what we have currently without any additional complexity in the generic code. So you really have to have some seriously convincing arguments in hands. So far I haven't heard any, to be honest. This is just yet another thing that can be achieved with what other ballooning solutions are doing. I might be wrong because this is not really my area and I might underestimate some nuances but as already said, you have to be really convincing... > >> Try to find one 4MB chunk of a > >> "128MB" DIMM that can be unplugged: With ZONE_NORMAL > >> maybe possible. With ZONE_MOVABLE: likely possible. > >> > >> But let's not go into the discussion of ZONE_MOVABLE vs. ZONE_NORMAL, I > >> plan to work on that in the future. > >> > >> Think about it that way: A compromise between section based memory > >> hotplug and page based ballooning. > >> > >> > >> 2. memory hotplug and 128MB size > >> > >> Interesting point. One thing to note is that "The whole memory hotplug > >> is centered around memory sections and those are 128MB in size" is > >> simply the current state how it is implemented in Linux, nothing more. > > > > Yes, and we do care about that because the whole memory hotplug is a > > bunch of hacks duct taped together to address very specific usecases. > > It took me one year to put it into a state that my eyes do not bleed > > anytime I have to look there. There are still way too many problems to > > address. I certainly do not want to add more complication. Quite > > contrary, the whole code cries for cleanups and sanity. > > And I highly appreciate your effort. But look at the details: I am even > cleaning up online_pages() and offline_pages(). And this is not the end > of my contributions :) This is one step into that direction. It > showcases what is easily possible right now. With existing interfaces. If you can bring some cleanups then great. I would suggest pulling those out and post separately. > >> I want to avoid what balloon drivers do: rip out random pages, > >> fragmenting guest memory until we eventually trigger the OOM killer. So > >> instead, using 4MB chunks produces no fragmentation. And if I can't find > >> such a chunk anymore: bad luck. At least I won't be risking stability of > >> my guest. > >> > >> Does that answer your question? > > > > So you basically pull out those pages from the page allocator and mark > > them offline (reserved what ever)? Why do you need any integration to > > the hotplug code base then? You should be perfectly fine to work on > > top and only teach that hotplug code to recognize your pages are being > > free when somebody decides to offline the whole section. I can think of > > a callback that would allow that. > > > > But then you are, well a balloon driver, aren't you? > > Pointing you at: [1] > > I *cannot* use the page allocator. Using it would be a potential addon > (for special cases!) in the future. I really scan for removable chunks > in the memory region a certain virtio-mem device owns. Why can't I do it: > > 1. I have to allocate memory in certain physical address range (the > range that belongs to a virtio-mem device). Well, we could write an > allocator. Not really. Why cannot you simply mimic what the hotplug already does? Scan a pfn range and isolate/migrate it? We can abstract such functionality to be better usable. > 2. I might have to deal with chunks that are bigger than MAX_ORDER - 1. > Say my virito-mem device has a block size of 8MB and I only can allocate > 4MB. I'm out of luck then. Confused again. So you are managing 4M chunks in 8M units? > So, no, virtio-mem is not a balloon driver :) [...] > >> 1. "hotplug should simply not depend on kdump at all" > >> > >> In theory yes. In the current state we already have to trigger kdump to > >> reload whenever we add/remove a memory block. > > > > More details please. > > I just had another look at the whole complexity of > makedumfile/kdump/uevents and I'll follow up with a detailed description. > > kdump.service is definitely reloaded when setting a memory block > online/offline (not when adding/removing as I wrongly claimed before). > > I'll follow up with a more detailed description and all the pointers. Please make sure to describe what is the architecture then. I have no idea what kdump.servise is supposed to do for example. > >> 2. kdump part > >> > >> Whenever we offline a page and tell the hypervisor about it ("unplug"), > >> we should not assume that we can read that page again. Now, if dumping > >> tools assume they can read all memory that is offline, we are in trouble. > > > > Sure. Just make those pages reserved. Nobody should touch those IIRC. > > I think I answered that question already (see [1]) in another thread: We > have certain buffers that are marked reserved. Reserved does not imply > don't dump. all dump tools I am aware of will dump reserved pages. I > cannot use reserved to mark sections offline once all pages are offline. Then fix those dump tools. They shouldn't have any business touching reserved memory, should they? That memory can be in an arbitrary state. > And I don't think the current approach of using a mapcount value is the > problematic part. This is straight forward now. The thing is that struct page space is extremely scarce. Anytime you take a bit or there somebody else will have to scratch his head much harder in the future. So if we can go with the existing infrastructure then it should be preferable. And we do have "this page is mine do not even think about touching it" - PageReserved. -- Michal Hocko SUSE Labs