Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1969936imm; Thu, 24 May 2018 03:47:32 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpGfNAj+wDiL8P/2G0SAhzpxzqsis0OP3ufSolmbDRQaBurMQu/08B4DAuEa/K5lDuWR9ah X-Received: by 2002:a65:65c8:: with SMTP id y8-v6mr5482973pgv.320.1527158852479; Thu, 24 May 2018 03:47:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527158852; cv=none; d=google.com; s=arc-20160816; b=F16l5/T7zGKat2KkKS1xzpnEE93VQzSC84yXazuVeN0j/RpIzxjAXxefEzcXYY2ZP4 beVeW/SM+xwzTZBfiHDCCm/ZPkr62u/0sux1ADPZgivfsnnBTiP01K4da+YDpAdUBdHD 58cNHlMw1SQLNwtC54Yx9QWB8Ejv6sBdLROsPt2TIhZTlDX0i8Fb3M5FR89zOqNkL0LS jd4Dk0apFJSJdkMaLRTvSEnakGRV1ISIyU31atsypzgNPp/MDBcidtvoWVb9ogMFS9Oi PuL1g+wAabMS/zV07+s7e7SkyjpD7fE3e4n9DxnA5Y5P3qficiTnac/5/i8XKE8Eqyma o9ZA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=8Os7Zwa90ZUS5byGFgvIlzh8MbCW6T9KHn5UVuz5SAQ=; b=L3zOr/sOVm7hfxEsLvsF3V9T2GyDTSPwcJbDh1AnEaSrxve72oNBxS98T+13QBHUKw omfn/WxvYeUQedauvoz0CyTcM/Ea2AQEleiN81sedxlSpxb7644c89bEVQdcBoCNzWax M674Td3gQh3E8nruYiNyyncbIItEp10MqN4UyZsF9P+HE+UnnZ9afcqFlWUAiYn7Ufn9 kVdztCUT4gvD7xoxtTwtVcS5uSN40EI4QvQya5ixcHnlE7kyJS9u/wLqi+PbZVCHolou StTPMhImAUUGftBRAiWSN3a/HyuHEavFX4EUu1WC3c/v2qKHRA5ErbmmXpKCA8HNB/F0 Rajw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 202-v6si16751975pgg.546.2018.05.24.03.47.17; Thu, 24 May 2018 03:47:32 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1032449AbeEXKqX (ORCPT + 99 others); Thu, 24 May 2018 06:46:23 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:49002 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1030572AbeEXKqC (ORCPT ); Thu, 24 May 2018 06:46:02 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id D315340BC044; Thu, 24 May 2018 10:46:01 +0000 (UTC) Received: from [10.36.116.171] (ovpn-116-171.ams2.redhat.com [10.36.116.171]) by smtp.corp.redhat.com (Postfix) with ESMTP id 28BC510B0F5C; Thu, 24 May 2018 10:45:50 +0000 (UTC) Subject: Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Alexander Potapenko , Andrew Morton , Andrey Ryabinin , Balbir Singh , Baoquan He , Benjamin Herrenschmidt , Boris Ostrovsky , Dan Williams , Dave Young , Dmitry Vyukov , Greg Kroah-Hartman , Hari Bathini , Huang Ying , Hugh Dickins , Ingo Molnar , Jaewon Kim , Jan Kara , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Joonsoo Kim , Juergen Gross , Kate Stewart , "Kirill A. Shutemov" , Matthew Wilcox , Mel Gorman , Michael Ellerman , Miles Chen , Oscar Salvador , Paul Mackerras , Pavel Tatashin , Philippe Ombredanne , Rashmica Gupta , Reza Arbab , Souptick Joarder , Tetsuo Handa , Thomas Gleixner , Vlastimil Babka References: <20180523151151.6730-1-david@redhat.com> <20180524075327.GU20441@dhcp22.suse.cz> <14d79dad-ad47-f090-2ec0-c5daf87ac529@redhat.com> <20180524093121.GZ20441@dhcp22.suse.cz> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: Date: Thu, 24 May 2018 12:45:50 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180524093121.GZ20441@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Thu, 24 May 2018 10:46:02 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.7]); Thu, 24 May 2018 10:46:02 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'david@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 24.05.2018 11:31, Michal Hocko wrote: > On Thu 24-05-18 10:31:30, David Hildenbrand wrote: >> On 24.05.2018 09:53, Michal Hocko wrote: >>> I've had some questions before and I am not sure they are fully covered. >>> At least not in the cover letter (I didn't get much further yet) which >>> should give us a highlevel overview of the feature. >> >> Sure, I can give you more details. Adding all details to the cover >> letter will result in a cover letter that nobody will read :) > > Well, these are not details. Those are mostly highlevel design points > and integration with the existing hotplug scheme. I am definitely not > suggesting describing the code etc... Will definitely do as we move on. > >>> On Wed 23-05-18 17:11:41, David Hildenbrand wrote: >>>> This is now the !RFC version. I did some additional tests and inspected >>>> all memory notifiers. At least page_ext and kasan need fixes. >>>> >>>> ========== >>>> >>>> I am right now working on a paravirtualized memory device ("virtio-mem"). >>>> These devices control a memory region and the amount of memory available >>>> via it. Memory will not be indicated/added/onlined via ACPI and friends, >>>> the device driver is responsible for it. >>>> >>>> When the device driver starts up, it will add and online the requested >>>> amount of memory from its assigned physical memory region. On request, it >>>> can add (online) either more memory or try to remove (offline) memory. As >>>> it will be a virtio module, we also want to be able to have it as a loadable >>>> kernel module. >>> >>> How do you handle the offline case? Do you online all the memory to >>> zone_movable? >> >> Right now everything is added to ZONE_NORMAL. I have some plans to >> change that, but that will require more work (auto assigning to >> ZONE_MOVABLE or ZONE_NORMAL depending on certain heuristics). For now >> this showcases that offlining of memory actually works on that >> granularity and it can be used in some scenarios. To make unplug more >> reliable, more work is needed. > > Spell that out then. Memory offline is basically unusable for > zone_normal. So you are talking about adding memory only in practice and > it would be fair to be explicit about that. Yes, I will add that. For now hotplugs works (except when already very tight on memory) reliably. unplug is done on a "best can do basis". Nothing critical happens if I can't free up a 4MB chunk. Bad luck. > >>>> Such a device can be thought of like a "resizable DIMM" or a "huge >>>> number of 4MB DIMMS" that can be automatically managed. >>> >>> Why do we need such a small granularity? The whole memory hotplug is >>> centered around memory sections and those are 128MB in size. Smaller >>> sizes simply do not fit into that concept. How do you deal with that? >> >> 1. Why do we need such a small granularity? >> >> Because we can :) No, honestly, on s390x it is 256MB and if I am not >> wrong on certain x86/arm machines it is even 1 or 2GB. This simply gives >> more flexibility when plugging memory. (thinking about cloud environments) > > We can but if that makes the memory hotplug (cluttered enough in the > current state) more complicated then we simply won't. Or at least I will > not ack anything that will go that direction. I agree, but at least from the current set of patches it doesn't increase the complexity. At least from my POV. E.g onlining of arbitrary sizes was always implemented. Offlining was claimed to be supported to some extend. And I am cleaning that up here (e.g. in offline_pages() or in the memory notifiers). I am using what the existing interfaces promised but fix it and clean it up. > >> Allowing to unplug such small chunks is actually the interesting thing. > > Not really. The vmemmap will stay behind and so you are still wasting > memory. Well, unless you want to have small ptes for the hotplug memory > which is just too suboptimal > >> Try unplugging a 128MB DIMM. With ZONE_NORMAL: pretty much impossible. >> With ZONE_MOVABLE: maybe possible. > > It should be always possible. We have some players who pin pages for > arbitrary amount of time even from zone movable but we should focus on > fixing them or come with a way to handle that. Zone movable is about > movable memory pretty much by definition. You exactly describe what has been the case for way too long. But this is only the tip of the ice berg. Simply adding all memory to ZONE_MOVABLE is not going to work (we create an imbalance - e.g. page tables have to go into ZONE_NORMAL. this imbalance will have to be managed later on). That's why I am rather thinking about said assignment to different zones in the future. For now using ZONE_NORMAL is the easiest approach. > >> Try to find one 4MB chunk of a >> "128MB" DIMM that can be unplugged: With ZONE_NORMAL >> maybe possible. With ZONE_MOVABLE: likely possible. >> >> But let's not go into the discussion of ZONE_MOVABLE vs. ZONE_NORMAL, I >> plan to work on that in the future. >> >> Think about it that way: A compromise between section based memory >> hotplug and page based ballooning. >> >> >> 2. memory hotplug and 128MB size >> >> Interesting point. One thing to note is that "The whole memory hotplug >> is centered around memory sections and those are 128MB in size" is >> simply the current state how it is implemented in Linux, nothing more. > > Yes, and we do care about that because the whole memory hotplug is a > bunch of hacks duct taped together to address very specific usecases. > It took me one year to put it into a state that my eyes do not bleed > anytime I have to look there. There are still way too many problems to > address. I certainly do not want to add more complication. Quite > contrary, the whole code cries for cleanups and sanity. And I highly appreciate your effort. But look at the details: I am even cleaning up online_pages() and offline_pages(). And this is not the end of my contributions :) This is one step into that direction. It showcases what is easily possible right now. With existing interfaces. >> I want to avoid what balloon drivers do: rip out random pages, >> fragmenting guest memory until we eventually trigger the OOM killer. So >> instead, using 4MB chunks produces no fragmentation. And if I can't find >> such a chunk anymore: bad luck. At least I won't be risking stability of >> my guest. >> >> Does that answer your question? > > So you basically pull out those pages from the page allocator and mark > them offline (reserved what ever)? Why do you need any integration to > the hotplug code base then? You should be perfectly fine to work on > top and only teach that hotplug code to recognize your pages are being > free when somebody decides to offline the whole section. I can think of > a callback that would allow that. > > But then you are, well a balloon driver, aren't you? Pointing you at: [1] I *cannot* use the page allocator. Using it would be a potential addon (for special cases!) in the future. I really scan for removable chunks in the memory region a certain virtio-mem device owns. Why can't I do it: 1. I have to allocate memory in certain physical address range (the range that belongs to a virtio-mem device). Well, we could write an allocator. 2. I might have to deal with chunks that are bigger than MAX_ORDER - 1. Say my virito-mem device has a block size of 8MB and I only can allocate 4MB. I'm out of luck then. So, no, virtio-mem is not a balloon driver :) >>> >>>> For kdump and onlining/offlining code, we >>>> have to mark pages as offline before a new segment is visible to the system >>>> (e.g. as these pages might not be backed by real memory in the hypervisor). >>> >>> Please expand on the kdump part. That is really confusing because >>> hotplug should simply not depend on kdump at all. Moreover why don't you >>> simply mark those pages reserved and pull them out from the page >>> allocator? >> >> 1. "hotplug should simply not depend on kdump at all" >> >> In theory yes. In the current state we already have to trigger kdump to >> reload whenever we add/remove a memory block. > > More details please. I just had another look at the whole complexity of makedumfile/kdump/uevents and I'll follow up with a detailed description. kdump.service is definitely reloaded when setting a memory block online/offline (not when adding/removing as I wrongly claimed before). I'll follow up with a more detailed description and all the pointers. > >> 2. kdump part >> >> Whenever we offline a page and tell the hypervisor about it ("unplug"), >> we should not assume that we can read that page again. Now, if dumping >> tools assume they can read all memory that is offline, we are in trouble. > > Sure. Just make those pages reserved. Nobody should touch those IIRC. I think I answered that question already (see [1]) in another thread: We have certain buffers that are marked reserved. Reserved does not imply don't dump. all dump tools I am aware of will dump reserved pages. I cannot use reserved to mark sections offline once all pages are offline. And I don't think the current approach of using a mapcount value is the problematic part. This is straight forward now. > >> It is the same thing as we already have with Pg_hwpoison. Just a >> different meaning - "don't touch this page, it is offline" compared to >> "don't touch this page, hw is broken". >> >> Balloon drivers solve this problem by always allowing to read unplugged >> memory. In virtio-mem, this cannot and should even not be guaranteed. >> >> And what we have to do to make this work is actually pretty simple: Just >> like Pg_hwpoison, track per page if it is online and provide this >> information to kdump. > > If somebody doesn't check your new flag then you are screwed anyway. Yes, but that's not a problem in a new environment. We don't modify existing environments. Same used to be true with Pg_hwpoison. > Existing code should be quite used to PageReserved. Please refer to [3]:makedump.c And the performed checks to exclude pages. No sight of reserved pages. > >> 3. Marking pages reserved and pulling them out of the page allocator >> >> This is basically what e.g. the XEN balloon does. I don't see how that >> helps related to kdump. Can you explain how that would solve any of the >> problems I am trying to solve here? This does neither solve the unplug >> part nor the "tell dump tools to not read such memory" part. > > I still do not understand the unplug part, to be honest. And the rest > should be pretty much solveable. Or what do I miss. Your 4MB can be > perfectly emulated on top of existing sections and pulling pages out of > the allocator. How you mark those pages is an implementation detail. > PageReserved sounds like the easiest way forward. > No allocator. Please refer to the virtio-mem prototype and the explanation I gave above. That should give you a better idea what I do in the current prototype. [1] https://lkml.org/lkml/2018/5/23/803 [2] https://www.spinics.net/lists/linux-mm/msg150029.html [3] https://github.com/jmesmon/makedumpfile -- Thanks, David / dhildenb