Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp1843903imm; Thu, 24 May 2018 01:33:23 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpQUYG0KWzAc95+RWm4CA4g4jIOZMX91tcJ7HS4eGrWd0Z0Ur+oxI+R03w06c/XvZ+lh78G X-Received: by 2002:a62:449c:: with SMTP id m28-v6mr6271461pfi.145.1527150803181; Thu, 24 May 2018 01:33:23 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527150803; cv=none; d=google.com; s=arc-20160816; b=srlAbjJ5wFWaA7OO+pOjTps4A+clAibkNi4xY5zpGhJJGsrRx3rFjfLB/+wYiERTwT I1i7IXp4TBKxg6eDZRYt+s8NdBL+XsENCMXVK7CVUQbv/QIxK9phSbZZ/FNnEo3wl2+7 x2LRXgCWuIq4bFytatiP0IoEoSUsSKJOAiKH8PxC5rvy9bX1rcnrLn/bUMyP9PKff3rK NOcMU8w7KTB5x5uQx9xT5jqCiTlf+Ia3lUfiv/u7RcyudZ7Bsn4ixv1iWEwnMYoPK9GV 9KFZTVMXSte31BFVvu1B4Ga+LspZvIRYXr7SsQf/j4mRxvJCoA7HXtZDrnCBLFsGhj3X aa1w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=hWB5TBRRf5e2IUpI8wASB/AAMOBo1Ns8WQf+CgXlVIo=; b=ccrlPy2DXh80/7pzOZTRz0bzqC/SOOV3qFc8XY8OWOhDgy4F5sNDANW93gBuV5F9Lf GeCg40BD90SgZMFioAME1+WNakISfHX6etindwuQsj2Uu0vSxYTuQ/WGi+aTpJM6j4zK zvWzVfrOVR5uGBEbEmTXQ/nRFsWe7sBNGRsOHIEDPw9HovVF8a4+WHIGCC+HSvXzklpo ha9ecy/Etzm2Vxgk4AefQbFfWRZgvNX511JJAMD4gDJ1aWVAkev+i567wBdYuF6uNH4R EtmV9UcMIYPZfaQmMF4eIboK31AnPA1e+LRy2uuElsYRKVTntRNJKbI80Ppd2oGppma+ e9Zg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b6-v6si20213787plz.238.2018.05.24.01.32.38; Thu, 24 May 2018 01:33:23 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S935834AbeEXIbp (ORCPT + 99 others); Thu, 24 May 2018 04:31:45 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:59566 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S935631AbeEXIbj (ORCPT ); Thu, 24 May 2018 04:31:39 -0400 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id DD51C805A52F; Thu, 24 May 2018 08:31:38 +0000 (UTC) Received: from [10.36.116.171] (ovpn-116-171.ams2.redhat.com [10.36.116.171]) by smtp.corp.redhat.com (Postfix) with ESMTP id 83E1F2024CA1; Thu, 24 May 2018 08:31:31 +0000 (UTC) Subject: Re: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver To: Michal Hocko Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Alexander Potapenko , Andrew Morton , Andrey Ryabinin , Balbir Singh , Baoquan He , Benjamin Herrenschmidt , Boris Ostrovsky , Dan Williams , Dave Young , Dmitry Vyukov , Greg Kroah-Hartman , Hari Bathini , Huang Ying , Hugh Dickins , Ingo Molnar , Jaewon Kim , Jan Kara , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Joonsoo Kim , Juergen Gross , Kate Stewart , "Kirill A. Shutemov" , Matthew Wilcox , Mel Gorman , Michael Ellerman , Miles Chen , Oscar Salvador , Paul Mackerras , Pavel Tatashin , Philippe Ombredanne , Rashmica Gupta , Reza Arbab , Souptick Joarder , Tetsuo Handa , Thomas Gleixner , Vlastimil Babka References: <20180523151151.6730-1-david@redhat.com> <20180524075327.GU20441@dhcp22.suse.cz> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: <14d79dad-ad47-f090-2ec0-c5daf87ac529@redhat.com> Date: Thu, 24 May 2018 10:31:30 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180524075327.GU20441@dhcp22.suse.cz> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.78 on 10.11.54.4 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Thu, 24 May 2018 08:31:39 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Thu, 24 May 2018 08:31:39 +0000 (UTC) for IP:'10.11.54.4' DOMAIN:'int-mx04.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'david@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 24.05.2018 09:53, Michal Hocko wrote: > I've had some questions before and I am not sure they are fully covered. > At least not in the cover letter (I didn't get much further yet) which > should give us a highlevel overview of the feature. Sure, I can give you more details. Adding all details to the cover letter will result in a cover letter that nobody will read :) > > On Wed 23-05-18 17:11:41, David Hildenbrand wrote: >> This is now the !RFC version. I did some additional tests and inspected >> all memory notifiers. At least page_ext and kasan need fixes. >> >> ========== >> >> I am right now working on a paravirtualized memory device ("virtio-mem"). >> These devices control a memory region and the amount of memory available >> via it. Memory will not be indicated/added/onlined via ACPI and friends, >> the device driver is responsible for it. >> >> When the device driver starts up, it will add and online the requested >> amount of memory from its assigned physical memory region. On request, it >> can add (online) either more memory or try to remove (offline) memory. As >> it will be a virtio module, we also want to be able to have it as a loadable >> kernel module. > > How do you handle the offline case? Do you online all the memory to > zone_movable? Right now everything is added to ZONE_NORMAL. I have some plans to change that, but that will require more work (auto assigning to ZONE_MOVABLE or ZONE_NORMAL depending on certain heuristics). For now this showcases that offlining of memory actually works on that granularity and it can be used in some scenarios. To make unplug more reliable, more work is needed. > >> Such a device can be thought of like a "resizable DIMM" or a "huge >> number of 4MB DIMMS" that can be automatically managed. > > Why do we need such a small granularity? The whole memory hotplug is > centered around memory sections and those are 128MB in size. Smaller > sizes simply do not fit into that concept. How do you deal with that? 1. Why do we need such a small granularity? Because we can :) No, honestly, on s390x it is 256MB and if I am not wrong on certain x86/arm machines it is even 1 or 2GB. This simply gives more flexibility when plugging memory. (thinking about cloud environments) Allowing to unplug such small chunks is actually the interesting thing. Try unplugging a 128MB DIMM. With ZONE_NORMAL: pretty much impossible. With ZONE_MOVABLE: maybe possible. Try to find one 4MB chunk of a "128MB" DIMM that can be unplugged: With ZONE_NORMAL maybe possible. With ZONE_MOVABLE: likely possible. But let's not go into the discussion of ZONE_MOVABLE vs. ZONE_NORMAL, I plan to work on that in the future. Think about it that way: A compromise between section based memory hotplug and page based ballooning. 2. memory hotplug and 128MB size Interesting point. One thing to note is that "The whole memory hotplug is centered around memory sections and those are 128MB in size" is simply the current state how it is implemented in Linux, nothing more. E.g. Windows supports 1MB DIMMs So the statement "Smaller sizes simply do not fit into that concept" is wrong. It simply does not fit *perfectly* into the way it is implemented right now in Linux. But as this patch series shows, this is just a minor drawback we can easily work around. "How do you deal with that?" I think the question is answered below: add a section and online only parts of it. > >> As we want to be able to add/remove small chunks of memory to a VM without >> fragmenting guest memory ("it's not what the guest pays for" and "what if >> the hypervisor wants to use huge pages"), it looks like we can do that >> under Linux in a 4MB granularity by using online_pages()/offline_pages() > > Please expand on this some more. Larger logical units usually lead to a > smaller fragmentation. I want to avoid what balloon drivers do: rip out random pages, fragmenting guest memory until we eventually trigger the OOM killer. So instead, using 4MB chunks produces no fragmentation. And if I can't find such a chunk anymore: bad luck. At least I won't be risking stability of my guest. Does that answer your question? > >> We add a segment and online only 4MB blocks of it on demand. So the other >> memory might not be accessible. > > But you still allocate vmemmap for the full memory section, right? That > would mean that you spend 2MB to online 4MB of memory. Sounds quite > wasteful to me. This is true for the first 4MB chunk of a section, but not for the remaining ones. Of course, I try to minimize the number of such sections (ideally, this would only be one "open section" for a virtio-mem device when only plugging memory). So once we online further 4MB chunk, the overhead gets smaller and smaller. > >> For kdump and onlining/offlining code, we >> have to mark pages as offline before a new segment is visible to the system >> (e.g. as these pages might not be backed by real memory in the hypervisor). > > Please expand on the kdump part. That is really confusing because > hotplug should simply not depend on kdump at all. Moreover why don't you > simply mark those pages reserved and pull them out from the page > allocator? 1. "hotplug should simply not depend on kdump at all" In theory yes. In the current state we already have to trigger kdump to reload whenever we add/remove a memory block. 2. kdump part Whenever we offline a page and tell the hypervisor about it ("unplug"), we should not assume that we can read that page again. Now, if dumping tools assume they can read all memory that is offline, we are in trouble. It is the same thing as we already have with Pg_hwpoison. Just a different meaning - "don't touch this page, it is offline" compared to "don't touch this page, hw is broken". Balloon drivers solve this problem by always allowing to read unplugged memory. In virtio-mem, this cannot and should even not be guaranteed. And what we have to do to make this work is actually pretty simple: Just like Pg_hwpoison, track per page if it is online and provide this information to kdump. 3. Marking pages reserved and pulling them out of the page allocator This is basically what e.g. the XEN balloon does. I don't see how that helps related to kdump. Can you explain how that would solve any of the problems I am trying to solve here? This does neither solve the unplug part nor the "tell dump tools to not read such memory" part. Thanks for looking into this. -- Thanks, David / dhildenb