Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp964887imm; Wed, 23 May 2018 08:12:37 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpnqgzPwt6Ffz0kR9Rtdypo7iehvF/P5kwb8dsP1wudeubXwGG8eWK4kKKQKg8QXn0q/ZPc X-Received: by 2002:a63:7d47:: with SMTP id m7-v6mr2662188pgn.443.1527088357502; Wed, 23 May 2018 08:12:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527088357; cv=none; d=google.com; s=arc-20160816; b=RVEWFZI5Mypas50H2VW5ueyLuvoUIM+0x3zFDexfqJFLz+QUv1BTfaHK82YRoSX39+ P+iUpLWry7AeM5wghP1kGOGN07w9EKpZMJlquPiF/lofjmwPyWuMjkwsFKdkL6sDGMYJ MABdqEA0T9bdWEvZjZmsRGC8QKL5q1xfyXkENo973CKKx4afNlGOmQJbiDUsYgcFPtwN zrMBs7xh4/7H+NGw6AmoHoq1nju6xU92G9dbpU/6tIakRGRkJYfnOeMlFg83WU3eTmOu yRpm0//gg3AHEFOVRohj3vLLbz3BOGVqPpPBSk7wtmtRaFmZ2+YuzU+i5TcaralXqcMc m8Ug== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:message-id:date:subject:cc:to:from :arc-authentication-results; bh=RqYfoF0KLHvtDbBPDJDgnmqghsvrgei8clVRsbKQzz0=; b=CHllDzooOoQB0UObN2nT/ar0qWGakIZY5vzzvrVZXQLXk6TexVNwiZg7XLNAXvdMoX Cqi8gCalxbXSl0jTpZyHeFWDz2Jiy8ZU9JI3YrSLPv+DXK5V19IJU6vTZ/7emfpcjAn0 s6Ei350uF/92APhl/5rk9eSVxYtly5Idp95stPOUvxvmY601MoUz4fli6lX8TCfc4aGK V6zPO/gLT/a86oVT6YjVeq0Sc1uGZmXNVvhdXnjU4aOdz457gLLLr+VCrLDjJkIOIDVa 4n4I9ShiCCfKnIaVyhFYr3XFZtj7pwKpYkci7EE3xjPz8t+Kik4isHG2sZwSMdxTD87T eCjg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id n20-v6si19091814pff.370.2018.05.23.08.12.22; Wed, 23 May 2018 08:12:37 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933409AbeEWPMH (ORCPT + 99 others); Wed, 23 May 2018 11:12:07 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:46440 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S933151AbeEWPMG (ORCPT ); Wed, 23 May 2018 11:12:06 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id CBF1577060; Wed, 23 May 2018 15:12:05 +0000 (UTC) Received: from t460s.redhat.com (ovpn-116-112.ams2.redhat.com [10.36.116.112]) by smtp.corp.redhat.com (Postfix) with ESMTP id 00CFC1116703; Wed, 23 May 2018 15:11:51 +0000 (UTC) From: David Hildenbrand To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, David Hildenbrand , Alexander Potapenko , Andrew Morton , Andrey Ryabinin , Balbir Singh , Baoquan He , Benjamin Herrenschmidt , Boris Ostrovsky , Dan Williams , Dave Young , Dmitry Vyukov , Greg Kroah-Hartman , Hari Bathini , Huang Ying , Hugh Dickins , Ingo Molnar , Jaewon Kim , Jan Kara , =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , Joonsoo Kim , Juergen Gross , Kate Stewart , "Kirill A. Shutemov" , Matthew Wilcox , Mel Gorman , Michael Ellerman , Michal Hocko , Miles Chen , Oscar Salvador , Paul Mackerras , Pavel Tatashin , Philippe Ombredanne , Rashmica Gupta , Reza Arbab , Souptick Joarder , Tetsuo Handa , Thomas Gleixner , Vlastimil Babka Subject: [PATCH v1 00/10] mm: online/offline 4MB chunks controlled by device driver Date: Wed, 23 May 2018 17:11:41 +0200 Message-Id: <20180523151151.6730-1-david@redhat.com> X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Wed, 23 May 2018 15:12:06 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.1]); Wed, 23 May 2018 15:12:06 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'david@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org This is now the !RFC version. I did some additional tests and inspected all memory notifiers. At least page_ext and kasan need fixes. ========== I am right now working on a paravirtualized memory device ("virtio-mem"). These devices control a memory region and the amount of memory available via it. Memory will not be indicated/added/onlined via ACPI and friends, the device driver is responsible for it. When the device driver starts up, it will add and online the requested amount of memory from its assigned physical memory region. On request, it can add (online) either more memory or try to remove (offline) memory. As it will be a virtio module, we also want to be able to have it as a loadable kernel module. Such a device can be thought of like a "resizable DIMM" or a "huge number of 4MB DIMMS" that can be automatically managed. As we want to be able to add/remove small chunks of memory to a VM without fragmenting guest memory ("it's not what the guest pays for" and "what if the hypervisor wants to use huge pages"), it looks like we can do that under Linux in a 4MB granularity by using online_pages()/offline_pages() We add a segment and online only 4MB blocks of it on demand. So the other memory might not be accessible. For kdump and onlining/offlining code, we have to mark pages as offline before a new segment is visible to the system (e.g. as these pages might not be backed by real memory in the hypervisor). This is not a balloon driver. Main differences: - We can add more memory to a VM without having to use mixture of technologies - e.g. ACPI for plugging, balloon for unplugging (in contrast to virtio-balloon). - The device is responsible for its own memory only - will not inflate on any system memory. (in contrast to all balloons) - Works on a coarser granularity (e.g. 4MB because that's what we can online/offline in Linux). We are not using the buddy allocator when unplugging but really search for chunks of memory we can offline. We actually can support arbitrary block sizes. (in contrast to all balloons) - That's why we don't fragment guest memory. - A device can belong to exactly one NUMA node. This way we can online/ offline memory in a fine granularity NUMA aware. Even if the guest does not even know how to spell NUMA. (in contrast to all balloons) - Architectures that don't have proper memory hotplug interfaces (e.g. s390x) get memory hotplug support. I have a prototype for s390x. - Once all 4MB chunks of a memory block are offline, we can remove the memory block and therefore the struct pages. (in contrast to all balloons) This essentially allows us to add/remove 4MB chunks to/from a VM. Especially without caring about the future when adding memory ("If I add a 128GB DIMM I can only unplug 128GB again") or running into limits ("If I want my VM to grow to 4TB, I have to plug at least 16GB per DIMM"). Future work: - Performance improvements - Be smarter about which blocks to offline first (e.g. free ones) - Automatically manage assignemnt to NORMAL/MOVABLE zone to make unplug more likely to succeed. I will post the next prototype of virtio-mem shortly. This time for real :) ========== RFCv2 -> v1: - "mm: introduce and use PageOffline()" -- fix set_page_address() handling for WANT_PAGE_VIRTUAL - Include "mm/page_ext.c: support online/offline of memory < section size" - Include "kasan: prepare for online/offline of different start/size" - Include "mm/memory_hotplug: onlining pages can only fail due to notifiers" David Hildenbrand (10): mm: introduce and use PageOffline() mm/page_ext.c: support online/offline of memory < section size kasan: prepare for online/offline of different start/size kdump: include PAGE_OFFLINE_MAPCOUNT_VALUE in VMCOREINFO mm/memory_hotplug: limit offline_pages() to sizes we can actually handle mm/memory_hotplug: onlining pages can only fail due to notifiers mm/memory_hotplug: print only with DEBUG_VM in online/offline_pages() mm/memory_hotplug: allow to control onlining/offlining of memory by a driver mm/memory_hotplug: teach offline_pages() to not try forever mm/memory_hotplug: allow online/offline memory by a kernel module arch/powerpc/platforms/powernv/memtrace.c | 2 +- drivers/base/memory.c | 25 +-- drivers/base/node.c | 1 - drivers/xen/balloon.c | 2 +- include/linux/memory.h | 2 +- include/linux/memory_hotplug.h | 20 ++- include/linux/mm.h | 10 ++ include/linux/page-flags.h | 9 ++ kernel/crash_core.c | 1 + mm/kasan/kasan.c | 107 ++++++++----- mm/memory_hotplug.c | 180 +++++++++++++++++----- mm/page_alloc.c | 32 ++-- mm/page_ext.c | 9 +- mm/sparse.c | 25 ++- 14 files changed, 315 insertions(+), 110 deletions(-) -- 2.17.0