Received: by 10.192.165.148 with SMTP id m20csp5577711imm; Wed, 9 May 2018 07:15:45 -0700 (PDT) X-Google-Smtp-Source: AB8JxZonSuwsn5+5+95DDuL11lLQ2T4LwB20S81m6nEpHIAUfpFUWdx9SVP1WuJYEdJYL945BBKI X-Received: by 2002:a17:902:bc84:: with SMTP id bb4-v6mr34224996plb.84.1525875345409; Wed, 09 May 2018 07:15:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1525875345; cv=none; d=google.com; s=arc-20160816; b=CKIJw+KqCJbOsX2rX38V3laX42L9Ut+MXa0pNpEQNOmyCWe9t/uBCT1do3I1er3PaW 4K9cIUcZQFNITv97X6fnlxndLBHupCPvFakRymDOVAq5JQxVgNUqbge74d5/U7gH4oQH Po2hYig6cdKoKBTnkIMJ2ya6teddVOynwQXhJNBAFRZEqXazAx6jGgPey5MvjuWzWhRS TWRVnumYiOMTqbA8UhXM47U2mdSZxFzg4jsQSEBrLoU5M5OdcUcnlhopCQHwFC37Sfwy 0rPrD0yZh6hDqnLriJEmnD7JP6PJHnuCeYiiBv0dtIDW7GtRt4zCC47uprPFWGPKgN5D YKIw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=qPtVi2SQn6YbsXxVwlF/cJzk77St/fc3G/46MOfpOHg=; b=IEFbEwWDzBKFruXI2uMBtSg0qXOdBGh1ADWuEG6ROHqh+xdbO9hkcft7f7ODJp+oFT iGYxRCoSlQAcVyW27TAEfRdSgeA8OICW/gs29j8GQA8Lj+tWK78DoaR9jTF/i3BQid9G Qtkhxh3x2g1TlOK5oD4knkeI7Hn7QQVIyB2Wt6gJCkT6JwecEFzfgDEgBT+K/EimJ8R4 J8aP7GeKEwAWbDRi+SI+DbffcaiBe02mv3DnyX7R0+LYJzqjjYV16oTsjhbEtRpHg6a+ T7fJ/J0GBO2mJElynkMOMMgE4Z4pgsGs9uWZtm+GnmM6W330gonthvFk0rl94o+ZWcfx Slzg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a24-v6si14276996pgw.101.2018.05.09.07.15.30; Wed, 09 May 2018 07:15:45 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756388AbeEIOOx (ORCPT + 99 others); Wed, 9 May 2018 10:14:53 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:58980 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1756236AbeEIOOv (ORCPT ); Wed, 9 May 2018 10:14:51 -0400 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 8DBE24059FE3; Wed, 9 May 2018 14:14:50 +0000 (UTC) Received: from [10.36.118.3] (unknown [10.36.118.3]) by smtp.corp.redhat.com (Postfix) with ESMTP id 0BBBBD7B0F; Wed, 9 May 2018 14:14:40 +0000 (UTC) Subject: Re: [PATCH RCFv2 0/7] mm: online/offline 4MB chunks controlled by device driver To: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Andrew Morton , Balbir Singh , Baoquan He , Benjamin Herrenschmidt , Boris Ostrovsky , Dan Williams , Dave Young , Greg Kroah-Hartman , Hari Bathini , Huang Ying , Hugh Dickins , Ingo Molnar , Jan Kara , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Joonsoo Kim , Juergen Gross , "Kirill A. Shutemov" , Matthew Wilcox , Mel Gorman , Michael Ellerman , Michal Hocko , Miles Chen , Paul Mackerras , Pavel Tatashin , Philippe Ombredanne , Rashmica Gupta , Reza Arbab , Souptick Joarder , Tetsuo Handa , Thomas Gleixner , Vlastimil Babka References: <20180430094236.29056-1-david@redhat.com> From: David Hildenbrand Organization: Red Hat GmbH Message-ID: Date: Wed, 9 May 2018 16:14:40 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 MIME-Version: 1.0 In-Reply-To: <20180430094236.29056-1-david@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.11.54.5 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.5]); Wed, 09 May 2018 14:14:50 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.5]); Wed, 09 May 2018 14:14:50 +0000 (UTC) for IP:'10.11.54.5' DOMAIN:'int-mx05.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'david@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 30.04.2018 11:42, David Hildenbrand wrote: > I am right now working on a paravirtualized memory device ("virtio-mem"). > These devices control a memory region and the amount of memory available > via it. Memory will not be indicated/added/onlined via ACPI and friends, > the device driver is responsible for it. > > When the device driver starts up, it will add and online the requested > amount of memory from its assigned physical memory region. On request, it can > add (online) either more memory or try to remove (offline) memory. As it > will be a virtio module, we also want to be able to have it as a loadable > kernel module. > > Such a device can be thought of like a "resizable DIMM" or a "huge > number of 4MB DIMMS" that can be automatically managed. > > As we want to be able to add/remove small chunks of memory to a VM without > fragmenting guest memory ("it's not what the guest pays for" and "what if > the hypervisor wants to sue huge pages"), it looks like we can do that > under Linux in a 4MB granularity by using online_pages()/offline_pages() > > We add a segment and online only 4MB blocks of it on demand. So the other > memory might not be accessible. For kdump and offlining code, we have to > mark pages as offline before a new segment is visible to the system (e.g. > as these pages might not be backed by real memory in the hypervisor). > > This is not a balloon driver. Main differences: > - We can add more memory to a VM without having to use mixture of > technologies - e.g. ACPI for plugging, balloon for unplugging (in contrast > to virtio-balloon). > - The device is responsible for its own memory only - will not inflate on > any system memory. (in contrast to all balloons) > - Works on a coarser granularity (e.g. 4MB because that's what we can > online/offline in Linux). We are not using the buddy allocator when unplugging > but really search for chunks of memory we can offline. We actually > can support arbitrary block sizes. (in contrast to all balloons) > - That's why we don't fragment guest memory. > - A device can belong to exactly one NUMA node. This way we can online/offline > memory in a fine granularity NUMA aware. Even if the guest does not even > know how to spell NUMA. (in contrast to all balloons) > - Architectures that don't have proper memory hotplug interfaces (e.g. s390x) > get memory hotplug support. I have a prototype for s390x. > - Once all 4MB chunks of a memory block are offline, we can remove the > memory block and therefore the struct pages. (in contrast to all balloons) > > This essentially allows us to add/remove 4MB chunks to/from a VM. Especially > without caring about the future when adding memory ("If I add a 128GB DIMM > I can only unplug 128GB again") or running into limits ("If I want my VM to > grow to 4TB, I have to plug at least 16GB per DIMM"). > > Future work: > - Performance improvements > - Be smarter about which blocks to offline first (e.g. free ones) > - Automatically manage assignemnt to NORMAL/MOVABLE zone to make > unplug more likely to succeed. > > I will post the next prototype of virtio-mem shortly. > If there are no further comments, I'll send a v1 (!RFC) version, along with the virtio-mem prototype after rebasing (assuming that nothing breaks :) ). -- Thanks, David / dhildenb