Received: by 2002:a25:8b91:0:0:0:0:0 with SMTP id j17csp1073520ybl; Thu, 12 Dec 2019 09:13:50 -0800 (PST) X-Google-Smtp-Source: APXvYqxdwY7z4JvnFctU+NFdbFvFdyohFffCGK/c58P6KzYHj7Nik0hJFczWW5ntVGe3YJdX1csZ X-Received: by 2002:a9d:73ca:: with SMTP id m10mr9146625otk.289.1576170830595; Thu, 12 Dec 2019 09:13:50 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1576170830; cv=none; d=google.com; s=arc-20160816; b=R+mJHct1d+A6aQW+71oeVOTCH72zqFjnSbXjmBnSJfgfUhSF1qjeVt4pgptwCsQ13o PRdy0BDw/a0Wmxqm1ElhFfuFSVnORTMuU+11ugXduWfNL3tbi/lEksjsDIpkDjxssV26 AbALyIoj/ppYNxWNt6HHIbdT2RNlj9JV9t7PpZXR0n6HKMrQi7jJbteh5+bNv8uscGVf 9qfDjbSfjRc93ifYbfgBQ9+XpJcvhaKr1WDthHy8ZbTwOP2TGOi11nlwBP7IAZBi26V0 6chbqYcEwsY26dOqer/yzy/jdPdw9xdgqXgmmDhYQMwQ955CZLb12n8FOxwA0UhsXSwe 9hiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=mKeW3DpebZ67c2RDZ1Gfe95WJ7e2jjr2FPiI7AhfxT0=; b=FzjB+DMuY/xoc9YIC3wtPGeW+UMSQN3xuQ8o6/WpjxcEtrBzXMLXxOQ8BacquHBpEG IavCIlGPJG2rEIZoL7Tq19dQRkmz6clm90x0aXzLsdIWzpVG4f1F45LbDPwahSmkfM7G BZu0CiE54Jgh+4KipCb+g1fxOvL+9SSNVgGv0qXQsGuaFg2PPPB2WCAxot+SG8X1k87y R5xLaIT8ZjMW7Ns4swAOYCCCUZJ1OwUyHC2juWEawqP0af7KZRudlHLN2s57DYIjEbWO MmcoVVoo0UPEpNYiCR5KgN1cDQfQNM5Uxzu0b8YMrpG4KcyZyrlXdXd7ETW8l9PfHRdE wWEg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Nc8SeJln; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o2si3429633oig.41.2019.12.12.09.13.38; Thu, 12 Dec 2019 09:13:50 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Nc8SeJln; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730164AbfLLRMq (ORCPT + 99 others); Thu, 12 Dec 2019 12:12:46 -0500 Received: from us-smtp-2.mimecast.com ([207.211.31.81]:35980 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729546AbfLLRMp (ORCPT ); Thu, 12 Dec 2019 12:12:45 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1576170764; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mKeW3DpebZ67c2RDZ1Gfe95WJ7e2jjr2FPiI7AhfxT0=; b=Nc8SeJlntRv+oXghcm5VhD8uV+UvV8/VQL9mWo02CXwpB1rgfgAmOfENiO1atmq7MGPFgV soSznew4bGuG1zLMbTvx+XzVcYhHZeGHJ0QgFC4pGbFPt4D2biydFZDj25PorlSlufTVyC /Bm/ZRmosMjntCShg7vFEHML/d4r5GQ= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-311-fEEElIqYMmyR6my3gJ3u1Q-1; Thu, 12 Dec 2019 12:12:40 -0500 X-MC-Unique: fEEElIqYMmyR6my3gJ3u1Q-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 631AF800EB5; Thu, 12 Dec 2019 17:12:38 +0000 (UTC) Received: from t480s.redhat.com (ovpn-117-65.ams2.redhat.com [10.36.117.65]) by smtp.corp.redhat.com (Postfix) with ESMTP id 74F726683C; Thu, 12 Dec 2019 17:12:28 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, virtio-dev@lists.oasis-open.org, virtualization@lists.linux-foundation.org, kvm@vger.kernel.org, Michal Hocko , Andrew Morton , "Michael S . Tsirkin" , David Hildenbrand , Jason Wang , Oscar Salvador , Igor Mammedov , Dave Young , Dan Williams , Pavel Tatashin , Stefan Hajnoczi , Vlastimil Babka Subject: [PATCH RFC v4 05/13] virtio-mem: Paravirtualized memory hotunplug part 2 Date: Thu, 12 Dec 2019 18:11:29 +0100 Message-Id: <20191212171137.13872-6-david@redhat.com> In-Reply-To: <20191212171137.13872-1-david@redhat.com> References: <20191212171137.13872-1-david@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Content-Transfer-Encoding: quoted-printable Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org We can use alloc_contig_range() to try to unplug subblocks. Unplugged blocks will be marked PG_offline, however, don't have the PG_reserved flag set. This way, we can differentiate these allocated subblocks from subblocks that were never onlined and handle them properly in virtio_mem_fake_online(). free_contig_range() is used to hand back subblocks to Linux. It is worth noting that there are no guarantess on how much memory can actually get unplugged again. All device memory might completely be fragmented with unmovable data, such that no subblock can get unplugged. We might want to improve the unplugging capability in the future. We are not touching the ZONE_MOVABLE. If memory is onlined to the ZONE_MOVABLE, it can only get unplugged after that memory was offlined manually by user space. In normal operation, virtio-mem memory is suggested to be onlined to ZONE_NORMAL. In the future, we will try to make unplug more likely to succeed. Add a module parameter to control if online memory shall be touched. Future work: - Performance improvements: -- Sense (lockless) if it make sense to try alloc_contig_range() at all before directly trying to isolate and taking locks. -- Try to unplug bigger chunks if possible first. -- Identify free areas first, that don't have to be evacuated. - Make unplug more likely to succeed: -- There are various idea to limit fragmentation on memory block granularity. (e.g., ZONE_PREFER_MOVABLE and smart balancing) -- Allocate memmap from added memory. This way, less unmovable data can end up on the memory blocks. - OOM handling, e.g., via an OOM handler. - Defragmentation -- Will require a new virtio-mem CMD to exchange plugged<->unplugged bloc= ks Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Oscar Salvador Cc: Michal Hocko Cc: Igor Mammedov Cc: Dave Young Cc: Andrew Morton Cc: Dan Williams Cc: Pavel Tatashin Cc: Stefan Hajnoczi Cc: Vlastimil Babka Signed-off-by: David Hildenbrand --- drivers/virtio/Kconfig | 1 + drivers/virtio/virtio_mem.c | 156 ++++++++++++++++++++++++++++++++---- 2 files changed, 143 insertions(+), 14 deletions(-) diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index 294720d53057..75a760f32ec7 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -71,6 +71,7 @@ config VIRTIO_MEM depends on VIRTIO depends on MEMORY_HOTPLUG_SPARSE depends on MEMORY_HOTREMOVE + select CONTIG_ALLOC help This driver provides access to virtio-mem paravirtualized memory devices, allowing to hotplug and hotunplug memory. diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c index f1af05def5df..5a142a371222 100644 --- a/drivers/virtio/virtio_mem.c +++ b/drivers/virtio/virtio_mem.c @@ -21,6 +21,10 @@ =20 #include =20 +static bool unplug_online =3D true; +module_param(unplug_online, bool, 0644); +MODULE_PARM_DESC(unplug_online, "Try to unplug online memory"); + enum virtio_mem_mb_state { /* Unplugged, not added to Linux. Can be reused later. */ VIRTIO_MEM_MB_STATE_UNUSED =3D 0, @@ -646,23 +650,35 @@ static int virtio_mem_memory_notifier_cb(struct not= ifier_block *nb, } =20 /* - * Set a range of pages PG_offline. + * Set a range of pages PG_offline. Remember pages that were never onlin= ed + * (via generic_online_page()) using PageDirty(). */ static void virtio_mem_set_fake_offline(unsigned long pfn, - unsigned int nr_pages) + unsigned int nr_pages, bool onlined) { - for (; nr_pages--; pfn++) - __SetPageOffline(pfn_to_page(pfn)); + for (; nr_pages--; pfn++) { + struct page *page =3D pfn_to_page(pfn); + + __SetPageOffline(page); + if (!onlined) + SetPageDirty(page); + } } =20 /* - * Clear PG_offline from a range of pages. + * Clear PG_offline from a range of pages. If the pages were never onlin= ed, + * (via generic_online_page()), clear PageDirty(). */ static void virtio_mem_clear_fake_offline(unsigned long pfn, - unsigned int nr_pages) + unsigned int nr_pages, bool onlined) { - for (; nr_pages--; pfn++) - __ClearPageOffline(pfn_to_page(pfn)); + for (; nr_pages--; pfn++) { + struct page *page =3D pfn_to_page(pfn); + + __ClearPageOffline(page); + if (!onlined) + ClearPageDirty(page); + } } =20 /* @@ -681,10 +697,26 @@ static void virtio_mem_fake_online(unsigned long pf= n, unsigned int nr_pages) BUG_ON(!IS_ALIGNED(pfn, 1 << order)); BUG_ON(!IS_ALIGNED(nr_pages, 1 << order)); =20 - virtio_mem_clear_fake_offline(pfn, nr_pages); + for (i =3D 0; i < nr_pages; i +=3D 1 << order) { + struct page *page =3D pfn_to_page(pfn + i); =20 - for (i =3D 0; i < nr_pages; i +=3D 1 << order) - generic_online_page(pfn_to_page(pfn + i), order); + /* + * If the page is PageDirty(), it was kept fake-offline when + * onlining the memory block. Otherwise, it was allocated + * using alloc_contig_range(). All pages in a subblock are + * alike. + */ + if (PageDirty(page)) { + virtio_mem_clear_fake_offline(pfn + i, 1 << order, + false); + generic_online_page(page, order); + } else { + virtio_mem_clear_fake_offline(pfn + i, 1 << order, + true); + free_contig_range(pfn + i, 1 << order); + adjust_managed_page_count(page, 1 << order); + } + } } =20 static void virtio_mem_online_page_cb(struct page *page, unsigned int or= der) @@ -713,7 +745,8 @@ static void virtio_mem_online_page_cb(struct page *pa= ge, unsigned int order) if (virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1)) generic_online_page(page, order); else - virtio_mem_set_fake_offline(PFN_DOWN(addr), 1 << order); + virtio_mem_set_fake_offline(PFN_DOWN(addr), 1 << order, + false); rcu_read_unlock(); return; } @@ -1181,6 +1214,72 @@ static int virtio_mem_mb_unplug_any_sb_offline(str= uct virtio_mem *vm, return 0; } =20 +/* + * Unplug the desired number of plugged subblocks of an online memory bl= ock. + * Will skip subblock that are busy. + * + * Will modify the state of the memory block. + * + * Note: Can fail after some subblocks were successfully unplugged. Can + * return 0 even if subblocks were busy and could not get unplugge= d. + */ +static int virtio_mem_mb_unplug_any_sb_online(struct virtio_mem *vm, + unsigned long mb_id, + uint64_t *nb_sb) +{ + const unsigned long nr_pages =3D PFN_DOWN(vm->subblock_size); + unsigned long start_pfn; + int rc, sb_id; + + /* + * TODO: To increase the performance we want to try bigger, consecutive + * subblocks first before falling back to single subblocks. Also, + * we should sense via something like is_mem_section_removable() + * first if it makes sense to go ahead any try to allocate. + */ + for (sb_id =3D 0; sb_id < vm->nb_sb_per_mb && *nb_sb; sb_id++) { + /* Find the next candidate subblock */ + while (sb_id < vm->nb_sb_per_mb && + !virtio_mem_mb_test_sb_plugged(vm, mb_id, sb_id, 1)) + sb_id++; + if (sb_id >=3D vm->nb_sb_per_mb) + break; + + start_pfn =3D PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) + + sb_id * vm->subblock_size); + rc =3D alloc_contig_range(start_pfn, start_pfn + nr_pages, + MIGRATE_MOVABLE, GFP_KERNEL); + if (rc =3D=3D -ENOMEM) + /* whoops, out of memory */ + return rc; + if (rc) + /* memory busy, we can't unplug this chunk */ + continue; + + /* Mark it as fake-offline before unplugging it */ + virtio_mem_set_fake_offline(start_pfn, nr_pages, true); + adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages); + + /* Try to unplug the allocated memory */ + rc =3D virtio_mem_mb_unplug_sb(vm, mb_id, sb_id, 1); + if (rc) { + /* Return the memory to the buddy. */ + virtio_mem_fake_online(start_pfn, nr_pages); + return rc; + } + + virtio_mem_mb_set_state(vm, mb_id, + VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL); + *nb_sb -=3D 1; + } + + /* + * TODO: Once all subblocks of a memory block were unplugged, we want + * to offline the memory block and remove it. + */ + return 0; +} + /* * Try to unplug the requested amount of memory. */ @@ -1219,8 +1318,36 @@ static int virtio_mem_unplug_request(struct virtio= _mem *vm, uint64_t diff) cond_resched(); } =20 + if (!unplug_online) { + mutex_unlock(&vm->hotplug_mutex); + return 0; + } + + /* Try to unplug subblocks of partially plugged online blocks. */ + virtio_mem_for_each_mb_state(vm, mb_id, + VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL) { + rc =3D virtio_mem_mb_unplug_any_sb_online(vm, mb_id, + &nb_sb); + if (rc || !nb_sb) + goto out_unlock; + mutex_unlock(&vm->hotplug_mutex); + cond_resched(); + mutex_lock(&vm->hotplug_mutex); + } + + /* Try to unplug subblocks of plugged online blocks. */ + virtio_mem_for_each_mb_state(vm, mb_id, VIRTIO_MEM_MB_STATE_ONLINE) { + rc =3D virtio_mem_mb_unplug_any_sb_online(vm, mb_id, + &nb_sb); + if (rc || !nb_sb) + goto out_unlock; + mutex_unlock(&vm->hotplug_mutex); + cond_resched(); + mutex_lock(&vm->hotplug_mutex); + } + mutex_unlock(&vm->hotplug_mutex); - return 0; + return nb_sb ? -EBUSY : 0; out_unlock: mutex_unlock(&vm->hotplug_mutex); return rc; @@ -1324,7 +1451,8 @@ static void virtio_mem_run_wq(struct work_struct *w= ork) case -EBUSY: /* * The hypervisor cannot process our request right now - * (e.g., out of memory, migrating). + * (e.g., out of memory, migrating) or we cannot free up + * any memory to unplug it (all plugged memory is busy). */ case -ENOMEM: /* Out of memory, try again later. */ --=20 2.23.0