Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp6571002pxb; Wed, 17 Feb 2021 07:53:01 -0800 (PST) X-Google-Smtp-Source: ABdhPJxDRRCIlcxtYHoDfYFJOaU9dBvgAGg9nZC5YYS6yIVAbObmdq1uoettcUc2QRdT6UkoyK42 X-Received: by 2002:aa7:c80d:: with SMTP id a13mr16676364edt.327.1613577181226; Wed, 17 Feb 2021 07:53:01 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613577181; cv=none; d=google.com; s=arc-20160816; b=MTUCH7egmynLVLZpxrjfxk9KfkOXEhDmMajgUWb2wsprqy7bOiBZFvTq/FXjmDjNwQ P0bTGa/CruEn5Jrl06myO4riEznvfGld82AkbNocbuc/V8hR6BDcZz3rBZqVLUInh2fK z5tBml9GZG+nb89XHZl93GbE3va9u/9pnDfXh8Qcxix/iTYESun1LSaI+IQ4wE/hnsJC YvT2CHS0z7FnwPhXV/MCw3A9p3yDKvu0Hw6eETH7m3BuEaQRdXn0PHAmwqr02nOyL69D Ws739crbw/gN2n0MizGBIUUfjioxqUUYWmiDmsKYYlpTbPwdrMu3sfPAP1Wd1PEIopd7 ax8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=yAWDsPHL6qVi6EQFSh9JjZtDjXH6VNWvFfdhBEAIS5U=; b=ItI5pXwzT2wSNFbIfIrMC8zgFe/f9ltscxJght0IdutjJndgbzVF72S+xmOCI90DgM 3QUqR+xvuo/iZCcjnrmDQtxhsApu0O6tZ7nI6/0FoShHjnRZe0FWMsoEZNRSY+Oy5lzF t7kHSMwBfejJxLXAtPE1NxZMXbRwyCpf8DlRrpvECcLk9ZpjN9Nd74dIpoWmNTDy0BIh JH/CMKFOPlqRolrY7WhuVwH5B53id36AH25/mhizeTttpXVo0og/23qZhtLYc4qmpP3Y qO0DAmZarLvnnB1Rn+b/pHDWsvc/0dnryMiB3Y11DBTlMQtgUm1f9yf/aLokMWYlQ1Bb pPTw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UxQkMfRk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id sb16si1770141ejb.46.2021.02.17.07.52.37; Wed, 17 Feb 2021 07:53:01 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=UxQkMfRk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233845AbhBQPuk (ORCPT + 99 others); Wed, 17 Feb 2021 10:50:40 -0500 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:40712 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233855AbhBQPuf (ORCPT ); Wed, 17 Feb 2021 10:50:35 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613576947; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding; bh=yAWDsPHL6qVi6EQFSh9JjZtDjXH6VNWvFfdhBEAIS5U=; b=UxQkMfRkHNGmHRwBqLg16/7SHP5tgRFbW5291CR+HAa0w9VO5P3QX24bEIz5vcC8s+zpqp PliLyE1qvIe1TsToPZW14tG9Ynbl+WmshhSn4SsLo26koidiZk6eWh100icFrkvm0/e2Cz bKalmvqe2tYNSLgVM45Vh+pgCGhytgo= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-543-kNffKRMnOdqMBusloKeOcg-1; Wed, 17 Feb 2021 10:49:02 -0500 X-MC-Unique: kNffKRMnOdqMBusloKeOcg-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 1DBDB107ACF7; Wed, 17 Feb 2021 15:48:58 +0000 (UTC) Received: from t480s.redhat.com (ovpn-114-178.ams2.redhat.com [10.36.114.178]) by smtp.corp.redhat.com (Postfix) with ESMTP id 884435C3E4; Wed, 17 Feb 2021 15:48:45 +0000 (UTC) From: David Hildenbrand To: linux-kernel@vger.kernel.org Cc: linux-mm@kvack.org, David Hildenbrand , Andrew Morton , Arnd Bergmann , Michal Hocko , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jann Horn , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch@vger.kernel.org Subject: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Date: Wed, 17 Feb 2021 16:48:44 +0100 Message-Id: <20210217154844.12392-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org When we manage sparse memory mappings dynamically in user space - also sometimes involving MADV_NORESERVE - we want to dynamically populate/ discard memory inside such a sparse memory region. Example users are hypervisors (especially implementing memory ballooning or similar technologies like virtio-mem) and memory allocators. In addition, we want to fail in a nice way if populating does not succeed because we are out of backend memory (which can happen easily with file-based mappings, especially tmpfs and hugetlbfs). While MADV_DONTNEED and FALLOC_FL_PUNCH_HOLE provide us ways to reliably discard memory, there is no generic approach to populate ("preallocate") memory. Although mmap() supports MAP_POPULATE, it is not applicable to the concept of sparse memory mappings, where we want to do populate/discard dynamically and avoid expensive/problematic remappings. In addition, we never actually report error during the final populate phase - it is best-effort only. fallocate() can be used to preallocate file-based memory and fail in a safe way. However, it is less useful for private mappings on anonymous files due to COW semantics. For example, using fallocate() to preallocate memory on an anonymous memfd files that are mapped MAP_PRIVATE results in a double memory consumption when actually writing via the mapping. In addition, fallocate() does not actually populate page tables, so we still always have to resolve minor faults on first access. Because we don't have a proper interface, what applications (like QEMU and databases) end up doing is touching (i.e., writing) all individual pages. However, it requires expensive signal handling (SIGBUS); for example, this is problematic in hypervisors like QEMU where SIGBUS handlers might already be used by other subsystems concurrently to e.g, handle hardware errors. "Simply" doing preallocation from another thread is not that easy. Let's introduce MADV_POPULATE with the following semantics 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works on everything else. 2. Errors during MADV_POPULATED (especially OOM) are reported. If we hit hardware errors on pages, ignore them - nothing we really can or should do. 3. On errors during MADV_POPULATED, some memory might have been populated. Callers have to clean up if they care. 4. Concurrent changes to the virtual memory layour are tolerated - we process each and every PFN only once, though. 5. If MADV_POPULATE succeeds, all memory in the range can be accessed without SIGBUS. (of course, not if user space changed mappings in the meantime or KSM kicked in on anonymous memory). Although sparse memory mappings are the primary use case, this will also be useful for ordinary preallocations where MAP_POPULATE is not desired (e.g., in QEMU, where users can trigger preallocation of guest RAM after the mapping was created). Looking at the history, MADV_POPULATE was already proposed in 2013 [1], however, the main motivation back than was performance improvements (which should also still be the case, but it's a seconary concern). Basic functionality was tested with: - anonymous memory - MAP_PRIVATE on anonymous file via memfd - MAP_SHARED on anonymous file via memf - MAP_PRIVATE on anonymous hugetlbfs file via memfd - MAP_SHARED on anonymous hugetlbfs file via memfd - MAP_PRIVATE on tmpfs/shmem file (we end up with double memory consumption though, as the actual file gets populated with zeroes) - MAP_SHARED on tmpfs/shmem file Note: For populating/preallocating zeroed-out memory while userfaultfd is active, it's even faster to use first fallocate() or placing zeroed pages via userfaultfd APIs. Otherwise, we'll have to route every fault while populating via the userfaultfd handler. [1] https://lkml.org/lkml/2013/6/27/698 Cc: Andrew Morton Cc: Arnd Bergmann Cc: Michal Hocko Cc: Oscar Salvador Cc: Matthew Wilcox (Oracle) Cc: Andrea Arcangeli Cc: Minchan Kim Cc: Jann Horn Cc: Jason Gunthorpe Cc: Dave Hansen Cc: Hugh Dickins Cc: Rik van Riel Cc: Michael S. Tsirkin Cc: Kirill A. Shutemov Cc: Vlastimil Babka Cc: Richard Henderson Cc: Ivan Kokshaysky Cc: Matt Turner Cc: Thomas Bogendoerfer Cc: "James E.J. Bottomley" Cc: Helge Deller Cc: Chris Zankel Cc: Max Filippov Cc: linux-alpha@vger.kernel.org Cc: linux-mips@vger.kernel.org Cc: linux-parisc@vger.kernel.org Cc: linux-xtensa@linux-xtensa.org Cc: linux-arch@vger.kernel.org Signed-off-by: David Hildenbrand --- If we agree that this makes sense I'll do more testing to see if we are missing any return value handling and prepare a man page update to document the semantics. Thoughts? --- arch/alpha/include/uapi/asm/mman.h | 2 + arch/mips/include/uapi/asm/mman.h | 2 + arch/parisc/include/uapi/asm/mman.h | 2 + arch/xtensa/include/uapi/asm/mman.h | 2 + include/uapi/asm-generic/mman-common.h | 2 + mm/madvise.c | 70 ++++++++++++++++++++++++++ 6 files changed, 80 insertions(+) diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h index a18ec7f63888..e90eeb5e6cf1 100644 --- a/arch/alpha/include/uapi/asm/mman.h +++ b/arch/alpha/include/uapi/asm/mman.h @@ -71,6 +71,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h index 57dc2ac4f8bd..b928becc5308 100644 --- a/arch/mips/include/uapi/asm/mman.h +++ b/arch/mips/include/uapi/asm/mman.h @@ -98,6 +98,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h index ab78cba446ed..9d3a56044287 100644 --- a/arch/parisc/include/uapi/asm/mman.h +++ b/arch/parisc/include/uapi/asm/mman.h @@ -52,6 +52,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + #define MADV_MERGEABLE 65 /* KSM may merge identical pages */ #define MADV_UNMERGEABLE 66 /* KSM may not merge identical pages */ diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h index e5e643752947..3169b1be8920 100644 --- a/arch/xtensa/include/uapi/asm/mman.h +++ b/arch/xtensa/include/uapi/asm/mman.h @@ -106,6 +106,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h index f94f65d429be..fa617fd0d733 100644 --- a/include/uapi/asm-generic/mman-common.h +++ b/include/uapi/asm-generic/mman-common.h @@ -72,6 +72,8 @@ #define MADV_COLD 20 /* deactivate these pages */ #define MADV_PAGEOUT 21 /* reclaim these pages */ +#define MADV_POPULATE 22 /* populate pages */ + /* compatibility flags */ #define MAP_FILE 0 diff --git a/mm/madvise.c b/mm/madvise.c index 6a660858784b..f76fdd6fcf10 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -53,6 +53,7 @@ static int madvise_need_mmap_write(int behavior) case MADV_COLD: case MADV_PAGEOUT: case MADV_FREE: + case MADV_POPULATE: return 0; default: /* be safe, default to 1. list exceptions explicitly */ @@ -821,6 +822,72 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, return -EINVAL; } +static long madvise_populate(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + struct mm_struct *mm = vma->vm_mm; + unsigned long tmp_end; + int locked = 1; + long pages; + + *prev = vma; + + while (start < end) { + /* + * We might have temporarily dropped the lock. For example, + * our VMA might have been split. + */ + if (!vma || start >= vma->vm_end) { + vma = find_vma(mm, start); + if (!vma) + return -ENOMEM; + } + + /* Bail out on incompatible VMA types. */ + if (vma->vm_flags & (VM_IO | VM_PFNMAP) || + !vma_is_accessible(vma)) { + return -EINVAL; + } + + /* + * Populate pages and take care of VM_LOCKED: simulate user + * space access. + * + * For private, writable mappings, trigger a write fault to + * break COW (i.e., shared zeropage). For other mappings (i.e., + * read-only, shared), trigger a read fault. + */ + tmp_end = min_t(unsigned long, end, vma->vm_end); + pages = populate_vma_page_range(vma, start, tmp_end, &locked); + if (!locked) { + mmap_read_lock(mm); + *prev = NULL; + vma = NULL; + } + if (pages < 0) { + switch (pages) { + case -EINTR: + case -ENOMEM: + return pages; + case -EHWPOISON: + /* Skip over any poisoned pages. */ + start += PAGE_SIZE; + continue; + case -EBUSY: + case -EAGAIN: + continue; + default: + pr_warn_once("%s: unhandled return value: %ld\n", + __func__, pages); + return -ENOMEM; + } + } + start += pages * PAGE_SIZE; + } + return 0; +} + /* * Application wants to free up the pages and associated backing store. * This is effectively punching a hole into the middle of a file. @@ -934,6 +1001,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, case MADV_FREE: case MADV_DONTNEED: return madvise_dontneed_free(vma, prev, start, end, behavior); + case MADV_POPULATE: + return madvise_populate(vma, prev, start, end); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -954,6 +1023,7 @@ madvise_behavior_valid(int behavior) case MADV_FREE: case MADV_COLD: case MADV_PAGEOUT: + case MADV_POPULATE: #ifdef CONFIG_KSM case MADV_MERGEABLE: case MADV_UNMERGEABLE: -- 2.29.2