Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp7195739pxb; Thu, 18 Feb 2021 04:13:20 -0800 (PST) X-Google-Smtp-Source: ABdhPJxLqPdsauBJhT7FYs7x18HL9HcRw5KSXQReHPj886EULRjLq4qkYTrLaA6oRMrGSBfQhspl X-Received: by 2002:a17:907:948d:: with SMTP id dm13mr3583651ejc.545.1613650400371; Thu, 18 Feb 2021 04:13:20 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613650400; cv=none; d=google.com; s=arc-20160816; b=IXqDb4r8oMD/Vm2EStGjrB+uSrWA+zbXnaAicjKN9z8Yk9f+ZEqjxjEi9m/hHzfwRI SlRxMi9+AfkmOBG9dYTe/KUruzfWKWgXbDMD+C5Hf+EJ/KRcxGbJZb2kLDVJ+9aC/wqs nahzJ0Wgje7wSwZdcHZ8V4qAtxcvILKxY5nRGIA47WOgZefFmI5UFjY3tvzfXEDYX3c2 s7Lxjn0FFwZzubrV+QrNXz4C/BWNvCaUT25F7BKXwjW9GQxfWvLZsgzB5C+3hTWqhixR Te826ypRiU2Hz7dPpwtEhNE0Oeyqi5nA5ZvH9xH+X9ESpsGInrXk2qxptfH9WfJG4j88 xPbg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=s56Gnw396EhRPm1YevA7eLB2KjutFFHIZPiy2hOTxEM=; b=qOeDtGgTglwYUvscsde/lp9aMsAsDEMHUGcjoOeEdfSOk7q8o7fBavPEg/LF1RcYSM aQYbMb4LqyQojlcq4z+mrO1fI6pjQAUhOrTFktB4oBm7Sy66gtlr5sSDNnE0vj4w68Nj 0rwnukM53r7k2HZVwFDP/127kCd7PjDU0y74/oF8/C2w32CBnPFkRfuw9BO6JbBUXxje /VQkg/rHhCCvYXiesUncey067u8RxX81KeMKXUsjlmM7gbcHrfm4O9K4ym+RnLmWMOuB kLUQhjeHLvG7arfJNa3RlBA9vT+zbjIpZYAkY5+d4osN7tzGRzkM2vlAAxk+mesGgBBM Esdw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=A9MqyQw+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id kj10si3656030ejc.166.2021.02.18.04.12.56; Thu, 18 Feb 2021 04:13:20 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=A9MqyQw+; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232545AbhBRML5 (ORCPT + 99 others); Thu, 18 Feb 2021 07:11:57 -0500 Received: from mx2.suse.de ([195.135.220.15]:37714 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231766AbhBRKpm (ORCPT ); Thu, 18 Feb 2021 05:45:42 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1613643933; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=s56Gnw396EhRPm1YevA7eLB2KjutFFHIZPiy2hOTxEM=; b=A9MqyQw+uGhARjd9EqBxsZ4pVorl1rxQ2pnTECvYBMZUd3tcI8ZcXKSNfsXKlEzsZwlgy4 kBdzERLSQASsLtQFB/lm5Ifsli67inMrXKgnu8AA9uhsmo259oTESajGpXVylgsoEhb7PR P4QBZEcvbiTdUv8WNCMjQckNPvqvXeQ= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 32F03ADE3; Thu, 18 Feb 2021 10:25:33 +0000 (UTC) Date: Thu, 18 Feb 2021 11:25:31 +0100 From: Michal Hocko To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Arnd Bergmann , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jann Horn , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch@vger.kernel.org Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Message-ID: References: <20210217154844.12392-1-david@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210217154844.12392-1-david@redhat.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 17-02-21 16:48:44, David Hildenbrand wrote: > When we manage sparse memory mappings dynamically in user space - also > sometimes involving MADV_NORESERVE - we want to dynamically populate/ Just wondering what is MADV_NORESERVE? I do not see anything like that in the Linus tree. Did you mean MAP_NORESERVE? > discard memory inside such a sparse memory region. Example users are > hypervisors (especially implementing memory ballooning or similar > technologies like virtio-mem) and memory allocators. In addition, we want > to fail in a nice way if populating does not succeed because we are out of > backend memory (which can happen easily with file-based mappings, > especially tmpfs and hugetlbfs). by "fail in a nice way" you mean before a #PF would fail and SIGBUS which would be harder to handle? [...] > Because we don't have a proper interface, what applications > (like QEMU and databases) end up doing is touching (i.e., writing) all > individual pages. However, it requires expensive signal handling (SIGBUS); > for example, this is problematic in hypervisors like QEMU where SIGBUS > handlers might already be used by other subsystems concurrently to e.g, > handle hardware errors. "Simply" doing preallocation from another thread > is not that easy. OK, that clarifies my above question. > > Let's introduce MADV_POPULATE with the following semantics > 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works > on everything else. This would better clarify what "does not work" means. I assume those are ignored and do not report any error? > 2. Errors during MADV_POPULATED (especially OOM) are reported. How do you want to achieve that? gup/page fault handler will allocate memory and trigger the oom without caller noticing that. You would somehow have to weaken the allocation context to GFP_RETRY_MAYFAIL or NORETRY to achieve the error handling. > If we hit > hardware errors on pages, ignore them - nothing we really can or > should do. > 3. On errors during MADV_POPULATED, some memory might have been > populated. Callers have to clean up if they care. How does caller find out? madvise reports 0 on success so how do you find out how much has been populated? > 4. Concurrent changes to the virtual memory layour are tolerated - we > process each and every PFN only once, though. I do not understand this. madvise is about virtual address space not a physical address space. > 5. If MADV_POPULATE succeeds, all memory in the range can be accessed > without SIGBUS. (of course, not if user space changed mappings in the > meantime or KSM kicked in on anonymous memory). I do not see how KSM would change anything here and maybe it is not really important to mention it. KSM should be really transparent from the users space POV. Parallel and destructive virtual address space operations are also expected to change the outcome and there is nothing kernel do about at and provide any meaningful guarantees. I guess we want to assume a reasonable userspace behavior here. > Although sparse memory mappings are the primary use case, this will > also be useful for ordinary preallocations where MAP_POPULATE is not > desired (e.g., in QEMU, where users can trigger preallocation of > guest RAM after the mapping was created). > > Looking at the history, MADV_POPULATE was already proposed in 2013 [1], > however, the main motivation back than was performance improvements > (which should also still be the case, but it's a seconary concern). Well, I think it is more of a concern than prior-spectre era when syscalls were quite cheap. -- Michal Hocko SUSE Labs