Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp7204952pxb; Thu, 18 Feb 2021 04:28:05 -0800 (PST) X-Google-Smtp-Source: ABdhPJx2nC5UWz7Pl3N5NCygozfHbPDkqXjFI+4/bfr1Kbq9TM1/pHrpfkIdxyxeTP2FNuvv6MZ4 X-Received: by 2002:a05:6402:b86:: with SMTP id cf6mr3908045edb.269.1613651284858; Thu, 18 Feb 2021 04:28:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613651284; cv=none; d=google.com; s=arc-20160816; b=Nr2DxZvzPBIUg2wRWYbNWxasKewthDv4kEYfW01K9O0oDDbAGMu2JHXMUIU3bZN/9f f93gWUIHd3QV/yoo6G4VuFeNf7zMiFDLaClc5+v0+CLaL661lQcMokvn+Mxa4um3p/y5 MjvEdue1K6geITBsX+bjNoUYmuZbt1J0J1goZV3mpF7+7dhrVHIDEXAkBdht6kSyue4l OMAHe6KFeHSdoIUOdbmn3JOz/HE3CONe8HQNtQdqoge5V59fbgtcMkHPykDl/GD96slT JXO3aXu6M8GaU0OFpBfAdYCuwKpfXtZEMQhbiAWzX6QCD/Q4ir6dWRxlgWhd6mA6fC6B wzrQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject :organization:from:references:cc:to:dkim-signature; bh=BMH8S4T7u+QFmoTuKd8zzsW1EwjP83j2mvXgWWp0OT0=; b=BTGjLzXZw9l4vaYlhEU7aUgvpp3W4imbdKGuOgcMv1ROEwY7jmIbGnty3sW60hqYhO de0tyFc8ihhsst4bBlZX2H/aLUItXYJKj4FVRInZfckPN5Y36jYHHhpW7b5KvsUe6yIr AYGC+e0ExKPdn9YMDBgxwfbW7TbHzT/bqxKHv37uA9MR72QaK5T8DkkPDUcQ9xl3I14k s7TqWqoUTzU9vQ5llbNEOBoy0vE32UyAgXWkgL/bnMt1HD9Li0OKKgzB3PbY7OENQmKo G9HcCEwkQTnkJUfj0oKFZb5e2khU9iZqnM0s3sOJFMoHhr/6s+7ZZPpjuHP93Qaq/g6z KzTg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HBofra4i; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id j24si763144ejv.592.2021.02.18.04.27.39; Thu, 18 Feb 2021 04:28:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=HBofra4i; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232771AbhBRMYf (ORCPT + 99 others); Thu, 18 Feb 2021 07:24:35 -0500 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:32715 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232667AbhBRKq2 (ORCPT ); Thu, 18 Feb 2021 05:46:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1613645099; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BMH8S4T7u+QFmoTuKd8zzsW1EwjP83j2mvXgWWp0OT0=; b=HBofra4iKdCbY7r1xj9jayR+yXaFThEnxBitgPwMC1wZy17HHq94F9T0wXrbXj8rTz5SOO bmqy7wjsL8g970IGRiupn2a2WBbwTMrXw4kzoYcoyIYepalbRP1Nnf7iMGr0n4Vd9ghrZy dBvU7rAcQbcIUifLDXbgVwyyWbe2uRo= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-558-EbN0LvXNPNW1e9tnCjCIdA-1; Thu, 18 Feb 2021 05:44:55 -0500 X-MC-Unique: EbN0LvXNPNW1e9tnCjCIdA-1 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 11B17107ACE6; Thu, 18 Feb 2021 10:44:52 +0000 (UTC) Received: from [10.36.114.59] (ovpn-114-59.ams2.redhat.com [10.36.114.59]) by smtp.corp.redhat.com (Postfix) with ESMTP id 4016110016FD; Thu, 18 Feb 2021 10:44:42 +0000 (UTC) To: Michal Hocko Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Andrew Morton , Arnd Bergmann , Oscar Salvador , Matthew Wilcox , Andrea Arcangeli , Minchan Kim , Jann Horn , Jason Gunthorpe , Dave Hansen , Hugh Dickins , Rik van Riel , "Michael S . Tsirkin" , "Kirill A . Shutemov" , Vlastimil Babka , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Chris Zankel , Max Filippov , linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, linux-xtensa@linux-xtensa.org, linux-arch@vger.kernel.org References: <20210217154844.12392-1-david@redhat.com> From: David Hildenbrand Organization: Red Hat GmbH Subject: Re: [PATCH RFC] mm/madvise: introduce MADV_POPULATE to prefault/prealloc memory Message-ID: <3763a505-02d6-5efe-a9f5-40381acfbdfd@redhat.com> Date: Thu, 18 Feb 2021 11:44:41 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 18.02.21 11:25, Michal Hocko wrote: > On Wed 17-02-21 16:48:44, David Hildenbrand wrote: >> When we manage sparse memory mappings dynamically in user space - also >> sometimes involving MADV_NORESERVE - we want to dynamically populate/ > > Just wondering what is MADV_NORESERVE? I do not see anything like that > in the Linus tree. Did you mean MAP_NORESERVE? Most certainly, thanks :) > >> discard memory inside such a sparse memory region. Example users are >> hypervisors (especially implementing memory ballooning or similar >> technologies like virtio-mem) and memory allocators. In addition, we want >> to fail in a nice way if populating does not succeed because we are out of >> backend memory (which can happen easily with file-based mappings, >> especially tmpfs and hugetlbfs). > > by "fail in a nice way" you mean before a #PF would fail and SIGBUS > which would be harder to handle? Yes. > > [...] >> Because we don't have a proper interface, what applications >> (like QEMU and databases) end up doing is touching (i.e., writing) all >> individual pages. However, it requires expensive signal handling (SIGBUS); >> for example, this is problematic in hypervisors like QEMU where SIGBUS >> handlers might already be used by other subsystems concurrently to e.g, >> handle hardware errors. "Simply" doing preallocation from another thread >> is not that easy. > > OK, that clarifies my above question. > >> >> Let's introduce MADV_POPULATE with the following semantics >> 1. MADV_POPULATED does not work on PROT_NONE and special VMAs. It works >> on everything else. > > This would better clarify what "does not work" means. I assume those are > ignored and do not report any error? I'm currently preparing the man page. "Fail with -ENOMEM" (like MADV_DONTNEED or MADV_REMOVE) > >> 2. Errors during MADV_POPULATED (especially OOM) are reported. > > How do you want to achieve that? gup/page fault handler will allocate > memory and trigger the oom without caller noticing that. You would > somehow have to weaken the allocation context to GFP_RETRY_MAYFAIL or > NORETRY to achieve the error handling. Okay, I should be more clear here (again, I'm realizing this as well while I create the man page), OOM is confusing: avoid SIGBUS at runtime - like we would get on actual file systems/shmem/hugetlbfs when preallocating. It cannot save us from the actual OOM killer. To handle anonymous memory more reliable I'll need other means as well (dynamic swap space allocation for sparse mappings). > >> If we hit >> hardware errors on pages, ignore them - nothing we really can or >> should do. >> 3. On errors during MADV_POPULATED, some memory might have been >> populated. Callers have to clean up if they care. > > How does caller find out? madvise reports 0 on success so how do you > find out how much has been populated? If there is an error, something might have been populated. In my QEMU implementation, I simply discard the range again, good enough. I don't think we need to really indicate "error and populated" or "error and not populated". > >> 4. Concurrent changes to the virtual memory layour are tolerated - we >> process each and every PFN only once, though. > > I do not understand this. madvise is about virtual address space not a > physical address space. What I wanted to express: if we detect a change in the mapping we don't restart at the beginning, we always make forward progress. We process each virtual address once (on a per-page basis, thus I accidentally used "PFN"). > >> 5. If MADV_POPULATE succeeds, all memory in the range can be accessed >> without SIGBUS. (of course, not if user space changed mappings in the >> meantime or KSM kicked in on anonymous memory). > > I do not see how KSM would change anything here and maybe it is not > really important to mention it. KSM should be really transparent from > the users space POV. Parallel and destructive virtual address space > operations are also expected to change the outcome and there is nothing > kernel do about at and provide any meaningful guarantees. I guess we > want to assume a reasonable userspace behavior here. It's just a note that we cannot protect from someone interfering (discard/ksm/whatever). I'm making that clearer in the cover letter. Thanks! -- Thanks, David / dhildenb