Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp1775219pxb; Sat, 16 Oct 2021 19:58:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJydJ3/C86N4LN60pBnji74ew3AKOrDqJz+CUMLs+qq4VwymgZDIuwJoapOJTWtoQXw+larm X-Received: by 2002:aa7:cb03:: with SMTP id s3mr32137897edt.334.1634439486194; Sat, 16 Oct 2021 19:58:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634439486; cv=none; d=google.com; s=arc-20160816; b=F9Ih2SmR+PTXVW2s7Y49yeRzP7KvxUf9gKC+L6AY8wsMYE/42zY0EPTKuhPJwjoOMv 8vrv7WaLK5+hc8bqbGIBNv7mnRetJZgq2Rx+bShUt1SiEfSKq5P1nm5lGSttAUBCsjY8 dX8zHJT/wg1pWkiK6XC6/LBjmubr1GsEZMCOTHlHjdBsUuMoHC5UlsVHRwhgo2D1dy5X +gWUn9hs+iX1iTZRbv5meGO8O20MQugiT+C31FQnFaGppGbIOtEWR1Ocp2riXwQTfLq5 oc6Sn56l50Wq5JzYO1SdbF489LQsEQluR1nFHdwyhwMGJTG8zq30ebZ9crY0AME7AnJM J0+Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=kz5lMVwvOCVrmogtVEWpwErQFfGGKuUYVGz/2nToq3M=; b=iX7V+PB5GUlSfRwNc1px3vELc7rUMEXKh6Rt0pYx84vW5aiYwl5UEtggJ5NWPOT2ZO sqdATjrYpYuTlGEKIhb0hrYkdm57xNiTRJh/bn5I3ligCDHo7e97S1uG1w9sRIim3ca+ QfYm/CzWvixfyojJ+LiNDsNOFzPIjZaWYCanVI41pXW9D4SXJHb1Npk9FWpQjr2nvuUq NLthaEuCPDS1hPUsuv1Dp9hKjoe6ndcrZruuhc5RJfrFyGVAMG6G7ubpD7MXqlScsa40 7e0TyxX6NQHT2cQlefmqWiKatoHtRvUlWPSb5CV0Lr1NdSWu7U2Nj6qC2pXGVh2goS2C 8bMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=j2u09yM5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id f7si17904029edd.38.2021.10.16.19.57.23; Sat, 16 Oct 2021 19:58:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=j2u09yM5; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241328AbhJOQcc (ORCPT + 99 others); Fri, 15 Oct 2021 12:32:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55716 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241276AbhJOQc3 (ORCPT ); Fri, 15 Oct 2021 12:32:29 -0400 Received: from mail-yb1-xb2b.google.com (mail-yb1-xb2b.google.com [IPv6:2607:f8b0:4864:20::b2b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 976BDC061570 for ; Fri, 15 Oct 2021 09:30:21 -0700 (PDT) Received: by mail-yb1-xb2b.google.com with SMTP id s4so24074859ybs.8 for ; Fri, 15 Oct 2021 09:30:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=kz5lMVwvOCVrmogtVEWpwErQFfGGKuUYVGz/2nToq3M=; b=j2u09yM5l/323Ka448VaPv1LG7LqAaEYz0PZdr1r/KNJ1g2xJcBMLqZlsDpDMzRb5h 9Vy7xuf0SdACs1IMILrxi2L2UVJpAsywRRr1wlr5kvTFterFPXO+6ywJpiSQ4e/m+cM8 HkgnzQJrFHocBgUiBBO64Kr/Y8YCS+F/TYQTUnYmIgNybQ/pb8n1BVRhTgQnTVh8uh7A 84xYZ7CxxEBn2+EMqvpOKBHpTNCtrX+jqsTafZriRQS7NIZ/Z7URojLuuB00AHf2noJ2 v7mme9NkiL8tUyYNnljRWXFEhhl2qhFvqKMCOwGjGThZIS6buys8ohJ7+5kkZuBoV1dT 7fPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=kz5lMVwvOCVrmogtVEWpwErQFfGGKuUYVGz/2nToq3M=; b=mzY1qP6G79QWtGFlityH+5IfmrRqBJswT5rM2BkK2jYv79G7z7YA9Fu1i0jDmF3ltp GdxWyOdUkv7EcAbnMBfNqXkc4SRDFOf55RPTtuz/4ed8cvTQbf23o+AihBhqNIP8WhR8 8AHaV0yW/ycfE7fEqv5EcGy9WUaayBFfT3cKG3kbA5MNTtEw75UAvJrdrSZWH8PHwuMn G2i4egGkSho93u/bLlENlD57nJ5BJnl3yUZ8+/G6p9DrZQMqVB3XY2qi6lKhd1rIULIa yCYSfQfXTNo9oxvAvWKfBYvJT4lnvQtyMusSwRnJAvaayFMhZ6TxRs8Qtq/9xPWRJ+4O kpZw== X-Gm-Message-State: AOAM532AOUi/RnAcbQe3q1H0pV6WkXph2BJKsRiVHYGzcndzvL3r3Oo/ i4KNMzpnaFjsG2M5EhAaY86IDN+SfUmYpj1AU8GF1g== X-Received: by 2002:a25:bd03:: with SMTP id f3mr13232331ybk.412.1634315420493; Fri, 15 Oct 2021 09:30:20 -0700 (PDT) MIME-Version: 1.0 References: <92cbfe3b-f3d1-a8e1-7eb9-bab735e782f6@rasmusvillemoes.dk> <20211007101527.GA26288@duo.ucw.cz> <202110071111.DF87B4EE3@keescook> <202110081344.FE6A7A82@keescook> <26f9db1e-69e9-1a54-6d49-45c0c180067c@redhat.com> In-Reply-To: From: Suren Baghdasaryan Date: Fri, 15 Oct 2021 09:30:09 -0700 Message-ID: Subject: Re: [PATCH v10 3/3] mm: add anonymous vma name refcounting To: David Hildenbrand Cc: Michal Hocko , Kees Cook , Pavel Machek , Rasmus Villemoes , John Hubbard , Andrew Morton , Colin Cross , Sumit Semwal , Dave Hansen , Matthew Wilcox , "Kirill A . Shutemov" , Vlastimil Babka , Johannes Weiner , Jonathan Corbet , Al Viro , Randy Dunlap , Kalesh Singh , Peter Xu , rppt@kernel.org, Peter Zijlstra , Catalin Marinas , vincenzo.frascino@arm.com, =?UTF-8?B?Q2hpbndlbiBDaGFuZyAo5by16Yym5paHKQ==?= , Axel Rasmussen , Andrea Arcangeli , Jann Horn , apopple@nvidia.com, Yu Zhao , Will Deacon , fenghua.yu@intel.com, thunder.leizhen@huawei.com, Hugh Dickins , feng.tang@intel.com, Jason Gunthorpe , Roman Gushchin , Thomas Gleixner , krisman@collabora.com, Chris Hyser , Peter Collingbourne , "Eric W. Biederman" , Jens Axboe , legion@kernel.org, Rolf Eike Beer , Cyrill Gorcunov , Muchun Song , Viresh Kumar , Thomas Cedeno , sashal@kernel.org, cxfcosmos@gmail.com, LKML , linux-fsdevel@vger.kernel.org, linux-doc@vger.kernel.org, linux-mm , kernel-team Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Oct 15, 2021 at 1:04 AM David Hildenbrand wrote: > > On 14.10.21 22:16, Suren Baghdasaryan wrote: > > On Tue, Oct 12, 2021 at 10:01 AM Suren Baghdasaryan wrote: > >> > >> On Tue, Oct 12, 2021 at 12:44 AM David Hildenbrand wrote: > >>> > >>>> I'm still evaluating the proposal to use memfds but I'm not sure if > >>>> the issue that David Hildenbrand mentioned about additional memory > >>>> consumed in pagecache (which has to be addressed) is the only one we > >>>> will encounter with this approach. If anyone knows of any potential > >>>> issues with using memfds as named anonymous memory, I would really > >>>> appreciate your feedback before I go too far in that direction. > >>> > >>> [MAP_PRIVATE memfd only behave that way with 4k, not with huge pages, so > >>> I think it just has to be fixed. It doesn't make any sense to allocate a > >>> page for the pagecache ("populate the file") when accessing via a > >>> private mapping that's supposed to leave the file untouched] > >>> > >>> My gut feeling is if you really need a string as identifier, then try > >>> going with memfds. Yes, we might hit some road blocks to be sorted out, > >>> but it just logically makes sense to me: Files have names. These names > >>> exist before mapping and after mapping. They "name" the content. > >> > >> I'm investigating this direction. I don't have much background with > >> memfds, so I'll need to digest the code first. > > > > I've done some investigation into the possibility of using memfds to > > name anonymous VMAs. Here are my findings: > > Thanks for exploring the alternatives! Thanks for pointing to them! > > > > > 1. Forking a process with anonymous vmas named using memfd is 5-15% > > slower than with prctl (depends on the number of VMAs in the process > > being forked). Profiling shows that i_mmap_lock_write() dominates > > dup_mmap(). Exit path is also slower by roughly 9% with > > free_pgtables() and fput() dominating exit_mmap(). Fork performance is > > important for Android because almost all processes are forked from > > zygote, therefore this limitation already makes this approach > > prohibitive. > > Interesting, naturally I wonder if that can be optimized. Maybe but it looks like we simply do additional things for file-backed memory, which seems natural. The call to i_mmap_lock_write() is from here: https://elixir.bootlin.com/linux/latest/source/kernel/fork.c#L565 > > > > > 2. mremap() usage to grow the mapping has an issue when used with memfds: > > > > fd = memfd_create(name, MFD_ALLOW_SEALING); > > ftruncate(fd, size_bytes); > > ptr = mmap(NULL, size_bytes, prot, MAP_PRIVATE, fd, 0); > > close(fd); > > ptr = mremap(ptr, size_bytes, size_bytes * 2, MREMAP_MAYMOVE); > > touch_mem(ptr, size_bytes * 2); > > > > This would generate a SIGBUS in touch_mem(). I believe it's because > > ftruncate() specified the size to be size_bytes and we are accessing > > more than that after remapping. prctl() does not have this limitation > > and we do have a usecase for growing a named VMA. > > Can't you simply size the memfd much larger? I mean, it doesn't really > cost much, does it? If we know beforehand what the max size it can reach then that would be possible. I would really hate to miscalculate here and cause a simple memory access to generate signals. Tracking such corner cases in the field is not an easy task and I would rather avoid the possibility of it. > > > > > 3. Leaves an fd exposed, even briefly, which may lead to unexpected > > flaws (e.g. anything using mmap MAP_SHARED could allow exposures or > > overwrites). Even MAP_PRIVATE, if an attacker writes into the file > > after ftruncate() and before mmap(), can cause private memory to be > > initialized with unexpected data. > > I don't quite follow. Can you elaborate what exactly the issue here is? > We use a temporary fd, yes, but how is that a problem? > > Any attacker can just write any random memory memory in the address > space, so I don't see the issue. It feels to me that introducing another handle to the memory region is a potential attack vector but I'm not a security expert. Maybe Kees can assess this better? > > > > > 4. There is a usecase in the Android userspace where vma naming > > happens after memory was allocated. Bionic linker does in-memory > > relocations and then names some relocated sections. > > Would renaming a memfd be an option or is that "too late" ? My understanding is that linker allocates space to load and relocate the code, performs the relocations in that space and then names some of the regions after that. Whether it can be redesigned to allocate multiple named regions and perform the relocation between them I did not really try since it would be a project by itself. TBH, at some point I just look at the amount of required changes (both kernel and userspace) and new limitations that userspace has to adhere to for fitting memfds to my usecase, and I feel that it's just not worth it. In the end we end up using the same refcounted strings with vma->vm_file->f_count as the refcount and name stored in vma->vm_file->f_path->dentry but with more overhead. Thanks, Suren. > > > > > In the light of these findings, could the current patchset be reconsidered? > > Thanks, > > Suren. > > > > > -- > Thanks, > > David / dhildenb >