Received: by 2002:a05:6a10:c604:0:0:0:0 with SMTP id y4csp251328pxt; Fri, 6 Aug 2021 00:44:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz9LX/NV4i2RcDtaXZf5Q4ZBoYGh9qIcK13YxwWpkGR8bfK9JoxNyYSUnKpwPlA16oHatU5 X-Received: by 2002:a17:907:76b9:: with SMTP id jw25mr8519179ejc.393.1628235895528; Fri, 06 Aug 2021 00:44:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1628235895; cv=none; d=google.com; s=arc-20160816; b=nyriIjwI7w2ElSJn+lz/OrAlL5EolYRCZH+MWpmBfOLDGTeGt3rbfDtQ2wybNJITii MeS/qzmXnQwN/TnMOODkVnIjejEwYCTXk5VynbMd898NN8KDqOmvdZyka+e37AL0i3LI Oa4lF0SgejtFUBtRdlF3aSemCW/MpnpUHsHIn7S7NDALjFGWJ030GMnv+0CHdhsOjBFQ 7YciBnKXx4GHKUsbMmJ2LJ4UWAmIfBOkfowzWta2Kt2MA3XlS74DI1+60w5ADX4+o35H OeKquvrGdjFucz02RU8F15+LJw+isw863ADGJxO73E4oSsmF5nNa/jR0q0BquvVTNL5x vqiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:references:message-id:in-reply-to :subject:cc:to:from:date:dkim-signature; bh=zoxn86tI0OeId1+HsMFj7XfTnI0a8mb/MBGQLX/EXMQ=; b=tSFDCtx2ciOhWwl1BViCkmjhcIxInZM1Oq5W4rjSqhqGV/z/iyoZAj3bM9tBppS8dS A3S5b+yKBRslHcKzLJ6+fhN13jK1beq3u+VDvHTTz7j4Y9LSBGCSMUjgEf6EjrRTartp ObT+rkPd82MKVc8P0mNNk/j1aCneq/69cm6uRQf15tIPU8r7tkiuvMaUT1NCC76eM5sI cGbz1krS6d1T2nL/CAp3cJdwiQpPZgKHEzND5XyrA197uMUrkYhpUVnzYpCvsCr76biB H5zyk1ZvjMVmQqipq+WV/3VHutINO3DMjSTjLdI5pkPtYTAlYMq5DteblPo/fKB5boRj g8HA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=vHJ6XPyU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id nc19si837581ejc.426.2021.08.06.00.44.31; Fri, 06 Aug 2021 00:44:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=vHJ6XPyU; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234583AbhHFEex (ORCPT + 99 others); Fri, 6 Aug 2021 00:34:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35656 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233838AbhHFEek (ORCPT ); Fri, 6 Aug 2021 00:34:40 -0400 Received: from mail-oi1-x235.google.com (mail-oi1-x235.google.com [IPv6:2607:f8b0:4864:20::235]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AC48DC06179A for ; Thu, 5 Aug 2021 21:34:24 -0700 (PDT) Received: by mail-oi1-x235.google.com with SMTP id w6so10422163oiv.11 for ; Thu, 05 Aug 2021 21:34:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :mime-version; bh=zoxn86tI0OeId1+HsMFj7XfTnI0a8mb/MBGQLX/EXMQ=; b=vHJ6XPyUZn6HJNpOvReBw9udckcsYokef5di4Syl6LKnkAactsc4UI9tj9lhQKRc0p i/+4HPQobOZvRyjNn4Zebf2F1yOcSKsOn5V8tBuYKsk+PFIOtt1jrMEl8xd9Ztx5wzIT n1MAnsVD+YiW8TTm0+g7+a2VsbVcGvGNym6KWmauxtV7zjI5v3XCM763w7dEODDSFNa4 uQvcfkiiE2EZApZ+s4qyMeJUKiqubOkwcYJT9ybNscj3gYQHSv5qHjzvedj9jcAMlbI6 Hf1bmO06V/2hWuag9tcb5QyIhNQNwLNAoczuTesYJsa7F0HDtlZdIRlHkTJLI4nHQAol 9/kQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:mime-version; bh=zoxn86tI0OeId1+HsMFj7XfTnI0a8mb/MBGQLX/EXMQ=; b=kiRZjRmxbP6gVBRyxTrBhh8mLNfFAdtMg7QHPMP0jnyZCUb75QlGZ9R06niPZRr6Aq A9IPyL1f36eZbFUTH8o1u80ERMGBTI4E4OnVqsZY0Hwz3L0f3m9XXfTJns1fa7Ot6erI I+oO/t7AtvoGl3gRGft+QzCJdyp2nj5/4wMnf9IAQzjquUaydcDIfbTfcmcrLDeO5KDi dG0IJosDaM+3Gi3u4KWXoigR+jdcbtpX1P9R56otMf1LAg+Pf1Aesc5lzQO3lqFmimLE rY9k/zrelHB4jCgrrYpc9Jkj+ELraCnJtXda5+XWi1ToiqWQdVbz8aa9XGUwIDC96TfX 8Mng== X-Gm-Message-State: AOAM533LdFsMD+tNbvjtRiQ84g2jqNMVH2/2YAXnfew3mPQJzBrnOy2h cvtt7vu1vzoDlHj3IiKGQIfe9g== X-Received: by 2002:aca:b909:: with SMTP id j9mr13116203oif.9.1628224463623; Thu, 05 Aug 2021 21:34:23 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id n1sm1358979otk.34.2021.08.05.21.34.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Aug 2021 21:34:22 -0700 (PDT) Date: Thu, 5 Aug 2021 21:34:20 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.anvils To: "Kirill A. Shutemov" cc: Hugh Dickins , Andrew Morton , Shakeel Butt , "Kirill A. Shutemov" , Yang Shi , Miaohe Lin , Mike Kravetz , Michal Hocko , Rik van Riel , Christoph Hellwig , Matthew Wilcox , "Eric W. Biederman" , Alexey Gladkov , Chris Wilson , Matthew Auld , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH 08/16] huge tmpfs: fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE) In-Reply-To: <20210804140805.vpuerwaiqtcvc5or@box.shutemov.name> Message-ID: References: <2862852d-badd-7486-3a8e-c5ea9666d6fb@google.com> <1c32c75b-095-22f0-aee3-30a44d4a4744@google.com> <20210804140805.vpuerwaiqtcvc5or@box.shutemov.name> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 4 Aug 2021, Kirill A. Shutemov wrote: > On Fri, Jul 30, 2021 at 12:48:33AM -0700, Hugh Dickins wrote: > > Add support for fcntl(fd, F_HUGEPAGE) and fcntl(fd, F_NOHUGEPAGE), to > > select hugeness per file: useful to override the default hugeness of the > > shmem mount, when occasionally needing to store a hugepage file in a > > smallpage mount or vice versa. > > Hm. But why is the new MFD_* needed if the fcntl() can do the same. That I've just addressed in the MFD_HUGEPAGE 07/16 thread. > > > These fcntls just specify whether or not to try for huge pages when > > allocating to the object later: F_HUGEPAGE does not touch small pages > > already allocated (though khugepaged may do so when the file is mapped > > afterwards), F_NOHUGEPAGE does not split huge pages already allocated. > > > > Why fcntl? Because it's already in use (for sealing) on memfds; and I'm > > anxious to keep this simple, just applying it to whole files: fallocate, > > madvise and posix_fadvise each involve a range, which would need a new > > kind of tree attached to the inode for proper support. > > Most of fadvise() operations ignore the range. I like fadvise() because > it's less prescriptive: kernel is free to ignore it. As to ignoring the range, yes, I see now that some do; and I'm relieved to see "Len == 0 means as much as possible", that's great, I was afraid of compat bugs over 0xffy numbers for the len. And we would want, not to ignore the range, but insist on offset 0, len 0 for now, if there's any intention (not mine) of extending it to ranges in the future. As to ignoring the prescription, that's just a matter of how we describe it in the manpage, no matter whether it's fadvise() or fcntl(). And in the 07/16 thread you also said: > > If a tunable needed, I would rather go with fadvise(). It would operate on > a couple of bits per struct file and they get translated into VM_HUGEPAGE > and VM_NOHUGEPAGE on mmap(). Not so sure about that detail: the point here is to decide what kind of allocations to try for, before the file is mmap()ed; and it is the file (the underlying object) that I want to condition here, rather than the struct file of who has it open at the time, or their mmap()s. But adding the flags into the vm_flags on mmap(): that's an interesting idea, I haven't played with that at all. Offhand, I don't think it will give different allocation results from what I'm already doing, but might affect what is shown by default in /proc//smaps. > > Later if needed fadvise() implementation may be extended to track > requested ranges. But initially it can be simple. I still prefer fcntl() myself, but we can go with either: what I'd like to hear is the preference of linux-fsdevel and linux-api people. Aside from the unused offset+len, my main problem with fadvise() is that... it doesn't exist. It's posix_fadvise() or fadvise64() or fadvise64_64(), and all its good advices are POSIX_MADV_whatever. Are we comfortable now adding LINUX_MADV_HUGEPAGE, LINUX_MADV_NOHUGEPAGE? I find myself singing 64 64 Zoo Lane. Hugh