Received: by 2002:a05:6a10:af89:0:0:0:0 with SMTP id iu9csp528144pxb; Fri, 28 Jan 2022 04:40:04 -0800 (PST) X-Google-Smtp-Source: ABdhPJwJuXIwWON4Nm6ZKnwTlW5p4XHCA6ixzRqVnUZw3HFgs+Ca4nB7XUHI1VhA9ZflCEeHuLeU X-Received: by 2002:a17:903:2451:: with SMTP id l17mr8676813pls.84.1643373604678; Fri, 28 Jan 2022 04:40:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1643373604; cv=none; d=google.com; s=arc-20160816; b=FE67t00AS4j2twuSknZ77PFm4e9gdeeo7yz+h0gBuVUgkmnd9dwJfZVJjSioJemCeC TsfDUgjV0+cKzdQQcJ5P5/0QsZVya/NaKO7k8U1/tSqBKPQO2VQiTW2bgbCJxlAXTd4m ooF0rZouIxf/oZtHRwhaAltqKjmCfWbK3OKnUr8368WlnLFtxmj9ZyyT/R0cexsk91gW 3yqd7d3Bzg+EadQSZhEPet5l/Ky/I0uLDMqbrea8kRdbiJC16dxY2BnrVpj5yqs2IjmE B2Ujc59RW7PnUAM7JZZYG4ZGNhdRNLGI3k/v7jspgc9VJYuxmr7Wwrb/+p3D9kfzaQGK BhbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=irEE7yHILyJWoTqBD2Snz597LhBeTSpmicRdNJceySU=; b=Dh433d8FnusItcs39+st/vPp8yw8BmWC/iDofT2EQ4ou18XlwZ4xWtE5QpYS2Lxjsw O0k9rudSXdnZLHa+41VYyYghIE0Dwo5QJ+BCUWf1r2fQuAAkcnKIUtWTdFgCRykHg/IL knwxiWWY8bq8ozKKcO4U/y2W7Ycki93q++FJ0kUXAFnwzjt/bYdnWYn3xwrhxKz6ZLcE DDfP6VTTWlAP9f9CH6xoRVNurFD+hqPnyGCaTcACcASw0o+JaMJQVRlrRLqQ2oI75tyK v0RIrB2CpRf9HaffPFbtq6WbHyM9EvDMg8gPJDwZRG5dRxHj20Qn/EPez1G8dbJXt3y7 U2Wg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=WaoWsAgL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q4si6047241pgj.233.2022.01.28.04.39.53; Fri, 28 Jan 2022 04:40:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=WaoWsAgL; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235338AbiA0RxL (ORCPT + 99 others); Thu, 27 Jan 2022 12:53:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41140 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231817AbiA0RxJ (ORCPT ); Thu, 27 Jan 2022 12:53:09 -0500 Received: from mail-il1-x132.google.com (mail-il1-x132.google.com [IPv6:2607:f8b0:4864:20::132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C04F6C061714 for ; Thu, 27 Jan 2022 09:53:09 -0800 (PST) Received: by mail-il1-x132.google.com with SMTP id s1so3155091ilj.7 for ; Thu, 27 Jan 2022 09:53:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=irEE7yHILyJWoTqBD2Snz597LhBeTSpmicRdNJceySU=; b=WaoWsAgLDsn8+cDXwJGuBSwy822bEXB3Sr5ofxaHfQSg7Gu+gT6bYzIxtbbeA7LcAc rhYG211svtnFRxmgpOVlYL0RNEfxluZbJKmoK7IwSwPASU++F88KcAnU9ab9ZuRbs3Gp SEKa+Jd+5CHfIDx3N4JFN+KLiJ7wEUqIYq/gtogS9Sa6yh7oBfVQ3va15aJs5s9kUW4v rVdjzODrABmhUZPQaTM87Q0iAMdVFm/ur7QCk13JE7p/3MMyCvTVsl2B7Xbx+SfxN4RK 4KkTUNGkprWu9S8CxGXGwj752zhjT12FxXO2SlU0lZfQiF6hBp2po442vmAKqp+imTiL E56Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=irEE7yHILyJWoTqBD2Snz597LhBeTSpmicRdNJceySU=; b=PbIQN02LfSDOcQ6HqIPOAWPO1iRueaOqVZKu4VI53pFQU37VcEhCpIJjaM+NeNZRyK C8GswrkUUJ3UVPxLh0cnOJBCUuZq/jtrXicaFRtAQ0HGo1ALlVp9QGyFMxYhBObatAvM kvSFw3W4dkAaRyXArMjqsFzLXi/qv4a7UPIB4ppTMDpv0SrExIB1fIHFzy1O/px3zURH ZQWGzgF5pmSQcJxKoE5sMbEZmd8db3qse5qoNq7tR9r4dTm2TLtGmFyUqQrjs0DnFp0D qBaXG7v/fB0Tt2sO5H8r45KsJkpvjau/CxyOBy9qj2tAaFKcabJG9Hxw9SldWwtgLC0D 6s0A== X-Gm-Message-State: AOAM531aYQj9uZh6M0z27mTPXkKkj+5YgTBpVIKN0/eMN7SdPb0Exq49 f4DUBq9ALXn40tu/3QSuP7cj6orKLGgZM3cmLPK6xw== X-Received: by 2002:a05:6e02:1b81:: with SMTP id h1mr3315954ili.239.1643305988641; Thu, 27 Jan 2022 09:53:08 -0800 (PST) MIME-Version: 1.0 References: <20220113180308.15610-1-mike.kravetz@oracle.com> In-Reply-To: From: Axel Rasmussen Date: Thu, 27 Jan 2022 09:52:32 -0800 Message-ID: Subject: Re: [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support To: David Hildenbrand Cc: Mike Kravetz , LKML , Linux MM , Michal Hocko , Naoya Horiguchi , Peter Xu , Andrea Arcangeli , Mina Almasry , Shuah Khan , Andrew Morton Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jan 27, 2022 at 3:57 AM David Hildenbrand wrote: > > On 13.01.22 19:03, Mike Kravetz wrote: > > Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP > > testing. However, mremap support was recently added in commit > > 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed > > vma"). While attempting to enable mremap support in the test, it was > > discovered that the mremap test indirectly depends on MADV_DONTNEED. > > > > hugetlb does not support MADV_DONTNEED. However, the only thing > > preventing support is a check in can_madv_lru_vma(). Simply removing > > the check will enable support. > > > > This is sent as a RFC because there is no existing use case calling > > for hugetlb MADV_DONTNEED support except possibly the userfaultfd test. > > However, adding support makes sense as it is fairly trivial and brings > > hugetlb functionality more in line with 'normal' memory. > > > > Just a note: > > QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...) > but instead always goes either via hugetlbfs or via memfd. > > For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems > to get the job done (IOW: also discards private anon pages). See the > comments in the QEMU code below. I remember that that is somewhat > inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we > always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED) > to make sure > > a) All file pages are removed > b) All private anon pages are removed > > IIRC hugetlbfs really is different in that regard, but maybe other fs > behave similarly. > > That's why QEMU was able to live for now without MADV_DONTNEED support > for hugetlbfs and most probably won't ever need it. Agreed, all of the production use cases I'm aware of use hugetlbfs, not MAP_HUGE... But, I would say this is convenient for testing purposes. It's slightly more convenient to not have to mount hugetlbfs / perform the associated setup for tests. Perhaps that's only a small motivation for enabling this, but then again Mike's patch to do so is likewise very small. :) > > > ... > /* The logic here is messy; > * madvise DONTNEED fails for hugepages > * fallocate works on hugepages and shmem > * shared anonymous memory requires madvise REMOVE > */ > need_madvise = (rb->page_size == qemu_host_page_size); > need_fallocate = rb->fd != -1; > if (need_fallocate) { > /* For a file, this causes the area of the file to be zero'd > * if read, and for hugetlbfs also causes it to be unmapped > * so a userfault will trigger. > */ > #ifdef CONFIG_FALLOCATE_PUNCH_HOLE > ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, > start, length); > if (ret) { > ret = -errno; > error_report("ram_block_discard_range: Failed to fallocate " > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > } > #else > ret = -ENOSYS; > error_report("ram_block_discard_range: fallocate not available/file" > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > #endif > } > if (need_madvise) { > /* For normal RAM this causes it to be unmapped, > * for shared memory it causes the local mapping to disappear > * and to fall back on the file contents (which we just > * fallocate'd away). > */ > #if defined(CONFIG_MADVISE) > if (qemu_ram_is_shared(rb) && rb->fd < 0) { > ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE); > } else { > ret = madvise(host_startaddr, length, QEMU_MADV_DONTNEED); > } > if (ret) { > ret = -errno; > error_report("ram_block_discard_range: Failed to discard range " > "%s:%" PRIx64 " +%zx (%d)", > rb->idstr, start, length, ret); > goto err; > } > #else > ... > > -- > Thanks, > > David / dhildenb >