Received: by 2002:a05:7412:2a8c:b0:e2:908c:2ebd with SMTP id u12csp3895385rdh; Fri, 29 Sep 2023 05:47:01 -0700 (PDT) X-Google-Smtp-Source: AGHT+IH8NmFg8xijPa4AalauCaJmcWTwQa/D4awPhShsSpGp97Sri1oSvJ6Lfs7mZziXswHKWEfm X-Received: by 2002:a05:6a20:8f06:b0:153:860e:47ef with SMTP id b6-20020a056a208f0600b00153860e47efmr5193606pzk.47.1695991621173; Fri, 29 Sep 2023 05:47:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1695991621; cv=none; d=google.com; s=arc-20160816; b=WLoQS6trMHS0gZhyj+ntR1nREGOktkcaGf+KdEd6mLl3h3iVGquj3J07kNn9zntISL awGWNPaM1WaZcDCTBJKu5DtmZhkOCgCwP9QmYiVsUYg8YNbMiYCvrf2cRV59fJE8cHYY xs6k0lyxKc5m4o2Sc9H4X88DeWz9WGsWCDjT7irFM5V94uLrfs9aD4CF5fvxxPHen5+J vU46cpVG4oUys3K4r3t3cI8R/QzF5yDfgzvwKHQ4bTs7Su66JWaJxwe06AlcTSKkqnDr WSCgZeLzA0bEYzD4miC1AdUF+Uhcu9o+HWlBEHDikK93mWwlaTZhruJJo50HKircPK0q WoiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :content-language:references:cc:to:subject:user-agent:mime-version :date:message-id; bh=n0tVaEIbhvdS07CENfddjCD3gE+pbeuzFtlHo4xm7pg=; fh=WZsz7FoCgipQi3h20peDNfZmHumLu3u/wmRYiq0SS3k=; b=y0NHm0QpPdElJyvkqWWZGWC0dhf6NAJWvcb4tdtDsUXW6kWkeHi+bCcaN24YMNAhd2 6k3tK+pa8ifYNeiKouRKS6/IErUuvealjdNzCxFMfuTTpNi4WFnnIkdjDxnH4WN9eClb 0+Cu7ViVK9VNPJ1J2/ChWHw+srLst5EmFADF0jLLxAvQI4CcPBJx1dTLMIuxuyfZGGM6 BTRbLWCVQQcwTa3dAo9oqEUczpx82m37w17P4nqDqB9ykgRxqhM1S5nEiyljcR0N0Vbh jDGbXAXJDZEcn6w9pTN3A5ceeVxrgo+nWRyzpcTQB5yVNkRA/glwEv5p/q1wHxVyUOx+ ANjg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id c1-20020a170902d48100b001bbfbe6bf3esi22977581plg.504.2023.09.29.05.47.00 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 29 Sep 2023 05:47:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 3281F80220ED; Fri, 29 Sep 2023 05:33:24 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233119AbjI2MdK (ORCPT + 99 others); Fri, 29 Sep 2023 08:33:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36954 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232977AbjI2MdJ (ORCPT ); Fri, 29 Sep 2023 08:33:09 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 4C42D1AE; Fri, 29 Sep 2023 05:33:06 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3EA7D1FB; Fri, 29 Sep 2023 05:33:44 -0700 (PDT) Received: from [10.57.66.194] (unknown [10.57.66.194]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6960C3F59C; Fri, 29 Sep 2023 05:33:04 -0700 (PDT) Message-ID: Date: Fri, 29 Sep 2023 13:33:03 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: BUG: MADV_COLLAPSE doesn't work for XFS files To: Zach O'Keefe Cc: Bagas Sanjaya , Hugh Dickins , David Hildenbrand , Matthew Wilcox , Chandan Babu R , "Darrick J. Wong" , Linux Memory Management List , Linux XFS , Linux Kernel Mailing List , Yu Zhao References: <4d6c9b19-cdbb-4a00-9a40-5ed5c36332e5@arm.com> <54e5accf-1a56-495a-a4f5-d57504bc2fc8@arm.com> Content-Language: en-GB From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Fri, 29 Sep 2023 05:33:24 -0700 (PDT) On 28/09/2023 20:43, Zach O'Keefe wrote: > Hey Ryan, > > Thanks for bringing this up. > > On Thu, Sep 28, 2023 at 4:59 AM Ryan Roberts wrote: >> >> On 28/09/2023 11:54, Bagas Sanjaya wrote: >>> On Thu, Sep 28, 2023 at 10:55:17AM +0100, Ryan Roberts wrote: >>>> Hi all, >>>> >>>> I've just noticed that when applied to a file mapping for a file on xfs, MADV_COLLAPSE returns EINVAL. The same test case works fine if the file is on ext4. >>>> >>>> I think the root cause is that the implementation bails out if it finds a (non-PMD-sized) large folio in the page cache for any part of the file covered by the region. XFS does readahead into large folios so we hit this issue. See khugepaged.h:collapse_file(): >>>> >>>> if (PageTransCompound(page)) { >>>> struct page *head = compound_head(page); >>>> >>>> result = compound_order(head) == HPAGE_PMD_ORDER && >>>> head->index == start >>>> /* Maybe PMD-mapped */ >>>> ? SCAN_PTE_MAPPED_HUGEPAGE >>>> : SCAN_PAGE_COMPOUND; >>>> goto out_unlock; >>>> } >>> > > Ya, non-PMD-sized THPs were just barely visible in my peripherals when > writing this, and I'm still woefully behind on your work on them now > (sorry!). Nothing to apologise for! Although, this issue has no relation to the work I've been doing for anonymous large folios; It shows up for large _file_ folios. And it looks like the kernel was capable of doing large file folios for XFS before the collapse implementation went in, so I guess this behavior has always been the case: git rev-list --no-walk=sorted --pretty=oneline \ 793917d997df2e432f3e9ac126e4482d68256d01 \ 6795801366da0cd3d99e27c37f020a8f16714886 \ 8549a26308f945bddb39391643eb102da026f0ef \ e6687b89225ee9c817e6dcbadc873f6a4691e5c2 \ 7d8faaf155454f8798ec56404faca29a82689c77 7d8faaf155454f8798ec56404faca29a82689c77 mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse 793917d997df2e432f3e9ac126e4482d68256d01 mm/readahead: Add large folio readahead 6795801366da0cd3d99e27c37f020a8f16714886 xfs: Support large folios So first, XFS supported it, then readahead actually started allocating large folios, then MADV_COLLAPSE came along. > > I'd like to eventually make collapse (not just MADV_COLLAPSE, but > khugepaged too) support arbitrary-sized large folios in general, but > I'm very pressed for time right now. I think M. Wilcox is also > interested in this, given he left the TODO to support it :P Yes, I think this could be a useful capability. I'm currently investigating use of MADV_COLLAPSE as a work-around to get executable sections into large folios for file systems that don't natively support them (ext4 mainly). On arm64, having executable memory in 64K folios means we can make better use of the iTLB and improve performance. > > Thank you for the reproducer though! I haven't run it, but I'll > probably come back here to steal it when the time comes. > >>> I don't see any hint to -EINVAL above. Am I missing something? >> >> The SCAN_PAGE_COMPOUND result ends up back at madvise_collapse() where it >> eventually gets converted to -EINVAL by madvise_collapse_errno(). >> >>> >>>> >>>> I'm not sure if this is already a known issue? I don't have time to work on a fix for this right now, so thought I would highlight it at least. I might get around to it at some point in the future if nobody else tackles it. > > My guess is Q1 2024 is when I'd be able to look into this, at the > current level of urgency. It doesn't sound like it's blocking anything > for your work right now -- lmk if that changes though! No - its not a blocker for me. I just wanted to highlight the issue. > > Thanks, > Zach > > > >>>> >>>> Thanks, >>>> Ryan >>>> >>>> >>>> Test case I've been using: >>>> >>>> -->8-- >>>> >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> #include >>>> >>>> #ifndef MADV_COLLAPSE >>>> #define MADV_COLLAPSE 25 >>>> #endif >>>> >>>> #define handle_error(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0) >>>> >>>> #define SZ_1K 1024 >>>> #define SZ_1M (SZ_1K * SZ_1K) >>>> #define ALIGN(val, align) (((val) + ((align) - 1)) & ~((align) - 1)) >>>> >>>> #if 1 >>>> // ext4 >>>> #define DATA_FILE "/home/ubuntu/data.txt" >>>> #else >>>> // xfs >>>> #define DATA_FILE "/boot/data.txt" >>>> #endif >>>> >>>> int main(void) >>>> { >>>> int fd; >>>> char *mem; >>>> int ret; >>>> >>>> fd = open(DATA_FILE, O_RDONLY); >>>> if (fd == -1) >>>> handle_error("open"); >>>> >>>> mem = mmap(NULL, SZ_1M * 4, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0); >>>> close(fd); >>>> if (mem == MAP_FAILED) >>>> handle_error("mmap"); >>>> >>>> printf("1: pid=%d, mem=%p\n", getpid(), mem); >>>> getchar(); >>>> >>>> mem = (char *)ALIGN((unsigned long)mem, SZ_1M * 2); >>>> ret = madvise(mem, SZ_1M * 2, MADV_COLLAPSE); >>>> if (ret) >>>> handle_error("madvise"); >>>> >>>> printf("2: pid=%d, mem=%p\n", getpid(), mem); >>>> getchar(); >>>> >>>> return 0; >>>> } >>>> >>>> -->8-- >>>> >>> >>> Confused... >> >> This is a user space test case that shows the problem; data.txt needs to be at >> least 4MB and on a mounted ext4 and xfs filesystem. By toggling the '#if 1' to >> 0, you can see the different behaviours for ext4 and xfs - >> handle_error("madvise") fires with EINVAL in the xfs case. The getchar()s are >> leftovers from me looking at the smaps file. >>