Received: by 2002:a05:6a10:16a7:0:0:0:0 with SMTP id gp39csp3435448pxb; Mon, 16 Nov 2020 14:51:57 -0800 (PST) X-Google-Smtp-Source: ABdhPJycRAMvO3d6qtlIxUXQH71TC1Vil4zUp8GNY5qCS6lJzv/LEwQE4kTUpmQxU0XyooApZ6y1 X-Received: by 2002:a50:c19a:: with SMTP id m26mr17214472edf.302.1605567116782; Mon, 16 Nov 2020 14:51:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1605567116; cv=none; d=google.com; s=arc-20160816; b=0r9uWXgz2hKNFabTg+EA/CvPZiuoRysxwSHEGoApnakimxAMyjFrWwNctOZSY6hZlN aZUF3n2mIHrKjArEf+aaX+lcXGxbFs+MBSbbFnVEYhTp1qHLu1R1vZeIApnKpuKyNJj8 FOKfkNEhudV1/uc79HkoTdkC9jpkvXknoNlZmn4U6s12ou8xWNvMiXw4XQdQkCzPLeBM fLSO5Xdrwi5xhD9l81zt0DkdRlKqOPdlvx3yERxOCyEBF5fObH8Wt4EtcqYPEo+Su0j5 7Ja62JOTh7+vc+LR4hCoHWkaNeWF4fDKt1AC/OwCex5tlJlI55M+EPzj1ARdd0qYhLr2 XlVA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:references:message-id :in-reply-to:subject:cc:to:from:date:dkim-signature; bh=xbx+XzcPct/46zaNl1olGwUlxfEGJ/8djflBdlJjy9s=; b=PZ8DGK2Xric5q9GVd97aF3gXLcDSJJvSM6Kpzp3CyELLIZqMsHzswWawRZgPhZB2s0 Lxz+yfavXKuegjW3ma/FPV9yS60XTCUmMpy6Er8xfgOpd5yp4fZM7Ppe/pt4aW0IaKQp BilaURBt1g3fPOX6beHya/43CWC6dxHhoLBONcV7OVb7YtSSZc5UVUzWGT+8ni9DuSp6 nuS+1a58i2/zdSWfk2oZGuRUoExiNTtssyE96iq9YgWnpImduKCR1wFgzz9z2Qiah2IU w8Z7wP/FFm6E9wn9xoYCH6EzvNnVD3yRu8791iLIzaUDN2ltOtyXT46b8TTpLrqsDj9J 9vrg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=uanpSTuR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id q11si11621966ejb.550.2020.11.16.14.51.28; Mon, 16 Nov 2020 14:51:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=uanpSTuR; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729489AbgKPV2J (ORCPT + 99 others); Mon, 16 Nov 2020 16:28:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53770 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726182AbgKPV2J (ORCPT ); Mon, 16 Nov 2020 16:28:09 -0500 Received: from mail-ot1-x342.google.com (mail-ot1-x342.google.com [IPv6:2607:f8b0:4864:20::342]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EEF4FC0613CF for ; Mon, 16 Nov 2020 13:28:08 -0800 (PST) Received: by mail-ot1-x342.google.com with SMTP id 79so17429867otc.7 for ; Mon, 16 Nov 2020 13:28:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=xbx+XzcPct/46zaNl1olGwUlxfEGJ/8djflBdlJjy9s=; b=uanpSTuRyE6qBHQPB9Uzp1JeNZCVlz9ZnTZjVNE17hT3vF1opz2Iu22yaD8LvRPOl2 1FC/kdUyLbX2jcGvS60cnowHpo/Uj1A3Kg14sgt7K6Jhs0+ttpaUmnwOlFN+isKv82Df vYWiOJ00WWE6cpvywFEJ3Rdv3Z5fG5iJkJixs+V5sqKybgg+sSKf2F0Aygb96k0PAex9 WbDdCOnUP+AzZpF4vX6W7TE+a0CRC0RyQE3394IC4m5YWDHwGjZROy6lJYo+PBWMpeCR Vp4i/kBxiLWfa0mW8PGGqvznjf/UKdnsB/4yHSE4f0Oe7rBN1JF8s2ZX0nok87/6dCo1 l79w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=xbx+XzcPct/46zaNl1olGwUlxfEGJ/8djflBdlJjy9s=; b=sG8NhUGTo7Dg+7H0nvMImufF61A6Xq6rcxxnmtGozmr8KCTE6ZecBRAPZkahypToeB nVoPiDtZbmga8EdC5D958GucjpceYMua4Zru0+6Q7TghAAOoZFMcmYpZRNJs8iA1TWo6 zLxTVNY+m/5Gkj1ApGiquiWRDdSGvG0DamUDxgUinDblCqwsqgz34jIgPBycuPIU2CPI CMuJ6R2zschJemx7aAvfP4IEVwzxHVcHVJt9VKFjC4HVk5A/gAvKl+8SAX6klRjzCwpU FZ0GHUFlz7iMtZMkDyspNHZoI+9XFRbmvqHpWRHI9dIYhDG6SL5L+QolJ+xUpPorKZTy R6Uw== X-Gm-Message-State: AOAM533JOByiSrg5DGCTmC0yl+ISVzpZuY2MxYHqUgebnJ11OHXkcjA2 aAB8qAq6qMbRyoOqxXjpomDc7w== X-Received: by 2002:a9d:69d5:: with SMTP id v21mr985210oto.176.1605562087950; Mon, 16 Nov 2020 13:28:07 -0800 (PST) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id m17sm5053643otq.57.2020.11.16.13.28.05 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Mon, 16 Nov 2020 13:28:06 -0800 (PST) Date: Mon, 16 Nov 2020 13:27:47 -0800 (PST) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Matthew Wilcox cc: Hugh Dickins , Andrew Morton , Jan Kara , William Kucharski , Rik van Riel , linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, hch@lst.de, hannes@cmpxchg.org, yang.shi@linux.alibaba.com, dchinner@redhat.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH v4 00/16] Overhaul multi-page lookups for THP In-Reply-To: <20201116151444.GB29991@casper.infradead.org> Message-ID: References: <20201112212641.27837-1-willy@infradead.org> <20201116151444.GB29991@casper.infradead.org> User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 16 Nov 2020, Matthew Wilcox wrote: > On Mon, Nov 16, 2020 at 02:34:34AM -0800, Hugh Dickins wrote: > > On Thu, 12 Nov 2020, Matthew Wilcox (Oracle) wrote: > > > > > This THP prep patchset changes several page cache iteration APIs to only > > > return head pages. > > > > > > - It's only possible to tag head pages in the page cache, so only > > > return head pages, not all their subpages. > > > - Factor a lot of common code out of the various batch lookup routines > > > - Add mapping_seek_hole_data() > > > - Unify find_get_entries() and pagevec_lookup_entries() > > > - Make find_get_entries only return head pages, like find_get_entry(). > > > > > > These are only loosely connected, but they seem to make sense together > > > as a series. > > > > > > v4: > > > - Add FGP_ENTRY, remove find_get_entry and find_lock_entry > > > - Rename xas_find_get_entry to find_get_entry > > > - Add "Optimise get_shadow_from_swap_cache" > > > - Move "iomap: Use mapping_seek_hole_data" to this patch series > > > - Rebase against next-20201112 > > > > I hope next-20201112 had nothing vital for this series, I applied > > it to v5.10-rc3, and have been busy testing huge tmpfs on that. > > Thank you. It's plain I'm not able to hit these cases ... I do run > xfstests against shmem, but that's obviously not good enough. Can > you suggest something I should be running to improve my coverage? Not quickly. I'll send you a builds.tar.xz of my tmpfs kernel builds swapping load, like I sent Alex Shi for his lru_lock testing back in March: but no point in resending exactly that, before I've updated its source patch to suit less ancient compilers. Sometime in the next week. It's hard to get working usefully in new environments: probably better as source of ideas (like building kernel in ext4 on loop on huge tmpfs while swapping) than as something to duplicate and run directly. I haven't tried LTP on your series yet, will do so later today. I install into a huge tmpfs mounted at /opt/ltp and run it there, with /sys/kernel/mm/transparent_hugepage/shmem_enabled "force" and khugepaged expedited. Doesn't usually find anything, but well worth a go. I'll be doing it based on 5.10-rc3 or rc4: avoiding mmotm or linux-next at the moment, because I've noticed that Rik's shmem THP gfp_mask mods have severely limited the use of huge pages in my testing there: maybe I just need to tweak defrag, or maybe I'll want the patches adjusted - I just haven't looked yet, making your series safe took priority. (And aside from Rik's, there seemed to be something else affecting the use of huge pages in mmotm, not investigated yet.) xfstests: awkward time to ask, since I'm amidst transitioning from my old internal hack for user xattrs to what I had hoped to get into next release (enabling them, but restricting to the nr_inodes space), but too late for that now. So I have a few workaround patches to the xfstests tree, and a few tmpfs patches to the kernel tree, but in flux at the moment. Not that user xattrs are much related to your series. If you've got xfstests going on tmpfs, probably the easiest thing to add is /sys/kernel/mm/transparent_hugepage/shmem_enabled "force", so that mounts are huge=always by default. I do have six xfstests failing with your series, that passed before; but I have not had time to analyse those yet, and don't even want to enumerate them yet. Some will probably be "not run"s in a standard tree, so hard for you to try; and maybe all of them will turn out to depend on different interpretations of small holes in huge pages - not necessarily anything to worry about. But I wonder if they might relate to failures to split: your truncate_inode_partial_page() work is very nice, but for better or worse it does skip all the retries implicit in the previous way of proceeding. > > > Fix to [PATCH v4 06/16] mm/filemap: Add helper for finding pages. > > I hit that VM_BUG_ON_PAGE(!thp_contains) when swapping, it is not > > safe without page lock, during the interval when shmem is moving a > > page between page cache and swap cache. It could be tightened by > > passing in a new FGP to distinguish whether searching page or swap > > cache, but I think never tight enough in the swap case - because there > > is no rule for persisting page->private as there is for page->index. > > The best I could do is: > > I'll just move this out to the caller who actually locks the page: > > +++ b/mm/filemap.c > @@ -1839,7 +1839,6 @@ static inline struct page *find_get_entry(struct xa_state *xas, pgoff_t max, > put_page(page); > goto reset; > } > - VM_BUG_ON_PAGE(!thp_contains(page, xas->xa_index), page); > > return page; > reset: > @@ -1923,6 +1922,8 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, > goto put; > if (page->mapping != mapping || PageWriteback(page)) > goto unlock; > + VM_BUG_ON_PAGE(!thp_contains(page, xas->xa_index), > + page); > } > indices[pvec->nr] = xas.xa_index; > if (!pagevec_add(pvec, page)) > Okay, up to you: that's obviously much cleaner-looking than what I did; and if it covers all the cases you're most concerned about, fine (but your original placing did check some other usages, and I didn't want to weaken them - though now I notice that e.g. mapping_get_entry() has no equivalent check, so the find_get_entry() check was exceptional). > > Fix to [PATCH v4 07/16] mm/filemap: Add mapping_seek_hole_data. > > Crashed on a swap entry 0x2ff09, fairly obvious... > > Whoops. Thanks. > > > Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs. > > One machine ran fine, swapping and building in ext4 on loop0 on huge tmpfs; > > one machine got occasional pages of zeros in its .os; one machine couldn't > > get started because of ext4_find_dest_de errors on the newly mkfs'ed fs. > > The partial_end case was decided by PAGE_SIZE, when there might be a THP > > there. The below patch has run well (for not very long), but I could > > easily have got it slightly wrong, off-by-one or whatever; and I have > > not looked into the similar code in mm/truncate.c, maybe that will need > > a similar fix or maybe not. > > > > --- 5103w/mm/shmem.c 2020-11-12 15:46:21.075254036 -0800 > > +++ 5103wh/mm/shmem.c 2020-11-16 01:09:35.431677308 -0800 > > @@ -874,7 +874,7 @@ static void shmem_undo_range(struct inod > > long nr_swaps_freed = 0; > > pgoff_t index; > > int i; > > - bool partial_end; > > + bool same_page; > > > > if (lend == -1) > > end = -1; /* unsigned, so actually very big */ > > @@ -907,16 +907,12 @@ static void shmem_undo_range(struct inod > > index++; > > } > > > > - partial_end = ((lend + 1) % PAGE_SIZE) > 0; > > + same_page = (lstart >> PAGE_SHIFT) == end; > > page = NULL; > > shmem_getpage(inode, lstart >> PAGE_SHIFT, &page, SGP_READ); > > if (page) { > > - bool same_page; > > - > > page = thp_head(page); > > same_page = lend < page_offset(page) + thp_size(page); > > - if (same_page) > > - partial_end = false; > > I don't object to this patch at all, at least partly because it's shorter > and simpler! I don't understand what it's solving, though. The case > where there's a THP which covers partial_end is supposed to be handled > by the three lines above. I'm glad to hear it's not obvious to you, I was ashamed of how long it took me to see. The initial partial_end = ((lend + 1) % PAGE_SIZE) > 0; left partial_end false if the end is nicely PAGE_SIZE aligned. If the lstart shmem_getpage() finds a page, THP or small, and the lend is covered by the same page, all is well. But if the lend is covered by a THP, PAGE_SIZE aligned but not page_size() aligned (forgive my eliding off-by-ones wrt lend), then partial_end remains false, so there's no shmem_getpage() of the end, no discovery that it's a THP, and no attempt to split it as intended. I think then "end" goes forward unadjusted, and that whole ending THP will be truncated out: when it should have been split, and only its first pages removed. Hence zeroes appearing where they should not. > > > Fix to [PATCH v4 15/16] mm/truncate,shmem: Handle truncates that split THPs. > > xfstests generic/012 on huge tmpfs hit this every time (when checking > > xfs_io commands available: later decides "not run" because no "fiemap"). > > I grabbed this line unthinkingly from one of your later series, it fixes > > the crash; but once I actually thought about it when trying to track down > > weirder behaviours, realize that the kmap_atomic() and flush_dcache_page() > > in zero_user_segments() are not prepared for a THP - so highmem and > > flush_dcache_page architectures will be disappointed. If I searched > > through your other series, I might find the complete fix; or perhaps > > it's already there in linux-next, I haven't looked. > > zero_user_segments() is fixed by "mm: Support THPs in zero_user_segments". > I think most recently posted here: > https://lore.kernel.org/linux-mm/20201026183136.10404-2-willy@infradead.org/ > > My fault for not realising this patch depended on that patch. I did > test these patches stand-alone, but it didn't trigger this problem. > > flush_dcache_page() needs to be called once for each sub-page. We > really need a flush_dcache_thp() so that architectures can optimise this. Right. > Although maybe now that's going to be called flush_dcache_folio(). Yes, I was delighted to notice the return of "folio": https://lore.kernel.org/linux-mm/Pine.LNX.4.21.0107051737340.1577-100000@localhost.localdomain/ > > > I also had noise from the WARN_ON(page_to_index(page) != index) > > in invalidate_inode_pages2_range(): but that's my problem, since > > for testing I add a dummy shmem_direct_IO() (return 0): for that > > I've now added a shmem_mapping() check at the head of pages2_range(). > > Ah, I have a later fix for invalidate_inode_pages2_range(): > https://lore.kernel.org/linux-mm/20201026183136.10404-6-willy@infradead.org/ > > I didn't post it earlier because there aren't any filesystems currently > which use THPs and directIO ;-) Yes, I'd seen your exchange with Jan about that. And initially I ported in your later fix; but later decided it was better to decouple my testing mods from whatever is decided in your series. If tmpfs does (pretend to) support directIO, it is correct that invalidate_inode_pages2_range() do nothing on it: because the pretended IO is to and from the tmpfs backing store, which is the page cache. > > > That's all for now: I'll fire off more overnight testing. > > Thanks! All running fine. Hugh