Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp3730001pxv; Mon, 28 Jun 2021 11:20:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyIk32dJA5dZ2rreumcq8ndHZeOKvfC3MKFHBKLgrNuwQk5cxd7qWFPTj6BILxSjxUcbCou X-Received: by 2002:a17:907:728a:: with SMTP id dt10mr25652956ejc.280.1624904448857; Mon, 28 Jun 2021 11:20:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1624904448; cv=none; d=google.com; s=arc-20160816; b=DVtuD3x0h1fRAshf7EONTfqQRRqaOW32SsRNy2Lv5dY8h4az51NUsMyzFo2LbujzHK qy6OSQBubYADuSj2EVKk7foksVpWyUWtw/vhwKnWZm7CIFtRyrLjqZWfLkhkpgoWSON8 mb+AUyL2ndQjszRsrAGDLhOVH+DMBbQa+OXJ4WXlAum9PfeytUGupfrfSFiJS+autp1F Le8CG7QXoV84N/+tTcUSPQzeKJy3TgDOqhj6bA026hKIf2c/MZj5TPB5bpwfJM7e4VFZ RLF4riW228cfwFCD/+oybkmwGmr2ApYLJ0+e5/kCLhNitb0Ngm/B+6JtRxI0jJeFYTW8 8sbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=rthHQGdmLlW2XjYfc/KWqFknPZpKIpL1w8Huhp6xvpE=; b=tBiHl9HLL1+hvcg0V9tPfF+93IbZOdHd0LqFg9PeVeFA3Hw/D2C1D1kwikvFd6JqCx oZ77c26A9g/ZSHRROXpnmC+bfChMnzlLF90NO2a5r4aKVsME8io4o3XhxAq6ZOdTuMaD KBvLkVDwqZ5g8MyDoVEZ46iRYh3PfuiJeBinAuBUzRqeKSlCwnJCwkXUMkyTsdsOodsn Bpu0vfhOrRIu75bw18a+ZmSygCPvYpEWeE/PlQAk93aoKQ7iCC+rqzWC6FdzN6vWvN/V g+OWlNRqdyk+HhdqlsQK4k0JV7pHC+/+MLKaIMU562RUkNOx4FCcQu0ikCyj7TXjCDEb sePQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=KuGzplwx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id 4si14875775ejr.444.2021.06.28.11.20.23; Mon, 28 Jun 2021 11:20:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=KuGzplwx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235559AbhF1OkI (ORCPT + 99 others); Mon, 28 Jun 2021 10:40:08 -0400 Received: from mail.kernel.org ([198.145.29.99]:36732 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234217AbhF1Ocl (ORCPT ); Mon, 28 Jun 2021 10:32:41 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 252C961CD4; Mon, 28 Jun 2021 14:27:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1624890446; bh=8H3OYQOIj1iLD2E93EpnIWzqpZKKmP8S9n54WjIi2kM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=KuGzplwxfGKxANU+ZiGgtgzXniYWNosEUVB9PibXcmc/KjWxjFRX4eANODLnJr1Ty p+V6RBkwaSHws71tei40dEEfVdG9OHOZ5e8ZTAQCxfAV+ezZNBFb7ZztM3I5VzOl9o /GBt0IZi4xemgJsXZvrHUcJZHTVv5ZKvp97mBfx4Kh9fbtUvDqYynA7V15qR2ti+4q vdVZVkSz6yMWlzcc1PUDBokrXLGXthTN7hney44HJAfYqqXd7644+03bdSf/4y00UB FuJaGejC4z54040JIDlUKIAt/kGTW80mGZGMIMQCu9BpS63BmfIr2BIf4Pv8OyARNh pcSnDE72JFmjA== From: Sasha Levin To: linux-kernel@vger.kernel.org, stable@vger.kernel.org Cc: Hugh Dickins , "Kirill A . Shutemov" , Yang Shi , Alistair Popple , Jan Kara , Jue Wang , "Matthew Wilcox (Oracle)" , Miaohe Lin , Minchan Kim , Naoya Horiguchi , Oscar Salvador , Peter Xu , Ralph Campbell , Shakeel Butt , Wang Yugui , Zi Yan , Andrew Morton , Linus Torvalds , Greg Kroah-Hartman Subject: [PATCH 5.10 080/101] mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() Date: Mon, 28 Jun 2021 10:25:46 -0400 Message-Id: <20210628142607.32218-81-sashal@kernel.org> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20210628142607.32218-1-sashal@kernel.org> References: <20210628142607.32218-1-sashal@kernel.org> MIME-Version: 1.0 X-KernelTest-Patch: http://kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.47-rc1.gz X-KernelTest-Tree: git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git X-KernelTest-Branch: linux-5.10.y X-KernelTest-Patches: git://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git X-KernelTest-Version: 5.10.47-rc1 X-KernelTest-Deadline: 2021-06-30T14:25+00:00 X-stable: review X-Patchwork-Hint: Ignore Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Hugh Dickins [ Upstream commit 22061a1ffabdb9c3385de159c5db7aac3a4df1cc ] There is a race between THP unmapping and truncation, when truncate sees pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared it, but before its page_remove_rmap() gets to decrement compound_mapcount: generating false "BUG: Bad page cache" reports that the page is still mapped when deleted. This commit fixes that, but not in the way I hoped. The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK) instead of unmap_mapping_range() in truncate_cleanup_page(): it has often been an annoyance that we usually call unmap_mapping_range() with no pages locked, but there apply it to a single locked page. try_to_unmap() looks more suitable for a single locked page. However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page): it is used to insert THP migration entries, but not used to unmap THPs. Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB needs are different, I'm too ignorant of the DAX cases, and couldn't decide how far to go for anon+swap. Set that aside. The second attempt took a different tack: make no change in truncate.c, but modify zap_huge_pmd() to insert an invalidated huge pmd instead of clearing it initially, then pmd_clear() between page_remove_rmap() and unlocking at the end. Nice. But powerpc blows that approach out of the water, with its serialize_against_pte_lookup(), and interesting pgtable usage. It would need serious help to get working on powerpc (with a minor optimization issue on s390 too). Set that aside. Just add an "if (page_mapped(page)) synchronize_rcu();" or other such delay, after unmapping in truncate_cleanup_page()? Perhaps, but though that's likely to reduce or eliminate the number of incidents, it would give less assurance of whether we had identified the problem correctly. This successful iteration introduces "unmap_mapping_page(page)" instead of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route, with an addition to details. Then zap_pmd_range() watches for this case, and does spin_unlock(pmd_lock) if so - just like page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but safe. Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to assert its interface; but currently that's only used to make sure that page->mapping is stable, and zap_pmd_range() doesn't care if the page is locked or not. Along these lines, in invalidate_inode_pages2_range() move the initial unmap_mapping_range() out from under page lock, before then calling unmap_mapping_page() under page lock if still mapped. Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com Fixes: fc127da085c2 ("truncate: handle file thp") Signed-off-by: Hugh Dickins Acked-by: Kirill A. Shutemov Reviewed-by: Yang Shi Cc: Alistair Popple Cc: Jan Kara Cc: Jue Wang Cc: "Matthew Wilcox (Oracle)" Cc: Miaohe Lin Cc: Minchan Kim Cc: Naoya Horiguchi Cc: Oscar Salvador Cc: Peter Xu Cc: Ralph Campbell Cc: Shakeel Butt Cc: Wang Yugui Cc: Zi Yan Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Note on stable backport: fixed up call to truncate_cleanup_page() in truncate_inode_pages_range(). Signed-off-by: Hugh Dickins Signed-off-by: Greg Kroah-Hartman --- include/linux/mm.h | 3 +++ mm/memory.c | 41 +++++++++++++++++++++++++++++++++++++++++ mm/truncate.c | 43 +++++++++++++++++++------------------------ 3 files changed, 63 insertions(+), 24 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5106db3ad1ce..289c26f055cd 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1648,6 +1648,7 @@ struct zap_details { struct address_space *check_mapping; /* Check page->mapping if set */ pgoff_t first_index; /* Lowest page->index to unmap */ pgoff_t last_index; /* Highest page->index to unmap */ + struct page *single_page; /* Locked page to be unmapped */ }; struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, @@ -1695,6 +1696,7 @@ extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, extern int fixup_user_fault(struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked); +void unmap_mapping_page(struct page *page); void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows); void unmap_mapping_range(struct address_space *mapping, @@ -1715,6 +1717,7 @@ static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address, BUG(); return -EFAULT; } +static inline void unmap_mapping_page(struct page *page) { } static inline void unmap_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t nr, bool even_cows) { } static inline void unmap_mapping_range(struct address_space *mapping, diff --git a/mm/memory.c b/mm/memory.c index b70bd3ba3388..eb31b3e4ef93 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1355,7 +1355,18 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, else if (zap_huge_pmd(tlb, vma, pmd, addr)) goto next; /* fall through */ + } else if (details && details->single_page && + PageTransCompound(details->single_page) && + next - addr == HPAGE_PMD_SIZE && pmd_none(*pmd)) { + spinlock_t *ptl = pmd_lock(tlb->mm, pmd); + /* + * Take and drop THP pmd lock so that we cannot return + * prematurely, while zap_huge_pmd() has cleared *pmd, + * but not yet decremented compound_mapcount(). + */ + spin_unlock(ptl); } + /* * Here there can be other concurrent MADV_DONTNEED or * trans huge page faults running, and if the pmd is @@ -3185,6 +3196,36 @@ static inline void unmap_mapping_range_tree(struct rb_root_cached *root, } } +/** + * unmap_mapping_page() - Unmap single page from processes. + * @page: The locked page to be unmapped. + * + * Unmap this page from any userspace process which still has it mmaped. + * Typically, for efficiency, the range of nearby pages has already been + * unmapped by unmap_mapping_pages() or unmap_mapping_range(). But once + * truncation or invalidation holds the lock on a page, it may find that + * the page has been remapped again: and then uses unmap_mapping_page() + * to unmap it finally. + */ +void unmap_mapping_page(struct page *page) +{ + struct address_space *mapping = page->mapping; + struct zap_details details = { }; + + VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(PageTail(page)); + + details.check_mapping = mapping; + details.first_index = page->index; + details.last_index = page->index + thp_nr_pages(page) - 1; + details.single_page = page; + + i_mmap_lock_write(mapping); + if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))) + unmap_mapping_range_tree(&mapping->i_mmap, &details); + i_mmap_unlock_write(mapping); +} + /** * unmap_mapping_pages() - Unmap pages from processes. * @mapping: The address space containing pages to be unmapped. diff --git a/mm/truncate.c b/mm/truncate.c index 960edf5803ca..8914ca4ce4b1 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -173,13 +173,10 @@ void do_invalidatepage(struct page *page, unsigned int offset, * its lock, b) when a concurrent invalidate_mapping_pages got there first and * c) when tmpfs swizzles a page between a tmpfs inode and swapper_space. */ -static void -truncate_cleanup_page(struct address_space *mapping, struct page *page) +static void truncate_cleanup_page(struct page *page) { - if (page_mapped(page)) { - unsigned int nr = thp_nr_pages(page); - unmap_mapping_pages(mapping, page->index, nr, false); - } + if (page_mapped(page)) + unmap_mapping_page(page); if (page_has_private(page)) do_invalidatepage(page, 0, thp_size(page)); @@ -224,7 +221,7 @@ int truncate_inode_page(struct address_space *mapping, struct page *page) if (page->mapping != mapping) return -EIO; - truncate_cleanup_page(mapping, page); + truncate_cleanup_page(page); delete_from_page_cache(page); return 0; } @@ -362,7 +359,7 @@ void truncate_inode_pages_range(struct address_space *mapping, pagevec_add(&locked_pvec, page); } for (i = 0; i < pagevec_count(&locked_pvec); i++) - truncate_cleanup_page(mapping, locked_pvec.pages[i]); + truncate_cleanup_page(locked_pvec.pages[i]); delete_from_page_cache_batch(mapping, &locked_pvec); for (i = 0; i < pagevec_count(&locked_pvec); i++) unlock_page(locked_pvec.pages[i]); @@ -737,6 +734,16 @@ int invalidate_inode_pages2_range(struct address_space *mapping, continue; } + if (!did_range_unmap && page_mapped(page)) { + /* + * If page is mapped, before taking its lock, + * zap the rest of the file in one hit. + */ + unmap_mapping_pages(mapping, index, + (1 + end - index), false); + did_range_unmap = 1; + } + lock_page(page); WARN_ON(page_to_index(page) != index); if (page->mapping != mapping) { @@ -744,23 +751,11 @@ int invalidate_inode_pages2_range(struct address_space *mapping, continue; } wait_on_page_writeback(page); - if (page_mapped(page)) { - if (!did_range_unmap) { - /* - * Zap the rest of the file in one hit. - */ - unmap_mapping_pages(mapping, index, - (1 + end - index), false); - did_range_unmap = 1; - } else { - /* - * Just zap this page - */ - unmap_mapping_pages(mapping, index, - 1, false); - } - } + + if (page_mapped(page)) + unmap_mapping_page(page); BUG_ON(page_mapped(page)); + ret2 = do_launder_page(mapping, page); if (ret2 == 0) { if (!invalidate_complete_page2(mapping, page)) -- 2.30.2