Received: by 2002:a05:6a10:6744:0:0:0:0 with SMTP id w4csp3009916pxu; Sat, 10 Oct 2020 15:58:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz4o/14zr4W1e0iPXtZg9ue4VlOe8wqglU0sd7QqwMpyzy1VWcON6p7HCDmSwrIl7ppZw12 X-Received: by 2002:a17:906:5618:: with SMTP id f24mr20711612ejq.86.1602370692882; Sat, 10 Oct 2020 15:58:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1602370692; cv=none; d=google.com; s=arc-20160816; b=BcefUPR9hjy9ctKSf9nb0pvz6j3+ZI6fZjwgFctasJrEwUMpjAIbFZiUHOGry0DzmY K0RghFyfgW7dWJ6ws7+LQwcAgy6pkakTGz6WG4ui6dB3GRlWQcxjepaM8XBpOwpQsI0d FUB6oUrDi+HqGdUfy1dSkfyL+VCaB5CQLXZuIWiIJ1acyGj6ksmUY6Fwwr2nt3CIkZlD 2Ui7XHADVF2nRs2rwUOhrg/TIbg6EArhccm5+ihav8qHXEkd++1jqEg0FaiVJhkMumNz E48xvs4904cCbb0Anqo0t05nY48Gq0fT7umQN1fGKTUI+vcYpLfB/cOM+IvM7nXnH/Tz xLTg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:message-id:subject:cc:to :from:date:dkim-signature; bh=fZ9ooXmn2GQKpm0k8FtJCZVXBPaGul3wh0cIHpiIF2U=; b=T7z760lb6KGhb0pTGrizzloHTMN16eb6gDZFd0dIdqrS/PM2STb7DBEJs3Fw1Domks 9mZRan/Mf572L7pNt04JOpy8JOwgbKcVqdLJ7l32yO5LG/v0JaUzDLpaNI5fJ87RBUY/ Wmzpt/jxGMQkn+cjq5tO81yGupHi2bVj/xfmbK5c9oqzyEMwEmGfhUaqzcs0i0EQsCdj BZsvVNOEmqS6Qn6aDAJTjlyTQ1r83hVZNVPrEBXsZQf4kbYuoFsiTMdjQdm+4yy25Qgo RbUb8QyJJ92qcucW3k1aw+YyzPN56swtu4NRQdFC+yy9UUFlbzZnVWv5W0wOM99fjeJx lGQA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=oG4sK3F6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id m27si9225679ejb.476.2020.10.10.15.57.49; Sat, 10 Oct 2020 15:58:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=oG4sK3F6; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730255AbgJJDZR (ORCPT + 99 others); Fri, 9 Oct 2020 23:25:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57722 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730572AbgJJDIT (ORCPT ); Fri, 9 Oct 2020 23:08:19 -0400 Received: from mail-oi1-x244.google.com (mail-oi1-x244.google.com [IPv6:2607:f8b0:4864:20::244]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1D7E1C0613CF for ; Fri, 9 Oct 2020 20:08:18 -0700 (PDT) Received: by mail-oi1-x244.google.com with SMTP id w141so12337858oia.2 for ; Fri, 09 Oct 2020 20:08:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:user-agent:mime-version; bh=fZ9ooXmn2GQKpm0k8FtJCZVXBPaGul3wh0cIHpiIF2U=; b=oG4sK3F6KhXNCzBLglOUojErBNiB00LZjENdEH/udSgzvBUZ3hDhcJayWMBVGnL5uS duPcpjpimRgSfhPKLGaiPZ5zEAwED8Ct4SOo4ozv3TwUQj0dZXZv7aMjyTvHfBJp8JET RKrLpdZgtxNkPG2UxGyhrT0cZ0CiNaGZp3gEzvXsN4zoxM/j6H2Rho3WWKSBFYmRs0cJ sZAST+Nkcxv8GlOYB6DsdDTZG+kB1qz3BoET++tVraZkT8xqIi6jTlYeXYrE7obko2ew mgGevB8NqsdjiEusXPr9upxJM7b5gdUhs+GqEMhx3agPlvFZRDgmxeXHyfMYPTf6/IuG AppQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent :mime-version; bh=fZ9ooXmn2GQKpm0k8FtJCZVXBPaGul3wh0cIHpiIF2U=; b=P59Q8EJ9FPqf7I/QES1ePfxZdl3PDOvo/eNSzgstKGwzTI5/6x2U0XzK+KjYTradZu lV6AVrjDuMsUYPaalR9snOs+CAB2x+AAhKzkvYBON+jzC6n90Qkl5D39Pr15EV2BB1ln 4mJxgwnd7rwJBZIro+aBmridF42zZZxKxTYsRnLcnmR9lkI3NiZZBMjqxyZM04pSxhNE gQSX5ayzvDGw5p0RuldB5q5UUROJgU7jxdR551cB7jDM4y+EuCVggiEFwq4bFww5I7rk BjeMJS0L4QYqEhVGpyaRN+WypzL80ImOg3YAhtl5q6EozYN00BhuNVIpaZf9jXjEC+RG EF9A== X-Gm-Message-State: AOAM530Uc4FCEFNmEqKWOrhFFlUZ1s+YeILDoYv8q3VfC63ai0Aac2aH 6ECfTljzGGBAHmzQkgg2nAIN/w== X-Received: by 2002:aca:ea44:: with SMTP id i65mr4213710oih.117.1602299296924; Fri, 09 Oct 2020 20:08:16 -0700 (PDT) Received: from eggly.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id j83sm7979078oia.19.2020.10.09.20.08.13 (version=TLS1 cipher=ECDHE-ECDSA-AES128-SHA bits=128/128); Fri, 09 Oct 2020 20:08:15 -0700 (PDT) Date: Fri, 9 Oct 2020 20:07:59 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@eggly.anvils To: Andrew Morton cc: Linus Torvalds , Matthew Wilcox , Song Liu , "Kirill A. Shutemov" , Yang Shi , Denis Lisov , Qian Cai , Suren Baghdasaryan , David Rientjes , Minchan Kim , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH] mm/khugepaged: fix filemap page_to_pgoff(page) != offset Message-ID: User-Agent: Alpine 2.11 (LSU 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org There have been elusive reports of filemap_fault() hitting its VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built with CONFIG_READ_ONLY_THP_FOR_FS=y. Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged without NUMA reuses the same huge page after collapse_file() failed (whereas NUMA targets its allocation to the respective node each time). And most of us were usually testing with CONFIG_NUMA=y kernels. collapse_file(old start) new_page = khugepaged_alloc_page(hpage) __SetPageLocked(new_page) new_page->index = start // hpage->index=old offset new_page->mapping = mapping xas_store(&xas, new_page) filemap_fault page = find_get_page(mapping, offset) // if offset falls inside hpage then // compound_head(page) == hpage lock_page_maybe_drop_mmap() __lock_page(page) // collapse fails xas_store(&xas, old page) new_page->mapping = NULL unlock_page(new_page) collapse_file(new start) new_page = khugepaged_alloc_page(hpage) __SetPageLocked(new_page) new_page->index = start // hpage->index=new offset new_page->mapping = mapping // mapping becomes valid again // since compound_head(page) == hpage // page_to_pgoff(page) got changed VM_BUG_ON_PAGE(page_to_pgoff(page) != offset) An initial patch replaced __SetPageLocked() by lock_page(), which did fix the race which Suren illustrates above. But testing showed that it's not good enough: if the racing task's __lock_page() gets delayed long after its find_get_page(), then it may follow collapse_file(new start)'s successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE. It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a check and retry (as is done for mapping), with similar relaxations in find_lock_entry() and pagecache_get_page(): but it's not obvious what else might get caught out; and khugepaged non-NUMA appears to be unique in exposing a page to page cache, then revoking, without going through a full cycle of freeing before reuse. Instead, non-NUMA khugepaged_prealloc_page() release the old page if anyone else has a reference to it (1% of cases when I tested). Although never reported on huge tmpfs, I believe its find_lock_entry() has been at similar risk; but huge tmpfs does not rely on khugepaged for its normal working nearly so much as READ_ONLY_THP_FOR_FS does. Reported-by: Denis Lisov Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569 Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org Reported-by: Qian Cai Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw Reported-and-analyzed-by: Suren Baghdasaryan Fixes: 87c460a0bded ("mm/khugepaged: collapse_shmem() without freezing new_page") Signed-off-by: Hugh Dickins Cc: stable@vger.kernel.org # v4.9+ --- mm/khugepaged.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) --- 5.9-rc8/mm/khugepaged.c 2020-09-06 17:34:46.939306972 -0700 +++ linux/mm/khugepaged.c 2020-10-08 16:19:42.999765534 -0700 @@ -914,6 +914,18 @@ static struct page *khugepaged_alloc_hug static bool khugepaged_prealloc_page(struct page **hpage, bool *wait) { + /* + * If the hpage allocated earlier was briefly exposed in page cache + * before collapse_file() failed, it is possible that racing lookups + * have not yet completed, and would then be unpleasantly surprised by + * finding the hpage reused for the same mapping at a different offset. + * Just release the previous allocation if there is any danger of that. + */ + if (*hpage && page_count(*hpage) > 1) { + put_page(*hpage); + *hpage = NULL; + } + if (!*hpage) *hpage = khugepaged_alloc_hugepage(wait);