Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp2832917pxv; Mon, 12 Jul 2021 03:04:30 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz6LfUDaC1i1GaJ/ANikKuVTMbjbd1gZq0kXRIlS0cNGwWwXw+bdxLuF9h6mxsmhwiKzGv7 X-Received: by 2002:a05:6e02:1be3:: with SMTP id y3mr37236803ilv.142.1626084270646; Mon, 12 Jul 2021 03:04:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1626084270; cv=none; d=google.com; s=arc-20160816; b=pXs65+gjxnV8u2qWbQtRfmuipX9htxnCKhOfnPLe9RiXhxPxqDeLm2CuPuGFTurCCM e/O0Fg68neY+gqHskYiFdZ7MSQL9lzLNU1fZu6V1TzRT4qxeHmiD6EgWWRgss5+2l+5o QqOSfopudEGH6eSLwfvbmGjUvsd+WlsqVJwyQ0O3/kFSLqmK5ng3FkXTjD6jKAX2FBcS LqsOVR/8n2I24bcC+dcq+5JHtrxNJdZupAeEKZQwW2Mas/AxWZXNcCA+ZR+wnfj5LaxO TiiddnGoMDzjLcekJjgSaMq0kVEkddyNc/SXtXaxUhBzSrBNwTpkPXsGpLBuLDh3XG2F 7NuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:message-id:date:subject:cc:to :from:dkim-signature; bh=NeEhSgzpNDpjm2Tzcq233C+ErlrFll+Gs1y7xKvlcRo=; b=naTgwVHRumY437nl8rD5UruHnvye/261XOTxUrwbjn8VDqHBQdyr2Jfx/z/lTq08W4 0/wnjwOKyVrS+0ut6JAvgKHkzIe3jhifZgd7rnFluePF4tDTJSIfyKyNdfk7ommym8YP 11A3J9LIDge3cWBcmWdIwEzA9txWvBPyo7FAwR+9hSlJtm3h6eJPQFAyBkwcFhYTZDni OQNkxQPWEA1+0T5enUHnvuJkko8vSMexLkFI0VYSpQFDJEx2tt82g6JYVO5d1QVWQ3N1 8SHLe2qqwttLXmQK/h0Zlfm2j20/JpZpDdXaB5HXP0iOKOXpnG8V2HcO0e4lTIe0IIAG XCkg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b="Jiq+XTr/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u12si17705501ilm.144.2021.07.12.03.04.18; Mon, 12 Jul 2021 03:04:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linuxfoundation.org header.s=korg header.b="Jiq+XTr/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linuxfoundation.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345224AbhGLHZH (ORCPT + 99 others); Mon, 12 Jul 2021 03:25:07 -0400 Received: from mail.kernel.org ([198.145.29.99]:32804 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239541AbhGLG6V (ORCPT ); Mon, 12 Jul 2021 02:58:21 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 943B66102A; Mon, 12 Jul 2021 06:55:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1626072933; bh=uYegv75YXwXYGm3pbXDEJRlo5vmTRcR5qItmqKXASVs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=Jiq+XTr/wUmQypLdc9VBLjckSvC02DR/PSMPqnWSmPvkJZ3dFyskcfACf2WZT70xy i/0tEvHmD8UmtjmVe8eR12zBZ0FKTeky6yfnOHQav5ra96g7kcSJLwqJalgBEr6XW0 igvdGspndA2BG7kCwBmp2uWkpQ+QMwlXPQ7XOe6s= From: Greg Kroah-Hartman To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman , stable@vger.kernel.org, Jann Horn , John Hubbard , Matthew Wilcox , "Kirill A. Shutemov" , Jan Kara , Andrew Morton , Linus Torvalds Subject: [PATCH 5.12 068/700] mm/gup: fix try_grab_compound_head() race with split_huge_page() Date: Mon, 12 Jul 2021 08:02:31 +0200 Message-Id: <20210712060934.314878937@linuxfoundation.org> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210712060924.797321836@linuxfoundation.org> References: <20210712060924.797321836@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Jann Horn commit c24d37322548a6ec3caec67100d28b9c1f89f60a upstream. try_grab_compound_head() is used to grab a reference to a page from get_user_pages_fast(), which is only protected against concurrent freeing of page tables (via local_irq_save()), but not against concurrent TLB flushes, freeing of data pages, or splitting of compound pages. Because no reference is held to the page when try_grab_compound_head() is called, the page may have been freed and reallocated by the time its refcount has been elevated; therefore, once we're holding a stable reference to the page, the caller re-checks whether the PTE still points to the same page (with the same access rights). The problem is that try_grab_compound_head() has to grab a reference on the head page; but between the time we look up what the head page is and the time we actually grab a reference on the head page, the compound page may have been split up (either explicitly through split_huge_page() or by freeing the compound page to the buddy allocator and then allocating its individual order-0 pages). If that happens, get_user_pages_fast() may end up returning the right page but lifting the refcount on a now-unrelated page, leading to use-after-free of pages. To fix it: Re-check whether the pages still belong together after lifting the refcount on the head page. Move anything else that checks compound_head(page) below the refcount increment. This can't actually happen on bare-metal x86 (because there, disabling IRQs locks out remote TLB flushes), but it can happen on virtualized x86 (e.g. under KVM) and probably also on arm64. The race window is pretty narrow, and constantly allocating and shattering hugepages isn't exactly fast; for now I've only managed to reproduce this in an x86 KVM guest with an artificially widened timing window (by adding a loop that repeatedly calls `inl(0x3f8 + 5)` in `try_get_compound_head()` to force VM exits, so that PV TLB flushes are used instead of IPIs). As requested on the list, also replace the existing VM_BUG_ON_PAGE() with a warning and bailout. Since the existing code only performed the BUG_ON check on DEBUG_VM kernels, ensure that the new code also only performs the check under that configuration - I don't want to mix two logically separate changes together too much. The macro VM_WARN_ON_ONCE_PAGE() doesn't return a value on !DEBUG_VM, so wrap the whole check in an #ifdef block. An alternative would be to change the VM_WARN_ON_ONCE_PAGE() definition for !DEBUG_VM such that it always returns false, but since that would differ from the behavior of the normal WARN macros, it might be too confusing for readers. Link: https://lkml.kernel.org/r/20210615012014.1100672-1-jannh@google.com Fixes: 7aef4172c795 ("mm: handle PTE-mapped tail pages in gerneric fast gup implementaiton") Signed-off-by: Jann Horn Reviewed-by: John Hubbard Cc: Matthew Wilcox Cc: Kirill A. Shutemov Cc: Jan Kara Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman --- mm/gup.c | 58 +++++++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 43 insertions(+), 15 deletions(-) --- a/mm/gup.c +++ b/mm/gup.c @@ -44,6 +44,23 @@ static void hpage_pincount_sub(struct pa atomic_sub(refs, compound_pincount_ptr(page)); } +/* Equivalent to calling put_page() @refs times. */ +static void put_page_refs(struct page *page, int refs) +{ +#ifdef CONFIG_DEBUG_VM + if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page)) + return; +#endif + + /* + * Calling put_page() for each ref is unnecessarily slow. Only the last + * ref needs a put_page(). + */ + if (refs > 1) + page_ref_sub(page, refs - 1); + put_page(page); +} + /* * Return the compound head page with ref appropriately incremented, * or NULL if that failed. @@ -56,6 +73,21 @@ static inline struct page *try_get_compo return NULL; if (unlikely(!page_cache_add_speculative(head, refs))) return NULL; + + /* + * At this point we have a stable reference to the head page; but it + * could be that between the compound_head() lookup and the refcount + * increment, the compound page was split, in which case we'd end up + * holding a reference on a page that has nothing to do with the page + * we were given anymore. + * So now that the head page is stable, recheck that the pages still + * belong together. + */ + if (unlikely(compound_head(page) != head)) { + put_page_refs(head, refs); + return NULL; + } + return head; } @@ -95,6 +127,14 @@ __maybe_unused struct page *try_grab_com return NULL; /* + * CAUTION: Don't use compound_head() on the page before this + * point, the result won't be stable. + */ + page = try_get_compound_head(page, refs); + if (!page) + return NULL; + + /* * When pinning a compound page of order > 1 (which is what * hpage_pincount_available() checks for), use an exact count to * track it, via hpage_pincount_add/_sub(). @@ -102,15 +142,10 @@ __maybe_unused struct page *try_grab_com * However, be sure to *also* increment the normal page refcount * field at least once, so that the page really is pinned. */ - if (!hpage_pincount_available(page)) - refs *= GUP_PIN_COUNTING_BIAS; - - page = try_get_compound_head(page, refs); - if (!page) - return NULL; - if (hpage_pincount_available(page)) hpage_pincount_add(page, refs); + else + page_ref_add(page, refs * (GUP_PIN_COUNTING_BIAS - 1)); mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, orig_refs); @@ -134,14 +169,7 @@ static void put_compound_head(struct pag refs *= GUP_PIN_COUNTING_BIAS; } - VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); - /* - * Calling put_page() for each ref is unnecessarily slow. Only the last - * ref needs a put_page(). - */ - if (refs > 1) - page_ref_sub(page, refs - 1); - put_page(page); + put_page_refs(page, refs); } /**