Here is a very rare race which leaks memory:
Page P0 is allocated to the page cache. Page P1 is free.
Thread A Thread B Thread C
find_get_entry():
xas_load() returns P0
Removes P0 from page cache
P0 finds its buddy P1
alloc_pages(GFP_KERNEL, 1) returns P0
P0 has refcount 1
page_cache_get_speculative(P0)
P0 has refcount 2
__free_pages(P0)
P0 has refcount 1
put_page(P0)
P1 is not freed
Fix this by freeing all the pages in __free_pages() that won't be freed
by the call to put_page(). It's usually not a good idea to split a page,
but this is a very unlikely scenario.
Fixes: e286781d5f2e ("mm: speculative page references")
Signed-off-by: Matthew Wilcox (Oracle) <[email protected]>
---
v2: Add a test module. Verified it works by:
(1) loading the test module on an unmodified kernel and watching it OOM
(2) modifying __free_pages() with my v1 patch that neglected to add the
PageHead test and hitting VM_BUG_ON_PAGE(PageTail(page))
(3) applying all of this patch and seeing it survive
lib/Kconfig.debug | 9 +++++++++
lib/Makefile | 1 +
lib/test_free_pages.c | 42 ++++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 3 +++
4 files changed, 55 insertions(+)
create mode 100644 lib/test_free_pages.c
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e068c3c7189a..eed59af0e907 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2369,6 +2369,15 @@ config TEST_HMM
If unsure, say N.
+config TEST_FREE_PAGES
+ tristate "Test freeing pages"
+ help
+ Test that a memory leak does not occur due to a race between
+ freeing a block of pages and a speculative page reference.
+ Loading this module is safe if your kernel has the bug fixed.
+ If the bug is not fixed, it will leak gigabytes of memory and
+ probably OOM your system.
+
config TEST_FPU
tristate "Test floating point operations in kernel space"
depends on X86 && !KCOV_INSTRUMENT_ALL
diff --git a/lib/Makefile b/lib/Makefile
index a4a4c6864f51..071b687b7363 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -99,6 +99,7 @@ obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test_blackhole_dev.o
obj-$(CONFIG_TEST_MEMINIT) += test_meminit.o
obj-$(CONFIG_TEST_LOCKUP) += test_lockup.o
obj-$(CONFIG_TEST_HMM) += test_hmm.o
+obj-$(CONFIG_TEST_FREE_PAGES) += test_free_pages.o
#
# CFLAGS for compiling floating point code inside the kernel. x86/Makefile turns
diff --git a/lib/test_free_pages.c b/lib/test_free_pages.c
new file mode 100644
index 000000000000..074e76bd76b2
--- /dev/null
+++ b/lib/test_free_pages.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * test_free_pages.c: Check that free_pages() doesn't leak memory
+ * Copyright (c) 2020 Oracle
+ * Author: Matthew Wilcox <[email protected]>
+ */
+
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+
+static void test_free_pages(gfp_t gfp)
+{
+ unsigned int i;
+
+ for (i = 0; i < 1000 * 1000; i++) {
+ unsigned long addr = __get_free_pages(gfp, 3);
+ struct page *page = virt_to_page(addr);
+
+ /* Simulate page cache getting a speculative reference */
+ get_page(page);
+ free_pages(addr, 3);
+ put_page(page);
+ }
+}
+
+static int m_in(void)
+{
+ test_free_pages(GFP_KERNEL);
+ test_free_pages(GFP_KERNEL | __GFP_COMP);
+
+ return 0;
+}
+
+static void m_ex(void)
+{
+}
+
+module_init(m_in);
+module_exit(m_ex);
+MODULE_AUTHOR("Matthew Wilcox <[email protected]>");
+MODULE_LICENSE("GPL");
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fab5e97dc9ca..9b259c76e285 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4947,6 +4947,9 @@ void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page))
free_the_page(page, order);
+ else if (!PageHead(page))
+ while (order-- > 0)
+ free_the_page(page + (1 << order), order);
}
EXPORT_SYMBOL(__free_pages);
--
2.28.0
On Sat, 26 Sep 2020 22:39:19 +0100 "Matthew Wilcox (Oracle)" <[email protected]> wrote:
> Here is a very rare race which leaks memory:
Not worth a cc:stable?
> Page P0 is allocated to the page cache. Page P1 is free.
>
> Thread A Thread B Thread C
> find_get_entry():
> xas_load() returns P0
> Removes P0 from page cache
> P0 finds its buddy P1
> alloc_pages(GFP_KERNEL, 1) returns P0
> P0 has refcount 1
> page_cache_get_speculative(P0)
> P0 has refcount 2
> __free_pages(P0)
__free_pages(P0, 1), I assume.
> P0 has refcount 1
> put_page(P0)
but this is implicitly order 0
> P1 is not freed
huh.
> Fix this by freeing all the pages in __free_pages() that won't be freed
> by the call to put_page(). It's usually not a good idea to split a page,
> but this is a very unlikely scenario.
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4947,6 +4947,9 @@ void __free_pages(struct page *page, unsigned int order)
> {
> if (put_page_testzero(page))
> free_the_page(page, order);
> + else if (!PageHead(page))
> + while (order-- > 0)
> + free_the_page(page + (1 << order), order);
Well that's weird and scary looking. `page' has non-zero refcount yet
we go and free random followon pages. Methinks it merits an
explanatory comment?
On Mon, Sep 28, 2020 at 06:03:07PM -0700, Andrew Morton wrote:
> On Sat, 26 Sep 2020 22:39:19 +0100 "Matthew Wilcox (Oracle)" <[email protected]> wrote:
>
> > Here is a very rare race which leaks memory:
>
> Not worth a cc:stable?
Yes, it probably should have been. I just assume the stablebot will
pick up anything that has a Fixes: tag.
> > Page P0 is allocated to the page cache. Page P1 is free.
> >
> > Thread A Thread B Thread C
> > find_get_entry():
> > xas_load() returns P0
> > Removes P0 from page cache
> > P0 finds its buddy P1
> > alloc_pages(GFP_KERNEL, 1) returns P0
> > P0 has refcount 1
> > page_cache_get_speculative(P0)
> > P0 has refcount 2
> > __free_pages(P0)
>
> __free_pages(P0, 1), I assume.
Good catch. That was what I meant to type.
> > P0 has refcount 1
> > put_page(P0)
>
> but this is implicitly order 0
Right, because it's not a compound page.
> > P1 is not freed
>
> huh.
Yeah. Nasty, and we'll never know how often it was hit.
> > Fix this by freeing all the pages in __free_pages() that won't be freed
> > by the call to put_page(). It's usually not a good idea to split a page,
> > but this is a very unlikely scenario.
> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4947,6 +4947,9 @@ void __free_pages(struct page *page, unsigned int order)
> > {
> > if (put_page_testzero(page))
> > free_the_page(page, order);
> > + else if (!PageHead(page))
> > + while (order-- > 0)
> > + free_the_page(page + (1 << order), order);
>
> Well that's weird and scary looking. `page' has non-zero refcount yet
> we go and free random followon pages. Methinks it merits an
> explanatory comment?
Well, poot. I lost that comment in the shuffling of patches. In a
different tree, I have:
@@ -4943,10 +4943,19 @@ static inline void free_the_page(struct page *page, unsi
gned int order)
__free_pages_ok(page, order);
}
+/*
+ * If we free a non-compound allocation, another thread may have a
+ * speculative reference to the first page. It has no way of knowing
+ * about the rest of the allocation, so we have to free all but the
+ * first page here.
+ */
void __free_pages(struct page *page, unsigned int order)
{
if (put_page_testzero(page))
free_the_page(page, order);
+ else if (!PageHead(page))
+ while (order-- > 0)
+ free_the_page(page + (1 << order), order);
}
EXPORT_SYMBOL(__free_pages);
Although I'm now thinking of making that comment into kernel-doc and
turning it into advice to the caller rather than an internal note to
other mm developers.
On Mon, Sep 28, 2020 at 06:03:07PM -0700, Andrew Morton wrote:
> Well that's weird and scary looking. `page' has non-zero refcount yet
> we go and free random followon pages. Methinks it merits an
> explanatory comment?
Here's some kernel-doc. Opinions?
/**
* __free_pages - Free pages allocated with alloc_pages().
* @page: The page pointer returned from alloc_pages().
* @order: The order of the allocation.
*
* This function differs from put_page() in that it can free multi-page
* allocations that were not allocated with %__GFP_COMP. This function
* does not check that the @order passed in matches that of the
* allocation, so it is possible to leak memory. Freeing more memory than
* was allocated will probably be warned about by other debugging checks.
*
* It is only safe to use the page reference count to determine when
* to free an allocation if you use %__GFP_COMP (in which case, you may
* as well use put_page() to free the page). Another thread may have a
* speculative reference to the first page, but it has no way of knowing
* about the rest of the allocation, so we have to free all but the
* first page here.
*
* Context: May be called in interrupt context but not NMI context.
*/
On Tue, 29 Sep 2020 02:17:19 +0100 Matthew Wilcox <[email protected]> wrote:
> On Mon, Sep 28, 2020 at 06:03:07PM -0700, Andrew Morton wrote:
> > On Sat, 26 Sep 2020 22:39:19 +0100 "Matthew Wilcox (Oracle)" <[email protected]> wrote:
> >
> > > Here is a very rare race which leaks memory:
> >
> > Not worth a cc:stable?
>
> Yes, it probably should have been.
Have you a feeling for how often this occurs?
> I just assume the stablebot will
> pick up anything that has a Fixes: tag.
We asked them not to do that for mm/ patches. Crazy stuff was getting
backported.
> > >
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4947,6 +4947,9 @@ void __free_pages(struct page *page, unsigned int order)
> > > {
> > > if (put_page_testzero(page))
> > > free_the_page(page, order);
> > > + else if (!PageHead(page))
> > > + while (order-- > 0)
> > > + free_the_page(page + (1 << order), order);
> >
> > Well that's weird and scary looking. `page' has non-zero refcount yet
> > we go and free random followon pages. Methinks it merits an
> > explanatory comment?
>
> Well, poot. I lost that comment in the shuffling of patches. In a
> different tree, I have:
>
> @@ -4943,10 +4943,19 @@ static inline void free_the_page(struct page *page, unsi
> gned int order)
> __free_pages_ok(page, order);
> }
>
> +/*
> + * If we free a non-compound allocation, another thread may have a
"non-compound, higher-order", I suggest?
> + * speculative reference to the first page. It has no way of knowing
> + * about the rest of the allocation, so we have to free all but the
> + * first page here.
> + */
> void __free_pages(struct page *page, unsigned int order)
> {
> if (put_page_testzero(page))
> free_the_page(page, order);
> + else if (!PageHead(page))
> + while (order-- > 0)
> + free_the_page(page + (1 << order), order);
> }
> EXPORT_SYMBOL(__free_pages);
>
>
> Although I'm now thinking of making that comment into kernel-doc and
> turning it into advice to the caller rather than an internal note to
> other mm developers.
hm. But what action could the caller take? The explanatory comment
seems OK to me.
On Tue, Sep 29, 2020 at 04:40:26AM +0100, Matthew Wilcox wrote:
> On Mon, Sep 28, 2020 at 06:03:07PM -0700, Andrew Morton wrote:
> > Well that's weird and scary looking. `page' has non-zero refcount yet
> > we go and free random followon pages. Methinks it merits an
> > explanatory comment?
>
> Here's some kernel-doc. Opinions?
>
> /**
> * __free_pages - Free pages allocated with alloc_pages().
> * @page: The page pointer returned from alloc_pages().
> * @order: The order of the allocation.
> *
> * This function differs from put_page() in that it can free multi-page
This sentence presumes existing description/prior knowledge about
put_page().
Maybe
This function can free multi-page allocations that were not allocated
with %__GFP_COMP, unlike put_page() that would free only the first page
in such case. __free_pages() does not ...
> * allocations that were not allocated with %__GFP_COMP. This function
> * does not check that the @order passed in matches that of the
> * allocation, so it is possible to leak memory. Freeing more memory than
> * was allocated will probably be warned about by other debugging checks.
> *
> * It is only safe to use the page reference count to determine when
> * to free an allocation if you use %__GFP_COMP (in which case, you may
> * as well use put_page() to free the page). Another thread may have a
> * speculative reference to the first page, but it has no way of knowing
> * about the rest of the allocation, so we have to free all but the
> * first page here.
> *
> * Context: May be called in interrupt context but not NMI context.
> */
>
--
Sincerely yours,
Mike.
On Tue, Sep 29, 2020 at 10:26:22AM +0300, Mike Rapoport wrote:
> This sentence presumes existing description/prior knowledge about
> put_page().
>
> Maybe
>
> This function can free multi-page allocations that were not allocated
> with %__GFP_COMP, unlike put_page() that would free only the first page
> in such case. __free_pages() does not ...
Thanks. After waking up this morning I did a more extensive rewrite:
/**
* __free_pages - Free pages allocated with alloc_pages().
* @page: The page pointer returned from alloc_pages().
* @order: The order of the allocation.
*
* This function can free multi-page allocations that are not compound
* pages. It does not check that the @order passed in matches that of
* the allocation, so it is easy to leak memory. Freeing more memory
* than was allocated will probably emit a warning.
*
* If the last reference to this page is speculative, it will be released
* by put_page() which only frees the first page of a non-compound
* allocation. To prevent the remaining pages from being leaked, we free
* the subsequent pages here. If you want to use the page's reference
* count to decide when to free the allocation, you should allocate a
* compound page, and use put_page() instead of __free_pages().
*
* Context: May be called in interrupt context or holding a normal
* spinlock, but not in NMI context or while holding a raw spinlock.
*/
On Mon, Sep 28, 2020 at 09:46:56PM -0700, Andrew Morton wrote:
> On Tue, 29 Sep 2020 02:17:19 +0100 Matthew Wilcox <[email protected]> wrote:
>
> > On Mon, Sep 28, 2020 at 06:03:07PM -0700, Andrew Morton wrote:
> > > On Sat, 26 Sep 2020 22:39:19 +0100 "Matthew Wilcox (Oracle)" <[email protected]> wrote:
> > >
> > > > Here is a very rare race which leaks memory:
> > >
> > > Not worth a cc:stable?
> >
> > Yes, it probably should have been.
>
> Have you a feeling for how often this occurs?
I doubt it happens often. I don't think I could construct a workload to
make it happen frequently. Maybe more often with a virtualised workload
where a thread can be preempted between instructions.
> > I just assume the stablebot will
> > pick up anything that has a Fixes: tag.
>
> We asked them not to do that for mm/ patches. Crazy stuff was getting
> backported.
That's a shame. I'll try to remember to cc them explicitly in the future.
> > Although I'm now thinking of making that comment into kernel-doc and
> > turning it into advice to the caller rather than an internal note to
> > other mm developers.
>
> hm. But what action could the caller take? The explanatory comment
> seems OK to me.
Use compound pages instead of non-compound pages. Although Linus has
asked that people stop using __get_free_pages(), so maybe that will be
the direction we go in.
https://lore.kernel.org/lkml/CA+55aFwyxJ+TOpaJZnC5MPJ-25xbLAEu8iJP8zTYhmA3LXFF8Q@mail.gmail.com/
On Tue, Sep 29, 2020 at 03:06:22PM +0100, Matthew Wilcox wrote:
> On Tue, Sep 29, 2020 at 10:26:22AM +0300, Mike Rapoport wrote:
> > This sentence presumes existing description/prior knowledge about
> > put_page().
> >
> > Maybe
> >
> > This function can free multi-page allocations that were not allocated
> > with %__GFP_COMP, unlike put_page() that would free only the first page
> > in such case. __free_pages() does not ...
>
> Thanks. After waking up this morning I did a more extensive rewrite:
I like this one
Acked-by: Mike Rapoport <[email protected]>
> /**
> * __free_pages - Free pages allocated with alloc_pages().
> * @page: The page pointer returned from alloc_pages().
> * @order: The order of the allocation.
> *
> * This function can free multi-page allocations that are not compound
> * pages. It does not check that the @order passed in matches that of
> * the allocation, so it is easy to leak memory. Freeing more memory
> * than was allocated will probably emit a warning.
> *
> * If the last reference to this page is speculative, it will be released
> * by put_page() which only frees the first page of a non-compound
> * allocation. To prevent the remaining pages from being leaked, we free
> * the subsequent pages here. If you want to use the page's reference
> * count to decide when to free the allocation, you should allocate a
> * compound page, and use put_page() instead of __free_pages().
> *
> * Context: May be called in interrupt context or holding a normal
> * spinlock, but not in NMI context or while holding a raw spinlock.
> */
>
--
Sincerely yours,
Mike.
Excerpts from Andrew Morton's message of September 29, 2020 2:46 pm:
> On Tue, 29 Sep 2020 02:17:19 +0100 Matthew Wilcox <[email protected]> wrote:
>
>> On Mon, Sep 28, 2020 at 06:03:07PM -0700, Andrew Morton wrote:
>> > On Sat, 26 Sep 2020 22:39:19 +0100 "Matthew Wilcox (Oracle)" <[email protected]> wrote:
>> >
>> > > Here is a very rare race which leaks memory:
Great catch! [sorry, a bit behind with emails]
>> >
>> > Not worth a cc:stable?
>>
>> Yes, it probably should have been.
>
> Have you a feeling for how often this occurs?
>
>> I just assume the stablebot will
>> pick up anything that has a Fixes: tag.
>
> We asked them not to do that for mm/ patches. Crazy stuff was getting
> backported.
>
>> > >
>> > > --- a/mm/page_alloc.c
>> > > +++ b/mm/page_alloc.c
>> > > @@ -4947,6 +4947,9 @@ void __free_pages(struct page *page, unsigned int order)
>> > > {
>> > > if (put_page_testzero(page))
>> > > free_the_page(page, order);
>> > > + else if (!PageHead(page))
>> > > + while (order-- > 0)
>> > > + free_the_page(page + (1 << order), order);
>> >
>> > Well that's weird and scary looking. `page' has non-zero refcount yet
>> > we go and free random followon pages. Methinks it merits an
>> > explanatory comment?
>>
>> Well, poot. I lost that comment in the shuffling of patches. In a
>> different tree, I have:
>>
>> @@ -4943,10 +4943,19 @@ static inline void free_the_page(struct page *page, unsi
>> gned int order)
>> __free_pages_ok(page, order);
>> }
>>
>> +/*
>> + * If we free a non-compound allocation, another thread may have a
>
> "non-compound, higher-order", I suggest?
>
>> + * speculative reference to the first page. It has no way of knowing
>> + * about the rest of the allocation, so we have to free all but the
>> + * first page here.
>> + */
>> void __free_pages(struct page *page, unsigned int order)
>> {
>> if (put_page_testzero(page))
>> free_the_page(page, order);
>> + else if (!PageHead(page))
>> + while (order-- > 0)
>> + free_the_page(page + (1 << order), order);
>> }
>> EXPORT_SYMBOL(__free_pages);
>>
>>
>> Although I'm now thinking of making that comment into kernel-doc and
>> turning it into advice to the caller rather than an internal note to
>> other mm developers.
>
> hm. But what action could the caller take? The explanatory comment
> seems OK to me.
The version of this without the comment got merged. I didn't mind the
comment...
Thanks,
Nick