2021-10-27 03:41:27

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 0/8] Hardening page _refcount

It is hard to root cause _refcount problems, because they usually
manifest after the damage has occurred. Yet, they can lead to
catastrophic failures such memory corruptions.

Improve debugability by adding more checks that ensure that
page->_refcount never turns negative (i.e. double free does not
happen, or free after freeze etc).

- Check for overflow and underflow right from the functions that
modify _refcount
- Remove set_page_count(), so we do not unconditionally overwrite
_refcount with an unrestrained value
- Trace return values in all functions that modify _refcount

Applies against v5.15-rc7. Boot tested in QEMU.

Pasha Tatashin (8):
mm: add overflow and underflow checks for page->_refcount
mm/hugetlb: remove useless set_page_count()
mm: Avoid using set_page_count() in set_page_recounted()
mm: remove set_page_count() from page_frag_alloc_align
mm: avoid using set_page_count() when pages are freed into allocator
mm: rename init_page_count() -> page_ref_init()
mm: remove set_page_count()
mm: simplify page_ref_* functions

arch/m68k/mm/motorola.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/page_ref.h | 116 ++++++++++++++++----------------
include/trace/events/page_ref.h | 66 +++++++++++-------
mm/debug_page_ref.c | 22 ++----
mm/hugetlb.c | 2 +-
mm/internal.h | 5 +-
mm/page_alloc.c | 19 ++++--
8 files changed, 125 insertions(+), 109 deletions(-)

--
2.33.0.1079.g6e70778dc9-goog


2021-10-27 03:50:38

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 2/8] mm/hugetlb: remove useless set_page_count()

prep_compound_gigantic_page() calls set_page_count(0, p), but it is not
needed because page_ref_freeze(p, 1) already sets refcount to 0.

Using, set_page_count() is dangerous, because it unconditionally resets
refcount from the current value to unrestrained value, and therefore
should be minimized.

Signed-off-by: Pasha Tatashin <[email protected]>
---
mm/hugetlb.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95dc7b83381f..7e3996c8b696 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1707,7 +1707,7 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
goto out_error;
}
- set_page_count(p, 0);
+ VM_BUG_ON_PAGE(page_count(p), p);
set_compound_head(p, page);
}
atomic_set(compound_mapcount_ptr(page), -1);
--
2.33.0.1079.g6e70778dc9-goog

2021-10-27 03:50:38

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 4/8] mm: remove set_page_count() from page_frag_alloc_align

set_page_count() unconditionally resets the value of _ref_count and that
is dangerous, as it is not programmatically verified. Instead we rely on
comments like: "OK, page count is 0, we can safely set it".

Add a new refcount function: page_ref_add_return() to return the new
refcount value after adding to it. Use the return value to verify that
the _ref_count was indeed what we expected.

Signed-off-by: Pasha Tatashin <[email protected]>
---
include/linux/page_ref.h | 13 +++++++++++++
mm/page_alloc.c | 6 ++++--
2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index b3ec2b231fc7..db7ccb461c3e 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -88,6 +88,19 @@ static inline void init_page_count(struct page *page)
set_page_count(page, 1);
}

+static inline int page_ref_add_return(struct page *page, int nr)
+{
+ int ret;
+
+ VM_BUG_ON(nr <= 0);
+ ret = atomic_add_return(nr, &page->_refcount);
+ VM_BUG_ON_PAGE(ret <= 0, page);
+
+ if (page_ref_tracepoint_active(page_ref_mod_and_return))
+ __page_ref_mod_and_return(page, nr, ret);
+ return ret;
+}
+
static inline void page_ref_add(struct page *page, int nr)
{
int ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..6af4596bddc2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5510,6 +5510,7 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
unsigned int size = PAGE_SIZE;
struct page *page;
int offset;
+ int refcnt;

if (unlikely(!nc->va)) {
refill:
@@ -5548,8 +5549,9 @@ void *page_frag_alloc_align(struct page_frag_cache *nc,
/* if size can vary use size else just use PAGE_SIZE */
size = nc->size;
#endif
- /* OK, page count is 0, we can safely set it */
- set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+ /* page count is 0, set it to PAGE_FRAG_CACHE_MAX_SIZE + 1 */
+ refcnt = page_ref_add_return(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
+ VM_BUG_ON_PAGE(refcnt != PAGE_FRAG_CACHE_MAX_SIZE + 1, page);

/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
--
2.33.0.1079.g6e70778dc9-goog

2021-10-27 03:50:38

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 5/8] mm: avoid using set_page_count() when pages are freed into allocator

When struct pages are first initialized the page->_refcount field is
set 1. However, later when pages are freed into allocator we set
_refcount to 0 via set_page_count(). Unconditionally resetting
_refcount is dangerous.

Instead use page_ref_dec_return(), and verify that the _refcount is
what is expected.

Signed-off-by: Pasha Tatashin <[email protected]>
---
mm/page_alloc.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6af4596bddc2..9d18e5f9a85a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1659,6 +1659,7 @@ void __free_pages_core(struct page *page, unsigned int order)
unsigned int nr_pages = 1 << order;
struct page *p = page;
unsigned int loop;
+ int refcnt;

/*
* When initializing the memmap, __init_single_page() sets the refcount
@@ -1669,10 +1670,12 @@ void __free_pages_core(struct page *page, unsigned int order)
for (loop = 0; loop < (nr_pages - 1); loop++, p++) {
prefetchw(p + 1);
__ClearPageReserved(p);
- set_page_count(p, 0);
+ refcnt = page_ref_dec_return(p);
+ VM_BUG_ON_PAGE(refcnt, p);
}
__ClearPageReserved(p);
- set_page_count(p, 0);
+ refcnt = page_ref_dec_return(p);
+ VM_BUG_ON_PAGE(refcnt, p);

atomic_long_add(nr_pages, &page_zone(page)->managed_pages);

@@ -2244,10 +2247,12 @@ void __init init_cma_reserved_pageblock(struct page *page)
{
unsigned i = pageblock_nr_pages;
struct page *p = page;
+ int refcnt;

do {
__ClearPageReserved(p);
- set_page_count(p, 0);
+ refcnt = page_ref_dec_return(p);
+ VM_BUG_ON_PAGE(refcnt, p);
} while (++p, --i);

set_pageblock_migratetype(page, MIGRATE_CMA);
--
2.33.0.1079.g6e70778dc9-goog

2021-10-27 03:50:38

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

The problems with page->_refcount are hard to debug, because usually
when they are detected, the damage has occurred a long time ago. Yet,
the problems with invalid page refcount may be catastrophic and lead to
memory corruptions.

Reduce the scope of when the _refcount problems manifest themselves by
adding checks for underflows and overflows into functions that modify
_refcount.

Signed-off-by: Pasha Tatashin <[email protected]>
---
include/linux/page_ref.h | 61 ++++++++++++++++++++++++++++++++--------
1 file changed, 49 insertions(+), 12 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 7ad46f45df39..b3ec2b231fc7 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -90,21 +90,35 @@ static inline void init_page_count(struct page *page)

static inline void page_ref_add(struct page *page, int nr)
{
- atomic_add(nr, &page->_refcount);
+ int ret;
+
+ VM_BUG_ON(nr <= 0);
+ ret = atomic_add_return(nr, &page->_refcount);
+ VM_BUG_ON_PAGE(ret <= 0, page);
+
if (page_ref_tracepoint_active(page_ref_mod))
__page_ref_mod(page, nr);
}

static inline void page_ref_sub(struct page *page, int nr)
{
- atomic_sub(nr, &page->_refcount);
+ int ret;
+
+ VM_BUG_ON(nr <= 0);
+ ret = atomic_sub_return(nr, &page->_refcount);
+ VM_BUG_ON_PAGE(ret < 0, page);
+
if (page_ref_tracepoint_active(page_ref_mod))
__page_ref_mod(page, -nr);
}

static inline int page_ref_sub_return(struct page *page, int nr)
{
- int ret = atomic_sub_return(nr, &page->_refcount);
+ int ret;
+
+ VM_BUG_ON(nr <= 0);
+ ret = atomic_sub_return(nr, &page->_refcount);
+ VM_BUG_ON_PAGE(ret < 0, page);

if (page_ref_tracepoint_active(page_ref_mod_and_return))
__page_ref_mod_and_return(page, -nr, ret);
@@ -113,31 +127,43 @@ static inline int page_ref_sub_return(struct page *page, int nr)

static inline void page_ref_inc(struct page *page)
{
- atomic_inc(&page->_refcount);
+ int ret = atomic_inc_return(&page->_refcount);
+
+ VM_BUG_ON_PAGE(ret <= 0, page);
+
if (page_ref_tracepoint_active(page_ref_mod))
__page_ref_mod(page, 1);
}

static inline void page_ref_dec(struct page *page)
{
- atomic_dec(&page->_refcount);
+ int ret = atomic_dec_return(&page->_refcount);
+
+ VM_BUG_ON_PAGE(ret < 0, page);
+
if (page_ref_tracepoint_active(page_ref_mod))
__page_ref_mod(page, -1);
}

static inline int page_ref_sub_and_test(struct page *page, int nr)
{
- int ret = atomic_sub_and_test(nr, &page->_refcount);
+ int ret;
+
+ VM_BUG_ON(nr <= 0);
+ ret = atomic_sub_return(nr, &page->_refcount);
+ VM_BUG_ON_PAGE(ret < 0, page);

if (page_ref_tracepoint_active(page_ref_mod_and_test))
__page_ref_mod_and_test(page, -nr, ret);
- return ret;
+ return ret == 0;
}

static inline int page_ref_inc_return(struct page *page)
{
int ret = atomic_inc_return(&page->_refcount);

+ VM_BUG_ON_PAGE(ret <= 0, page);
+
if (page_ref_tracepoint_active(page_ref_mod_and_return))
__page_ref_mod_and_return(page, 1, ret);
return ret;
@@ -145,17 +171,21 @@ static inline int page_ref_inc_return(struct page *page)

static inline int page_ref_dec_and_test(struct page *page)
{
- int ret = atomic_dec_and_test(&page->_refcount);
+ int ret = atomic_dec_return(&page->_refcount);
+
+ VM_BUG_ON_PAGE(ret < 0, page);

if (page_ref_tracepoint_active(page_ref_mod_and_test))
__page_ref_mod_and_test(page, -1, ret);
- return ret;
+ return ret == 0;
}

static inline int page_ref_dec_return(struct page *page)
{
int ret = atomic_dec_return(&page->_refcount);

+ VM_BUG_ON_PAGE(ret < 0, page);
+
if (page_ref_tracepoint_active(page_ref_mod_and_return))
__page_ref_mod_and_return(page, -1, ret);
return ret;
@@ -163,16 +193,23 @@ static inline int page_ref_dec_return(struct page *page)

static inline int page_ref_add_unless(struct page *page, int nr, int u)
{
- int ret = atomic_add_unless(&page->_refcount, nr, u);
+ int ret;
+
+ VM_BUG_ON(nr <= 0 || u < 0);
+ ret = atomic_fetch_add_unless(&page->_refcount, nr, u);
+ VM_BUG_ON_PAGE(ret < 0, page);

if (page_ref_tracepoint_active(page_ref_mod_unless))
__page_ref_mod_unless(page, nr, ret);
- return ret;
+ return ret != u;
}

static inline int page_ref_freeze(struct page *page, int count)
{
- int ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);
+ int ret;
+
+ VM_BUG_ON(count <= 0);
+ ret = likely(atomic_cmpxchg(&page->_refcount, count, 0) == count);

if (page_ref_tracepoint_active(page_ref_freeze))
__page_ref_freeze(page, count, ret);
--
2.33.0.1079.g6e70778dc9-goog

2021-10-27 03:50:53

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

set_page_refcounted() converts a non-refcounted page that has
(page->_refcount == 0) into a refcounted page by setting _refcount to 1,

Use page_ref_inc_return() instead to avoid unconditionally overwriting
the _refcount value.

Signed-off-by: Pasha Tatashin <[email protected]>
---
mm/internal.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index cf3cb933eba3..cf345fac6894 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -91,9 +91,12 @@ static inline bool page_evictable(struct page *page)
*/
static inline void set_page_refcounted(struct page *page)
{
+ int refcnt;
+
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(page_ref_count(page), page);
- set_page_count(page, 1);
+ refcnt = page_ref_inc_return(page);
+ VM_BUG_ON_PAGE(refcnt != 1, page);
}

extern unsigned long highest_memmap_pfn;
--
2.33.0.1079.g6e70778dc9-goog

2021-10-27 04:01:35

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 6/8] mm: rename init_page_count() -> page_ref_init()

Now, that set_page_count() is not called from outside anymore and about
to be removed, init_page_count() is the only function that is going to
be used to unconditionally set _refcount, however it is restricted to set
it only to 1.

Make init_page_count() aligned with the other page_ref_*
functions by renaming it.

Signed-off-by: Pasha Tatashin <[email protected]>
---
arch/m68k/mm/motorola.c | 2 +-
include/linux/mm.h | 2 +-
include/linux/page_ref.h | 10 +++++++---
mm/page_alloc.c | 2 +-
4 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index 9f3f77785aa7..0d016c2e390b 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -133,7 +133,7 @@ void __init init_pointer_table(void *table, int type)

/* unreserve the page so it's possible to free that page */
__ClearPageReserved(PD_PAGE(dp));
- init_page_count(PD_PAGE(dp));
+ page_ref_init(PD_PAGE(dp));

return;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 73a52aba448f..46a25e6a14b8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2397,7 +2397,7 @@ extern void reserve_bootmem_region(phys_addr_t start, phys_addr_t end);
static inline void free_reserved_page(struct page *page)
{
ClearPageReserved(page);
- init_page_count(page);
+ page_ref_init(page);
__free_page(page);
adjust_managed_page_count(page, 1);
}
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index db7ccb461c3e..81a628dc9b8b 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -80,10 +80,14 @@ static inline void set_page_count(struct page *page, int v)
}

/*
- * Setup the page count before being freed into the page allocator for
- * the first time (boot or memory hotplug)
+ * Setup the page refcount to one before being freed into the page allocator.
+ * The memory might not be initialized and therefore there cannot be any
+ * assumptions about the current value of page->_refcount. This call should be
+ * done during boot when memory is being initialized, during memory hotplug
+ * when new memory is added, or when a previous reserved memory is unreserved
+ * this is the first time kernel take control of the given memory.
*/
-static inline void init_page_count(struct page *page)
+static inline void page_ref_init(struct page *page)
{
set_page_count(page, 1);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9d18e5f9a85a..fcd4c4ce329b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1561,7 +1561,7 @@ static void __meminit __init_single_page(struct page *page, unsigned long pfn,
{
mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
- init_page_count(page);
+ page_ref_init(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
page_kasan_tag_reset(page);
--
2.33.0.1079.g6e70778dc9-goog

2021-10-27 04:01:36

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 7/8] mm: remove set_page_count()

set_page_count() is dangerous because it resets _refcount to an
arbitrary value. Instead we now initialize _refcount to 1 only once,
and the rest of the time we are using add/dec/cmpxchg to have a
contiguous track of the counter.

Remove set_page_count() and add new tracing hooks to page_ref_init().

Signed-off-by: Pasha Tatashin <[email protected]>
---
include/linux/page_ref.h | 19 +++++---------
include/trace/events/page_ref.h | 46 ++++++++++++++++++++++++++++-----
mm/debug_page_ref.c | 8 +++---
3 files changed, 50 insertions(+), 23 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 81a628dc9b8b..06f5760fcd06 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -7,7 +7,7 @@
#include <linux/page-flags.h>
#include <linux/tracepoint-defs.h>

-DECLARE_TRACEPOINT(page_ref_set);
+DECLARE_TRACEPOINT(page_ref_init);
DECLARE_TRACEPOINT(page_ref_mod);
DECLARE_TRACEPOINT(page_ref_mod_and_test);
DECLARE_TRACEPOINT(page_ref_mod_and_return);
@@ -26,7 +26,7 @@ DECLARE_TRACEPOINT(page_ref_unfreeze);
*/
#define page_ref_tracepoint_active(t) tracepoint_enabled(t)

-extern void __page_ref_set(struct page *page, int v);
+extern void __page_ref_init(struct page *page);
extern void __page_ref_mod(struct page *page, int v);
extern void __page_ref_mod_and_test(struct page *page, int v, int ret);
extern void __page_ref_mod_and_return(struct page *page, int v, int ret);
@@ -38,7 +38,7 @@ extern void __page_ref_unfreeze(struct page *page, int v);

#define page_ref_tracepoint_active(t) false

-static inline void __page_ref_set(struct page *page, int v)
+static inline void __page_ref_init(struct page *page)
{
}
static inline void __page_ref_mod(struct page *page, int v)
@@ -72,15 +72,8 @@ static inline int page_count(const struct page *page)
return atomic_read(&compound_head(page)->_refcount);
}

-static inline void set_page_count(struct page *page, int v)
-{
- atomic_set(&page->_refcount, v);
- if (page_ref_tracepoint_active(page_ref_set))
- __page_ref_set(page, v);
-}
-
/*
- * Setup the page refcount to one before being freed into the page allocator.
+ * Setup the page->_refcount to 1 before being freed into the page allocator.
* The memory might not be initialized and therefore there cannot be any
* assumptions about the current value of page->_refcount. This call should be
* done during boot when memory is being initialized, during memory hotplug
@@ -89,7 +82,9 @@ static inline void set_page_count(struct page *page, int v)
*/
static inline void page_ref_init(struct page *page)
{
- set_page_count(page, 1);
+ atomic_set(&page->_refcount, 1);
+ if (page_ref_tracepoint_active(page_ref_init))
+ __page_ref_init(page);
}

static inline int page_ref_add_return(struct page *page, int nr)
diff --git a/include/trace/events/page_ref.h b/include/trace/events/page_ref.h
index 8a99c1cd417b..87551bb1df9e 100644
--- a/include/trace/events/page_ref.h
+++ b/include/trace/events/page_ref.h
@@ -10,6 +10,45 @@
#include <linux/tracepoint.h>
#include <trace/events/mmflags.h>

+DECLARE_EVENT_CLASS(page_ref_init_template,
+
+ TP_PROTO(struct page *page),
+
+ TP_ARGS(page),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, pfn)
+ __field(unsigned long, flags)
+ __field(int, count)
+ __field(int, mapcount)
+ __field(void *, mapping)
+ __field(int, mt)
+ __field(int, val)
+ ),
+
+ TP_fast_assign(
+ __entry->pfn = page_to_pfn(page);
+ __entry->flags = page->flags;
+ __entry->count = page_ref_count(page);
+ __entry->mapcount = page_mapcount(page);
+ __entry->mapping = page->mapping;
+ __entry->mt = get_pageblock_migratetype(page);
+ ),
+
+ TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d",
+ __entry->pfn,
+ show_page_flags(__entry->flags & PAGEFLAGS_MASK),
+ __entry->count,
+ __entry->mapcount, __entry->mapping, __entry->mt)
+);
+
+DEFINE_EVENT(page_ref_init_template, page_ref_init,
+
+ TP_PROTO(struct page *page),
+
+ TP_ARGS(page)
+);
+
DECLARE_EVENT_CLASS(page_ref_mod_template,

TP_PROTO(struct page *page, int v),
@@ -44,13 +83,6 @@ DECLARE_EVENT_CLASS(page_ref_mod_template,
__entry->val)
);

-DEFINE_EVENT(page_ref_mod_template, page_ref_set,
-
- TP_PROTO(struct page *page, int v),
-
- TP_ARGS(page, v)
-);
-
DEFINE_EVENT(page_ref_mod_template, page_ref_mod,

TP_PROTO(struct page *page, int v),
diff --git a/mm/debug_page_ref.c b/mm/debug_page_ref.c
index f3b2c9d3ece2..e32149734122 100644
--- a/mm/debug_page_ref.c
+++ b/mm/debug_page_ref.c
@@ -5,12 +5,12 @@
#define CREATE_TRACE_POINTS
#include <trace/events/page_ref.h>

-void __page_ref_set(struct page *page, int v)
+void __page_ref_init(struct page *page)
{
- trace_page_ref_set(page, v);
+ trace_page_ref_init(page);
}
-EXPORT_SYMBOL(__page_ref_set);
-EXPORT_TRACEPOINT_SYMBOL(page_ref_set);
+EXPORT_SYMBOL(__page_ref_init);
+EXPORT_TRACEPOINT_SYMBOL(page_ref_init);

void __page_ref_mod(struct page *page, int v)
{
--
2.33.0.1079.g6e70778dc9-goog

2021-10-27 04:01:36

by Pasha Tatashin

[permalink] [raw]
Subject: [RFC 8/8] mm: simplify page_ref_* functions

Now, that we are using atomic_return variants to add/sub/inc/dec page
_refcount, it makes sense to combined page_ref_* return and non return
functions.

Also remove some extra trace points for non-return variants. This
improves the tracability by always recording the new _refcount value
after the modifications has occurred.

Signed-off-by: Pasha Tatashin <[email protected]>
---
include/linux/page_ref.h | 79 +++++++--------------------------
include/trace/events/page_ref.h | 26 +++--------
mm/debug_page_ref.c | 14 ------
3 files changed, 22 insertions(+), 97 deletions(-)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 06f5760fcd06..2a91dbc33486 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -8,8 +8,6 @@
#include <linux/tracepoint-defs.h>

DECLARE_TRACEPOINT(page_ref_init);
-DECLARE_TRACEPOINT(page_ref_mod);
-DECLARE_TRACEPOINT(page_ref_mod_and_test);
DECLARE_TRACEPOINT(page_ref_mod_and_return);
DECLARE_TRACEPOINT(page_ref_mod_unless);
DECLARE_TRACEPOINT(page_ref_freeze);
@@ -27,8 +25,6 @@ DECLARE_TRACEPOINT(page_ref_unfreeze);
#define page_ref_tracepoint_active(t) tracepoint_enabled(t)

extern void __page_ref_init(struct page *page);
-extern void __page_ref_mod(struct page *page, int v);
-extern void __page_ref_mod_and_test(struct page *page, int v, int ret);
extern void __page_ref_mod_and_return(struct page *page, int v, int ret);
extern void __page_ref_mod_unless(struct page *page, int v, int u);
extern void __page_ref_freeze(struct page *page, int v, int ret);
@@ -41,12 +37,6 @@ extern void __page_ref_unfreeze(struct page *page, int v);
static inline void __page_ref_init(struct page *page)
{
}
-static inline void __page_ref_mod(struct page *page, int v)
-{
-}
-static inline void __page_ref_mod_and_test(struct page *page, int v, int ret)
-{
-}
static inline void __page_ref_mod_and_return(struct page *page, int v, int ret)
{
}
@@ -102,26 +92,7 @@ static inline int page_ref_add_return(struct page *page, int nr)

static inline void page_ref_add(struct page *page, int nr)
{
- int ret;
-
- VM_BUG_ON(nr <= 0);
- ret = atomic_add_return(nr, &page->_refcount);
- VM_BUG_ON_PAGE(ret <= 0, page);
-
- if (page_ref_tracepoint_active(page_ref_mod))
- __page_ref_mod(page, nr);
-}
-
-static inline void page_ref_sub(struct page *page, int nr)
-{
- int ret;
-
- VM_BUG_ON(nr <= 0);
- ret = atomic_sub_return(nr, &page->_refcount);
- VM_BUG_ON_PAGE(ret < 0, page);
-
- if (page_ref_tracepoint_active(page_ref_mod))
- __page_ref_mod(page, -nr);
+ page_ref_add_return(page, nr);
}

static inline int page_ref_sub_return(struct page *page, int nr)
@@ -137,37 +108,14 @@ static inline int page_ref_sub_return(struct page *page, int nr)
return ret;
}

-static inline void page_ref_inc(struct page *page)
-{
- int ret = atomic_inc_return(&page->_refcount);
-
- VM_BUG_ON_PAGE(ret <= 0, page);
-
- if (page_ref_tracepoint_active(page_ref_mod))
- __page_ref_mod(page, 1);
-}
-
-static inline void page_ref_dec(struct page *page)
+static inline void page_ref_sub(struct page *page, int nr)
{
- int ret = atomic_dec_return(&page->_refcount);
-
- VM_BUG_ON_PAGE(ret < 0, page);
-
- if (page_ref_tracepoint_active(page_ref_mod))
- __page_ref_mod(page, -1);
+ page_ref_sub_return(page, nr);
}

static inline int page_ref_sub_and_test(struct page *page, int nr)
{
- int ret;
-
- VM_BUG_ON(nr <= 0);
- ret = atomic_sub_return(nr, &page->_refcount);
- VM_BUG_ON_PAGE(ret < 0, page);
-
- if (page_ref_tracepoint_active(page_ref_mod_and_test))
- __page_ref_mod_and_test(page, -nr, ret);
- return ret == 0;
+ return page_ref_sub_return(page, nr) == 0;
}

static inline int page_ref_inc_return(struct page *page)
@@ -181,15 +129,10 @@ static inline int page_ref_inc_return(struct page *page)
return ret;
}

-static inline int page_ref_dec_and_test(struct page *page)
+static inline void page_ref_inc(struct page *page)
{
- int ret = atomic_dec_return(&page->_refcount);

- VM_BUG_ON_PAGE(ret < 0, page);
-
- if (page_ref_tracepoint_active(page_ref_mod_and_test))
- __page_ref_mod_and_test(page, -1, ret);
- return ret == 0;
+ page_ref_inc_return(page);
}

static inline int page_ref_dec_return(struct page *page)
@@ -203,6 +146,16 @@ static inline int page_ref_dec_return(struct page *page)
return ret;
}

+static inline void page_ref_dec(struct page *page)
+{
+ page_ref_dec_return(page);
+}
+
+static inline int page_ref_dec_and_test(struct page *page)
+{
+ return page_ref_dec_return(page) == 0;
+}
+
static inline int page_ref_add_unless(struct page *page, int nr, int u)
{
int ret;
diff --git a/include/trace/events/page_ref.h b/include/trace/events/page_ref.h
index 87551bb1df9e..883d90508ca2 100644
--- a/include/trace/events/page_ref.h
+++ b/include/trace/events/page_ref.h
@@ -49,7 +49,7 @@ DEFINE_EVENT(page_ref_init_template, page_ref_init,
TP_ARGS(page)
);

-DECLARE_EVENT_CLASS(page_ref_mod_template,
+DECLARE_EVENT_CLASS(page_ref_unfreeze_template,

TP_PROTO(struct page *page, int v),

@@ -83,14 +83,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_template,
__entry->val)
);

-DEFINE_EVENT(page_ref_mod_template, page_ref_mod,
-
- TP_PROTO(struct page *page, int v),
-
- TP_ARGS(page, v)
-);
-
-DECLARE_EVENT_CLASS(page_ref_mod_and_test_template,
+DECLARE_EVENT_CLASS(page_ref_mod_template,

TP_PROTO(struct page *page, int v, int ret),

@@ -126,35 +119,28 @@ DECLARE_EVENT_CLASS(page_ref_mod_and_test_template,
__entry->val, __entry->ret)
);

-DEFINE_EVENT(page_ref_mod_and_test_template, page_ref_mod_and_test,
-
- TP_PROTO(struct page *page, int v, int ret),
-
- TP_ARGS(page, v, ret)
-);
-
-DEFINE_EVENT(page_ref_mod_and_test_template, page_ref_mod_and_return,
+DEFINE_EVENT(page_ref_mod_template, page_ref_mod_and_return,

TP_PROTO(struct page *page, int v, int ret),

TP_ARGS(page, v, ret)
);

-DEFINE_EVENT(page_ref_mod_and_test_template, page_ref_mod_unless,
+DEFINE_EVENT(page_ref_mod_template, page_ref_mod_unless,

TP_PROTO(struct page *page, int v, int ret),

TP_ARGS(page, v, ret)
);

-DEFINE_EVENT(page_ref_mod_and_test_template, page_ref_freeze,
+DEFINE_EVENT(page_ref_mod_template, page_ref_freeze,

TP_PROTO(struct page *page, int v, int ret),

TP_ARGS(page, v, ret)
);

-DEFINE_EVENT(page_ref_mod_template, page_ref_unfreeze,
+DEFINE_EVENT(page_ref_unfreeze_template, page_ref_unfreeze,

TP_PROTO(struct page *page, int v),

diff --git a/mm/debug_page_ref.c b/mm/debug_page_ref.c
index e32149734122..1de9d93cca25 100644
--- a/mm/debug_page_ref.c
+++ b/mm/debug_page_ref.c
@@ -12,20 +12,6 @@ void __page_ref_init(struct page *page)
EXPORT_SYMBOL(__page_ref_init);
EXPORT_TRACEPOINT_SYMBOL(page_ref_init);

-void __page_ref_mod(struct page *page, int v)
-{
- trace_page_ref_mod(page, v);
-}
-EXPORT_SYMBOL(__page_ref_mod);
-EXPORT_TRACEPOINT_SYMBOL(page_ref_mod);
-
-void __page_ref_mod_and_test(struct page *page, int v, int ret)
-{
- trace_page_ref_mod_and_test(page, v, ret);
-}
-EXPORT_SYMBOL(__page_ref_mod_and_test);
-EXPORT_TRACEPOINT_SYMBOL(page_ref_mod_and_test);
-
void __page_ref_mod_and_return(struct page *page, int v, int ret)
{
trace_page_ref_mod_and_return(page, v, ret);
--
2.33.0.1079.g6e70778dc9-goog

2021-10-27 04:38:37

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On 10/26/21 10:53, John Hubbard wrote:
> On 10/26/21 10:38, Pasha Tatashin wrote:
>> set_page_refcounted() converts a non-refcounted page that has
>> (page->_refcount == 0) into a refcounted page by setting _refcount to 1,
>>
>> Use page_ref_inc_return() instead to avoid unconditionally overwriting
>> the _refcount value.
>>
>> Signed-off-by: Pasha Tatashin <[email protected]>
>> ---
>>   mm/internal.h | 5 ++++-
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index cf3cb933eba3..cf345fac6894 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -91,9 +91,12 @@ static inline bool page_evictable(struct page *page)
>>    */
>>   static inline void set_page_refcounted(struct page *page)
>>   {
>> +    int refcnt;
>> +
>>       VM_BUG_ON_PAGE(PageTail(page), page);
>>       VM_BUG_ON_PAGE(page_ref_count(page), page);
>> -    set_page_count(page, 1);
>> +    refcnt = page_ref_inc_return(page);
>> +    VM_BUG_ON_PAGE(refcnt != 1, page);
>
> Hi Pavel,

ohhh, s/Pavel/Pasha/ !

Huge apologies for the name mixup, I had just seen another email...
very sorry about that mistake.


thanks,
--
John Hubbard
NVIDIA

>
> I am acutely uncomfortable with this change, because it changes the
> meaning and behavior of the function to something completely different,
> while leaving the function name unchanged. Furthermore, in relies upon
> debug assertions, rather than a return value (for example) to verify
> that all is well.
>
> I understand where this patchset is going, but this intermediate step is
> not a good move.
>
> Also, for the overall series, if you want to change from
> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
> return something like -EBUSY if incrementing the value results in a
> surprise, and let the caller decide how to handle it.
>
> thanks,

2021-10-27 04:38:38

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On 10/26/21 10:38, Pasha Tatashin wrote:
> set_page_refcounted() converts a non-refcounted page that has
> (page->_refcount == 0) into a refcounted page by setting _refcount to 1,
>
> Use page_ref_inc_return() instead to avoid unconditionally overwriting
> the _refcount value.
>
> Signed-off-by: Pasha Tatashin <[email protected]>
> ---
> mm/internal.h | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index cf3cb933eba3..cf345fac6894 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -91,9 +91,12 @@ static inline bool page_evictable(struct page *page)
> */
> static inline void set_page_refcounted(struct page *page)
> {
> + int refcnt;
> +
> VM_BUG_ON_PAGE(PageTail(page), page);
> VM_BUG_ON_PAGE(page_ref_count(page), page);
> - set_page_count(page, 1);
> + refcnt = page_ref_inc_return(page);
> + VM_BUG_ON_PAGE(refcnt != 1, page);

Hi Pavel,

I am acutely uncomfortable with this change, because it changes the
meaning and behavior of the function to something completely different,
while leaving the function name unchanged. Furthermore, in relies upon
debug assertions, rather than a return value (for example) to verify
that all is well.

I understand where this patchset is going, but this intermediate step is
not a good move.

Also, for the overall series, if you want to change from
"set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
do that is *not* to depend solely on VM_BUG*() to verify. Instead,
return something like -EBUSY if incrementing the value results in a
surprise, and let the caller decide how to handle it.

thanks,
--
John Hubbard
NVIDIA

> }
>
> extern unsigned long highest_memmap_pfn;
>

2021-10-27 05:30:30

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

> > Hi Pavel,
>
> ohhh, s/Pavel/Pasha/ !
>
> Huge apologies for the name mixup, I had just seen another email...
> very sorry about that mistake.

Hi John,

Do not need to be sorry, Pavel == Pasha :) In Russian it is the same
name; I've been trying to reduce the confusion, by converting
everything to Pasha.

Pasha

2021-10-27 07:03:18

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

Hi John,

Thank you for looking at this series.

> > static inline void set_page_refcounted(struct page *page)
> > {
> > + int refcnt;
> > +
> > VM_BUG_ON_PAGE(PageTail(page), page);
> > VM_BUG_ON_PAGE(page_ref_count(page), page);
> > - set_page_count(page, 1);
> > + refcnt = page_ref_inc_return(page);
> > + VM_BUG_ON_PAGE(refcnt != 1, page);


> I am acutely uncomfortable with this change, because it changes the
> meaning and behavior of the function to something completely different,
> while leaving the function name unchanged. Furthermore, in relies upon
> debug assertions, rather than a return value (for example) to verify
> that all is well.


It must return the same thing, if it does not we have a bug in our
kernel which may lead to memory corruptions and security holes.

So today we have this:
VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
< What if something modified here? Hmm..>
set_page_count(page, 1); -> Yet we reset it to 1.

With my proposed change:
VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
refcnt = page_ref_inc_return(page); -> ref_count better be 1.
VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1.

>
> I understand where this patchset is going, but this intermediate step is
> not a good move.
>
> Also, for the overall series, if you want to change from
> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
> return something like -EBUSY if incrementing the value results in a
> surprise, and let the caller decide how to handle it.

Actually, -EBUSY would be OK if the problems were because we failed to
modify refcount for some reason, but if we modified refcount and got
an unexpected value (i.e underflow/overflow) we better report it right
away instead of waiting for memory corruption to happen.

Thanks,
Pasha

2021-10-27 07:03:21

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC 0/8] Hardening page _refcount

On Tue, Oct 26, 2021 at 05:38:14PM +0000, Pasha Tatashin wrote:
> It is hard to root cause _refcount problems, because they usually
> manifest after the damage has occurred. Yet, they can lead to
> catastrophic failures such memory corruptions.
>
> Improve debugability by adding more checks that ensure that
> page->_refcount never turns negative (i.e. double free does not
> happen, or free after freeze etc).
>
> - Check for overflow and underflow right from the functions that
> modify _refcount
> - Remove set_page_count(), so we do not unconditionally overwrite
> _refcount with an unrestrained value
> - Trace return values in all functions that modify _refcount

I think this is overkill. Won't we get exactly the same protection
by simply testing that page->_refcount == 0 in set_page_count()?
Anything which triggers that BUG_ON would already be buggy because
it can race with speculative gets.

2021-10-27 07:07:30

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 0/8] Hardening page _refcount

On Tue, Oct 26, 2021 at 2:24 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Oct 26, 2021 at 05:38:14PM +0000, Pasha Tatashin wrote:
> > It is hard to root cause _refcount problems, because they usually
> > manifest after the damage has occurred. Yet, they can lead to
> > catastrophic failures such memory corruptions.
> >
> > Improve debugability by adding more checks that ensure that
> > page->_refcount never turns negative (i.e. double free does not
> > happen, or free after freeze etc).
> >
> > - Check for overflow and underflow right from the functions that
> > modify _refcount
> > - Remove set_page_count(), so we do not unconditionally overwrite
> > _refcount with an unrestrained value
> > - Trace return values in all functions that modify _refcount
>
> I think this is overkill. Won't we get exactly the same protection
> by simply testing that page->_refcount == 0 in set_page_count()?
> Anything which triggers that BUG_ON would already be buggy because
> it can race with speculative gets.

We can't because set_page_count(v) is used for
1. changing _refcount form a current value to unconstrained v
2. initialize _refcount from undefined state to v.

In this work we forbid the first case, and reduce the second case to
initialize only to 1.

Pasha

2021-10-27 07:17:01

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 2/8] mm/hugetlb: remove useless set_page_count()

On Tue, Oct 26, 2021 at 2:45 PM Mike Kravetz <[email protected]> wrote:
>
> On 10/26/21 10:38 AM, Pasha Tatashin wrote:
> > prep_compound_gigantic_page() calls set_page_count(0, p), but it is not
> > needed because page_ref_freeze(p, 1) already sets refcount to 0.
> >
> > Using, set_page_count() is dangerous, because it unconditionally resets
> > refcount from the current value to unrestrained value, and therefore
> > should be minimized.
> >
> > Signed-off-by: Pasha Tatashin <[email protected]>
>
> Thanks!
>
> My bad for not removing the set_page_count when adding the page_ref_freeze.
>
> FYI, there have been additional changes to this routine in Andrew's
> tree. Not really sure if we want/need the VM_BUG_ON_PAGE as that would
> only check if there was a 'bug' in page_ref_freeze.

I would like to keep it. Part of the idea of this series is to reduce
reliance on comments such as:

/* No worries, refcount is A therefore we can do B */

And instead enforce that via VM_BUG_ON(). It should be able to
prevent existing and future _refcount related bugs from manifesting as
memory corruptions.

Pasha

>
> --
> Mike Kravetz
>
> > ---
> > mm/hugetlb.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 95dc7b83381f..7e3996c8b696 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1707,7 +1707,7 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
> > pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
> > goto out_error;
> > }
> > - set_page_count(p, 0);
> > + VM_BUG_ON_PAGE(page_count(p), p);
> > set_compound_head(p, page);
> > }
> > atomic_set(compound_mapcount_ptr(page), -1);
> >

2021-10-27 07:17:01

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC 2/8] mm/hugetlb: remove useless set_page_count()

On 10/26/21 10:38 AM, Pasha Tatashin wrote:
> prep_compound_gigantic_page() calls set_page_count(0, p), but it is not
> needed because page_ref_freeze(p, 1) already sets refcount to 0.
>
> Using, set_page_count() is dangerous, because it unconditionally resets
> refcount from the current value to unrestrained value, and therefore
> should be minimized.
>
> Signed-off-by: Pasha Tatashin <[email protected]>

Thanks!

My bad for not removing the set_page_count when adding the page_ref_freeze.

FYI, there have been additional changes to this routine in Andrew's
tree. Not really sure if we want/need the VM_BUG_ON_PAGE as that would
only check if there was a 'bug' in page_ref_freeze.

--
Mike Kravetz

> ---
> mm/hugetlb.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 95dc7b83381f..7e3996c8b696 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1707,7 +1707,7 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
> pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
> goto out_error;
> }
> - set_page_count(p, 0);
> + VM_BUG_ON_PAGE(page_count(p), p);
> set_compound_head(p, page);
> }
> atomic_set(compound_mapcount_ptr(page), -1);
>

2021-10-27 10:12:12

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

On Tue, Oct 26, 2021 at 05:38:15PM +0000, Pasha Tatashin wrote:
> static inline void page_ref_add(struct page *page, int nr)
> {
> - atomic_add(nr, &page->_refcount);
> + int ret;
> +
> + VM_BUG_ON(nr <= 0);
> + ret = atomic_add_return(nr, &page->_refcount);
> + VM_BUG_ON_PAGE(ret <= 0, page);

This isn't right. _refcount is allowed to overflow into the negatives.
See page_ref_zero_or_close_to_overflow() and the conversations that led
to it being added.

2021-10-27 10:15:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC 0/8] Hardening page _refcount

On Tue, Oct 26, 2021 at 02:30:25PM -0400, Pasha Tatashin wrote:
> On Tue, Oct 26, 2021 at 2:24 PM Matthew Wilcox <[email protected]> wrote:
> > I think this is overkill. Won't we get exactly the same protection
> > by simply testing that page->_refcount == 0 in set_page_count()?
> > Anything which triggers that BUG_ON would already be buggy because
> > it can race with speculative gets.
>
> We can't because set_page_count(v) is used for
> 1. changing _refcount form a current value to unconstrained v
> 2. initialize _refcount from undefined state to v.
>
> In this work we forbid the first case, and reduce the second case to
> initialize only to 1.

Anything that is calling set_page_refcount() on something which is
not 0 is buggy today. There are several ways to increment the page
refcount speculatively if it is not 0. eg lockless GUP and page cache
reads. So we could have:

CPU 0: alloc_page() (refcount now 1)
CPU 1: get_page_unless_zero() (refcount now 2)
CPU 0: set_page_refcount(5) (refcount now 5)
CPU 1: put_page() (refcount now 4)

Now the refcount is wrong. So it is *only* safe to call
set_page_refcount() if the refcount is 0. If you can find somewhere
that's calling set_page_refcount() on a non-0 refcount, that's a bug
that needs to be fixed.

2021-10-27 11:04:18

by Mike Kravetz

[permalink] [raw]
Subject: Re: [RFC 2/8] mm/hugetlb: remove useless set_page_count()

On 10/26/21 11:50 AM, Pasha Tatashin wrote:
> On Tue, Oct 26, 2021 at 2:45 PM Mike Kravetz <[email protected]> wrote:
>>
>> On 10/26/21 10:38 AM, Pasha Tatashin wrote:
>>> prep_compound_gigantic_page() calls set_page_count(0, p), but it is not
>>> needed because page_ref_freeze(p, 1) already sets refcount to 0.
>>>
>>> Using, set_page_count() is dangerous, because it unconditionally resets
>>> refcount from the current value to unrestrained value, and therefore
>>> should be minimized.
>>>
>>> Signed-off-by: Pasha Tatashin <[email protected]>
>>
>> Thanks!
>>
>> My bad for not removing the set_page_count when adding the page_ref_freeze.
>>
>> FYI, there have been additional changes to this routine in Andrew's
>> tree. Not really sure if we want/need the VM_BUG_ON_PAGE as that would
>> only check if there was a 'bug' in page_ref_freeze.
>
> I would like to keep it. Part of the idea of this series is to reduce
> reliance on comments such as:
>
> /* No worries, refcount is A therefore we can do B */
>
> And instead enforce that via VM_BUG_ON(). It should be able to
> prevent existing and future _refcount related bugs from manifesting as
> memory corruptions.

Ok, but that seems a bit redundant to me.

There is actually a VM_BUG_ON_PAGE(page_count(p), p) in the 'non-demote'
case in Andrew's tree. This is in the path where we do not call
page_ref_freeze to zero page ref. That seems sufficient to me.

Since you did point out the unnecessary set_page_count, I will submit a
code cleanup patch to remove it. I think that is independent of your
efforts here, and adding VM_BUG_ON can be discussed in the context of
this series.
--
Mike Kravetz

2021-10-27 11:06:03

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 0/8] Hardening page _refcount

On Tue, Oct 26, 2021 at 4:14 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Oct 26, 2021 at 02:30:25PM -0400, Pasha Tatashin wrote:
> > On Tue, Oct 26, 2021 at 2:24 PM Matthew Wilcox <[email protected]> wrote:
> > > I think this is overkill. Won't we get exactly the same protection
> > > by simply testing that page->_refcount == 0 in set_page_count()?
> > > Anything which triggers that BUG_ON would already be buggy because
> > > it can race with speculative gets.
> >
> > We can't because set_page_count(v) is used for
> > 1. changing _refcount form a current value to unconstrained v
> > 2. initialize _refcount from undefined state to v.
> >
> > In this work we forbid the first case, and reduce the second case to
> > initialize only to 1.
>
> Anything that is calling set_page_refcount() on something which is
> not 0 is buggy today

For performance reasons the memblock allocator does not zero struct
page memory, we initialize every field in struct page individually,
that includes page->_refcount. Most of the time the boot memory is
zeroed by firmware, but it is not guaranteed, non-zero pages can come
from bootloader, or from freed firmware pages. Also, after kexec
memory state is not zeroed as well, and struct pages can contain
garbage until fields are initialized.

This is a valid reason to do set_page_recount() with non-zero
refcounts, but the function is too generic, as in this case we really
need to set it only to 1: so instead rename it to page_ref_init() and
set it only to 1.

Another example:
In __free_pages_core() and in init_cma_reserved_pageblock() we do
set_page_refcount() when _ref_count is 1 and we change it to 0.

In this case doing dec_return check makes more sense in order to
verify that the ref_count was indeed 1, and we won't have a double
free.

> There are several ways to increment the page
> refcount speculatively if it is not 0. eg lockless GUP and page cache
> reads. So we could have:
>
> CPU 0: alloc_page() (refcount now 1)
> CPU 1: get_page_unless_zero() (refcount now 2)
> CPU 0: set_page_refcount(5) (refcount now 5)
> CPU 1: put_page() (refcount now 4)
>
> Now the refcount is wrong. So it is *only* safe to call
> set_page_refcount() if the refcount is 0. If you can find somewhere
> that's calling set_page_refcount() on a non-0 refcount, that's a bug
> that needs to be fixed.

Right, add_return/sub_return() with check after operation ensures that
there are no races where we could have some writes to refcount between
knowing it is 0 and calling set_page_refcount().

Thanks,
Pasha

2021-10-27 11:21:47

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

On Tue, Oct 26, 2021 at 3:50 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Oct 26, 2021 at 05:38:15PM +0000, Pasha Tatashin wrote:
> > static inline void page_ref_add(struct page *page, int nr)
> > {
> > - atomic_add(nr, &page->_refcount);
> > + int ret;
> > +
> > + VM_BUG_ON(nr <= 0);
> > + ret = atomic_add_return(nr, &page->_refcount);
> > + VM_BUG_ON_PAGE(ret <= 0, page);
>
> This isn't right. _refcount is allowed to overflow into the negatives.
> See page_ref_zero_or_close_to_overflow() and the conversations that led
> to it being added.

#define page_ref_zero_or_close_to_overflow(page) \
1204 ((unsigned int) page_ref_count(page) + 127u <= 127u)


Uh, right, I saw the macro but did not realize there was an (unsigned int) cast.

OK, I think we can move this macro inside:
include/linux/page_ref.h

modify it to something like this:
#define page_ref_zero_or_close_to_overflow(page) \
((unsigned int) page_ref_count(page) + v + 127u <= v + 127u)

The sub/dec can also be fixed to ensure that we do not underflow but
still working with the fact that we use all 32bits of _refcount.

Pasha

2021-10-27 14:53:14

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

On Tue, Oct 26, 2021 at 5:34 PM Pasha Tatashin
<[email protected]> wrote:
>
> On Tue, Oct 26, 2021 at 3:50 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Tue, Oct 26, 2021 at 05:38:15PM +0000, Pasha Tatashin wrote:
> > > static inline void page_ref_add(struct page *page, int nr)
> > > {
> > > - atomic_add(nr, &page->_refcount);
> > > + int ret;
> > > +
> > > + VM_BUG_ON(nr <= 0);
> > > + ret = atomic_add_return(nr, &page->_refcount);
> > > + VM_BUG_ON_PAGE(ret <= 0, page);
> >
> > This isn't right. _refcount is allowed to overflow into the negatives.
> > See page_ref_zero_or_close_to_overflow() and the conversations that led
> > to it being added.
>
> #define page_ref_zero_or_close_to_overflow(page) \
> 1204 ((unsigned int) page_ref_count(page) + 127u <= 127u)
>
>
> Uh, right, I saw the macro but did not realize there was an (unsigned int) cast.
>
> OK, I think we can move this macro inside:
> include/linux/page_ref.h
>
> modify it to something like this:
> #define page_ref_zero_or_close_to_overflow(page) \
> ((unsigned int) page_ref_count(page) + v + 127u <= v + 127u)
>
> The sub/dec can also be fixed to ensure that we do not underflow but
> still working with the fact that we use all 32bits of _refcount.

I think we can do that by using:
atomic_fetch_*() and check for overflow/underflow after operation. I
will send the updated series soon.

Pasha

2021-10-27 16:42:44

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

On Tue, Oct 26, 2021 at 09:21:29PM -0400, Pasha Tatashin wrote:
> I think we can do that by using:
> atomic_fetch_*() and check for overflow/underflow after operation. I
> will send the updated series soon.

Please include performance measurements. You're adding code to some hot
paths.

2021-10-27 17:32:27

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On 10/26/21 11:21, Pasha Tatashin wrote:
> It must return the same thing, if it does not we have a bug in our
> kernel which may lead to memory corruptions and security holes.
>
> So today we have this:
> VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
> < What if something modified here? Hmm..>
> set_page_count(page, 1); -> Yet we reset it to 1.
>
> With my proposed change:
> VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
> refcnt = page_ref_inc_return(page); -> ref_count better be 1.
> VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1.
>

Yes, you are just repeating what the diffs say.

But it's still not good to have this function name doing something completely
different than its name indicates.

>>
>> I understand where this patchset is going, but this intermediate step is
>> not a good move.
>>
>> Also, for the overall series, if you want to change from
>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
>> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
>> return something like -EBUSY if incrementing the value results in a
>> surprise, and let the caller decide how to handle it.
>
> Actually, -EBUSY would be OK if the problems were because we failed to
> modify refcount for some reason, but if we modified refcount and got
> an unexpected value (i.e underflow/overflow) we better report it right
> away instead of waiting for memory corruption to happen.
>

Having the caller do the BUG() or VM_BUG*() is not a significant delay.


thanks,
--
John Hubbard
NVIDIA

2021-10-27 18:30:58

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [RFC 6/8] mm: rename init_page_count() -> page_ref_init()

On Tue, Oct 26, 2021 at 7:38 PM Pasha Tatashin
<[email protected]> wrote:
> Now, that set_page_count() is not called from outside anymore and about
> to be removed, init_page_count() is the only function that is going to
> be used to unconditionally set _refcount, however it is restricted to set
> it only to 1.
>
> Make init_page_count() aligned with the other page_ref_*
> functions by renaming it.
>
> Signed-off-by: Pasha Tatashin <[email protected]>

> arch/m68k/mm/motorola.c | 2 +-

Acked-by: Geert Uytterhoeven <[email protected]>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2021-10-27 20:45:16

by Muchun Song

[permalink] [raw]
Subject: Re: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

On Wed, Oct 27, 2021 at 1:38 AM Pasha Tatashin
<[email protected]> wrote:
>
> The problems with page->_refcount are hard to debug, because usually
> when they are detected, the damage has occurred a long time ago. Yet,
> the problems with invalid page refcount may be catastrophic and lead to
> memory corruptions.
>
> Reduce the scope of when the _refcount problems manifest themselves by
> adding checks for underflows and overflows into functions that modify
> _refcount.
>
> Signed-off-by: Pasha Tatashin <[email protected]>

I found some atomic_add/dec are replaced with atomic_add/dec_return,
those helpers with return value imply a full memory barrier around it, but
others without return value do not. Do you have any numbers to show
the impact? Maybe atomic_add/dec_return_relaxed can help this.

Thanks.

2021-10-27 21:33:05

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

On Tue, Oct 26, 2021 at 11:05 PM Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Oct 26, 2021 at 09:21:29PM -0400, Pasha Tatashin wrote:
> > I think we can do that by using:
> > atomic_fetch_*() and check for overflow/underflow after operation. I
> > will send the updated series soon.
>
> Please include performance measurements. You're adding code to some hot
> paths.

I will do boot performance tests to ensure we do not regress
initializing "struct pages" on larger machines. Do you have a
suggestion for another perf test?

Thanks,
Pasha

2021-10-27 21:33:34

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On Wed, Oct 27, 2021 at 1:12 AM John Hubbard <[email protected]> wrote:
>
> On 10/26/21 11:21, Pasha Tatashin wrote:
> > It must return the same thing, if it does not we have a bug in our
> > kernel which may lead to memory corruptions and security holes.
> >
> > So today we have this:
> > VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
> > < What if something modified here? Hmm..>
> > set_page_count(page, 1); -> Yet we reset it to 1.
> >
> > With my proposed change:
> > VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
> > refcnt = page_ref_inc_return(page); -> ref_count better be 1.
> > VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1.
> >
>
> Yes, you are just repeating what the diffs say.
>
> But it's still not good to have this function name doing something completely
> different than its name indicates.

I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?

>
> >>
> >> I understand where this patchset is going, but this intermediate step is
> >> not a good move.
> >>
> >> Also, for the overall series, if you want to change from
> >> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
> >> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
> >> return something like -EBUSY if incrementing the value results in a
> >> surprise, and let the caller decide how to handle it.
> >
> > Actually, -EBUSY would be OK if the problems were because we failed to
> > modify refcount for some reason, but if we modified refcount and got
> > an unexpected value (i.e underflow/overflow) we better report it right
> > away instead of waiting for memory corruption to happen.
> >
>
> Having the caller do the BUG() or VM_BUG*() is not a significant delay.

We cannot guarantee that new callers in the future will check return
values, the idea behind this work is to ensure that we are always
protected from refcount underflow/overflow and invalid refcount
modifications by set_refcount.

Pasha

2021-10-27 21:35:07

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

> I found some atomic_add/dec are replaced with atomic_add/dec_return,

I am going to replace -return variants with -fetch variants, potentially -fetch

> those helpers with return value imply a full memory barrier around it, but
> others without return value do not. Do you have any numbers to show
> the impact? Maybe atomic_add/dec_return_relaxed can help this.

The generic variant uses arch_cmpxchg() for all atomic variants
without any extra barriers. Therefore, on platforms that use generic
implementations there won't be performance differences except for an
extra branch that checks results when VM_BUG_ON is enabled.

On x86 the difference between the two is the following

atomic_add:
lock add %eax,(%rsi)

atomic_fetch_add:
lock xadd %eax,(%rsi)

atomic_fetch_add_relaxed:
lock xadd %eax,(%rsi)

No differences between relaxed and non relaxed variants. However, we
used lock xadd instead of lock add. I am not sure if the performance
difference is going to be different.

Pasha

2021-10-28 01:22:29

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On 10/27/21 11:27, Pasha Tatashin wrote:
> On Wed, Oct 27, 2021 at 1:12 AM John Hubbard <[email protected]> wrote:
>>
>> On 10/26/21 11:21, Pasha Tatashin wrote:
>>> It must return the same thing, if it does not we have a bug in our
>>> kernel which may lead to memory corruptions and security holes.
>>>
>>> So today we have this:
>>> VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
>>> < What if something modified here? Hmm..>
>>> set_page_count(page, 1); -> Yet we reset it to 1.
>>>
>>> With my proposed change:
>>> VM_BUG_ON_PAGE(page_ref_count(page), page); -> check ref_count is 0
>>> refcnt = page_ref_inc_return(page); -> ref_count better be 1.
>>> VM_BUG_ON_PAGE(refcnt != 1, page); -> Verify that it is 1.
>>>
>>
>> Yes, you are just repeating what the diffs say.
>>
>> But it's still not good to have this function name doing something completely
>> different than its name indicates.
>
> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>

What? No, that's not where I was going at all. The function is already
named set_page_refcounted(), and one of the problems I see is that your
changes turn it into something that most certainly does not
set_page_refounted(). Instead, this patch *increments* the refcount.
That is not the same thing.

And then it uses a .config-sensitive assertion to "prevent" problems.
And by that I mean, the wording throughout this series seems to equate
VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
however, in CONFIG_DEBUG_VM configurations, and provide no protection at
all for normal (most distros) users. That's something that the wording,
comments, and even design should be tweaked to account for.


>>
>>>>
>>>> I understand where this patchset is going, but this intermediate step is
>>>> not a good move.
>>>>
>>>> Also, for the overall series, if you want to change from
>>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
>>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
>>>> return something like -EBUSY if incrementing the value results in a
>>>> surprise, and let the caller decide how to handle it.
>>>
>>> Actually, -EBUSY would be OK if the problems were because we failed to
>>> modify refcount for some reason, but if we modified refcount and got
>>> an unexpected value (i.e underflow/overflow) we better report it right
>>> away instead of waiting for memory corruption to happen.
>>>
>>
>> Having the caller do the BUG() or VM_BUG*() is not a significant delay.
>
> We cannot guarantee that new callers in the future will check return
> values, the idea behind this work is to ensure that we are always
> protected from refcount underflow/overflow and invalid refcount
> modifications by set_refcount.
>

I don't have a problem with putting assertions closest to where they should
fire. That's a good thing. I'm looking here for ways to fix up the problems
listed in the points above, though.

And I do want to point out another thing, though, and that is: generally, we
don't have to program to quite the level of defensiveness you seem to be at.
If return values must be checked, they usually are in the kernel--and we even
have tooling to enforce it:

/*
* gcc:
https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-warn_005funused_005fresult-function-attribute
* clang: https://clang.llvm.org/docs/AttributeReference.html#nodiscard-warn-unused-result
*/
#define __must_check __attribute__((__warn_unused_result__))


Please take that into consideration when weighing tradeoffs, just sort of in
general.



thanks,
--
John Hubbard
NVIDIA

2021-10-28 01:38:18

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On 10/27/21 18:20, John Hubbard wrote:
>>> But it's still not good to have this function name doing something completely
>>> different than its name indicates.
>>
>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>>
>
> What? No, that's not where I was going at all. The function is already
> named set_page_refcounted(), and one of the problems I see is that your
> changes turn it into something that most certainly does not
> set_page_refounted(). Instead, this patch *increments* the refcount.
> That is not the same thing.
>
> And then it uses a .config-sensitive assertion to "prevent" problems.
> And by that I mean, the wording throughout this series seems to equate
> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
> all for normal (most distros) users. That's something that the wording,
> comments, and even design should be tweaked to account for.

...and to clarify a bit more, maybe this also helps:

These patches are attempting to improve debugging, and that is fine, as
far as debugging goes. However, a point that seems to be slightly
misunderstood is: incrementing a bad refcount value is not actually any
better than overwriting it, from a recovery point of view. Maybe (?)
it's better from a debugging point of view.

That's because the problem occurred before this code, and its debug-only
assertions, ran. Once here, the code cannot actually recover: there is
no automatic way to recover from a refcount that it 1, -1, 2, or 706,
when it was supposed to be zero. Incrementing it is, again, not really
necessarily better than setting: setting it might actually make the
broken system appear to run--and in some cases, even avoid symptoms.
Whereas incrementing doesn't cover anything up. The only thing you can
really does is just panic() or BUG(), really.

Don't get me wrong, I don't want bugs covered up. But the claim that
incrementing is somehow better deserves some actual thinking about it.

Overall, I'm inclined to *not* switch anything over to incrementing the
refcounts. Instead, go ahead and:

a) Add assertions up to a "reasonable" point (some others have pointed
out that you don't need quite all of the assertions you've added).

b) Remove set_page_count() calls where possible--maybe everywhere.

c) Fix any bugs found along the way.

d) ...but, leave the basic logic as-is: no changing over to
page_ref_inc_return().

Anyway, that's my take on it.

thanks,
--
John Hubbard
NVIDIA

2021-10-28 04:10:11

by Muchun Song

[permalink] [raw]
Subject: Re: [RFC 1/8] mm: add overflow and underflow checks for page->_refcount

On Thu, Oct 28, 2021 at 2:22 AM Pasha Tatashin
<[email protected]> wrote:
>
> > I found some atomic_add/dec are replaced with atomic_add/dec_return,
>
> I am going to replace -return variants with -fetch variants, potentially -fetch
>
> > those helpers with return value imply a full memory barrier around it, but
> > others without return value do not. Do you have any numbers to show
> > the impact? Maybe atomic_add/dec_return_relaxed can help this.
>
> The generic variant uses arch_cmpxchg() for all atomic variants
> without any extra barriers. Therefore, on platforms that use generic
> implementations there won't be performance differences except for an
> extra branch that checks results when VM_BUG_ON is enabled.
>
> On x86 the difference between the two is the following
>
> atomic_add:
> lock add %eax,(%rsi)
>
> atomic_fetch_add:
> lock xadd %eax,(%rsi)
>
> atomic_fetch_add_relaxed:
> lock xadd %eax,(%rsi)
>
> No differences between relaxed and non relaxed variants. However, we

Right. There is no difference on x86. Maybe there are differences in
other architectures.

> used lock xadd instead of lock add. I am not sure if the performance
> difference is going to be different.
>
> Pasha

2021-11-01 14:25:19

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

> >> Yes, you are just repeating what the diffs say.
> >>
> >> But it's still not good to have this function name doing something completely
> >> different than its name indicates.
> >
> > I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
> >
>
> What? No, that's not where I was going at all. The function is already
> named set_page_refcounted(), and one of the problems I see is that your
> changes turn it into something that most certainly does not
> set_page_refounted(). Instead, this patch *increments* the refcount.
> That is not the same thing.
>
> And then it uses a .config-sensitive assertion to "prevent" problems.
> And by that I mean, the wording throughout this series seems to equate
> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
> all for normal (most distros) users. That's something that the wording,
> comments, and even design should be tweaked to account for.

VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config
sensitive, but in both cases *BUG_ON() means that there is an
unrecoverable problem that occured. The only difference between the
two is that VM_BUG_ON() is not enabled when distros decide to reduce
the size of their kernel and improve runtime performance by skipping
some extra checking.

There is no logical separation between VM_BUG_ON and BUG_ON, there is
been a lengthy discussion about this:

https://lore.kernel.org/lkml/CA+55aFy6a8BVWtqgeJKZuhU-CZFVZ3X90SdQ5z+NTDDsEOnpJA@mail.gmail.com/
"so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It
just allows some people to build smaller kernels, but apparently
distro people would rather have debugging than save a few kB of RAM."

Losing control of ref_count is an unrecoverable problem because it
leads to security sensitive memory corruptions. It is better to crash
the kernel when that happens instead of ending up with some pages
mapped into the wrong address space.

The races are tricky to spot, but set_page_count() is inherently
dangerous, so I am removing it entirely and replacing it with safer
operations which do the same thing.

One example is this:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=7118fc29

> >>>> I understand where this patchset is going, but this intermediate step is
> >>>> not a good move.
> >>>>
> >>>> Also, for the overall series, if you want to change from
> >>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
> >>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
> >>>> return something like -EBUSY if incrementing the value results in a
> >>>> surprise, and let the caller decide how to handle it.

In set_page_refcounted() we already have:

VM_BUG_ON_PAGE(page_ref_count(page), page);
set_page_count(page, 1);

I am pointing out that above code is racy:

Between the check VM_BUG_ON_PAGE() check and unconditional set to 1
the value of page->_refcount can change.

I am replacing it with an identical version of code that is not racy.
There is no need to complicate the code by introducing new -EBUSY
returns here, as it would reduce the fragility of this could even
farther.

> >>> Actually, -EBUSY would be OK if the problems were because we failed to

I am not sure -EBUSY would be OK here, it means we had a race which we
were not aware about, and which could have led to memory corruptions.

> >>> modify refcount for some reason, but if we modified refcount and got
> >>> an unexpected value (i.e underflow/overflow) we better report it right
> >>> away instead of waiting for memory corruption to happen.
> >>>
> >>
> >> Having the caller do the BUG() or VM_BUG*() is not a significant delay.

I agree, however, helper functions exist to remove code duplications.
If we must verify the assumption of set_page_refcounted() that non
counted page is turned into a counted page, it is better to do it in
one place than at every call site. We do it today in thus helper
function, I do not see why we would change that.

2021-11-01 14:33:59

by Pasha Tatashin

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On Wed, Oct 27, 2021 at 9:35 PM John Hubbard <[email protected]> wrote:
>
> On 10/27/21 18:20, John Hubbard wrote:
> >>> But it's still not good to have this function name doing something completely
> >>> different than its name indicates.
> >>
> >> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
> >>
> >
> > What? No, that's not where I was going at all. The function is already
> > named set_page_refcounted(), and one of the problems I see is that your
> > changes turn it into something that most certainly does not
> > set_page_refounted(). Instead, this patch *increments* the refcount.
> > That is not the same thing.
> >
> > And then it uses a .config-sensitive assertion to "prevent" problems.
> > And by that I mean, the wording throughout this series seems to equate
> > VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
> > however, in CONFIG_DEBUG_VM configurations, and provide no protection at
> > all for normal (most distros) users. That's something that the wording,
> > comments, and even design should be tweaked to account for.
>
> ...and to clarify a bit more, maybe this also helps:
>
> These patches are attempting to improve debugging, and that is fine, as

They are attempting to catch potentioal race conditions where
_refcount is changed between the time we verified what it was and we
set it to something else.

They also attempt to prevent overflows and underflows bugs which are
not all tested today, but can be tested with this patch set at least
on kernels where DEBUG_VM is enabled.

> far as debugging goes. However, a point that seems to be slightly
> misunderstood is: incrementing a bad refcount value is not actually any
> better than overwriting it, from a recovery point of view. Maybe (?)
> it's better from a debugging point of view.

It is better for debugging as well: if one is tracing the page
_refcount history, knowing that the _refcount can only be
incremented/decremented/frozen/unfrozen provides a contiguous history
of refcount that can be tracked. In case when we set refcount in some
places as we do today, the contigous history is lost, as we do not
know the actual _refcount value at the time of the set operation.

>
> That's because the problem occurred before this code, and its debug-only
> assertions, ran. Once here, the code cannot actually recover: there is
> no automatic way to recover from a refcount that it 1, -1, 2, or 706,
> when it was supposed to be zero. Incrementing it is, again, not really
> necessarily better than setting: setting it might actually make the
> broken system appear to run--and in some cases, even avoid symptoms.
> Whereas incrementing doesn't cover anything up. The only thing you can
> really does is just panic() or BUG(), really.

This is what my patch series attempt to do, I chose to use VM_BUG()
instead of BUG() because this is VM code, and avoid potential
performance regressions for those who chose performance over possible
security implications.

>
> Don't get me wrong, I don't want bugs covered up. But the claim that
> incrementing is somehow better deserves some actual thinking about it.

I think it does, I described my points above, if you still disagree
please let me know.

Thank you for providing your thoughts on this RFC, I will send out a
new version, and we can continue discussion in the new thread.

Pasha

2021-11-01 19:32:52

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On 11/1/21 07:22, Pasha Tatashin wrote:
>>>> Yes, you are just repeating what the diffs say.
>>>>
>>>> But it's still not good to have this function name doing something completely
>>>> different than its name indicates.
>>>
>>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>>>
>>
>> What? No, that's not where I was going at all. The function is already
>> named set_page_refcounted(), and one of the problems I see is that your
>> changes turn it into something that most certainly does not
>> set_page_refounted(). Instead, this patch *increments* the refcount.
>> That is not the same thing.
>>
>> And then it uses a .config-sensitive assertion to "prevent" problems.
>> And by that I mean, the wording throughout this series seems to equate
>> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
>> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
>> all for normal (most distros) users. That's something that the wording,
>> comments, and even design should be tweaked to account for.
>
> VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config
> sensitive, but in both cases *BUG_ON() means that there is an
> unrecoverable problem that occured. The only difference between the
> two is that VM_BUG_ON() is not enabled when distros decide to reduce
> the size of their kernel and improve runtime performance by skipping
> some extra checking.
>
> There is no logical separation between VM_BUG_ON and BUG_ON, there is
> been a lengthy discussion about this:
>
> https://lore.kernel.org/lkml/CA+55aFy6a8BVWtqgeJKZuhU-CZFVZ3X90SdQ5z+NTDDsEOnpJA@mail.gmail.com/
> "so *no*. VM_BUG_ON() is no less deadly than a regular BUG_ON(). It
> just allows some people to build smaller kernels, but apparently
> distro people would rather have debugging than save a few kB of RAM."
>
> Losing control of ref_count is an unrecoverable problem because it
> leads to security sensitive memory corruptions. It is better to crash
> the kernel when that happens instead of ending up with some pages
> mapped into the wrong address space.
>
> The races are tricky to spot, but set_page_count() is inherently
> dangerous, so I am removing it entirely and replacing it with safer
> operations which do the same thing.
>
> One example is this:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=7118fc29

I don't think we have any disagreement about the landscape here. But it's
much easier to describe the problem than it is to fix it--as always. And
repeating the problem doesn't make a proposed fix more (or less) appropriate. :)

>
>>>>>> I understand where this patchset is going, but this intermediate step is
>>>>>> not a good move.
>>>>>>
>>>>>> Also, for the overall series, if you want to change from
>>>>>> "set_page_count()" to "inc_and_verify_val_equals_one()", then the way to
>>>>>> do that is *not* to depend solely on VM_BUG*() to verify. Instead,
>>>>>> return something like -EBUSY if incrementing the value results in a
>>>>>> surprise, and let the caller decide how to handle it.
>
> In set_page_refcounted() we already have:
>
> VM_BUG_ON_PAGE(page_ref_count(page), page);
> set_page_count(page, 1);
>
> I am pointing out that above code is racy:
>
> Between the check VM_BUG_ON_PAGE() check and unconditional set to 1
> the value of page->_refcount can change.
>
> I am replacing it with an identical version of code that is not racy.

And I'm pointing out that raciness is not the real bug, or at least, not
the only bug. "Fixing" the race does not fix the code, but the patch
series seems to imply that it does.

> There is no need to complicate the code by introducing new -EBUSY
> returns here, as it would reduce the fragility of this could even
> farther.
>
>>>>> Actually, -EBUSY would be OK if the problems were because we failed to
>
> I am not sure -EBUSY would be OK here, it means we had a race which we
> were not aware about, and which could have led to memory corruptions.
>
>>>>> modify refcount for some reason, but if we modified refcount and got
>>>>> an unexpected value (i.e underflow/overflow) we better report it right
>>>>> away instead of waiting for memory corruption to happen.
>>>>>
>>>>
>>>> Having the caller do the BUG() or VM_BUG*() is not a significant delay.
>
> I agree, however, helper functions exist to remove code duplications.
> If we must verify the assumption of set_page_refcounted() that non
> counted page is turned into a counted page, it is better to do it in
> one place than at every call site. We do it today in thus helper
> function, I do not see why we would change that.
>

Let's ignore this -EBUSY idea for now, because I'm not sure where you are
going with your next version, and maybe it won't even come up.


thanks,
--
John Hubbard
NVIDIA

2021-11-01 19:37:57

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On 11/1/21 07:30, Pasha Tatashin wrote:
> On Wed, Oct 27, 2021 at 9:35 PM John Hubbard <[email protected]> wrote:
>>
>> On 10/27/21 18:20, John Hubbard wrote:
>>>>> But it's still not good to have this function name doing something completely
>>>>> different than its name indicates.
>>>>
>>>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>>>>
>>>
>>> What? No, that's not where I was going at all. The function is already
>>> named set_page_refcounted(), and one of the problems I see is that your
>>> changes turn it into something that most certainly does not
>>> set_page_refounted(). Instead, this patch *increments* the refcount.
>>> That is not the same thing.
>>>
>>> And then it uses a .config-sensitive assertion to "prevent" problems.
>>> And by that I mean, the wording throughout this series seems to equate
>>> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
>>> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
>>> all for normal (most distros) users. That's something that the wording,
>>> comments, and even design should be tweaked to account for.
>>
>> ...and to clarify a bit more, maybe this also helps:
>>
>> These patches are attempting to improve debugging, and that is fine, as
>
> They are attempting to catch potentioal race conditions where
> _refcount is changed between the time we verified what it was and we
> set it to something else.
>
> They also attempt to prevent overflows and underflows bugs which are
> not all tested today, but can be tested with this patch set at least
> on kernels where DEBUG_VM is enabled.

OK, but did you get my point about the naming problem?

>
>> far as debugging goes. However, a point that seems to be slightly
>> misunderstood is: incrementing a bad refcount value is not actually any
>> better than overwriting it, from a recovery point of view. Maybe (?)
>> it's better from a debugging point of view.
>
> It is better for debugging as well: if one is tracing the page
> _refcount history, knowing that the _refcount can only be
> incremented/decremented/frozen/unfrozen provides a contiguous history
> of refcount that can be tracked. In case when we set refcount in some
> places as we do today, the contigous history is lost, as we do not
> know the actual _refcount value at the time of the set operation.
>

OK, that is a reasonable argument. Let's put it somewhere, maybe in a
comment block, if it's not already there.

>>
>> That's because the problem occurred before this code, and its debug-only
>> assertions, ran. Once here, the code cannot actually recover: there is
>> no automatic way to recover from a refcount that it 1, -1, 2, or 706,
>> when it was supposed to be zero. Incrementing it is, again, not really
>> necessarily better than setting: setting it might actually make the
>> broken system appear to run--and in some cases, even avoid symptoms.
>> Whereas incrementing doesn't cover anything up. The only thing you can
>> really does is just panic() or BUG(), really.
>
> This is what my patch series attempt to do, I chose to use VM_BUG()
> instead of BUG() because this is VM code, and avoid potential
> performance regressions for those who chose performance over possible
> security implications.

Yes, the VM_BUG() vs. BUG() is awkward. But you cannot rely on VM_BUG()
to stop the system, even if Fedora does turn it on.

>
>>
>> Don't get me wrong, I don't want bugs covered up. But the claim that
>> incrementing is somehow better deserves some actual thinking about it.
>
> I think it does, I described my points above, if you still disagree
> please let me know.
>
> Thank you for providing your thoughts on this RFC, I will send out a
> new version, and we can continue discussion in the new thread.
>
> Pasha
>

Yes, let's see what it looks like.

thanks,
--
John Hubbard
NVIDIA

2021-11-01 19:44:06

by John Hubbard

[permalink] [raw]
Subject: Re: [RFC 3/8] mm: Avoid using set_page_count() in set_page_recounted()

On 11/1/21 07:22, Pasha Tatashin wrote:
>>>> Yes, you are just repeating what the diffs say.
>>>>
>>>> But it's still not good to have this function name doing something completely
>>>> different than its name indicates.
>>>
>>> I see, I can rename it to: 'set_page_recounted/get_page_recounted' ?
>>>
>>
>> What? No, that's not where I was going at all. The function is already
>> named set_page_refcounted(), and one of the problems I see is that your
>> changes turn it into something that most certainly does not
>> set_page_refounted(). Instead, this patch *increments* the refcount.
>> That is not the same thing.
>>
>> And then it uses a .config-sensitive assertion to "prevent" problems.
>> And by that I mean, the wording throughout this series seems to equate
>> VM_BUG_ON_PAGE() assertions with real assertions. They are only active,
>> however, in CONFIG_DEBUG_VM configurations, and provide no protection at
>> all for normal (most distros) users. That's something that the wording,
>> comments, and even design should be tweaked to account for.
>
> VM_BUG_ON and BUG_ON should be treated the same. Yes, they are config
> sensitive, but in both cases *BUG_ON() means that there is an
> unrecoverable problem that occured. The only difference between the
> two is that VM_BUG_ON() is not enabled when distros decide to reduce
> the size of their kernel and improve runtime performance by skipping
> some extra checking.
>
> There is no logical separation between VM_BUG_ON and BUG_ON, there is
> been a lengthy discussion about this:

Actually I do want to mention one more thing about this, before we move
on to the next version of the patchset. The above is inaccurate. The
intent of VM_BUG_ON() and BUG_ON() is similar, but there is *definitely*
a logical separation between them: they do not behave the same at runtime.

Just because some distros enable VM_BUG_ON(), does not mean that we can
treat it the same as BUG_ON() in "both directions". From a "don't BUG()
crash the kernel unless really warranted", they are about the same, as
Linus keeps repeating. From the other direction, though ("I need to BUG()-
crash the kernel"), they are NOT the same. BUG_ON() is more reliably
available.

And that's the essence of my object to treating this as if you have
reliably stopped the kernel with a VM_BUG_ON(). It's not really the same!


thanks,
--
John Hubbard
NVIDIA