RFC v2:
https://lore.kernel.org/linux-mm/[email protected]/
This patch series moves PG_slab page flag to page_type,
freeing one bit in page->flags and introduces %pGt format
that prints human-readable page_type like %pGp for printing page flags.
See changelog of patch 2 for more implementation details.
Thanks everyone that gave valuable comments.
v2 -> v3:
- dropped show_page_types() in a thought that it is not interesting
to track refcount of non-usermapped pages.
- added patch 1 that cleans up MF_MSG_SLAB in hwpoison
- split implementation and application of %pGt to separate patches.
- instead of printing "no page_type for ..." in %pGt, just print
'0x<value>()' for a page that does not use page_type field.
Hyeonggon Yoo (4):
mm/hwpoison: remove MF_MSG_SLAB from action_page_types
mm: move PG_slab flag to page_type
mm, printk: introduce new format %pGt for page_type
mm/debug: use %pGt to print page_type in dump_page()
Documentation/core-api/printk-formats.rst | 3 +-
fs/proc/page.c | 13 ++--
include/linux/mm_types.h | 11 ++--
include/linux/page-flags.h | 77 ++++++++++++++++-------
include/trace/events/mmflags.h | 8 ++-
kernel/crash_core.c | 3 +-
lib/test_printf.c | 26 ++++++++
lib/vsprintf.c | 21 +++++++
mm/debug.c | 7 +++
mm/internal.h | 1 +
mm/memory-failure.c | 10 ---
mm/slab.c | 44 ++++++++-----
mm/slab.h | 3 +-
13 files changed, 162 insertions(+), 65 deletions(-)
--
2.32.0
As suggested by Naoya [1], identify_page_state() is never
called when handling memory error on a slab page.
Clean this up before moving PG_slab flag to page_type in later patch.
[1] https://lore.kernel.org/linux-mm/Y2s+dnBsHAJu19ob@hyeyoo/#r
Suggested-by: Naoya Horiguchi <[email protected]>
Signed-off-by: Hyeonggon Yoo <[email protected]>
---
mm/memory-failure.c | 10 ----------
1 file changed, 10 deletions(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c77a9e37e27e..74ad1db989e3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -783,7 +783,6 @@ static const char *action_name[] = {
static const char * const action_page_types[] = {
[MF_MSG_KERNEL] = "reserved kernel page",
[MF_MSG_KERNEL_HIGH_ORDER] = "high-order kernel page",
- [MF_MSG_SLAB] = "kernel slab page",
[MF_MSG_DIFFERENT_COMPOUND] = "different compound page after locking",
[MF_MSG_HUGE] = "huge page",
[MF_MSG_FREE_HUGE] = "free huge page",
@@ -1146,7 +1145,6 @@ static int me_huge_page(struct page_state *ps, struct page *p)
#define mlock (1UL << PG_mlocked)
#define lru (1UL << PG_lru)
#define head (1UL << PG_head)
-#define slab (1UL << PG_slab)
#define reserved (1UL << PG_reserved)
static struct page_state error_states[] = {
@@ -1156,13 +1154,6 @@ static struct page_state error_states[] = {
* PG_buddy pages only make a small fraction of all free pages.
*/
- /*
- * Could in theory check if slab page is free or if we can drop
- * currently unused objects without touching them. But just
- * treat it as standard kernel for now.
- */
- { slab, slab, MF_MSG_SLAB, me_kernel },
-
{ head, head, MF_MSG_HUGE, me_huge_page },
{ sc|dirty, sc|dirty, MF_MSG_DIRTY_SWAPCACHE, me_swapcache_dirty },
@@ -1189,7 +1180,6 @@ static struct page_state error_states[] = {
#undef mlock
#undef lru
#undef head
-#undef slab
#undef reserved
/*
--
2.32.0
For now, only SLAB uses _mapcount field as a number of active objects in
a slab, and other slab allocators do not. As 16 bits are enough for that,
use remaining 16 bits of _mapcount as page_type even when SLAB is used.
And then move PG_slab flag to page_type.
As suggested by Matthew, store number of active objects in negative
form and use helper when accessing or modifying it.
page_type is always placed in upper half of _mapcount to avoid
confusing normal _mapcount as page_type. As underflow (actually I mean,
yeah, overflow) is not a concern anymore, use more lower bits.
Add more folio helpers for PAGE_TYPE_OPS() not to break existing
slab implementations. To preserve current behavior apply page policy
in PAGE_TYPE_OPS() and use PF_NO_TAIL for slab pages and PF_ANY for others.
Remove PG_slab check from PAGE_FLAGS_CHECK_AT_FREE. buddy will still
check if _mapcount is properly set at free.
Note that with this change, page_mapped() and folio_mapped() always return
false for a slab page.
Signed-off-by: Hyeonggon Yoo <[email protected]>
---
fs/proc/page.c | 13 ++----
include/linux/mm_types.h | 11 +++--
include/linux/page-flags.h | 77 ++++++++++++++++++++++++----------
include/trace/events/mmflags.h | 1 -
kernel/crash_core.c | 3 +-
mm/slab.c | 44 ++++++++++++-------
mm/slab.h | 3 +-
7 files changed, 98 insertions(+), 54 deletions(-)
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 6249c347809a..e7524f21cefe 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -67,7 +67,7 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
*/
ppage = pfn_to_online_page(pfn);
- if (!ppage || PageSlab(ppage) || page_has_type(ppage))
+ if (!ppage || page_has_type(ppage))
pcount = 0;
else
pcount = page_mapcount(ppage);
@@ -124,11 +124,8 @@ u64 stable_page_flags(struct page *page)
/*
* pseudo flags for the well known (anonymous) memory mapped pages
- *
- * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
- * simple test in page_mapped() is not enough.
*/
- if (!PageSlab(page) && page_mapped(page))
+ if (page_mapped(page))
u |= 1 << KPF_MMAP;
if (PageAnon(page))
u |= 1 << KPF_ANON;
@@ -178,16 +175,14 @@ u64 stable_page_flags(struct page *page)
u |= 1 << KPF_OFFLINE;
if (PageTable(page))
u |= 1 << KPF_PGTABLE;
+ if (PageSlab(page))
+ u |= 1 << KPF_SLAB;
if (page_is_idle(page))
u |= 1 << KPF_IDLE;
u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);
- u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
- if (PageTail(page) && PageSlab(compound_head(page)))
- u |= 1 << KPF_SLAB;
-
u |= kpf_copy_bit(k, KPF_ERROR, PG_error);
u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty);
u |= kpf_copy_bit(k, KPF_UPTODATE, PG_uptodate);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 3b8475007734..6b04ae65241d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -207,10 +207,13 @@ struct page {
atomic_t _mapcount;
/*
- * If the page is neither PageSlab nor mappable to userspace,
- * the value stored here may help determine what this page
- * is used for. See page-flags.h for a list of page types
- * which are currently stored here.
+ * If the page is not mappable to userspace, the value
+ * stored here may help determine what this page is used for.
+ * See page-flags.h for a list of page types which are currently
+ * stored here.
+ *
+ * Note that only upper half is used for page types and lower
+ * half is reserved for SLAB.
*/
unsigned int page_type;
};
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 69e93a0c1277..07063d60efe3 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -107,7 +107,6 @@ enum pageflags {
PG_workingset,
PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
PG_error,
- PG_slab,
PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
PG_arch_1,
PG_reserved,
@@ -482,7 +481,6 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
TESTCLEARFLAG(Active, active, PF_HEAD)
PAGEFLAG(Workingset, workingset, PF_HEAD)
TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
-__PAGEFLAG(Slab, slab, PF_NO_TAIL)
__PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
@@ -906,42 +904,72 @@ static inline bool is_page_hwpoison(struct page *page)
}
/*
- * For pages that are never mapped to userspace (and aren't PageSlab),
- * page_type may be used. Because it is initialised to -1, we invert the
- * sense of the bit, so __SetPageFoo *clears* the bit used for PageFoo, and
- * __ClearPageFoo *sets* the bit used for PageFoo. We reserve a few high and
- * low bits so that an underflow or overflow of page_mapcount() won't be
- * mistaken for a page type value.
+ * For pages that are never mapped to userspace, page_type may be used.
+ * Because it is initialised to -1, we invert the sense of the bit,
+ * so __SetPageFoo *clears* the bit used for PageFoo, and __ClearPageFoo
+ * *sets* the bit used for PageFoo. We reserve a few high and low bits
+ * so that an underflow or overflow of page_mapcount() won't be mistaken
+ * for a page type value.
*/
#define PAGE_TYPE_BASE 0xf0000000
-/* Reserve 0x0000007f to catch underflows of page_mapcount */
-#define PAGE_MAPCOUNT_RESERVE -128
-#define PG_buddy 0x00000080
-#define PG_offline 0x00000100
-#define PG_table 0x00000200
-#define PG_guard 0x00000400
+#define PG_buddy 0x00010000
+#define PG_offline 0x00020000
+#define PG_table 0x00040000
+#define PG_guard 0x00080000
+#define PG_slab 0x00100000
#define PageType(page, flag) \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
-static inline int page_has_type(struct page *page)
+#define PAGE_TYPE_MASK 0xffff0000
+
+static inline bool page_type_has_type(unsigned int page_type)
{
- return (int)page->page_type < PAGE_MAPCOUNT_RESERVE;
+ return ((int)page_type < (int)PAGE_TYPE_MASK);
}
-#define PAGE_TYPE_OPS(uname, lname) \
+static inline bool page_has_type(struct page *page)
+{
+ return page_type_has_type(page->page_type);
+}
+
+
+#define PAGE_TYPE_OPS(uname, lname, policy) \
static __always_inline int Page##uname(struct page *page) \
{ \
+ page = policy(page, 0); \
+ return PageType(page, PG_##lname); \
+} \
+static __always_inline int folio_test_##lname(struct folio *folio) \
+{ \
+ struct page *page = &folio->page; \
+ \
return PageType(page, PG_##lname); \
} \
static __always_inline void __SetPage##uname(struct page *page) \
{ \
+ page = policy(page, 1); \
+ VM_BUG_ON_PAGE(!PageType(page, 0), page); \
+ page->page_type &= ~PG_##lname; \
+} \
+static __always_inline void __folio_set_##lname(struct folio *folio) \
+{ \
+ struct page *page = &folio->page; \
+ \
VM_BUG_ON_PAGE(!PageType(page, 0), page); \
page->page_type &= ~PG_##lname; \
} \
static __always_inline void __ClearPage##uname(struct page *page) \
{ \
+ page = policy(page, 1); \
+ VM_BUG_ON_PAGE(!Page##uname(page), page); \
+ page->page_type |= PG_##lname; \
+} \
+static __always_inline void __folio_clear_##lname(struct folio *folio) \
+{ \
+ struct page *page = &folio->page; \
+ \
VM_BUG_ON_PAGE(!Page##uname(page), page); \
page->page_type |= PG_##lname; \
}
@@ -950,7 +978,7 @@ static __always_inline void __ClearPage##uname(struct page *page) \
* PageBuddy() indicates that the page is free and in the buddy system
* (see mm/page_alloc.c).
*/
-PAGE_TYPE_OPS(Buddy, buddy)
+PAGE_TYPE_OPS(Buddy, buddy, PF_ANY)
/*
* PageOffline() indicates that the page is logically offline although the
@@ -974,7 +1002,10 @@ PAGE_TYPE_OPS(Buddy, buddy)
* pages should check PageOffline() and synchronize with such drivers using
* page_offline_freeze()/page_offline_thaw().
*/
-PAGE_TYPE_OPS(Offline, offline)
+PAGE_TYPE_OPS(Offline, offline, PF_ANY)
+
+/* PageSlab() indicates that the page is used by slab subsystem. */
+PAGE_TYPE_OPS(Slab, slab, PF_NO_TAIL)
extern void page_offline_freeze(void);
extern void page_offline_thaw(void);
@@ -984,12 +1015,12 @@ extern void page_offline_end(void);
/*
* Marks pages in use as page tables.
*/
-PAGE_TYPE_OPS(Table, table)
+PAGE_TYPE_OPS(Table, table, PF_ANY)
/*
* Marks guardpages used with debug_pagealloc.
*/
-PAGE_TYPE_OPS(Guard, guard)
+PAGE_TYPE_OPS(Guard, guard, PF_ANY)
extern bool is_free_buddy_page(struct page *page);
@@ -1037,8 +1068,8 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
(1UL << PG_lru | 1UL << PG_locked | \
1UL << PG_private | 1UL << PG_private_2 | \
1UL << PG_writeback | 1UL << PG_reserved | \
- 1UL << PG_slab | 1UL << PG_active | \
- 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)
+ 1UL << PG_active | 1UL << PG_unevictable | \
+ __PG_MLOCKED | LRU_GEN_MASK)
/*
* Flags checked when a page is prepped for return by the page allocator.
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..8301912f8c25 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -113,7 +113,6 @@
{1UL << PG_lru, "lru" }, \
{1UL << PG_active, "active" }, \
{1UL << PG_workingset, "workingset" }, \
- {1UL << PG_slab, "slab" }, \
{1UL << PG_owner_priv_1, "owner_priv_1" }, \
{1UL << PG_arch_1, "arch_1" }, \
{1UL << PG_reserved, "reserved" }, \
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 87ef6096823f..1ea8198c26e1 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -482,13 +482,14 @@ static int __init crash_save_vmcoreinfo_init(void)
VMCOREINFO_NUMBER(PG_private);
VMCOREINFO_NUMBER(PG_swapcache);
VMCOREINFO_NUMBER(PG_swapbacked);
- VMCOREINFO_NUMBER(PG_slab);
#ifdef CONFIG_MEMORY_FAILURE
VMCOREINFO_NUMBER(PG_hwpoison);
#endif
VMCOREINFO_NUMBER(PG_head_mask);
#define PAGE_BUDDY_MAPCOUNT_VALUE (~PG_buddy)
VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
+#define PAGE_SLAB_MAPCOUNT_VALUE (~PG_slab)
+ VMCOREINFO_NUMBER(PAGE_SLAB_MAPCOUNT_VALUE);
#ifdef CONFIG_HUGETLB_PAGE
VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
#define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline)
diff --git a/mm/slab.c b/mm/slab.c
index 7a269db050ee..eee46f71c4b8 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2269,6 +2269,21 @@ void __kmem_cache_release(struct kmem_cache *cachep)
}
}
+static inline unsigned int slab_get_active(struct slab *slab)
+{
+ return ~(slab->page_type | PG_slab);
+}
+
+static inline void slab_inc_active(struct slab *slab)
+{
+ slab->page_type--;
+}
+
+static inline void slab_dec_active(struct slab *slab)
+{
+ slab->page_type++;
+}
+
/*
* Get the memory for a slab management obj.
*
@@ -2291,7 +2306,6 @@ static void *alloc_slabmgmt(struct kmem_cache *cachep,
void *addr = slab_address(slab);
slab->s_mem = addr + colour_off;
- slab->active = 0;
if (OBJFREELIST_SLAB(cachep))
freelist = NULL;
@@ -2510,8 +2524,8 @@ static void *slab_get_obj(struct kmem_cache *cachep, struct slab *slab)
{
void *objp;
- objp = index_to_obj(cachep, slab, get_free_obj(slab, slab->active));
- slab->active++;
+ objp = index_to_obj(cachep, slab, get_free_obj(slab, slab_get_active(slab)));
+ slab_inc_active(slab);
return objp;
}
@@ -2524,7 +2538,7 @@ static void slab_put_obj(struct kmem_cache *cachep,
unsigned int i;
/* Verify double free bug */
- for (i = slab->active; i < cachep->num; i++) {
+ for (i = slab_get_active(slab); i < cachep->num; i++) {
if (get_free_obj(slab, i) == objnr) {
pr_err("slab: double free detected in cache '%s', objp %px\n",
cachep->name, objp);
@@ -2532,11 +2546,11 @@ static void slab_put_obj(struct kmem_cache *cachep,
}
}
#endif
- slab->active--;
+ slab_dec_active(slab);
if (!slab->freelist)
slab->freelist = objp + obj_offset(cachep);
- set_free_obj(slab, slab->active, objnr);
+ set_free_obj(slab, slab_get_active(slab), objnr);
}
/*
@@ -2635,14 +2649,14 @@ static void cache_grow_end(struct kmem_cache *cachep, struct slab *slab)
raw_spin_lock(&n->list_lock);
n->total_slabs++;
- if (!slab->active) {
+ if (!slab_get_active(slab)) {
list_add_tail(&slab->slab_list, &n->slabs_free);
n->free_slabs++;
} else
fixup_slab_list(cachep, n, slab, &list);
STATS_INC_GROWN(cachep);
- n->free_objects += cachep->num - slab->active;
+ n->free_objects += cachep->num - slab_get_active(slab);
raw_spin_unlock(&n->list_lock);
fixup_objfreelist_debug(cachep, &list);
@@ -2744,7 +2758,7 @@ static inline void fixup_slab_list(struct kmem_cache *cachep,
{
/* move slabp to correct slabp list: */
list_del(&slab->slab_list);
- if (slab->active == cachep->num) {
+ if (slab_get_active(slab) == cachep->num) {
list_add(&slab->slab_list, &n->slabs_full);
if (OBJFREELIST_SLAB(cachep)) {
#if DEBUG
@@ -2783,7 +2797,7 @@ static noinline struct slab *get_valid_first_slab(struct kmem_cache_node *n,
/* Move pfmemalloc slab to the end of list to speed up next search */
list_del(&slab->slab_list);
- if (!slab->active) {
+ if (!slab_get_active(slab)) {
list_add_tail(&slab->slab_list, &n->slabs_free);
n->free_slabs++;
} else
@@ -2865,9 +2879,9 @@ static __always_inline int alloc_block(struct kmem_cache *cachep,
* There must be at least one object available for
* allocation.
*/
- BUG_ON(slab->active >= cachep->num);
+ BUG_ON(slab_get_active(slab) >= cachep->num);
- while (slab->active < cachep->num && batchcount--) {
+ while (slab_get_active(slab) < cachep->num && batchcount--) {
STATS_INC_ALLOCED(cachep);
STATS_INC_ACTIVE(cachep);
STATS_SET_HIGH(cachep);
@@ -3162,7 +3176,7 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
STATS_INC_ACTIVE(cachep);
STATS_SET_HIGH(cachep);
- BUG_ON(slab->active == cachep->num);
+ BUG_ON(slab_get_active(slab) == cachep->num);
obj = slab_get_obj(cachep, slab);
n->free_objects--;
@@ -3297,7 +3311,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
STATS_DEC_ACTIVE(cachep);
/* fixup slab chains */
- if (slab->active == 0) {
+ if (slab_get_active(slab) == 0) {
list_add(&slab->slab_list, &n->slabs_free);
n->free_slabs++;
} else {
@@ -3352,7 +3366,7 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
struct slab *slab;
list_for_each_entry(slab, &n->slabs_free, slab_list) {
- BUG_ON(slab->active);
+ BUG_ON(slab_get_active(slab));
i++;
}
diff --git a/mm/slab.h b/mm/slab.h
index 7cc432969945..c6ffe6799436 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -20,7 +20,8 @@ struct slab {
};
struct rcu_head rcu_head;
};
- unsigned int active;
+ /* lower half of page_type is used as active objects counter */
+ unsigned int page_type;
#elif defined(CONFIG_SLUB)
--
2.32.0
Some page flags are stored in page_type rather than flags field.
Use newly introduced page type %pGt in dump_page().
Below are some examples:
page:00000000e47d45a7 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10175e
flags: 0x200000000000000(node=0|zone=2)
page_type: 0xffffffff()
raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: just after alloc_pages()
page:00000000e47d45a7 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x10175e
flags: 0x200000000000000(node=0|zone=2)
page_type: 0xffefffff(slab)
raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffefffff 0000000000000000
page dumped because: page with PG_slab set
page:00000000e47d45a7 refcount:1 mapcount:2 mapping:0000000000000000 index:0x0 pfn:0x10175e
flags: 0x200000000000000(node=0|zone=2)
page_type: 0x1()
raw: 0200000000000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 0000000100000001 0000000000000000
page dumped because: page with _mapcount == 1
Signed-off-by: Hyeonggon Yoo <[email protected]>
---
mm/debug.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/debug.c b/mm/debug.c
index 5ce6b359004a..d6a0eb0a9bb8 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -120,6 +120,8 @@ static void __dump_page(struct page *page)
pr_warn("%sflags: %pGp%s\n", type, &head->flags,
page_cma ? " CMA" : "");
+ pr_warn("page_type: %pGt\n", &head->page_type);
+
print_hex_dump(KERN_WARNING, "raw: ", DUMP_PREFIX_NONE, 32,
sizeof(unsigned long), page,
sizeof(struct page), false);
--
2.32.0
%pGp format is used to print 'flags' field of struct page.
As some page flags (e.g. PG_buddy, see page-flags.h for more details)
are set in page_type field, introduce %pGt format which provides
human readable output of page_type.
Note that the sense of bits are different in page_type. if page_type is
0xffffffff, no flags are set. if PG_slab (0x00100000) flag is set,
page_type is 0xffefffff. Clearing a bit means we set the bit.
Bits in page_type are inverted when printing page type names.
Signed-off-by: Hyeonggon Yoo <[email protected]>
---
Documentation/core-api/printk-formats.rst | 3 ++-
include/trace/events/mmflags.h | 7 ++++++
lib/test_printf.c | 26 +++++++++++++++++++++++
lib/vsprintf.c | 21 ++++++++++++++++++
mm/debug.c | 5 +++++
mm/internal.h | 1 +
6 files changed, 62 insertions(+), 1 deletion(-)
diff --git a/Documentation/core-api/printk-formats.rst b/Documentation/core-api/printk-formats.rst
index dbe1aacc79d0..582e965508eb 100644
--- a/Documentation/core-api/printk-formats.rst
+++ b/Documentation/core-api/printk-formats.rst
@@ -575,12 +575,13 @@ The field width is passed by value, the bitmap is passed by reference.
Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
printing cpumask and nodemask.
-Flags bitfields such as page flags, gfp_flags
+Flags bitfields such as page flags, page_type, gfp_flags
---------------------------------------------
::
%pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
+ %pGt 0xffefffff(slab)
%pGg GFP_USER|GFP_DMA32|GFP_NOWARN
%pGv read|exec|mayread|maywrite|mayexec|denywrite
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 8301912f8c25..57f52d00e761 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -138,6 +138,13 @@ IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
__def_pageflag_names \
) : "none"
+#define __def_pagetype_names \
+ {PG_slab, "slab" }, \
+ {PG_offline, "offline" }, \
+ {PG_guard, "guard" }, \
+ {PG_table, "table" }, \
+ {PG_buddy, "buddy" }
+
#if defined(CONFIG_X86)
#define __VM_ARCH_SPECIFIC_1 {VM_PAT, "pat" }
#elif defined(CONFIG_PPC)
diff --git a/lib/test_printf.c b/lib/test_printf.c
index d34dc636b81c..e0d0770d5eec 100644
--- a/lib/test_printf.c
+++ b/lib/test_printf.c
@@ -642,12 +642,26 @@ page_flags_test(int section, int node, int zone, int last_cpupid,
test(cmp_buf, "%pGp", &flags);
}
+static void __init page_type_test(unsigned int page_type, const char *name,
+ char *cmp_buf)
+{
+ unsigned long size;
+
+ size = scnprintf(cmp_buf, BUF_SIZE, "%#x(", page_type);
+ if (page_type_has_type(page_type))
+ size += scnprintf(cmp_buf + size, BUF_SIZE - size, "%s", name);
+
+ snprintf(cmp_buf + size, BUF_SIZE - size, ")");
+ test(cmp_buf, "%pGt", &page_type);
+}
+
static void __init
flags(void)
{
unsigned long flags;
char *cmp_buffer;
gfp_t gfp;
+ unsigned int page_type;
cmp_buffer = kmalloc(BUF_SIZE, GFP_KERNEL);
if (!cmp_buffer)
@@ -687,6 +701,18 @@ flags(void)
gfp |= __GFP_ATOMIC;
test(cmp_buffer, "%pGg", &gfp);
+ page_type = ~0;
+ page_type_test(page_type, "", cmp_buffer);
+
+ page_type = 10;
+ page_type_test(page_type, "", cmp_buffer);
+
+ page_type = ~PG_slab;
+ page_type_test(page_type, "slab", cmp_buffer);
+
+ page_type = ~(PG_slab | PG_table | PG_buddy);
+ page_type_test(page_type, "slab|table|buddy", cmp_buffer);
+
kfree(cmp_buffer);
}
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index be71a03c936a..fbe320b5e89f 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -2052,6 +2052,25 @@ char *format_page_flags(char *buf, char *end, unsigned long flags)
return buf;
}
+static
+char *format_page_type(char *buf, char *end, unsigned int page_type)
+{
+ buf = number(buf, end, page_type, default_flag_spec);
+
+ if (buf < end)
+ *buf = '(';
+ buf++;
+
+ if (page_type_has_type(page_type))
+ buf = format_flags(buf, end, ~page_type, pagetype_names);
+
+ if (buf < end)
+ *buf = ')';
+ buf++;
+
+ return buf;
+}
+
static noinline_for_stack
char *flags_string(char *buf, char *end, void *flags_ptr,
struct printf_spec spec, const char *fmt)
@@ -2065,6 +2084,8 @@ char *flags_string(char *buf, char *end, void *flags_ptr,
switch (fmt[1]) {
case 'p':
return format_page_flags(buf, end, *(unsigned long *)flags_ptr);
+ case 't':
+ return format_page_type(buf, end, *(unsigned int *)flags_ptr);
case 'v':
flags = *(unsigned long *)flags_ptr;
names = vmaflag_names;
diff --git a/mm/debug.c b/mm/debug.c
index 7f8e5f744e42..5ce6b359004a 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -36,6 +36,11 @@ const struct trace_print_flags pageflag_names[] = {
{0, NULL}
};
+const struct trace_print_flags pagetype_names[] = {
+ __def_pagetype_names,
+ {0, NULL}
+};
+
const struct trace_print_flags gfpflag_names[] = {
__def_gfpflag_names,
{0, NULL}
diff --git a/mm/internal.h b/mm/internal.h
index bcf75a8b032d..b4ba6fd6051c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -773,6 +773,7 @@ static inline void flush_tlb_batched_pending(struct mm_struct *mm)
#endif /* CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH */
extern const struct trace_print_flags pageflag_names[];
+extern const struct trace_print_flags pagetype_names[];
extern const struct trace_print_flags vmaflag_names[];
extern const struct trace_print_flags gfpflag_names[];
--
2.32.0
On Sun, Dec 18, 2022 at 07:19:00PM +0900, Hyeonggon Yoo wrote:
> %pGp format is used to print 'flags' field of struct page.
> As some page flags (e.g. PG_buddy, see page-flags.h for more details)
> are set in page_type field, introduce %pGt format which provides
> human readable output of page_type.
>
> Note that the sense of bits are different in page_type. if page_type is
> 0xffffffff, no flags are set. if PG_slab (0x00100000) flag is set,
> page_type is 0xffefffff. Clearing a bit means we set the bit.
>
> Bits in page_type are inverted when printing page type names.
...
> +#define __def_pagetype_names \
> + {PG_slab, "slab" }, \
> + {PG_offline, "offline" }, \
> + {PG_guard, "guard" }, \
> + {PG_table, "table" }, \
> + {PG_buddy, "buddy" }
Wondering if it will be more robust to define a helper macro
#define DEF_PAGETYPE_NAME(_name) { PG_##_name, __stringify(_name) }
...
#undef DEF_PAGETYPE_MASK
In this case it decreases the chances of typo in the strings and flags.
--
With Best Regards,
Andy Shevchenko
On 12/19/22 01:44, Andy Shevchenko wrote:
> On Sun, Dec 18, 2022 at 07:19:00PM +0900, Hyeonggon Yoo wrote:
>> %pGp format is used to print 'flags' field of struct page.
>> As some page flags (e.g. PG_buddy, see page-flags.h for more details)
>> are set in page_type field, introduce %pGt format which provides
>> human readable output of page_type.
>>
>> Note that the sense of bits are different in page_type. if page_type is
>> 0xffffffff, no flags are set. if PG_slab (0x00100000) flag is set,
>> page_type is 0xffefffff. Clearing a bit means we set the bit.
>>
>> Bits in page_type are inverted when printing page type names.
>
> ...
>
>> +#define __def_pagetype_names \
>> + {PG_slab, "slab" }, \
>> + {PG_offline, "offline" }, \
>> + {PG_guard, "guard" }, \
>> + {PG_table, "table" }, \
>> + {PG_buddy, "buddy" }
>
> Wondering if it will be more robust to define a helper macro
>
> #define DEF_PAGETYPE_NAME(_name) { PG_##_name, __stringify(_name) }
> ...
> #undef DEF_PAGETYPE_MASK
>
> In this case it decreases the chances of typo in the strings and flags.
I'd say Yes. (and #undef DEF_PAGETYPE_NAME methinks; or change both to _MASK)
--
~Randy
On Mon, Dec 19, 2022 at 11:35:38AM -0800, Randy Dunlap wrote:
> On 12/19/22 01:44, Andy Shevchenko wrote:
> > On Sun, Dec 18, 2022 at 07:19:00PM +0900, Hyeonggon Yoo wrote:
...
> > #define DEF_PAGETYPE_NAME(_name) { PG_##_name, __stringify(_name) }
> > ...
> > #undef DEF_PAGETYPE_MASK
> >
> > In this case it decreases the chances of typo in the strings and flags.
>
> I'd say Yes. (and #undef DEF_PAGETYPE_NAME methinks; or change both to _MASK)
Ah, it's me who used two different names for the same macro :-)
--
With Best Regards,
Andy Shevchenko
On Sun 2022-12-18 19:19:00, Hyeonggon Yoo wrote:
> %pGp format is used to print 'flags' field of struct page.
> As some page flags (e.g. PG_buddy, see page-flags.h for more details)
> are set in page_type field, introduce %pGt format which provides
> human readable output of page_type.
>
> Note that the sense of bits are different in page_type. if page_type is
> 0xffffffff, no flags are set. if PG_slab (0x00100000) flag is set,
> page_type is 0xffefffff. Clearing a bit means we set the bit.
>
> Bits in page_type are inverted when printing page type names.
>
> --- a/Documentation/core-api/printk-formats.rst
> +++ b/Documentation/core-api/printk-formats.rst
> @@ -575,12 +575,13 @@ The field width is passed by value, the bitmap is passed by reference.
> Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
> printing cpumask and nodemask.
>
> -Flags bitfields such as page flags, gfp_flags
> +Flags bitfields such as page flags, page_type, gfp_flags
> ---------------------------------------------
Please, underline the entire title. Otherwise, "make htmldoc"
complains ;-)
/prace/kernel/linux/Documentation/core-api/printk-formats.rst:579: WARNING: Title underline too short.
Flags bitfields such as page flags, page_type, gfp_flags
>
> ::
>
> %pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
> + %pGt 0xffefffff(slab)
> %pGg GFP_USER|GFP_DMA32|GFP_NOWARN
> %pGv read|exec|mayread|maywrite|mayexec|denywrite
>
Please, explain this also in the paragraph below these examples.
I would personally refactor it to an itemized list, something like:
<proposal>
For printing flags bitfields as a collection of symbolic constants that
would construct the value. The type of flags is given by the third
character. Currently supported are:
- p - [p]age flags, expects value of type (``unsigned long *``)
- t - page [t]ype, expects value of type (``unsigned int *``)
- v - [v]ma_flags, expects value of type (``unsigned long *``)
- g - [g]fp_flags, expects value of type (``gfp_t *``)
The flag names and print order depends on the particular type.
</proposal>
Rant:
Sigh, it looks a bit error prone when similar pointer modifiers
expects pointers to different types. I wish there was a way how
to check the passed pointer type at compilation time. But it
is generic problem with these %p* modifiers.
Otherwise the patch looks fine for the vsprinf side.
Best Regards,
Petr
On Sun, Dec 18, 2022 at 07:18:58PM +0900, Hyeonggon Yoo wrote:
> As suggested by Naoya [1], identify_page_state() is never
> called when handling memory error on a slab page.
>
> Clean this up before moving PG_slab flag to page_type in later patch.
>
> [1] https://lore.kernel.org/linux-mm/Y2s+dnBsHAJu19ob@hyeyoo/#r
>
> Suggested-by: Naoya Horiguchi <[email protected]>
> Signed-off-by: Hyeonggon Yoo <[email protected]>
Thank you for the patch,
I think there're a few other places to remove under include/.
$ grep -inr MF_MSG_SLA include
include/ras/ras_event.h:359: EM ( MF_MSG_SLAB, "kernel slab page" ) \
include/linux/mm.h:3502: MF_MSG_SLAB,
, so could you update them together?
Thanks,
Naoya Horiguchi
On Tue, Dec 20, 2022 at 11:53:11PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sun, Dec 18, 2022 at 07:18:58PM +0900, Hyeonggon Yoo wrote:
> > As suggested by Naoya [1], identify_page_state() is never
> > called when handling memory error on a slab page.
> >
> > Clean this up before moving PG_slab flag to page_type in later patch.
> > [1] https://lore.kernel.org/linux-mm/Y2s+dnBsHAJu19ob@hyeyoo/#r
You can make it to be a Link: tag.
--
With Best Regards,
Andy Shevchenko
On Tue, Dec 20, 2022 at 11:53:11PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Sun, Dec 18, 2022 at 07:18:58PM +0900, Hyeonggon Yoo wrote:
> > As suggested by Naoya [1], identify_page_state() is never
> > called when handling memory error on a slab page.
> >
> > Clean this up before moving PG_slab flag to page_type in later patch.
> >
> > [1] https://lore.kernel.org/linux-mm/Y2s+dnBsHAJu19ob@hyeyoo/#r
> >
> > Suggested-by: Naoya Horiguchi <[email protected]>
> > Signed-off-by: Hyeonggon Yoo <[email protected]>
>
> Thank you for the patch,
> I think there're a few other places to remove under include/.
>
> $ grep -inr MF_MSG_SLA include
> include/ras/ras_event.h:359: EM ( MF_MSG_SLAB, "kernel slab page" ) \
> include/linux/mm.h:3502: MF_MSG_SLAB,
>
> , so could you update them together?
Oh, missed this. Will do in next version.
--
Thanks,
Hyeonggon
On Tue, Dec 20, 2022 at 04:20:26PM +0100, Petr Mladek wrote:
> On Sun 2022-12-18 19:19:00, Hyeonggon Yoo wrote:
> > %pGp format is used to print 'flags' field of struct page.
> > As some page flags (e.g. PG_buddy, see page-flags.h for more details)
> > are set in page_type field, introduce %pGt format which provides
> > human readable output of page_type.
> >
> > Note that the sense of bits are different in page_type. if page_type is
> > 0xffffffff, no flags are set. if PG_slab (0x00100000) flag is set,
> > page_type is 0xffefffff. Clearing a bit means we set the bit.
> >
> > Bits in page_type are inverted when printing page type names.
> >
> > --- a/Documentation/core-api/printk-formats.rst
> > +++ b/Documentation/core-api/printk-formats.rst
> > @@ -575,12 +575,13 @@ The field width is passed by value, the bitmap is passed by reference.
> > Helper macros cpumask_pr_args() and nodemask_pr_args() are available to ease
> > printing cpumask and nodemask.
> >
> > -Flags bitfields such as page flags, gfp_flags
> > +Flags bitfields such as page flags, page_type, gfp_flags
> > ---------------------------------------------
>
> Please, underline the entire title. Otherwise, "make htmldoc"
> complains ;-)
>
> /prace/kernel/linux/Documentation/core-api/printk-formats.rst:579: WARNING: Title underline too short.
> Flags bitfields such as page flags, page_type, gfp_flags
Still not getting used to rst format ;)
Will fix, thanks!
>
>
> >
> > ::
> >
> > %pGp 0x17ffffc0002036(referenced|uptodate|lru|active|private|node=0|zone=2|lastcpupid=0x1fffff)
> > + %pGt 0xffefffff(slab)
> > %pGg GFP_USER|GFP_DMA32|GFP_NOWARN
> > %pGv read|exec|mayread|maywrite|mayexec|denywrite
> >
>
> Please, explain this also in the paragraph below these examples.
> I would personally refactor it to an itemized list, something like:
>
> <proposal>
> For printing flags bitfields as a collection of symbolic constants that
> would construct the value. The type of flags is given by the third
> character. Currently supported are:
>
> - p - [p]age flags, expects value of type (``unsigned long *``)
> - t - page [t]ype, expects value of type (``unsigned int *``)
> - v - [v]ma_flags, expects value of type (``unsigned long *``)
> - g - [g]fp_flags, expects value of type (``gfp_t *``)
>
> The flag names and print order depends on the particular type.
> </proposal>
The proposal sounds reasonable to me,
will adjust in next version.
> Rant:
> Sigh, it looks a bit error prone when similar pointer modifiers
> expects pointers to different types. I wish there was a way how
> to check the passed pointer type at compilation time. But it
> is generic problem with these %p* modifiers.
From my limited knowledge, it seems that there is no way to check
this :/
>
> Otherwise the patch looks fine for the vsprinf side.
Thank you for looking at this!
>
> Best Regards,
> Petr
--
Thanks,
Hyeonggon
On Wed, Dec 21, 2022 at 07:00:48PM +0200, Andy Shevchenko wrote:
> On Tue, Dec 20, 2022 at 11:53:11PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Sun, Dec 18, 2022 at 07:18:58PM +0900, Hyeonggon Yoo wrote:
> > > As suggested by Naoya [1], identify_page_state() is never
> > > called when handling memory error on a slab page.
> > >
> > > Clean this up before moving PG_slab flag to page_type in later patch.
>
> > > [1] https://lore.kernel.org/linux-mm/Y2s+dnBsHAJu19ob@hyeyoo/#r
>
> You can make it to be a Link: tag.
Will do in next version, thanks!
--
Thanks,
Hyeonggon
On Tue, Dec 20, 2022 at 12:58:53PM +0200, Andy Shevchenko wrote:
> On Mon, Dec 19, 2022 at 11:35:38AM -0800, Randy Dunlap wrote:
> > On 12/19/22 01:44, Andy Shevchenko wrote:
> > > On Sun, Dec 18, 2022 at 07:19:00PM +0900, Hyeonggon Yoo wrote:
>
> ...
>
> > > #define DEF_PAGETYPE_NAME(_name) { PG_##_name, __stringify(_name) }
> > > ...
> > > #undef DEF_PAGETYPE_MASK
> > >
> > > In this case it decreases the chances of typo in the strings and flags.
> >
> > I'd say Yes. (and #undef DEF_PAGETYPE_NAME methinks; or change both to _MASK)
>
> Ah, it's me who used two different names for the same macro :-)
It seems like a good way to make fewer mistakes.
Will do in next version, thanks! :-)
--
Thanks,
Hyeonggon
On 12/18/22 11:18, Hyeonggon Yoo wrote:
> For now, only SLAB uses _mapcount field as a number of active objects in
> a slab, and other slab allocators do not. As 16 bits are enough for that,
> use remaining 16 bits of _mapcount as page_type even when SLAB is used.
> And then move PG_slab flag to page_type.
>
> As suggested by Matthew, store number of active objects in negative
> form and use helper when accessing or modifying it.
>
> page_type is always placed in upper half of _mapcount to avoid
> confusing normal _mapcount as page_type. As underflow (actually I mean,
> yeah, overflow) is not a concern anymore, use more lower bits.
>
> Add more folio helpers for PAGE_TYPE_OPS() not to break existing
> slab implementations. To preserve current behavior apply page policy
> in PAGE_TYPE_OPS() and use PF_NO_TAIL for slab pages and PF_ANY for others.
>
> Remove PG_slab check from PAGE_FLAGS_CHECK_AT_FREE. buddy will still
> check if _mapcount is properly set at free.
>
> Note that with this change, page_mapped() and folio_mapped() always return
> false for a slab page.
>
> Signed-off-by: Hyeonggon Yoo <[email protected]>
Seems like quite some changes to page_type to accomodate SLAB, which is
hopefully going away soon(TM). Could we perhaps avoid that?
- Alternative 1: convert only SLUB for now, let SLAB keep using PG_slab
until it's gone.
- Alternative 2: SLAB's s_mem field seems wasteful to be a full blown
pointer, it could share the word with "active" if it was an offset from
page_addr(), see how it's normally set up:
void *addr = slab_address(slab);
slab->s_mem = addr + colour_off;
so basically we would store the colour_off and perform this calculation
on-demand.
One caveat is KFENCE's kfence_guarded_alloc() which seems to be creating
some fake slab? And setting s_mem to something that maybe isn't an offset up
to 16 bits from slab_address(). That has to be investigated. Worst case,
SLAB is going away soon(TM) so it's fine if KFENCE only works with SLUB?
> ---
> fs/proc/page.c | 13 ++----
> include/linux/mm_types.h | 11 +++--
> include/linux/page-flags.h | 77 ++++++++++++++++++++++++----------
> include/trace/events/mmflags.h | 1 -
> kernel/crash_core.c | 3 +-
> mm/slab.c | 44 ++++++++++++-------
> mm/slab.h | 3 +-
> 7 files changed, 98 insertions(+), 54 deletions(-)
>
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 6249c347809a..e7524f21cefe 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -67,7 +67,7 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
> */
> ppage = pfn_to_online_page(pfn);
>
> - if (!ppage || PageSlab(ppage) || page_has_type(ppage))
> + if (!ppage || page_has_type(ppage))
> pcount = 0;
> else
> pcount = page_mapcount(ppage);
> @@ -124,11 +124,8 @@ u64 stable_page_flags(struct page *page)
>
> /*
> * pseudo flags for the well known (anonymous) memory mapped pages
> - *
> - * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
> - * simple test in page_mapped() is not enough.
> */
> - if (!PageSlab(page) && page_mapped(page))
> + if (page_mapped(page))
> u |= 1 << KPF_MMAP;
> if (PageAnon(page))
> u |= 1 << KPF_ANON;
> @@ -178,16 +175,14 @@ u64 stable_page_flags(struct page *page)
> u |= 1 << KPF_OFFLINE;
> if (PageTable(page))
> u |= 1 << KPF_PGTABLE;
> + if (PageSlab(page))
> + u |= 1 << KPF_SLAB;
>
> if (page_is_idle(page))
> u |= 1 << KPF_IDLE;
>
> u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);
>
> - u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
> - if (PageTail(page) && PageSlab(compound_head(page)))
> - u |= 1 << KPF_SLAB;
> -
> u |= kpf_copy_bit(k, KPF_ERROR, PG_error);
> u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty);
> u |= kpf_copy_bit(k, KPF_UPTODATE, PG_uptodate);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 3b8475007734..6b04ae65241d 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -207,10 +207,13 @@ struct page {
> atomic_t _mapcount;
>
> /*
> - * If the page is neither PageSlab nor mappable to userspace,
> - * the value stored here may help determine what this page
> - * is used for. See page-flags.h for a list of page types
> - * which are currently stored here.
> + * If the page is not mappable to userspace, the value
> + * stored here may help determine what this page is used for.
> + * See page-flags.h for a list of page types which are currently
> + * stored here.
> + *
> + * Note that only upper half is used for page types and lower
> + * half is reserved for SLAB.
> */
> unsigned int page_type;
> };
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 69e93a0c1277..07063d60efe3 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -107,7 +107,6 @@ enum pageflags {
> PG_workingset,
> PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
> PG_error,
> - PG_slab,
> PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
> PG_arch_1,
> PG_reserved,
> @@ -482,7 +481,6 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
> TESTCLEARFLAG(Active, active, PF_HEAD)
> PAGEFLAG(Workingset, workingset, PF_HEAD)
> TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
> -__PAGEFLAG(Slab, slab, PF_NO_TAIL)
> __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
> PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
>
> @@ -906,42 +904,72 @@ static inline bool is_page_hwpoison(struct page *page)
> }
>
> /*
> - * For pages that are never mapped to userspace (and aren't PageSlab),
> - * page_type may be used. Because it is initialised to -1, we invert the
> - * sense of the bit, so __SetPageFoo *clears* the bit used for PageFoo, and
> - * __ClearPageFoo *sets* the bit used for PageFoo. We reserve a few high and
> - * low bits so that an underflow or overflow of page_mapcount() won't be
> - * mistaken for a page type value.
> + * For pages that are never mapped to userspace, page_type may be used.
> + * Because it is initialised to -1, we invert the sense of the bit,
> + * so __SetPageFoo *clears* the bit used for PageFoo, and __ClearPageFoo
> + * *sets* the bit used for PageFoo. We reserve a few high and low bits
> + * so that an underflow or overflow of page_mapcount() won't be mistaken
> + * for a page type value.
> */
>
> #define PAGE_TYPE_BASE 0xf0000000
> -/* Reserve 0x0000007f to catch underflows of page_mapcount */
> -#define PAGE_MAPCOUNT_RESERVE -128
> -#define PG_buddy 0x00000080
> -#define PG_offline 0x00000100
> -#define PG_table 0x00000200
> -#define PG_guard 0x00000400
> +#define PG_buddy 0x00010000
> +#define PG_offline 0x00020000
> +#define PG_table 0x00040000
> +#define PG_guard 0x00080000
> +#define PG_slab 0x00100000
>
> #define PageType(page, flag) \
> ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
>
> -static inline int page_has_type(struct page *page)
> +#define PAGE_TYPE_MASK 0xffff0000
> +
> +static inline bool page_type_has_type(unsigned int page_type)
> {
> - return (int)page->page_type < PAGE_MAPCOUNT_RESERVE;
> + return ((int)page_type < (int)PAGE_TYPE_MASK);
> }
>
> -#define PAGE_TYPE_OPS(uname, lname) \
> +static inline bool page_has_type(struct page *page)
> +{
> + return page_type_has_type(page->page_type);
> +}
> +
> +
> +#define PAGE_TYPE_OPS(uname, lname, policy) \
> static __always_inline int Page##uname(struct page *page) \
> { \
> + page = policy(page, 0); \
> + return PageType(page, PG_##lname); \
> +} \
> +static __always_inline int folio_test_##lname(struct folio *folio) \
> +{ \
> + struct page *page = &folio->page; \
> + \
> return PageType(page, PG_##lname); \
> } \
> static __always_inline void __SetPage##uname(struct page *page) \
> { \
> + page = policy(page, 1); \
> + VM_BUG_ON_PAGE(!PageType(page, 0), page); \
> + page->page_type &= ~PG_##lname; \
> +} \
> +static __always_inline void __folio_set_##lname(struct folio *folio) \
> +{ \
> + struct page *page = &folio->page; \
> + \
> VM_BUG_ON_PAGE(!PageType(page, 0), page); \
> page->page_type &= ~PG_##lname; \
> } \
> static __always_inline void __ClearPage##uname(struct page *page) \
> { \
> + page = policy(page, 1); \
> + VM_BUG_ON_PAGE(!Page##uname(page), page); \
> + page->page_type |= PG_##lname; \
> +} \
> +static __always_inline void __folio_clear_##lname(struct folio *folio) \
> +{ \
> + struct page *page = &folio->page; \
> + \
> VM_BUG_ON_PAGE(!Page##uname(page), page); \
> page->page_type |= PG_##lname; \
> }
> @@ -950,7 +978,7 @@ static __always_inline void __ClearPage##uname(struct page *page) \
> * PageBuddy() indicates that the page is free and in the buddy system
> * (see mm/page_alloc.c).
> */
> -PAGE_TYPE_OPS(Buddy, buddy)
> +PAGE_TYPE_OPS(Buddy, buddy, PF_ANY)
>
> /*
> * PageOffline() indicates that the page is logically offline although the
> @@ -974,7 +1002,10 @@ PAGE_TYPE_OPS(Buddy, buddy)
> * pages should check PageOffline() and synchronize with such drivers using
> * page_offline_freeze()/page_offline_thaw().
> */
> -PAGE_TYPE_OPS(Offline, offline)
> +PAGE_TYPE_OPS(Offline, offline, PF_ANY)
> +
> +/* PageSlab() indicates that the page is used by slab subsystem. */
> +PAGE_TYPE_OPS(Slab, slab, PF_NO_TAIL)
>
> extern void page_offline_freeze(void);
> extern void page_offline_thaw(void);
> @@ -984,12 +1015,12 @@ extern void page_offline_end(void);
> /*
> * Marks pages in use as page tables.
> */
> -PAGE_TYPE_OPS(Table, table)
> +PAGE_TYPE_OPS(Table, table, PF_ANY)
>
> /*
> * Marks guardpages used with debug_pagealloc.
> */
> -PAGE_TYPE_OPS(Guard, guard)
> +PAGE_TYPE_OPS(Guard, guard, PF_ANY)
>
> extern bool is_free_buddy_page(struct page *page);
>
> @@ -1037,8 +1068,8 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
> (1UL << PG_lru | 1UL << PG_locked | \
> 1UL << PG_private | 1UL << PG_private_2 | \
> 1UL << PG_writeback | 1UL << PG_reserved | \
> - 1UL << PG_slab | 1UL << PG_active | \
> - 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)
> + 1UL << PG_active | 1UL << PG_unevictable | \
> + __PG_MLOCKED | LRU_GEN_MASK)
>
> /*
> * Flags checked when a page is prepped for return by the page allocator.
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 412b5a46374c..8301912f8c25 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -113,7 +113,6 @@
> {1UL << PG_lru, "lru" }, \
> {1UL << PG_active, "active" }, \
> {1UL << PG_workingset, "workingset" }, \
> - {1UL << PG_slab, "slab" }, \
> {1UL << PG_owner_priv_1, "owner_priv_1" }, \
> {1UL << PG_arch_1, "arch_1" }, \
> {1UL << PG_reserved, "reserved" }, \
> diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> index 87ef6096823f..1ea8198c26e1 100644
> --- a/kernel/crash_core.c
> +++ b/kernel/crash_core.c
> @@ -482,13 +482,14 @@ static int __init crash_save_vmcoreinfo_init(void)
> VMCOREINFO_NUMBER(PG_private);
> VMCOREINFO_NUMBER(PG_swapcache);
> VMCOREINFO_NUMBER(PG_swapbacked);
> - VMCOREINFO_NUMBER(PG_slab);
> #ifdef CONFIG_MEMORY_FAILURE
> VMCOREINFO_NUMBER(PG_hwpoison);
> #endif
> VMCOREINFO_NUMBER(PG_head_mask);
> #define PAGE_BUDDY_MAPCOUNT_VALUE (~PG_buddy)
> VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
> +#define PAGE_SLAB_MAPCOUNT_VALUE (~PG_slab)
> + VMCOREINFO_NUMBER(PAGE_SLAB_MAPCOUNT_VALUE);
> #ifdef CONFIG_HUGETLB_PAGE
> VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
> #define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline)
> diff --git a/mm/slab.c b/mm/slab.c
> index 7a269db050ee..eee46f71c4b8 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2269,6 +2269,21 @@ void __kmem_cache_release(struct kmem_cache *cachep)
> }
> }
>
> +static inline unsigned int slab_get_active(struct slab *slab)
> +{
> + return ~(slab->page_type | PG_slab);
> +}
> +
> +static inline void slab_inc_active(struct slab *slab)
> +{
> + slab->page_type--;
> +}
> +
> +static inline void slab_dec_active(struct slab *slab)
> +{
> + slab->page_type++;
> +}
> +
> /*
> * Get the memory for a slab management obj.
> *
> @@ -2291,7 +2306,6 @@ static void *alloc_slabmgmt(struct kmem_cache *cachep,
> void *addr = slab_address(slab);
>
> slab->s_mem = addr + colour_off;
> - slab->active = 0;
>
> if (OBJFREELIST_SLAB(cachep))
> freelist = NULL;
> @@ -2510,8 +2524,8 @@ static void *slab_get_obj(struct kmem_cache *cachep, struct slab *slab)
> {
> void *objp;
>
> - objp = index_to_obj(cachep, slab, get_free_obj(slab, slab->active));
> - slab->active++;
> + objp = index_to_obj(cachep, slab, get_free_obj(slab, slab_get_active(slab)));
> + slab_inc_active(slab);
>
> return objp;
> }
> @@ -2524,7 +2538,7 @@ static void slab_put_obj(struct kmem_cache *cachep,
> unsigned int i;
>
> /* Verify double free bug */
> - for (i = slab->active; i < cachep->num; i++) {
> + for (i = slab_get_active(slab); i < cachep->num; i++) {
> if (get_free_obj(slab, i) == objnr) {
> pr_err("slab: double free detected in cache '%s', objp %px\n",
> cachep->name, objp);
> @@ -2532,11 +2546,11 @@ static void slab_put_obj(struct kmem_cache *cachep,
> }
> }
> #endif
> - slab->active--;
> + slab_dec_active(slab);
> if (!slab->freelist)
> slab->freelist = objp + obj_offset(cachep);
>
> - set_free_obj(slab, slab->active, objnr);
> + set_free_obj(slab, slab_get_active(slab), objnr);
> }
>
> /*
> @@ -2635,14 +2649,14 @@ static void cache_grow_end(struct kmem_cache *cachep, struct slab *slab)
>
> raw_spin_lock(&n->list_lock);
> n->total_slabs++;
> - if (!slab->active) {
> + if (!slab_get_active(slab)) {
> list_add_tail(&slab->slab_list, &n->slabs_free);
> n->free_slabs++;
> } else
> fixup_slab_list(cachep, n, slab, &list);
>
> STATS_INC_GROWN(cachep);
> - n->free_objects += cachep->num - slab->active;
> + n->free_objects += cachep->num - slab_get_active(slab);
> raw_spin_unlock(&n->list_lock);
>
> fixup_objfreelist_debug(cachep, &list);
> @@ -2744,7 +2758,7 @@ static inline void fixup_slab_list(struct kmem_cache *cachep,
> {
> /* move slabp to correct slabp list: */
> list_del(&slab->slab_list);
> - if (slab->active == cachep->num) {
> + if (slab_get_active(slab) == cachep->num) {
> list_add(&slab->slab_list, &n->slabs_full);
> if (OBJFREELIST_SLAB(cachep)) {
> #if DEBUG
> @@ -2783,7 +2797,7 @@ static noinline struct slab *get_valid_first_slab(struct kmem_cache_node *n,
>
> /* Move pfmemalloc slab to the end of list to speed up next search */
> list_del(&slab->slab_list);
> - if (!slab->active) {
> + if (!slab_get_active(slab)) {
> list_add_tail(&slab->slab_list, &n->slabs_free);
> n->free_slabs++;
> } else
> @@ -2865,9 +2879,9 @@ static __always_inline int alloc_block(struct kmem_cache *cachep,
> * There must be at least one object available for
> * allocation.
> */
> - BUG_ON(slab->active >= cachep->num);
> + BUG_ON(slab_get_active(slab) >= cachep->num);
>
> - while (slab->active < cachep->num && batchcount--) {
> + while (slab_get_active(slab) < cachep->num && batchcount--) {
> STATS_INC_ALLOCED(cachep);
> STATS_INC_ACTIVE(cachep);
> STATS_SET_HIGH(cachep);
> @@ -3162,7 +3176,7 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
> STATS_INC_ACTIVE(cachep);
> STATS_SET_HIGH(cachep);
>
> - BUG_ON(slab->active == cachep->num);
> + BUG_ON(slab_get_active(slab) == cachep->num);
>
> obj = slab_get_obj(cachep, slab);
> n->free_objects--;
> @@ -3297,7 +3311,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
> STATS_DEC_ACTIVE(cachep);
>
> /* fixup slab chains */
> - if (slab->active == 0) {
> + if (slab_get_active(slab) == 0) {
> list_add(&slab->slab_list, &n->slabs_free);
> n->free_slabs++;
> } else {
> @@ -3352,7 +3366,7 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> struct slab *slab;
>
> list_for_each_entry(slab, &n->slabs_free, slab_list) {
> - BUG_ON(slab->active);
> + BUG_ON(slab_get_active(slab));
>
> i++;
> }
> diff --git a/mm/slab.h b/mm/slab.h
> index 7cc432969945..c6ffe6799436 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -20,7 +20,8 @@ struct slab {
> };
> struct rcu_head rcu_head;
> };
> - unsigned int active;
> + /* lower half of page_type is used as active objects counter */
> + unsigned int page_type;
>
> #elif defined(CONFIG_SLUB)
>
On Thu, Jan 12, 2023 at 05:27:24PM +0100, Vlastimil Babka wrote:
> On 12/18/22 11:18, Hyeonggon Yoo wrote:
> > For now, only SLAB uses _mapcount field as a number of active objects in
> > a slab, and other slab allocators do not. As 16 bits are enough for that,
> > use remaining 16 bits of _mapcount as page_type even when SLAB is used.
> > And then move PG_slab flag to page_type.
> >
> > As suggested by Matthew, store number of active objects in negative
> > form and use helper when accessing or modifying it.
> >
> > page_type is always placed in upper half of _mapcount to avoid
> > confusing normal _mapcount as page_type. As underflow (actually I mean,
> > yeah, overflow) is not a concern anymore, use more lower bits.
> >
> > Add more folio helpers for PAGE_TYPE_OPS() not to break existing
> > slab implementations. To preserve current behavior apply page policy
> > in PAGE_TYPE_OPS() and use PF_NO_TAIL for slab pages and PF_ANY for others.
> >
> > Remove PG_slab check from PAGE_FLAGS_CHECK_AT_FREE. buddy will still
> > check if _mapcount is properly set at free.
> >
> > Note that with this change, page_mapped() and folio_mapped() always return
> > false for a slab page.
> >
> > Signed-off-by: Hyeonggon Yoo <[email protected]>
>
> Seems like quite some changes to page_type to accomodate SLAB, which is
> hopefully going away soon(TM). Could we perhaps avoid that?
If it could be done with less changes, I'll try to avoid that.
>
> - Alternative 1: convert only SLUB for now, let SLAB keep using PG_slab
> until it's gone.
>
> - Alternative 2: SLAB's s_mem field seems wasteful to be a full blown
> pointer, it could share the word with "active" if it was an offset from
> page_addr(), see how it's normally set up:
>
> void *addr = slab_address(slab);
> slab->s_mem = addr + colour_off;
>
> so basically we would store the colour_off and perform this calculation
> on-demand.
Thank you for suggestions, I prefer the latter approach.
Will refresh with that.
By the way, I think printk part is already ready so I sent this first.
( https://lore.kernel.org/linux-mm/[email protected] )
> One caveat is KFENCE's kfence_guarded_alloc() which seems to be creating
> some fake slab? And setting s_mem to something that maybe isn't an offset up
> to 16 bits from slab_address(). That has to be investigated. Worst case,
> SLAB is going away soon(TM) so it's fine if KFENCE only works with SLUB?
Will investigate, but if SLAB is going away anyway I guess that would be
okay.
> > ---
> > fs/proc/page.c | 13 ++----
> > include/linux/mm_types.h | 11 +++--
> > include/linux/page-flags.h | 77 ++++++++++++++++++++++++----------
> > include/trace/events/mmflags.h | 1 -
> > kernel/crash_core.c | 3 +-
> > mm/slab.c | 44 ++++++++++++-------
> > mm/slab.h | 3 +-
> > 7 files changed, 98 insertions(+), 54 deletions(-)
> >
> > diff --git a/fs/proc/page.c b/fs/proc/page.c
> > index 6249c347809a..e7524f21cefe 100644
> > --- a/fs/proc/page.c
> > +++ b/fs/proc/page.c
> > @@ -67,7 +67,7 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf,
> > */
> > ppage = pfn_to_online_page(pfn);
> >
> > - if (!ppage || PageSlab(ppage) || page_has_type(ppage))
> > + if (!ppage || page_has_type(ppage))
> > pcount = 0;
> > else
> > pcount = page_mapcount(ppage);
> > @@ -124,11 +124,8 @@ u64 stable_page_flags(struct page *page)
> >
> > /*
> > * pseudo flags for the well known (anonymous) memory mapped pages
> > - *
> > - * Note that page->_mapcount is overloaded in SLOB/SLUB/SLQB, so the
> > - * simple test in page_mapped() is not enough.
> > */
> > - if (!PageSlab(page) && page_mapped(page))
> > + if (page_mapped(page))
> > u |= 1 << KPF_MMAP;
> > if (PageAnon(page))
> > u |= 1 << KPF_ANON;
> > @@ -178,16 +175,14 @@ u64 stable_page_flags(struct page *page)
> > u |= 1 << KPF_OFFLINE;
> > if (PageTable(page))
> > u |= 1 << KPF_PGTABLE;
> > + if (PageSlab(page))
> > + u |= 1 << KPF_SLAB;
> >
> > if (page_is_idle(page))
> > u |= 1 << KPF_IDLE;
> >
> > u |= kpf_copy_bit(k, KPF_LOCKED, PG_locked);
> >
> > - u |= kpf_copy_bit(k, KPF_SLAB, PG_slab);
> > - if (PageTail(page) && PageSlab(compound_head(page)))
> > - u |= 1 << KPF_SLAB;
> > -
> > u |= kpf_copy_bit(k, KPF_ERROR, PG_error);
> > u |= kpf_copy_bit(k, KPF_DIRTY, PG_dirty);
> > u |= kpf_copy_bit(k, KPF_UPTODATE, PG_uptodate);
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 3b8475007734..6b04ae65241d 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -207,10 +207,13 @@ struct page {
> > atomic_t _mapcount;
> >
> > /*
> > - * If the page is neither PageSlab nor mappable to userspace,
> > - * the value stored here may help determine what this page
> > - * is used for. See page-flags.h for a list of page types
> > - * which are currently stored here.
> > + * If the page is not mappable to userspace, the value
> > + * stored here may help determine what this page is used for.
> > + * See page-flags.h for a list of page types which are currently
> > + * stored here.
> > + *
> > + * Note that only upper half is used for page types and lower
> > + * half is reserved for SLAB.
> > */
> > unsigned int page_type;
> > };
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 69e93a0c1277..07063d60efe3 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -107,7 +107,6 @@ enum pageflags {
> > PG_workingset,
> > PG_waiters, /* Page has waiters, check its waitqueue. Must be bit #7 and in the same byte as "PG_locked" */
> > PG_error,
> > - PG_slab,
> > PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/
> > PG_arch_1,
> > PG_reserved,
> > @@ -482,7 +481,6 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
> > TESTCLEARFLAG(Active, active, PF_HEAD)
> > PAGEFLAG(Workingset, workingset, PF_HEAD)
> > TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
> > -__PAGEFLAG(Slab, slab, PF_NO_TAIL)
> > __PAGEFLAG(SlobFree, slob_free, PF_NO_TAIL)
> > PAGEFLAG(Checked, checked, PF_NO_COMPOUND) /* Used by some filesystems */
> >
> > @@ -906,42 +904,72 @@ static inline bool is_page_hwpoison(struct page *page)
> > }
> >
> > /*
> > - * For pages that are never mapped to userspace (and aren't PageSlab),
> > - * page_type may be used. Because it is initialised to -1, we invert the
> > - * sense of the bit, so __SetPageFoo *clears* the bit used for PageFoo, and
> > - * __ClearPageFoo *sets* the bit used for PageFoo. We reserve a few high and
> > - * low bits so that an underflow or overflow of page_mapcount() won't be
> > - * mistaken for a page type value.
> > + * For pages that are never mapped to userspace, page_type may be used.
> > + * Because it is initialised to -1, we invert the sense of the bit,
> > + * so __SetPageFoo *clears* the bit used for PageFoo, and __ClearPageFoo
> > + * *sets* the bit used for PageFoo. We reserve a few high and low bits
> > + * so that an underflow or overflow of page_mapcount() won't be mistaken
> > + * for a page type value.
> > */
> >
> > #define PAGE_TYPE_BASE 0xf0000000
> > -/* Reserve 0x0000007f to catch underflows of page_mapcount */
> > -#define PAGE_MAPCOUNT_RESERVE -128
> > -#define PG_buddy 0x00000080
> > -#define PG_offline 0x00000100
> > -#define PG_table 0x00000200
> > -#define PG_guard 0x00000400
> > +#define PG_buddy 0x00010000
> > +#define PG_offline 0x00020000
> > +#define PG_table 0x00040000
> > +#define PG_guard 0x00080000
> > +#define PG_slab 0x00100000
> >
> > #define PageType(page, flag) \
> > ((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
> >
> > -static inline int page_has_type(struct page *page)
> > +#define PAGE_TYPE_MASK 0xffff0000
> > +
> > +static inline bool page_type_has_type(unsigned int page_type)
> > {
> > - return (int)page->page_type < PAGE_MAPCOUNT_RESERVE;
> > + return ((int)page_type < (int)PAGE_TYPE_MASK);
> > }
> >
> > -#define PAGE_TYPE_OPS(uname, lname) \
> > +static inline bool page_has_type(struct page *page)
> > +{
> > + return page_type_has_type(page->page_type);
> > +}
> > +
> > +
> > +#define PAGE_TYPE_OPS(uname, lname, policy) \
> > static __always_inline int Page##uname(struct page *page) \
> > { \
> > + page = policy(page, 0); \
> > + return PageType(page, PG_##lname); \
> > +} \
> > +static __always_inline int folio_test_##lname(struct folio *folio) \
> > +{ \
> > + struct page *page = &folio->page; \
> > + \
> > return PageType(page, PG_##lname); \
> > } \
> > static __always_inline void __SetPage##uname(struct page *page) \
> > { \
> > + page = policy(page, 1); \
> > + VM_BUG_ON_PAGE(!PageType(page, 0), page); \
> > + page->page_type &= ~PG_##lname; \
> > +} \
> > +static __always_inline void __folio_set_##lname(struct folio *folio) \
> > +{ \
> > + struct page *page = &folio->page; \
> > + \
> > VM_BUG_ON_PAGE(!PageType(page, 0), page); \
> > page->page_type &= ~PG_##lname; \
> > } \
> > static __always_inline void __ClearPage##uname(struct page *page) \
> > { \
> > + page = policy(page, 1); \
> > + VM_BUG_ON_PAGE(!Page##uname(page), page); \
> > + page->page_type |= PG_##lname; \
> > +} \
> > +static __always_inline void __folio_clear_##lname(struct folio *folio) \
> > +{ \
> > + struct page *page = &folio->page; \
> > + \
> > VM_BUG_ON_PAGE(!Page##uname(page), page); \
> > page->page_type |= PG_##lname; \
> > }
> > @@ -950,7 +978,7 @@ static __always_inline void __ClearPage##uname(struct page *page) \
> > * PageBuddy() indicates that the page is free and in the buddy system
> > * (see mm/page_alloc.c).
> > */
> > -PAGE_TYPE_OPS(Buddy, buddy)
> > +PAGE_TYPE_OPS(Buddy, buddy, PF_ANY)
> >
> > /*
> > * PageOffline() indicates that the page is logically offline although the
> > @@ -974,7 +1002,10 @@ PAGE_TYPE_OPS(Buddy, buddy)
> > * pages should check PageOffline() and synchronize with such drivers using
> > * page_offline_freeze()/page_offline_thaw().
> > */
> > -PAGE_TYPE_OPS(Offline, offline)
> > +PAGE_TYPE_OPS(Offline, offline, PF_ANY)
> > +
> > +/* PageSlab() indicates that the page is used by slab subsystem. */
> > +PAGE_TYPE_OPS(Slab, slab, PF_NO_TAIL)
> >
> > extern void page_offline_freeze(void);
> > extern void page_offline_thaw(void);
> > @@ -984,12 +1015,12 @@ extern void page_offline_end(void);
> > /*
> > * Marks pages in use as page tables.
> > */
> > -PAGE_TYPE_OPS(Table, table)
> > +PAGE_TYPE_OPS(Table, table, PF_ANY)
> >
> > /*
> > * Marks guardpages used with debug_pagealloc.
> > */
> > -PAGE_TYPE_OPS(Guard, guard)
> > +PAGE_TYPE_OPS(Guard, guard, PF_ANY)
> >
> > extern bool is_free_buddy_page(struct page *page);
> >
> > @@ -1037,8 +1068,8 @@ static __always_inline void __ClearPageAnonExclusive(struct page *page)
> > (1UL << PG_lru | 1UL << PG_locked | \
> > 1UL << PG_private | 1UL << PG_private_2 | \
> > 1UL << PG_writeback | 1UL << PG_reserved | \
> > - 1UL << PG_slab | 1UL << PG_active | \
> > - 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)
> > + 1UL << PG_active | 1UL << PG_unevictable | \
> > + __PG_MLOCKED | LRU_GEN_MASK)
> >
> > /*
> > * Flags checked when a page is prepped for return by the page allocator.
> > diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> > index 412b5a46374c..8301912f8c25 100644
> > --- a/include/trace/events/mmflags.h
> > +++ b/include/trace/events/mmflags.h
> > @@ -113,7 +113,6 @@
> > {1UL << PG_lru, "lru" }, \
> > {1UL << PG_active, "active" }, \
> > {1UL << PG_workingset, "workingset" }, \
> > - {1UL << PG_slab, "slab" }, \
> > {1UL << PG_owner_priv_1, "owner_priv_1" }, \
> > {1UL << PG_arch_1, "arch_1" }, \
> > {1UL << PG_reserved, "reserved" }, \
> > diff --git a/kernel/crash_core.c b/kernel/crash_core.c
> > index 87ef6096823f..1ea8198c26e1 100644
> > --- a/kernel/crash_core.c
> > +++ b/kernel/crash_core.c
> > @@ -482,13 +482,14 @@ static int __init crash_save_vmcoreinfo_init(void)
> > VMCOREINFO_NUMBER(PG_private);
> > VMCOREINFO_NUMBER(PG_swapcache);
> > VMCOREINFO_NUMBER(PG_swapbacked);
> > - VMCOREINFO_NUMBER(PG_slab);
> > #ifdef CONFIG_MEMORY_FAILURE
> > VMCOREINFO_NUMBER(PG_hwpoison);
> > #endif
> > VMCOREINFO_NUMBER(PG_head_mask);
> > #define PAGE_BUDDY_MAPCOUNT_VALUE (~PG_buddy)
> > VMCOREINFO_NUMBER(PAGE_BUDDY_MAPCOUNT_VALUE);
> > +#define PAGE_SLAB_MAPCOUNT_VALUE (~PG_slab)
> > + VMCOREINFO_NUMBER(PAGE_SLAB_MAPCOUNT_VALUE);
> > #ifdef CONFIG_HUGETLB_PAGE
> > VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
> > #define PAGE_OFFLINE_MAPCOUNT_VALUE (~PG_offline)
> > diff --git a/mm/slab.c b/mm/slab.c
> > index 7a269db050ee..eee46f71c4b8 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -2269,6 +2269,21 @@ void __kmem_cache_release(struct kmem_cache *cachep)
> > }
> > }
> >
> > +static inline unsigned int slab_get_active(struct slab *slab)
> > +{
> > + return ~(slab->page_type | PG_slab);
> > +}
> > +
> > +static inline void slab_inc_active(struct slab *slab)
> > +{
> > + slab->page_type--;
> > +}
> > +
> > +static inline void slab_dec_active(struct slab *slab)
> > +{
> > + slab->page_type++;
> > +}
> > +
> > /*
> > * Get the memory for a slab management obj.
> > *
> > @@ -2291,7 +2306,6 @@ static void *alloc_slabmgmt(struct kmem_cache *cachep,
> > void *addr = slab_address(slab);
> >
> > slab->s_mem = addr + colour_off;
> > - slab->active = 0;
> >
> > if (OBJFREELIST_SLAB(cachep))
> > freelist = NULL;
> > @@ -2510,8 +2524,8 @@ static void *slab_get_obj(struct kmem_cache *cachep, struct slab *slab)
> > {
> > void *objp;
> >
> > - objp = index_to_obj(cachep, slab, get_free_obj(slab, slab->active));
> > - slab->active++;
> > + objp = index_to_obj(cachep, slab, get_free_obj(slab, slab_get_active(slab)));
> > + slab_inc_active(slab);
> >
> > return objp;
> > }
> > @@ -2524,7 +2538,7 @@ static void slab_put_obj(struct kmem_cache *cachep,
> > unsigned int i;
> >
> > /* Verify double free bug */
> > - for (i = slab->active; i < cachep->num; i++) {
> > + for (i = slab_get_active(slab); i < cachep->num; i++) {
> > if (get_free_obj(slab, i) == objnr) {
> > pr_err("slab: double free detected in cache '%s', objp %px\n",
> > cachep->name, objp);
> > @@ -2532,11 +2546,11 @@ static void slab_put_obj(struct kmem_cache *cachep,
> > }
> > }
> > #endif
> > - slab->active--;
> > + slab_dec_active(slab);
> > if (!slab->freelist)
> > slab->freelist = objp + obj_offset(cachep);
> >
> > - set_free_obj(slab, slab->active, objnr);
> > + set_free_obj(slab, slab_get_active(slab), objnr);
> > }
> >
> > /*
> > @@ -2635,14 +2649,14 @@ static void cache_grow_end(struct kmem_cache *cachep, struct slab *slab)
> >
> > raw_spin_lock(&n->list_lock);
> > n->total_slabs++;
> > - if (!slab->active) {
> > + if (!slab_get_active(slab)) {
> > list_add_tail(&slab->slab_list, &n->slabs_free);
> > n->free_slabs++;
> > } else
> > fixup_slab_list(cachep, n, slab, &list);
> >
> > STATS_INC_GROWN(cachep);
> > - n->free_objects += cachep->num - slab->active;
> > + n->free_objects += cachep->num - slab_get_active(slab);
> > raw_spin_unlock(&n->list_lock);
> >
> > fixup_objfreelist_debug(cachep, &list);
> > @@ -2744,7 +2758,7 @@ static inline void fixup_slab_list(struct kmem_cache *cachep,
> > {
> > /* move slabp to correct slabp list: */
> > list_del(&slab->slab_list);
> > - if (slab->active == cachep->num) {
> > + if (slab_get_active(slab) == cachep->num) {
> > list_add(&slab->slab_list, &n->slabs_full);
> > if (OBJFREELIST_SLAB(cachep)) {
> > #if DEBUG
> > @@ -2783,7 +2797,7 @@ static noinline struct slab *get_valid_first_slab(struct kmem_cache_node *n,
> >
> > /* Move pfmemalloc slab to the end of list to speed up next search */
> > list_del(&slab->slab_list);
> > - if (!slab->active) {
> > + if (!slab_get_active(slab)) {
> > list_add_tail(&slab->slab_list, &n->slabs_free);
> > n->free_slabs++;
> > } else
> > @@ -2865,9 +2879,9 @@ static __always_inline int alloc_block(struct kmem_cache *cachep,
> > * There must be at least one object available for
> > * allocation.
> > */
> > - BUG_ON(slab->active >= cachep->num);
> > + BUG_ON(slab_get_active(slab) >= cachep->num);
> >
> > - while (slab->active < cachep->num && batchcount--) {
> > + while (slab_get_active(slab) < cachep->num && batchcount--) {
> > STATS_INC_ALLOCED(cachep);
> > STATS_INC_ACTIVE(cachep);
> > STATS_SET_HIGH(cachep);
> > @@ -3162,7 +3176,7 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
> > STATS_INC_ACTIVE(cachep);
> > STATS_SET_HIGH(cachep);
> >
> > - BUG_ON(slab->active == cachep->num);
> > + BUG_ON(slab_get_active(slab) == cachep->num);
> >
> > obj = slab_get_obj(cachep, slab);
> > n->free_objects--;
> > @@ -3297,7 +3311,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp,
> > STATS_DEC_ACTIVE(cachep);
> >
> > /* fixup slab chains */
> > - if (slab->active == 0) {
> > + if (slab_get_active(slab) == 0) {
> > list_add(&slab->slab_list, &n->slabs_free);
> > n->free_slabs++;
> > } else {
> > @@ -3352,7 +3366,7 @@ static void cache_flusharray(struct kmem_cache *cachep, struct array_cache *ac)
> > struct slab *slab;
> >
> > list_for_each_entry(slab, &n->slabs_free, slab_list) {
> > - BUG_ON(slab->active);
> > + BUG_ON(slab_get_active(slab));
> >
> > i++;
> > }
> > diff --git a/mm/slab.h b/mm/slab.h
> > index 7cc432969945..c6ffe6799436 100644
> > --- a/mm/slab.h
> > +++ b/mm/slab.h
> > @@ -20,7 +20,8 @@ struct slab {
> > };
> > struct rcu_head rcu_head;
> > };
> > - unsigned int active;
> > + /* lower half of page_type is used as active objects counter */
> > + unsigned int page_type;
> >
> > #elif defined(CONFIG_SLUB)
> >
>
On Mon, Jan 30, 2023 at 01:34:59PM +0900, Hyeonggon Yoo wrote:
> > Seems like quite some changes to page_type to accomodate SLAB, which is
> > hopefully going away soon(TM). Could we perhaps avoid that?
>
> If it could be done with less changes, I'll try to avoid that.
Let me outline the idea I had for removing PG_slab:
Observe that PG_reserved and PG_slab are mutually exclusive. Also,
if PG_reserved is set, no other flags are set. If PG_slab is set, only
PG_locked is used. Many of the flags are only for use by anon/page
cache pages (eg referenced, uptodate, dirty, lru, active, workingset,
waiters, error, owner_priv_1, writeback, mappedtodisk, reclaim,
swapbacked, unevictable, mlocked).
Redefine PG_reserved as PG_kernel. Now we can use the other _15_
flags to indicate pagetype, as long as PG_kernel is set. So, eg
PageSlab() can now be (page->flags & PG_type) == PG_slab where
#define PG_kernel 0x00001
#define PG_type (PG_kernel | 0x7fff0)
#define PG_slab (PG_kernel | 0x00010)
#define PG_reserved (PG_kernel | 0x00020)
#define PG_buddy (PG_kernel | 0x00030)
#define PG_offline (PG_kernel | 0x00040)
#define PG_table (PG_kernel | 0x00050)
#define PG_guard (PG_kernel | 0x00060)
That frees up the existing PG_slab, lets us drop the page_type field
altogether and gives us space to define all the page types we might
want (eg PG_vmalloc)
We'll want to reorganise all the flags which are for anon/file pages
into a contiguous block. And now that I think about it, vmalloc pages
can be mapped to userspace, so they can get marked dirty, so only
14 bits are available. Maybe rearrange to ...
PG_locked 0x000001
PG_writeback 0x000002
PG_head 0x000004
PG_dirty 0x000008
PG_owner_priv_1 0x000010
PG_arch_1 0x000020
PG_private 0x000040
PG_waiters 0x000080
PG_kernel 0x000100
PG_referenced 0x000200
PG_uptodate 0x000400
PG_lru 0x000800
PG_active 0x001000
PG_workingset 0x002000
PG_error 0x004000
PG_private_2 0x008000
PG_mappedtodisk 0x010000
PG_reclaim 0x020000
PG_swapbacked 0x040000
PG_unevictable 0x080000
PG_mlocked 0x100000
... or something. There are a number of constraints and it may take
a few iterations to get this right. Oh, and if this is the layout
we use, then:
PG_type 0x1fff00
PG_reserved (PG_kernel | 0x200)
PG_slab (PG_kernel | 0x400)
PG_buddy (PG_kernel | 0x600)
PG_offline (PG_kernel | 0x800)
PG_table (PG_kernel | 0xa00)
PG_guard (PG_kernel | 0xc00)
PG_vmalloc (PG_kernel | 0xe00)
This is going to make show_page_flags() more complex :-P
Oh, and while we're doing this, we should just make PG_mlocked
unconditional. NOMMU doesn't need the extra space in page flags
(for what? their large number of NUMA nodes?)
On Mon, Jan 30, 2023 at 05:11:48AM +0000, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 01:34:59PM +0900, Hyeonggon Yoo wrote:
> > > Seems like quite some changes to page_type to accomodate SLAB, which is
> > > hopefully going away soon(TM). Could we perhaps avoid that?
> >
> > If it could be done with less changes, I'll try to avoid that.
>
> Let me outline the idea I had for removing PG_slab:
>
> Observe that PG_reserved and PG_slab are mutually exclusive. Also,
> if PG_reserved is set, no other flags are set. If PG_slab is set, only
> PG_locked is used. Many of the flags are only for use by anon/page
> cache pages (eg referenced, uptodate, dirty, lru, active, workingset,
> waiters, error, owner_priv_1, writeback, mappedtodisk, reclaim,
> swapbacked, unevictable, mlocked).
>
> Redefine PG_reserved as PG_kernel. Now we can use the other _15_
> flags to indicate pagetype, as long as PG_kernel is set.
So PG_kernel is a new special flag, I thought it indicates
"not usermappable pages", but considering PG_vmalloc it's not.
> So, eg
> PageSlab() can now be (page->flags & PG_type) == PG_slab where
But if PG_xxx and PG_slab shares same bit, PG_xxx would be confused?
> #define PG_kernel 0x00001
> #define PG_type (PG_kernel | 0x7fff0)
> #define PG_slab (PG_kernel | 0x00010)
> #define PG_reserved (PG_kernel | 0x00020)
> #define PG_buddy (PG_kernel | 0x00030)
> #define PG_offline (PG_kernel | 0x00040)
> #define PG_table (PG_kernel | 0x00050)
> #define PG_guard (PG_kernel | 0x00060)
>
> That frees up the existing PG_slab, lets us drop the page_type field
> altogether and gives us space to define all the page types we might
> want (eg PG_vmalloc)
>
> We'll want to reorganise all the flags which are for anon/file pages
> into a contiguous block. And now that I think about it, vmalloc pages
> can be mapped to userspace, so they can get marked dirty, so only
> 14 bits are available. Maybe rearrange to ...
>
> PG_locked 0x000001
> PG_writeback 0x000002
> PG_head 0x000004
I think slab still needs PG_head,
but it seems to be okay with this layout.
(but these assumpstions are better documented, I think)
> PG_dirty 0x000008
> PG_owner_priv_1 0x000010
> PG_arch_1 0x000020
> PG_private 0x000040
> PG_waiters 0x000080
> PG_kernel 0x000100
> PG_referenced 0x000200
> PG_uptodate 0x000400
> PG_lru 0x000800
> PG_active 0x001000
> PG_workingset 0x002000
> PG_error 0x004000
> PG_private_2 0x008000
> PG_mappedtodisk 0x010000
> PG_reclaim 0x020000
> PG_swapbacked 0x040000
> PG_unevictable 0x080000
> PG_mlocked 0x100000
>
> ... or something. There are a number of constraints and it may take
> a few iterations to get this right. Oh, and if this is the layout
> we use, then:
>
> PG_type 0x1fff00
> PG_reserved (PG_kernel | 0x200)
> PG_slab (PG_kernel | 0x400)
> PG_buddy (PG_kernel | 0x600)
> PG_offline (PG_kernel | 0x800)
> PG_table (PG_kernel | 0xa00)
> PG_guard (PG_kernel | 0xc00)
> PG_vmalloc (PG_kernel | 0xe00)
what is PG_vmalloc for, is it just an example for
explaining possible layout?
> This is going to make show_page_flags() more complex :-P
>
> Oh, and while we're doing this, we should just make PG_mlocked
> unconditional. NOMMU doesn't need the extra space in page flags
> (for what? their large number of NUMA nodes?)
On 30.01.23 06:11, Matthew Wilcox wrote:
> On Mon, Jan 30, 2023 at 01:34:59PM +0900, Hyeonggon Yoo wrote:
>>> Seems like quite some changes to page_type to accomodate SLAB, which is
>>> hopefully going away soon(TM). Could we perhaps avoid that?
>>
>> If it could be done with less changes, I'll try to avoid that.
>
> Let me outline the idea I had for removing PG_slab:
>
> Observe that PG_reserved and PG_slab are mutually exclusive. Also,
I recall that there are SetPageReserved() calls on pages allocated via slab.
--
Thanks,
David / dhildenb
On Fri, Feb 03, 2023 at 04:00:08PM +0000, Hyeonggon Yoo wrote:
> On Mon, Jan 30, 2023 at 05:11:48AM +0000, Matthew Wilcox wrote:
> > On Mon, Jan 30, 2023 at 01:34:59PM +0900, Hyeonggon Yoo wrote:
> > > > Seems like quite some changes to page_type to accomodate SLAB, which is
> > > > hopefully going away soon(TM). Could we perhaps avoid that?
> > >
> > > If it could be done with less changes, I'll try to avoid that.
> >
> > Let me outline the idea I had for removing PG_slab:
> >
> > Observe that PG_reserved and PG_slab are mutually exclusive. Also,
> > if PG_reserved is set, no other flags are set. If PG_slab is set, only
> > PG_locked is used. Many of the flags are only for use by anon/page
> > cache pages (eg referenced, uptodate, dirty, lru, active, workingset,
> > waiters, error, owner_priv_1, writeback, mappedtodisk, reclaim,
> > swapbacked, unevictable, mlocked).
> >
> > Redefine PG_reserved as PG_kernel. Now we can use the other _15_
> > flags to indicate pagetype, as long as PG_kernel is set.
>
> So PG_kernel is a new special flag, I thought it indicates
> "not usermappable pages", but considering PG_vmalloc it's not.
Right, it means "The kernel allocated this page for its own purposes;
what that purpose is might be available by looking at PG_type". ie
it's not-anon, not-page-cache.
> > So, eg
> > PageSlab() can now be (page->flags & PG_type) == PG_slab where
>
> But if PG_xxx and PG_slab shares same bit, PG_xxx would be confused?
Correct. Ideally those tests wouldn't be used on arbitrary pages,
only pages which are already confirmed to be anon or file. I suspect
we haven't been super-careful about that in the past, and so there
would be some degree of "Oh, we need to fix this up". But flags like
PG_mappedtodisk, PG_mlocked, PG_unevictable, PG_workingset should be
all gated behind "We know this is anon/file".
> > #define PG_kernel 0x00001
> > #define PG_type (PG_kernel | 0x7fff0)
> > #define PG_slab (PG_kernel | 0x00010)
> > #define PG_reserved (PG_kernel | 0x00020)
> > #define PG_buddy (PG_kernel | 0x00030)
> > #define PG_offline (PG_kernel | 0x00040)
> > #define PG_table (PG_kernel | 0x00050)
> > #define PG_guard (PG_kernel | 0x00060)
> >
> > That frees up the existing PG_slab, lets us drop the page_type field
> > altogether and gives us space to define all the page types we might
> > want (eg PG_vmalloc)
> >
> > We'll want to reorganise all the flags which are for anon/file pages
> > into a contiguous block. And now that I think about it, vmalloc pages
> > can be mapped to userspace, so they can get marked dirty, so only
> > 14 bits are available. Maybe rearrange to ...
> >
> > PG_locked 0x000001
> > PG_writeback 0x000002
> > PG_head 0x000004
>
> I think slab still needs PG_head,
> but it seems to be okay with this layout.
> (but these assumpstions are better documented, I think)
Yes, slab need PG_head so it knows whether this is a multi-page slab or
not. I forgot to mention it above as a bit that slab needs, but I put
it in the low bits here.
> > PG_dirty 0x000008
> > PG_owner_priv_1 0x000010
> > PG_arch_1 0x000020
> > PG_private 0x000040
> > PG_waiters 0x000080
> > PG_kernel 0x000100
> > PG_referenced 0x000200
> > PG_uptodate 0x000400
> > PG_lru 0x000800
> > PG_active 0x001000
> > PG_workingset 0x002000
> > PG_error 0x004000
> > PG_private_2 0x008000
> > PG_mappedtodisk 0x010000
> > PG_reclaim 0x020000
> > PG_swapbacked 0x040000
> > PG_unevictable 0x080000
> > PG_mlocked 0x100000
> >
> > ... or something. There are a number of constraints and it may take
> > a few iterations to get this right. Oh, and if this is the layout
> > we use, then:
> >
> > PG_type 0x1fff00
> > PG_reserved (PG_kernel | 0x200)
> > PG_slab (PG_kernel | 0x400)
> > PG_buddy (PG_kernel | 0x600)
> > PG_offline (PG_kernel | 0x800)
> > PG_table (PG_kernel | 0xa00)
> > PG_guard (PG_kernel | 0xc00)
> > PG_vmalloc (PG_kernel | 0xe00)
>
> what is PG_vmalloc for, is it just an example for
> explaining possible layout?
I really want to mark pages as being allocated from vmalloc. It's
one of the things we could do to make debugging better.
On Fri, Feb 03, 2023 at 05:04:01PM +0100, David Hildenbrand wrote:
> On 30.01.23 06:11, Matthew Wilcox wrote:
> > On Mon, Jan 30, 2023 at 01:34:59PM +0900, Hyeonggon Yoo wrote:
> > > > Seems like quite some changes to page_type to accomodate SLAB, which is
> > > > hopefully going away soon(TM). Could we perhaps avoid that?
> > >
> > > If it could be done with less changes, I'll try to avoid that.
> >
> > Let me outline the idea I had for removing PG_slab:
> >
> > Observe that PG_reserved and PG_slab are mutually exclusive. Also,
>
> I recall that there are SetPageReserved() calls on pages allocated via slab.
I did a quick scan, and it seems that all allocated reserved pages come
from alloc_pages() or vmalloc().
BTW, looking at the current usage of SetPageReserved, it seems that some
are bogus/historical and some would justify a different page type if we are
to have 15 bits for PG_kernel.
> --
> Thanks,
>
> David / dhildenb
--
Sincerely yours,
Mike.
On 08.02.23 10:44, Mike Rapoport wrote:
> On Fri, Feb 03, 2023 at 05:04:01PM +0100, David Hildenbrand wrote:
>> On 30.01.23 06:11, Matthew Wilcox wrote:
>>> On Mon, Jan 30, 2023 at 01:34:59PM +0900, Hyeonggon Yoo wrote:
>>>>> Seems like quite some changes to page_type to accomodate SLAB, which is
>>>>> hopefully going away soon(TM). Could we perhaps avoid that?
>>>>
>>>> If it could be done with less changes, I'll try to avoid that.
>>>
>>> Let me outline the idea I had for removing PG_slab:
>>>
>>> Observe that PG_reserved and PG_slab are mutually exclusive. Also,
>>
>> I recall that there are SetPageReserved() calls on pages allocated via slab.
>
> I did a quick scan, and it seems that all allocated reserved pages come
> from alloc_pages() or vmalloc().
>
Thanks for checking, good that my memory was wrong :)
> BTW, looking at the current usage of SetPageReserved, it seems that some
> are bogus/historical and some would justify a different page type if we are
> to have 15 bits for PG_kernel.
I remember raising in the past that something like PG_hole might be helpful.
I also recall that most SetPageReserved usage in drivers is to make
ioremap happy. PG_ioremappable or smth like that would be better.
So yes, that would even help to cleanup the existing PG_reserved usage.
--
Thanks,
David / dhildenb
On Fri, Feb 03, 2023 at 04:19:32PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 03, 2023 at 04:00:08PM +0000, Hyeonggon Yoo wrote:
> > On Mon, Jan 30, 2023 at 05:11:48AM +0000, Matthew Wilcox wrote:
> > > On Mon, Jan 30, 2023 at 01:34:59PM +0900, Hyeonggon Yoo wrote:
> > > > > Seems like quite some changes to page_type to accomodate SLAB, which is
> > > > > hopefully going away soon(TM). Could we perhaps avoid that?
> > > >
> > > > If it could be done with less changes, I'll try to avoid that.
> > >
> > > Let me outline the idea I had for removing PG_slab:
> > >
> > > Observe that PG_reserved and PG_slab are mutually exclusive. Also,
> > > if PG_reserved is set, no other flags are set. If PG_slab is set, only
> > > PG_locked is used. Many of the flags are only for use by anon/page
> > > cache pages (eg referenced, uptodate, dirty, lru, active, workingset,
> > > waiters, error, owner_priv_1, writeback, mappedtodisk, reclaim,
> > > swapbacked, unevictable, mlocked).
> > >
> > > Redefine PG_reserved as PG_kernel. Now we can use the other _15_
> > > flags to indicate pagetype, as long as PG_kernel is set.
> >
> > So PG_kernel is a new special flag, I thought it indicates
> > "not usermappable pages", but considering PG_vmalloc it's not.
>
> Right, it means "The kernel allocated this page for its own purposes;
> what that purpose is might be available by looking at PG_type". ie
> it's not-anon, not-page-cache.
>
> > > So, eg
> > > PageSlab() can now be (page->flags & PG_type) == PG_slab where
> >
> > But if PG_xxx and PG_slab shares same bit, PG_xxx would be confused?
>
> Correct. Ideally those tests wouldn't be used on arbitrary pages,
> only pages which are already confirmed to be anon or file. I suspect
> we haven't been super-careful about that in the past, and so there
> would be some degree of "Oh, we need to fix this up". But flags like
> PG_mappedtodisk, PG_mlocked, PG_unevictable, PG_workingset should be
> all gated behind "We know this is anon/file".
Okay. let's just try then!
> > > PG_dirty 0x000008
> > > PG_owner_priv_1 0x000010
> > > PG_arch_1 0x000020
> > > PG_private 0x000040
> > > PG_waiters 0x000080
> > > PG_kernel 0x000100
> > > PG_referenced 0x000200
> > > PG_uptodate 0x000400
> > > PG_lru 0x000800
> > > PG_active 0x001000
> > > PG_workingset 0x002000
> > > PG_error 0x004000
> > > PG_private_2 0x008000
> > > PG_mappedtodisk 0x010000
> > > PG_reclaim 0x020000
> > > PG_swapbacked 0x040000
> > > PG_unevictable 0x080000
> > > PG_mlocked 0x100000
> > >
> > > ... or something. There are a number of constraints and it may take
> > > a few iterations to get this right. Oh, and if this is the layout
> > > we use, then:
> > >
> > > PG_type 0x1fff00
> > > PG_reserved (PG_kernel | 0x200)
> > > PG_slab (PG_kernel | 0x400)
> > > PG_buddy (PG_kernel | 0x600)
> > > PG_offline (PG_kernel | 0x800)
> > > PG_table (PG_kernel | 0xa00)
> > > PG_guard (PG_kernel | 0xc00)
> > > PG_vmalloc (PG_kernel | 0xe00)
> >
> > what is PG_vmalloc for, is it just an example for
> > explaining possible layout?
>
> I really want to mark pages as being allocated from vmalloc. It's
> one of the things we could do to make debugging better.
Got it. Anyway, the proposed approach is not compatible with this series
so I'll try implementing new series based on your outline.
Thanks!