2015-11-09 07:23:23

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 1/2] mm: introduce page reference manipulation functions

Success of CMA allocation largely depends on success of migration
and key factor of it is page reference count. Until now, page reference
is manipulated by direct calling atomic functions so we cannot follow up
who and where manipulate it. Then, it is hard to find actual reason
of CMA allocation failure. CMA allocation should be guaranteed to succeed
so finding offending place is really important.

In this patch, call sites where page reference is manipulated are converted
to introduced wrapper function. This is preparation step to add tracepoint
to each page reference manipulation function. With this facility, we can
easily find reason of CMA allocation failure. There is no functional change
in this patch.

Signed-off-by: Joonsoo Kim <[email protected]>
---
arch/mips/mm/gup.c | 2 +-
arch/powerpc/mm/mmu_context_hash64.c | 3 +-
arch/powerpc/mm/pgtable_64.c | 2 +-
arch/x86/mm/gup.c | 2 +-
drivers/block/aoe/aoecmd.c | 4 +-
drivers/net/ethernet/freescale/gianfar.c | 2 +-
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 +--
drivers/net/ethernet/sun/niu.c | 2 +-
include/linux/mm.h | 21 ++-----
include/linux/page_ref.h | 76 +++++++++++++++++++++++
include/linux/pagemap.h | 19 +-----
mm/huge_memory.c | 6 +-
mm/internal.h | 5 --
mm/memory_hotplug.c | 4 +-
mm/migrate.c | 10 +--
mm/page_alloc.c | 6 +-
mm/vmscan.c | 6 +-
21 files changed, 114 insertions(+), 71 deletions(-)
create mode 100644 include/linux/page_ref.h

diff --git a/arch/mips/mm/gup.c b/arch/mips/mm/gup.c
index 1afd87c..6cdffc7 100644
--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -64,7 +64,7 @@ static inline void get_head_page_multiple(struct page *page, int nr)
{
VM_BUG_ON(page != compound_head(page));
VM_BUG_ON(page_count(page) == 0);
- atomic_add(nr, &page->_count);
+ page_ref_add(page, nr);
SetPageReferenced(page);
}

diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
index 4e4efbc..9ca6fe1 100644
--- a/arch/powerpc/mm/mmu_context_hash64.c
+++ b/arch/powerpc/mm/mmu_context_hash64.c
@@ -118,8 +118,7 @@ static void destroy_pagetable_page(struct mm_struct *mm)
/* drop all the pending references */
count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT;
/* We allow PTE_FRAG_NR fragments from a PTE page */
- count = atomic_sub_return(PTE_FRAG_NR - count, &page->_count);
- if (!count) {
+ if (page_ref_sub_and_test(page, PTE_FRAG_NR - count)) {
pgtable_page_dtor(page);
free_hot_cold_page(page, 0);
}
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index 422c59a..b1cab84 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -403,7 +403,7 @@ static pte_t *__alloc_for_cache(struct mm_struct *mm, int kernel)
* count.
*/
if (likely(!mm->context.pte_frag)) {
- atomic_set(&page->_count, PTE_FRAG_NR);
+ set_page_count(page, PTE_FRAG_NR);
mm->context.pte_frag = ret + PTE_FRAG_SIZE;
}
spin_unlock(&mm->page_table_lock);
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index f8cb3e8..ba002ae 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -110,7 +110,7 @@ static inline void get_head_page_multiple(struct page *page, int nr)
{
VM_BUG_ON_PAGE(page != compound_head(page), page);
VM_BUG_ON_PAGE(page_count(page) == 0, page);
- atomic_add(nr, &page->_count);
+ page_ref_add(page, nr);
SetPageReferenced(page);
}

diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index ad80c85..2c8dd1b 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -875,7 +875,7 @@ bio_pageinc(struct bio *bio)
* compound pages is no longer allowed by the kernel.
*/
page = compound_head(bv.bv_page);
- atomic_inc(&page->_count);
+ page_ref_inc(page);
}
}

@@ -888,7 +888,7 @@ bio_pagedec(struct bio *bio)

bio_for_each_segment(bv, bio, iter) {
page = compound_head(bv.bv_page);
- atomic_dec(&page->_count);
+ page_ref_dec(page);
}
}

diff --git a/drivers/net/ethernet/freescale/gianfar.c b/drivers/net/ethernet/freescale/gianfar.c
index 3e6b9b4..4511015 100644
--- a/drivers/net/ethernet/freescale/gianfar.c
+++ b/drivers/net/ethernet/freescale/gianfar.c
@@ -2943,7 +2943,7 @@ static bool gfar_add_rx_frag(struct gfar_rx_buff *rxb, u32 lstatus,
/* change offset to the other half */
rxb->page_offset ^= GFAR_RXB_TRUESIZE;

- atomic_inc(&page->_count);
+ page_ref_inc(page);

return true;
}
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index e76a44c..3766a0d 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -245,7 +245,7 @@ static bool fm10k_can_reuse_rx_page(struct fm10k_rx_buffer *rx_buffer,
/* Even if we own the page, we are not allowed to use atomic_set()
* This would break get_page_unless_zero() users.
*/
- atomic_inc(&page->_count);
+ page_ref_inc(page);

return true;
}
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index ea7b098..9a7d492 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6622,7 +6622,7 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer,
/* Even if we own the page, we are not allowed to use atomic_set()
* This would break get_page_unless_zero() users.
*/
- atomic_inc(&page->_count);
+ page_ref_inc(page);

return true;
}
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 47395ff..489b1c5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1941,7 +1941,7 @@ static bool ixgbe_add_rx_frag(struct ixgbe_ring *rx_ring,
/* Even if we own the page, we are not allowed to use atomic_set()
* This would break get_page_unless_zero() users.
*/
- atomic_inc(&page->_count);
+ page_ref_inc(page);

return true;
}
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 592ff23..424a159 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -835,7 +835,7 @@ add_tail_frag:
/* Even if we own the page, we are not allowed to use atomic_set()
* This would break get_page_unless_zero() users.
*/
- atomic_inc(&page->_count);
+ page_ref_inc(page);

return true;
}
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index e7a5000..98f4536 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -82,8 +82,7 @@ static int mlx4_alloc_pages(struct mlx4_en_priv *priv,
/* Not doing get_page() for each frag is a big win
* on asymetric workloads. Note we can not use atomic_set().
*/
- atomic_add(page_alloc->page_size / frag_info->frag_stride - 1,
- &page->_count);
+ page_ref_add(page, page_alloc->page_size / frag_info->frag_stride - 1);
return 0;
}

@@ -127,7 +126,7 @@ out:
dma_unmap_page(priv->ddev, page_alloc[i].dma,
page_alloc[i].page_size, PCI_DMA_FROMDEVICE);
page = page_alloc[i].page;
- atomic_set(&page->_count, 1);
+ set_page_count(page, 1);
put_page(page);
}
}
@@ -177,7 +176,7 @@ out:
dma_unmap_page(priv->ddev, page_alloc->dma,
page_alloc->page_size, PCI_DMA_FROMDEVICE);
page = page_alloc->page;
- atomic_set(&page->_count, 1);
+ set_page_count(page, 1);
put_page(page);
page_alloc->page = NULL;
}
diff --git a/drivers/net/ethernet/sun/niu.c b/drivers/net/ethernet/sun/niu.c
index ab6051a..9cc4564 100644
--- a/drivers/net/ethernet/sun/niu.c
+++ b/drivers/net/ethernet/sun/niu.c
@@ -3341,7 +3341,7 @@ static int niu_rbr_add_page(struct niu *np, struct rx_ring_info *rp,

niu_hash_page(rp, page, addr);
if (rp->rbr_blocks_per_page > 1)
- atomic_add(rp->rbr_blocks_per_page - 1, &page->_count);
+ page_ref_add(page, rp->rbr_blocks_per_page - 1);

for (i = 0; i < rp->rbr_blocks_per_page; i++) {
__le32 *rbr = &rp->rbr[start_index + i];
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 30ef3b5..d562638 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -21,6 +21,7 @@
#include <linux/resource.h>
#include <linux/page_ext.h>
#include <linux/err.h>
+#include <linux/page_ref.h>

struct mempolicy;
struct anon_vma;
@@ -340,7 +341,7 @@ struct inode;
static inline int put_page_testzero(struct page *page)
{
VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0, page);
- return atomic_dec_and_test(&page->_count);
+ return page_ref_dec_and_test(page);
}

/*
@@ -351,7 +352,7 @@ static inline int put_page_testzero(struct page *page)
*/
static inline int get_page_unless_zero(struct page *page)
{
- return atomic_inc_not_zero(&page->_count);
+ return page_ref_add_unless(page, 1, 0);
}

extern int page_is_ram(unsigned long pfn);
@@ -433,11 +434,6 @@ static inline int page_mapcount(struct page *page)
return ret;
}

-static inline int page_count(struct page *page)
-{
- return atomic_read(&compound_head(page)->_count);
-}
-
static inline void get_page(struct page *page)
{
page = compound_head(page);
@@ -446,7 +442,7 @@ static inline void get_page(struct page *page)
* requires to already have an elevated page->_count.
*/
VM_BUG_ON_PAGE(atomic_read(&page->_count) <= 0, page);
- atomic_inc(&page->_count);
+ page_ref_inc(page);
}

static inline struct page *virt_to_head_page(const void *x)
@@ -456,15 +452,6 @@ static inline struct page *virt_to_head_page(const void *x)
return compound_head(page);
}

-/*
- * Setup the page count before being freed into the page allocator for
- * the first time (boot or memory hotplug)
- */
-static inline void init_page_count(struct page *page)
-{
- atomic_set(&page->_count, 1);
-}
-
void __put_page(struct page *page);

static inline void put_page(struct page *page)
diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
new file mode 100644
index 0000000..534249c
--- /dev/null
+++ b/include/linux/page_ref.h
@@ -0,0 +1,76 @@
+#include <linux/atomic.h>
+#include <linux/mm_types.h>
+#include <linux/page-flags.h>
+
+static inline int page_count(struct page *page)
+{
+ return atomic_read(&compound_head(page)->_count);
+}
+
+static inline void set_page_count(struct page *page, int v)
+{
+ atomic_set(&page->_count, v);
+}
+
+/*
+ * Setup the page count before being freed into the page allocator for
+ * the first time (boot or memory hotplug)
+ */
+static inline void init_page_count(struct page *page)
+{
+ set_page_count(page, 1);
+}
+
+static inline void page_ref_add(struct page *page, int nr)
+{
+ atomic_add(nr, &page->_count);
+}
+
+static inline void page_ref_sub(struct page *page, int nr)
+{
+ atomic_sub(nr, &page->_count);
+}
+
+static inline void page_ref_inc(struct page *page)
+{
+ atomic_inc(&page->_count);
+}
+
+static inline void page_ref_dec(struct page *page)
+{
+ atomic_dec(&page->_count);
+}
+
+static inline int page_ref_sub_and_test(struct page *page, int nr)
+{
+ return atomic_sub_and_test(nr, &page->_count);
+}
+
+static inline int page_ref_dec_and_test(struct page *page)
+{
+ return atomic_dec_and_test(&page->_count);
+}
+
+static inline int page_ref_dec_return(struct page *page)
+{
+ return atomic_dec_return(&page->_count);
+}
+
+static inline int page_ref_add_unless(struct page *page, int nr, int u)
+{
+ return atomic_add_unless(&page->_count, nr, u);
+}
+
+static inline int page_ref_freeze(struct page *page, int count)
+{
+ return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
+}
+
+static inline void page_ref_unfreeze(struct page *page, int count)
+{
+ VM_BUG_ON_PAGE(page_count(page) != 0, page);
+ VM_BUG_ON(count == 0);
+
+ atomic_set(&page->_count, count);
+}
+
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4d08b6c..67ce560 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -165,7 +165,7 @@ static inline int page_cache_get_speculative(struct page *page)
* SMP requires.
*/
VM_BUG_ON_PAGE(page_count(page) == 0, page);
- atomic_inc(&page->_count);
+ page_ref_inc(page);

#else
if (unlikely(!get_page_unless_zero(page))) {
@@ -194,10 +194,10 @@ static inline int page_cache_add_speculative(struct page *page, int count)
VM_BUG_ON(!in_atomic());
# endif
VM_BUG_ON_PAGE(page_count(page) == 0, page);
- atomic_add(count, &page->_count);
+ page_ref_add(page, count);

#else
- if (unlikely(!atomic_add_unless(&page->_count, count, 0)))
+ if (unlikely(!page_ref_add_unless(page, count, 0)))
return 0;
#endif
VM_BUG_ON_PAGE(PageCompound(page) && page != compound_head(page), page);
@@ -205,19 +205,6 @@ static inline int page_cache_add_speculative(struct page *page, int count)
return 1;
}

-static inline int page_freeze_refs(struct page *page, int count)
-{
- return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
-}
-
-static inline void page_unfreeze_refs(struct page *page, int count)
-{
- VM_BUG_ON_PAGE(page_count(page) != 0, page);
- VM_BUG_ON(count == 0);
-
- atomic_set(&page->_count, count);
-}
-
#ifdef CONFIG_NUMA
extern struct page *__page_cache_alloc(gfp_t gfp);
#else
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4b3420a..a189f27 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2881,7 +2881,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

page = pmd_page(*pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
- atomic_add(HPAGE_PMD_NR - 1, &page->_count);
+ page_ref_add(page, HPAGE_PMD_NR - 1);
write = pmd_write(*pmd);
young = pmd_young(*pmd);

@@ -3197,7 +3197,7 @@ static int __split_huge_page_tail(struct page *head, int tail,
* atomic_set() here would be safe on all archs (and not only on x86),
* it's safer to use atomic_add().
*/
- atomic_add(mapcount + 1, &page_tail->_count);
+ page_ref_add(page_tail, mapcount + 1);

/* after clearing PageTail the gup refcount can be released */
smp_mb__after_atomic();
@@ -3256,7 +3256,7 @@ static void __split_huge_page(struct page *page, struct list_head *list)
tail_mapcount = 0;
for (i = HPAGE_PMD_NR - 1; i >= 1; i--)
tail_mapcount += __split_huge_page_tail(head, i, lruvec, list);
- atomic_sub(tail_mapcount, &head->_count);
+ page_ref_sub(head, tail_mapcount);

ClearPageCompound(head);
spin_unlock_irq(&zone->lru_lock);
diff --git a/mm/internal.h b/mm/internal.h
index dbe0436..f7be635 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,11 +41,6 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);

-static inline void set_page_count(struct page *page, int v)
-{
- atomic_set(&page->_count, v);
-}
-
extern int __do_page_cache_readahead(struct address_space *mapping,
struct file *filp, pgoff_t offset, unsigned long nr_to_read,
unsigned long lookahead_size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 67d488a..a598c1c 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -161,7 +161,7 @@ void get_page_bootmem(unsigned long info, struct page *page,
page->lru.next = (struct list_head *) type;
SetPagePrivate(page);
set_page_private(page, info);
- atomic_inc(&page->_count);
+ page_ref_inc(page);
}

void put_page_bootmem(struct page *page)
@@ -172,7 +172,7 @@ void put_page_bootmem(struct page *page)
BUG_ON(type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE);

- if (atomic_dec_return(&page->_count) == 1) {
+ if (page_ref_dec_return(page) == 1) {
ClearPagePrivate(page);
set_page_private(page, 0);
INIT_LIST_HEAD(&page->lru);
diff --git a/mm/migrate.c b/mm/migrate.c
index 1ae0113..b48a9b1 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -349,7 +349,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
return -EAGAIN;
}

- if (!page_freeze_refs(page, expected_count)) {
+ if (!page_ref_freeze(page, expected_count)) {
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
@@ -363,7 +363,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
*/
if (mode == MIGRATE_ASYNC && head &&
!buffer_migrate_lock_buffers(head, mode)) {
- page_unfreeze_refs(page, expected_count);
+ page_ref_unfreeze(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
@@ -398,7 +398,7 @@ int migrate_page_move_mapping(struct address_space *mapping,
* to one less reference.
* We know this isn't the last reference.
*/
- page_unfreeze_refs(page, expected_count - 1);
+ page_ref_unfreeze(page, expected_count - 1);

spin_unlock(&mapping->tree_lock);
/* Leave irq disabled to prevent preemption while updating stats */
@@ -452,7 +452,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
return -EAGAIN;
}

- if (!page_freeze_refs(page, expected_count)) {
+ if (!page_ref_freeze(page, expected_count)) {
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
@@ -464,7 +464,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,

radix_tree_replace_slot(pslot, newpage);

- page_unfreeze_refs(page, expected_count - 1);
+ page_ref_unfreeze(page, expected_count - 1);

spin_unlock_irq(&mapping->tree_lock);
return MIGRATEPAGE_SUCCESS;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e78d78f..40c07db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3388,7 +3388,7 @@ refill:
/* Even if we own the page, we do not use atomic_set().
* This would break get_page_unless_zero() users.
*/
- atomic_add(size - 1, &page->_count);
+ page_ref_add(page, size - 1);

/* reset page count bias and offset to start of new frag */
nc->pfmemalloc = page_is_pfmemalloc(page);
@@ -3400,7 +3400,7 @@ refill:
if (unlikely(offset < 0)) {
page = virt_to_page(nc->va);

- if (!atomic_sub_and_test(nc->pagecnt_bias, &page->_count))
+ if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
goto refill;

#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
@@ -3408,7 +3408,7 @@ refill:
size = nc->size;
#endif
/* OK, page count is 0, we can safely set it */
- atomic_set(&page->_count, size);
+ set_page_count(page, size);

/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = size;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9b52ecf..d9ccf02 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -642,11 +642,11 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
* Note that if SetPageDirty is always performed via set_page_dirty,
* and thus under tree_lock, then this ordering is not required.
*/
- if (!page_freeze_refs(page, 2))
+ if (!page_ref_freeze(page, 2))
goto cannot_free;
/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
if (unlikely(PageDirty(page))) {
- page_unfreeze_refs(page, 2);
+ page_ref_unfreeze(page, 2);
goto cannot_free;
}

@@ -705,7 +705,7 @@ int remove_mapping(struct address_space *mapping, struct page *page)
* drops the pagecache ref for us without requiring another
* atomic operation.
*/
- page_unfreeze_refs(page, 1);
+ page_ref_unfreeze(page, 1);
return 1;
}
return 0;
--
1.9.1


2015-11-09 07:23:33

by Joonsoo Kim

[permalink] [raw]
Subject: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

CMA allocation should be guaranteed to succeed by definition, but,
unfortunately, it would be failed sometimes. It is hard to track down
the problem, because it is related to page reference manipulation and
we don't have any facility to analyze it.

This patch adds tracepoints to track down page reference manipulation.
With it, we can find exact reason of failure and can fix the problem.
Following is an example of tracepoint output.

<...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
<...>-9018 [004] 92.678378: kernel_stack:
=> get_page_from_freelist (ffffffff81176659)
=> __alloc_pages_nodemask (ffffffff81176d22)
=> alloc_pages_vma (ffffffff811bf675)
=> handle_mm_fault (ffffffff8119e693)
=> __do_page_fault (ffffffff810631ea)
=> trace_do_page_fault (ffffffff81063543)
=> do_async_page_fault (ffffffff8105c40a)
=> async_page_fault (ffffffff817581d8)
[snip]
<...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
[snip]
...
...
<...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
[snip]
<...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
=> release_pages (ffffffff8117c9e4)
=> free_pages_and_swap_cache (ffffffff811b0697)
=> tlb_flush_mmu_free (ffffffff81199616)
=> tlb_finish_mmu (ffffffff8119a62c)
=> exit_mmap (ffffffff811a53f7)
=> mmput (ffffffff81073f47)
=> do_exit (ffffffff810794e9)
=> do_group_exit (ffffffff81079def)
=> SyS_exit_group (ffffffff81079e74)
=> entry_SYSCALL_64_fastpath (ffffffff817560b6)

This output shows that problem comes from exit path. In exit path,
to improve performance, pages are not freed immediately. They are gathered
and processed by batch. During this process, migration cannot be possible
and CMA allocation is failed. This problem is hard to find without this
page reference tracepoint facility.

Enabling this feature bloat kernel text 20 KB in my configuration.

text data bss dec hex filename
12041272 2223424 1507328 15772024 f0a978 vmlinux_disabled
12064844 2225920 1507328 15798092 f10f4c vmlinux_enabled

Signed-off-by: Joonsoo Kim <[email protected]>
---
include/linux/page_ref.h | 67 +++++++++++++++++++--
include/trace/events/page_ref.h | 128 ++++++++++++++++++++++++++++++++++++++++
mm/Kconfig.debug | 4 ++
mm/Makefile | 1 +
mm/debug_page_ref.c | 46 +++++++++++++++
5 files changed, 241 insertions(+), 5 deletions(-)
create mode 100644 include/trace/events/page_ref.h
create mode 100644 mm/debug_page_ref.c

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 534249c..de81073 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -2,6 +2,42 @@
#include <linux/mm_types.h>
#include <linux/page-flags.h>

+#ifdef CONFIG_DEBUG_PAGE_REF
+extern void __page_ref_set(struct page *page, int v);
+extern void __page_ref_mod(struct page *page, int v);
+extern void __page_ref_mod_and_test(struct page *page, int v, int ret);
+extern void __page_ref_mod_and_return(struct page *page, int v, int ret);
+extern void __page_ref_mod_unless(struct page *page, int v, int u);
+extern void __page_ref_freeze(struct page *page, int v, int ret);
+extern void __page_ref_unfreeze(struct page *page, int v);
+
+#else
+
+
+static inline void __page_ref_set(struct page *page, int v)
+{
+}
+static inline void __page_ref_mod(struct page *page, int v)
+{
+}
+static inline void __page_ref_mod_and_test(struct page *page, int v, int ret)
+{
+}
+static inline void __page_ref_mod_and_return(struct page *page, int v, int ret)
+{
+}
+static inline void __page_ref_mod_unless(struct page *page, int v, int u)
+{
+}
+static inline void __page_ref_freeze(struct page *page, int v, int ret)
+{
+}
+static inline void __page_ref_unfreeze(struct page *page, int v)
+{
+}
+
+#endif
+
static inline int page_count(struct page *page)
{
return atomic_read(&compound_head(page)->_count);
@@ -10,6 +46,7 @@ static inline int page_count(struct page *page)
static inline void set_page_count(struct page *page, int v)
{
atomic_set(&page->_count, v);
+ __page_ref_set(page, v);
}

/*
@@ -24,46 +61,65 @@ static inline void init_page_count(struct page *page)
static inline void page_ref_add(struct page *page, int nr)
{
atomic_add(nr, &page->_count);
+ __page_ref_mod(page, nr);
}

static inline void page_ref_sub(struct page *page, int nr)
{
atomic_sub(nr, &page->_count);
+ __page_ref_mod(page, -nr);
}

static inline void page_ref_inc(struct page *page)
{
atomic_inc(&page->_count);
+ __page_ref_mod(page, 1);
}

static inline void page_ref_dec(struct page *page)
{
atomic_dec(&page->_count);
+ __page_ref_mod(page, -1);
}

static inline int page_ref_sub_and_test(struct page *page, int nr)
{
- return atomic_sub_and_test(nr, &page->_count);
+ int ret = atomic_sub_and_test(nr, &page->_count);
+
+ __page_ref_mod_and_test(page, -nr, ret);
+ return ret;
}

static inline int page_ref_dec_and_test(struct page *page)
{
- return atomic_dec_and_test(&page->_count);
+ int ret = atomic_dec_and_test(&page->_count);
+
+ __page_ref_mod_and_test(page, -1, ret);
+ return ret;
}

static inline int page_ref_dec_return(struct page *page)
{
- return atomic_dec_return(&page->_count);
+ int ret = atomic_dec_return(&page->_count);
+
+ __page_ref_mod_and_return(page, -1, ret);
+ return ret;
}

static inline int page_ref_add_unless(struct page *page, int nr, int u)
{
- return atomic_add_unless(&page->_count, nr, u);
+ int ret = atomic_add_unless(&page->_count, nr, u);
+
+ __page_ref_mod_unless(page, nr, ret);
+ return ret;
}

static inline int page_ref_freeze(struct page *page, int count)
{
- return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
+ int ret = likely(atomic_cmpxchg(&page->_count, count, 0) == count);
+
+ __page_ref_freeze(page, count, ret);
+ return ret;
}

static inline void page_ref_unfreeze(struct page *page, int count)
@@ -72,5 +128,6 @@ static inline void page_ref_unfreeze(struct page *page, int count)
VM_BUG_ON(count == 0);

atomic_set(&page->_count, count);
+ __page_ref_unfreeze(page, count);
}

diff --git a/include/trace/events/page_ref.h b/include/trace/events/page_ref.h
new file mode 100644
index 0000000..6c5fd5b
--- /dev/null
+++ b/include/trace/events/page_ref.h
@@ -0,0 +1,128 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM page_ref
+
+#if !defined(_TRACE_PAGE_REF_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PAGE_REF_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(page_ref_mod_template,
+
+ TP_PROTO(struct page *page, int v),
+
+ TP_ARGS(page, v),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, pfn)
+ __field(unsigned long, flags)
+ __field(int, count)
+ __field(int, mapcount)
+ __field(void *, mapping)
+ __field(int, mt)
+ __field(int, val)
+ ),
+
+ TP_fast_assign(
+ __entry->pfn = page_to_pfn(page);
+ __entry->flags = page->flags & ((1UL << NR_PAGEFLAGS) - 1);
+ __entry->count = atomic_read(&page->_count);
+ __entry->mapcount = page_mapcount(page);
+ __entry->mapping = page->mapping;
+ __entry->mt = get_pageblock_migratetype(page);
+ __entry->val = v;
+ ),
+
+ TP_printk("pfn=0x%lx flags=0x%lx count=%d mapcount=%d mapping=%p mt=%d val=%d",
+ __entry->pfn, __entry->flags, __entry->count,
+ __entry->mapcount, __entry->mapping, __entry->mt,
+ __entry->val)
+);
+
+DEFINE_EVENT(page_ref_mod_template, page_ref_set,
+
+ TP_PROTO(struct page *page, int v),
+
+ TP_ARGS(page, v)
+);
+
+DEFINE_EVENT(page_ref_mod_template, page_ref_mod,
+
+ TP_PROTO(struct page *page, int v),
+
+ TP_ARGS(page, v)
+);
+
+DECLARE_EVENT_CLASS(page_ref_mod_and_test_template,
+
+ TP_PROTO(struct page *page, int v, int ret),
+
+ TP_ARGS(page, v, ret),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, pfn)
+ __field(unsigned long, flags)
+ __field(int, count)
+ __field(int, mapcount)
+ __field(void *, mapping)
+ __field(int, mt)
+ __field(int, val)
+ __field(int, ret)
+ ),
+
+ TP_fast_assign(
+ __entry->pfn = page_to_pfn(page);
+ __entry->flags = page->flags & ((1UL << NR_PAGEFLAGS) - 1);
+ __entry->count = atomic_read(&page->_count);
+ __entry->mapcount = page_mapcount(page);
+ __entry->mapping = page->mapping;
+ __entry->mt = get_pageblock_migratetype(page);
+ __entry->val = v;
+ __entry->ret = ret;
+ ),
+
+ TP_printk("pfn=0x%lx flags=0x%lx count=%d mapcount=%d mapping=%p mt=%d val=%d ret=%d",
+ __entry->pfn, __entry->flags, __entry->count,
+ __entry->mapcount, __entry->mapping, __entry->mt,
+ __entry->val, __entry->ret)
+);
+
+DEFINE_EVENT(page_ref_mod_and_test_template, page_ref_mod_and_test,
+
+ TP_PROTO(struct page *page, int v, int ret),
+
+ TP_ARGS(page, v, ret)
+);
+
+DEFINE_EVENT(page_ref_mod_and_test_template, page_ref_mod_and_return,
+
+ TP_PROTO(struct page *page, int v, int ret),
+
+ TP_ARGS(page, v, ret)
+);
+
+DEFINE_EVENT(page_ref_mod_and_test_template, page_ref_mod_unless,
+
+ TP_PROTO(struct page *page, int v, int ret),
+
+ TP_ARGS(page, v, ret)
+);
+
+DEFINE_EVENT(page_ref_mod_and_test_template, page_ref_freeze,
+
+ TP_PROTO(struct page *page, int v, int ret),
+
+ TP_ARGS(page, v, ret)
+);
+
+DEFINE_EVENT(page_ref_mod_template, page_ref_unfreeze,
+
+ TP_PROTO(struct page *page, int v),
+
+ TP_ARGS(page, v)
+);
+
+#endif /* _TRACE_PAGE_COUNT_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
index 957d3da..71d2399 100644
--- a/mm/Kconfig.debug
+++ b/mm/Kconfig.debug
@@ -28,3 +28,7 @@ config DEBUG_PAGEALLOC

config PAGE_POISONING
bool
+
+config DEBUG_PAGE_REF
+ bool "Enable tracepoint to track down page reference manipulation"
+ depends on DEBUG_KERNEL
diff --git a/mm/Makefile b/mm/Makefile
index 2ed4319..000f89f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -81,3 +81,4 @@ obj-$(CONFIG_CMA_DEBUGFS) += cma_debug.o
obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
obj-$(CONFIG_FRAME_VECTOR) += frame_vector.o
+obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
diff --git a/mm/debug_page_ref.c b/mm/debug_page_ref.c
new file mode 100644
index 0000000..d80b376
--- /dev/null
+++ b/mm/debug_page_ref.c
@@ -0,0 +1,46 @@
+#include <linux/tracepoint.h>
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/page_ref.h>
+
+void __page_ref_set(struct page *page, int v)
+{
+ trace_page_ref_set(page, v);
+}
+EXPORT_SYMBOL(__page_ref_set);
+
+void __page_ref_mod(struct page *page, int v)
+{
+ trace_page_ref_mod(page, v);
+}
+EXPORT_SYMBOL(__page_ref_mod);
+
+void __page_ref_mod_and_test(struct page *page, int v, int ret)
+{
+ trace_page_ref_mod_and_test(page, v, ret);
+}
+EXPORT_SYMBOL(__page_ref_mod_and_test);
+
+void __page_ref_mod_and_return(struct page *page, int v, int ret)
+{
+ trace_page_ref_mod_and_return(page, v, ret);
+}
+EXPORT_SYMBOL(__page_ref_mod_and_return);
+
+void __page_ref_mod_unless(struct page *page, int v, int u)
+{
+ trace_page_ref_mod_unless(page, v, u);
+}
+EXPORT_SYMBOL(__page_ref_mod_unless);
+
+void __page_ref_freeze(struct page *page, int v, int ret)
+{
+ trace_page_ref_freeze(page, v, ret);
+}
+EXPORT_SYMBOL(__page_ref_freeze);
+
+void __page_ref_unfreeze(struct page *page, int v)
+{
+ trace_page_ref_unfreeze(page, v);
+}
+EXPORT_SYMBOL(__page_ref_unfreeze);
--
1.9.1

2015-11-09 07:52:46

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: introduce page reference manipulation functions

Hi,

On (11/09/15 16:23), Joonsoo Kim wrote:
[..]
> +static inline int page_count(struct page *page)
> +{
> + return atomic_read(&compound_head(page)->_count);
> +}
> +
> +static inline void set_page_count(struct page *page, int v)
> +{
> + atomic_set(&page->_count, v);
> +}
> +
> +/*
> + * Setup the page count before being freed into the page allocator for
> + * the first time (boot or memory hotplug)
> + */
> +static inline void init_page_count(struct page *page)
> +{
> + set_page_count(page, 1);
> +}
> +
> +static inline void page_ref_add(struct page *page, int nr)
> +{
> + atomic_add(nr, &page->_count);
> +}

Since page_ref_FOO wrappers operate with page->_count and there
are already page_count()/set_page_count()/etc. may be name new
wrappers in page_count_FOO() manner?


> +static inline void page_ref_sub(struct page *page, int nr)

for example, page_count_sub(), etc.

-ss

> +{
> + atomic_sub(nr, &page->_count);
> +}
> +
> +static inline void page_ref_inc(struct page *page)
> +{
> + atomic_inc(&page->_count);
> +}
> +
> +static inline void page_ref_dec(struct page *page)
> +{
> + atomic_dec(&page->_count);
> +}
> +
> +static inline int page_ref_sub_and_test(struct page *page, int nr)
> +{
> + return atomic_sub_and_test(nr, &page->_count);
> +}
> +
> +static inline int page_ref_dec_and_test(struct page *page)
> +{
> + return atomic_dec_and_test(&page->_count);
> +}
> +
> +static inline int page_ref_dec_return(struct page *page)
> +{
> + return atomic_dec_return(&page->_count);
> +}
> +
> +static inline int page_ref_add_unless(struct page *page, int nr, int u)
> +{
> + return atomic_add_unless(&page->_count, nr, u);
> +}
> +
> +static inline int page_ref_freeze(struct page *page, int count)
> +{
> + return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
> +}
> +
> +static inline void page_ref_unfreeze(struct page *page, int count)
> +{
> + VM_BUG_ON_PAGE(page_count(page) != 0, page);
> + VM_BUG_ON(count == 0);
> +
> + atomic_set(&page->_count, count);
> +}

2015-11-09 08:00:35

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: introduce page reference manipulation functions

2015-11-09 16:53 GMT+09:00 Sergey Senozhatsky
<[email protected]>:
> Hi,
>
> On (11/09/15 16:23), Joonsoo Kim wrote:
> [..]
>> +static inline int page_count(struct page *page)
>> +{
>> + return atomic_read(&compound_head(page)->_count);
>> +}
>> +
>> +static inline void set_page_count(struct page *page, int v)
>> +{
>> + atomic_set(&page->_count, v);
>> +}
>> +
>> +/*
>> + * Setup the page count before being freed into the page allocator for
>> + * the first time (boot or memory hotplug)
>> + */
>> +static inline void init_page_count(struct page *page)
>> +{
>> + set_page_count(page, 1);
>> +}
>> +
>> +static inline void page_ref_add(struct page *page, int nr)
>> +{
>> + atomic_add(nr, &page->_count);
>> +}
>
> Since page_ref_FOO wrappers operate with page->_count and there
> are already page_count()/set_page_count()/etc. may be name new
> wrappers in page_count_FOO() manner?

Hello,

I used that page_count_ before but change my mind.
I think that ref is more relevant to this operation.
Perhaps, it'd be better to change page_count()/set_page_count()
to page_ref()/set_page_ref().

FYI, some functions such as page_(un)freeze_refs uses ref. :)

Thanks.

2015-11-09 11:45:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: introduce page reference manipulation functions

On Mon, Nov 09, 2015 at 05:00:32PM +0900, Joonsoo Kim wrote:
> 2015-11-09 16:53 GMT+09:00 Sergey Senozhatsky
> <[email protected]>:
> > Hi,
> >
> > On (11/09/15 16:23), Joonsoo Kim wrote:
> > [..]
> >> +static inline int page_count(struct page *page)
> >> +{
> >> + return atomic_read(&compound_head(page)->_count);
> >> +}
> >> +
> >> +static inline void set_page_count(struct page *page, int v)
> >> +{
> >> + atomic_set(&page->_count, v);
> >> +}
> >> +
> >> +/*
> >> + * Setup the page count before being freed into the page allocator for
> >> + * the first time (boot or memory hotplug)
> >> + */
> >> +static inline void init_page_count(struct page *page)
> >> +{
> >> + set_page_count(page, 1);
> >> +}
> >> +
> >> +static inline void page_ref_add(struct page *page, int nr)
> >> +{
> >> + atomic_add(nr, &page->_count);
> >> +}
> >
> > Since page_ref_FOO wrappers operate with page->_count and there
> > are already page_count()/set_page_count()/etc. may be name new
> > wrappers in page_count_FOO() manner?
>
> Hello,
>
> I used that page_count_ before but change my mind.
> I think that ref is more relevant to this operation.
> Perhaps, it'd be better to change page_count()/set_page_count()
> to page_ref()/set_page_ref().

What about get_page() vs. page_cache_get() and put_page() vs.
page_cache_release()? Two different helpers for the same thing is annyoing
me for some time (plus PAGE_SIZE vs. PAGE_CACHE_SIZE, etc.).

If you want coherent API you might want to get them consitent too.

> FYI, some functions such as page_(un)freeze_refs uses ref. :)

--
Kirill A. Shutemov

2015-11-10 00:28:18

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: introduce page reference manipulation functions

On Mon, Nov 09, 2015 at 01:45:37PM +0200, Kirill A. Shutemov wrote:
> On Mon, Nov 09, 2015 at 05:00:32PM +0900, Joonsoo Kim wrote:
> > 2015-11-09 16:53 GMT+09:00 Sergey Senozhatsky
> > <[email protected]>:
> > > Hi,
> > >
> > > On (11/09/15 16:23), Joonsoo Kim wrote:
> > > [..]
> > >> +static inline int page_count(struct page *page)
> > >> +{
> > >> + return atomic_read(&compound_head(page)->_count);
> > >> +}
> > >> +
> > >> +static inline void set_page_count(struct page *page, int v)
> > >> +{
> > >> + atomic_set(&page->_count, v);
> > >> +}
> > >> +
> > >> +/*
> > >> + * Setup the page count before being freed into the page allocator for
> > >> + * the first time (boot or memory hotplug)
> > >> + */
> > >> +static inline void init_page_count(struct page *page)
> > >> +{
> > >> + set_page_count(page, 1);
> > >> +}
> > >> +
> > >> +static inline void page_ref_add(struct page *page, int nr)
> > >> +{
> > >> + atomic_add(nr, &page->_count);
> > >> +}
> > >
> > > Since page_ref_FOO wrappers operate with page->_count and there
> > > are already page_count()/set_page_count()/etc. may be name new
> > > wrappers in page_count_FOO() manner?
> >
> > Hello,
> >
> > I used that page_count_ before but change my mind.
> > I think that ref is more relevant to this operation.
> > Perhaps, it'd be better to change page_count()/set_page_count()
> > to page_ref()/set_page_ref().
>
> What about get_page() vs. page_cache_get() and put_page() vs.
> page_cache_release()? Two different helpers for the same thing is annyoing
> me for some time (plus PAGE_SIZE vs. PAGE_CACHE_SIZE, etc.).
>
> If you want coherent API you might want to get them consitent too.

In fact, consistent naming is out of interest of this patchset. :)

Thanks.

2015-11-10 15:59:03

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: introduce page reference manipulation functions

On Mon, Nov 09 2015, Joonsoo Kim wrote:
> Success of CMA allocation largely depends on success of migration
> and key factor of it is page reference count. Until now, page reference
> is manipulated by direct calling atomic functions so we cannot follow up
> who and where manipulate it. Then, it is hard to find actual reason
> of CMA allocation failure. CMA allocation should be guaranteed to succeed
> so finding offending place is really important.
>
> In this patch, call sites where page reference is manipulated are converted
> to introduced wrapper function. This is preparation step to add tracepoint
> to each page reference manipulation function. With this facility, we can
> easily find reason of CMA allocation failure. There is no functional change
> in this patch.
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Michal Nazarewicz <[email protected]>

> ---
> arch/mips/mm/gup.c | 2 +-
> arch/powerpc/mm/mmu_context_hash64.c | 3 +-
> arch/powerpc/mm/pgtable_64.c | 2 +-
> arch/x86/mm/gup.c | 2 +-
> drivers/block/aoe/aoecmd.c | 4 +-
> drivers/net/ethernet/freescale/gianfar.c | 2 +-
> drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
> drivers/net/ethernet/intel/igb/igb_main.c | 2 +-
> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
> drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
> drivers/net/ethernet/mellanox/mlx4/en_rx.c | 7 +--
> drivers/net/ethernet/sun/niu.c | 2 +-
> include/linux/mm.h | 21 ++-----
> include/linux/page_ref.h | 76 +++++++++++++++++++++++
> include/linux/pagemap.h | 19 +-----
> mm/huge_memory.c | 6 +-
> mm/internal.h | 5 --
> mm/memory_hotplug.c | 4 +-
> mm/migrate.c | 10 +--
> mm/page_alloc.c | 6 +-
> mm/vmscan.c | 6 +-
> 21 files changed, 114 insertions(+), 71 deletions(-)
> create mode 100644 include/linux/page_ref.h
>

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, ミハウ “mina86” ナザレヴイツ (o o)
ooo +--<[email protected]>--<xmpp:[email protected]>-----ooO--(_)--Ooo--

2015-11-10 16:02:50

by Michal Nazarewicz

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Mon, Nov 09 2015, Joonsoo Kim wrote:
> CMA allocation should be guaranteed to succeed by definition,

Uh? That’s a peculiar statement. Which is to say that it’s not true.

> but,
> unfortunately, it would be failed sometimes. It is hard to track down
> the problem, because it is related to page reference manipulation and
> we don't have any facility to analyze it.
>
> This patch adds tracepoints to track down page reference manipulation.
> With it, we can find exact reason of failure and can fix the problem.
> Following is an example of tracepoint output.
>
> <...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
> <...>-9018 [004] 92.678378: kernel_stack:
> => get_page_from_freelist (ffffffff81176659)
> => __alloc_pages_nodemask (ffffffff81176d22)
> => alloc_pages_vma (ffffffff811bf675)
> => handle_mm_fault (ffffffff8119e693)
> => __do_page_fault (ffffffff810631ea)
> => trace_do_page_fault (ffffffff81063543)
> => do_async_page_fault (ffffffff8105c40a)
> => async_page_fault (ffffffff817581d8)
> [snip]
> <...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
> [snip]
> ...
> ...
> <...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
> [snip]
> <...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
> => release_pages (ffffffff8117c9e4)
> => free_pages_and_swap_cache (ffffffff811b0697)
> => tlb_flush_mmu_free (ffffffff81199616)
> => tlb_finish_mmu (ffffffff8119a62c)
> => exit_mmap (ffffffff811a53f7)
> => mmput (ffffffff81073f47)
> => do_exit (ffffffff810794e9)
> => do_group_exit (ffffffff81079def)
> => SyS_exit_group (ffffffff81079e74)
> => entry_SYSCALL_64_fastpath (ffffffff817560b6)
>
> This output shows that problem comes from exit path. In exit path,
> to improve performance, pages are not freed immediately. They are gathered
> and processed by batch. During this process, migration cannot be possible
> and CMA allocation is failed. This problem is hard to find without this
> page reference tracepoint facility.
>
> Enabling this feature bloat kernel text 20 KB in my configuration.
>
> text data bss dec hex filename
> 12041272 2223424 1507328 15772024 f0a978 vmlinux_disabled
> 12064844 2225920 1507328 15798092 f10f4c vmlinux_enabled
>
> Signed-off-by: Joonsoo Kim <[email protected]>

Acked-by: Michal Nazarewicz <[email protected]>

> ---
> include/trace/events/page_ref.h | 128 ++++++++++++++++++++++++++++++++++++++++

I haven’t really looked at the above file though.

--
Best regards, _ _
.o. | Liege of Serenely Enlightened Majesty of o' \,=./ `o
..o | Computer Science, ミハウ “mina86” ナザレヴイツ (o o)
ooo +--<[email protected]>--<xmpp:[email protected]>-----ooO--(_)--Ooo--

2015-11-18 15:34:47

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On 11/09/2015 08:23 AM, Joonsoo Kim wrote:
> CMA allocation should be guaranteed to succeed by definition, but,
> unfortunately, it would be failed sometimes. It is hard to track down
> the problem, because it is related to page reference manipulation and
> we don't have any facility to analyze it.

Reminds me of the PeterZ's VM_PINNED patchset. What happened to it?
https://lwn.net/Articles/600502/

> This patch adds tracepoints to track down page reference manipulation.
> With it, we can find exact reason of failure and can fix the problem.
> Following is an example of tracepoint output.
>
> <...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
> <...>-9018 [004] 92.678378: kernel_stack:
> => get_page_from_freelist (ffffffff81176659)
> => __alloc_pages_nodemask (ffffffff81176d22)
> => alloc_pages_vma (ffffffff811bf675)
> => handle_mm_fault (ffffffff8119e693)
> => __do_page_fault (ffffffff810631ea)
> => trace_do_page_fault (ffffffff81063543)
> => do_async_page_fault (ffffffff8105c40a)
> => async_page_fault (ffffffff817581d8)
> [snip]
> <...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
> [snip]
> ...
> ...
> <...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
> [snip]
> <...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
> => release_pages (ffffffff8117c9e4)
> => free_pages_and_swap_cache (ffffffff811b0697)
> => tlb_flush_mmu_free (ffffffff81199616)
> => tlb_finish_mmu (ffffffff8119a62c)
> => exit_mmap (ffffffff811a53f7)
> => mmput (ffffffff81073f47)
> => do_exit (ffffffff810794e9)
> => do_group_exit (ffffffff81079def)
> => SyS_exit_group (ffffffff81079e74)
> => entry_SYSCALL_64_fastpath (ffffffff817560b6)
>
> This output shows that problem comes from exit path. In exit path,
> to improve performance, pages are not freed immediately. They are gathered
> and processed by batch. During this process, migration cannot be possible
> and CMA allocation is failed. This problem is hard to find without this
> page reference tracepoint facility.

Yeah but when you realized it was this problem, what was the fix? Probably not
remove batching from exit path? Shouldn't CMA in this case just try waiting for
the pins to go away, which would eventually happen? And for long-term pins,
VM_PINNED would make sure the pages are migrated away from CMA pageblocks first?

So I'm worried that this is quite nontrivial change for a very specific usecase.

> Enabling this feature bloat kernel text 20 KB in my configuration.

It's not just that, see below.

[...]


> static inline int page_ref_freeze(struct page *page, int count)
> {
> - return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
> + int ret = likely(atomic_cmpxchg(&page->_count, count, 0) == count);

The "likely" mean makes no sense anymore, doe it?

> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> index 957d3da..71d2399 100644
> --- a/mm/Kconfig.debug
> +++ b/mm/Kconfig.debug
> @@ -28,3 +28,7 @@ config DEBUG_PAGEALLOC
>
> config PAGE_POISONING
> bool
> +
> +config DEBUG_PAGE_REF
> + bool "Enable tracepoint to track down page reference manipulation"

So you should probably state the costs. Which is the extra memory, and also that
all the page ref manipulations are now turned to function calls, even if the
tracepoints are disabled. Patch 1 didn't change that many callsites, so maybe it
would be feasible to have the tracepoints inline, where being disabled has
near-zero overhead?

2015-11-19 06:50:12

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Wed, Nov 18, 2015 at 04:34:30PM +0100, Vlastimil Babka wrote:
> On 11/09/2015 08:23 AM, Joonsoo Kim wrote:
> > CMA allocation should be guaranteed to succeed by definition, but,
> > unfortunately, it would be failed sometimes. It is hard to track down
> > the problem, because it is related to page reference manipulation and
> > we don't have any facility to analyze it.
>
> Reminds me of the PeterZ's VM_PINNED patchset. What happened to it?
> https://lwn.net/Articles/600502/
>
> > This patch adds tracepoints to track down page reference manipulation.
> > With it, we can find exact reason of failure and can fix the problem.
> > Following is an example of tracepoint output.
> >
> > <...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
> > <...>-9018 [004] 92.678378: kernel_stack:
> > => get_page_from_freelist (ffffffff81176659)
> > => __alloc_pages_nodemask (ffffffff81176d22)
> > => alloc_pages_vma (ffffffff811bf675)
> > => handle_mm_fault (ffffffff8119e693)
> > => __do_page_fault (ffffffff810631ea)
> > => trace_do_page_fault (ffffffff81063543)
> > => do_async_page_fault (ffffffff8105c40a)
> > => async_page_fault (ffffffff817581d8)
> > [snip]
> > <...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
> > [snip]
> > ...
> > ...
> > <...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
> > [snip]
> > <...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
> > => release_pages (ffffffff8117c9e4)
> > => free_pages_and_swap_cache (ffffffff811b0697)
> > => tlb_flush_mmu_free (ffffffff81199616)
> > => tlb_finish_mmu (ffffffff8119a62c)
> > => exit_mmap (ffffffff811a53f7)
> > => mmput (ffffffff81073f47)
> > => do_exit (ffffffff810794e9)
> > => do_group_exit (ffffffff81079def)
> > => SyS_exit_group (ffffffff81079e74)
> > => entry_SYSCALL_64_fastpath (ffffffff817560b6)
> >
> > This output shows that problem comes from exit path. In exit path,
> > to improve performance, pages are not freed immediately. They are gathered
> > and processed by batch. During this process, migration cannot be possible
> > and CMA allocation is failed. This problem is hard to find without this
> > page reference tracepoint facility.
>
> Yeah but when you realized it was this problem, what was the fix? Probably not
> remove batching from exit path? Shouldn't CMA in this case just try waiting for
> the pins to go away, which would eventually happen? And for long-term pins,
> VM_PINNED would make sure the pages are migrated away from CMA pageblocks first?
>
> So I'm worried that this is quite nontrivial change for a very specific usecase.

This patch is not to solve the problem but just to expose what is culprit.

For using VM_PINNED, firstly, we should know where are long-term pins.
There are obviously clear places which will be first target if we use
VM_PINNED but somewhere are vague. For vague places, this patch will
help finding there. Even we don't use VM_PINNED, this patch will expose
current obstacle which can help to understand current problems.

2015-11-20 06:33:19

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

Ccing Steven.

Hello,

On Wed, Nov 18, 2015 at 04:34:30PM +0100, Vlastimil Babka wrote:
> On 11/09/2015 08:23 AM, Joonsoo Kim wrote:
> > CMA allocation should be guaranteed to succeed by definition, but,
> > unfortunately, it would be failed sometimes. It is hard to track down
> > the problem, because it is related to page reference manipulation and
> > we don't have any facility to analyze it.
>
> Reminds me of the PeterZ's VM_PINNED patchset. What happened to it?
> https://lwn.net/Articles/600502/
>
> > This patch adds tracepoints to track down page reference manipulation.
> > With it, we can find exact reason of failure and can fix the problem.
> > Following is an example of tracepoint output.
> >
> > <...>-9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
> > <...>-9018 [004] 92.678378: kernel_stack:
> > => get_page_from_freelist (ffffffff81176659)
> > => __alloc_pages_nodemask (ffffffff81176d22)
> > => alloc_pages_vma (ffffffff811bf675)
> > => handle_mm_fault (ffffffff8119e693)
> > => __do_page_fault (ffffffff810631ea)
> > => trace_do_page_fault (ffffffff81063543)
> > => do_async_page_fault (ffffffff8105c40a)
> > => async_page_fault (ffffffff817581d8)
> > [snip]
> > <...>-9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
> > [snip]
> > ...
> > ...
> > <...>-9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
> > [snip]
> > <...>-9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
> > => release_pages (ffffffff8117c9e4)
> > => free_pages_and_swap_cache (ffffffff811b0697)
> > => tlb_flush_mmu_free (ffffffff81199616)
> > => tlb_finish_mmu (ffffffff8119a62c)
> > => exit_mmap (ffffffff811a53f7)
> > => mmput (ffffffff81073f47)
> > => do_exit (ffffffff810794e9)
> > => do_group_exit (ffffffff81079def)
> > => SyS_exit_group (ffffffff81079e74)
> > => entry_SYSCALL_64_fastpath (ffffffff817560b6)
> >
> > This output shows that problem comes from exit path. In exit path,
> > to improve performance, pages are not freed immediately. They are gathered
> > and processed by batch. During this process, migration cannot be possible
> > and CMA allocation is failed. This problem is hard to find without this
> > page reference tracepoint facility.
>
> Yeah but when you realized it was this problem, what was the fix? Probably not
> remove batching from exit path? Shouldn't CMA in this case just try waiting for
> the pins to go away, which would eventually happen? And for long-term pins,
> VM_PINNED would make sure the pages are migrated away from CMA pageblocks first?
>
> So I'm worried that this is quite nontrivial change for a very specific usecase.

Minchan already answered it. Thanks, Minchan.
This also can be used for memory-offlining and compaction.

> > Enabling this feature bloat kernel text 20 KB in my configuration.
>
> It's not just that, see below.
>
> [...]
>
>
> > static inline int page_ref_freeze(struct page *page, int count)
> > {
> > - return likely(atomic_cmpxchg(&page->_count, count, 0) == count);
> > + int ret = likely(atomic_cmpxchg(&page->_count, count, 0) == count);
>
> The "likely" mean makes no sense anymore, doe it?

I don't know how compiler uses this hint exactly but it will have same
effect as before. Why do you think it makes no sense now?

>
> > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
> > index 957d3da..71d2399 100644
> > --- a/mm/Kconfig.debug
> > +++ b/mm/Kconfig.debug
> > @@ -28,3 +28,7 @@ config DEBUG_PAGEALLOC
> >
> > config PAGE_POISONING
> > bool
> > +
> > +config DEBUG_PAGE_REF
> > + bool "Enable tracepoint to track down page reference manipulation"
>
> So you should probably state the costs. Which is the extra memory, and also that
> all the page ref manipulations are now turned to function calls, even if the
> tracepoints are disabled.

Yes, will do.

> Patch 1 didn't change that many callsites, so maybe it
> would be feasible to have the tracepoints inline, where being disabled has
> near-zero overhead?

Hmm...Although I changed just one get_page() in previous patch, it would
be used in many places. So, it's not feasible to inline tracepoints to
each callsites directly.

In fact, I tried to inline tracepoint into wrapper function directly,
but, it fails due to (maybe) header file dependency.

Steven, is it possible to add tracepoint to inlined fucntion such as
get_page() in include/linux/mm.h?

Thanks.

2015-11-20 16:42:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Fri, 20 Nov 2015 15:33:25 +0900
Joonsoo Kim <[email protected]> wrote:


> Steven, is it possible to add tracepoint to inlined fucntion such as
> get_page() in include/linux/mm.h?

I highly recommend against it. The tracepoint code adds a bit of bloat,
and if you inline it, you add that bloat to every use case. Also, it
makes things difficult if this file is included in other files that
create tracepoints, which I could easily imagine would be the case.
That is, if a tracepoint file in include/trace/events/foo.h needs to
include include/linux/mm.h, when you do CREATE_TRACEPOINTS for foo.h,
it will create tracepoints for mm.h as to use tracepoints there you
would need to include the include/trace/events/mm.h (or whatever its
name is), and that has caused issues in the past.

Now, if you still want to have these tracepoints in the inlined
function, it would be best to add a new file mm_trace.h? or something
that would include it, and then have only the .c files include that
directly. Do not put it into mm.h as that would definitely cause
tracepoint include troubles.

-- Steve

2015-11-23 08:27:49

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Fri, Nov 20, 2015 at 11:42:25AM -0500, Steven Rostedt wrote:
> On Fri, 20 Nov 2015 15:33:25 +0900
> Joonsoo Kim <[email protected]> wrote:
>
>
> > Steven, is it possible to add tracepoint to inlined fucntion such as
> > get_page() in include/linux/mm.h?
>
> I highly recommend against it. The tracepoint code adds a bit of bloat,
> and if you inline it, you add that bloat to every use case. Also, it

Is it worse than adding function call to my own stub function into
inlined function such as get_page(). I implemented it as following.

get_page()
{
atomic_inc()
stub_get_page()
}

stub_get_page() in foo.c
{
trace_page_ref_get_page()
}

> makes things difficult if this file is included in other files that
> create tracepoints, which I could easily imagine would be the case.
> That is, if a tracepoint file in include/trace/events/foo.h needs to
> include include/linux/mm.h, when you do CREATE_TRACEPOINTS for foo.h,
> it will create tracepoints for mm.h as to use tracepoints there you
> would need to include the include/trace/events/mm.h (or whatever its
> name is), and that has caused issues in the past.
>
> Now, if you still want to have these tracepoints in the inlined
> function, it would be best to add a new file mm_trace.h? or something
> that would include it, and then have only the .c files include that
> directly. Do not put it into mm.h as that would definitely cause
> tracepoint include troubles.

Okay. If I choose this way, I have to change too many places and churn
the code. If bloat of my implementation is similar with this suggestion,
I prefer my implementation.

Thanks for good advice.

Thanks.

2015-11-23 14:26:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Mon, 23 Nov 2015 17:28:05 +0900
Joonsoo Kim <[email protected]> wrote:

> On Fri, Nov 20, 2015 at 11:42:25AM -0500, Steven Rostedt wrote:
> > On Fri, 20 Nov 2015 15:33:25 +0900
> > Joonsoo Kim <[email protected]> wrote:
> >
> >
> > > Steven, is it possible to add tracepoint to inlined fucntion such as
> > > get_page() in include/linux/mm.h?
> >
> > I highly recommend against it. The tracepoint code adds a bit of bloat,
> > and if you inline it, you add that bloat to every use case. Also, it
>
> Is it worse than adding function call to my own stub function into
> inlined function such as get_page(). I implemented it as following.
>
> get_page()
> {
> atomic_inc()
> stub_get_page()
> }
>
> stub_get_page() in foo.c
> {
> trace_page_ref_get_page()
> }

Now you just slowed down the fast path. But what you could do is:

get_page()
{
atomic_inc();
if (trace_page_ref_get_page_enabled())
stub_get_page();
}

Now that "trace_page_ref_get_page_enabled()" will turn into:

if (static_key_false(&__tracepoint_page_ref_get_page.key)) {

which is a jump label (nop when disabled, a jmp when enabled). That's
less bloat but doesn't solve the include problem. You still need to add
the include of that will cause havoc with other tracepoints.

-- Steve

2015-11-24 02:16:22

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Mon, Nov 23, 2015 at 09:26:04AM -0500, Steven Rostedt wrote:
> On Mon, 23 Nov 2015 17:28:05 +0900
> Joonsoo Kim <[email protected]> wrote:
>
> > On Fri, Nov 20, 2015 at 11:42:25AM -0500, Steven Rostedt wrote:
> > > On Fri, 20 Nov 2015 15:33:25 +0900
> > > Joonsoo Kim <[email protected]> wrote:
> > >
> > >
> > > > Steven, is it possible to add tracepoint to inlined fucntion such as
> > > > get_page() in include/linux/mm.h?
> > >
> > > I highly recommend against it. The tracepoint code adds a bit of bloat,
> > > and if you inline it, you add that bloat to every use case. Also, it
> >
> > Is it worse than adding function call to my own stub function into
> > inlined function such as get_page(). I implemented it as following.
> >
> > get_page()
> > {
> > atomic_inc()
> > stub_get_page()
> > }
> >
> > stub_get_page() in foo.c
> > {
> > trace_page_ref_get_page()
> > }
>
> Now you just slowed down the fast path. But what you could do is:
>
> get_page()
> {
> atomic_inc();
> if (trace_page_ref_get_page_enabled())
> stub_get_page();
> }
>
> Now that "trace_page_ref_get_page_enabled()" will turn into:
>
> if (static_key_false(&__tracepoint_page_ref_get_page.key)) {
>
> which is a jump label (nop when disabled, a jmp when enabled). That's
> less bloat but doesn't solve the include problem. You still need to add
> the include of that will cause havoc with other tracepoints.

Yes, It also has a include dependency problem so I can't use
trace_page_ref_get_page_enabled() in mm.h. BTW, I tested following
implementation and it works fine.

extern struct tracepoint __tracepoint_page_ref_get_page;

get_page()
{
atomic_inc()
if (static_key_false(&__tracepoint_page_ref_get_page.key))
stub_get_page()
}

This would not slow down fast path although it can't prevent bloat.
I know that it isn't good code practice, but, this page reference
handling functions have complex include dependency so I'm not sure
I can solve it completely. For this special case, can I use
this raw data structure?

Thanks.

2015-11-24 01:59:44

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Mon, Nov 23, 2015 at 09:26:04AM -0500, Steven Rostedt wrote:
> On Mon, 23 Nov 2015 17:28:05 +0900
> Joonsoo Kim <[email protected]> wrote:
>
> > On Fri, Nov 20, 2015 at 11:42:25AM -0500, Steven Rostedt wrote:
> > > On Fri, 20 Nov 2015 15:33:25 +0900
> > > Joonsoo Kim <[email protected]> wrote:
> > >
> > >
> > > > Steven, is it possible to add tracepoint to inlined fucntion such as
> > > > get_page() in include/linux/mm.h?
> > >
> > > I highly recommend against it. The tracepoint code adds a bit of bloat,
> > > and if you inline it, you add that bloat to every use case. Also, it
> >
> > Is it worse than adding function call to my own stub function into
> > inlined function such as get_page(). I implemented it as following.
> >
> > get_page()
> > {
> > atomic_inc()
> > stub_get_page()
> > }
> >
> > stub_get_page() in foo.c
> > {
> > trace_page_ref_get_page()
> > }
>
> Now you just slowed down the fast path. But what you could do is:
>
> get_page()
> {
> atomic_inc();
> if (trace_page_ref_get_page_enabled())
> stub_get_page();
> }
>
> Now that "trace_page_ref_get_page_enabled()" will turn into:
>
> if (static_key_false(&__tracepoint_page_ref_get_page.key)) {
>
> which is a jump label (nop when disabled, a jmp when enabled). That's
> less bloat but doesn't solve the include problem. You still need to add
> the include of that will cause havoc with other tracepoints.

Yes, it also has include dependency problem so I can't use
trace_page_ref_get_page_enabled() in mm.h.

BTW, I try to open code trace_page_ref_get_page_enabled() in
get_page() as following and it works fine.

extern struct tracepoint __tracepoint_page_ref_get_page;

get_page()
{
atomic_inc()
if (static_key_false(&__tracepoint_page_ref_get_page.key))
stub_get_page()
}

I know that it's not good coding practice to use raw data structure,
but, page reference management functions has complex dependency
so I'm not sure I can solve it completely. For this special case, can
I use it?

Thanks.

2015-12-03 04:16:06

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Tue, Nov 24, 2015 at 10:45:28AM +0900, Joonsoo Kim wrote:
> On Mon, Nov 23, 2015 at 09:26:04AM -0500, Steven Rostedt wrote:
> > On Mon, 23 Nov 2015 17:28:05 +0900
> > Joonsoo Kim <[email protected]> wrote:
> >
> > > On Fri, Nov 20, 2015 at 11:42:25AM -0500, Steven Rostedt wrote:
> > > > On Fri, 20 Nov 2015 15:33:25 +0900
> > > > Joonsoo Kim <[email protected]> wrote:
> > > >
> > > >
> > > > > Steven, is it possible to add tracepoint to inlined fucntion such as
> > > > > get_page() in include/linux/mm.h?
> > > >
> > > > I highly recommend against it. The tracepoint code adds a bit of bloat,
> > > > and if you inline it, you add that bloat to every use case. Also, it
> > >
> > > Is it worse than adding function call to my own stub function into
> > > inlined function such as get_page(). I implemented it as following.
> > >
> > > get_page()
> > > {
> > > atomic_inc()
> > > stub_get_page()
> > > }
> > >
> > > stub_get_page() in foo.c
> > > {
> > > trace_page_ref_get_page()
> > > }
> >
> > Now you just slowed down the fast path. But what you could do is:
> >
> > get_page()
> > {
> > atomic_inc();
> > if (trace_page_ref_get_page_enabled())
> > stub_get_page();
> > }
> >
> > Now that "trace_page_ref_get_page_enabled()" will turn into:
> >
> > if (static_key_false(&__tracepoint_page_ref_get_page.key)) {
> >
> > which is a jump label (nop when disabled, a jmp when enabled). That's
> > less bloat but doesn't solve the include problem. You still need to add
> > the include of that will cause havoc with other tracepoints.
>
> Yes, It also has a include dependency problem so I can't use
> trace_page_ref_get_page_enabled() in mm.h. BTW, I tested following
> implementation and it works fine.
>
> extern struct tracepoint __tracepoint_page_ref_get_page;
>
> get_page()
> {
> atomic_inc()
> if (static_key_false(&__tracepoint_page_ref_get_page.key))
> stub_get_page()
> }
>
> This would not slow down fast path although it can't prevent bloat.
> I know that it isn't good code practice, but, this page reference
> handling functions have complex include dependency so I'm not sure
> I can solve it completely. For this special case, can I use
> this raw data structure?
>

Steven, any comment?

Thanks.

2015-12-09 20:02:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Thu, 3 Dec 2015 13:16:58 +0900
Joonsoo Kim <[email protected]> wrote:

> On Tue, Nov 24, 2015 at 10:45:28AM +0900, Joonsoo Kim wrote:
> > On Mon, Nov 23, 2015 at 09:26:04AM -0500, Steven Rostedt wrote:
> > > On Mon, 23 Nov 2015 17:28:05 +0900
> > > Joonsoo Kim <[email protected]> wrote:
> > >
> > > > On Fri, Nov 20, 2015 at 11:42:25AM -0500, Steven Rostedt wrote:
> > > > > On Fri, 20 Nov 2015 15:33:25 +0900
> > > > > Joonsoo Kim <[email protected]> wrote:
> > > > >
> > > > >
> > > > > > Steven, is it possible to add tracepoint to inlined fucntion such as
> > > > > > get_page() in include/linux/mm.h?
> > > > >
> > > > > I highly recommend against it. The tracepoint code adds a bit of bloat,
> > > > > and if you inline it, you add that bloat to every use case. Also, it
> > > >
> > > > Is it worse than adding function call to my own stub function into
> > > > inlined function such as get_page(). I implemented it as following.
> > > >
> > > > get_page()
> > > > {
> > > > atomic_inc()
> > > > stub_get_page()
> > > > }
> > > >
> > > > stub_get_page() in foo.c
> > > > {
> > > > trace_page_ref_get_page()
> > > > }
> > >
> > > Now you just slowed down the fast path. But what you could do is:
> > >
> > > get_page()
> > > {
> > > atomic_inc();
> > > if (trace_page_ref_get_page_enabled())
> > > stub_get_page();
> > > }
> > >
> > > Now that "trace_page_ref_get_page_enabled()" will turn into:
> > >
> > > if (static_key_false(&__tracepoint_page_ref_get_page.key)) {
> > >
> > > which is a jump label (nop when disabled, a jmp when enabled). That's
> > > less bloat but doesn't solve the include problem. You still need to add
> > > the include of that will cause havoc with other tracepoints.
> >
> > Yes, It also has a include dependency problem so I can't use
> > trace_page_ref_get_page_enabled() in mm.h. BTW, I tested following
> > implementation and it works fine.
> >
> > extern struct tracepoint __tracepoint_page_ref_get_page;
> >
> > get_page()
> > {
> > atomic_inc()
> > if (static_key_false(&__tracepoint_page_ref_get_page.key))
> > stub_get_page()
> > }
> >
> > This would not slow down fast path although it can't prevent bloat.
> > I know that it isn't good code practice, but, this page reference
> > handling functions have complex include dependency so I'm not sure
> > I can solve it completely. For this special case, can I use
> > this raw data structure?
> >
>
> Steven, any comment?

Sorry for the later reply, I was going to reply but then got called off
to do something else, and then forgot about this message :-/

I wanted you to look at what Andi has done here:

http://lkml.kernel.org/r/[email protected]

and here

http://lkml.kernel.org/r/[email protected]

-- Steve

2015-12-10 02:58:41

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Wed, Dec 09, 2015 at 03:01:54PM -0500, Steven Rostedt wrote:
> On Thu, 3 Dec 2015 13:16:58 +0900
> Joonsoo Kim <[email protected]> wrote:
>
> > On Tue, Nov 24, 2015 at 10:45:28AM +0900, Joonsoo Kim wrote:
> > > On Mon, Nov 23, 2015 at 09:26:04AM -0500, Steven Rostedt wrote:
> > > > On Mon, 23 Nov 2015 17:28:05 +0900
> > > > Joonsoo Kim <[email protected]> wrote:
> > > >
> > > > > On Fri, Nov 20, 2015 at 11:42:25AM -0500, Steven Rostedt wrote:
> > > > > > On Fri, 20 Nov 2015 15:33:25 +0900
> > > > > > Joonsoo Kim <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > > > Steven, is it possible to add tracepoint to inlined fucntion such as
> > > > > > > get_page() in include/linux/mm.h?
> > > > > >
> > > > > > I highly recommend against it. The tracepoint code adds a bit of bloat,
> > > > > > and if you inline it, you add that bloat to every use case. Also, it
> > > > >
> > > > > Is it worse than adding function call to my own stub function into
> > > > > inlined function such as get_page(). I implemented it as following.
> > > > >
> > > > > get_page()
> > > > > {
> > > > > atomic_inc()
> > > > > stub_get_page()
> > > > > }
> > > > >
> > > > > stub_get_page() in foo.c
> > > > > {
> > > > > trace_page_ref_get_page()
> > > > > }
> > > >
> > > > Now you just slowed down the fast path. But what you could do is:
> > > >
> > > > get_page()
> > > > {
> > > > atomic_inc();
> > > > if (trace_page_ref_get_page_enabled())
> > > > stub_get_page();
> > > > }
> > > >
> > > > Now that "trace_page_ref_get_page_enabled()" will turn into:
> > > >
> > > > if (static_key_false(&__tracepoint_page_ref_get_page.key)) {
> > > >
> > > > which is a jump label (nop when disabled, a jmp when enabled). That's
> > > > less bloat but doesn't solve the include problem. You still need to add
> > > > the include of that will cause havoc with other tracepoints.
> > >
> > > Yes, It also has a include dependency problem so I can't use
> > > trace_page_ref_get_page_enabled() in mm.h. BTW, I tested following
> > > implementation and it works fine.
> > >
> > > extern struct tracepoint __tracepoint_page_ref_get_page;
> > >
> > > get_page()
> > > {
> > > atomic_inc()
> > > if (static_key_false(&__tracepoint_page_ref_get_page.key))
> > > stub_get_page()
> > > }
> > >
> > > This would not slow down fast path although it can't prevent bloat.
> > > I know that it isn't good code practice, but, this page reference
> > > handling functions have complex include dependency so I'm not sure
> > > I can solve it completely. For this special case, can I use
> > > this raw data structure?
> > >
> >
> > Steven, any comment?
>
> Sorry for the later reply, I was going to reply but then got called off
> to do something else, and then forgot about this message :-/

No problem. :)

>
> I wanted you to look at what Andi has done here:
>
> http://lkml.kernel.org/r/[email protected]
>
> and here
>
> http://lkml.kernel.org/r/[email protected]

Wow...They look like what I'm looking for. Nice!
Thanks for the pointer!

I have one more question about trace-cmd.
'trace-cmd report' shows time-sorted output even stack trace. See
following example.

trace-cmd-6338 [003] 54.046508: page_ref_mod: ...
trace-cmd-6583 [007] 54.046509: page_ref_mod: ...
trace-cmd-6338 [003] 54.046515: kernel_stack: <stack trace>
=> do_wp_page (ffffffff811a0c6f)
=> handle_mm_fault (ffffffff811a34e2)
=> __do_page_fault (ffffffff810632da)
=> trace_do_page_fault (ffffffff81063633)
=> do_async_page_fault (ffffffff8105c3ea)
=> async_page_fault (ffffffff817733f8)
trace-cmd-6583 [007] 54.046515: kernel_stack: <stack trace>
=> do_wp_page (ffffffff811a0c6f)
=> handle_mm_fault (ffffffff811a34e2)
...

Output of cpu 3, 7 are mixed and it's not easy to analyze it.

I think that it'd be better not to sort stack trace. How do
you think about it? Could you fix it, please?

Thanks.

2015-12-10 03:37:08

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Thu, 10 Dec 2015 11:50:15 +0900
Joonsoo Kim <[email protected]> wrote:

> Output of cpu 3, 7 are mixed and it's not easy to analyze it.
>
> I think that it'd be better not to sort stack trace. How do
> you think about it? Could you fix it, please?

It may not be that easy to fix because of the sorting algorithm. That
would require looking going ahead one more event each time and then
checking if its a stacktrace. I may look at it and see if I can come up
with something that's not too invasive in the algorithms.

That said, for now you can use the --cpu option. I'm not sure I ever
documented it as it was originally added for debugging, but I use it
enough that it may be worth while to officially support it.

trace-cmd report --cpu 3

Will show you just cpu 3 and nothing else. Which is what I use a lot.

But doing the stack trace thing may be something to fix as well. I'll
see what I can do, but no guarantees.

-- Steve

2015-12-10 04:20:27

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm/page_ref: add tracepoint to track down page reference manipulation

On Wed, Dec 09, 2015 at 10:36:48PM -0500, Steven Rostedt wrote:
> On Thu, 10 Dec 2015 11:50:15 +0900
> Joonsoo Kim <[email protected]> wrote:
>
> > Output of cpu 3, 7 are mixed and it's not easy to analyze it.
> >
> > I think that it'd be better not to sort stack trace. How do
> > you think about it? Could you fix it, please?
>
> It may not be that easy to fix because of the sorting algorithm. That
> would require looking going ahead one more event each time and then
> checking if its a stacktrace. I may look at it and see if I can come up
> with something that's not too invasive in the algorithms.

Okay.

> That said, for now you can use the --cpu option. I'm not sure I ever
> documented it as it was originally added for debugging, but I use it
> enough that it may be worth while to officially support it.
>
> trace-cmd report --cpu 3
>
> Will show you just cpu 3 and nothing else. Which is what I use a lot.

Thanks for the input. It works but it's not sufficient to me.
Page reference is manipulated by multiple cpus so it's better to
analyze unified output.

>
> But doing the stack trace thing may be something to fix as well. I'll
> see what I can do, but no guarantees.

Okay. Don't be hurry. :)
trace-cmd is excellent and works well for me as it is.

Thanks.