by Alex Shi

[permalink] [raw]

Subject: Re: [PATCH v11 00/16] per memcg lru lock

在 2020/6/12 上午6:09, Hugh Dickins 写道:
>>> I thought that a very safe change, but best to do some test runs with
>>> it in before finalizing. And was then unpleasantly surprised to hit a
>>> VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != page->mem_cgroup) from
>>> lock_page_lruvec_irqsave < relock_page_lruvec < pagevec_lru_move_fn <
>>> pagevec_move_tail < lru_add_drain_cpu after 6 hours on one machine.
>>> Then similar but < rotate_reclaimable_page after 8 hours on another.
>>>
>>> Only seen once before: that's what drove me to add patch 4 (with 3 to
>>> revert the locking before it): somehow, when adding the lruvec locking
>>> there, I just took it for granted that your patchset would have the
>>> appropriate locking (or TestClearPageLRU magic) at the other end.
>>>
>>> But apparently not. And I'm beginning to think that TestClearPageLRU
>>> was just to distract the audience from the lack of proper locking.
>>>
>>> I have certainly not concluded that yet, but I'm having to think about
>>> an area of the code which I'd imagined you had under control (and I'm
>>> puzzled why my testing has found it so very hard to hit). If we're
>>> lucky, I'll find that pagevec_move_tail is a special case, and
>>> nothing much else needs changing; but I doubt that will be so.
> ... shows that your locking primitives are not yet good enough
> to handle the case when tasks are moved between memcgs with
> move_charge_at_immigrate set. "bin/cg m" in the tests I sent,
> but today I'm changing its "seconds=60" to "seconds=1" in hope
> of speeding up the reproduction.
>
> Ah, good, two machines crashed in 1.5 hours: but I don't need to
> examine the crashes, now that it's obvious there's no protection -
> please, think about rotate_reclaimable_page() (there will be more
> cases, but in practice that seems easiest to hit, so focus on that)
> and how it is not protected from mem_cgroup_move_account().
>
> I'm thinking too. Maybe judicious use of lock_page_memcg() can fix it
> (8 years ago it was unsuitable, but a lot has changed for the better
> since then); otherwise it's back to what I've been doing all along,
> taking the likely lruvec lock, and checking under that lock whether
> we have the right lock (as your lruvec_memcg_debug() does), retrying
> if not. Which may be more efficient than involving lock_page_memcg().
>
Hi Hugh,

Thanks a lot for the report!

Think again lru_move_fn and mem_cgroup_move_account relation. I found
if we want to change the pgdat->lru_lock to memcg's lruvec lock, we have
to serialize mem_cgroup_move_account during pagevec_lru_move_fn. Otherwise
the possible bad scenario would like:

cpu 0 cpu 1
lruvec = mem_cgroup_page_lruvec()
if (!isolate_lru_page())
mem_cgroup_move_account

spin_lock_irqsave(&lruvec->lru_lock <== wrong lock.

So we need the ClearPageLRU to block isolate_lru_page(), then serialize
the memcg change here. Do relock check would get a mitigation, but not
solution.

The following patch fold vm event PGROTATED into pagevec_move_tail_fn
and fixed this problem by ClearPageLRU before page moving between lru
I will split them into 2 patches, and merge into v12 patchset.

Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: Alex Shi <[email protected]>

diff --git a/mm/swap.c b/mm/swap.c
index eba0c17dffd8..fa211157bfec 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -200,8 +200,7 @@ int get_kernel_page(unsigned long start, int write, struct page **pages)
EXPORT_SYMBOL_GPL(get_kernel_page);

static void pagevec_lru_move_fn(struct pagevec *pvec,
- void (*move_fn)(struct page *page, struct lruvec *lruvec, void *arg),
- void *arg)
+ void (*move_fn)(struct page *page, struct lruvec *lruvec), bool add)
{
int i;
struct lruvec *lruvec = NULL;
@@ -210,8 +209,14 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];

+ if (!add && !TestClearPageLRU(page))
+ continue;
+
lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
- (*move_fn)(page, lruvec, arg);
+ (*move_fn)(page, lruvec);
+
+ if (!add)
+ SetPageLRU(page);
}
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -219,35 +224,23 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
pagevec_reinit(pvec);
}

-static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec,
- void *arg)
+static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
{
- int *pgmoved = arg;
-
if (PageLRU(page) && !PageUnevictable(page)) {
del_page_from_lru_list(page, lruvec, page_lru(page));
ClearPageActive(page);
add_page_to_lru_list_tail(page, lruvec, page_lru(page));
- (*pgmoved) += hpage_nr_pages(page);
+ __count_vm_events(PGROTATED, hpage_nr_pages(page));
}
}

/*
- * pagevec_move_tail() must be called with IRQ disabled.
- * Otherwise this may cause nasty races.
- */
-static void pagevec_move_tail(struct pagevec *pvec)
-{
- int pgmoved = 0;
-
- pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, &pgmoved);
- __count_vm_events(PGROTATED, pgmoved);
-}
-
-/*
* Writeback is about to end against a page which has been marked for immediate
* reclaim. If it still appears to be reclaimable, move it to the tail of the
* inactive list.
+ *
+ * pagevec_move_tail_fn() must be called with IRQ disabled.
+ * Otherwise this may cause nasty races.
*/
void rotate_reclaimable_page(struct page *page)
{
@@ -260,7 +253,7 @@ void rotate_reclaimable_page(struct page *page)
local_lock_irqsave(&lru_rotate.lock, flags);
pvec = this_cpu_ptr(&lru_rotate.pvec);
if (!pagevec_add(pvec, page) || PageCompound(page))
- pagevec_move_tail(pvec);
+ pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, false);
local_unlock_irqrestore(&lru_rotate.lock, flags);
}
}
@@ -302,8 +295,7 @@ void lru_note_cost_page(struct page *page)
page_is_file_lru(page), hpage_nr_pages(page));
}

-static void __activate_page(struct page *page, struct lruvec *lruvec,
- void *arg)
+static void __activate_page(struct page *page, struct lruvec *lruvec)
{
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
int lru = page_lru_base_type(page);
@@ -327,7 +319,7 @@ static void activate_page_drain(int cpu)
struct pagevec *pvec = &per_cpu(lru_pvecs.activate_page, cpu);

if (pagevec_count(pvec))
- pagevec_lru_move_fn(pvec, __activate_page, NULL);
+ pagevec_lru_move_fn(pvec, __activate_page, false);
}

static bool need_activate_page_drain(int cpu)
@@ -345,7 +337,7 @@ void activate_page(struct page *page)
pvec = this_cpu_ptr(&lru_pvecs.activate_page);
get_page(page);
if (!pagevec_add(pvec, page) || PageCompound(page))
- pagevec_lru_move_fn(pvec, __activate_page, NULL);
+ pagevec_lru_move_fn(pvec, __activate_page, false);
local_unlock(&lru_pvecs.lock);
}
}
@@ -515,8 +507,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
* be write it out by flusher threads as this is much more effective
* than the single-page writeout from reclaim.
*/
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
- void *arg)
+static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
{
int lru;
bool active;
@@ -563,8 +554,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
}
}

-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
- void *arg)
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
{
if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
int lru = page_lru_base_type(page);
@@ -581,8 +571,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
}
}

-static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
- void *arg)
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
{
if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
!PageSwapCache(page) && !PageUnevictable(page)) {
@@ -625,21 +614,21 @@ void lru_add_drain_cpu(int cpu)

/* No harm done if a racing interrupt already did this */
local_lock_irqsave(&lru_rotate.lock, flags);
- pagevec_move_tail(pvec);
+ pagevec_lru_move_fn(pvec, pagevec_move_tail_fn, false);
local_unlock_irqrestore(&lru_rotate.lock, flags);
}

pvec = &per_cpu(lru_pvecs.lru_deactivate_file, cpu);
if (pagevec_count(pvec))
- pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, false);

pvec = &per_cpu(lru_pvecs.lru_deactivate, cpu);
if (pagevec_count(pvec))
- pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, false);

pvec = &per_cpu(lru_pvecs.lru_lazyfree, cpu);
if (pagevec_count(pvec))
- pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_lazyfree_fn, false);

activate_page_drain(cpu);
}
@@ -668,7 +657,7 @@ void deactivate_file_page(struct page *page)
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate_file);

if (!pagevec_add(pvec, page) || PageCompound(page))
- pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, false);
local_unlock(&lru_pvecs.lock);
}
}
@@ -690,7 +679,7 @@ void deactivate_page(struct page *page)
pvec = this_cpu_ptr(&lru_pvecs.lru_deactivate);
get_page(page);
if (!pagevec_add(pvec, page) || PageCompound(page))
- pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, false);
local_unlock(&lru_pvecs.lock);
}
}
@@ -712,7 +701,7 @@ void mark_page_lazyfree(struct page *page)
pvec = this_cpu_ptr(&lru_pvecs.lru_lazyfree);
get_page(page);
if (!pagevec_add(pvec, page) || PageCompound(page))
- pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_lazyfree_fn, false);
local_unlock(&lru_pvecs.lock);
}
}
@@ -913,8 +902,7 @@ void __pagevec_release(struct pagevec *pvec)
}
EXPORT_SYMBOL(__pagevec_release);

-static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
- void *arg)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
{
enum lru_list lru;
int was_unevictable = TestClearPageUnevictable(page);
@@ -973,7 +961,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
*/
void __pagevec_lru_add(struct pagevec *pvec)
{
- pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, NULL);
+ pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn, true);
}

/**