LinuxLists.cc - [PATCH 0/6 v4] memcg: page cgroup diet

2012-02-14 03:05:50

Subject: [PATCH 0/6 v4] memcg: page cgroup diet

Here is v4. I removed RFC and fixed a fatal bug in v3.

This patch series is a preparetion for removing page_cgroup->flags. To remove
it, we need to reduce flags on page_cgroup. In this set, PCG_MOVE_LOCK and
PCG_FILE_MAPPED are remvoed.

After this, we only have 3 bits.
==
enum {
/* flags for mem_cgroup */
PCG_LOCK, /* Lock for pc->mem_cgroup and following bits. */
PCG_USED, /* this object is in use. */
PCG_MIGRATION, /* under page migration */
__NR_PCG_FLAGS,
};
==

I'll make a try to merge this 3bits to lower 3bits of pointer to
memcg. Then, we can make page_cgroup 8(4)bytes per page.

Major changes since v3 are
- replaced move_lock_page_cgroup() with move_lock_mem_cgroup().
passes pointer to memcg rather than page_cgroup.

Working on linux-next feb13 and passed several tests.

Thanks,
-Kame

2012-02-14 03:08:10

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)

This is just a cleanup.
==
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 2 Feb 2012 12:05:41 +0900
Subject: [PATCH 1/6] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)

>From the log, I guess EXPORT was for preparing dirty accounting.
But _now_, we don't need to export this. Remove this for now.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ab315ab..4c2b759 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1897,7 +1897,6 @@ out:
move_unlock_page_cgroup(pc, &flags);
rcu_read_unlock();
}
-EXPORT_SYMBOL(mem_cgroup_update_page_stat);

/*
* size of first charge trial. "32" comes from vmscan.c's magic value.
--
1.7.4.1

2012-02-14 03:09:25

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 2/6 v4] memcg: simplify move_account() check

>From 9cdb3b63dc8d08cc2220c54c80438c13433a0d12 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 2 Feb 2012 10:02:39 +0900
Subject: [PATCH 2/6] memcg: simplify move_account() check.

In memcg, for avoiding take-lock-irq-off at accessing page_cgroup,
a logic, flag + rcu_read_lock(), is used. This works as following

CPU-A CPU-B
rcu_read_lock()
set flag
if(flag is set)
take heavy lock
do job.
synchronize_rcu() rcu_read_unlock()

In recent discussion, it's argued that using per-cpu value for this
flag just complicates the code because 'set flag' is very rare.

This patch changes 'flag' implementation from percpu to atomic_t.
This will be much simpler.

Changelog: v4
- fixed many typos.
- fixed return value to be bool
- add comments.
Changelog: v3
- this is a new patch since v3.

memcontrol.c | 65 ++++++++++++++++++++++-------------------------------------
1 file changed, 25 insertions(+), 40 deletions(-)

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 70 +++++++++++++++++++++++-------------------------------
1 files changed, 30 insertions(+), 40 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4c2b759..9f69679 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -89,7 +89,6 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
- MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */
MEM_CGROUP_STAT_NSTATS,
};

@@ -278,6 +277,10 @@ struct mem_cgroup {
*/
unsigned long move_charge_at_immigrate;
/*
+ * set > 0 if pages under this cgroup are moving to other cgroup.
+ */
+ atomic_t moving_account;
+ /*
* percpu counter.
*/
struct mem_cgroup_stat_cpu *stat;
@@ -1253,36 +1256,37 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)

return memcg->swappiness;
}
+/*
+ * memcg->moving_account is used for checking possibility that some thread is
+ * calling move_account(). When a thread on CPU-A starts moving pages under
+ * a memcg, other threads should check memcg->moving_account under
+ * rcu_read_lock(), like this:
+ *
+ * CPU-A CPU-B
+ * rcu_read_lock()
+ * memcg->moving_account+1 if (memcg->mocing_account)
+ * take heavy locks.
+ * synchronize_rcu() update something.
+ * rcu_read_unlock()
+ * start move here.
+ */

static void mem_cgroup_start_move(struct mem_cgroup *memcg)
{
- int cpu;
-
- get_online_cpus();
- spin_lock(&memcg->pcp_counter_lock);
- for_each_online_cpu(cpu)
- per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
- memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] += 1;
- spin_unlock(&memcg->pcp_counter_lock);
- put_online_cpus();
-
+ atomic_inc(&memcg->moving_account);
synchronize_rcu();
}

static void mem_cgroup_end_move(struct mem_cgroup *memcg)
{
- int cpu;
-
- if (!memcg)
- return;
- get_online_cpus();
- spin_lock(&memcg->pcp_counter_lock);
- for_each_online_cpu(cpu)
- per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
- memcg->nocpu_base.count[MEM_CGROUP_ON_MOVE] -= 1;
- spin_unlock(&memcg->pcp_counter_lock);
- put_online_cpus();
+ /*
+ * Now, mem_cgroup_clear_mc() may call this function with NULL.
+ * We check NULL in callee rather than caller.
+ */
+ if (memcg)
+ atomic_dec(&memcg->moving_account);
}
+
/*
* 2 routines for checking "mem" is under move_account() or not.
*
@@ -1298,7 +1302,7 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
static bool mem_cgroup_stealed(struct mem_cgroup *memcg)
{
VM_BUG_ON(!rcu_read_lock_held());
- return this_cpu_read(memcg->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+ return atomic_read(&memcg->moving_account) > 0;
}

static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
@@ -1849,8 +1853,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
* by flags.
*
* Considering "move", this is an only case we see a race. To make the race
- * small, we check MEM_CGROUP_ON_MOVE percpu value and detect there are
- * possibility of race condition. If there is, we take a lock.
+ * small, we check mm->moving_account and detect there are possibility of race
+ * If there is, we take a lock.
*/

void mem_cgroup_update_page_stat(struct page *page,
@@ -2067,17 +2071,6 @@ static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
per_cpu(memcg->stat->events[i], cpu) = 0;
memcg->nocpu_base.events[i] += x;
}
- /* need to clear ON_MOVE value, works as a kind of lock. */
- per_cpu(memcg->stat->count[MEM_CGROUP_ON_MOVE], cpu) = 0;
- spin_unlock(&memcg->pcp_counter_lock);
-}
-
-static void synchronize_mem_cgroup_on_move(struct mem_cgroup *memcg, int cpu)
-{
- int idx = MEM_CGROUP_ON_MOVE;
-
- spin_lock(&memcg->pcp_counter_lock);
- per_cpu(memcg->stat->count[idx], cpu) = memcg->nocpu_base.count[idx];
spin_unlock(&memcg->pcp_counter_lock);
}

@@ -2089,11 +2082,8 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
struct memcg_stock_pcp *stock;
struct mem_cgroup *iter;

- if ((action == CPU_ONLINE)) {
- for_each_mem_cgroup(iter)
- synchronize_mem_cgroup_on_move(iter, cpu);
+ if ((action == CPU_ONLINE))
return NOTIFY_OK;
- }

if ((action != CPU_DEAD) || action != CPU_DEAD_FROZEN)
return NOTIFY_OK;
--
1.7.4.1

2012-02-14 03:14:43

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 3/6 v4] memcg: remove PCG_MOVE_LOCK flag from page_cgroup.

>From ffd1b013fe294a80c12e3f30e85386135a6fb284 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 2 Feb 2012 11:49:59 +0900
Subject: [PATCH 3/6] memcg: remove PCG_MOVE_LOCK flag from page_cgroup.

PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
pc->mem_cgroup and page statistics accounting per memcg.
This lock helps to avoid the race but the race is very rare because moving
tasks between cgroup is not a usual job.
So, it seems using 1bit per page is too costly.

This patch changes this lock as per-memcg spinlock and removes PCG_MOVE_LOCK.

If smaller lock is required, we'll be able to add some hashes but
I'd like to start from this.

Changelog:
- fixed to pass memcg as an argument rather than page_cgroup.
and renamed from move_lock_page_cgroup() to move_lock_mem_cgroup()

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page_cgroup.h | 19 -------------------
mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++++----------
2 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 1060292..7a3af74 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -7,7 +7,6 @@ enum {
PCG_USED, /* this object is in use. */
PCG_MIGRATION, /* under page migration */
/* flags for mem_cgroup and file and I/O status */
- PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
PCG_FILE_MAPPED, /* page is accounted as "mapped" */
__NR_PCG_FLAGS,
};
@@ -89,24 +88,6 @@ static inline void unlock_page_cgroup(struct page_cgroup *pc)
bit_spin_unlock(PCG_LOCK, &pc->flags);
}

-static inline void move_lock_page_cgroup(struct page_cgroup *pc,
- unsigned long *flags)
-{
- /*
- * We know updates to pc->flags of page cache's stats are from both of
- * usual context or IRQ context. Disable IRQ to avoid deadlock.
- */
- local_irq_save(*flags);
- bit_spin_lock(PCG_MOVE_LOCK, &pc->flags);
-}
-
-static inline void move_unlock_page_cgroup(struct page_cgroup *pc,
- unsigned long *flags)
-{
- bit_spin_unlock(PCG_MOVE_LOCK, &pc->flags);
- local_irq_restore(*flags);
-}
-
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct page_cgroup;

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9f69679..ecf8856 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -280,6 +280,8 @@ struct mem_cgroup {
* set > 0 if pages under this cgroup are moving to other cgroup.
*/
atomic_t moving_account;
+ /* taken only while moving_account > 0 */
+ spinlock_t move_lock;
/*
* percpu counter.
*/
@@ -1343,6 +1345,24 @@ static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
return false;
}

+/*
+ * Take this lock when
+ * - a code tries to modify page's memcg while it's USED.
+ * - a code tries to modify page state accounting in a memcg.
+ * see mem_cgroup_stealed(), too.
+ */
+static void move_lock_mem_cgroup(struct mem_cgroup *memcg,
+ unsigned long *flags)
+{
+ spin_lock_irqsave(&memcg->move_lock, *flags);
+}
+
+static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
+ unsigned long *flags)
+{
+ spin_unlock_irqrestore(&memcg->move_lock, *flags);
+}
+
/**
* mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
* @memcg: The memory cgroup that went over limit
@@ -1867,7 +1887,7 @@ void mem_cgroup_update_page_stat(struct page *page,

if (mem_cgroup_disabled())
return;
-
+again:
rcu_read_lock();
memcg = pc->mem_cgroup;
if (unlikely(!memcg || !PageCgroupUsed(pc)))
@@ -1875,11 +1895,13 @@ void mem_cgroup_update_page_stat(struct page *page,
/* pc->mem_cgroup is unstable ? */
if (unlikely(mem_cgroup_stealed(memcg))) {
/* take a lock against to access pc->mem_cgroup */
- move_lock_page_cgroup(pc, &flags);
+ move_lock_mem_cgroup(memcg, &flags);
+ if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
+ move_unlock_mem_cgroup(memcg, &flags);
+ rcu_read_unlock();
+ goto again;
+ }
need_unlock = true;
- memcg = pc->mem_cgroup;
- if (!memcg || !PageCgroupUsed(pc))
- goto out;
}

switch (idx) {
@@ -1898,7 +1920,7 @@ void mem_cgroup_update_page_stat(struct page *page,

out:
if (unlikely(need_unlock))
- move_unlock_page_cgroup(pc, &flags);
+ move_unlock_mem_cgroup(memcg, &flags);
rcu_read_unlock();
}

@@ -2440,8 +2462,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *memcg,

#ifdef CONFIG_TRANSPARENT_HUGEPAGE

-#define PCGF_NOCOPY_AT_SPLIT ((1 << PCG_LOCK) | (1 << PCG_MOVE_LOCK) |\
- (1 << PCG_MIGRATION))
+#define PCGF_NOCOPY_AT_SPLIT ((1 << PCG_LOCK) | (1 << PCG_MIGRATION))
/*
* Because tail pages are not marked as "used", set it. We're under
* zone->lru_lock, 'splitting on pmd' and compound_lock.
@@ -2512,7 +2533,7 @@ static int mem_cgroup_move_account(struct page *page,
if (!PageCgroupUsed(pc) || pc->mem_cgroup != from)
goto unlock;

- move_lock_page_cgroup(pc, &flags);
+ move_lock_mem_cgroup(from, &flags);

if (PageCgroupFileMapped(pc)) {
/* Update mapped_file data for mem_cgroup */
@@ -2536,7 +2557,7 @@ static int mem_cgroup_move_account(struct page *page,
* guaranteed that "to" is never removed. So, we don't check rmdir
* status here.
*/
- move_unlock_page_cgroup(pc, &flags);
+ move_unlock_mem_cgroup(from, &flags);
ret = 0;
unlock:
unlock_page_cgroup(pc);
@@ -4928,6 +4949,7 @@ mem_cgroup_create(struct cgroup *cont)
atomic_set(&memcg->refcnt, 1);
memcg->move_charge_at_immigrate = 0;
mutex_init(&memcg->thresholds_lock);
+ spin_lock_init(&memcg->move_lock);
return &memcg->css;
free_out:
__mem_cgroup_free(memcg);
--
1.7.4.1

2012-02-14 03:15:52

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 4/6 v4] memcg: use new logic for page stat accounting

>From ad2905362ef58a44d96a325193ab384739418050 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 2 Feb 2012 11:49:59 +0900
Subject: [PATCH 4/6] memcg: use new logic for page stat accounting.

Now, page-stat-per-memcg is recorded into per page_cgroup flag by
duplicating page's status into the flag. The reason is that memcg
has a feature to move a page from a group to another group and we
have race between "move" and "page stat accounting",

Under current logic, assume CPU-A and CPU-B. CPU-A does "move"
and CPU-B does "page stat accounting".

When CPU-A goes 1st,

CPU-A CPU-B
update "struct page" info.
move_lock_mem_cgroup(memcg)
see flags
copy page stat to new group
overwrite pc->mem_cgroup.
move_unlock_mem_cgroup(memcg)
move_lock_mem_cgroup(mem)
set pc->flags
update page stat accounting
move_unlock_mem_cgroup(mem)

stat accounting is guarded by move_lock_mem_cgroup() and "move"
logic (CPU-A) doesn't see changes in "struct page" information.

But it's costly to have the same information both in 'struct page' and
'struct page_cgroup'. And, there is a potential problem.

For example, assume we have PG_dirty accounting in memcg.
PG_..is a flag for struct page.
PCG_ is a flag for struct page_cgroup.
(This is just an example. The same problem can be found in any
kind of page stat accounting.)

CPU-A CPU-B
TestSet PG_dirty
(delay) TestClear PG_dirty_
if (TestClear(PCG_dirty))
memcg->nr_dirty--
if (TestSet(PCG_dirty))
memcg->nr_dirty++

Here, memcg->nr_dirty = +1, this is wrong.
This race was reported by Greg Thelen <[email protected]>.
Now, only FILE_MAPPED is supported but fortunately, it's serialized by
page table lock and this is not real bug, _now_,

If this potential problem is caused by having duplicated information in
struct page and struct page_cgroup, we may be able to fix this by using
original 'struct page' information.
But we'll have a problem in "move account"

Assume we use only PG_dirty.

CPU-A CPU-B
TestSet PG_dirty
(delay) move_lock_mem_cgroup()
if (PageDirty(page))
new_memcg->nr_dirty++
pc->mem_cgroup = new_memcg;
move_unlock_mem_cgroup()
move_lock_mem_cgroup()
memcg = pc->mem_cgroup
new_memcg->nr_dirty++

accounting information may be double-counted. This was original
reason to have PCG_xxx flags but it seems PCG_xxx has another problem.

I think we need a bigger lock as

move_lock_mem_cgroup(page)
TestSetPageDirty(page)
update page stats (without any checks)
move_unlock_mem_cgroup(page)

This fixes both of problems and we don't have to duplicate page flag
into page_cgroup. Please note: move_lock_mem_cgroup() is held
only when there are possibility of "account move" under the system.
So, in most path, status update will go without atomic locks.

This patch introduce mem_cgroup_begin_update_page_stat() and
mem_cgroup_end_update_page_stat() both should be called at modifying
'struct page' information if memcg takes care of it. as

mem_cgroup_begin_update_page_stat()
modify page information
mem_cgroup_update_page_stat()
=> never check any 'struct page' info, just update counters.
mem_cgroup_end_update_page_stat().

This patch is slow because we need to call begin_update_page_stat()/
end_update_page_stat() regardless of accounted will be changed or not.
A following patch adds an easy optimization and reduces the cost.

Changes since v3
- fixed typos

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 35 +++++++++++++++++++++++++++
mm/memcontrol.c | 57 ++++++++++++++++++++++++++++---------------
mm/rmap.c | 15 ++++++++++-
3 files changed, 85 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bf4e1f4..3f3ef33 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -141,6 +141,31 @@ static inline bool mem_cgroup_disabled(void)
return false;
}

+void __mem_cgroup_begin_update_page_stat(struct page *page,
+ bool *lock, unsigned long *flags);
+
+static inline void mem_cgroup_begin_update_page_stat(struct page *page,
+ bool *lock, unsigned long *flags)
+{
+ if (mem_cgroup_disabled())
+ return;
+ rcu_read_lock();
+ *lock = false;
+ return __mem_cgroup_begin_update_page_stat(page, lock, flags);
+}
+
+void __mem_cgroup_end_update_page_stat(struct page *page, bool *lock,
+ unsigned long *flags);
+static inline void mem_cgroup_end_update_page_stat(struct page *page,
+ bool *lock, unsigned long *flags)
+{
+ if (mem_cgroup_disabled())
+ return;
+ if (*lock)
+ __mem_cgroup_end_update_page_stat(page, lock, flags);
+ rcu_read_unlock();
+}
+
void mem_cgroup_update_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx,
int val);
@@ -356,6 +381,16 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline void mem_cgroup_begin_update_page_stat(struct page *page,
+ bool *lock, unsigned long *flags)
+{
+}
+
+static inline void mem_cgroup_end_update_page_stat(struct page *page,
+ bool *lock, unsigned long *flags)
+{
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ecf8856..30afea5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1877,32 +1877,54 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
* If there is, we take a lock.
*/

+void __mem_cgroup_begin_update_page_stat(struct page *page,
+ bool *lock, unsigned long *flags)
+{
+ struct mem_cgroup *memcg;
+ struct page_cgroup *pc;
+
+ pc = lookup_page_cgroup(page);
+again:
+ memcg = pc->mem_cgroup;
+ if (unlikely(!memcg || !PageCgroupUsed(pc)))
+ return;
+ if (!mem_cgroup_stealed(memcg))
+ return;
+
+ move_lock_mem_cgroup(memcg, flags);
+ if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
+ move_unlock_mem_cgroup(memcg, flags);
+ goto again;
+ }
+ *lock = true;
+}
+
+void __mem_cgroup_end_update_page_stat(struct page *page,
+ bool *lock, unsigned long *flags)
+{
+ struct page_cgroup *pc = lookup_page_cgroup(page);
+
+ /*
+ * It's guaranteed that pc->mem_cgroup never changes while
+ * lock is held
+ */
+ move_unlock_mem_cgroup(pc->mem_cgroup, flags);
+}
+
+
void mem_cgroup_update_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx, int val)
{
struct mem_cgroup *memcg;
struct page_cgroup *pc = lookup_page_cgroup(page);
- bool need_unlock = false;
unsigned long uninitialized_var(flags);

if (mem_cgroup_disabled())
return;
-again:
- rcu_read_lock();
+
memcg = pc->mem_cgroup;
if (unlikely(!memcg || !PageCgroupUsed(pc)))
- goto out;
- /* pc->mem_cgroup is unstable ? */
- if (unlikely(mem_cgroup_stealed(memcg))) {
- /* take a lock against to access pc->mem_cgroup */
- move_lock_mem_cgroup(memcg, &flags);
- if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
- move_unlock_mem_cgroup(memcg, &flags);
- rcu_read_unlock();
- goto again;
- }
- need_unlock = true;
- }
+ return;

switch (idx) {
case MEMCG_NR_FILE_MAPPED:
@@ -1917,11 +1939,6 @@ again:
}

this_cpu_add(memcg->stat->count[idx], val);
-
-out:
- if (unlikely(need_unlock))
- move_unlock_mem_cgroup(memcg, &flags);
- rcu_read_unlock();
}

/*
diff --git a/mm/rmap.c b/mm/rmap.c
index aa547d4..dd193db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1150,10 +1150,15 @@ void page_add_new_anon_rmap(struct page *page,
*/
void page_add_file_rmap(struct page *page)
{
+ bool locked;
+ unsigned long flags;
+
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (atomic_inc_and_test(&page->_mapcount)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED);
}
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
}

/**
@@ -1164,9 +1169,13 @@ void page_add_file_rmap(struct page *page)
*/
void page_remove_rmap(struct page *page)
{
+ bool locked;
+ unsigned long flags;
+
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
/* page still mapped by someone else? */
if (!atomic_add_negative(-1, &page->_mapcount))
- return;
+ goto out;

/*
* Now that the last pte has gone, s390 must transfer dirty
@@ -1183,7 +1192,7 @@ void page_remove_rmap(struct page *page)
* and not charged by memcg for now.
*/
if (unlikely(PageHuge(page)))
- return;
+ goto out;
if (PageAnon(page)) {
mem_cgroup_uncharge_page(page);
if (!PageTransHuge(page))
@@ -1204,6 +1213,8 @@ void page_remove_rmap(struct page *page)
* Leaving it set also helps swapoff to reinstate ptes
* faster for those pages still in swapcache.
*/
+out:
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
}

/*
--
1.7.4.1

2012-02-14 03:16:40

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 5/6 v4] memcg: remove PCG_FILE_MAPPED

>From 96407a510d5212179a4768f28591b35d5c17d403 Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Thu, 2 Feb 2012 15:02:18 +0900
Subject: [PATCH 5/6] memcg: remove PCG_FILE_MAPPED

with new lock scheme for updating memcg's page stat, we don't need
a flag PCG_FILE_MAPPED which was duplicated information of page_mapped().

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page_cgroup.h | 6 ------
mm/memcontrol.c | 6 +-----
2 files changed, 1 insertions(+), 11 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 7a3af74..a88cdba 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -6,8 +6,6 @@ enum {
PCG_LOCK, /* Lock for pc->mem_cgroup and following bits. */
PCG_USED, /* this object is in use. */
PCG_MIGRATION, /* under page migration */
- /* flags for mem_cgroup and file and I/O status */
- PCG_FILE_MAPPED, /* page is accounted as "mapped" */
__NR_PCG_FLAGS,
};

@@ -66,10 +64,6 @@ TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)
SETPCGFLAG(Used, USED)

-SETPCGFLAG(FileMapped, FILE_MAPPED)
-CLEARPCGFLAG(FileMapped, FILE_MAPPED)
-TESTPCGFLAG(FileMapped, FILE_MAPPED)
-
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 30afea5..ec55880 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1928,10 +1928,6 @@ void mem_cgroup_update_page_stat(struct page *page,

switch (idx) {
case MEMCG_NR_FILE_MAPPED:
- if (val > 0)
- SetPageCgroupFileMapped(pc);
- else if (!page_mapped(page))
- ClearPageCgroupFileMapped(pc);
idx = MEM_CGROUP_STAT_FILE_MAPPED;
break;
default:
@@ -2552,7 +2548,7 @@ static int mem_cgroup_move_account(struct page *page,

move_lock_mem_cgroup(from, &flags);

- if (PageCgroupFileMapped(pc)) {
+ if (page_mapped(page)) {
/* Update mapped_file data for mem_cgroup */
preempt_disable();
__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
--
1.7.4.1

2012-02-14 03:17:57

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat()

>From 3377fd7b6e23a5d2a368c078eae27e2b49c4f4aa Mon Sep 17 00:00:00 2001
From: KAMEZAWA Hiroyuki <[email protected]>
Date: Mon, 6 Feb 2012 12:14:47 +0900
Subject: [PATCH 6/6] memcg: fix performance of mem_cgroup_begin_update_page_stat()

mem_cgroup_begin_update_page_stat() should be very fast because
it's called very frequently. Now, it needs to look up page_cgroup
and its memcg....this is slow.

This patch adds a global variable to check "a memcg is moving or not".
By this, the caller doesn't need to visit page_cgroup and memcg.

Here is a test result. A test program makes page faults onto a file,
MAP_SHARED and makes each page's page_mapcount(page) > 1, and free
the range by madvise() and page fault again. This program causes
26214400 times of page fault onto a file(size was 1G.) and shows
shows the cost of mem_cgroup_begin_update_page_stat().

Before this patch for mem_cgroup_begin_update_page_stat()
[kamezawa@bluextal test]$ time ./mmap 1G

real 0m21.765s
user 0m5.999s
sys 0m15.434s

27.46% mmap mmap [.] reader
21.15% mmap [kernel.kallsyms] [k] page_fault
9.17% mmap [kernel.kallsyms] [k] filemap_fault
2.96% mmap [kernel.kallsyms] [k] __do_fault
2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat

After this patch
[root@bluextal test]# time ./mmap 1G

real 0m21.373s
user 0m6.113s
sys 0m15.016s

In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.

Note: we may be able to remove this optimization in future if
we can get pointer to memcg directly from struct page.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 5 +++--
mm/memcontrol.c | 7 ++++++-
2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3f3ef33..3df9979 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -143,7 +143,7 @@ static inline bool mem_cgroup_disabled(void)

void __mem_cgroup_begin_update_page_stat(struct page *page,
bool *lock, unsigned long *flags);
-
+extern atomic_t memcg_moving;
static inline void mem_cgroup_begin_update_page_stat(struct page *page,
bool *lock, unsigned long *flags)
{
@@ -151,7 +151,8 @@ static inline void mem_cgroup_begin_update_page_stat(struct page *page,
return;
rcu_read_lock();
*lock = false;
- return __mem_cgroup_begin_update_page_stat(page, lock, flags);
+ if (atomic_read(&memcg_moving))
+ return __mem_cgroup_begin_update_page_stat(page, lock, flags);
}

void __mem_cgroup_end_update_page_stat(struct page *page, bool *lock,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ec55880..fd02fab 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1273,8 +1273,11 @@ int mem_cgroup_swappiness(struct mem_cgroup *memcg)
* start move here.
*/

+/* for quick checking without looking up memcg */
+atomic_t memcg_moving __read_mostly;
static void mem_cgroup_start_move(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg_moving);
atomic_inc(&memcg->moving_account);
synchronize_rcu();
}
@@ -1285,8 +1288,10 @@ static void mem_cgroup_end_move(struct mem_cgroup *memcg)
* Now, mem_cgroup_clear_mc() may call this function with NULL.
* We check NULL in callee rather than caller.
*/
- if (memcg)
+ if (memcg) {
+ atomic_dec(&memcg_moving);
atomic_dec(&memcg->moving_account);
+ }
}

/*
--
1.7.4.1

2012-02-14 07:21:36

by Greg Thelen

[permalink] [raw]

Subject: Re: [PATCH 1/6 v4] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)

On Mon, Feb 13, 2012 at 7:06 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> This is just a cleanup.
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 2 Feb 2012 12:05:41 +0900
> Subject: [PATCH 1/6] memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat)
>
> From the log, I guess EXPORT was for preparing dirty accounting.
> But _now_, we don't need to export this. Remove this for now.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Looks good to me.

Reviewed-by: Greg Thelen <[email protected]>

> ---
> ?mm/memcontrol.c | ? ?1 -
> ?1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ab315ab..4c2b759 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1897,7 +1897,6 @@ out:
> ? ? ? ? ? ? ? ?move_unlock_page_cgroup(pc, &flags);
> ? ? ? ?rcu_read_unlock();
> ?}
> -EXPORT_SYMBOL(mem_cgroup_update_page_stat);
>
> ?/*
> ?* size of first charge trial. "32" comes from vmscan.c's magic value.
> --
> 1.7.4.1

2012-02-14 07:21:57

by Greg Thelen

[permalink] [raw]

Subject: Re: [PATCH 2/6 v4] memcg: simplify move_account() check

On Mon, Feb 13, 2012 at 7:07 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From 9cdb3b63dc8d08cc2220c54c80438c13433a0d12 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 2 Feb 2012 10:02:39 +0900
> Subject: [PATCH 2/6] memcg: simplify move_account() check.
>
> In memcg, for avoiding take-lock-irq-off at accessing page_cgroup,
> a logic, flag + rcu_read_lock(), is used. This works as following
>
> ? ? CPU-A ? ? ? ? ? ? ? ? ? ? CPU-B
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? rcu_read_lock()
> ? ?set flag
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? if(flag is set)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? take heavy lock
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? do job.
> ? ?synchronize_rcu() ? ? ? ?rcu_read_unlock()

I assume that CPU-A will take heavy lock after synchronize_rcu() when
updating variables read by CPU-B.

> ?memcontrol.c | ? 65 ++++++++++++++++++++++-------------------------------------
> ?1 file changed, 25 insertions(+), 40 deletions(-)
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: Greg Thelen <[email protected]>

> ---
> ?mm/memcontrol.c | ? 70 +++++++++++++++++++++++-------------------------------
> ?1 files changed, 30 insertions(+), 40 deletions(-)

> @@ -2089,11 +2082,8 @@ static int __cpuinit memcg_cpu_hotplug_callback(struct notifier_block *nb,
> ? ? ? ?struct memcg_stock_pcp *stock;
> ? ? ? ?struct mem_cgroup *iter;
>
> - ? ? ? if ((action == CPU_ONLINE)) {
> - ? ? ? ? ? ? ? for_each_mem_cgroup(iter)
> - ? ? ? ? ? ? ? ? ? ? ? synchronize_mem_cgroup_on_move(iter, cpu);
> + ? ? ? if ((action == CPU_ONLINE))

Extra parenthesis. I recommend:
+ ? ? ? if (action == CPU_ONLINE)

2012-02-14 07:22:08

by Greg Thelen

[permalink] [raw]

Subject: Re: [PATCH 3/6 v4] memcg: remove PCG_MOVE_LOCK flag from page_cgroup.

On Mon, Feb 13, 2012 at 7:13 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From ffd1b013fe294a80c12e3f30e85386135a6fb284 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 2 Feb 2012 11:49:59 +0900
> Subject: [PATCH 3/6] memcg: remove PCG_MOVE_LOCK flag from page_cgroup.
>
> PCG_MOVE_LOCK is used for bit spinlock to avoid race between overwriting
> pc->mem_cgroup and page statistics accounting per memcg.
> This lock helps to avoid the race but the race is very rare because moving
> tasks between cgroup is not a usual job.
> So, it seems using 1bit per page is too costly.
>
> This patch changes this lock as per-memcg spinlock and removes PCG_MOVE_LOCK.
>
> If smaller lock is required, we'll be able to add some hashes but
> I'd like to start from this.
>
> Changelog:
> ?- fixed to pass memcg as an argument rather than page_cgroup.
> ? ?and renamed from move_lock_page_cgroup() to move_lock_mem_cgroup()
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Seems good. Thanks.

Acked-by: Greg Thelen <[email protected]>

2012-02-14 07:22:44

by Greg Thelen

[permalink] [raw]

Subject: Re: [PATCH 4/6 v4] memcg: use new logic for page stat accounting

On Mon, Feb 13, 2012 at 7:14 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From ad2905362ef58a44d96a325193ab384739418050 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 2 Feb 2012 11:49:59 +0900
> Subject: [PATCH 4/6] memcg: use new logic for page stat accounting.
>
> Now, page-stat-per-memcg is recorded into per page_cgroup flag by
> duplicating page's status into the flag. The reason is that memcg
> has a feature to move a page from a group to another group and we
> have race between "move" and "page stat accounting",
>
> Under current logic, assume CPU-A and CPU-B. CPU-A does "move"
> and CPU-B does "page stat accounting".
>
> When CPU-A goes 1st,
>
> ? ? ? ? ? ?CPU-A ? ? ? ? ? ? ? ? ? ? ? ? ? CPU-B
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?update "struct page" info.
> ? ?move_lock_mem_cgroup(memcg)
> ? ?see flags

pc->flags?

> ? ?copy page stat to new group
> ? ?overwrite pc->mem_cgroup.
> ? ?move_unlock_mem_cgroup(memcg)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?move_lock_mem_cgroup(mem)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?set pc->flags
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?update page stat accounting
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?move_unlock_mem_cgroup(mem)
>
> stat accounting is guarded by move_lock_mem_cgroup() and "move"
> logic (CPU-A) doesn't see changes in "struct page" information.
>
> But it's costly to have the same information both in 'struct page' and
> 'struct page_cgroup'. And, there is a potential problem.
>
> For example, assume we have PG_dirty accounting in memcg.
> PG_..is a flag for struct page.
> PCG_ is a flag for struct page_cgroup.
> (This is just an example. The same problem can be found in any
> ?kind of page stat accounting.)
>
> ? ? ? ? ?CPU-A ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? CPU-B
> ? ? ?TestSet PG_dirty
> ? ? ?(delay) ? ? ? ? ? ? ? ? ? ? ? ?TestClear PG_dirty_

PG_dirty

> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? if (TestClear(PCG_dirty))
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?memcg->nr_dirty--
> ? ? ?if (TestSet(PCG_dirty))
> ? ? ? ? ?memcg->nr_dirty++
>

> @@ -141,6 +141,31 @@ static inline bool mem_cgroup_disabled(void)
> ? ? ? ?return false;
> ?}
>
> +void __mem_cgroup_begin_update_page_stat(struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool *lock, unsigned long *flags);
> +
> +static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool *lock, unsigned long *flags)
> +{
> + ? ? ? if (mem_cgroup_disabled())
> + ? ? ? ? ? ? ? return;
> + ? ? ? rcu_read_lock();
> + ? ? ? *lock = false;

This seems like a strange place to set *lock=false. I think it's
clearer if __mem_cgroup_begin_update_page_stat() is the only routine
that sets or clears *lock. But I do see that in patch 6/6 'memcg: fix
performance of mem_cgroup_begin_update_page_stat()' this position is
required.

> + ? ? ? return __mem_cgroup_begin_update_page_stat(page, lock, flags);
> +}

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ecf8856..30afea5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1877,32 +1877,54 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> ?* If there is, we take a lock.
> ?*/
>
> +void __mem_cgroup_begin_update_page_stat(struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool *lock, unsigned long *flags)
> +{
> + ? ? ? struct mem_cgroup *memcg;
> + ? ? ? struct page_cgroup *pc;
> +
> + ? ? ? pc = lookup_page_cgroup(page);
> +again:
> + ? ? ? memcg = pc->mem_cgroup;
> + ? ? ? if (unlikely(!memcg || !PageCgroupUsed(pc)))
> + ? ? ? ? ? ? ? return;
> + ? ? ? if (!mem_cgroup_stealed(memcg))
> + ? ? ? ? ? ? ? return;
> +
> + ? ? ? move_lock_mem_cgroup(memcg, flags);
> + ? ? ? if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> + ? ? ? ? ? ? ? move_unlock_mem_cgroup(memcg, flags);
> + ? ? ? ? ? ? ? goto again;
> + ? ? ? }
> + ? ? ? *lock = true;
> +}
> +
> +void __mem_cgroup_end_update_page_stat(struct page *page,
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? bool *lock, unsigned long *flags)

'lock' looks like an unused parameter. If so, then remove it.

> +{
> + ? ? ? struct page_cgroup *pc = lookup_page_cgroup(page);
> +
> + ? ? ? /*
> + ? ? ? ?* It's guaranteed that pc->mem_cgroup never changes while
> + ? ? ? ?* lock is held

Please continue comment describing what provides this guarantee. I
assume it is because rcu_read_lock() is held by
mem_cgroup_begin_update_page_stat(). Maybe it's best to to just make
small reference to the locking protocol description in
mem_cgroup_start_move().

> + ? ? ? ?*/
> + ? ? ? move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> +}
> +
> +

I think it would be useful to add a small comment here declaring that
all callers of this routine must be in a
mem_cgroup_begin_update_page_stat(), mem_cgroup_end_update_page_stat()
critical section to keep pc->mem_cgroup stable.

> ?void mem_cgroup_update_page_stat(struct page *page,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum mem_cgroup_page_stat_item idx, int val)
> ?{

2012-02-14 07:22:59

by Greg Thelen

[permalink] [raw]

Subject: Re: [PATCH 5/6 v4] memcg: remove PCG_FILE_MAPPED

On Mon, Feb 13, 2012 at 7:15 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From 96407a510d5212179a4768f28591b35d5c17d403 Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Thu, 2 Feb 2012 15:02:18 +0900
> Subject: [PATCH 5/6] memcg: remove PCG_FILE_MAPPED
>
> with new lock scheme for updating memcg's page stat, we don't need
> a flag PCG_FILE_MAPPED which was duplicated information of page_mapped().
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Thanks!

Acked-by: Greg Thelen <[email protected]>

2012-02-14 07:23:13

by Greg Thelen

[permalink] [raw]

Subject: Re: [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat()

On Mon, Feb 13, 2012 at 7:16 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From 3377fd7b6e23a5d2a368c078eae27e2b49c4f4aa Mon Sep 17 00:00:00 2001
> From: KAMEZAWA Hiroyuki <[email protected]>
> Date: Mon, 6 Feb 2012 12:14:47 +0900
> Subject: [PATCH 6/6] memcg: fix performance of mem_cgroup_begin_update_page_stat()
>
> mem_cgroup_begin_update_page_stat() should be very fast because
> it's called very frequently. Now, it needs to look up page_cgroup
> and its memcg....this is slow.
>
> This patch adds a global variable to check "a memcg is moving or not".

s/a memcg/any memcg/

> By this, the caller doesn't need to visit page_cgroup and memcg.

s/By/With/

> Here is a test result. A test program makes page faults onto a file,
> MAP_SHARED and makes each page's page_mapcount(page) > 1, and free
> the range by madvise() and page fault again. ?This program causes
> 26214400 times of page fault onto a file(size was 1G.) and shows
> shows the cost of mem_cgroup_begin_update_page_stat().

Out of curiosity, what is the performance of the mmap program before
this series?

> Before this patch for mem_cgroup_begin_update_page_stat()
> [kamezawa@bluextal test]$ time ./mmap 1G
>
> real ? ?0m21.765s
> user ? ?0m5.999s
> sys ? ? 0m15.434s
>
> ? ?27.46% ? ? mmap ?mmap ? ? ? ? ? ? ? [.] reader
> ? ?21.15% ? ? mmap ?[kernel.kallsyms] ?[k] page_fault
> ? ? 9.17% ? ? mmap ?[kernel.kallsyms] ?[k] filemap_fault
> ? ? 2.96% ? ? mmap ?[kernel.kallsyms] ?[k] __do_fault
> ? ? 2.83% ? ? mmap ?[kernel.kallsyms] ?[k] __mem_cgroup_begin_update_page_stat
>
> After this patch
> [root@bluextal test]# time ./mmap 1G
>
> real ? ?0m21.373s
> user ? ?0m6.113s
> sys ? ? 0m15.016s
>
> In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away.
>
> Note: we may be able to remove this optimization in future if
> ? ? ?we can get pointer to memcg directly from struct page.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: Greg Thelen <[email protected]>

2012-02-14 08:45:24

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 4/6 v4] memcg: use new logic for page stat accounting

On Mon, 13 Feb 2012 23:22:22 -0800
Greg Thelen <[email protected]> wrote:

> On Mon, Feb 13, 2012 at 7:14 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From ad2905362ef58a44d96a325193ab384739418050 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Thu, 2 Feb 2012 11:49:59 +0900
> > Subject: [PATCH 4/6] memcg: use new logic for page stat accounting.
> >
> > Now, page-stat-per-memcg is recorded into per page_cgroup flag by
> > duplicating page's status into the flag. The reason is that memcg
> > has a feature to move a page from a group to another group and we
> > have race between "move" and "page stat accounting",
> >
> > Under current logic, assume CPU-A and CPU-B. CPU-A does "move"
> > and CPU-B does "page stat accounting".
> >
> > When CPU-A goes 1st,
> >
> > CPU-A CPU-B
> > update "struct page" info.
> > move_lock_mem_cgroup(memcg)
> > see flags
>
> pc->flags?
>
yes.

> > copy page stat to new group
> > overwrite pc->mem_cgroup.
> > move_unlock_mem_cgroup(memcg)
> > move_lock_mem_cgroup(mem)
> > set pc->flags
> > update page stat accounting
> > move_unlock_mem_cgroup(mem)
> >
> > stat accounting is guarded by move_lock_mem_cgroup() and "move"
> > logic (CPU-A) doesn't see changes in "struct page" information.
> >
> > But it's costly to have the same information both in 'struct page' and
> > 'struct page_cgroup'. And, there is a potential problem.
> >
> > For example, assume we have PG_dirty accounting in memcg.
> > PG_..is a flag for struct page.
> > PCG_ is a flag for struct page_cgroup.
> > (This is just an example. The same problem can be found in any
> > kind of page stat accounting.)
> >
> > CPU-A CPU-B
> > TestSet PG_dirty
> > (delay) TestClear PG_dirty_
>
> PG_dirty
>
> > if (TestClear(PCG_dirty))
> > memcg->nr_dirty--
> > if (TestSet(PCG_dirty))
> > memcg->nr_dirty++
> >
>
> > @@ -141,6 +141,31 @@ static inline bool mem_cgroup_disabled(void)
> > return false;
> > }
> >
> > +void __mem_cgroup_begin_update_page_stat(struct page *page,
> > + bool *lock, unsigned long *flags);
> > +
> > +static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> > + bool *lock, unsigned long *flags)
> > +{
> > + if (mem_cgroup_disabled())
> > + return;
> > + rcu_read_lock();
> > + *lock = false;
>
> This seems like a strange place to set *lock=false. I think it's
> clearer if __mem_cgroup_begin_update_page_stat() is the only routine
> that sets or clears *lock. But I do see that in patch 6/6 'memcg: fix
> performance of mem_cgroup_begin_update_page_stat()' this position is
> required.
>

Ah, yes. Hmm, it was better to move this to the body of function.

> > + return __mem_cgroup_begin_update_page_stat(page, lock, flags);
> > +}
>
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index ecf8856..30afea5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1877,32 +1877,54 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
> > * If there is, we take a lock.
> > */
> >
> > +void __mem_cgroup_begin_update_page_stat(struct page *page,
> > + bool *lock, unsigned long *flags)
> > +{
> > + struct mem_cgroup *memcg;
> > + struct page_cgroup *pc;
> > +
> > + pc = lookup_page_cgroup(page);
> > +again:
> > + memcg = pc->mem_cgroup;
> > + if (unlikely(!memcg || !PageCgroupUsed(pc)))
> > + return;
> > + if (!mem_cgroup_stealed(memcg))
> > + return;
> > +
> > + move_lock_mem_cgroup(memcg, flags);
> > + if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) {
> > + move_unlock_mem_cgroup(memcg, flags);
> > + goto again;
> > + }
> > + *lock = true;
> > +}
> > +
> > +void __mem_cgroup_end_update_page_stat(struct page *page,
> > + bool *lock, unsigned long *flags)
>
> 'lock' looks like an unused parameter. If so, then remove it.
>

Ok.

> > +{
> > + struct page_cgroup *pc = lookup_page_cgroup(page);
> > +
> > + /*
> > + * It's guaranteed that pc->mem_cgroup never changes while
> > + * lock is held
>
> Please continue comment describing what provides this guarantee. I
> assume it is because rcu_read_lock() is held by
> mem_cgroup_begin_update_page_stat(). Maybe it's best to to just make
> small reference to the locking protocol description in
> mem_cgroup_start_move().
>
Ok, I will update this.

> > + */
> > + move_unlock_mem_cgroup(pc->mem_cgroup, flags);
> > +}
> > +
> > +
>
> I think it would be useful to add a small comment here declaring that
> all callers of this routine must be in a
> mem_cgroup_begin_update_page_stat(), mem_cgroup_end_update_page_stat()
> critical section to keep pc->mem_cgroup stable.
>

Sure, will do.

Thank you for review.
-Kame

> > void mem_cgroup_update_page_stat(struct page *page,
> > enum mem_cgroup_page_stat_item idx, int val)
> > {
>

2012-02-14 08:46:13

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 2/6 v4] memcg: simplify move_account() check

On Mon, 13 Feb 2012 23:21:34 -0800
Greg Thelen <[email protected]> wrote:

> On Mon, Feb 13, 2012 at 7:07 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From 9cdb3b63dc8d08cc2220c54c80438c13433a0d12 Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Thu, 2 Feb 2012 10:02:39 +0900
> > Subject: [PATCH 2/6] memcg: simplify move_account() check.
> >
> > In memcg, for avoiding take-lock-irq-off at accessing page_cgroup,
> > a logic, flag + rcu_read_lock(), is used. This works as following
> >
> > CPU-A CPU-B
> > rcu_read_lock()
> > set flag
> > if(flag is set)
> > take heavy lock
> > do job.
> > synchronize_rcu() rcu_read_unlock()
>
> I assume that CPU-A will take heavy lock after synchronize_rcu() when
> updating variables read by CPU-B.
>
Ah, yes. I should wrote that.

> > memcontrol.c | 65 ++++++++++++++++++++++-------------------------------------
> > 1 file changed, 25 insertions(+), 40 deletions(-)
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Acked-by: Greg Thelen <[email protected]>
>

Thank you!.

-Kame

2012-02-14 11:10:15

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH 6/6 v4] memcg: fix performance of mem_cgroup_begin_update_page_stat()

On Mon, 13 Feb 2012 23:22:50 -0800
Greg Thelen <[email protected]> wrote:

> On Mon, Feb 13, 2012 at 7:16 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > From 3377fd7b6e23a5d2a368c078eae27e2b49c4f4aa Mon Sep 17 00:00:00 2001
> > From: KAMEZAWA Hiroyuki <[email protected]>
> > Date: Mon, 6 Feb 2012 12:14:47 +0900
> > Subject: [PATCH 6/6] memcg: fix performance of mem_cgroup_begin_update_page_stat()
> >
> > mem_cgroup_begin_update_page_stat() should be very fast because
> > it's called very frequently. Now, it needs to look up page_cgroup
> > and its memcg....this is slow.
> >
> > This patch adds a global variable to check "a memcg is moving or not".
>
> s/a memcg/any memcg/
>
yes.

> > By this, the caller doesn't need to visit page_cgroup and memcg.
>
> s/By/With/
>
ok.

> > Here is a test result. A test program makes page faults onto a file,
> > MAP_SHARED and makes each page's page_mapcount(page) > 1, and free
> > the range by madvise() and page fault again. This program causes
> > 26214400 times of page fault onto a file(size was 1G.) and shows
> > shows the cost of mem_cgroup_begin_update_page_stat().
>
> Out of curiosity, what is the performance of the mmap program before
> this series?
>

Score of 3 runs underlinux-next.
==
[root@bluextal test]# time ./mmap 1G

real 0m21.041s
user 0m6.146s
sys 0m14.625s
[root@bluextal test]# time ./mmap 1G

real 0m21.063s
user 0m6.019s
sys 0m14.776s
[root@bluextal test]# time ./mmap 1G

real 0m21.178s
user 0m6.000s
sys 0m14.849s
==

My program is attached. This program is for checking cost of updating FILE_MAPPED.
I guess sys_time scores's error rate will be 0.2-0.3 sec.

Thanks,
-Kame
==
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>

void reader(int fd, int size)
{
int i, off, x;
char *addr;

addr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
for (i = 0; i < 100; i++) {
for(off = 0; off < size; off += 4096) {
x += *(addr + off);
}
madvise(addr, size, MADV_DONTNEED);
}
}

int main(int argc, char *argv[])
{
int fd;
char *addr, *c;
unsigned long size;
struct stat statbuf;

fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("cannot open file");
return 1;
}

if (fstat(fd, &statbuf)) {
perror("fstat failed");
return 1;
}
size = statbuf.st_size;
/* mmap in 2 place. */
addr = mmap(NULL, size, PROT_READ, MAP_SHARED, fd, 0);
mlock(addr, size);
reader(fd, size);
}