2010-07-27 07:56:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 0/7][memcg] towards I/O aware memory cgroup


>From a view of patch management, this set is a mixture of a few features for
memcg, and I should divide them to some groups. But, at first, I'd like to
show the total view. This set is consists from 5 sets. Main purpose is
create a room in page_cgroup for I/O tracking and add light-weight access method
for file-cache related accounting.

1. An virtual-indexed array.
2,3. Use virtual-indexed array for id-to-memory_cgroup detection.
4. modify page_cgroup to use ID instead of pointer, this gives us enough
spaces for further memory tracking.
5,6 Use light-weight locking mechanism for file related accounting.
7. use spin_lock instead of bit_spinlock.


As a function, patch 5,6 can be an independent patch and I'll accept
reordering series of patch if someone requests.
But we'll need all, I think.
(irq_save for patch 7 will be required later.)

Any comments are welcome.

Thanks,
-Kame


2010-07-27 07:57:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 1/7][memcg] virtually indexed array library.

From: KAMEZAWA Hiroyuki <[email protected]>

This virt-array allocates a virtally contiguous array via get_vm_area()
and allows object allocation per an element of array.
Physical pages are used only for used items in the array.

- At first, the user has to create an array by create_virt_array().
- At using an element, virt_array_alloc_index(index) should be called.
- At freeing an element, virt_array_free_index(index) should be called.
- At destroying, destroy_virt_array() should be called.

Item used/unused status is controlled by bitmap and back-end physical
pages are automatically allocated/freed. This is useful when you
want to access objects by index in light weight. For example,

create_virt_array(va);
struct your_struct *objmap = va->address;
Then, you can access your objects by objmap[i].

In usual case, holding reference by index rather than pointer can save memory.
But index -> object lookup cost cannot be negligible. In such case,
this virt-array may be helpful. Ah yes, if lookup performance is not important,
using radix-tree will be better (from TLB point of view). This virty-array
may consume VMALLOC area too much. and alloc/free routine is very slow.

Changelog:
- fixed bugs in bitmap ops.
- add offset for find_free_index.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/virt-array.h | 22 ++++
lib/Makefile | 2
lib/virt-array.c | 227 +++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 250 insertions(+), 1 deletion(-)

Index: mmotm-0719/lib/Makefile
===================================================================
--- mmotm-0719.orig/lib/Makefile
+++ mmotm-0719/lib/Makefile
@@ -14,7 +14,7 @@ lib-y := ctype.o string.o vsprintf.o cmd
proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o flex_array.o

-lib-$(CONFIG_MMU) += ioremap.o
+lib-$(CONFIG_MMU) += ioremap.o virt-array.o
lib-$(CONFIG_SMP) += cpumask.o

lib-y += kobject.o kref.o klist.o
Index: mmotm-0719/lib/virt-array.c
===================================================================
--- /dev/null
+++ mmotm-0719/lib/virt-array.c
@@ -0,0 +1,227 @@
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/slab.h>
+#include <linux/virt-array.h>
+#include <asm/cacheflush.h>
+
+
+/*
+ * Why define this here is because this function should be
+ * defined by user for getting better code. This generic one is slow
+ * because the compiler cannot know what the "size" is.
+ */
+static unsigned long idx_to_addr(struct virt_array *v, int idx)
+{
+ return (unsigned long)v->vm_area->addr + idx * v->size;
+}
+
+static unsigned long addr_index(struct virt_array *v, unsigned long addr)
+{
+ return (addr - (unsigned long)v->vm_area->addr) >> PAGE_SHIFT;
+}
+
+static int idx_used(const struct virt_array *v, int idx)
+{
+ return test_bit(idx, v->map);
+}
+
+static void unmap_free_page(struct virt_array *v, unsigned long page)
+{
+ struct page *pg;
+ unsigned long page_idx;
+
+ page_idx = addr_index(v, page);
+ /* alloc_idx's undo routine may find !pg */
+ pg = radix_tree_delete(&v->phys_pages, page_idx);
+ if (!pg)
+ return;
+ unmap_kernel_range(page, PAGE_SIZE);
+ __free_page(pg);
+
+}
+
+static void __free_head_page(struct virt_array *v, int idx)
+{
+ int i;
+ unsigned long page;
+
+ page = idx_to_addr(v, idx) & PAGE_MASK;
+
+ /* check backword */
+ for (i = idx - 1; i >= 0; i--) {
+ unsigned long address = idx_to_addr(v, i) + v->size - 1;
+ if ((address & PAGE_MASK) != page)
+ break;
+ /* A used object shares this page ? */
+ if (idx_used(v, i))
+ return;
+ }
+ unmap_free_page(v, page);
+}
+
+
+static void __free_middle_page(struct virt_array *v, int idx)
+{
+ unsigned long page, end_page;
+
+ page = (idx_to_addr(v, idx)) & PAGE_MASK;
+ end_page = (idx_to_addr(v, idx) + v->size) & PAGE_MASK;
+ if (end_page - page <= PAGE_SIZE)
+ return;
+ /* free all pages between head and tail */
+ for (page += PAGE_SIZE; page != end_page; page += PAGE_SIZE)
+ unmap_free_page(v, page);
+}
+
+
+static void __free_tail_page(struct virt_array *v, int idx)
+{
+ int i;
+ unsigned long page;
+
+ page = (idx_to_addr(v, idx) + v->size) & PAGE_MASK;
+ /* check forword */
+ for (i = idx + 1; i < v->nelem ; i++) {
+ unsigned long address = idx_to_addr(v, i);
+ if ((address & PAGE_MASK) != page)
+ break;
+ /* A used object shares this page ? */
+ if (idx_used(v, i))
+ return;
+ }
+ /* we can free this page */
+ unmap_free_page(v, page);
+}
+
+static void __free_this_page(struct virt_array *v, int idx, unsigned long page)
+{
+ int i;
+
+ /* check backword */
+ for (i = idx - 1; i >= 0; i--) {
+ unsigned long address = idx_to_addr(v, i) + v->size - 1;
+ if ((address & PAGE_MASK) != page)
+ break;
+ /* A used object shares this page ? */
+ if (idx_used(v, i))
+ return;
+ }
+ /* check forward */
+ for (i = idx + 1; i < v->nelem; i++) {
+ unsigned long address = idx_to_addr(v, i);
+ if ((address & PAGE_MASK) != page)
+ break;
+ /* A used object shares this page ? */
+ if (idx_used(v, i))
+ return;
+ }
+ /* we can free this page */
+ unmap_free_page(v, page);
+}
+
+static void __free_unmap_entry(struct virt_array *v, int idx)
+{
+ unsigned long address, end;
+
+ address = idx_to_addr(v, idx);
+ end = address + v->size;
+ if ((address & PAGE_MASK) == (end & PAGE_MASK)) {
+ __free_this_page(v, idx, address & PAGE_MASK);
+ } else {
+ __free_head_page(v, idx);
+ __free_middle_page(v, idx);
+ __free_tail_page(v, idx);
+ }
+ clear_bit(idx, v->map);
+}
+
+void free_varray_item(struct virt_array *v, int idx)
+{
+ mutex_lock(&v->mutex);
+ __free_unmap_entry(v, idx);
+ mutex_unlock(&v->mutex);
+}
+
+void *alloc_varray_item(struct virt_array *v, int idx)
+{
+ unsigned long obj, tmp, start, end, addr_idx;
+ struct page *pg[1];
+ void *ret = ERR_PTR(-EBUSY);
+
+ mutex_lock(&v->mutex);
+ if (idx_used(v, idx))
+ goto out;
+
+ obj = idx_to_addr(v, idx);
+ start = obj & PAGE_MASK;
+ end = PAGE_ALIGN(obj + v->size);
+
+ for (tmp = start; tmp < end; tmp+=PAGE_SIZE) {
+ addr_idx = addr_index(v, tmp);
+ pg[0] = radix_tree_lookup(&v->phys_pages, addr_idx);
+ if (pg[0])
+ continue;
+ pg[0] = alloc_page(GFP_KERNEL);
+ if (map_kernel_range_noflush(tmp, PAGE_SIZE,
+ PAGE_KERNEL, pg) == -ENOMEM) {
+ __free_page(pg[0]);
+ goto out_unmap;
+ }
+
+ radix_tree_preload(GFP_KERNEL);
+ if (radix_tree_insert(&v->phys_pages, addr_idx, pg[0])) {
+ BUG();
+ }
+ radix_tree_preload_end();
+ }
+ flush_cache_vmap(start, end);
+ ret = (void *)obj;
+ set_bit(idx, v->map);
+out:
+ mutex_unlock(&v->mutex);
+ return ret;
+out_unmap:
+ ret = ERR_PTR(-ENOMEM);
+ __free_unmap_entry(v, idx);
+ goto out;
+}
+
+void *create_varray(struct virt_array *v,int size, int nelem)
+{
+ unsigned long total = size * nelem;
+ unsigned long bits;
+
+ bits = ((nelem/BITS_PER_LONG)+1) * sizeof(long);
+ v->map = kzalloc(bits, GFP_KERNEL);
+ if (!v->map)
+ return NULL;
+ total = PAGE_ALIGN(total);
+ v->vm_area = get_vm_area(total, 0);
+ if (!v->vm_area) {
+ kfree(v->map);
+ return NULL;
+ }
+
+ v->size = size;
+ v->nelem = nelem;
+ INIT_RADIX_TREE(&v->phys_pages, GFP_KERNEL);
+ mutex_init(&v->mutex);
+ return v->vm_area->addr;
+}
+
+void destroy_varray(struct virt_array *v)
+{
+ int i;
+
+ for_each_set_bit(i, v->map, v->nelem)
+ __free_unmap_entry(v, i);
+ kfree(v->map);
+ free_vm_area(v->vm_area);
+ return;
+}
+
+int varray_find_free_index(struct virt_array *v, int base)
+{
+ return find_next_zero_bit(v->map, v->nelem, base);
+}
+
Index: mmotm-0719/include/linux/virt-array.h
===================================================================
--- /dev/null
+++ mmotm-0719/include/linux/virt-array.h
@@ -0,0 +1,22 @@
+#ifndef __LINUX_VIRTARRAY_H
+#define __LINUX_VIRTARRAY_H
+
+#include <linux/vmalloc.h>
+#include <linux/radix-tree.h>
+
+struct virt_array {
+ struct vm_struct *vm_area;
+ int size;
+ int nelem;
+ struct mutex mutex;
+ struct radix_tree_root phys_pages;
+ unsigned long *map;
+};
+
+void *create_varray(struct virt_array *va, int size, int nelems);
+void *alloc_varray_item(struct virt_array *va, int index);
+void free_varray_item(struct virt_array *va, int index);
+void destroy_varray(struct virt_array *va);
+int varray_find_free_index(struct virt_array *va, int base);
+
+#endif

2010-07-27 07:59:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 2/7][memcg] cgroup arbitarary ID allocation

From: KAMEZAWA Hiroyuki <[email protected]>

When a subsystem want to make use of "id" more, it's necessary to
manage the id at cgroup subsystem creation time. But, now,
because of the order of cgroup creation callback, subsystem can't
declare the id it wants. This patch allows subsystem to use customized
ID for themselves.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/cgroups.txt | 9 +++++++++
include/linux/cgroup.h | 3 ++-
kernel/cgroup.c | 17 ++++++++++++-----
3 files changed, 23 insertions(+), 6 deletions(-)

Index: mmotm-2.6.35-0719/include/linux/cgroup.h
===================================================================
--- mmotm-2.6.35-0719.orig/include/linux/cgroup.h
+++ mmotm-2.6.35-0719/include/linux/cgroup.h
@@ -475,7 +475,7 @@ struct cgroup_subsys {
struct cgroup *cgrp);
void (*post_clone)(struct cgroup_subsys *ss, struct cgroup *cgrp);
void (*bind)(struct cgroup_subsys *ss, struct cgroup *root);
-
+ int (*custom_id)(struct cgroup_subsys *ss, struct cgroup *cgrp);
int subsys_id;
int active;
int disabled;
@@ -483,6 +483,7 @@ struct cgroup_subsys {
/*
* True if this subsys uses ID. ID is not available before cgroup_init()
* (not available in early_init time.)
+ * You can detemine ID if you have custom_id() callback.
*/
bool use_id;
#define MAX_CGROUP_TYPE_NAMELEN 32
Index: mmotm-2.6.35-0719/kernel/cgroup.c
===================================================================
--- mmotm-2.6.35-0719.orig/kernel/cgroup.c
+++ mmotm-2.6.35-0719/kernel/cgroup.c
@@ -4526,10 +4526,11 @@ EXPORT_SYMBOL_GPL(free_css_id);
* always serialized (By cgroup_mutex() at create()).
*/

-static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
+static struct css_id *get_new_cssid(struct cgroup_subsys *ss,
+ int depth, struct cgroup *child)
{
struct css_id *newid;
- int myid, error, size;
+ int from_id, myid, error, size;

BUG_ON(!ss->use_id);

@@ -4542,9 +4543,13 @@ static struct css_id *get_new_cssid(stru
error = -ENOMEM;
goto err_out;
}
+ if (child && ss->custom_id)
+ from_id = ss->custom_id(ss, child);
+ else
+ from_id = 1;
spin_lock(&ss->id_lock);
/* Don't use 0. allocates an ID of 1-65535 */
- error = idr_get_new_above(&ss->idr, newid, 1, &myid);
+ error = idr_get_new_above(&ss->idr, newid, from_id, &myid);
spin_unlock(&ss->id_lock);

/* Returns error when there are no free spaces for new ID.*/
@@ -4552,6 +4557,8 @@ static struct css_id *get_new_cssid(stru
error = -ENOSPC;
goto err_out;
}
+ BUG_ON(ss->custom_id && from_id != myid);
+
if (myid > CSS_ID_MAX)
goto remove_idr;

@@ -4577,7 +4584,7 @@ static int __init_or_module cgroup_init_
spin_lock_init(&ss->id_lock);
idr_init(&ss->idr);

- newid = get_new_cssid(ss, 0);
+ newid = get_new_cssid(ss, 0 ,NULL);
if (IS_ERR(newid))
return PTR_ERR(newid);

@@ -4600,7 +4607,7 @@ static int alloc_css_id(struct cgroup_su
parent_id = parent_css->id;
depth = parent_id->depth + 1;

- child_id = get_new_cssid(ss, depth);
+ child_id = get_new_cssid(ss, depth, child);
if (IS_ERR(child_id))
return PTR_ERR(child_id);

Index: mmotm-2.6.35-0719/Documentation/cgroups/cgroups.txt
===================================================================
--- mmotm-2.6.35-0719.orig/Documentation/cgroups/cgroups.txt
+++ mmotm-2.6.35-0719/Documentation/cgroups/cgroups.txt
@@ -621,6 +621,15 @@ and root cgroup. Currently this will onl
the default hierarchy (which never has sub-cgroups) and a hierarchy
that is being created/destroyed (and hence has no sub-cgroups).

+void custom_id(struct cgroup_subsys *ss, struct cgroup *cgrp)
+
+Called at assigning a new ID to cgroup subsystem state struct. This
+is called when ss->use_id == true. If this function is not provided,
+a new ID is automatically assigned. If you enable ss->use_id,
+you can use css_lookup() and css_get_next() to access "css" objects
+via IDs.
+
+
4. Questions
============

2010-07-27 08:00:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 3/7][memcg] memcg on virt array for quick access via ID.

From: KAMEZAWA Hiroyuki <[email protected]>

Now, memory cgroup has an ID(1-65535) per a group and use it for
walking hierarchy, recording it for swapped-out resources etc...

This patch tries to make use of it more. Allocating memory cgroup
into an (virtual) array. This allows to access a memory cgroup by
mem = mem_cgroup_base + id.

By this, we don't have to use css_lookup() and will have a chance to
replace a pointer to mem_cgroup(8bytes on 64bit) to an ID (2bytes).

Thought:
- Although I added CONFIG in this patch, I wonder I should remove it..

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
init/Kconfig | 14 +++++++-
mm/memcontrol.c | 93 ++++++++++++++++++++++++++++++--------------------------
2 files changed, 63 insertions(+), 44 deletions(-)

Index: mmotm-0719/init/Kconfig
===================================================================
--- mmotm-0719.orig/init/Kconfig
+++ mmotm-0719/init/Kconfig
@@ -555,7 +555,7 @@ config RESOURCE_COUNTERS

config CGROUP_MEM_RES_CTLR
bool "Memory Resource Controller for Control Groups"
- depends on CGROUPS && RESOURCE_COUNTERS
+ depends on CGROUPS && RESOURCE_COUNTERS && MMU
select MM_OWNER
help
Provides a memory resource controller that manages both anonymous
@@ -594,6 +594,18 @@ config CGROUP_MEM_RES_CTLR_SWAP
Now, memory usage of swap_cgroup is 2 bytes per entry. If swap page
size is 4096bytes, 512k per 1Gbytes of swap.

+config MEM_CGROUP_MAX_GROUPS
+ int "Maximum number of memory cgroups on a system"
+ range 1 65535 if 64BIT
+ default 8192 if 64BIT
+ range 1 4096 if 32BIT
+ default 2048 if 32BIT
+ help
+ Memory cgroup has limitation of the number of groups created.
+ Please select your favorite value. The more you allow, the more
+ memory will be consumed. This consumes vmalloc() area, so,
+ this should be small on 32bit arch.
+
menuconfig CGROUP_SCHED
bool "Group CPU scheduler"
depends on EXPERIMENTAL && CGROUPS
Index: mmotm-0719/mm/memcontrol.c
===================================================================
--- mmotm-0719.orig/mm/memcontrol.c
+++ mmotm-0719/mm/memcontrol.c
@@ -48,6 +48,7 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/virt-array.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -242,7 +243,8 @@ struct mem_cgroup {

/* For oom notifier event fd */
struct list_head oom_notify;
-
+ /* Used when varray is used */
+ int custom_id;
/*
* Should we move charges of a task when a task is moved into this
* mem_cgroup ? And what type of charges should we move ?
@@ -254,6 +256,8 @@ struct mem_cgroup {
struct mem_cgroup_stat_cpu *stat;
};

+static struct mem_cgroup *mem_cgroup_base __read_mostly;
+
/* Stuffs for move charges at task migration. */
/*
* Types of charges to be moved. "move_charge_at_immitgrate" is treated as a
@@ -341,6 +345,19 @@ static void mem_cgroup_put(struct mem_cg
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
static void drain_all_stock_async(void);

+/*
+ * A helper function to get mem_cgroup from ID. must be called under
+ * rcu_read_lock(). The caller must check css_is_removed() or some if
+ * it's concern. (dropping refcnt from swap can be called against removed
+ * memcg.)
+ */
+static struct mem_cgroup *id_to_mem(unsigned short id)
+{
+ if (id)
+ return mem_cgroup_base + id;
+ return NULL;
+}
+
static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
{
@@ -1818,24 +1835,6 @@ static void mem_cgroup_cancel_charge(str
__mem_cgroup_cancel_charge(mem, 1);
}

-/*
- * A helper function to get mem_cgroup from ID. must be called under
- * rcu_read_lock(). The caller must check css_is_removed() or some if
- * it's concern. (dropping refcnt from swap can be called against removed
- * memcg.)
- */
-static struct mem_cgroup *mem_cgroup_lookup(unsigned short id)
-{
- struct cgroup_subsys_state *css;
-
- /* ID 0 is unused ID */
- if (!id)
- return NULL;
- css = css_lookup(&mem_cgroup_subsys, id);
- if (!css)
- return NULL;
- return container_of(css, struct mem_cgroup, css);
-}

struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page)
{
@@ -1856,7 +1855,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
ent.val = page_private(page);
id = lookup_swap_cgroup(ent);
rcu_read_lock();
- mem = mem_cgroup_lookup(id);
+ mem = id_to_mem(id);
if (mem && !css_tryget(&mem->css))
mem = NULL;
rcu_read_unlock();
@@ -2208,7 +2207,7 @@ __mem_cgroup_commit_charge_swapin(struct

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = mem_cgroup_lookup(id);
+ memcg = id_to_mem(id);
if (memcg) {
/*
* This recorded memcg can be obsolete one. So, avoid
@@ -2472,7 +2471,7 @@ void mem_cgroup_uncharge_swap(swp_entry_

id = swap_cgroup_record(ent, 0);
rcu_read_lock();
- memcg = mem_cgroup_lookup(id);
+ memcg = id_to_mem(id);
if (memcg) {
/*
* We uncharge this because swap is freed.
@@ -3983,32 +3982,42 @@ static void free_mem_cgroup_per_zone_inf
kfree(mem->info.nodeinfo[node]);
}

-static struct mem_cgroup *mem_cgroup_alloc(void)
+struct virt_array memcg_varray;
+
+static int mem_cgroup_custom_id(struct cgroup_subsys *ss, struct cgroup *cont)
{
- struct mem_cgroup *mem;
- int size = sizeof(struct mem_cgroup);
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
+ return mem->custom_id;
+}

- /* Can be very big if MAX_NUMNODES is very big */
- if (size < PAGE_SIZE)
- mem = kmalloc(size, GFP_KERNEL);
- else
- mem = vmalloc(size);
+static struct mem_cgroup *mem_cgroup_alloc(struct cgroup *cgroup)
+{
+ struct mem_cgroup *mem;
+ int idx;

- if (!mem)
+ if (cgroup->parent == NULL) {
+ mem_cgroup_base = create_varray(&memcg_varray, sizeof(*mem),
+ CONFIG_MEM_CGROUP_MAX_GROUPS);
+ BUG_ON(IS_ERR_OR_NULL(mem_cgroup_base));
+ }
+ /* 0 is unused ID.(see css_id's spec). */
+ idx = varray_find_free_index(&memcg_varray, 1);
+ if (idx == memcg_varray.nelem)
return NULL;
-
- memset(mem, 0, size);
+ mem = alloc_varray_item(&memcg_varray, idx);
+ if (IS_ERR_OR_NULL(mem))
+ return NULL;
+ memset(mem, 0, sizeof(*mem));
mem->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
if (!mem->stat) {
- if (size < PAGE_SIZE)
- kfree(mem);
- else
- vfree(mem);
+ free_varray_item(&memcg_varray, idx);
mem = NULL;
- }
+ } else
+ mem->custom_id = idx;
return mem;
}

+
/*
* At destroying mem_cgroup, references from swap_cgroup can remain.
* (scanning all at force_empty is too costly...)
@@ -4031,10 +4040,7 @@ static void __mem_cgroup_free(struct mem
free_mem_cgroup_per_zone_info(mem, node);

free_percpu(mem->stat);
- if (sizeof(struct mem_cgroup) < PAGE_SIZE)
- kfree(mem);
- else
- vfree(mem);
+ free_varray_item(&memcg_varray, mem->custom_id);
}

static void mem_cgroup_get(struct mem_cgroup *mem)
@@ -4111,7 +4117,7 @@ mem_cgroup_create(struct cgroup_subsys *
long error = -ENOMEM;
int node;

- mem = mem_cgroup_alloc();
+ mem = mem_cgroup_alloc(cont);
if (!mem)
return ERR_PTR(error);

@@ -4692,6 +4698,7 @@ struct cgroup_subsys mem_cgroup_subsys =
.can_attach = mem_cgroup_can_attach,
.cancel_attach = mem_cgroup_cancel_attach,
.attach = mem_cgroup_move_task,
+ .custom_id = mem_cgroup_custom_id,
.early_init = 0,
.use_id = 1,
};

2010-07-27 08:01:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 4/7][memcg] memcg use ID in page_cgroup

From: KAMEZAWA Hiroyuki <[email protected]>

Now, addresses of memory cgroup can be calculated by their ID without complex.
This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
On 64bit architecture, this offers us more 6bytes room per page_cgroup.
Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
some light-weight concurrent access.

We may able to move this id onto flags field but ...go step by step.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/page_cgroup.h | 3 ++-
mm/memcontrol.c | 40 +++++++++++++++++++++++++---------------
mm/page_cgroup.c | 2 +-
3 files changed, 28 insertions(+), 17 deletions(-)

Index: mmotm-0719/include/linux/page_cgroup.h
===================================================================
--- mmotm-0719.orig/include/linux/page_cgroup.h
+++ mmotm-0719/include/linux/page_cgroup.h
@@ -12,7 +12,8 @@
*/
struct page_cgroup {
unsigned long flags;
- struct mem_cgroup *mem_cgroup;
+ unsigned short mem_cgroup; /* ID of assigned memory cgroup */
+ unsigned short blk_cgroup; /* Not Used..but will be. */
struct page *page;
struct list_head lru; /* per cgroup LRU list */
};
Index: mmotm-0719/mm/page_cgroup.c
===================================================================
--- mmotm-0719.orig/mm/page_cgroup.c
+++ mmotm-0719/mm/page_cgroup.c
@@ -14,7 +14,7 @@ static void __meminit
__init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
{
pc->flags = 0;
- pc->mem_cgroup = NULL;
+ pc->mem_cgroup = 0;
pc->page = pfn_to_page(pfn);
INIT_LIST_HEAD(&pc->lru);
}
Index: mmotm-0719/mm/memcontrol.c
===================================================================
--- mmotm-0719.orig/mm/memcontrol.c
+++ mmotm-0719/mm/memcontrol.c
@@ -372,7 +372,7 @@ struct cgroup_subsys_state *mem_cgroup_c
static struct mem_cgroup_per_zone *
page_cgroup_zoneinfo(struct page_cgroup *pc)
{
- struct mem_cgroup *mem = pc->mem_cgroup;
+ struct mem_cgroup *mem = id_to_mem(pc->mem_cgroup);
int nid = page_cgroup_nid(pc);
int zid = page_cgroup_zid(pc);

@@ -577,7 +577,11 @@ static void mem_cgroup_charge_statistics
bool charge)
{
int val = (charge) ? 1 : -1;
-
+ if (pc->mem_cgroup == 0) {
+ show_stack(NULL, NULL);
+ printk("charge to 0\n");
+ while(1);
+ }
preempt_disable();

if (PageCgroupCache(pc))
@@ -714,6 +718,11 @@ static inline bool mem_cgroup_is_root(st
return (mem == root_mem_cgroup);
}

+static inline bool mem_cgroup_is_rootid(unsigned short id)
+{
+ return (id == 1);
+}
+
/*
* Following LRU functions are allowed to be used without PCG_LOCK.
* Operations are called by routine of global LRU independently from memcg.
@@ -746,7 +755,7 @@ void mem_cgroup_del_lru_list(struct page
*/
mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) -= 1;
- if (mem_cgroup_is_root(pc->mem_cgroup))
+ if (mem_cgroup_is_rootid(pc->mem_cgroup))
return;
VM_BUG_ON(list_empty(&pc->lru));
list_del_init(&pc->lru);
@@ -773,7 +782,7 @@ void mem_cgroup_rotate_lru_list(struct p
*/
smp_rmb();
/* unused or root page is not rotated. */
- if (!PageCgroupUsed(pc) || mem_cgroup_is_root(pc->mem_cgroup))
+ if (!PageCgroupUsed(pc) || mem_cgroup_is_rootid(pc->mem_cgroup))
return;
mz = page_cgroup_zoneinfo(pc);
list_move(&pc->lru, &mz->lists[lru]);
@@ -799,7 +808,7 @@ void mem_cgroup_add_lru_list(struct page
mz = page_cgroup_zoneinfo(pc);
MEM_CGROUP_ZSTAT(mz, lru) += 1;
SetPageCgroupAcctLRU(pc);
- if (mem_cgroup_is_root(pc->mem_cgroup))
+ if (mem_cgroup_is_rootid(pc->mem_cgroup))
return;
list_add(&pc->lru, &mz->lists[lru]);
}
@@ -1467,7 +1476,7 @@ void mem_cgroup_update_file_mapped(struc
return;

lock_page_cgroup(pc);
- mem = pc->mem_cgroup;
+ mem = id_to_mem(pc->mem_cgroup);
if (!mem || !PageCgroupUsed(pc))
goto done;

@@ -1848,7 +1857,7 @@ struct mem_cgroup *try_get_mem_cgroup_fr
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- mem = pc->mem_cgroup;
+ mem = id_to_mem(pc->mem_cgroup);
if (mem && !css_tryget(&mem->css))
mem = NULL;
} else if (PageSwapCache(page)) {
@@ -1884,7 +1893,7 @@ static void __mem_cgroup_commit_charge(s
return;
}

- pc->mem_cgroup = mem;
+ pc->mem_cgroup = css_id(&mem->css);
/*
* We access a page_cgroup asynchronously without lock_page_cgroup().
* Especially when a page_cgroup is taken from a page, pc->mem_cgroup
@@ -1942,7 +1951,7 @@ static void __mem_cgroup_move_account(st
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!PageCgroupLocked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
- VM_BUG_ON(pc->mem_cgroup != from);
+ VM_BUG_ON(id_to_mem(pc->mem_cgroup) != from);

if (PageCgroupFileMapped(pc)) {
/* Update mapped_file data for mem_cgroup */
@@ -1957,7 +1966,7 @@ static void __mem_cgroup_move_account(st
mem_cgroup_cancel_charge(from);

/* caller should have done css_get */
- pc->mem_cgroup = to;
+ pc->mem_cgroup = css_id(&to->css);
mem_cgroup_charge_statistics(to, pc, true);
/*
* We charges against "to" which may not have any tasks. Then, "to"
@@ -1977,7 +1986,7 @@ static int mem_cgroup_move_account(struc
{
int ret = -EINVAL;
lock_page_cgroup(pc);
- if (PageCgroupUsed(pc) && pc->mem_cgroup == from) {
+ if (PageCgroupUsed(pc) && id_to_mem(pc->mem_cgroup) == from) {
__mem_cgroup_move_account(pc, from, to, uncharge);
ret = 0;
}
@@ -2316,9 +2325,9 @@ __mem_cgroup_uncharge_common(struct page

lock_page_cgroup(pc);

- mem = pc->mem_cgroup;
+ mem = id_to_mem(pc->mem_cgroup);

- if (!PageCgroupUsed(pc))
+ if (!mem || !PageCgroupUsed(pc))
goto unlock_out;

switch (ctype) {
@@ -2561,7 +2570,7 @@ int mem_cgroup_prepare_migration(struct
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
- mem = pc->mem_cgroup;
+ mem = id_to_mem(pc->mem_cgroup);
css_get(&mem->css);
/*
* At migrating an anonymous page, its mapcount goes down
@@ -4384,7 +4393,8 @@ static int is_target_pte_for_mc(struct v
* mem_cgroup_move_account() checks the pc is valid or not under
* the lock.
*/
- if (PageCgroupUsed(pc) && pc->mem_cgroup == mc.from) {
+ if (PageCgroupUsed(pc) &&
+ id_to_mem(pc->mem_cgroup) == mc.from) {
ret = MC_TARGET_PAGE;
if (target)
target->page = page;

2010-07-27 08:04:33

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 5/7][memcg] memcg lockless update of file mapped


From: KAMEZAWA Hiroyuki <[email protected]>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.
At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
update is done while the page-cache is in stable state, we don't have to
take care of race with charge/uncharge.

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move
1. Increment it before start moving
2. Call synchronize_rcu()
3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.

Note: update_file_mapped is safe against charge/uncharge even if it's
not under address_space->tree_lock or lock_page(). Because it's under
page_table_lock(), anyone can't unmap it...then, anyone can't uncharge().



Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 60 insertions(+), 4 deletions(-)

Index: mmotm-0719/mm/memcontrol.c
===================================================================
--- mmotm-0719.orig/mm/memcontrol.c
+++ mmotm-0719/mm/memcontrol.c
@@ -89,6 +89,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
+ MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */

MEM_CGROUP_STAT_NSTATS,
};
@@ -1071,7 +1072,48 @@ static unsigned int get_swappiness(struc
return swappiness;
}

-/* A routine for testing mem is not under move_account */
+static void mem_cgroup_start_move(struct mem_cgroup *mem)
+{
+ int cpu;
+ /* for fast checking in mem_cgroup_update_file_stat() etc..*/
+ spin_lock(&mc.lock);
+ for_each_possible_cpu(cpu)
+ per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
+ spin_unlock(&mc.lock);
+
+ synchronize_rcu();
+}
+
+static void mem_cgroup_end_move(struct mem_cgroup *mem)
+{
+ int cpu;
+
+ if (!mem)
+ return;
+ /* for fast checking in mem_cgroup_update_file_stat() etc..*/
+ spin_lock(&mc.lock);
+ for_each_possible_cpu(cpu)
+ per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
+ spin_unlock(&mc.lock);
+}
+
+/*
+ * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
+ * used for avoiding race.
+ * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
+ * under hierarchy of them. used for waiting at
+ * memory pressure.
+ * Result of is_moved can be trusted until the end of rcu_read_unlock().
+ * The caller must do
+ * rcu_read_lock();
+ * result = mem_cgroup_is_moved();
+ * .....make use of result here....
+ * rcu_read_unlock();
+ */
+static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
+{
+ return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
+}

static bool mem_cgroup_under_move(struct mem_cgroup *mem)
{
@@ -1470,13 +1512,21 @@ void mem_cgroup_update_file_mapped(struc
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
+ bool need_lock = false;

pc = lookup_page_cgroup(page);
if (unlikely(!pc))
return;
-
- lock_page_cgroup(pc);
+ rcu_read_lock();
mem = id_to_mem(pc->mem_cgroup);
+ if (!mem)
+ goto done;
+ need_lock = mem_cgroup_is_moved(mem);
+ if (need_lock) {
+ /* need to serialize with move_account */
+ lock_page_cgroup(pc);
+ mem = id_to_mem(pc->mem_cgroup);
+ }
if (!mem || !PageCgroupUsed(pc))
goto done;

@@ -1492,7 +1542,9 @@ void mem_cgroup_update_file_mapped(struc
}

done:
- unlock_page_cgroup(pc);
+ if (need_lock)
+ unlock_page_cgroup(pc);
+ rcu_read_unlock();
}

/*
@@ -3024,6 +3076,7 @@ move_account:
lru_add_drain_all();
drain_all_stock_sync();
ret = 0;
+ mem_cgroup_start_move(mem);
for_each_node_state(node, N_HIGH_MEMORY) {
for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
enum lru_list l;
@@ -3037,6 +3090,7 @@ move_account:
if (ret)
break;
}
+ mem_cgroup_end_move(mem);
memcg_oom_recover(mem);
/* it seems parent cgroup doesn't have enough mem */
if (ret == -ENOMEM)
@@ -4503,6 +4557,7 @@ static void mem_cgroup_clear_mc(void)
mc.to = NULL;
mc.moving_task = NULL;
spin_unlock(&mc.lock);
+ mem_cgroup_end_move(from);
memcg_oom_recover(from);
memcg_oom_recover(to);
wake_up_all(&mc.waitq);
@@ -4533,6 +4588,7 @@ static int mem_cgroup_can_attach(struct
VM_BUG_ON(mc.moved_charge);
VM_BUG_ON(mc.moved_swap);
VM_BUG_ON(mc.moving_task);
+ mem_cgroup_start_move(from);
spin_lock(&mc.lock);
mc.from = from;
mc.to = mem;

2010-07-27 08:05:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 6/7][memcg] generic file status update

This patch itself is not important. I just feel we need this kind of
clean up in future.

==
From: KAMEZAWA Hiroyuki <[email protected]>

Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
Using a unified macro and more generic names.
All counters will have the same rule for updating.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 10 +++++++---
include/linux/page_cgroup.h | 21 +++++++++++++++------
mm/memcontrol.c | 27 +++++++++++++++++----------
mm/rmap.c | 4 ++--
4 files changed, 41 insertions(+), 21 deletions(-)

Index: mmotm-0719/include/linux/memcontrol.h
===================================================================
--- mmotm-0719.orig/include/linux/memcontrol.h
+++ mmotm-0719/include/linux/memcontrol.h
@@ -121,7 +121,11 @@ static inline bool mem_cgroup_disabled(v
return false;
}

-void mem_cgroup_update_file_mapped(struct page *page, int val);
+enum {
+ __MEMCG_FILE_MAPPED,
+ NR_MEMCG_FILE_STAT
+};
+void mem_cgroup_update_file_stat(struct page *page, int stat, int val);
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
@@ -292,8 +296,8 @@ mem_cgroup_print_oom_info(struct mem_cgr
{
}

-static inline void mem_cgroup_update_file_mapped(struct page *page,
- int val)
+static inline void
+mem_cgroup_update_file_stat(struct page *page, int stat, int val);
{
}

Index: mmotm-0719/mm/memcontrol.c
===================================================================
--- mmotm-0719.orig/mm/memcontrol.c
+++ mmotm-0719/mm/memcontrol.c
@@ -84,16 +84,20 @@ enum mem_cgroup_stat_index {
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
-
+ MEM_CGROUP_FSTAT_BASE,
+ MEM_CGROUP_FSTAT_END
+ = MEM_CGROUP_FSTAT_BASE + NR_MEMCG_FILE_STAT,
MEM_CGROUP_STAT_NSTATS,
};

+#define MEM_CGROUP_STAT_FILE_MAPPED\
+ (MEM_CGROUP_FSTAT_BASE + __MEMCG_FILE_MAPPED)
+
struct mem_cgroup_stat_cpu {
s64 count[MEM_CGROUP_STAT_NSTATS];
};
@@ -1508,7 +1512,7 @@ bool mem_cgroup_handle_oom(struct mem_cg
* Currently used to update mapped file statistics, but the routine can be
* generalized to update other statistics as well.
*/
-void mem_cgroup_update_file_mapped(struct page *page, int val)
+void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
{
struct mem_cgroup *mem;
struct page_cgroup *pc;
@@ -1534,11 +1538,11 @@ void mem_cgroup_update_file_mapped(struc
* Preemption is already disabled. We can use __this_cpu_xxx
*/
if (val > 0) {
- __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- SetPageCgroupFileMapped(pc);
+ __this_cpu_inc(mem->stat->count[MEM_CGROUP_FSTAT_BASE + idx]);
+ SetPCGFileFlag(pc, idx);
} else {
- __this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- ClearPageCgroupFileMapped(pc);
+ __this_cpu_dec(mem->stat->count[MEM_CGROUP_FSTAT_BASE + idx]);
+ ClearPCGFileFlag(pc, idx);
}

done:
@@ -1999,17 +2003,20 @@ static void __mem_cgroup_commit_charge(s
static void __mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
{
+ int i;
VM_BUG_ON(from == to);
VM_BUG_ON(PageLRU(pc->page));
VM_BUG_ON(!PageCgroupLocked(pc));
VM_BUG_ON(!PageCgroupUsed(pc));
VM_BUG_ON(id_to_mem(pc->mem_cgroup) != from);

- if (PageCgroupFileMapped(pc)) {
+ for (i = 0; i < NR_MEMCG_FILE_STAT; ++i) {
+ if (!TestPCGFileFlag(pc, i))
+ continue;
/* Update mapped_file data for mem_cgroup */
preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
+ __this_cpu_dec(from->stat->count[MEM_CGROUP_FSTAT_BASE + i]);
+ __this_cpu_inc(to->stat->count[MEM_CGROUP_FSTAT_BASE + i]);
preempt_enable();
}
mem_cgroup_charge_statistics(from, pc, false);
Index: mmotm-0719/include/linux/page_cgroup.h
===================================================================
--- mmotm-0719.orig/include/linux/page_cgroup.h
+++ mmotm-0719/include/linux/page_cgroup.h
@@ -40,8 +40,8 @@ enum {
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
- PCG_FILE_MAPPED, /* page is accounted as "mapped" */
PCG_MIGRATION, /* under page migration */
+ PCG_FILE_FLAGS, /* see memcontrol.h */
};

#define TESTPCGFLAG(uname, lname) \
@@ -76,11 +76,6 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
TESTPCGFLAG(AcctLRU, ACCT_LRU)
TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)

-
-SETPCGFLAG(FileMapped, FILE_MAPPED)
-CLEARPCGFLAG(FileMapped, FILE_MAPPED)
-TESTPCGFLAG(FileMapped, FILE_MAPPED)
-
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)
@@ -105,6 +100,20 @@ static inline void unlock_page_cgroup(st
bit_spin_unlock(PCG_LOCK, &pc->flags);
}

+static inline void SetPCGFileFlag(struct page_cgroup *pc, int idx)
+{
+ set_bit(PCG_FILE_FLAGS + idx, &pc->flags);
+}
+
+static inline void ClearPCGFileFlag(struct page_cgroup *pc, int idx)
+{
+ clear_bit(PCG_FILE_FLAGS + idx, &pc->flags);
+}
+static inline bool TestPCGFileFlag(struct page_cgroup *pc, int idx)
+{
+ return test_bit(PCG_FILE_FLAGS + idx, &pc->flags);
+}
+
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct page_cgroup;

Index: mmotm-0719/mm/rmap.c
===================================================================
--- mmotm-0719.orig/mm/rmap.c
+++ mmotm-0719/mm/rmap.c
@@ -891,7 +891,7 @@ void page_add_file_rmap(struct page *pag
{
if (atomic_inc_and_test(&page->_mapcount)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_update_file_mapped(page, 1);
+ mem_cgroup_update_file_stat(page, __MEMCG_FILE_MAPPED, 1);
}
}

@@ -929,7 +929,7 @@ void page_remove_rmap(struct page *page)
__dec_zone_page_state(page, NR_ANON_PAGES);
} else {
__dec_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_update_file_mapped(page, -1);
+ mem_cgroup_update_file_stat(page, __MEMCG_FILE_MAPPED, -1);
}
/*
* It would be tidy to reset the PageAnon mapping here,

2010-07-27 08:07:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH 7/7][memcg] use spin lock instead of bit_spin_lock in page_cgroup

From: KAMEZAWA Hiroyuki <[email protected]>

This patch replaces page_cgroup's bit_spinlock with spinlock. In general,
spinlock has good implementation than bit_spin_lock and we should use
it if we have a room for it. In 64bit arch, we have extra 4bytes.
Let's use it.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
--
Index: mmotm-0719/include/linux/page_cgroup.h
===================================================================
--- mmotm-0719.orig/include/linux/page_cgroup.h
+++ mmotm-0719/include/linux/page_cgroup.h
@@ -10,8 +10,14 @@
* All page cgroups are allocated at boot or memory hotplug event,
* then the page cgroup for pfn always exists.
*/
+#ifdef CONFIG_64BIT
+#define PCG_HAS_SPINLOCK
+#endif
struct page_cgroup {
unsigned long flags;
+#ifdef PCG_HAS_SPINLOCK
+ spinlock_t lock;
+#endif
unsigned short mem_cgroup; /* ID of assigned memory cgroup */
unsigned short blk_cgroup; /* Not Used..but will be. */
struct page *page;
@@ -90,6 +96,16 @@ static inline enum zone_type page_cgroup
return page_zonenum(pc->page);
}

+#ifdef PCG_HAS_SPINLOCK
+static inline void lock_page_cgroup(struct page_cgroup *pc)
+{
+ spin_lock(&pc->lock);
+}
+static inline void unlock_page_cgroup(struct page_cgroup *pc)
+{
+ spin_unlock(&pc->lock);
+}
+#else
static inline void lock_page_cgroup(struct page_cgroup *pc)
{
bit_spin_lock(PCG_LOCK, &pc->flags);
@@ -99,6 +115,7 @@ static inline void unlock_page_cgroup(st
{
bit_spin_unlock(PCG_LOCK, &pc->flags);
}
+#endif

static inline void SetPCGFileFlag(struct page_cgroup *pc, int idx)
{
Index: mmotm-0719/mm/page_cgroup.c
===================================================================
--- mmotm-0719.orig/mm/page_cgroup.c
+++ mmotm-0719/mm/page_cgroup.c
@@ -17,6 +17,9 @@ __init_page_cgroup(struct page_cgroup *p
pc->mem_cgroup = 0;
pc->page = pfn_to_page(pfn);
INIT_LIST_HEAD(&pc->lru);
+#ifdef PCG_HAS_SPINLOCK
+ spin_lock_init(&pc->lock);
+#endif
}
static unsigned long total_usage;

2010-07-27 18:29:53

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/7][memcg] virtually indexed array library.

On Tue, 27 Jul 2010 16:53:03 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> This virt-array allocates a virtally contiguous array via get_vm_area()
> and allows object allocation per an element of array.

Quick question: this looks a lot like the "flexible array" mechanism
which went in around a year ago, and which is documented in
Documentation/flexible-arrays.txt. I'm not sure we need two of
these... That said, it appears that there are still no users of
flexible arrays. If your virtually-indexed arrays provide something
that flexible arrays don't, perhaps your implementation should replace
flexible arrays?

Thanks,

jon

2010-07-28 00:13:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/7][memcg] virtually indexed array library.

On Tue, 27 Jul 2010 12:29:49 -0600
Jonathan Corbet <[email protected]> wrote:

> On Tue, 27 Jul 2010 16:53:03 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > This virt-array allocates a virtally contiguous array via get_vm_area()
> > and allows object allocation per an element of array.
>
> Quick question: this looks a lot like the "flexible array" mechanism
> which went in around a year ago, and which is documented in
> Documentation/flexible-arrays.txt. I'm not sure we need two of
> these... That said, it appears that there are still no users of
> flexible arrays. If your virtually-indexed arrays provide something
> that flexible arrays don't, perhaps your implementation should replace
> flexible arrays?

Hmm. As Documentatin/flexible-arrays.txt says,

"The down sides are that the arrays cannot be indexed directly, individual object
size cannot exceed the system page size, and putting data into a flexible array
requires a copy operation. "

This virtually-indexed array is

- the arrays can be indexed directly.
- individual object size can be defined arbitrary.
- puttind data into virt-array requires memory allocation via alloc_varray_item().

But, virtyally-indexed array has its own down side, too.

- It uses vmalloc() area. This can be very limited in 32bit archs.
- It cannot be used in !MMU archs.
- It may consume much TLBs because vmalloc area tends not to be backed by hugepage.

Especially, I think !MMU case is much different. So, there are functional
difference. I implemented this to do quick direct access to objects by indexes.
Otherwise, flex-array may be able to provide more generic frameworks.

Then, I myself don't think virt-array is a replacemento for flex-array.

A discussion "flex-array should be dropped or not" is out of my scope, sorry.
I think you can ask to drop it just because it's almost dead without mentioning
virt-array.

Thanks,
-Kame


2010-07-28 00:18:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/7][memcg] towards I/O aware memory cgroup

On Tue, 27 Jul 2010 16:51:55 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

>
> From a view of patch management, this set is a mixture of a few features for
> memcg, and I should divide them to some groups. But, at first, I'd like to
> show the total view. This set is consists from 5 sets. Main purpose is
> create a room in page_cgroup for I/O tracking and add light-weight access method
> for file-cache related accounting.
>
> 1. An virtual-indexed array.
> 2,3. Use virtual-indexed array for id-to-memory_cgroup detection.
> 4. modify page_cgroup to use ID instead of pointer, this gives us enough
> spaces for further memory tracking.
> 5,6 Use light-weight locking mechanism for file related accounting.
> 7. use spin_lock instead of bit_spinlock.
>
>
> As a function, patch 5,6 can be an independent patch and I'll accept
> reordering series of patch if someone requests.
> But we'll need all, I think.
> (irq_save for patch 7 will be required later.)
>
> Any comments are welcome.
>

This was onto mmotm-0719..but mmotm-0727 is shipped. I'll post rebased one
if someone wants to test.

Thanks,
-Kame

2010-07-28 02:31:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7][memcg] cgroup arbitarary ID allocation

On Tue, Jul 27, 2010 at 04:54:17PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> When a subsystem want to make use of "id" more, it's necessary to
> manage the id at cgroup subsystem creation time. But, now,
> because of the order of cgroup creation callback, subsystem can't
> declare the id it wants. This patch allows subsystem to use customized
> ID for themselves.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

[..]
> Index: mmotm-2.6.35-0719/Documentation/cgroups/cgroups.txt
> ===================================================================
> --- mmotm-2.6.35-0719.orig/Documentation/cgroups/cgroups.txt
> +++ mmotm-2.6.35-0719/Documentation/cgroups/cgroups.txt
> @@ -621,6 +621,15 @@ and root cgroup. Currently this will onl
> the default hierarchy (which never has sub-cgroups) and a hierarchy
> that is being created/destroyed (and hence has no sub-cgroups).
>
> +void custom_id(struct cgroup_subsys *ss, struct cgroup *cgrp)
> +
> +Called at assigning a new ID to cgroup subsystem state struct. This
> +is called when ss->use_id == true. If this function is not provided,
> +a new ID is automatically assigned. If you enable ss->use_id,
> +you can use css_lookup() and css_get_next() to access "css" objects
> +via IDs.
> +

Couple of lines to explain why a subsystem would like to assign its
own ids and not be happy with generic cgroup assigned id be helpful.
In this case, I think you are using this id as index into array
and want to control the index, hence you seem to be doing it.

But I am not sure again why do you want to control index?

Vivek

2010-07-28 02:39:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/7][memcg] memcg use ID in page_cgroup

On Tue, Jul 27, 2010 at 04:56:29PM +0900, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, addresses of memory cgroup can be calculated by their ID without complex.
> This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
> On 64bit architecture, this offers us more 6bytes room per page_cgroup.
> Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
> some light-weight concurrent access.
>
> We may able to move this id onto flags field but ...go step by step.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/page_cgroup.h | 3 ++-
> mm/memcontrol.c | 40 +++++++++++++++++++++++++---------------
> mm/page_cgroup.c | 2 +-
> 3 files changed, 28 insertions(+), 17 deletions(-)
>
> Index: mmotm-0719/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-0719.orig/include/linux/page_cgroup.h
> +++ mmotm-0719/include/linux/page_cgroup.h
> @@ -12,7 +12,8 @@
> */
> struct page_cgroup {
> unsigned long flags;
> - struct mem_cgroup *mem_cgroup;
> + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> + unsigned short blk_cgroup; /* Not Used..but will be. */

So later I shall have to use virtually indexed arrays in blkio controller?
Or you are just using virtually indexed arrays for lookup speed and
I can continue to use css_lookup() and not worry about using virtually
indexed arrays.

So the idea is that when a page is allocated, also store the blk_group
id and once that page is submitted for writeback, we should be able
to associate it to right blkio group?

Vivek

2010-07-28 02:40:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7][memcg] cgroup arbitarary ID allocation

On Tue, 27 Jul 2010 22:30:27 -0400
Vivek Goyal <[email protected]> wrote:

> > Index: mmotm-2.6.35-0719/Documentation/cgroups/cgroups.txt
> > ===================================================================
> > --- mmotm-2.6.35-0719.orig/Documentation/cgroups/cgroups.txt
> > +++ mmotm-2.6.35-0719/Documentation/cgroups/cgroups.txt
> > @@ -621,6 +621,15 @@ and root cgroup. Currently this will onl
> > the default hierarchy (which never has sub-cgroups) and a hierarchy
> > that is being created/destroyed (and hence has no sub-cgroups).
> >
> > +void custom_id(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > +
> > +Called at assigning a new ID to cgroup subsystem state struct. This
> > +is called when ss->use_id == true. If this function is not provided,
> > +a new ID is automatically assigned. If you enable ss->use_id,
> > +you can use css_lookup() and css_get_next() to access "css" objects
> > +via IDs.
> > +
>
> Couple of lines to explain why a subsystem would like to assign its
> own ids and not be happy with generic cgroup assigned id be helpful.
> In this case, I think you are using this id as index into array
> and want to control the index, hence you seem to be doing it.
>
> But I am not sure again why do you want to control index?
>

Now, the subsystem allocation/id-allocation order is

->create()
alloc_id.

Otherwise "id" of memory cgroup is just determined by the place in virtual-indexed
array.
As
memcg = mem_cgroup_base + id

This "id" is determined at create().

If "id" is determined regardless of memory cgroup's placement, it's of no use.
My original design of css_id() allocates id in create() but it was moved to
generic part. So, this is expected change in my plan.

We have 2 choices.
id = alloc_id()
create(id)
or
this patch.

Both are okay for me. But alloc id before create() may add some ugly rollback.

Thanks,
-Kame

2010-07-28 02:48:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/7][memcg] memcg use ID in page_cgroup

On Tue, 27 Jul 2010 22:39:04 -0400
Vivek Goyal <[email protected]> wrote:

> On Tue, Jul 27, 2010 at 04:56:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, addresses of memory cgroup can be calculated by their ID without complex.
> > This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
> > On 64bit architecture, this offers us more 6bytes room per page_cgroup.
> > Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
> > some light-weight concurrent access.
> >
> > We may able to move this id onto flags field but ...go step by step.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/page_cgroup.h | 3 ++-
> > mm/memcontrol.c | 40 +++++++++++++++++++++++++---------------
> > mm/page_cgroup.c | 2 +-
> > 3 files changed, 28 insertions(+), 17 deletions(-)
> >
> > Index: mmotm-0719/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-0719.orig/include/linux/page_cgroup.h
> > +++ mmotm-0719/include/linux/page_cgroup.h
> > @@ -12,7 +12,8 @@
> > */
> > struct page_cgroup {
> > unsigned long flags;
> > - struct mem_cgroup *mem_cgroup;
> > + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> > + unsigned short blk_cgroup; /* Not Used..but will be. */
>
> So later I shall have to use virtually indexed arrays in blkio controller?
> Or you are just using virtually indexed arrays for lookup speed and
> I can continue to use css_lookup() and not worry about using virtually
> indexed arrays.
>
yes. you can use css_lookup() even if it's slow.

> So the idea is that when a page is allocated, also store the blk_group
> id and once that page is submitted for writeback, we should be able
> to associate it to right blkio group?
>
blk_cgroup id can be attached whenever you wants. please overwrite
page_cgroup->blk_cgroup when it's necessary.
Did you read Ikeda's patch ? I myself doesn't have patches at this point.
This is just for make a room for recording blkio-ID, which was requested
for a year.

Hmm, but page-allocation-time doesn't sound very good for me.

Thanks.
-Kame

2010-07-28 03:11:10

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7][memcg] cgroup arbitarary ID allocation

On Wed, Jul 28, 2010 at 11:35:29AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Jul 2010 22:30:27 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > > Index: mmotm-2.6.35-0719/Documentation/cgroups/cgroups.txt
> > > ===================================================================
> > > --- mmotm-2.6.35-0719.orig/Documentation/cgroups/cgroups.txt
> > > +++ mmotm-2.6.35-0719/Documentation/cgroups/cgroups.txt
> > > @@ -621,6 +621,15 @@ and root cgroup. Currently this will onl
> > > the default hierarchy (which never has sub-cgroups) and a hierarchy
> > > that is being created/destroyed (and hence has no sub-cgroups).
> > >
> > > +void custom_id(struct cgroup_subsys *ss, struct cgroup *cgrp)
> > > +
> > > +Called at assigning a new ID to cgroup subsystem state struct. This
> > > +is called when ss->use_id == true. If this function is not provided,
> > > +a new ID is automatically assigned. If you enable ss->use_id,
> > > +you can use css_lookup() and css_get_next() to access "css" objects
> > > +via IDs.
> > > +
> >
> > Couple of lines to explain why a subsystem would like to assign its
> > own ids and not be happy with generic cgroup assigned id be helpful.
> > In this case, I think you are using this id as index into array
> > and want to control the index, hence you seem to be doing it.
> >
> > But I am not sure again why do you want to control index?
> >
>
> Now, the subsystem allocation/id-allocation order is
>
> ->create()
> alloc_id.
>
> Otherwise "id" of memory cgroup is just determined by the place in virtual-indexed
> array.
> As
> memcg = mem_cgroup_base + id
>
> This "id" is determined at create().
>
> If "id" is determined regardless of memory cgroup's placement, it's of no use.
> My original design of css_id() allocates id in create() but it was moved to
> generic part. So, this is expected change in my plan.
>
> We have 2 choices.
> id = alloc_id()
> create(id)
> or
> this patch.
>
> Both are okay for me. But alloc id before create() may add some ugly rollback.

Ok, so in current design at the time of mem_cgroup instantiation css_id
is not available so you don't know at what index to put the newly
instantiated mem_cgroup object, hence the notion of let subsys decide
the css_id and cgroup can query from subsystem later.

I don't have any preference. Anything simple works..

Vivek

2010-07-28 03:14:20

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/7][memcg] memcg use ID in page_cgroup

On Wed, Jul 28, 2010 at 11:44:02AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 27 Jul 2010 22:39:04 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Tue, Jul 27, 2010 at 04:56:29PM +0900, KAMEZAWA Hiroyuki wrote:
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > Now, addresses of memory cgroup can be calculated by their ID without complex.
> > > This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
> > > On 64bit architecture, this offers us more 6bytes room per page_cgroup.
> > > Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
> > > some light-weight concurrent access.
> > >
> > > We may able to move this id onto flags field but ...go step by step.
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > > ---
> > > include/linux/page_cgroup.h | 3 ++-
> > > mm/memcontrol.c | 40 +++++++++++++++++++++++++---------------
> > > mm/page_cgroup.c | 2 +-
> > > 3 files changed, 28 insertions(+), 17 deletions(-)
> > >
> > > Index: mmotm-0719/include/linux/page_cgroup.h
> > > ===================================================================
> > > --- mmotm-0719.orig/include/linux/page_cgroup.h
> > > +++ mmotm-0719/include/linux/page_cgroup.h
> > > @@ -12,7 +12,8 @@
> > > */
> > > struct page_cgroup {
> > > unsigned long flags;
> > > - struct mem_cgroup *mem_cgroup;
> > > + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> > > + unsigned short blk_cgroup; /* Not Used..but will be. */
> >
> > So later I shall have to use virtually indexed arrays in blkio controller?
> > Or you are just using virtually indexed arrays for lookup speed and
> > I can continue to use css_lookup() and not worry about using virtually
> > indexed arrays.
> >
> yes. you can use css_lookup() even if it's slow.
>

Ok.

> > So the idea is that when a page is allocated, also store the blk_group
> > id and once that page is submitted for writeback, we should be able
> > to associate it to right blkio group?
> >
> blk_cgroup id can be attached whenever you wants. please overwrite
> page_cgroup->blk_cgroup when it's necessary.

> Did you read Ikeda's patch ? I myself doesn't have patches at this point.
> This is just for make a room for recording blkio-ID, which was requested
> for a year.

I have not read his patches yet. IIRC, previously there were issues
regarding which group should be charged for the page. The person who
allocated it or the thread which did last write to it etc... I guess
we can sort that out later.

>
> Hmm, but page-allocation-time doesn't sound very good for me.
>

Why?

Vivek

2010-07-28 03:23:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/7][memcg] memcg use ID in page_cgroup

On Tue, 27 Jul 2010 23:13:58 -0400
Vivek Goyal <[email protected]> wrote:

> > > So the idea is that when a page is allocated, also store the blk_group
> > > id and once that page is submitted for writeback, we should be able
> > > to associate it to right blkio group?
> > >
> > blk_cgroup id can be attached whenever you wants. please overwrite
> > page_cgroup->blk_cgroup when it's necessary.
>
> > Did you read Ikeda's patch ? I myself doesn't have patches at this point.
> > This is just for make a room for recording blkio-ID, which was requested
> > for a year.
>
> I have not read his patches yet. IIRC, previously there were issues
> regarding which group should be charged for the page. The person who
> allocated it or the thread which did last write to it etc... I guess
> we can sort that out later.
>
> >
> > Hmm, but page-allocation-time doesn't sound very good for me.
> >
>
> Why?
>

As you wrote, by attaching ID when a page cache is added, we'll have
much chances of free-rider until it's paged out. So, adding some
reseting-owner point may be good.

But considering real world usage, I may be wrong.
There will not be much free rider in real world, especially at write().
Then, page-allocation time may be good.

(Because database doesn't use page-cache, there will be no big random write
application.)

Thanks,
-Kame

2010-07-28 03:26:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/7][memcg] memcg use ID in page_cgroup

On Wed, 28 Jul 2010 12:18:20 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> > > Hmm, but page-allocation-time doesn't sound very good for me.
> > >
> >
> > Why?
> >
>
> As you wrote, by attaching ID when a page cache is added, we'll have
> much chances of free-rider until it's paged out. So, adding some
> reseting-owner point may be good.
>
> But considering real world usage, I may be wrong.
> There will not be much free rider in real world, especially at write().
> Then, page-allocation time may be good.
>
> (Because database doesn't use page-cache, there will be no big random write
> application.)
>

Sorry, one more reason. memory cgroup has much complex code for supporting
move_account, re-attaching memory cgroup per pages.
So, if you take care of task-move-between-groups, blkio-ID may have
some problems if you only support allocation-time accounting.

Thanks,
-Kame

2010-07-28 06:17:21

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/7][memcg] use spin lock instead of bit_spin_lock in page_cgroup

KAMEZAWA Hiroyuki <[email protected]> writes:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> This patch replaces page_cgroup's bit_spinlock with spinlock. In general,
> spinlock has good implementation than bit_spin_lock and we should use
> it if we have a room for it. In 64bit arch, we have extra 4bytes.
> Let's use it.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> --
> Index: mmotm-0719/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-0719.orig/include/linux/page_cgroup.h
> +++ mmotm-0719/include/linux/page_cgroup.h
> @@ -10,8 +10,14 @@
> * All page cgroups are allocated at boot or memory hotplug event,
> * then the page cgroup for pfn always exists.
> */
> +#ifdef CONFIG_64BIT
> +#define PCG_HAS_SPINLOCK
> +#endif
> struct page_cgroup {
> unsigned long flags;
> +#ifdef PCG_HAS_SPINLOCK
> + spinlock_t lock;
> +#endif
> unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> unsigned short blk_cgroup; /* Not Used..but will be. */
> struct page *page;
> @@ -90,6 +96,16 @@ static inline enum zone_type page_cgroup
> return page_zonenum(pc->page);
> }
>
> +#ifdef PCG_HAS_SPINLOCK
> +static inline void lock_page_cgroup(struct page_cgroup *pc)
> +{
> + spin_lock(&pc->lock);
> +}

This is minor issue, but this patch breaks usage of PageCgroupLocked().
Example from __mem_cgroup_move_account() cases panic:
VM_BUG_ON(!PageCgroupLocked(pc));

I assume that this patch should also delete the following:
- PCG_LOCK definition from page_cgroup.h
- TESTPCGFLAG(Locked, LOCK) from page_cgroup.h
- PCGF_LOCK from memcontrol.c

> +static inline void unlock_page_cgroup(struct page_cgroup *pc)
> +{
> + spin_unlock(&pc->lock);
> +}
> +#else
> static inline void lock_page_cgroup(struct page_cgroup *pc)
> {
> bit_spin_lock(PCG_LOCK, &pc->flags);
> @@ -99,6 +115,7 @@ static inline void unlock_page_cgroup(st
> {
> bit_spin_unlock(PCG_LOCK, &pc->flags);
> }
> +#endif
>
> static inline void SetPCGFileFlag(struct page_cgroup *pc, int idx)
> {
> Index: mmotm-0719/mm/page_cgroup.c
> ===================================================================
> --- mmotm-0719.orig/mm/page_cgroup.c
> +++ mmotm-0719/mm/page_cgroup.c
> @@ -17,6 +17,9 @@ __init_page_cgroup(struct page_cgroup *p
> pc->mem_cgroup = 0;
> pc->page = pfn_to_page(pfn);
> INIT_LIST_HEAD(&pc->lru);
> +#ifdef PCG_HAS_SPINLOCK
> + spin_lock_init(&pc->lock);
> +#endif
> }
> static unsigned long total_usage;
>

2010-07-28 06:25:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/7][memcg] use spin lock instead of bit_spin_lock in page_cgroup

On Tue, 27 Jul 2010 23:16:54 -0700
Greg Thelen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > This patch replaces page_cgroup's bit_spinlock with spinlock. In general,
> > spinlock has good implementation than bit_spin_lock and we should use
> > it if we have a room for it. In 64bit arch, we have extra 4bytes.
> > Let's use it.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > --
> > Index: mmotm-0719/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-0719.orig/include/linux/page_cgroup.h
> > +++ mmotm-0719/include/linux/page_cgroup.h
> > @@ -10,8 +10,14 @@
> > * All page cgroups are allocated at boot or memory hotplug event,
> > * then the page cgroup for pfn always exists.
> > */
> > +#ifdef CONFIG_64BIT
> > +#define PCG_HAS_SPINLOCK
> > +#endif
> > struct page_cgroup {
> > unsigned long flags;
> > +#ifdef PCG_HAS_SPINLOCK
> > + spinlock_t lock;
> > +#endif
> > unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> > unsigned short blk_cgroup; /* Not Used..but will be. */
> > struct page *page;
> > @@ -90,6 +96,16 @@ static inline enum zone_type page_cgroup
> > return page_zonenum(pc->page);
> > }
> >
> > +#ifdef PCG_HAS_SPINLOCK
> > +static inline void lock_page_cgroup(struct page_cgroup *pc)
> > +{
> > + spin_lock(&pc->lock);
> > +}
>
> This is minor issue, but this patch breaks usage of PageCgroupLocked().
> Example from __mem_cgroup_move_account() cases panic:
> VM_BUG_ON(!PageCgroupLocked(pc));
>
> I assume that this patch should also delete the following:
> - PCG_LOCK definition from page_cgroup.h
> - TESTPCGFLAG(Locked, LOCK) from page_cgroup.h
> - PCGF_LOCK from memcontrol.c
>

yes. thank you.

-Kame



> > +static inline void unlock_page_cgroup(struct page_cgroup *pc)
> > +{
> > + spin_unlock(&pc->lock);
> > +}
> > +#else
> > static inline void lock_page_cgroup(struct page_cgroup *pc)
> > {
> > bit_spin_lock(PCG_LOCK, &pc->flags);
> > @@ -99,6 +115,7 @@ static inline void unlock_page_cgroup(st
> > {
> > bit_spin_unlock(PCG_LOCK, &pc->flags);
> > }
> > +#endif
> >
> > static inline void SetPCGFileFlag(struct page_cgroup *pc, int idx)
> > {
> > Index: mmotm-0719/mm/page_cgroup.c
> > ===================================================================
> > --- mmotm-0719.orig/mm/page_cgroup.c
> > +++ mmotm-0719/mm/page_cgroup.c
> > @@ -17,6 +17,9 @@ __init_page_cgroup(struct page_cgroup *p
> > pc->mem_cgroup = 0;
> > pc->page = pfn_to_page(pfn);
> > INIT_LIST_HEAD(&pc->lru);
> > +#ifdef PCG_HAS_SPINLOCK
> > + spin_lock_init(&pc->lock);
> > +#endif
> > }
> > static unsigned long total_usage;
> >
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2010-07-28 07:09:52

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/7][memcg] memcg lockless update of file mapped

KAMEZAWA Hiroyuki <[email protected]> writes:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> At accounting file events per memory cgroup, we need to find memory cgroup
> via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().
>
> But, considering the context which page-cgroup for files are accessed,
> we can use alternative light-weight mutual execusion in the most case.
> At handling file-caches, the only race we have to take care of is "moving"
> account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
> update is done while the page-cache is in stable state, we don't have to
> take care of race with charge/uncharge.
>
> Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> rmdir() and task-moving (with a special settings.)
> This patch adds a race-checker for file-cache-status accounting v.s. account
> moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> The routine for account move
> 1. Increment it before start moving
> 2. Call synchronize_rcu()
> 3. Decrement it after the end of moving.
> By this, file-status-counting routine can check it needs to call
> lock_page_cgroup(). In most case, I doesn't need to call it.
>
> Note: update_file_mapped is safe against charge/uncharge even if it's
> not under address_space->tree_lock or lock_page(). Because it's under
> page_table_lock(), anyone can't unmap it...then, anyone can't uncharge().
>
>
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 60 insertions(+), 4 deletions(-)
>
> Index: mmotm-0719/mm/memcontrol.c
> ===================================================================
> --- mmotm-0719.orig/mm/memcontrol.c
> +++ mmotm-0719/mm/memcontrol.c
> @@ -89,6 +89,7 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> + MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
>
> MEM_CGROUP_STAT_NSTATS,
> };
> @@ -1071,7 +1072,48 @@ static unsigned int get_swappiness(struc
> return swappiness;
> }
>
> -/* A routine for testing mem is not under move_account */
> +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> +{
> + int cpu;
> + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> + spin_lock(&mc.lock);
> + for_each_possible_cpu(cpu)
> + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> + spin_unlock(&mc.lock);
> +
> + synchronize_rcu();
> +}
> +
> +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> +{
> + int cpu;
> +
> + if (!mem)
> + return;
> + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> + spin_lock(&mc.lock);
> + for_each_possible_cpu(cpu)
> + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
> + spin_unlock(&mc.lock);
> +}
> +
> +/*
> + * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
> + * used for avoiding race.
> + * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
> + * under hierarchy of them. used for waiting at
> + * memory pressure.
> + * Result of is_moved can be trusted until the end of rcu_read_unlock().
> + * The caller must do
> + * rcu_read_lock();
> + * result = mem_cgroup_is_moved();
> + * .....make use of result here....
> + * rcu_read_unlock();
> + */
> +static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
> +{

Could we add an assertion to confirm locking contract is upheld:
VM_BUG_ON(!rcu_read_lock_held());

> + return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
> +}
>
> static bool mem_cgroup_under_move(struct mem_cgroup *mem)
> {
> @@ -1470,13 +1512,21 @@ void mem_cgroup_update_file_mapped(struc
> {
> struct mem_cgroup *mem;
> struct page_cgroup *pc;
> + bool need_lock = false;
>
> pc = lookup_page_cgroup(page);
> if (unlikely(!pc))
> return;
> -
> - lock_page_cgroup(pc);
> + rcu_read_lock();
> mem = id_to_mem(pc->mem_cgroup);
> + if (!mem)
> + goto done;
> + need_lock = mem_cgroup_is_moved(mem);
> + if (need_lock) {
> + /* need to serialize with move_account */
> + lock_page_cgroup(pc);
> + mem = id_to_mem(pc->mem_cgroup);
> + }
> if (!mem || !PageCgroupUsed(pc))
> goto done;

Could we add a preemption() check here to ensure that the
__this_cpu_xxx() is safe to use?

/*
* Preemption is already disabled. We can use __this_cpu_xxx
*/
+ VM_BUG_ON(preemptible());

> @@ -1492,7 +1542,9 @@ void mem_cgroup_update_file_mapped(struc
> }
>
> done:
> - unlock_page_cgroup(pc);
> + if (need_lock)
> + unlock_page_cgroup(pc);
> + rcu_read_unlock();
> }
>
> /*
> @@ -3024,6 +3076,7 @@ move_account:
> lru_add_drain_all();
> drain_all_stock_sync();
> ret = 0;
> + mem_cgroup_start_move(mem);
> for_each_node_state(node, N_HIGH_MEMORY) {
> for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> enum lru_list l;
> @@ -3037,6 +3090,7 @@ move_account:
> if (ret)
> break;
> }
> + mem_cgroup_end_move(mem);
> memcg_oom_recover(mem);
> /* it seems parent cgroup doesn't have enough mem */
> if (ret == -ENOMEM)
> @@ -4503,6 +4557,7 @@ static void mem_cgroup_clear_mc(void)
> mc.to = NULL;
> mc.moving_task = NULL;
> spin_unlock(&mc.lock);
> + mem_cgroup_end_move(from);
> memcg_oom_recover(from);
> memcg_oom_recover(to);
> wake_up_all(&mc.waitq);
> @@ -4533,6 +4588,7 @@ static int mem_cgroup_can_attach(struct
> VM_BUG_ON(mc.moved_charge);
> VM_BUG_ON(mc.moved_swap);
> VM_BUG_ON(mc.moving_task);
> + mem_cgroup_start_move(from);
> spin_lock(&mc.lock);
> mc.from = from;
> mc.to = mem;

2010-07-28 07:13:03

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH 6/7][memcg] generic file status update

KAMEZAWA Hiroyuki <[email protected]> writes:

> This patch itself is not important. I just feel we need this kind of
> clean up in future.
>
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
> Using a unified macro and more generic names.
> All counters will have the same rule for updating.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/memcontrol.h | 10 +++++++---
> include/linux/page_cgroup.h | 21 +++++++++++++++------
> mm/memcontrol.c | 27 +++++++++++++++++----------
> mm/rmap.c | 4 ++--
> 4 files changed, 41 insertions(+), 21 deletions(-)
>
> Index: mmotm-0719/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-0719.orig/include/linux/memcontrol.h
> +++ mmotm-0719/include/linux/memcontrol.h
> @@ -121,7 +121,11 @@ static inline bool mem_cgroup_disabled(v
> return false;
> }
>
> -void mem_cgroup_update_file_mapped(struct page *page, int val);
> +enum {
> + __MEMCG_FILE_MAPPED,
> + NR_MEMCG_FILE_STAT
> +};

These two stat values need to be defined outside of "#if
CONFIG_CGROUP_MEM_RES_CTLR" (above) to allow for rmap.c to allow for the
following (from rmap.c) when memcg is not compiled in:
mem_cgroup_update_file_stat(page, __MEMCG_FILE_MAPPED, 1);

> +void mem_cgroup_update_file_stat(struct page *page, int stat, int val);
> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask, int nid,
> int zid);
> @@ -292,8 +296,8 @@ mem_cgroup_print_oom_info(struct mem_cgr
> {
> }
>
> -static inline void mem_cgroup_update_file_mapped(struct page *page,
> - int val)
> +static inline void
> +mem_cgroup_update_file_stat(struct page *page, int stat, int val);

Trailing ';' needs to be removed.

> {
> }
>
> Index: mmotm-0719/mm/memcontrol.c
> ===================================================================
> --- mmotm-0719.orig/mm/memcontrol.c
> +++ mmotm-0719/mm/memcontrol.c
> @@ -84,16 +84,20 @@ enum mem_cgroup_stat_index {
> */
> MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> - MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
> MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
> -
> + MEM_CGROUP_FSTAT_BASE,
> + MEM_CGROUP_FSTAT_END
> + = MEM_CGROUP_FSTAT_BASE + NR_MEMCG_FILE_STAT,
> MEM_CGROUP_STAT_NSTATS,
> };
>
> +#define MEM_CGROUP_STAT_FILE_MAPPED\
> + (MEM_CGROUP_FSTAT_BASE + __MEMCG_FILE_MAPPED)
> +
> struct mem_cgroup_stat_cpu {
> s64 count[MEM_CGROUP_STAT_NSTATS];
> };
> @@ -1508,7 +1512,7 @@ bool mem_cgroup_handle_oom(struct mem_cg
> * Currently used to update mapped file statistics, but the routine can be
> * generalized to update other statistics as well.
> */
> -void mem_cgroup_update_file_mapped(struct page *page, int val)
> +void mem_cgroup_update_file_stat(struct page *page, int idx, int val)
> {
> struct mem_cgroup *mem;
> struct page_cgroup *pc;
> @@ -1534,11 +1538,11 @@ void mem_cgroup_update_file_mapped(struc
> * Preemption is already disabled. We can use __this_cpu_xxx
> */
> if (val > 0) {
> - __this_cpu_inc(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - SetPageCgroupFileMapped(pc);
> + __this_cpu_inc(mem->stat->count[MEM_CGROUP_FSTAT_BASE + idx]);
> + SetPCGFileFlag(pc, idx);
> } else {
> - __this_cpu_dec(mem->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - ClearPageCgroupFileMapped(pc);
> + __this_cpu_dec(mem->stat->count[MEM_CGROUP_FSTAT_BASE + idx]);
> + ClearPCGFileFlag(pc, idx);
> }
>
> done:
> @@ -1999,17 +2003,20 @@ static void __mem_cgroup_commit_charge(s
> static void __mem_cgroup_move_account(struct page_cgroup *pc,
> struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
> {
> + int i;
> VM_BUG_ON(from == to);
> VM_BUG_ON(PageLRU(pc->page));
> VM_BUG_ON(!PageCgroupLocked(pc));
> VM_BUG_ON(!PageCgroupUsed(pc));
> VM_BUG_ON(id_to_mem(pc->mem_cgroup) != from);
>
> - if (PageCgroupFileMapped(pc)) {
> + for (i = 0; i < NR_MEMCG_FILE_STAT; ++i) {
> + if (!TestPCGFileFlag(pc, i))
> + continue;
> /* Update mapped_file data for mem_cgroup */
> preempt_disable();
> - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> + __this_cpu_dec(from->stat->count[MEM_CGROUP_FSTAT_BASE + i]);
> + __this_cpu_inc(to->stat->count[MEM_CGROUP_FSTAT_BASE + i]);
> preempt_enable();
> }
> mem_cgroup_charge_statistics(from, pc, false);
> Index: mmotm-0719/include/linux/page_cgroup.h
> ===================================================================
> --- mmotm-0719.orig/include/linux/page_cgroup.h
> +++ mmotm-0719/include/linux/page_cgroup.h
> @@ -40,8 +40,8 @@ enum {
> PCG_CACHE, /* charged as cache */
> PCG_USED, /* this object is in use. */
> PCG_ACCT_LRU, /* page has been accounted for */
> - PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> PCG_MIGRATION, /* under page migration */
> + PCG_FILE_FLAGS, /* see memcontrol.h */
> };
>
> #define TESTPCGFLAG(uname, lname) \
> @@ -76,11 +76,6 @@ CLEARPCGFLAG(AcctLRU, ACCT_LRU)
> TESTPCGFLAG(AcctLRU, ACCT_LRU)
> TESTCLEARPCGFLAG(AcctLRU, ACCT_LRU)
>
> -
> -SETPCGFLAG(FileMapped, FILE_MAPPED)
> -CLEARPCGFLAG(FileMapped, FILE_MAPPED)
> -TESTPCGFLAG(FileMapped, FILE_MAPPED)
> -
> SETPCGFLAG(Migration, MIGRATION)
> CLEARPCGFLAG(Migration, MIGRATION)
> TESTPCGFLAG(Migration, MIGRATION)
> @@ -105,6 +100,20 @@ static inline void unlock_page_cgroup(st
> bit_spin_unlock(PCG_LOCK, &pc->flags);
> }
>
> +static inline void SetPCGFileFlag(struct page_cgroup *pc, int idx)
> +{
> + set_bit(PCG_FILE_FLAGS + idx, &pc->flags);
> +}
> +
> +static inline void ClearPCGFileFlag(struct page_cgroup *pc, int idx)
> +{
> + clear_bit(PCG_FILE_FLAGS + idx, &pc->flags);
> +}
> +static inline bool TestPCGFileFlag(struct page_cgroup *pc, int idx)
> +{
> + return test_bit(PCG_FILE_FLAGS + idx, &pc->flags);
> +}
> +
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct page_cgroup;
>
> Index: mmotm-0719/mm/rmap.c
> ===================================================================
> --- mmotm-0719.orig/mm/rmap.c
> +++ mmotm-0719/mm/rmap.c
> @@ -891,7 +891,7 @@ void page_add_file_rmap(struct page *pag
> {
> if (atomic_inc_and_test(&page->_mapcount)) {
> __inc_zone_page_state(page, NR_FILE_MAPPED);
> - mem_cgroup_update_file_mapped(page, 1);
> + mem_cgroup_update_file_stat(page, __MEMCG_FILE_MAPPED, 1);
> }
> }
>
> @@ -929,7 +929,7 @@ void page_remove_rmap(struct page *page)
> __dec_zone_page_state(page, NR_ANON_PAGES);
> } else {
> __dec_zone_page_state(page, NR_FILE_MAPPED);
> - mem_cgroup_update_file_mapped(page, -1);
> + mem_cgroup_update_file_stat(page, __MEMCG_FILE_MAPPED, -1);
> }
> /*
> * It would be tidy to reset the PageAnon mapping here,

2010-07-28 07:17:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 5/7][memcg] memcg lockless update of file mapped

On Wed, 28 Jul 2010 00:09:21 -0700
Greg Thelen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > At accounting file events per memory cgroup, we need to find memory cgroup
> > via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().
> >
> > But, considering the context which page-cgroup for files are accessed,
> > we can use alternative light-weight mutual execusion in the most case.
> > At handling file-caches, the only race we have to take care of is "moving"
> > account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
> > update is done while the page-cache is in stable state, we don't have to
> > take care of race with charge/uncharge.
> >
> > Unlike charge/uncharge, "move" happens not so frequently. It happens only when
> > rmdir() and task-moving (with a special settings.)
> > This patch adds a race-checker for file-cache-status accounting v.s. account
> > moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
> > The routine for account move
> > 1. Increment it before start moving
> > 2. Call synchronize_rcu()
> > 3. Decrement it after the end of moving.
> > By this, file-status-counting routine can check it needs to call
> > lock_page_cgroup(). In most case, I doesn't need to call it.
> >
> > Note: update_file_mapped is safe against charge/uncharge even if it's
> > not under address_space->tree_lock or lock_page(). Because it's under
> > page_table_lock(), anyone can't unmap it...then, anyone can't uncharge().
> >
> >
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > mm/memcontrol.c | 64 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 60 insertions(+), 4 deletions(-)
> >
> > Index: mmotm-0719/mm/memcontrol.c
> > ===================================================================
> > --- mmotm-0719.orig/mm/memcontrol.c
> > +++ mmotm-0719/mm/memcontrol.c
> > @@ -89,6 +89,7 @@ enum mem_cgroup_stat_index {
> > MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
> > MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > MEM_CGROUP_EVENTS, /* incremented at every pagein/pageout */
> > + MEM_CGROUP_ON_MOVE, /* A check for locking move account/status */
> >
> > MEM_CGROUP_STAT_NSTATS,
> > };
> > @@ -1071,7 +1072,48 @@ static unsigned int get_swappiness(struc
> > return swappiness;
> > }
> >
> > -/* A routine for testing mem is not under move_account */
> > +static void mem_cgroup_start_move(struct mem_cgroup *mem)
> > +{
> > + int cpu;
> > + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> > + spin_lock(&mc.lock);
> > + for_each_possible_cpu(cpu)
> > + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) += 1;
> > + spin_unlock(&mc.lock);
> > +
> > + synchronize_rcu();
> > +}
> > +
> > +static void mem_cgroup_end_move(struct mem_cgroup *mem)
> > +{
> > + int cpu;
> > +
> > + if (!mem)
> > + return;
> > + /* for fast checking in mem_cgroup_update_file_stat() etc..*/
> > + spin_lock(&mc.lock);
> > + for_each_possible_cpu(cpu)
> > + per_cpu(mem->stat->count[MEM_CGROUP_ON_MOVE], cpu) -= 1;
> > + spin_unlock(&mc.lock);
> > +}
> > +
> > +/*
> > + * mem_cgroup_is_moved -- checking a cgroup is mc.from target or not.
> > + * used for avoiding race.
> > + * mem_cgroup_under_move -- checking a cgroup is mc.from or mc.to or
> > + * under hierarchy of them. used for waiting at
> > + * memory pressure.
> > + * Result of is_moved can be trusted until the end of rcu_read_unlock().
> > + * The caller must do
> > + * rcu_read_lock();
> > + * result = mem_cgroup_is_moved();
> > + * .....make use of result here....
> > + * rcu_read_unlock();
> > + */
> > +static bool mem_cgroup_is_moved(struct mem_cgroup *mem)
> > +{
>
> Could we add an assertion to confirm locking contract is upheld:
> VM_BUG_ON(!rcu_read_lock_held());
>

Hmm. there is an only one caller...I'll add one or I don't make
this as a funciton.


> > + return this_cpu_read(mem->stat->count[MEM_CGROUP_ON_MOVE]) > 0;
> > +}
> >
> > static bool mem_cgroup_under_move(struct mem_cgroup *mem)
> > {
> > @@ -1470,13 +1512,21 @@ void mem_cgroup_update_file_mapped(struc
> > {
> > struct mem_cgroup *mem;
> > struct page_cgroup *pc;
> > + bool need_lock = false;
> >
> > pc = lookup_page_cgroup(page);
> > if (unlikely(!pc))
> > return;
> > -
> > - lock_page_cgroup(pc);
> > + rcu_read_lock();
> > mem = id_to_mem(pc->mem_cgroup);
> > + if (!mem)
> > + goto done;
> > + need_lock = mem_cgroup_is_moved(mem);
> > + if (need_lock) {
> > + /* need to serialize with move_account */
> > + lock_page_cgroup(pc);
> > + mem = id_to_mem(pc->mem_cgroup);
> > + }
> > if (!mem || !PageCgroupUsed(pc))
> > goto done;
>
> Could we add a preemption() check here to ensure that the
> __this_cpu_xxx() is safe to use?
>
Hmm, ok.


Thanks,
-Kame
> /*
> * Preemption is already disabled. We can use __this_cpu_xxx
> */
> + VM_BUG_ON(preemptible());
>
> > @@ -1492,7 +1542,9 @@ void mem_cgroup_update_file_mapped(struc
> > }
> >
> > done:
> > - unlock_page_cgroup(pc);
> > + if (need_lock)
> > + unlock_page_cgroup(pc);
> > + rcu_read_unlock();
> > }
> >
> > /*
> > @@ -3024,6 +3076,7 @@ move_account:
> > lru_add_drain_all();
> > drain_all_stock_sync();
> > ret = 0;
> > + mem_cgroup_start_move(mem);
> > for_each_node_state(node, N_HIGH_MEMORY) {
> > for (zid = 0; !ret && zid < MAX_NR_ZONES; zid++) {
> > enum lru_list l;
> > @@ -3037,6 +3090,7 @@ move_account:
> > if (ret)
> > break;
> > }
> > + mem_cgroup_end_move(mem);
> > memcg_oom_recover(mem);
> > /* it seems parent cgroup doesn't have enough mem */
> > if (ret == -ENOMEM)
> > @@ -4503,6 +4557,7 @@ static void mem_cgroup_clear_mc(void)
> > mc.to = NULL;
> > mc.moving_task = NULL;
> > spin_unlock(&mc.lock);
> > + mem_cgroup_end_move(from);
> > memcg_oom_recover(from);
> > memcg_oom_recover(to);
> > wake_up_all(&mc.waitq);
> > @@ -4533,6 +4588,7 @@ static int mem_cgroup_can_attach(struct
> > VM_BUG_ON(mc.moved_charge);
> > VM_BUG_ON(mc.moved_swap);
> > VM_BUG_ON(mc.moving_task);
> > + mem_cgroup_start_move(from);
> > spin_lock(&mc.lock);
> > mc.from = from;
> > mc.to = mem;
>

2010-07-28 07:19:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 6/7][memcg] generic file status update

On Wed, 28 Jul 2010 00:12:13 -0700
Greg Thelen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > This patch itself is not important. I just feel we need this kind of
> > clean up in future.
> >
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
> > Using a unified macro and more generic names.
> > All counters will have the same rule for updating.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > include/linux/memcontrol.h | 10 +++++++---
> > include/linux/page_cgroup.h | 21 +++++++++++++++------
> > mm/memcontrol.c | 27 +++++++++++++++++----------
> > mm/rmap.c | 4 ++--
> > 4 files changed, 41 insertions(+), 21 deletions(-)
> >
> > Index: mmotm-0719/include/linux/memcontrol.h
> > ===================================================================
> > --- mmotm-0719.orig/include/linux/memcontrol.h
> > +++ mmotm-0719/include/linux/memcontrol.h
> > @@ -121,7 +121,11 @@ static inline bool mem_cgroup_disabled(v
> > return false;
> > }
> >
> > -void mem_cgroup_update_file_mapped(struct page *page, int val);
> > +enum {
> > + __MEMCG_FILE_MAPPED,
> > + NR_MEMCG_FILE_STAT
> > +};
>
> These two stat values need to be defined outside of "#if
> CONFIG_CGROUP_MEM_RES_CTLR" (above) to allow for rmap.c to allow for the
> following (from rmap.c) when memcg is not compiled in:
> mem_cgroup_update_file_stat(page, __MEMCG_FILE_MAPPED, 1);
>

ok. or I'll not remove mem_cgroup_update_file_mapped().



> > +void mem_cgroup_update_file_stat(struct page *page, int stat, int val);
> > unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> > gfp_t gfp_mask, int nid,
> > int zid);
> > @@ -292,8 +296,8 @@ mem_cgroup_print_oom_info(struct mem_cgr
> > {
> > }
> >
> > -static inline void mem_cgroup_update_file_mapped(struct page *page,
> > - int val)
> > +static inline void
> > +mem_cgroup_update_file_stat(struct page *page, int stat, int val);
>
> Trailing ';' needs to be removed.
>
ok, will do.

Bye.
-Kame

2010-07-28 14:17:27

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/7][memcg] memcg use ID in page_cgroup

On Wed, Jul 28, 2010 at 12:21:28PM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 28 Jul 2010 12:18:20 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > > > Hmm, but page-allocation-time doesn't sound very good for me.
> > > >
> > >
> > > Why?
> > >
> >
> > As you wrote, by attaching ID when a page cache is added, we'll have
> > much chances of free-rider until it's paged out. So, adding some
> > reseting-owner point may be good.
> >
> > But considering real world usage, I may be wrong.
> > There will not be much free rider in real world, especially at write().
> > Then, page-allocation time may be good.
> >
> > (Because database doesn't use page-cache, there will be no big random write
> > application.)
> >
>
> Sorry, one more reason. memory cgroup has much complex code for supporting
> move_account, re-attaching memory cgroup per pages.
> So, if you take care of task-move-between-groups, blkio-ID may have
> some problems if you only support allocation-time accounting.

I think initially we can just keep it simple for blkio controller and
not move page charges across blkio cgroup when process moves.

Vivek

2010-07-28 14:42:41

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/7][memcg] towards I/O aware memory cgroup

* KAMEZAWA Hiroyuki <[email protected]> [2010-07-27 16:51:55]:

>
> From a view of patch management, this set is a mixture of a few features for
> memcg, and I should divide them to some groups. But, at first, I'd like to
> show the total view. This set is consists from 5 sets. Main purpose is
> create a room in page_cgroup for I/O tracking and add light-weight access method
> for file-cache related accounting.
>
> 1. An virtual-indexed array.
> 2,3. Use virtual-indexed array for id-to-memory_cgroup detection.
> 4. modify page_cgroup to use ID instead of pointer, this gives us enough
> spaces for further memory tracking.

Yes, this is good, I've been meaning to merge the flags and the
pointer. Thanks for looking into this.

> 5,6 Use light-weight locking mechanism for file related accounting.
> 7. use spin_lock instead of bit_spinlock.
>
>
> As a function, patch 5,6 can be an independent patch and I'll accept
> reordering series of patch if someone requests.
> But we'll need all, I think.
> (irq_save for patch 7 will be required later.)
>

--
Three Cheers,
Balbir

2010-07-28 15:47:36

by Munehiro Ikeda

[permalink] [raw]
Subject: Re: [RFC][PATCH 4/7][memcg] memcg use ID in page_cgroup

Vivek Goyal wrote, on 07/27/2010 11:13 PM:
> On Wed, Jul 28, 2010 at 11:44:02AM +0900, KAMEZAWA Hiroyuki wrote:
>> On Tue, 27 Jul 2010 22:39:04 -0400
>> Vivek Goyal<[email protected]> wrote:
>>
>>> On Tue, Jul 27, 2010 at 04:56:29PM +0900, KAMEZAWA Hiroyuki wrote:
>>>> From: KAMEZAWA Hiroyuki<[email protected]>
>>>>
>>>> Now, addresses of memory cgroup can be calculated by their ID without complex.
>>>> This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
>>>> On 64bit architecture, this offers us more 6bytes room per page_cgroup.
>>>> Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
>>>> some light-weight concurrent access.
>>>>
>>>> We may able to move this id onto flags field but ...go step by step.
>>>>
>>>> Signed-off-by: KAMEZAWA Hiroyuki<[email protected]>
>>>> ---
>>>> include/linux/page_cgroup.h | 3 ++-
>>>> mm/memcontrol.c | 40 +++++++++++++++++++++++++---------------
>>>> mm/page_cgroup.c | 2 +-
>>>> 3 files changed, 28 insertions(+), 17 deletions(-)
>>>>
>>>> Index: mmotm-0719/include/linux/page_cgroup.h
>>>> ===================================================================
>>>> --- mmotm-0719.orig/include/linux/page_cgroup.h
>>>> +++ mmotm-0719/include/linux/page_cgroup.h
>>>> @@ -12,7 +12,8 @@
>>>> */
>>>> struct page_cgroup {
>>>> unsigned long flags;
>>>> - struct mem_cgroup *mem_cgroup;
>>>> + unsigned short mem_cgroup; /* ID of assigned memory cgroup */
>>>> + unsigned short blk_cgroup; /* Not Used..but will be. */
>>>
>>> So later I shall have to use virtually indexed arrays in blkio controller?
>>> Or you are just using virtually indexed arrays for lookup speed and
>>> I can continue to use css_lookup() and not worry about using virtually
>>> indexed arrays.
>>>
>> yes. you can use css_lookup() even if it's slow.
>>
>
> Ok.
>
>>> So the idea is that when a page is allocated, also store the blk_group
>>> id and once that page is submitted for writeback, we should be able
>>> to associate it to right blkio group?
>>>
>> blk_cgroup id can be attached whenever you wants. please overwrite
>> page_cgroup->blk_cgroup when it's necessary.
>
>> Did you read Ikeda's patch ? I myself doesn't have patches at this point.
>> This is just for make a room for recording blkio-ID, which was requested
>> for a year.
>
> I have not read his patches yet. IIRC, previously there were issues
> regarding which group should be charged for the page. The person who
> allocated it or the thread which did last write to it etc... I guess
> we can sort that out later.

Absolutely.
iotrack, a part of blkio cgroup for async write patch I posted, charges
the thread (in exact, blkio-cgroup to which the thread belongs) which
dirtied the page first. Though it should be controversial and we need
to discuss who should be charged, adding pc->blk_cgroup has no problem
because blkio-cgroup can overwrite it any time, as Kame said above.

Beyond that, adding pc->blk_cgroup is a big step for us. Now I encode
and store ID in pc->flags. pc->blk_cgroup is more straight and that
is what we have been looking forward. Thanks Kame!



--
IKEDA, Munehiro
NEC Corporation of America
[email protected]

2010-07-28 19:45:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/7][memcg] virtually indexed array library.

On Tue, 27 Jul 2010 16:53:03 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> This virt-array allocates a virtally contiguous array via get_vm_area()
> and allows object allocation per an element of array.
> Physical pages are used only for used items in the array.
>
> - At first, the user has to create an array by create_virt_array().
> - At using an element, virt_array_alloc_index(index) should be called.
> - At freeing an element, virt_array_free_index(index) should be called.
> - At destroying, destroy_virt_array() should be called.
>
> Item used/unused status is controlled by bitmap and back-end physical
> pages are automatically allocated/freed. This is useful when you
> want to access objects by index in light weight. For example,
>
> create_virt_array(va);
> struct your_struct *objmap = va->address;
> Then, you can access your objects by objmap[i].
>
> In usual case, holding reference by index rather than pointer can save memory.
> But index -> object lookup cost cannot be negligible. In such case,
> this virt-array may be helpful. Ah yes, if lookup performance is not important,
> using radix-tree will be better (from TLB point of view). This virty-array
> may consume VMALLOC area too much. and alloc/free routine is very slow.
>
> Changelog:
> - fixed bugs in bitmap ops.
> - add offset for find_free_index.
>

My gut reaction to this sort of thing is "run away in terror". It
encourages kernel developers to operate like lackadaisical userspace
developers and to assume that underlying code can perform heroic and
immortal feats. But it can't. This is the kernel and the kernel is a
tough and hostile place and callers should be careful and defensive and
take great efforts to minimise the strain they put upon other systems.

IOW, can we avoid doing this?

>
> ...
>
> +void free_varray_item(struct virt_array *v, int idx)
> +{
> + mutex_lock(&v->mutex);
> + __free_unmap_entry(v, idx);
> + mutex_unlock(&v->mutex);
> +}

It's generally a bad idea for library code to perform its own locking.
In this case we've just made this whole facility inaccessible to code
which runs from interrupt or atomic contexts.

> + pg[0] = alloc_page(GFP_KERNEL);

And hard-wiring GFP_KERNEL makes this facility inaccessible to GFP_NOIO
and GFP_NOFS contexts as well.

2010-07-29 00:37:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/7][memcg] virtually indexed array library.

On Wed, 28 Jul 2010 12:45:13 -0700
Andrew Morton <[email protected]> wrote:

> On Tue, 27 Jul 2010 16:53:03 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > This virt-array allocates a virtally contiguous array via get_vm_area()
> > and allows object allocation per an element of array.
> > Physical pages are used only for used items in the array.
> >
> > - At first, the user has to create an array by create_virt_array().
> > - At using an element, virt_array_alloc_index(index) should be called.
> > - At freeing an element, virt_array_free_index(index) should be called.
> > - At destroying, destroy_virt_array() should be called.
> >
> > Item used/unused status is controlled by bitmap and back-end physical
> > pages are automatically allocated/freed. This is useful when you
> > want to access objects by index in light weight. For example,
> >
> > create_virt_array(va);
> > struct your_struct *objmap = va->address;
> > Then, you can access your objects by objmap[i].
> >
> > In usual case, holding reference by index rather than pointer can save memory.
> > But index -> object lookup cost cannot be negligible. In such case,
> > this virt-array may be helpful. Ah yes, if lookup performance is not important,
> > using radix-tree will be better (from TLB point of view). This virty-array
> > may consume VMALLOC area too much. and alloc/free routine is very slow.
> >
> > Changelog:
> > - fixed bugs in bitmap ops.
> > - add offset for find_free_index.
> >
>
> My gut reaction to this sort of thing is "run away in terror". It
> encourages kernel developers to operate like lackadaisical userspace
> developers and to assume that underlying code can perform heroic and
> immortal feats. But it can't. This is the kernel and the kernel is a
> tough and hostile place and callers should be careful and defensive and
> take great efforts to minimise the strain they put upon other systems.
>
> IOW, can we avoid doing this?
>

Hmm. To pack more information into page_cgroup, I'd like reduce the size of it.
One candidate is pc->mem_cgroup, a pointer. mem_cgroup has its own ID already.

If we replace pc->mem_cgroup from a pointer to ID,
- struct mem_cgroup *mem = pc->mem_cgroup;
+ struct mem_cgroup *mem = id_to_mem_cgroup(pc->mem_cgroup);

call will be added.

The problem is that we have to call id_to_mem_cgroup routine even in
add_to_lru() and delete_from_lru(). Any kind of "lookup" routines aren't
enough fast. So, I'd like to use an array[].
(mem_cgroup_add_to_lru(), delete_from_lru() has been slow codes already...
andI hear there is a brave guy who uses 2000+ memory cgroup on a host.

With this, we just pay pointer calculation cost but have no more memory access.

The basic idea of this virt-array[] implemenation is vmemmap(virtual memmap).
But it's highly hard-coded per architecture and use very-very big space.
So, I wrote a generic library routine.

One another idea I thought of a technique making this kind of lookup
is to prepare per-cpu lookup cache. But it adds much more complicated
controls and can't garantee "works well always".

IOW, I love this virt-array[]. But it's okay to move this routine to
memcontrol.c and don't show any interface to others if you have concerns.

If you recommened more 16bytes cost per page rather than implementing a
thing like this, okay, it's much easier.



> >
> > ...
> >
> > +void free_varray_item(struct virt_array *v, int idx)
> > +{
> > + mutex_lock(&v->mutex);
> > + __free_unmap_entry(v, idx);
> > + mutex_unlock(&v->mutex);
> > +}
>
> It's generally a bad idea for library code to perform its own locking.
> In this case we've just made this whole facility inaccessible to code
> which runs from interrupt or atomic contexts.
>
hmm, IIUC, because this use codes like vmalloc, this can't be used in atomic
context. But ok, I'll move this lock and adds comment.


> > + pg[0] = alloc_page(GFP_KERNEL);
>
> And hard-wiring GFP_KERNEL makes this facility inaccessible to GFP_NOIO
> and GFP_NOFS contexts as well.
>
ok, I'll pass gfp mask.

Thanks,
-Kame

2010-07-29 04:31:55

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/7][memcg] virtually indexed array library.

On Thu, 29 Jul 2010 09:32:26 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Wed, 28 Jul 2010 12:45:13 -0700
> Andrew Morton <[email protected]> wrote:

> > My gut reaction to this sort of thing is "run away in terror". It
> > encourages kernel developers to operate like lackadaisical userspace
> > developers and to assume that underlying code can perform heroic and
> > immortal feats. But it can't. This is the kernel and the kernel is a
> > tough and hostile place and callers should be careful and defensive and
> > take great efforts to minimise the strain they put upon other systems.
> >
> > IOW, can we avoid doing this?
> >
>

I'll use pre-allocated pointer array in the next version. It's simple even
if a bit slow.

==
struct mem_cgroup *mem_cgroups[CONFIG_MAX_MEM_CGROUPS] __read_mostly;
#define id_to_memcg(id) mem_cgroups[id];
==
But this can use hugepage-mapping-for-kernel...so, can be quick.

Thanks,
-Kame


2010-08-02 18:00:59

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/7][memcg] virtually indexed array library.

* KAMEZAWA Hiroyuki <[email protected]> [2010-07-29 13:27:03]:

> On Thu, 29 Jul 2010 09:32:26 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Wed, 28 Jul 2010 12:45:13 -0700
> > Andrew Morton <[email protected]> wrote:
>
> > > My gut reaction to this sort of thing is "run away in terror". It
> > > encourages kernel developers to operate like lackadaisical userspace
> > > developers and to assume that underlying code can perform heroic and
> > > immortal feats. But it can't. This is the kernel and the kernel is a
> > > tough and hostile place and callers should be careful and defensive and
> > > take great efforts to minimise the strain they put upon other systems.
> > >
> > > IOW, can we avoid doing this?
> > >
> >
>
> I'll use pre-allocated pointer array in the next version. It's simple even
> if a bit slow.
>
> ==
> struct mem_cgroup *mem_cgroups[CONFIG_MAX_MEM_CGROUPS] __read_mostly;
> #define id_to_memcg(id) mem_cgroups[id];
> ==

Hmm.. I thought we were going to reuse css_id() and use that to get to
the cgroup. May be I am missing something.


--
Three Cheers,
Balbir

2010-08-02 18:04:36

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7][memcg] cgroup arbitarary ID allocation

* KAMEZAWA Hiroyuki <[email protected]> [2010-07-27 16:54:17]:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> When a subsystem want to make use of "id" more, it's necessary to
> manage the id at cgroup subsystem creation time. But, now,
> because of the order of cgroup creation callback, subsystem can't
> declare the id it wants. This patch allows subsystem to use customized
> ID for themselves.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---

What happens if the id is taken already? Could you please explain how
this is used?

--
Three Cheers,
Balbir

2010-08-02 18:09:17

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/7][memcg] use spin lock instead of bit_spin_lock in page_cgroup

* Greg Thelen <[email protected]> [2010-07-27 23:16:54]:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > This patch replaces page_cgroup's bit_spinlock with spinlock. In general,
> > spinlock has good implementation than bit_spin_lock and we should use
> > it if we have a room for it. In 64bit arch, we have extra 4bytes.
> > Let's use it.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > --
> > Index: mmotm-0719/include/linux/page_cgroup.h
> > ===================================================================
> > --- mmotm-0719.orig/include/linux/page_cgroup.h
> > +++ mmotm-0719/include/linux/page_cgroup.h
> > @@ -10,8 +10,14 @@
> > * All page cgroups are allocated at boot or memory hotplug event,
> > * then the page cgroup for pfn always exists.
> > */
> > +#ifdef CONFIG_64BIT
> > +#define PCG_HAS_SPINLOCK
> > +#endif
> > struct page_cgroup {
> > unsigned long flags;
> > +#ifdef PCG_HAS_SPINLOCK
> > + spinlock_t lock;
> > +#endif
> > unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> > unsigned short blk_cgroup; /* Not Used..but will be. */
> > struct page *page;
> > @@ -90,6 +96,16 @@ static inline enum zone_type page_cgroup
> > return page_zonenum(pc->page);
> > }
> >
> > +#ifdef PCG_HAS_SPINLOCK
> > +static inline void lock_page_cgroup(struct page_cgroup *pc)
> > +{
> > + spin_lock(&pc->lock);
> > +}
>
> This is minor issue, but this patch breaks usage of PageCgroupLocked().
> Example from __mem_cgroup_move_account() cases panic:
> VM_BUG_ON(!PageCgroupLocked(pc));
>
> I assume that this patch should also delete the following:
> - PCG_LOCK definition from page_cgroup.h
> - TESTPCGFLAG(Locked, LOCK) from page_cgroup.h
> - PCGF_LOCK from memcontrol.c
>


Good catch! But from my understanding of the code we use spinlock_t
only for 64 bit systems, so we still need the PCG* and TESTPGFLAGS.

> > +static inline void unlock_page_cgroup(struct page_cgroup *pc)
> > +{
> > + spin_unlock(&pc->lock);
> > +}
> > +#else
> > static inline void lock_page_cgroup(struct page_cgroup *pc)
> > {
> > bit_spin_lock(PCG_LOCK, &pc->flags);
> > @@ -99,6 +115,7 @@ static inline void unlock_page_cgroup(st
> > {
> > bit_spin_unlock(PCG_LOCK, &pc->flags);
> > }
> > +#endif
> >
> > static inline void SetPCGFileFlag(struct page_cgroup *pc, int idx)
> > {
> > Index: mmotm-0719/mm/page_cgroup.c
> > ===================================================================
> > --- mmotm-0719.orig/mm/page_cgroup.c
> > +++ mmotm-0719/mm/page_cgroup.c
> > @@ -17,6 +17,9 @@ __init_page_cgroup(struct page_cgroup *p
> > pc->mem_cgroup = 0;
> > pc->page = pfn_to_page(pfn);
> > INIT_LIST_HEAD(&pc->lru);
> > +#ifdef PCG_HAS_SPINLOCK
> > + spin_lock_init(&pc->lock);
> > +#endif
> > }
> > static unsigned long total_usage;
> >

--
Three Cheers,
Balbir

2010-08-02 23:49:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/7][memcg] virtually indexed array library.

On Mon, 2 Aug 2010 23:30:51 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-07-29 13:27:03]:
>
> > On Thu, 29 Jul 2010 09:32:26 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Wed, 28 Jul 2010 12:45:13 -0700
> > > Andrew Morton <[email protected]> wrote:
> >
> > > > My gut reaction to this sort of thing is "run away in terror". It
> > > > encourages kernel developers to operate like lackadaisical userspace
> > > > developers and to assume that underlying code can perform heroic and
> > > > immortal feats. But it can't. This is the kernel and the kernel is a
> > > > tough and hostile place and callers should be careful and defensive and
> > > > take great efforts to minimise the strain they put upon other systems.
> > > >
> > > > IOW, can we avoid doing this?
> > > >
> > >
> >
> > I'll use pre-allocated pointer array in the next version. It's simple even
> > if a bit slow.
> >
> > ==
> > struct mem_cgroup *mem_cgroups[CONFIG_MAX_MEM_CGROUPS] __read_mostly;
> > #define id_to_memcg(id) mem_cgroups[id];
> > ==
>
> Hmm.. I thought we were going to reuse css_id() and use that to get to
> the cgroup. May be I am missing something.
>
?
lookup_css_id() requires multi-level table lookup because of radix-tree.
And compiler can't generate an optimized code. linear table lookup is quick.

-Kame

2010-08-02 23:50:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 2/7][memcg] cgroup arbitarary ID allocation

On Mon, 2 Aug 2010 23:34:29 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-07-27 16:54:17]:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > When a subsystem want to make use of "id" more, it's necessary to
> > manage the id at cgroup subsystem creation time. But, now,
> > because of the order of cgroup creation callback, subsystem can't
> > declare the id it wants. This patch allows subsystem to use customized
> > ID for themselves.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
>
> What happens if the id is taken already? Could you please explain how
> this is used?
>

This patch is dropped.

-Kame

2010-08-02 23:51:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH 7/7][memcg] use spin lock instead of bit_spin_lock in page_cgroup

On Mon, 2 Aug 2010 23:39:11 +0530
Balbir Singh <[email protected]> wrote:

> * Greg Thelen <[email protected]> [2010-07-27 23:16:54]:
>
> > KAMEZAWA Hiroyuki <[email protected]> writes:
> >
> > > From: KAMEZAWA Hiroyuki <[email protected]>
> > >
> > > This patch replaces page_cgroup's bit_spinlock with spinlock. In general,
> > > spinlock has good implementation than bit_spin_lock and we should use
> > > it if we have a room for it. In 64bit arch, we have extra 4bytes.
> > > Let's use it.
> > >
> > > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > > --
> > > Index: mmotm-0719/include/linux/page_cgroup.h
> > > ===================================================================
> > > --- mmotm-0719.orig/include/linux/page_cgroup.h
> > > +++ mmotm-0719/include/linux/page_cgroup.h
> > > @@ -10,8 +10,14 @@
> > > * All page cgroups are allocated at boot or memory hotplug event,
> > > * then the page cgroup for pfn always exists.
> > > */
> > > +#ifdef CONFIG_64BIT
> > > +#define PCG_HAS_SPINLOCK
> > > +#endif
> > > struct page_cgroup {
> > > unsigned long flags;
> > > +#ifdef PCG_HAS_SPINLOCK
> > > + spinlock_t lock;
> > > +#endif
> > > unsigned short mem_cgroup; /* ID of assigned memory cgroup */
> > > unsigned short blk_cgroup; /* Not Used..but will be. */
> > > struct page *page;
> > > @@ -90,6 +96,16 @@ static inline enum zone_type page_cgroup
> > > return page_zonenum(pc->page);
> > > }
> > >
> > > +#ifdef PCG_HAS_SPINLOCK
> > > +static inline void lock_page_cgroup(struct page_cgroup *pc)
> > > +{
> > > + spin_lock(&pc->lock);
> > > +}
> >
> > This is minor issue, but this patch breaks usage of PageCgroupLocked().
> > Example from __mem_cgroup_move_account() cases panic:
> > VM_BUG_ON(!PageCgroupLocked(pc));
> >
> > I assume that this patch should also delete the following:
> > - PCG_LOCK definition from page_cgroup.h
> > - TESTPCGFLAG(Locked, LOCK) from page_cgroup.h
> > - PCGF_LOCK from memcontrol.c
> >
>
>
> Good catch! But from my understanding of the code we use spinlock_t
> only for 64 bit systems, so we still need the PCG* and TESTPGFLAGS.
>
The latest sets have proper calls.

-Kame