2011-05-13 08:49:33

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 00/14] memcg: per cgroup dirty page accounting

This patch series provides the ability for each cgroup to have independent dirty
page usage limits. Limiting dirty memory fixes the max amount of dirty (hard to
reclaim) page cache used by a cgroup. This allows for better per cgroup memory
isolation and fewer ooms within a single cgroup.

Having per cgroup dirty memory limits is not very interesting unless writeback
is cgroup aware. There is not much isolation if cgroups have to writeback data
from other cgroups to get below their dirty memory threshold.

Per-memcg dirty limits are provided to support isolation and thus cross cgroup
inode sharing is not a priority. This allows the code be simpler.

To add cgroup awareness to writeback, this series adds a memcg field to the
inode to allow writeback to isolate inodes for a particular cgroup. When an
inode is marked dirty, i_memcg is set to the current cgroup. When inode pages
are marked dirty the i_memcg field compared against the page's cgroup. If they
differ, then the inode is marked as shared by setting i_memcg to a special
shared value (zero).

Previous discussions suggested that a per-bdi per-memcg b_dirty list was a good
way to assoicate inodes with a cgroup without having to add a field to struct
inode. I prototyped this approach but found that it involved more complex
writeback changes and had at least one major shortcoming: detection of when an
inode becomes shared by multiple cgroups. While such sharing is not expected to
be common, the system should gracefully handle it.

balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
dirty usage vs dirty thresholds for the current cgroup and its parents. If any
over-limit cgroups are found, they are marked in a global over-limit bitmap
(indexed by cgroup id) and the bdi flusher is awoke.

The bdi flusher uses wb_check_background_flush() to check for any memcg over
their dirty limit. When performing per-memcg background writeback,
move_expired_inodes() walks per bdi b_dirty list using each inode's i_memcg and
the global over-limit memcg bitmap to determine if the inode should be written.

If mem_cgroup_balance_dirty_pages() is unable to get below the dirty page
threshold writing per-memcg inodes, then downshifts to also writing shared
inodes (i_memcg=0).

I know that there is some significant writeback changes associated with the
IO-less balance_dirty_pages() effort. I am not trying to derail that, so this
patch series is merely an RFC to get feedback on the design. There are probably
some subtle races in these patches. I have done moderate functional testing of
the newly proposed features.

Here is an example of the memcg-oom that is avoided with this patch series:
# mkdir /dev/cgroup/memory/x
# echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
# echo $$ > /dev/cgroup/memory/x/tasks
# dd if=/dev/zero of=/data/f1 bs=1k count=1M &
# dd if=/dev/zero of=/data/f2 bs=1k count=1M &
# wait
[1]- Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k
[2]+ Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k

Known limitations:
If a dirty limit is lowered a cgroup may be over its limit.

Changes since -v6:
- memcg aware writeback.

Single patch that can be applied to mmotm-2011-05-06-16-39:
http://www.kernel.org/pub/linux/kernel/people/gthelen/memcg/memcg-dirty-limits-v7-on-mmotm-2011-05-06-16-39.patch

Patches are based on mmotm-2011-05-06-16-39.

Greg Thelen (14):
memcg: document cgroup dirty memory interfaces
memcg: add page_cgroup flags for dirty page tracking
memcg: add mem_cgroup_mark_inode_dirty()
memcg: add dirty page accounting infrastructure
memcg: add kernel calls for memcg dirty page stats
memcg: add dirty limits to mem_cgroup
memcg: add cgroupfs interface to memcg dirty limits
writeback: add memcg fields to writeback_control
cgroup: move CSS_ID_MAX to cgroup.h
memcg: dirty page accounting support routines
memcg: create support routines for writeback
memcg: create support routines for page-writeback
writeback: make background writeback cgroup aware
memcg: check memcg dirty limits in page writeback

Documentation/cgroups/memory.txt | 70 ++++
fs/fs-writeback.c | 33 ++-
fs/inode.c | 3 +
fs/nfs/write.c | 4 +
include/linux/cgroup.h | 1 +
include/linux/fs.h | 9 +
include/linux/memcontrol.h | 64 ++++-
include/linux/page_cgroup.h | 23 ++
include/linux/writeback.h | 5 +-
include/trace/events/memcontrol.h | 198 +++++++++++
kernel/cgroup.c | 1 -
mm/filemap.c | 1 +
mm/memcontrol.c | 705 ++++++++++++++++++++++++++++++++++++-
mm/page-writeback.c | 42 ++-
mm/truncate.c | 1 +
mm/vmscan.c | 2 +-
16 files changed, 1134 insertions(+), 28 deletions(-)
create mode 100644 include/trace/events/memcontrol.h

--
1.7.3.1


2011-05-13 08:50:04

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 01/14] memcg: document cgroup dirty memory interfaces

Document cgroup dirty memory interfaces and statistics.

The implementation for these new interfaces routines comes in a series
of following patches.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Balbir Singh <[email protected]>
---
Changelog since v6:
- Removed 'Inode writeback issue' now that memcg-writeback is implemented in
the series.
- Trivial reword of section 5.6 "dirty memory".

Changelog since v4:
- Minor rewording of '5.5 dirty memory' section.
- Added '5.5.1 Inode writeback issue' section.

Changelog since v3:
- Described interactions with memory.use_hierarchy.
- Added description of total_dirty, total_writeback, and total_nfs_unstable.

Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
memory.stat to match /proc/meminfo.
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
- Describe a situation where a cgroup can exceed its dirty limit.

Documentation/cgroups/memory.txt | 70 ++++++++++++++++++++++++++++++++++++++
1 files changed, 70 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 43b9e46..15019a3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -395,6 +395,10 @@ soft_direct_steal- # of pages reclaimed in global hierarchical reclaim from
direct reclaim
soft_direct_scan- # of pages scanned in global hierarchical reclaim from
direct reclaim
+dirty - # of bytes that are waiting to get written back to the disk.
+writeback - # of bytes that are actively being written back to the disk.
+nfs_unstable - # of bytes sent to the NFS server, but not yet committed to
+ the actual storage.
inactive_anon - # of bytes of anonymous memory and swap cache memory on
LRU list.
active_anon - # of bytes of anonymous and swap cache memory on active
@@ -420,6 +424,9 @@ total_soft_kswapd_steal - sum of all children's "soft_kswapd_steal"
total_soft_kswapd_scan - sum of all children's "soft_kswapd_scan"
total_soft_direct_steal - sum of all children's "soft_direct_steal"
total_soft_direct_scan - sum of all children's "soft_direct_scan"
+total_dirty - sum of all children's "dirty"
+total_writeback - sum of all children's "writeback"
+total_nfs_unstable - sum of all children's "nfs_unstable"
total_inactive_anon - sum of all children's "inactive_anon"
total_active_anon - sum of all children's "active_anon"
total_inactive_file - sum of all children's "inactive_file"
@@ -476,6 +483,69 @@ value for efficient access. (Of course, when necessary, it's synchronized.)
If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
value in memory.stat(see 5.2).

+5.6 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup. So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be throttled if they cross that limit. System-wide dirty limits are also
+consulted. Dirty memory consumption is checked against both system-wide and
+per-cgroup dirty limits.
+
+The interface is similar to the procfs interface: /proc/sys/vm/dirty_*. It is
+possible to configure a limit to trigger throttling of a dirtier or queue
+background writeback. The root cgroup memory.dirty_* control files are
+read-only and match the contents of the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+ cgroup memory) at which a process generating dirty pages will be throttled.
+ The default value is the system-wide dirty ratio, /proc/sys/vm/dirty_ratio.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+ in the cgroup at which a process generating dirty pages will be throttled.
+ Suffix (k, K, m, M, g, or G) can be used to indicate that value is kilo, mega
+ or gigabytes. The default value is the system-wide dirty limit,
+ /proc/sys/vm/dirty_bytes.
+
+ Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+ Only one may be specified at a time. When one is written it is immediately
+ taken into account to evaluate the dirty memory limits and the other appears
+ as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+ (expressed as a percentage of cgroup memory) at which background writeback
+ kernel threads will start writing out dirty data. The default value is the
+ system-wide background dirty ratio, /proc/sys/vm/dirty_background_ratio.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+ in bytes) in the cgroup at which background writeback kernel threads will
+ start writing out dirty data. Suffix (k, K, m, M, g, or G) can be used to
+ indicate that value is kilo, mega or gigabytes. The default value is the
+ system-wide dirty background limit, /proc/sys/vm/dirty_background_bytes.
+
+ Note: memory.dirty_background_limit_in_bytes is the counterpart of
+ memory.dirty_background_ratio. Only one may be specified at a time. When one
+ is written it is immediately taken into account to evaluate the dirty memory
+ limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit. This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it. Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup. Example: If page is allocated by a
+cgroup A task, then the page is charged to cgroup A. If the page is later
+dirtied by a task in cgroup B, then the cgroup A dirty count will be
+incremented. If cgroup A is over its dirty limit but cgroup B is not, then
+dirtying a cgroup A page from a cgroup B task may push cgroup A over its dirty
+limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
+When use_hierarchy=1 the dirty limits of parent cgroups are also checked to
+ensure that no dirty limit is exceeded.
+
6. Hierarchy support

The memory controller supports a deep hierarchy and hierarchical accounting.
--
1.7.3.1

2011-05-13 08:50:05

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 02/14] memcg: add page_cgroup flags for dirty page tracking

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
---
Changelog since v6:
- Trivial: removed extraneous space.

include/linux/page_cgroup.h | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..66d3245 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -10,6 +10,9 @@ enum {
/* flags for mem_cgroup and file and I/O status */
PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+ PCG_FILE_DIRTY, /* page is dirty */
+ PCG_FILE_WRITEBACK, /* page is under writeback */
+ PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
/* No lock in page_cgroup */
PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
__NR_PCG_FLAGS,
@@ -67,6 +70,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }

+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+ { return test_and_set_bit(PCG_##lname, &pc->flags); }
+
/* Cache flag is set only once (at allocation) */
TESTPCGFLAG(Cache, CACHE)
CLEARPCGFLAG(Cache, CACHE)
@@ -86,6 +93,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
CLEARPCGFLAG(FileMapped, FILE_MAPPED)
TESTPCGFLAG(FileMapped, FILE_MAPPED)

+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)
--
1.7.3.1

2011-05-13 08:50:32

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 03/14] memcg: add mem_cgroup_mark_inode_dirty()

Create the mem_cgroup_mark_inode_dirty() routine, which is called when
an inode is marked dirty. In kernels without memcg, this is an inline
no-op.

Add i_memcg field to struct address_space. When an inode is marked
dirty with mem_cgroup_mark_inode_dirty(), the css_id of current memcg is
recorded in i_memcg. Per-memcg writeback (introduced in a latter
change) uses this field to isolate inodes associated with a particular
memcg.

The type of i_memcg is an 'unsigned short' because it stores the css_id
of the memcg. Using a struct mem_cgroup pointer would be larger and
also create a reference on the memcg which would hang memcg rmdir
deletion. Usage of a css_id is not a reference so cgroup deletion is
not affected. The memcg can be deleted without cleaning up the i_memcg
field. When a memcg is deleted its pages are recharged to the cgroup
parent, and the related inode(s) are marked as shared thus
disassociating the inodes from the deleted cgroup.

A mem_cgroup_mark_inode_dirty() tracepoint is also included to allow for
easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <[email protected]>
---
fs/fs-writeback.c | 2 ++
fs/inode.c | 3 +++
include/linux/fs.h | 9 +++++++++
include/linux/memcontrol.h | 6 ++++++
include/trace/events/memcontrol.h | 32 ++++++++++++++++++++++++++++++++
mm/memcontrol.c | 24 ++++++++++++++++++++++++
6 files changed, 76 insertions(+), 0 deletions(-)
create mode 100644 include/trace/events/memcontrol.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3392c29..0174fcf 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,7 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/fs.h>
+#include <linux/memcontrol.h>
#include <linux/mm.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -1111,6 +1112,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
spin_lock(&bdi->wb.list_lock);
inode->dirtied_when = jiffies;
list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+ mem_cgroup_mark_inode_dirty(inode);
spin_unlock(&bdi->wb.list_lock);

if (wakeup_bdi)
diff --git a/fs/inode.c b/fs/inode.c
index ce61a1b..9ecb0bb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -228,6 +228,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
mapping->writeback_index = 0;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ mapping->i_memcg = 0;
+#endif

/*
* If the block_device provides a backing_dev_info for client
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29c02f6..deabca3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -645,6 +645,9 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ unsigned short i_memcg; /* css_id of memcg dirtier */
+#endif
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
@@ -652,6 +655,12 @@ struct address_space {
* of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
*/

+/*
+ * When an address_space is shared by multiple memcg dirtieres, then i_memcg is
+ * set to this special, wildcard, css_id value (zero).
+ */
+#define I_MEMCG_SHARED 0
+
struct block_device {
dev_t bd_dev; /* not a kdev_t - it's a search key */
int bd_openers;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 77e47f5..14b6d67 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -103,6 +103,8 @@ mem_cgroup_prepare_migration(struct page *page,
extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
struct page *oldpage, struct page *newpage, bool migration_ok);

+void mem_cgroup_mark_inode_dirty(struct inode *inode);
+
/*
* For memory reclaim.
*/
@@ -273,6 +275,10 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
{
}

+static inline void mem_cgroup_mark_inode_dirty(struct inode *inode)
+{
+}
+
static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
{
return 0;
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
new file mode 100644
index 0000000..781ef9fc
--- /dev/null
+++ b/include/trace/events/memcontrol.h
@@ -0,0 +1,32 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM memcontrol
+
+#if !defined(_TRACE_MEMCONTROL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MEMCONTROL_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mem_cgroup_mark_inode_dirty,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned short, css_id)
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->css_id =
+ inode->i_mapping ? inode->i_mapping->i_memcg : 0;
+ ),
+
+ TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
+)
+
+#endif /* _TRACE_MEMCONTROL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 95aecca..3a792b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,6 +54,9 @@

#include <trace/events/vmscan.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/memcontrol.h>
+
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5
struct mem_cgroup *root_mem_cgroup __read_mostly;
@@ -1122,6 +1125,27 @@ static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_
return inactive_ratio;
}

+/*
+ * Mark the current task's memcg as the memcg associated with inode. Note: the
+ * recorded cgroup css_id is not guaranteed to remain correct. The current task
+ * may be moved to another cgroup. The memcg may also be deleted before the
+ * caller has time to use the i_memcg.
+ */
+void mem_cgroup_mark_inode_dirty(struct inode *inode)
+{
+ struct mem_cgroup *mem;
+ unsigned short id;
+
+ rcu_read_lock();
+ mem = mem_cgroup_from_task(current);
+ id = mem ? css_id(&mem->css) : 0;
+ rcu_read_unlock();
+
+ inode->i_mapping->i_memcg = id;
+
+ trace_mem_cgroup_mark_inode_dirty(inode);
+}
+
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
{
unsigned long active;
--
1.7.3.1

2011-05-13 08:50:48

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 04/14] memcg: add dirty page accounting infrastructure

Add memcg routines to count dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages. A
later change adds kernel calls to these new routines.

As inode pages are marked dirty, if the dirtied page's cgroup differs
from the inode's cgroup, then mark the inode shared across several
cgroup.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
---
Changelog since v6:
- Mark inode as cgroup-shared if charging a page from a cgroup other than
the inode cgroup.
- Mark inode as cgroup-shared if migrating a page to a different cgroup.

include/linux/memcontrol.h | 8 +++-
mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++++++++---
2 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 14b6d67..f1261e5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,15 @@ struct page_cgroup;
struct page;
struct mm_struct;

-/* Stats that can be updated by kernel. */
+/*
+ * Per mem_cgroup page counts tracked by kernel. As pages enter and leave these
+ * states, the kernel notifies memcg using mem_cgroup_{inc,dec}_page_stat().
+ */
enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+ MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+ MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
};

extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a792b7..a4cb991 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -86,8 +86,11 @@ enum mem_cgroup_stat_index {
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */
MEM_CGROUP_STAT_NSTATS,
@@ -1860,6 +1863,7 @@ void mem_cgroup_update_page_stat(struct page *page,
{
struct mem_cgroup *mem;
struct page_cgroup *pc = lookup_page_cgroup(page);
+ struct address_space *mapping;
bool need_unlock = false;
unsigned long uninitialized_var(flags);

@@ -1888,6 +1892,53 @@ void mem_cgroup_update_page_stat(struct page *page,
ClearPageCgroupFileMapped(pc);
idx = MEM_CGROUP_STAT_FILE_MAPPED;
break;
+
+ case MEMCG_NR_FILE_DIRTY:
+ /* Use Test{Set,Clear} to only un/charge the memcg once. */
+ if (val > 0) {
+ mapping = page_mapping(page);
+ if (TestSetPageCgroupFileDirty(pc))
+ val = 0;
+ else if (mapping &&
+ (mapping->i_memcg != css_id(&mem->css)))
+ /*
+ * If the inode is being dirtied by a memcg
+ * other than the one that marked it dirty, then
+ * mark the inode shared by multiple memcg.
+ */
+ mapping->i_memcg = I_MEMCG_SHARED;
+ } else {
+ if (!TestClearPageCgroupFileDirty(pc))
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_DIRTY;
+ break;
+
+ case MEMCG_NR_FILE_WRITEBACK:
+ /*
+ * This counter is adjusted while holding the mapping's
+ * tree_lock. Therefore there is no race between settings and
+ * clearing of this flag.
+ */
+ if (val > 0)
+ SetPageCgroupFileWriteback(pc);
+ else
+ ClearPageCgroupFileWriteback(pc);
+ idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+ break;
+
+ case MEMCG_NR_FILE_UNSTABLE_NFS:
+ /* Use Test{Set,Clear} to only un/charge the memcg once. */
+ if (val > 0) {
+ if (TestSetPageCgroupFileUnstableNFS(pc))
+ val = 0;
+ } else {
+ if (!TestClearPageCgroupFileUnstableNFS(pc))
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+ break;
+
default:
BUG();
}
@@ -2447,6 +2498,17 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
}
#endif

+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+ struct mem_cgroup *to,
+ enum mem_cgroup_stat_index idx)
+{
+ preempt_disable();
+ __this_cpu_dec(from->stat->count[idx]);
+ __this_cpu_inc(to->stat->count[idx]);
+ preempt_enable();
+}
+
/**
* mem_cgroup_move_account - move account of the page
* @page: the page
@@ -2495,13 +2557,28 @@ static int mem_cgroup_move_account(struct page *page,

move_lock_page_cgroup(pc, &flags);

- if (PageCgroupFileMapped(pc)) {
- /* Update mapped_file data for mem_cgroup */
- preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- preempt_enable();
+ if (PageCgroupFileMapped(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_MAPPED);
+ if (PageCgroupFileDirty(pc)) {
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_DIRTY);
+ /*
+ * Moving a dirty file page between memcg makes the underlying
+ * inode shared. If the new (to) cgroup attempts writeback it
+ * should consider this inode. If the old (from) cgroup
+ * attempts writeback it likely has other pages in the same
+ * inode. The inode is now shared by the to and from cgroups.
+ * So mark the inode as shared.
+ */
+ page_mapping(page)->i_memcg = I_MEMCG_SHARED;
}
+ if (PageCgroupFileWriteback(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_WRITEBACK);
+ if (PageCgroupFileUnstableNFS(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
if (uncharge)
/* This is not "cancel", but cancel_charge does all we need. */
@@ -3981,6 +4058,9 @@ enum {
MCS_SOFT_KSWAPD_SCAN,
MCS_SOFT_DIRECT_STEAL,
MCS_SOFT_DIRECT_SCAN,
+ MCS_FILE_DIRTY,
+ MCS_WRITEBACK,
+ MCS_UNSTABLE_NFS,
MCS_INACTIVE_ANON,
MCS_ACTIVE_ANON,
MCS_INACTIVE_FILE,
@@ -4009,6 +4089,9 @@ struct {
{"soft_kswapd_scan", "total_soft_scan"},
{"soft_direct_steal", "total_soft_direct_steal"},
{"soft_direct_scan", "total_soft_direct_scan"},
+ {"dirty", "total_dirty"},
+ {"writeback", "total_writeback"},
+ {"nfs_unstable", "total_nfs_unstable"},
{"inactive_anon", "total_inactive_anon"},
{"active_anon", "total_active_anon"},
{"inactive_file", "total_inactive_file"},
@@ -4050,6 +4133,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
s->stat[MCS_PGMAJFAULT] += val;

+ /* dirty stat */
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+ s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+ s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+ s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
/* per zone stat */
val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
--
1.7.3.1

2011-05-13 08:50:39

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 05/14] memcg: add kernel calls for memcg dirty page stats

Add calls into memcg dirty page accounting. Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs. This
allows the memory controller to maintain an accurate view of the amount
of its memory that is dirty.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Daisuke Nishimura <[email protected]>
---
Changelog since v6:
- moved accounting of writeback pages into account_page_writeback().

Changelog since v5:
- moved accounting site in test_clear_page_writeback() and
test_set_page_writeback().

fs/nfs/write.c | 4 ++++
mm/filemap.c | 1 +
mm/page-writeback.c | 7 ++++++-
mm/truncate.c | 1 +
4 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 3bd5d7e..c23b168 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg)
nfsi->ncommit++;
spin_unlock(&inode->i_lock);
pnfs_mark_request_commit(req, lseg);
+ mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -460,6 +461,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req->wb_page;

if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1376,6 +1378,8 @@ void nfs_retry_commit(struct list_head *page_list,
req = nfs_list_entry(page_list->next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req, lseg);
+ mem_cgroup_dec_page_stat(req->wb_page,
+ MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 707ae82..6cd8297 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -145,6 +145,7 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cca0803..62fcf3d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1124,6 +1124,7 @@ int __set_page_dirty_no_writeback(struct page *page)
void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1140,6 +1141,7 @@ EXPORT_SYMBOL(account_page_dirtied);
*/
void account_page_writeback(struct page *page)
{
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
inc_zone_page_state(page, NR_WRITEBACK);
inc_zone_page_state(page, NR_WRITTEN);
}
@@ -1323,6 +1325,7 @@ int clear_page_dirty_for_io(struct page *page)
* for more comments.
*/
if (TestClearPageDirty(page)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
@@ -1358,8 +1361,10 @@ int test_clear_page_writeback(struct page *page)
} else {
ret = TestClearPageWriteback(page);
}
- if (ret)
+ if (ret) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK);
+ }
return ret;
}

diff --git a/mm/truncate.c b/mm/truncate.c
index 3a29a61..3dbade6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
--
1.7.3.1

2011-05-13 08:51:01

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 06/14] memcg: add dirty limits to mem_cgroup

Extend mem_cgroup to contain dirty page limits.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
Changelog since v5:
- To simplify this patch, deferred adding routines for kernel to query dirty
usage and limits to a later patch in this series.
- Collapsed __mem_cgroup_has_dirty_limit() into mem_cgroup_has_dirty_limit().
- Renamed __mem_cgroup_dirty_param() to mem_cgroup_dirty_param().

Changelog since v4:
- Added support for hierarchical dirty limits.
- Simplified __mem_cgroup_dirty_param().
- Simplified mem_cgroup_page_stat().
- Deleted mem_cgroup_nr_pages_item enum, which was added little value.
Instead the mem_cgroup_page_stat_item enum values are used to identify
memcg dirty statistics exported to kernel.
- Fixed overflow issues in mem_cgroup_hierarchical_free_pages().

Changelog since v3:
- Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
advertise dirty memory limits. Now struct dirty_info and
mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
the rest of the kernel.
- __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.
- memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
- created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
logic.

Changelog since v1:
- Rename (for clarity):
- mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
- mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Removed unnecessary get_ prefix from get_xxx() functions.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
mem_cgroup_has_dirty_limit().

mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a4cb991..f496677 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -216,6 +216,14 @@ struct mem_cgroup_eventfd_list {
static void mem_cgroup_threshold(struct mem_cgroup *mem);
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);

+/* Dirty memory parameters */
+struct vm_dirty_param {
+ int dirty_ratio;
+ int dirty_background_ratio;
+ unsigned long dirty_bytes;
+ unsigned long dirty_background_bytes;
+};
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -260,6 +268,10 @@ struct mem_cgroup {
atomic_t refcnt;

unsigned int swappiness;
+
+ /* control memory cgroup dirty pages */
+ struct vm_dirty_param dirty_param;
+
/* OOM-Killer disable */
int oom_kill_disable;

@@ -1309,6 +1321,36 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs. So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+static bool mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
+{
+ return mem && !mem_cgroup_is_root(mem);
+}
+
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits. If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where neither dirty_[background_]_ratio nor _bytes are set.
+ */
+static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
+ struct mem_cgroup *mem)
+{
+ if (mem_cgroup_has_dirty_limit(mem)) {
+ *param = mem->dirty_param;
+ } else {
+ param->dirty_ratio = vm_dirty_ratio;
+ param->dirty_bytes = vm_dirty_bytes;
+ param->dirty_background_ratio = dirty_background_ratio;
+ param->dirty_background_bytes = dirty_background_bytes;
+ }
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
@@ -4917,8 +4959,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mem->last_scanned_node = MAX_NUMNODES;
INIT_LIST_HEAD(&mem->oom_notify);

- if (parent)
+ if (parent) {
mem->swappiness = get_swappiness(parent);
+ mem_cgroup_dirty_param(&mem->dirty_param, parent);
+ } else {
+ /*
+ * The root cgroup dirty_param field is not used, instead,
+ * system-wide dirty limits are used.
+ */
+ }
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
--
1.7.3.1

2011-05-13 08:51:18

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 07/14] memcg: add cgroupfs interface to memcg dirty limits

Add cgroupfs interface to memcg dirty page limits:
Direct write-out is controlled with:
- memory.dirty_ratio
- memory.dirty_limit_in_bytes

Background write-out is controlled with:
- memory.dirty_background_ratio
- memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts. This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
Changelog since v3:
- Make use of new routine, __mem_cgroup_has_dirty_limit(), to disable memcg
dirty limits when use_hierarchy=1.

Changelog since v1:
- Renamed newly created proc files:
- memory.dirty_bytes -> memory.dirty_limit_in_bytes
- memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

mm/memcontrol.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f496677..248396c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -126,6 +126,13 @@ enum mem_cgroup_events_target {
#define THRESHOLDS_EVENTS_TARGET (128)
#define SOFTLIMIT_EVENTS_TARGET (1024)

+enum {
+ MEM_CGROUP_DIRTY_RATIO,
+ MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+ MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
@@ -4638,6 +4645,89 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
return 0;
}

+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ bool use_sys = !mem_cgroup_has_dirty_limit(mem);
+
+ switch (cft->private) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ return use_sys ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+ case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+ return use_sys ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ return use_sys ? dirty_background_ratio :
+ mem->dirty_param.dirty_background_ratio;
+ case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+ return use_sys ? dirty_background_bytes :
+ mem->dirty_param.dirty_background_bytes;
+ default:
+ BUG();
+ }
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+ int ret = -EINVAL;
+ unsigned long long val;
+
+ if (!mem_cgroup_has_dirty_limit(memcg))
+ return ret;
+
+ switch (type) {
+ case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_bytes = val;
+ memcg->dirty_param.dirty_ratio = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_background_bytes = val;
+ memcg->dirty_param.dirty_background_ratio = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+
+ if (!mem_cgroup_has_dirty_limit(memcg))
+ return -EINVAL;
+ if ((type == MEM_CGROUP_DIRTY_RATIO ||
+ type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+ return -EINVAL;
+ switch (type) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ memcg->dirty_param.dirty_ratio = val;
+ memcg->dirty_param.dirty_bytes = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ memcg->dirty_param.dirty_background_ratio = val;
+ memcg->dirty_param.dirty_background_bytes = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return 0;
+}
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -4701,6 +4791,30 @@ static struct cftype mem_cgroup_files[] = {
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
+ {
+ .name = "dirty_ratio",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_u64 = mem_cgroup_dirty_write,
+ .private = MEM_CGROUP_DIRTY_RATIO,
+ },
+ {
+ .name = "dirty_limit_in_bytes",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_string = mem_cgroup_dirty_write_string,
+ .private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+ },
+ {
+ .name = "dirty_background_ratio",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_u64 = mem_cgroup_dirty_write,
+ .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ },
+ {
+ .name = "dirty_background_limit_in_bytes",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_string = mem_cgroup_dirty_write_string,
+ .private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
--
1.7.3.1

2011-05-13 08:51:27

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 08/14] writeback: add memcg fields to writeback_control

Add writeback_control fields to differentiate between bdi-wide and
per-cgroup writeback. Cgroup writeback is also able to differentiate
between writing inodes isolated to a particular cgroup and inodes shared
by multiple cgroups.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/writeback.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d10d133..4f5c0d2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -47,6 +47,8 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned more_io:1; /* more io to be dispatched */
+ unsigned for_cgroup:1; /* enable cgroup writeback */
+ unsigned shared_inodes:1; /* write inodes spanning cgroups */
};

/*
--
1.7.3.1

2011-05-13 08:52:47

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 09/14] cgroup: move CSS_ID_MAX to cgroup.h

This allows users of css_id() to know the largest possible css_id value.
This knowledge can be used to build per-cgroup bitmaps.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 1 -
2 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ab4ac0c..5eb6543 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -624,6 +624,7 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
const struct cgroup_subsys_state *root);

/* Get id and depth of css */
+#define CSS_ID_MAX (65535)
unsigned short css_id(struct cgroup_subsys_state *css);
unsigned short css_depth(struct cgroup_subsys_state *css);
struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2731d11..ab7e7a7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -129,7 +129,6 @@ static struct cgroupfs_root rootnode;
* CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
* cgroup_subsys->use_id != 0.
*/
-#define CSS_ID_MAX (65535)
struct css_id {
/*
* The css to which this ID points. This pointer is set to valid value
--
1.7.3.1

2011-05-13 08:53:28

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 10/14] memcg: dirty page accounting support routines

Added memcg dirty page accounting support routines. These routines are
used by later changes to provide memcg aware writeback and dirty page
limiting. A mem_cgroup_dirty_info() tracepoint is is also included to
allow for easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 9 +++
include/trace/events/memcontrol.h | 34 +++++++++
mm/memcontrol.c | 145 +++++++++++++++++++++++++++++++++++++
3 files changed, 188 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f1261e5..f06c2de 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+ MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
+};
+
+struct dirty_info {
+ unsigned long dirty_thresh;
+ unsigned long background_thresh;
+ unsigned long nr_file_dirty;
+ unsigned long nr_writeback;
+ unsigned long nr_unstable_nfs;
};

extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index 781ef9fc..abf1306 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
)

+TRACE_EVENT(mem_cgroup_dirty_info,
+ TP_PROTO(unsigned short css_id,
+ struct dirty_info *dirty_info),
+
+ TP_ARGS(css_id, dirty_info),
+
+ TP_STRUCT__entry(
+ __field(unsigned short, css_id)
+ __field(unsigned long, dirty_thresh)
+ __field(unsigned long, background_thresh)
+ __field(unsigned long, nr_file_dirty)
+ __field(unsigned long, nr_writeback)
+ __field(unsigned long, nr_unstable_nfs)
+ ),
+
+ TP_fast_assign(
+ __entry->css_id = css_id;
+ __entry->dirty_thresh = dirty_info->dirty_thresh;
+ __entry->background_thresh = dirty_info->background_thresh;
+ __entry->nr_file_dirty = dirty_info->nr_file_dirty;
+ __entry->nr_writeback = dirty_info->nr_writeback;
+ __entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
+ ),
+
+ TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
+ "unstable_nfs=%ld",
+ __entry->css_id,
+ __entry->dirty_thresh,
+ __entry->background_thresh,
+ __entry->nr_file_dirty,
+ __entry->nr_writeback,
+ __entry->nr_unstable_nfs)
+)
+
#endif /* _TRACE_MEMCONTROL_H */

/* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 248396c..75ef32c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1328,6 +1328,11 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+static unsigned long dirty_info_reclaimable(struct dirty_info *info)
+{
+ return info->nr_file_dirty + info->nr_unstable_nfs;
+}
+
/*
* Return true if the current memory cgroup has local dirty memory settings.
* There is an allowed race between the current task migrating in-to/out-of the
@@ -1358,6 +1363,146 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
}
}

+static inline bool mem_cgroup_can_swap(struct mem_cgroup *mem)
+{
+ if (!do_swap_account)
+ return nr_swap_pages > 0;
+ return !mem->memsw_is_minimum &&
+ (res_counter_read_u64(&mem->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ s64 ret;
+
+ switch (item) {
+ case MEMCG_NR_FILE_DIRTY:
+ ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+ break;
+ case MEMCG_NR_FILE_WRITEBACK:
+ ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+ break;
+ case MEMCG_NR_FILE_UNSTABLE_NFS:
+ ret = mem_cgroup_read_stat(mem,
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+ break;
+ case MEMCG_NR_DIRTYABLE_PAGES:
+ ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
+ mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
+ if (mem_cgroup_can_swap(mem))
+ ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
+ mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+/*
+ * Return the number of additional pages that the @mem cgroup could allocate.
+ * If use_hierarchy is set, then this involves checking parent mem cgroups to
+ * find the cgroup with the smallest free space.
+ */
+static unsigned long
+mem_cgroup_hierarchical_free_pages(struct mem_cgroup *mem)
+{
+ u64 free;
+ unsigned long min_free;
+
+ min_free = global_page_state(NR_FREE_PAGES);
+
+ while (mem) {
+ free = (res_counter_read_u64(&mem->res, RES_LIMIT) -
+ res_counter_read_u64(&mem->res, RES_USAGE)) >>
+ PAGE_SHIFT;
+ min_free = min((u64)min_free, free);
+ mem = parent_mem_cgroup(mem);
+ }
+
+ return min_free;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @mem: memory cgroup to query
+ * @item: memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+static unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ struct mem_cgroup *iter;
+ s64 value;
+
+ /*
+ * If we're looking for dirtyable pages we need to evaluate free pages
+ * depending on the limit and usage of the parents first of all.
+ */
+ if (item == MEMCG_NR_DIRTYABLE_PAGES)
+ value = mem_cgroup_hierarchical_free_pages(mem);
+ else
+ value = 0;
+
+ /*
+ * Recursively evaluate page statistics against all cgroup under
+ * hierarchy tree
+ */
+ for_each_mem_cgroup_tree(iter, mem)
+ value += mem_cgroup_local_page_stat(iter, item);
+
+ /*
+ * Summing of unlocked per-cpu counters is racy and may yield a slightly
+ * negative value. Zero is the only sensible value in such cases.
+ */
+ if (unlikely(value < 0))
+ value = 0;
+
+ return value;
+}
+
+/* Return dirty thresholds and usage for @mem. */
+static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *mem,
+ struct dirty_info *info)
+{
+ unsigned long uninitialized_var(available_mem);
+ struct vm_dirty_param dirty_param;
+
+ mem_cgroup_dirty_param(&dirty_param, mem);
+
+ if (!dirty_param.dirty_bytes || !dirty_param.dirty_background_bytes)
+ available_mem = min(
+ sys_available_mem,
+ mem_cgroup_page_stat(mem, MEMCG_NR_DIRTYABLE_PAGES));
+
+ if (dirty_param.dirty_bytes)
+ info->dirty_thresh =
+ DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
+ else
+ info->dirty_thresh =
+ (dirty_param.dirty_ratio * available_mem) / 100;
+
+ if (dirty_param.dirty_background_bytes)
+ info->background_thresh =
+ DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+ PAGE_SIZE);
+ else
+ info->background_thresh =
+ (dirty_param.dirty_background_ratio *
+ available_mem) / 100;
+
+ info->nr_file_dirty = mem_cgroup_page_stat(mem, MEMCG_NR_FILE_DIRTY);
+ info->nr_writeback = mem_cgroup_page_stat(mem, MEMCG_NR_FILE_WRITEBACK);
+ info->nr_unstable_nfs =
+ mem_cgroup_page_stat(mem, MEMCG_NR_FILE_UNSTABLE_NFS);
+
+ trace_mem_cgroup_dirty_info(css_id(&mem->css), info);
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
--
1.7.3.1

2011-05-13 08:53:33

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 11/14] memcg: create support routines for writeback

Introduce memcg routines to assist in per-memcg writeback:

- mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
writeback because they are over their dirty memory threshold.

- should_writeback_mem_cgroup_inode() determines if an inode is
contributing pages to an over-limit memcg.

- mem_cgroup_writeback_done() is used periodically during writeback to
update memcg writeback data.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 22 +++++++
include/trace/events/memcontrol.h | 49 ++++++++++++++++
mm/memcontrol.c | 116 +++++++++++++++++++++++++++++++++++++
3 files changed, 187 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f06c2de..3d72e09 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -26,6 +26,7 @@ struct mem_cgroup;
struct page_cgroup;
struct page;
struct mm_struct;
+struct writeback_control;

/*
* Per mem_cgroup page counts tracked by kernel. As pages enter and leave these
@@ -162,6 +163,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}

+bool should_writeback_mem_cgroup_inode(struct inode *inode,
+ struct writeback_control *wbc);
+bool mem_cgroups_over_bground_dirty_thresh(void);
+void mem_cgroup_writeback_done(void);
+
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
unsigned long *total_scanned);
@@ -361,6 +367,22 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
{
}

+static inline bool
+should_writeback_mem_cgroup_inode(struct inode *inode,
+ struct writeback_control *wbc)
+{
+ return true;
+}
+
+static inline bool mem_cgroups_over_bground_dirty_thresh(void)
+{
+ return true;
+}
+
+static inline void mem_cgroup_writeback_done(void)
+{
+}
+
static inline
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index abf1306..326a66b 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -60,6 +60,55 @@ TRACE_EVENT(mem_cgroup_dirty_info,
__entry->nr_unstable_nfs)
)

+TRACE_EVENT(should_writeback_mem_cgroup_inode,
+ TP_PROTO(struct inode *inode,
+ struct writeback_control *wbc,
+ bool over_limit),
+
+ TP_ARGS(inode, wbc, over_limit),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned short, css_id)
+ __field(bool, shared_inodes)
+ __field(bool, over_limit)
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->css_id =
+ inode->i_mapping ? inode->i_mapping->i_memcg : 0;
+ __entry->shared_inodes = wbc->shared_inodes;
+ __entry->over_limit = over_limit;
+ ),
+
+ TP_printk("ino=%ld css_id=%d shared_inodes=%d over_limit=%d",
+ __entry->ino,
+ __entry->css_id,
+ __entry->shared_inodes,
+ __entry->over_limit)
+)
+
+TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
+ TP_PROTO(bool over_limit,
+ unsigned short first_id),
+
+ TP_ARGS(over_limit, first_id),
+
+ TP_STRUCT__entry(
+ __field(bool, over_limit)
+ __field(unsigned short, first_id)
+ ),
+
+ TP_fast_assign(
+ __entry->over_limit = over_limit;
+ __entry->first_id = first_id;
+ ),
+
+ TP_printk("over_limit=%d first_css_id=%d", __entry->over_limit,
+ __entry->first_id)
+)
+
#endif /* _TRACE_MEMCONTROL_H */

/* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 75ef32c..230f0fb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -389,10 +389,18 @@ enum charge_type {
#define MEM_CGROUP_RECLAIM_SOFT_BIT 0x2
#define MEM_CGROUP_RECLAIM_SOFT (1 << MEM_CGROUP_RECLAIM_SOFT_BIT)

+/*
+ * A bitmap representing all possible memcg, indexed by css_id. Each bit
+ * indicates if the respective memcg is over its background dirty memory
+ * limit.
+ */
+static DECLARE_BITMAP(over_bground_dirty_thresh, CSS_ID_MAX + 1);
+
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
static void drain_all_stock_async(void);
+static struct mem_cgroup *mem_cgroup_lookup(unsigned short id);

static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1503,6 +1511,114 @@ static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
trace_mem_cgroup_dirty_info(css_id(&mem->css), info);
}

+/* Are any memcg over their background dirty memory limit? */
+bool mem_cgroups_over_bground_dirty_thresh(void)
+{
+ bool over_thresh;
+
+ over_thresh = !bitmap_empty(over_bground_dirty_thresh, CSS_ID_MAX + 1);
+
+ trace_mem_cgroups_over_bground_dirty_thresh(
+ over_thresh,
+ over_thresh ? find_next_bit(over_bground_dirty_thresh,
+ CSS_ID_MAX + 1, 0) : 0);
+
+ return over_thresh;
+}
+
+/*
+ * Should inode be written back? wbc indicates if this is foreground or
+ * background writeback and the set of inodes worth considering.
+ */
+bool should_writeback_mem_cgroup_inode(struct inode *inode,
+ struct writeback_control *wbc)
+{
+ unsigned short id;
+ bool over;
+
+ id = inode->i_mapping->i_memcg;
+ VM_BUG_ON(id >= CSS_ID_MAX + 1);
+
+ if (wbc->shared_inodes && id == I_MEMCG_SHARED)
+ over = true;
+ else
+ over = test_bit(id, over_bground_dirty_thresh);
+
+ trace_should_writeback_mem_cgroup_inode(inode, wbc, over);
+ return over;
+}
+
+/*
+ * Mark all child cgroup as eligible for writeback because @mem is over its bg
+ * threshold.
+ */
+static void mem_cgroup_mark_over_bg_thresh(struct mem_cgroup *mem)
+{
+ struct mem_cgroup *iter;
+
+ /* mark this and all child cgroup as candidates for writeback */
+ for_each_mem_cgroup_tree(iter, mem)
+ set_bit(css_id(&iter->css), over_bground_dirty_thresh);
+}
+
+static void mem_cgroup_queue_bg_writeback(struct mem_cgroup *mem,
+ struct backing_dev_info *bdi)
+{
+ mem_cgroup_mark_over_bg_thresh(mem);
+ bdi_start_background_writeback(bdi);
+}
+
+/*
+ * This routine is called when per-memcg writeback completes. It scans any
+ * previously over-bground-thresh memcg to determine if the memcg are still over
+ * their background dirty memory limit.
+ */
+void mem_cgroup_writeback_done(void)
+{
+ struct mem_cgroup *mem;
+ struct mem_cgroup *ref_mem;
+ struct dirty_info info;
+ unsigned long sys_available_mem;
+ int id;
+
+ sys_available_mem = 0;
+
+ /* for each previously over-bg-limit memcg... */
+ for (id = 0; (id = find_next_bit(over_bground_dirty_thresh,
+ CSS_ID_MAX + 1, id)) < CSS_ID_MAX + 1;
+ id++) {
+
+ /* reference the memcg */
+ rcu_read_lock();
+ mem = mem_cgroup_lookup(id);
+ if (mem && !css_tryget(&mem->css))
+ mem = NULL;
+ rcu_read_unlock();
+ if (!mem)
+ continue;
+ ref_mem = mem;
+
+ if (!sys_available_mem)
+ sys_available_mem = determine_dirtyable_memory();
+
+ /*
+ * Walk the ancestry of inode's mem clearing the over-limit bits
+ * for for any memcg under its dirty memory background
+ * threshold.
+ */
+ for (; mem_cgroup_has_dirty_limit(mem);
+ mem = parent_mem_cgroup(mem)) {
+ mem_cgroup_dirty_info(sys_available_mem, mem, &info);
+ if (dirty_info_reclaimable(&info) >= info.dirty_thresh)
+ break;
+
+ clear_bit(css_id(&mem->css), over_bground_dirty_thresh);
+ }
+
+ css_put(&ref_mem->css);
+ }
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
--
1.7.3.1

2011-05-13 08:53:16

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 12/14] memcg: create support routines for page-writeback

Introduce memcg routines to assist in per-memcg dirty page management:

- mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
dirty memory usage against memcg foreground and background thresholds.
If an over-background-threshold memcg is found, then per-memcg
background writeback is queued. If an over-foreground-threshold memcg
is found, then foreground writeout occurs. When performing foreground
writeout, first consider inodes exclusive to the memcg. If unable to
make enough progress, then consider inodes shared between memcg. Such
cross-memcg inode sharing likely to be rare in situations that use
per-cgroup memory isolation. So the approach tries to handle the
common case well without falling over in cases where such sharing
exists. This routine is used by balance_dirty_pages() in a later
change.

- mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
and limits of the memcg closest to (or over) its dirty limit. This
will be used by throttle_vm_writeout() in a latter change.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 18 +++++
include/trace/events/memcontrol.h | 83 ++++++++++++++++++++
mm/memcontrol.c | 150 +++++++++++++++++++++++++++++++++++++
3 files changed, 251 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3d72e09..0d0363e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -167,6 +167,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
struct writeback_control *wbc);
bool mem_cgroups_over_bground_dirty_thresh(void);
void mem_cgroup_writeback_done(void);
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *mem,
+ struct dirty_info *info);
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk);

unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
@@ -383,6 +388,19 @@ static inline void mem_cgroup_writeback_done(void)
{
}

+static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk)
+{
+}
+
+static inline bool
+mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *mem,
+ struct dirty_info *info)
+{
+ return false;
+}
+
static inline
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index 326a66b..b42dae1 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -109,6 +109,89 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
__entry->first_id)
)

+DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
+ TP_PROTO(unsigned short css_id,
+ struct backing_dev_info *bdi,
+ unsigned long nr_reclaimable,
+ unsigned long thresh,
+ bool over_limit),
+
+ TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
+
+ TP_STRUCT__entry(
+ __field(unsigned short, css_id)
+ __field(struct backing_dev_info *, bdi)
+ __field(unsigned long, nr_reclaimable)
+ __field(unsigned long, thresh)
+ __field(bool, over_limit)
+ ),
+
+ TP_fast_assign(
+ __entry->css_id = css_id;
+ __entry->bdi = bdi;
+ __entry->nr_reclaimable = nr_reclaimable;
+ __entry->thresh = thresh;
+ __entry->over_limit = over_limit;
+ ),
+
+ TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
+ "over_limit=%d", __entry->css_id, __entry->bdi,
+ __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
+)
+
+#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
+DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
+ TP_PROTO(unsigned short id, \
+ struct backing_dev_info *bdi, \
+ unsigned long nr_reclaimable, \
+ unsigned long thresh, \
+ bool over_limit), \
+ TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
+)
+
+DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
+DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
+
+TRACE_EVENT(mem_cgroup_fg_writeback,
+ TP_PROTO(unsigned long write_chunk,
+ struct writeback_control *wbc),
+
+ TP_ARGS(write_chunk, wbc),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, write_chunk)
+ __field(long, wbc_to_write)
+ __field(bool, shared_inodes)
+ ),
+
+ TP_fast_assign(
+ __entry->write_chunk = write_chunk;
+ __entry->wbc_to_write = wbc->nr_to_write;
+ __entry->shared_inodes = wbc->shared_inodes;
+ ),
+
+ TP_printk("write_chunk=%ld nr_to_write=%ld shared_inodes=%d",
+ __entry->write_chunk,
+ __entry->wbc_to_write,
+ __entry->shared_inodes)
+)
+
+TRACE_EVENT(mem_cgroup_enable_shared_writeback,
+ TP_PROTO(unsigned short css_id),
+
+ TP_ARGS(css_id),
+
+ TP_STRUCT__entry(
+ __field(unsigned short, css_id)
+ ),
+
+ TP_fast_assign(
+ __entry->css_id = css_id;
+ ),
+
+ TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
+)
+
#endif /* _TRACE_MEMCONTROL_H */

/* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 230f0fb..e595514 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1619,6 +1619,156 @@ void mem_cgroup_writeback_done(void)
}
}

+/*
+ * This routine must be called by processes which are generating dirty pages.
+ * It considers the dirty pages usage and thresholds of the current cgroup and
+ * (depending if hierarchical accounting is enabled) ancestral memcg. If any of
+ * the considered memcg are over their background dirty limit, then background
+ * writeback is queued. If any are over the foreground dirty limit then
+ * throttle the dirtying task while writing dirty data. The per-memcg dirty
+ * limits check by this routine are distinct from either the per-system,
+ * per-bdi, or per-task limits considered by balance_dirty_pages().
+ */
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk)
+{
+ struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct mem_cgroup *mem;
+ struct mem_cgroup *ref_mem;
+ struct dirty_info info;
+ unsigned long nr_reclaimable;
+ unsigned long sys_available_mem;
+ unsigned long pause = 1;
+ unsigned short id;
+ bool over;
+ bool shared_inodes;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ sys_available_mem = determine_dirtyable_memory();
+
+ /* reference the memcg so it is not deleted during this routine */
+ rcu_read_lock();
+ mem = mem_cgroup_from_task(current);
+ if (mem && mem_cgroup_is_root(mem))
+ mem = NULL;
+ if (mem)
+ css_get(&mem->css);
+ rcu_read_unlock();
+ ref_mem = mem;
+
+ /* balance entire ancestry of current's mem. */
+ for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
+ id = css_id(&mem->css);
+
+ /*
+ * keep throttling and writing inode data so long as mem is over
+ * its dirty limit.
+ */
+ for (shared_inodes = false; ; ) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .older_than_this = NULL,
+ .range_cyclic = 1,
+ .for_cgroup = 1,
+ .nr_to_write = write_chunk,
+ .shared_inodes = shared_inodes,
+ };
+
+ /*
+ * if mem is under dirty limit, then break from
+ * throttling loop.
+ */
+ mem_cgroup_dirty_info(sys_available_mem, mem, &info);
+ nr_reclaimable = dirty_info_reclaimable(&info);
+ over = nr_reclaimable > info.dirty_thresh;
+ trace_mem_cgroup_consider_fg_writeback(
+ id, bdi, nr_reclaimable, info.dirty_thresh,
+ over);
+ if (!over)
+ break;
+
+ mem_cgroup_mark_over_bg_thresh(mem);
+ writeback_inodes_wb(&bdi->wb, &wbc);
+ trace_mem_cgroup_fg_writeback(write_chunk, &wbc);
+ /* if no progress, then consider shared inodes */
+ if ((wbc.nr_to_write == write_chunk) &&
+ !shared_inodes) {
+ trace_mem_cgroup_enable_shared_writeback(id);
+ shared_inodes = true;
+ }
+
+ /*
+ * Sleep up to 100ms to throttle writer and wait for
+ * queued background I/O to complete.
+ */
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ io_schedule_timeout(pause);
+ pause <<= 1;
+ if (pause > HZ / 10)
+ pause = HZ / 10;
+ }
+
+ /* if mem is over background limit, then queue bg writeback */
+ over = nr_reclaimable >= info.background_thresh;
+ trace_mem_cgroup_consider_bg_writeback(
+ id, bdi, nr_reclaimable, info.background_thresh,
+ over);
+ if (over)
+ mem_cgroup_queue_bg_writeback(mem, bdi);
+ }
+
+ if (ref_mem)
+ css_put(&ref_mem->css);
+}
+
+/*
+ * Return the dirty thresholds and usage for the mem (within the ancestral chain
+ * of @mem) closest to its dirty limit or the first memcg over its limit.
+ *
+ * The check is not stable because the usage and limits can change asynchronous
+ * to this routine.
+ */
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *mem,
+ struct dirty_info *info)
+{
+ unsigned long usage;
+ struct dirty_info uninitialized_var(cur_info);
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ info->nr_writeback = ULONG_MAX; /* invalid initial value */
+
+ /* walk up hierarchy enabled parents */
+ for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
+ mem_cgroup_dirty_info(sys_available_mem, mem, &cur_info);
+ usage = dirty_info_reclaimable(&cur_info) +
+ cur_info.nr_writeback;
+
+ /* if over limit, stop searching */
+ if (usage >= cur_info.dirty_thresh) {
+ *info = cur_info;
+ break;
+ }
+
+ /*
+ * Save dirty usage of mem closest to its limit if either:
+ * - mem is the first mem considered
+ * - mem dirty margin is smaller than last recorded one
+ */
+ if ((info->nr_writeback == ULONG_MAX) ||
+ (cur_info.dirty_thresh - usage) <
+ (info->dirty_thresh -
+ (dirty_info_reclaimable(info) + info->nr_writeback)))
+ *info = cur_info;
+ }
+
+ return info->nr_writeback != ULONG_MAX;
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
--
1.7.3.1

2011-05-13 08:53:50

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 13/14] writeback: make background writeback cgroup aware

When the system is under background dirty memory threshold but a cgroup
is over its background dirty memory threshold, then only writeback
inodes associated with the over-limit cgroup(s).

In addition to checking if the system dirty memory usage is over the
system background threshold, over_bground_thresh() also checks if any
cgroups are over their respective background dirty memory thresholds.
The writeback_control.for_cgroup field is set to distinguish between a
system and memcg overage.

If performing cgroup writeback, move_expired_inodes() skips inodes that
do not contribute dirty pages to the cgroup being written back.

After writing some pages, wb_writeback() will call
mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.

Signed-off-by: Greg Thelen <[email protected]>
---
fs/fs-writeback.c | 31 +++++++++++++++++++++++--------
1 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0174fcf..b01bb2a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -256,14 +256,17 @@ static void move_expired_inodes(struct list_head *delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
- struct inode *inode;
+ struct inode *inode, *tmp_inode;
int do_sb_sort = 0;

- while (!list_empty(delaying_queue)) {
- inode = wb_inode(delaying_queue->prev);
+ list_for_each_entry_safe_reverse(inode, tmp_inode, delaying_queue,
+ i_wb_list) {
if (wbc->older_than_this &&
inode_dirtied_after(inode, *wbc->older_than_this))
break;
+ if (wbc->for_cgroup &&
+ !should_writeback_mem_cgroup_inode(inode, wbc))
+ continue;
if (sb && sb != inode->i_sb)
do_sb_sort = 1;
sb = inode->i_sb;
@@ -614,14 +617,21 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
*/
#define MAX_WRITEBACK_PAGES 1024

-static inline bool over_bground_thresh(void)
+static inline bool over_bground_thresh(struct bdi_writeback *wb,
+ struct writeback_control *wbc)
{
unsigned long background_thresh, dirty_thresh;

global_dirty_limits(&background_thresh, &dirty_thresh);

- return (global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+ if (global_page_state(NR_FILE_DIRTY) +
+ global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
+ wbc->for_cgroup = 0;
+ return true;
+ }
+
+ wbc->for_cgroup = 1;
+ return mem_cgroups_over_bground_dirty_thresh();
}

/*
@@ -700,7 +710,7 @@ static long wb_writeback(struct bdi_writeback *wb,
* For background writeout, stop when we are below the
* background dirty threshold
*/
- if (work->for_background && !over_bground_thresh())
+ if (work->for_background && !over_bground_thresh(wb, &wbc))
break;

if (work->for_kupdate || work->for_background) {
@@ -729,6 +739,9 @@ retry:
work->nr_pages -= write_chunk - wbc.nr_to_write;
wrote += write_chunk - wbc.nr_to_write;

+ if (write_chunk - wbc.nr_to_write > 0)
+ mem_cgroup_writeback_done();
+
/*
* Did we write something? Try for more
*
@@ -809,7 +822,9 @@ static unsigned long get_nr_dirty_pages(void)

static long wb_check_background_flush(struct bdi_writeback *wb)
{
- if (over_bground_thresh()) {
+ struct writeback_control wbc;
+
+ if (over_bground_thresh(wb, &wbc)) {

struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
--
1.7.3.1

2011-05-13 08:53:10

by Greg Thelen

[permalink] [raw]
Subject: [RFC][PATCH v7 14/14] memcg: check memcg dirty limits in page writeback

If the current process is in a non-root memcg, then
balance_dirty_pages() will consider the memcg dirty limits as well as
the system-wide limits. This allows different cgroups to have distinct
dirty limits which trigger direct and background writeback at different
levels.

If called with a mem_cgroup, then throttle_vm_writeout() queries the
given cgroup for its dirty memory usage limits.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Wu Fengguang <[email protected]>
---
Changelog since v6:
- Adapt to new mem_cgroup_hierarchical_dirty_info() parameters: it no longer
takes a background/foreground parameter.
- Trivial comment reword.

Changelog since v5:
- Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
cramming the somewhat different logic into balance_dirty_pages(). This means
the global (non-memcg) dirty limits are not passed around in the
struct dirty_info, so there's less change to existing code.

Changelog since v4:
- Added missing 'struct mem_cgroup' forward declaration in writeback.h.
- Made throttle_vm_writeout() memcg aware.
- Removed previously added dirty_writeback_pages() which is no longer needed.
- Added logic to balance_dirty_pages() to throttle if over foreground memcg
limit.

Changelog since v3:
- Leave determine_dirtyable_memory() static. v3 made is non-static.
- balance_dirty_pages() now considers both system and memcg dirty limits and
usage data. This data is retrieved with global_dirty_info() and
memcg_dirty_info().

include/linux/writeback.h | 3 ++-
mm/page-writeback.c | 35 +++++++++++++++++++++++++++++------
mm/vmscan.c | 2 +-
3 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 4f5c0d2..0b4b851 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -8,6 +8,7 @@
#include <linux/fs.h>

struct backing_dev_info;
+struct mem_cgroup;

/*
* fs/fs-writeback.c
@@ -91,7 +92,7 @@ void laptop_mode_timer_fn(unsigned long data);
#else
static inline void laptop_sync_completion(void) { }
#endif
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup);

/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 62fcf3d..30c265b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -473,7 +473,8 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
* data. It looks at the number of dirty pages in the machine and will force
* the caller to perform writeback if the system is over `vm_dirty_ratio'.
* If we're over `background_thresh' then the writeback threads are woken to
- * perform some writeout.
+ * perform some writeout. The current task may belong to a cgroup with
+ * dirty limits, which are also checked.
*/
static void balance_dirty_pages(struct address_space *mapping,
unsigned long write_chunk)
@@ -488,6 +489,8 @@ static void balance_dirty_pages(struct address_space *mapping,
bool dirty_exceeded = false;
struct backing_dev_info *bdi = mapping->backing_dev_info;

+ mem_cgroup_balance_dirty_pages(mapping, write_chunk);
+
for (;;) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
@@ -651,23 +654,43 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);

-void throttle_vm_writeout(gfp_t gfp_mask)
+/*
+ * Throttle the current task if it is near dirty memory usage limits. Both
+ * global dirty memory limits and (if @mem_cgroup is given) per-cgroup dirty
+ * memory limits are checked.
+ *
+ * If near limits, then wait for usage to drop. Dirty usage should drop because
+ * dirty producers should have used balance_dirty_pages(), which would have
+ * scheduled writeback.
+ */
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
{
unsigned long background_thresh;
unsigned long dirty_thresh;
+ struct dirty_info memcg_info;
+ bool do_memcg;

for ( ; ; ) {
global_dirty_limits(&background_thresh, &dirty_thresh);
+ do_memcg = mem_cgroup &&
+ mem_cgroup_hierarchical_dirty_info(
+ determine_dirtyable_memory(), mem_cgroup,
+ &memcg_info);

/*
* Boost the allowable dirty threshold a bit for page
* allocators so they don't get DoS'ed by heavy writers
*/
dirty_thresh += dirty_thresh / 10; /* wheeee... */
-
- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
- break;
+ if (do_memcg)
+ memcg_info.dirty_thresh += memcg_info.dirty_thresh / 10;
+
+ if ((global_page_state(NR_UNSTABLE_NFS) +
+ global_page_state(NR_WRITEBACK) <= dirty_thresh) &&
+ (!do_memcg ||
+ (memcg_info.nr_unstable_nfs +
+ memcg_info.nr_writeback <= memcg_info.dirty_thresh)))
+ break;
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..66324a4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1953,7 +1953,7 @@ restart:
sc->nr_scanned - nr_scanned, sc))
goto restart;

- throttle_vm_writeout(sc->gfp_mask);
+ throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
}

/*
--
1.7.3.1

2011-05-13 09:32:31

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 00/14] memcg: per cgroup dirty page accounting

On Fri, 13 May 2011 01:47:39 -0700
Greg Thelen <[email protected]> wrote:

> This patch series provides the ability for each cgroup to have independent dirty
> page usage limits. Limiting dirty memory fixes the max amount of dirty (hard to
> reclaim) page cache used by a cgroup. This allows for better per cgroup memory
> isolation and fewer ooms within a single cgroup.
>
> Having per cgroup dirty memory limits is not very interesting unless writeback
> is cgroup aware. There is not much isolation if cgroups have to writeback data
> from other cgroups to get below their dirty memory threshold.
>
> Per-memcg dirty limits are provided to support isolation and thus cross cgroup
> inode sharing is not a priority. This allows the code be simpler.
>
> To add cgroup awareness to writeback, this series adds a memcg field to the
> inode to allow writeback to isolate inodes for a particular cgroup. When an
> inode is marked dirty, i_memcg is set to the current cgroup. When inode pages
> are marked dirty the i_memcg field compared against the page's cgroup. If they
> differ, then the inode is marked as shared by setting i_memcg to a special
> shared value (zero).
>
> Previous discussions suggested that a per-bdi per-memcg b_dirty list was a good
> way to assoicate inodes with a cgroup without having to add a field to struct
> inode. I prototyped this approach but found that it involved more complex
> writeback changes and had at least one major shortcoming: detection of when an
> inode becomes shared by multiple cgroups. While such sharing is not expected to
> be common, the system should gracefully handle it.
>
> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
> dirty usage vs dirty thresholds for the current cgroup and its parents. If any
> over-limit cgroups are found, they are marked in a global over-limit bitmap
> (indexed by cgroup id) and the bdi flusher is awoke.
>
> The bdi flusher uses wb_check_background_flush() to check for any memcg over
> their dirty limit. When performing per-memcg background writeback,
> move_expired_inodes() walks per bdi b_dirty list using each inode's i_memcg and
> the global over-limit memcg bitmap to determine if the inode should be written.
>
> If mem_cgroup_balance_dirty_pages() is unable to get below the dirty page
> threshold writing per-memcg inodes, then downshifts to also writing shared
> inodes (i_memcg=0).
>
> I know that there is some significant writeback changes associated with the
> IO-less balance_dirty_pages() effort. I am not trying to derail that, so this
> patch series is merely an RFC to get feedback on the design. There are probably
> some subtle races in these patches. I have done moderate functional testing of
> the newly proposed features.
>
> Here is an example of the memcg-oom that is avoided with this patch series:
> # mkdir /dev/cgroup/memory/x
> # echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
> # echo $$ > /dev/cgroup/memory/x/tasks
> # dd if=/dev/zero of=/data/f1 bs=1k count=1M &
> # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
> # wait
> [1]- Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k
> [2]+ Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k
>
> Known limitations:
> If a dirty limit is lowered a cgroup may be over its limit.
>


Thank you, I think this should be merged earlier than all other works. Without this,
I think all memory reclaim changes of memcg will do something wrong.

I'll do a brief review today but I'll be busy until Wednesday, sorry.

In general, I agree with inode->i_mapping->i_memcg, simple 2bytes field and
ignoring a special case of shared inode between memcg.

BTW, IIUC, i_memcg is resetted always when mark_inode_dirty() sets new I_DIRTY to
the flags, right ?

Thanks,
-Kame

2011-05-13 09:38:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 03/14] memcg: add mem_cgroup_mark_inode_dirty()

On Fri, 13 May 2011 01:47:42 -0700
Greg Thelen <[email protected]> wrote:

> Create the mem_cgroup_mark_inode_dirty() routine, which is called when
> an inode is marked dirty. In kernels without memcg, this is an inline
> no-op.
>
> Add i_memcg field to struct address_space. When an inode is marked
> dirty with mem_cgroup_mark_inode_dirty(), the css_id of current memcg is
> recorded in i_memcg. Per-memcg writeback (introduced in a latter
> change) uses this field to isolate inodes associated with a particular
> memcg.
>
> The type of i_memcg is an 'unsigned short' because it stores the css_id
> of the memcg. Using a struct mem_cgroup pointer would be larger and
> also create a reference on the memcg which would hang memcg rmdir
> deletion. Usage of a css_id is not a reference so cgroup deletion is
> not affected. The memcg can be deleted without cleaning up the i_memcg
> field. When a memcg is deleted its pages are recharged to the cgroup
> parent, and the related inode(s) are marked as shared thus
> disassociating the inodes from the deleted cgroup.
>
> A mem_cgroup_mark_inode_dirty() tracepoint is also included to allow for
> easier understanding of memcg writeback operation.
>
> Signed-off-by: Greg Thelen <[email protected]>


Seems simple and handy enough.

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2011-05-13 09:40:03

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 04/14] memcg: add dirty page accounting infrastructure

On Fri, 13 May 2011 01:47:43 -0700
Greg Thelen <[email protected]> wrote:

> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages. A
> later change adds kernel calls to these new routines.
>
> As inode pages are marked dirty, if the dirtied page's cgroup differs
> from the inode's cgroup, then mark the inode shared across several
> cgroup.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>


Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2011-05-13 09:48:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 08/14] writeback: add memcg fields to writeback_control

On Fri, 13 May 2011 01:47:47 -0700
Greg Thelen <[email protected]> wrote:

> Add writeback_control fields to differentiate between bdi-wide and
> per-cgroup writeback. Cgroup writeback is also able to differentiate
> between writing inodes isolated to a particular cgroup and inodes shared
> by multiple cgroups.
>
> Signed-off-by: Greg Thelen <[email protected]>

Personally, I want to see new flags with their usage in a patch...


> ---
> include/linux/writeback.h | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index d10d133..4f5c0d2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -47,6 +47,8 @@ struct writeback_control {
> unsigned for_reclaim:1; /* Invoked from the page allocator */
> unsigned range_cyclic:1; /* range_start is cyclic */
> unsigned more_io:1; /* more io to be dispatched */
> + unsigned for_cgroup:1; /* enable cgroup writeback */
> + unsigned shared_inodes:1; /* write inodes spanning cgroups */
> };


If shared_inode is really rare case...we don't need to have this shared_inodes
flag and do writeback shared_inode always.....No ?

Thanks,
-Kame

2011-05-13 09:58:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 09/14] cgroup: move CSS_ID_MAX to cgroup.h

On Fri, 13 May 2011 01:47:48 -0700
Greg Thelen <[email protected]> wrote:

> This allows users of css_id() to know the largest possible css_id value.
> This knowledge can be used to build per-cgroup bitmaps.
>
> Signed-off-by: Greg Thelen <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

Hmm, I think this can be merged to following bitmap patch.

2011-05-13 10:03:08

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 10/14] memcg: dirty page accounting support routines

On Fri, 13 May 2011 01:47:49 -0700
Greg Thelen <[email protected]> wrote:

> Added memcg dirty page accounting support routines. These routines are
> used by later changes to provide memcg aware writeback and dirty page
> limiting. A mem_cgroup_dirty_info() tracepoint is is also included to
> allow for easier understanding of memcg writeback operation.
>
> Signed-off-by: Greg Thelen <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

with small nit..(see below)


> ---
> include/linux/memcontrol.h | 9 +++
> include/trace/events/memcontrol.h | 34 +++++++++
> mm/memcontrol.c | 145 +++++++++++++++++++++++++++++++++++++
> 3 files changed, 188 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f1261e5..f06c2de 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
> MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
> MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
> MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> + MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
> +};
> +
> +struct dirty_info {
> + unsigned long dirty_thresh;
> + unsigned long background_thresh;
> + unsigned long nr_file_dirty;
> + unsigned long nr_writeback;
> + unsigned long nr_unstable_nfs;
> };
>
> extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
> index 781ef9fc..abf1306 100644
> --- a/include/trace/events/memcontrol.h
> +++ b/include/trace/events/memcontrol.h
> @@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
> TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
> )
>
> +TRACE_EVENT(mem_cgroup_dirty_info,
> + TP_PROTO(unsigned short css_id,
> + struct dirty_info *dirty_info),
> +
> + TP_ARGS(css_id, dirty_info),
> +
> + TP_STRUCT__entry(
> + __field(unsigned short, css_id)
> + __field(unsigned long, dirty_thresh)
> + __field(unsigned long, background_thresh)
> + __field(unsigned long, nr_file_dirty)
> + __field(unsigned long, nr_writeback)
> + __field(unsigned long, nr_unstable_nfs)
> + ),
> +
> + TP_fast_assign(
> + __entry->css_id = css_id;
> + __entry->dirty_thresh = dirty_info->dirty_thresh;
> + __entry->background_thresh = dirty_info->background_thresh;
> + __entry->nr_file_dirty = dirty_info->nr_file_dirty;
> + __entry->nr_writeback = dirty_info->nr_writeback;
> + __entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
> + ),
> +
> + TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
> + "unstable_nfs=%ld",
> + __entry->css_id,
> + __entry->dirty_thresh,
> + __entry->background_thresh,
> + __entry->nr_file_dirty,
> + __entry->nr_writeback,
> + __entry->nr_unstable_nfs)
> +)
> +
> #endif /* _TRACE_MEMCONTROL_H */
>
> /* This part must be outside protection */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 248396c..75ef32c 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1328,6 +1328,11 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
> return memcg->swappiness;
> }
>
> +static unsigned long dirty_info_reclaimable(struct dirty_info *info)
> +{
> + return info->nr_file_dirty + info->nr_unstable_nfs;
> +}
> +
> /*
> * Return true if the current memory cgroup has local dirty memory settings.
> * There is an allowed race between the current task migrating in-to/out-of the
> @@ -1358,6 +1363,146 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
> }
> }
>
> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *mem)
> +{
> + if (!do_swap_account)
> + return nr_swap_pages > 0;
> + return !mem->memsw_is_minimum &&
> + (res_counter_read_u64(&mem->memsw, RES_LIMIT) > 0);
> +}
> +
> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
> + enum mem_cgroup_page_stat_item item)
> +{
> + s64 ret;
> +
> + switch (item) {
> + case MEMCG_NR_FILE_DIRTY:
> + ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> + break;
> + case MEMCG_NR_FILE_WRITEBACK:
> + ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
> + break;
> + case MEMCG_NR_FILE_UNSTABLE_NFS:
> + ret = mem_cgroup_read_stat(mem,
> + MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> + break;
> + case MEMCG_NR_DIRTYABLE_PAGES:
> + ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
> + mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
> + if (mem_cgroup_can_swap(mem))
> + ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
> + mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
> + break;
> + default:
> + BUG();
> + break;
> + }
> + return ret;
> +}
> +
> +/*
> + * Return the number of additional pages that the @mem cgroup could allocate.
> + * If use_hierarchy is set, then this involves checking parent mem cgroups to
> + * find the cgroup with the smallest free space.
> + */
> +static unsigned long
> +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *mem)
> +{
> + u64 free;
> + unsigned long min_free;
> +
> + min_free = global_page_state(NR_FREE_PAGES);
> +
> + while (mem) {
> + free = (res_counter_read_u64(&mem->res, RES_LIMIT) -
> + res_counter_read_u64(&mem->res, RES_USAGE)) >>
> + PAGE_SHIFT;
> + min_free = min((u64)min_free, free);
> + mem = parent_mem_cgroup(mem);
> + }
> +
> + return min_free;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @mem: memory cgroup to query
> + * @item: memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value.
> + */
> +static unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
> + enum mem_cgroup_page_stat_item item)

How about mem_cgroup_file_cache_stat() ?


> +{
> + struct mem_cgroup *iter;
> + s64 value;
> +
> + /*
> + * If we're looking for dirtyable pages we need to evaluate free pages
> + * depending on the limit and usage of the parents first of all.
> + */
> + if (item == MEMCG_NR_DIRTYABLE_PAGES)
> + value = mem_cgroup_hierarchical_free_pages(mem);
> + else
> + value = 0;
> +
> + /*
> + * Recursively evaluate page statistics against all cgroup under
> + * hierarchy tree
> + */
> + for_each_mem_cgroup_tree(iter, mem)
> + value += mem_cgroup_local_page_stat(iter, item);
> +
> + /*
> + * Summing of unlocked per-cpu counters is racy and may yield a slightly
> + * negative value. Zero is the only sensible value in such cases.
> + */
> + if (unlikely(value < 0))
> + value = 0;
> +
> + return value;
> +}

seems very nice handling of hierarchy :)

2011-05-13 10:11:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 11/14] memcg: create support routines for writeback

On Fri, 13 May 2011 01:47:50 -0700
Greg Thelen <[email protected]> wrote:

> Introduce memcg routines to assist in per-memcg writeback:
>
> - mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
> writeback because they are over their dirty memory threshold.
>
> - should_writeback_mem_cgroup_inode() determines if an inode is
> contributing pages to an over-limit memcg.
>
> - mem_cgroup_writeback_done() is used periodically during writeback to
> update memcg writeback data.
>
> Signed-off-by: Greg Thelen <[email protected]>

Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

I'm okay with the bitmap..then, problem will be when set/clear wbc->for_cgroup...

2011-05-13 10:21:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 13/14] writeback: make background writeback cgroup aware

On Fri, 13 May 2011 01:47:52 -0700
Greg Thelen <[email protected]> wrote:

> When the system is under background dirty memory threshold but a cgroup
> is over its background dirty memory threshold, then only writeback
> inodes associated with the over-limit cgroup(s).
>
> In addition to checking if the system dirty memory usage is over the
> system background threshold, over_bground_thresh() also checks if any
> cgroups are over their respective background dirty memory thresholds.
> The writeback_control.for_cgroup field is set to distinguish between a
> system and memcg overage.
>
> If performing cgroup writeback, move_expired_inodes() skips inodes that
> do not contribute dirty pages to the cgroup being written back.
>
> After writing some pages, wb_writeback() will call
> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
>
> Signed-off-by: Greg Thelen <[email protected]>

Seems ok to me, at least..
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

2011-05-14 00:55:24

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 00/14] memcg: per cgroup dirty page accounting

On Fri, May 13, 2011 at 2:25 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 13 May 2011 01:47:39 -0700
> Greg Thelen <[email protected]> wrote:
>
>> This patch series provides the ability for each cgroup to have independent dirty
>> page usage limits. ?Limiting dirty memory fixes the max amount of dirty (hard to
>> reclaim) page cache used by a cgroup. ?This allows for better per cgroup memory
>> isolation and fewer ooms within a single cgroup.
>>
>> Having per cgroup dirty memory limits is not very interesting unless writeback
>> is cgroup aware. ?There is not much isolation if cgroups have to writeback data
>> from other cgroups to get below their dirty memory threshold.
>>
>> Per-memcg dirty limits are provided to support isolation and thus cross cgroup
>> inode sharing is not a priority. ?This allows the code be simpler.
>>
>> To add cgroup awareness to writeback, this series adds a memcg field to the
>> inode to allow writeback to isolate inodes for a particular cgroup. ?When an
>> inode is marked dirty, i_memcg is set to the current cgroup. ?When inode pages
>> are marked dirty the i_memcg field compared against the page's cgroup. ?If they
>> differ, then the inode is marked as shared by setting i_memcg to a special
>> shared value (zero).
>>
>> Previous discussions suggested that a per-bdi per-memcg b_dirty list was a good
>> way to assoicate inodes with a cgroup without having to add a field to struct
>> inode. ?I prototyped this approach but found that it involved more complex
>> writeback changes and had at least one major shortcoming: detection of when an
>> inode becomes shared by multiple cgroups. ?While such sharing is not expected to
>> be common, the system should gracefully handle it.
>>
>> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
>> dirty usage vs dirty thresholds for the current cgroup and its parents. ?If any
>> over-limit cgroups are found, they are marked in a global over-limit bitmap
>> (indexed by cgroup id) and the bdi flusher is awoke.
>>
>> The bdi flusher uses wb_check_background_flush() to check for any memcg over
>> their dirty limit. ?When performing per-memcg background writeback,
>> move_expired_inodes() walks per bdi b_dirty list using each inode's i_memcg and
>> the global over-limit memcg bitmap to determine if the inode should be written.
>>
>> If mem_cgroup_balance_dirty_pages() is unable to get below the dirty page
>> threshold writing per-memcg inodes, then downshifts to also writing shared
>> inodes (i_memcg=0).
>>
>> I know that there is some significant writeback changes associated with the
>> IO-less balance_dirty_pages() effort. ?I am not trying to derail that, so this
>> patch series is merely an RFC to get feedback on the design. ?There are probably
>> some subtle races in these patches. ?I have done moderate functional testing of
>> the newly proposed features.
>>
>> Here is an example of the memcg-oom that is avoided with this patch series:
>> ? ? ? # mkdir /dev/cgroup/memory/x
>> ? ? ? # echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
>> ? ? ? # echo $$ > /dev/cgroup/memory/x/tasks
>> ? ? ? # dd if=/dev/zero of=/data/f1 bs=1k count=1M &
>> ? ? ? ? # dd if=/dev/zero of=/data/f2 bs=1k count=1M &
>> ? ? ? ? # wait
>> ? ? ? [1]- ?Killed ? ? ? ? ? ? ? ? ?dd if=/dev/zero of=/data/f1 bs=1M count=1k
>> ? ? ? [2]+ ?Killed ? ? ? ? ? ? ? ? ?dd if=/dev/zero of=/data/f1 bs=1M count=1k
>>
>> Known limitations:
>> ? ? ? If a dirty limit is lowered a cgroup may be over its limit.
>>
>
>
> Thank you, I think this should be merged earlier than all other works. Without this,
> I think all memory reclaim changes of memcg will do something wrong.
>
> I'll do a brief review today but I'll be busy until Wednesday, sorry.

Thank you.

> In general, I agree with inode->i_mapping->i_memcg, simple 2bytes field and
> ignoring a special case of shared inode between memcg.

These proposed patches do not optimize for sharing, but the patches do
attempt to handle sharing to ensure forward progress. The sharing
case I have in mind is where an inode is transferred between memcg
(e.g. if cgroup_a appends to a log file and then cgroup_b appends to
the same file). While such cases are thought to be somewhat rare for
isolated memcg workloads, they will happen sometimes. In these
situations I want to make sure that the memcg that is charged for
dirty pages of a shared inode is able to make forward progress to
write dirty pages to drop below the cgroup dirty memory threshold.

The patches try to perform well for cgroups that operate on non-shared
inodes. If a cgroup has no shared inodes, then that cgroup should not
be punished if other cgroups have shared inodes.

Currently the patches perform the following:
1) when exceeding background limit, wake bdi flusher to write any
inodes of over-limit cgroups.
2) when exceeding foreground limit, write dirty inodes of the
over-limit cgroup. This will change when integrated with IO-less
balance_dirty_pages(). If unable to make forward progress, also write
shared inodes.

One could argue that step (2) should always consider writing shared
inodes, but I wanted to avoid burdening cgroups that had no shared
inodes with the responsibility of writing dirty shared inodes.

> BTW, IIUC, i_memcg is resetted always when mark_inode_dirty() sets new I_DIRTY to
> the flags, right ?

Yes.

> Thanks,
> -Kame

One small bug in this patch series is that per memcg background
writeback does not write shared inode pages. In the (potentially
common) case where the system dirty memory usage is below the system
background dirty threshold but at least one cgroup is over its
background dirty limit, then per memcg background writeback is queued
for any over-background-threshold cgroups. In this case background
writeback should be allowed to writeback shared inodes. The hope is
that writing such inodes has good chance of cleaning the inodes so
they can transition from shared to non-shared. Such a transition is
good because then the inode will remain unshared until it is written
by multiple cgroup. This is easy to fix if
wb_check_background_flush() sets shared_inodes=1. I will include this
change in the next version of these patches.

2011-05-15 19:53:37

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 08/14] writeback: add memcg fields to writeback_control

On Fri, May 13, 2011 at 2:41 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 13 May 2011 01:47:47 -0700
> Greg Thelen <[email protected]> wrote:
>
>> Add writeback_control fields to differentiate between bdi-wide and
>> per-cgroup writeback. ?Cgroup writeback is also able to differentiate
>> between writing inodes isolated to a particular cgroup and inodes shared
>> by multiple cgroups.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>
> Personally, I want to see new flags with their usage in a patch...

Ok. Next version will merge the flag definition with first usage of the flag.

>> ---
>> ?include/linux/writeback.h | ? ?2 ++
>> ?1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>> index d10d133..4f5c0d2 100644
>> --- a/include/linux/writeback.h
>> +++ b/include/linux/writeback.h
>> @@ -47,6 +47,8 @@ struct writeback_control {
>> ? ? ? unsigned for_reclaim:1; ? ? ? ? /* Invoked from the page allocator */
>> ? ? ? unsigned range_cyclic:1; ? ? ? ?/* range_start is cyclic */
>> ? ? ? unsigned more_io:1; ? ? ? ? ? ? /* more io to be dispatched */
>> + ? ? unsigned for_cgroup:1; ? ? ? ? ?/* enable cgroup writeback */
>> + ? ? unsigned shared_inodes:1; ? ? ? /* write inodes spanning cgroups */
>> ?};
>
>
> If shared_inode is really rare case...we don't need to have this shared_inodes
> flag and do writeback shared_inode always.....No ?
>
> Thanks,
> -Kame

The shared_inodes field is present to avoid punishing cgroups that are
not sharing, if they are run on a system that also includes sharing.

This issue is being debated in another thread: "[RFC][PATCH v7 00/14]
memcg: per cgroup dirty page accounting". Depending on the decision,
we may be able to delete the shared_inode fields if we choose to
always write shared inodes in both cgroup foreground and cgroup
background writeback.

2011-05-15 19:53:45

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 09/14] cgroup: move CSS_ID_MAX to cgroup.h

On Fri, May 13, 2011 at 2:51 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 13 May 2011 01:47:48 -0700
> Greg Thelen <[email protected]> wrote:
>
>> This allows users of css_id() to know the largest possible css_id value.
>> This knowledge can be used to build per-cgroup bitmaps.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Hmm, I think this can be merged to following bitmap patch.

Ok. I will merge this in following patch.

2011-05-15 19:56:24

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 10/14] memcg: dirty page accounting support routines

On Fri, May 13, 2011 at 2:56 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 13 May 2011 01:47:49 -0700
> Greg Thelen <[email protected]> wrote:
>
>> Added memcg dirty page accounting support routines. ?These routines are
>> used by later changes to provide memcg aware writeback and dirty page
>> limiting. ?A mem_cgroup_dirty_info() tracepoint is is also included to
>> allow for easier understanding of memcg writeback operation.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>
> with small nit..(see below)
>
>
>> ---
>> ?include/linux/memcontrol.h ? ? ? ?| ? ?9 +++
>> ?include/trace/events/memcontrol.h | ? 34 +++++++++
>> ?mm/memcontrol.c ? ? ? ? ? ? ? ? ? | ?145 +++++++++++++++++++++++++++++++++++++
>> ?3 files changed, 188 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index f1261e5..f06c2de 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
>> ? ? ? MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
>> ? ? ? MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
>> ? ? ? MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
>> + ? ? MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
>> +};
>> +
>> +struct dirty_info {
>> + ? ? unsigned long dirty_thresh;
>> + ? ? unsigned long background_thresh;
>> + ? ? unsigned long nr_file_dirty;
>> + ? ? unsigned long nr_writeback;
>> + ? ? unsigned long nr_unstable_nfs;
>> ?};
>>
>> ?extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
>> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
>> index 781ef9fc..abf1306 100644
>> --- a/include/trace/events/memcontrol.h
>> +++ b/include/trace/events/memcontrol.h
>> @@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
>> ? ? ? TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
>> ?)
>>
>> +TRACE_EVENT(mem_cgroup_dirty_info,
>> + ? ? TP_PROTO(unsigned short css_id,
>> + ? ? ? ? ? ? ?struct dirty_info *dirty_info),
>> +
>> + ? ? TP_ARGS(css_id, dirty_info),
>> +
>> + ? ? TP_STRUCT__entry(
>> + ? ? ? ? ? ? __field(unsigned short, css_id)
>> + ? ? ? ? ? ? __field(unsigned long, dirty_thresh)
>> + ? ? ? ? ? ? __field(unsigned long, background_thresh)
>> + ? ? ? ? ? ? __field(unsigned long, nr_file_dirty)
>> + ? ? ? ? ? ? __field(unsigned long, nr_writeback)
>> + ? ? ? ? ? ? __field(unsigned long, nr_unstable_nfs)
>> + ? ? ? ? ? ? ),
>> +
>> + ? ? TP_fast_assign(
>> + ? ? ? ? ? ? __entry->css_id = css_id;
>> + ? ? ? ? ? ? __entry->dirty_thresh = dirty_info->dirty_thresh;
>> + ? ? ? ? ? ? __entry->background_thresh = dirty_info->background_thresh;
>> + ? ? ? ? ? ? __entry->nr_file_dirty = dirty_info->nr_file_dirty;
>> + ? ? ? ? ? ? __entry->nr_writeback = dirty_info->nr_writeback;
>> + ? ? ? ? ? ? __entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
>> + ? ? ? ? ? ? ),
>> +
>> + ? ? TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
>> + ? ? ? ? ? ? ? "unstable_nfs=%ld",
>> + ? ? ? ? ? ? ? __entry->css_id,
>> + ? ? ? ? ? ? ? __entry->dirty_thresh,
>> + ? ? ? ? ? ? ? __entry->background_thresh,
>> + ? ? ? ? ? ? ? __entry->nr_file_dirty,
>> + ? ? ? ? ? ? ? __entry->nr_writeback,
>> + ? ? ? ? ? ? ? __entry->nr_unstable_nfs)
>> +)
>> +
>> ?#endif /* _TRACE_MEMCONTROL_H */
>>
>> ?/* This part must be outside protection */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 248396c..75ef32c 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1328,6 +1328,11 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
>> ? ? ? return memcg->swappiness;
>> ?}
>>
>> +static unsigned long dirty_info_reclaimable(struct dirty_info *info)
>> +{
>> + ? ? return info->nr_file_dirty + info->nr_unstable_nfs;
>> +}
>> +
>> ?/*
>> ? * Return true if the current memory cgroup has local dirty memory settings.
>> ? * There is an allowed race between the current task migrating in-to/out-of the
>> @@ -1358,6 +1363,146 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
>> ? ? ? }
>> ?}
>>
>> +static inline bool mem_cgroup_can_swap(struct mem_cgroup *mem)
>> +{
>> + ? ? if (!do_swap_account)
>> + ? ? ? ? ? ? return nr_swap_pages > 0;
>> + ? ? return !mem->memsw_is_minimum &&
>> + ? ? ? ? ? ? (res_counter_read_u64(&mem->memsw, RES_LIMIT) > 0);
>> +}
>> +
>> +static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum mem_cgroup_page_stat_item item)
>> +{
>> + ? ? s64 ret;
>> +
>> + ? ? switch (item) {
>> + ? ? case MEMCG_NR_FILE_DIRTY:
>> + ? ? ? ? ? ? ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
>> + ? ? ? ? ? ? break;
>> + ? ? case MEMCG_NR_FILE_WRITEBACK:
>> + ? ? ? ? ? ? ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
>> + ? ? ? ? ? ? break;
>> + ? ? case MEMCG_NR_FILE_UNSTABLE_NFS:
>> + ? ? ? ? ? ? ret = mem_cgroup_read_stat(mem,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
>> + ? ? ? ? ? ? break;
>> + ? ? case MEMCG_NR_DIRTYABLE_PAGES:
>> + ? ? ? ? ? ? ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
>> + ? ? ? ? ? ? ? ? ? ? mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
>> + ? ? ? ? ? ? if (mem_cgroup_can_swap(mem))
>> + ? ? ? ? ? ? ? ? ? ? ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
>> + ? ? ? ? ? ? break;
>> + ? ? default:
>> + ? ? ? ? ? ? BUG();
>> + ? ? ? ? ? ? break;
>> + ? ? }
>> + ? ? return ret;
>> +}
>> +
>> +/*
>> + * Return the number of additional pages that the @mem cgroup could allocate.
>> + * If use_hierarchy is set, then this involves checking parent mem cgroups to
>> + * find the cgroup with the smallest free space.
>> + */
>> +static unsigned long
>> +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *mem)
>> +{
>> + ? ? u64 free;
>> + ? ? unsigned long min_free;
>> +
>> + ? ? min_free = global_page_state(NR_FREE_PAGES);
>> +
>> + ? ? while (mem) {
>> + ? ? ? ? ? ? free = (res_counter_read_u64(&mem->res, RES_LIMIT) -
>> + ? ? ? ? ? ? ? ? ? ? res_counter_read_u64(&mem->res, RES_USAGE)) >>
>> + ? ? ? ? ? ? ? ? ? ? PAGE_SHIFT;
>> + ? ? ? ? ? ? min_free = min((u64)min_free, free);
>> + ? ? ? ? ? ? mem = parent_mem_cgroup(mem);
>> + ? ? }
>> +
>> + ? ? return min_free;
>> +}
>> +
>> +/*
>> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
>> + * @mem: ? ? ? memory cgroup to query
>> + * @item: ? ? ?memory statistic item exported to the kernel
>> + *
>> + * Return the accounted statistic value.
>> + */
>> +static unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? enum mem_cgroup_page_stat_item item)
>
> How about mem_cgroup_file_cache_stat() ?

The suggested rename is possible. But for consistency I assume we
would also want to rename:
* mem_cgroup_local_page_stat()
* enum mem_cgroup_page_stat_item
* mem_cgroup_update_page_stat()
* mem_cgroup_move_account_page_stat()

I have a slight preference for leaving it as is,
mem_cgroup_page_stat(), to allow for future coverage of pages other
that just file cache pages. But I do not feel very strongly.

>> +{
>> + ? ? struct mem_cgroup *iter;
>> + ? ? s64 value;
>> +
>> + ? ? /*
>> + ? ? ?* If we're looking for dirtyable pages we need to evaluate free pages
>> + ? ? ?* depending on the limit and usage of the parents first of all.
>> + ? ? ?*/
>> + ? ? if (item == MEMCG_NR_DIRTYABLE_PAGES)
>> + ? ? ? ? ? ? value = mem_cgroup_hierarchical_free_pages(mem);
>> + ? ? else
>> + ? ? ? ? ? ? value = 0;
>> +
>> + ? ? /*
>> + ? ? ?* Recursively evaluate page statistics against all cgroup under
>> + ? ? ?* hierarchy tree
>> + ? ? ?*/
>> + ? ? for_each_mem_cgroup_tree(iter, mem)
>> + ? ? ? ? ? ? value += mem_cgroup_local_page_stat(iter, item);
>> +
>> + ? ? /*
>> + ? ? ?* Summing of unlocked per-cpu counters is racy and may yield a slightly
>> + ? ? ?* negative value. ?Zero is the only sensible value in such cases.
>> + ? ? ?*/
>> + ? ? if (unlikely(value < 0))
>> + ? ? ? ? ? ? value = 0;
>> +
>> + ? ? return value;
>> +}
>
> seems very nice handling of hierarchy :)

2011-05-15 19:56:48

by Greg Thelen

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 11/14] memcg: create support routines for writeback

On Fri, May 13, 2011 at 3:04 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 13 May 2011 01:47:50 -0700
> Greg Thelen <[email protected]> wrote:
>
>> Introduce memcg routines to assist in per-memcg writeback:
>>
>> - mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
>> ? writeback because they are over their dirty memory threshold.
>>
>> - should_writeback_mem_cgroup_inode() determines if an inode is
>> ? contributing pages to an over-limit memcg.
>>
>> - mem_cgroup_writeback_done() is used periodically during writeback to
>> ? update memcg writeback data.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
>
> I'm okay with the bitmap..then, problem will be when set/clear wbc->for_cgroup...

wbc->for_cgroup is only set in two conditions:

a) when mem_cgroup_balance_dirty_pages() is trying to get a cgroup
below its dirty memory foreground threshold. This is in patch 12/14.

b) when bdi-flusher is performing background writeback and determines
that at any of the cgroup are over their respective background dirty
memory threshold. This is in patch 13/14.

2011-05-16 06:05:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH v7 10/14] memcg: dirty page accounting support routines

On Sun, 15 May 2011 12:56:00 -0700
Greg Thelen <[email protected]> wrote:

> On Fri, May 13, 2011 at 2:56 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Fri, 13 May 2011 01:47:49 -0700
> > Greg Thelen <[email protected]> wrote:

> >> +static unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
> >> +                                       enum mem_cgroup_page_stat_item item)
> >
> > How about mem_cgroup_file_cache_stat() ?
>
> The suggested rename is possible. But for consistency I assume we
> would also want to rename:
> * mem_cgroup_local_page_stat()
> * enum mem_cgroup_page_stat_item
> * mem_cgroup_update_page_stat()
> * mem_cgroup_move_account_page_stat()
>

Yes, maybe clean up is necessary.

> I have a slight preference for leaving it as is,
> mem_cgroup_page_stat(), to allow for future coverage of pages other
> that just file cache pages. But I do not feel very strongly.
>

ok, I'm not have big concern on naming for now. please do as you like.

Thanks,
-Kame