2011-06-03 16:13:18

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 00/12] memcg: per cgroup dirty page accounting

This patch series provides the ability for each cgroup to have independent dirty
page usage limits. Limiting dirty memory fixes the max amount of dirty (hard to
reclaim) page cache used by a cgroup. This allows for better per cgroup memory
isolation and fewer ooms within a single cgroup.

Having per cgroup dirty memory limits is not very interesting unless writeback
is cgroup aware. There is not much isolation if cgroups have to writeback data
from other cgroups to get below their dirty memory threshold.

Per-memcg dirty limits are provided to support isolation and thus cross cgroup
inode sharing is not a priority. This allows the code be simpler.

To add cgroup awareness to writeback, this series adds a memcg field to the
inode to allow writeback to isolate inodes for a particular cgroup. When an
inode is marked dirty, i_memcg is set to the current cgroup. When inode pages
are marked dirty the i_memcg field compared against the page's cgroup. If they
differ, then the inode is marked as shared by setting i_memcg to a special
shared value (zero).

Previous discussions suggested that a per-bdi per-memcg b_dirty list was a good
way to assoicate inodes with a cgroup without having to add a field to struct
inode. I prototyped this approach but found that it involved more complex
writeback changes and had at least one major shortcoming: detection of when an
inode becomes shared by multiple cgroups. While such sharing is not expected to
be common, the system should gracefully handle it.

balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
dirty usage vs dirty thresholds for the current cgroup and its parents. If any
over-limit cgroups are found, they are marked in a global over-limit bitmap
(indexed by cgroup id) and the bdi flusher is awoke.

The bdi flusher uses wb_check_background_flush() to check for any memcg over
their dirty limit. When performing per-memcg background writeback,
move_expired_inodes() walks per bdi b_dirty list using each inode's i_memcg and
the global over-limit memcg bitmap to determine if the inode should be written.

If mem_cgroup_balance_dirty_pages() is unable to get below the dirty page
threshold writing per-memcg inodes, then downshifts to also writing shared
inodes (i_memcg=0).

I know that there is some significant writeback changes associated with the
IO-less balance_dirty_pages() effort. I am not trying to derail that, so this
patch series is merely an RFC to get feedback on the design. There are probably
some subtle races in these patches. I have done moderate functional testing of
the newly proposed features.

Here is an example of the memcg-oom that is avoided with this patch series:
# mkdir /dev/cgroup/memory/x
# echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
# echo $$ > /dev/cgroup/memory/x/tasks
# dd if=/dev/zero of=/data/f1 bs=1k count=1M &
# dd if=/dev/zero of=/data/f2 bs=1k count=1M &
# wait
[1]- Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k
[2]+ Killed dd if=/dev/zero of=/data/f1 bs=1M count=1k

Known limitations:
If a dirty limit is lowered a cgroup may be over its limit.

Changes since -v7:
- Merged -v7 09/14 'cgroup: move CSS_ID_MAX to cgroup.h' into
-v8 09/13 'memcg: create support routines for writeback'

- Merged -v7 08/14 'writeback: add memcg fields to writeback_control'
into -v8 09/13 'memcg: create support routines for writeback' and
-v8 10/13 'memcg: create support routines for page-writeback'. This
moves the declaration of new fields with the first usage of the
respective fields.

- mem_cgroup_writeback_done() now clears corresponding bit for cgroup that
cannot be referenced. Such a bit would represent a cgroup previously over
dirty limit, but that has been deleted before writeback cleaned all pages. By
clearing bit, writeback will not continually try to writeback the deleted
cgroup.

- Previously mem_cgroup_writeback_done() would only finish writeback when the
cgroup's dirty memory usage dropped below the dirty limit. This was the wrong
limit to check. This now correctly checks usage against the background dirty
limit.

- over_bground_thresh() now sets shared_inodes=1. In -v7 per memcg
background writeback did not, so it did not write pages of shared
inodes in background writeback. In the (potentially common) case
where the system dirty memory usage is below the system background
dirty threshold but at least one cgroup is over its background dirty
limit, then per memcg background writeback is queued for any
over-background-threshold cgroups. Background writeback should be
allowed to writeback shared inodes. The hope is that writing such
inodes has good chance of cleaning the inodes so they can transition
from shared to non-shared. Such a transition is good because then the
inode will remain unshared until it is written by multiple cgroup.
Non-shared inodes offer better isolation.

Single patch that can be applied to mmotm-2011-05-12-15-52:
http://www.kernel.org/pub/linux/kernel/people/gthelen/memcg/memcg-dirty-limits-v8-on-mmotm-2011-05-12-15-52.patch

Patches are based on mmotm-2011-05-12-15-52.

Greg Thelen (12):
memcg: document cgroup dirty memory interfaces
memcg: add page_cgroup flags for dirty page tracking
memcg: add mem_cgroup_mark_inode_dirty()
memcg: add dirty page accounting infrastructure
memcg: add kernel calls for memcg dirty page stats
memcg: add dirty limits to mem_cgroup
memcg: add cgroupfs interface to memcg dirty limits
memcg: dirty page accounting support routines
memcg: create support routines for writeback
memcg: create support routines for page-writeback
writeback: make background writeback cgroup aware
memcg: check memcg dirty limits in page writeback

Documentation/cgroups/memory.txt | 70 ++++
fs/fs-writeback.c | 34 ++-
fs/inode.c | 3 +
fs/nfs/write.c | 4 +
include/linux/cgroup.h | 1 +
include/linux/fs.h | 9 +
include/linux/memcontrol.h | 63 ++++-
include/linux/page_cgroup.h | 23 ++
include/linux/writeback.h | 5 +-
include/trace/events/memcontrol.h | 198 +++++++++++
kernel/cgroup.c | 1 -
mm/filemap.c | 1 +
mm/memcontrol.c | 708 ++++++++++++++++++++++++++++++++++++-
mm/page-writeback.c | 42 ++-
mm/truncate.c | 1 +
mm/vmscan.c | 2 +-
16 files changed, 1138 insertions(+), 27 deletions(-)
create mode 100644 include/trace/events/memcontrol.h

--
1.7.3.1


2011-06-03 16:13:24

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 01/12] memcg: document cgroup dirty memory interfaces

Document cgroup dirty memory interfaces and statistics.

The implementation for these new interfaces routines comes in a series
of following patches.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Balbir Singh <[email protected]>
---
Documentation/cgroups/memory.txt | 70 ++++++++++++++++++++++++++++++++++++++
1 files changed, 70 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 43b9e46..15019a3 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -395,6 +395,10 @@ soft_direct_steal- # of pages reclaimed in global hierarchical reclaim from
direct reclaim
soft_direct_scan- # of pages scanned in global hierarchical reclaim from
direct reclaim
+dirty - # of bytes that are waiting to get written back to the disk.
+writeback - # of bytes that are actively being written back to the disk.
+nfs_unstable - # of bytes sent to the NFS server, but not yet committed to
+ the actual storage.
inactive_anon - # of bytes of anonymous memory and swap cache memory on
LRU list.
active_anon - # of bytes of anonymous and swap cache memory on active
@@ -420,6 +424,9 @@ total_soft_kswapd_steal - sum of all children's "soft_kswapd_steal"
total_soft_kswapd_scan - sum of all children's "soft_kswapd_scan"
total_soft_direct_steal - sum of all children's "soft_direct_steal"
total_soft_direct_scan - sum of all children's "soft_direct_scan"
+total_dirty - sum of all children's "dirty"
+total_writeback - sum of all children's "writeback"
+total_nfs_unstable - sum of all children's "nfs_unstable"
total_inactive_anon - sum of all children's "inactive_anon"
total_active_anon - sum of all children's "active_anon"
total_inactive_file - sum of all children's "inactive_file"
@@ -476,6 +483,69 @@ value for efficient access. (Of course, when necessary, it's synchronized.)
If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
value in memory.stat(see 5.2).

+5.6 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup. So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be throttled if they cross that limit. System-wide dirty limits are also
+consulted. Dirty memory consumption is checked against both system-wide and
+per-cgroup dirty limits.
+
+The interface is similar to the procfs interface: /proc/sys/vm/dirty_*. It is
+possible to configure a limit to trigger throttling of a dirtier or queue
+background writeback. The root cgroup memory.dirty_* control files are
+read-only and match the contents of the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+ cgroup memory) at which a process generating dirty pages will be throttled.
+ The default value is the system-wide dirty ratio, /proc/sys/vm/dirty_ratio.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+ in the cgroup at which a process generating dirty pages will be throttled.
+ Suffix (k, K, m, M, g, or G) can be used to indicate that value is kilo, mega
+ or gigabytes. The default value is the system-wide dirty limit,
+ /proc/sys/vm/dirty_bytes.
+
+ Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+ Only one may be specified at a time. When one is written it is immediately
+ taken into account to evaluate the dirty memory limits and the other appears
+ as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+ (expressed as a percentage of cgroup memory) at which background writeback
+ kernel threads will start writing out dirty data. The default value is the
+ system-wide background dirty ratio, /proc/sys/vm/dirty_background_ratio.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+ in bytes) in the cgroup at which background writeback kernel threads will
+ start writing out dirty data. Suffix (k, K, m, M, g, or G) can be used to
+ indicate that value is kilo, mega or gigabytes. The default value is the
+ system-wide dirty background limit, /proc/sys/vm/dirty_background_bytes.
+
+ Note: memory.dirty_background_limit_in_bytes is the counterpart of
+ memory.dirty_background_ratio. Only one may be specified at a time. When one
+ is written it is immediately taken into account to evaluate the dirty memory
+ limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit. This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it. Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup. Example: If page is allocated by a
+cgroup A task, then the page is charged to cgroup A. If the page is later
+dirtied by a task in cgroup B, then the cgroup A dirty count will be
+incremented. If cgroup A is over its dirty limit but cgroup B is not, then
+dirtying a cgroup A page from a cgroup B task may push cgroup A over its dirty
+limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
+When use_hierarchy=1 the dirty limits of parent cgroups are also checked to
+ensure that no dirty limit is exceeded.
+
6. Hierarchy support

The memory controller supports a deep hierarchy and hierarchical accounting.
--
1.7.3.1

2011-06-03 16:13:43

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 02/12] memcg: add page_cgroup flags for dirty page tracking

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/page_cgroup.h | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 961ecc7..66d3245 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -10,6 +10,9 @@ enum {
/* flags for mem_cgroup and file and I/O status */
PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+ PCG_FILE_DIRTY, /* page is dirty */
+ PCG_FILE_WRITEBACK, /* page is under writeback */
+ PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
/* No lock in page_cgroup */
PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
__NR_PCG_FLAGS,
@@ -67,6 +70,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }

+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+ { return test_and_set_bit(PCG_##lname, &pc->flags); }
+
/* Cache flag is set only once (at allocation) */
TESTPCGFLAG(Cache, CACHE)
CLEARPCGFLAG(Cache, CACHE)
@@ -86,6 +93,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
CLEARPCGFLAG(FileMapped, FILE_MAPPED)
TESTPCGFLAG(FileMapped, FILE_MAPPED)

+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)
--
1.7.3.1

2011-06-03 16:13:50

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 03/12] memcg: add mem_cgroup_mark_inode_dirty()

Create the mem_cgroup_mark_inode_dirty() routine, which is called when
an inode is marked dirty. In kernels without memcg, this is an inline
no-op.

Add i_memcg field to struct address_space. When an inode is marked
dirty with mem_cgroup_mark_inode_dirty(), the css_id of current memcg is
recorded in i_memcg. Per-memcg writeback (introduced in a latter
change) uses this field to isolate inodes associated with a particular
memcg.

The type of i_memcg is an 'unsigned short' because it stores the css_id
of the memcg. Using a struct mem_cgroup pointer would be larger and
also create a reference on the memcg which would hang memcg rmdir
deletion. Usage of a css_id is not a reference so cgroup deletion is
not affected. The memcg can be deleted without cleaning up the i_memcg
field. When a memcg is deleted its pages are recharged to the cgroup
parent, and the related inode(s) are marked as shared thus
disassociating the inodes from the deleted cgroup.

A mem_cgroup_mark_inode_dirty() tracepoint is also included to allow for
easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/fs-writeback.c | 2 ++
fs/inode.c | 3 +++
include/linux/fs.h | 9 +++++++++
include/linux/memcontrol.h | 6 ++++++
include/trace/events/memcontrol.h | 32 ++++++++++++++++++++++++++++++++
mm/memcontrol.c | 24 ++++++++++++++++++++++++
6 files changed, 76 insertions(+), 0 deletions(-)
create mode 100644 include/trace/events/memcontrol.h

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3392c29..0174fcf 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,7 @@
#include <linux/slab.h>
#include <linux/sched.h>
#include <linux/fs.h>
+#include <linux/memcontrol.h>
#include <linux/mm.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
@@ -1111,6 +1112,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
spin_lock(&bdi->wb.list_lock);
inode->dirtied_when = jiffies;
list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+ mem_cgroup_mark_inode_dirty(inode);
spin_unlock(&bdi->wb.list_lock);

if (wakeup_bdi)
diff --git a/fs/inode.c b/fs/inode.c
index ce61a1b..9ecb0bb 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -228,6 +228,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
mapping->assoc_mapping = NULL;
mapping->backing_dev_info = &default_backing_dev_info;
mapping->writeback_index = 0;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ mapping->i_memcg = 0;
+#endif

/*
* If the block_device provides a backing_dev_info for client
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 29c02f6..deabca3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -645,6 +645,9 @@ struct address_space {
spinlock_t private_lock; /* for use by the address_space */
struct list_head private_list; /* ditto */
struct address_space *assoc_mapping; /* ditto */
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ unsigned short i_memcg; /* css_id of memcg dirtier */
+#endif
} __attribute__((aligned(sizeof(long))));
/*
* On most architectures that alignment is already the case; but
@@ -652,6 +655,12 @@ struct address_space {
* of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
*/

+/*
+ * When an address_space is shared by multiple memcg dirtieres, then i_memcg is
+ * set to this special, wildcard, css_id value (zero).
+ */
+#define I_MEMCG_SHARED 0
+
struct block_device {
dev_t bd_dev; /* not a kdev_t - it's a search key */
int bd_openers;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 77e47f5..14b6d67 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -103,6 +103,8 @@ mem_cgroup_prepare_migration(struct page *page,
extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
struct page *oldpage, struct page *newpage, bool migration_ok);

+void mem_cgroup_mark_inode_dirty(struct inode *inode);
+
/*
* For memory reclaim.
*/
@@ -273,6 +275,10 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
{
}

+static inline void mem_cgroup_mark_inode_dirty(struct inode *inode)
+{
+}
+
static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
{
return 0;
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
new file mode 100644
index 0000000..781ef9fc
--- /dev/null
+++ b/include/trace/events/memcontrol.h
@@ -0,0 +1,32 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM memcontrol
+
+#if !defined(_TRACE_MEMCONTROL_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MEMCONTROL_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(mem_cgroup_mark_inode_dirty,
+ TP_PROTO(struct inode *inode),
+
+ TP_ARGS(inode),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned short, css_id)
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->css_id =
+ inode->i_mapping ? inode->i_mapping->i_memcg : 0;
+ ),
+
+ TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
+)
+
+#endif /* _TRACE_MEMCONTROL_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bf642b5..e83ef74 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -54,6 +54,9 @@

#include <trace/events/vmscan.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/memcontrol.h>
+
struct cgroup_subsys mem_cgroup_subsys __read_mostly;
#define MEM_CGROUP_RECLAIM_RETRIES 5
struct mem_cgroup *root_mem_cgroup __read_mostly;
@@ -1122,6 +1125,27 @@ static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_
return inactive_ratio;
}

+/*
+ * Mark the current task's memcg as the memcg associated with inode. Note: the
+ * recorded cgroup css_id is not guaranteed to remain correct. The current task
+ * may be moved to another cgroup. The memcg may also be deleted before the
+ * caller has time to use the i_memcg.
+ */
+void mem_cgroup_mark_inode_dirty(struct inode *inode)
+{
+ struct mem_cgroup *mem;
+ unsigned short id;
+
+ rcu_read_lock();
+ mem = mem_cgroup_from_task(current);
+ id = mem ? css_id(&mem->css) : 0;
+ rcu_read_unlock();
+
+ inode->i_mapping->i_memcg = id;
+
+ trace_mem_cgroup_mark_inode_dirty(inode);
+}
+
int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
{
unsigned long active;
--
1.7.3.1

2011-06-03 16:14:07

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 04/12] memcg: add dirty page accounting infrastructure

Add memcg routines to count dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages. A
later change adds kernel calls to these new routines.

As inode pages are marked dirty, if the dirtied page's cgroup differs
from the inode's cgroup, then mark the inode shared across several
cgroup.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 8 +++-
mm/memcontrol.c | 105 +++++++++++++++++++++++++++++++++++++++++---
2 files changed, 105 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 14b6d67..f1261e5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -27,9 +27,15 @@ struct page_cgroup;
struct page;
struct mm_struct;

-/* Stats that can be updated by kernel. */
+/*
+ * Per mem_cgroup page counts tracked by kernel. As pages enter and leave these
+ * states, the kernel notifies memcg using mem_cgroup_{inc,dec}_page_stat().
+ */
enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+ MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+ MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
};

extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e83ef74..e323978 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -86,8 +86,11 @@ enum mem_cgroup_stat_index {
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */
MEM_CGROUP_STAT_NSTATS,
@@ -1860,6 +1863,7 @@ void mem_cgroup_update_page_stat(struct page *page,
{
struct mem_cgroup *mem;
struct page_cgroup *pc = lookup_page_cgroup(page);
+ struct address_space *mapping;
bool need_unlock = false;
unsigned long uninitialized_var(flags);

@@ -1888,6 +1892,53 @@ void mem_cgroup_update_page_stat(struct page *page,
ClearPageCgroupFileMapped(pc);
idx = MEM_CGROUP_STAT_FILE_MAPPED;
break;
+
+ case MEMCG_NR_FILE_DIRTY:
+ /* Use Test{Set,Clear} to only un/charge the memcg once. */
+ if (val > 0) {
+ mapping = page_mapping(page);
+ if (TestSetPageCgroupFileDirty(pc))
+ val = 0;
+ else if (mapping &&
+ (mapping->i_memcg != css_id(&mem->css)))
+ /*
+ * If the inode is being dirtied by a memcg
+ * other than the one that marked it dirty, then
+ * mark the inode shared by multiple memcg.
+ */
+ mapping->i_memcg = I_MEMCG_SHARED;
+ } else {
+ if (!TestClearPageCgroupFileDirty(pc))
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_DIRTY;
+ break;
+
+ case MEMCG_NR_FILE_WRITEBACK:
+ /*
+ * This counter is adjusted while holding the mapping's
+ * tree_lock. Therefore there is no race between settings and
+ * clearing of this flag.
+ */
+ if (val > 0)
+ SetPageCgroupFileWriteback(pc);
+ else
+ ClearPageCgroupFileWriteback(pc);
+ idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+ break;
+
+ case MEMCG_NR_FILE_UNSTABLE_NFS:
+ /* Use Test{Set,Clear} to only un/charge the memcg once. */
+ if (val > 0) {
+ if (TestSetPageCgroupFileUnstableNFS(pc))
+ val = 0;
+ } else {
+ if (!TestClearPageCgroupFileUnstableNFS(pc))
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+ break;
+
default:
BUG();
}
@@ -2447,6 +2498,17 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
}
#endif

+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+ struct mem_cgroup *to,
+ enum mem_cgroup_stat_index idx)
+{
+ preempt_disable();
+ __this_cpu_dec(from->stat->count[idx]);
+ __this_cpu_inc(to->stat->count[idx]);
+ preempt_enable();
+}
+
/**
* mem_cgroup_move_account - move account of the page
* @page: the page
@@ -2495,13 +2557,28 @@ static int mem_cgroup_move_account(struct page *page,

move_lock_page_cgroup(pc, &flags);

- if (PageCgroupFileMapped(pc)) {
- /* Update mapped_file data for mem_cgroup */
- preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- preempt_enable();
+ if (PageCgroupFileMapped(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_MAPPED);
+ if (PageCgroupFileDirty(pc)) {
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_DIRTY);
+ /*
+ * Moving a dirty file page between memcg makes the underlying
+ * inode shared. If the new (to) cgroup attempts writeback it
+ * should consider this inode. If the old (from) cgroup
+ * attempts writeback it likely has other pages in the same
+ * inode. The inode is now shared by the to and from cgroups.
+ * So mark the inode as shared.
+ */
+ page_mapping(page)->i_memcg = I_MEMCG_SHARED;
}
+ if (PageCgroupFileWriteback(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_WRITEBACK);
+ if (PageCgroupFileUnstableNFS(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
if (uncharge)
/* This is not "cancel", but cancel_charge does all we need. */
@@ -3981,6 +4058,9 @@ enum {
MCS_SOFT_KSWAPD_SCAN,
MCS_SOFT_DIRECT_STEAL,
MCS_SOFT_DIRECT_SCAN,
+ MCS_FILE_DIRTY,
+ MCS_WRITEBACK,
+ MCS_UNSTABLE_NFS,
MCS_INACTIVE_ANON,
MCS_ACTIVE_ANON,
MCS_INACTIVE_FILE,
@@ -4009,6 +4089,9 @@ struct {
{"soft_kswapd_scan", "total_soft_scan"},
{"soft_direct_steal", "total_soft_direct_steal"},
{"soft_direct_scan", "total_soft_direct_scan"},
+ {"dirty", "total_dirty"},
+ {"writeback", "total_writeback"},
+ {"nfs_unstable", "total_nfs_unstable"},
{"inactive_anon", "total_inactive_anon"},
{"active_anon", "total_active_anon"},
{"inactive_file", "total_inactive_file"},
@@ -4050,6 +4133,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
val = mem_cgroup_read_events(mem, MEM_CGROUP_EVENTS_PGMAJFAULT);
s->stat[MCS_PGMAJFAULT] += val;

+ /* dirty stat */
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+ s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+ s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+ s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
/* per zone stat */
val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
--
1.7.3.1

2011-06-03 16:14:15

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 05/12] memcg: add kernel calls for memcg dirty page stats

Add calls into memcg dirty page accounting. Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs. This
allows the memory controller to maintain an accurate view of the amount
of its memory that is dirty.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Daisuke Nishimura <[email protected]>
---
fs/nfs/write.c | 4 ++++
mm/filemap.c | 1 +
mm/page-writeback.c | 7 ++++++-
mm/truncate.c | 1 +
4 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 3bd5d7e..c23b168 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg)
nfsi->ncommit++;
spin_unlock(&inode->i_lock);
pnfs_mark_request_commit(req, lseg);
+ mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -460,6 +461,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req->wb_page;

if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1376,6 +1378,8 @@ void nfs_retry_commit(struct list_head *page_list,
req = nfs_list_entry(page_list->next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req, lseg);
+ mem_cgroup_dec_page_stat(req->wb_page,
+ MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 707ae82..6cd8297 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -145,6 +145,7 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cca0803..62fcf3d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1124,6 +1124,7 @@ int __set_page_dirty_no_writeback(struct page *page)
void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1140,6 +1141,7 @@ EXPORT_SYMBOL(account_page_dirtied);
*/
void account_page_writeback(struct page *page)
{
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
inc_zone_page_state(page, NR_WRITEBACK);
inc_zone_page_state(page, NR_WRITTEN);
}
@@ -1323,6 +1325,7 @@ int clear_page_dirty_for_io(struct page *page)
* for more comments.
*/
if (TestClearPageDirty(page)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
@@ -1358,8 +1361,10 @@ int test_clear_page_writeback(struct page *page)
} else {
ret = TestClearPageWriteback(page);
}
- if (ret)
+ if (ret) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK);
+ }
return ret;
}

diff --git a/mm/truncate.c b/mm/truncate.c
index 3a29a61..3dbade6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
--
1.7.3.1

2011-06-03 16:14:24

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 06/12] memcg: add dirty limits to mem_cgroup

Extend mem_cgroup to contain dirty page limits.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e323978..91e0692 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -216,6 +216,14 @@ struct mem_cgroup_eventfd_list {
static void mem_cgroup_threshold(struct mem_cgroup *mem);
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);

+/* Dirty memory parameters */
+struct vm_dirty_param {
+ int dirty_ratio;
+ int dirty_background_ratio;
+ unsigned long dirty_bytes;
+ unsigned long dirty_background_bytes;
+};
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -260,6 +268,10 @@ struct mem_cgroup {
atomic_t refcnt;

unsigned int swappiness;
+
+ /* control memory cgroup dirty pages */
+ struct vm_dirty_param dirty_param;
+
/* OOM-Killer disable */
int oom_kill_disable;

@@ -1309,6 +1321,36 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs. So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+static bool mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
+{
+ return mem && !mem_cgroup_is_root(mem);
+}
+
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits. If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where neither dirty_[background_]_ratio nor _bytes are set.
+ */
+static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
+ struct mem_cgroup *mem)
+{
+ if (mem_cgroup_has_dirty_limit(mem)) {
+ *param = mem->dirty_param;
+ } else {
+ param->dirty_ratio = vm_dirty_ratio;
+ param->dirty_bytes = vm_dirty_bytes;
+ param->dirty_background_ratio = dirty_background_ratio;
+ param->dirty_background_bytes = dirty_background_bytes;
+ }
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
@@ -4917,8 +4959,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mem->last_scanned_node = MAX_NUMNODES;
INIT_LIST_HEAD(&mem->oom_notify);

- if (parent)
+ if (parent) {
mem->swappiness = get_swappiness(parent);
+ mem_cgroup_dirty_param(&mem->dirty_param, parent);
+ } else {
+ /*
+ * The root cgroup dirty_param field is not used, instead,
+ * system-wide dirty limits are used.
+ */
+ }
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
--
1.7.3.1

2011-06-03 16:14:53

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 07/12] memcg: add cgroupfs interface to memcg dirty limits

Add cgroupfs interface to memcg dirty page limits:
Direct write-out is controlled with:
- memory.dirty_ratio
- memory.dirty_limit_in_bytes

Background write-out is controlled with:
- memory.dirty_background_ratio
- memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts. This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 91e0692..72fed89 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -126,6 +126,13 @@ enum mem_cgroup_events_target {
#define THRESHOLDS_EVENTS_TARGET (128)
#define SOFTLIMIT_EVENTS_TARGET (1024)

+enum {
+ MEM_CGROUP_DIRTY_RATIO,
+ MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+ MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
@@ -4638,6 +4645,89 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
return 0;
}

+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ bool use_sys = !mem_cgroup_has_dirty_limit(mem);
+
+ switch (cft->private) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ return use_sys ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+ case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+ return use_sys ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ return use_sys ? dirty_background_ratio :
+ mem->dirty_param.dirty_background_ratio;
+ case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+ return use_sys ? dirty_background_bytes :
+ mem->dirty_param.dirty_background_bytes;
+ default:
+ BUG();
+ }
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+ int ret = -EINVAL;
+ unsigned long long val;
+
+ if (!mem_cgroup_has_dirty_limit(memcg))
+ return ret;
+
+ switch (type) {
+ case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_bytes = val;
+ memcg->dirty_param.dirty_ratio = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_background_bytes = val;
+ memcg->dirty_param.dirty_background_ratio = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+
+ if (!mem_cgroup_has_dirty_limit(memcg))
+ return -EINVAL;
+ if ((type == MEM_CGROUP_DIRTY_RATIO ||
+ type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+ return -EINVAL;
+ switch (type) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ memcg->dirty_param.dirty_ratio = val;
+ memcg->dirty_param.dirty_bytes = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ memcg->dirty_param.dirty_background_ratio = val;
+ memcg->dirty_param.dirty_background_bytes = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return 0;
+}
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -4701,6 +4791,30 @@ static struct cftype mem_cgroup_files[] = {
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
+ {
+ .name = "dirty_ratio",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_u64 = mem_cgroup_dirty_write,
+ .private = MEM_CGROUP_DIRTY_RATIO,
+ },
+ {
+ .name = "dirty_limit_in_bytes",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_string = mem_cgroup_dirty_write_string,
+ .private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+ },
+ {
+ .name = "dirty_background_ratio",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_u64 = mem_cgroup_dirty_write,
+ .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ },
+ {
+ .name = "dirty_background_limit_in_bytes",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_string = mem_cgroup_dirty_write_string,
+ .private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
--
1.7.3.1

2011-06-03 16:15:39

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 08/12] memcg: dirty page accounting support routines

Added memcg dirty page accounting support routines. These routines are
used by later changes to provide memcg aware writeback and dirty page
limiting. A mem_cgroup_dirty_info() tracepoint is is also included to
allow for easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 9 +++
include/trace/events/memcontrol.h | 34 +++++++++
mm/memcontrol.c | 145 +++++++++++++++++++++++++++++++++++++
3 files changed, 188 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f1261e5..f06c2de 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -36,6 +36,15 @@ enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+ MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
+};
+
+struct dirty_info {
+ unsigned long dirty_thresh;
+ unsigned long background_thresh;
+ unsigned long nr_file_dirty;
+ unsigned long nr_writeback;
+ unsigned long nr_unstable_nfs;
};

extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index 781ef9fc..abf1306 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -26,6 +26,40 @@ TRACE_EVENT(mem_cgroup_mark_inode_dirty,
TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
)

+TRACE_EVENT(mem_cgroup_dirty_info,
+ TP_PROTO(unsigned short css_id,
+ struct dirty_info *dirty_info),
+
+ TP_ARGS(css_id, dirty_info),
+
+ TP_STRUCT__entry(
+ __field(unsigned short, css_id)
+ __field(unsigned long, dirty_thresh)
+ __field(unsigned long, background_thresh)
+ __field(unsigned long, nr_file_dirty)
+ __field(unsigned long, nr_writeback)
+ __field(unsigned long, nr_unstable_nfs)
+ ),
+
+ TP_fast_assign(
+ __entry->css_id = css_id;
+ __entry->dirty_thresh = dirty_info->dirty_thresh;
+ __entry->background_thresh = dirty_info->background_thresh;
+ __entry->nr_file_dirty = dirty_info->nr_file_dirty;
+ __entry->nr_writeback = dirty_info->nr_writeback;
+ __entry->nr_unstable_nfs = dirty_info->nr_unstable_nfs;
+ ),
+
+ TP_printk("css_id=%d thresh=%ld bg_thresh=%ld dirty=%ld wb=%ld "
+ "unstable_nfs=%ld",
+ __entry->css_id,
+ __entry->dirty_thresh,
+ __entry->background_thresh,
+ __entry->nr_file_dirty,
+ __entry->nr_writeback,
+ __entry->nr_unstable_nfs)
+)
+
#endif /* _TRACE_MEMCONTROL_H */

/* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 72fed89..4306a61 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1328,6 +1328,11 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+static unsigned long dirty_info_reclaimable(struct dirty_info *info)
+{
+ return info->nr_file_dirty + info->nr_unstable_nfs;
+}
+
/*
* Return true if the current memory cgroup has local dirty memory settings.
* There is an allowed race between the current task migrating in-to/out-of the
@@ -1358,6 +1363,146 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
}
}

+static inline bool mem_cgroup_can_swap(struct mem_cgroup *mem)
+{
+ if (!do_swap_account)
+ return nr_swap_pages > 0;
+ return !mem->memsw_is_minimum &&
+ (res_counter_read_u64(&mem->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ s64 ret;
+
+ switch (item) {
+ case MEMCG_NR_FILE_DIRTY:
+ ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+ break;
+ case MEMCG_NR_FILE_WRITEBACK:
+ ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+ break;
+ case MEMCG_NR_FILE_UNSTABLE_NFS:
+ ret = mem_cgroup_read_stat(mem,
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+ break;
+ case MEMCG_NR_DIRTYABLE_PAGES:
+ ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
+ mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
+ if (mem_cgroup_can_swap(mem))
+ ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
+ mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+/*
+ * Return the number of additional pages that the @mem cgroup could allocate.
+ * If use_hierarchy is set, then this involves checking parent mem cgroups to
+ * find the cgroup with the smallest free space.
+ */
+static unsigned long
+mem_cgroup_hierarchical_free_pages(struct mem_cgroup *mem)
+{
+ u64 free;
+ unsigned long min_free;
+
+ min_free = global_page_state(NR_FREE_PAGES);
+
+ while (mem) {
+ free = (res_counter_read_u64(&mem->res, RES_LIMIT) -
+ res_counter_read_u64(&mem->res, RES_USAGE)) >>
+ PAGE_SHIFT;
+ min_free = min((u64)min_free, free);
+ mem = parent_mem_cgroup(mem);
+ }
+
+ return min_free;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @mem: memory cgroup to query
+ * @item: memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+static unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ struct mem_cgroup *iter;
+ s64 value;
+
+ /*
+ * If we're looking for dirtyable pages we need to evaluate free pages
+ * depending on the limit and usage of the parents first of all.
+ */
+ if (item == MEMCG_NR_DIRTYABLE_PAGES)
+ value = mem_cgroup_hierarchical_free_pages(mem);
+ else
+ value = 0;
+
+ /*
+ * Recursively evaluate page statistics against all cgroup under
+ * hierarchy tree
+ */
+ for_each_mem_cgroup_tree(iter, mem)
+ value += mem_cgroup_local_page_stat(iter, item);
+
+ /*
+ * Summing of unlocked per-cpu counters is racy and may yield a slightly
+ * negative value. Zero is the only sensible value in such cases.
+ */
+ if (unlikely(value < 0))
+ value = 0;
+
+ return value;
+}
+
+/* Return dirty thresholds and usage for @mem. */
+static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *mem,
+ struct dirty_info *info)
+{
+ unsigned long uninitialized_var(available_mem);
+ struct vm_dirty_param dirty_param;
+
+ mem_cgroup_dirty_param(&dirty_param, mem);
+
+ if (!dirty_param.dirty_bytes || !dirty_param.dirty_background_bytes)
+ available_mem = min(
+ sys_available_mem,
+ mem_cgroup_page_stat(mem, MEMCG_NR_DIRTYABLE_PAGES));
+
+ if (dirty_param.dirty_bytes)
+ info->dirty_thresh =
+ DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
+ else
+ info->dirty_thresh =
+ (dirty_param.dirty_ratio * available_mem) / 100;
+
+ if (dirty_param.dirty_background_bytes)
+ info->background_thresh =
+ DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+ PAGE_SIZE);
+ else
+ info->background_thresh =
+ (dirty_param.dirty_background_ratio *
+ available_mem) / 100;
+
+ info->nr_file_dirty = mem_cgroup_page_stat(mem, MEMCG_NR_FILE_DIRTY);
+ info->nr_writeback = mem_cgroup_page_stat(mem, MEMCG_NR_FILE_WRITEBACK);
+ info->nr_unstable_nfs =
+ mem_cgroup_page_stat(mem, MEMCG_NR_FILE_UNSTABLE_NFS);
+
+ trace_mem_cgroup_dirty_info(css_id(&mem->css), info);
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
--
1.7.3.1

2011-06-03 16:15:56

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 09/12] memcg: create support routines for writeback

Introduce memcg routines to assist in per-memcg writeback:

- mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
writeback because they are over their dirty memory threshold.

- should_writeback_mem_cgroup_inode() determines if an inode is
contributing pages to an over-limit memcg. A new writeback_control
field determines if shared inodes should be written back.

- mem_cgroup_writeback_done() is used periodically during writeback to
update memcg writeback data.

These routines make use of a new over_bground_dirty_thresh bitmap that
indicates which mem_cgroup are over their respective dirty background
threshold. As this bitmap is indexed by css_id, the largest possible
css_id value is needed to create the bitmap. So move the definition of
CSS_ID_MAX from cgroup.c to cgroup.h. This allows users of css_id() to
know the largest possible css_id value. This knowledge can be used to
build such per-cgroup bitmaps.

Signed-off-by: Greg Thelen <[email protected]>
---
Changelog since v7:
- Combined creation of and first use of new struct writeback_control
shared_inodes field into this patch. In -v7 these were in separate patches.

- Promote CSS_ID_MAX from cgroup.c to cgroup.h for general usage. In -v7 this
change was in a separate patch. CSS_ID_MAX is needed by this patch, so the
promotion is included with this patch.

- mem_cgroup_writeback_done() now clears corresponding bit for cgroup that
cannot be referenced. Such a bit would represent a cgroup previously over
dirty limit, but that has been deleted before writeback cleaned all pages. By
clearing bit, writeback will not continually try to writeback the deleted
cgroup.

- Previously mem_cgroup_writeback_done() would only finish writeback when the
cgroup's dirty memory usage dropped below the dirty limit. This was the wrong
limit to check. This now correctly checks usage against the background dirty
limit.

include/linux/cgroup.h | 1 +
include/linux/memcontrol.h | 22 +++++++
include/linux/writeback.h | 1 +
include/trace/events/memcontrol.h | 49 +++++++++++++++
kernel/cgroup.c | 1 -
mm/memcontrol.c | 119 +++++++++++++++++++++++++++++++++++++
6 files changed, 192 insertions(+), 1 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ab4ac0c..5eb6543 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -624,6 +624,7 @@ bool css_is_ancestor(struct cgroup_subsys_state *cg,
const struct cgroup_subsys_state *root);

/* Get id and depth of css */
+#define CSS_ID_MAX (65535)
unsigned short css_id(struct cgroup_subsys_state *css);
unsigned short css_depth(struct cgroup_subsys_state *css);
struct cgroup_subsys_state *cgroup_css_from_dir(struct file *f, int id);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f06c2de..3d72e09 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -26,6 +26,7 @@ struct mem_cgroup;
struct page_cgroup;
struct page;
struct mm_struct;
+struct writeback_control;

/*
* Per mem_cgroup page counts tracked by kernel. As pages enter and leave these
@@ -162,6 +163,11 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}

+bool should_writeback_mem_cgroup_inode(struct inode *inode,
+ struct writeback_control *wbc);
+bool mem_cgroups_over_bground_dirty_thresh(void);
+void mem_cgroup_writeback_done(void);
+
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
unsigned long *total_scanned);
@@ -361,6 +367,22 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
{
}

+static inline bool
+should_writeback_mem_cgroup_inode(struct inode *inode,
+ struct writeback_control *wbc)
+{
+ return true;
+}
+
+static inline bool mem_cgroups_over_bground_dirty_thresh(void)
+{
+ return true;
+}
+
+static inline void mem_cgroup_writeback_done(void)
+{
+}
+
static inline
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index d10d133..66ec339 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -47,6 +47,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned more_io:1; /* more io to be dispatched */
+ unsigned shared_inodes:1; /* write inodes spanning cgroups */
};

/*
diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index abf1306..326a66b 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -60,6 +60,55 @@ TRACE_EVENT(mem_cgroup_dirty_info,
__entry->nr_unstable_nfs)
)

+TRACE_EVENT(should_writeback_mem_cgroup_inode,
+ TP_PROTO(struct inode *inode,
+ struct writeback_control *wbc,
+ bool over_limit),
+
+ TP_ARGS(inode, wbc, over_limit),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, ino)
+ __field(unsigned short, css_id)
+ __field(bool, shared_inodes)
+ __field(bool, over_limit)
+ ),
+
+ TP_fast_assign(
+ __entry->ino = inode->i_ino;
+ __entry->css_id =
+ inode->i_mapping ? inode->i_mapping->i_memcg : 0;
+ __entry->shared_inodes = wbc->shared_inodes;
+ __entry->over_limit = over_limit;
+ ),
+
+ TP_printk("ino=%ld css_id=%d shared_inodes=%d over_limit=%d",
+ __entry->ino,
+ __entry->css_id,
+ __entry->shared_inodes,
+ __entry->over_limit)
+)
+
+TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
+ TP_PROTO(bool over_limit,
+ unsigned short first_id),
+
+ TP_ARGS(over_limit, first_id),
+
+ TP_STRUCT__entry(
+ __field(bool, over_limit)
+ __field(unsigned short, first_id)
+ ),
+
+ TP_fast_assign(
+ __entry->over_limit = over_limit;
+ __entry->first_id = first_id;
+ ),
+
+ TP_printk("over_limit=%d first_css_id=%d", __entry->over_limit,
+ __entry->first_id)
+)
+
#endif /* _TRACE_MEMCONTROL_H */

/* This part must be outside protection */
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2731d11..ab7e7a7 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -129,7 +129,6 @@ static struct cgroupfs_root rootnode;
* CSS ID -- ID per subsys's Cgroup Subsys State(CSS). used only when
* cgroup_subsys->use_id != 0.
*/
-#define CSS_ID_MAX (65535)
struct css_id {
/*
* The css to which this ID points. This pointer is set to valid value
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4306a61..a5b1794 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -389,10 +389,18 @@ enum charge_type {
#define MEM_CGROUP_RECLAIM_SOFT_BIT 0x2
#define MEM_CGROUP_RECLAIM_SOFT (1 << MEM_CGROUP_RECLAIM_SOFT_BIT)

+/*
+ * A bitmap representing all possible memcg, indexed by css_id. Each bit
+ * indicates if the respective memcg is over its background dirty memory
+ * limit.
+ */
+static DECLARE_BITMAP(over_bground_dirty_thresh, CSS_ID_MAX + 1);
+
static void mem_cgroup_get(struct mem_cgroup *mem);
static void mem_cgroup_put(struct mem_cgroup *mem);
static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *mem);
static void drain_all_stock_async(void);
+static struct mem_cgroup *mem_cgroup_lookup(unsigned short id);

static struct mem_cgroup_per_zone *
mem_cgroup_zoneinfo(struct mem_cgroup *mem, int nid, int zid)
@@ -1503,6 +1511,117 @@ static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
trace_mem_cgroup_dirty_info(css_id(&mem->css), info);
}

+/* Are any memcg over their background dirty memory limit? */
+bool mem_cgroups_over_bground_dirty_thresh(void)
+{
+ bool over_thresh;
+
+ over_thresh = !bitmap_empty(over_bground_dirty_thresh, CSS_ID_MAX + 1);
+
+ trace_mem_cgroups_over_bground_dirty_thresh(
+ over_thresh,
+ over_thresh ? find_next_bit(over_bground_dirty_thresh,
+ CSS_ID_MAX + 1, 0) : 0);
+
+ return over_thresh;
+}
+
+/*
+ * Should inode be written back? wbc indicates if this is foreground or
+ * background writeback and the set of inodes worth considering.
+ */
+bool should_writeback_mem_cgroup_inode(struct inode *inode,
+ struct writeback_control *wbc)
+{
+ unsigned short id;
+ bool over;
+
+ id = inode->i_mapping->i_memcg;
+ VM_BUG_ON(id >= CSS_ID_MAX + 1);
+
+ if (wbc->shared_inodes && id == I_MEMCG_SHARED)
+ over = true;
+ else
+ over = test_bit(id, over_bground_dirty_thresh);
+
+ trace_should_writeback_mem_cgroup_inode(inode, wbc, over);
+ return over;
+}
+
+/*
+ * Mark all child cgroup as eligible for writeback because @mem is over its bg
+ * threshold.
+ */
+static void mem_cgroup_mark_over_bg_thresh(struct mem_cgroup *mem)
+{
+ struct mem_cgroup *iter;
+
+ /* mark this and all child cgroup as candidates for writeback */
+ for_each_mem_cgroup_tree(iter, mem)
+ set_bit(css_id(&iter->css), over_bground_dirty_thresh);
+}
+
+static void mem_cgroup_queue_bg_writeback(struct mem_cgroup *mem,
+ struct backing_dev_info *bdi)
+{
+ mem_cgroup_mark_over_bg_thresh(mem);
+ bdi_start_background_writeback(bdi);
+}
+
+/*
+ * This routine is called when per-memcg writeback completes. It scans any
+ * previously over-bground-thresh memcg to determine if the memcg are still over
+ * their background dirty memory limit.
+ */
+void mem_cgroup_writeback_done(void)
+{
+ struct mem_cgroup *mem;
+ struct mem_cgroup *ref_mem;
+ struct dirty_info info;
+ unsigned long sys_available_mem;
+ int id;
+
+ sys_available_mem = 0;
+
+ /* for each previously over-bg-limit memcg... */
+ for (id = 0; (id = find_next_bit(over_bground_dirty_thresh,
+ CSS_ID_MAX + 1, id)) < CSS_ID_MAX + 1;
+ id++) {
+
+ /* reference the memcg */
+ rcu_read_lock();
+ mem = mem_cgroup_lookup(id);
+ if (mem && !css_tryget(&mem->css))
+ mem = NULL;
+ rcu_read_unlock();
+ if (!mem) {
+ clear_bit(id, over_bground_dirty_thresh);
+ continue;
+ }
+ ref_mem = mem;
+
+ if (!sys_available_mem)
+ sys_available_mem = determine_dirtyable_memory();
+
+ /*
+ * Walk the ancestry of inode's mem clearing the over-limit bits
+ * for for any memcg under its dirty memory background
+ * threshold.
+ */
+ for (; mem_cgroup_has_dirty_limit(mem);
+ mem = parent_mem_cgroup(mem)) {
+ mem_cgroup_dirty_info(sys_available_mem, mem, &info);
+ if (dirty_info_reclaimable(&info) >=
+ info.background_thresh)
+ break;
+
+ clear_bit(css_id(&mem->css), over_bground_dirty_thresh);
+ }
+
+ css_put(&ref_mem->css);
+ }
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
--
1.7.3.1

2011-06-03 16:15:37

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 10/12] memcg: create support routines for page-writeback

Introduce memcg routines to assist in per-memcg dirty page management:

- mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
dirty memory usage against memcg foreground and background thresholds.
If an over-background-threshold memcg is found, then per-memcg
background writeback is queued. Per-memcg writeback differs from
classic, non-memcg, per bdi writeback by setting the new
writeback_control.for_cgroup bit.

If an over-foreground-threshold memcg is found, then foreground
writeout occurs. When performing foreground writeout, first consider
inodes exclusive to the memcg. If unable to make enough progress,
then consider inodes shared between memcg. Such cross-memcg inode
sharing likely to be rare in situations that use per-cgroup memory
isolation. The approach tries to handle the common (non-shared)
case well without punishing well behaved (non-sharing) cgroups.
As a last resort writeback shared inodes.

This routine is used by balance_dirty_pages() in a later change.

- mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
and limits of the memcg closest to (or over) its dirty limit. This
will be used by throttle_vm_writeout() in a latter change.

Signed-off-by: Greg Thelen <[email protected]>
---
Changelog since v7:
- Add more detail to commit description.

- Declare the new writeback_control for_cgroup bit in this change, the
first patch that uses the new field is first used. In -v7 the field
was declared in a separate patch.

include/linux/memcontrol.h | 18 +++++
include/linux/writeback.h | 1 +
include/trace/events/memcontrol.h | 83 ++++++++++++++++++++
mm/memcontrol.c | 150 +++++++++++++++++++++++++++++++++++++
4 files changed, 252 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3d72e09..0d0363e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -167,6 +167,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
struct writeback_control *wbc);
bool mem_cgroups_over_bground_dirty_thresh(void);
void mem_cgroup_writeback_done(void);
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *mem,
+ struct dirty_info *info);
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk);

unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
@@ -383,6 +388,19 @@ static inline void mem_cgroup_writeback_done(void)
{
}

+static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk)
+{
+}
+
+static inline bool
+mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *mem,
+ struct dirty_info *info)
+{
+ return false;
+}
+
static inline
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask,
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 66ec339..4f5c0d2 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -47,6 +47,7 @@ struct writeback_control {
unsigned for_reclaim:1; /* Invoked from the page allocator */
unsigned range_cyclic:1; /* range_start is cyclic */
unsigned more_io:1; /* more io to be dispatched */
+ unsigned for_cgroup:1; /* enable cgroup writeback */
unsigned shared_inodes:1; /* write inodes spanning cgroups */
};

diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
index 326a66b..b42dae1 100644
--- a/include/trace/events/memcontrol.h
+++ b/include/trace/events/memcontrol.h
@@ -109,6 +109,89 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
__entry->first_id)
)

+DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
+ TP_PROTO(unsigned short css_id,
+ struct backing_dev_info *bdi,
+ unsigned long nr_reclaimable,
+ unsigned long thresh,
+ bool over_limit),
+
+ TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
+
+ TP_STRUCT__entry(
+ __field(unsigned short, css_id)
+ __field(struct backing_dev_info *, bdi)
+ __field(unsigned long, nr_reclaimable)
+ __field(unsigned long, thresh)
+ __field(bool, over_limit)
+ ),
+
+ TP_fast_assign(
+ __entry->css_id = css_id;
+ __entry->bdi = bdi;
+ __entry->nr_reclaimable = nr_reclaimable;
+ __entry->thresh = thresh;
+ __entry->over_limit = over_limit;
+ ),
+
+ TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
+ "over_limit=%d", __entry->css_id, __entry->bdi,
+ __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
+)
+
+#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
+DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
+ TP_PROTO(unsigned short id, \
+ struct backing_dev_info *bdi, \
+ unsigned long nr_reclaimable, \
+ unsigned long thresh, \
+ bool over_limit), \
+ TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
+)
+
+DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
+DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
+
+TRACE_EVENT(mem_cgroup_fg_writeback,
+ TP_PROTO(unsigned long write_chunk,
+ struct writeback_control *wbc),
+
+ TP_ARGS(write_chunk, wbc),
+
+ TP_STRUCT__entry(
+ __field(unsigned long, write_chunk)
+ __field(long, wbc_to_write)
+ __field(bool, shared_inodes)
+ ),
+
+ TP_fast_assign(
+ __entry->write_chunk = write_chunk;
+ __entry->wbc_to_write = wbc->nr_to_write;
+ __entry->shared_inodes = wbc->shared_inodes;
+ ),
+
+ TP_printk("write_chunk=%ld nr_to_write=%ld shared_inodes=%d",
+ __entry->write_chunk,
+ __entry->wbc_to_write,
+ __entry->shared_inodes)
+)
+
+TRACE_EVENT(mem_cgroup_enable_shared_writeback,
+ TP_PROTO(unsigned short css_id),
+
+ TP_ARGS(css_id),
+
+ TP_STRUCT__entry(
+ __field(unsigned short, css_id)
+ ),
+
+ TP_fast_assign(
+ __entry->css_id = css_id;
+ ),
+
+ TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
+)
+
#endif /* _TRACE_MEMCONTROL_H */

/* This part must be outside protection */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a5b1794..17cb888 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1622,6 +1622,156 @@ void mem_cgroup_writeback_done(void)
}
}

+/*
+ * This routine must be called by processes which are generating dirty pages.
+ * It considers the dirty pages usage and thresholds of the current cgroup and
+ * (depending if hierarchical accounting is enabled) ancestral memcg. If any of
+ * the considered memcg are over their background dirty limit, then background
+ * writeback is queued. If any are over the foreground dirty limit then
+ * throttle the dirtying task while writing dirty data. The per-memcg dirty
+ * limits check by this routine are distinct from either the per-system,
+ * per-bdi, or per-task limits considered by balance_dirty_pages().
+ */
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk)
+{
+ struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct mem_cgroup *mem;
+ struct mem_cgroup *ref_mem;
+ struct dirty_info info;
+ unsigned long nr_reclaimable;
+ unsigned long sys_available_mem;
+ unsigned long pause = 1;
+ unsigned short id;
+ bool over;
+ bool shared_inodes;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ sys_available_mem = determine_dirtyable_memory();
+
+ /* reference the memcg so it is not deleted during this routine */
+ rcu_read_lock();
+ mem = mem_cgroup_from_task(current);
+ if (mem && mem_cgroup_is_root(mem))
+ mem = NULL;
+ if (mem)
+ css_get(&mem->css);
+ rcu_read_unlock();
+ ref_mem = mem;
+
+ /* balance entire ancestry of current's mem. */
+ for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
+ id = css_id(&mem->css);
+
+ /*
+ * keep throttling and writing inode data so long as mem is over
+ * its dirty limit.
+ */
+ for (shared_inodes = false; ; ) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .older_than_this = NULL,
+ .range_cyclic = 1,
+ .for_cgroup = 1,
+ .nr_to_write = write_chunk,
+ .shared_inodes = shared_inodes,
+ };
+
+ /*
+ * if mem is under dirty limit, then break from
+ * throttling loop.
+ */
+ mem_cgroup_dirty_info(sys_available_mem, mem, &info);
+ nr_reclaimable = dirty_info_reclaimable(&info);
+ over = nr_reclaimable > info.dirty_thresh;
+ trace_mem_cgroup_consider_fg_writeback(
+ id, bdi, nr_reclaimable, info.dirty_thresh,
+ over);
+ if (!over)
+ break;
+
+ mem_cgroup_mark_over_bg_thresh(mem);
+ writeback_inodes_wb(&bdi->wb, &wbc);
+ trace_mem_cgroup_fg_writeback(write_chunk, &wbc);
+ /* if no progress, then consider shared inodes */
+ if ((wbc.nr_to_write == write_chunk) &&
+ !shared_inodes) {
+ trace_mem_cgroup_enable_shared_writeback(id);
+ shared_inodes = true;
+ }
+
+ /*
+ * Sleep up to 100ms to throttle writer and wait for
+ * queued background I/O to complete.
+ */
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ io_schedule_timeout(pause);
+ pause <<= 1;
+ if (pause > HZ / 10)
+ pause = HZ / 10;
+ }
+
+ /* if mem is over background limit, then queue bg writeback */
+ over = nr_reclaimable >= info.background_thresh;
+ trace_mem_cgroup_consider_bg_writeback(
+ id, bdi, nr_reclaimable, info.background_thresh,
+ over);
+ if (over)
+ mem_cgroup_queue_bg_writeback(mem, bdi);
+ }
+
+ if (ref_mem)
+ css_put(&ref_mem->css);
+}
+
+/*
+ * Return the dirty thresholds and usage for the mem (within the ancestral chain
+ * of @mem) closest to its dirty limit or the first memcg over its limit.
+ *
+ * The check is not stable because the usage and limits can change asynchronous
+ * to this routine.
+ */
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *mem,
+ struct dirty_info *info)
+{
+ unsigned long usage;
+ struct dirty_info uninitialized_var(cur_info);
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ info->nr_writeback = ULONG_MAX; /* invalid initial value */
+
+ /* walk up hierarchy enabled parents */
+ for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
+ mem_cgroup_dirty_info(sys_available_mem, mem, &cur_info);
+ usage = dirty_info_reclaimable(&cur_info) +
+ cur_info.nr_writeback;
+
+ /* if over limit, stop searching */
+ if (usage >= cur_info.dirty_thresh) {
+ *info = cur_info;
+ break;
+ }
+
+ /*
+ * Save dirty usage of mem closest to its limit if either:
+ * - mem is the first mem considered
+ * - mem dirty margin is smaller than last recorded one
+ */
+ if ((info->nr_writeback == ULONG_MAX) ||
+ (cur_info.dirty_thresh - usage) <
+ (info->dirty_thresh -
+ (dirty_info_reclaimable(info) + info->nr_writeback)))
+ *info = cur_info;
+ }
+
+ return info->nr_writeback != ULONG_MAX;
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
--
1.7.3.1

2011-06-03 16:15:32

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v8 11/12] writeback: make background writeback cgroup aware

When the system is under background dirty memory threshold but a cgroup
is over its background dirty memory threshold, then only writeback
inodes associated with the over-limit cgroup(s).

In addition to checking if the system dirty memory usage is over the
system background threshold, over_bground_thresh() also checks if any
cgroups are over their respective background dirty memory thresholds.
The writeback_control.for_cgroup field is set to distinguish between a
system and memcg overage.

If performing cgroup writeback, move_expired_inodes() skips inodes that
do not contribute dirty pages to the cgroup being written back.

After writing some pages, wb_writeback() will call
mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.

Signed-off-by: Greg Thelen <[email protected]>
---
Changelog since v7:
- over_bground_thresh() now sets shared_inodes=1. In -v7 per memcg
background writeback did not, so it did not write pages of shared
inodes in background writeback. In the (potentially common) case
where the system dirty memory usage is below the system background
dirty threshold but at least one cgroup is over its background dirty
limit, then per memcg background writeback is queued for any
over-background-threshold cgroups. Background writeback should be
allowed to writeback shared inodes. The hope is that writing such
inodes has good chance of cleaning the inodes so they can transition
from shared to non-shared. Such a transition is good because then the
inode will remain unshared until it is written by multiple cgroup.
Non-shared inodes offer better isolation.

fs/fs-writeback.c | 32 ++++++++++++++++++++++++--------
1 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0174fcf..c0bfe62 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -256,14 +256,17 @@ static void move_expired_inodes(struct list_head *delaying_queue,
LIST_HEAD(tmp);
struct list_head *pos, *node;
struct super_block *sb = NULL;
- struct inode *inode;
+ struct inode *inode, *tmp_inode;
int do_sb_sort = 0;

- while (!list_empty(delaying_queue)) {
- inode = wb_inode(delaying_queue->prev);
+ list_for_each_entry_safe_reverse(inode, tmp_inode, delaying_queue,
+ i_wb_list) {
if (wbc->older_than_this &&
inode_dirtied_after(inode, *wbc->older_than_this))
break;
+ if (wbc->for_cgroup &&
+ !should_writeback_mem_cgroup_inode(inode, wbc))
+ continue;
if (sb && sb != inode->i_sb)
do_sb_sort = 1;
sb = inode->i_sb;
@@ -614,14 +617,22 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
*/
#define MAX_WRITEBACK_PAGES 1024

-static inline bool over_bground_thresh(void)
+static inline bool over_bground_thresh(struct bdi_writeback *wb,
+ struct writeback_control *wbc)
{
unsigned long background_thresh, dirty_thresh;

global_dirty_limits(&background_thresh, &dirty_thresh);

- return (global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+ if (global_page_state(NR_FILE_DIRTY) +
+ global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
+ wbc->for_cgroup = 0;
+ return true;
+ }
+
+ wbc->for_cgroup = 1;
+ wbc->shared_inodes = 1;
+ return mem_cgroups_over_bground_dirty_thresh();
}

/*
@@ -700,7 +711,7 @@ static long wb_writeback(struct bdi_writeback *wb,
* For background writeout, stop when we are below the
* background dirty threshold
*/
- if (work->for_background && !over_bground_thresh())
+ if (work->for_background && !over_bground_thresh(wb, &wbc))
break;

if (work->for_kupdate || work->for_background) {
@@ -729,6 +740,9 @@ retry:
work->nr_pages -= write_chunk - wbc.nr_to_write;
wrote += write_chunk - wbc.nr_to_write;

+ if (write_chunk - wbc.nr_to_write > 0)
+ mem_cgroup_writeback_done();
+
/*
* Did we write something? Try for more
*
@@ -809,7 +823,9 @@ static unsigned long get_nr_dirty_pages(void)

static long wb_check_background_flush(struct bdi_writeback *wb)
{
- if (over_bground_thresh()) {
+ struct writeback_control wbc;
+
+ if (over_bground_thresh(wb, &wbc)) {

struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
--
1.7.3.1

2011-06-03 22:46:50

by Hiroyuki Kamezawa

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] memcg: per cgroup dirty page accounting

2011/6/4 Greg Thelen <[email protected]>:
> This patch series provides the ability for each cgroup to have independent dirty
> page usage limits. ?Limiting dirty memory fixes the max amount of dirty (hard to
> reclaim) page cache used by a cgroup. ?This allows for better per cgroup memory
> isolation and fewer ooms within a single cgroup.
>
> Having per cgroup dirty memory limits is not very interesting unless writeback
> is cgroup aware. ?There is not much isolation if cgroups have to writeback data
> from other cgroups to get below their dirty memory threshold.
>
> Per-memcg dirty limits are provided to support isolation and thus cross cgroup
> inode sharing is not a priority. ?This allows the code be simpler.
>
> To add cgroup awareness to writeback, this series adds a memcg field to the
> inode to allow writeback to isolate inodes for a particular cgroup. ?When an
> inode is marked dirty, i_memcg is set to the current cgroup. ?When inode pages
> are marked dirty the i_memcg field compared against the page's cgroup. ?If they
> differ, then the inode is marked as shared by setting i_memcg to a special
> shared value (zero).
>
> Previous discussions suggested that a per-bdi per-memcg b_dirty list was a good
> way to assoicate inodes with a cgroup without having to add a field to struct
> inode. ?I prototyped this approach but found that it involved more complex
> writeback changes and had at least one major shortcoming: detection of when an
> inode becomes shared by multiple cgroups. ?While such sharing is not expected to
> be common, the system should gracefully handle it.
>
> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
> dirty usage vs dirty thresholds for the current cgroup and its parents. ?If any
> over-limit cgroups are found, they are marked in a global over-limit bitmap
> (indexed by cgroup id) and the bdi flusher is awoke.
>
> The bdi flusher uses wb_check_background_flush() to check for any memcg over
> their dirty limit. ?When performing per-memcg background writeback,
> move_expired_inodes() walks per bdi b_dirty list using each inode's i_memcg and
> the global over-limit memcg bitmap to determine if the inode should be written.
>
> If mem_cgroup_balance_dirty_pages() is unable to get below the dirty page
> threshold writing per-memcg inodes, then downshifts to also writing shared
> inodes (i_memcg=0).
>
> I know that there is some significant writeback changes associated with the
> IO-less balance_dirty_pages() effort. ?I am not trying to derail that, so this
> patch series is merely an RFC to get feedback on the design. ?There are probably
> some subtle races in these patches. ?I have done moderate functional testing of
> the newly proposed features.
>

Thank you...hmm, is this set really "merely RFC ?". I'd like to merge
this function
before other new big hammer works because this makes behavior of memcg
much better.

I'd like to review and test this set (but maybe I can't do much in the
weekend...)

Anyway, thank you.
-Kame


> Here is an example of the memcg-oom that is avoided with this patch series:
> ? ? ? ?# mkdir /dev/cgroup/memory/x
> ? ? ? ?# echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
> ? ? ? ?# echo $$ > /dev/cgroup/memory/x/tasks
> ? ? ? ?# dd if=/dev/zero of=/data/f1 bs=1k count=1M &
> ? ? ? ?# dd if=/dev/zero of=/data/f2 bs=1k count=1M &
> ? ? ? ?# wait
> ? ? ? ?[1]- ?Killed ? ? ? ? ? ? ? ? ?dd if=/dev/zero of=/data/f1 bs=1M count=1k
> ? ? ? ?[2]+ ?Killed ? ? ? ? ? ? ? ? ?dd if=/dev/zero of=/data/f1 bs=1M count=1k
>
> Known limitations:
> ? ? ? ?If a dirty limit is lowered a cgroup may be over its limit.
>
> Changes since -v7:
> - Merged -v7 09/14 'cgroup: move CSS_ID_MAX to cgroup.h' into
> ?-v8 09/13 'memcg: create support routines for writeback'
>
> - Merged -v7 08/14 'writeback: add memcg fields to writeback_control'
> ?into -v8 09/13 'memcg: create support routines for writeback' and
> ?-v8 10/13 'memcg: create support routines for page-writeback'. ?This
> ?moves the declaration of new fields with the first usage of the
> ?respective fields.
>
> - mem_cgroup_writeback_done() now clears corresponding bit for cgroup that
> ?cannot be referenced. ?Such a bit would represent a cgroup previously over
> ?dirty limit, but that has been deleted before writeback cleaned all pages. ?By
> ?clearing bit, writeback will not continually try to writeback the deleted
> ?cgroup.
>
> - Previously mem_cgroup_writeback_done() would only finish writeback when the
> ?cgroup's dirty memory usage dropped below the dirty limit. ?This was the wrong
> ?limit to check. ?This now correctly checks usage against the background dirty
> ?limit.
>
> - over_bground_thresh() now sets shared_inodes=1. ?In -v7 per memcg
> ?background writeback did not, so it did not write pages of shared
> ?inodes in background writeback. ?In the (potentially common) case
> ?where the system dirty memory usage is below the system background
> ?dirty threshold but at least one cgroup is over its background dirty
> ?limit, then per memcg background writeback is queued for any
> ?over-background-threshold cgroups. ?Background writeback should be
> ?allowed to writeback shared inodes. ?The hope is that writing such
> ?inodes has good chance of cleaning the inodes so they can transition
> ?from shared to non-shared. ?Such a transition is good because then the
> ?inode will remain unshared until it is written by multiple cgroup.
> ?Non-shared inodes offer better isolation.
>
> Single patch that can be applied to mmotm-2011-05-12-15-52:
> ?http://www.kernel.org/pub/linux/kernel/people/gthelen/memcg/memcg-dirty-limits-v8-on-mmotm-2011-05-12-15-52.patch
>
> Patches are based on mmotm-2011-05-12-15-52.
>
> Greg Thelen (12):
> ?memcg: document cgroup dirty memory interfaces
> ?memcg: add page_cgroup flags for dirty page tracking
> ?memcg: add mem_cgroup_mark_inode_dirty()
> ?memcg: add dirty page accounting infrastructure
> ?memcg: add kernel calls for memcg dirty page stats
> ?memcg: add dirty limits to mem_cgroup
> ?memcg: add cgroupfs interface to memcg dirty limits
> ?memcg: dirty page accounting support routines
> ?memcg: create support routines for writeback
> ?memcg: create support routines for page-writeback
> ?writeback: make background writeback cgroup aware
> ?memcg: check memcg dirty limits in page writeback
>
> ?Documentation/cgroups/memory.txt ?| ? 70 ++++
> ?fs/fs-writeback.c ? ? ? ? ? ? ? ? | ? 34 ++-
> ?fs/inode.c ? ? ? ? ? ? ? ? ? ? ? ?| ? ?3 +
> ?fs/nfs/write.c ? ? ? ? ? ? ? ? ? ?| ? ?4 +
> ?include/linux/cgroup.h ? ? ? ? ? ?| ? ?1 +
> ?include/linux/fs.h ? ? ? ? ? ? ? ?| ? ?9 +
> ?include/linux/memcontrol.h ? ? ? ?| ? 63 ++++-
> ?include/linux/page_cgroup.h ? ? ? | ? 23 ++
> ?include/linux/writeback.h ? ? ? ? | ? ?5 +-
> ?include/trace/events/memcontrol.h | ?198 +++++++++++
> ?kernel/cgroup.c ? ? ? ? ? ? ? ? ? | ? ?1 -
> ?mm/filemap.c ? ? ? ? ? ? ? ? ? ? ?| ? ?1 +
> ?mm/memcontrol.c ? ? ? ? ? ? ? ? ? | ?708 ++++++++++++++++++++++++++++++++++++-
> ?mm/page-writeback.c ? ? ? ? ? ? ? | ? 42 ++-
> ?mm/truncate.c ? ? ? ? ? ? ? ? ? ? | ? ?1 +
> ?mm/vmscan.c ? ? ? ? ? ? ? ? ? ? ? | ? ?2 +-
> ?16 files changed, 1138 insertions(+), 27 deletions(-)
> ?create mode 100644 include/trace/events/memcontrol.h
>
> --
> 1.7.3.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

2011-06-03 22:51:11

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 00/12] memcg: per cgroup dirty page accounting

On Fri, Jun 3, 2011 at 3:46 PM, Hiroyuki Kamezawa
<[email protected]> wrote:
> 2011/6/4 Greg Thelen <[email protected]>:
>> This patch series provides the ability for each cgroup to have independent dirty
>> page usage limits. ?Limiting dirty memory fixes the max amount of dirty (hard to
>> reclaim) page cache used by a cgroup. ?This allows for better per cgroup memory
>> isolation and fewer ooms within a single cgroup.
>>
>> Having per cgroup dirty memory limits is not very interesting unless writeback
>> is cgroup aware. ?There is not much isolation if cgroups have to writeback data
>> from other cgroups to get below their dirty memory threshold.
>>
>> Per-memcg dirty limits are provided to support isolation and thus cross cgroup
>> inode sharing is not a priority. ?This allows the code be simpler.
>>
>> To add cgroup awareness to writeback, this series adds a memcg field to the
>> inode to allow writeback to isolate inodes for a particular cgroup. ?When an
>> inode is marked dirty, i_memcg is set to the current cgroup. ?When inode pages
>> are marked dirty the i_memcg field compared against the page's cgroup. ?If they
>> differ, then the inode is marked as shared by setting i_memcg to a special
>> shared value (zero).
>>
>> Previous discussions suggested that a per-bdi per-memcg b_dirty list was a good
>> way to assoicate inodes with a cgroup without having to add a field to struct
>> inode. ?I prototyped this approach but found that it involved more complex
>> writeback changes and had at least one major shortcoming: detection of when an
>> inode becomes shared by multiple cgroups. ?While such sharing is not expected to
>> be common, the system should gracefully handle it.
>>
>> balance_dirty_pages() calls mem_cgroup_balance_dirty_pages(), which checks the
>> dirty usage vs dirty thresholds for the current cgroup and its parents. ?If any
>> over-limit cgroups are found, they are marked in a global over-limit bitmap
>> (indexed by cgroup id) and the bdi flusher is awoke.
>>
>> The bdi flusher uses wb_check_background_flush() to check for any memcg over
>> their dirty limit. ?When performing per-memcg background writeback,
>> move_expired_inodes() walks per bdi b_dirty list using each inode's i_memcg and
>> the global over-limit memcg bitmap to determine if the inode should be written.
>>
>> If mem_cgroup_balance_dirty_pages() is unable to get below the dirty page
>> threshold writing per-memcg inodes, then downshifts to also writing shared
>> inodes (i_memcg=0).
>>
>> I know that there is some significant writeback changes associated with the
>> IO-less balance_dirty_pages() effort. ?I am not trying to derail that, so this
>> patch series is merely an RFC to get feedback on the design. ?There are probably
>> some subtle races in these patches. ?I have done moderate functional testing of
>> the newly proposed features.
>>
>
> Thank you...hmm, is this set really "merely RFC ?". I'd like to merge
> this function
> before other new big hammer works because this makes behavior of memcg
> much better.

Oops. I meant to remove the above RFC paragraph. This -v8 patch
series is intended for merging into mmotm.

> I'd like to review and test this set (but maybe I can't do much in the
> weekend...)

Thank you.

> Anyway, thank you.
> -Kame



>> Here is an example of the memcg-oom that is avoided with this patch series:
>> ? ? ? ?# mkdir /dev/cgroup/memory/x
>> ? ? ? ?# echo 100M > /dev/cgroup/memory/x/memory.limit_in_bytes
>> ? ? ? ?# echo $$ > /dev/cgroup/memory/x/tasks
>> ? ? ? ?# dd if=/dev/zero of=/data/f1 bs=1k count=1M &
>> ? ? ? ?# dd if=/dev/zero of=/data/f2 bs=1k count=1M &
>> ? ? ? ?# wait
>> ? ? ? ?[1]- ?Killed ? ? ? ? ? ? ? ? ?dd if=/dev/zero of=/data/f1 bs=1M count=1k
>> ? ? ? ?[2]+ ?Killed ? ? ? ? ? ? ? ? ?dd if=/dev/zero of=/data/f1 bs=1M count=1k
>>
>> Known limitations:
>> ? ? ? ?If a dirty limit is lowered a cgroup may be over its limit.
>>
>> Changes since -v7:
>> - Merged -v7 09/14 'cgroup: move CSS_ID_MAX to cgroup.h' into
>> ?-v8 09/13 'memcg: create support routines for writeback'
>>
>> - Merged -v7 08/14 'writeback: add memcg fields to writeback_control'
>> ?into -v8 09/13 'memcg: create support routines for writeback' and
>> ?-v8 10/13 'memcg: create support routines for page-writeback'. ?This
>> ?moves the declaration of new fields with the first usage of the
>> ?respective fields.
>>
>> - mem_cgroup_writeback_done() now clears corresponding bit for cgroup that
>> ?cannot be referenced. ?Such a bit would represent a cgroup previously over
>> ?dirty limit, but that has been deleted before writeback cleaned all pages. ?By
>> ?clearing bit, writeback will not continually try to writeback the deleted
>> ?cgroup.
>>
>> - Previously mem_cgroup_writeback_done() would only finish writeback when the
>> ?cgroup's dirty memory usage dropped below the dirty limit. ?This was the wrong
>> ?limit to check. ?This now correctly checks usage against the background dirty
>> ?limit.
>>
>> - over_bground_thresh() now sets shared_inodes=1. ?In -v7 per memcg
>> ?background writeback did not, so it did not write pages of shared
>> ?inodes in background writeback. ?In the (potentially common) case
>> ?where the system dirty memory usage is below the system background
>> ?dirty threshold but at least one cgroup is over its background dirty
>> ?limit, then per memcg background writeback is queued for any
>> ?over-background-threshold cgroups. ?Background writeback should be
>> ?allowed to writeback shared inodes. ?The hope is that writing such
>> ?inodes has good chance of cleaning the inodes so they can transition
>> ?from shared to non-shared. ?Such a transition is good because then the
>> ?inode will remain unshared until it is written by multiple cgroup.
>> ?Non-shared inodes offer better isolation.
>>
>> Single patch that can be applied to mmotm-2011-05-12-15-52:
>> ?http://www.kernel.org/pub/linux/kernel/people/gthelen/memcg/memcg-dirty-limits-v8-on-mmotm-2011-05-12-15-52.patch
>>
>> Patches are based on mmotm-2011-05-12-15-52.
>>
>> Greg Thelen (12):
>> ?memcg: document cgroup dirty memory interfaces
>> ?memcg: add page_cgroup flags for dirty page tracking
>> ?memcg: add mem_cgroup_mark_inode_dirty()
>> ?memcg: add dirty page accounting infrastructure
>> ?memcg: add kernel calls for memcg dirty page stats
>> ?memcg: add dirty limits to mem_cgroup
>> ?memcg: add cgroupfs interface to memcg dirty limits
>> ?memcg: dirty page accounting support routines
>> ?memcg: create support routines for writeback
>> ?memcg: create support routines for page-writeback
>> ?writeback: make background writeback cgroup aware
>> ?memcg: check memcg dirty limits in page writeback
>>
>> ?Documentation/cgroups/memory.txt ?| ? 70 ++++
>> ?fs/fs-writeback.c ? ? ? ? ? ? ? ? | ? 34 ++-
>> ?fs/inode.c ? ? ? ? ? ? ? ? ? ? ? ?| ? ?3 +
>> ?fs/nfs/write.c ? ? ? ? ? ? ? ? ? ?| ? ?4 +
>> ?include/linux/cgroup.h ? ? ? ? ? ?| ? ?1 +
>> ?include/linux/fs.h ? ? ? ? ? ? ? ?| ? ?9 +
>> ?include/linux/memcontrol.h ? ? ? ?| ? 63 ++++-
>> ?include/linux/page_cgroup.h ? ? ? | ? 23 ++
>> ?include/linux/writeback.h ? ? ? ? | ? ?5 +-
>> ?include/trace/events/memcontrol.h | ?198 +++++++++++
>> ?kernel/cgroup.c ? ? ? ? ? ? ? ? ? | ? ?1 -
>> ?mm/filemap.c ? ? ? ? ? ? ? ? ? ? ?| ? ?1 +
>> ?mm/memcontrol.c ? ? ? ? ? ? ? ? ? | ?708 ++++++++++++++++++++++++++++++++++++-
>> ?mm/page-writeback.c ? ? ? ? ? ? ? | ? 42 ++-
>> ?mm/truncate.c ? ? ? ? ? ? ? ? ? ? | ? ?1 +
>> ?mm/vmscan.c ? ? ? ? ? ? ? ? ? ? ? | ? ?2 +-
>> ?16 files changed, 1138 insertions(+), 27 deletions(-)
>> ?create mode 100644 include/trace/events/memcontrol.h
>>
>> --
>> 1.7.3.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at ?http://www.tux.org/lkml/
>>
>

2011-06-03 23:18:46

by Andrea Righi

[permalink] [raw]
Subject: Re: [PATCH v8 03/12] memcg: add mem_cgroup_mark_inode_dirty()

On Fri, Jun 03, 2011 at 09:12:09AM -0700, Greg Thelen wrote:
...
> diff --git a/fs/inode.c b/fs/inode.c
> index ce61a1b..9ecb0bb 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -228,6 +228,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
> mapping->assoc_mapping = NULL;
> mapping->backing_dev_info = &default_backing_dev_info;
> mapping->writeback_index = 0;
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + mapping->i_memcg = 0;
> +#endif

It won't change too much, since it's always 0, but shouldn't we use
I_MEMCG_SHARED by default?

>
> /*
> * If the block_device provides a backing_dev_info for client
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 29c02f6..deabca3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -645,6 +645,9 @@ struct address_space {
> spinlock_t private_lock; /* for use by the address_space */
> struct list_head private_list; /* ditto */
> struct address_space *assoc_mapping; /* ditto */
> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
> + unsigned short i_memcg; /* css_id of memcg dirtier */
> +#endif
> } __attribute__((aligned(sizeof(long))));
> /*
> * On most architectures that alignment is already the case; but
> @@ -652,6 +655,12 @@ struct address_space {
> * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
> */
>
> +/*
> + * When an address_space is shared by multiple memcg dirtieres, then i_memcg is
> + * set to this special, wildcard, css_id value (zero).
> + */
> +#define I_MEMCG_SHARED 0
> +
> struct block_device {
> dev_t bd_dev; /* not a kdev_t - it's a search key */
> int bd_openers;
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 77e47f5..14b6d67 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -103,6 +103,8 @@ mem_cgroup_prepare_migration(struct page *page,
> extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
> struct page *oldpage, struct page *newpage, bool migration_ok);
>
> +void mem_cgroup_mark_inode_dirty(struct inode *inode);
> +
> /*
> * For memory reclaim.
> */
> @@ -273,6 +275,10 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
> {
> }
>
> +static inline void mem_cgroup_mark_inode_dirty(struct inode *inode)
> +{
> +}
> +
> static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
> {
> return 0;
> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
> new file mode 100644
> index 0000000..781ef9fc
> --- /dev/null
> +++ b/include/trace/events/memcontrol.h
> @@ -0,0 +1,32 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM memcontrol
> +
> +#if !defined(_TRACE_MEMCONTROL_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MEMCONTROL_H
> +
> +#include <linux/types.h>
> +#include <linux/tracepoint.h>
> +
> +TRACE_EVENT(mem_cgroup_mark_inode_dirty,
> + TP_PROTO(struct inode *inode),
> +
> + TP_ARGS(inode),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, ino)
> + __field(unsigned short, css_id)
> + ),
> +
> + TP_fast_assign(
> + __entry->ino = inode->i_ino;
> + __entry->css_id =
> + inode->i_mapping ? inode->i_mapping->i_memcg : 0;
> + ),
> +
> + TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
> +)
> +
> +#endif /* _TRACE_MEMCONTROL_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bf642b5..e83ef74 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -54,6 +54,9 @@
>
> #include <trace/events/vmscan.h>
>
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/memcontrol.h>
> +
> struct cgroup_subsys mem_cgroup_subsys __read_mostly;
> #define MEM_CGROUP_RECLAIM_RETRIES 5
> struct mem_cgroup *root_mem_cgroup __read_mostly;
> @@ -1122,6 +1125,27 @@ static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_
> return inactive_ratio;
> }
>
> +/*
> + * Mark the current task's memcg as the memcg associated with inode. Note: the
> + * recorded cgroup css_id is not guaranteed to remain correct. The current task
> + * may be moved to another cgroup. The memcg may also be deleted before the
> + * caller has time to use the i_memcg.
> + */
> +void mem_cgroup_mark_inode_dirty(struct inode *inode)
> +{
> + struct mem_cgroup *mem;
> + unsigned short id;
> +
> + rcu_read_lock();
> + mem = mem_cgroup_from_task(current);
> + id = mem ? css_id(&mem->css) : 0;

Ditto.

> + rcu_read_unlock();
> +
> + inode->i_mapping->i_memcg = id;
> +
> + trace_mem_cgroup_mark_inode_dirty(inode);
> +}
> +
> int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> {
> unsigned long active;
> --
> 1.7.3.1

Thanks,
-Andrea

2011-06-03 23:46:22

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 03/12] memcg: add mem_cgroup_mark_inode_dirty()

On Fri, Jun 3, 2011 at 4:09 PM, Andrea Righi <[email protected]> wrote:
> On Fri, Jun 03, 2011 at 09:12:09AM -0700, Greg Thelen wrote:
> ...
>> diff --git a/fs/inode.c b/fs/inode.c
>> index ce61a1b..9ecb0bb 100644
>> --- a/fs/inode.c
>> +++ b/fs/inode.c
>> @@ -228,6 +228,9 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
>> ? ? ? mapping->assoc_mapping = NULL;
>> ? ? ? mapping->backing_dev_info = &default_backing_dev_info;
>> ? ? ? mapping->writeback_index = 0;
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> + ? ? mapping->i_memcg = 0;
>> +#endif
>
> It won't change too much, since it's always 0, but shouldn't we use
> I_MEMCG_SHARED by default?

I agree, I_MEMCG_SHARED should be used instead of 0. Will include in
-v9, if there is a need for -v9.

>> ? ? ? /*
>> ? ? ? ?* If the block_device provides a backing_dev_info for client
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 29c02f6..deabca3 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -645,6 +645,9 @@ struct address_space {
>> ? ? ? spinlock_t ? ? ? ? ? ? ?private_lock; ? /* for use by the address_space */
>> ? ? ? struct list_head ? ? ? ?private_list; ? /* ditto */
>> ? ? ? struct address_space ? ?*assoc_mapping; /* ditto */
>> +#ifdef CONFIG_CGROUP_MEM_RES_CTLR
>> + ? ? unsigned short ? ? ? ? ?i_memcg; ? ? ? ?/* css_id of memcg dirtier */
>> +#endif
>> ?} __attribute__((aligned(sizeof(long))));
>> ? ? ? /*
>> ? ? ? ?* On most architectures that alignment is already the case; but
>> @@ -652,6 +655,12 @@ struct address_space {
>> ? ? ? ?* of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
>> ? ? ? ?*/
>>
>> +/*
>> + * When an address_space is shared by multiple memcg dirtieres, then i_memcg is
>> + * set to this special, wildcard, css_id value (zero).
>> + */
>> +#define I_MEMCG_SHARED 0
>> +
>> ?struct block_device {
>> ? ? ? dev_t ? ? ? ? ? ? ? ? ? bd_dev; ?/* not a kdev_t - it's a search key */
>> ? ? ? int ? ? ? ? ? ? ? ? ? ? bd_openers;
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 77e47f5..14b6d67 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -103,6 +103,8 @@ mem_cgroup_prepare_migration(struct page *page,
>> ?extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
>> ? ? ? struct page *oldpage, struct page *newpage, bool migration_ok);
>>
>> +void mem_cgroup_mark_inode_dirty(struct inode *inode);
>> +
>> ?/*
>> ? * For memory reclaim.
>> ? */
>> @@ -273,6 +275,10 @@ static inline void mem_cgroup_end_migration(struct mem_cgroup *mem,
>> ?{
>> ?}
>>
>> +static inline void mem_cgroup_mark_inode_dirty(struct inode *inode)
>> +{
>> +}
>> +
>> ?static inline int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem)
>> ?{
>> ? ? ? return 0;
>> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
>> new file mode 100644
>> index 0000000..781ef9fc
>> --- /dev/null
>> +++ b/include/trace/events/memcontrol.h
>> @@ -0,0 +1,32 @@
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM memcontrol
>> +
>> +#if !defined(_TRACE_MEMCONTROL_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_MEMCONTROL_H
>> +
>> +#include <linux/types.h>
>> +#include <linux/tracepoint.h>
>> +
>> +TRACE_EVENT(mem_cgroup_mark_inode_dirty,
>> + ? ? TP_PROTO(struct inode *inode),
>> +
>> + ? ? TP_ARGS(inode),
>> +
>> + ? ? TP_STRUCT__entry(
>> + ? ? ? ? ? ? __field(unsigned long, ino)
>> + ? ? ? ? ? ? __field(unsigned short, css_id)
>> + ? ? ? ? ? ? ),
>> +
>> + ? ? TP_fast_assign(
>> + ? ? ? ? ? ? __entry->ino = inode->i_ino;
>> + ? ? ? ? ? ? __entry->css_id =
>> + ? ? ? ? ? ? ? ? ? ? inode->i_mapping ? inode->i_mapping->i_memcg : 0;
>> + ? ? ? ? ? ? ),
>> +
>> + ? ? TP_printk("ino=%ld css_id=%d", __entry->ino, __entry->css_id)
>> +)
>> +
>> +#endif /* _TRACE_MEMCONTROL_H */
>> +
>> +/* This part must be outside protection */
>> +#include <trace/define_trace.h>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index bf642b5..e83ef74 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -54,6 +54,9 @@
>>
>> ?#include <trace/events/vmscan.h>
>>
>> +#define CREATE_TRACE_POINTS
>> +#include <trace/events/memcontrol.h>
>> +
>> ?struct cgroup_subsys mem_cgroup_subsys __read_mostly;
>> ?#define MEM_CGROUP_RECLAIM_RETRIES ? 5
>> ?struct mem_cgroup *root_mem_cgroup __read_mostly;
>> @@ -1122,6 +1125,27 @@ static int calc_inactive_ratio(struct mem_cgroup *memcg, unsigned long *present_
>> ? ? ? return inactive_ratio;
>> ?}
>>
>> +/*
>> + * Mark the current task's memcg as the memcg associated with inode. ?Note: the
>> + * recorded cgroup css_id is not guaranteed to remain correct. ?The current task
>> + * may be moved to another cgroup. ?The memcg may also be deleted before the
>> + * caller has time to use the i_memcg.
>> + */
>> +void mem_cgroup_mark_inode_dirty(struct inode *inode)
>> +{
>> + ? ? struct mem_cgroup *mem;
>> + ? ? unsigned short id;
>> +
>> + ? ? rcu_read_lock();
>> + ? ? mem = mem_cgroup_from_task(current);
>> + ? ? id = mem ? css_id(&mem->css) : 0;
>
> Ditto.

I agree, I_MEMCG_SHARED should be used instead of 0.

>> + ? ? rcu_read_unlock();
>> +
>> + ? ? inode->i_mapping->i_memcg = id;
>> +
>> + ? ? trace_mem_cgroup_mark_inode_dirty(inode);
>> +}
>> +
>> ?int mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
>> ?{
>> ? ? ? unsigned long active;
>> --
>> 1.7.3.1
>
> Thanks,
> -Andrea
>

2011-06-04 09:54:32

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 01/12] memcg: document cgroup dirty memory interfaces

On Fri, Jun 03, 2011 at 09:12:07AM -0700, Greg Thelen wrote:
> Document cgroup dirty memory interfaces and statistics.
>
> The implementation for these new interfaces routines comes in a series
> of following patches.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Balbir Singh <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards
Minchan Kim

2011-06-04 09:56:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 02/12] memcg: add page_cgroup flags for dirty page tracking

On Fri, Jun 03, 2011 at 09:12:08AM -0700, Greg Thelen wrote:
> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards
Minchan Kim

2011-06-04 10:11:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 04/12] memcg: add dirty page accounting infrastructure

On Fri, Jun 03, 2011 at 09:12:10AM -0700, Greg Thelen wrote:
> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages. A
> later change adds kernel calls to these new routines.
>
> As inode pages are marked dirty, if the dirtied page's cgroup differs
> from the inode's cgroup, then mark the inode shared across several
> cgroup.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards
Minchan Kim

2011-06-04 15:42:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 05/12] memcg: add kernel calls for memcg dirty page stats

On Fri, Jun 03, 2011 at 09:12:11AM -0700, Greg Thelen wrote:
> Add calls into memcg dirty page accounting. Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs. This
> allows the memory controller to maintain an accurate view of the amount
> of its memory that is dirty.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Reviewed-by: Daisuke Nishimura <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards
Minchan Kim

2011-06-04 15:57:43

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 06/12] memcg: add dirty limits to mem_cgroup

On Fri, Jun 03, 2011 at 09:12:12AM -0700, Greg Thelen wrote:
> Extend mem_cgroup to contain dirty page limits.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards
Minchan Kim

2011-06-04 16:04:52

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 07/12] memcg: add cgroupfs interface to memcg dirty limits

On Fri, Jun 03, 2011 at 09:12:13AM -0700, Greg Thelen wrote:
> Add cgroupfs interface to memcg dirty page limits:
> Direct write-out is controlled with:
> - memory.dirty_ratio
> - memory.dirty_limit_in_bytes
>
> Background write-out is controlled with:
> - memory.dirty_background_ratio
> - memory.dirty_background_limit_bytes
>
> Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> and 'G' suffixes for byte counts. This patch provides the
> same functionality for memory.dirty_limit_in_bytes and
> memory.dirty_background_limit_bytes.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Balbir Singh <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards
Minchan Kim

2011-06-05 02:46:54

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 09/12] memcg: create support routines for writeback

On Fri, Jun 03, 2011 at 09:12:15AM -0700, Greg Thelen wrote:
> Introduce memcg routines to assist in per-memcg writeback:
>
> - mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
> writeback because they are over their dirty memory threshold.
>
> - should_writeback_mem_cgroup_inode() determines if an inode is
> contributing pages to an over-limit memcg. A new writeback_control
> field determines if shared inodes should be written back.
>
> - mem_cgroup_writeback_done() is used periodically during writeback to
> update memcg writeback data.
>
> These routines make use of a new over_bground_dirty_thresh bitmap that
> indicates which mem_cgroup are over their respective dirty background
> threshold. As this bitmap is indexed by css_id, the largest possible
> css_id value is needed to create the bitmap. So move the definition of
> CSS_ID_MAX from cgroup.c to cgroup.h. This allows users of css_id() to
> know the largest possible css_id value. This knowledge can be used to
> build such per-cgroup bitmaps.
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards
Minchan Kim

2011-06-05 03:12:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 10/12] memcg: create support routines for page-writeback

Hi Greg,

On Fri, Jun 03, 2011 at 09:12:16AM -0700, Greg Thelen wrote:
> Introduce memcg routines to assist in per-memcg dirty page management:
>
> - mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
> dirty memory usage against memcg foreground and background thresholds.
> If an over-background-threshold memcg is found, then per-memcg
> background writeback is queued. Per-memcg writeback differs from
> classic, non-memcg, per bdi writeback by setting the new
> writeback_control.for_cgroup bit.
>
> If an over-foreground-threshold memcg is found, then foreground
> writeout occurs. When performing foreground writeout, first consider
> inodes exclusive to the memcg. If unable to make enough progress,
> then consider inodes shared between memcg. Such cross-memcg inode
> sharing likely to be rare in situations that use per-cgroup memory
> isolation. The approach tries to handle the common (non-shared)
> case well without punishing well behaved (non-sharing) cgroups.
> As a last resort writeback shared inodes.
>
> This routine is used by balance_dirty_pages() in a later change.
>
> - mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
> and limits of the memcg closest to (or over) its dirty limit. This
> will be used by throttle_vm_writeout() in a latter change.
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---
> Changelog since v7:
> - Add more detail to commit description.
>
> - Declare the new writeback_control for_cgroup bit in this change, the
> first patch that uses the new field is first used. In -v7 the field
> was declared in a separate patch.
>
> include/linux/memcontrol.h | 18 +++++
> include/linux/writeback.h | 1 +
> include/trace/events/memcontrol.h | 83 ++++++++++++++++++++
> mm/memcontrol.c | 150 +++++++++++++++++++++++++++++++++++++
> 4 files changed, 252 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3d72e09..0d0363e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -167,6 +167,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
> struct writeback_control *wbc);
> bool mem_cgroups_over_bground_dirty_thresh(void);
> void mem_cgroup_writeback_done(void);
> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
> + struct mem_cgroup *mem,
> + struct dirty_info *info);
> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> + unsigned long write_chunk);
>
> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask,
> @@ -383,6 +388,19 @@ static inline void mem_cgroup_writeback_done(void)
> {
> }
>
> +static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> + unsigned long write_chunk)
> +{
> +}
> +
> +static inline bool
> +mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
> + struct mem_cgroup *mem,
> + struct dirty_info *info)
> +{
> + return false;
> +}
> +
> static inline
> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask,
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 66ec339..4f5c0d2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -47,6 +47,7 @@ struct writeback_control {
> unsigned for_reclaim:1; /* Invoked from the page allocator */
> unsigned range_cyclic:1; /* range_start is cyclic */
> unsigned more_io:1; /* more io to be dispatched */
> + unsigned for_cgroup:1; /* enable cgroup writeback */
> unsigned shared_inodes:1; /* write inodes spanning cgroups */
> };
>
> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
> index 326a66b..b42dae1 100644
> --- a/include/trace/events/memcontrol.h
> +++ b/include/trace/events/memcontrol.h
> @@ -109,6 +109,89 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
> __entry->first_id)
> )
>
> +DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
> + TP_PROTO(unsigned short css_id,
> + struct backing_dev_info *bdi,
> + unsigned long nr_reclaimable,
> + unsigned long thresh,
> + bool over_limit),
> +
> + TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
> +
> + TP_STRUCT__entry(
> + __field(unsigned short, css_id)
> + __field(struct backing_dev_info *, bdi)
> + __field(unsigned long, nr_reclaimable)
> + __field(unsigned long, thresh)
> + __field(bool, over_limit)
> + ),
> +
> + TP_fast_assign(
> + __entry->css_id = css_id;
> + __entry->bdi = bdi;
> + __entry->nr_reclaimable = nr_reclaimable;
> + __entry->thresh = thresh;
> + __entry->over_limit = over_limit;
> + ),
> +
> + TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
> + "over_limit=%d", __entry->css_id, __entry->bdi,
> + __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
> +)
> +
> +#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
> +DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
> + TP_PROTO(unsigned short id, \
> + struct backing_dev_info *bdi, \
> + unsigned long nr_reclaimable, \
> + unsigned long thresh, \
> + bool over_limit), \
> + TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
> +)
> +
> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
> +
> +TRACE_EVENT(mem_cgroup_fg_writeback,
> + TP_PROTO(unsigned long write_chunk,
> + struct writeback_control *wbc),
> +
> + TP_ARGS(write_chunk, wbc),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, write_chunk)
> + __field(long, wbc_to_write)
> + __field(bool, shared_inodes)
> + ),
> +
> + TP_fast_assign(
> + __entry->write_chunk = write_chunk;
> + __entry->wbc_to_write = wbc->nr_to_write;
> + __entry->shared_inodes = wbc->shared_inodes;
> + ),
> +
> + TP_printk("write_chunk=%ld nr_to_write=%ld shared_inodes=%d",
> + __entry->write_chunk,
> + __entry->wbc_to_write,
> + __entry->shared_inodes)
> +)
> +
> +TRACE_EVENT(mem_cgroup_enable_shared_writeback,
> + TP_PROTO(unsigned short css_id),
> +
> + TP_ARGS(css_id),
> +
> + TP_STRUCT__entry(
> + __field(unsigned short, css_id)
> + ),
> +
> + TP_fast_assign(
> + __entry->css_id = css_id;
> + ),
> +
> + TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
> +)
> +
> #endif /* _TRACE_MEMCONTROL_H */
>
> /* This part must be outside protection */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a5b1794..17cb888 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1622,6 +1622,156 @@ void mem_cgroup_writeback_done(void)
> }
> }
>
> +/*
> + * This routine must be called by processes which are generating dirty pages.
> + * It considers the dirty pages usage and thresholds of the current cgroup and
> + * (depending if hierarchical accounting is enabled) ancestral memcg. If any of
> + * the considered memcg are over their background dirty limit, then background
> + * writeback is queued. If any are over the foreground dirty limit then
> + * throttle the dirtying task while writing dirty data. The per-memcg dirty
> + * limits check by this routine are distinct from either the per-system,
> + * per-bdi, or per-task limits considered by balance_dirty_pages().
> + */
> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> + unsigned long write_chunk)
> +{
> + struct backing_dev_info *bdi = mapping->backing_dev_info;
> + struct mem_cgroup *mem;
> + struct mem_cgroup *ref_mem;
> + struct dirty_info info;
> + unsigned long nr_reclaimable;
> + unsigned long sys_available_mem;
> + unsigned long pause = 1;
> + unsigned short id;
> + bool over;
> + bool shared_inodes;
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + sys_available_mem = determine_dirtyable_memory();
> +
> + /* reference the memcg so it is not deleted during this routine */
> + rcu_read_lock();
> + mem = mem_cgroup_from_task(current);
> + if (mem && mem_cgroup_is_root(mem))
> + mem = NULL;
> + if (mem)
> + css_get(&mem->css);
> + rcu_read_unlock();
> + ref_mem = mem;
> +
> + /* balance entire ancestry of current's mem. */
> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
> + id = css_id(&mem->css);
> +
> + /*
> + * keep throttling and writing inode data so long as mem is over
> + * its dirty limit.
> + */
> + for (shared_inodes = false; ; ) {
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + .older_than_this = NULL,
> + .range_cyclic = 1,
> + .for_cgroup = 1,
> + .nr_to_write = write_chunk,
> + .shared_inodes = shared_inodes,
> + };
> +
> + /*
> + * if mem is under dirty limit, then break from
> + * throttling loop.
> + */
> + mem_cgroup_dirty_info(sys_available_mem, mem, &info);
> + nr_reclaimable = dirty_info_reclaimable(&info);
> + over = nr_reclaimable > info.dirty_thresh;
> + trace_mem_cgroup_consider_fg_writeback(
> + id, bdi, nr_reclaimable, info.dirty_thresh,
> + over);
> + if (!over)
> + break;
> +
> + mem_cgroup_mark_over_bg_thresh(mem);

We are over the fg_thresh.
Then, why do you mark bg_thresh, too?


> + writeback_inodes_wb(&bdi->wb, &wbc);
> + trace_mem_cgroup_fg_writeback(write_chunk, &wbc);
> + /* if no progress, then consider shared inodes */
> + if ((wbc.nr_to_write == write_chunk) &&
> + !shared_inodes) {
> + trace_mem_cgroup_enable_shared_writeback(id);
> + shared_inodes = true;

I am not sure this is really right condition to punish shared inodes.
We requested wbc with async. If bdi was congested, isn't it possible that
we can't write anyting in this turn?

If shared inodes are on different bdi, it would be effective but they are on
same bdi?

If you assume that shared inode case is rare in memcg configuration and it's
right assumption, I don't have a concern about it. But I have no idea.

> + }
> +
> + /*
> + * Sleep up to 100ms to throttle writer and wait for
> + * queued background I/O to complete.
> + */
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> + io_schedule_timeout(pause);
> + pause <<= 1;
> + if (pause > HZ / 10)
> + pause = HZ / 10;
> + }
> +
> + /* if mem is over background limit, then queue bg writeback */
> + over = nr_reclaimable >= info.background_thresh;
> + trace_mem_cgroup_consider_bg_writeback(
> + id, bdi, nr_reclaimable, info.background_thresh,
> + over);
> + if (over)
> + mem_cgroup_queue_bg_writeback(mem, bdi);
> + }
> +
> + if (ref_mem)
> + css_put(&ref_mem->css);
> +}
> +
> +/*
> + * Return the dirty thresholds and usage for the mem (within the ancestral chain
> + * of @mem) closest to its dirty limit or the first memcg over its limit.

dirty_info has return value 'bool'.
What's meaning? Of course, we can guess it by look the code.
But let's make user painful.
Please write down the menaing of return value, too.

> + *
> + * The check is not stable because the usage and limits can change asynchronous
> + * to this routine.
> + */
> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
> + struct mem_cgroup *mem,
> + struct dirty_info *info)
> +{
> + unsigned long usage;
> + struct dirty_info uninitialized_var(cur_info);
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + info->nr_writeback = ULONG_MAX; /* invalid initial value */
> +
> + /* walk up hierarchy enabled parents */
> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
> + mem_cgroup_dirty_info(sys_available_mem, mem, &cur_info);
> + usage = dirty_info_reclaimable(&cur_info) +
> + cur_info.nr_writeback;
> +
> + /* if over limit, stop searching */
> + if (usage >= cur_info.dirty_thresh) {
> + *info = cur_info;
> + break;
> + }
> +
> + /*
> + * Save dirty usage of mem closest to its limit if either:
> + * - mem is the first mem considered
> + * - mem dirty margin is smaller than last recorded one
> + */
> + if ((info->nr_writeback == ULONG_MAX) ||
> + (cur_info.dirty_thresh - usage) <
> + (info->dirty_thresh -
> + (dirty_info_reclaimable(info) + info->nr_writeback)))
> + *info = cur_info;
> + }
> +
> + return info->nr_writeback != ULONG_MAX;
> +}
> +
> static void mem_cgroup_start_move(struct mem_cgroup *mem)
> {
> int cpu;
> --
> 1.7.3.1
>

--
Kind regards
Minchan Kim

2011-06-05 04:11:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
> When the system is under background dirty memory threshold but a cgroup
> is over its background dirty memory threshold, then only writeback
> inodes associated with the over-limit cgroup(s).
>
> In addition to checking if the system dirty memory usage is over the
> system background threshold, over_bground_thresh() also checks if any
> cgroups are over their respective background dirty memory thresholds.
> The writeback_control.for_cgroup field is set to distinguish between a
> system and memcg overage.
>
> If performing cgroup writeback, move_expired_inodes() skips inodes that
> do not contribute dirty pages to the cgroup being written back.
>
> After writing some pages, wb_writeback() will call
> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

There are some nitpicks at below.

> ---
> Changelog since v7:
> - over_bground_thresh() now sets shared_inodes=1. In -v7 per memcg
> background writeback did not, so it did not write pages of shared
> inodes in background writeback. In the (potentially common) case
> where the system dirty memory usage is below the system background
> dirty threshold but at least one cgroup is over its background dirty
> limit, then per memcg background writeback is queued for any
> over-background-threshold cgroups. Background writeback should be
> allowed to writeback shared inodes. The hope is that writing such
> inodes has good chance of cleaning the inodes so they can transition
> from shared to non-shared. Such a transition is good because then the
> inode will remain unshared until it is written by multiple cgroup.
> Non-shared inodes offer better isolation.

Above comment should be in description.

>
> fs/fs-writeback.c | 32 ++++++++++++++++++++++++--------
> 1 files changed, 24 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 0174fcf..c0bfe62 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -256,14 +256,17 @@ static void move_expired_inodes(struct list_head *delaying_queue,
> LIST_HEAD(tmp);
> struct list_head *pos, *node;
> struct super_block *sb = NULL;
> - struct inode *inode;
> + struct inode *inode, *tmp_inode;
> int do_sb_sort = 0;
>
> - while (!list_empty(delaying_queue)) {
> - inode = wb_inode(delaying_queue->prev);
> + list_for_each_entry_safe_reverse(inode, tmp_inode, delaying_queue,
> + i_wb_list) {
> if (wbc->older_than_this &&
> inode_dirtied_after(inode, *wbc->older_than_this))
> break;
> + if (wbc->for_cgroup &&
> + !should_writeback_mem_cgroup_inode(inode, wbc))
> + continue;
> if (sb && sb != inode->i_sb)
> do_sb_sort = 1;
> sb = inode->i_sb;
> @@ -614,14 +617,22 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
> */
> #define MAX_WRITEBACK_PAGES 1024
>
> -static inline bool over_bground_thresh(void)
> +static inline bool over_bground_thresh(struct bdi_writeback *wb,

At present, wb isn't used.
Do you remain it intentionally for using in future?

> + struct writeback_control *wbc)
> {
> unsigned long background_thresh, dirty_thresh;
>
> global_dirty_limits(&background_thresh, &dirty_thresh);
>
> - return (global_page_state(NR_FILE_DIRTY) +
> - global_page_state(NR_UNSTABLE_NFS) > background_thresh);
> + if (global_page_state(NR_FILE_DIRTY) +
> + global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
> + wbc->for_cgroup = 0;
> + return true;
> + }
> +
> + wbc->for_cgroup = 1;
> + wbc->shared_inodes = 1;
> + return mem_cgroups_over_bground_dirty_thresh();
> }
>
> /*
> @@ -700,7 +711,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> * For background writeout, stop when we are below the
> * background dirty threshold
> */
> - if (work->for_background && !over_bground_thresh())
> + if (work->for_background && !over_bground_thresh(wb, &wbc))
> break;
>
> if (work->for_kupdate || work->for_background) {
> @@ -729,6 +740,9 @@ retry:
> work->nr_pages -= write_chunk - wbc.nr_to_write;
> wrote += write_chunk - wbc.nr_to_write;
>
> + if (write_chunk - wbc.nr_to_write > 0)
> + mem_cgroup_writeback_done();
> +
> /*
> * Did we write something? Try for more
> *
> @@ -809,7 +823,9 @@ static unsigned long get_nr_dirty_pages(void)
>
> static long wb_check_background_flush(struct bdi_writeback *wb)
> {
> - if (over_bground_thresh()) {
> + struct writeback_control wbc;
> +
> + if (over_bground_thresh(wb, &wbc)) {
>
> struct wb_writeback_work work = {
> .nr_pages = LONG_MAX,
> --
> 1.7.3.1
>

--
Kind regards
Minchan Kim

2011-06-06 18:47:55

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 10/12] memcg: create support routines for page-writeback

Minchan Kim <[email protected]> writes:

> Hi Greg,
>
> On Fri, Jun 03, 2011 at 09:12:16AM -0700, Greg Thelen wrote:
>> Introduce memcg routines to assist in per-memcg dirty page management:
>>
>> - mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
>> dirty memory usage against memcg foreground and background thresholds.
>> If an over-background-threshold memcg is found, then per-memcg
>> background writeback is queued. Per-memcg writeback differs from
>> classic, non-memcg, per bdi writeback by setting the new
>> writeback_control.for_cgroup bit.
>>
>> If an over-foreground-threshold memcg is found, then foreground
>> writeout occurs. When performing foreground writeout, first consider
>> inodes exclusive to the memcg. If unable to make enough progress,
>> then consider inodes shared between memcg. Such cross-memcg inode
>> sharing likely to be rare in situations that use per-cgroup memory
>> isolation. The approach tries to handle the common (non-shared)
>> case well without punishing well behaved (non-sharing) cgroups.
>> As a last resort writeback shared inodes.
>>
>> This routine is used by balance_dirty_pages() in a later change.
>>
>> - mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
>> and limits of the memcg closest to (or over) its dirty limit. This
>> will be used by throttle_vm_writeout() in a latter change.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>> ---
>> Changelog since v7:
>> - Add more detail to commit description.
>>
>> - Declare the new writeback_control for_cgroup bit in this change, the
>> first patch that uses the new field is first used. In -v7 the field
>> was declared in a separate patch.
>>
>> include/linux/memcontrol.h | 18 +++++
>> include/linux/writeback.h | 1 +
>> include/trace/events/memcontrol.h | 83 ++++++++++++++++++++
>> mm/memcontrol.c | 150 +++++++++++++++++++++++++++++++++++++
>> 4 files changed, 252 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 3d72e09..0d0363e 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -167,6 +167,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
>> struct writeback_control *wbc);
>> bool mem_cgroups_over_bground_dirty_thresh(void);
>> void mem_cgroup_writeback_done(void);
>> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
>> + struct mem_cgroup *mem,
>> + struct dirty_info *info);
>> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> + unsigned long write_chunk);
>>
>> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> gfp_t gfp_mask,
>> @@ -383,6 +388,19 @@ static inline void mem_cgroup_writeback_done(void)
>> {
>> }
>>
>> +static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> + unsigned long write_chunk)
>> +{
>> +}
>> +
>> +static inline bool
>> +mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
>> + struct mem_cgroup *mem,
>> + struct dirty_info *info)
>> +{
>> + return false;
>> +}
>> +
>> static inline
>> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> gfp_t gfp_mask,
>> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>> index 66ec339..4f5c0d2 100644
>> --- a/include/linux/writeback.h
>> +++ b/include/linux/writeback.h
>> @@ -47,6 +47,7 @@ struct writeback_control {
>> unsigned for_reclaim:1; /* Invoked from the page allocator */
>> unsigned range_cyclic:1; /* range_start is cyclic */
>> unsigned more_io:1; /* more io to be dispatched */
>> + unsigned for_cgroup:1; /* enable cgroup writeback */
>> unsigned shared_inodes:1; /* write inodes spanning cgroups */
>> };
>>
>> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
>> index 326a66b..b42dae1 100644
>> --- a/include/trace/events/memcontrol.h
>> +++ b/include/trace/events/memcontrol.h
>> @@ -109,6 +109,89 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
>> __entry->first_id)
>> )
>>
>> +DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
>> + TP_PROTO(unsigned short css_id,
>> + struct backing_dev_info *bdi,
>> + unsigned long nr_reclaimable,
>> + unsigned long thresh,
>> + bool over_limit),
>> +
>> + TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
>> +
>> + TP_STRUCT__entry(
>> + __field(unsigned short, css_id)
>> + __field(struct backing_dev_info *, bdi)
>> + __field(unsigned long, nr_reclaimable)
>> + __field(unsigned long, thresh)
>> + __field(bool, over_limit)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->css_id = css_id;
>> + __entry->bdi = bdi;
>> + __entry->nr_reclaimable = nr_reclaimable;
>> + __entry->thresh = thresh;
>> + __entry->over_limit = over_limit;
>> + ),
>> +
>> + TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
>> + "over_limit=%d", __entry->css_id, __entry->bdi,
>> + __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
>> +)
>> +
>> +#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
>> +DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
>> + TP_PROTO(unsigned short id, \
>> + struct backing_dev_info *bdi, \
>> + unsigned long nr_reclaimable, \
>> + unsigned long thresh, \
>> + bool over_limit), \
>> + TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
>> +)
>> +
>> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
>> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
>> +
>> +TRACE_EVENT(mem_cgroup_fg_writeback,
>> + TP_PROTO(unsigned long write_chunk,
>> + struct writeback_control *wbc),
>> +
>> + TP_ARGS(write_chunk, wbc),
>> +
>> + TP_STRUCT__entry(
>> + __field(unsigned long, write_chunk)
>> + __field(long, wbc_to_write)
>> + __field(bool, shared_inodes)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->write_chunk = write_chunk;
>> + __entry->wbc_to_write = wbc->nr_to_write;
>> + __entry->shared_inodes = wbc->shared_inodes;
>> + ),
>> +
>> + TP_printk("write_chunk=%ld nr_to_write=%ld shared_inodes=%d",
>> + __entry->write_chunk,
>> + __entry->wbc_to_write,
>> + __entry->shared_inodes)
>> +)
>> +
>> +TRACE_EVENT(mem_cgroup_enable_shared_writeback,
>> + TP_PROTO(unsigned short css_id),
>> +
>> + TP_ARGS(css_id),
>> +
>> + TP_STRUCT__entry(
>> + __field(unsigned short, css_id)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->css_id = css_id;
>> + ),
>> +
>> + TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
>> +)
>> +
>> #endif /* _TRACE_MEMCONTROL_H */
>>
>> /* This part must be outside protection */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a5b1794..17cb888 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1622,6 +1622,156 @@ void mem_cgroup_writeback_done(void)
>> }
>> }
>>
>> +/*
>> + * This routine must be called by processes which are generating dirty pages.
>> + * It considers the dirty pages usage and thresholds of the current cgroup and
>> + * (depending if hierarchical accounting is enabled) ancestral memcg. If any of
>> + * the considered memcg are over their background dirty limit, then background
>> + * writeback is queued. If any are over the foreground dirty limit then
>> + * throttle the dirtying task while writing dirty data. The per-memcg dirty
>> + * limits check by this routine are distinct from either the per-system,
>> + * per-bdi, or per-task limits considered by balance_dirty_pages().
>> + */
>> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> + unsigned long write_chunk)
>> +{
>> + struct backing_dev_info *bdi = mapping->backing_dev_info;
>> + struct mem_cgroup *mem;
>> + struct mem_cgroup *ref_mem;
>> + struct dirty_info info;
>> + unsigned long nr_reclaimable;
>> + unsigned long sys_available_mem;
>> + unsigned long pause = 1;
>> + unsigned short id;
>> + bool over;
>> + bool shared_inodes;
>> +
>> + if (mem_cgroup_disabled())
>> + return;
>> +
>> + sys_available_mem = determine_dirtyable_memory();
>> +
>> + /* reference the memcg so it is not deleted during this routine */
>> + rcu_read_lock();
>> + mem = mem_cgroup_from_task(current);
>> + if (mem && mem_cgroup_is_root(mem))
>> + mem = NULL;
>> + if (mem)
>> + css_get(&mem->css);
>> + rcu_read_unlock();
>> + ref_mem = mem;
>> +
>> + /* balance entire ancestry of current's mem. */
>> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
>> + id = css_id(&mem->css);
>> +
>> + /*
>> + * keep throttling and writing inode data so long as mem is over
>> + * its dirty limit.
>> + */
>> + for (shared_inodes = false; ; ) {
>> + struct writeback_control wbc = {
>> + .sync_mode = WB_SYNC_NONE,
>> + .older_than_this = NULL,
>> + .range_cyclic = 1,
>> + .for_cgroup = 1,
>> + .nr_to_write = write_chunk,
>> + .shared_inodes = shared_inodes,
>> + };
>> +
>> + /*
>> + * if mem is under dirty limit, then break from
>> + * throttling loop.
>> + */
>> + mem_cgroup_dirty_info(sys_available_mem, mem, &info);
>> + nr_reclaimable = dirty_info_reclaimable(&info);
>> + over = nr_reclaimable > info.dirty_thresh;
>> + trace_mem_cgroup_consider_fg_writeback(
>> + id, bdi, nr_reclaimable, info.dirty_thresh,
>> + over);
>> + if (!over)
>> + break;
>> +
>> + mem_cgroup_mark_over_bg_thresh(mem);
>
> We are over the fg_thresh.
> Then, why do you mark bg_thresh, too?

It is possible for a cgroup to exceed its background limit without
calling balance_dirty_pages() due to rate limiting in
balance_dirty_pages_ratelimited_nr(). So this call to
mem_cgroup_mark_over_bg_thresh() is needed to ensure that the cgroup is
marked as over-bg-limit. The writeback_inodes_wb() call below writes
the inodes owned by over-bg-limit memcg. This is not obvious. I will
add a comment here explaining why marking over_bg_thresh in this
situation.

>> + writeback_inodes_wb(&bdi->wb, &wbc);
>> + trace_mem_cgroup_fg_writeback(write_chunk, &wbc);
>> + /* if no progress, then consider shared inodes */
>> + if ((wbc.nr_to_write == write_chunk) &&
>> + !shared_inodes) {
>> + trace_mem_cgroup_enable_shared_writeback(id);
>> + shared_inodes = true;
>
> I am not sure this is really right condition to punish shared inodes.
> We requested wbc with async. If bdi was congested, isn't it possible
> that we can't write anyting in this turn?
>
> If shared inodes are on different bdi, it would be effective but they
> are on same bdi?
>
> If you assume that shared inode case is rare in memcg configuration
> and it's right assumption, I don't have a concern about it. But I have
> no idea.

I am not sure if bdi congestion can cause a problem here. I don't think
that bdi congestion would cause writeback to fail to submit IOs and
decrement nr_to_write. But I could be wrong.

There are other reasons that writeback_inodes_wb() may not make progress
(e.g. if b_io has only one inode currently under writeback, then no
progress may be made). To detect such cases would require adding more
counters in the writeback code. I do not feel that it is necessary to
avoid downshifting here into writing shared inodes. But I do see value
in first trying to write non-shared inodes (as proposed in the patch).
Shared inodes should be rare because their usage would further limit
isolation offered by cgroups.

I will add a comment to this section to clarify this.

>> + }
>> +
>> + /*
>> + * Sleep up to 100ms to throttle writer and wait for
>> + * queued background I/O to complete.
>> + */
>> + __set_current_state(TASK_UNINTERRUPTIBLE);
>> + io_schedule_timeout(pause);
>> + pause <<= 1;
>> + if (pause > HZ / 10)
>> + pause = HZ / 10;
>> + }
>> +
>> + /* if mem is over background limit, then queue bg writeback */
>> + over = nr_reclaimable >= info.background_thresh;
>> + trace_mem_cgroup_consider_bg_writeback(
>> + id, bdi, nr_reclaimable, info.background_thresh,
>> + over);
>> + if (over)
>> + mem_cgroup_queue_bg_writeback(mem, bdi);
>> + }
>> +
>> + if (ref_mem)
>> + css_put(&ref_mem->css);
>> +}
>> +
>> +/*
>> + * Return the dirty thresholds and usage for the mem (within the ancestral chain
>> + * of @mem) closest to its dirty limit or the first memcg over its limit.
>
> dirty_info has return value 'bool'.
> What's meaning? Of course, we can guess it by look the code.
> But let's make user painful.
> Please write down the menaing of return value, too.

Good point. I will add more details to the comment.

>> + *
>> + * The check is not stable because the usage and limits can change asynchronous
>> + * to this routine.
>> + */
>> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
>> + struct mem_cgroup *mem,
>> + struct dirty_info *info)
>> +{
>> + unsigned long usage;
>> + struct dirty_info uninitialized_var(cur_info);
>> +
>> + if (mem_cgroup_disabled())
>> + return false;
>> +
>> + info->nr_writeback = ULONG_MAX; /* invalid initial value */
>> +
>> + /* walk up hierarchy enabled parents */
>> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
>> + mem_cgroup_dirty_info(sys_available_mem, mem, &cur_info);
>> + usage = dirty_info_reclaimable(&cur_info) +
>> + cur_info.nr_writeback;
>> +
>> + /* if over limit, stop searching */
>> + if (usage >= cur_info.dirty_thresh) {
>> + *info = cur_info;
>> + break;
>> + }
>> +
>> + /*
>> + * Save dirty usage of mem closest to its limit if either:
>> + * - mem is the first mem considered
>> + * - mem dirty margin is smaller than last recorded one
>> + */
>> + if ((info->nr_writeback == ULONG_MAX) ||
>> + (cur_info.dirty_thresh - usage) <
>> + (info->dirty_thresh -
>> + (dirty_info_reclaimable(info) + info->nr_writeback)))
>> + *info = cur_info;
>> + }
>> +
>> + return info->nr_writeback != ULONG_MAX;
>> +}
>> +
>> static void mem_cgroup_start_move(struct mem_cgroup *mem)
>> {
>> int cpu;
>> --
>> 1.7.3.1
>>

2011-06-06 18:52:47

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

Minchan Kim <[email protected]> writes:

> On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
>> When the system is under background dirty memory threshold but a cgroup
>> is over its background dirty memory threshold, then only writeback
>> inodes associated with the over-limit cgroup(s).
>>
>> In addition to checking if the system dirty memory usage is over the
>> system background threshold, over_bground_thresh() also checks if any
>> cgroups are over their respective background dirty memory thresholds.
>> The writeback_control.for_cgroup field is set to distinguish between a
>> system and memcg overage.
>>
>> If performing cgroup writeback, move_expired_inodes() skips inodes that
>> do not contribute dirty pages to the cgroup being written back.
>>
>> After writing some pages, wb_writeback() will call
>> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
>
> There are some nitpicks at below.
>
>> ---
>> Changelog since v7:
>> - over_bground_thresh() now sets shared_inodes=1. In -v7 per memcg
>> background writeback did not, so it did not write pages of shared
>> inodes in background writeback. In the (potentially common) case
>> where the system dirty memory usage is below the system background
>> dirty threshold but at least one cgroup is over its background dirty
>> limit, then per memcg background writeback is queued for any
>> over-background-threshold cgroups. Background writeback should be
>> allowed to writeback shared inodes. The hope is that writing such
>> inodes has good chance of cleaning the inodes so they can transition
>> from shared to non-shared. Such a transition is good because then the
>> inode will remain unshared until it is written by multiple cgroup.
>> Non-shared inodes offer better isolation.
>
> Above comment should be in description.

Good point. I will add this to the commit description.

>>
>> fs/fs-writeback.c | 32 ++++++++++++++++++++++++--------
>> 1 files changed, 24 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 0174fcf..c0bfe62 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -256,14 +256,17 @@ static void move_expired_inodes(struct list_head *delaying_queue,
>> LIST_HEAD(tmp);
>> struct list_head *pos, *node;
>> struct super_block *sb = NULL;
>> - struct inode *inode;
>> + struct inode *inode, *tmp_inode;
>> int do_sb_sort = 0;
>>
>> - while (!list_empty(delaying_queue)) {
>> - inode = wb_inode(delaying_queue->prev);
>> + list_for_each_entry_safe_reverse(inode, tmp_inode, delaying_queue,
>> + i_wb_list) {
>> if (wbc->older_than_this &&
>> inode_dirtied_after(inode, *wbc->older_than_this))
>> break;
>> + if (wbc->for_cgroup &&
>> + !should_writeback_mem_cgroup_inode(inode, wbc))
>> + continue;
>> if (sb && sb != inode->i_sb)
>> do_sb_sort = 1;
>> sb = inode->i_sb;
>> @@ -614,14 +617,22 @@ void writeback_inodes_wb(struct bdi_writeback *wb,
>> */
>> #define MAX_WRITEBACK_PAGES 1024
>>
>> -static inline bool over_bground_thresh(void)
>> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
>
> At present, wb isn't used.
> Do you remain it intentionally for using in future?

Good catch. At present, wb isn't used. I do not have plans to use it.
I will remove it.

>> + struct writeback_control *wbc)
>> {
>> unsigned long background_thresh, dirty_thresh;
>>
>> global_dirty_limits(&background_thresh, &dirty_thresh);
>>
>> - return (global_page_state(NR_FILE_DIRTY) +
>> - global_page_state(NR_UNSTABLE_NFS) > background_thresh);
>> + if (global_page_state(NR_FILE_DIRTY) +
>> + global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
>> + wbc->for_cgroup = 0;
>> + return true;
>> + }
>> +
>> + wbc->for_cgroup = 1;
>> + wbc->shared_inodes = 1;
>> + return mem_cgroups_over_bground_dirty_thresh();
>> }
>>
>> /*
>> @@ -700,7 +711,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>> * For background writeout, stop when we are below the
>> * background dirty threshold
>> */
>> - if (work->for_background && !over_bground_thresh())
>> + if (work->for_background && !over_bground_thresh(wb, &wbc))
>> break;
>>
>> if (work->for_kupdate || work->for_background) {
>> @@ -729,6 +740,9 @@ retry:
>> work->nr_pages -= write_chunk - wbc.nr_to_write;
>> wrote += write_chunk - wbc.nr_to_write;
>>
>> + if (write_chunk - wbc.nr_to_write > 0)
>> + mem_cgroup_writeback_done();
>> +
>> /*
>> * Did we write something? Try for more
>> *
>> @@ -809,7 +823,9 @@ static unsigned long get_nr_dirty_pages(void)
>>
>> static long wb_check_background_flush(struct bdi_writeback *wb)
>> {
>> - if (over_bground_thresh()) {
>> + struct writeback_control wbc;
>> +
>> + if (over_bground_thresh(wb, &wbc)) {
>>
>> struct wb_writeback_work work = {
>> .nr_pages = LONG_MAX,
>> --
>> 1.7.3.1
>>

2011-06-07 07:34:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 03/12] memcg: add mem_cgroup_mark_inode_dirty()

On Fri, 3 Jun 2011 09:12:09 -0700
Greg Thelen <[email protected]> wrote:

> Create the mem_cgroup_mark_inode_dirty() routine, which is called when
> an inode is marked dirty. In kernels without memcg, this is an inline
> no-op.
>
> Add i_memcg field to struct address_space. When an inode is marked
> dirty with mem_cgroup_mark_inode_dirty(), the css_id of current memcg is
> recorded in i_memcg. Per-memcg writeback (introduced in a latter
> change) uses this field to isolate inodes associated with a particular
> memcg.
>
> The type of i_memcg is an 'unsigned short' because it stores the css_id
> of the memcg. Using a struct mem_cgroup pointer would be larger and
> also create a reference on the memcg which would hang memcg rmdir
> deletion. Usage of a css_id is not a reference so cgroup deletion is
> not affected. The memcg can be deleted without cleaning up the i_memcg
> field. When a memcg is deleted its pages are recharged to the cgroup
> parent, and the related inode(s) are marked as shared thus
> disassociating the inodes from the deleted cgroup.
>
> A mem_cgroup_mark_inode_dirty() tracepoint is also included to allow for
> easier understanding of memcg writeback operation.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-06-07 07:35:53

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 04/12] memcg: add dirty page accounting infrastructure

On Fri, 3 Jun 2011 09:12:10 -0700
Greg Thelen <[email protected]> wrote:

> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages. A
> later change adds kernel calls to these new routines.
>
> As inode pages are marked dirty, if the dirtied page's cgroup differs
> from the inode's cgroup, then mark the inode shared across several
> cgroup.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-06-07 07:51:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 08/12] memcg: dirty page accounting support routines

On Fri, 3 Jun 2011 09:12:14 -0700
Greg Thelen <[email protected]> wrote:

> Added memcg dirty page accounting support routines. These routines are
> used by later changes to provide memcg aware writeback and dirty page
> limiting. A mem_cgroup_dirty_info() tracepoint is is also included to
> allow for easier understanding of memcg writeback operation.
>
> Signed-off-by: Greg Thelen <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-06-07 07:53:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 09/12] memcg: create support routines for writeback

On Fri, 3 Jun 2011 09:12:15 -0700
Greg Thelen <[email protected]> wrote:

> Introduce memcg routines to assist in per-memcg writeback:
>
> - mem_cgroups_over_bground_dirty_thresh() determines if any cgroups need
> writeback because they are over their dirty memory threshold.
>
> - should_writeback_mem_cgroup_inode() determines if an inode is
> contributing pages to an over-limit memcg. A new writeback_control
> field determines if shared inodes should be written back.
>
> - mem_cgroup_writeback_done() is used periodically during writeback to
> update memcg writeback data.
>
> These routines make use of a new over_bground_dirty_thresh bitmap that
> indicates which mem_cgroup are over their respective dirty background
> threshold. As this bitmap is indexed by css_id, the largest possible
> css_id value is needed to create the bitmap. So move the definition of
> CSS_ID_MAX from cgroup.c to cgroup.h. This allows users of css_id() to
> know the largest possible css_id value. This knowledge can be used to
> build such per-cgroup bitmaps.
>
> Signed-off-by: Greg Thelen <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-06-07 08:58:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 10/12] memcg: create support routines for page-writeback

On Fri, 3 Jun 2011 09:12:16 -0700
Greg Thelen <[email protected]> wrote:

> Introduce memcg routines to assist in per-memcg dirty page management:
>
> - mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
> dirty memory usage against memcg foreground and background thresholds.
> If an over-background-threshold memcg is found, then per-memcg
> background writeback is queued. Per-memcg writeback differs from
> classic, non-memcg, per bdi writeback by setting the new
> writeback_control.for_cgroup bit.
>
> If an over-foreground-threshold memcg is found, then foreground
> writeout occurs. When performing foreground writeout, first consider
> inodes exclusive to the memcg. If unable to make enough progress,
> then consider inodes shared between memcg. Such cross-memcg inode
> sharing likely to be rare in situations that use per-cgroup memory
> isolation. The approach tries to handle the common (non-shared)
> case well without punishing well behaved (non-sharing) cgroups.
> As a last resort writeback shared inodes.
>
> This routine is used by balance_dirty_pages() in a later change.
>
> - mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
> and limits of the memcg closest to (or over) its dirty limit. This
> will be used by throttle_vm_writeout() in a latter change.
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---
> Changelog since v7:
> - Add more detail to commit description.
>
> - Declare the new writeback_control for_cgroup bit in this change, the
> first patch that uses the new field is first used. In -v7 the field
> was declared in a separate patch.
>
> include/linux/memcontrol.h | 18 +++++
> include/linux/writeback.h | 1 +
> include/trace/events/memcontrol.h | 83 ++++++++++++++++++++
> mm/memcontrol.c | 150 +++++++++++++++++++++++++++++++++++++
> 4 files changed, 252 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3d72e09..0d0363e 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -167,6 +167,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
> struct writeback_control *wbc);
> bool mem_cgroups_over_bground_dirty_thresh(void);
> void mem_cgroup_writeback_done(void);
> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
> + struct mem_cgroup *mem,
> + struct dirty_info *info);
> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> + unsigned long write_chunk);
>
> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask,
> @@ -383,6 +388,19 @@ static inline void mem_cgroup_writeback_done(void)
> {
> }
>
> +static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> + unsigned long write_chunk)
> +{
> +}
> +
> +static inline bool
> +mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
> + struct mem_cgroup *mem,
> + struct dirty_info *info)
> +{
> + return false;
> +}
> +
> static inline
> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask,
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 66ec339..4f5c0d2 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -47,6 +47,7 @@ struct writeback_control {
> unsigned for_reclaim:1; /* Invoked from the page allocator */
> unsigned range_cyclic:1; /* range_start is cyclic */
> unsigned more_io:1; /* more io to be dispatched */
> + unsigned for_cgroup:1; /* enable cgroup writeback */
> unsigned shared_inodes:1; /* write inodes spanning cgroups */
> };
>
> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
> index 326a66b..b42dae1 100644
> --- a/include/trace/events/memcontrol.h
> +++ b/include/trace/events/memcontrol.h
> @@ -109,6 +109,89 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
> __entry->first_id)
> )
>
> +DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
> + TP_PROTO(unsigned short css_id,
> + struct backing_dev_info *bdi,
> + unsigned long nr_reclaimable,
> + unsigned long thresh,
> + bool over_limit),
> +
> + TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
> +
> + TP_STRUCT__entry(
> + __field(unsigned short, css_id)
> + __field(struct backing_dev_info *, bdi)
> + __field(unsigned long, nr_reclaimable)
> + __field(unsigned long, thresh)
> + __field(bool, over_limit)
> + ),
> +
> + TP_fast_assign(
> + __entry->css_id = css_id;
> + __entry->bdi = bdi;
> + __entry->nr_reclaimable = nr_reclaimable;
> + __entry->thresh = thresh;
> + __entry->over_limit = over_limit;
> + ),
> +
> + TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
> + "over_limit=%d", __entry->css_id, __entry->bdi,
> + __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
> +)
> +
> +#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
> +DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
> + TP_PROTO(unsigned short id, \
> + struct backing_dev_info *bdi, \
> + unsigned long nr_reclaimable, \
> + unsigned long thresh, \
> + bool over_limit), \
> + TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
> +)
> +
> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
> +
> +TRACE_EVENT(mem_cgroup_fg_writeback,
> + TP_PROTO(unsigned long write_chunk,
> + struct writeback_control *wbc),
> +
> + TP_ARGS(write_chunk, wbc),
> +
> + TP_STRUCT__entry(
> + __field(unsigned long, write_chunk)
> + __field(long, wbc_to_write)
> + __field(bool, shared_inodes)
> + ),
> +
> + TP_fast_assign(
> + __entry->write_chunk = write_chunk;
> + __entry->wbc_to_write = wbc->nr_to_write;
> + __entry->shared_inodes = wbc->shared_inodes;
> + ),
> +
> + TP_printk("write_chunk=%ld nr_to_write=%ld shared_inodes=%d",
> + __entry->write_chunk,
> + __entry->wbc_to_write,
> + __entry->shared_inodes)
> +)
> +
> +TRACE_EVENT(mem_cgroup_enable_shared_writeback,
> + TP_PROTO(unsigned short css_id),
> +
> + TP_ARGS(css_id),
> +
> + TP_STRUCT__entry(
> + __field(unsigned short, css_id)
> + ),
> +
> + TP_fast_assign(
> + __entry->css_id = css_id;
> + ),
> +
> + TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
> +)
> +
> #endif /* _TRACE_MEMCONTROL_H */
>
> /* This part must be outside protection */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a5b1794..17cb888 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1622,6 +1622,156 @@ void mem_cgroup_writeback_done(void)
> }
> }
>
> +/*
> + * This routine must be called by processes which are generating dirty pages.
> + * It considers the dirty pages usage and thresholds of the current cgroup and
> + * (depending if hierarchical accounting is enabled) ancestral memcg. If any of
> + * the considered memcg are over their background dirty limit, then background
> + * writeback is queued. If any are over the foreground dirty limit then
> + * throttle the dirtying task while writing dirty data. The per-memcg dirty
> + * limits check by this routine are distinct from either the per-system,
> + * per-bdi, or per-task limits considered by balance_dirty_pages().
> + */
> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> + unsigned long write_chunk)
> +{
> + struct backing_dev_info *bdi = mapping->backing_dev_info;
> + struct mem_cgroup *mem;
> + struct mem_cgroup *ref_mem;
> + struct dirty_info info;
> + unsigned long nr_reclaimable;
> + unsigned long sys_available_mem;
> + unsigned long pause = 1;
> + unsigned short id;
> + bool over;
> + bool shared_inodes;
> +
> + if (mem_cgroup_disabled())
> + return;
> +
> + sys_available_mem = determine_dirtyable_memory();
> +
> + /* reference the memcg so it is not deleted during this routine */
> + rcu_read_lock();
> + mem = mem_cgroup_from_task(current);
> + if (mem && mem_cgroup_is_root(mem))
> + mem = NULL;
> + if (mem)
> + css_get(&mem->css);
> + rcu_read_unlock();
> + ref_mem = mem;
> +
> + /* balance entire ancestry of current's mem. */
> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
> + id = css_id(&mem->css);
> +

Hmm, this sounds natural...but...don't we need to restart checking from ref_mem's
dirty_ratio once we find an ancestor is over dirty_ratio and we slept ?

Even if parent's dirty ratio comes to be clean state, children's may not.
So, I think some "restart loop" jump after io_schedule_timeout().


Thanks,
-Kame


> + /*
> + * keep throttling and writing inode data so long as mem is over
> + * its dirty limit.
> + */
> + for (shared_inodes = false; ; ) {
> + struct writeback_control wbc = {
> + .sync_mode = WB_SYNC_NONE,
> + .older_than_this = NULL,
> + .range_cyclic = 1,
> + .for_cgroup = 1,
> + .nr_to_write = write_chunk,
> + .shared_inodes = shared_inodes,
> + };
> +
> + /*
> + * if mem is under dirty limit, then break from
> + * throttling loop.
> + */
> + mem_cgroup_dirty_info(sys_available_mem, mem, &info);
> + nr_reclaimable = dirty_info_reclaimable(&info);
> + over = nr_reclaimable > info.dirty_thresh;
> + trace_mem_cgroup_consider_fg_writeback(
> + id, bdi, nr_reclaimable, info.dirty_thresh,
> + over);
> + if (!over)
> + break;
> +
> + mem_cgroup_mark_over_bg_thresh(mem);
> + writeback_inodes_wb(&bdi->wb, &wbc);
> + trace_mem_cgroup_fg_writeback(write_chunk, &wbc);
> + /* if no progress, then consider shared inodes */
> + if ((wbc.nr_to_write == write_chunk) &&
> + !shared_inodes) {
> + trace_mem_cgroup_enable_shared_writeback(id);
> + shared_inodes = true;
> + }
> +
> + /*
> + * Sleep up to 100ms to throttle writer and wait for
> + * queued background I/O to complete.
> + */
> + __set_current_state(TASK_UNINTERRUPTIBLE);
> + io_schedule_timeout(pause);
> + pause <<= 1;
> + if (pause > HZ / 10)
> + pause = HZ / 10;

Hmm, is this exponential back off "pause" is from mm/page-writeback.c ?
I'm happy if we can have a shared code. (But ok if it adds some messy.)



> + }
> +
> + /* if mem is over background limit, then queue bg writeback */
> + over = nr_reclaimable >= info.background_thresh;
> + trace_mem_cgroup_consider_bg_writeback(
> + id, bdi, nr_reclaimable, info.background_thresh,
> + over);
> + if (over)
> + mem_cgroup_queue_bg_writeback(mem, bdi);
> + }
> +
> + if (ref_mem)
> + css_put(&ref_mem->css);
> +}
> +
> +/*
> + * Return the dirty thresholds and usage for the mem (within the ancestral chain
> + * of @mem) closest to its dirty limit or the first memcg over its limit.
> + *
> + * The check is not stable because the usage and limits can change asynchronous
> + * to this routine.
> + */
> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
> + struct mem_cgroup *mem,
> + struct dirty_info *info)
> +{
> + unsigned long usage;
> + struct dirty_info uninitialized_var(cur_info);
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + info->nr_writeback = ULONG_MAX; /* invalid initial value */
> +
> + /* walk up hierarchy enabled parents */
> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
> + mem_cgroup_dirty_info(sys_available_mem, mem, &cur_info);
> + usage = dirty_info_reclaimable(&cur_info) +
> + cur_info.nr_writeback;
> +
> + /* if over limit, stop searching */
> + if (usage >= cur_info.dirty_thresh) {
> + *info = cur_info;
> + break;
> + }
> +
> + /*
> + * Save dirty usage of mem closest to its limit if either:
> + * - mem is the first mem considered
> + * - mem dirty margin is smaller than last recorded one
> + */
> + if ((info->nr_writeback == ULONG_MAX) ||
> + (cur_info.dirty_thresh - usage) <
> + (info->dirty_thresh -
> + (dirty_info_reclaimable(info) + info->nr_writeback)))
> + *info = cur_info;
> + }
> +
> + return info->nr_writeback != ULONG_MAX;
> +}
> +
> static void mem_cgroup_start_move(struct mem_cgroup *mem)
> {
> int cpu;
> --
> 1.7.3.1
>

2011-06-07 09:03:39

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Fri, 3 Jun 2011 09:12:17 -0700
Greg Thelen <[email protected]> wrote:

> When the system is under background dirty memory threshold but a cgroup
> is over its background dirty memory threshold, then only writeback
> inodes associated with the over-limit cgroup(s).
>
> In addition to checking if the system dirty memory usage is over the
> system background threshold, over_bground_thresh() also checks if any
> cgroups are over their respective background dirty memory thresholds.
> The writeback_control.for_cgroup field is set to distinguish between a
> system and memcg overage.
>
> If performing cgroup writeback, move_expired_inodes() skips inodes that
> do not contribute dirty pages to the cgroup being written back.
>
> After writing some pages, wb_writeback() will call
> mem_cgroup_writeback_done() to update the set of over-bg-limits memcg.
>
> Signed-off-by: Greg Thelen <[email protected]>



> ---
> Changelog since v7:
> - over_bground_thresh() now sets shared_inodes=1. In -v7 per memcg
> background writeback did not, so it did not write pages of shared
> inodes in background writeback. In the (potentially common) case
> where the system dirty memory usage is below the system background
> dirty threshold but at least one cgroup is over its background dirty
> limit, then per memcg background writeback is queued for any
> over-background-threshold cgroups. Background writeback should be
> allowed to writeback shared inodes. The hope is that writing such
> inodes has good chance of cleaning the inodes so they can transition
> from shared to non-shared. Such a transition is good because then the
> inode will remain unshared until it is written by multiple cgroup.
> Non-shared inodes offer better isolation.

If you post v9, please adds above explanation as the comments in the code.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-06-07 15:59:57

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 10/12] memcg: create support routines for page-writeback

KAMEZAWA Hiroyuki <[email protected]> writes:

> On Fri, 3 Jun 2011 09:12:16 -0700
> Greg Thelen <[email protected]> wrote:
>
>> Introduce memcg routines to assist in per-memcg dirty page management:
>>
>> - mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
>> dirty memory usage against memcg foreground and background thresholds.
>> If an over-background-threshold memcg is found, then per-memcg
>> background writeback is queued. Per-memcg writeback differs from
>> classic, non-memcg, per bdi writeback by setting the new
>> writeback_control.for_cgroup bit.
>>
>> If an over-foreground-threshold memcg is found, then foreground
>> writeout occurs. When performing foreground writeout, first consider
>> inodes exclusive to the memcg. If unable to make enough progress,
>> then consider inodes shared between memcg. Such cross-memcg inode
>> sharing likely to be rare in situations that use per-cgroup memory
>> isolation. The approach tries to handle the common (non-shared)
>> case well without punishing well behaved (non-sharing) cgroups.
>> As a last resort writeback shared inodes.
>>
>> This routine is used by balance_dirty_pages() in a later change.
>>
>> - mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
>> and limits of the memcg closest to (or over) its dirty limit. This
>> will be used by throttle_vm_writeout() in a latter change.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>> ---
>> Changelog since v7:
>> - Add more detail to commit description.
>>
>> - Declare the new writeback_control for_cgroup bit in this change, the
>> first patch that uses the new field is first used. In -v7 the field
>> was declared in a separate patch.
>>
>> include/linux/memcontrol.h | 18 +++++
>> include/linux/writeback.h | 1 +
>> include/trace/events/memcontrol.h | 83 ++++++++++++++++++++
>> mm/memcontrol.c | 150 +++++++++++++++++++++++++++++++++++++
>> 4 files changed, 252 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 3d72e09..0d0363e 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -167,6 +167,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
>> struct writeback_control *wbc);
>> bool mem_cgroups_over_bground_dirty_thresh(void);
>> void mem_cgroup_writeback_done(void);
>> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
>> + struct mem_cgroup *mem,
>> + struct dirty_info *info);
>> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> + unsigned long write_chunk);
>>
>> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> gfp_t gfp_mask,
>> @@ -383,6 +388,19 @@ static inline void mem_cgroup_writeback_done(void)
>> {
>> }
>>
>> +static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> + unsigned long write_chunk)
>> +{
>> +}
>> +
>> +static inline bool
>> +mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
>> + struct mem_cgroup *mem,
>> + struct dirty_info *info)
>> +{
>> + return false;
>> +}
>> +
>> static inline
>> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> gfp_t gfp_mask,
>> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>> index 66ec339..4f5c0d2 100644
>> --- a/include/linux/writeback.h
>> +++ b/include/linux/writeback.h
>> @@ -47,6 +47,7 @@ struct writeback_control {
>> unsigned for_reclaim:1; /* Invoked from the page allocator */
>> unsigned range_cyclic:1; /* range_start is cyclic */
>> unsigned more_io:1; /* more io to be dispatched */
>> + unsigned for_cgroup:1; /* enable cgroup writeback */
>> unsigned shared_inodes:1; /* write inodes spanning cgroups */
>> };
>>
>> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
>> index 326a66b..b42dae1 100644
>> --- a/include/trace/events/memcontrol.h
>> +++ b/include/trace/events/memcontrol.h
>> @@ -109,6 +109,89 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
>> __entry->first_id)
>> )
>>
>> +DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
>> + TP_PROTO(unsigned short css_id,
>> + struct backing_dev_info *bdi,
>> + unsigned long nr_reclaimable,
>> + unsigned long thresh,
>> + bool over_limit),
>> +
>> + TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
>> +
>> + TP_STRUCT__entry(
>> + __field(unsigned short, css_id)
>> + __field(struct backing_dev_info *, bdi)
>> + __field(unsigned long, nr_reclaimable)
>> + __field(unsigned long, thresh)
>> + __field(bool, over_limit)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->css_id = css_id;
>> + __entry->bdi = bdi;
>> + __entry->nr_reclaimable = nr_reclaimable;
>> + __entry->thresh = thresh;
>> + __entry->over_limit = over_limit;
>> + ),
>> +
>> + TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
>> + "over_limit=%d", __entry->css_id, __entry->bdi,
>> + __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
>> +)
>> +
>> +#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
>> +DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
>> + TP_PROTO(unsigned short id, \
>> + struct backing_dev_info *bdi, \
>> + unsigned long nr_reclaimable, \
>> + unsigned long thresh, \
>> + bool over_limit), \
>> + TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
>> +)
>> +
>> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
>> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
>> +
>> +TRACE_EVENT(mem_cgroup_fg_writeback,
>> + TP_PROTO(unsigned long write_chunk,
>> + struct writeback_control *wbc),
>> +
>> + TP_ARGS(write_chunk, wbc),
>> +
>> + TP_STRUCT__entry(
>> + __field(unsigned long, write_chunk)
>> + __field(long, wbc_to_write)
>> + __field(bool, shared_inodes)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->write_chunk = write_chunk;
>> + __entry->wbc_to_write = wbc->nr_to_write;
>> + __entry->shared_inodes = wbc->shared_inodes;
>> + ),
>> +
>> + TP_printk("write_chunk=%ld nr_to_write=%ld shared_inodes=%d",
>> + __entry->write_chunk,
>> + __entry->wbc_to_write,
>> + __entry->shared_inodes)
>> +)
>> +
>> +TRACE_EVENT(mem_cgroup_enable_shared_writeback,
>> + TP_PROTO(unsigned short css_id),
>> +
>> + TP_ARGS(css_id),
>> +
>> + TP_STRUCT__entry(
>> + __field(unsigned short, css_id)
>> + ),
>> +
>> + TP_fast_assign(
>> + __entry->css_id = css_id;
>> + ),
>> +
>> + TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
>> +)
>> +
>> #endif /* _TRACE_MEMCONTROL_H */
>>
>> /* This part must be outside protection */
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index a5b1794..17cb888 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -1622,6 +1622,156 @@ void mem_cgroup_writeback_done(void)
>> }
>> }
>>
>> +/*
>> + * This routine must be called by processes which are generating dirty pages.
>> + * It considers the dirty pages usage and thresholds of the current cgroup and
>> + * (depending if hierarchical accounting is enabled) ancestral memcg. If any of
>> + * the considered memcg are over their background dirty limit, then background
>> + * writeback is queued. If any are over the foreground dirty limit then
>> + * throttle the dirtying task while writing dirty data. The per-memcg dirty
>> + * limits check by this routine are distinct from either the per-system,
>> + * per-bdi, or per-task limits considered by balance_dirty_pages().
>> + */
>> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> + unsigned long write_chunk)
>> +{
>> + struct backing_dev_info *bdi = mapping->backing_dev_info;
>> + struct mem_cgroup *mem;
>> + struct mem_cgroup *ref_mem;
>> + struct dirty_info info;
>> + unsigned long nr_reclaimable;
>> + unsigned long sys_available_mem;
>> + unsigned long pause = 1;
>> + unsigned short id;
>> + bool over;
>> + bool shared_inodes;
>> +
>> + if (mem_cgroup_disabled())
>> + return;
>> +
>> + sys_available_mem = determine_dirtyable_memory();
>> +
>> + /* reference the memcg so it is not deleted during this routine */
>> + rcu_read_lock();
>> + mem = mem_cgroup_from_task(current);
>> + if (mem && mem_cgroup_is_root(mem))
>> + mem = NULL;
>> + if (mem)
>> + css_get(&mem->css);
>> + rcu_read_unlock();
>> + ref_mem = mem;
>> +
>> + /* balance entire ancestry of current's mem. */
>> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
>> + id = css_id(&mem->css);
>> +
>
> Hmm, this sounds natural...but...don't we need to restart checking from ref_mem's
> dirty_ratio once we find an ancestor is over dirty_ratio and we slept ?
>
> Even if parent's dirty ratio comes to be clean state, children's may not.
> So, I think some "restart loop" jump after io_schedule_timeout().
>
> Thanks,
> -Kame

I do not think that we need to restart, but maybe you have a case in
mind that I am not considering.

Example hierarchy:
root
A B
A1 A2
A11 A12 A21 A22

Assume that mem_cgroup_balance_dirty_pages(A11), so ref_mem=A11.

We start at A11 and walk up towards the root. If A11 is over limit,
then write A11 until under limit. Next check A1, if over limit then
write A1,A11,A12. Then check A. If A is over A limit, then we invoke
writeback on A* until A is under A limit. Are you concerned that while
performing writeback on A* that other tasks may push A1 over the A1
limit? Such other task writers would also be calling
mem_cgroup_balance_dirty_pages() later.

>> + /*
>> + * keep throttling and writing inode data so long as mem is over
>> + * its dirty limit.
>> + */
>> + for (shared_inodes = false; ; ) {
>> + struct writeback_control wbc = {
>> + .sync_mode = WB_SYNC_NONE,
>> + .older_than_this = NULL,
>> + .range_cyclic = 1,
>> + .for_cgroup = 1,
>> + .nr_to_write = write_chunk,
>> + .shared_inodes = shared_inodes,
>> + };
>> +
>> + /*
>> + * if mem is under dirty limit, then break from
>> + * throttling loop.
>> + */
>> + mem_cgroup_dirty_info(sys_available_mem, mem, &info);
>> + nr_reclaimable = dirty_info_reclaimable(&info);
>> + over = nr_reclaimable > info.dirty_thresh;
>> + trace_mem_cgroup_consider_fg_writeback(
>> + id, bdi, nr_reclaimable, info.dirty_thresh,
>> + over);
>> + if (!over)
>> + break;
>> +
>> + mem_cgroup_mark_over_bg_thresh(mem);
>> + writeback_inodes_wb(&bdi->wb, &wbc);
>> + trace_mem_cgroup_fg_writeback(write_chunk, &wbc);
>> + /* if no progress, then consider shared inodes */
>> + if ((wbc.nr_to_write == write_chunk) &&
>> + !shared_inodes) {
>> + trace_mem_cgroup_enable_shared_writeback(id);
>> + shared_inodes = true;
>> + }
>> +
>> + /*
>> + * Sleep up to 100ms to throttle writer and wait for
>> + * queued background I/O to complete.
>> + */
>> + __set_current_state(TASK_UNINTERRUPTIBLE);
>> + io_schedule_timeout(pause);
>> + pause <<= 1;
>> + if (pause > HZ / 10)
>> + pause = HZ / 10;
>
> Hmm, is this exponential back off "pause" is from mm/page-writeback.c ?

Yes. This is from mm/page-writeback.c:balance_dirty_pages().

> I'm happy if we can have a shared code. (But ok if it adds some messy.)

Sounds good (will include in -v9). The refactored the shared code will
look like:

/*
* Sleep up to 100ms to throttle writer and wait for queued background
* I/O to complete.
*/
unsigned long balance_dirty_pages_pause(unsigned long pause)
{
__set_current_state(TASK_UNINTERRUPTIBLE);
io_schedule_timeout(pause);
pause <<= 1;
if (pause > HZ / 10)
pause = HZ / 10;
return pause;
}

void [mem_cgroup_]balance_dirty_pages(...)
{
unsigned long pause = 1;
for (;;) {
...
pause = balance_dirty_pages_pause(pause)
}
}

>> + }
>> +
>> + /* if mem is over background limit, then queue bg writeback */
>> + over = nr_reclaimable >= info.background_thresh;
>> + trace_mem_cgroup_consider_bg_writeback(
>> + id, bdi, nr_reclaimable, info.background_thresh,
>> + over);
>> + if (over)
>> + mem_cgroup_queue_bg_writeback(mem, bdi);
>> + }
>> +
>> + if (ref_mem)
>> + css_put(&ref_mem->css);
>> +}
>> +
>> +/*
>> + * Return the dirty thresholds and usage for the mem (within the ancestral chain
>> + * of @mem) closest to its dirty limit or the first memcg over its limit.
>> + *
>> + * The check is not stable because the usage and limits can change asynchronous
>> + * to this routine.
>> + */
>> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
>> + struct mem_cgroup *mem,
>> + struct dirty_info *info)
>> +{
>> + unsigned long usage;
>> + struct dirty_info uninitialized_var(cur_info);
>> +
>> + if (mem_cgroup_disabled())
>> + return false;
>> +
>> + info->nr_writeback = ULONG_MAX; /* invalid initial value */
>> +
>> + /* walk up hierarchy enabled parents */
>> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
>> + mem_cgroup_dirty_info(sys_available_mem, mem, &cur_info);
>> + usage = dirty_info_reclaimable(&cur_info) +
>> + cur_info.nr_writeback;
>> +
>> + /* if over limit, stop searching */
>> + if (usage >= cur_info.dirty_thresh) {
>> + *info = cur_info;
>> + break;
>> + }
>> +
>> + /*
>> + * Save dirty usage of mem closest to its limit if either:
>> + * - mem is the first mem considered
>> + * - mem dirty margin is smaller than last recorded one
>> + */
>> + if ((info->nr_writeback == ULONG_MAX) ||
>> + (cur_info.dirty_thresh - usage) <
>> + (info->dirty_thresh -
>> + (dirty_info_reclaimable(info) + info->nr_writeback)))
>> + *info = cur_info;
>> + }
>> +
>> + return info->nr_writeback != ULONG_MAX;
>> +}
>> +
>> static void mem_cgroup_start_move(struct mem_cgroup *mem)
>> {
>> int cpu;
>> --
>> 1.7.3.1
>>

2011-06-07 19:40:01

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
> When the system is under background dirty memory threshold but a cgroup
> is over its background dirty memory threshold, then only writeback
> inodes associated with the over-limit cgroup(s).
>

[..]
> -static inline bool over_bground_thresh(void)
> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
> + struct writeback_control *wbc)
> {
> unsigned long background_thresh, dirty_thresh;
>
> global_dirty_limits(&background_thresh, &dirty_thresh);
>
> - return (global_page_state(NR_FILE_DIRTY) +
> - global_page_state(NR_UNSTABLE_NFS) > background_thresh);
> + if (global_page_state(NR_FILE_DIRTY) +
> + global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
> + wbc->for_cgroup = 0;
> + return true;
> + }
> +
> + wbc->for_cgroup = 1;
> + wbc->shared_inodes = 1;
> + return mem_cgroups_over_bground_dirty_thresh();
> }

Hi Greg,

So all the logic of writeout from mem cgroup works only if system is
below background limit. The moment we cross background limit, looks
like we will fall back to existing way of writting inodes?

This kind of cgroup writeback I think will atleast not solve the problem
for CFQ IO controller, as we fall back to old ways of writting back inodes
the moment we cross dirty ratio.

Also have you done any benchmarking regarding what's the overhead of
going through say thousands of inodes to find the inode which is eligible
for writeback from a cgroup? I think Dave Chinner had raised this concern
in the past.

Thanks
Vivek

2011-06-07 19:43:46

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Tue, Jun 07, 2011 at 03:38:35PM -0400, Vivek Goyal wrote:
> On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
> > When the system is under background dirty memory threshold but a cgroup
> > is over its background dirty memory threshold, then only writeback
> > inodes associated with the over-limit cgroup(s).
> >
>
> [..]
> > -static inline bool over_bground_thresh(void)
> > +static inline bool over_bground_thresh(struct bdi_writeback *wb,
> > + struct writeback_control *wbc)
> > {
> > unsigned long background_thresh, dirty_thresh;
> >
> > global_dirty_limits(&background_thresh, &dirty_thresh);
> >
> > - return (global_page_state(NR_FILE_DIRTY) +
> > - global_page_state(NR_UNSTABLE_NFS) > background_thresh);
> > + if (global_page_state(NR_FILE_DIRTY) +
> > + global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
> > + wbc->for_cgroup = 0;
> > + return true;
> > + }
> > +
> > + wbc->for_cgroup = 1;
> > + wbc->shared_inodes = 1;
> > + return mem_cgroups_over_bground_dirty_thresh();
> > }
>
> Hi Greg,
>
> So all the logic of writeout from mem cgroup works only if system is
> below background limit. The moment we cross background limit, looks
> like we will fall back to existing way of writting inodes?

If yes, then from design point of view it is little odd that as long
as we are below background limit, we share the bdi between different
cgroups. The moment we are above background limit, we fall back to
algorithm of sharing the disk among individual inodes and forget
about memory cgroups. Kind of awkward.

Thanks
Vivek

2011-06-07 20:43:53

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

Vivek Goyal <[email protected]> writes:

> On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
>> When the system is under background dirty memory threshold but a cgroup
>> is over its background dirty memory threshold, then only writeback
>> inodes associated with the over-limit cgroup(s).
>>
>
> [..]
>> -static inline bool over_bground_thresh(void)
>> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
>> + struct writeback_control *wbc)
>> {
>> unsigned long background_thresh, dirty_thresh;
>>
>> global_dirty_limits(&background_thresh, &dirty_thresh);
>>
>> - return (global_page_state(NR_FILE_DIRTY) +
>> - global_page_state(NR_UNSTABLE_NFS) > background_thresh);
>> + if (global_page_state(NR_FILE_DIRTY) +
>> + global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
>> + wbc->for_cgroup = 0;
>> + return true;
>> + }
>> +
>> + wbc->for_cgroup = 1;
>> + wbc->shared_inodes = 1;
>> + return mem_cgroups_over_bground_dirty_thresh();
>> }
>
> Hi Greg,
>
> So all the logic of writeout from mem cgroup works only if system is
> below background limit. The moment we cross background limit, looks
> like we will fall back to existing way of writting inodes?

Correct. If the system is over its background limit then the previous
cgroup-unaware background writeback occurs. I think of the system
limits as those of the root cgroup. If the system is over the global
limit than all cgroups are eligible for writeback. In this situation
the current code does not distinguish between cgroups over or under
their dirty background limit.

Vivek Goyal <[email protected]> writes:
> If yes, then from design point of view it is little odd that as long
> as we are below background limit, we share the bdi between different
> cgroups. The moment we are above background limit, we fall back to
> algorithm of sharing the disk among individual inodes and forget
> about memory cgroups. Kind of awkward.
>
> This kind of cgroup writeback I think will atleast not solve the problem
> for CFQ IO controller, as we fall back to old ways of writting back inodes
> the moment we cross dirty ratio.

It might make more sense to reverse the order of the checks in the
proposed over_bground_thresh(): the new version would first check if any
memcg are over limit; assuming none are over limit, then check global
limits. Assuming that the system is over its background limit and some
cgroups are also over their limits, then the over limit cgroups would
first be written possibly getting the system below its limit. Does this
address your concern?

Note: mem_cgroup_balance_dirty_pages() (patch 10/12) will perform
foreground writeback when a memcg is above its dirty limit. This would
offer CFQ multiple tasks issuing IO.

> Also have you done any benchmarking regarding what's the overhead of
> going through say thousands of inodes to find the inode which is eligible
> for writeback from a cgroup? I think Dave Chinner had raised this concern
> in the past.
>
> Thanks
> Vivek

I will collect some performance data measuring the cost of scanning.

2011-06-07 21:07:06

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Tue, Jun 07, 2011 at 01:43:08PM -0700, Greg Thelen wrote:
> Vivek Goyal <[email protected]> writes:
>
> > On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
> >> When the system is under background dirty memory threshold but a cgroup
> >> is over its background dirty memory threshold, then only writeback
> >> inodes associated with the over-limit cgroup(s).
> >>
> >
> > [..]
> >> -static inline bool over_bground_thresh(void)
> >> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
> >> + struct writeback_control *wbc)
> >> {
> >> unsigned long background_thresh, dirty_thresh;
> >>
> >> global_dirty_limits(&background_thresh, &dirty_thresh);
> >>
> >> - return (global_page_state(NR_FILE_DIRTY) +
> >> - global_page_state(NR_UNSTABLE_NFS) > background_thresh);
> >> + if (global_page_state(NR_FILE_DIRTY) +
> >> + global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
> >> + wbc->for_cgroup = 0;
> >> + return true;
> >> + }
> >> +
> >> + wbc->for_cgroup = 1;
> >> + wbc->shared_inodes = 1;
> >> + return mem_cgroups_over_bground_dirty_thresh();
> >> }
> >
> > Hi Greg,
> >
> > So all the logic of writeout from mem cgroup works only if system is
> > below background limit. The moment we cross background limit, looks
> > like we will fall back to existing way of writting inodes?
>
> Correct. If the system is over its background limit then the previous
> cgroup-unaware background writeback occurs. I think of the system
> limits as those of the root cgroup. If the system is over the global
> limit than all cgroups are eligible for writeback. In this situation
> the current code does not distinguish between cgroups over or under
> their dirty background limit.
>
> Vivek Goyal <[email protected]> writes:
> > If yes, then from design point of view it is little odd that as long
> > as we are below background limit, we share the bdi between different
> > cgroups. The moment we are above background limit, we fall back to
> > algorithm of sharing the disk among individual inodes and forget
> > about memory cgroups. Kind of awkward.
> >
> > This kind of cgroup writeback I think will atleast not solve the problem
> > for CFQ IO controller, as we fall back to old ways of writting back inodes
> > the moment we cross dirty ratio.
>
> It might make more sense to reverse the order of the checks in the
> proposed over_bground_thresh(): the new version would first check if any
> memcg are over limit; assuming none are over limit, then check global
> limits. Assuming that the system is over its background limit and some
> cgroups are also over their limits, then the over limit cgroups would
> first be written possibly getting the system below its limit. Does this
> address your concern?

Do you treat root group also as any other cgroup? If no, then above logic
can lead to issue of starvation of root group inode. Or unfair writeback.
So I guess it will be important to treat root group same as other groups.

>
> Note: mem_cgroup_balance_dirty_pages() (patch 10/12) will perform
> foreground writeback when a memcg is above its dirty limit. This would
> offer CFQ multiple tasks issuing IO.

I guess we can't rely on this as this will go away once IO less dirty
throttling is merged.

Thanks
Vivek

2011-06-08 00:08:38

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 10/12] memcg: create support routines for page-writeback

On Tue, 07 Jun 2011 08:58:16 -0700
Greg Thelen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > On Fri, 3 Jun 2011 09:12:16 -0700
> > Greg Thelen <[email protected]> wrote:
> >
> >> Introduce memcg routines to assist in per-memcg dirty page management:
> >>
> >> - mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
> >> dirty memory usage against memcg foreground and background thresholds.
> >> If an over-background-threshold memcg is found, then per-memcg
> >> background writeback is queued. Per-memcg writeback differs from
> >> classic, non-memcg, per bdi writeback by setting the new
> >> writeback_control.for_cgroup bit.
> >>
> >> If an over-foreground-threshold memcg is found, then foreground
> >> writeout occurs. When performing foreground writeout, first consider
> >> inodes exclusive to the memcg. If unable to make enough progress,
> >> then consider inodes shared between memcg. Such cross-memcg inode
> >> sharing likely to be rare in situations that use per-cgroup memory
> >> isolation. The approach tries to handle the common (non-shared)
> >> case well without punishing well behaved (non-sharing) cgroups.
> >> As a last resort writeback shared inodes.
> >>
> >> This routine is used by balance_dirty_pages() in a later change.
> >>
> >> - mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
> >> and limits of the memcg closest to (or over) its dirty limit. This
> >> will be used by throttle_vm_writeout() in a latter change.
> >>
> >> Signed-off-by: Greg Thelen <[email protected]>
> >> ---
> >> Changelog since v7:
> >> - Add more detail to commit description.
> >>
> >> - Declare the new writeback_control for_cgroup bit in this change, the
> >> first patch that uses the new field is first used. In -v7 the field
> >> was declared in a separate patch.
> >>
> >> include/linux/memcontrol.h | 18 +++++
> >> include/linux/writeback.h | 1 +
> >> include/trace/events/memcontrol.h | 83 ++++++++++++++++++++
> >> mm/memcontrol.c | 150 +++++++++++++++++++++++++++++++++++++
> >> 4 files changed, 252 insertions(+), 0 deletions(-)
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index 3d72e09..0d0363e 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -167,6 +167,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
> >> struct writeback_control *wbc);
> >> bool mem_cgroups_over_bground_dirty_thresh(void);
> >> void mem_cgroup_writeback_done(void);
> >> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
> >> + struct mem_cgroup *mem,
> >> + struct dirty_info *info);
> >> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> >> + unsigned long write_chunk);
> >>
> >> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> gfp_t gfp_mask,
> >> @@ -383,6 +388,19 @@ static inline void mem_cgroup_writeback_done(void)
> >> {
> >> }
> >>
> >> +static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> >> + unsigned long write_chunk)
> >> +{
> >> +}
> >> +
> >> +static inline bool
> >> +mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
> >> + struct mem_cgroup *mem,
> >> + struct dirty_info *info)
> >> +{
> >> + return false;
> >> +}
> >> +
> >> static inline
> >> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> >> gfp_t gfp_mask,
> >> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> >> index 66ec339..4f5c0d2 100644
> >> --- a/include/linux/writeback.h
> >> +++ b/include/linux/writeback.h
> >> @@ -47,6 +47,7 @@ struct writeback_control {
> >> unsigned for_reclaim:1; /* Invoked from the page allocator */
> >> unsigned range_cyclic:1; /* range_start is cyclic */
> >> unsigned more_io:1; /* more io to be dispatched */
> >> + unsigned for_cgroup:1; /* enable cgroup writeback */
> >> unsigned shared_inodes:1; /* write inodes spanning cgroups */
> >> };
> >>
> >> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
> >> index 326a66b..b42dae1 100644
> >> --- a/include/trace/events/memcontrol.h
> >> +++ b/include/trace/events/memcontrol.h
> >> @@ -109,6 +109,89 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
> >> __entry->first_id)
> >> )
> >>
> >> +DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
> >> + TP_PROTO(unsigned short css_id,
> >> + struct backing_dev_info *bdi,
> >> + unsigned long nr_reclaimable,
> >> + unsigned long thresh,
> >> + bool over_limit),
> >> +
> >> + TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
> >> +
> >> + TP_STRUCT__entry(
> >> + __field(unsigned short, css_id)
> >> + __field(struct backing_dev_info *, bdi)
> >> + __field(unsigned long, nr_reclaimable)
> >> + __field(unsigned long, thresh)
> >> + __field(bool, over_limit)
> >> + ),
> >> +
> >> + TP_fast_assign(
> >> + __entry->css_id = css_id;
> >> + __entry->bdi = bdi;
> >> + __entry->nr_reclaimable = nr_reclaimable;
> >> + __entry->thresh = thresh;
> >> + __entry->over_limit = over_limit;
> >> + ),
> >> +
> >> + TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
> >> + "over_limit=%d", __entry->css_id, __entry->bdi,
> >> + __entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
> >> +)
> >> +
> >> +#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
> >> +DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
> >> + TP_PROTO(unsigned short id, \
> >> + struct backing_dev_info *bdi, \
> >> + unsigned long nr_reclaimable, \
> >> + unsigned long thresh, \
> >> + bool over_limit), \
> >> + TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
> >> +)
> >> +
> >> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
> >> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
> >> +
> >> +TRACE_EVENT(mem_cgroup_fg_writeback,
> >> + TP_PROTO(unsigned long write_chunk,
> >> + struct writeback_control *wbc),
> >> +
> >> + TP_ARGS(write_chunk, wbc),
> >> +
> >> + TP_STRUCT__entry(
> >> + __field(unsigned long, write_chunk)
> >> + __field(long, wbc_to_write)
> >> + __field(bool, shared_inodes)
> >> + ),
> >> +
> >> + TP_fast_assign(
> >> + __entry->write_chunk = write_chunk;
> >> + __entry->wbc_to_write = wbc->nr_to_write;
> >> + __entry->shared_inodes = wbc->shared_inodes;
> >> + ),
> >> +
> >> + TP_printk("write_chunk=%ld nr_to_write=%ld shared_inodes=%d",
> >> + __entry->write_chunk,
> >> + __entry->wbc_to_write,
> >> + __entry->shared_inodes)
> >> +)
> >> +
> >> +TRACE_EVENT(mem_cgroup_enable_shared_writeback,
> >> + TP_PROTO(unsigned short css_id),
> >> +
> >> + TP_ARGS(css_id),
> >> +
> >> + TP_STRUCT__entry(
> >> + __field(unsigned short, css_id)
> >> + ),
> >> +
> >> + TP_fast_assign(
> >> + __entry->css_id = css_id;
> >> + ),
> >> +
> >> + TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
> >> +)
> >> +
> >> #endif /* _TRACE_MEMCONTROL_H */
> >>
> >> /* This part must be outside protection */
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index a5b1794..17cb888 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -1622,6 +1622,156 @@ void mem_cgroup_writeback_done(void)
> >> }
> >> }
> >>
> >> +/*
> >> + * This routine must be called by processes which are generating dirty pages.
> >> + * It considers the dirty pages usage and thresholds of the current cgroup and
> >> + * (depending if hierarchical accounting is enabled) ancestral memcg. If any of
> >> + * the considered memcg are over their background dirty limit, then background
> >> + * writeback is queued. If any are over the foreground dirty limit then
> >> + * throttle the dirtying task while writing dirty data. The per-memcg dirty
> >> + * limits check by this routine are distinct from either the per-system,
> >> + * per-bdi, or per-task limits considered by balance_dirty_pages().
> >> + */
> >> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
> >> + unsigned long write_chunk)
> >> +{
> >> + struct backing_dev_info *bdi = mapping->backing_dev_info;
> >> + struct mem_cgroup *mem;
> >> + struct mem_cgroup *ref_mem;
> >> + struct dirty_info info;
> >> + unsigned long nr_reclaimable;
> >> + unsigned long sys_available_mem;
> >> + unsigned long pause = 1;
> >> + unsigned short id;
> >> + bool over;
> >> + bool shared_inodes;
> >> +
> >> + if (mem_cgroup_disabled())
> >> + return;
> >> +
> >> + sys_available_mem = determine_dirtyable_memory();
> >> +
> >> + /* reference the memcg so it is not deleted during this routine */
> >> + rcu_read_lock();
> >> + mem = mem_cgroup_from_task(current);
> >> + if (mem && mem_cgroup_is_root(mem))
> >> + mem = NULL;
> >> + if (mem)
> >> + css_get(&mem->css);
> >> + rcu_read_unlock();
> >> + ref_mem = mem;
> >> +
> >> + /* balance entire ancestry of current's mem. */
> >> + for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
> >> + id = css_id(&mem->css);
> >> +
> >
> > Hmm, this sounds natural...but...don't we need to restart checking from ref_mem's
> > dirty_ratio once we find an ancestor is over dirty_ratio and we slept ?
> >
> > Even if parent's dirty ratio comes to be clean state, children's may not.
> > So, I think some "restart loop" jump after io_schedule_timeout().
> >
> > Thanks,
> > -Kame
>
> I do not think that we need to restart, but maybe you have a case in
> mind that I am not considering.
>
> Example hierarchy:
> root
> A B
> A1 A2
> A11 A12 A21 A22
>
> Assume that mem_cgroup_balance_dirty_pages(A11), so ref_mem=A11.
>
> We start at A11 and walk up towards the root. If A11 is over limit,
> then write A11 until under limit. Next check A1, if over limit then
> write A1,A11,A12. Then check A. If A is over A limit, then we invoke
> writeback on A* until A is under A limit. Are you concerned that while
> performing writeback on A* that other tasks may push A1 over the A1
> limit? Such other task writers would also be calling
> mem_cgroup_balance_dirty_pages() later.
>

Hm, ok. Could you add comments to explain the algorithm ?

Thanks,
-Kame

2011-06-08 00:25:21

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Tue, 7 Jun 2011 17:05:40 -0400
Vivek Goyal <[email protected]> wrote:

> On Tue, Jun 07, 2011 at 01:43:08PM -0700, Greg Thelen wrote:
> > Vivek Goyal <[email protected]> writes:
> >
> > > On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
> > >> When the system is under background dirty memory threshold but a cgroup
> > >> is over its background dirty memory threshold, then only writeback
> > >> inodes associated with the over-limit cgroup(s).
> > >>
> > >
> > > [..]
> > >> -static inline bool over_bground_thresh(void)
> > >> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
> > >> + struct writeback_control *wbc)
> > >> {
> > >> unsigned long background_thresh, dirty_thresh;
> > >>
> > >> global_dirty_limits(&background_thresh, &dirty_thresh);
> > >>
> > >> - return (global_page_state(NR_FILE_DIRTY) +
> > >> - global_page_state(NR_UNSTABLE_NFS) > background_thresh);
> > >> + if (global_page_state(NR_FILE_DIRTY) +
> > >> + global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
> > >> + wbc->for_cgroup = 0;
> > >> + return true;
> > >> + }
> > >> +
> > >> + wbc->for_cgroup = 1;
> > >> + wbc->shared_inodes = 1;
> > >> + return mem_cgroups_over_bground_dirty_thresh();
> > >> }
> > >
> > > Hi Greg,
> > >
> > > So all the logic of writeout from mem cgroup works only if system is
> > > below background limit. The moment we cross background limit, looks
> > > like we will fall back to existing way of writting inodes?
> >
> > Correct. If the system is over its background limit then the previous
> > cgroup-unaware background writeback occurs. I think of the system
> > limits as those of the root cgroup. If the system is over the global
> > limit than all cgroups are eligible for writeback. In this situation
> > the current code does not distinguish between cgroups over or under
> > their dirty background limit.
> >
> > Vivek Goyal <[email protected]> writes:
> > > If yes, then from design point of view it is little odd that as long
> > > as we are below background limit, we share the bdi between different
> > > cgroups. The moment we are above background limit, we fall back to
> > > algorithm of sharing the disk among individual inodes and forget
> > > about memory cgroups. Kind of awkward.
> > >
> > > This kind of cgroup writeback I think will atleast not solve the problem
> > > for CFQ IO controller, as we fall back to old ways of writting back inodes
> > > the moment we cross dirty ratio.
> >
> > It might make more sense to reverse the order of the checks in the
> > proposed over_bground_thresh(): the new version would first check if any
> > memcg are over limit; assuming none are over limit, then check global
> > limits. Assuming that the system is over its background limit and some
> > cgroups are also over their limits, then the over limit cgroups would
> > first be written possibly getting the system below its limit. Does this
> > address your concern?
>
> Do you treat root group also as any other cgroup? If no, then above logic
> can lead to issue of starvation of root group inode. Or unfair writeback.
> So I guess it will be important to treat root group same as other groups.
>

As far as I can say, you should not place programs onto ROOT cgroups if you need
performance isolation.

>From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
set and background writeback can work for ROOT cgroup seamlessly.

Thanks,
-Kame

2011-06-08 01:51:00

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 10/12] memcg: create support routines for page-writeback

On Tue, Jun 7, 2011 at 5:01 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 07 Jun 2011 08:58:16 -0700
> Greg Thelen <[email protected]> wrote:
>
>> KAMEZAWA Hiroyuki <[email protected]> writes:
>>
>> > On Fri, ?3 Jun 2011 09:12:16 -0700
>> > Greg Thelen <[email protected]> wrote:
>> >
>> >> Introduce memcg routines to assist in per-memcg dirty page management:
>> >>
>> >> - mem_cgroup_balance_dirty_pages() walks a memcg hierarchy comparing
>> >> ? dirty memory usage against memcg foreground and background thresholds.
>> >> ? If an over-background-threshold memcg is found, then per-memcg
>> >> ? background writeback is queued. ?Per-memcg writeback differs from
>> >> ? classic, non-memcg, per bdi writeback by setting the new
>> >> ? writeback_control.for_cgroup bit.
>> >>
>> >> ? If an over-foreground-threshold memcg is found, then foreground
>> >> ? writeout occurs. ?When performing foreground writeout, first consider
>> >> ? inodes exclusive to the memcg. ?If unable to make enough progress,
>> >> ? then consider inodes shared between memcg. ?Such cross-memcg inode
>> >> ? sharing likely to be rare in situations that use per-cgroup memory
>> >> ? isolation. ?The approach tries to handle the common (non-shared)
>> >> ? case well without punishing well behaved (non-sharing) cgroups.
>> >> ? As a last resort writeback shared inodes.
>> >>
>> >> ? This routine is used by balance_dirty_pages() in a later change.
>> >>
>> >> - mem_cgroup_hierarchical_dirty_info() returns the dirty memory usage
>> >> ? and limits of the memcg closest to (or over) its dirty limit. ?This
>> >> ? will be used by throttle_vm_writeout() in a latter change.
>> >>
>> >> Signed-off-by: Greg Thelen <[email protected]>
>> >> ---
>> >> Changelog since v7:
>> >> - Add more detail to commit description.
>> >>
>> >> - Declare the new writeback_control for_cgroup bit in this change, the
>> >> ? first patch that uses the new field is first used. ?In -v7 the field
>> >> ? was declared in a separate patch.
>> >>
>> >> ?include/linux/memcontrol.h ? ? ? ?| ? 18 +++++
>> >> ?include/linux/writeback.h ? ? ? ? | ? ?1 +
>> >> ?include/trace/events/memcontrol.h | ? 83 ++++++++++++++++++++
>> >> ?mm/memcontrol.c ? ? ? ? ? ? ? ? ? | ?150 +++++++++++++++++++++++++++++++++++++
>> >> ?4 files changed, 252 insertions(+), 0 deletions(-)
>> >>
>> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> >> index 3d72e09..0d0363e 100644
>> >> --- a/include/linux/memcontrol.h
>> >> +++ b/include/linux/memcontrol.h
>> >> @@ -167,6 +167,11 @@ bool should_writeback_mem_cgroup_inode(struct inode *inode,
>> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct writeback_control *wbc);
>> >> ?bool mem_cgroups_over_bground_dirty_thresh(void);
>> >> ?void mem_cgroup_writeback_done(void);
>> >> +bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct mem_cgroup *mem,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?struct dirty_info *info);
>> >> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long write_chunk);
>> >>
>> >> ?unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask,
>> >> @@ -383,6 +388,19 @@ static inline void mem_cgroup_writeback_done(void)
>> >> ?{
>> >> ?}
>> >>
>> >> +static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long write_chunk)
>> >> +{
>> >> +}
>> >> +
>> >> +static inline bool
>> >> +mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct mem_cgroup *mem,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct dirty_info *info)
>> >> +{
>> >> + ?return false;
>> >> +}
>> >> +
>> >> ?static inline
>> >> ?unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
>> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?gfp_t gfp_mask,
>> >> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>> >> index 66ec339..4f5c0d2 100644
>> >> --- a/include/linux/writeback.h
>> >> +++ b/include/linux/writeback.h
>> >> @@ -47,6 +47,7 @@ struct writeback_control {
>> >> ? ?unsigned for_reclaim:1; ? ? ? ? /* Invoked from the page allocator */
>> >> ? ?unsigned range_cyclic:1; ? ? ? ?/* range_start is cyclic */
>> >> ? ?unsigned more_io:1; ? ? ? ? ? ? /* more io to be dispatched */
>> >> + ?unsigned for_cgroup:1; ? ? ? ? ?/* enable cgroup writeback */
>> >> ? ?unsigned shared_inodes:1; ? ? ? /* write inodes spanning cgroups */
>> >> ?};
>> >>
>> >> diff --git a/include/trace/events/memcontrol.h b/include/trace/events/memcontrol.h
>> >> index 326a66b..b42dae1 100644
>> >> --- a/include/trace/events/memcontrol.h
>> >> +++ b/include/trace/events/memcontrol.h
>> >> @@ -109,6 +109,89 @@ TRACE_EVENT(mem_cgroups_over_bground_dirty_thresh,
>> >> ? ? ? ? ? ? ?__entry->first_id)
>> >> ?)
>> >>
>> >> +DECLARE_EVENT_CLASS(mem_cgroup_consider_writeback,
>> >> + ?TP_PROTO(unsigned short css_id,
>> >> + ? ? ? ? ? struct backing_dev_info *bdi,
>> >> + ? ? ? ? ? unsigned long nr_reclaimable,
>> >> + ? ? ? ? ? unsigned long thresh,
>> >> + ? ? ? ? ? bool over_limit),
>> >> +
>> >> + ?TP_ARGS(css_id, bdi, nr_reclaimable, thresh, over_limit),
>> >> +
>> >> + ?TP_STRUCT__entry(
>> >> + ? ? ? ? ?__field(unsigned short, css_id)
>> >> + ? ? ? ? ?__field(struct backing_dev_info *, bdi)
>> >> + ? ? ? ? ?__field(unsigned long, nr_reclaimable)
>> >> + ? ? ? ? ?__field(unsigned long, thresh)
>> >> + ? ? ? ? ?__field(bool, over_limit)
>> >> + ?),
>> >> +
>> >> + ?TP_fast_assign(
>> >> + ? ? ? ? ?__entry->css_id = css_id;
>> >> + ? ? ? ? ?__entry->bdi = bdi;
>> >> + ? ? ? ? ?__entry->nr_reclaimable = nr_reclaimable;
>> >> + ? ? ? ? ?__entry->thresh = thresh;
>> >> + ? ? ? ? ?__entry->over_limit = over_limit;
>> >> + ?),
>> >> +
>> >> + ?TP_printk("css_id=%d bdi=%p nr_reclaimable=%ld thresh=%ld "
>> >> + ? ? ? ? ? ?"over_limit=%d", __entry->css_id, __entry->bdi,
>> >> + ? ? ? ? ? ?__entry->nr_reclaimable, __entry->thresh, __entry->over_limit)
>> >> +)
>> >> +
>> >> +#define DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(name) \
>> >> +DEFINE_EVENT(mem_cgroup_consider_writeback, name, \
>> >> + ?TP_PROTO(unsigned short id, \
>> >> + ? ? ? ? ? struct backing_dev_info *bdi, \
>> >> + ? ? ? ? ? unsigned long nr_reclaimable, \
>> >> + ? ? ? ? ? unsigned long thresh, \
>> >> + ? ? ? ? ? bool over_limit), \
>> >> + ?TP_ARGS(id, bdi, nr_reclaimable, thresh, over_limit) \
>> >> +)
>> >> +
>> >> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_bg_writeback);
>> >> +DEFINE_MEM_CGROUP_CONSIDER_WRITEBACK_EVENT(mem_cgroup_consider_fg_writeback);
>> >> +
>> >> +TRACE_EVENT(mem_cgroup_fg_writeback,
>> >> + ?TP_PROTO(unsigned long write_chunk,
>> >> + ? ? ? ? ? struct writeback_control *wbc),
>> >> +
>> >> + ?TP_ARGS(write_chunk, wbc),
>> >> +
>> >> + ?TP_STRUCT__entry(
>> >> + ? ? ? ? ?__field(unsigned long, write_chunk)
>> >> + ? ? ? ? ?__field(long, wbc_to_write)
>> >> + ? ? ? ? ?__field(bool, shared_inodes)
>> >> + ?),
>> >> +
>> >> + ?TP_fast_assign(
>> >> + ? ? ? ? ?__entry->write_chunk = write_chunk;
>> >> + ? ? ? ? ?__entry->wbc_to_write = wbc->nr_to_write;
>> >> + ? ? ? ? ?__entry->shared_inodes = wbc->shared_inodes;
>> >> + ?),
>> >> +
>> >> + ?TP_printk("write_chunk=%ld nr_to_write=%ld shared_inodes=%d",
>> >> + ? ? ? ? ? ?__entry->write_chunk,
>> >> + ? ? ? ? ? ?__entry->wbc_to_write,
>> >> + ? ? ? ? ? ?__entry->shared_inodes)
>> >> +)
>> >> +
>> >> +TRACE_EVENT(mem_cgroup_enable_shared_writeback,
>> >> + ?TP_PROTO(unsigned short css_id),
>> >> +
>> >> + ?TP_ARGS(css_id),
>> >> +
>> >> + ?TP_STRUCT__entry(
>> >> + ? ? ? ? ?__field(unsigned short, css_id)
>> >> + ? ? ? ? ?),
>> >> +
>> >> + ?TP_fast_assign(
>> >> + ? ? ? ? ?__entry->css_id = css_id;
>> >> + ? ? ? ? ?),
>> >> +
>> >> + ?TP_printk("enabling shared writeback for memcg %d", __entry->css_id)
>> >> +)
>> >> +
>> >> ?#endif /* _TRACE_MEMCONTROL_H */
>> >>
>> >> ?/* This part must be outside protection */
>> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> >> index a5b1794..17cb888 100644
>> >> --- a/mm/memcontrol.c
>> >> +++ b/mm/memcontrol.c
>> >> @@ -1622,6 +1622,156 @@ void mem_cgroup_writeback_done(void)
>> >> ? ?}
>> >> ?}
>> >>
>> >> +/*
>> >> + * This routine must be called by processes which are generating dirty pages.
>> >> + * It considers the dirty pages usage and thresholds of the current cgroup and
>> >> + * (depending if hierarchical accounting is enabled) ancestral memcg. ?If any of
>> >> + * the considered memcg are over their background dirty limit, then background
>> >> + * writeback is queued. ?If any are over the foreground dirty limit then
>> >> + * throttle the dirtying task while writing dirty data. ?The per-memcg dirty
>> >> + * limits check by this routine are distinct from either the per-system,
>> >> + * per-bdi, or per-task limits considered by balance_dirty_pages().
>> >> + */
>> >> +void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
>> >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long write_chunk)
>> >> +{
>> >> + ?struct backing_dev_info *bdi = mapping->backing_dev_info;
>> >> + ?struct mem_cgroup *mem;
>> >> + ?struct mem_cgroup *ref_mem;
>> >> + ?struct dirty_info info;
>> >> + ?unsigned long nr_reclaimable;
>> >> + ?unsigned long sys_available_mem;
>> >> + ?unsigned long pause = 1;
>> >> + ?unsigned short id;
>> >> + ?bool over;
>> >> + ?bool shared_inodes;
>> >> +
>> >> + ?if (mem_cgroup_disabled())
>> >> + ? ? ? ? ?return;
>> >> +
>> >> + ?sys_available_mem = determine_dirtyable_memory();
>> >> +
>> >> + ?/* reference the memcg so it is not deleted during this routine */
>> >> + ?rcu_read_lock();
>> >> + ?mem = mem_cgroup_from_task(current);
>> >> + ?if (mem && mem_cgroup_is_root(mem))
>> >> + ? ? ? ? ?mem = NULL;
>> >> + ?if (mem)
>> >> + ? ? ? ? ?css_get(&mem->css);
>> >> + ?rcu_read_unlock();
>> >> + ?ref_mem = mem;
>> >> +
>> >> + ?/* balance entire ancestry of current's mem. */
>> >> + ?for (; mem_cgroup_has_dirty_limit(mem); mem = parent_mem_cgroup(mem)) {
>> >> + ? ? ? ? ?id = css_id(&mem->css);
>> >> +
>> >
>> > Hmm, this sounds natural...but...don't we need to restart checking from ref_mem's
>> > dirty_ratio once we find an ancestor is over dirty_ratio and we slept ?
>> >
>> > Even if parent's dirty ratio comes to be clean state, children's may not.
>> > So, I think some "restart loop" jump after io_schedule_timeout().
>> >
>> > Thanks,
>> > -Kame
>>
>> I do not think that we need to restart, but maybe you have a case in
>> mind that I am not considering.
>>
>> Example hierarchy:
>> ? ? ? ? ? ? ? root
>> ? ? ? ? ?A ? ? ? ? ? ?B
>> ? ? ?A1 ? ? ?A2
>> ? A11 A12 ?A21 A22
>>
>> Assume that mem_cgroup_balance_dirty_pages(A11), so ref_mem=A11.
>>
>> We start at A11 and walk up towards the root. ?If A11 is over limit,
>> then write A11 until under limit. ?Next check A1, if over limit then
>> write A1,A11,A12. ?Then check A. ?If A is over A limit, then we invoke
>> writeback on A* until A is under A limit. ?Are you concerned that while
>> performing writeback on A* that other tasks may push A1 over the A1
>> limit? ?Such other task writers would also be calling
>> mem_cgroup_balance_dirty_pages() later.
>>
>
> Hm, ok. Could you add comments to explain the algorithm ?
>
> Thanks,
> -Kame

No problem. I will add comments to this routine in -v9 to clarify.

2011-06-08 04:02:46

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Tue, Jun 7, 2011 at 5:18 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 7 Jun 2011 17:05:40 -0400
> Vivek Goyal <[email protected]> wrote:
>
>> On Tue, Jun 07, 2011 at 01:43:08PM -0700, Greg Thelen wrote:
>> > Vivek Goyal <[email protected]> writes:
>> >
>> > > On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
>> > >> When the system is under background dirty memory threshold but a cgroup
>> > >> is over its background dirty memory threshold, then only writeback
>> > >> inodes associated with the over-limit cgroup(s).
>> > >>
>> > >
>> > > [..]
>> > >> -static inline bool over_bground_thresh(void)
>> > >> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
>> > >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct writeback_control *wbc)
>> > >> ?{
>> > >> ? ? ? ? ?unsigned long background_thresh, dirty_thresh;
>> > >>
>> > >> ? ? ? ? ?global_dirty_limits(&background_thresh, &dirty_thresh);
>> > >>
>> > >> - ? ? ? ?return (global_page_state(NR_FILE_DIRTY) +
>> > >> - ? ? ? ? ? ? ? ?global_page_state(NR_UNSTABLE_NFS) > background_thresh);
>> > >> + ? ? ? ?if (global_page_state(NR_FILE_DIRTY) +
>> > >> + ? ? ? ? ? ?global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
>> > >> + ? ? ? ? ? ? ? ?wbc->for_cgroup = 0;
>> > >> + ? ? ? ? ? ? ? ?return true;
>> > >> + ? ? ? ?}
>> > >> +
>> > >> + ? ? ? ?wbc->for_cgroup = 1;
>> > >> + ? ? ? ?wbc->shared_inodes = 1;
>> > >> + ? ? ? ?return mem_cgroups_over_bground_dirty_thresh();
>> > >> ?}
>> > >
>> > > Hi Greg,
>> > >
>> > > So all the logic of writeout from mem cgroup works only if system is
>> > > below background limit. The moment we cross background limit, looks
>> > > like we will fall back to existing way of writting inodes?
>> >
>> > Correct. ?If the system is over its background limit then the previous
>> > cgroup-unaware background writeback occurs. ?I think of the system
>> > limits as those of the root cgroup. ?If the system is over the global
>> > limit than all cgroups are eligible for writeback. ?In this situation
>> > the current code does not distinguish between cgroups over or under
>> > their dirty background limit.
>> >
>> > Vivek Goyal <[email protected]> writes:
>> > > If yes, then from design point of view it is little odd that as long
>> > > as we are below background limit, we share the bdi between different
>> > > cgroups. The moment we are above background limit, we fall back to
>> > > algorithm of sharing the disk among individual inodes and forget
>> > > about memory cgroups. Kind of awkward.
>> > >
>> > > This kind of cgroup writeback I think will atleast not solve the problem
>> > > for CFQ IO controller, as we fall back to old ways of writting back inodes
>> > > the moment we cross dirty ratio.
>> >
>> > It might make more sense to reverse the order of the checks in the
>> > proposed over_bground_thresh(): the new version would first check if any
>> > memcg are over limit; assuming none are over limit, then check global
>> > limits. ?Assuming that the system is over its background limit and some
>> > cgroups are also over their limits, then the over limit cgroups would
>> > first be written possibly getting the system below its limit. ?Does this
>> > address your concern?
>>
>> Do you treat root group also as any other cgroup? If no, then above logic
>> can lead to issue of starvation of root group inode. Or unfair writeback.
>> So I guess it will be important to treat root group same as other groups.
>>
>
> As far as I can say, you should not place programs onto ROOT cgroups if you need
> performance isolation.

Agreed.

> From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
> set and background writeback can work for ROOT cgroup seamlessly.
>
> Thanks,
> -Kame

Not quite. The proposed patches do not set the "1" bit (css_id of
root is 1). mem_cgroup_balance_dirty_pages() (from patch 10/12)
introduces the following balancing loop:
+ /* balance entire ancestry of current's mem. */
+ for (; mem_cgroup_has_dirty_limit(mem); mem =
parent_mem_cgroup(mem)) {

The loop terminates when mem_cgroup_has_dirty_limit() is called for
the root cgroup. The bitmap is set in the body of the loop. So the
root cgroup's bit (bit 1) will never be set in the bitmap. However, I
think the effect is the same. The proposed changes in this patch
(11/12) have background writeback first checking if the system is over
limit and if yes, then b_dirty inodes from any cgroup written. This
means that a small system background limit with an over-{fg or
bg}-limit cgroup could cause other cgroups that are not over their
limit to have their inodes written back. In an system-over-limit
situation normal system-wide bdi writeback is used (writing inodes in
b_dirty order). For those who want isolation, a simple rule to avoid
this is to ensure that that sum of all cgroup background_limits is
less than the system background limit.

2011-06-08 04:10:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Tue, 7 Jun 2011 21:02:21 -0700
Greg Thelen <[email protected]> wrote:

> On Tue, Jun 7, 2011 at 5:18 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Tue, 7 Jun 2011 17:05:40 -0400
> > Vivek Goyal <[email protected]> wrote:
> >
> >> On Tue, Jun 07, 2011 at 01:43:08PM -0700, Greg Thelen wrote:
> >> > Vivek Goyal <[email protected]> writes:
> >> >
> >> > > On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
> >> > >> When the system is under background dirty memory threshold but a cgroup
> >> > >> is over its background dirty memory threshold, then only writeback
> >> > >> inodes associated with the over-limit cgroup(s).
> >> > >>
> >> > >
> >> > > [..]
> >> > >> -static inline bool over_bground_thresh(void)
> >> > >> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
> >> > >> +                                       struct writeback_control *wbc)
> >> > >>  {
> >> > >>          unsigned long background_thresh, dirty_thresh;
> >> > >>
> >> > >>          global_dirty_limits(&background_thresh, &dirty_thresh);
> >> > >>
> >> > >> -        return (global_page_state(NR_FILE_DIRTY) +
> >> > >> -                global_page_state(NR_UNSTABLE_NFS) > background_thresh);
> >> > >> +        if (global_page_state(NR_FILE_DIRTY) +
> >> > >> +            global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
> >> > >> +                wbc->for_cgroup = 0;
> >> > >> +                return true;
> >> > >> +        }
> >> > >> +
> >> > >> +        wbc->for_cgroup = 1;
> >> > >> +        wbc->shared_inodes = 1;
> >> > >> +        return mem_cgroups_over_bground_dirty_thresh();
> >> > >>  }
> >> > >
> >> > > Hi Greg,
> >> > >
> >> > > So all the logic of writeout from mem cgroup works only if system is
> >> > > below background limit. The moment we cross background limit, looks
> >> > > like we will fall back to existing way of writting inodes?
> >> >
> >> > Correct.  If the system is over its background limit then the previous
> >> > cgroup-unaware background writeback occurs.  I think of the system
> >> > limits as those of the root cgroup.  If the system is over the global
> >> > limit than all cgroups are eligible for writeback.  In this situation
> >> > the current code does not distinguish between cgroups over or under
> >> > their dirty background limit.
> >> >
> >> > Vivek Goyal <[email protected]> writes:
> >> > > If yes, then from design point of view it is little odd that as long
> >> > > as we are below background limit, we share the bdi between different
> >> > > cgroups. The moment we are above background limit, we fall back to
> >> > > algorithm of sharing the disk among individual inodes and forget
> >> > > about memory cgroups. Kind of awkward.
> >> > >
> >> > > This kind of cgroup writeback I think will atleast not solve the problem
> >> > > for CFQ IO controller, as we fall back to old ways of writting back inodes
> >> > > the moment we cross dirty ratio.
> >> >
> >> > It might make more sense to reverse the order of the checks in the
> >> > proposed over_bground_thresh(): the new version would first check if any
> >> > memcg are over limit; assuming none are over limit, then check global
> >> > limits.  Assuming that the system is over its background limit and some
> >> > cgroups are also over their limits, then the over limit cgroups would
> >> > first be written possibly getting the system below its limit.  Does this
> >> > address your concern?
> >>
> >> Do you treat root group also as any other cgroup? If no, then above logic
> >> can lead to issue of starvation of root group inode. Or unfair writeback.
> >> So I guess it will be important to treat root group same as other groups.
> >>
> >
> > As far as I can say, you should not place programs onto ROOT cgroups if you need
> > performance isolation.
>
> Agreed.
>
> > From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
> > set and background writeback can work for ROOT cgroup seamlessly.
> >
> > Thanks,
> > -Kame
>
> Not quite. The proposed patches do not set the "1" bit (css_id of
> root is 1). mem_cgroup_balance_dirty_pages() (from patch 10/12)
> introduces the following balancing loop:
> + /* balance entire ancestry of current's mem. */
> + for (; mem_cgroup_has_dirty_limit(mem); mem =
> parent_mem_cgroup(mem)) {
>
> The loop terminates when mem_cgroup_has_dirty_limit() is called for
> the root cgroup. The bitmap is set in the body of the loop. So the
> root cgroup's bit (bit 1) will never be set in the bitmap. However, I
> think the effect is the same. The proposed changes in this patch
> (11/12) have background writeback first checking if the system is over
> limit and if yes, then b_dirty inodes from any cgroup written. This
> means that a small system background limit with an over-{fg or
> bg}-limit cgroup could cause other cgroups that are not over their
> limit to have their inodes written back. In an system-over-limit
> situation normal system-wide bdi writeback is used (writing inodes in
> b_dirty order). For those who want isolation, a simple rule to avoid
> this is to ensure that that sum of all cgroup background_limits is
> less than the system background limit.
>

Hmm, should we add the rule ?
How about disallowing to set dirty_ratio bigger than system's one ?

Thanks,
-Kame




2011-06-08 05:20:51

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Tue, Jun 7, 2011 at 9:03 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Tue, 7 Jun 2011 21:02:21 -0700
> Greg Thelen <[email protected]> wrote:
>
>> On Tue, Jun 7, 2011 at 5:18 PM, KAMEZAWA Hiroyuki
>> <[email protected]> wrote:
>> > On Tue, 7 Jun 2011 17:05:40 -0400
>> > Vivek Goyal <[email protected]> wrote:
>> >
>> >> On Tue, Jun 07, 2011 at 01:43:08PM -0700, Greg Thelen wrote:
>> >> > Vivek Goyal <[email protected]> writes:
>> >> >
>> >> > > On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
>> >> > >> When the system is under background dirty memory threshold but a cgroup
>> >> > >> is over its background dirty memory threshold, then only writeback
>> >> > >> inodes associated with the over-limit cgroup(s).
>> >> > >>
>> >> > >
>> >> > > [..]
>> >> > >> -static inline bool over_bground_thresh(void)
>> >> > >> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
>> >> > >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct writeback_control *wbc)
>> >> > >> ?{
>> >> > >> ? ? ? ? ?unsigned long background_thresh, dirty_thresh;
>> >> > >>
>> >> > >> ? ? ? ? ?global_dirty_limits(&background_thresh, &dirty_thresh);
>> >> > >>
>> >> > >> - ? ? ? ?return (global_page_state(NR_FILE_DIRTY) +
>> >> > >> - ? ? ? ? ? ? ? ?global_page_state(NR_UNSTABLE_NFS) > background_thresh);
>> >> > >> + ? ? ? ?if (global_page_state(NR_FILE_DIRTY) +
>> >> > >> + ? ? ? ? ? ?global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
>> >> > >> + ? ? ? ? ? ? ? ?wbc->for_cgroup = 0;
>> >> > >> + ? ? ? ? ? ? ? ?return true;
>> >> > >> + ? ? ? ?}
>> >> > >> +
>> >> > >> + ? ? ? ?wbc->for_cgroup = 1;
>> >> > >> + ? ? ? ?wbc->shared_inodes = 1;
>> >> > >> + ? ? ? ?return mem_cgroups_over_bground_dirty_thresh();
>> >> > >> ?}
>> >> > >
>> >> > > Hi Greg,
>> >> > >
>> >> > > So all the logic of writeout from mem cgroup works only if system is
>> >> > > below background limit. The moment we cross background limit, looks
>> >> > > like we will fall back to existing way of writting inodes?
>> >> >
>> >> > Correct. ?If the system is over its background limit then the previous
>> >> > cgroup-unaware background writeback occurs. ?I think of the system
>> >> > limits as those of the root cgroup. ?If the system is over the global
>> >> > limit than all cgroups are eligible for writeback. ?In this situation
>> >> > the current code does not distinguish between cgroups over or under
>> >> > their dirty background limit.
>> >> >
>> >> > Vivek Goyal <[email protected]> writes:
>> >> > > If yes, then from design point of view it is little odd that as long
>> >> > > as we are below background limit, we share the bdi between different
>> >> > > cgroups. The moment we are above background limit, we fall back to
>> >> > > algorithm of sharing the disk among individual inodes and forget
>> >> > > about memory cgroups. Kind of awkward.
>> >> > >
>> >> > > This kind of cgroup writeback I think will atleast not solve the problem
>> >> > > for CFQ IO controller, as we fall back to old ways of writting back inodes
>> >> > > the moment we cross dirty ratio.
>> >> >
>> >> > It might make more sense to reverse the order of the checks in the
>> >> > proposed over_bground_thresh(): the new version would first check if any
>> >> > memcg are over limit; assuming none are over limit, then check global
>> >> > limits. ?Assuming that the system is over its background limit and some
>> >> > cgroups are also over their limits, then the over limit cgroups would
>> >> > first be written possibly getting the system below its limit. ?Does this
>> >> > address your concern?
>> >>
>> >> Do you treat root group also as any other cgroup? If no, then above logic
>> >> can lead to issue of starvation of root group inode. Or unfair writeback.
>> >> So I guess it will be important to treat root group same as other groups.
>> >>
>> >
>> > As far as I can say, you should not place programs onto ROOT cgroups if you need
>> > performance isolation.
>>
>> Agreed.
>>
>> > From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
>> > set and background writeback can work for ROOT cgroup seamlessly.
>> >
>> > Thanks,
>> > -Kame
>>
>> Not quite. ?The proposed patches do not set the "1" bit (css_id of
>> root is 1). ?mem_cgroup_balance_dirty_pages() (from patch 10/12)
>> introduces the following balancing loop:
>> + ? ? ? /* balance entire ancestry of current's mem. */
>> + ? ? ? for (; mem_cgroup_has_dirty_limit(mem); mem =
>> parent_mem_cgroup(mem)) {
>>
>> The loop terminates when mem_cgroup_has_dirty_limit() is called for
>> the root cgroup. ?The bitmap is set in the body of the loop. ?So the
>> root cgroup's bit (bit 1) will never be set in the bitmap. ?However, I
>> think the effect is the same. ?The proposed changes in this patch
>> (11/12) have background writeback first checking if the system is over
>> limit and if yes, then b_dirty inodes from any cgroup written. ?This
>> means that a small system background limit with an over-{fg or
>> bg}-limit cgroup could cause other cgroups that are not over their
>> limit to have their inodes written back. ?In an system-over-limit
>> situation normal system-wide bdi writeback is used (writing inodes in
>> b_dirty order). ?For those who want isolation, a simple rule to avoid
>> this is to ensure that that sum of all cgroup background_limits is
>> less than the system background limit.
>>
>
> Hmm, should we add the rule ?
> How about disallowing to set dirty_ratio bigger than system's one ?

The rule needs to consider all cgroups when adjusting any cgroup (or
the system) effective background limit:

check_rule()
{
cgroup_bg_bytes = 0
for_each_mem_cgroup_tree(root, mem)
cgroup_bg_bytes += mem->dirty_param.dirty_background_bytes;

assert cgroup_bg_bytes < effective_background_limit
}

There may be more aggressive (lower values of cgroup_bg_bytes) if
hierarchy is enabled. If hierarchy is enabled the cgroup limits may
be more restrictive than just the sum of all. But the sum of all is
simpler. Enforcing this rule would disallow background over commit.

If a global dirty ratio (rather than byte count) is set, then the
effective_background_limit is a function of
global_reclaimable_pages(), which can fluctuate as the number lru
pages changes (e.g. mlock may lower effective_background_limit). So
the rule could be true when the limits are set, but when an mlock
occurs the rule would need to be reevaluated. This feels way too
complex. So my thinking is not to enforce this rule in code. I will
plan to add this guidance to the memcg Documentation.

2011-06-08 20:41:13

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Tue, Jun 07, 2011 at 09:02:21PM -0700, Greg Thelen wrote:

[..]
> > As far as I can say, you should not place programs onto ROOT cgroups if you need
> > performance isolation.
>
> Agreed.
>
> > From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
> > set and background writeback can work for ROOT cgroup seamlessly.
> >
> > Thanks,
> > -Kame
>
> Not quite. The proposed patches do not set the "1" bit (css_id of
> root is 1). mem_cgroup_balance_dirty_pages() (from patch 10/12)
> introduces the following balancing loop:
> + /* balance entire ancestry of current's mem. */
> + for (; mem_cgroup_has_dirty_limit(mem); mem =
> parent_mem_cgroup(mem)) {
>
> The loop terminates when mem_cgroup_has_dirty_limit() is called for
> the root cgroup. The bitmap is set in the body of the loop. So the
> root cgroup's bit (bit 1) will never be set in the bitmap. However, I
> think the effect is the same. The proposed changes in this patch
> (11/12) have background writeback first checking if the system is over
> limit and if yes, then b_dirty inodes from any cgroup written. This
> means that a small system background limit with an over-{fg or
> bg}-limit cgroup could cause other cgroups that are not over their
> limit to have their inodes written back. In an system-over-limit
> situation normal system-wide bdi writeback is used (writing inodes in
> b_dirty order). For those who want isolation, a simple rule to avoid
> this is to ensure that that sum of all cgroup background_limits is
> less than the system background limit.

Ok, we seem to be mixing multiple things.

- First of all, i thought running apps in root group is very valid
use case. Generally by default we run everything in root group and
once somebody notices that an application or group of application
is memory hog, that can be moved out in a cgroup of its own with
upper limits.

- Secondly, root starvation issue is not present as long as we fall
back to normal way of writting inodes once we have crossed dirty
limit. But you had suggested that we move cgroup based writeout
above so that we always use same scheme for writeout and that
potentially will have root starvation issue.

- If we don't move it up, then atleast it will not work for CFQ IO
controller.

Thanks
Vivek

2011-06-08 20:42:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Wed, Jun 08, 2011 at 01:03:15PM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 7 Jun 2011 21:02:21 -0700
> Greg Thelen <[email protected]> wrote:
>
> > On Tue, Jun 7, 2011 at 5:18 PM, KAMEZAWA Hiroyuki
> > <[email protected]> wrote:
> > > On Tue, 7 Jun 2011 17:05:40 -0400
> > > Vivek Goyal <[email protected]> wrote:
> > >
> > >> On Tue, Jun 07, 2011 at 01:43:08PM -0700, Greg Thelen wrote:
> > >> > Vivek Goyal <[email protected]> writes:
> > >> >
> > >> > > On Fri, Jun 03, 2011 at 09:12:17AM -0700, Greg Thelen wrote:
> > >> > >> When the system is under background dirty memory threshold but a cgroup
> > >> > >> is over its background dirty memory threshold, then only writeback
> > >> > >> inodes associated with the over-limit cgroup(s).
> > >> > >>
> > >> > >
> > >> > > [..]
> > >> > >> -static inline bool over_bground_thresh(void)
> > >> > >> +static inline bool over_bground_thresh(struct bdi_writeback *wb,
> > >> > >> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? struct writeback_control *wbc)
> > >> > >> ?{
> > >> > >> ? ? ? ? ?unsigned long background_thresh, dirty_thresh;
> > >> > >>
> > >> > >> ? ? ? ? ?global_dirty_limits(&background_thresh, &dirty_thresh);
> > >> > >>
> > >> > >> - ? ? ? ?return (global_page_state(NR_FILE_DIRTY) +
> > >> > >> - ? ? ? ? ? ? ? ?global_page_state(NR_UNSTABLE_NFS) > background_thresh);
> > >> > >> + ? ? ? ?if (global_page_state(NR_FILE_DIRTY) +
> > >> > >> + ? ? ? ? ? ?global_page_state(NR_UNSTABLE_NFS) > background_thresh) {
> > >> > >> + ? ? ? ? ? ? ? ?wbc->for_cgroup = 0;
> > >> > >> + ? ? ? ? ? ? ? ?return true;
> > >> > >> + ? ? ? ?}
> > >> > >> +
> > >> > >> + ? ? ? ?wbc->for_cgroup = 1;
> > >> > >> + ? ? ? ?wbc->shared_inodes = 1;
> > >> > >> + ? ? ? ?return mem_cgroups_over_bground_dirty_thresh();
> > >> > >> ?}
> > >> > >
> > >> > > Hi Greg,
> > >> > >
> > >> > > So all the logic of writeout from mem cgroup works only if system is
> > >> > > below background limit. The moment we cross background limit, looks
> > >> > > like we will fall back to existing way of writting inodes?
> > >> >
> > >> > Correct. ?If the system is over its background limit then the previous
> > >> > cgroup-unaware background writeback occurs. ?I think of the system
> > >> > limits as those of the root cgroup. ?If the system is over the global
> > >> > limit than all cgroups are eligible for writeback. ?In this situation
> > >> > the current code does not distinguish between cgroups over or under
> > >> > their dirty background limit.
> > >> >
> > >> > Vivek Goyal <[email protected]> writes:
> > >> > > If yes, then from design point of view it is little odd that as long
> > >> > > as we are below background limit, we share the bdi between different
> > >> > > cgroups. The moment we are above background limit, we fall back to
> > >> > > algorithm of sharing the disk among individual inodes and forget
> > >> > > about memory cgroups. Kind of awkward.
> > >> > >
> > >> > > This kind of cgroup writeback I think will atleast not solve the problem
> > >> > > for CFQ IO controller, as we fall back to old ways of writting back inodes
> > >> > > the moment we cross dirty ratio.
> > >> >
> > >> > It might make more sense to reverse the order of the checks in the
> > >> > proposed over_bground_thresh(): the new version would first check if any
> > >> > memcg are over limit; assuming none are over limit, then check global
> > >> > limits. ?Assuming that the system is over its background limit and some
> > >> > cgroups are also over their limits, then the over limit cgroups would
> > >> > first be written possibly getting the system below its limit. ?Does this
> > >> > address your concern?
> > >>
> > >> Do you treat root group also as any other cgroup? If no, then above logic
> > >> can lead to issue of starvation of root group inode. Or unfair writeback.
> > >> So I guess it will be important to treat root group same as other groups.
> > >>
> > >
> > > As far as I can say, you should not place programs onto ROOT cgroups if you need
> > > performance isolation.
> >
> > Agreed.
> >
> > > From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
> > > set and background writeback can work for ROOT cgroup seamlessly.
> > >
> > > Thanks,
> > > -Kame
> >
> > Not quite. The proposed patches do not set the "1" bit (css_id of
> > root is 1). mem_cgroup_balance_dirty_pages() (from patch 10/12)
> > introduces the following balancing loop:
> > + /* balance entire ancestry of current's mem. */
> > + for (; mem_cgroup_has_dirty_limit(mem); mem =
> > parent_mem_cgroup(mem)) {
> >
> > The loop terminates when mem_cgroup_has_dirty_limit() is called for
> > the root cgroup. The bitmap is set in the body of the loop. So the
> > root cgroup's bit (bit 1) will never be set in the bitmap. However, I
> > think the effect is the same. The proposed changes in this patch
> > (11/12) have background writeback first checking if the system is over
> > limit and if yes, then b_dirty inodes from any cgroup written. This
> > means that a small system background limit with an over-{fg or
> > bg}-limit cgroup could cause other cgroups that are not over their
> > limit to have their inodes written back. In an system-over-limit
> > situation normal system-wide bdi writeback is used (writing inodes in
> > b_dirty order). For those who want isolation, a simple rule to avoid
> > this is to ensure that that sum of all cgroup background_limits is
> > less than the system background limit.
> >
>
> Hmm, should we add the rule ?
> How about disallowing to set dirty_ratio bigger than system's one ?

I guess in common case people will use a common dirty ratio for all cgroups
(same as system dirty ratio). So it might not be of much value.

Thanks
Vivek

2011-06-09 17:56:06

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Wed, Jun 8, 2011 at 1:39 PM, Vivek Goyal <[email protected]> wrote:
> On Tue, Jun 07, 2011 at 09:02:21PM -0700, Greg Thelen wrote:
>
> [..]
>> > As far as I can say, you should not place programs onto ROOT cgroups if you need
>> > performance isolation.
>>
>> Agreed.
>>
>> > From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
>> > set and background writeback can work for ROOT cgroup seamlessly.
>> >
>> > Thanks,
>> > -Kame
>>
>> Not quite. ?The proposed patches do not set the "1" bit (css_id of
>> root is 1). ?mem_cgroup_balance_dirty_pages() (from patch 10/12)
>> introduces the following balancing loop:
>> + ? ? ? /* balance entire ancestry of current's mem. */
>> + ? ? ? for (; mem_cgroup_has_dirty_limit(mem); mem =
>> parent_mem_cgroup(mem)) {
>>
>> The loop terminates when mem_cgroup_has_dirty_limit() is called for
>> the root cgroup. ?The bitmap is set in the body of the loop. ?So the
>> root cgroup's bit (bit 1) will never be set in the bitmap. ?However, I
>> think the effect is the same. ?The proposed changes in this patch
>> (11/12) have background writeback first checking if the system is over
>> limit and if yes, then b_dirty inodes from any cgroup written. ?This
>> means that a small system background limit with an over-{fg or
>> bg}-limit cgroup could cause other cgroups that are not over their
>> limit to have their inodes written back. ?In an system-over-limit
>> situation normal system-wide bdi writeback is used (writing inodes in
>> b_dirty order). ?For those who want isolation, a simple rule to avoid
>> this is to ensure that that sum of all cgroup background_limits is
>> less than the system background limit.
>
> Ok, we seem to be mixing multiple things.
>
> - First of all, i thought running apps in root group is very valid
> ?use case. Generally by default we run everything in root group and
> ?once somebody notices that an application or group of application
> ?is memory hog, that can be moved out in a cgroup of its own with
> ?upper limits.
>
> - Secondly, root starvation issue is not present as long as we fall
> ?back to normal way of writting inodes once we have crossed dirty
> ?limit. But you had suggested that we move cgroup based writeout
> ?above so that we always use same scheme for writeout and that
> ?potentially will have root starvation issue.

To reduce the risk of breaking system writeback (by potentially
starting root inodes), my preference is to to retain this patch's
original ordering (first check and write towards system limits, only
if under system limits write per-cgroup).

> - If we don't move it up, then atleast it will not work for CFQ IO
> ?controller.

As originally proposed, over_bground_thresh() would check system
background limit, and if over limit then write b_dirty, until under
system limit. Then over_bground_thresh() checks cgroup background
limits, and if over limit(s) write over-limit-cgroup inodes until
cgroups are under their background limits.

How does the order of the checks in over_bground_thresh() affect CFQ
IO? Are you referring to recently proposed block throttle patches,
which (AFAIK) throttle the rate at which a cgroup can produce dirty
pages as a way to approximate the rate that async dirty pages will be
written to disk?

2011-06-09 21:28:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Thu, Jun 09, 2011 at 10:55:40AM -0700, Greg Thelen wrote:
> On Wed, Jun 8, 2011 at 1:39 PM, Vivek Goyal <[email protected]> wrote:
> > On Tue, Jun 07, 2011 at 09:02:21PM -0700, Greg Thelen wrote:
> >
> > [..]
> >> > As far as I can say, you should not place programs onto ROOT cgroups if you need
> >> > performance isolation.
> >>
> >> Agreed.
> >>
> >> > From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
> >> > set and background writeback can work for ROOT cgroup seamlessly.
> >> >
> >> > Thanks,
> >> > -Kame
> >>
> >> Not quite. ?The proposed patches do not set the "1" bit (css_id of
> >> root is 1). ?mem_cgroup_balance_dirty_pages() (from patch 10/12)
> >> introduces the following balancing loop:
> >> + ? ? ? /* balance entire ancestry of current's mem. */
> >> + ? ? ? for (; mem_cgroup_has_dirty_limit(mem); mem =
> >> parent_mem_cgroup(mem)) {
> >>
> >> The loop terminates when mem_cgroup_has_dirty_limit() is called for
> >> the root cgroup. ?The bitmap is set in the body of the loop. ?So the
> >> root cgroup's bit (bit 1) will never be set in the bitmap. ?However, I
> >> think the effect is the same. ?The proposed changes in this patch
> >> (11/12) have background writeback first checking if the system is over
> >> limit and if yes, then b_dirty inodes from any cgroup written. ?This
> >> means that a small system background limit with an over-{fg or
> >> bg}-limit cgroup could cause other cgroups that are not over their
> >> limit to have their inodes written back. ?In an system-over-limit
> >> situation normal system-wide bdi writeback is used (writing inodes in
> >> b_dirty order). ?For those who want isolation, a simple rule to avoid
> >> this is to ensure that that sum of all cgroup background_limits is
> >> less than the system background limit.
> >
> > Ok, we seem to be mixing multiple things.
> >
> > - First of all, i thought running apps in root group is very valid
> > ?use case. Generally by default we run everything in root group and
> > ?once somebody notices that an application or group of application
> > ?is memory hog, that can be moved out in a cgroup of its own with
> > ?upper limits.
> >
> > - Secondly, root starvation issue is not present as long as we fall
> > ?back to normal way of writting inodes once we have crossed dirty
> > ?limit. But you had suggested that we move cgroup based writeout
> > ?above so that we always use same scheme for writeout and that
> > ?potentially will have root starvation issue.
>
> To reduce the risk of breaking system writeback (by potentially
> starting root inodes), my preference is to to retain this patch's
> original ordering (first check and write towards system limits, only
> if under system limits write per-cgroup).
>
> > - If we don't move it up, then atleast it will not work for CFQ IO
> > ?controller.
>
> As originally proposed, over_bground_thresh() would check system
> background limit, and if over limit then write b_dirty, until under
> system limit. Then over_bground_thresh() checks cgroup background
> limits, and if over limit(s) write over-limit-cgroup inodes until
> cgroups are under their background limits.
>
> How does the order of the checks in over_bground_thresh() affect CFQ
> IO?

If you are over background limit, you will select inodes independent of
cgroup they belong to. So it might happen that for a long time you
select inode only from low prio IO cgroup and that will result in
pages being written from low prio cgroup (as against to high prio
cgroup) and low prio group gets to finish its writes earlier. This
is just reverse of what we wanted from IO controller.

So CFQ IO controller really can't do anything here till inode writeback
logic is cgroup aware in a way that we are doing round robin among
dirty cgroups so that most of the time these groups have some IO to
do at device level.

> Are you referring to recently proposed block throttle patches,
> which (AFAIK) throttle the rate at which a cgroup can produce dirty
> pages as a way to approximate the rate that async dirty pages will be
> written to disk?

No this is not related to throttling of async writes.

Thanks
Vivek

2011-06-09 22:21:45

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v8 11/12] writeback: make background writeback cgroup aware

On Thu, Jun 9, 2011 at 2:26 PM, Vivek Goyal <[email protected]> wrote:
> On Thu, Jun 09, 2011 at 10:55:40AM -0700, Greg Thelen wrote:
>> On Wed, Jun 8, 2011 at 1:39 PM, Vivek Goyal <[email protected]> wrote:
>> > On Tue, Jun 07, 2011 at 09:02:21PM -0700, Greg Thelen wrote:
>> >
>> > [..]
>> >> > As far as I can say, you should not place programs onto ROOT cgroups if you need
>> >> > performance isolation.
>> >>
>> >> Agreed.
>> >>
>> >> > From the code, I think if the system hits dirty_ratio, "1" bit of bitmap should be
>> >> > set and background writeback can work for ROOT cgroup seamlessly.
>> >> >
>> >> > Thanks,
>> >> > -Kame
>> >>
>> >> Not quite. ?The proposed patches do not set the "1" bit (css_id of
>> >> root is 1). ?mem_cgroup_balance_dirty_pages() (from patch 10/12)
>> >> introduces the following balancing loop:
>> >> + ? ? ? /* balance entire ancestry of current's mem. */
>> >> + ? ? ? for (; mem_cgroup_has_dirty_limit(mem); mem =
>> >> parent_mem_cgroup(mem)) {
>> >>
>> >> The loop terminates when mem_cgroup_has_dirty_limit() is called for
>> >> the root cgroup. ?The bitmap is set in the body of the loop. ?So the
>> >> root cgroup's bit (bit 1) will never be set in the bitmap. ?However, I
>> >> think the effect is the same. ?The proposed changes in this patch
>> >> (11/12) have background writeback first checking if the system is over
>> >> limit and if yes, then b_dirty inodes from any cgroup written. ?This
>> >> means that a small system background limit with an over-{fg or
>> >> bg}-limit cgroup could cause other cgroups that are not over their
>> >> limit to have their inodes written back. ?In an system-over-limit
>> >> situation normal system-wide bdi writeback is used (writing inodes in
>> >> b_dirty order). ?For those who want isolation, a simple rule to avoid
>> >> this is to ensure that that sum of all cgroup background_limits is
>> >> less than the system background limit.
>> >
>> > Ok, we seem to be mixing multiple things.
>> >
>> > - First of all, i thought running apps in root group is very valid
>> > ?use case. Generally by default we run everything in root group and
>> > ?once somebody notices that an application or group of application
>> > ?is memory hog, that can be moved out in a cgroup of its own with
>> > ?upper limits.
>> >
>> > - Secondly, root starvation issue is not present as long as we fall
>> > ?back to normal way of writting inodes once we have crossed dirty
>> > ?limit. But you had suggested that we move cgroup based writeout
>> > ?above so that we always use same scheme for writeout and that
>> > ?potentially will have root starvation issue.
>>
>> To reduce the risk of breaking system writeback (by potentially
>> starting root inodes), my preference is to to retain this patch's
>> original ordering (first check and write towards system limits, only
>> if under system limits write per-cgroup).
>>
>> > - If we don't move it up, then atleast it will not work for CFQ IO
>> > ?controller.
>>
>> As originally proposed, over_bground_thresh() would check system
>> background limit, and if over limit then write b_dirty, until under
>> system limit. ?Then over_bground_thresh() checks cgroup background
>> limits, and if over limit(s) write over-limit-cgroup inodes until
>> cgroups are under their background limits.
>>
>> How does the order of the checks in over_bground_thresh() affect CFQ
>> IO?
>
> If you are over background limit, you will select inodes independent of
> cgroup they belong to. So it might happen that for a long time you
> select inode only from low prio IO cgroup and that will result in
> pages being written from low prio cgroup (as against to high prio
> cgroup) and low prio group gets to finish its writes earlier. This
> is just reverse of what we wanted from IO controller.
>
> So CFQ IO controller really can't do anything here till inode writeback
> logic is cgroup aware in a way that we are doing round robin among
> dirty cgroups so that most of the time these groups have some IO to
> do at device level.

Thanks for the explanation.

My thinking is that this patch's original proposal (first checking
system limits before cgroup limits is better for CFQ than the reversal
discussed earlier in this thread. By walking the system's inode list
CFQ would be getting inodes in dirtied_when order from a mix of
cgroups rather than just inodes from a particular cgroup. This patch
series introduces a bitmap of cgroups needing writeback but the inodes
for multiple cgroups are still kept in a single bdi b_dirty list.
move_expired_inodes() could be changed to examine the over-limit
cgroup bitmap (over_bground_dirty_thresh) and the CFQ priorities of
found cgroups to develop an inode sort strategy which provides CFQ
with the right set of inodes. Though I would prefer to defer that to
a later patch series.

>> Are you referring to recently proposed block throttle patches,
>> which (AFAIK) throttle the rate at which a cgroup can produce dirty
>> pages as a way to approximate the rate that async dirty pages will be
>> written to disk?
>
> No this is not related to throttling of async writes.
>
> Thanks
> Vivek