2011-03-11 18:45:33

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

Changes since v5:
- Restructured previously proposed (-v5) interface between page-writback.c and
memcontrol.c to make a (hopefully) clearer interface. The largest change is
that balance_dirty_pages() now calls mem_cgroup_balance_dirty_pages() rather
than internally grappling with the memcg usage and limits. Due to this
restructuring, the new struct dirty_info is only used by memcg code, so the
following -v5 patches are dropped in -v6:
3/9 writeback: convert variables to unsigned
4/9 writeback: create dirty_info structure
This restructure introduces new patches in -v6:
7/9 memcg: add dirty limiting routines
9/9 memcg: make background writeback memcg aware
- memcg background writeback is more effective, though not perfect. -v5 testing
showed that memcg background limits queued background writeback which would
use over_bground_thresh() to consult system-wide rather than per-cgroup
background limits. This is been addressed in -v6.
- Thanks to review feedback, memcg page accounting was adjusted in
test_clear_page_writeback() and test_set_page_writeback() in patch 5/9.
- Rebased to mmotm-2011-03-10-16-42.

Changes since v4:
- Moved documentation changes to start of series to provide a better
introduction to the series.
- Added support for hierarchical dirty limits.
- Incorporated bug fixes previously found in v4.
- Include a new patch "writeback: convert variables to unsigned" to provide a
clearer transition to the the new dirty_info structure (patch "writeback:
create dirty_info structure").
- Within the new dirty_info structure, replaced nr_reclaimable with
nr_file_dirty and nr_unstable_nfs to give callers finer grain dirty usage
information also added dirty_info_reclaimable().
- Rebased the series to mmotm-2011-02-10-16-26 with two pending mmotm patches:
memcg: break out event counters from other stats
https://lkml.org/lkml/2011/2/17/415
memcg: use native word page statistics counters
https://lkml.org/lkml/2011/2/17/413

Changes since v3:
- Refactored balance_dirty_pages() dirtying checking to use new struct
dirty_info, which is used to compare both system and memcg dirty limits
against usage.
- Disabled memcg dirty limits when memory.use_hierarchy=1. An enhancement is
needed to check the chain of parents to ensure that no dirty limit is
exceeded.
- Ported to mmotm-2010-10-22-16-36.

Changes since v2:
- Rather than disabling softirq in lock_page_cgroup(), introduce a separate lock
to synchronize between memcg page accounting and migration. This only affects
patch 4 of the series. Patch 4 used to disable softirq, now it introduces the
new lock.

Changes since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
memory.stat to match /proc/meminfo.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
mem_cgroup_has_dirty_limit().
- Fixed lockdep issue in mem_cgroup_read_stat() which is exposed by these
patches.
- Remove redundant comments.
- Rename (for clarity):
- mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
- mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Renamed newly created proc files:
- memory.dirty_bytes -> memory.dirty_limit_in_bytes
- memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Removed unnecessary get_ prefix from get_xxx() functions.
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
- Disable softirq rather than hardirq in lock_page_cgroup()
- Made mem_cgroup_move_account_page_stat() inline.
- Ported patches to mmotm-2010-10-13-17-13.

This patch set provides the ability for each cgroup to have independent dirty
page limits.

Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
page cache used by a cgroup. So, in case of multiple cgroup writers, they will
not be able to consume more than their designated share of dirty pages and will
be throttled if they cross that limit.

Example use case:
#!/bin/bash
#
# Here is a test script that shows a situation where memcg dirty limits are
# beneficial.
#
# The script runs two programs:
# 1) a dirty page background antagonist (dd)
# 2) an interactive foreground process (tar).
#
# If the script's argument is false, then both processes are run together in
# the root cgroup sharing system-wide dirty memory in classic fashion. If the
# script is given a true argument, then a cgroup is used to contain dd dirty
# page consumption. The cgroup isolates the dd dirty memory consumption from
# the rest of the system processes (tar in this case).
#
# The time used by the tar process is printed (lower is better).
#
# The tar process had faster and more predictable performance. memcg dirty
# ratios might be useful to serve different task classes (interactive vs
# batch). A past discussion touched on this:
# http://lkml.org/lkml/2010/5/20/136
#
# When called with 'false' (using memcg without dirty isolation):
# tar takes 8s
# dd reports 69 MB/s
#
# When called with 'true' (using memcg for dirty isolation):
# tar takes 6s
# dd reports 66 MB/s
#
echo memcg_dirty_limits: $1

# Declare system limits.
echo $((1<<30)) > /proc/sys/vm/dirty_bytes
echo $((1<<29)) > /proc/sys/vm/dirty_background_bytes

mkdir /dev/cgroup/memory/A

# start antagonist
if $1; then # if using cgroup to contain 'dd'...
echo 400M > /dev/cgroup/memory/A/memory.dirty_limit_in_bytes
fi

(echo $BASHPID > /dev/cgroup/memory/A/tasks; \
dd if=/dev/zero of=big.file count=10k bs=1M) &

# let antagonist get warmed up
sleep 10

# time interactive job
time tar -xzf linux.tar.gz

wait
sleep 10
rmdir /dev/cgroup/memory/A


The patches are based on a series proposed by Andrea Righi in Mar 2010.


Overview:
- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
unstable.

- Extend mem_cgroup to record the total number of pages in each of the
interesting dirty states (dirty, writeback, unstable_nfs).

- Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_* limits to
mem_cgroup. The mem_cgroup dirty parameters are accessible via cgroupfs
control files.

- Consider both system and per-memcg dirty limits in page writeback when
deciding to queue background writeback or throttle dirty memory production.

Known shortcomings (see the patch 1/9 update to Documentation/cgroups/memory.txt
for more details):
- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
just inodes contributing dirty pages to the cgroup exceeding its limit.

- A cgroup may exceed its dirty limit if the memory is dirtied by a process in a
different memcg.

Performance data:
- A page fault microbenchmark workload was used to measure performance, which
can be called in read or write mode:
f = open(foo. $cpu)
truncate(f, 4096)
alarm(60)
while (1) {
p = mmap(f, 4096)
if (write)
*p = 1
else
x = *p
munmap(p)
}

- The workload was called for several points in the patch series in different
modes:
- s_read is a single threaded reader
- s_write is a single threaded writer
- p_read is a 16 thread reader, each operating on a different file
- p_write is a 16 thread writer, each operating on a different file

- Measurements were collected on a 16 core non-numa system using "perf stat
--repeat 3".

- All numbers are page fault rate (M/sec). Higher is better.

- To compare the performance of a kernel without memcg compare the first and
last rows - neither has memcg configured. The first row does not include any
of these memcg dirty limit patches.

- To compare the performance of using memcg dirty limits, compare the memcg
baseline (2nd row titled "mmotm w/ memcg") with the 3rd row (memcg enabled
with all patches).

root_cgroup child_cgroup
s_read s_write p_read p_write s_read s_write p_read p_write
mmotm w/o memcg 0.313 0.271 0.307 0.267
mmotm w/ memcg 0.311 0.280 0.303 0.268 0.317 0.278 0.299 0.266
all patches 0.315 0.283 0.303 0.267 0.318 0.279 0.307 0.267
all patches 0.324 0.277 0.315 0.273
w/o memcg

Greg Thelen (9):
memcg: document cgroup dirty memory interfaces
memcg: add page_cgroup flags for dirty page tracking
memcg: add dirty page accounting infrastructure
memcg: add kernel calls for memcg dirty page stats
memcg: add dirty limits to mem_cgroup
memcg: add cgroupfs interface to memcg dirty limits
memcg: add dirty limiting routines
memcg: check memcg dirty limits in page writeback
memcg: make background writeback memcg aware

Documentation/cgroups/memory.txt | 80 ++++++
fs/fs-writeback.c | 63 ++---
fs/nfs/write.c | 4 +
include/linux/backing-dev.h | 3 +-
include/linux/memcontrol.h | 43 +++-
include/linux/page_cgroup.h | 23 ++
include/linux/writeback.h | 3 +-
mm/filemap.c | 1 +
mm/memcontrol.c | 581 +++++++++++++++++++++++++++++++++++++-
mm/page-writeback.c | 46 +++-
mm/truncate.c | 1 +
mm/vmscan.c | 2 +-
12 files changed, 795 insertions(+), 55 deletions(-)

--
1.7.3.1


2011-03-11 18:45:46

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 1/9] memcg: document cgroup dirty memory interfaces

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Balbir Singh <[email protected]>
---
Changelog since v4:
- Minor rewording of '5.5 dirty memory' section.
- Added '5.5.1 Inode writeback issue' section.

Changelog since v3:
- Described interactions with memory.use_hierarchy.
- Added description of total_dirty, total_writeback, and total_nfs_unstable.

Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
memory.stat to match /proc/meminfo.
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
- Describe a situation where a cgroup can exceed its dirty limit.

Documentation/cgroups/memory.txt | 80 ++++++++++++++++++++++++++++++++++++++
1 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b6ed61c..4db695e 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,10 @@ mapped_file - # of bytes of mapped file (includes tmpfs/shmem)
pgpgin - # of pages paged in (equivalent to # of charging events).
pgpgout - # of pages paged out (equivalent to # of uncharging events).
swap - # of bytes of swap usage
+dirty - # of bytes that are waiting to get written back to the disk.
+writeback - # of bytes that are actively being written back to the disk.
+nfs_unstable - # of bytes sent to the NFS server, but not yet committed to
+ the actual storage.
inactive_anon - # of bytes of anonymous memory and swap cache memory on
LRU list.
active_anon - # of bytes of anonymous and swap cache memory on active
@@ -406,6 +410,9 @@ total_mapped_file - sum of all children's "cache"
total_pgpgin - sum of all children's "pgpgin"
total_pgpgout - sum of all children's "pgpgout"
total_swap - sum of all children's "swap"
+total_dirty - sum of all children's "dirty"
+total_writeback - sum of all children's "writeback"
+total_nfs_unstable - sum of all children's "nfs_unstable"
total_inactive_anon - sum of all children's "inactive_anon"
total_active_anon - sum of all children's "active_anon"
total_inactive_file - sum of all children's "inactive_file"
@@ -453,6 +460,79 @@ memory under it will be reclaimed.
You can reset failcnt by writing 0 to failcnt file.
# echo 0 > .../memory.failcnt

+5.5 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup. So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be throttled if they cross that limit. System-wide dirty limits are also
+consulted. Dirty memory consumption is checked against both system-wide and
+per-cgroup dirty limits.
+
+The interface is similar to the procfs interface: /proc/sys/vm/dirty_*. It is
+possible to configure a limit to trigger throttling of a dirtier or queue
+background writeback. The root cgroup memory.dirty_* control files are
+read-only and match the contents of the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+ cgroup memory) at which a process generating dirty pages will be throttled.
+ The default value is the system-wide dirty ratio, /proc/sys/vm/dirty_ratio.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+ in the cgroup at which a process generating dirty pages will be throttled.
+ Suffix (k, K, m, M, g, or G) can be used to indicate that value is kilo, mega
+ or gigabytes. The default value is the system-wide dirty limit,
+ /proc/sys/vm/dirty_bytes.
+
+ Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+ Only one may be specified at a time. When one is written it is immediately
+ taken into account to evaluate the dirty memory limits and the other appears
+ as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+ (expressed as a percentage of cgroup memory) at which background writeback
+ kernel threads will start writing out dirty data. The default value is the
+ system-wide background dirty ratio, /proc/sys/vm/dirty_background_ratio.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+ in bytes) in the cgroup at which background writeback kernel threads will
+ start writing out dirty data. Suffix (k, K, m, M, g, or G) can be used to
+ indicate that value is kilo, mega or gigabytes. The default value is the
+ system-wide dirty background limit, /proc/sys/vm/dirty_background_bytes.
+
+ Note: memory.dirty_background_limit_in_bytes is the counterpart of
+ memory.dirty_background_ratio. Only one may be specified at a time. When one
+ is written it is immediately taken into account to evaluate the dirty memory
+ limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit. This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it. Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup. Example: If page is allocated by a
+cgroup A task, then the page is charged to cgroup A. If the page is later
+dirtied by a task in cgroup B, then the cgroup A dirty count will be
+incremented. If cgroup A is over its dirty limit but cgroup B is not, then
+dirtying a cgroup A page from a cgroup B task may push cgroup A over its dirty
+limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
+When use_hierarchy=1 the dirty limits of parents cgroups are also checked to
+ensure that no dirty limit is exceeded.
+
+5.5.1 Inode writeback issue
+
+When a memcg dirty limit is exceeded, then bdi writeback is employed to
+writeback dirty inodes. Bdi writeback considers inodes from any memcg, not just
+inodes contributing dirty pages to the memcg exceeding its limit. Ideally when
+a memcg dirty limit is exceeded only inodes contributing dirty pages to that
+memcg would be considered for writeback. However, the current implementation
+does not behave this way because there is no way to quickly check the memcgs
+that an inode contributes dirty pages to.
+
6. Hierarchy support

The memory controller supports a deep hierarchy and hierarchical accounting.
--
1.7.3.1

2011-03-11 18:46:11

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 2/9] memcg: add page_cgroup flags for dirty page tracking

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: Daisuke Nishimura <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
---
include/linux/page_cgroup.h | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index f5de21d..a002ba8 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -10,6 +10,9 @@ enum {
/* flags for mem_cgroup and file and I/O status */
PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+ PCG_FILE_DIRTY, /* page is dirty */
+ PCG_FILE_WRITEBACK, /* page is under writeback */
+ PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
/* No lock in page_cgroup */
PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
__NR_PCG_FLAGS,
@@ -67,6 +70,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }

+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+ { return test_and_set_bit(PCG_##lname, &pc->flags); }
+
/* Cache flag is set only once (at allocation) */
TESTPCGFLAG(Cache, CACHE)
CLEARPCGFLAG(Cache, CACHE)
@@ -86,6 +93,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
CLEARPCGFLAG(FileMapped, FILE_MAPPED)
TESTPCGFLAG(FileMapped, FILE_MAPPED)

+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)
--
1.7.3.1

2011-03-11 18:46:25

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 4/9] memcg: add kernel calls for memcg dirty page stats

Add calls into memcg dirty page accounting. Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.
This allows the memory controller to maintain an accurate view of
the amount of its memory that is dirty.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Daisuke Nishimura <[email protected]>
---
Changelog since v5:
- moved accounting site in test_clear_page_writeback() and
test_set_page_writeback().

fs/nfs/write.c | 4 ++++
mm/filemap.c | 1 +
mm/page-writeback.c | 10 ++++++++--
mm/truncate.c | 1 +
4 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 42b92d7..7863777 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -451,6 +451,7 @@ nfs_mark_request_commit(struct nfs_page *req)
NFS_PAGE_TAG_COMMIT);
nfsi->ncommit++;
spin_unlock(&inode->i_lock);
+ mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -462,6 +463,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req->wb_page;

if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1319,6 +1321,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
req = nfs_list_entry(head->next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+ mem_cgroup_dec_page_stat(req->wb_page,
+ MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index a6cfecf..7e751fe 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -143,6 +143,7 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 632b464..d8005b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1118,6 +1118,7 @@ int __set_page_dirty_no_writeback(struct page *page)
void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1317,6 +1318,7 @@ int clear_page_dirty_for_io(struct page *page)
* for more comments.
*/
if (TestClearPageDirty(page)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
@@ -1352,8 +1354,10 @@ int test_clear_page_writeback(struct page *page)
} else {
ret = TestClearPageWriteback(page);
}
- if (ret)
+ if (ret) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK);
+ }
return ret;
}

@@ -1386,8 +1390,10 @@ int test_set_page_writeback(struct page *page)
} else {
ret = TestSetPageWriteback(page);
}
- if (!ret)
+ if (!ret) {
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
account_page_writeback(page);
+ }
return ret;

}
diff --git a/mm/truncate.c b/mm/truncate.c
index 5e5a43a..18d0304 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
--
1.7.3.1

2011-03-11 18:46:35

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 5/9] memcg: add dirty limits to mem_cgroup

Extend mem_cgroup to contain dirty page limits.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
Changelog since v5:
- To simplify this patch, deferred adding routines for kernel to query dirty
usage and limits to a later patch in this series.
- Collapsed __mem_cgroup_has_dirty_limit() into mem_cgroup_has_dirty_limit().
- Renamed __mem_cgroup_dirty_param() to mem_cgroup_dirty_param().

Changelog since v4:
- Added support for hierarchical dirty limits.
- Simplified __mem_cgroup_dirty_param().
- Simplified mem_cgroup_page_stat().
- Deleted mem_cgroup_nr_pages_item enum, which was added little value.
Instead the mem_cgroup_page_stat_item enum values are used to identify
memcg dirty statistics exported to kernel.
- Fixed overflow issues in mem_cgroup_hierarchical_free_pages().

Changelog since v3:
- Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
advertise dirty memory limits. Now struct dirty_info and
mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
the rest of the kernel.
- __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.
- memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
- created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
logic.

Changelog since v1:
- Rename (for clarity):
- mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
- mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Removed unnecessary get_ prefix from get_xxx() functions.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
mem_cgroup_has_dirty_limit().

mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 50 insertions(+), 1 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b8f517d..5c80622 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -203,6 +203,14 @@ struct mem_cgroup_eventfd_list {
static void mem_cgroup_threshold(struct mem_cgroup *mem);
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);

+/* Dirty memory parameters */
+struct vm_dirty_param {
+ int dirty_ratio;
+ int dirty_background_ratio;
+ unsigned long dirty_bytes;
+ unsigned long dirty_background_bytes;
+};
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -242,6 +250,10 @@ struct mem_cgroup {
atomic_t refcnt;

unsigned int swappiness;
+
+ /* control memory cgroup dirty pages */
+ struct vm_dirty_param dirty_param;
+
/* OOM-Killer disable */
int oom_kill_disable;

@@ -1198,6 +1210,36 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs. So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+static bool mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
+{
+ return mem && !mem_cgroup_is_root(mem);
+}
+
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits. If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where neither dirty_[background_]_ratio nor _bytes are set.
+ */
+static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
+ struct mem_cgroup *mem)
+{
+ if (mem_cgroup_has_dirty_limit(mem)) {
+ *param = mem->dirty_param;
+ } else {
+ param->dirty_ratio = vm_dirty_ratio;
+ param->dirty_bytes = vm_dirty_bytes;
+ param->dirty_background_ratio = dirty_background_ratio;
+ param->dirty_background_bytes = dirty_background_bytes;
+ }
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
@@ -4669,8 +4711,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mem->last_scanned_child = 0;
INIT_LIST_HEAD(&mem->oom_notify);

- if (parent)
+ if (parent) {
mem->swappiness = get_swappiness(parent);
+ mem_cgroup_dirty_param(&mem->dirty_param, parent);
+ } else {
+ /*
+ * The root cgroup dirty_param field is not used, instead,
+ * system-wide dirty limits are used.
+ */
+ }
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
--
1.7.3.1

2011-03-11 18:46:49

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 7/9] memcg: add dirty limiting routines

Add new memcg routines for use by mm to balance and throttle per-memcg
dirty page usage:
- mem_cgroup_balance_dirty_pages() balances memcg dirty memory usage by
checking memcg foreground and background limits.
- mem_cgroup_hierarchical_dirty_info() searches a memcg hierarchy to
find the memcg that is closest to (or over) its foreground or
background dirty limit.

A later change adds kernel calls to these new routines.

Signed-off-by: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 35 +++++
mm/memcontrol.c | 329 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 364 insertions(+), 0 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 549fa7c..42f5f63 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -34,6 +34,15 @@ enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+ MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
+};
+
+struct dirty_info {
+ unsigned long dirty_thresh;
+ unsigned long background_thresh;
+ unsigned long nr_file_dirty;
+ unsigned long nr_writeback;
+ unsigned long nr_unstable_nfs;
};

extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -149,6 +158,14 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}

+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ bool fg_limit,
+ struct mem_cgroup *memcg,
+ struct dirty_info *info);
+void mem_cgroup_bg_writeback_done(struct mem_cgroup *memcg);
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk);
+
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask);
u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -342,6 +359,24 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
{
}

+static inline bool
+mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ bool fg_limit,
+ struct mem_cgroup *memcg,
+ struct dirty_info *info)
+{
+ return false;
+}
+
+static inline void mem_cgroup_bg_writeback_done(struct mem_cgroup *memcg)
+{
+}
+
+static inline void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk)
+{
+}
+
static inline
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 07cbb35..25dc077 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -218,6 +218,12 @@ struct vm_dirty_param {
unsigned long dirty_background_bytes;
};

+/* Define per-memcg flags */
+enum mem_cgroup_flags {
+ /* is background writeback in-progress for this memcg? */
+ MEM_CGROUP_BG_WRITEBACK,
+};
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -261,6 +267,9 @@ struct mem_cgroup {
/* control memory cgroup dirty pages */
struct vm_dirty_param dirty_param;

+ /* see enum mem_cgroup_flags for bit definitions */
+ unsigned long flags;
+
/* OOM-Killer disable */
int oom_kill_disable;

@@ -1217,6 +1226,11 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+static unsigned long dirty_info_reclaimable(struct dirty_info *info)
+{
+ return info->nr_file_dirty + info->nr_unstable_nfs;
+}
+
/*
* Return true if the current memory cgroup has local dirty memory settings.
* There is an allowed race between the current task migrating in-to/out-of the
@@ -1247,6 +1261,321 @@ static void mem_cgroup_dirty_param(struct vm_dirty_param *param,
}
}

+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+ if (!do_swap_account)
+ return nr_swap_pages > 0;
+ return !memcg->memsw_is_minimum &&
+ (res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ s64 ret;
+
+ switch (item) {
+ case MEMCG_NR_FILE_DIRTY:
+ ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+ break;
+ case MEMCG_NR_FILE_WRITEBACK:
+ ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+ break;
+ case MEMCG_NR_FILE_UNSTABLE_NFS:
+ ret = mem_cgroup_read_stat(mem,
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+ break;
+ case MEMCG_NR_DIRTYABLE_PAGES:
+ ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
+ mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
+ if (mem_cgroup_can_swap(mem))
+ ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
+ mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+/*
+ * Return the number of additional pages that the @mem cgroup could allocate.
+ * If use_hierarchy is set, then this involves checking parent mem cgroups to
+ * find the cgroup with the smallest free space.
+ */
+static unsigned long
+mem_cgroup_hierarchical_free_pages(struct mem_cgroup *mem)
+{
+ u64 free;
+ unsigned long min_free;
+
+ min_free = global_page_state(NR_FREE_PAGES);
+
+ while (mem) {
+ free = (res_counter_read_u64(&mem->res, RES_LIMIT) -
+ res_counter_read_u64(&mem->res, RES_USAGE)) >>
+ PAGE_SHIFT;
+ min_free = min((u64)min_free, free);
+ mem = parent_mem_cgroup(mem);
+ }
+
+ return min_free;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @mem: memory cgroup to query
+ * @item: memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+static unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ struct mem_cgroup *iter;
+ s64 value;
+
+ VM_BUG_ON(!mem);
+ VM_BUG_ON(mem_cgroup_is_root(mem));
+
+ /*
+ * If we're looking for dirtyable pages we need to evaluate free pages
+ * depending on the limit and usage of the parents first of all.
+ */
+ if (item == MEMCG_NR_DIRTYABLE_PAGES)
+ value = mem_cgroup_hierarchical_free_pages(mem);
+ else
+ value = 0;
+
+ /*
+ * Recursively evaluate page statistics against all cgroup under
+ * hierarchy tree
+ */
+ for_each_mem_cgroup_tree(iter, mem)
+ value += mem_cgroup_local_page_stat(iter, item);
+
+ /*
+ * Summing of unlocked per-cpu counters is racy and may yield a slightly
+ * negative value. Zero is the only sensible value in such cases.
+ */
+ if (unlikely(value < 0))
+ value = 0;
+
+ return value;
+}
+
+/* Return dirty thresholds and usage for @memcg. */
+static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *memcg,
+ struct dirty_info *info)
+{
+ unsigned long uninitialized_var(available_mem);
+ struct vm_dirty_param dirty_param;
+
+ mem_cgroup_dirty_param(&dirty_param, memcg);
+
+ if (!dirty_param.dirty_bytes || !dirty_param.dirty_background_bytes)
+ available_mem = min(
+ sys_available_mem,
+ mem_cgroup_page_stat(memcg, MEMCG_NR_DIRTYABLE_PAGES));
+
+ if (dirty_param.dirty_bytes)
+ info->dirty_thresh =
+ DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
+ else
+ info->dirty_thresh =
+ (dirty_param.dirty_ratio * available_mem) / 100;
+
+ if (dirty_param.dirty_background_bytes)
+ info->background_thresh =
+ DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+ PAGE_SIZE);
+ else
+ info->background_thresh =
+ (dirty_param.dirty_background_ratio *
+ available_mem) / 100;
+
+ info->nr_file_dirty = mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_DIRTY);
+ info->nr_writeback = mem_cgroup_page_stat(memcg,
+ MEMCG_NR_FILE_WRITEBACK);
+ info->nr_unstable_nfs =
+ mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_UNSTABLE_NFS);
+}
+
+/*
+ * Releases memcg reference. Called when per-memcg background writeback
+ * completes.
+ */
+void mem_cgroup_bg_writeback_done(struct mem_cgroup *memcg)
+{
+ VM_BUG_ON(!test_bit(MEM_CGROUP_BG_WRITEBACK, &memcg->flags));
+ clear_bit(MEM_CGROUP_BG_WRITEBACK, &memcg->flags);
+ css_put(&memcg->css);
+}
+
+/*
+ * This routine must be called by processes which are generating dirty pages.
+ * It considers the dirty pages usage and thresholds of the current cgroup and
+ * (depending if hierarchical accounting is enabled) ancestral memcg. If any of
+ * the considered memcg are over their background dirty limit, then background
+ * writeback is queued. If any are over the foreground dirty limit then
+ * throttle the dirtying task while writing dirty data. The per-memcg dirty
+ * limits check by this routine are distinct from either the per-system,
+ * per-bdi, or per-task limits considered by balance_dirty_pages().
+ */
+void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
+ unsigned long write_chunk)
+{
+ unsigned long nr_reclaimable;
+ unsigned long sys_available_mem;
+ struct mem_cgroup *memcg;
+ struct mem_cgroup *ref_memcg;
+ struct dirty_info info;
+ struct backing_dev_info *bdi = mapping->backing_dev_info;
+ unsigned long pause = 1;
+
+ if (mem_cgroup_disabled())
+ return;
+
+ sys_available_mem = determine_dirtyable_memory();
+
+ /* reference the memcg so it is not deleted during this routine */
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+ ref_memcg = memcg;
+ if (memcg)
+ css_get(&ref_memcg->css);
+ rcu_read_unlock();
+
+ /* balance entire ancestry of current's memcg. */
+ while (mem_cgroup_has_dirty_limit(memcg)) {
+ /*
+ * keep throttling and writing inode data so long as memcg is
+ * over its dirty limit.
+ */
+ while (true) {
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .older_than_this = NULL,
+ .nr_to_write = write_chunk,
+ .range_cyclic = 1,
+ };
+
+ mem_cgroup_dirty_info(sys_available_mem, memcg, &info);
+ nr_reclaimable = dirty_info_reclaimable(&info);
+
+ /* if memcg is over dirty limit, then throttle. */
+ if (nr_reclaimable >= info.dirty_thresh) {
+ writeback_inodes_wb(&bdi->wb, &wbc);
+ /*
+ * Sleep up to 100ms to throttle writer and wait
+ * for queued background I/O to complete.
+ */
+ __set_current_state(TASK_UNINTERRUPTIBLE);
+ io_schedule_timeout(pause);
+ pause <<= 1;
+ if (pause > HZ / 10)
+ pause = HZ / 10;
+ } else
+ break;
+ }
+
+ /* if memcg is over background limit, then queue bg writeback */
+ if ((nr_reclaimable >= info.background_thresh) &&
+ !test_and_set_bit(MEM_CGROUP_BG_WRITEBACK, &memcg->flags)) {
+ /*
+ * grab css reference that will be released by
+ * mem_cgroup_bg_writeback_done().
+ */
+ css_get(&memcg->css);
+ bdi_start_background_writeback(bdi);
+ }
+
+ /* continue walking up hierarchy enabled parents */
+ memcg = parent_mem_cgroup(memcg);
+ if (!memcg || !memcg->use_hierarchy)
+ break;
+ }
+
+ if (ref_memcg)
+ css_put(&ref_memcg->css);
+}
+
+/*
+ * Return the dirty thresholds and usage for the memcg (within the ancestral
+ * chain of @memcg) closest to its limit or the first memcg over its limit.
+ * If @fg_limit is set, then check the dirty_limit, otherwise check
+ * background_limit. If @memcg is not set, then use the current task's memcg.
+ *
+ * The current task may be moved to another cgroup while this routine accesses
+ * the dirty limit. But a precise check is meaningless because the task can be
+ * moved after our access and writeback tends to take long time. At least,
+ * "memcg" will not be freed while holding rcu_read_lock().
+ */
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ bool fg_limit,
+ struct mem_cgroup *memcg,
+ struct dirty_info *info)
+{
+ unsigned long usage;
+ struct dirty_info uninitialized_var(cur_info);
+ struct mem_cgroup *ref_memcg = NULL;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ info->nr_writeback = ULONG_MAX; /* invalid initial value */
+
+ /* reference current's memcg unless a memcg was provided by caller */
+ if (!memcg) {
+ rcu_read_lock();
+ memcg = mem_cgroup_from_task(current);
+ if (memcg)
+ css_get(&memcg->css);
+ ref_memcg = memcg;
+ rcu_read_unlock();
+ }
+
+ while (mem_cgroup_has_dirty_limit(memcg)) {
+ mem_cgroup_dirty_info(sys_available_mem, memcg, &cur_info);
+ usage = dirty_info_reclaimable(&cur_info) +
+ cur_info.nr_writeback;
+
+ /* if over limit, stop searching */
+ if (usage >= (fg_limit ? cur_info.dirty_thresh :
+ cur_info.background_thresh)) {
+ *info = cur_info;
+ break;
+ }
+
+ /*
+ * Save dirty usage of memcg closest to its limit if either:
+ * - memcg is the first memcg considered
+ * - memcg dirty margin is smaller than last recorded one
+ */
+ /* cur_memcg_margin < */
+ if ((info->nr_writeback == ULONG_MAX) ||
+ ((fg_limit ? cur_info.dirty_thresh :
+ cur_info.background_thresh) - usage <
+ (fg_limit ? info->dirty_thresh :
+ info->background_thresh) -
+ (dirty_info_reclaimable(info) + info->nr_writeback)))
+
+ *info = cur_info;
+
+ /* continue walking up hierarchy enabled parents */
+ memcg = parent_mem_cgroup(memcg);
+ if (!memcg || !memcg->use_hierarchy)
+ break;
+ }
+
+ if (ref_memcg)
+ css_put(&ref_memcg->css);
+
+ return info->nr_writeback != ULONG_MAX;
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
--
1.7.3.1

2011-03-11 18:46:52

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits

Add cgroupfs interface to memcg dirty page limits:
Direct write-out is controlled with:
- memory.dirty_ratio
- memory.dirty_limit_in_bytes

Background write-out is controlled with:
- memory.dirty_background_ratio
- memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts. This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
Changelog since v3:
- Make use of new routine, __mem_cgroup_has_dirty_limit(), to disable memcg
dirty limits when use_hierarchy=1.

Changelog since v1:
- Renamed newly created proc files:
- memory.dirty_bytes -> memory.dirty_limit_in_bytes
- memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

mm/memcontrol.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5c80622..07cbb35 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -113,6 +113,13 @@ enum mem_cgroup_events_target {
#define THRESHOLDS_EVENTS_TARGET (128)
#define SOFTLIMIT_EVENTS_TARGET (1024)

+enum {
+ MEM_CGROUP_DIRTY_RATIO,
+ MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+ MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
@@ -4391,6 +4398,89 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
return 0;
}

+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ bool use_sys = !mem_cgroup_has_dirty_limit(mem);
+
+ switch (cft->private) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ return use_sys ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+ case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+ return use_sys ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ return use_sys ? dirty_background_ratio :
+ mem->dirty_param.dirty_background_ratio;
+ case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+ return use_sys ? dirty_background_bytes :
+ mem->dirty_param.dirty_background_bytes;
+ default:
+ BUG();
+ }
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+ int ret = -EINVAL;
+ unsigned long long val;
+
+ if (!mem_cgroup_has_dirty_limit(memcg))
+ return ret;
+
+ switch (type) {
+ case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_bytes = val;
+ memcg->dirty_param.dirty_ratio = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_background_bytes = val;
+ memcg->dirty_param.dirty_background_ratio = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+
+ if (!mem_cgroup_has_dirty_limit(memcg))
+ return -EINVAL;
+ if ((type == MEM_CGROUP_DIRTY_RATIO ||
+ type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+ return -EINVAL;
+ switch (type) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ memcg->dirty_param.dirty_ratio = val;
+ memcg->dirty_param.dirty_bytes = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ memcg->dirty_param.dirty_background_ratio = val;
+ memcg->dirty_param.dirty_background_bytes = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return 0;
+}
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -4454,6 +4544,30 @@ static struct cftype mem_cgroup_files[] = {
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
+ {
+ .name = "dirty_ratio",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_u64 = mem_cgroup_dirty_write,
+ .private = MEM_CGROUP_DIRTY_RATIO,
+ },
+ {
+ .name = "dirty_limit_in_bytes",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_string = mem_cgroup_dirty_write_string,
+ .private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+ },
+ {
+ .name = "dirty_background_ratio",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_u64 = mem_cgroup_dirty_write,
+ .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ },
+ {
+ .name = "dirty_background_limit_in_bytes",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_string = mem_cgroup_dirty_write_string,
+ .private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
--
1.7.3.1

2011-03-11 18:46:09

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 3/9] memcg: add dirty page accounting infrastructure

Add memcg routines to track dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.
A later change adds kernel calls to these new routines.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Daisuke Nishimura <[email protected]>
---
Changelog since v5:
- Updated enum mem_cgroup_page_stat_item comment.

Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
memory.stat to match /proc/meminfo.
- Rename (for clarity):
- mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
- mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Remove redundant comments.
- Made mem_cgroup_move_account_page_stat() inline.

include/linux/memcontrol.h | 8 ++++-
mm/memcontrol.c | 87 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 86 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a5ce70..549fa7c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,9 +25,15 @@ struct page_cgroup;
struct page;
struct mm_struct;

-/* Stats that can be updated by kernel. */
+/*
+ * Per mem_cgroup page counts tracked by kernel. As pages enter and leave these
+ * states, the kernel notifies memcg using mem_cgroup_{inc,dec}_page_stat().
+ */
enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+ MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+ MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
};

extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4407dd0..b8f517d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,8 +83,11 @@ enum mem_cgroup_stat_index {
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */
MEM_CGROUP_STAT_NSTATS,
@@ -1692,6 +1695,44 @@ void mem_cgroup_update_page_stat(struct page *page,
ClearPageCgroupFileMapped(pc);
idx = MEM_CGROUP_STAT_FILE_MAPPED;
break;
+
+ case MEMCG_NR_FILE_DIRTY:
+ /* Use Test{Set,Clear} to only un/charge the memcg once. */
+ if (val > 0) {
+ if (TestSetPageCgroupFileDirty(pc))
+ val = 0;
+ } else {
+ if (!TestClearPageCgroupFileDirty(pc))
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_DIRTY;
+ break;
+
+ case MEMCG_NR_FILE_WRITEBACK:
+ /*
+ * This counter is adjusted while holding the mapping's
+ * tree_lock. Therefore there is no race between settings and
+ * clearing of this flag.
+ */
+ if (val > 0)
+ SetPageCgroupFileWriteback(pc);
+ else
+ ClearPageCgroupFileWriteback(pc);
+ idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+ break;
+
+ case MEMCG_NR_FILE_UNSTABLE_NFS:
+ /* Use Test{Set,Clear} to only un/charge the memcg once. */
+ if (val > 0) {
+ if (TestSetPageCgroupFileUnstableNFS(pc))
+ val = 0;
+ } else {
+ if (!TestClearPageCgroupFileUnstableNFS(pc))
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+ break;
+
default:
BUG();
}
@@ -2251,6 +2292,17 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
}
#endif

+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+ struct mem_cgroup *to,
+ enum mem_cgroup_stat_index idx)
+{
+ preempt_disable();
+ __this_cpu_dec(from->stat->count[idx]);
+ __this_cpu_inc(to->stat->count[idx]);
+ preempt_enable();
+}
+
/**
* mem_cgroup_move_account - move account of the page
* @page: the page
@@ -2299,13 +2351,18 @@ static int mem_cgroup_move_account(struct page *page,

move_lock_page_cgroup(pc, &flags);

- if (PageCgroupFileMapped(pc)) {
- /* Update mapped_file data for mem_cgroup */
- preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- preempt_enable();
- }
+ if (PageCgroupFileMapped(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_MAPPED);
+ if (PageCgroupFileDirty(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_DIRTY);
+ if (PageCgroupFileWriteback(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_WRITEBACK);
+ if (PageCgroupFileUnstableNFS(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
if (uncharge)
/* This is not "cancel", but cancel_charge does all we need. */
@@ -3772,6 +3829,9 @@ enum {
MCS_PGPGIN,
MCS_PGPGOUT,
MCS_SWAP,
+ MCS_FILE_DIRTY,
+ MCS_WRITEBACK,
+ MCS_UNSTABLE_NFS,
MCS_INACTIVE_ANON,
MCS_ACTIVE_ANON,
MCS_INACTIVE_FILE,
@@ -3794,6 +3854,9 @@ struct {
{"pgpgin", "total_pgpgin"},
{"pgpgout", "total_pgpgout"},
{"swap", "total_swap"},
+ {"dirty", "total_dirty"},
+ {"writeback", "total_writeback"},
+ {"nfs_unstable", "total_nfs_unstable"},
{"inactive_anon", "total_inactive_anon"},
{"active_anon", "total_active_anon"},
{"inactive_file", "total_inactive_file"},
@@ -3823,6 +3886,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
s->stat[MCS_SWAP] += val * PAGE_SIZE;
}

+ /* dirty stat */
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+ s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+ s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+ s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
/* per zone stat */
val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
--
1.7.3.1

2011-03-11 18:47:44

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 9/9] memcg: make background writeback memcg aware

Add an memcg parameter to bdi_start_background_writeback(). If a memcg
is specified then the resulting background writeback call to
wb_writeback() will run until the memcg dirty memory usage drops below
the memcg background limit. This is used when balancing memcg dirty
memory with mem_cgroup_balance_dirty_pages().

If the memcg parameter is not specified, then background writeback runs
globally system dirty memory usage falls below the system background
limit.

Signed-off-by: Greg Thelen <[email protected]>
---
fs/fs-writeback.c | 63 ++++++++++++++++++++----------------------
include/linux/backing-dev.h | 3 +-
mm/memcontrol.c | 2 +-
mm/page-writeback.c | 2 +-
4 files changed, 34 insertions(+), 36 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 59c6e49..975741d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -20,6 +20,7 @@
#include <linux/sched.h>
#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/memcontrol.h>
#include <linux/kthread.h>
#include <linux/freezer.h>
#include <linux/writeback.h>
@@ -35,6 +36,7 @@
struct wb_writeback_work {
long nr_pages;
struct super_block *sb;
+ struct mem_cgroup *mem_cgroup; /* NULL for global (root cgroup) */
enum writeback_sync_modes sync_mode;
unsigned int for_kupdate:1;
unsigned int range_cyclic:1;
@@ -113,7 +115,8 @@ static void bdi_queue_work(struct backing_dev_info *bdi,

static void
__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
- bool range_cyclic)
+ bool range_cyclic, bool for_background,
+ struct mem_cgroup *mem_cgroup)
{
struct wb_writeback_work *work;

@@ -133,6 +136,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
work->sync_mode = WB_SYNC_NONE;
work->nr_pages = nr_pages;
work->range_cyclic = range_cyclic;
+ work->for_background = for_background;
+ work->mem_cgroup = mem_cgroup;

bdi_queue_work(bdi, work);
}
@@ -150,7 +155,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
*/
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages)
{
- __bdi_start_writeback(bdi, nr_pages, true);
+ __bdi_start_writeback(bdi, nr_pages, true, false, NULL);
}

/**
@@ -163,16 +168,10 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages)
* some IO is happening if we are over background dirty threshold.
* Caller need not hold sb s_umount semaphore.
*/
-void bdi_start_background_writeback(struct backing_dev_info *bdi)
+void bdi_start_background_writeback(struct backing_dev_info *bdi,
+ struct mem_cgroup *mem_cgroup)
{
- /*
- * We just wake up the flusher thread. It will perform background
- * writeback as soon as there is no other work to do.
- */
- trace_writeback_wake_background(bdi);
- spin_lock_bh(&bdi->wb_lock);
- bdi_wakeup_flusher(bdi);
- spin_unlock_bh(&bdi->wb_lock);
+ __bdi_start_writeback(bdi, LONG_MAX, true, true, mem_cgroup);
}

/*
@@ -593,10 +592,22 @@ static void __writeback_inodes_sb(struct super_block *sb,
*/
#define MAX_WRITEBACK_PAGES 1024

-static inline bool over_bground_thresh(void)
+static inline bool over_bground_thresh(struct mem_cgroup *mem_cgroup)
{
unsigned long background_thresh, dirty_thresh;

+ if (mem_cgroup) {
+ struct dirty_info info;
+
+ if (!mem_cgroup_hierarchical_dirty_info(
+ determine_dirtyable_memory(), false,
+ mem_cgroup, &info))
+ return false;
+
+ return info.nr_file_dirty +
+ info.nr_unstable_nfs > info.background_thresh;
+ }
+
global_dirty_limits(&background_thresh, &dirty_thresh);

return (global_page_state(NR_FILE_DIRTY) +
@@ -683,7 +694,8 @@ static long wb_writeback(struct bdi_writeback *wb,
* For background writeout, stop when we are below the
* background dirty threshold
*/
- if (work->for_background && !over_bground_thresh())
+ if (work->for_background &&
+ !over_bground_thresh(work->mem_cgroup))
break;

wbc.more_io = 0;
@@ -761,23 +773,6 @@ static unsigned long get_nr_dirty_pages(void)
get_nr_dirty_inodes();
}

-static long wb_check_background_flush(struct bdi_writeback *wb)
-{
- if (over_bground_thresh()) {
-
- struct wb_writeback_work work = {
- .nr_pages = LONG_MAX,
- .sync_mode = WB_SYNC_NONE,
- .for_background = 1,
- .range_cyclic = 1,
- };
-
- return wb_writeback(wb, &work);
- }
-
- return 0;
-}
-
static long wb_check_old_data_flush(struct bdi_writeback *wb)
{
unsigned long expired;
@@ -839,15 +834,17 @@ long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
*/
if (work->done)
complete(work->done);
- else
+ else {
+ if (work->mem_cgroup)
+ mem_cgroup_bg_writeback_done(work->mem_cgroup);
kfree(work);
+ }
}

/*
* Check for periodic writeback, kupdated() style
*/
wrote += wb_check_old_data_flush(wb);
- wrote += wb_check_background_flush(wb);
clear_bit(BDI_writeback_running, &wb->bdi->state);

return wrote;
@@ -934,7 +931,7 @@ void wakeup_flusher_threads(long nr_pages)
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
if (!bdi_has_dirty_io(bdi))
continue;
- __bdi_start_writeback(bdi, nr_pages, false);
+ __bdi_start_writeback(bdi, nr_pages, false, false, NULL);
}
rcu_read_unlock();
}
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4ce34fa..f794604 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -103,7 +103,8 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
void bdi_unregister(struct backing_dev_info *bdi);
int bdi_setup_and_register(struct backing_dev_info *, char *, unsigned int);
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages);
-void bdi_start_background_writeback(struct backing_dev_info *bdi);
+void bdi_start_background_writeback(struct backing_dev_info *bdi,
+ struct mem_cgroup *mem_cgroup);
int bdi_writeback_thread(void *data);
int bdi_has_dirty_io(struct backing_dev_info *bdi);
void bdi_arm_supers_timer(void);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 25dc077..41ba94f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1489,7 +1489,7 @@ void mem_cgroup_balance_dirty_pages(struct address_space *mapping,
* mem_cgroup_bg_writeback_done().
*/
css_get(&memcg->css);
- bdi_start_background_writeback(bdi);
+ bdi_start_background_writeback(bdi, memcg);
}

/* continue walking up hierarchy enabled parents */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f6a8dd6..492c3db 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -597,7 +597,7 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
if ((laptop_mode && pages_written) ||
(!laptop_mode && (nr_reclaimable > background_thresh)))
- bdi_start_background_writeback(bdi);
+ bdi_start_background_writeback(bdi, NULL);
}

void set_page_dirty_balance(struct page *page, int page_mkwrite)
--
1.7.3.1

2011-03-11 18:47:35

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

If the current process is in a non-root memcg, then
balance_dirty_pages() will consider the memcg dirty limits as well as
the system-wide limits. This allows different cgroups to have distinct
dirty limits which trigger direct and background writeback at different
levels.

If called with a mem_cgroup, then throttle_vm_writeout() queries the
given cgroup for its dirty memory usage limits.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Wu Fengguang <[email protected]>
---
Changelog since v5:
- Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
cramming the somewhat different logic into balance_dirty_pages(). This means
the global (non-memcg) dirty limits are not passed around in the
struct dirty_info, so there's less change to existing code.

Changelog since v4:
- Added missing 'struct mem_cgroup' forward declaration in writeback.h.
- Made throttle_vm_writeout() memcg aware.
- Removed previously added dirty_writeback_pages() which is no longer needed.
- Added logic to balance_dirty_pages() to throttle if over foreground memcg
limit.

Changelog since v3:
- Leave determine_dirtyable_memory() static. v3 made is non-static.
- balance_dirty_pages() now considers both system and memcg dirty limits and
usage data. This data is retrieved with global_dirty_info() and
memcg_dirty_info().

include/linux/writeback.h | 3 ++-
mm/page-writeback.c | 34 ++++++++++++++++++++++++++++------
mm/vmscan.c | 2 +-
3 files changed, 31 insertions(+), 8 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0ead399..a45d895 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -8,6 +8,7 @@
#include <linux/fs.h>

struct backing_dev_info;
+struct mem_cgroup;

extern spinlock_t inode_lock;

@@ -92,7 +93,7 @@ void laptop_mode_timer_fn(unsigned long data);
#else
static inline void laptop_sync_completion(void) { }
#endif
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup);

/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index d8005b0..f6a8dd6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -473,7 +473,8 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
* data. It looks at the number of dirty pages in the machine and will force
* the caller to perform writeback if the system is over `vm_dirty_ratio'.
* If we're over `background_thresh' then the writeback threads are woken to
- * perform some writeout.
+ * perform some writeout. The current task may have per-memcg dirty
+ * limits, which are also checked.
*/
static void balance_dirty_pages(struct address_space *mapping,
unsigned long write_chunk)
@@ -488,6 +489,8 @@ static void balance_dirty_pages(struct address_space *mapping,
bool dirty_exceeded = false;
struct backing_dev_info *bdi = mapping->backing_dev_info;

+ mem_cgroup_balance_dirty_pages(mapping, write_chunk);
+
for (;;) {
struct writeback_control wbc = {
.sync_mode = WB_SYNC_NONE,
@@ -651,23 +654,42 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);

-void throttle_vm_writeout(gfp_t gfp_mask)
+/*
+ * Throttle the current task if it is near dirty memory usage limits. Both
+ * global dirty memory limits and (if @mem_cgroup is given) per-cgroup dirty
+ * memory limits are checked.
+ *
+ * If near limits, then wait for usage to drop. Dirty usage should drop because
+ * dirty producers should have used balance_dirty_pages(), which would have
+ * scheduled writeback.
+ */
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
{
unsigned long background_thresh;
unsigned long dirty_thresh;
+ struct dirty_info memcg_info;
+ bool do_memcg;

for ( ; ; ) {
global_dirty_limits(&background_thresh, &dirty_thresh);
+ do_memcg = mem_cgroup && mem_cgroup_hierarchical_dirty_info(
+ determine_dirtyable_memory(), true, mem_cgroup,
+ &memcg_info);

/*
* Boost the allowable dirty threshold a bit for page
* allocators so they don't get DoS'ed by heavy writers
*/
dirty_thresh += dirty_thresh / 10; /* wheeee... */
-
- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
- break;
+ if (do_memcg)
+ memcg_info.dirty_thresh += memcg_info.dirty_thresh / 10;
+
+ if ((global_page_state(NR_UNSTABLE_NFS) +
+ global_page_state(NR_WRITEBACK) <= dirty_thresh) &&
+ (!do_memcg ||
+ (memcg_info.nr_unstable_nfs +
+ memcg_info.nr_writeback <= memcg_info.dirty_thresh)))
+ break;
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..035d2ea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1939,7 +1939,7 @@ restart:
sc->nr_scanned - nr_scanned, sc))
goto restart;

- throttle_vm_writeout(sc->gfp_mask);
+ throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
}

/*
--
1.7.3.1

2011-03-12 01:11:23

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Fri, 11 Mar 2011 10:43:22 -0800
Greg Thelen <[email protected]> wrote:

>
> ...
>
> This patch set provides the ability for each cgroup to have independent dirty
> page limits.

Here, it would be helpful to describe the current kernel behaviour.
And to explain what is wrong with it and why the patch set improves
things!

>
> ...
>
> Known shortcomings (see the patch 1/9 update to Documentation/cgroups/memory.txt
> for more details):
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
> writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
> just inodes contributing dirty pages to the cgroup exceeding its limit.

This is a pretty large shortcoming, I suspect. Will it be addressed?

There's a risk that a poorly (or maliciously) configured memcg could
have a pretty large affect upon overall system behaviour. Would
elevated premissions be needed to do this?

We could just crawl the memcg's page LRU and bring things under control
that way, couldn't we? That would fix it. What were the reasons for
not doing this?

> - A cgroup may exceed its dirty limit if the memory is dirtied by a process in a
> different memcg.

Please describe this scenario in (a lot) more detail?

2011-03-14 14:50:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 1/9] memcg: document cgroup dirty memory interfaces

On Fri, Mar 11, 2011 at 10:43:23AM -0800, Greg Thelen wrote:
> Document cgroup dirty memory interfaces and statistics.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Balbir Singh <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2011-03-14 14:56:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 3/9] memcg: add dirty page accounting infrastructure

On Fri, Mar 11, 2011 at 10:43:25AM -0800, Greg Thelen wrote:
> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.
> A later change adds kernel calls to these new routines.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Daisuke Nishimura <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2011-03-14 15:10:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 4/9] memcg: add kernel calls for memcg dirty page stats

On Fri, Mar 11, 2011 at 10:43:26AM -0800, Greg Thelen wrote:
> Add calls into memcg dirty page accounting. Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs.
> This allows the memory controller to maintain an accurate view of
> the amount of its memory that is dirty.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Reviewed-by: Daisuke Nishimura <[email protected]>
> ---
> Changelog since v5:
> - moved accounting site in test_clear_page_writeback() and
> test_set_page_writeback().
>
> fs/nfs/write.c | 4 ++++
> mm/filemap.c | 1 +
> mm/page-writeback.c | 10 ++++++++--
> mm/truncate.c | 1 +
> 4 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 42b92d7..7863777 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -451,6 +451,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> NFS_PAGE_TAG_COMMIT);
> nfsi->ncommit++;
> spin_unlock(&inode->i_lock);
> + mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
> inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
> __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -462,6 +463,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> struct page *page = req->wb_page;
>
> if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
> dec_zone_page_state(page, NR_UNSTABLE_NFS);
> dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
> return 1;
> @@ -1319,6 +1321,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> req = nfs_list_entry(head->next);
> nfs_list_remove_request(req);
> nfs_mark_request_commit(req);
> + mem_cgroup_dec_page_stat(req->wb_page,
> + MEMCG_NR_FILE_UNSTABLE_NFS);
> dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a6cfecf..7e751fe 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -143,6 +143,7 @@ void __delete_from_page_cache(struct page *page)
> * having removed the page entirely.
> */
> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 632b464..d8005b0 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1118,6 +1118,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> void account_page_dirtied(struct page *page, struct address_space *mapping)
> {
> if (mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_zone_page_state(page, NR_DIRTIED);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -1317,6 +1318,7 @@ int clear_page_dirty_for_io(struct page *page)
> * for more comments.
> */
> if (TestClearPageDirty(page)) {
> + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> @@ -1352,8 +1354,10 @@ int test_clear_page_writeback(struct page *page)
> } else {
> ret = TestClearPageWriteback(page);
> }
> - if (ret)
> + if (ret) {
> + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
> dec_zone_page_state(page, NR_WRITEBACK);
> + }
> return ret;
> }
>
> @@ -1386,8 +1390,10 @@ int test_set_page_writeback(struct page *page)
> } else {
> ret = TestSetPageWriteback(page);
> }
> - if (!ret)
> + if (!ret) {
> + mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
> account_page_writeback(page);
> + }
> return ret;
>
> }

At least in mainline, NR_WRITEBACK handling codes are following as.

1) increase

* account_page_writeback

2) decrease

* test_clear_page_writeback
* __nilfs_end_page_io

I think account_page_writeback name is good to add your account function into that.
The problem is decreasement. Normall we can handle decreasement in test_clear_page_writeback.
But I am not sure it's okay in __nilfs_end_page_io.
I think if __nilfs_end_page_io is right, __nilfs_end_page_io should call
mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK).

What do you think about it?



--
Kind regards,
Minchan Kim

2011-03-14 15:16:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits

On Fri, Mar 11, 2011 at 10:43:28AM -0800, Greg Thelen wrote:
> Add cgroupfs interface to memcg dirty page limits:
> Direct write-out is controlled with:
> - memory.dirty_ratio
> - memory.dirty_limit_in_bytes
>
> Background write-out is controlled with:
> - memory.dirty_background_ratio
> - memory.dirty_background_limit_bytes
>
> Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
> and 'G' suffixes for byte counts. This patch provides the
> same functionality for memory.dirty_limit_in_bytes and
> memory.dirty_background_limit_bytes.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Balbir Singh <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2011-03-14 17:54:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Fri, Mar 11, 2011 at 10:43:30AM -0800, Greg Thelen wrote:
> If the current process is in a non-root memcg, then
> balance_dirty_pages() will consider the memcg dirty limits as well as
> the system-wide limits. This allows different cgroups to have distinct
> dirty limits which trigger direct and background writeback at different
> levels.
>
> If called with a mem_cgroup, then throttle_vm_writeout() queries the
> given cgroup for its dirty memory usage limits.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Wu Fengguang <[email protected]>
> ---
> Changelog since v5:
> - Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
> cramming the somewhat different logic into balance_dirty_pages(). This means
> the global (non-memcg) dirty limits are not passed around in the
> struct dirty_info, so there's less change to existing code.

Yes there is less change to existing code but now we also have a separate
throttlig logic for cgroups.

I thought that we are moving in the direction of IO less throttling
where bdi threads always do the IO and Jan Kara also implemented the
logic to distribute the finished IO pages uniformly across the waiting
threads.

Keeping it separate for cgroups, reduces the complexity but also forks
off the balancing logic for root and other cgroups. So if Jan Kara's
changes go in, it automatically does not get used for memory cgroups.

Not sure how good a idea it is to use a separate throttling logic for
for non-root cgroups.

Thanks
Vivek

>
> Changelog since v4:
> - Added missing 'struct mem_cgroup' forward declaration in writeback.h.
> - Made throttle_vm_writeout() memcg aware.
> - Removed previously added dirty_writeback_pages() which is no longer needed.
> - Added logic to balance_dirty_pages() to throttle if over foreground memcg
> limit.
>
> Changelog since v3:
> - Leave determine_dirtyable_memory() static. v3 made is non-static.
> - balance_dirty_pages() now considers both system and memcg dirty limits and
> usage data. This data is retrieved with global_dirty_info() and
> memcg_dirty_info().
>
> include/linux/writeback.h | 3 ++-
> mm/page-writeback.c | 34 ++++++++++++++++++++++++++++------
> mm/vmscan.c | 2 +-
> 3 files changed, 31 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 0ead399..a45d895 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -8,6 +8,7 @@
> #include <linux/fs.h>
>
> struct backing_dev_info;
> +struct mem_cgroup;
>
> extern spinlock_t inode_lock;
>
> @@ -92,7 +93,7 @@ void laptop_mode_timer_fn(unsigned long data);
> #else
> static inline void laptop_sync_completion(void) { }
> #endif
> -void throttle_vm_writeout(gfp_t gfp_mask);
> +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup);
>
> /* These are exported to sysctl. */
> extern int dirty_background_ratio;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d8005b0..f6a8dd6 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -473,7 +473,8 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> * data. It looks at the number of dirty pages in the machine and will force
> * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> * If we're over `background_thresh' then the writeback threads are woken to
> - * perform some writeout.
> + * perform some writeout. The current task may have per-memcg dirty
> + * limits, which are also checked.
> */
> static void balance_dirty_pages(struct address_space *mapping,
> unsigned long write_chunk)
> @@ -488,6 +489,8 @@ static void balance_dirty_pages(struct address_space *mapping,
> bool dirty_exceeded = false;
> struct backing_dev_info *bdi = mapping->backing_dev_info;
>
> + mem_cgroup_balance_dirty_pages(mapping, write_chunk);
> +

> for (;;) {
> struct writeback_control wbc = {
> .sync_mode = WB_SYNC_NONE,
> @@ -651,23 +654,42 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> }
> EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>
> -void throttle_vm_writeout(gfp_t gfp_mask)
> +/*
> + * Throttle the current task if it is near dirty memory usage limits. Both
> + * global dirty memory limits and (if @mem_cgroup is given) per-cgroup dirty
> + * memory limits are checked.
> + *
> + * If near limits, then wait for usage to drop. Dirty usage should drop because
> + * dirty producers should have used balance_dirty_pages(), which would have
> + * scheduled writeback.
> + */
> +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
> {
> unsigned long background_thresh;
> unsigned long dirty_thresh;
> + struct dirty_info memcg_info;
> + bool do_memcg;
>
> for ( ; ; ) {
> global_dirty_limits(&background_thresh, &dirty_thresh);
> + do_memcg = mem_cgroup && mem_cgroup_hierarchical_dirty_info(
> + determine_dirtyable_memory(), true, mem_cgroup,
> + &memcg_info);
>
> /*
> * Boost the allowable dirty threshold a bit for page
> * allocators so they don't get DoS'ed by heavy writers
> */
> dirty_thresh += dirty_thresh / 10; /* wheeee... */
> -
> - if (global_page_state(NR_UNSTABLE_NFS) +
> - global_page_state(NR_WRITEBACK) <= dirty_thresh)
> - break;
> + if (do_memcg)
> + memcg_info.dirty_thresh += memcg_info.dirty_thresh / 10;
> +
> + if ((global_page_state(NR_UNSTABLE_NFS) +
> + global_page_state(NR_WRITEBACK) <= dirty_thresh) &&
> + (!do_memcg ||
> + (memcg_info.nr_unstable_nfs +
> + memcg_info.nr_writeback <= memcg_info.dirty_thresh)))
> + break;
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 060e4c1..035d2ea 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1939,7 +1939,7 @@ restart:
> sc->nr_scanned - nr_scanned, sc))
> goto restart;
>
> - throttle_vm_writeout(sc->gfp_mask);
> + throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
> }
>
> /*
> --
> 1.7.3.1

2011-03-14 18:00:13

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Mon, Mar 14, 2011 at 01:54:08PM -0400, Vivek Goyal wrote:
> On Fri, Mar 11, 2011 at 10:43:30AM -0800, Greg Thelen wrote:
> > If the current process is in a non-root memcg, then
> > balance_dirty_pages() will consider the memcg dirty limits as well as
> > the system-wide limits. This allows different cgroups to have distinct
> > dirty limits which trigger direct and background writeback at different
> > levels.
> >
> > If called with a mem_cgroup, then throttle_vm_writeout() queries the
> > given cgroup for its dirty memory usage limits.
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > Signed-off-by: Greg Thelen <[email protected]>
> > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> > Acked-by: Wu Fengguang <[email protected]>
> > ---
> > Changelog since v5:
> > - Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
> > cramming the somewhat different logic into balance_dirty_pages(). This means
> > the global (non-memcg) dirty limits are not passed around in the
> > struct dirty_info, so there's less change to existing code.
>
> Yes there is less change to existing code but now we also have a separate
> throttlig logic for cgroups.
>
> I thought that we are moving in the direction of IO less throttling
> where bdi threads always do the IO and Jan Kara also implemented the
> logic to distribute the finished IO pages uniformly across the waiting
> threads.
>
> Keeping it separate for cgroups, reduces the complexity but also forks
> off the balancing logic for root and other cgroups. So if Jan Kara's
> changes go in, it automatically does not get used for memory cgroups.
>
> Not sure how good a idea it is to use a separate throttling logic for
> for non-root cgroups.

One more disadvantage seems to be that new logic does not take into
account the dirty ratio of each task. So a fast writer in a non-root
cgroup might consume most of the quota of that cgroup and starve
the other task in same cgroup.

So keeping a separate logic for cgroup throttling does simplify the
logic and achieves immediate goal of limiting the memory usage of
a cgroup but I suspect down the line root cgroup and non-root cgroup
logic will keep on drifting far apart.

Thanks
Vivek

>
> Thanks
> Vivek
>
> >
> > Changelog since v4:
> > - Added missing 'struct mem_cgroup' forward declaration in writeback.h.
> > - Made throttle_vm_writeout() memcg aware.
> > - Removed previously added dirty_writeback_pages() which is no longer needed.
> > - Added logic to balance_dirty_pages() to throttle if over foreground memcg
> > limit.
> >
> > Changelog since v3:
> > - Leave determine_dirtyable_memory() static. v3 made is non-static.
> > - balance_dirty_pages() now considers both system and memcg dirty limits and
> > usage data. This data is retrieved with global_dirty_info() and
> > memcg_dirty_info().
> >
> > include/linux/writeback.h | 3 ++-
> > mm/page-writeback.c | 34 ++++++++++++++++++++++++++++------
> > mm/vmscan.c | 2 +-
> > 3 files changed, 31 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 0ead399..a45d895 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -8,6 +8,7 @@
> > #include <linux/fs.h>
> >
> > struct backing_dev_info;
> > +struct mem_cgroup;
> >
> > extern spinlock_t inode_lock;
> >
> > @@ -92,7 +93,7 @@ void laptop_mode_timer_fn(unsigned long data);
> > #else
> > static inline void laptop_sync_completion(void) { }
> > #endif
> > -void throttle_vm_writeout(gfp_t gfp_mask);
> > +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup);
> >
> > /* These are exported to sysctl. */
> > extern int dirty_background_ratio;
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index d8005b0..f6a8dd6 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -473,7 +473,8 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> > * data. It looks at the number of dirty pages in the machine and will force
> > * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> > * If we're over `background_thresh' then the writeback threads are woken to
> > - * perform some writeout.
> > + * perform some writeout. The current task may have per-memcg dirty
> > + * limits, which are also checked.
> > */
> > static void balance_dirty_pages(struct address_space *mapping,
> > unsigned long write_chunk)
> > @@ -488,6 +489,8 @@ static void balance_dirty_pages(struct address_space *mapping,
> > bool dirty_exceeded = false;
> > struct backing_dev_info *bdi = mapping->backing_dev_info;
> >
> > + mem_cgroup_balance_dirty_pages(mapping, write_chunk);
> > +
>
> > for (;;) {
> > struct writeback_control wbc = {
> > .sync_mode = WB_SYNC_NONE,
> > @@ -651,23 +654,42 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> > }
> > EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
> >
> > -void throttle_vm_writeout(gfp_t gfp_mask)
> > +/*
> > + * Throttle the current task if it is near dirty memory usage limits. Both
> > + * global dirty memory limits and (if @mem_cgroup is given) per-cgroup dirty
> > + * memory limits are checked.
> > + *
> > + * If near limits, then wait for usage to drop. Dirty usage should drop because
> > + * dirty producers should have used balance_dirty_pages(), which would have
> > + * scheduled writeback.
> > + */
> > +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
> > {
> > unsigned long background_thresh;
> > unsigned long dirty_thresh;
> > + struct dirty_info memcg_info;
> > + bool do_memcg;
> >
> > for ( ; ; ) {
> > global_dirty_limits(&background_thresh, &dirty_thresh);
> > + do_memcg = mem_cgroup && mem_cgroup_hierarchical_dirty_info(
> > + determine_dirtyable_memory(), true, mem_cgroup,
> > + &memcg_info);
> >
> > /*
> > * Boost the allowable dirty threshold a bit for page
> > * allocators so they don't get DoS'ed by heavy writers
> > */
> > dirty_thresh += dirty_thresh / 10; /* wheeee... */
> > -
> > - if (global_page_state(NR_UNSTABLE_NFS) +
> > - global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > - break;
> > + if (do_memcg)
> > + memcg_info.dirty_thresh += memcg_info.dirty_thresh / 10;
> > +
> > + if ((global_page_state(NR_UNSTABLE_NFS) +
> > + global_page_state(NR_WRITEBACK) <= dirty_thresh) &&
> > + (!do_memcg ||
> > + (memcg_info.nr_unstable_nfs +
> > + memcg_info.nr_writeback <= memcg_info.dirty_thresh)))
> > + break;
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
> >
> > /*
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 060e4c1..035d2ea 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1939,7 +1939,7 @@ restart:
> > sc->nr_scanned - nr_scanned, sc))
> > goto restart;
> >
> > - throttle_vm_writeout(sc->gfp_mask);
> > + throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
> > }
> >
> > /*
> > --
> > 1.7.3.1

2011-03-14 18:29:41

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Fri, Mar 11, 2011 at 5:10 PM, Andrew Morton
<[email protected]> wrote:
> On Fri, 11 Mar 2011 10:43:22 -0800
> Greg Thelen <[email protected]> wrote:
>
>>
>> ...
>>
>> This patch set provides the ability for each cgroup to have independent dirty
>> page limits.
>
> Here, it would be helpful to describe the current kernel behaviour.
> And to explain what is wrong with it and why the patch set improves
> things!

Good question.

The memcg dirty limits offer similar value to a cgroup as the global dirty
limits offer to the system.

Prior to this patch series, memcg had neither dirty limits nor background
reclaim. So when a memcg hit its limit_in_bytes it would enter memcg direct
reclaim. If the memcg memory was mostly dirty, then direct reclaim would be
slow, so page allocation latency would stink. By placing limits on the portion
of cgroup memory that can be dirty, page allocation latencies are improved.
These patches provide more performance guarantees for page allocations. Another
value is the ability to put a heavy dirtier in a small dirty memory jail which
is less than system dirty memory.

dd if=/dev/zero of=/data/input bs=1M count=50
sync
echo 3 > /proc/sys/vm/drop_caches
mkdir /dev/cgroup/memory/A
echo $$ > /dev/cgroup/memory/A/tasks
echo 100M > /dev/cgroup/memory/A/memory.dirty_limit_in_bytes
echo 200M > /dev/cgroup/memory/A/memory.limit_in_bytes
dd if=/dev/zero of=/data/output bs=1M &
dd if=/data/input of=/dev/null bs=1M

If the setting of memory.dirty_limit_in_bytes is omitted (as if this patch
series was not available), then the dd writer is able to fill the cgroup with
dirty memory increasing the page allocation latency for the dd reader.

With this sample, the dd reader sees a difference of 2x (6.5 MB/s vs 4.5 MB/s if
setting A/memory.dirty_limit_in_bytes is omitted).

>>
>> ...
>>
>> Known shortcomings (see the patch 1/9 update to Documentation/cgroups/memory.txt
>> for more details):
>> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
>> writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
>> just inodes contributing dirty pages to the cgroup exceeding its limit.
>
> This is a pretty large shortcoming, I suspect. Will it be addressed?

Yes. The two issues in this email are my next priorities for memcg dirty
limits.

> There's a risk that a poorly (or maliciously) configured memcg could
> have a pretty large affect upon overall system behaviour. Would
> elevated premissions be needed to do this?

Such an affect is possible. But root permissions are required to create it
because the new memory.dirty* limit files are 0644, this is similar to
/proc/sys/vm/dirty*. I suspect it would be easier to trash system performance
with sync().

> We could just crawl the memcg's page LRU and bring things under control
> that way, couldn't we? That would fix it. What were the reasons for
> not doing this?

My rational for pursuing bdi writeback was I/O locality. I have heard that
per-page I/O has bad locality. Per inode bdi-style writeback should have better
locality.

My hunch is the best solution is a hybrid which uses a) bdi writeback with a
target memcg filter and b) using the memcg lru as a fallback to identify the bdi
that needed writeback. I think the part a) memcg filtering is likely something
like:
http://marc.info/?l=linux-kernel&m=129910424431837

The part b) bdi selection should not be too hard assuming that page-to-mapping
locking is doable.

An alternative approach is to bind each inode to exactly one cgroup (possibly
the root cgroup). Both the cache page allocations and dirtying charges would be
accounted to the i_cgroup. With this approach there is no foreign dirtier issue
because all pages are in a single cgroup. I find this undesirable because the
first memcg to touch an inode is charged for all pages later cached even by
other memcg.

>> - A cgroup may exceed its dirty limit if the memory is dirtied by a process in a
>> different memcg.
>
> Please describe this scenario in (a lot) more detail?

The documentation in patch 1/9 discusses this issue somewhat:
A cgroup may contain more dirty memory than its dirty limit. This is possible
because of the principle that the first cgroup to touch a page is charged for
it. Subsequent page counting events (dirty, writeback, nfs_unstable) are also
counted to the originally charged cgroup. Example: If page is allocated by a
cgroup A task, then the page is charged to cgroup A. If the page is later
dirtied by a task in cgroup B, then the cgroup A dirty count will be
incremented. If cgroup A is over its dirty limit but cgroup B is not, then
dirtying a cgroup A page from a cgroup B task may push cgroup A over its dirty
limit without throttling the dirtying cgroup B task.

Here are some additional thoughts on this foreign dirtier issue:

When a page is allocated it is charged to the current task's memcg. When a
memcg page is later marked dirty the dirty charge is billed to the memcg from
the original page allocation. The billed memcg may be different than the
dirtying task's memcg.

After a rate limited number of file backed pages have been dirtied,
balance_dirty_pages() is called to enforce dirty limits by a) throttling
production of more dirty pages by current and b) queuing background writeback to
the current bdi.

balance_dirty_pages() receives a mapping and page count, which indicate what may
have been dirtied and the max number of pages that may have been dirtied. Due
to per cpu rate limiting and batching (when nr_pages_dirtied > 0),
balance_dirty_pages() does not know which memcg were charged for recently dirty
pages.

I think both bdi and system limits have the same issue in that a bdi may be
pushed over its dirty limit but not immediately checked due to rate limits. If
future dirtied pages are backed by different bdi, then future
balance_dirty_page() calls will check the second, compliant bdi ignoring the
first, over-limit bdi. The safety net is that the system wide limits are also
checked in balance_dirty_pages. However, per bdi writeback is employed in this
situation.

Note: This memcg foreign dirtier issue does not make it any more likely that a
memcg is pushed above its usage limit (limit_in_bytes). The only limit with
this weak contract is the dirty limit.

For reference, this issue was touch on in
http://marc.info/?l=linux-mm&m=128840780125261

There are ways to handle this issue (my preferred option is option #1).

1) keep a (global?) foreign_dirtied_memcg list of memcg that were recently
charged for dirty pages by tasks outside of memcg. When a memcg dirty page
count is elevated, the page's memcg would be queued to the list if current's
memcg does not match the pages cgroup. mem_cgroup_balance_dirty_pages()
would balance the current memcg and each memcg it dequeues from this list.
This should be a straightforward fix.

2) When pages are dirtied, migrate them to the current task's memcg.
mem_cgroup_balance_dirty_pages would then have a better chance at seeing all
pages dirtied by the current operation. This is still not perfect solution
due to rate limiting. This also is bad because such a migration would
involve charging and possibly memcg direct reclaim because the destination
memcg may be at its memory usage limit. Doing all of this in
account_page_dirtied() seems like trouble, so I do not like this approach.

3) Pass in some context which is represents a set of pages recently dirtied into
[mem_cgroup]_balance_dirty_pages. What would be a good context to collect
the set of memcg that should be balanced?
- an extra passed in parameter - yuck.
- address_space extension - does not feel quite right because address space
is not a io context object, I presume it can be shared by concurrent
threads.
- something hanging on current. Are there cases where pages become dirty
that are not followed by a call to balance dirty pages Note: this option
(3) is not a good idea because rate limiting make dirty limit enforcement
an inexact science. There is no guarantee that a caller will have context
describing the pages (or bdis) recently dirtied.

2011-03-14 20:24:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:

[..]
> > We could just crawl the memcg's page LRU and bring things under control
> > that way, couldn't we? That would fix it. What were the reasons for
> > not doing this?
>
> My rational for pursuing bdi writeback was I/O locality. I have heard that
> per-page I/O has bad locality. Per inode bdi-style writeback should have better
> locality.
>
> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> that needed writeback. I think the part a) memcg filtering is likely something
> like:
> http://marc.info/?l=linux-kernel&m=129910424431837
>
> The part b) bdi selection should not be too hard assuming that page-to-mapping
> locking is doable.

Greg,

IIUC, option b) seems to be going through pages of particular memcg and
mapping page to inode and start writeback on particular inode?

If yes, this might be reasonably good. In the case when cgroups are not
sharing inodes then it automatically maps one inode to one cgroup and
once cgroup is over limit, it starts writebacks of its own inode.

In case inode is shared, then we get the case of one cgroup writting
back the pages of other cgroup. Well I guess that also can be handeled
by flusher thread where a bunch or group of pages can be compared with
the cgroup passed in writeback structure. I guess that might hurt us
more than benefit us.

IIUC how option b) works then we don't even need option a) where an N level
deep cache is maintained?

Thanks
Vivek

2011-03-14 21:10:11

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Mon 14-03-11 13:54:08, Vivek Goyal wrote:
> On Fri, Mar 11, 2011 at 10:43:30AM -0800, Greg Thelen wrote:
> > If the current process is in a non-root memcg, then
> > balance_dirty_pages() will consider the memcg dirty limits as well as
> > the system-wide limits. This allows different cgroups to have distinct
> > dirty limits which trigger direct and background writeback at different
> > levels.
> >
> > If called with a mem_cgroup, then throttle_vm_writeout() queries the
> > given cgroup for its dirty memory usage limits.
> >
> > Signed-off-by: Andrea Righi <[email protected]>
> > Signed-off-by: Greg Thelen <[email protected]>
> > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> > Acked-by: Wu Fengguang <[email protected]>
> > ---
> > Changelog since v5:
> > - Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
> > cramming the somewhat different logic into balance_dirty_pages(). This means
> > the global (non-memcg) dirty limits are not passed around in the
> > struct dirty_info, so there's less change to existing code.
>
> Yes there is less change to existing code but now we also have a separate
> throttlig logic for cgroups.
>
> I thought that we are moving in the direction of IO less throttling
> where bdi threads always do the IO and Jan Kara also implemented the
> logic to distribute the finished IO pages uniformly across the waiting
> threads.
Yes, we'd like to avoid doing IO from balance_dirty_pages(). But if the
logic in cgroups specific part won't get too fancy (which it doesn't seem
to be the case currently), it shouldn't be too hard to convert it to the new
approach.

We can talk about it at LSF but at least with my approach to IO-less
balance_dirty_pages() it would be easy to convert cgroups throttling to
the new way. With Fengguang's approach it might be a bit harder since he
computes a throughput and from that necessary delay for a throttled task
but with cgroups that is impossible to compute so he'd have to add some
looping if we didn't write enough pages from the cgroup yet. But still it
would be reasonable doable AFAICT.

> Keeping it separate for cgroups, reduces the complexity but also forks
> off the balancing logic for root and other cgroups. So if Jan Kara's
> changes go in, it automatically does not get used for memory cgroups.
>
> Not sure how good a idea it is to use a separate throttling logic for
> for non-root cgroups.
Yeah, it looks a bit odd. I'd think that we could just cap
task_dirty_limit() by a value computed from a cgroup limit and be done
with that but I probably miss something... Sure there is also a different
background limit but that's broken anyway because a flusher thread will
quickly stop doing writeback if global background limit is not exceeded.
But that's a separate topic so I'll reply with this to a more appropriate
email ;)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-15 02:02:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Mon, 14 Mar 2011 11:29:17 -0700
Greg Thelen <[email protected]> wrote:

> On Fri, Mar 11, 2011 at 5:10 PM, Andrew Morton

> My rational for pursuing bdi writeback was I/O locality. I have heard that
> per-page I/O has bad locality. Per inode bdi-style writeback should have better
> locality.
>
> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> that needed writeback. I think the part a) memcg filtering is likely something
> like:
> http://marc.info/?l=linux-kernel&m=129910424431837
>
> The part b) bdi selection should not be too hard assuming that page-to-mapping
> locking is doable.
>

For now, I like b).

> An alternative approach is to bind each inode to exactly one cgroup (possibly
> the root cgroup). Both the cache page allocations and dirtying charges would be
> accounted to the i_cgroup. With this approach there is no foreign dirtier issue
> because all pages are in a single cgroup. I find this undesirable because the
> first memcg to touch an inode is charged for all pages later cached even by
> other memcg.
>

I don't think 'foreign dirtier' is a big problem. When program does write(),
the file to be written is tend to be under control of the application in
the cgroup. I don't think 'written file' is shared between several cgroups,
typically. But /var/log/messages is a shared one ;)

But I know some other OSs has 'group for file cache'. I'll not nack if you
propose such patch. Maybe there are some guys who want to limit the amount of
file cache.



> When a page is allocated it is charged to the current task's memcg. When a
> memcg page is later marked dirty the dirty charge is billed to the memcg from
> the original page allocation. The billed memcg may be different than the
> dirtying task's memcg.
>
yes.

> After a rate limited number of file backed pages have been dirtied,
> balance_dirty_pages() is called to enforce dirty limits by a) throttling
> production of more dirty pages by current and b) queuing background writeback to
> the current bdi.
>
> balance_dirty_pages() receives a mapping and page count, which indicate what may
> have been dirtied and the max number of pages that may have been dirtied. Due
> to per cpu rate limiting and batching (when nr_pages_dirtied > 0),
> balance_dirty_pages() does not know which memcg were charged for recently dirty
> pages.
>
> I think both bdi and system limits have the same issue in that a bdi may be
> pushed over its dirty limit but not immediately checked due to rate limits. If
> future dirtied pages are backed by different bdi, then future
> balance_dirty_page() calls will check the second, compliant bdi ignoring the
> first, over-limit bdi. The safety net is that the system wide limits are also
> checked in balance_dirty_pages. However, per bdi writeback is employed in this
> situation.
>
> Note: This memcg foreign dirtier issue does not make it any more likely that a
> memcg is pushed above its usage limit (limit_in_bytes). The only limit with
> this weak contract is the dirty limit.
>
> For reference, this issue was touch on in
> http://marc.info/?l=linux-mm&m=128840780125261
>
> There are ways to handle this issue (my preferred option is option #1).
>
> 1) keep a (global?) foreign_dirtied_memcg list of memcg that were recently
> charged for dirty pages by tasks outside of memcg. When a memcg dirty page
> count is elevated, the page's memcg would be queued to the list if current's
> memcg does not match the pages cgroup. mem_cgroup_balance_dirty_pages()
> would balance the current memcg and each memcg it dequeues from this list.
> This should be a straightforward fix.
>

Can you implement this in an efficient way ? (without taking any locks ?)
It seems cost > benefit.



> 2) When pages are dirtied, migrate them to the current task's memcg.
> mem_cgroup_balance_dirty_pages would then have a better chance at seeing all
> pages dirtied by the current operation. This is still not perfect solution
> due to rate limiting. This also is bad because such a migration would
> involve charging and possibly memcg direct reclaim because the destination
> memcg may be at its memory usage limit. Doing all of this in
> account_page_dirtied() seems like trouble, so I do not like this approach.
>

I think this cannot be implemented in an efficnent way.



> 3) Pass in some context which is represents a set of pages recently dirtied into
> [mem_cgroup]_balance_dirty_pages. What would be a good context to collect
> the set of memcg that should be balanced?
> - an extra passed in parameter - yuck.
> - address_space extension - does not feel quite right because address space
> is not a io context object, I presume it can be shared by concurrent
> threads.
> - something hanging on current. Are there cases where pages become dirty
> that are not followed by a call to balance dirty pages Note: this option
> (3) is not a good idea because rate limiting make dirty limit enforcement
> an inexact science. There is no guarantee that a caller will have context
> describing the pages (or bdis) recently dirtied.
>

I'd like to have an option 'cgroup only for file cache' rather than adding more
hooks and complicated operations.

But, if we need to record 'who dirtied ?' information, record it in page_cgroup
or radix-tree and do filtering is what I can consider, now.
In this case, some tracking information will be required to be stored into
struct inode, too.


How about this ?

1. record 'who dirtied memcg' into page_cgroup or radix-tree.
I prefer recording in radix-tree rather than using more field in page_cgroup.
2. bdi-writeback does some extra queueing operation per memcg.
find a page, check 'who dirtied', enqueue it(using page_cgroup or list of pagevec)
3. writeback it's own queue.(This can be done before 2. if cgroup has queue, already)
4. Some GC may be required...

Thanks,
-Kame

2011-03-15 02:41:38

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <[email protected]> wrote:
> On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:
>
> [..]
>> > We could just crawl the memcg's page LRU and bring things under control
>> > that way, couldn't we? ?That would fix it. ?What were the reasons for
>> > not doing this?
>>
>> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
>> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
>> locality.
>>
>> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
>> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
>> that needed writeback. ?I think the part a) memcg filtering is likely something
>> like:
>> ?http://marc.info/?l=linux-kernel&m=129910424431837
>>
>> The part b) bdi selection should not be too hard assuming that page-to-mapping
>> locking is doable.
>
> Greg,
>
> IIUC, option b) seems to be going through pages of particular memcg and
> mapping page to inode and start writeback on particular inode?

Yes.

> If yes, this might be reasonably good. In the case when cgroups are not
> sharing inodes then it automatically maps one inode to one cgroup and
> once cgroup is over limit, it starts writebacks of its own inode.
>
> In case inode is shared, then we get the case of one cgroup writting
> back the pages of other cgroup. Well I guess that also can be handeled
> by flusher thread where a bunch or group of pages can be compared with
> the cgroup passed in writeback structure. I guess that might hurt us
> more than benefit us.

Agreed. For now just writing the entire inode is probably fine.

> IIUC how option b) works then we don't even need option a) where an N level
> deep cache is maintained?

Originally I was thinking that bdi-wide writeback with memcg filter
was a good idea. But this may be unnecessarily complex. Now I am
agreeing with you that option (a) may not be needed. Memcg could
queue per-inode writeback using the memcg lru to locate inodes
(lru->page->inode) with something like this in
[mem_cgroup_]balance_dirty_pages():

while (memcg_usage() >= memcg_fg_limit) {
inode = memcg_dirty_inode(cg); /* scan lru for a dirty page, then
grab mapping & inode */
sync_inode(inode, &wbc);
}

if (memcg_usage() >= memcg_bg_limit) {
queue per-memcg bg flush work item
}

Does this look sensible?

2011-03-15 02:51:55

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Mon, Mar 14, 2011 at 6:56 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Mon, 14 Mar 2011 11:29:17 -0700
> Greg Thelen <[email protected]> wrote:
>
>> On Fri, Mar 11, 2011 at 5:10 PM, Andrew Morton
>
>> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
>> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
>> locality.
>>
>> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
>> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
>> that needed writeback. ?I think the part a) memcg filtering is likely something
>> like:
>> ?http://marc.info/?l=linux-kernel&m=129910424431837
>>
>> The part b) bdi selection should not be too hard assuming that page-to-mapping
>> locking is doable.
>>
>
> For now, I like b).
>
>> An alternative approach is to bind each inode to exactly one cgroup (possibly
>> the root cgroup). ?Both the cache page allocations and dirtying charges would be
>> accounted to the i_cgroup. ?With this approach there is no foreign dirtier issue
>> because all pages are in a single cgroup. ?I find this undesirable because the
>> first memcg to touch an inode is charged for all pages later cached even by
>> other memcg.
>>
>
> I don't think 'foreign dirtier' is a big problem. When program does write(),
> the file to be written is tend to be under control of the application in
> the cgroup. I don't think 'written file' is shared between several cgroups,
> typically. But /var/log/messages is a shared one ;)
>
> But I know some other OSs has 'group for file cache'. I'll not nack if you
> propose such patch. Maybe there are some guys who want to limit the amount of
> file cache.
>
>
>
>> When a page is allocated it is charged to the current task's memcg. ?When a
>> memcg page is later marked dirty the dirty charge is billed to the memcg from
>> the original page allocation. ?The billed memcg may be different than the
>> dirtying task's memcg.
>>
> yes.
>
>> After a rate limited number of file backed pages have been dirtied,
>> balance_dirty_pages() is called to enforce dirty limits by a) throttling
>> production of more dirty pages by current and b) queuing background writeback to
>> the current bdi.
>>
>> balance_dirty_pages() receives a mapping and page count, which indicate what may
>> have been dirtied and the max number of pages that may have been dirtied. ?Due
>> to per cpu rate limiting and batching (when nr_pages_dirtied > 0),
>> balance_dirty_pages() does not know which memcg were charged for recently dirty
>> pages.
>>
>> I think both bdi and system limits have the same issue in that a bdi may be
>> pushed over its dirty limit but not immediately checked due to rate limits. ?If
>> future dirtied pages are backed by different bdi, then future
>> balance_dirty_page() calls will check the second, compliant bdi ignoring the
>> first, over-limit bdi. ?The safety net is that the system wide limits are also
>> checked in balance_dirty_pages. ?However, per bdi writeback is employed in this
>> situation.
>>
>> Note: This memcg foreign dirtier issue does not make it any more likely that a
>> memcg is pushed above its usage limit (limit_in_bytes). ?The only limit with
>> this weak contract is the dirty limit.
>>
>> For reference, this issue was touch on in
>> http://marc.info/?l=linux-mm&m=128840780125261
>>
>> There are ways to handle this issue (my preferred option is option #1).
>>
>> 1) keep a (global?) foreign_dirtied_memcg list of memcg that were recently
>> ? charged for dirty pages by tasks outside of memcg. ?When a memcg dirty page
>> ? count is elevated, the page's memcg would be queued to the list if current's
>> ? memcg does not match the pages cgroup. ?mem_cgroup_balance_dirty_pages()
>> ? would balance the current memcg and each memcg it dequeues from this list.
>> ? This should be a straightforward fix.
>>
>
> Can you implement this in an efficient way ? (without taking any locks ?)
> It seems cost > benefit.

I am not sure either. But if we are willing to defer addressing the
foreign dirtier issue, then we can avoid this for now.

>> 2) When pages are dirtied, migrate them to the current task's memcg.
>> ? mem_cgroup_balance_dirty_pages would then have a better chance at seeing all
>> ? pages dirtied by the current operation. ?This is still not perfect solution
>> ? due to rate limiting. ?This also is bad because such a migration would
>> ? involve charging and possibly memcg direct reclaim because the destination
>> ? memcg may be at its memory usage limit. ?Doing all of this in
>> ? account_page_dirtied() seems like trouble, so I do not like this approach.
>>
>
> I think this cannot be implemented in an efficnent way.
>
>
>
>> 3) Pass in some context which is represents a set of pages recently dirtied into
>> ? [mem_cgroup]_balance_dirty_pages. ?What would be a good context to collect
>> ? the set of memcg that should be balanced?
>> ? - an extra passed in parameter - yuck.
>> ? - address_space extension - does not feel quite right because address space
>> ? ? is not a io context object, I presume it can be shared by concurrent
>> ? ? threads.
>> ? - something hanging on current. ?Are there cases where pages become dirty
>> ? ? that are not followed by a call to balance dirty pages Note: this option
>> ? ? (3) is not a good idea because rate limiting make dirty limit enforcement
>> ? ? an inexact science. ?There is no guarantee that a caller will have context
>> ? ? describing the pages (or bdis) recently dirtied.
>>
>
> I'd like to have an option ?'cgroup only for file cache' rather than adding more
> hooks and complicated operations.
>
> But, if we need to record 'who dirtied ?' information, record it in page_cgroup
> or radix-tree and do filtering is what I can consider, now.
> In this case, some tracking information will be required to be stored into
> struct inode, too.
>
>
> How about this ?
>
> ?1. record 'who dirtied memcg' into page_cgroup or radix-tree.
> ? I prefer recording in radix-tree rather than using more field in page_cgroup.
> ?2. bdi-writeback does some extra queueing operation per memcg.
> ? find a page, check 'who dirtied', enqueue it(using page_cgroup or list of pagevec)
> ?3. writeback it's own queue.(This can be done before 2. if cgroup has queue, already)
> ?4. Some GC may be required...
>
> Thanks,
> -Kame
>
>

The foreign dirtier issue is all about identifying the memcg (or
possibly multiple bdi) that need balancing. If the foreign dirtier
issue is not important then we can focus on identifying inodes to
writeback that will lower the current's memcg dirty usage. I am fine
ignoring the foreign dirtier issue for now and breaking the problem
into smaller pieces.

I think this can be done with out any additional state. Can just scan
the memcg lru to find dirty file pages and thus inodes to pass to
sync_inode(), or some other per-inode writeback routine?

2011-03-15 03:01:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Mon, 14 Mar 2011 19:51:22 -0700
Greg Thelen <[email protected]> wrote:

> On Mon, Mar 14, 2011 at 6:56 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Mon, 14 Mar 2011 11:29:17 -0700
> > Greg Thelen <[email protected]> wrote:
> >
> >> On Fri, Mar 11, 2011 at 5:10 PM, Andrew Morton
> The foreign dirtier issue is all about identifying the memcg (or
> possibly multiple bdi) that need balancing. If the foreign dirtier
> issue is not important then we can focus on identifying inodes to
> writeback that will lower the current's memcg dirty usage. I am fine
> ignoring the foreign dirtier issue for now and breaking the problem
> into smaller pieces.
>
ok.

> I think this can be done with out any additional state. Can just scan
> the memcg lru to find dirty file pages and thus inodes to pass to
> sync_inode(), or some other per-inode writeback routine?
>

I think it works, finding inodes to be cleaned by LRU scanning.

Thanks,
-Kame

2011-03-15 03:27:58

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Mon, Mar 14, 2011 at 2:10 PM, Jan Kara <[email protected]> wrote:
> On Mon 14-03-11 13:54:08, Vivek Goyal wrote:
>> On Fri, Mar 11, 2011 at 10:43:30AM -0800, Greg Thelen wrote:
>> > If the current process is in a non-root memcg, then
>> > balance_dirty_pages() will consider the memcg dirty limits as well as
>> > the system-wide limits. ?This allows different cgroups to have distinct
>> > dirty limits which trigger direct and background writeback at different
>> > levels.
>> >
>> > If called with a mem_cgroup, then throttle_vm_writeout() queries the
>> > given cgroup for its dirty memory usage limits.
>> >
>> > Signed-off-by: Andrea Righi <[email protected]>
>> > Signed-off-by: Greg Thelen <[email protected]>
>> > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>> > Acked-by: Wu Fengguang <[email protected]>
>> > ---
>> > Changelog since v5:
>> > - Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
>> > ? cramming the somewhat different logic into balance_dirty_pages(). ?This means
>> > ? the global (non-memcg) dirty limits are not passed around in the
>> > ? struct dirty_info, so there's less change to existing code.
>>
>> Yes there is less change to existing code but now we also have a separate
>> throttlig logic for cgroups.
>>
>> I thought that we are moving in the direction of IO less throttling
>> where bdi threads always do the IO and Jan Kara also implemented the
>> logic to distribute the finished IO pages uniformly across the waiting
>> threads.
> ?Yes, we'd like to avoid doing IO from balance_dirty_pages(). But if the
> logic in cgroups specific part won't get too fancy (which it doesn't seem
> to be the case currently), it shouldn't be too hard to convert it to the new
> approach.

Handling memcg hierarchy was something that was not trivial to implement in
mem_cgroup_balance_dirty_pages.

> We can talk about it at LSF but at least with my approach to IO-less
> balance_dirty_pages() it would be easy to convert cgroups throttling to
> the new way. With Fengguang's approach it might be a bit harder since he
> computes a throughput and from that necessary delay for a throttled task
> but with cgroups that is impossible to compute so he'd have to add some
> looping if we didn't write enough pages from the cgroup yet. But still it
> would be reasonable doable AFAICT.

I am definitely interested in finding a way to merge these feature
cleanly together.

>> Keeping it separate for cgroups, reduces the complexity but also forks
>> off the balancing logic for root and other cgroups. So if Jan Kara's
>> changes go in, it automatically does not get used for memory cgroups.
>>
>> Not sure how good a idea it is to use a separate throttling logic for
>> for non-root cgroups.
> ?Yeah, it looks a bit odd. I'd think that we could just cap
> task_dirty_limit() by a value computed from a cgroup limit and be done
> with that but I probably miss something...

That is an interesting idea. When looking at upstream balance_dirty_pages(),
the result of task_dirty_limit() is compared per bdi_nr_reclaimable and
bdi_nr_writeback. I think we should be comparing memcg usage to memcg limits
to catch cases where memcg usage is above memcg limits.
Or am I missing something in your apporach?

> Sure there is also a different
> background limit but that's broken anyway because a flusher thread will
> quickly stop doing writeback if global background limit is not exceeded.
> But that's a separate topic so I'll reply with this to a more appropriate
> email ;)

;) I am also interested in the this bg issue, but I should also try
to stay on topic.

> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>

2011-03-15 06:33:07

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 4/9] memcg: add kernel calls for memcg dirty page stats

On Mon, Mar 14, 2011 at 8:10 AM, Minchan Kim <[email protected]> wrote:
> On Fri, Mar 11, 2011 at 10:43:26AM -0800, Greg Thelen wrote:
>> Add calls into memcg dirty page accounting. ?Notify memcg when pages
>> transition between clean, file dirty, writeback, and unstable nfs.
>> This allows the memory controller to maintain an accurate view of
>> the amount of its memory that is dirty.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>> Signed-off-by: Andrea Righi <[email protected]>
>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>> Reviewed-by: Daisuke Nishimura <[email protected]>
>> ---
>> Changelog since v5:
>> - moved accounting site in test_clear_page_writeback() and
>> ? test_set_page_writeback().
>>
>> ?fs/nfs/write.c ? ? ?| ? ?4 ++++
>> ?mm/filemap.c ? ? ? ?| ? ?1 +
>> ?mm/page-writeback.c | ? 10 ++++++++--
>> ?mm/truncate.c ? ? ? | ? ?1 +
>> ?4 files changed, 14 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index 42b92d7..7863777 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -451,6 +451,7 @@ nfs_mark_request_commit(struct nfs_page *req)
>> ? ? ? ? ? ? ? ? ? ? ? NFS_PAGE_TAG_COMMIT);
>> ? ? ? nfsi->ncommit++;
>> ? ? ? spin_unlock(&inode->i_lock);
>> + ? ? mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
>> ? ? ? inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>> ? ? ? inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>> ? ? ? __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
>> @@ -462,6 +463,7 @@ nfs_clear_request_commit(struct nfs_page *req)
>> ? ? ? struct page *page = req->wb_page;
>>
>> ? ? ? if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
>> + ? ? ? ? ? ? mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
>> ? ? ? ? ? ? ? dec_zone_page_state(page, NR_UNSTABLE_NFS);
>> ? ? ? ? ? ? ? dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>> ? ? ? ? ? ? ? return 1;
>> @@ -1319,6 +1321,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
>> ? ? ? ? ? ? ? req = nfs_list_entry(head->next);
>> ? ? ? ? ? ? ? nfs_list_remove_request(req);
>> ? ? ? ? ? ? ? nfs_mark_request_commit(req);
>> + ? ? ? ? ? ? mem_cgroup_dec_page_stat(req->wb_page,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?MEMCG_NR_FILE_UNSTABLE_NFS);
>> ? ? ? ? ? ? ? dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>> ? ? ? ? ? ? ? dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BDI_RECLAIMABLE);
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index a6cfecf..7e751fe 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -143,6 +143,7 @@ void __delete_from_page_cache(struct page *page)
>> ? ? ? ?* having removed the page entirely.
>> ? ? ? ?*/
>> ? ? ? if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
>> + ? ? ? ? ? ? mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
>> ? ? ? ? ? ? ? dec_zone_page_state(page, NR_FILE_DIRTY);
>> ? ? ? ? ? ? ? dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>> ? ? ? }
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 632b464..d8005b0 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1118,6 +1118,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>> ?void account_page_dirtied(struct page *page, struct address_space *mapping)
>> ?{
>> ? ? ? if (mapping_cap_account_dirty(mapping)) {
>> + ? ? ? ? ? ? mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
>> ? ? ? ? ? ? ? __inc_zone_page_state(page, NR_FILE_DIRTY);
>> ? ? ? ? ? ? ? __inc_zone_page_state(page, NR_DIRTIED);
>> ? ? ? ? ? ? ? __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>> @@ -1317,6 +1318,7 @@ int clear_page_dirty_for_io(struct page *page)
>> ? ? ? ? ? ? ? ?* for more comments.
>> ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? if (TestClearPageDirty(page)) {
>> + ? ? ? ? ? ? ? ? ? ? mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
>> ? ? ? ? ? ? ? ? ? ? ? dec_zone_page_state(page, NR_FILE_DIRTY);
>> ? ? ? ? ? ? ? ? ? ? ? dec_bdi_stat(mapping->backing_dev_info,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? BDI_RECLAIMABLE);
>> @@ -1352,8 +1354,10 @@ int test_clear_page_writeback(struct page *page)
>> ? ? ? } else {
>> ? ? ? ? ? ? ? ret = TestClearPageWriteback(page);
>> ? ? ? }
>> - ? ? if (ret)
>> + ? ? if (ret) {
>> + ? ? ? ? ? ? mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
>> ? ? ? ? ? ? ? dec_zone_page_state(page, NR_WRITEBACK);
>> + ? ? }
>> ? ? ? return ret;
>> ?}
>>
>> @@ -1386,8 +1390,10 @@ int test_set_page_writeback(struct page *page)
>> ? ? ? } else {
>> ? ? ? ? ? ? ? ret = TestSetPageWriteback(page);
>> ? ? ? }
>> - ? ? if (!ret)
>> + ? ? if (!ret) {
>> + ? ? ? ? ? ? mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
>> ? ? ? ? ? ? ? account_page_writeback(page);
>> + ? ? }
>> ? ? ? return ret;
>>
>> ?}
>
> At least in mainline, NR_WRITEBACK handling codes are following as.
>
> 1) increase
>
> ?* account_page_writeback
>
> 2) decrease
>
> ?* test_clear_page_writeback
> ?* __nilfs_end_page_io
>
> I think account_page_writeback name is good to add your account function into that.
> The problem is decreasement. Normall we can handle decreasement in test_clear_page_writeback.
> But I am not sure it's okay in __nilfs_end_page_io.
> I think if __nilfs_end_page_io is right, __nilfs_end_page_io should call
> mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK).
>
> What do you think about it?
>
>
>
> --
> Kind regards,
> Minchan Kim
>

I would like to not have any special cases that avoid certain memory.
So I think your suggestion is good.
However, nilfs memcg dirty page accounting was skipped in a previous
memcg dirty limit effort due to complexity. See 'clone_page'
reference in:
http://lkml.indiana.edu/hypermail/linux/kernel/1003.0/02997.html

I admit that I don't follow all of the nilfs code path, but it looks
like some of the nilfs pages are allocated but not charged to memcg.
There is code in mem_cgroup_update_page_stat() to gracefully handle
pages not associated with a memcg. So perhaps nilfs clone pages dirty
[un]charge could be attempted. I have not succeeded in testing in
exercising these code paths in nilfs.

2011-03-15 13:51:11

by Ryusuke Konishi

[permalink] [raw]
Subject: Re: [PATCH v6 4/9] memcg: add kernel calls for memcg dirty page stats

On Mon, 14 Mar 2011 23:32:38 -0700, Greg Thelen wrote:
> On Mon, Mar 14, 2011 at 8:10 AM, Minchan Kim <[email protected]> wrote:
>> On Fri, Mar 11, 2011 at 10:43:26AM -0800, Greg Thelen wrote:
>>> Add calls into memcg dirty page accounting. ?Notify memcg when pages
>>> transition between clean, file dirty, writeback, and unstable nfs.
>>> This allows the memory controller to maintain an accurate view of
>>> the amount of its memory that is dirty.
>>>
>>> Signed-off-by: Greg Thelen <[email protected]>
>>> Signed-off-by: Andrea Righi <[email protected]>
>>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>>> Reviewed-by: Daisuke Nishimura <[email protected]>
<snip>
>>
>> At least in mainline, NR_WRITEBACK handling codes are following as.
>>
>> 1) increase
>>
>> * account_page_writeback
>>
>> 2) decrease
>>
>> * test_clear_page_writeback
>> * __nilfs_end_page_io
>>
>> I think account_page_writeback name is good to add your account function into that.
>> The problem is decreasement. Normall we can handle decreasement in test_clear_page_writeback.
>> But I am not sure it's okay in __nilfs_end_page_io.
>> I think if __nilfs_end_page_io is right, __nilfs_end_page_io should call
>> mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK).
>>
>> What do you think about it?
>>
>>
>>
>> --
>> Kind regards,
>> Minchan Kim
>>
>
> I would like to not have any special cases that avoid certain memory.
> So I think your suggestion is good.
> However, nilfs memcg dirty page accounting was skipped in a previous
> memcg dirty limit effort due to complexity. See 'clone_page'
> reference in:
> http://lkml.indiana.edu/hypermail/linux/kernel/1003.0/02997.html
>
> I admit that I don't follow all of the nilfs code path, but it looks
> like some of the nilfs pages are allocated but not charged to memcg.
> There is code in mem_cgroup_update_page_stat() to gracefully handle
> pages not associated with a memcg. So perhaps nilfs clone pages dirty
> [un]charge could be attempted. I have not succeeded in testing in
> exercising these code paths in nilfs.

Sorry for this matter. The clone_page code paths in nilfs is
exercised only when mmapped pages are written back.

I think the private page allocation used for the current clone_page
code should be altered to eliminate the root cause of these issues.
I would like to try to find some sort of alternative way.


Regards,
Ryusuke Konishi

2011-03-15 14:01:10

by Mike Heffner

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits

On 03/11/2011 01:43 PM, Greg Thelen wrote:
> Add cgroupfs interface to memcg dirty page limits:
> Direct write-out is controlled with:
> - memory.dirty_ratio
> - memory.dirty_limit_in_bytes
>
> Background write-out is controlled with:
> - memory.dirty_background_ratio
> - memory.dirty_background_limit_bytes


What's the overlap, if any, with the current memory limits controlled by
`memory.limit_in_bytes` and the above `memory.dirty_limit_in_bytes`? If
I want to fairly balance memory between two cgroups be one a dirty page
antagonist (dd) and the other an anonymous page (memcache), do I just
set `memory.limit_in_bytes`? Does this patch simply provide a more
granular level of control of the dirty limits?


Thanks,

Mike

2011-03-15 16:21:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Mon, Mar 14, 2011 at 10:10:03PM +0100, Jan Kara wrote:
> On Mon 14-03-11 13:54:08, Vivek Goyal wrote:
> > On Fri, Mar 11, 2011 at 10:43:30AM -0800, Greg Thelen wrote:
> > > If the current process is in a non-root memcg, then
> > > balance_dirty_pages() will consider the memcg dirty limits as well as
> > > the system-wide limits. This allows different cgroups to have distinct
> > > dirty limits which trigger direct and background writeback at different
> > > levels.
> > >
> > > If called with a mem_cgroup, then throttle_vm_writeout() queries the
> > > given cgroup for its dirty memory usage limits.
> > >
> > > Signed-off-by: Andrea Righi <[email protected]>
> > > Signed-off-by: Greg Thelen <[email protected]>
> > > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> > > Acked-by: Wu Fengguang <[email protected]>
> > > ---
> > > Changelog since v5:
> > > - Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
> > > cramming the somewhat different logic into balance_dirty_pages(). This means
> > > the global (non-memcg) dirty limits are not passed around in the
> > > struct dirty_info, so there's less change to existing code.
> >
> > Yes there is less change to existing code but now we also have a separate
> > throttlig logic for cgroups.
> >
> > I thought that we are moving in the direction of IO less throttling
> > where bdi threads always do the IO and Jan Kara also implemented the
> > logic to distribute the finished IO pages uniformly across the waiting
> > threads.
> Yes, we'd like to avoid doing IO from balance_dirty_pages(). But if the
> logic in cgroups specific part won't get too fancy (which it doesn't seem
> to be the case currently), it shouldn't be too hard to convert it to the new
> approach.
>
> We can talk about it at LSF but at least with my approach to IO-less
> balance_dirty_pages() it would be easy to convert cgroups throttling to
> the new way. With Fengguang's approach it might be a bit harder since he
> computes a throughput and from that necessary delay for a throttled task
> but with cgroups that is impossible to compute so he'd have to add some
> looping if we didn't write enough pages from the cgroup yet. But still it
> would be reasonable doable AFAICT.
>
> > Keeping it separate for cgroups, reduces the complexity but also forks
> > off the balancing logic for root and other cgroups. So if Jan Kara's
> > changes go in, it automatically does not get used for memory cgroups.
> >
> > Not sure how good a idea it is to use a separate throttling logic for
> > for non-root cgroups.
> Yeah, it looks a bit odd. I'd think that we could just cap
> task_dirty_limit() by a value computed from a cgroup limit and be done
> with that but I probably miss something...

I think previous implementation did something similar. Currently dirty
limit is per_bdi/per_task. They made it per_cgroup/per_bdi/per_task. This
new version tries to simplify the things by keeping mem cgroup throttling
logic separate.

> Sure there is also a different
> background limit but that's broken anyway because a flusher thread will
> quickly stop doing writeback if global background limit is not exceeded.
> But that's a separate topic so I'll reply with this to a more appropriate
> email ;)

I think last patch in the series (patch 9) takes care of that. In case of
mem_cgroup writeback, it forces flusher thread to write till we are
below the background ratio of cgroup (and not global background ratio).

Thanks
Vivek

2011-03-15 18:50:03

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Mon, Mar 14, 2011 at 07:41:13PM -0700, Greg Thelen wrote:
> On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <[email protected]> wrote:
> > On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:
> >
> > [..]
> >> > We could just crawl the memcg's page LRU and bring things under control
> >> > that way, couldn't we? ?That would fix it. ?What were the reasons for
> >> > not doing this?
> >>
> >> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
> >> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
> >> locality.
> >>
> >> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> >> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> >> that needed writeback. ?I think the part a) memcg filtering is likely something
> >> like:
> >> ?http://marc.info/?l=linux-kernel&m=129910424431837
> >>
> >> The part b) bdi selection should not be too hard assuming that page-to-mapping
> >> locking is doable.
> >
> > Greg,
> >
> > IIUC, option b) seems to be going through pages of particular memcg and
> > mapping page to inode and start writeback on particular inode?
>
> Yes.
>
> > If yes, this might be reasonably good. In the case when cgroups are not
> > sharing inodes then it automatically maps one inode to one cgroup and
> > once cgroup is over limit, it starts writebacks of its own inode.
> >
> > In case inode is shared, then we get the case of one cgroup writting
> > back the pages of other cgroup. Well I guess that also can be handeled
> > by flusher thread where a bunch or group of pages can be compared with
> > the cgroup passed in writeback structure. I guess that might hurt us
> > more than benefit us.
>
> Agreed. For now just writing the entire inode is probably fine.
>
> > IIUC how option b) works then we don't even need option a) where an N level
> > deep cache is maintained?
>
> Originally I was thinking that bdi-wide writeback with memcg filter
> was a good idea. But this may be unnecessarily complex. Now I am
> agreeing with you that option (a) may not be needed. Memcg could
> queue per-inode writeback using the memcg lru to locate inodes
> (lru->page->inode) with something like this in
> [mem_cgroup_]balance_dirty_pages():
>
> while (memcg_usage() >= memcg_fg_limit) {
> inode = memcg_dirty_inode(cg); /* scan lru for a dirty page, then
> grab mapping & inode */
> sync_inode(inode, &wbc);
> }
>
> if (memcg_usage() >= memcg_bg_limit) {
> queue per-memcg bg flush work item
> }

I think even for background we shall have to implement some kind of logic
where inodes are selected by traversing memcg->lru list so that for
background write we don't end up writting too many inodes from other
root group in an attempt to meet the low background ratio of memcg.

So to me it boils down to coming up a new inode selection logic for
memcg which can be used both for background as well as foreground
writes. This will make sure we don't end up writting pages from the
inodes we don't want to.

Though we also shall have to come up with some approximation so that
if there are multiple inodes in the cgroup, we don't end up writting
same inodes all the time and some inodes don't get written back at
all. May be skipping random amount of pages from the beginning of list
before we select an inode.

This has the disadvantage that we are using a different logic for non
root cgroup but until we figure out how to retrieve inodes belonging
to a memory cgroup, it might not be a bad idea.

Thanks
Vivek

2011-03-15 21:24:08

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Mon, Mar 14, 2011 at 07:41:13PM -0700, Greg Thelen wrote:
> On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <[email protected]> wrote:
> > On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:
> >
> > [..]
> >> > We could just crawl the memcg's page LRU and bring things under control
> >> > that way, couldn't we? ?That would fix it. ?What were the reasons for
> >> > not doing this?
> >>
> >> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
> >> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
> >> locality.
> >>
> >> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> >> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> >> that needed writeback. ?I think the part a) memcg filtering is likely something
> >> like:
> >> ?http://marc.info/?l=linux-kernel&m=129910424431837
> >>
> >> The part b) bdi selection should not be too hard assuming that page-to-mapping
> >> locking is doable.
> >
> > Greg,
> >
> > IIUC, option b) seems to be going through pages of particular memcg and
> > mapping page to inode and start writeback on particular inode?
>
> Yes.
>
> > If yes, this might be reasonably good. In the case when cgroups are not
> > sharing inodes then it automatically maps one inode to one cgroup and
> > once cgroup is over limit, it starts writebacks of its own inode.
> >
> > In case inode is shared, then we get the case of one cgroup writting
> > back the pages of other cgroup. Well I guess that also can be handeled
> > by flusher thread where a bunch or group of pages can be compared with
> > the cgroup passed in writeback structure. I guess that might hurt us
> > more than benefit us.
>
> Agreed. For now just writing the entire inode is probably fine.
>
> > IIUC how option b) works then we don't even need option a) where an N level
> > deep cache is maintained?
>
> Originally I was thinking that bdi-wide writeback with memcg filter
> was a good idea. But this may be unnecessarily complex. Now I am
> agreeing with you that option (a) may not be needed. Memcg could
> queue per-inode writeback using the memcg lru to locate inodes
> (lru->page->inode) with something like this in
> [mem_cgroup_]balance_dirty_pages():
>
> while (memcg_usage() >= memcg_fg_limit) {
> inode = memcg_dirty_inode(cg); /* scan lru for a dirty page, then
> grab mapping & inode */
> sync_inode(inode, &wbc);
> }

Is it possible to pass mem_cgroup in writeback_control structure or in
work structure which in turn will be set in writeback_control. And
modify writeback_inodes_wb() which will look that ->mem_cgroup is
set. So instead of calling queue_io() it can call memcg_queue_io()
and then memory cgroup can look at lru list and take its own decision
on which inodes needs to be pushed for IO?

Thanks
Vivek

2011-03-15 22:55:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 9/9] memcg: make background writeback memcg aware

On Fri, Mar 11, 2011 at 10:43:31AM -0800, Greg Thelen wrote:
> Add an memcg parameter to bdi_start_background_writeback(). If a memcg
> is specified then the resulting background writeback call to
> wb_writeback() will run until the memcg dirty memory usage drops below
> the memcg background limit. This is used when balancing memcg dirty
> memory with mem_cgroup_balance_dirty_pages().
>
> If the memcg parameter is not specified, then background writeback runs
> globally system dirty memory usage falls below the system background
> limit.
>
> Signed-off-by: Greg Thelen <[email protected]>
> ---

[..]
> -static inline bool over_bground_thresh(void)
> +static inline bool over_bground_thresh(struct mem_cgroup *mem_cgroup)
> {
> unsigned long background_thresh, dirty_thresh;
>
> + if (mem_cgroup) {
> + struct dirty_info info;
> +
> + if (!mem_cgroup_hierarchical_dirty_info(
> + determine_dirtyable_memory(), false,
> + mem_cgroup, &info))
> + return false;
> +
> + return info.nr_file_dirty +
> + info.nr_unstable_nfs > info.background_thresh;
> + }
> +
> global_dirty_limits(&background_thresh, &dirty_thresh);
>
> return (global_page_state(NR_FILE_DIRTY) +
> @@ -683,7 +694,8 @@ static long wb_writeback(struct bdi_writeback *wb,
> * For background writeout, stop when we are below the
> * background dirty threshold
> */
> - if (work->for_background && !over_bground_thresh())
> + if (work->for_background &&
> + !over_bground_thresh(work->mem_cgroup))
> break;
>
> wbc.more_io = 0;
> @@ -761,23 +773,6 @@ static unsigned long get_nr_dirty_pages(void)
> get_nr_dirty_inodes();
> }
>
> -static long wb_check_background_flush(struct bdi_writeback *wb)
> -{
> - if (over_bground_thresh()) {
> -
> - struct wb_writeback_work work = {
> - .nr_pages = LONG_MAX,
> - .sync_mode = WB_SYNC_NONE,
> - .for_background = 1,
> - .range_cyclic = 1,
> - };
> -
> - return wb_writeback(wb, &work);
> - }
> -
> - return 0;
> -}
> -
> static long wb_check_old_data_flush(struct bdi_writeback *wb)
> {
> unsigned long expired;
> @@ -839,15 +834,17 @@ long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
> */
> if (work->done)
> complete(work->done);
> - else
> + else {
> + if (work->mem_cgroup)
> + mem_cgroup_bg_writeback_done(work->mem_cgroup);
> kfree(work);
> + }
> }
>
> /*
> * Check for periodic writeback, kupdated() style
> */
> wrote += wb_check_old_data_flush(wb);
> - wrote += wb_check_background_flush(wb);

Hi Greg,

So in the past we will leave the background work unfinished and try
to finish queued work first.

I see following line in wb_writeback().

/*
* Background writeout and kupdate-style writeback may
* run forever. Stop them if there is other work to do
* so that e.g. sync can proceed. They'll be restarted
* after the other works are all done.
*/
if ((work->for_background || work->for_kupdate) &&
!list_empty(&wb->bdi->work_list))
break;

Now you seem to have converted background writeout also as queued
work item. So it sounds wb_writebac() will finish that background
work early and never take it up and finish other queued items. So
we might finish queued items still flusher thread might exit
without bringing down the background ratio of either root or memcg
depending on the ->mem_cgroup pointer.

May be requeuing the background work at the end of list might help.

Thanks
Vivek

2011-03-15 23:12:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Tue, Mar 15, 2011 at 05:23:39PM -0400, Vivek Goyal wrote:
> On Mon, Mar 14, 2011 at 07:41:13PM -0700, Greg Thelen wrote:
> > On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <[email protected]> wrote:
> > > On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:
> > >
> > > [..]
> > >> > We could just crawl the memcg's page LRU and bring things under control
> > >> > that way, couldn't we? ?That would fix it. ?What were the reasons for
> > >> > not doing this?
> > >>
> > >> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
> > >> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
> > >> locality.
> > >>
> > >> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> > >> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> > >> that needed writeback. ?I think the part a) memcg filtering is likely something
> > >> like:
> > >> ?http://marc.info/?l=linux-kernel&m=129910424431837
> > >>
> > >> The part b) bdi selection should not be too hard assuming that page-to-mapping
> > >> locking is doable.
> > >
> > > Greg,
> > >
> > > IIUC, option b) seems to be going through pages of particular memcg and
> > > mapping page to inode and start writeback on particular inode?
> >
> > Yes.
> >
> > > If yes, this might be reasonably good. In the case when cgroups are not
> > > sharing inodes then it automatically maps one inode to one cgroup and
> > > once cgroup is over limit, it starts writebacks of its own inode.
> > >
> > > In case inode is shared, then we get the case of one cgroup writting
> > > back the pages of other cgroup. Well I guess that also can be handeled
> > > by flusher thread where a bunch or group of pages can be compared with
> > > the cgroup passed in writeback structure. I guess that might hurt us
> > > more than benefit us.
> >
> > Agreed. For now just writing the entire inode is probably fine.
> >
> > > IIUC how option b) works then we don't even need option a) where an N level
> > > deep cache is maintained?
> >
> > Originally I was thinking that bdi-wide writeback with memcg filter
> > was a good idea. But this may be unnecessarily complex. Now I am
> > agreeing with you that option (a) may not be needed. Memcg could
> > queue per-inode writeback using the memcg lru to locate inodes
> > (lru->page->inode) with something like this in
> > [mem_cgroup_]balance_dirty_pages():
> >
> > while (memcg_usage() >= memcg_fg_limit) {
> > inode = memcg_dirty_inode(cg); /* scan lru for a dirty page, then
> > grab mapping & inode */
> > sync_inode(inode, &wbc);
> > }
>
> Is it possible to pass mem_cgroup in writeback_control structure or in
> work structure which in turn will be set in writeback_control. And
> modify writeback_inodes_wb() which will look that ->mem_cgroup is
> set. So instead of calling queue_io() it can call memcg_queue_io()
> and then memory cgroup can look at lru list and take its own decision
> on which inodes needs to be pushed for IO?

Thinking more about it, for foreground work probably your solution of
sync_inode() and writting actively in the context of dirtier is also
good. It has the disadvantage of that it does not take into account
IO less throttling but if forground IO is submitted in the context
of memcg process and we can do per process submitted IO accounting
and move nr_requests out of request queue.

using memcg_queue_io() has one disadvantage that it will take memcg's
dirty inodes from ->b_dirty list and put on ->b_io list. That is
equivalent to prioritizing writeback of memcg inodes and it can starve
root cgroup inode's starvation.

May be we can use memcg_queue_io() for background writeout and
sync_inode() for foreground writeout. And design memcg in such a way
so that it moves one of its own inode on io_list and at the same it
moves one more inode which does not belong to it. That might help
in mitigating the issue of starving root cgroup and also help making
sure that we don't end up writting lot of other inodes before the
dirty background ratio of a cgroup is under control.

Also need to check how it is going to interact with IO less throttling.
My head is spinning now. It is complicated. Will read more code tomorrow.

Thanks
Vivek

2011-03-15 23:12:42

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Mon 14-03-11 20:27:33, Greg Thelen wrote:
> On Mon, Mar 14, 2011 at 2:10 PM, Jan Kara <[email protected]> wrote:
> > On Mon 14-03-11 13:54:08, Vivek Goyal wrote:
> >> On Fri, Mar 11, 2011 at 10:43:30AM -0800, Greg Thelen wrote:
> >> > If the current process is in a non-root memcg, then
> >> > balance_dirty_pages() will consider the memcg dirty limits as well as
> >> > the system-wide limits. ?This allows different cgroups to have distinct
> >> > dirty limits which trigger direct and background writeback at different
> >> > levels.
> >> >
> >> > If called with a mem_cgroup, then throttle_vm_writeout() queries the
> >> > given cgroup for its dirty memory usage limits.
> >> >
> >> > Signed-off-by: Andrea Righi <[email protected]>
> >> > Signed-off-by: Greg Thelen <[email protected]>
> >> > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> >> > Acked-by: Wu Fengguang <[email protected]>
> >> > ---
> >> > Changelog since v5:
> >> > - Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
> >> > ? cramming the somewhat different logic into balance_dirty_pages(). ?This means
> >> > ? the global (non-memcg) dirty limits are not passed around in the
> >> > ? struct dirty_info, so there's less change to existing code.
> >>
> >> Yes there is less change to existing code but now we also have a separate
> >> throttlig logic for cgroups.
> >>
> >> I thought that we are moving in the direction of IO less throttling
> >> where bdi threads always do the IO and Jan Kara also implemented the
> >> logic to distribute the finished IO pages uniformly across the waiting
> >> threads.
> > ?Yes, we'd like to avoid doing IO from balance_dirty_pages(). But if the
> > logic in cgroups specific part won't get too fancy (which it doesn't seem
> > to be the case currently), it shouldn't be too hard to convert it to the new
> > approach.
>
> Handling memcg hierarchy was something that was not trivial to implement in
> mem_cgroup_balance_dirty_pages.
>
> > We can talk about it at LSF but at least with my approach to IO-less
> > balance_dirty_pages() it would be easy to convert cgroups throttling to
> > the new way. With Fengguang's approach it might be a bit harder since he
> > computes a throughput and from that necessary delay for a throttled task
> > but with cgroups that is impossible to compute so he'd have to add some
> > looping if we didn't write enough pages from the cgroup yet. But still it
> > would be reasonable doable AFAICT.
>
> I am definitely interested in finding a way to merge these feature
> cleanly together.
What my patches do is that instead of calling writeback_inodes_wb() the
process waits for IO on enough pages to get completed. Now if we can tell
for each page against which cgroup it is accounted (and I believe we are
able to do so), we can as well properly account amount of pages completed
against a particular cgroup and thus wait for right amount of pages for
that cgroup to get written. The only difficult part is that for BDI I can
estimate throughput, set sleep time appropriately, and thus avoid
unnecessary looping checking whether pages have already completed or not.
With per-cgroup this is impossible (cgroups share the resource) so we'd have
to check relatively often...

> >> Keeping it separate for cgroups, reduces the complexity but also forks
> >> off the balancing logic for root and other cgroups. So if Jan Kara's
> >> changes go in, it automatically does not get used for memory cgroups.
> >>
> >> Not sure how good a idea it is to use a separate throttling logic for
> >> for non-root cgroups.
> > ?Yeah, it looks a bit odd. I'd think that we could just cap
> > task_dirty_limit() by a value computed from a cgroup limit and be done
> > with that but I probably miss something...
>
> That is an interesting idea. When looking at upstream balance_dirty_pages(),
> the result of task_dirty_limit() is compared per bdi_nr_reclaimable and
> bdi_nr_writeback. I think we should be comparing memcg usage to memcg limits
> to catch cases where memcg usage is above memcg limits.
> Or am I missing something in your apporach?
Oh right. It was too late yesterday :).

> > Sure there is also a different
> > background limit but that's broken anyway because a flusher thread will
> > quickly stop doing writeback if global background limit is not exceeded.
> > But that's a separate topic so I'll reply with this to a more appropriate
> > email ;)
> ;) I am also interested in the this bg issue, but I should also try
> to stay on topic.
I found out I've already deleted the relevant email and thus have no good
way to reply to it. So in the end I'll write it here: As Vivek pointed out,
you try to introduce background writeback that honors per-cgroup limits but
the way you do it it doesn't quite work. To avoid livelocking of flusher
thread, any essentially unbounded work (and background writeback of bdi or
in your case a cgroup pages on the bdi is in principle unbounded) has to
give way to other work items in the queue (like a work submitted by
sync(1)). Thus wb_writeback() stops for_background works if there are other
works to do with the rationale that as soon as that work is finished, we
may happily return to background cleaning (and that other work works for
background cleaning as well anyway).

But with your introduction of per-cgroup background writeback we are going
to loose the information in which cgroup we have to get below background
limit. And if we stored the context somewhere and tried to return to it
later, we'd have the above problems with livelocking and we'd have to
really carefully handle cases where more cgroups actually want their limits
observed.

I'm not decided what would be a good solution for this. It seems that
a flusher thread should check all cgroups whether they are not exceeding
their background limit and if yes, do writeback. I'm not sure how practical
that would be but possibly we could have a list of cgroups with exceeded
limits and flusher thread could check that?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-16 00:07:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits

On Tue, 15 Mar 2011 10:01:05 -0400
Mike Heffner <[email protected]> wrote:

> On 03/11/2011 01:43 PM, Greg Thelen wrote:
> > Add cgroupfs interface to memcg dirty page limits:
> > Direct write-out is controlled with:
> > - memory.dirty_ratio
> > - memory.dirty_limit_in_bytes
> >
> > Background write-out is controlled with:
> > - memory.dirty_background_ratio
> > - memory.dirty_background_limit_bytes
>
>
> What's the overlap, if any, with the current memory limits controlled by
> `memory.limit_in_bytes` and the above `memory.dirty_limit_in_bytes`? If
> I want to fairly balance memory between two cgroups be one a dirty page
> antagonist (dd) and the other an anonymous page (memcache), do I just
> set `memory.limit_in_bytes`? Does this patch simply provide a more
> granular level of control of the dirty limits?
>

dirty_ratio is for control
- speed of write() within cgroup.
- risk of huge latency at memory reclaim (and OOM)
Small dirty ratio means big ratio of clean page within cgroup.
This will make memory reclaim, pageout easier.

memory.limit_in_bytes controls the amount of memory.

Thanks,
-Kame

2011-03-16 00:50:45

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 6/9] memcg: add cgroupfs interface to memcg dirty limits

On Tue, Mar 15, 2011 at 7:01 AM, Mike Heffner <[email protected]> wrote:
> On 03/11/2011 01:43 PM, Greg Thelen wrote:
>>
>> Add cgroupfs interface to memcg dirty page limits:
>> ? Direct write-out is controlled with:
>> ? - memory.dirty_ratio
>> ? - memory.dirty_limit_in_bytes
>>
>> ? Background write-out is controlled with:
>> ? - memory.dirty_background_ratio
>> ? - memory.dirty_background_limit_bytes
>
>
> What's the overlap, if any, with the current memory limits controlled by
> `memory.limit_in_bytes` and the above `memory.dirty_limit_in_bytes`? If I
> want to fairly balance memory between two cgroups be one a dirty page
> antagonist (dd) and the other an anonymous page (memcache), do I just set
> `memory.limit_in_bytes`? Does this patch simply provide a more granular
> level of control of the dirty limits?
>
>
> Thanks,
>
> Mike
>

The per memcg dirty ratios are more about controlling how memory
within a cgroup is used. If you isolate two processes in
different memcg, then the memcg dirty ratios will neither help nor hurt
isolation between cgroups. The per memcg dirty limits are more
focused on providing
some form of better behavior when multiple processes share a single memcg.
Running an antagonist (dd) in the same cgroup as a read-mostly workload
would benefit because the antagonist dirty memory usage should be
capped at the memcg's dirty memory usage. So any clean page
allocation requests by the read-mostly workload should be faster (and
less likely to OOM) because there will be more clean pages available
within the memcg.

2011-03-16 01:02:46

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 9/9] memcg: make background writeback memcg aware

On Tue, Mar 15, 2011 at 3:54 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Mar 11, 2011 at 10:43:31AM -0800, Greg Thelen wrote:
>> Add an memcg parameter to bdi_start_background_writeback(). ?If a memcg
>> is specified then the resulting background writeback call to
>> wb_writeback() will run until the memcg dirty memory usage drops below
>> the memcg background limit. ?This is used when balancing memcg dirty
>> memory with mem_cgroup_balance_dirty_pages().
>>
>> If the memcg parameter is not specified, then background writeback runs
>> globally system dirty memory usage falls below the system background
>> limit.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>> ---
>
> [..]
>> -static inline bool over_bground_thresh(void)
>> +static inline bool over_bground_thresh(struct mem_cgroup *mem_cgroup)
>> ?{
>> ? ? ? unsigned long background_thresh, dirty_thresh;
>>
>> + ? ? if (mem_cgroup) {
>> + ? ? ? ? ? ? struct dirty_info info;
>> +
>> + ? ? ? ? ? ? if (!mem_cgroup_hierarchical_dirty_info(
>> + ? ? ? ? ? ? ? ? ? ? ? ? determine_dirtyable_memory(), false,
>> + ? ? ? ? ? ? ? ? ? ? ? ? mem_cgroup, &info))
>> + ? ? ? ? ? ? ? ? ? ? return false;
>> +
>> + ? ? ? ? ? ? return info.nr_file_dirty +
>> + ? ? ? ? ? ? ? ? ? ? info.nr_unstable_nfs > info.background_thresh;
>> + ? ? }
>> +
>> ? ? ? global_dirty_limits(&background_thresh, &dirty_thresh);
>>
>> ? ? ? return (global_page_state(NR_FILE_DIRTY) +
>> @@ -683,7 +694,8 @@ static long wb_writeback(struct bdi_writeback *wb,
>> ? ? ? ? ? ? ? ?* For background writeout, stop when we are below the
>> ? ? ? ? ? ? ? ?* background dirty threshold
>> ? ? ? ? ? ? ? ?*/
>> - ? ? ? ? ? ? if (work->for_background && !over_bground_thresh())
>> + ? ? ? ? ? ? if (work->for_background &&
>> + ? ? ? ? ? ? ? ? !over_bground_thresh(work->mem_cgroup))
>> ? ? ? ? ? ? ? ? ? ? ? break;
>>
>> ? ? ? ? ? ? ? wbc.more_io = 0;
>> @@ -761,23 +773,6 @@ static unsigned long get_nr_dirty_pages(void)
>> ? ? ? ? ? ? ? get_nr_dirty_inodes();
>> ?}
>>
>> -static long wb_check_background_flush(struct bdi_writeback *wb)
>> -{
>> - ? ? if (over_bground_thresh()) {
>> -
>> - ? ? ? ? ? ? struct wb_writeback_work work = {
>> - ? ? ? ? ? ? ? ? ? ? .nr_pages ? ? ? = LONG_MAX,
>> - ? ? ? ? ? ? ? ? ? ? .sync_mode ? ? ?= WB_SYNC_NONE,
>> - ? ? ? ? ? ? ? ? ? ? .for_background = 1,
>> - ? ? ? ? ? ? ? ? ? ? .range_cyclic ? = 1,
>> - ? ? ? ? ? ? };
>> -
>> - ? ? ? ? ? ? return wb_writeback(wb, &work);
>> - ? ? }
>> -
>> - ? ? return 0;
>> -}
>> -
>> ?static long wb_check_old_data_flush(struct bdi_writeback *wb)
>> ?{
>> ? ? ? unsigned long expired;
>> @@ -839,15 +834,17 @@ long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
>> ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? if (work->done)
>> ? ? ? ? ? ? ? ? ? ? ? complete(work->done);
>> - ? ? ? ? ? ? else
>> + ? ? ? ? ? ? else {
>> + ? ? ? ? ? ? ? ? ? ? if (work->mem_cgroup)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? mem_cgroup_bg_writeback_done(work->mem_cgroup);
>> ? ? ? ? ? ? ? ? ? ? ? kfree(work);
>> + ? ? ? ? ? ? }
>> ? ? ? }
>>
>> ? ? ? /*
>> ? ? ? ?* Check for periodic writeback, kupdated() style
>> ? ? ? ?*/
>> ? ? ? wrote += wb_check_old_data_flush(wb);
>> - ? ? wrote += wb_check_background_flush(wb);
>
> Hi Greg,
>
> So in the past we will leave the background work unfinished and try
> to finish queued work first.
>
> I see following line in wb_writeback().
>
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * Background writeout and kupdate-style writeback may
> ? ? ? ? ? ? ? ? * run forever. Stop them if there is other work to do
> ? ? ? ? ? ? ? ? * so that e.g. sync can proceed. They'll be restarted
> ? ? ? ? ? ? ? ? * after the other works are all done.
> ? ? ? ? ? ? ? ? */
> ? ? ? ? ? ? ? ?if ((work->for_background || work->for_kupdate) &&
> ? ? ? ? ? ? ? ? ? ?!list_empty(&wb->bdi->work_list))
> ? ? ? ? ? ? ? ? ? ? ? ?break;
>
> Now you seem to have converted background writeout also as queued
> work item. So it sounds wb_writebac() will finish that background
> work early and never take it up and finish other queued items. So
> we might finish queued items still flusher thread might exit
> without bringing down the background ratio of either root or memcg
> depending on the ->mem_cgroup pointer.
>
> May be requeuing the background work at the end of list might help.

Good catch! I agree that an interrupted queued bg writeback work item
should be requeued to the tail.

> Thanks
> Vivek

2011-03-16 02:35:56

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Tue, Mar 15, 2011 at 4:12 PM, Jan Kara <[email protected]> wrote:
> On Mon 14-03-11 20:27:33, Greg Thelen wrote:
>> On Mon, Mar 14, 2011 at 2:10 PM, Jan Kara <[email protected]> wrote:
>> > On Mon 14-03-11 13:54:08, Vivek Goyal wrote:
>> >> On Fri, Mar 11, 2011 at 10:43:30AM -0800, Greg Thelen wrote:
>> >> > If the current process is in a non-root memcg, then
>> >> > balance_dirty_pages() will consider the memcg dirty limits as well as
>> >> > the system-wide limits. ?This allows different cgroups to have distinct
>> >> > dirty limits which trigger direct and background writeback at different
>> >> > levels.
>> >> >
>> >> > If called with a mem_cgroup, then throttle_vm_writeout() queries the
>> >> > given cgroup for its dirty memory usage limits.
>> >> >
>> >> > Signed-off-by: Andrea Righi <[email protected]>
>> >> > Signed-off-by: Greg Thelen <[email protected]>
>> >> > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
>> >> > Acked-by: Wu Fengguang <[email protected]>
>> >> > ---
>> >> > Changelog since v5:
>> >> > - Simplified this change by using mem_cgroup_balance_dirty_pages() rather than
>> >> > ? cramming the somewhat different logic into balance_dirty_pages(). ?This means
>> >> > ? the global (non-memcg) dirty limits are not passed around in the
>> >> > ? struct dirty_info, so there's less change to existing code.
>> >>
>> >> Yes there is less change to existing code but now we also have a separate
>> >> throttlig logic for cgroups.
>> >>
>> >> I thought that we are moving in the direction of IO less throttling
>> >> where bdi threads always do the IO and Jan Kara also implemented the
>> >> logic to distribute the finished IO pages uniformly across the waiting
>> >> threads.
>> > ?Yes, we'd like to avoid doing IO from balance_dirty_pages(). But if the
>> > logic in cgroups specific part won't get too fancy (which it doesn't seem
>> > to be the case currently), it shouldn't be too hard to convert it to the new
>> > approach.
>>
>> Handling memcg hierarchy was something that was not trivial to implement in
>> mem_cgroup_balance_dirty_pages.
>>
>> > We can talk about it at LSF but at least with my approach to IO-less
>> > balance_dirty_pages() it would be easy to convert cgroups throttling to
>> > the new way. With Fengguang's approach it might be a bit harder since he
>> > computes a throughput and from that necessary delay for a throttled task
>> > but with cgroups that is impossible to compute so he'd have to add some
>> > looping if we didn't write enough pages from the cgroup yet. But still it
>> > would be reasonable doable AFAICT.
>>
>> I am definitely interested in finding a way to merge these feature
>> cleanly together.
> ?What my patches do is that instead of calling writeback_inodes_wb() the
> process waits for IO on enough pages to get completed. Now if we can tell
> for each page against which cgroup it is accounted (and I believe we are
> able to do so), we can as well properly account amount of pages completed
> against a particular cgroup and thus wait for right amount of pages for
> that cgroup to get written. The only difficult part is that for BDI I can
> estimate throughput, set sleep time appropriately, and thus avoid
> unnecessary looping checking whether pages have already completed or not.
> With per-cgroup this is impossible (cgroups share the resource) so we'd have
> to check relatively often...
>
>> >> Keeping it separate for cgroups, reduces the complexity but also forks
>> >> off the balancing logic for root and other cgroups. So if Jan Kara's
>> >> changes go in, it automatically does not get used for memory cgroups.
>> >>
>> >> Not sure how good a idea it is to use a separate throttling logic for
>> >> for non-root cgroups.
>> > ?Yeah, it looks a bit odd. I'd think that we could just cap
>> > task_dirty_limit() by a value computed from a cgroup limit and be done
>> > with that but I probably miss something...
>>
>> That is an interesting idea. ?When looking at upstream balance_dirty_pages(),
>> the result of task_dirty_limit() is compared per bdi_nr_reclaimable and
>> bdi_nr_writeback. ?I think we should be comparing memcg usage to memcg limits
>> to catch cases where memcg usage is above memcg limits.
>> Or am I missing something in your apporach?
> ?Oh right. It was too late yesterday :).
>
>> > Sure there is also a different
>> > background limit but that's broken anyway because a flusher thread will
>> > quickly stop doing writeback if global background limit is not exceeded.
>> > But that's a separate topic so I'll reply with this to a more appropriate
>> > email ;)
>> ;) ?I am also interested in the this bg issue, but I should also try
>> to stay on topic.
> ?I found out I've already deleted the relevant email and thus have no good
> way to reply to it. So in the end I'll write it here: As Vivek pointed out,
> you try to introduce background writeback that honors per-cgroup limits but
> the way you do it it doesn't quite work. To avoid livelocking of flusher
> thread, any essentially unbounded work (and background writeback of bdi or
> in your case a cgroup pages on the bdi is in principle unbounded) has to
> give way to other work items in the queue (like a work submitted by
> sync(1)). Thus wb_writeback() stops for_background works if there are other
> works to do with the rationale that as soon as that work is finished, we
> may happily return to background cleaning (and that other work works for
> background cleaning as well anyway).
>
> But with your introduction of per-cgroup background writeback we are going
> to loose the information in which cgroup we have to get below background
> limit. And if we stored the context somewhere and tried to return to it
> later, we'd have the above problems with livelocking and we'd have to
> really carefully handle cases where more cgroups actually want their limits
> observed.
>
> I'm not decided what would be a good solution for this. It seems that
> a flusher thread should check all cgroups whether they are not exceeding
> their background limit and if yes, do writeback. I'm not sure how practical
> that would be but possibly we could have a list of cgroups with exceeded
> limits and flusher thread could check that?
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>

mem_cgroup_balance_dirty_pages() queues a bdi work item which already
includes a memcg that is available to wb_writeback() in '[PATCH v6
9/9] memcg: make background writeback memcg aware'. Background
writeback checks the given memcg usage vs memcg limit rather than
global usage vs global limit.

If we amend this to requeue an interrupted background work to the end
of the per-bdi work_list, then I think that would address the
livelocking issue.

To prevent a memcg writeback work item from writing irrelevant inodes
(outside the memcg) then bdi writeback could call
mem_cgroup_queue_io(memcg, bdi) to locate an inode to writeback for
the memcg under dirty pressure. mem_cgroup_queue_io() would scan the
memcg lru for dirty pages belonging to the particular bdi.

If mem_cgroup_queue_io() is unable to find any dirty inodes for the
bdi, then it would return an empty set. Then wb_writeback() would
abandon background writeback because there is nothing useful to write
back to that bdi. In patch 9/9, wb_writeback() calls
mem_cgroup_bg_writeback_done() when writeback completes.
mem_cgroup_bg_writeback_done() could check that cgroup is still over
background thresh and use the memcg lru to select another bdi to start
per-memcg bdi writeback on. This allows one queued per-memcg bdi
background writeback work item to pass off to another bdi to continue
per-memcg background writeback.

Does this seem reasonable?

Unfortunately the approach above would only queue a memcg's bg writes
to one bdi at a time. Another way to approach the problem would be to
have a per-memcg flusher thread that is able to queue inodes to
multiple bdis concurrently.

2011-03-16 12:35:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Tue 15-03-11 19:35:26, Greg Thelen wrote:
> On Tue, Mar 15, 2011 at 4:12 PM, Jan Kara <[email protected]> wrote:
> > ?I found out I've already deleted the relevant email and thus have no good
> > way to reply to it. So in the end I'll write it here: As Vivek pointed out,
> > you try to introduce background writeback that honors per-cgroup limits but
> > the way you do it it doesn't quite work. To avoid livelocking of flusher
> > thread, any essentially unbounded work (and background writeback of bdi or
> > in your case a cgroup pages on the bdi is in principle unbounded) has to
> > give way to other work items in the queue (like a work submitted by
> > sync(1)). Thus wb_writeback() stops for_background works if there are other
> > works to do with the rationale that as soon as that work is finished, we
> > may happily return to background cleaning (and that other work works for
> > background cleaning as well anyway).
> >
> > But with your introduction of per-cgroup background writeback we are going
> > to loose the information in which cgroup we have to get below background
> > limit. And if we stored the context somewhere and tried to return to it
> > later, we'd have the above problems with livelocking and we'd have to
> > really carefully handle cases where more cgroups actually want their limits
> > observed.
> >
> > I'm not decided what would be a good solution for this. It seems that
> > a flusher thread should check all cgroups whether they are not exceeding
> > their background limit and if yes, do writeback. I'm not sure how practical
> > that would be but possibly we could have a list of cgroups with exceeded
> > limits and flusher thread could check that?
>
> mem_cgroup_balance_dirty_pages() queues a bdi work item which already
> includes a memcg that is available to wb_writeback() in '[PATCH v6
> 9/9] memcg: make background writeback memcg aware'. Background
> writeback checks the given memcg usage vs memcg limit rather than
> global usage vs global limit.
Yes.

> If we amend this to requeue an interrupted background work to the end
> of the per-bdi work_list, then I think that would address the
> livelocking issue.
Yes, that would work. But it would be nice (I'd find that cleaner design)
if we could keep just one type of background work and make sure that it
observes all the imposed memcg limits. For that we wouldn't explicitely
pass memcg to the flusher thread but rather make over_bground_thresh()
check all the memcg limits - or to make this more effective have some list
of memcgs which crossed the background limit. What do you think?

> To prevent a memcg writeback work item from writing irrelevant inodes
> (outside the memcg) then bdi writeback could call
> mem_cgroup_queue_io(memcg, bdi) to locate an inode to writeback for
> the memcg under dirty pressure. mem_cgroup_queue_io() would scan the
> memcg lru for dirty pages belonging to the particular bdi.
And similarly here, we could just loop over all relevant memcg's and
let each of them queue relevant inodes as they wish and after that we go
and write all the queued inodes... That would also solve the problem with
cgroups competing with each other on the same bdi (writeback thread makes
sure that all queued inodes get comparable amount of writeback). Does it
look OK? It seems cleaner to me than what you propose but maybe I miss
something...

> If mem_cgroup_queue_io() is unable to find any dirty inodes for the
> bdi, then it would return an empty set. Then wb_writeback() would
> abandon background writeback because there is nothing useful to write
> back to that bdi. In patch 9/9, wb_writeback() calls
> mem_cgroup_bg_writeback_done() when writeback completes.
> mem_cgroup_bg_writeback_done() could check that cgroup is still over
> background thresh and use the memcg lru to select another bdi to start
> per-memcg bdi writeback on. This allows one queued per-memcg bdi
> background writeback work item to pass off to another bdi to continue
> per-memcg background writeback.
Here I'd think that memcg_balance_dirty_pages() would check which memcgs
are over threshold on that bdi and add these to the list of relevant memcgs
for that bdi. Thinking about it it's rather similar to what you propose
just instead of queueing and requeueing work items (which are processed
sequentially, which isn't really what we want) we'd rather maintain a
separate list of memcgs which would be processed in parallel.

> Unfortunately the approach above would only queue a memcg's bg writes
> to one bdi at a time. Another way to approach the problem would be to
> have a per-memcg flusher thread that is able to queue inodes to
> multiple bdis concurrently.
Well, in principle there's no reason why memcg couldn't ask for writeback
(either via work items as you propose or via my mechanism) on several bdis
in parallel. I see no reason why a special flusher thread would be needed
for that...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-16 12:45:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Fri, Mar 11, 2011 at 10:43:22AM -0800, Greg Thelen wrote:
> This patch set provides the ability for each cgroup to have independent dirty
> page limits.
>
> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup. So, in case of multiple cgroup writers, they will
> not be able to consume more than their designated share of dirty pages and will
> be throttled if they cross that limit.
>
> Example use case:
> #!/bin/bash
> #
> # Here is a test script that shows a situation where memcg dirty limits are
> # beneficial.
> #
> # The script runs two programs:
> # 1) a dirty page background antagonist (dd)
> # 2) an interactive foreground process (tar).
> #
> # If the script's argument is false, then both processes are run together in
> # the root cgroup sharing system-wide dirty memory in classic fashion. If the
> # script is given a true argument, then a cgroup is used to contain dd dirty
> # page consumption. The cgroup isolates the dd dirty memory consumption from
> # the rest of the system processes (tar in this case).
> #
> # The time used by the tar process is printed (lower is better).
> #
> # The tar process had faster and more predictable performance. memcg dirty
> # ratios might be useful to serve different task classes (interactive vs
> # batch). A past discussion touched on this:
> # http://lkml.org/lkml/2010/5/20/136
> #
> # When called with 'false' (using memcg without dirty isolation):
> # tar takes 8s
> # dd reports 69 MB/s
> #
> # When called with 'true' (using memcg for dirty isolation):
> # tar takes 6s
> # dd reports 66 MB/s
> #
> echo memcg_dirty_limits: $1
>
> # Declare system limits.
> echo $((1<<30)) > /proc/sys/vm/dirty_bytes
> echo $((1<<29)) > /proc/sys/vm/dirty_background_bytes
>
> mkdir /dev/cgroup/memory/A
>
> # start antagonist
> if $1; then # if using cgroup to contain 'dd'...
> echo 400M > /dev/cgroup/memory/A/memory.dirty_limit_in_bytes
> fi
>
> (echo $BASHPID > /dev/cgroup/memory/A/tasks; \
> dd if=/dev/zero of=big.file count=10k bs=1M) &
>
> # let antagonist get warmed up
> sleep 10
>
> # time interactive job
> time tar -xzf linux.tar.gz
>
> wait
> sleep 10
> rmdir /dev/cgroup/memory/A
>
>
> The patches are based on a series proposed by Andrea Righi in Mar 2010.
>
>
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
> unstable.
>
> - Extend mem_cgroup to record the total number of pages in each of the
> interesting dirty states (dirty, writeback, unstable_nfs).
>
> - Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_* limits to
> mem_cgroup. The mem_cgroup dirty parameters are accessible via cgroupfs
> control files.
>
> - Consider both system and per-memcg dirty limits in page writeback when
> deciding to queue background writeback or throttle dirty memory production.
>
> Known shortcomings (see the patch 1/9 update to Documentation/cgroups/memory.txt
> for more details):
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
> writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
> just inodes contributing dirty pages to the cgroup exceeding its limit.

The smaller your cgroups wrt overall system size, the less likely it
becomes that writeback will find the pages that unblock throttled
cgrouped dirtiers.

Your example was probably not much affected from this because the
'needle' of 400MB dirty memory from the cgroup would have to be
searched for only in the 1G 'haystack' that is the sum total of dirty
memory before the dirtier could continue (and considering. Even
memcg-unaware writeback has a decent chance of writing back the right
pages by accident.

The performance of the throttled dirtier already dropped by 5% in your
testcase, so I would be really interested in a case where you have 100
cgroups with dirty limits relatively small to the global dirty limit.

> - A cgroup may exceed its dirty limit if the memory is dirtied by a process in a
> different memcg.

I do wonder if we should hide the knobs from userspace as long as the
limits can not be strictly enforced. Just proportionally apply the
global dirty limit (memory.limit_in_bytes * dirty_ratio / 100) and
make it a best-effort optimization. It would be an improvement for
the common case without promising anything. Does that make sense?

Hannes

2011-03-16 13:13:48

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote:
> On Mon, Mar 14, 2011 at 07:41:13PM -0700, Greg Thelen wrote:
> > On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <[email protected]> wrote:
> > > On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:
> > >
> > > [..]
> > >> > We could just crawl the memcg's page LRU and bring things under control
> > >> > that way, couldn't we? ?That would fix it. ?What were the reasons for
> > >> > not doing this?
> > >>
> > >> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
> > >> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
> > >> locality.
> > >>
> > >> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> > >> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> > >> that needed writeback. ?I think the part a) memcg filtering is likely something
> > >> like:
> > >> ?http://marc.info/?l=linux-kernel&m=129910424431837
> > >>
> > >> The part b) bdi selection should not be too hard assuming that page-to-mapping
> > >> locking is doable.
> > >
> > > Greg,
> > >
> > > IIUC, option b) seems to be going through pages of particular memcg and
> > > mapping page to inode and start writeback on particular inode?
> >
> > Yes.
> >
> > > If yes, this might be reasonably good. In the case when cgroups are not
> > > sharing inodes then it automatically maps one inode to one cgroup and
> > > once cgroup is over limit, it starts writebacks of its own inode.
> > >
> > > In case inode is shared, then we get the case of one cgroup writting
> > > back the pages of other cgroup. Well I guess that also can be handeled
> > > by flusher thread where a bunch or group of pages can be compared with
> > > the cgroup passed in writeback structure. I guess that might hurt us
> > > more than benefit us.
> >
> > Agreed. For now just writing the entire inode is probably fine.
> >
> > > IIUC how option b) works then we don't even need option a) where an N level
> > > deep cache is maintained?
> >
> > Originally I was thinking that bdi-wide writeback with memcg filter
> > was a good idea. But this may be unnecessarily complex. Now I am
> > agreeing with you that option (a) may not be needed. Memcg could
> > queue per-inode writeback using the memcg lru to locate inodes
> > (lru->page->inode) with something like this in
> > [mem_cgroup_]balance_dirty_pages():
> >
> > while (memcg_usage() >= memcg_fg_limit) {
> > inode = memcg_dirty_inode(cg); /* scan lru for a dirty page, then
> > grab mapping & inode */
> > sync_inode(inode, &wbc);
> > }
> >
> > if (memcg_usage() >= memcg_bg_limit) {
> > queue per-memcg bg flush work item
> > }
>
> I think even for background we shall have to implement some kind of logic
> where inodes are selected by traversing memcg->lru list so that for
> background write we don't end up writting too many inodes from other
> root group in an attempt to meet the low background ratio of memcg.
>
> So to me it boils down to coming up a new inode selection logic for
> memcg which can be used both for background as well as foreground
> writes. This will make sure we don't end up writting pages from the
> inodes we don't want to.

Originally for struct page_cgroup reduction, I had the idea of
introducing something like

struct memcg_mapping {
struct address_space *mapping;
struct mem_cgroup *memcg;
};

hanging off page->mapping to make memcg association no longer per-page
and save the pc->memcg linkage (it's not completely per-inode either,
multiple memcgs can still refer to a single inode).

We could put these descriptors on a per-memcg list and write inodes
from this list during memcg-writeback.

We would have the option of extending this structure to contain hints
as to which subrange of the inode is actually owned by the cgroup, to
further narrow writeback to the right pages - iff shared big files
become a problem.

Does that sound feasible?

2011-03-16 15:01:32

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Wed, Mar 16, 2011 at 02:13:24PM +0100, Johannes Weiner wrote:
> On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote:
> > On Mon, Mar 14, 2011 at 07:41:13PM -0700, Greg Thelen wrote:
> > > On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <[email protected]> wrote:
> > > > On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:
> > > >
> > > > [..]
> > > >> > We could just crawl the memcg's page LRU and bring things under control
> > > >> > that way, couldn't we? ?That would fix it. ?What were the reasons for
> > > >> > not doing this?
> > > >>
> > > >> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
> > > >> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
> > > >> locality.
> > > >>
> > > >> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> > > >> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> > > >> that needed writeback. ?I think the part a) memcg filtering is likely something
> > > >> like:
> > > >> ?http://marc.info/?l=linux-kernel&m=129910424431837
> > > >>
> > > >> The part b) bdi selection should not be too hard assuming that page-to-mapping
> > > >> locking is doable.
> > > >
> > > > Greg,
> > > >
> > > > IIUC, option b) seems to be going through pages of particular memcg and
> > > > mapping page to inode and start writeback on particular inode?
> > >
> > > Yes.
> > >
> > > > If yes, this might be reasonably good. In the case when cgroups are not
> > > > sharing inodes then it automatically maps one inode to one cgroup and
> > > > once cgroup is over limit, it starts writebacks of its own inode.
> > > >
> > > > In case inode is shared, then we get the case of one cgroup writting
> > > > back the pages of other cgroup. Well I guess that also can be handeled
> > > > by flusher thread where a bunch or group of pages can be compared with
> > > > the cgroup passed in writeback structure. I guess that might hurt us
> > > > more than benefit us.
> > >
> > > Agreed. For now just writing the entire inode is probably fine.
> > >
> > > > IIUC how option b) works then we don't even need option a) where an N level
> > > > deep cache is maintained?
> > >
> > > Originally I was thinking that bdi-wide writeback with memcg filter
> > > was a good idea. But this may be unnecessarily complex. Now I am
> > > agreeing with you that option (a) may not be needed. Memcg could
> > > queue per-inode writeback using the memcg lru to locate inodes
> > > (lru->page->inode) with something like this in
> > > [mem_cgroup_]balance_dirty_pages():
> > >
> > > while (memcg_usage() >= memcg_fg_limit) {
> > > inode = memcg_dirty_inode(cg); /* scan lru for a dirty page, then
> > > grab mapping & inode */
> > > sync_inode(inode, &wbc);
> > > }
> > >
> > > if (memcg_usage() >= memcg_bg_limit) {
> > > queue per-memcg bg flush work item
> > > }
> >
> > I think even for background we shall have to implement some kind of logic
> > where inodes are selected by traversing memcg->lru list so that for
> > background write we don't end up writting too many inodes from other
> > root group in an attempt to meet the low background ratio of memcg.
> >
> > So to me it boils down to coming up a new inode selection logic for
> > memcg which can be used both for background as well as foreground
> > writes. This will make sure we don't end up writting pages from the
> > inodes we don't want to.
>
> Originally for struct page_cgroup reduction, I had the idea of
> introducing something like
>
> struct memcg_mapping {
> struct address_space *mapping;
> struct mem_cgroup *memcg;
> };
>
> hanging off page->mapping to make memcg association no longer per-page
> and save the pc->memcg linkage (it's not completely per-inode either,
> multiple memcgs can still refer to a single inode).

So page->mapping will basically be a list where multiple memcg_mappings
are hanging? That will essentially tell what memory cgroups own pages
in this inode?

And similary every cgroup will have a list where these memcg_mapping
are hanging allowing to trace which memcg is doing IO on which inodes?

>
> We could put these descriptors on a per-memcg list and write inodes
> from this list during memcg-writeback.
>
> We would have the option of extending this structure to contain hints
> as to which subrange of the inode is actually owned by the cgroup, to
> further narrow writeback to the right pages - iff shared big files
> become a problem.
>
> Does that sound feasible?

May be. I am really not an expert in this area.

IIUC, this sounds more like a solution to quickly come up with a list of
inodes one should be writting back. One could also come up with this kind of
list by going through memcg->lru list also (approximate). So this can be
an improvement over going through memcg->lru instead go through
memcg->mapping_list.

Thanks
Vivek

2011-03-16 16:35:34

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Wed, Mar 16, 2011 at 10:59:59AM -0400, Vivek Goyal wrote:
> On Wed, Mar 16, 2011 at 02:13:24PM +0100, Johannes Weiner wrote:
> > On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote:
> > > On Mon, Mar 14, 2011 at 07:41:13PM -0700, Greg Thelen wrote:
> > > > On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <[email protected]> wrote:
> > > > > On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:
> > > > >
> > > > > [..]
> > > > >> > We could just crawl the memcg's page LRU and bring things under control
> > > > >> > that way, couldn't we? ?That would fix it. ?What were the reasons for
> > > > >> > not doing this?
> > > > >>
> > > > >> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
> > > > >> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
> > > > >> locality.
> > > > >>
> > > > >> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
> > > > >> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
> > > > >> that needed writeback. ?I think the part a) memcg filtering is likely something
> > > > >> like:
> > > > >> ?http://marc.info/?l=linux-kernel&m=129910424431837
> > > > >>
> > > > >> The part b) bdi selection should not be too hard assuming that page-to-mapping
> > > > >> locking is doable.
> > > > >
> > > > > Greg,
> > > > >
> > > > > IIUC, option b) seems to be going through pages of particular memcg and
> > > > > mapping page to inode and start writeback on particular inode?
> > > >
> > > > Yes.
> > > >
> > > > > If yes, this might be reasonably good. In the case when cgroups are not
> > > > > sharing inodes then it automatically maps one inode to one cgroup and
> > > > > once cgroup is over limit, it starts writebacks of its own inode.
> > > > >
> > > > > In case inode is shared, then we get the case of one cgroup writting
> > > > > back the pages of other cgroup. Well I guess that also can be handeled
> > > > > by flusher thread where a bunch or group of pages can be compared with
> > > > > the cgroup passed in writeback structure. I guess that might hurt us
> > > > > more than benefit us.
> > > >
> > > > Agreed. For now just writing the entire inode is probably fine.
> > > >
> > > > > IIUC how option b) works then we don't even need option a) where an N level
> > > > > deep cache is maintained?
> > > >
> > > > Originally I was thinking that bdi-wide writeback with memcg filter
> > > > was a good idea. But this may be unnecessarily complex. Now I am
> > > > agreeing with you that option (a) may not be needed. Memcg could
> > > > queue per-inode writeback using the memcg lru to locate inodes
> > > > (lru->page->inode) with something like this in
> > > > [mem_cgroup_]balance_dirty_pages():
> > > >
> > > > while (memcg_usage() >= memcg_fg_limit) {
> > > > inode = memcg_dirty_inode(cg); /* scan lru for a dirty page, then
> > > > grab mapping & inode */
> > > > sync_inode(inode, &wbc);
> > > > }
> > > >
> > > > if (memcg_usage() >= memcg_bg_limit) {
> > > > queue per-memcg bg flush work item
> > > > }
> > >
> > > I think even for background we shall have to implement some kind of logic
> > > where inodes are selected by traversing memcg->lru list so that for
> > > background write we don't end up writting too many inodes from other
> > > root group in an attempt to meet the low background ratio of memcg.
> > >
> > > So to me it boils down to coming up a new inode selection logic for
> > > memcg which can be used both for background as well as foreground
> > > writes. This will make sure we don't end up writting pages from the
> > > inodes we don't want to.
> >
> > Originally for struct page_cgroup reduction, I had the idea of
> > introducing something like
> >
> > struct memcg_mapping {
> > struct address_space *mapping;
> > struct mem_cgroup *memcg;
> > };
> >
> > hanging off page->mapping to make memcg association no longer per-page
> > and save the pc->memcg linkage (it's not completely per-inode either,
> > multiple memcgs can still refer to a single inode).
>
> So page->mapping will basically be a list where multiple memcg_mappings
> are hanging?

No, a single memcg_mapping per page. A page can only be part of one
mapping, and only be part of one memcg at any point in time.

But not all pages backing an inode belong to the same memcg. So the
two extremes are 1) every page associated individually with one memcg,
which is what we have now or 2) all pages in an inode collectively
associated with one memcg, which is not feasible.

The trade-off I propose is grouping all pages backing an inode that
are associated with the same memcg. struct memcg_mapping would be the
representation of this group.

> That will essentially tell what memory cgroups own pages
> in this inode?

The idea is to have it efficiently the other way round: quickly find
all inodes referenced by one memcg.

> And similary every cgroup will have a list where these memcg_mapping
> are hanging allowing to trace which memcg is doing IO on which inodes?

Yes.

> > We could put these descriptors on a per-memcg list and write inodes
> > from this list during memcg-writeback.
> >
> > We would have the option of extending this structure to contain hints
> > as to which subrange of the inode is actually owned by the cgroup, to
> > further narrow writeback to the right pages - iff shared big files
> > become a problem.
> >
> > Does that sound feasible?
>
> May be. I am really not an expert in this area.
>
> IIUC, this sounds more like a solution to quickly come up with a list of
> inodes one should be writting back. One could also come up with this kind of
> list by going through memcg->lru list also (approximate). So this can be
> an improvement over going through memcg->lru instead go through
> memcg->mapping_list.

Well, if you operate on a large file it may make a difference between
taking five inodes off the list and crawling through hundreds of
thousands of pages to get to those same five inodes.

And having efficient inode lookup for a memcg makes targetted
background writeback more feasible: pass the memcg in the background
writeback work and have the flusher go through memcg->mappings,
selecting those that match the bdi.

Am I missing something? I feel like I missed your point.

2011-03-16 17:07:49

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Wed, Mar 16, 2011 at 05:35:10PM +0100, Johannes Weiner wrote:

[..]
> > IIUC, this sounds more like a solution to quickly come up with a list of
> > inodes one should be writting back. One could also come up with this kind of
> > list by going through memcg->lru list also (approximate). So this can be
> > an improvement over going through memcg->lru instead go through
> > memcg->mapping_list.
>
> Well, if you operate on a large file it may make a difference between
> taking five inodes off the list and crawling through hundreds of
> thousands of pages to get to those same five inodes.
>
> And having efficient inode lookup for a memcg makes targetted
> background writeback more feasible: pass the memcg in the background
> writeback work and have the flusher go through memcg->mappings,
> selecting those that match the bdi.
>
> Am I missing something? I feel like I missed your point.

No. Thanks for the explanation. I get it. Walking through the memcg->lru
list to figure out inodes memcg is writting to will be slow and can be
very painful for large files. Keeping a more direct mapping like
memcg_mapping list per memcg can simplify it a lot.

Thanks
Vivek

2011-03-16 18:09:03

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 8/9] memcg: check memcg dirty limits in page writeback

On Wed, Mar 16, 2011 at 01:35:14PM +0100, Jan Kara wrote:
> On Tue 15-03-11 19:35:26, Greg Thelen wrote:
> > On Tue, Mar 15, 2011 at 4:12 PM, Jan Kara <[email protected]> wrote:
> > > ?I found out I've already deleted the relevant email and thus have no good
> > > way to reply to it. So in the end I'll write it here: As Vivek pointed out,
> > > you try to introduce background writeback that honors per-cgroup limits but
> > > the way you do it it doesn't quite work. To avoid livelocking of flusher
> > > thread, any essentially unbounded work (and background writeback of bdi or
> > > in your case a cgroup pages on the bdi is in principle unbounded) has to
> > > give way to other work items in the queue (like a work submitted by
> > > sync(1)). Thus wb_writeback() stops for_background works if there are other
> > > works to do with the rationale that as soon as that work is finished, we
> > > may happily return to background cleaning (and that other work works for
> > > background cleaning as well anyway).
> > >
> > > But with your introduction of per-cgroup background writeback we are going
> > > to loose the information in which cgroup we have to get below background
> > > limit. And if we stored the context somewhere and tried to return to it
> > > later, we'd have the above problems with livelocking and we'd have to
> > > really carefully handle cases where more cgroups actually want their limits
> > > observed.
> > >
> > > I'm not decided what would be a good solution for this. It seems that
> > > a flusher thread should check all cgroups whether they are not exceeding
> > > their background limit and if yes, do writeback. I'm not sure how practical
> > > that would be but possibly we could have a list of cgroups with exceeded
> > > limits and flusher thread could check that?
> >
> > mem_cgroup_balance_dirty_pages() queues a bdi work item which already
> > includes a memcg that is available to wb_writeback() in '[PATCH v6
> > 9/9] memcg: make background writeback memcg aware'. Background
> > writeback checks the given memcg usage vs memcg limit rather than
> > global usage vs global limit.
> Yes.
>
> > If we amend this to requeue an interrupted background work to the end
> > of the per-bdi work_list, then I think that would address the
> > livelocking issue.
> Yes, that would work. But it would be nice (I'd find that cleaner design)
> if we could keep just one type of background work and make sure that it
> observes all the imposed memcg limits. For that we wouldn't explicitely
> pass memcg to the flusher thread but rather make over_bground_thresh()
> check all the memcg limits - or to make this more effective have some list
> of memcgs which crossed the background limit. What do you think?

List of memcg per bdi which need writeback sounds interesting. This
can also allow us to keep track of additional state in memcgroup
regarding how much IO is in flight per memory cgroup on a bdi. One of
the additional things we wanted to do was differentiating between
write speed of two buffered writers in two groups. IO controller at
the end device can differentiate between the rates but that is only
possible if flusher threads are submitting enough IO from faster moving
group and not getting stuck behind slow group.

So if we can also do some accouting of in flight IO per memcg per bdi,
then flusher threads can skip the memcg which have lot of pending IOs.
That means IO controller at the device is holding back on these
requests and prioritizing some other group. And flusher threads can
move onto other memcg in the list and pick inodes from those.

If there are per memcg per bdi structures, then there can be per memcg
per bdi waitlists too and throttled task can sleep on those wait lists
and one can keep count of BDI_WRITTEN per memory cgroup and distribute
completion its tasks. That way, even memory cgroup foreground writeout
becomes IO less.

Thanks
Vivek

2011-03-16 21:19:55

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Wed, Mar 16, 2011 at 6:13 AM, Johannes Weiner <[email protected]> wrote:
> On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote:
>> On Mon, Mar 14, 2011 at 07:41:13PM -0700, Greg Thelen wrote:
>> > On Mon, Mar 14, 2011 at 1:23 PM, Vivek Goyal <[email protected]> wrote:
>> > > On Mon, Mar 14, 2011 at 11:29:17AM -0700, Greg Thelen wrote:
>> > >
>> > > [..]
>> > >> > We could just crawl the memcg's page LRU and bring things under control
>> > >> > that way, couldn't we? ?That would fix it. ?What were the reasons for
>> > >> > not doing this?
>> > >>
>> > >> My rational for pursuing bdi writeback was I/O locality. ?I have heard that
>> > >> per-page I/O has bad locality. ?Per inode bdi-style writeback should have better
>> > >> locality.
>> > >>
>> > >> My hunch is the best solution is a hybrid which uses a) bdi writeback with a
>> > >> target memcg filter and b) using the memcg lru as a fallback to identify the bdi
>> > >> that needed writeback. ?I think the part a) memcg filtering is likely something
>> > >> like:
>> > >> ?http://marc.info/?l=linux-kernel&m=129910424431837
>> > >>
>> > >> The part b) bdi selection should not be too hard assuming that page-to-mapping
>> > >> locking is doable.
>> > >
>> > > Greg,
>> > >
>> > > IIUC, option b) seems to be going through pages of particular memcg and
>> > > mapping page to inode and start writeback on particular inode?
>> >
>> > Yes.
>> >
>> > > If yes, this might be reasonably good. In the case when cgroups are not
>> > > sharing inodes then it automatically maps one inode to one cgroup and
>> > > once cgroup is over limit, it starts writebacks of its own inode.
>> > >
>> > > In case inode is shared, then we get the case of one cgroup writting
>> > > back the pages of other cgroup. Well I guess that also can be handeled
>> > > by flusher thread where a bunch or group of pages can be compared with
>> > > the cgroup passed in writeback structure. I guess that might hurt us
>> > > more than benefit us.
>> >
>> > Agreed. ?For now just writing the entire inode is probably fine.
>> >
>> > > IIUC how option b) works then we don't even need option a) where an N level
>> > > deep cache is maintained?
>> >
>> > Originally I was thinking that bdi-wide writeback with memcg filter
>> > was a good idea. ?But this may be unnecessarily complex. ?Now I am
>> > agreeing with you that option (a) may not be needed. ?Memcg could
>> > queue per-inode writeback using the memcg lru to locate inodes
>> > (lru->page->inode) with something like this in
>> > [mem_cgroup_]balance_dirty_pages():
>> >
>> > ? while (memcg_usage() >= memcg_fg_limit) {
>> > ? ? inode = memcg_dirty_inode(cg); ?/* scan lru for a dirty page, then
>> > grab mapping & inode */
>> > ? ? sync_inode(inode, &wbc);
>> > ? }
>> >
>> > ? if (memcg_usage() >= memcg_bg_limit) {
>> > ? ? queue per-memcg bg flush work item
>> > ? }
>>
>> I think even for background we shall have to implement some kind of logic
>> where inodes are selected by traversing memcg->lru list so that for
>> background write we don't end up writting too many inodes from other
>> root group in an attempt to meet the low background ratio of memcg.
>>
>> So to me it boils down to coming up a new inode selection logic for
>> memcg which can be used both for background as well as foreground
>> writes. This will make sure we don't end up writting pages from the
>> inodes we don't want to.
>
> Originally for struct page_cgroup reduction, I had the idea of
> introducing something like
>
> ? ? ? ?struct memcg_mapping {
> ? ? ? ? ? ? ? ?struct address_space *mapping;
> ? ? ? ? ? ? ? ?struct mem_cgroup *memcg;
> ? ? ? ?};
>
> hanging off page->mapping to make memcg association no longer per-page
> and save the pc->memcg linkage (it's not completely per-inode either,
> multiple memcgs can still refer to a single inode).
>
> We could put these descriptors on a per-memcg list and write inodes
> from this list during memcg-writeback.
>
> We would have the option of extending this structure to contain hints
> as to which subrange of the inode is actually owned by the cgroup, to
> further narrow writeback to the right pages - iff shared big files
> become a problem.
>
> Does that sound feasible?

If I understand your memcg_mapping proposal, then each inode could
have a collection of memcg_mapping objects representing the set of
memcg that were charged for caching pages of the inode's data. When a
new file page is charged to a memcg, then the inode's set of
memcg_mapping would be scanned to determine if current's memcg is
already in the memcg_mapping set. If this is the first page for the
memcg within the inode, then a new memcg_mapping would be allocated
and attached to the inode. The memcg_mapping may be reference counted
and would be deleted when the last inode page for a particular memcg
is uncharged.

page->mapping = &memcg_mapping
inode->i_mapping = collection of memcg_mapping, grows/shrinks with [un]charge

Am I close?

I still have to think though the various use cases, but I wanted to
make sure I had the basic idea.

2011-03-16 21:52:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Wed, Mar 16, 2011 at 02:19:26PM -0700, Greg Thelen wrote:
> On Wed, Mar 16, 2011 at 6:13 AM, Johannes Weiner <[email protected]> wrote:
> > On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote:
> >> I think even for background we shall have to implement some kind of logic
> >> where inodes are selected by traversing memcg->lru list so that for
> >> background write we don't end up writting too many inodes from other
> >> root group in an attempt to meet the low background ratio of memcg.
> >>
> >> So to me it boils down to coming up a new inode selection logic for
> >> memcg which can be used both for background as well as foreground
> >> writes. This will make sure we don't end up writting pages from the
> >> inodes we don't want to.
> >
> > Originally for struct page_cgroup reduction, I had the idea of
> > introducing something like
> >
> > ? ? ? ?struct memcg_mapping {
> > ? ? ? ? ? ? ? ?struct address_space *mapping;
> > ? ? ? ? ? ? ? ?struct mem_cgroup *memcg;
> > ? ? ? ?};
> >
> > hanging off page->mapping to make memcg association no longer per-page
> > and save the pc->memcg linkage (it's not completely per-inode either,
> > multiple memcgs can still refer to a single inode).
> >
> > We could put these descriptors on a per-memcg list and write inodes
> > from this list during memcg-writeback.
> >
> > We would have the option of extending this structure to contain hints
> > as to which subrange of the inode is actually owned by the cgroup, to
> > further narrow writeback to the right pages - iff shared big files
> > become a problem.
> >
> > Does that sound feasible?
>
> If I understand your memcg_mapping proposal, then each inode could
> have a collection of memcg_mapping objects representing the set of
> memcg that were charged for caching pages of the inode's data. When a
> new file page is charged to a memcg, then the inode's set of
> memcg_mapping would be scanned to determine if current's memcg is
> already in the memcg_mapping set. If this is the first page for the
> memcg within the inode, then a new memcg_mapping would be allocated
> and attached to the inode. The memcg_mapping may be reference counted
> and would be deleted when the last inode page for a particular memcg
> is uncharged.

Dead-on. Well, on which side you put the list - a per-memcg list of
inodes, or a per-inode list of memcgs - really depends on which way
you want to do the lookups. But this is the idea, yes.

> page->mapping = &memcg_mapping
> inode->i_mapping = collection of memcg_mapping, grows/shrinks with [un]charge

If the memcg_mapping list (or hash-table for quick find-or-create?)
was to be on the inode side, I'd put it in struct address_space, since
this is all about page cache, not so much an fs thing.

Still, correct in general.

2011-03-17 04:42:18

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Wed, Mar 16, 2011 at 2:52 PM, Johannes Weiner <[email protected]> wrote:
> On Wed, Mar 16, 2011 at 02:19:26PM -0700, Greg Thelen wrote:
>> On Wed, Mar 16, 2011 at 6:13 AM, Johannes Weiner <[email protected]> wrote:
>> > On Tue, Mar 15, 2011 at 02:48:39PM -0400, Vivek Goyal wrote:
>> >> I think even for background we shall have to implement some kind of logic
>> >> where inodes are selected by traversing memcg->lru list so that for
>> >> background write we don't end up writting too many inodes from other
>> >> root group in an attempt to meet the low background ratio of memcg.
>> >>
>> >> So to me it boils down to coming up a new inode selection logic for
>> >> memcg which can be used both for background as well as foreground
>> >> writes. This will make sure we don't end up writting pages from the
>> >> inodes we don't want to.
>> >
>> > Originally for struct page_cgroup reduction, I had the idea of
>> > introducing something like
>> >
>> > ? ? ? ?struct memcg_mapping {
>> > ? ? ? ? ? ? ? ?struct address_space *mapping;
>> > ? ? ? ? ? ? ? ?struct mem_cgroup *memcg;
>> > ? ? ? ?};
>> >
>> > hanging off page->mapping to make memcg association no longer per-page
>> > and save the pc->memcg linkage (it's not completely per-inode either,
>> > multiple memcgs can still refer to a single inode).
>> >
>> > We could put these descriptors on a per-memcg list and write inodes
>> > from this list during memcg-writeback.
>> >
>> > We would have the option of extending this structure to contain hints
>> > as to which subrange of the inode is actually owned by the cgroup, to
>> > further narrow writeback to the right pages - iff shared big files
>> > become a problem.
>> >
>> > Does that sound feasible?
>>
>> If I understand your memcg_mapping proposal, then each inode could
>> have a collection of memcg_mapping objects representing the set of
>> memcg that were charged for caching pages of the inode's data. ?When a
>> new file page is charged to a memcg, then the inode's set of
>> memcg_mapping would be scanned to determine if current's memcg is
>> already in the memcg_mapping set. ?If this is the first page for the
>> memcg within the inode, then a new memcg_mapping would be allocated
>> and attached to the inode. ?The memcg_mapping may be reference counted
>> and would be deleted when the last inode page for a particular memcg
>> is uncharged.
>
> Dead-on. ?Well, on which side you put the list - a per-memcg list of
> inodes, or a per-inode list of memcgs - really depends on which way
> you want to do the lookups. ?But this is the idea, yes.
>
>> ? page->mapping = &memcg_mapping
>> ? inode->i_mapping = collection of memcg_mapping, grows/shrinks with [un]charge
>
> If the memcg_mapping list (or hash-table for quick find-or-create?)
> was to be on the inode side, I'd put it in struct address_space, since
> this is all about page cache, not so much an fs thing.
>
> Still, correct in general.
>

In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
Vivek have had some discussion around how memcg and writeback mesh.
In my mind, the
discussions in 8/9 are starting to blend with this thread.

I have been thinking about Johannes' struct memcg_mapping. I think the idea
may address several of the issues being discussed, especially
interaction between
IO-less balance_dirty_pages() and memcg writeback.

Here is my thinking. Feedback is most welcome!

The data structures:
- struct memcg_mapping {
struct address_space *mapping;
struct mem_cgroup *memcg;
int refcnt;
};
- each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
- each memcg_bdi contains a mapping from inode to memcg_mapping. This may be a
very large set representing many cached inodes.
- each memcg_mapping represents all pages within an bdi,inode,memcg. All
corresponding cached inode pages point to the same memcg_mapping via
pc->mapping. I assume that all pages of inode belong to no more than one bdi.
- manage a global list of memcg that are over their respective background dirty
limit.
- i_mapping continues to point to a traditional non-memcg mapping (no change
here).
- none of these memcg_* structures affect root cgroup or kernels with memcg
configured out.

The routines under discussion:
- memcg charging a new inode page to a memcg: will use inode->mapping and inode
to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
levels in data structure.

- Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
memcg. If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
Also delete memcg_bdi if last memcg_mapping is removed.

- account_page_dirtied(): nothing new here, continue to set the per-page flags
and increment the memcg per-cpu dirty page counter. Same goes for routines
that mark pages in writeback and clean states.

- mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
background limit, then add memcg to global memcg_over_bg_limit list and use
memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
(aka memcg_bdi) accounting structure.

- bdi writeback: will revert some of the mmotm memcg dirty limit changes to
fs-writeback.c so that wb_do_writeback() will return to checking
wb_check_background_flush() to check background limits and being
interruptible if
sync flush occurs. wb_check_background_flush() will check the global
memcg_over_bg_limit list for memcg that are over their dirty limit.
wb_writeback() will either (I am not sure):
a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
b) scan bdi dirty inode list (only some of them in memcg) using
inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,memcg),
would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
is caching pages from the inode.

- over_bground_thresh() will determine if memcg is still over bg limit.
If over limit, then it per bdi per memcg background flushing will continue.
If not over limit then memcg will be removed from memcg_over_bg_limit list.

2011-03-17 12:44:17

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote:
> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
> Vivek have had some discussion around how memcg and writeback mesh.
> In my mind, the
> discussions in 8/9 are starting to blend with this thread.
>
> I have been thinking about Johannes' struct memcg_mapping. I think the idea
> may address several of the issues being discussed, especially
> interaction between
> IO-less balance_dirty_pages() and memcg writeback.
>
> Here is my thinking. Feedback is most welcome!
>
> The data structures:
> - struct memcg_mapping {
> struct address_space *mapping;
> struct mem_cgroup *memcg;
> int refcnt;
> };
> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
> - each memcg_bdi contains a mapping from inode to memcg_mapping. This may be a
> very large set representing many cached inodes.
> - each memcg_mapping represents all pages within an bdi,inode,memcg. All
> corresponding cached inode pages point to the same memcg_mapping via
> pc->mapping. I assume that all pages of inode belong to no more than one bdi.
> - manage a global list of memcg that are over their respective background dirty
> limit.
> - i_mapping continues to point to a traditional non-memcg mapping (no change
> here).
> - none of these memcg_* structures affect root cgroup or kernels with memcg
> configured out.

So structures roughly like this:

struct mem_cgroup {
...
/* key is struct backing_dev_info * */
struct rb_root memcg_bdis;
};

struct memcg_bdi {
/* key is struct address_space * */
struct rb_root memcg_mappings;
struct rb_node node;
};

struct memcg_mapping {
struct address_space *mapping;
struct mem_cgroup *memcg;
struct rb_node node;
atomic_t count;
};

struct page_cgroup {
...
struct memcg_mapping *memcg_mapping;
};

> The routines under discussion:
> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
> levels in data structure.
>
> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> memcg. If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
> Also delete memcg_bdi if last memcg_mapping is removed.
>
> - account_page_dirtied(): nothing new here, continue to set the per-page flags
> and increment the memcg per-cpu dirty page counter. Same goes for routines
> that mark pages in writeback and clean states.

We may want to remember the dirty memcg_mappings so that on writeback
we don't have to go through every single one that the memcg refers to?

> - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
> background limit, then add memcg to global memcg_over_bg_limit list and use
> memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
> fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
> (aka memcg_bdi) accounting structure.

I wonder if we could just schedule a for_background work manually in
the memcg case that writes back the corresponding memcg_bdi set (and
e.g. having it continue until either the memcg is below bg thresh OR
the global bg thresh is exceeded OR there is other work scheduled)?
Then we would get away without the extra list, and it doesn't sound
overly complex to implement.

2011-03-17 14:46:50

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Wed 16-03-11 21:41:48, Greg Thelen wrote:
> I have been thinking about Johannes' struct memcg_mapping. I think the idea
> may address several of the issues being discussed, especially
> interaction between
> IO-less balance_dirty_pages() and memcg writeback.
>
> Here is my thinking. Feedback is most welcome!
>
> The data structures:
> - struct memcg_mapping {
> struct address_space *mapping;
> struct mem_cgroup *memcg;
> int refcnt;
> };
> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
> - each memcg_bdi contains a mapping from inode to memcg_mapping. This may be a
> very large set representing many cached inodes.
> - each memcg_mapping represents all pages within an bdi,inode,memcg. All
> corresponding cached inode pages point to the same memcg_mapping via
> pc->mapping. I assume that all pages of inode belong to no more than one bdi.
> - manage a global list of memcg that are over their respective background dirty
> limit.
> - i_mapping continues to point to a traditional non-memcg mapping (no change
> here).
> - none of these memcg_* structures affect root cgroup or kernels with memcg
> configured out.
>
> The routines under discussion:
> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
> levels in data structure.
>
> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> memcg. If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
> Also delete memcg_bdi if last memcg_mapping is removed.
>
> - account_page_dirtied(): nothing new here, continue to set the per-page flags
> and increment the memcg per-cpu dirty page counter. Same goes for routines
> that mark pages in writeback and clean states.
>
> - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
> background limit, then add memcg to global memcg_over_bg_limit list and
> use memcg's set of memcg_bdi to wakeup each(?) corresponding bdi
> flusher.
In fact, mem_cgroup_balance_dirty_pages() is called for a particular bdi.
So after some thought I think we should wake up flusher thread only for that
bdi to mimic the logic of global background limit. If memcg is dirtying
pages on another bdi, mem_cgroup_balance_dirty_pages() will get called for
that bdi eventually as well.

> If over fg limit, then use IO-less style foreground
> throttling with per-memcg per-bdi (aka memcg_bdi) accounting structure.
We'll probably need a counter of written pages in memcg_bdi so that we
are able to tell how big progress are we making with the writeback and
decide when to release throttled thread. But it should be doable just fine.

> - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
> fs-writeback.c so that wb_do_writeback() will return to checking
> wb_check_background_flush() to check background limits and being
> interruptible if
> sync flush occurs. wb_check_background_flush() will check the global
> memcg_over_bg_limit list for memcg that are over their dirty limit.
> wb_writeback() will either (I am not sure):
> a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
> b) scan bdi dirty inode list (only some of them in memcg) using
> inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,memcg),
> would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
> is caching pages from the inode.
Hmm, both has its problems. With a) we could queue all the dirty inodes
from the memcg for writeback but then we'd essentially write all dirty data
for a memcg, not only enough data to get below bg limit. And if we started
skipping inodes when memcg(s) inode belongs to get below bg limit, we'd
risk copying inodes there and back without reason, cases where some inodes
never get written because they always end up skipped etc. Also the question
whether some of the memcgs inode belongs to is still over limit is the
hardest part of solution b) so we wouldn't help ourselves much.

The problem with b) is that the flusher thread will not work on behalf
of one memcg but rather on behalf of all memcgs that have crossed their
background limits. Thus what it should do is to scan inode dirty list and
for each inode ask: Does the inode belong to any of memcgs that have
crossed background limit? If yes, write it, if no, skip it. I'm not sure
about the best data structure for such query - essentially we have a set of
memcgs for an inode (memcgs which have pages in a mapping - ideally only
dirty ones but I guess that's asking for too much ;)) and a set of
memcgs that have crossed background limit and ask whether they have
nonempty intersection. Hmm, if we had memcgs for the inode and memcgs
over bg limit in a tree ordered (by an arbitrary criterion), we could do
the intersection rather efficiently in time O(m*log(n)) where 'm' is size
of the smaller tree and 'n' size of the larger tree. But it gets complex.

All in all I see as reasonable choices either b) or a) in a trivial variant
where we write all the dirty data in a memcg...

> - over_bground_thresh() will determine if memcg is still over bg limit.
> If over limit, then it per bdi per memcg background flushing will continue.
> If not over limit then memcg will be removed from memcg_over_bg_limit list.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-17 14:49:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Thu, Mar 17, 2011 at 01:43:50PM +0100, Johannes Weiner wrote:
> On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote:
> > In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
> > Vivek have had some discussion around how memcg and writeback mesh.
> > In my mind, the
> > discussions in 8/9 are starting to blend with this thread.
> >
> > I have been thinking about Johannes' struct memcg_mapping. I think the idea
> > may address several of the issues being discussed, especially
> > interaction between
> > IO-less balance_dirty_pages() and memcg writeback.
> >
> > Here is my thinking. Feedback is most welcome!
> >
> > The data structures:
> > - struct memcg_mapping {
> > struct address_space *mapping;
> > struct mem_cgroup *memcg;
> > int refcnt;
> > };
> > - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
> > - each memcg_bdi contains a mapping from inode to memcg_mapping. This may be a
> > very large set representing many cached inodes.
> > - each memcg_mapping represents all pages within an bdi,inode,memcg. All
> > corresponding cached inode pages point to the same memcg_mapping via
> > pc->mapping. I assume that all pages of inode belong to no more than one bdi.
> > - manage a global list of memcg that are over their respective background dirty
> > limit.
> > - i_mapping continues to point to a traditional non-memcg mapping (no change
> > here).
> > - none of these memcg_* structures affect root cgroup or kernels with memcg
> > configured out.
>
> So structures roughly like this:
>
> struct mem_cgroup {
> ...
> /* key is struct backing_dev_info * */
> struct rb_root memcg_bdis;
> };
>
> struct memcg_bdi {
> /* key is struct address_space * */
> struct rb_root memcg_mappings;
> struct rb_node node;
> };
>
> struct memcg_mapping {
> struct address_space *mapping;
> struct mem_cgroup *memcg;
> struct rb_node node;
> atomic_t count;
> };
>
> struct page_cgroup {
> ...
> struct memcg_mapping *memcg_mapping;
> };
>
> > The routines under discussion:
> > - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> > to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
> > levels in data structure.
> >
> > - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> > memcg. If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
> > Also delete memcg_bdi if last memcg_mapping is removed.
> >
> > - account_page_dirtied(): nothing new here, continue to set the per-page flags
> > and increment the memcg per-cpu dirty page counter. Same goes for routines
> > that mark pages in writeback and clean states.
>
> We may want to remember the dirty memcg_mappings so that on writeback
> we don't have to go through every single one that the memcg refers to?
>
> > - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
> > background limit, then add memcg to global memcg_over_bg_limit list and use
> > memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
> > fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
> > (aka memcg_bdi) accounting structure.
>
> I wonder if we could just schedule a for_background work manually in
> the memcg case that writes back the corresponding memcg_bdi set (and
> e.g. having it continue until either the memcg is below bg thresh OR
> the global bg thresh is exceeded OR there is other work scheduled)?
> Then we would get away without the extra list, and it doesn't sound
> overly complex to implement.

Jan tought that design of maintaining a list of memocy groups (memcg_bdi)
in this case is cleaner. So he preferred that let the queuing of work be
for sync work and all this background writeout and IO less throttling
stuff can be covered through background writeout logic.

I think I tend to agree that keeping a list is a clean design as bdi
is shared medium and now flusher thread can decide how to divide the
bandwidth of shared bdi among the queued memory cgroups. So as long as
it does not turn out to be complex, I think maintaining a separate list
of cgroups waiting for writeback on this bdi is not a bad idea.

Thanks
Vivek

2011-03-17 14:53:08

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Thu 17-03-11 13:43:50, Johannes Weiner wrote:
> > - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
> > background limit, then add memcg to global memcg_over_bg_limit list and use
> > memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
> > fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
> > (aka memcg_bdi) accounting structure.
>
> I wonder if we could just schedule a for_background work manually in
> the memcg case that writes back the corresponding memcg_bdi set (and
> e.g. having it continue until either the memcg is below bg thresh OR
> the global bg thresh is exceeded OR there is other work scheduled)?
> Then we would get away without the extra list, and it doesn't sound
> overly complex to implement.
But then when you stop background writeback because of other work, you
have to know you should restart it after that other work is done. For this
you basically need the list. With this approach of one-work-per-memcg
you also get into problems that one cgroup can livelock the flusher thread
and thus other memcgs won't get writeback. So you have to switch between
memcgs once in a while.

We've tried several approaches with global background writeback before we
arrived at what we have now and what seems to work at least reasonably...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-17 15:42:34

by Curt Wohlgemuth

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Thu, Mar 17, 2011 at 7:53 AM, Jan Kara <[email protected]> wrote:
> On Thu 17-03-11 13:43:50, Johannes Weiner wrote:
>> > - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
>> > ? background limit, then add memcg to global memcg_over_bg_limit list and use
>> > ? memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. ?If over
>> > ? fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
>> > ? (aka memcg_bdi) accounting structure.
>>
>> I wonder if we could just schedule a for_background work manually in
>> the memcg case that writes back the corresponding memcg_bdi set (and
>> e.g. having it continue until either the memcg is below bg thresh OR
>> the global bg thresh is exceeded OR there is other work scheduled)?
>> Then we would get away without the extra list, and it doesn't sound
>> overly complex to implement.
> ?But then when you stop background writeback because of other work, you
> have to know you should restart it after that other work is done. For this
> you basically need the list. With this approach of one-work-per-memcg
> you also get into problems that one cgroup can livelock the flusher thread
> and thus other memcgs won't get writeback. So you have to switch between
> memcgs once in a while.

In pre-2.6.38 kernels (when background writeback enqueued work items,
and we didn't break the loop in wb_writeback() with for_background for
other work items), we experimented with this issue. One solution we
came up with was enqueuing a background work item for a given memory
cgroup, but limiting nr_pages to something like 2048 instead of
LONG_MAX, to avoid livelock. Writeback would only operate on inodes
with dirty pages from this memory cgroup.

If BG writeback takes place for all memcgs that are over their BG
limts, it seems that simply asking if each inode is "related" somehow
to the a of dirty memcgs is the simplest way to go. Assuming of
course that efficient data structures are built to answer this
question.

Thanks,
Curt

> We've tried several approaches with global background writeback before we
> arrived at what we have now and what seems to work at least reasonably...
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>

2011-03-17 17:13:46

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Thu, Mar 17, 2011 at 03:46:41PM +0100, Jan Kara wrote:

[..]
> > - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
> > fs-writeback.c so that wb_do_writeback() will return to checking
> > wb_check_background_flush() to check background limits and being
> > interruptible if
> > sync flush occurs. wb_check_background_flush() will check the global
> > memcg_over_bg_limit list for memcg that are over their dirty limit.
> > wb_writeback() will either (I am not sure):
> > a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
> > b) scan bdi dirty inode list (only some of them in memcg) using
> > inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,memcg),
> > would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
> > is caching pages from the inode.
> Hmm, both has its problems. With a) we could queue all the dirty inodes
> from the memcg for writeback but then we'd essentially write all dirty data
> for a memcg, not only enough data to get below bg limit. And if we started
> skipping inodes when memcg(s) inode belongs to get below bg limit, we'd
> risk copying inodes there and back without reason, cases where some inodes
> never get written because they always end up skipped etc. Also the question
> whether some of the memcgs inode belongs to is still over limit is the
> hardest part of solution b) so we wouldn't help ourselves much.

May be I am missing something but can't we just start traversing
through list of memcg_over_bg_list and take option a) to traverse
through list of inodes and write them till we are below limit of
that group. We of course skip inodes which are not dirty.

This is assuming that root group is also part of that list so that
inodes in root group do not starve writeback.

We still continue to have all the inodes on bdi wb structure and
memcg will just give us pointers to those inodes. So for background
write, instead of going serially through dirty inodes list, we
will first pick the cgroup to write and then inode to write. As
we will be doing round robin among cgroup list, it will make sure
that none of the cgroups (including root) as well as inode are not
starved.

What am I missing?

Thanks
Vivek

2011-03-17 17:59:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Thu 17-03-11 13:12:19, Vivek Goyal wrote:
> On Thu, Mar 17, 2011 at 03:46:41PM +0100, Jan Kara wrote:
> [..]
> > > - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
> > > fs-writeback.c so that wb_do_writeback() will return to checking
> > > wb_check_background_flush() to check background limits and being
> > > interruptible if
> > > sync flush occurs. wb_check_background_flush() will check the global
> > > memcg_over_bg_limit list for memcg that are over their dirty limit.
> > > wb_writeback() will either (I am not sure):
> > > a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
> > > b) scan bdi dirty inode list (only some of them in memcg) using
> > > inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,memcg),
> > > would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
> > > is caching pages from the inode.
> > Hmm, both has its problems. With a) we could queue all the dirty inodes
> > from the memcg for writeback but then we'd essentially write all dirty data
> > for a memcg, not only enough data to get below bg limit. And if we started
> > skipping inodes when memcg(s) inode belongs to get below bg limit, we'd
> > risk copying inodes there and back without reason, cases where some inodes
> > never get written because they always end up skipped etc. Also the question
> > whether some of the memcgs inode belongs to is still over limit is the
> > hardest part of solution b) so we wouldn't help ourselves much.
>
> May be I am missing something but can't we just start traversing
> through list of memcg_over_bg_list and take option a) to traverse
> through list of inodes and write them till we are below limit of
> that group. We of course skip inodes which are not dirty.
>
> This is assuming that root group is also part of that list so that
> inodes in root group do not starve writeback.
>
> We still continue to have all the inodes on bdi wb structure and
> memcg will just give us pointers to those inodes. So for background
> write, instead of going serially through dirty inodes list, we
> will first pick the cgroup to write and then inode to write. As
> we will be doing round robin among cgroup list, it will make sure
> that none of the cgroups (including root) as well as inode are not
> starved.
I was considering this as well and didn't quite like it but on a second
thought it need not be that bad. If we wrote MAX_WRITEBACK_PAGES from one
memcg, then switched to another one while keeping pointers to per-memcg inode
list (for the time when we return to this memcg), it could work just fine.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2011-03-17 18:16:16

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Thu, Mar 17, 2011 at 06:59:08PM +0100, Jan Kara wrote:
> On Thu 17-03-11 13:12:19, Vivek Goyal wrote:
> > On Thu, Mar 17, 2011 at 03:46:41PM +0100, Jan Kara wrote:
> > [..]
> > > > - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
> > > > fs-writeback.c so that wb_do_writeback() will return to checking
> > > > wb_check_background_flush() to check background limits and being
> > > > interruptible if
> > > > sync flush occurs. wb_check_background_flush() will check the global
> > > > memcg_over_bg_limit list for memcg that are over their dirty limit.
> > > > wb_writeback() will either (I am not sure):
> > > > a) scan memcg's bdi_memcg list of inodes (only some of them are dirty)
> > > > b) scan bdi dirty inode list (only some of them in memcg) using
> > > > inode_in_memcg() to identify inodes to write. inode_in_memcg(inode,memcg),
> > > > would walk memcg- -> memcg_bdi -> memcg_mapping to determine if the memcg
> > > > is caching pages from the inode.
> > > Hmm, both has its problems. With a) we could queue all the dirty inodes
> > > from the memcg for writeback but then we'd essentially write all dirty data
> > > for a memcg, not only enough data to get below bg limit. And if we started
> > > skipping inodes when memcg(s) inode belongs to get below bg limit, we'd
> > > risk copying inodes there and back without reason, cases where some inodes
> > > never get written because they always end up skipped etc. Also the question
> > > whether some of the memcgs inode belongs to is still over limit is the
> > > hardest part of solution b) so we wouldn't help ourselves much.
> >
> > May be I am missing something but can't we just start traversing
> > through list of memcg_over_bg_list and take option a) to traverse
> > through list of inodes and write them till we are below limit of
> > that group. We of course skip inodes which are not dirty.
> >
> > This is assuming that root group is also part of that list so that
> > inodes in root group do not starve writeback.
> >
> > We still continue to have all the inodes on bdi wb structure and
> > memcg will just give us pointers to those inodes. So for background
> > write, instead of going serially through dirty inodes list, we
> > will first pick the cgroup to write and then inode to write. As
> > we will be doing round robin among cgroup list, it will make sure
> > that none of the cgroups (including root) as well as inode are not
> > starved.
> I was considering this as well and didn't quite like it but on a second
> thought it need not be that bad. If we wrote MAX_WRITEBACK_PAGES from one
> memcg, then switched to another one while keeping pointers to per-memcg inode
> list (for the time when we return to this memcg), it could work just fine.

Yes, we can write MAX_WRITEBACK_PAGES from each memcg and then move on to
next one. In fact memcg_bdi should have list of memcg_mapping. So once we
select the inode (memcg_mapping) from cgroup for writeout (move inode
on ->b_io list), we can also shuffle the position of memcg_mapping with-in
memory cgroup so that inodes with-in a cgroup get fair share of writeout in
a round robin manner.

As you said in other mail, we probably will keep MEMCG_BDI_WRITTEN count
so that IO less throttling can distribute the pages completed to right
cgroup.

Down the line we can probably also maintain MEMCG_BDI_WRITEBACK to keep track
how many pages are already under writeout from a cgroup and skip that cgroup
if too many pages are already in-flight. This might help us push more
WRITES for higher weight IO cgroup as compared to lower weight IO cgroup.

Thanks
Vivek

2011-03-18 07:57:44

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Thu, Mar 17, 2011 at 5:43 AM, Johannes Weiner <[email protected]> wrote:
> On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote:
>> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
>> Vivek have had some discussion around how memcg and writeback mesh.
>> In my mind, the
>> discussions in 8/9 are starting to blend with this thread.
>>
>> I have been thinking about Johannes' struct memcg_mapping. ?I think the idea
>> may address several of the issues being discussed, especially
>> interaction between
>> IO-less balance_dirty_pages() and memcg writeback.
>>
>> Here is my thinking. ?Feedback is most welcome!
>>
>> The data structures:
>> - struct memcg_mapping {
>> ? ? ? ?struct address_space *mapping;
>> ? ? ? ?struct mem_cgroup *memcg;
>> ? ? ? ?int refcnt;
>> ? };
>> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
>> - each memcg_bdi contains a mapping from inode to memcg_mapping. ?This may be a
>> ? very large set representing many cached inodes.
>> - each memcg_mapping represents all pages within an bdi,inode,memcg. ?All
>> ? corresponding cached inode pages point to the same memcg_mapping via
>> ? pc->mapping. ?I assume that all pages of inode belong to no more than one bdi.
>> - manage a global list of memcg that are over their respective background dirty
>> ? limit.
>> - i_mapping continues to point to a traditional non-memcg mapping (no change
>> ? here).
>> - none of these memcg_* structures affect root cgroup or kernels with memcg
>> ? configured out.
>
> So structures roughly like this:
>
> struct mem_cgroup {
> ? ? ? ?...
> ? ? ? ?/* key is struct backing_dev_info * */
> ? ? ? ?struct rb_root memcg_bdis;
> };
>
> struct memcg_bdi {
> ? ? ? ?/* key is struct address_space * */
> ? ? ? ?struct rb_root memcg_mappings;
> ? ? ? ?struct rb_node node;
> };
>
> struct memcg_mapping {
> ? ? ? ?struct address_space *mapping;
> ? ? ? ?struct mem_cgroup *memcg;
> ? ? ? ?struct rb_node node;
> ? ? ? ?atomic_t count;
> };
>
> struct page_cgroup {
> ? ? ? ?...
> ? ? ? ?struct memcg_mapping *memcg_mapping;
> };
>
>> The routines under discussion:
>> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
>> ? to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
>> ? levels in data structure.
>>
>> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
>> ? memcg. ?If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
>> ? Also delete memcg_bdi if last memcg_mapping is removed.
>>
>> - account_page_dirtied(): nothing new here, continue to set the per-page flags
>> ? and increment the memcg per-cpu dirty page counter. ?Same goes for routines
>> ? that mark pages in writeback and clean states.
>
> We may want to remember the dirty memcg_mappings so that on writeback
> we don't have to go through every single one that the memcg refers to?

I think this is a good idea to allow per memcg per bdi list of dirty mappings.

It feels like some of this is starting to gel. I've been sketching
some of the code to see how the memcg locking will work out. The
basic structures I see are:

struct mem_cgroup {
...
/*
* For all file pages cached by this memcg sort by bdi.
* key is struct backing_dev_info *; value is struct memcg_bdi *
* Protected by bdis_lock.
*/
struct rb_root bdis;
spinlock_t bdis_lock; /* or use rcu structure, memcg:bdi set
could be fairly static */
};

struct memcg_bdi {
struct backing_dev_info *bdi;
/*
* key is struct address_space *; value is struct
memcg_mapping *
* memcg_mappings live within either mappings or
dirty_mappings set.
*/
struct rb_root mappings;
struct rb_root dirty_mappings;
struct rb_node node;
spinlock_t lock; /* protect [dirty_]mappings */
};

struct memcg_mapping {
struct address_space *mapping;
struct memcg_bdi *memcg_bdi;
struct rb_node node;
atomic_t nr_pages;
atomic_t nr_dirty;
};

struct page_cgroup {
...
struct memcg_mapping *memcg_mapping;
};

- each memcg contains a mapping from bdi to memcg_bdi.
- each memcg_bdi contains two mappings:
mappings: from address_space to memcg_mapping for clean pages
dirty_mappings: from address_space to memcg_mapping when there are
some dirty pages
- each memcg_mapping represents a set of cached pages within an
bdi,inode,memcg. All
corresponding cached inode pages point to the same memcg_mapping via
pc->mapping. I assume that all pages of inode belong to no more than one bdi.
- manage a global list of memcg that are over their respective background dirty
limit.
- i_mapping continues to point to a traditional non-memcg mapping (no change
here).
- none of these memcg_* structures affect root cgroup or kernels with memcg
configured out.

The routines under discussion:
- memcg charging a new inode page to a memcg: will use inode->mapping and inode
to walk memcg -> memcg_bdi -> mappings and lazily allocating missing
levels in data structure.

- Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
memcg. If refcnt drops to zero, then remove memcg_mapping from the
memcg_bdi.[dirty_]mappings.
Also delete memcg_bdi if last memcg_mapping is removed.

- account_page_dirtied(): increment nr_dirty. If first dirty page,
then move memcg_mapping from memcg_bdi.mappings to
memcg_bdi.dirty_mappings page counter. When marking page clean, do
the opposite.

- mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
background limit, then add memcg to global memcg_over_bg_limit list and use
memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
(aka memcg_bdi) accounting structure.

- bdi writeback: will revert some of the mmotm memcg dirty limit changes to
fs-writeback.c so that wb_do_writeback() will return to checking
wb_check_background_flush() to check background limits and being
interruptible if sync flush occurs. wb_check_background_flush() will
check the global
memcg_over_bg_limit list for memcg that are over their dirty limit.
Within each memcg write inodes from the dirty_mappings list until a
threshold page count has been reached (MAX_WRITEBACK_PAGES). Then
move to next listed memcg.

- over_bground_thresh() will determine if memcg is still over bg limit.
If over limit, then it per bdi per memcg background flushing will continue.
If not over limit then memcg will be removed from memcg_over_bg_limit list.

I'll post my resulting patches in RFC form, or (at the least) my conclusions.

2011-03-18 14:31:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Thu, Mar 17, 2011 at 01:43:50PM +0100, Johannes Weiner wrote:

[..]
> So structures roughly like this:
>
> struct mem_cgroup {
> ...
> /* key is struct backing_dev_info * */
> struct rb_root memcg_bdis;
> };
>
> struct memcg_bdi {
> /* key is struct address_space * */
> struct rb_root memcg_mappings;
> struct rb_node node;
> };
>
> struct memcg_mapping {
> struct address_space *mapping;
> struct mem_cgroup *memcg;
> struct rb_node node;
> atomic_t count;
> };
>
> struct page_cgroup {
> ...
> struct memcg_mapping *memcg_mapping;
> };

Johannes, didn't you want page->mapping to point to memcg_mapping instead
of increasing the size of page_cgroup?

Thanks
Vivek

2011-03-18 14:47:39

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Fri, Mar 18, 2011 at 10:29:50AM -0400, Vivek Goyal wrote:
> On Thu, Mar 17, 2011 at 01:43:50PM +0100, Johannes Weiner wrote:
>
> [..]
> > So structures roughly like this:
> >
> > struct mem_cgroup {
> > ...
> > /* key is struct backing_dev_info * */
> > struct rb_root memcg_bdis;
> > };
> >
> > struct memcg_bdi {
> > /* key is struct address_space * */
> > struct rb_root memcg_mappings;
> > struct rb_node node;
> > };
> >
> > struct memcg_mapping {
> > struct address_space *mapping;
> > struct mem_cgroup *memcg;
> > struct rb_node node;
> > atomic_t count;
> > };
> >
> > struct page_cgroup {
> > ...
> > struct memcg_mapping *memcg_mapping;
> > };
>
> Johannes, didn't you want page->mapping to point to memcg_mapping instead
> of increasing the size of page_cgroup?

Initially, yes, but this is far less invasive. We don't increase
page_cgroup, though: memcg_mapping contains the pointer to struct
mem_cgroup, it can replace pc->memcg for now.

2011-03-18 14:51:06

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Fri, Mar 18, 2011 at 12:57:09AM -0700, Greg Thelen wrote:

[..]
> I think this is a good idea to allow per memcg per bdi list of dirty mappings.
>
> It feels like some of this is starting to gel. I've been sketching
> some of the code to see how the memcg locking will work out. The
> basic structures I see are:
>
> struct mem_cgroup {
> ...
> /*
> * For all file pages cached by this memcg sort by bdi.
> * key is struct backing_dev_info *; value is struct memcg_bdi *
> * Protected by bdis_lock.
> */
> struct rb_root bdis;
> spinlock_t bdis_lock; /* or use rcu structure, memcg:bdi set
> could be fairly static */
> };
>
> struct memcg_bdi {
> struct backing_dev_info *bdi;
> /*
> * key is struct address_space *; value is struct
> memcg_mapping *
> * memcg_mappings live within either mappings or
> dirty_mappings set.
> */
> struct rb_root mappings;
> struct rb_root dirty_mappings;
> struct rb_node node;
> spinlock_t lock; /* protect [dirty_]mappings */
> };
>
> struct memcg_mapping {
> struct address_space *mapping;
> struct memcg_bdi *memcg_bdi;
> struct rb_node node;
> atomic_t nr_pages;
> atomic_t nr_dirty;
> };
>
> struct page_cgroup {
> ...
> struct memcg_mapping *memcg_mapping;
> };
>
> - each memcg contains a mapping from bdi to memcg_bdi.
> - each memcg_bdi contains two mappings:
> mappings: from address_space to memcg_mapping for clean pages
> dirty_mappings: from address_space to memcg_mapping when there are
> some dirty pages

Keeping dirty_mappings separate sounds like a good idea. To me this is
equivalent of wb->b_dirty and down the line we might want to also
maintain equivalent of ->b_io and ->b_more_io. But that's for later.

> - each memcg_mapping represents a set of cached pages within an
> bdi,inode,memcg. All
> corresponding cached inode pages point to the same memcg_mapping via
> pc->mapping. I assume that all pages of inode belong to no more than one bdi.
> - manage a global list of memcg that are over their respective background dirty
> limit.
> - i_mapping continues to point to a traditional non-memcg mapping (no change
> here).
> - none of these memcg_* structures affect root cgroup or kernels with memcg
> configured out.

So there will not be any memcg structure for root cgroup? Then how would
we make sure that flusher thread does not starve either root cgroup inodes
or memory cgroup inodes. I thought if everything is on a single list
(including root group), then we could just select cgroups to writeback in
round robin manner. Now with root cgroup not being on that list, how
would you make sure that root group's inodes don't starve writeback.

>
> The routines under discussion:
> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> to walk memcg -> memcg_bdi -> mappings and lazily allocating missing
> levels in data structure.
>
> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> memcg. If refcnt drops to zero, then remove memcg_mapping from the
> memcg_bdi.[dirty_]mappings.
> Also delete memcg_bdi if last memcg_mapping is removed.
>
> - account_page_dirtied(): increment nr_dirty. If first dirty page,
> then move memcg_mapping from memcg_bdi.mappings to
> memcg_bdi.dirty_mappings page counter. When marking page clean, do
> the opposite.
>
> - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
> background limit, then add memcg to global memcg_over_bg_limit list and use
> memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
> fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
> (aka memcg_bdi) accounting structure.

In other mail Jan mentioned that mem_cgroup_balance_dirty_pages() is per bdi
so we have to wake up only corresponding bdi flusher thread only.

Thanks
Vivek

2011-03-23 09:13:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v6 0/9] memcg: per cgroup dirty page accounting

On Fri, 18 Mar 2011 00:57:09 -0700
Greg Thelen <[email protected]> wrote:

> On Thu, Mar 17, 2011 at 5:43 AM, Johannes Weiner <[email protected]> wrote:
> > On Wed, Mar 16, 2011 at 09:41:48PM -0700, Greg Thelen wrote:
> >> In '[PATCH v6 8/9] memcg: check memcg dirty limits in page writeback' Jan and
> >> Vivek have had some discussion around how memcg and writeback mesh.
> >> In my mind, the
> >> discussions in 8/9 are starting to blend with this thread.
> >>
> >> I have been thinking about Johannes' struct memcg_mapping.  I think the idea
> >> may address several of the issues being discussed, especially
> >> interaction between
> >> IO-less balance_dirty_pages() and memcg writeback.
> >>
> >> Here is my thinking.  Feedback is most welcome!
> >>
> >> The data structures:
> >> - struct memcg_mapping {
> >>        struct address_space *mapping;
> >>        struct mem_cgroup *memcg;
> >>        int refcnt;
> >>   };
> >> - each memcg contains a (radix, hash_table, etc.) mapping from bdi to memcg_bdi.
> >> - each memcg_bdi contains a mapping from inode to memcg_mapping.  This may be a
> >>   very large set representing many cached inodes.
> >> - each memcg_mapping represents all pages within an bdi,inode,memcg.  All
> >>   corresponding cached inode pages point to the same memcg_mapping via
> >>   pc->mapping.  I assume that all pages of inode belong to no more than one bdi.
> >> - manage a global list of memcg that are over their respective background dirty
> >>   limit.
> >> - i_mapping continues to point to a traditional non-memcg mapping (no change
> >>   here).
> >> - none of these memcg_* structures affect root cgroup or kernels with memcg
> >>   configured out.
> >
> > So structures roughly like this:
> >
> > struct mem_cgroup {
> >        ...
> >        /* key is struct backing_dev_info * */
> >        struct rb_root memcg_bdis;
> > };
> >
> > struct memcg_bdi {
> >        /* key is struct address_space * */
> >        struct rb_root memcg_mappings;
> >        struct rb_node node;
> > };
> >
> > struct memcg_mapping {
> >        struct address_space *mapping;
> >        struct mem_cgroup *memcg;
> >        struct rb_node node;
> >        atomic_t count;
> > };
> >
> > struct page_cgroup {
> >        ...
> >        struct memcg_mapping *memcg_mapping;
> > };
> >
> >> The routines under discussion:
> >> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> >>   to walk memcg -> memcg_bdi -> memcg_mapping and lazily allocating missing
> >>   levels in data structure.
> >>
> >> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> >>   memcg.  If refcnt drops to zero, then remove memcg_mapping from the memcg_bdi.
> >>   Also delete memcg_bdi if last memcg_mapping is removed.
> >>
> >> - account_page_dirtied(): nothing new here, continue to set the per-page flags
> >>   and increment the memcg per-cpu dirty page counter.  Same goes for routines
> >>   that mark pages in writeback and clean states.
> >
> > We may want to remember the dirty memcg_mappings so that on writeback
> > we don't have to go through every single one that the memcg refers to?
>
> I think this is a good idea to allow per memcg per bdi list of dirty mappings.
>
> It feels like some of this is starting to gel. I've been sketching
> some of the code to see how the memcg locking will work out. The
> basic structures I see are:
>
> struct mem_cgroup {
> ...
> /*
> * For all file pages cached by this memcg sort by bdi.
> * key is struct backing_dev_info *; value is struct memcg_bdi *
> * Protected by bdis_lock.
> */
> struct rb_root bdis;
> spinlock_t bdis_lock; /* or use rcu structure, memcg:bdi set
> could be fairly static */
> };
>
> struct memcg_bdi {
> struct backing_dev_info *bdi;
> /*
> * key is struct address_space *; value is struct
> memcg_mapping *
> * memcg_mappings live within either mappings or
> dirty_mappings set.
> */
> struct rb_root mappings;
> struct rb_root dirty_mappings;
> struct rb_node node;
> spinlock_t lock; /* protect [dirty_]mappings */
> };
>
> struct memcg_mapping {
> struct address_space *mapping;
> struct memcg_bdi *memcg_bdi;
> struct rb_node node;
> atomic_t nr_pages;
> atomic_t nr_dirty;
> };
>
> struct page_cgroup {
> ...
> struct memcg_mapping *memcg_mapping;
> };
>
> - each memcg contains a mapping from bdi to memcg_bdi.
> - each memcg_bdi contains two mappings:
> mappings: from address_space to memcg_mapping for clean pages
> dirty_mappings: from address_space to memcg_mapping when there are
> some dirty pages
> - each memcg_mapping represents a set of cached pages within an
> bdi,inode,memcg. All
> corresponding cached inode pages point to the same memcg_mapping via
> pc->mapping. I assume that all pages of inode belong to no more than one bdi.
> - manage a global list of memcg that are over their respective background dirty
> limit.
> - i_mapping continues to point to a traditional non-memcg mapping (no change
> here).
> - none of these memcg_* structures affect root cgroup or kernels with memcg
> configured out.
>
> The routines under discussion:
> - memcg charging a new inode page to a memcg: will use inode->mapping and inode
> to walk memcg -> memcg_bdi -> mappings and lazily allocating missing
> levels in data structure.
>
> - Uncharging a inode page from a memcg: will use pc->mapping->memcg to locate
> memcg. If refcnt drops to zero, then remove memcg_mapping from the
> memcg_bdi.[dirty_]mappings.
> Also delete memcg_bdi if last memcg_mapping is removed.
>
> - account_page_dirtied(): increment nr_dirty. If first dirty page,
> then move memcg_mapping from memcg_bdi.mappings to
> memcg_bdi.dirty_mappings page counter. When marking page clean, do
> the opposite.
>
> - mem_cgroup_balance_dirty_pages(): if memcg dirty memory usage if above
> background limit, then add memcg to global memcg_over_bg_limit list and use
> memcg's set of memcg_bdi to wakeup each(?) corresponding bdi flusher. If over
> fg limit, then use IO-less style foreground throttling with per-memcg per-bdi
> (aka memcg_bdi) accounting structure.
>
> - bdi writeback: will revert some of the mmotm memcg dirty limit changes to
> fs-writeback.c so that wb_do_writeback() will return to checking
> wb_check_background_flush() to check background limits and being
> interruptible if sync flush occurs. wb_check_background_flush() will
> check the global
> memcg_over_bg_limit list for memcg that are over their dirty limit.
> Within each memcg write inodes from the dirty_mappings list until a
> threshold page count has been reached (MAX_WRITEBACK_PAGES). Then
> move to next listed memcg.
>
> - over_bground_thresh() will determine if memcg is still over bg limit.
> If over limit, then it per bdi per memcg background flushing will continue.
> If not over limit then memcg will be removed from memcg_over_bg_limit list.
>
> I'll post my resulting patches in RFC form, or (at the least) my conclusions.
>
please take care of force_empty and move_mapping at el. when you do this
and please do rmdir() tests.


Thanks,
-Kame