2011-02-25 21:38:09

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 0/9] memcg: per cgroup dirty page accounting

Changes since v4:
- Moved documentation changes to start of series to provide a better
introduction to the series.
- Added support for hierarchical dirty limits.
- Incorporated bug fixes previously found in v4.
- Include a new patch "writeback: convert variables to unsigned" to provide a
clearer transition to the the new dirty_info structure (patch "writeback:
create dirty_info structure").
- Within the new dirty_info structure, replaced nr_reclaimable with
nr_file_dirty and nr_unstable_nfs to give callers finer grain dirty usage
information also added dirty_info_reclaimable().
- Rebased the series to mmotm-2011-02-10-16-26 with two pending mmotm patches:
memcg: break out event counters from other stats
https://lkml.org/lkml/2011/2/17/415
memcg: use native word page statistics counters
https://lkml.org/lkml/2011/2/17/413

Changes since v3:
- Refactored balance_dirty_pages() dirtying checking to use new struct
dirty_info, which is used to compare both system and memcg dirty limits
against usage.
- Disabled memcg dirty limits when memory.use_hierarchy=1. An enhancement is
needed to check the chain of parents to ensure that no dirty limit is
exceeded.
- Ported to mmotm-2010-10-22-16-36.

Changes since v2:
- Rather than disabling softirq in lock_page_cgroup(), introduce a separate lock
to synchronize between memcg page accounting and migration. This only affects
patch 4 of the series. Patch 4 used to disable softirq, now it introduces the
new lock.

Changes since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
memory.stat to match /proc/meminfo.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
mem_cgroup_has_dirty_limit().
- Fixed lockdep issue in mem_cgroup_read_stat() which is exposed by these
patches.
- Remove redundant comments.
- Rename (for clarity):
- mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
- mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Renamed newly created proc files:
- memory.dirty_bytes -> memory.dirty_limit_in_bytes
- memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Removed unnecessary get_ prefix from get_xxx() functions.
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
- Disable softirq rather than hardirq in lock_page_cgroup()
- Made mem_cgroup_move_account_page_stat() inline.
- Ported patches to mmotm-2010-10-13-17-13.

This patch set provides the ability for each cgroup to have independent dirty
page limits.

Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
page cache used by a cgroup. So, in case of multiple cgroup writers, they will
not be able to consume more than their designated share of dirty pages and will
be throttled if they cross that limit.

Example use case:
#!/bin/bash
#
# Here is a test script that shows a situation where memcg dirty limits are
# beneficial.
#
# The script runs two programs:
# 1) a dirty page background antagonist (dd)
# 2) an interactive foreground process (tar).
#
# If the script's argument is false, then both processes are limited by the
# classic global dirty limits. If the script is given a true argument, then a
# per-cgroup dirty limit is used to contain dd dirty page consumption. The
# cgroup isolates the dd dirty memory consumption from the rest of the system
# processes (tar in this case).
#
# The time used by the tar process is printed (lower is better).
#
# When dd is run within a dirty limiting cgroup, the tar process had faster
# and more predictable performance. memcg dirty ratios might be useful to
# serve different task classes (interactive vs batch). A past discussion
# touched on this: http://lkml.org/lkml/2010/5/20/136
#
# When called with 'false' (using memcg without dirty isolation):
# tar finished in 7.0s
# dd reports 92 MB/s
#
# When called with 'true' (using memcg for dirty isolation):
# tar finished in 2.5s
# dd reports 82 MB/s

echo memcg_dirty_limits: $1

# set system dirty limits.
echo $((1<<30)) > /proc/sys/vm/dirty_bytes
echo $((1<<29)) > /proc/sys/vm/dirty_background_bytes

mkdir /dev/cgroup/memory/A

if $1; then # if using cgroup to contain 'dd'...
echo 100M > /dev/cgroup/memory/A/memory.dirty_limit_in_bytes
fi

# run antagonist (dd) in cgroup A
(echo $BASHPID > /dev/cgroup/memory/A/tasks; \
dd if=/dev/zero of=/disk1/big.file count=10k bs=1M) &

# let antagonist (dd) get warmed up
sleep 10

# time interactive job
time tar -C /disk2 -xzf linux.tar.gz

wait
sleep 10
rmdir /dev/cgroup/memory/A

The patches are based on a series proposed by Andrea Righi in Mar 2010.

Overview:
- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
unstable.

- Extend mem_cgroup to record the total number of pages in each of the
interesting dirty states (dirty, writeback, unstable_nfs).

- Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_* limits to
mem_cgroup. The mem_cgroup dirty parameters are accessible via cgroupfs
control files.

- Consider both system and per-memcg dirty limits in page writeback when
deciding to queue background writeback or throttle dirty memory production.

Known shortcomings (see the patch 1 update to Documentation/cgroups/memory.txt
for more details):
- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
just inodes contributing dirty pages to the cgroup exceeding its limit.

- A cgroup may exceed its dirty limit if the memory is dirtied by a process in a
different memcg.

Performance data:
- A page fault microbenchmark workload was used to measure performance, which
can be called in read or write mode:
f = open(foo. $cpu)
truncate(f, 4096)
alarm(60)
while (1) {
p = mmap(f, 4096)
if (write)
*p = 1
else
x = *p
munmap(p)
}

- The workload was called for several points in the patch series in different
modes:
- s_read is a single threaded reader
- s_write is a single threaded writer
- p_read is a 16 thread reader, each operating on a different file
- p_write is a 16 thread writer, each operating on a different file

- Measurements were collected on a 16 core non-numa system using "perf stat
--repeat 3".

- All numbers are page fault rate (M/sec). Higher is better.

- To compare the performance of a kernel without memcg compare the first and
last rows - neither has memcg configured. The first row does not include any
of these memcg dirty limit patches.

- To compare the performance of using memcg dirty limits, compare the memcg
baseline (2nd row titled "mmotm w/ memcg") with the 3rd row (memcg enabled
with all patches).

root_cgroup child_cgroup
s_read s_write p_read p_write s_read s_write p_read p_write
mmotm w/o memcg 0.359 0.312 0.357 0.312
mmotm w/ memcg 0.366 0.316 0.342 0.301 0.368 0.309 0.347 0.301
all patches 0.347 0.322 0.327 0.303 0.342 0.323 0.327 0.305
all patches 0.358 0.322 0.357 0.316
w/o memcg

Greg Thelen (9):
memcg: document cgroup dirty memory interfaces
memcg: add page_cgroup flags for dirty page tracking
writeback: convert variables to unsigned
writeback: create dirty_info structure
memcg: add dirty page accounting infrastructure
memcg: add kernel calls for memcg dirty page stats
memcg: add dirty limits to mem_cgroup
memcg: add cgroupfs interface to memcg dirty limits
memcg: check memcg dirty limits in page writeback

Documentation/cgroups/memory.txt | 80 +++++++
fs/fs-writeback.c | 7 +-
fs/nfs/write.c | 4 +
include/linux/memcontrol.h | 33 +++-
include/linux/page_cgroup.h | 23 ++
include/linux/writeback.h | 18 ++-
mm/backing-dev.c | 18 +-
mm/filemap.c | 1 +
mm/memcontrol.c | 470 +++++++++++++++++++++++++++++++++++++-
mm/page-writeback.c | 150 +++++++++----
mm/truncate.c | 1 +
mm/vmscan.c | 2 +-
mm/vmstat.c | 6 +-
13 files changed, 742 insertions(+), 71 deletions(-)

--
1.7.3.1


2011-02-25 21:38:34

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 1/9] memcg: document cgroup dirty memory interfaces

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
---
Changelog since v4:
- Minor rewording of '5.5 dirty memory' section.
- Added '5.5.1 Inode writeback issue' section.

Changelog since v3:
- Described interactions with memory.use_hierarchy.
- Added description of total_dirty, total_writeback, and total_nfs_unstable.

Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
memory.stat to match /proc/meminfo.
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
- Describe a situation where a cgroup can exceed its dirty limit.

Documentation/cgroups/memory.txt | 80 ++++++++++++++++++++++++++++++++++++++
1 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b6ed61c..4db695e 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -385,6 +385,10 @@ mapped_file - # of bytes of mapped file (includes tmpfs/shmem)
pgpgin - # of pages paged in (equivalent to # of charging events).
pgpgout - # of pages paged out (equivalent to # of uncharging events).
swap - # of bytes of swap usage
+dirty - # of bytes that are waiting to get written back to the disk.
+writeback - # of bytes that are actively being written back to the disk.
+nfs_unstable - # of bytes sent to the NFS server, but not yet committed to
+ the actual storage.
inactive_anon - # of bytes of anonymous memory and swap cache memory on
LRU list.
active_anon - # of bytes of anonymous and swap cache memory on active
@@ -406,6 +410,9 @@ total_mapped_file - sum of all children's "cache"
total_pgpgin - sum of all children's "pgpgin"
total_pgpgout - sum of all children's "pgpgout"
total_swap - sum of all children's "swap"
+total_dirty - sum of all children's "dirty"
+total_writeback - sum of all children's "writeback"
+total_nfs_unstable - sum of all children's "nfs_unstable"
total_inactive_anon - sum of all children's "inactive_anon"
total_active_anon - sum of all children's "active_anon"
total_inactive_file - sum of all children's "inactive_file"
@@ -453,6 +460,79 @@ memory under it will be reclaimed.
You can reset failcnt by writing 0 to failcnt file.
# echo 0 > .../memory.failcnt

+5.5 dirty memory
+
+Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
+page cache used by a cgroup. So, in case of multiple cgroup writers, they will
+not be able to consume more than their designated share of dirty pages and will
+be throttled if they cross that limit. System-wide dirty limits are also
+consulted. Dirty memory consumption is checked against both system-wide and
+per-cgroup dirty limits.
+
+The interface is similar to the procfs interface: /proc/sys/vm/dirty_*. It is
+possible to configure a limit to trigger throttling of a dirtier or queue
+background writeback. The root cgroup memory.dirty_* control files are
+read-only and match the contents of the /proc/sys/vm/dirty_* files.
+
+Per-cgroup dirty limits can be set using the following files in the cgroupfs:
+
+- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of
+ cgroup memory) at which a process generating dirty pages will be throttled.
+ The default value is the system-wide dirty ratio, /proc/sys/vm/dirty_ratio.
+
+- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in bytes)
+ in the cgroup at which a process generating dirty pages will be throttled.
+ Suffix (k, K, m, M, g, or G) can be used to indicate that value is kilo, mega
+ or gigabytes. The default value is the system-wide dirty limit,
+ /proc/sys/vm/dirty_bytes.
+
+ Note: memory.dirty_limit_in_bytes is the counterpart of memory.dirty_ratio.
+ Only one may be specified at a time. When one is written it is immediately
+ taken into account to evaluate the dirty memory limits and the other appears
+ as 0 when read.
+
+- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
+ (expressed as a percentage of cgroup memory) at which background writeback
+ kernel threads will start writing out dirty data. The default value is the
+ system-wide background dirty ratio, /proc/sys/vm/dirty_background_ratio.
+
+- memory.dirty_background_limit_in_bytes: the amount of dirty memory (expressed
+ in bytes) in the cgroup at which background writeback kernel threads will
+ start writing out dirty data. Suffix (k, K, m, M, g, or G) can be used to
+ indicate that value is kilo, mega or gigabytes. The default value is the
+ system-wide dirty background limit, /proc/sys/vm/dirty_background_bytes.
+
+ Note: memory.dirty_background_limit_in_bytes is the counterpart of
+ memory.dirty_background_ratio. Only one may be specified at a time. When one
+ is written it is immediately taken into account to evaluate the dirty memory
+ limits and the other appears as 0 when read.
+
+A cgroup may contain more dirty memory than its dirty limit. This is possible
+because of the principle that the first cgroup to touch a page is charged for
+it. Subsequent page counting events (dirty, writeback, nfs_unstable) are also
+counted to the originally charged cgroup. Example: If page is allocated by a
+cgroup A task, then the page is charged to cgroup A. If the page is later
+dirtied by a task in cgroup B, then the cgroup A dirty count will be
+incremented. If cgroup A is over its dirty limit but cgroup B is not, then
+dirtying a cgroup A page from a cgroup B task may push cgroup A over its dirty
+limit without throttling the dirtying cgroup B task.
+
+When use_hierarchy=0, each cgroup has independent dirty memory usage and limits.
+When use_hierarchy=1 the dirty limits of parents cgroups are also checked to
+ensure that no dirty limit is exceeded.
+
+5.5.1 Inode writeback issue
+
+When a memcg dirty limit is exceeded, then bdi writeback is employed to
+writeback dirty inodes. Bdi writeback considers inodes from any memcg, not just
+inodes contributing dirty pages to the memcg exceeding its limit. Ideally when
+a memcg dirty limit is exceeded only inodes contributing dirty pages to that
+memcg would be considered for writeback. However, the current implementation
+does not behave this way because there is no way to quickly check the memcgs
+that an inode contributes dirty pages to.
+
6. Hierarchy support

The memory controller supports a deep hierarchy and hierarchical accounting.
--
1.7.3.1

2011-02-25 21:38:49

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 3/9] writeback: convert variables to unsigned

Convert two balance_dirty_pages() page counter variables (nr_reclaimable
and nr_writeback) from 'long' to 'unsigned long'.

These two variables are used to store results from global_page_state().
global_page_state() returns unsigned long and carefully sums per-cpu
counters explicitly avoiding returning a negative value.

Signed-off-by: Greg Thelen <[email protected]>
---
Changelog since v4:
- Created this patch for clarity. Previously this patch was integrated within
the "writeback: create dirty_info structure" patch.

mm/page-writeback.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2cb01f6..4408e54 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -478,8 +478,10 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
static void balance_dirty_pages(struct address_space *mapping,
unsigned long write_chunk)
{
- long nr_reclaimable, bdi_nr_reclaimable;
- long nr_writeback, bdi_nr_writeback;
+ unsigned long nr_reclaimable;
+ long bdi_nr_reclaimable;
+ unsigned long nr_writeback;
+ long bdi_nr_writeback;
unsigned long background_thresh;
unsigned long dirty_thresh;
unsigned long bdi_thresh;
--
1.7.3.1

2011-02-25 21:38:54

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 2/9] memcg: add page_cgroup flags for dirty page tracking

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: Daisuke Nishimura <[email protected]>
---
include/linux/page_cgroup.h | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index f5de21d..a002ba8 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -10,6 +10,9 @@ enum {
/* flags for mem_cgroup and file and I/O status */
PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+ PCG_FILE_DIRTY, /* page is dirty */
+ PCG_FILE_WRITEBACK, /* page is under writeback */
+ PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
/* No lock in page_cgroup */
PCG_ACCT_LRU, /* page has been accounted for (under lru_lock) */
__NR_PCG_FLAGS,
@@ -67,6 +70,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }

+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+ { return test_and_set_bit(PCG_##lname, &pc->flags); }
+
/* Cache flag is set only once (at allocation) */
TESTPCGFLAG(Cache, CACHE)
CLEARPCGFLAG(Cache, CACHE)
@@ -86,6 +93,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
CLEARPCGFLAG(FileMapped, FILE_MAPPED)
TESTPCGFLAG(FileMapped, FILE_MAPPED)

+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
SETPCGFLAG(Migration, MIGRATION)
CLEARPCGFLAG(Migration, MIGRATION)
TESTPCGFLAG(Migration, MIGRATION)
--
1.7.3.1

2011-02-25 21:39:14

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 7/9] memcg: add dirty limits to mem_cgroup

Extend mem_cgroup to contain dirty page limits. Also add routines
allowing the kernel to query the dirty usage of a memcg.

These interfaces not used by the kernel yet. A subsequent commit
will add kernel calls to utilize these new routines.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
Changelog since v4:
- Added support for hierarchical dirty limits.
- Simplified __mem_cgroup_dirty_param().
- Simplified mem_cgroup_page_stat().
- Deleted mem_cgroup_nr_pages_item enum, which was added little value.
Instead the mem_cgroup_page_stat_item enum values are used to identify
memcg dirty statistics exported to kernel.
- Fixed overflow issues in mem_cgroup_hierarchical_free_pages().

Changelog since v3:
- Previously memcontrol.c used struct vm_dirty_param and vm_dirty_param() to
advertise dirty memory limits. Now struct dirty_info and
mem_cgroup_dirty_info() is used to share dirty limits between memcontrol and
the rest of the kernel.
- __mem_cgroup_has_dirty_limit() now returns false if use_hierarchy is set.
- memcg_hierarchical_free_pages() now uses parent_mem_cgroup() and is simpler.
- created internal routine, __mem_cgroup_has_dirty_limit(), to consolidate the
logic.

Changelog since v1:
- Rename (for clarity):
- mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
- mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Removed unnecessary get_ prefix from get_xxx() functions.
- Avoid lockdep warnings by using rcu_read_[un]lock() in
mem_cgroup_has_dirty_limit().

include/linux/memcontrol.h | 28 +++++
mm/memcontrol.c | 269 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 296 insertions(+), 1 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1f70a9..8c00c06 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,6 +19,7 @@

#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
+#include <linux/writeback.h>
#include <linux/cgroup.h>
struct mem_cgroup;
struct page_cgroup;
@@ -31,6 +32,7 @@ enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+ MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
};

extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
@@ -145,6 +147,13 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}

+bool mem_cgroup_has_dirty_limit(void);
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *memcg,
+ struct dirty_info *info);
+unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item);
+
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask);
u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -333,6 +342,25 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
{
}

+static inline bool mem_cgroup_has_dirty_limit(void)
+{
+ return false;
+}
+
+static inline bool
+mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *memcg,
+ struct dirty_info *info)
+{
+ return false;
+}
+
+static inline unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ return -ENOSYS;
+}
+
static inline
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 38f786b..bc86329 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -198,6 +198,14 @@ struct mem_cgroup_eventfd_list {
static void mem_cgroup_threshold(struct mem_cgroup *mem);
static void mem_cgroup_oom_notify(struct mem_cgroup *mem);

+/* Dirty memory parameters */
+struct vm_dirty_param {
+ int dirty_ratio;
+ int dirty_background_ratio;
+ unsigned long dirty_bytes;
+ unsigned long dirty_background_bytes;
+};
+
/*
* The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
@@ -237,6 +245,10 @@ struct mem_cgroup {
atomic_t refcnt;

unsigned int swappiness;
+
+ /* control memory cgroup dirty pages */
+ struct vm_dirty_param dirty_param;
+
/* OOM-Killer disable */
int oom_kill_disable;

@@ -1128,6 +1140,254 @@ static unsigned int get_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+/*
+ * Return true if the current memory cgroup has local dirty memory settings.
+ * There is an allowed race between the current task migrating in-to/out-of the
+ * root cgroup while this routine runs. So the return value may be incorrect if
+ * the current task is being simultaneously migrated.
+ */
+static bool __mem_cgroup_has_dirty_limit(struct mem_cgroup *mem)
+{
+ return mem && !mem_cgroup_is_root(mem);
+}
+
+bool mem_cgroup_has_dirty_limit(void)
+{
+ struct mem_cgroup *mem;
+ bool ret;
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ rcu_read_lock();
+ mem = mem_cgroup_from_task(current);
+ ret = __mem_cgroup_has_dirty_limit(mem);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+/*
+ * Returns a snapshot of the current dirty limits which is not synchronized with
+ * the routines that change the dirty limits. If this routine races with an
+ * update to the dirty bytes/ratio value, then the caller must handle the case
+ * where neither dirty_[background_]_ratio nor _bytes are set.
+ */
+static void __mem_cgroup_dirty_param(struct vm_dirty_param *param,
+ struct mem_cgroup *mem)
+{
+ if (__mem_cgroup_has_dirty_limit(mem)) {
+ *param = mem->dirty_param;
+ } else {
+ param->dirty_ratio = vm_dirty_ratio;
+ param->dirty_bytes = vm_dirty_bytes;
+ param->dirty_background_ratio = dirty_background_ratio;
+ param->dirty_background_bytes = dirty_background_bytes;
+ }
+}
+
+/* Return dirty thresholds and usage metrics for @memcg. */
+static void mem_cgroup_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *memcg,
+ struct dirty_info *info)
+{
+ unsigned long available_mem;
+ struct vm_dirty_param dirty_param;
+
+ __mem_cgroup_dirty_param(&dirty_param, memcg);
+
+ if (!dirty_param.dirty_bytes || !dirty_param.dirty_background_bytes)
+ available_mem = min(
+ sys_available_mem,
+ mem_cgroup_page_stat(memcg, MEMCG_NR_DIRTYABLE_PAGES));
+
+ if (dirty_param.dirty_bytes)
+ info->dirty_thresh =
+ DIV_ROUND_UP(dirty_param.dirty_bytes, PAGE_SIZE);
+ else
+ info->dirty_thresh =
+ (dirty_param.dirty_ratio * available_mem) / 100;
+
+ if (dirty_param.dirty_background_bytes)
+ info->background_thresh =
+ DIV_ROUND_UP(dirty_param.dirty_background_bytes,
+ PAGE_SIZE);
+ else
+ info->background_thresh =
+ (dirty_param.dirty_background_ratio *
+ available_mem) / 100;
+
+ info->nr_file_dirty = mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_DIRTY);
+ info->nr_writeback = mem_cgroup_page_stat(memcg,
+ MEMCG_NR_FILE_WRITEBACK);
+ info->nr_unstable_nfs =
+ mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_UNSTABLE_NFS);
+}
+
+/*
+ * Return the dirty thresholds and usage metrics for the memcg (within the
+ * ancestral chain of @memcg) closest to its limit. If @memcg is not set, then
+ * use the current task's memcg.
+ *
+ * The current task may be moved to another cgroup while this routine accesses
+ * the dirty limit. But a precise check is meaningless because the task can be
+ * moved after our access and writeback tends to take long time. At least,
+ * "memcg" will not be freed while holding rcu_read_lock().
+ */
+bool mem_cgroup_hierarchical_dirty_info(unsigned long sys_available_mem,
+ struct mem_cgroup *memcg,
+ struct dirty_info *info)
+{
+ unsigned long usage;
+ struct dirty_info uninitialized_var(cur_info);
+
+ if (mem_cgroup_disabled())
+ return false;
+
+ info->nr_writeback = ULONG_MAX; /* invalid initial value */
+
+ /*
+ * Routine within mem_cgroup_page_stat() need online cpus locked.
+ * get_online_cpus() can sleep so it must be called before
+ * rcu_read_lock().
+ */
+ get_online_cpus();
+ rcu_read_lock();
+ if (!memcg)
+ memcg = mem_cgroup_from_task(current);
+
+ while (__mem_cgroup_has_dirty_limit(memcg)) {
+ mem_cgroup_dirty_info(sys_available_mem, memcg, &cur_info);
+ usage = dirty_info_reclaimable(&cur_info) +
+ cur_info.nr_writeback;
+
+ if (!memcg->use_hierarchy || usage >= cur_info.dirty_thresh) {
+ *info = cur_info;
+ break;
+ }
+
+ /* save dirty stats for memcg closest to its limit */
+ if ((info->nr_writeback == ULONG_MAX) ||
+ (cur_info.dirty_thresh - usage <
+ info->dirty_thresh - dirty_info_reclaimable(info) -
+ info->nr_writeback))
+ *info = cur_info;
+
+ /* continue walking up hierarchy enabled parents */
+ memcg = parent_mem_cgroup(memcg);
+ if (!memcg || !memcg->use_hierarchy)
+ break;
+ }
+
+ rcu_read_unlock();
+ put_online_cpus();
+ return info->nr_writeback != ULONG_MAX;
+}
+
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+ if (!do_swap_account)
+ return nr_swap_pages > 0;
+ return !memcg->memsw_is_minimum &&
+ (res_counter_read_u64(&memcg->memsw, RES_LIMIT) > 0);
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ s64 ret;
+
+ switch (item) {
+ case MEMCG_NR_FILE_DIRTY:
+ ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+ break;
+ case MEMCG_NR_FILE_WRITEBACK:
+ ret = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+ break;
+ case MEMCG_NR_FILE_UNSTABLE_NFS:
+ ret = mem_cgroup_read_stat(mem,
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+ break;
+ case MEMCG_NR_DIRTYABLE_PAGES:
+ ret = mem_cgroup_read_stat(mem, LRU_ACTIVE_FILE) +
+ mem_cgroup_read_stat(mem, LRU_INACTIVE_FILE);
+ if (mem_cgroup_can_swap(mem))
+ ret += mem_cgroup_read_stat(mem, LRU_ACTIVE_ANON) +
+ mem_cgroup_read_stat(mem, LRU_INACTIVE_ANON);
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+/*
+ * Return the number of additional pages that the @mem cgroup could allocate.
+ * If use_hierarchy is set, then this involves checking parent mem cgroups to
+ * find the cgroup with the smallest free space.
+ */
+static unsigned long
+mem_cgroup_hierarchical_free_pages(struct mem_cgroup *mem)
+{
+ u64 free;
+ unsigned long min_free;
+
+ min_free = global_page_state(NR_FREE_PAGES);
+
+ while (mem) {
+ free = (res_counter_read_u64(&mem->res, RES_LIMIT) -
+ res_counter_read_u64(&mem->res, RES_USAGE)) >>
+ PAGE_SHIFT;
+ min_free = min((u64)min_free, free);
+ mem = parent_mem_cgroup(mem);
+ }
+
+ return min_free;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @mem: memory cgroup to query
+ * @item: memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+unsigned long mem_cgroup_page_stat(struct mem_cgroup *mem,
+ enum mem_cgroup_page_stat_item item)
+{
+ struct mem_cgroup *iter;
+ s64 value;
+
+ VM_BUG_ON(!mem);
+ VM_BUG_ON(mem_cgroup_is_root(mem));
+
+ /*
+ * If we're looking for dirtyable pages we need to evaluate free pages
+ * depending on the limit and usage of the parents first of all.
+ */
+ if (item == MEMCG_NR_DIRTYABLE_PAGES)
+ value = mem_cgroup_hierarchical_free_pages(mem);
+ else
+ value = 0;
+
+ /*
+ * Recursively evaluate page statistics against all cgroup under
+ * hierarchy tree
+ */
+ for_each_mem_cgroup_tree(iter, mem)
+ value += mem_cgroup_local_page_stat(iter, item);
+
+ /*
+ * Summing of unlocked per-cpu counters is racy and may yield a slightly
+ * negative value. Zero is the only sensible value in such cases.
+ */
+ if (unlikely(value < 0))
+ value = 0;
+
+ return value;
+}
+
static void mem_cgroup_start_move(struct mem_cgroup *mem)
{
int cpu;
@@ -4578,8 +4838,15 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
mem->last_scanned_child = 0;
INIT_LIST_HEAD(&mem->oom_notify);

- if (parent)
+ if (parent) {
mem->swappiness = get_swappiness(parent);
+ __mem_cgroup_dirty_param(&mem->dirty_param, parent);
+ } else {
+ /*
+ * The root cgroup dirty_param field is not used, instead,
+ * system-wide dirty limits are used.
+ */
+ }
atomic_set(&mem->refcnt, 1);
mem->move_charge_at_immigrate = 0;
mutex_init(&mem->thresholds_lock);
--
1.7.3.1

2011-02-25 21:39:25

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 4/9] writeback: create dirty_info structure

Bundle dirty limits and dirty memory usage metrics into a dirty_info
structure to simplify interfaces of routines that need all.

Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
Changelog since v4:
- Within new dirty_info structure, replaced nr_reclaimable with nr_file_dirty
and nr_unstable_nfs to give callers finer grain dirty usage information.
- Added new dirty_info_reclaimable() function.
- Made more use of dirty_info structure in throttle_vm_writeout().

Changelog since v3:
- This is a new patch in v4.

fs/fs-writeback.c | 7 ++---
include/linux/writeback.h | 15 ++++++++++++-
mm/backing-dev.c | 18 +++++++++------
mm/page-writeback.c | 50 ++++++++++++++++++++++----------------------
mm/vmstat.c | 6 +++-
5 files changed, 57 insertions(+), 39 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 59c6e49..d75e4da 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -595,12 +595,11 @@ static void __writeback_inodes_sb(struct super_block *sb,

static inline bool over_bground_thresh(void)
{
- unsigned long background_thresh, dirty_thresh;
+ struct dirty_info info;

- global_dirty_limits(&background_thresh, &dirty_thresh);
+ global_dirty_info(&info);

- return (global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) > background_thresh);
+ return dirty_info_reclaimable(&info) > info.background_thresh;
}

/*
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0ead399..a06fb38 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -84,6 +84,19 @@ static inline void inode_sync_wait(struct inode *inode)
/*
* mm/page-writeback.c
*/
+struct dirty_info {
+ unsigned long dirty_thresh;
+ unsigned long background_thresh;
+ unsigned long nr_file_dirty;
+ unsigned long nr_writeback;
+ unsigned long nr_unstable_nfs;
+};
+
+static inline unsigned long dirty_info_reclaimable(struct dirty_info *info)
+{
+ return info->nr_file_dirty + info->nr_unstable_nfs;
+}
+
#ifdef CONFIG_BLOCK
void laptop_io_completion(struct backing_dev_info *info);
void laptop_sync_completion(void);
@@ -124,7 +137,7 @@ struct ctl_table;
int dirty_writeback_centisecs_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);

-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
+void global_dirty_info(struct dirty_info *info);
unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
unsigned long dirty);

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 027100d..17a06ab 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -66,8 +66,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
{
struct backing_dev_info *bdi = m->private;
struct bdi_writeback *wb = &bdi->wb;
- unsigned long background_thresh;
- unsigned long dirty_thresh;
+ struct dirty_info dirty_info;
unsigned long bdi_thresh;
unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
struct inode *inode;
@@ -82,8 +81,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
nr_more_io++;
spin_unlock(&inode_lock);

- global_dirty_limits(&background_thresh, &dirty_thresh);
- bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+ global_dirty_info(&dirty_info);
+ bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);

#define K(x) ((x) << (PAGE_SHIFT - 10))
seq_printf(m,
@@ -99,9 +98,14 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
"state: %8lx\n",
(unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
(unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
- K(bdi_thresh), K(dirty_thresh),
- K(background_thresh), nr_dirty, nr_io, nr_more_io,
- !list_empty(&bdi->bdi_list), bdi->state);
+ K(bdi_thresh),
+ K(dirty_info.dirty_thresh),
+ K(dirty_info.background_thresh),
+ nr_dirty,
+ nr_io,
+ nr_more_io,
+ !list_empty(&bdi->bdi_list),
+ bdi->state);
#undef K

return 0;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4408e54..00424b9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -398,7 +398,7 @@ unsigned long determine_dirtyable_memory(void)
}

/*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ * global_dirty_info - return dirty thresholds and usage metrics
*
* Calculate the dirty thresholds based on sysctl parameters
* - vm.dirty_background_ratio or vm.dirty_background_bytes
@@ -406,7 +406,7 @@ unsigned long determine_dirtyable_memory(void)
* The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
* real-time tasks.
*/
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+void global_dirty_info(struct dirty_info *info)
{
unsigned long background;
unsigned long dirty;
@@ -426,6 +426,10 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
else
background = (dirty_background_ratio * available_memory) / 100;

+ info->nr_file_dirty = global_page_state(NR_FILE_DIRTY);
+ info->nr_writeback = global_page_state(NR_WRITEBACK);
+ info->nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS);
+
if (background >= dirty)
background = dirty / 2;
tsk = current;
@@ -433,8 +437,8 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
background += background / 4;
dirty += dirty / 4;
}
- *pbackground = background;
- *pdirty = dirty;
+ info->background_thresh = background;
+ info->dirty_thresh = dirty;
}

/*
@@ -478,12 +482,9 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
static void balance_dirty_pages(struct address_space *mapping,
unsigned long write_chunk)
{
- unsigned long nr_reclaimable;
+ struct dirty_info sys_info;
long bdi_nr_reclaimable;
- unsigned long nr_writeback;
long bdi_nr_writeback;
- unsigned long background_thresh;
- unsigned long dirty_thresh;
unsigned long bdi_thresh;
unsigned long pages_written = 0;
unsigned long pause = 1;
@@ -498,22 +499,19 @@ static void balance_dirty_pages(struct address_space *mapping,
.range_cyclic = 1,
};

- nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS);
- nr_writeback = global_page_state(NR_WRITEBACK);
-
- global_dirty_limits(&background_thresh, &dirty_thresh);
+ global_dirty_info(&sys_info);

/*
* Throttle it only when the background writeback cannot
* catch-up. This avoids (excessively) small writeouts
* when the bdi limits are ramping up.
*/
- if (nr_reclaimable + nr_writeback <=
- (background_thresh + dirty_thresh) / 2)
+ if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <=
+ (sys_info.background_thresh +
+ sys_info.dirty_thresh) / 2)
break;

- bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+ bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh);
bdi_thresh = task_dirty_limit(current, bdi_thresh);

/*
@@ -542,7 +540,8 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
dirty_exceeded =
(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
- || (nr_reclaimable + nr_writeback > dirty_thresh);
+ || (dirty_info_reclaimable(&sys_info) +
+ sys_info.nr_writeback > sys_info.dirty_thresh);

if (!dirty_exceeded)
break;
@@ -595,7 +594,8 @@ static void balance_dirty_pages(struct address_space *mapping,
* background_thresh, to keep the amount of dirty memory low.
*/
if ((laptop_mode && pages_written) ||
- (!laptop_mode && (nr_reclaimable > background_thresh)))
+ (!laptop_mode && (dirty_info_reclaimable(&sys_info) >
+ sys_info.background_thresh)))
bdi_start_background_writeback(bdi);
}

@@ -655,21 +655,21 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);

void throttle_vm_writeout(gfp_t gfp_mask)
{
- unsigned long background_thresh;
- unsigned long dirty_thresh;
+ struct dirty_info sys_info;

for ( ; ; ) {
- global_dirty_limits(&background_thresh, &dirty_thresh);
+ global_dirty_info(&sys_info);

/*
* Boost the allowable dirty threshold a bit for page
* allocators so they don't get DoS'ed by heavy writers
*/
- dirty_thresh += dirty_thresh / 10; /* wheeee... */
+ sys_info.dirty_thresh +=
+ sys_info.dirty_thresh / 10; /* wheeee... */

- if (global_page_state(NR_UNSTABLE_NFS) +
- global_page_state(NR_WRITEBACK) <= dirty_thresh)
- break;
+ if (sys_info.nr_unstable_nfs +
+ sys_info.nr_writeback <= sys_info.dirty_thresh)
+ break;
congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0c3b504..ec95924 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1044,6 +1044,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
{
unsigned long *v;
int i, stat_items_size;
+ struct dirty_info dirty_info;

if (*pos >= ARRAY_SIZE(vmstat_text))
return NULL;
@@ -1062,8 +1063,9 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
v[i] = global_page_state(i);
v += NR_VM_ZONE_STAT_ITEMS;

- global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
- v + NR_DIRTY_THRESHOLD);
+ global_dirty_info(&dirty_info);
+ v[NR_DIRTY_BG_THRESHOLD] = dirty_info.background_thresh;
+ v[NR_DIRTY_THRESHOLD] = dirty_info.dirty_thresh;
v += NR_VM_WRITEBACK_STAT_ITEMS;

#ifdef CONFIG_VM_EVENT_COUNTERS
--
1.7.3.1

2011-02-25 21:39:12

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 5/9] memcg: add dirty page accounting infrastructure

Add memcg routines to track dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.
A later change adds kernel calls to these new routines.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Daisuke Nishimura <[email protected]>
---
Changelog since v1:
- Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
memory.stat to match /proc/meminfo.
- Rename (for clarity):
- mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
- mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
- Remove redundant comments.
- Made mem_cgroup_move_account_page_stat() inline.

include/linux/memcontrol.h | 5 ++-
mm/memcontrol.c | 87 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 83 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3da48ae..e1f70a9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,9 +25,12 @@ struct page_cgroup;
struct page;
struct mm_struct;

-/* Stats that can be updated by kernel. */
+/* mem_cgroup page counts accessed by kernel. */
enum mem_cgroup_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+ MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+ MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
};

extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1c2704a..38f786b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -92,8 +92,11 @@ enum mem_cgroup_stat_index {
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */
MEM_CGROUP_STAT_NSTATS,
@@ -1622,6 +1625,44 @@ void mem_cgroup_update_page_stat(struct page *page,
ClearPageCgroupFileMapped(pc);
idx = MEM_CGROUP_STAT_FILE_MAPPED;
break;
+
+ case MEMCG_NR_FILE_DIRTY:
+ /* Use Test{Set,Clear} to only un/charge the memcg once. */
+ if (val > 0) {
+ if (TestSetPageCgroupFileDirty(pc))
+ val = 0;
+ } else {
+ if (!TestClearPageCgroupFileDirty(pc))
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_DIRTY;
+ break;
+
+ case MEMCG_NR_FILE_WRITEBACK:
+ /*
+ * This counter is adjusted while holding the mapping's
+ * tree_lock. Therefore there is no race between settings and
+ * clearing of this flag.
+ */
+ if (val > 0)
+ SetPageCgroupFileWriteback(pc);
+ else
+ ClearPageCgroupFileWriteback(pc);
+ idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+ break;
+
+ case MEMCG_NR_FILE_UNSTABLE_NFS:
+ /* Use Test{Set,Clear} to only un/charge the memcg once. */
+ if (val > 0) {
+ if (TestSetPageCgroupFileUnstableNFS(pc))
+ val = 0;
+ } else {
+ if (!TestClearPageCgroupFileUnstableNFS(pc))
+ val = 0;
+ }
+ idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+ break;
+
default:
BUG();
}
@@ -2181,6 +2222,17 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
}
#endif

+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+ struct mem_cgroup *to,
+ enum mem_cgroup_stat_index idx)
+{
+ preempt_disable();
+ __this_cpu_dec(from->stat->count[idx]);
+ __this_cpu_inc(to->stat->count[idx]);
+ preempt_enable();
+}
+
/**
* mem_cgroup_move_account - move account of the page
* @page: the page
@@ -2229,13 +2281,18 @@ static int mem_cgroup_move_account(struct page *page,

move_lock_page_cgroup(pc, &flags);

- if (PageCgroupFileMapped(pc)) {
- /* Update mapped_file data for mem_cgroup */
- preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- preempt_enable();
- }
+ if (PageCgroupFileMapped(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_MAPPED);
+ if (PageCgroupFileDirty(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_DIRTY);
+ if (PageCgroupFileWriteback(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_WRITEBACK);
+ if (PageCgroupFileUnstableNFS(pc))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
if (uncharge)
/* This is not "cancel", but cancel_charge does all we need. */
@@ -3681,6 +3738,9 @@ enum {
MCS_PGPGIN,
MCS_PGPGOUT,
MCS_SWAP,
+ MCS_FILE_DIRTY,
+ MCS_WRITEBACK,
+ MCS_UNSTABLE_NFS,
MCS_INACTIVE_ANON,
MCS_ACTIVE_ANON,
MCS_INACTIVE_FILE,
@@ -3703,6 +3763,9 @@ struct {
{"pgpgin", "total_pgpgin"},
{"pgpgout", "total_pgpgout"},
{"swap", "total_swap"},
+ {"dirty", "total_dirty"},
+ {"writeback", "total_writeback"},
+ {"nfs_unstable", "total_nfs_unstable"},
{"inactive_anon", "total_inactive_anon"},
{"active_anon", "total_active_anon"},
{"inactive_file", "total_inactive_file"},
@@ -3732,6 +3795,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
s->stat[MCS_SWAP] += val * PAGE_SIZE;
}

+ /* dirty stat */
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
+ s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
+ s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+ val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+ s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
/* per zone stat */
val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
--
1.7.3.1

2011-02-25 21:39:51

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 6/9] memcg: add kernel calls for memcg dirty page stats

Add calls into memcg dirty page accounting. Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.
This allows the memory controller to maintain an accurate view of
the amount of its memory that is dirty.

Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Andrea Righi <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Daisuke Nishimura <[email protected]>
---
fs/nfs/write.c | 4 ++++
mm/filemap.c | 1 +
mm/page-writeback.c | 4 ++++
mm/truncate.c | 1 +
4 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index c8278f4..118935b 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -451,6 +451,7 @@ nfs_mark_request_commit(struct nfs_page *req)
NFS_PAGE_TAG_COMMIT);
nfsi->ncommit++;
spin_unlock(&inode->i_lock);
+ mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -462,6 +463,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req->wb_page;

if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1317,6 +1319,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
req = nfs_list_entry(head->next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+ mem_cgroup_dec_page_stat(req->wb_page,
+ MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 7b137b4..e234b8d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -143,6 +143,7 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 00424b9..8d61cfa 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1119,6 +1119,7 @@ int __set_page_dirty_no_writeback(struct page *page)
void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1308,6 +1309,7 @@ int clear_page_dirty_for_io(struct page *page)
* for more comments.
*/
if (TestClearPageDirty(page)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
@@ -1338,6 +1340,7 @@ int test_clear_page_writeback(struct page *page)
__dec_bdi_stat(bdi, BDI_WRITEBACK);
__bdi_writeout_inc(bdi);
}
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
} else {
@@ -1365,6 +1368,7 @@ int test_set_page_writeback(struct page *page)
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
__inc_bdi_stat(bdi, BDI_WRITEBACK);
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index 94be6c0..75ad82a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
--
1.7.3.1

2011-02-25 21:39:54

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 9/9] memcg: check memcg dirty limits in page writeback

If the current process is in a non-root memcg, then
balance_dirty_pages() will consider the memcg dirty limits as well as
the system-wide limits. This allows different cgroups to have distinct
dirty limits which trigger direct and background writeback at different
levels.

If called with a mem_cgroup, then throttle_vm_writeout() should query
the given cgroup for its dirty memory usage limits.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: Wu Fengguang <[email protected]>
---
Changelog since v4:
- Added missing 'struct mem_cgroup' forward declaration in writeback.h.
- Made throttle_vm_writeout() memcg aware.
- Removed previously added dirty_writeback_pages() which is no longer needed.
- Added logic to balance_dirty_pages() to throttle if over foreground memcg
limit.

Changelog since v3:
- Leave determine_dirtyable_memory() static. v3 made is non-static.
- balance_dirty_pages() now considers both system and memcg dirty limits and
usage data. This data is retrieved with global_dirty_info() and
memcg_dirty_info().

include/linux/writeback.h | 3 +-
mm/page-writeback.c | 112 +++++++++++++++++++++++++++++++++------------
mm/vmscan.c | 2 +-
3 files changed, 86 insertions(+), 31 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a06fb38..e4688d6 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -8,6 +8,7 @@
#include <linux/fs.h>

struct backing_dev_info;
+struct mem_cgroup;

extern spinlock_t inode_lock;

@@ -105,7 +106,7 @@ void laptop_mode_timer_fn(unsigned long data);
#else
static inline void laptop_sync_completion(void) { }
#endif
-void throttle_vm_writeout(gfp_t gfp_mask);
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup);

/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 8d61cfa..5557e0c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -398,47 +398,72 @@ unsigned long determine_dirtyable_memory(void)
}

/*
+ * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * real-time tasks.
+ */
+static inline void adjust_dirty_info(struct dirty_info *info)
+{
+ struct task_struct *tsk;
+
+ if (info->background_thresh >= info->dirty_thresh)
+ info->background_thresh = info->dirty_thresh / 2;
+ tsk = current;
+ if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
+ info->background_thresh += info->background_thresh / 4;
+ info->dirty_thresh += info->dirty_thresh / 4;
+ }
+}
+
+/*
* global_dirty_info - return dirty thresholds and usage metrics
*
* Calculate the dirty thresholds based on sysctl parameters
* - vm.dirty_background_ratio or vm.dirty_background_bytes
* - vm.dirty_ratio or vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
- * real-time tasks.
*/
void global_dirty_info(struct dirty_info *info)
{
- unsigned long background;
- unsigned long dirty;
unsigned long uninitialized_var(available_memory);
- struct task_struct *tsk;

if (!vm_dirty_bytes || !dirty_background_bytes)
available_memory = determine_dirtyable_memory();

if (vm_dirty_bytes)
- dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+ info->dirty_thresh = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
else
- dirty = (vm_dirty_ratio * available_memory) / 100;
+ info->dirty_thresh = (vm_dirty_ratio * available_memory) / 100;

if (dirty_background_bytes)
- background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+ info->background_thresh =
+ DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
else
- background = (dirty_background_ratio * available_memory) / 100;
+ info->background_thresh =
+ (dirty_background_ratio * available_memory) / 100;

info->nr_file_dirty = global_page_state(NR_FILE_DIRTY);
info->nr_writeback = global_page_state(NR_WRITEBACK);
info->nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS);

- if (background >= dirty)
- background = dirty / 2;
- tsk = current;
- if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
- background += background / 4;
- dirty += dirty / 4;
- }
- info->background_thresh = background;
- info->dirty_thresh = dirty;
+ adjust_dirty_info(info);
+}
+
+/*
+ * Calculate the background-writeback and dirty-throttling thresholds and dirty
+ * usage metrics from the current task's memcg dirty limit parameters. Returns
+ * false if no memcg limits exist.
+ *
+ * @memcg may be NULL if the current task's memcg should be used.
+ * @info is the location where the dirty information is written.
+ */
+static bool memcg_dirty_info(struct mem_cgroup *memcg, struct dirty_info *info)
+{
+ unsigned long available_memory = determine_dirtyable_memory();
+
+ if (!mem_cgroup_hierarchical_dirty_info(available_memory, memcg, info))
+ return false;
+
+ adjust_dirty_info(info);
+ return true;
}

/*
@@ -477,12 +502,14 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
* data. It looks at the number of dirty pages in the machine and will force
* the caller to perform writeback if the system is over `vm_dirty_ratio'.
* If we're over `background_thresh' then the writeback threads are woken to
- * perform some writeout.
+ * perform some writeout. The current task may have per-memcg dirty
+ * limits, which are also checked.
*/
static void balance_dirty_pages(struct address_space *mapping,
unsigned long write_chunk)
{
struct dirty_info sys_info;
+ struct dirty_info memcg_info;
long bdi_nr_reclaimable;
long bdi_nr_writeback;
unsigned long bdi_thresh;
@@ -500,18 +527,27 @@ static void balance_dirty_pages(struct address_space *mapping,
};

global_dirty_info(&sys_info);
+ if (!memcg_dirty_info(NULL, &memcg_info))
+ memcg_info = sys_info;

/*
* Throttle it only when the background writeback cannot
* catch-up. This avoids (excessively) small writeouts
* when the bdi limits are ramping up.
*/
- if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <=
+ if ((dirty_info_reclaimable(&sys_info) +
+ sys_info.nr_writeback <=
(sys_info.background_thresh +
- sys_info.dirty_thresh) / 2)
+ sys_info.dirty_thresh) / 2) &&
+ (dirty_info_reclaimable(&memcg_info) +
+ memcg_info.nr_writeback <=
+ (memcg_info.background_thresh +
+ memcg_info.dirty_thresh) / 2))
break;

- bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh);
+ bdi_thresh = bdi_dirty_limit(bdi,
+ min(sys_info.dirty_thresh,
+ memcg_info.dirty_thresh));
bdi_thresh = task_dirty_limit(current, bdi_thresh);

/*
@@ -541,7 +577,9 @@ static void balance_dirty_pages(struct address_space *mapping,
dirty_exceeded =
(bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
|| (dirty_info_reclaimable(&sys_info) +
- sys_info.nr_writeback > sys_info.dirty_thresh);
+ sys_info.nr_writeback > sys_info.dirty_thresh)
+ || (dirty_info_reclaimable(&memcg_info) +
+ memcg_info.nr_writeback > memcg_info.dirty_thresh);

if (!dirty_exceeded)
break;
@@ -559,7 +597,9 @@ static void balance_dirty_pages(struct address_space *mapping,
* up.
*/
trace_wbc_balance_dirty_start(&wbc, bdi);
- if (bdi_nr_reclaimable > bdi_thresh) {
+ if ((bdi_nr_reclaimable > bdi_thresh) ||
+ (dirty_info_reclaimable(&memcg_info) >
+ memcg_info.dirty_thresh)) {
writeback_inodes_wb(&bdi->wb, &wbc);
pages_written += write_chunk - wbc.nr_to_write;
trace_wbc_balance_dirty_written(&wbc, bdi);
@@ -594,8 +634,10 @@ static void balance_dirty_pages(struct address_space *mapping,
* background_thresh, to keep the amount of dirty memory low.
*/
if ((laptop_mode && pages_written) ||
- (!laptop_mode && (dirty_info_reclaimable(&sys_info) >
- sys_info.background_thresh)))
+ (!laptop_mode && ((dirty_info_reclaimable(&sys_info) >
+ sys_info.background_thresh) ||
+ (dirty_info_reclaimable(&memcg_info) >
+ memcg_info.background_thresh))))
bdi_start_background_writeback(bdi);
}

@@ -653,12 +695,20 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);

-void throttle_vm_writeout(gfp_t gfp_mask)
+/*
+ * Throttle the current task if it is near dirty memory usage limits.
+ * If @mem_cgroup is NULL or the root_cgroup, then use global dirty memory
+ * information; otherwise use the per-memcg dirty limits.
+ */
+void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
{
struct dirty_info sys_info;
+ struct dirty_info memcg_info;

for ( ; ; ) {
global_dirty_info(&sys_info);
+ if (!memcg_dirty_info(mem_cgroup, &memcg_info))
+ memcg_info = sys_info;

/*
* Boost the allowable dirty threshold a bit for page
@@ -666,9 +716,13 @@ void throttle_vm_writeout(gfp_t gfp_mask)
*/
sys_info.dirty_thresh +=
sys_info.dirty_thresh / 10; /* wheeee... */
+ memcg_info.dirty_thresh +=
+ memcg_info.dirty_thresh / 10; /* wheeee... */

- if (sys_info.nr_unstable_nfs +
- sys_info.nr_writeback <= sys_info.dirty_thresh)
+ if ((sys_info.nr_unstable_nfs +
+ sys_info.nr_writeback <= sys_info.dirty_thresh) &&
+ (memcg_info.nr_unstable_nfs +
+ memcg_info.nr_writeback <= memcg_info.dirty_thresh))
break;
congestion_wait(BLK_RW_ASYNC, HZ/10);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ba11e28..f723242 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1927,7 +1927,7 @@ restart:
sc->nr_scanned - nr_scanned, sc))
goto restart;

- throttle_vm_writeout(sc->gfp_mask);
+ throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
}

/*
--
1.7.3.1

2011-02-25 21:40:16

by Greg Thelen

[permalink] [raw]
Subject: [PATCH v5 8/9] memcg: add cgroupfs interface to memcg dirty limits

Add cgroupfs interface to memcg dirty page limits:
Direct write-out is controlled with:
- memory.dirty_ratio
- memory.dirty_limit_in_bytes

Background write-out is controlled with:
- memory.dirty_background_ratio
- memory.dirty_background_limit_bytes

Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g'
and 'G' suffixes for byte counts. This patch provides the
same functionality for memory.dirty_limit_in_bytes and
memory.dirty_background_limit_bytes.

Signed-off-by: Andrea Righi <[email protected]>
Signed-off-by: Balbir Singh <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Acked-by: KAMEZAWA Hiroyuki <[email protected]>
---
Changelog since v3:
- Make use of new routine, __mem_cgroup_has_dirty_limit(), to disable memcg
dirty limits when use_hierarchy=1.

Changelog since v1:
- Renamed newly created proc files:
- memory.dirty_bytes -> memory.dirty_limit_in_bytes
- memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
- Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.

mm/memcontrol.c | 114 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bc86329..6c8885b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -109,6 +109,13 @@ enum mem_cgroup_events_index {
MEM_CGROUP_EVENTS_NSTATS,
};

+enum {
+ MEM_CGROUP_DIRTY_RATIO,
+ MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+ MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+};
+
struct mem_cgroup_stat_cpu {
long count[MEM_CGROUP_STAT_NSTATS];
unsigned long events[MEM_CGROUP_EVENTS_NSTATS];
@@ -4518,6 +4525,89 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
return 0;
}

+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ bool use_sys = !__mem_cgroup_has_dirty_limit(mem);
+
+ switch (cft->private) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ return use_sys ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+ case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+ return use_sys ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ return use_sys ? dirty_background_ratio :
+ mem->dirty_param.dirty_background_ratio;
+ case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+ return use_sys ? dirty_background_bytes :
+ mem->dirty_param.dirty_background_bytes;
+ default:
+ BUG();
+ }
+}
+
+static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+ int ret = -EINVAL;
+ unsigned long long val;
+
+ if (!__mem_cgroup_has_dirty_limit(memcg))
+ return ret;
+
+ switch (type) {
+ case MEM_CGROUP_DIRTY_LIMIT_IN_BYTES:
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_bytes = val;
+ memcg->dirty_param.dirty_ratio = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_background_bytes = val;
+ memcg->dirty_param.dirty_background_ratio = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+
+ if (!__mem_cgroup_has_dirty_limit(memcg))
+ return -EINVAL;
+ if ((type == MEM_CGROUP_DIRTY_RATIO ||
+ type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) && val > 100)
+ return -EINVAL;
+ switch (type) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ memcg->dirty_param.dirty_ratio = val;
+ memcg->dirty_param.dirty_bytes = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ memcg->dirty_param.dirty_background_ratio = val;
+ memcg->dirty_param.dirty_background_bytes = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return 0;
+}
+
static struct cftype mem_cgroup_files[] = {
{
.name = "usage_in_bytes",
@@ -4581,6 +4671,30 @@ static struct cftype mem_cgroup_files[] = {
.unregister_event = mem_cgroup_oom_unregister_event,
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
+ {
+ .name = "dirty_ratio",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_u64 = mem_cgroup_dirty_write,
+ .private = MEM_CGROUP_DIRTY_RATIO,
+ },
+ {
+ .name = "dirty_limit_in_bytes",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_string = mem_cgroup_dirty_write_string,
+ .private = MEM_CGROUP_DIRTY_LIMIT_IN_BYTES,
+ },
+ {
+ .name = "dirty_background_ratio",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_u64 = mem_cgroup_dirty_write,
+ .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ },
+ {
+ .name = "dirty_background_limit_in_bytes",
+ .read_u64 = mem_cgroup_dirty_read,
+ .write_string = mem_cgroup_dirty_write_string,
+ .private = MEM_CGROUP_DIRTY_BACKGROUND_LIMIT_IN_BYTES,
+ },
};

#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
--
1.7.3.1

2011-02-26 04:17:17

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v5 0/9] memcg: per cgroup dirty page accounting

Greg Thelen <[email protected]> writes:

> Changes since v4:
> - Moved documentation changes to start of series to provide a better
> introduction to the series.
> - Added support for hierarchical dirty limits.
> - Incorporated bug fixes previously found in v4.
> - Include a new patch "writeback: convert variables to unsigned" to provide a
> clearer transition to the the new dirty_info structure (patch "writeback:
> create dirty_info structure").
> - Within the new dirty_info structure, replaced nr_reclaimable with
> nr_file_dirty and nr_unstable_nfs to give callers finer grain dirty usage
> information also added dirty_info_reclaimable().
> - Rebased the series to mmotm-2011-02-10-16-26 with two pending mmotm patches:
> memcg: break out event counters from other stats
> https://lkml.org/lkml/2011/2/17/415
> memcg: use native word page statistics counters
> https://lkml.org/lkml/2011/2/17/413

For those who prefer a monolithic patch, which applies to
mmotm-2011-02-10-16-26:
http://www.kernel.org/pub/linux/kernel/people/gthelen/memcg/memcg-dirty-limits-v5-on-mmotm-2011-02-10-16-26.patch

The monolithic patch includes this -v4 series and the two above
mentioned memcg prerequisite patches.

> Changes since v3:
> - Refactored balance_dirty_pages() dirtying checking to use new struct
> dirty_info, which is used to compare both system and memcg dirty limits
> against usage.
> - Disabled memcg dirty limits when memory.use_hierarchy=1. An enhancement is
> needed to check the chain of parents to ensure that no dirty limit is
> exceeded.
> - Ported to mmotm-2010-10-22-16-36.
>
> Changes since v2:
> - Rather than disabling softirq in lock_page_cgroup(), introduce a separate lock
> to synchronize between memcg page accounting and migration. This only affects
> patch 4 of the series. Patch 4 used to disable softirq, now it introduces the
> new lock.
>
> Changes since v1:
> - Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
> memory.stat to match /proc/meminfo.
> - Avoid lockdep warnings by using rcu_read_[un]lock() in
> mem_cgroup_has_dirty_limit().
> - Fixed lockdep issue in mem_cgroup_read_stat() which is exposed by these
> patches.
> - Remove redundant comments.
> - Rename (for clarity):
> - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
> - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
> - Renamed newly created proc files:
> - memory.dirty_bytes -> memory.dirty_limit_in_bytes
> - memory.dirty_background_bytes -> memory.dirty_background_limit_in_bytes
> - Removed unnecessary get_ prefix from get_xxx() functions.
> - Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs files.
> - Disable softirq rather than hardirq in lock_page_cgroup()
> - Made mem_cgroup_move_account_page_stat() inline.
> - Ported patches to mmotm-2010-10-13-17-13.
>
> This patch set provides the ability for each cgroup to have independent dirty
> page limits.
>
> Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
> page cache used by a cgroup. So, in case of multiple cgroup writers, they will
> not be able to consume more than their designated share of dirty pages and will
> be throttled if they cross that limit.
>
> Example use case:
> #!/bin/bash
> #
> # Here is a test script that shows a situation where memcg dirty limits are
> # beneficial.
> #
> # The script runs two programs:
> # 1) a dirty page background antagonist (dd)
> # 2) an interactive foreground process (tar).
> #
> # If the script's argument is false, then both processes are limited by the
> # classic global dirty limits. If the script is given a true argument, then a
> # per-cgroup dirty limit is used to contain dd dirty page consumption. The
> # cgroup isolates the dd dirty memory consumption from the rest of the system
> # processes (tar in this case).
> #
> # The time used by the tar process is printed (lower is better).
> #
> # When dd is run within a dirty limiting cgroup, the tar process had faster
> # and more predictable performance. memcg dirty ratios might be useful to
> # serve different task classes (interactive vs batch). A past discussion
> # touched on this: http://lkml.org/lkml/2010/5/20/136
> #
> # When called with 'false' (using memcg without dirty isolation):
> # tar finished in 7.0s
> # dd reports 92 MB/s
> #
> # When called with 'true' (using memcg for dirty isolation):
> # tar finished in 2.5s
> # dd reports 82 MB/s
>
> echo memcg_dirty_limits: $1
>
> # set system dirty limits.
> echo $((1<<30)) > /proc/sys/vm/dirty_bytes
> echo $((1<<29)) > /proc/sys/vm/dirty_background_bytes
>
> mkdir /dev/cgroup/memory/A
>
> if $1; then # if using cgroup to contain 'dd'...
> echo 100M > /dev/cgroup/memory/A/memory.dirty_limit_in_bytes
> fi
>
> # run antagonist (dd) in cgroup A
> (echo $BASHPID > /dev/cgroup/memory/A/tasks; \
> dd if=/dev/zero of=/disk1/big.file count=10k bs=1M) &
>
> # let antagonist (dd) get warmed up
> sleep 10
>
> # time interactive job
> time tar -C /disk2 -xzf linux.tar.gz
>
> wait
> sleep 10
> rmdir /dev/cgroup/memory/A
>
> The patches are based on a series proposed by Andrea Righi in Mar 2010.
>
> Overview:
> - Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
> unstable.
>
> - Extend mem_cgroup to record the total number of pages in each of the
> interesting dirty states (dirty, writeback, unstable_nfs).
>
> - Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_* limits to
> mem_cgroup. The mem_cgroup dirty parameters are accessible via cgroupfs
> control files.
>
> - Consider both system and per-memcg dirty limits in page writeback when
> deciding to queue background writeback or throttle dirty memory production.
>
> Known shortcomings (see the patch 1 update to Documentation/cgroups/memory.txt
> for more details):
> - When a cgroup dirty limit is exceeded, then bdi writeback is employed to
> writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
> just inodes contributing dirty pages to the cgroup exceeding its limit.
>
> - A cgroup may exceed its dirty limit if the memory is dirtied by a process in a
> different memcg.
>
> Performance data:
> - A page fault microbenchmark workload was used to measure performance, which
> can be called in read or write mode:
> f = open(foo. $cpu)
> truncate(f, 4096)
> alarm(60)
> while (1) {
> p = mmap(f, 4096)
> if (write)
> *p = 1
> else
> x = *p
> munmap(p)
> }
>
> - The workload was called for several points in the patch series in different
> modes:
> - s_read is a single threaded reader
> - s_write is a single threaded writer
> - p_read is a 16 thread reader, each operating on a different file
> - p_write is a 16 thread writer, each operating on a different file
>
> - Measurements were collected on a 16 core non-numa system using "perf stat
> --repeat 3".
>
> - All numbers are page fault rate (M/sec). Higher is better.
>
> - To compare the performance of a kernel without memcg compare the first and
> last rows - neither has memcg configured. The first row does not include any
> of these memcg dirty limit patches.
>
> - To compare the performance of using memcg dirty limits, compare the memcg
> baseline (2nd row titled "mmotm w/ memcg") with the 3rd row (memcg enabled
> with all patches).
>
> root_cgroup child_cgroup
> s_read s_write p_read p_write s_read s_write p_read p_write
> mmotm w/o memcg 0.359 0.312 0.357 0.312
> mmotm w/ memcg 0.366 0.316 0.342 0.301 0.368 0.309 0.347 0.301
> all patches 0.347 0.322 0.327 0.303 0.342 0.323 0.327 0.305
> all patches 0.358 0.322 0.357 0.316
> w/o memcg
>
> Greg Thelen (9):
> memcg: document cgroup dirty memory interfaces
> memcg: add page_cgroup flags for dirty page tracking
> writeback: convert variables to unsigned
> writeback: create dirty_info structure
> memcg: add dirty page accounting infrastructure
> memcg: add kernel calls for memcg dirty page stats
> memcg: add dirty limits to mem_cgroup
> memcg: add cgroupfs interface to memcg dirty limits
> memcg: check memcg dirty limits in page writeback
>
> Documentation/cgroups/memory.txt | 80 +++++++
> fs/fs-writeback.c | 7 +-
> fs/nfs/write.c | 4 +
> include/linux/memcontrol.h | 33 +++-
> include/linux/page_cgroup.h | 23 ++
> include/linux/writeback.h | 18 ++-
> mm/backing-dev.c | 18 +-
> mm/filemap.c | 1 +
> mm/memcontrol.c | 470 +++++++++++++++++++++++++++++++++++++-
> mm/page-writeback.c | 150 +++++++++----
> mm/truncate.c | 1 +
> mm/vmscan.c | 2 +-
> mm/vmstat.c | 6 +-
> 13 files changed, 742 insertions(+), 71 deletions(-)

2011-02-27 15:56:40

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 2/9] memcg: add page_cgroup flags for dirty page tracking

On Fri, Feb 25, 2011 at 01:35:53PM -0800, Greg Thelen wrote:
> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
> Acked-by: Daisuke Nishimura <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2011-02-27 16:07:33

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] writeback: convert variables to unsigned

On Fri, Feb 25, 2011 at 01:35:54PM -0800, Greg Thelen wrote:
> Convert two balance_dirty_pages() page counter variables (nr_reclaimable
> and nr_writeback) from 'long' to 'unsigned long'.
>
> These two variables are used to store results from global_page_state().
> global_page_state() returns unsigned long and carefully sums per-cpu
> counters explicitly avoiding returning a negative value.
>
> Signed-off-by: Greg Thelen <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

> ---
> Changelog since v4:
> - Created this patch for clarity. Previously this patch was integrated within
> the "writeback: create dirty_info structure" patch.
>
> mm/page-writeback.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 2cb01f6..4408e54 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -478,8 +478,10 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> static void balance_dirty_pages(struct address_space *mapping,
> unsigned long write_chunk)
> {
> - long nr_reclaimable, bdi_nr_reclaimable;
> - long nr_writeback, bdi_nr_writeback;
> + unsigned long nr_reclaimable;
> + long bdi_nr_reclaimable;
> + unsigned long nr_writeback;
> + long bdi_nr_writeback;
> unsigned long background_thresh;
> unsigned long dirty_thresh;
> unsigned long bdi_thresh;
> --
> 1.7.3.1
>
bdi_nr_[reclaimable|writeback] can return negative value?
When I just look through bdi_stat_sum, it uses *percpu_counter_sum_positive*.
So I guess it always returns positive value.
If it is right, could you change it, too?

--
Kind regards,
Minchan Kim

2011-02-27 16:38:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] writeback: create dirty_info structure

On Fri, Feb 25, 2011 at 01:35:55PM -0800, Greg Thelen wrote:
> Bundle dirty limits and dirty memory usage metrics into a dirty_info
> structure to simplify interfaces of routines that need all.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

> ---
> Changelog since v4:
> - Within new dirty_info structure, replaced nr_reclaimable with nr_file_dirty
> and nr_unstable_nfs to give callers finer grain dirty usage information.
> - Added new dirty_info_reclaimable() function.
> - Made more use of dirty_info structure in throttle_vm_writeout().
>
> Changelog since v3:
> - This is a new patch in v4.
>
> fs/fs-writeback.c | 7 ++---
> include/linux/writeback.h | 15 ++++++++++++-
> mm/backing-dev.c | 18 +++++++++------
> mm/page-writeback.c | 50 ++++++++++++++++++++++----------------------
> mm/vmstat.c | 6 +++-
> 5 files changed, 57 insertions(+), 39 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 59c6e49..d75e4da 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -595,12 +595,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
>
> static inline bool over_bground_thresh(void)
> {
> - unsigned long background_thresh, dirty_thresh;
> + struct dirty_info info;
>
> - global_dirty_limits(&background_thresh, &dirty_thresh);
> + global_dirty_info(&info);
>
> - return (global_page_state(NR_FILE_DIRTY) +
> - global_page_state(NR_UNSTABLE_NFS) > background_thresh);
> + return dirty_info_reclaimable(&info) > info.background_thresh;
> }

Get unnecessary nr_writeback.

>
> /*
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 0ead399..a06fb38 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -84,6 +84,19 @@ static inline void inode_sync_wait(struct inode *inode)
> /*
> * mm/page-writeback.c
> */
> +struct dirty_info {
> + unsigned long dirty_thresh;
> + unsigned long background_thresh;
> + unsigned long nr_file_dirty;
> + unsigned long nr_writeback;
> + unsigned long nr_unstable_nfs;
> +};
> +
> +static inline unsigned long dirty_info_reclaimable(struct dirty_info *info)
> +{
> + return info->nr_file_dirty + info->nr_unstable_nfs;
> +}
> +
> #ifdef CONFIG_BLOCK
> void laptop_io_completion(struct backing_dev_info *info);
> void laptop_sync_completion(void);
> @@ -124,7 +137,7 @@ struct ctl_table;
> int dirty_writeback_centisecs_handler(struct ctl_table *, int,
> void __user *, size_t *, loff_t *);
>
> -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
> +void global_dirty_info(struct dirty_info *info);
> unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> unsigned long dirty);
>
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 027100d..17a06ab 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -66,8 +66,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> {
> struct backing_dev_info *bdi = m->private;
> struct bdi_writeback *wb = &bdi->wb;
> - unsigned long background_thresh;
> - unsigned long dirty_thresh;
> + struct dirty_info dirty_info;
> unsigned long bdi_thresh;
> unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
> struct inode *inode;
> @@ -82,8 +81,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> nr_more_io++;
> spin_unlock(&inode_lock);
>
> - global_dirty_limits(&background_thresh, &dirty_thresh);
> - bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> + global_dirty_info(&dirty_info);
> + bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);
>
> #define K(x) ((x) << (PAGE_SHIFT - 10))
> seq_printf(m,
> @@ -99,9 +98,14 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> "state: %8lx\n",
> (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
> (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
> - K(bdi_thresh), K(dirty_thresh),
> - K(background_thresh), nr_dirty, nr_io, nr_more_io,
> - !list_empty(&bdi->bdi_list), bdi->state);
> + K(bdi_thresh),
> + K(dirty_info.dirty_thresh),
> + K(dirty_info.background_thresh),

Get unnecessary nr_file_dirty, nr_writeback, nr_unstable_nfs.

> + nr_dirty,
> + nr_io,
> + nr_more_io,
> + !list_empty(&bdi->bdi_list),
> + bdi->state);
> #undef K
>
> return 0;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4408e54..00424b9 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -398,7 +398,7 @@ unsigned long determine_dirtyable_memory(void)
> }
>
> /*
> - * global_dirty_limits - background-writeback and dirty-throttling thresholds
> + * global_dirty_info - return dirty thresholds and usage metrics
> *
> * Calculate the dirty thresholds based on sysctl parameters
> * - vm.dirty_background_ratio or vm.dirty_background_bytes
> @@ -406,7 +406,7 @@ unsigned long determine_dirtyable_memory(void)
> * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
> * real-time tasks.
> */
> -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> +void global_dirty_info(struct dirty_info *info)
> {
> unsigned long background;
> unsigned long dirty;
> @@ -426,6 +426,10 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> else
> background = (dirty_background_ratio * available_memory) / 100;
>
> + info->nr_file_dirty = global_page_state(NR_FILE_DIRTY);
> + info->nr_writeback = global_page_state(NR_WRITEBACK);
> + info->nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS);
> +
> if (background >= dirty)
> background = dirty / 2;
> tsk = current;
> @@ -433,8 +437,8 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
> background += background / 4;
> dirty += dirty / 4;
> }
> - *pbackground = background;
> - *pdirty = dirty;
> + info->background_thresh = background;
> + info->dirty_thresh = dirty;
> }
>
> /*
> @@ -478,12 +482,9 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> static void balance_dirty_pages(struct address_space *mapping,
> unsigned long write_chunk)
> {
> - unsigned long nr_reclaimable;
> + struct dirty_info sys_info;
> long bdi_nr_reclaimable;
> - unsigned long nr_writeback;
> long bdi_nr_writeback;
> - unsigned long background_thresh;
> - unsigned long dirty_thresh;
> unsigned long bdi_thresh;
> unsigned long pages_written = 0;
> unsigned long pause = 1;
> @@ -498,22 +499,19 @@ static void balance_dirty_pages(struct address_space *mapping,
> .range_cyclic = 1,
> };
>
> - nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> - global_page_state(NR_UNSTABLE_NFS);
> - nr_writeback = global_page_state(NR_WRITEBACK);
> -
> - global_dirty_limits(&background_thresh, &dirty_thresh);
> + global_dirty_info(&sys_info);
>
> /*
> * Throttle it only when the background writeback cannot
> * catch-up. This avoids (excessively) small writeouts
> * when the bdi limits are ramping up.
> */
> - if (nr_reclaimable + nr_writeback <=
> - (background_thresh + dirty_thresh) / 2)
> + if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <=
> + (sys_info.background_thresh +
> + sys_info.dirty_thresh) / 2)
> break;
>
> - bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> + bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh);
> bdi_thresh = task_dirty_limit(current, bdi_thresh);
>
> /*
> @@ -542,7 +540,8 @@ static void balance_dirty_pages(struct address_space *mapping,
> */
> dirty_exceeded =
> (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
> - || (nr_reclaimable + nr_writeback > dirty_thresh);
> + || (dirty_info_reclaimable(&sys_info) +
> + sys_info.nr_writeback > sys_info.dirty_thresh);
>
> if (!dirty_exceeded)
> break;
> @@ -595,7 +594,8 @@ static void balance_dirty_pages(struct address_space *mapping,
> * background_thresh, to keep the amount of dirty memory low.
> */
> if ((laptop_mode && pages_written) ||
> - (!laptop_mode && (nr_reclaimable > background_thresh)))
> + (!laptop_mode && (dirty_info_reclaimable(&sys_info) >
> + sys_info.background_thresh)))
> bdi_start_background_writeback(bdi);
> }
>
> @@ -655,21 +655,21 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>
> void throttle_vm_writeout(gfp_t gfp_mask)
> {
> - unsigned long background_thresh;
> - unsigned long dirty_thresh;
> + struct dirty_info sys_info;
>
> for ( ; ; ) {
> - global_dirty_limits(&background_thresh, &dirty_thresh);
> + global_dirty_info(&sys_info);

Get unnecessary nr_file_dirty.

>
> /*
> * Boost the allowable dirty threshold a bit for page
> * allocators so they don't get DoS'ed by heavy writers
> */
> - dirty_thresh += dirty_thresh / 10; /* wheeee... */
> + sys_info.dirty_thresh +=
> + sys_info.dirty_thresh / 10; /* wheeee... */
>
> - if (global_page_state(NR_UNSTABLE_NFS) +
> - global_page_state(NR_WRITEBACK) <= dirty_thresh)
> - break;
> + if (sys_info.nr_unstable_nfs +
> + sys_info.nr_writeback <= sys_info.dirty_thresh)
> + break;
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> /*
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 0c3b504..ec95924 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1044,6 +1044,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> {
> unsigned long *v;
> int i, stat_items_size;
> + struct dirty_info dirty_info;
>
> if (*pos >= ARRAY_SIZE(vmstat_text))
> return NULL;
> @@ -1062,8 +1063,9 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> v[i] = global_page_state(i);
> v += NR_VM_ZONE_STAT_ITEMS;
>
> - global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
> - v + NR_DIRTY_THRESHOLD);
> + global_dirty_info(&dirty_info);
> + v[NR_DIRTY_BG_THRESHOLD] = dirty_info.background_thresh;
> + v[NR_DIRTY_THRESHOLD] = dirty_info.dirty_thresh;
> v += NR_VM_WRITEBACK_STAT_ITEMS;

Get unnecessary nr_file_dirty, nr_writeback, nr_unstable_nfs.

The code itself doesn't have a problem. but although it makes code simple,
sometime it get unnecessary information in that context. The global_page_state never
cheap operation and we have been tried to reduce overhead in page-writeback.
(16c4042f, e50e3720).

Fortunately this patch doesn't increase balance_dirty_pages's overhead and
things affected by this patch are not fast-path.
So I think it doesn't have a problem.

--
Kind regards,
Minchan Kim

2011-02-27 16:47:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 5/9] memcg: add dirty page accounting infrastructure

On Fri, Feb 25, 2011 at 01:35:56PM -0800, Greg Thelen wrote:
> Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.
> A later change adds kernel calls to these new routines.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Daisuke Nishimura <[email protected]>
> ---
> Changelog since v1:
> - Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
> memory.stat to match /proc/meminfo.
> - Rename (for clarity):
> - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
> - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
> - Remove redundant comments.
> - Made mem_cgroup_move_account_page_stat() inline.
>
> include/linux/memcontrol.h | 5 ++-
> mm/memcontrol.c | 87 ++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 83 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3da48ae..e1f70a9 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -25,9 +25,12 @@ struct page_cgroup;
> struct page;
> struct mm_struct;
>
> -/* Stats that can be updated by kernel. */
> +/* mem_cgroup page counts accessed by kernel. */

I confused by 'kernel', 'access'?
So, What's the page counts accessed by user?
I don't like such words.

Please, clarify the comment.
'Stats of page that can be tracking by memcg' or whatever.

> enum mem_cgroup_page_stat_item {
> MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> + MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
> + MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
> + MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> };
>
> extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1c2704a..38f786b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -92,8 +92,11 @@ enum mem_cgroup_stat_index {
> */
> MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> - MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> + MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
> + MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
> + MEM_CGROUP_STAT_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
> MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */
> MEM_CGROUP_STAT_NSTATS,
> @@ -1622,6 +1625,44 @@ void mem_cgroup_update_page_stat(struct page *page,
> ClearPageCgroupFileMapped(pc);
> idx = MEM_CGROUP_STAT_FILE_MAPPED;
> break;
> +
> + case MEMCG_NR_FILE_DIRTY:
> + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> + if (val > 0) {
> + if (TestSetPageCgroupFileDirty(pc))
> + val = 0;
> + } else {
> + if (!TestClearPageCgroupFileDirty(pc))
> + val = 0;
> + }
> + idx = MEM_CGROUP_STAT_FILE_DIRTY;
> + break;
> +
> + case MEMCG_NR_FILE_WRITEBACK:
> + /*
> + * This counter is adjusted while holding the mapping's
> + * tree_lock. Therefore there is no race between settings and
> + * clearing of this flag.
> + */
> + if (val > 0)
> + SetPageCgroupFileWriteback(pc);
> + else
> + ClearPageCgroupFileWriteback(pc);
> + idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
> + break;
> +
> + case MEMCG_NR_FILE_UNSTABLE_NFS:
> + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> + if (val > 0) {
> + if (TestSetPageCgroupFileUnstableNFS(pc))
> + val = 0;
> + } else {
> + if (!TestClearPageCgroupFileUnstableNFS(pc))
> + val = 0;
> + }
> + idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
> + break;

This part can be simplified by some macro work.
But it's another issue.

> +
> default:
> BUG();
> }
> @@ -2181,6 +2222,17 @@ void mem_cgroup_split_huge_fixup(struct page *head, struct page *tail)
> }
> #endif
>
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> + struct mem_cgroup *to,
> + enum mem_cgroup_stat_index idx)
> +{
> + preempt_disable();
> + __this_cpu_dec(from->stat->count[idx]);
> + __this_cpu_inc(to->stat->count[idx]);
> + preempt_enable();
> +}
> +
> /**
> * mem_cgroup_move_account - move account of the page
> * @page: the page
> @@ -2229,13 +2281,18 @@ static int mem_cgroup_move_account(struct page *page,
>
> move_lock_page_cgroup(pc, &flags);
>
> - if (PageCgroupFileMapped(pc)) {
> - /* Update mapped_file data for mem_cgroup */
> - preempt_disable();
> - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - preempt_enable();
> - }
> + if (PageCgroupFileMapped(pc))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_MAPPED);
> + if (PageCgroupFileDirty(pc))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_DIRTY);
> + if (PageCgroupFileWriteback(pc))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_WRITEBACK);
> + if (PageCgroupFileUnstableNFS(pc))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
> if (uncharge)
> /* This is not "cancel", but cancel_charge does all we need. */
> @@ -3681,6 +3738,9 @@ enum {
> MCS_PGPGIN,
> MCS_PGPGOUT,
> MCS_SWAP,
> + MCS_FILE_DIRTY,
> + MCS_WRITEBACK,
> + MCS_UNSTABLE_NFS,
> MCS_INACTIVE_ANON,
> MCS_ACTIVE_ANON,
> MCS_INACTIVE_FILE,
> @@ -3703,6 +3763,9 @@ struct {
> {"pgpgin", "total_pgpgin"},
> {"pgpgout", "total_pgpgout"},
> {"swap", "total_swap"},
> + {"dirty", "total_dirty"},
> + {"writeback", "total_writeback"},
> + {"nfs_unstable", "total_nfs_unstable"},
> {"inactive_anon", "total_inactive_anon"},
> {"active_anon", "total_active_anon"},
> {"inactive_file", "total_inactive_file"},
> @@ -3732,6 +3795,14 @@ mem_cgroup_get_local_stat(struct mem_cgroup *mem, struct mcs_total_stat *s)
> s->stat[MCS_SWAP] += val * PAGE_SIZE;
> }
>
> + /* dirty stat */
> + val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_DIRTY);
> + s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
> + val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_WRITEBACK);
> + s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
> + val = mem_cgroup_read_stat(mem, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
> + s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
> +
> /* per zone stat */
> val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON);
> s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;
> --
> 1.7.3.1
>

--
Kind regards,
Minchan Kim

2011-02-27 17:01:56

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 6/9] memcg: add kernel calls for memcg dirty page stats

On Fri, Feb 25, 2011 at 01:35:57PM -0800, Greg Thelen wrote:
> Add calls into memcg dirty page accounting. Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs.
> This allows the memory controller to maintain an accurate view of
> the amount of its memory that is dirty.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Reviewed-by: Daisuke Nishimura <[email protected]>
> ---
> fs/nfs/write.c | 4 ++++
> mm/filemap.c | 1 +
> mm/page-writeback.c | 4 ++++
> mm/truncate.c | 1 +
> 4 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index c8278f4..118935b 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -451,6 +451,7 @@ nfs_mark_request_commit(struct nfs_page *req)
> NFS_PAGE_TAG_COMMIT);
> nfsi->ncommit++;
> spin_unlock(&inode->i_lock);
> + mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
> inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
> __mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -462,6 +463,7 @@ nfs_clear_request_commit(struct nfs_page *req)
> struct page *page = req->wb_page;
>
> if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
> dec_zone_page_state(page, NR_UNSTABLE_NFS);
> dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
> return 1;
> @@ -1317,6 +1319,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
> req = nfs_list_entry(head->next);
> nfs_list_remove_request(req);
> nfs_mark_request_commit(req);
> + mem_cgroup_dec_page_stat(req->wb_page,
> + MEMCG_NR_FILE_UNSTABLE_NFS);
> dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 7b137b4..e234b8d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -143,6 +143,7 @@ void __delete_from_page_cache(struct page *page)
> * having removed the page entirely.
> */
> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 00424b9..8d61cfa 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1119,6 +1119,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> void account_page_dirtied(struct page *page, struct address_space *mapping)
> {
> if (mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_zone_page_state(page, NR_DIRTIED);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -1308,6 +1309,7 @@ int clear_page_dirty_for_io(struct page *page)
> * for more comments.
> */
> if (TestClearPageDirty(page)) {
> + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> @@ -1338,6 +1340,7 @@ int test_clear_page_writeback(struct page *page)
> __dec_bdi_stat(bdi, BDI_WRITEBACK);
> __bdi_writeout_inc(bdi);
> }
> + mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
> }
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> } else {
> @@ -1365,6 +1368,7 @@ int test_set_page_writeback(struct page *page)
> PAGECACHE_TAG_WRITEBACK);
> if (bdi_cap_account_writeback(bdi))
> __inc_bdi_stat(bdi, BDI_WRITEBACK);
> + mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);

Question:
Why should we care of BDI_CAP_NO_WRITEBACK?


--
Kind regards,
Minchan Kim

2011-02-28 02:33:56

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 1/9] memcg: document cgroup dirty memory interfaces

On Fri, 25 Feb 2011 13:35:52 -0800
Greg Thelen <[email protected]> wrote:

> Document cgroup dirty memory interfaces and statistics.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>

very nice. Thank you.

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-02-28 02:35:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] writeback: convert variables to unsigned

On Fri, 25 Feb 2011 13:35:54 -0800
Greg Thelen <[email protected]> wrote:

> Convert two balance_dirty_pages() page counter variables (nr_reclaimable
> and nr_writeback) from 'long' to 'unsigned long'.
>
> These two variables are used to store results from global_page_state().
> global_page_state() returns unsigned long and carefully sums per-cpu
> counters explicitly avoiding returning a negative value.
>
> Signed-off-by: Greg Thelen <[email protected]>

Reviewd-by: KAMEZAWA Hiroyuki <[email protected]>

2011-02-28 02:40:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 5/9] memcg: add dirty page accounting infrastructure

On Mon, 28 Feb 2011 01:47:30 +0900
Minchan Kim <[email protected]> wrote:

> On Fri, Feb 25, 2011 at 01:35:56PM -0800, Greg Thelen wrote:
> > Add memcg routines to track dirty, writeback, and unstable_NFS pages.
> > These routines are not yet used by the kernel to count such pages.
> > A later change adds kernel calls to these new routines.
> >
> > Signed-off-by: Greg Thelen <[email protected]>
> > Signed-off-by: Andrea Righi <[email protected]>
> > Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> > Acked-by: Daisuke Nishimura <[email protected]>
> > ---
> > Changelog since v1:
> > - Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per cgroup
> > memory.stat to match /proc/meminfo.
> > - Rename (for clarity):
> > - mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item
> > - mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item
> > - Remove redundant comments.
> > - Made mem_cgroup_move_account_page_stat() inline.
> >
> > include/linux/memcontrol.h | 5 ++-
> > mm/memcontrol.c | 87 ++++++++++++++++++++++++++++++++++++++++----
> > 2 files changed, 83 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 3da48ae..e1f70a9 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -25,9 +25,12 @@ struct page_cgroup;
> > struct page;
> > struct mm_struct;
> >
> > -/* Stats that can be updated by kernel. */
> > +/* mem_cgroup page counts accessed by kernel. */
>
> I confused by 'kernel', 'access'?
> So, What's the page counts accessed by user?
> I don't like such words.
>
> Please, clarify the comment.
> 'Stats of page that can be tracking by memcg' or whatever.
>
Ah, yes. that's better.




> > enum mem_cgroup_page_stat_item {
> > MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
> > + MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
> > + MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
> > + MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> > };
> >
> > extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1c2704a..38f786b 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -92,8 +92,11 @@ enum mem_cgroup_stat_index {
> > */
> > MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
> > MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> > - MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> > MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
> > + MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> > + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
> > + MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
> > + MEM_CGROUP_STAT_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
> > MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
> > MEM_CGROUP_ON_MOVE, /* someone is moving account between groups */
> > MEM_CGROUP_STAT_NSTATS,
> > @@ -1622,6 +1625,44 @@ void mem_cgroup_update_page_stat(struct page *page,
> > ClearPageCgroupFileMapped(pc);
> > idx = MEM_CGROUP_STAT_FILE_MAPPED;
> > break;
> > +
> > + case MEMCG_NR_FILE_DIRTY:
> > + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> > + if (val > 0) {
> > + if (TestSetPageCgroupFileDirty(pc))
> > + val = 0;
> > + } else {
> > + if (!TestClearPageCgroupFileDirty(pc))
> > + val = 0;
> > + }
> > + idx = MEM_CGROUP_STAT_FILE_DIRTY;
> > + break;
> > +
> > + case MEMCG_NR_FILE_WRITEBACK:
> > + /*
> > + * This counter is adjusted while holding the mapping's
> > + * tree_lock. Therefore there is no race between settings and
> > + * clearing of this flag.
> > + */
> > + if (val > 0)
> > + SetPageCgroupFileWriteback(pc);
> > + else
> > + ClearPageCgroupFileWriteback(pc);
> > + idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
> > + break;
> > +
> > + case MEMCG_NR_FILE_UNSTABLE_NFS:
> > + /* Use Test{Set,Clear} to only un/charge the memcg once. */
> > + if (val > 0) {
> > + if (TestSetPageCgroupFileUnstableNFS(pc))
> > + val = 0;
> > + } else {
> > + if (!TestClearPageCgroupFileUnstableNFS(pc))
> > + val = 0;
> > + }
> > + idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
> > + break;
>
> This part can be simplified by some macro work.
> But it's another issue.
>
Agreed, doing in another patch is better.

Thanks,
-Kame

2011-02-28 02:46:47

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 6/9] memcg: add kernel calls for memcg dirty page stats

On Mon, 28 Feb 2011 02:01:43 +0900
Minchan Kim <[email protected]> wrote:

> On Fri, Feb 25, 2011 at 01:35:57PM -0800, Greg Thelen wrote:
spin_unlock_irqrestore(&mapping->tree_lock, flags);
> > } else {
> > @@ -1365,6 +1368,7 @@ int test_set_page_writeback(struct page *page)
> > PAGECACHE_TAG_WRITEBACK);
> > if (bdi_cap_account_writeback(bdi))
> > __inc_bdi_stat(bdi, BDI_WRITEBACK);
> > + mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
>
> Question:
> Why should we care of BDI_CAP_NO_WRITEBACK?
>
Hmm, should we do ..
==
if (!ret) {
account_page_writeback(page);
+ mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACL);
}
==

Thanks,
-Kame

2011-02-28 02:53:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH v5 7/9] memcg: add dirty limits to mem_cgroup

On Fri, 25 Feb 2011 13:35:58 -0800
Greg Thelen <[email protected]> wrote:

> Extend mem_cgroup to contain dirty page limits. Also add routines
> allowing the kernel to query the dirty usage of a memcg.
>
> These interfaces not used by the kernel yet. A subsequent commit
> will add kernel calls to utilize these new routines.
>
> Signed-off-by: Greg Thelen <[email protected]>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: KAMEZAWA Hiroyuki <[email protected]>

2011-02-28 23:53:18

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] writeback: convert variables to unsigned

On Sun, Feb 27, 2011 at 8:07 AM, Minchan Kim <[email protected]> wrote:
> On Fri, Feb 25, 2011 at 01:35:54PM -0800, Greg Thelen wrote:
>> Convert two balance_dirty_pages() page counter variables (nr_reclaimable
>> and nr_writeback) from 'long' to 'unsigned long'.
>>
>> These two variables are used to store results from global_page_state().
>> global_page_state() returns unsigned long and carefully sums per-cpu
>> counters explicitly avoiding returning a negative value.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
>
>> ---
>> Changelog since v4:
>> - Created this patch for clarity. ?Previously this patch was integrated within
>> ? the "writeback: create dirty_info structure" patch.
>>
>> ?mm/page-writeback.c | ? ?6 ++++--
>> ?1 files changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 2cb01f6..4408e54 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -478,8 +478,10 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
>> ?static void balance_dirty_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long write_chunk)
>> ?{
>> - ? ? long nr_reclaimable, bdi_nr_reclaimable;
>> - ? ? long nr_writeback, bdi_nr_writeback;
>> + ? ? unsigned long nr_reclaimable;
>> + ? ? long bdi_nr_reclaimable;
>> + ? ? unsigned long nr_writeback;
>> + ? ? long bdi_nr_writeback;
>> ? ? ? unsigned long background_thresh;
>> ? ? ? unsigned long dirty_thresh;
>> ? ? ? unsigned long bdi_thresh;
>> --
>> 1.7.3.1
>>
> bdi_nr_[reclaimable|writeback] can return negative value?
> When I just look through bdi_stat_sum, it uses *percpu_counter_sum_positive*.
> So I guess it always returns positive value.
> If it is right, could you change it, too?

Yes, I think we can also change bdi_nr_[reclaimable|writeback] to unsigned long.

bdi_stat_sum() and bdi_stat() both call percpu_counter_sum_positive(),
which return a positive number. bdi_stat[_sum]() return s64. Should
we also change bdi_stat[_sum]() to return unsigned long rather than
s64? I would like the return value type to match the type of the
corresponding local variables in balance_dirty_pages(). All current
callers appear to expect bdi_stat[_sum]() to return unsigned long.

> --
> Kind regards,
> Minchan Kim
>

2011-03-01 04:45:13

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 9/9] memcg: check memcg dirty limits in page writeback

On Fri, Feb 25, 2011 at 01:36:00PM -0800, Greg Thelen wrote:
> If the current process is in a non-root memcg, then
> balance_dirty_pages() will consider the memcg dirty limits as well as
> the system-wide limits. This allows different cgroups to have distinct
> dirty limits which trigger direct and background writeback at different
> levels.
>
> If called with a mem_cgroup, then throttle_vm_writeout() should query
> the given cgroup for its dirty memory usage limits.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Acked-by: Wu Fengguang <[email protected]>
> ---

<snip>

> /*
> @@ -477,12 +502,14 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> * data. It looks at the number of dirty pages in the machine and will force
> * the caller to perform writeback if the system is over `vm_dirty_ratio'.
> * If we're over `background_thresh' then the writeback threads are woken to
> - * perform some writeout.
> + * perform some writeout. The current task may have per-memcg dirty
> + * limits, which are also checked.
> */
> static void balance_dirty_pages(struct address_space *mapping,
> unsigned long write_chunk)
> {
> struct dirty_info sys_info;
> + struct dirty_info memcg_info;
> long bdi_nr_reclaimable;
> long bdi_nr_writeback;
> unsigned long bdi_thresh;
> @@ -500,18 +527,27 @@ static void balance_dirty_pages(struct address_space *mapping,
> };
>
> global_dirty_info(&sys_info);
> + if (!memcg_dirty_info(NULL, &memcg_info))
> + memcg_info = sys_info;


Sigh.

I don't like dobule check in case of no-memcg configuration or root memcgroup.
"Dobule check" means 1) sys_info check and 2) memcg_info check but it's same so
second check is redundant.
In addition, we always need same logic between global and memcg.
It adds binary bloating.

>
> /*
> * Throttle it only when the background writeback cannot
> * catch-up. This avoids (excessively) small writeouts
> * when the bdi limits are ramping up.
> */
> - if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <=
> + if ((dirty_info_reclaimable(&sys_info) +
> + sys_info.nr_writeback <=
> (sys_info.background_thresh +
> - sys_info.dirty_thresh) / 2)
> + sys_info.dirty_thresh) / 2) &&
> + (dirty_info_reclaimable(&memcg_info) +
> + memcg_info.nr_writeback <=
> + (memcg_info.background_thresh +
> + memcg_info.dirty_thresh) / 2))
> break;
>
> - bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh);
> + bdi_thresh = bdi_dirty_limit(bdi,
> + min(sys_info.dirty_thresh,
> + memcg_info.dirty_thresh));
> bdi_thresh = task_dirty_limit(current, bdi_thresh);
>
> /*
> @@ -541,7 +577,9 @@ static void balance_dirty_pages(struct address_space *mapping,
> dirty_exceeded =
> (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
> || (dirty_info_reclaimable(&sys_info) +
> - sys_info.nr_writeback > sys_info.dirty_thresh);
> + sys_info.nr_writeback > sys_info.dirty_thresh)
> + || (dirty_info_reclaimable(&memcg_info) +
> + memcg_info.nr_writeback > memcg_info.dirty_thresh);
>
> if (!dirty_exceeded)
> break;
> @@ -559,7 +597,9 @@ static void balance_dirty_pages(struct address_space *mapping,
> * up.
> */
> trace_wbc_balance_dirty_start(&wbc, bdi);
> - if (bdi_nr_reclaimable > bdi_thresh) {
> + if ((bdi_nr_reclaimable > bdi_thresh) ||
> + (dirty_info_reclaimable(&memcg_info) >
> + memcg_info.dirty_thresh)) {

Why does memcg need this check?
I guess bdi_thresh is the f(x), x = dirty_thresh and dirty_thresh is already memcg's one.
So isn't it enough?
I don't know the logic well but as I look at the comment, at least you change the behavior.
Please write down why you need it and if you need it, please change comment, too.

"
* Only move pages to writeback if this bdi is over its
* threshold otherwise wait until the disk writes catch
* up.
"

> writeback_inodes_wb(&bdi->wb, &wbc);
> pages_written += write_chunk - wbc.nr_to_write;
> trace_wbc_balance_dirty_written(&wbc, bdi);
> @@ -594,8 +634,10 @@ static void balance_dirty_pages(struct address_space *mapping,
> * background_thresh, to keep the amount of dirty memory low.
> */
> if ((laptop_mode && pages_written) ||
> - (!laptop_mode && (dirty_info_reclaimable(&sys_info) >
> - sys_info.background_thresh)))
> + (!laptop_mode && ((dirty_info_reclaimable(&sys_info) >
> + sys_info.background_thresh) ||
> + (dirty_info_reclaimable(&memcg_info) >
> + memcg_info.background_thresh))))
> bdi_start_background_writeback(bdi);
> }
>
> @@ -653,12 +695,20 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
> }
> EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>
> -void throttle_vm_writeout(gfp_t gfp_mask)
> +/*
> + * Throttle the current task if it is near dirty memory usage limits.
> + * If @mem_cgroup is NULL or the root_cgroup, then use global dirty memory
> + * information; otherwise use the per-memcg dirty limits.
> + */
> +void throttle_vm_writeout(gfp_t gfp_mask, struct mem_cgroup *mem_cgroup)
> {
> struct dirty_info sys_info;
> + struct dirty_info memcg_info;
>
> for ( ; ; ) {
> global_dirty_info(&sys_info);
> + if (!memcg_dirty_info(mem_cgroup, &memcg_info))
> + memcg_info = sys_info;
>
> /*
> * Boost the allowable dirty threshold a bit for page
> @@ -666,9 +716,13 @@ void throttle_vm_writeout(gfp_t gfp_mask)
> */
> sys_info.dirty_thresh +=
> sys_info.dirty_thresh / 10; /* wheeee... */
> + memcg_info.dirty_thresh +=
> + memcg_info.dirty_thresh / 10; /* wheeee... */
>
> - if (sys_info.nr_unstable_nfs +
> - sys_info.nr_writeback <= sys_info.dirty_thresh)
> + if ((sys_info.nr_unstable_nfs +
> + sys_info.nr_writeback <= sys_info.dirty_thresh) &&
> + (memcg_info.nr_unstable_nfs +
> + memcg_info.nr_writeback <= memcg_info.dirty_thresh))
> break;
> congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ba11e28..f723242 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1927,7 +1927,7 @@ restart:
> sc->nr_scanned - nr_scanned, sc))
> goto restart;
>
> - throttle_vm_writeout(sc->gfp_mask);
> + throttle_vm_writeout(sc->gfp_mask, sc->mem_cgroup);
> }
>
> /*
> --
> 1.7.3.1
>

--
Kind regards,
Minchan Kim

2011-03-01 04:50:38

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v5 3/9] writeback: convert variables to unsigned

On Mon, Feb 28, 2011 at 03:52:53PM -0800, Greg Thelen wrote:
> On Sun, Feb 27, 2011 at 8:07 AM, Minchan Kim <[email protected]> wrote:
> > On Fri, Feb 25, 2011 at 01:35:54PM -0800, Greg Thelen wrote:
> >> Convert two balance_dirty_pages() page counter variables (nr_reclaimable
> >> and nr_writeback) from 'long' to 'unsigned long'.
> >>
> >> These two variables are used to store results from global_page_state().
> >> global_page_state() returns unsigned long and carefully sums per-cpu
> >> counters explicitly avoiding returning a negative value.
> >>
> >> Signed-off-by: Greg Thelen <[email protected]>
> > Reviewed-by: Minchan Kim <[email protected]>
> >
> >> ---
> >> Changelog since v4:
> >> - Created this patch for clarity. ?Previously this patch was integrated within
> >> ? the "writeback: create dirty_info structure" patch.
> >>
> >> ?mm/page-writeback.c | ? ?6 ++++--
> >> ?1 files changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> index 2cb01f6..4408e54 100644
> >> --- a/mm/page-writeback.c
> >> +++ b/mm/page-writeback.c
> >> @@ -478,8 +478,10 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> >> ?static void balance_dirty_pages(struct address_space *mapping,
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long write_chunk)
> >> ?{
> >> - ? ? long nr_reclaimable, bdi_nr_reclaimable;
> >> - ? ? long nr_writeback, bdi_nr_writeback;
> >> + ? ? unsigned long nr_reclaimable;
> >> + ? ? long bdi_nr_reclaimable;
> >> + ? ? unsigned long nr_writeback;
> >> + ? ? long bdi_nr_writeback;
> >> ? ? ? unsigned long background_thresh;
> >> ? ? ? unsigned long dirty_thresh;
> >> ? ? ? unsigned long bdi_thresh;
> >> --
> >> 1.7.3.1
> >>
> > bdi_nr_[reclaimable|writeback] can return negative value?
> > When I just look through bdi_stat_sum, it uses *percpu_counter_sum_positive*.
> > So I guess it always returns positive value.
> > If it is right, could you change it, too?
>
> Yes, I think we can also change bdi_nr_[reclaimable|writeback] to unsigned long.
>
> bdi_stat_sum() and bdi_stat() both call percpu_counter_sum_positive(),
> which return a positive number. bdi_stat[_sum]() return s64. Should
> we also change bdi_stat[_sum]() to return unsigned long rather than
> s64? I would like the return value type to match the type of the
> corresponding local variables in balance_dirty_pages(). All current
> callers appear to expect bdi_stat[_sum]() to return unsigned long.

Please, clear them, too.
But this patch could be orthogonal with your series so it's up to you. :)

>
> > --
> > Kind regards,
> > Minchan Kim
> >

--
Kind regards,
Minchan Kim

2011-03-01 21:14:24

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v5 4/9] writeback: create dirty_info structure

On Sun, Feb 27, 2011 at 8:38 AM, Minchan Kim <[email protected]> wrote:
> On Fri, Feb 25, 2011 at 01:35:55PM -0800, Greg Thelen wrote:
>> Bundle dirty limits and dirty memory usage metrics into a dirty_info
>> structure to simplify interfaces of routines that need all.
>>
>> Signed-off-by: Greg Thelen <[email protected]>
>> Acked-by: KAMEZAWA Hiroyuki <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
>
>> ---
>> Changelog since v4:
>> - Within new dirty_info structure, replaced nr_reclaimable with nr_file_dirty
>> ? and nr_unstable_nfs to give callers finer grain dirty usage information.
>> - Added new dirty_info_reclaimable() function.
>> - Made more use of dirty_info structure in throttle_vm_writeout().
>>
>> Changelog since v3:
>> - This is a new patch in v4.
>>
>> ?fs/fs-writeback.c ? ? ? ? | ? ?7 ++---
>> ?include/linux/writeback.h | ? 15 ++++++++++++-
>> ?mm/backing-dev.c ? ? ? ? ?| ? 18 +++++++++------
>> ?mm/page-writeback.c ? ? ? | ? 50 ++++++++++++++++++++++----------------------
>> ?mm/vmstat.c ? ? ? ? ? ? ? | ? ?6 +++-
>> ?5 files changed, 57 insertions(+), 39 deletions(-)
>>
>> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
>> index 59c6e49..d75e4da 100644
>> --- a/fs/fs-writeback.c
>> +++ b/fs/fs-writeback.c
>> @@ -595,12 +595,11 @@ static void __writeback_inodes_sb(struct super_block *sb,
>>
>> ?static inline bool over_bground_thresh(void)
>> ?{
>> - ? ? unsigned long background_thresh, dirty_thresh;
>> + ? ? struct dirty_info info;
>>
>> - ? ? global_dirty_limits(&background_thresh, &dirty_thresh);
>> + ? ? global_dirty_info(&info);
>>
>> - ? ? return (global_page_state(NR_FILE_DIRTY) +
>> - ? ? ? ? ? ? global_page_state(NR_UNSTABLE_NFS) > background_thresh);
>> + ? ? return dirty_info_reclaimable(&info) > info.background_thresh;
>> ?}
>
> Get unnecessary nr_writeback.
>
>>
>> ?/*
>> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
>> index 0ead399..a06fb38 100644
>> --- a/include/linux/writeback.h
>> +++ b/include/linux/writeback.h
>> @@ -84,6 +84,19 @@ static inline void inode_sync_wait(struct inode *inode)
>> ?/*
>> ? * mm/page-writeback.c
>> ? */
>> +struct dirty_info {
>> + ? ? unsigned long dirty_thresh;
>> + ? ? unsigned long background_thresh;
>> + ? ? unsigned long nr_file_dirty;
>> + ? ? unsigned long nr_writeback;
>> + ? ? unsigned long nr_unstable_nfs;
>> +};
>> +
>> +static inline unsigned long dirty_info_reclaimable(struct dirty_info *info)
>> +{
>> + ? ? return info->nr_file_dirty + info->nr_unstable_nfs;
>> +}
>> +
>> ?#ifdef CONFIG_BLOCK
>> ?void laptop_io_completion(struct backing_dev_info *info);
>> ?void laptop_sync_completion(void);
>> @@ -124,7 +137,7 @@ struct ctl_table;
>> ?int dirty_writeback_centisecs_handler(struct ctl_table *, int,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? void __user *, size_t *, loff_t *);
>>
>> -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
>> +void global_dirty_info(struct dirty_info *info);
>> ?unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?unsigned long dirty);
>>
>> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
>> index 027100d..17a06ab 100644
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -66,8 +66,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> ?{
>> ? ? ? struct backing_dev_info *bdi = m->private;
>> ? ? ? struct bdi_writeback *wb = &bdi->wb;
>> - ? ? unsigned long background_thresh;
>> - ? ? unsigned long dirty_thresh;
>> + ? ? struct dirty_info dirty_info;
>> ? ? ? unsigned long bdi_thresh;
>> ? ? ? unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
>> ? ? ? struct inode *inode;
>> @@ -82,8 +81,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> ? ? ? ? ? ? ? nr_more_io++;
>> ? ? ? spin_unlock(&inode_lock);
>>
>> - ? ? global_dirty_limits(&background_thresh, &dirty_thresh);
>> - ? ? bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
>> + ? ? global_dirty_info(&dirty_info);
>> + ? ? bdi_thresh = bdi_dirty_limit(bdi, dirty_info.dirty_thresh);
>>
>> ?#define K(x) ((x) << (PAGE_SHIFT - 10))
>> ? ? ? seq_printf(m,
>> @@ -99,9 +98,14 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> ? ? ? ? ? ? ? ? ?"state: ? ? ? ? ? ?%8lx\n",
>> ? ? ? ? ? ? ? ? ?(unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
>> ? ? ? ? ? ? ? ? ?(unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
>> - ? ? ? ? ? ? ? ?K(bdi_thresh), K(dirty_thresh),
>> - ? ? ? ? ? ? ? ?K(background_thresh), nr_dirty, nr_io, nr_more_io,
>> - ? ? ? ? ? ? ? ?!list_empty(&bdi->bdi_list), bdi->state);
>> + ? ? ? ? ? ? ? ?K(bdi_thresh),
>> + ? ? ? ? ? ? ? ?K(dirty_info.dirty_thresh),
>> + ? ? ? ? ? ? ? ?K(dirty_info.background_thresh),
>
> Get unnecessary nr_file_dirty, nr_writeback, nr_unstable_nfs.
>
>> + ? ? ? ? ? ? ? ?nr_dirty,
>> + ? ? ? ? ? ? ? ?nr_io,
>> + ? ? ? ? ? ? ? ?nr_more_io,
>> + ? ? ? ? ? ? ? ?!list_empty(&bdi->bdi_list),
>> + ? ? ? ? ? ? ? ?bdi->state);
>> ?#undef K
>>
>> ? ? ? return 0;
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 4408e54..00424b9 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -398,7 +398,7 @@ unsigned long determine_dirtyable_memory(void)
>> ?}
>>
>> ?/*
>> - * global_dirty_limits - background-writeback and dirty-throttling thresholds
>> + * global_dirty_info - return dirty thresholds and usage metrics
>> ? *
>> ? * Calculate the dirty thresholds based on sysctl parameters
>> ? * - vm.dirty_background_ratio ?or ?vm.dirty_background_bytes
>> @@ -406,7 +406,7 @@ unsigned long determine_dirtyable_memory(void)
>> ? * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
>> ? * real-time tasks.
>> ? */
>> -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>> +void global_dirty_info(struct dirty_info *info)
>> ?{
>> ? ? ? unsigned long background;
>> ? ? ? unsigned long dirty;
>> @@ -426,6 +426,10 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>> ? ? ? else
>> ? ? ? ? ? ? ? background = (dirty_background_ratio * available_memory) / 100;
>>
>> + ? ? info->nr_file_dirty = global_page_state(NR_FILE_DIRTY);
>> + ? ? info->nr_writeback = global_page_state(NR_WRITEBACK);
>> + ? ? info->nr_unstable_nfs = global_page_state(NR_UNSTABLE_NFS);
>> +
>> ? ? ? if (background >= dirty)
>> ? ? ? ? ? ? ? background = dirty / 2;
>> ? ? ? tsk = current;
>> @@ -433,8 +437,8 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
>> ? ? ? ? ? ? ? background += background / 4;
>> ? ? ? ? ? ? ? dirty += dirty / 4;
>> ? ? ? }
>> - ? ? *pbackground = background;
>> - ? ? *pdirty = dirty;
>> + ? ? info->background_thresh = background;
>> + ? ? info->dirty_thresh = dirty;
>> ?}
>>
>> ?/*
>> @@ -478,12 +482,9 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
>> ?static void balance_dirty_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? unsigned long write_chunk)
>> ?{
>> - ? ? unsigned long nr_reclaimable;
>> + ? ? struct dirty_info sys_info;
>> ? ? ? long bdi_nr_reclaimable;
>> - ? ? unsigned long nr_writeback;
>> ? ? ? long bdi_nr_writeback;
>> - ? ? unsigned long background_thresh;
>> - ? ? unsigned long dirty_thresh;
>> ? ? ? unsigned long bdi_thresh;
>> ? ? ? unsigned long pages_written = 0;
>> ? ? ? unsigned long pause = 1;
>> @@ -498,22 +499,19 @@ static void balance_dirty_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? ? ? ? ? .range_cyclic ? = 1,
>> ? ? ? ? ? ? ? };
>>
>> - ? ? ? ? ? ? nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? global_page_state(NR_UNSTABLE_NFS);
>> - ? ? ? ? ? ? nr_writeback = global_page_state(NR_WRITEBACK);
>> -
>> - ? ? ? ? ? ? global_dirty_limits(&background_thresh, &dirty_thresh);
>> + ? ? ? ? ? ? global_dirty_info(&sys_info);
>>
>> ? ? ? ? ? ? ? /*
>> ? ? ? ? ? ? ? ?* Throttle it only when the background writeback cannot
>> ? ? ? ? ? ? ? ?* catch-up. This avoids (excessively) small writeouts
>> ? ? ? ? ? ? ? ?* when the bdi limits are ramping up.
>> ? ? ? ? ? ? ? ?*/
>> - ? ? ? ? ? ? if (nr_reclaimable + nr_writeback <=
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? (background_thresh + dirty_thresh) / 2)
>> + ? ? ? ? ? ? if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <=
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? (sys_info.background_thresh +
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sys_info.dirty_thresh) / 2)
>> ? ? ? ? ? ? ? ? ? ? ? break;
>>
>> - ? ? ? ? ? ? bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
>> + ? ? ? ? ? ? bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh);
>> ? ? ? ? ? ? ? bdi_thresh = task_dirty_limit(current, bdi_thresh);
>>
>> ? ? ? ? ? ? ? /*
>> @@ -542,7 +540,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? ?*/
>> ? ? ? ? ? ? ? dirty_exceeded =
>> ? ? ? ? ? ? ? ? ? ? ? (bdi_nr_reclaimable + bdi_nr_writeback > bdi_thresh)
>> - ? ? ? ? ? ? ? ? ? ? || (nr_reclaimable + nr_writeback > dirty_thresh);
>> + ? ? ? ? ? ? ? ? ? ? || (dirty_info_reclaimable(&sys_info) +
>> + ? ? ? ? ? ? ? ? ? ? ? ? ?sys_info.nr_writeback > sys_info.dirty_thresh);
>>
>> ? ? ? ? ? ? ? if (!dirty_exceeded)
>> ? ? ? ? ? ? ? ? ? ? ? break;
>> @@ -595,7 +594,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>> ? ? ? ?* background_thresh, to keep the amount of dirty memory low.
>> ? ? ? ?*/
>> ? ? ? if ((laptop_mode && pages_written) ||
>> - ? ? ? ? (!laptop_mode && (nr_reclaimable > background_thresh)))
>> + ? ? ? ? (!laptop_mode && (dirty_info_reclaimable(&sys_info) >
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? sys_info.background_thresh)))
>> ? ? ? ? ? ? ? bdi_start_background_writeback(bdi);
>> ?}
>>
>> @@ -655,21 +655,21 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
>>
>> ?void throttle_vm_writeout(gfp_t gfp_mask)
>> ?{
>> - ? ? unsigned long background_thresh;
>> - ? ? unsigned long dirty_thresh;
>> + ? ? struct dirty_info sys_info;
>>
>> ? ? ? ? ?for ( ; ; ) {
>> - ? ? ? ? ? ? global_dirty_limits(&background_thresh, &dirty_thresh);
>> + ? ? ? ? ? ? global_dirty_info(&sys_info);
>
> Get unnecessary nr_file_dirty.
>
>>
>> ? ? ? ? ? ? ? ? ?/*
>> ? ? ? ? ? ? ? ? ? * Boost the allowable dirty threshold a bit for page
>> ? ? ? ? ? ? ? ? ? * allocators so they don't get DoS'ed by heavy writers
>> ? ? ? ? ? ? ? ? ? */
>> - ? ? ? ? ? ? ? ?dirty_thresh += dirty_thresh / 10; ? ? ?/* wheeee... */
>> + ? ? ? ? ? ? sys_info.dirty_thresh +=
>> + ? ? ? ? ? ? ? ? ? ? sys_info.dirty_thresh / 10; ? ? ?/* wheeee... */
>>
>> - ? ? ? ? ? ? ? ?if (global_page_state(NR_UNSTABLE_NFS) +
>> - ? ? ? ? ? ? ? ? ? ? global_page_state(NR_WRITEBACK) <= dirty_thresh)
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? break;
>> + ? ? ? ? ? ? if (sys_info.nr_unstable_nfs +
>> + ? ? ? ? ? ? ? ? sys_info.nr_writeback <= sys_info.dirty_thresh)
>> + ? ? ? ? ? ? ? ? ? ? break;
>> ? ? ? ? ? ? ? ? ?congestion_wait(BLK_RW_ASYNC, HZ/10);
>>
>> ? ? ? ? ? ? ? /*
>> diff --git a/mm/vmstat.c b/mm/vmstat.c
>> index 0c3b504..ec95924 100644
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1044,6 +1044,7 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>> ?{
>> ? ? ? unsigned long *v;
>> ? ? ? int i, stat_items_size;
>> + ? ? struct dirty_info dirty_info;
>>
>> ? ? ? if (*pos >= ARRAY_SIZE(vmstat_text))
>> ? ? ? ? ? ? ? return NULL;
>> @@ -1062,8 +1063,9 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>> ? ? ? ? ? ? ? v[i] = global_page_state(i);
>> ? ? ? v += NR_VM_ZONE_STAT_ITEMS;
>>
>> - ? ? global_dirty_limits(v + NR_DIRTY_BG_THRESHOLD,
>> - ? ? ? ? ? ? ? ? ? ? ? ? v + NR_DIRTY_THRESHOLD);
>> + ? ? global_dirty_info(&dirty_info);
>> + ? ? v[NR_DIRTY_BG_THRESHOLD] = dirty_info.background_thresh;
>> + ? ? v[NR_DIRTY_THRESHOLD] = dirty_info.dirty_thresh;
>> ? ? ? v += NR_VM_WRITEBACK_STAT_ITEMS;
>
> Get unnecessary nr_file_dirty, nr_writeback, nr_unstable_nfs.
>
> The code itself doesn't have a problem. but although it makes code simple,
> sometime it get unnecessary information in that context. The global_page_state never
> cheap operation and we have been tried to reduce overhead in page-writeback.
> (16c4042f, e50e3720).
>
> Fortunately this patch doesn't increase balance_dirty_pages's overhead and
> things affected by this patch are not fast-path.
> So I think it doesn't have a problem.

I am doing a small restructure to address this feedback. The plan is
to only use the new dirty_info structure for memcg limits and usage.
The global limits and usage variables and logic will not be changed.

> --
> Kind regards,
> Minchan Kim
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2011-03-02 23:18:52

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v5 9/9] memcg: check memcg dirty limits in page writeback

On Fri, Feb 25, 2011 at 01:36:00PM -0800, Greg Thelen wrote:

[..]
> @@ -500,18 +527,27 @@ static void balance_dirty_pages(struct address_space *mapping,
> };
>
> global_dirty_info(&sys_info);
> + if (!memcg_dirty_info(NULL, &memcg_info))
> + memcg_info = sys_info;
>
> /*
> * Throttle it only when the background writeback cannot
> * catch-up. This avoids (excessively) small writeouts
> * when the bdi limits are ramping up.
> */
> - if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <=
> + if ((dirty_info_reclaimable(&sys_info) +
> + sys_info.nr_writeback <=
> (sys_info.background_thresh +
> - sys_info.dirty_thresh) / 2)
> + sys_info.dirty_thresh) / 2) &&
> + (dirty_info_reclaimable(&memcg_info) +
> + memcg_info.nr_writeback <=
> + (memcg_info.background_thresh +
> + memcg_info.dirty_thresh) / 2))
> break;
>
> - bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh);
> + bdi_thresh = bdi_dirty_limit(bdi,
> + min(sys_info.dirty_thresh,
> + memcg_info.dirty_thresh));
> bdi_thresh = task_dirty_limit(current, bdi_thresh);

Greg, so currently we seem to have per_bdi/per_task dirty limits and
now with this patch it will sort of become per_cgroup/per_bdi/per_task
dirty limits? I think that kind of makes sense to me.

Thanks
Vivek

2011-03-04 00:30:48

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v5 9/9] memcg: check memcg dirty limits in page writeback

On Wed, Mar 2, 2011 at 3:17 PM, Vivek Goyal <[email protected]> wrote:
> On Fri, Feb 25, 2011 at 01:36:00PM -0800, Greg Thelen wrote:
>
> [..]
>> @@ -500,18 +527,27 @@ static void balance_dirty_pages(struct address_space *mapping,
>> ? ? ? ? ? ? ? };
>>
>> ? ? ? ? ? ? ? global_dirty_info(&sys_info);
>> + ? ? ? ? ? ? if (!memcg_dirty_info(NULL, &memcg_info))
>> + ? ? ? ? ? ? ? ? ? ? memcg_info = sys_info;
>>
>> ? ? ? ? ? ? ? /*
>> ? ? ? ? ? ? ? ?* Throttle it only when the background writeback cannot
>> ? ? ? ? ? ? ? ?* catch-up. This avoids (excessively) small writeouts
>> ? ? ? ? ? ? ? ?* when the bdi limits are ramping up.
>> ? ? ? ? ? ? ? ?*/
>> - ? ? ? ? ? ? if (dirty_info_reclaimable(&sys_info) + sys_info.nr_writeback <=
>> + ? ? ? ? ? ? if ((dirty_info_reclaimable(&sys_info) +
>> + ? ? ? ? ? ? ? ? ?sys_info.nr_writeback <=
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (sys_info.background_thresh +
>> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sys_info.dirty_thresh) / 2)
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?sys_info.dirty_thresh) / 2) &&
>> + ? ? ? ? ? ? ? ? (dirty_info_reclaimable(&memcg_info) +
>> + ? ? ? ? ? ? ? ? ?memcg_info.nr_writeback <=
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? (memcg_info.background_thresh +
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?memcg_info.dirty_thresh) / 2))
>> ? ? ? ? ? ? ? ? ? ? ? break;
>>
>> - ? ? ? ? ? ? bdi_thresh = bdi_dirty_limit(bdi, sys_info.dirty_thresh);
>> + ? ? ? ? ? ? bdi_thresh = bdi_dirty_limit(bdi,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? min(sys_info.dirty_thresh,
>> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? memcg_info.dirty_thresh));
>> ? ? ? ? ? ? ? bdi_thresh = task_dirty_limit(current, bdi_thresh);
>
> Greg, so currently we seem to have per_bdi/per_task dirty limits and
> now with this patch it will sort of become per_cgroup/per_bdi/per_task
> dirty limits? I think that kind of makes sense to me.
>
> Thanks
> Vivek
>

Vivek, you are correct. This patch adds per_cgroup limits to the
existing system, bdi, and system dirty memory limits.

2011-03-10 16:18:59

by Greg Thelen

[permalink] [raw]
Subject: Re: [PATCH v5 6/9] memcg: add kernel calls for memcg dirty page stats

On Sun, Feb 27, 2011 at 6:40 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Mon, 28 Feb 2011 02:01:43 +0900
> Minchan Kim <[email protected]> wrote:
>
>> On Fri, Feb 25, 2011 at 01:35:57PM -0800, Greg Thelen wrote:
> ? ? ? ? ? ? ? ?spin_unlock_irqrestore(&mapping->tree_lock, flags);
>> > ? ? } else {
>> > @@ -1365,6 +1368,7 @@ int test_set_page_writeback(struct page *page)
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? PAGECACHE_TAG_WRITEBACK);
>> > ? ? ? ? ? ? ? ? ? ? if (bdi_cap_account_writeback(bdi))
>> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? __inc_bdi_stat(bdi, BDI_WRITEBACK);
>> > + ? ? ? ? ? ? ? ? ? mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
>>
>> Question:
>> Why should we care of BDI_CAP_NO_WRITEBACK?
>>
> Hmm, should we do ..
> ==
> ? ? ? ?if (!ret) {
> ? ? ? ? ? ? ? ?account_page_writeback(page);
> + ? ? ? ? ? ? ? mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACL);
> ? ? ? ?}
> ==

Yes, I agree with Minchan that this is an issue. I think Kame's fix
is good. I will apply Kame's fix to test_set_page_writeback(). I
also found that test_clear_page_writeback() has the same issue and it
will also be fixed. I will be posting v6 shortly (hopefully today)
with these fixes.

> Thanks,
> -Kame
>
>

2011-03-11 10:19:58

by Balbir Singh

[permalink] [raw]
Subject: Re: [PATCH v5 1/9] memcg: document cgroup dirty memory interfaces

* Greg Thelen <[email protected]> [2011-02-25 13:35:52]:

> Document cgroup dirty memory interfaces and statistics.
>
> Signed-off-by: Andrea Righi <[email protected]>
> Signed-off-by: Greg Thelen <[email protected]>
> ---


Acked-by: Balbir Singh <[email protected]>


--
Three Cheers,
Balbir