2015-05-22 22:23:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET 2/3 v3 block/for-4.2/core] writeback: cgroup writeback backpressure propagation

Hello,

Changes from the last take[L] are

* Rebased on top of block/for-4.2/core.

While the previous patchset[1] implemented cgroup writeback support,
the IO back pressure propagation mechanism implemented in
balance_dirty_pages() and its subroutines isn't yet aware of cgroup
writeback.

Processes belonging to a memcg may have access to only subset of total
memory available in the system and not factoring this into dirty
throttling rendered it completely ineffective for processes under
memcg limits and memcg ended up building a separate ad-hoc degenerate
mechanism directly into vmscan code to limit page dirtying.

This patchset refactors the dirty throttling logic implemented in
balance_dirty_pages() and its subroutines os that it can handle both
global and memcg memory domains. Dirty throttling mechanism is
applied against both the global and memcg constraints and the more
restricted of the two is used for actual throttling.

This makes the dirty throttling mechanism operational for memcg
domains including writeback-bandwidth-proportional dirty page
distribution inside them.

This patchset contains the following 19 patches.

0001-memcg-make-mem_cgroup_read_-stat-event-iterate-possi.patch
0002-writeback-clean-up-wb_dirty_limit.patch
0003-writeback-reorganize-__-wb_update_bandwidth.patch
0004-writeback-implement-wb_domain.patch
0005-writeback-move-global_dirty_limit-into-wb_domain.patch
0006-writeback-consolidate-dirty-throttle-parameters-into.patch
0007-writeback-add-dirty_throttle_control-wb_bg_thresh.patch
0008-writeback-make-__wb_calc_thresh-take-dirty_throttle_.patch
0009-writeback-add-dirty_throttle_control-pos_ratio.patch
0010-writeback-add-dirty_throttle_control-wb_completions.patch
0011-writeback-add-dirty_throttle_control-dom.patch
0012-writeback-make-__wb_writeout_inc-and-hard_dirty_limi.patch
0013-writeback-separate-out-domain_dirty_limits.patch
0014-writeback-move-over_bground_thresh-to-mm-page-writeb.patch
0015-writeback-update-wb_over_bg_thresh-to-use-wb_domain-.patch
0016-writeback-implement-memcg-wb_domain.patch
0017-writeback-reset-wb_domain-dirty_limit-_tstmp-when-me.patch
0018-writeback-implement-memcg-writeback-domain-based-thr.patch
0019-mm-vmscan-disable-memcg-direct-reclaim-stalling-if-c.patch

0001-0003 are prep patches.

0004-0015 refactors dirty throttling logic so that it operates on
wb_domain.

0016-0019 implement memcg wb_domain.

This patchset is on top of

block/for-4.2/core b04a5636a665 ("block: replace trylock with mutex_lock in blkdev_reread_part()")
+ [1] [PATCHSET 1/3 v4 block/for-4.2/core] writeback: cgroup writeback support

and available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-backpressure-20150522

diffstat follows. Thanks.

fs/fs-writeback.c | 32 -
include/linux/backing-dev-defs.h | 1
include/linux/memcontrol.h | 21 +
include/linux/writeback.h | 84 +++-
include/trace/events/writeback.h | 7
mm/backing-dev.c | 15
mm/memcontrol.c | 145 +++++--
mm/page-writeback.c | 744 +++++++++++++++++++++++++--------------
mm/vmscan.c | 51 ++
9 files changed, 739 insertions(+), 361 deletions(-)

--
tejun

[L] http://lkml.kernel.org/g/[email protected]
[1] http://lkml.kernel.org/g/[email protected]


2015-05-22 22:23:49

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/19] memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online

cpu_possible_mask represents the CPUs which are actually possible
during that boot instance. For systems which don't support CPU
hotplug, this will match cpu_online_mask exactly in most cases. Even
for systems which support CPU hotplug, the number of possible CPU
slots is highly unlikely to diverge greatly from the number of online
CPUs. The only cases where the difference between possible and online
caused problems were when the boot code failed to initialize the
possible mask and left it fully set at NR_CPUS - 1.

As such, most per-cpu constructs allocate for all possible CPUs and
often iterate over the possibles, which also has the benefit of
avoiding the blocking CPU hotplug synchronization.

memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
mem_cgroup_read_events(), which iterates over online CPUs and handles
CPU hotplug operations explicitly. This complexity doesn't actually
buy anything. Switch to iterating over the possibles and drop the
explicit CPU hotplug handling.

Eventually, we want to convert memcg to use percpu_counter instead of
its own custom implementation which also benefits from quick access
w/o summing for cases where larger error margin is acceptable.

This will allow mem_cgroup_read_stat() to be called from non-sleepable
contexts which will be used by cgroup writeback.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 51 ++-------------------------------------------------
1 file changed, 2 insertions(+), 49 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6732c2c..d7d270a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -324,11 +324,6 @@ struct mem_cgroup {
* percpu counter.
*/
struct mem_cgroup_stat_cpu __percpu *stat;
- /*
- * used when a cpu is offlined or other synchronizations
- * See mem_cgroup_read_stat().
- */
- struct mem_cgroup_stat_cpu nocpu_base;
spinlock_t pcp_counter_lock;

#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
@@ -815,15 +810,8 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
long val = 0;
int cpu;

- get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_possible_cpu(cpu)
val += per_cpu(memcg->stat->count[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
- spin_lock(&memcg->pcp_counter_lock);
- val += memcg->nocpu_base.count[idx];
- spin_unlock(&memcg->pcp_counter_lock);
-#endif
- put_online_cpus();
return val;
}

@@ -833,15 +821,8 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
unsigned long val = 0;
int cpu;

- get_online_cpus();
- for_each_online_cpu(cpu)
+ for_each_possible_cpu(cpu)
val += per_cpu(memcg->stat->events[idx], cpu);
-#ifdef CONFIG_HOTPLUG_CPU
- spin_lock(&memcg->pcp_counter_lock);
- val += memcg->nocpu_base.events[idx];
- spin_unlock(&memcg->pcp_counter_lock);
-#endif
- put_online_cpus();
return val;
}

@@ -2191,37 +2172,12 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
mutex_unlock(&percpu_charge_mutex);
}

-/*
- * This function drains percpu counter value from DEAD cpu and
- * move it to local cpu. Note that this function can be preempted.
- */
-static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
-{
- int i;
-
- spin_lock(&memcg->pcp_counter_lock);
- for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
- long x = per_cpu(memcg->stat->count[i], cpu);
-
- per_cpu(memcg->stat->count[i], cpu) = 0;
- memcg->nocpu_base.count[i] += x;
- }
- for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
- unsigned long x = per_cpu(memcg->stat->events[i], cpu);
-
- per_cpu(memcg->stat->events[i], cpu) = 0;
- memcg->nocpu_base.events[i] += x;
- }
- spin_unlock(&memcg->pcp_counter_lock);
-}
-
static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
unsigned long action,
void *hcpu)
{
int cpu = (unsigned long)hcpu;
struct memcg_stock_pcp *stock;
- struct mem_cgroup *iter;

if (action == CPU_ONLINE)
return NOTIFY_OK;
@@ -2229,9 +2185,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
return NOTIFY_OK;

- for_each_mem_cgroup(iter)
- mem_cgroup_drain_pcp_counter(iter, cpu);
-
stock = &per_cpu(memcg_stock, cpu);
drain_stock(stock);
return NOTIFY_OK;
--
2.4.0

2015-05-22 22:23:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/19] writeback: clean up wb_dirty_limit()

The function name wb_dirty_limit(), its argument @dirty and the local
variable @wb_dirty are mortally confusing given that the function
calculates per-wb threshold value not dirty pages, especially given
that @dirty and @wb_dirty are used elsewhere for dirty pages.

Let's rename the function to wb_calc_thresh() and wb_dirty to
wb_thresh.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
fs/fs-writeback.c | 2 +-
include/linux/writeback.h | 2 +-
mm/backing-dev.c | 6 +++---
mm/page-writeback.c | 30 +++++++++++++++---------------
4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 881ea5d..b1b3b81 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1081,7 +1081,7 @@ static bool over_bground_thresh(struct bdi_writeback *wb)
global_page_state(NR_UNSTABLE_NFS) > background_thresh)
return true;

- if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh))
+ if (wb_stat(wb, WB_RECLAIMABLE) > wb_calc_thresh(wb, background_thresh))
return true;

return false;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 23af355..0435c85 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -155,7 +155,7 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);

void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
-unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty);
+unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);

void __wb_update_bandwidth(struct bdi_writeback *wb,
unsigned long thresh,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ad5608d..9c8b7b5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -49,7 +49,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
struct bdi_writeback *wb = &bdi->wb;
unsigned long background_thresh;
unsigned long dirty_thresh;
- unsigned long bdi_thresh;
+ unsigned long wb_thresh;
unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
struct inode *inode;

@@ -67,7 +67,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
spin_unlock(&wb->list_lock);

global_dirty_limits(&background_thresh, &dirty_thresh);
- bdi_thresh = wb_dirty_limit(wb, dirty_thresh);
+ wb_thresh = wb_calc_thresh(wb, dirty_thresh);

#define K(x) ((x) << (PAGE_SHIFT - 10))
seq_printf(m,
@@ -87,7 +87,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
"state: %10lx\n",
(unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
(unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
- K(bdi_thresh),
+ K(wb_thresh),
K(dirty_thresh),
K(background_thresh),
(unsigned long) K(wb_stat(wb, WB_DIRTIED)),
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 70cf98d..c7745a7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -556,7 +556,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
}

/**
- * wb_dirty_limit - @wb's share of dirty throttling threshold
+ * wb_calc_thresh - @wb's share of dirty throttling threshold
* @wb: bdi_writeback to query
* @dirty: global dirty limit in pages
*
@@ -577,28 +577,28 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
* The wb's share of dirty limit will be adapting to its throughput and
* bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
*/
-unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
+unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
{
- u64 wb_dirty;
+ u64 wb_thresh;
long numerator, denominator;
unsigned long wb_min_ratio, wb_max_ratio;

/*
- * Calculate this BDI's share of the dirty ratio.
+ * Calculate this BDI's share of the thresh ratio.
*/
wb_writeout_fraction(wb, &numerator, &denominator);

- wb_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
- wb_dirty *= numerator;
- do_div(wb_dirty, denominator);
+ wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100;
+ wb_thresh *= numerator;
+ do_div(wb_thresh, denominator);

wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);

- wb_dirty += (dirty * wb_min_ratio) / 100;
- if (wb_dirty > (dirty * wb_max_ratio) / 100)
- wb_dirty = dirty * wb_max_ratio / 100;
+ wb_thresh += (thresh * wb_min_ratio) / 100;
+ if (wb_thresh > (thresh * wb_max_ratio) / 100)
+ wb_thresh = thresh * wb_max_ratio / 100;

- return wb_dirty;
+ return wb_thresh;
}

/*
@@ -750,7 +750,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
* total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
* limits are set by default to 10% and 20% (background and throttle).
* Then wb_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
- * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. wb_setpoint is
+ * wb_calc_thresh(wb, bg_thresh) is about ~4K pages. wb_setpoint is
* about ~6K pages (as the average of background and throttle wb
* limits). The 3rd order polynomial will provide positive feedback if
* wb_dirty is under wb_setpoint and vice versa.
@@ -1115,7 +1115,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
*
* We rampup dirty_ratelimit forcibly if wb_dirty is low because
* it's possible that wb_thresh is close to zero due to inactivity
- * of backing device (see the implementation of wb_dirty_limit()).
+ * of backing device (see the implementation of wb_calc_thresh()).
*/
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
dirty = wb_dirty;
@@ -1123,7 +1123,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
setpoint = wb_dirty + 1;
else
setpoint = (wb_thresh +
- wb_dirty_limit(wb, bg_thresh)) / 2;
+ wb_calc_thresh(wb, bg_thresh)) / 2;
}

if (dirty < setpoint) {
@@ -1352,7 +1352,7 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb,
* wb_position_ratio() will let the dirtier task progress
* at some rate <= (write_bw / 2) for bringing down wb_dirty.
*/
- *wb_thresh = wb_dirty_limit(wb, dirty_thresh);
+ *wb_thresh = wb_calc_thresh(wb, dirty_thresh);

if (wb_bg_thresh)
*wb_bg_thresh = dirty_thresh ? div_u64((u64)*wb_thresh *
--
2.4.0

2015-05-22 22:29:48

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/19] writeback: reorganize [__]wb_update_bandwidth()

__wb_update_bandwidth() is called from two places -
fs/fs-writeback.c::balance_dirty_pages() and
mm/page-writeback.c::wb_writeback(). The latter updates only the
write bandwidth while the former also deals with the dirty ratelimit.
The two callsites are distinguished by whether @thresh parameter is
zero or not, which is cryptic. In addition, the two files define
their own different versions of wb_update_bandwidth() on top of
__wb_update_bandwidth(), which is confusing to say the least. This
patch cleans up [__]wb_update_bandwidth() in the following ways.

* __wb_update_bandwidth() now takes explicit @update_ratelimit
parameter to gate dirty ratelimit handling.

* mm/page-writeback.c::wb_update_bandwidth() is flattened into its
caller - balance_dirty_pages().

* fs/fs-writeback.c::wb_update_bandwidth() is moved to
mm/page-writeback.c and __wb_update_bandwidth() is made static.

* While at it, add a lockdep assertion to __wb_update_bandwidth().

Except for the lockdep addition, this is pure reorganization and
doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
fs/fs-writeback.c | 10 ----------
include/linux/writeback.h | 9 +--------
mm/page-writeback.c | 45 ++++++++++++++++++++++-----------------------
3 files changed, 23 insertions(+), 41 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b1b3b81..cd89484 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1088,16 +1088,6 @@ static bool over_bground_thresh(struct bdi_writeback *wb)
}

/*
- * Called under wb->list_lock. If there are multiple wb per bdi,
- * only the flusher working on the first wb should do it.
- */
-static void wb_update_bandwidth(struct bdi_writeback *wb,
- unsigned long start_time)
-{
- __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time);
-}
-
-/*
* Explicit flushing or periodic writeback of "old" data.
*
* Define "old": the first time one of an inode's pages is dirtied, we mark the
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0435c85..80adf3d 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -157,14 +157,7 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int,
void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);

-void __wb_update_bandwidth(struct bdi_writeback *wb,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long start_time);
-
+void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
void page_writeback_init(void);
void balance_dirty_pages_ratelimited(struct address_space *mapping);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c7745a7..bebdd41 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1160,19 +1160,22 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
}

-void __wb_update_bandwidth(struct bdi_writeback *wb,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long wb_thresh,
- unsigned long wb_dirty,
- unsigned long start_time)
+static void __wb_update_bandwidth(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long wb_thresh,
+ unsigned long wb_dirty,
+ unsigned long start_time,
+ bool update_ratelimit)
{
unsigned long now = jiffies;
unsigned long elapsed = now - wb->bw_time_stamp;
unsigned long dirtied;
unsigned long written;

+ lockdep_assert_held(&wb->list_lock);
+
/*
* rate-limit, only update once every 200ms.
*/
@@ -1189,7 +1192,7 @@ void __wb_update_bandwidth(struct bdi_writeback *wb,
if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time))
goto snapshot;

- if (thresh) {
+ if (update_ratelimit) {
global_update_bandwidth(thresh, dirty, now);
wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty,
wb_thresh, wb_dirty,
@@ -1203,20 +1206,9 @@ void __wb_update_bandwidth(struct bdi_writeback *wb,
wb->bw_time_stamp = now;
}

-static void wb_update_bandwidth(struct bdi_writeback *wb,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long wb_thresh,
- unsigned long wb_dirty,
- unsigned long start_time)
+void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time)
{
- if (time_is_after_eq_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL))
- return;
- spin_lock(&wb->list_lock);
- __wb_update_bandwidth(wb, thresh, bg_thresh, dirty,
- wb_thresh, wb_dirty, start_time);
- spin_unlock(&wb->list_lock);
+ __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time, false);
}

/*
@@ -1467,8 +1459,15 @@ static void balance_dirty_pages(struct address_space *mapping,
if (dirty_exceeded && !wb->dirty_exceeded)
wb->dirty_exceeded = 1;

- wb_update_bandwidth(wb, dirty_thresh, background_thresh,
- nr_dirty, wb_thresh, wb_dirty, start_time);
+ if (time_is_before_jiffies(wb->bw_time_stamp +
+ BANDWIDTH_INTERVAL)) {
+ spin_lock(&wb->list_lock);
+ __wb_update_bandwidth(wb, dirty_thresh,
+ background_thresh, nr_dirty,
+ wb_thresh, wb_dirty, start_time,
+ true);
+ spin_unlock(&wb->list_lock);
+ }

dirty_ratelimit = wb->dirty_ratelimit;
pos_ratio = wb_position_ratio(wb, dirty_thresh,
--
2.4.0

2015-05-22 22:29:16

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/19] writeback: implement wb_domain

Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.

For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg. IOW, a wb will belong to two writeback domains - the global
and memcg domains.

Currently, what constitutes the global writeback domain are scattered
across a number of global states. This patch starts collecting them
into struct wb_domain.

* fprop_global which serves as the basis for proportional bandwidth
measurement and its period timer are moved into struct wb_domain.

* global_wb_domain hosts the states for the global domain.

* While at it, flatten wb_writeout_fraction() into its callers. This
thin wrapper doesn't provide any actual benefits while getting in
the way.

This is pure reorganization and doesn't introduce any behavioral
changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
include/linux/writeback.h | 32 +++++++++++++++++++++
mm/page-writeback.c | 72 ++++++++++++++++++-----------------------------
2 files changed, 59 insertions(+), 45 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 80adf3d..3148db1 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -7,6 +7,7 @@
#include <linux/sched.h>
#include <linux/workqueue.h>
#include <linux/fs.h>
+#include <linux/flex_proportions.h>

DECLARE_PER_CPU(int, dirty_throttle_leaks);

@@ -87,6 +88,36 @@ struct writeback_control {
};

/*
+ * A wb_domain represents a domain that wb's (bdi_writeback's) belong to
+ * and are measured against each other in. There always is one global
+ * domain, global_wb_domain, that every wb in the system is a member of.
+ * This allows measuring the relative bandwidth of each wb to distribute
+ * dirtyable memory accordingly.
+ */
+struct wb_domain {
+ /*
+ * Scale the writeback cache size proportional to the relative
+ * writeout speed.
+ *
+ * We do this by keeping a floating proportion between BDIs, based
+ * on page writeback completions [end_page_writeback()]. Those
+ * devices that write out pages fastest will get the larger share,
+ * while the slower will get a smaller share.
+ *
+ * We use page writeout completions because we are interested in
+ * getting rid of dirty pages. Having them written out is the
+ * primary goal.
+ *
+ * We introduce a concept of time, a period over which we measure
+ * these events, because demand can/will vary over time. The length
+ * of this period itself is measured in page writeback completions.
+ */
+ struct fprop_global completions;
+ struct timer_list period_timer; /* timer for aging of completions */
+ unsigned long period_time;
+};
+
+/*
* fs/fs-writeback.c
*/
struct bdi_writeback;
@@ -120,6 +151,7 @@ static inline void laptop_sync_completion(void) { }
#endif
void throttle_vm_writeout(gfp_t gfp_mask);
bool zone_dirty_ok(struct zone *zone);
+int wb_domain_init(struct wb_domain *dom, gfp_t gfp);

extern unsigned long global_dirty_limit;

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bebdd41..08e1737 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -124,29 +124,7 @@ EXPORT_SYMBOL(laptop_mode);

unsigned long global_dirty_limit;

-/*
- * Scale the writeback cache size proportional to the relative writeout speeds.
- *
- * We do this by keeping a floating proportion between BDIs, based on page
- * writeback completions [end_page_writeback()]. Those devices that write out
- * pages fastest will get the larger share, while the slower will get a smaller
- * share.
- *
- * We use page writeout completions because we are interested in getting rid of
- * dirty pages. Having them written out is the primary goal.
- *
- * We introduce a concept of time, a period over which we measure these events,
- * because demand can/will vary over time. The length of this period itself is
- * measured in page writeback completions.
- *
- */
-static struct fprop_global writeout_completions;
-
-static void writeout_period(unsigned long t);
-/* Timer for aging of writeout_completions */
-static struct timer_list writeout_period_timer =
- TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0);
-static unsigned long writeout_period_time = 0;
+static struct wb_domain global_wb_domain;

/*
* Length of period for aging writeout fractions of bdis. This is an
@@ -433,24 +411,26 @@ static unsigned long wp_next_time(unsigned long cur_time)
}

/*
- * Increment the BDI's writeout completion count and the global writeout
+ * Increment the wb's writeout completion count and the global writeout
* completion count. Called from test_clear_page_writeback().
*/
static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
+ struct wb_domain *dom = &global_wb_domain;
+
__inc_wb_stat(wb, WB_WRITTEN);
- __fprop_inc_percpu_max(&writeout_completions, &wb->completions,
+ __fprop_inc_percpu_max(&dom->completions, &wb->completions,
wb->bdi->max_prop_frac);
/* First event after period switching was turned off? */
- if (!unlikely(writeout_period_time)) {
+ if (!unlikely(dom->period_time)) {
/*
* We can race with other __bdi_writeout_inc calls here but
* it does not cause any harm since the resulting time when
* timer will fire and what is in writeout_period_time will be
* roughly the same.
*/
- writeout_period_time = wp_next_time(jiffies);
- mod_timer(&writeout_period_timer, writeout_period_time);
+ dom->period_time = wp_next_time(jiffies);
+ mod_timer(&dom->period_timer, dom->period_time);
}
}

@@ -465,37 +445,37 @@ void wb_writeout_inc(struct bdi_writeback *wb)
EXPORT_SYMBOL_GPL(wb_writeout_inc);

/*
- * Obtain an accurate fraction of the BDI's portion.
- */
-static void wb_writeout_fraction(struct bdi_writeback *wb,
- long *numerator, long *denominator)
-{
- fprop_fraction_percpu(&writeout_completions, &wb->completions,
- numerator, denominator);
-}
-
-/*
* On idle system, we can be called long after we scheduled because we use
* deferred timers so count with missed periods.
*/
static void writeout_period(unsigned long t)
{
- int miss_periods = (jiffies - writeout_period_time) /
+ struct wb_domain *dom = (void *)t;
+ int miss_periods = (jiffies - dom->period_time) /
VM_COMPLETIONS_PERIOD_LEN;

- if (fprop_new_period(&writeout_completions, miss_periods + 1)) {
- writeout_period_time = wp_next_time(writeout_period_time +
+ if (fprop_new_period(&dom->completions, miss_periods + 1)) {
+ dom->period_time = wp_next_time(dom->period_time +
miss_periods * VM_COMPLETIONS_PERIOD_LEN);
- mod_timer(&writeout_period_timer, writeout_period_time);
+ mod_timer(&dom->period_timer, dom->period_time);
} else {
/*
* Aging has zeroed all fractions. Stop wasting CPU on period
* updates.
*/
- writeout_period_time = 0;
+ dom->period_time = 0;
}
}

+int wb_domain_init(struct wb_domain *dom, gfp_t gfp)
+{
+ memset(dom, 0, sizeof(*dom));
+ init_timer_deferrable(&dom->period_timer);
+ dom->period_timer.function = writeout_period;
+ dom->period_timer.data = (unsigned long)dom;
+ return fprop_global_init(&dom->completions, gfp);
+}
+
/*
* bdi_min_ratio keeps the sum of the minimum dirty shares of all
* registered backing devices, which, for obvious reasons, can not
@@ -579,6 +559,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
*/
unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
{
+ struct wb_domain *dom = &global_wb_domain;
u64 wb_thresh;
long numerator, denominator;
unsigned long wb_min_ratio, wb_max_ratio;
@@ -586,7 +567,8 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
/*
* Calculate this BDI's share of the thresh ratio.
*/
- wb_writeout_fraction(wb, &numerator, &denominator);
+ fprop_fraction_percpu(&dom->completions, &wb->completions,
+ &numerator, &denominator);

wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100;
wb_thresh *= numerator;
@@ -1831,7 +1813,7 @@ void __init page_writeback_init(void)
writeback_set_ratelimit();
register_cpu_notifier(&ratelimit_nb);

- fprop_global_init(&writeout_completions, GFP_KERNEL);
+ BUG_ON(wb_domain_init(&global_wb_domain, GFP_KERNEL));
}

/**
--
2.4.0

2015-05-22 22:28:54

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/19] writeback: move global_dirty_limit into wb_domain

This patch is a part of the series to define wb_domain which
represents a domain that wb's (bdi_writeback's) belong to and are
measured against each other in. This will enable IO backpressure
propagation for cgroup writeback.

global_dirty_limit exists to regulate the global dirty threshold which
is a property of the wb_domain. This patch moves hard_dirty_limit,
dirty_lock, and update_time into wb_domain.

This is pure reorganization and doesn't introduce any behavioral
changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
fs/fs-writeback.c | 2 +-
include/linux/writeback.h | 17 ++++++++++++++-
include/trace/events/writeback.h | 7 +++---
mm/page-writeback.c | 46 ++++++++++++++++++++--------------------
4 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cd89484..51c8a5b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -887,7 +887,7 @@ static long writeback_chunk_size(struct bdi_writeback *wb,
pages = LONG_MAX;
else {
pages = min(wb->avg_write_bandwidth / 2,
- global_dirty_limit / DIRTY_SCOPE);
+ global_wb_domain.dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
pages = round_down(pages + MIN_WRITEBACK_PAGES,
MIN_WRITEBACK_PAGES);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3148db1..5fdd4e1 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -95,6 +95,8 @@ struct writeback_control {
* dirtyable memory accordingly.
*/
struct wb_domain {
+ spinlock_t lock;
+
/*
* Scale the writeback cache size proportional to the relative
* writeout speed.
@@ -115,6 +117,19 @@ struct wb_domain {
struct fprop_global completions;
struct timer_list period_timer; /* timer for aging of completions */
unsigned long period_time;
+
+ /*
+ * The dirtyable memory and dirty threshold could be suddenly
+ * knocked down by a large amount (eg. on the startup of KVM in a
+ * swapless system). This may throw the system into deep dirty
+ * exceeded state and throttle heavy/light dirtiers alike. To
+ * retain good responsiveness, maintain global_dirty_limit for
+ * tracking slowly down to the knocked down dirty threshold.
+ *
+ * Both fields are protected by ->lock.
+ */
+ unsigned long dirty_limit_tstamp;
+ unsigned long dirty_limit;
};

/*
@@ -153,7 +168,7 @@ void throttle_vm_writeout(gfp_t gfp_mask);
bool zone_dirty_ok(struct zone *zone);
int wb_domain_init(struct wb_domain *dom, gfp_t gfp);

-extern unsigned long global_dirty_limit;
+extern struct wb_domain global_wb_domain;

/* These are exported to sysctl. */
extern int dirty_background_ratio;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 9b876f6..bec6999 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -361,7 +361,7 @@ TRACE_EVENT(global_dirty_state,
__entry->nr_written = global_page_state(NR_WRITTEN);
__entry->background_thresh = background_thresh;
__entry->dirty_thresh = dirty_thresh;
- __entry->dirty_limit = global_dirty_limit;
+ __entry->dirty_limit = global_wb_domain.dirty_limit;
),

TP_printk("dirty=%lu writeback=%lu unstable=%lu "
@@ -463,8 +463,9 @@ TRACE_EVENT(balance_dirty_pages,
unsigned long freerun = (thresh + bg_thresh) / 2;
strlcpy(__entry->bdi, dev_name(bdi->dev), 32);

- __entry->limit = global_dirty_limit;
- __entry->setpoint = (global_dirty_limit + freerun) / 2;
+ __entry->limit = global_wb_domain.dirty_limit;
+ __entry->setpoint = (global_wb_domain.dirty_limit +
+ freerun) / 2;
__entry->dirty = dirty;
__entry->bdi_setpoint = __entry->setpoint *
bdi_thresh / (thresh + 1);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 08e1737..27e60ba 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -122,9 +122,7 @@ EXPORT_SYMBOL(laptop_mode);

/* End of sysctl-exported parameters */

-unsigned long global_dirty_limit;
-
-static struct wb_domain global_wb_domain;
+struct wb_domain global_wb_domain;

/*
* Length of period for aging writeout fractions of bdis. This is an
@@ -470,9 +468,15 @@ static void writeout_period(unsigned long t)
int wb_domain_init(struct wb_domain *dom, gfp_t gfp)
{
memset(dom, 0, sizeof(*dom));
+
+ spin_lock_init(&dom->lock);
+
init_timer_deferrable(&dom->period_timer);
dom->period_timer.function = writeout_period;
dom->period_timer.data = (unsigned long)dom;
+
+ dom->dirty_limit_tstamp = jiffies;
+
return fprop_global_init(&dom->completions, gfp);
}

@@ -532,7 +536,9 @@ static unsigned long dirty_freerun_ceiling(unsigned long thresh,

static unsigned long hard_dirty_limit(unsigned long thresh)
{
- return max(thresh, global_dirty_limit);
+ struct wb_domain *dom = &global_wb_domain;
+
+ return max(thresh, dom->dirty_limit);
}

/**
@@ -916,17 +922,10 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
wb->avg_write_bandwidth = avg;
}

-/*
- * The global dirtyable memory and dirty threshold could be suddenly knocked
- * down by a large amount (eg. on the startup of KVM in a swapless system).
- * This may throw the system into deep dirty exceeded state and throttle
- * heavy/light dirtiers alike. To retain good responsiveness, maintain
- * global_dirty_limit for tracking slowly down to the knocked down dirty
- * threshold.
- */
static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
{
- unsigned long limit = global_dirty_limit;
+ struct wb_domain *dom = &global_wb_domain;
+ unsigned long limit = dom->dirty_limit;

/*
* Follow up in one step.
@@ -939,7 +938,7 @@ static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
/*
* Follow down slowly. Use the higher one as the target, because thresh
* may drop below dirty. This is exactly the reason to introduce
- * global_dirty_limit which is guaranteed to lie above the dirty pages.
+ * dom->dirty_limit which is guaranteed to lie above the dirty pages.
*/
thresh = max(thresh, dirty);
if (limit > thresh) {
@@ -948,28 +947,27 @@ static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
}
return;
update:
- global_dirty_limit = limit;
+ dom->dirty_limit = limit;
}

static void global_update_bandwidth(unsigned long thresh,
unsigned long dirty,
unsigned long now)
{
- static DEFINE_SPINLOCK(dirty_lock);
- static unsigned long update_time = INITIAL_JIFFIES;
+ struct wb_domain *dom = &global_wb_domain;

/*
* check locklessly first to optimize away locking for the most time
*/
- if (time_before(now, update_time + BANDWIDTH_INTERVAL))
+ if (time_before(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL))
return;

- spin_lock(&dirty_lock);
- if (time_after_eq(now, update_time + BANDWIDTH_INTERVAL)) {
+ spin_lock(&dom->lock);
+ if (time_after_eq(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) {
update_dirty_limit(thresh, dirty);
- update_time = now;
+ dom->dirty_limit_tstamp = now;
}
- spin_unlock(&dirty_lock);
+ spin_unlock(&dom->lock);
}

/*
@@ -1761,10 +1759,12 @@ void laptop_sync_completion(void)

void writeback_set_ratelimit(void)
{
+ struct wb_domain *dom = &global_wb_domain;
unsigned long background_thresh;
unsigned long dirty_thresh;
+
global_dirty_limits(&background_thresh, &dirty_thresh);
- global_dirty_limit = dirty_thresh;
+ dom->dirty_limit = dirty_thresh;
ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
if (ratelimit_pages < 16)
ratelimit_pages = 16;
--
2.4.0

2015-05-22 22:28:16

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/19] writeback: consolidate dirty throttle parameters into dirty_throttle_control

Dirty throttling implemented in balance_dirty_pages() and its
subroutines makes use of a number of parameters which are passed
around individually. This renders these functions somewhat unwieldy
and makes it difficult to add or change the involved parameters. Also
some functions use different or conflicting naming schemes for the
same parameters making the code confusing to follow.

This patch consolidates the main parameters into struct
dirty_throttle_control so that they can be passed around easily and
adding new paramters isn't painful. This also unifies how a given
parameter is named and accessed. The drawback of using this type of
control structure rather than explicit paramters is that it isn't
immediately obvious which function accesses and modifies what;
however, it's fairly clear that the benefits outweigh in this case.

GDTC_INIT() macro is provided to ease initializing
dirty_throttle_control for the global_wb_domain and
balance_dirty_pages() uses a separate pointer to point to its global
dirty_throttle_control. This is to make it uniform with memcg domain
handling which will be added later.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 212 +++++++++++++++++++++++++---------------------------
1 file changed, 101 insertions(+), 111 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 27e60ba..126e3c8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -124,6 +124,20 @@ EXPORT_SYMBOL(laptop_mode);

struct wb_domain global_wb_domain;

+/* consolidated parameters for balance_dirty_pages() and its subroutines */
+struct dirty_throttle_control {
+ struct bdi_writeback *wb;
+
+ unsigned long dirty; /* file_dirty + write + nfs */
+ unsigned long thresh; /* dirty threshold */
+ unsigned long bg_thresh; /* dirty background threshold */
+
+ unsigned long wb_dirty; /* per-wb counterparts */
+ unsigned long wb_thresh;
+};
+
+#define GDTC_INIT(__wb) .wb = (__wb)
+
/*
* Length of period for aging writeout fractions of bdis. This is an
* arbitrarily chosen number. The longer the period, the slower fractions will
@@ -695,16 +709,13 @@ static long long pos_ratio_polynom(unsigned long setpoint,
* card's wb_dirty may rush to many times higher than wb_setpoint.
* - the wb dirty thresh drops quickly due to change of JBOD workload
*/
-static unsigned long wb_position_ratio(struct bdi_writeback *wb,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long wb_thresh,
- unsigned long wb_dirty)
+static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
{
+ struct bdi_writeback *wb = dtc->wb;
unsigned long write_bw = wb->avg_write_bandwidth;
- unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
- unsigned long limit = hard_dirty_limit(thresh);
+ unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
+ unsigned long limit = hard_dirty_limit(dtc->thresh);
+ unsigned long wb_thresh = dtc->wb_thresh;
unsigned long x_intercept;
unsigned long setpoint; /* dirty pages' target balance point */
unsigned long wb_setpoint;
@@ -712,7 +723,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
long long pos_ratio; /* for scaling up/down the rate limit */
long x;

- if (unlikely(dirty >= limit))
+ if (unlikely(dtc->dirty >= limit))
return 0;

/*
@@ -721,7 +732,7 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
* See comment for pos_ratio_polynom().
*/
setpoint = (freerun + limit) / 2;
- pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
+ pos_ratio = pos_ratio_polynom(setpoint, dtc->dirty, limit);

/*
* The strictlimit feature is a tool preventing mistrusted filesystems
@@ -752,20 +763,21 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
long long wb_pos_ratio;
unsigned long wb_bg_thresh;

- if (wb_dirty < 8)
+ if (dtc->wb_dirty < 8)
return min_t(long long, pos_ratio * 2,
2 << RATELIMIT_CALC_SHIFT);

- if (wb_dirty >= wb_thresh)
+ if (dtc->wb_dirty >= wb_thresh)
return 0;

- wb_bg_thresh = div_u64((u64)wb_thresh * bg_thresh, thresh);
+ wb_bg_thresh = div_u64((u64)wb_thresh * dtc->bg_thresh,
+ dtc->thresh);
wb_setpoint = dirty_freerun_ceiling(wb_thresh, wb_bg_thresh);

if (wb_setpoint == 0 || wb_setpoint == wb_thresh)
return 0;

- wb_pos_ratio = pos_ratio_polynom(wb_setpoint, wb_dirty,
+ wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty,
wb_thresh);

/*
@@ -823,8 +835,8 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
* own size, so move the slope over accordingly and choose a slope that
* yields 100% pos_ratio fluctuation on suddenly doubled wb_thresh.
*/
- if (unlikely(wb_thresh > thresh))
- wb_thresh = thresh;
+ if (unlikely(wb_thresh > dtc->thresh))
+ wb_thresh = dtc->thresh;
/*
* It's very possible that wb_thresh is close to 0 not because the
* device is slow, but that it has remained inactive for long time.
@@ -832,12 +844,12 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
* threshold, so that the occasional writes won't be blocked and active
* writes can rampup the threshold quickly.
*/
- wb_thresh = max(wb_thresh, (limit - dirty) / 8);
+ wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8);
/*
* scale global setpoint to wb's:
* wb_setpoint = setpoint * wb_thresh / thresh
*/
- x = div_u64((u64)wb_thresh << 16, thresh + 1);
+ x = div_u64((u64)wb_thresh << 16, dtc->thresh + 1);
wb_setpoint = setpoint * (u64)x >> 16;
/*
* Use span=(8*write_bw) in single wb case as indicated by
@@ -847,12 +859,12 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
* span = --------- * (8 * write_bw) + ------------------ * wb_thresh
* thresh thresh
*/
- span = (thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16;
+ span = (dtc->thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16;
x_intercept = wb_setpoint + span;

- if (wb_dirty < x_intercept - span / 4) {
- pos_ratio = div64_u64(pos_ratio * (x_intercept - wb_dirty),
- x_intercept - wb_setpoint + 1);
+ if (dtc->wb_dirty < x_intercept - span / 4) {
+ pos_ratio = div64_u64(pos_ratio * (x_intercept - dtc->wb_dirty),
+ x_intercept - wb_setpoint + 1);
} else
pos_ratio /= 4;

@@ -862,9 +874,10 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
* than setpoint.
*/
x_intercept = wb_thresh / 2;
- if (wb_dirty < x_intercept) {
- if (wb_dirty > x_intercept / 8)
- pos_ratio = div_u64(pos_ratio * x_intercept, wb_dirty);
+ if (dtc->wb_dirty < x_intercept) {
+ if (dtc->wb_dirty > x_intercept / 8)
+ pos_ratio = div_u64(pos_ratio * x_intercept,
+ dtc->wb_dirty);
else
pos_ratio *= 8;
}
@@ -922,9 +935,10 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
wb->avg_write_bandwidth = avg;
}

-static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
+static void update_dirty_limit(struct dirty_throttle_control *dtc)
{
struct wb_domain *dom = &global_wb_domain;
+ unsigned long thresh = dtc->thresh;
unsigned long limit = dom->dirty_limit;

/*
@@ -940,7 +954,7 @@ static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
* may drop below dirty. This is exactly the reason to introduce
* dom->dirty_limit which is guaranteed to lie above the dirty pages.
*/
- thresh = max(thresh, dirty);
+ thresh = max(thresh, dtc->dirty);
if (limit > thresh) {
limit -= (limit - thresh) >> 5;
goto update;
@@ -950,8 +964,7 @@ static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
dom->dirty_limit = limit;
}

-static void global_update_bandwidth(unsigned long thresh,
- unsigned long dirty,
+static void global_update_bandwidth(struct dirty_throttle_control *dtc,
unsigned long now)
{
struct wb_domain *dom = &global_wb_domain;
@@ -964,7 +977,7 @@ static void global_update_bandwidth(unsigned long thresh,

spin_lock(&dom->lock);
if (time_after_eq(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) {
- update_dirty_limit(thresh, dirty);
+ update_dirty_limit(dtc);
dom->dirty_limit_tstamp = now;
}
spin_unlock(&dom->lock);
@@ -976,17 +989,14 @@ static void global_update_bandwidth(unsigned long thresh,
* Normal wb tasks will be curbed at or below it in long term.
* Obviously it should be around (write_bw / N) when there are N dd tasks.
*/
-static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long wb_thresh,
- unsigned long wb_dirty,
+static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
unsigned long dirtied,
unsigned long elapsed)
{
- unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
- unsigned long limit = hard_dirty_limit(thresh);
+ struct bdi_writeback *wb = dtc->wb;
+ unsigned long dirty = dtc->dirty;
+ unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
+ unsigned long limit = hard_dirty_limit(dtc->thresh);
unsigned long setpoint = (freerun + limit) / 2;
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long dirty_ratelimit = wb->dirty_ratelimit;
@@ -1003,8 +1013,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
*/
dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;

- pos_ratio = wb_position_ratio(wb, thresh, bg_thresh, dirty,
- wb_thresh, wb_dirty);
+ pos_ratio = wb_position_ratio(dtc);
/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
@@ -1098,12 +1107,12 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
* of backing device (see the implementation of wb_calc_thresh()).
*/
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
- dirty = wb_dirty;
- if (wb_dirty < 8)
- setpoint = wb_dirty + 1;
+ dirty = dtc->wb_dirty;
+ if (dtc->wb_dirty < 8)
+ setpoint = dtc->wb_dirty + 1;
else
- setpoint = (wb_thresh +
- wb_calc_thresh(wb, bg_thresh)) / 2;
+ setpoint = (dtc->wb_thresh +
+ wb_calc_thresh(wb, dtc->bg_thresh)) / 2;
}

if (dirty < setpoint) {
@@ -1140,15 +1149,11 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
}

-static void __wb_update_bandwidth(struct bdi_writeback *wb,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long wb_thresh,
- unsigned long wb_dirty,
+static void __wb_update_bandwidth(struct dirty_throttle_control *dtc,
unsigned long start_time,
bool update_ratelimit)
{
+ struct bdi_writeback *wb = dtc->wb;
unsigned long now = jiffies;
unsigned long elapsed = now - wb->bw_time_stamp;
unsigned long dirtied;
@@ -1173,10 +1178,8 @@ static void __wb_update_bandwidth(struct bdi_writeback *wb,
goto snapshot;

if (update_ratelimit) {
- global_update_bandwidth(thresh, dirty, now);
- wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty,
- wb_thresh, wb_dirty,
- dirtied, elapsed);
+ global_update_bandwidth(dtc, now);
+ wb_update_dirty_ratelimit(dtc, dirtied, elapsed);
}
wb_update_write_bandwidth(wb, elapsed, written);

@@ -1188,7 +1191,9 @@ static void __wb_update_bandwidth(struct bdi_writeback *wb,

void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time)
{
- __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time, false);
+ struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };
+
+ __wb_update_bandwidth(&gdtc, start_time, false);
}

/*
@@ -1302,13 +1307,10 @@ static long wb_min_pause(struct bdi_writeback *wb,
return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
}

-static inline void wb_dirty_limits(struct bdi_writeback *wb,
- unsigned long dirty_thresh,
- unsigned long background_thresh,
- unsigned long *wb_dirty,
- unsigned long *wb_thresh,
+static inline void wb_dirty_limits(struct dirty_throttle_control *dtc,
unsigned long *wb_bg_thresh)
{
+ struct bdi_writeback *wb = dtc->wb;
unsigned long wb_reclaimable;

/*
@@ -1324,12 +1326,12 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb,
* wb_position_ratio() will let the dirtier task progress
* at some rate <= (write_bw / 2) for bringing down wb_dirty.
*/
- *wb_thresh = wb_calc_thresh(wb, dirty_thresh);
+ dtc->wb_thresh = wb_calc_thresh(dtc->wb, dtc->thresh);

if (wb_bg_thresh)
- *wb_bg_thresh = dirty_thresh ? div_u64((u64)*wb_thresh *
- background_thresh,
- dirty_thresh) : 0;
+ *wb_bg_thresh = dtc->thresh ? div_u64((u64)dtc->wb_thresh *
+ dtc->bg_thresh,
+ dtc->thresh) : 0;

/*
* In order to avoid the stacked BDI deadlock we need
@@ -1341,12 +1343,12 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb,
* actually dirty; with m+n sitting in the percpu
* deltas.
*/
- if (*wb_thresh < 2 * wb_stat_error(wb)) {
+ if (dtc->wb_thresh < 2 * wb_stat_error(wb)) {
wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
- *wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
+ dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
} else {
wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
- *wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
+ dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
}
}

@@ -1361,10 +1363,9 @@ static void balance_dirty_pages(struct address_space *mapping,
struct bdi_writeback *wb,
unsigned long pages_dirtied)
{
+ struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
+ struct dirty_throttle_control * const gdtc = &gdtc_stor;
unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
- unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */
- unsigned long background_thresh;
- unsigned long dirty_thresh;
long period;
long pause;
long max_pause;
@@ -1380,11 +1381,7 @@ static void balance_dirty_pages(struct address_space *mapping,

for (;;) {
unsigned long now = jiffies;
- unsigned long uninitialized_var(wb_thresh);
- unsigned long thresh;
- unsigned long uninitialized_var(wb_dirty);
- unsigned long dirty;
- unsigned long bg_thresh;
+ unsigned long dirty, thresh, bg_thresh;

/*
* Unstable writes are a feature of certain networked
@@ -1394,20 +1391,19 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
- nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);
+ gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);

- global_dirty_limits(&background_thresh, &dirty_thresh);
+ global_dirty_limits(&gdtc->bg_thresh, &gdtc->thresh);

if (unlikely(strictlimit)) {
- wb_dirty_limits(wb, dirty_thresh, background_thresh,
- &wb_dirty, &wb_thresh, &bg_thresh);
+ wb_dirty_limits(gdtc, &bg_thresh);

- dirty = wb_dirty;
- thresh = wb_thresh;
+ dirty = gdtc->wb_dirty;
+ thresh = gdtc->wb_thresh;
} else {
- dirty = nr_dirty;
- thresh = dirty_thresh;
- bg_thresh = background_thresh;
+ dirty = gdtc->dirty;
+ thresh = gdtc->thresh;
+ bg_thresh = gdtc->bg_thresh;
}

/*
@@ -1431,31 +1427,25 @@ static void balance_dirty_pages(struct address_space *mapping,
wb_start_background_writeback(wb);

if (!strictlimit)
- wb_dirty_limits(wb, dirty_thresh, background_thresh,
- &wb_dirty, &wb_thresh, NULL);
+ wb_dirty_limits(gdtc, NULL);

- dirty_exceeded = (wb_dirty > wb_thresh) &&
- ((nr_dirty > dirty_thresh) || strictlimit);
+ dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) &&
+ ((gdtc->dirty > gdtc->thresh) || strictlimit);
if (dirty_exceeded && !wb->dirty_exceeded)
wb->dirty_exceeded = 1;

if (time_is_before_jiffies(wb->bw_time_stamp +
BANDWIDTH_INTERVAL)) {
spin_lock(&wb->list_lock);
- __wb_update_bandwidth(wb, dirty_thresh,
- background_thresh, nr_dirty,
- wb_thresh, wb_dirty, start_time,
- true);
+ __wb_update_bandwidth(gdtc, start_time, true);
spin_unlock(&wb->list_lock);
}

dirty_ratelimit = wb->dirty_ratelimit;
- pos_ratio = wb_position_ratio(wb, dirty_thresh,
- background_thresh, nr_dirty,
- wb_thresh, wb_dirty);
+ pos_ratio = wb_position_ratio(gdtc);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
- max_pause = wb_max_pause(wb, wb_dirty);
+ max_pause = wb_max_pause(wb, gdtc->wb_dirty);
min_pause = wb_min_pause(wb, max_pause,
task_ratelimit, dirty_ratelimit,
&nr_dirtied_pause);
@@ -1478,11 +1468,11 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
if (pause < min_pause) {
trace_balance_dirty_pages(bdi,
- dirty_thresh,
- background_thresh,
- nr_dirty,
- wb_thresh,
- wb_dirty,
+ gdtc->thresh,
+ gdtc->bg_thresh,
+ gdtc->dirty,
+ gdtc->wb_thresh,
+ gdtc->wb_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
@@ -1507,11 +1497,11 @@ static void balance_dirty_pages(struct address_space *mapping,

pause:
trace_balance_dirty_pages(bdi,
- dirty_thresh,
- background_thresh,
- nr_dirty,
- wb_thresh,
- wb_dirty,
+ gdtc->thresh,
+ gdtc->bg_thresh,
+ gdtc->dirty,
+ gdtc->wb_thresh,
+ gdtc->wb_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
@@ -1526,8 +1516,8 @@ static void balance_dirty_pages(struct address_space *mapping,
current->nr_dirtied_pause = nr_dirtied_pause;

/*
- * This is typically equal to (nr_dirty < dirty_thresh) and can
- * also keep "1000+ dd on a slow USB stick" under control.
+ * This is typically equal to (dirty < thresh) and can also
+ * keep "1000+ dd on a slow USB stick" under control.
*/
if (task_ratelimit)
break;
@@ -1542,7 +1532,7 @@ static void balance_dirty_pages(struct address_space *mapping,
* more page. However wb_dirty has accounting errors. So use
* the larger and more IO friendly wb_stat_error.
*/
- if (wb_dirty <= wb_stat_error(wb))
+ if (gdtc->wb_dirty <= wb_stat_error(wb))
break;

if (fatal_signal_pending(current))
@@ -1566,7 +1556,7 @@ static void balance_dirty_pages(struct address_space *mapping,
if (laptop_mode)
return;

- if (nr_reclaimable > background_thresh)
+ if (nr_reclaimable > gdtc->bg_thresh)
wb_start_background_writeback(wb);
}

--
2.4.0

2015-05-22 22:23:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/19] writeback: add dirty_throttle_control->wb_bg_thresh

wb_bg_thresh is currently treated as a second-class citizen. It's
only used when BDI_CAP_STRICTLIMIT is set and balance_dirty_pages()
doesn't calculate it unless the cap is set. When the cap is set, the
calculated value is not passed around but instead recalculated
whenever it's used.

wb_position_ratio() calculates it by scaling wb_thresh proportional to
bg_thresh / thresh. wb_update_dirty_ratelimit() uses wb_dirty_limit()
on bg_thresh, which should generally lead to a similar result as the
proportional scaling but can also be way off in the presence of
max/min_ratio settings.

Avoiding wb_bg_thresh calculation saves us one u64 multiplication and
divsion when BDI_CAP_STRICTLIMIT is not set. Given that
balance_dirty_pages() is already ratelimited, this doesn't justify the
incurred extra complexity.

This patch adds wb_bg_thresh to dirty_throttle_control and makes
wb_dirty_limits() always calculate it and updates the users to use the
pre-calculated value.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 27 +++++++++++----------------
1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 126e3c8..3ec9223 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -134,6 +134,7 @@ struct dirty_throttle_control {

unsigned long wb_dirty; /* per-wb counterparts */
unsigned long wb_thresh;
+ unsigned long wb_bg_thresh;
};

#define GDTC_INIT(__wb) .wb = (__wb)
@@ -761,7 +762,6 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
*/
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
long long wb_pos_ratio;
- unsigned long wb_bg_thresh;

if (dtc->wb_dirty < 8)
return min_t(long long, pos_ratio * 2,
@@ -770,9 +770,8 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
if (dtc->wb_dirty >= wb_thresh)
return 0;

- wb_bg_thresh = div_u64((u64)wb_thresh * dtc->bg_thresh,
- dtc->thresh);
- wb_setpoint = dirty_freerun_ceiling(wb_thresh, wb_bg_thresh);
+ wb_setpoint = dirty_freerun_ceiling(wb_thresh,
+ dtc->wb_bg_thresh);

if (wb_setpoint == 0 || wb_setpoint == wb_thresh)
return 0;
@@ -1104,15 +1103,14 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
*
* We rampup dirty_ratelimit forcibly if wb_dirty is low because
* it's possible that wb_thresh is close to zero due to inactivity
- * of backing device (see the implementation of wb_calc_thresh()).
+ * of backing device.
*/
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
dirty = dtc->wb_dirty;
if (dtc->wb_dirty < 8)
setpoint = dtc->wb_dirty + 1;
else
- setpoint = (dtc->wb_thresh +
- wb_calc_thresh(wb, dtc->bg_thresh)) / 2;
+ setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2;
}

if (dirty < setpoint) {
@@ -1307,8 +1305,7 @@ static long wb_min_pause(struct bdi_writeback *wb,
return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
}

-static inline void wb_dirty_limits(struct dirty_throttle_control *dtc,
- unsigned long *wb_bg_thresh)
+static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
{
struct bdi_writeback *wb = dtc->wb;
unsigned long wb_reclaimable;
@@ -1327,11 +1324,8 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc,
* at some rate <= (write_bw / 2) for bringing down wb_dirty.
*/
dtc->wb_thresh = wb_calc_thresh(dtc->wb, dtc->thresh);
-
- if (wb_bg_thresh)
- *wb_bg_thresh = dtc->thresh ? div_u64((u64)dtc->wb_thresh *
- dtc->bg_thresh,
- dtc->thresh) : 0;
+ dtc->wb_bg_thresh = dtc->thresh ?
+ div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;

/*
* In order to avoid the stacked BDI deadlock we need
@@ -1396,10 +1390,11 @@ static void balance_dirty_pages(struct address_space *mapping,
global_dirty_limits(&gdtc->bg_thresh, &gdtc->thresh);

if (unlikely(strictlimit)) {
- wb_dirty_limits(gdtc, &bg_thresh);
+ wb_dirty_limits(gdtc);

dirty = gdtc->wb_dirty;
thresh = gdtc->wb_thresh;
+ bg_thresh = gdtc->wb_bg_thresh;
} else {
dirty = gdtc->dirty;
thresh = gdtc->thresh;
@@ -1427,7 +1422,7 @@ static void balance_dirty_pages(struct address_space *mapping,
wb_start_background_writeback(wb);

if (!strictlimit)
- wb_dirty_limits(gdtc, NULL);
+ wb_dirty_limits(gdtc);

dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) &&
((gdtc->dirty > gdtc->thresh) || strictlimit);
--
2.4.0

2015-05-22 22:28:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/19] writeback: make __wb_calc_thresh() take dirty_throttle_control

wb_calc_thresh() calculates wb_thresh by scaling thresh according to
the wb's portion in the system-wide write bandwidth. cgroup writeback
support would need to calculate wb_thresh against memcg domain too.
This patch renames wb_calc_thresh() to __wb_calc_thresh() and makes it
take dirty_throttle_control so that the function can later be updated
to calculate against different domains according to
dirty_throttle_control.

wb_calc_thresh() is now a thin wrapper around __wb_calc_thresh().

v2: The original version was incorrectly scaling dtc->dirty instead of
dtc->thresh. This was due to the extremely confusing function and
variable names. Added a rename patch and fixed this one.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 3ec9223..2352c69 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -557,9 +557,8 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
}

/**
- * wb_calc_thresh - @wb's share of dirty throttling threshold
- * @wb: bdi_writeback to query
- * @dirty: global dirty limit in pages
+ * __wb_calc_thresh - @wb's share of dirty throttling threshold
+ * @dtc: dirty_throttle_context of interest
*
* Returns @wb's dirty limit in pages. The term "dirty" in the context of
* dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
@@ -578,9 +577,10 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
* The wb's share of dirty limit will be adapting to its throughput and
* bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
*/
-unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
+static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
{
struct wb_domain *dom = &global_wb_domain;
+ unsigned long thresh = dtc->thresh;
u64 wb_thresh;
long numerator, denominator;
unsigned long wb_min_ratio, wb_max_ratio;
@@ -588,14 +588,14 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
/*
* Calculate this BDI's share of the thresh ratio.
*/
- fprop_fraction_percpu(&dom->completions, &wb->completions,
+ fprop_fraction_percpu(&dom->completions, &dtc->wb->completions,
&numerator, &denominator);

wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100;
wb_thresh *= numerator;
do_div(wb_thresh, denominator);

- wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
+ wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio);

wb_thresh += (thresh * wb_min_ratio) / 100;
if (wb_thresh > (thresh * wb_max_ratio) / 100)
@@ -604,6 +604,13 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
return wb_thresh;
}

+unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
+{
+ struct dirty_throttle_control gdtc = { GDTC_INIT(wb),
+ .thresh = thresh };
+ return __wb_calc_thresh(&gdtc);
+}
+
/*
* setpoint - dirty 3
* f(dirty) := 1.0 + (----------------)
@@ -1323,7 +1330,7 @@ static inline void wb_dirty_limits(struct dirty_throttle_control *dtc)
* wb_position_ratio() will let the dirtier task progress
* at some rate <= (write_bw / 2) for bringing down wb_dirty.
*/
- dtc->wb_thresh = wb_calc_thresh(dtc->wb, dtc->thresh);
+ dtc->wb_thresh = __wb_calc_thresh(dtc);
dtc->wb_bg_thresh = dtc->thresh ?
div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0;

--
2.4.0

2015-05-22 22:24:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/19] writeback: add dirty_throttle_control->pos_ratio

wb_position_ratio() is used to calculate pos_ratio, which is used for
two purposes. wb_update_dirty_ratelimit() uses it to adjust
wb->[balanced_]dirty_ratelimit gradually and balance_dirty_pages() to
immediately adjust dirty_ratelimit right before applying it to
determine pause duration.

While wb_update_dirty_ratelimit() is separately rate limited from
balance_dirty_pages(), on the run where the ratelimit is updated, we
end up calculating pos_ratio twice with the same parameters.

This patch adds dirty_throttle_control->pos_ratio.
balance_dirty_pages() calculates it once per run and
wb_update_dirty_ratelimit() uses the value stored in
dirty_throttle_control.

This removes the duplicate calculation and also will help implementing
memcg wb_domain.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 36 +++++++++++++++++++++---------------
1 file changed, 21 insertions(+), 15 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 2352c69..fcebae7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -135,6 +135,8 @@ struct dirty_throttle_control {
unsigned long wb_dirty; /* per-wb counterparts */
unsigned long wb_thresh;
unsigned long wb_bg_thresh;
+
+ unsigned long pos_ratio;
};

#define GDTC_INIT(__wb) .wb = (__wb)
@@ -717,7 +719,7 @@ static long long pos_ratio_polynom(unsigned long setpoint,
* card's wb_dirty may rush to many times higher than wb_setpoint.
* - the wb dirty thresh drops quickly due to change of JBOD workload
*/
-static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
+static void wb_position_ratio(struct dirty_throttle_control *dtc)
{
struct bdi_writeback *wb = dtc->wb;
unsigned long write_bw = wb->avg_write_bandwidth;
@@ -731,8 +733,10 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
long long pos_ratio; /* for scaling up/down the rate limit */
long x;

+ dtc->pos_ratio = 0;
+
if (unlikely(dtc->dirty >= limit))
- return 0;
+ return;

/*
* global setpoint
@@ -770,18 +774,20 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
long long wb_pos_ratio;

- if (dtc->wb_dirty < 8)
- return min_t(long long, pos_ratio * 2,
- 2 << RATELIMIT_CALC_SHIFT);
+ if (dtc->wb_dirty < 8) {
+ dtc->pos_ratio = min_t(long long, pos_ratio * 2,
+ 2 << RATELIMIT_CALC_SHIFT);
+ return;
+ }

if (dtc->wb_dirty >= wb_thresh)
- return 0;
+ return;

wb_setpoint = dirty_freerun_ceiling(wb_thresh,
dtc->wb_bg_thresh);

if (wb_setpoint == 0 || wb_setpoint == wb_thresh)
- return 0;
+ return;

wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty,
wb_thresh);
@@ -807,7 +813,8 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
* is 2. We might want to tweak this if we observe the control
* system is too slow to adapt.
*/
- return min(pos_ratio, wb_pos_ratio);
+ dtc->pos_ratio = min(pos_ratio, wb_pos_ratio);
+ return;
}

/*
@@ -888,7 +895,7 @@ static unsigned long wb_position_ratio(struct dirty_throttle_control *dtc)
pos_ratio *= 8;
}

- return pos_ratio;
+ dtc->pos_ratio = pos_ratio;
}

static void wb_update_write_bandwidth(struct bdi_writeback *wb,
@@ -1009,7 +1016,6 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
unsigned long dirty_rate;
unsigned long task_ratelimit;
unsigned long balanced_dirty_ratelimit;
- unsigned long pos_ratio;
unsigned long step;
unsigned long x;

@@ -1019,12 +1025,11 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
*/
dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;

- pos_ratio = wb_position_ratio(dtc);
/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
task_ratelimit = (u64)dirty_ratelimit *
- pos_ratio >> RATELIMIT_CALC_SHIFT;
+ dtc->pos_ratio >> RATELIMIT_CALC_SHIFT;
task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */

/*
@@ -1375,7 +1380,6 @@ static void balance_dirty_pages(struct address_space *mapping,
bool dirty_exceeded = false;
unsigned long task_ratelimit;
unsigned long dirty_ratelimit;
- unsigned long pos_ratio;
struct backing_dev_info *bdi = wb->bdi;
bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
unsigned long start_time = jiffies;
@@ -1433,6 +1437,9 @@ static void balance_dirty_pages(struct address_space *mapping,

dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) &&
((gdtc->dirty > gdtc->thresh) || strictlimit);
+
+ wb_position_ratio(gdtc);
+
if (dirty_exceeded && !wb->dirty_exceeded)
wb->dirty_exceeded = 1;

@@ -1444,8 +1451,7 @@ static void balance_dirty_pages(struct address_space *mapping,
}

dirty_ratelimit = wb->dirty_ratelimit;
- pos_ratio = wb_position_ratio(gdtc);
- task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
+ task_ratelimit = ((u64)dirty_ratelimit * gdtc->pos_ratio) >>
RATELIMIT_CALC_SHIFT;
max_pause = wb_max_pause(wb, gdtc->wb_dirty);
min_pause = wb_min_pause(wb, max_pause,
--
2.4.0

2015-05-22 22:24:06

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/19] writeback: add dirty_throttle_control->wb_completions

wb->completions measures the wb's proportional write bandwidth in
global_wb_domain and thus naturally tied to the wb_domain. This patch
adds dirty_throttle_control->wb_completions which is initialized to
wb->completions by GDTC_INIT() and updates __wb_dirty_limits() to use
it instead of dereferencing wb->completions directly.

This will allow dirty_throttle_control to represent different
wb_domains and the matching wb completions.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index fcebae7..5b439fc 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -127,6 +127,7 @@ struct wb_domain global_wb_domain;
/* consolidated parameters for balance_dirty_pages() and its subroutines */
struct dirty_throttle_control {
struct bdi_writeback *wb;
+ struct fprop_local_percpu *wb_completions;

unsigned long dirty; /* file_dirty + write + nfs */
unsigned long thresh; /* dirty threshold */
@@ -139,7 +140,8 @@ struct dirty_throttle_control {
unsigned long pos_ratio;
};

-#define GDTC_INIT(__wb) .wb = (__wb)
+#define GDTC_INIT(__wb) .wb = (__wb), \
+ .wb_completions = &(__wb)->completions

/*
* Length of period for aging writeout fractions of bdis. This is an
@@ -590,7 +592,7 @@ static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
/*
* Calculate this BDI's share of the thresh ratio.
*/
- fprop_fraction_percpu(&dom->completions, &dtc->wb->completions,
+ fprop_fraction_percpu(&dom->completions, dtc->wb_completions,
&numerator, &denominator);

wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100;
--
2.4.0

2015-05-22 22:27:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/19] writeback: add dirty_throttle_control->dom

Currently all dirty throttle operations use global_wb_domain; however,
cgroup writeback support requires considering per-memcg wb_domain too.
This patch adds dirty_throttle_control->dom and updates functions
which are directly using globabl_wb_domain to use it instead.

As this makes global_update_bandwidth() a misnomer, the function is
renamed to domain_update_bandwidth().

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5b439fc..38d45d8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -126,6 +126,9 @@ struct wb_domain global_wb_domain;

/* consolidated parameters for balance_dirty_pages() and its subroutines */
struct dirty_throttle_control {
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct wb_domain *dom;
+#endif
struct bdi_writeback *wb;
struct fprop_local_percpu *wb_completions;

@@ -140,7 +143,7 @@ struct dirty_throttle_control {
unsigned long pos_ratio;
};

-#define GDTC_INIT(__wb) .wb = (__wb), \
+#define DTC_INIT_COMMON(__wb) .wb = (__wb), \
.wb_completions = &(__wb)->completions

/*
@@ -152,6 +155,14 @@ struct dirty_throttle_control {

#ifdef CONFIG_CGROUP_WRITEBACK

+#define GDTC_INIT(__wb) .dom = &global_wb_domain, \
+ DTC_INIT_COMMON(__wb)
+
+static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
+{
+ return dtc->dom;
+}
+
static void wb_min_max_ratio(struct bdi_writeback *wb,
unsigned long *minp, unsigned long *maxp)
{
@@ -181,6 +192,13 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,

#else /* CONFIG_CGROUP_WRITEBACK */

+#define GDTC_INIT(__wb) DTC_INIT_COMMON(__wb)
+
+static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
+{
+ return &global_wb_domain;
+}
+
static void wb_min_max_ratio(struct bdi_writeback *wb,
unsigned long *minp, unsigned long *maxp)
{
@@ -583,7 +601,7 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
*/
static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
{
- struct wb_domain *dom = &global_wb_domain;
+ struct wb_domain *dom = dtc_dom(dtc);
unsigned long thresh = dtc->thresh;
u64 wb_thresh;
long numerator, denominator;
@@ -952,7 +970,7 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,

static void update_dirty_limit(struct dirty_throttle_control *dtc)
{
- struct wb_domain *dom = &global_wb_domain;
+ struct wb_domain *dom = dtc_dom(dtc);
unsigned long thresh = dtc->thresh;
unsigned long limit = dom->dirty_limit;

@@ -979,10 +997,10 @@ static void update_dirty_limit(struct dirty_throttle_control *dtc)
dom->dirty_limit = limit;
}

-static void global_update_bandwidth(struct dirty_throttle_control *dtc,
+static void domain_update_bandwidth(struct dirty_throttle_control *dtc,
unsigned long now)
{
- struct wb_domain *dom = &global_wb_domain;
+ struct wb_domain *dom = dtc_dom(dtc);

/*
* check locklessly first to optimize away locking for the most time
@@ -1190,7 +1208,7 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *dtc,
goto snapshot;

if (update_ratelimit) {
- global_update_bandwidth(dtc, now);
+ domain_update_bandwidth(dtc, now);
wb_update_dirty_ratelimit(dtc, dirtied, elapsed);
}
wb_update_write_bandwidth(wb, elapsed, written);
--
2.4.0

2015-05-22 22:27:26

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/19] writeback: make __wb_writeout_inc() and hard_dirty_limit() take wb_domaas a parameter

Currently __wb_writeout_inc() and hard_dirty_limit() assume
global_wb_domain; however, cgroup writeback support requires
considering per-memcg wb_domain too.

This patch separates out domain-specific part of __wb_writeout_inc()
into wb_domain_writeout_inc() which takes wb_domain as a parameter and
adds the parameter to hard_dirty_limit(). This will allow these two
functions to handle per-memcg wb_domains.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 37 +++++++++++++++++++++----------------
1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 38d45d8..a4d0cee 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -445,17 +445,12 @@ static unsigned long wp_next_time(unsigned long cur_time)
return cur_time;
}

-/*
- * Increment the wb's writeout completion count and the global writeout
- * completion count. Called from test_clear_page_writeback().
- */
-static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+static void wb_domain_writeout_inc(struct wb_domain *dom,
+ struct fprop_local_percpu *completions,
+ unsigned int max_prop_frac)
{
- struct wb_domain *dom = &global_wb_domain;
-
- __inc_wb_stat(wb, WB_WRITTEN);
- __fprop_inc_percpu_max(&dom->completions, &wb->completions,
- wb->bdi->max_prop_frac);
+ __fprop_inc_percpu_max(&dom->completions, completions,
+ max_prop_frac);
/* First event after period switching was turned off? */
if (!unlikely(dom->period_time)) {
/*
@@ -469,6 +464,17 @@ static inline void __wb_writeout_inc(struct bdi_writeback *wb)
}
}

+/*
+ * Increment @wb's writeout completion count and the global writeout
+ * completion count. Called from test_clear_page_writeback().
+ */
+static inline void __wb_writeout_inc(struct bdi_writeback *wb)
+{
+ __inc_wb_stat(wb, WB_WRITTEN);
+ wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
+ wb->bdi->max_prop_frac);
+}
+
void wb_writeout_inc(struct bdi_writeback *wb)
{
unsigned long flags;
@@ -571,10 +577,9 @@ static unsigned long dirty_freerun_ceiling(unsigned long thresh,
return (thresh + bg_thresh) / 2;
}

-static unsigned long hard_dirty_limit(unsigned long thresh)
+static unsigned long hard_dirty_limit(struct wb_domain *dom,
+ unsigned long thresh)
{
- struct wb_domain *dom = &global_wb_domain;
-
return max(thresh, dom->dirty_limit);
}

@@ -744,7 +749,7 @@ static void wb_position_ratio(struct dirty_throttle_control *dtc)
struct bdi_writeback *wb = dtc->wb;
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
- unsigned long limit = hard_dirty_limit(dtc->thresh);
+ unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
unsigned long wb_thresh = dtc->wb_thresh;
unsigned long x_intercept;
unsigned long setpoint; /* dirty pages' target balance point */
@@ -1029,7 +1034,7 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
struct bdi_writeback *wb = dtc->wb;
unsigned long dirty = dtc->dirty;
unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh);
- unsigned long limit = hard_dirty_limit(dtc->thresh);
+ unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh);
unsigned long setpoint = (freerun + limit) / 2;
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long dirty_ratelimit = wb->dirty_ratelimit;
@@ -1681,7 +1686,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)

for ( ; ; ) {
global_dirty_limits(&background_thresh, &dirty_thresh);
- dirty_thresh = hard_dirty_limit(dirty_thresh);
+ dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);

/*
* Boost the allowable dirty threshold a bit for page
--
2.4.0

2015-05-22 22:27:03

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/19] writeback: separate out domain_dirty_limits()

global_dirty_limits() calculates thresh and bg_thresh (confusingly
called *pdirty and *pbackground in the function) assuming
global_wb_domain; however, cgroup writeback support requires
considering per-memcg wb_domain too.

This patch separates out domain_dirty_limits() which takes
dirty_throttle_control out of global_dirty_limits(). As thresh and
bg_thresh calculation needs the amount of dirtyable memory in the
domain, dirty_throttle_control->avail is added. The new function
calculates the two thresholds and store them directly in the
dirty_throttle_control.

Also, as memcg domains can't follow vm_dirty_bytes and
dirty_background_bytes settings directly. If those are set and
domain_dirty_limits() is invoked for a !global domain, the settings
are translated to ratios by scaling them against globally available
memory. dirty_throttle_control->gdtc is added to enable this when
CONFIG_CGROUP_WRITEBACK.

global_dirty_limits() is now a thin wrapper around
domain_dirty_limits() and balance_dirty_pages() is updated to use the
new function too.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 111 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 86 insertions(+), 25 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a4d0cee..c8ac8ce 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -128,10 +128,12 @@ struct wb_domain global_wb_domain;
struct dirty_throttle_control {
#ifdef CONFIG_CGROUP_WRITEBACK
struct wb_domain *dom;
+ struct dirty_throttle_control *gdtc; /* only set in memcg dtc's */
#endif
struct bdi_writeback *wb;
struct fprop_local_percpu *wb_completions;

+ unsigned long avail; /* dirtyable */
unsigned long dirty; /* file_dirty + write + nfs */
unsigned long thresh; /* dirty threshold */
unsigned long bg_thresh; /* dirty background threshold */
@@ -157,12 +159,18 @@ struct dirty_throttle_control {

#define GDTC_INIT(__wb) .dom = &global_wb_domain, \
DTC_INIT_COMMON(__wb)
+#define GDTC_INIT_NO_WB .dom = &global_wb_domain

static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
{
return dtc->dom;
}

+static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc)
+{
+ return mdtc->gdtc;
+}
+
static void wb_min_max_ratio(struct bdi_writeback *wb,
unsigned long *minp, unsigned long *maxp)
{
@@ -193,12 +201,18 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,
#else /* CONFIG_CGROUP_WRITEBACK */

#define GDTC_INIT(__wb) DTC_INIT_COMMON(__wb)
+#define GDTC_INIT_NO_WB

static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
{
return &global_wb_domain;
}

+static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc)
+{
+ return NULL;
+}
+
static void wb_min_max_ratio(struct bdi_writeback *wb,
unsigned long *minp, unsigned long *maxp)
{
@@ -303,42 +317,88 @@ static unsigned long global_dirtyable_memory(void)
return x + 1; /* Ensure that we never return 0 */
}

-/*
- * global_dirty_limits - background-writeback and dirty-throttling thresholds
+/**
+ * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain
+ * @dtc: dirty_throttle_control of interest
*
- * Calculate the dirty thresholds based on sysctl parameters
- * - vm.dirty_background_ratio or vm.dirty_background_bytes
- * - vm.dirty_ratio or vm.dirty_bytes
- * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
+ * Calculate @dtc->thresh and ->bg_thresh considering
+ * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller
+ * must ensure that @dtc->avail is set before calling this function. The
+ * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and
* real-time tasks.
*/
-void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
-{
- const unsigned long available_memory = global_dirtyable_memory();
- unsigned long background;
- unsigned long dirty;
+static void domain_dirty_limits(struct dirty_throttle_control *dtc)
+{
+ const unsigned long available_memory = dtc->avail;
+ struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc);
+ unsigned long bytes = vm_dirty_bytes;
+ unsigned long bg_bytes = dirty_background_bytes;
+ unsigned long ratio = vm_dirty_ratio;
+ unsigned long bg_ratio = dirty_background_ratio;
+ unsigned long thresh;
+ unsigned long bg_thresh;
struct task_struct *tsk;

- if (vm_dirty_bytes)
- dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
+ /* gdtc is !NULL iff @dtc is for memcg domain */
+ if (gdtc) {
+ unsigned long global_avail = gdtc->avail;
+
+ /*
+ * The byte settings can't be applied directly to memcg
+ * domains. Convert them to ratios by scaling against
+ * globally available memory.
+ */
+ if (bytes)
+ ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 /
+ global_avail, 100UL);
+ if (bg_bytes)
+ bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 /
+ global_avail, 100UL);
+ bytes = bg_bytes = 0;
+ }
+
+ if (bytes)
+ thresh = DIV_ROUND_UP(bytes, PAGE_SIZE);
else
- dirty = (vm_dirty_ratio * available_memory) / 100;
+ thresh = (ratio * available_memory) / 100;

- if (dirty_background_bytes)
- background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
+ if (bg_bytes)
+ bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE);
else
- background = (dirty_background_ratio * available_memory) / 100;
+ bg_thresh = (bg_ratio * available_memory) / 100;

- if (background >= dirty)
- background = dirty / 2;
+ if (bg_thresh >= thresh)
+ bg_thresh = thresh / 2;
tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
- background += background / 4;
- dirty += dirty / 4;
+ bg_thresh += bg_thresh / 4;
+ thresh += thresh / 4;
}
- *pbackground = background;
- *pdirty = dirty;
- trace_global_dirty_state(background, dirty);
+ dtc->thresh = thresh;
+ dtc->bg_thresh = bg_thresh;
+
+ /* we should eventually report the domain in the TP */
+ if (!gdtc)
+ trace_global_dirty_state(bg_thresh, thresh);
+}
+
+/**
+ * global_dirty_limits - background-writeback and dirty-throttling thresholds
+ * @pbackground: out parameter for bg_thresh
+ * @pdirty: out parameter for thresh
+ *
+ * Calculate bg_thresh and thresh for global_wb_domain. See
+ * domain_dirty_limits() for details.
+ */
+void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
+{
+ struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
+
+ gdtc.avail = global_dirtyable_memory();
+ domain_dirty_limits(&gdtc);
+
+ *pbackground = gdtc.bg_thresh;
+ *pdirty = gdtc.thresh;
}

/**
@@ -1421,9 +1481,10 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
+ gdtc->avail = global_dirtyable_memory();
gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK);

- global_dirty_limits(&gdtc->bg_thresh, &gdtc->thresh);
+ domain_dirty_limits(gdtc);

if (unlikely(strictlimit)) {
wb_dirty_limits(gdtc);
--
2.4.0

2015-05-22 22:26:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/19] writeback: move over_bground_thresh() to mm/page-writeback.c

and rename it to wb_over_bg_thresh(). The function is closely tied to
the dirty throttling mechanism implemented in page-writeback.c. This
relocation will allow future updates necessary for cgroup writeback
support.

While at it, add function comment.

This is pure reorganization and doesn't introduce any behavioral
changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
fs/fs-writeback.c | 20 ++------------------
include/linux/writeback.h | 1 +
mm/page-writeback.c | 23 +++++++++++++++++++++++
3 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 51c8a5b..da35587 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1071,22 +1071,6 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
return nr_pages - work.nr_pages;
}

-static bool over_bground_thresh(struct bdi_writeback *wb)
-{
- unsigned long background_thresh, dirty_thresh;
-
- global_dirty_limits(&background_thresh, &dirty_thresh);
-
- if (global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) > background_thresh)
- return true;
-
- if (wb_stat(wb, WB_RECLAIMABLE) > wb_calc_thresh(wb, background_thresh))
- return true;
-
- return false;
-}
-
/*
* Explicit flushing or periodic writeback of "old" data.
*
@@ -1136,7 +1120,7 @@ static long wb_writeback(struct bdi_writeback *wb,
* For background writeout, stop when we are below the
* background dirty threshold
*/
- if (work->for_background && !over_bground_thresh(wb))
+ if (work->for_background && !wb_over_bg_thresh(wb))
break;

/*
@@ -1227,7 +1211,7 @@ static unsigned long get_nr_dirty_pages(void)

static long wb_check_background_flush(struct bdi_writeback *wb)
{
- if (over_bground_thresh(wb)) {
+ if (wb_over_bg_thresh(wb)) {

struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5fdd4e1..b57c2786 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -207,6 +207,7 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh);
void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time);
void page_writeback_init(void);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
+bool wb_over_bg_thresh(struct bdi_writeback *wb);

typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
void *data);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c8ac8ce..9d9a896 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1740,6 +1740,29 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);

+/**
+ * wb_over_bg_thresh - does @wb need to be written back?
+ * @wb: bdi_writeback of interest
+ *
+ * Determines whether background writeback should keep writing @wb or it's
+ * clean enough. Returns %true if writeback should continue.
+ */
+bool wb_over_bg_thresh(struct bdi_writeback *wb)
+{
+ unsigned long background_thresh, dirty_thresh;
+
+ global_dirty_limits(&background_thresh, &dirty_thresh);
+
+ if (global_page_state(NR_FILE_DIRTY) +
+ global_page_state(NR_UNSTABLE_NFS) > background_thresh)
+ return true;
+
+ if (wb_stat(wb, WB_RECLAIMABLE) > wb_calc_thresh(wb, background_thresh))
+ return true;
+
+ return false;
+}
+
void throttle_vm_writeout(gfp_t gfp_mask)
{
unsigned long background_thresh;
--
2.4.0

2015-05-22 22:24:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/19] writeback: update wb_over_bg_thresh() to use wb_domain aware operations

wb_over_bg_thresh() currently uses global_dirty_limits() and
wb_dirty_limit() both of which are wrappers around operations which
take dirty_throttle_control. For cgroup writeback support, the
function will be updated to also consider memcg wb_domains which
requires the context information carried in dirty_throttle_control.

This patch updates wb_over_bg_thresh() so that it uses the underlying
wb_domain aware operations directly and builds the global
dirty_throttle_control in the process.

This patch doesn't introduce any behavioral changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
mm/page-writeback.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9d9a896..a7ba5ce 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1749,15 +1749,22 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
*/
bool wb_over_bg_thresh(struct bdi_writeback *wb)
{
- unsigned long background_thresh, dirty_thresh;
+ struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
+ struct dirty_throttle_control * const gdtc = &gdtc_stor;

- global_dirty_limits(&background_thresh, &dirty_thresh);
+ /*
+ * Similar to balance_dirty_pages() but ignores pages being written
+ * as we're trying to decide whether to put more under writeback.
+ */
+ gdtc->avail = global_dirtyable_memory();
+ gdtc->dirty = global_page_state(NR_FILE_DIRTY) +
+ global_page_state(NR_UNSTABLE_NFS);
+ domain_dirty_limits(gdtc);

- if (global_page_state(NR_FILE_DIRTY) +
- global_page_state(NR_UNSTABLE_NFS) > background_thresh)
+ if (gdtc->dirty > gdtc->bg_thresh)
return true;

- if (wb_stat(wb, WB_RECLAIMABLE) > wb_calc_thresh(wb, background_thresh))
+ if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(gdtc))
return true;

return false;
--
2.4.0

2015-05-22 22:26:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/19] writeback: implement memcg wb_domain

Dirtyable memory is distributed to a wb (bdi_writeback) according to
the relative bandwidth the wb is writing out in the whole system.
This distribution is global - each wb is measured against all other
wb's and gets the proportinately sized portion of the memory in the
whole system.

For cgroup writeback, the amount of dirtyable memory is scoped by
memcg and thus each wb would need to be measured and controlled in its
memcg. IOW, a wb will belong to two writeback domains - the global
and memcg domains.

The previous patches laid the groundwork to support the two wb_domains
and this patch implements memcg wb_domain. memcg->cgwb_domain is
initialized on css online and destroyed on css release,
wb->memcg_completions is added, and __wb_writeout_inc() is updated to
increment completions against both global and memcg wb_domains.

The following patches will update balance_dirty_pages() and its
subroutines to actually consider memcg wb_domain for throttling.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
include/linux/backing-dev-defs.h | 1 +
include/linux/memcontrol.h | 12 +++++++++++-
include/linux/writeback.h | 3 +++
mm/backing-dev.c | 9 ++++++++-
mm/memcontrol.c | 39 +++++++++++++++++++++++++++++++++++++++
mm/page-writeback.c | 25 +++++++++++++++++++++++++
6 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 97a92fa..8d470b7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -118,6 +118,7 @@ struct bdi_writeback {

#ifdef CONFIG_CGROUP_WRITEBACK
struct percpu_ref refcnt; /* used only for !root wb's */
+ struct fprop_local_percpu memcg_completions;
struct cgroup_subsys_state *memcg_css; /* the associated memcg */
struct cgroup_subsys_state *blkcg_css; /* and blkcg */
struct list_head memcg_node; /* anchored at memcg->cgwb_list */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 662a953..e3177be 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -389,8 +389,18 @@ enum {
};

#ifdef CONFIG_CGROUP_WRITEBACK
+
struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
-#endif
+struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
+
+#else /* CONFIG_CGROUP_WRITEBACK */
+
+static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
+{
+ return NULL;
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */

struct sock;
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b57c2786..04a3786 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -167,6 +167,9 @@ static inline void laptop_sync_completion(void) { }
void throttle_vm_writeout(gfp_t gfp_mask);
bool zone_dirty_ok(struct zone *zone);
int wb_domain_init(struct wb_domain *dom, gfp_t gfp);
+#ifdef CONFIG_CGROUP_WRITEBACK
+void wb_domain_exit(struct wb_domain *dom);
+#endif

extern struct wb_domain global_wb_domain;

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9c8b7b5..84ebf7c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -482,6 +482,7 @@ static void cgwb_release_workfn(struct work_struct *work)
css_put(wb->blkcg_css);
wb_congested_put(wb->congested);

+ fprop_local_destroy_percpu(&wb->memcg_completions);
percpu_ref_exit(&wb->refcnt);
wb_exit(wb);
kfree_rcu(wb, rcu);
@@ -548,9 +549,13 @@ static int cgwb_create(struct backing_dev_info *bdi,
if (ret)
goto err_wb_exit;

+ ret = fprop_local_init_percpu(&wb->memcg_completions, gfp);
+ if (ret)
+ goto err_ref_exit;
+
wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp);
if (!wb->congested)
- goto err_ref_exit;
+ goto err_fprop_exit;

wb->memcg_css = memcg_css;
wb->blkcg_css = blkcg_css;
@@ -587,6 +592,8 @@ static int cgwb_create(struct backing_dev_info *bdi,

err_put_congested:
wb_congested_put(wb->congested);
+err_fprop_exit:
+ fprop_local_destroy_percpu(&wb->memcg_completions);
err_ref_exit:
percpu_ref_exit(&wb->refcnt);
err_wb_exit:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d7d270a..436fbc2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -345,6 +345,7 @@ struct mem_cgroup {

#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head cgwb_list;
+ struct wb_domain cgwb_domain;
#endif

/* List of events which userspace want to receive */
@@ -3975,6 +3976,37 @@ struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg)
return &memcg->cgwb_list;
}

+static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
+{
+ return wb_domain_init(&memcg->cgwb_domain, gfp);
+}
+
+static void memcg_wb_domain_exit(struct mem_cgroup *memcg)
+{
+ wb_domain_exit(&memcg->cgwb_domain);
+}
+
+struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
+
+ if (!memcg->css.parent)
+ return NULL;
+
+ return &memcg->cgwb_domain;
+}
+
+#else /* CONFIG_CGROUP_WRITEBACK */
+
+static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
+{
+ return 0;
+}
+
+static void memcg_wb_domain_exit(struct mem_cgroup *memcg)
+{
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */

/*
@@ -4361,9 +4393,15 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
if (!memcg->stat)
goto out_free;
+
+ if (memcg_wb_domain_init(memcg, GFP_KERNEL))
+ goto out_free_stat;
+
spin_lock_init(&memcg->pcp_counter_lock);
return memcg;

+out_free_stat:
+ free_percpu(memcg->stat);
out_free:
kfree(memcg);
return NULL;
@@ -4390,6 +4428,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
free_mem_cgroup_per_zone_info(memcg, node);

free_percpu(memcg->stat);
+ memcg_wb_domain_exit(memcg);
kfree(memcg);
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a7ba5ce..a146e33 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -171,6 +171,11 @@ static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *m
return mdtc->gdtc;
}

+static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb)
+{
+ return &wb->memcg_completions;
+}
+
static void wb_min_max_ratio(struct bdi_writeback *wb,
unsigned long *minp, unsigned long *maxp)
{
@@ -213,6 +218,11 @@ static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *m
return NULL;
}

+static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb)
+{
+ return NULL;
+}
+
static void wb_min_max_ratio(struct bdi_writeback *wb,
unsigned long *minp, unsigned long *maxp)
{
@@ -530,9 +540,16 @@ static void wb_domain_writeout_inc(struct wb_domain *dom,
*/
static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
+ struct wb_domain *cgdom;
+
__inc_wb_stat(wb, WB_WRITTEN);
wb_domain_writeout_inc(&global_wb_domain, &wb->completions,
wb->bdi->max_prop_frac);
+
+ cgdom = mem_cgroup_wb_domain(wb);
+ if (cgdom)
+ wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb),
+ wb->bdi->max_prop_frac);
}

void wb_writeout_inc(struct bdi_writeback *wb)
@@ -583,6 +600,14 @@ int wb_domain_init(struct wb_domain *dom, gfp_t gfp)
return fprop_global_init(&dom->completions, gfp);
}

+#ifdef CONFIG_CGROUP_WRITEBACK
+void wb_domain_exit(struct wb_domain *dom)
+{
+ del_timer_sync(&dom->period_timer);
+ fprop_global_destroy(&dom->completions);
+}
+#endif
+
/*
* bdi_min_ratio keeps the sum of the minimum dirty shares of all
* registered backing devices, which, for obvious reasons, can not
--
2.4.0

2015-05-22 22:24:23

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/19] writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes

The amount of available memory to a memcg wb_domain can change as
memcg configuration changes. A domain's ->dirty_limit exists to
smooth out sudden drops in dirty threshold; however, when a domain's
size actually drops significantly, it hinders the dirty throttling
from adjusting to the new configuration leading to unexpected
behaviors including unnecessary OOM kills.

This patch resolves the issue by adding wb_domain_size_changed() which
resets ->dirty_limit[_tstmp] and making memcg call it on configuration
changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
include/linux/writeback.h | 20 ++++++++++++++++++++
mm/memcontrol.c | 12 ++++++++++++
2 files changed, 32 insertions(+)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 04a3786..3b73e97 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -132,6 +132,26 @@ struct wb_domain {
unsigned long dirty_limit;
};

+/**
+ * wb_domain_size_changed - memory available to a wb_domain has changed
+ * @dom: wb_domain of interest
+ *
+ * This function should be called when the amount of memory available to
+ * @dom has changed. It resets @dom's dirty limit parameters to prevent
+ * the past values which don't match the current configuration from skewing
+ * dirty throttling. Without this, when memory size of a wb_domain is
+ * greatly reduced, the dirty throttling logic may allow too many pages to
+ * be dirtied leading to consecutive unnecessary OOMs and may get stuck in
+ * that situation.
+ */
+static inline void wb_domain_size_changed(struct wb_domain *dom)
+{
+ spin_lock(&dom->lock);
+ dom->dirty_limit_tstamp = jiffies;
+ dom->dirty_limit = 0;
+ spin_unlock(&dom->lock);
+}
+
/*
* fs/fs-writeback.c
*/
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 436fbc2..8fbd501 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3986,6 +3986,11 @@ static void memcg_wb_domain_exit(struct mem_cgroup *memcg)
wb_domain_exit(&memcg->cgwb_domain);
}

+static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
+{
+ wb_domain_size_changed(&memcg->cgwb_domain);
+}
+
struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
@@ -4007,6 +4012,10 @@ static void memcg_wb_domain_exit(struct mem_cgroup *memcg)
{
}

+static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
+{
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */

/*
@@ -4605,6 +4614,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
memcg->low = 0;
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
+ memcg_wb_domain_size_changed(memcg);
}

#ifdef CONFIG_MMU
@@ -5342,6 +5352,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,

memcg->high = high;

+ memcg_wb_domain_size_changed(memcg);
return nbytes;
}

@@ -5374,6 +5385,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (err)
return err;

+ memcg_wb_domain_size_changed(memcg);
return nbytes;
}

--
2.4.0

2015-05-22 22:25:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/19] writeback: implement memcg writeback domain based throttling

While cgroup writeback support now connects memcg and blkcg so that
writeback IOs are properly attributed and controlled, the IO back
pressure propagation mechanism implemented in balance_dirty_pages()
and its subroutines wasn't aware of cgroup writeback.

Processes belonging to a memcg may have access to only subset of total
memory available in the system and not factoring this into dirty
throttling rendered it completely ineffective for processes under
memcg limits and memcg ended up building a separate ad-hoc degenerate
mechanism directly into vmscan code to limit page dirtying.

The previous patches updated balance_dirty_pages() and its subroutines
so that they can deal with multiple wb_domain's (writeback domains)
and defined per-memcg wb_domain. Processes belonging to a non-root
memcg are bound to two wb_domains, global wb_domain and memcg
wb_domain, and should be throttled according to IO pressures from both
domains. This patch updates dirty throttling code so that it repeats
similar calculations for the two domains - the differences between the
two are few and minor - and applies the lower of the two sets of
resulting constraints.

wb_over_bg_thresh(), which controls when background writeback
terminates, is also updated to consider both global and memcg
wb_domains. It returns true if dirty is over bg_thresh for either
domain.

This makes the dirty throttling mechanism operational for memcg
domains including writeback-bandwidth-proportional dirty page
distribution inside them but the ad-hoc memcg throttling mechanism in
vmscan is still in place. The next patch will rip it out.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
---
include/linux/memcontrol.h | 9 +++
mm/memcontrol.c | 43 ++++++++++++
mm/page-writeback.c | 158 ++++++++++++++++++++++++++++++++++++++-------
3 files changed, 188 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e3177be..c3eb19e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -392,6 +392,8 @@ enum {

struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb);
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
+ unsigned long *pdirty, unsigned long *pwriteback);

#else /* CONFIG_CGROUP_WRITEBACK */

@@ -400,6 +402,13 @@ static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
return NULL;
}

+static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb,
+ unsigned long *pavail,
+ unsigned long *pdirty,
+ unsigned long *pwriteback)
+{
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */

struct sock;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8fbd501..7bde293 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4001,6 +4001,49 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb)
return &memcg->cgwb_domain;
}

+/**
+ * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg
+ * @wb: bdi_writeback in question
+ * @pavail: out parameter for number of available pages
+ * @pdirty: out parameter for number of dirty pages
+ * @pwriteback: out parameter for number of pages under writeback
+ *
+ * Determine the numbers of available, dirty, and writeback pages in @wb's
+ * memcg. Dirty and writeback are self-explanatory. Available is a bit
+ * more involved.
+ *
+ * A memcg's headroom is "min(max, high) - used". The available memory is
+ * calculated as the lowest headroom of itself and the ancestors plus the
+ * number of pages already being used for file pages. Note that this
+ * doesn't consider the actual amount of available memory in the system.
+ * The caller should further cap *@pavail accordingly.
+ */
+void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail,
+ unsigned long *pdirty, unsigned long *pwriteback)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
+ struct mem_cgroup *parent;
+ unsigned long head_room = PAGE_COUNTER_MAX;
+ unsigned long file_pages;
+
+ *pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY);
+
+ /* this should eventually include NR_UNSTABLE_NFS */
+ *pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK);
+
+ file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) |
+ (1 << LRU_ACTIVE_FILE));
+ while ((parent = parent_mem_cgroup(memcg))) {
+ unsigned long ceiling = min(memcg->memory.limit, memcg->high);
+ unsigned long used = page_counter_read(&memcg->memory);
+
+ head_room = min(head_room, ceiling - min(ceiling, used));
+ memcg = parent;
+ }
+
+ *pavail = file_pages + head_room;
+}
+
#else /* CONFIG_CGROUP_WRITEBACK */

static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a146e33..e890335 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -160,6 +160,14 @@ struct dirty_throttle_control {
#define GDTC_INIT(__wb) .dom = &global_wb_domain, \
DTC_INIT_COMMON(__wb)
#define GDTC_INIT_NO_WB .dom = &global_wb_domain
+#define MDTC_INIT(__wb, __gdtc) .dom = mem_cgroup_wb_domain(__wb), \
+ .gdtc = __gdtc, \
+ DTC_INIT_COMMON(__wb)
+
+static bool mdtc_valid(struct dirty_throttle_control *dtc)
+{
+ return dtc->dom;
+}

static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
{
@@ -207,6 +215,12 @@ static void wb_min_max_ratio(struct bdi_writeback *wb,

#define GDTC_INIT(__wb) DTC_INIT_COMMON(__wb)
#define GDTC_INIT_NO_WB
+#define MDTC_INIT(__wb, __gdtc)
+
+static bool mdtc_valid(struct dirty_throttle_control *dtc)
+{
+ return false;
+}

static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc)
{
@@ -668,6 +682,15 @@ static unsigned long hard_dirty_limit(struct wb_domain *dom,
return max(thresh, dom->dirty_limit);
}

+/* memory available to a memcg domain is capped by system-wide clean memory */
+static void mdtc_cap_avail(struct dirty_throttle_control *mdtc)
+{
+ struct dirty_throttle_control *gdtc = mdtc_gdtc(mdtc);
+ unsigned long clean = gdtc->avail - min(gdtc->avail, gdtc->dirty);
+
+ mdtc->avail = min(mdtc->avail, clean);
+}
+
/**
* __wb_calc_thresh - @wb's share of dirty throttling threshold
* @dtc: dirty_throttle_context of interest
@@ -1269,11 +1292,12 @@ static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc,
trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
}

-static void __wb_update_bandwidth(struct dirty_throttle_control *dtc,
+static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
+ struct dirty_throttle_control *mdtc,
unsigned long start_time,
bool update_ratelimit)
{
- struct bdi_writeback *wb = dtc->wb;
+ struct bdi_writeback *wb = gdtc->wb;
unsigned long now = jiffies;
unsigned long elapsed = now - wb->bw_time_stamp;
unsigned long dirtied;
@@ -1298,8 +1322,17 @@ static void __wb_update_bandwidth(struct dirty_throttle_control *dtc,
goto snapshot;

if (update_ratelimit) {
- domain_update_bandwidth(dtc, now);
- wb_update_dirty_ratelimit(dtc, dirtied, elapsed);
+ domain_update_bandwidth(gdtc, now);
+ wb_update_dirty_ratelimit(gdtc, dirtied, elapsed);
+
+ /*
+ * @mdtc is always NULL if !CGROUP_WRITEBACK but the
+ * compiler has no way to figure that out. Help it.
+ */
+ if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) {
+ domain_update_bandwidth(mdtc, now);
+ wb_update_dirty_ratelimit(mdtc, dirtied, elapsed);
+ }
}
wb_update_write_bandwidth(wb, elapsed, written);

@@ -1313,7 +1346,7 @@ void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time)
{
struct dirty_throttle_control gdtc = { GDTC_INIT(wb) };

- __wb_update_bandwidth(&gdtc, start_time, false);
+ __wb_update_bandwidth(&gdtc, NULL, start_time, false);
}

/*
@@ -1480,7 +1513,11 @@ static void balance_dirty_pages(struct address_space *mapping,
unsigned long pages_dirtied)
{
struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
+ struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
struct dirty_throttle_control * const gdtc = &gdtc_stor;
+ struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
+ &mdtc_stor : NULL;
+ struct dirty_throttle_control *sdtc;
unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
long period;
long pause;
@@ -1497,6 +1534,7 @@ static void balance_dirty_pages(struct address_space *mapping,
for (;;) {
unsigned long now = jiffies;
unsigned long dirty, thresh, bg_thresh;
+ unsigned long m_dirty, m_thresh, m_bg_thresh;

/*
* Unstable writes are a feature of certain networked
@@ -1523,6 +1561,32 @@ static void balance_dirty_pages(struct address_space *mapping,
bg_thresh = gdtc->bg_thresh;
}

+ if (mdtc) {
+ unsigned long writeback;
+
+ /*
+ * If @wb belongs to !root memcg, repeat the same
+ * basic calculations for the memcg domain.
+ */
+ mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty,
+ &writeback);
+ mdtc_cap_avail(mdtc);
+ mdtc->dirty += writeback;
+
+ domain_dirty_limits(mdtc);
+
+ if (unlikely(strictlimit)) {
+ wb_dirty_limits(mdtc);
+ m_dirty = mdtc->wb_dirty;
+ m_thresh = mdtc->wb_thresh;
+ m_bg_thresh = mdtc->wb_bg_thresh;
+ } else {
+ m_dirty = mdtc->dirty;
+ m_thresh = mdtc->thresh;
+ m_bg_thresh = mdtc->bg_thresh;
+ }
+ }
+
/*
* Throttle it only when the background writeback cannot
* catch-up. This avoids (excessively) small writeouts
@@ -1531,18 +1595,31 @@ static void balance_dirty_pages(struct address_space *mapping,
* In strictlimit case make decision based on the wb counters
* and limits. Small writeouts when the wb limits are ramping
* up are the price we consciously pay for strictlimit-ing.
+ *
+ * If memcg domain is in effect, @dirty should be under
+ * both global and memcg freerun ceilings.
*/
- if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {
+ if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) &&
+ (!mdtc ||
+ m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) {
+ unsigned long intv = dirty_poll_interval(dirty, thresh);
+ unsigned long m_intv = ULONG_MAX;
+
current->dirty_paused_when = now;
current->nr_dirtied = 0;
- current->nr_dirtied_pause =
- dirty_poll_interval(dirty, thresh);
+ if (mdtc)
+ m_intv = dirty_poll_interval(m_dirty, m_thresh);
+ current->nr_dirtied_pause = min(intv, m_intv);
break;
}

if (unlikely(!writeback_in_progress(wb)))
wb_start_background_writeback(wb);

+ /*
+ * Calculate global domain's pos_ratio and select the
+ * global dtc by default.
+ */
if (!strictlimit)
wb_dirty_limits(gdtc);

@@ -1550,6 +1627,25 @@ static void balance_dirty_pages(struct address_space *mapping,
((gdtc->dirty > gdtc->thresh) || strictlimit);

wb_position_ratio(gdtc);
+ sdtc = gdtc;
+
+ if (mdtc) {
+ /*
+ * If memcg domain is in effect, calculate its
+ * pos_ratio. @wb should satisfy constraints from
+ * both global and memcg domains. Choose the one
+ * w/ lower pos_ratio.
+ */
+ if (!strictlimit)
+ wb_dirty_limits(mdtc);
+
+ dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) &&
+ ((mdtc->dirty > mdtc->thresh) || strictlimit);
+
+ wb_position_ratio(mdtc);
+ if (mdtc->pos_ratio < gdtc->pos_ratio)
+ sdtc = mdtc;
+ }

if (dirty_exceeded && !wb->dirty_exceeded)
wb->dirty_exceeded = 1;
@@ -1557,14 +1653,15 @@ static void balance_dirty_pages(struct address_space *mapping,
if (time_is_before_jiffies(wb->bw_time_stamp +
BANDWIDTH_INTERVAL)) {
spin_lock(&wb->list_lock);
- __wb_update_bandwidth(gdtc, start_time, true);
+ __wb_update_bandwidth(gdtc, mdtc, start_time, true);
spin_unlock(&wb->list_lock);
}

+ /* throttle according to the chosen dtc */
dirty_ratelimit = wb->dirty_ratelimit;
- task_ratelimit = ((u64)dirty_ratelimit * gdtc->pos_ratio) >>
+ task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >>
RATELIMIT_CALC_SHIFT;
- max_pause = wb_max_pause(wb, gdtc->wb_dirty);
+ max_pause = wb_max_pause(wb, sdtc->wb_dirty);
min_pause = wb_min_pause(wb, max_pause,
task_ratelimit, dirty_ratelimit,
&nr_dirtied_pause);
@@ -1587,11 +1684,11 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
if (pause < min_pause) {
trace_balance_dirty_pages(bdi,
- gdtc->thresh,
- gdtc->bg_thresh,
- gdtc->dirty,
- gdtc->wb_thresh,
- gdtc->wb_dirty,
+ sdtc->thresh,
+ sdtc->bg_thresh,
+ sdtc->dirty,
+ sdtc->wb_thresh,
+ sdtc->wb_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
@@ -1616,11 +1713,11 @@ static void balance_dirty_pages(struct address_space *mapping,

pause:
trace_balance_dirty_pages(bdi,
- gdtc->thresh,
- gdtc->bg_thresh,
- gdtc->dirty,
- gdtc->wb_thresh,
- gdtc->wb_dirty,
+ sdtc->thresh,
+ sdtc->bg_thresh,
+ sdtc->dirty,
+ sdtc->wb_thresh,
+ sdtc->wb_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
@@ -1651,7 +1748,7 @@ static void balance_dirty_pages(struct address_space *mapping,
* more page. However wb_dirty has accounting errors. So use
* the larger and more IO friendly wb_stat_error.
*/
- if (gdtc->wb_dirty <= wb_stat_error(wb))
+ if (sdtc->wb_dirty <= wb_stat_error(wb))
break;

if (fatal_signal_pending(current))
@@ -1775,7 +1872,10 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
bool wb_over_bg_thresh(struct bdi_writeback *wb)
{
struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) };
+ struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) };
struct dirty_throttle_control * const gdtc = &gdtc_stor;
+ struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
+ &mdtc_stor : NULL;

/*
* Similar to balance_dirty_pages() but ignores pages being written
@@ -1792,6 +1892,20 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(gdtc))
return true;

+ if (mdtc) {
+ unsigned long writeback;
+
+ mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback);
+ mdtc_cap_avail(mdtc);
+ domain_dirty_limits(mdtc); /* ditto, ignore writeback */
+
+ if (mdtc->dirty > mdtc->bg_thresh)
+ return true;
+
+ if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(mdtc))
+ return true;
+ }
+
return false;
}

--
2.4.0

2015-05-22 22:24:45

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/19] mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use

Because writeback wasn't cgroup aware before, the usual dirty
throttling mechanism in balance_dirty_pages() didn't work for
processes under memcg limit. The writeback path didn't know how much
memory is available or how fast the dirty pages are being written out
for a given memcg and balance_dirty_pages() didn't have any measure of
IO back pressure for the memcg.

To work around the issue, memcg implemented an ad-hoc dirty throttling
mechanism in the direct reclaim path by stalling on pages under
writeback which are encountered during direct reclaim scan. This is
rather ugly and crude - none of the configurability, fairness, or
bandwidth-proportional distribution of the normal path.

The previous patches implemented proper memcg aware dirty throttling
when cgroup writeback is in use making the ad-hoc mechanism
unnecessary. This patch disables direct reclaim stalling for such
case.

Note: I disabled the parts which seemed obvious and it behaves fine
while testing but my understanding of this code path is
rudimentary and it's quite possible that I got something wrong.
Please let me know if I got some wrong or more global_reclaim()
sites should be updated.

v2: The original patch removed the direct stalling mechanism which
breaks legacy hierarchies. Conditionalize instead of removing.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Vladimir Davydov <[email protected]>
---
mm/vmscan.c | 51 +++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 41 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f463398..8cb16eb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -154,11 +154,42 @@ static bool global_reclaim(struct scan_control *sc)
{
return !sc->target_mem_cgroup;
}
+
+/**
+ * sane_reclaim - is the usual dirty throttling mechanism operational?
+ * @sc: scan_control in question
+ *
+ * The normal page dirty throttling mechanism in balance_dirty_pages() is
+ * completely broken with the legacy memcg and direct stalling in
+ * shrink_page_list() is used for throttling instead, which lacks all the
+ * niceties such as fairness, adaptive pausing, bandwidth proportional
+ * allocation and configurability.
+ *
+ * This function tests whether the vmscan currently in progress can assume
+ * that the normal dirty throttling mechanism is operational.
+ */
+static bool sane_reclaim(struct scan_control *sc)
+{
+ struct mem_cgroup *memcg = sc->target_mem_cgroup;
+
+ if (!memcg)
+ return true;
+#ifdef CONFIG_CGROUP_WRITEBACK
+ if (cgroup_on_dfl(mem_cgroup_css(memcg)->cgroup))
+ return true;
+#endif
+ return false;
+}
#else
static bool global_reclaim(struct scan_control *sc)
{
return true;
}
+
+static bool sane_reclaim(struct scan_control *sc)
+{
+ return true;
+}
#endif

static unsigned long zone_reclaimable_pages(struct zone *zone)
@@ -941,10 +972,10 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* note that the LRU is being scanned too quickly and the
* caller can stall after page list has been processed.
*
- * 2) Global reclaim encounters a page, memcg encounters a
- * page that is not marked for immediate reclaim or
- * the caller does not have __GFP_IO. In this case mark
- * the page for immediate reclaim and continue scanning.
+ * 2) Global or new memcg reclaim encounters a page that is
+ * not marked for immediate reclaim or the caller does not
+ * have __GFP_IO. In this case mark the page for immediate
+ * reclaim and continue scanning.
*
* __GFP_IO is checked because a loop driver thread might
* enter reclaim, and deadlock if it waits on a page for
@@ -958,7 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so testing
* may_enter_fs here is liable to OOM on them.
*
- * 3) memcg encounters a page that is not already marked
+ * 3) Legacy memcg encounters a page that is not already marked
* PageReclaim. memcg does not have any dirty pages
* throttling so we could easily OOM just because too many
* pages are in writeback and there is nothing else to
@@ -973,7 +1004,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;

/* Case 2 above */
- } else if (global_reclaim(sc) ||
+ } else if (sane_reclaim(sc) ||
!PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
/*
* This is slightly racy - end_page_writeback()
@@ -1422,7 +1453,7 @@ static int too_many_isolated(struct zone *zone, int file,
if (current_is_kswapd())
return 0;

- if (!global_reclaim(sc))
+ if (!sane_reclaim(sc))
return 0;

if (file) {
@@ -1614,10 +1645,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
set_bit(ZONE_WRITEBACK, &zone->flags);

/*
- * memcg will stall in page writeback so only consider forcibly
- * stalling for global reclaim
+ * Legacy memcg will stall in page writeback so avoid forcibly
+ * stalling here.
*/
- if (global_reclaim(sc)) {
+ if (sane_reclaim(sc)) {
/*
* Tag a zone as congested if all the dirty pages scanned were
* backed by a congested BDI and wait_iff_congested will stall.
--
2.4.0

2015-05-22 23:13:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 01/19] memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online

On Fri, May 22, 2015 at 06:23:18PM -0400, Tejun Heo wrote:
> cpu_possible_mask represents the CPUs which are actually possible
> during that boot instance. For systems which don't support CPU
> hotplug, this will match cpu_online_mask exactly in most cases. Even
> for systems which support CPU hotplug, the number of possible CPU
> slots is highly unlikely to diverge greatly from the number of online
> CPUs. The only cases where the difference between possible and online
> caused problems were when the boot code failed to initialize the
> possible mask and left it fully set at NR_CPUS - 1.
>
> As such, most per-cpu constructs allocate for all possible CPUs and
> often iterate over the possibles, which also has the benefit of
> avoiding the blocking CPU hotplug synchronization.
>
> memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
> mem_cgroup_read_events(), which iterates over online CPUs and handles
> CPU hotplug operations explicitly. This complexity doesn't actually
> buy anything. Switch to iterating over the possibles and drop the
> explicit CPU hotplug handling.
>
> Eventually, we want to convert memcg to use percpu_counter instead of
> its own custom implementation which also benefits from quick access
> w/o summing for cases where larger error margin is acceptable.
>
> This will allow mem_cgroup_read_stat() to be called from non-sleepable
> contexts which will be used by cgroup writeback.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2015-06-17 14:41:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 01/19] memcg: make mem_cgroup_read_{stat|event}() iterate possible cpus instead of online

On Fri 22-05-15 18:23:18, Tejun Heo wrote:
> cpu_possible_mask represents the CPUs which are actually possible
> during that boot instance. For systems which don't support CPU
> hotplug, this will match cpu_online_mask exactly in most cases. Even
> for systems which support CPU hotplug, the number of possible CPU
> slots is highly unlikely to diverge greatly from the number of online
> CPUs. The only cases where the difference between possible and online
> caused problems were when the boot code failed to initialize the
> possible mask and left it fully set at NR_CPUS - 1.
>
> As such, most per-cpu constructs allocate for all possible CPUs and
> often iterate over the possibles, which also has the benefit of
> avoiding the blocking CPU hotplug synchronization.
>
> memcg open codes per-cpu stat counting for mem_cgroup_read_stat() and
> mem_cgroup_read_events(), which iterates over online CPUs and handles
> CPU hotplug operations explicitly. This complexity doesn't actually
> buy anything. Switch to iterating over the possibles and drop the
> explicit CPU hotplug handling.
>
> Eventually, we want to convert memcg to use percpu_counter instead of
> its own custom implementation which also benefits from quick access
> w/o summing for cases where larger error margin is acceptable.
>
> This will allow mem_cgroup_read_stat() to be called from non-sleepable
> contexts which will be used by cgroup writeback.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Michal Hocko <[email protected]>

I am sorry for being late in this thread.

I have seen systems where the number of possible CPUs was really high
wrt. online ones but I wouldn't worry about them. The change is an
overal improvement for usual configurations though.

Acked-by: Michal Hocko <[email protected]>
> ---
> mm/memcontrol.c | 51 ++-------------------------------------------------
> 1 file changed, 2 insertions(+), 49 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6732c2c..d7d270a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -324,11 +324,6 @@ struct mem_cgroup {
> * percpu counter.
> */
> struct mem_cgroup_stat_cpu __percpu *stat;
> - /*
> - * used when a cpu is offlined or other synchronizations
> - * See mem_cgroup_read_stat().
> - */
> - struct mem_cgroup_stat_cpu nocpu_base;
> spinlock_t pcp_counter_lock;
>
> #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET)
> @@ -815,15 +810,8 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg,
> long val = 0;
> int cpu;
>
> - get_online_cpus();
> - for_each_online_cpu(cpu)
> + for_each_possible_cpu(cpu)
> val += per_cpu(memcg->stat->count[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> - spin_lock(&memcg->pcp_counter_lock);
> - val += memcg->nocpu_base.count[idx];
> - spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> - put_online_cpus();
> return val;
> }
>
> @@ -833,15 +821,8 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg,
> unsigned long val = 0;
> int cpu;
>
> - get_online_cpus();
> - for_each_online_cpu(cpu)
> + for_each_possible_cpu(cpu)
> val += per_cpu(memcg->stat->events[idx], cpu);
> -#ifdef CONFIG_HOTPLUG_CPU
> - spin_lock(&memcg->pcp_counter_lock);
> - val += memcg->nocpu_base.events[idx];
> - spin_unlock(&memcg->pcp_counter_lock);
> -#endif
> - put_online_cpus();
> return val;
> }
>
> @@ -2191,37 +2172,12 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
> mutex_unlock(&percpu_charge_mutex);
> }
>
> -/*
> - * This function drains percpu counter value from DEAD cpu and
> - * move it to local cpu. Note that this function can be preempted.
> - */
> -static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu)
> -{
> - int i;
> -
> - spin_lock(&memcg->pcp_counter_lock);
> - for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> - long x = per_cpu(memcg->stat->count[i], cpu);
> -
> - per_cpu(memcg->stat->count[i], cpu) = 0;
> - memcg->nocpu_base.count[i] += x;
> - }
> - for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) {
> - unsigned long x = per_cpu(memcg->stat->events[i], cpu);
> -
> - per_cpu(memcg->stat->events[i], cpu) = 0;
> - memcg->nocpu_base.events[i] += x;
> - }
> - spin_unlock(&memcg->pcp_counter_lock);
> -}
> -
> static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
> unsigned long action,
> void *hcpu)
> {
> int cpu = (unsigned long)hcpu;
> struct memcg_stock_pcp *stock;
> - struct mem_cgroup *iter;
>
> if (action == CPU_ONLINE)
> return NOTIFY_OK;
> @@ -2229,9 +2185,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb,
> if (action != CPU_DEAD && action != CPU_DEAD_FROZEN)
> return NOTIFY_OK;
>
> - for_each_mem_cgroup(iter)
> - mem_cgroup_drain_pcp_counter(iter, cpu);
> -
> stock = &per_cpu(memcg_stock, cpu);
> drain_stock(stock);
> return NOTIFY_OK;
> --
> 2.4.0
>

--
Michal Hocko
SUSE Labs