2015-05-22 21:14:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET 1/3 v4 block/for-4.2/core] writeback: cgroup writeback support

Hello,

This is v4 of cgroup writeback support patchset. Changes from the
last take[L] are

* b9ea25152e56 ("page_writeback: clean up mess around
cancel_dirty_page()") replaced cancel_dirty_page() with
account_page_cleaned() which pushed clearing the dirty flag to the
caller; however, changes in this patchset and the following ones
require synchronization between dirty clearing and stat updates
which is a lot easier with a helper which does both operations.

0001-page_writeback-revive-cancel_dirty_page-in-a-restric.patch is
added to resurrect cancel_dirty_page() in a more restricted form.

* Recent dirtytime changes added wakeup_dirtytime_writeback() which
needs to be updated to walk through all wb's.
0042-writeback-make-wakeup_dirtytime_writeback-handle-mul.patch
added.

* Rebased on top of the current block/for-4.2/core.

blkio cgroup (blkcg) is severely crippled in that it can only control
read and direct write IOs. blkcg can't tell which cgroup should be
held responsible for a given writeback IO and charges all of them to
the root cgroup - all normal write traffic ends up in the root cgroup.
Although the problem has been identified years ago, mainly because it
interacts with so many subsystems, it hasn't been solved yet.

This patchset finally implements cgroup writeback support so that
writeback of a page is attributed to the corresponding blkcg of the
memcg that the page belongs to.

Overall design
--------------

* This requires cooperation between memcg and blkcg. Each inode is
assigned to the blkcg mapped to the memcg being dirtied.

* struct bdi_writeback (wb) was always embedded in struct
backing_dev_info (bdi) and the distinction between the two wasn't
clear. This patchset makes wb operate as an independent writeback
execution domain. bdi->wb is still embedded and serves the root
cgroup but there can be other wb's for other cgroups.

* Each wb is associated with memcg. As memcg is implicitly enabled by
blkcg on the unified hierarchy, this gives a unique wb for each
memcg-blkcg combination. When memcg-blkcg mapping changes, a new wb
is created and the existing wb is unlinked and drained.

* An inode is associated with the matching wb when it gets dirtied for
the first time and written back by that wb. A later patchset will
implement dynamic wb switching.

* All writeback operations are made per-wb instead of per-bdi.
bdi-wide operations are split across all member wb's. If some
finite amount needs to be distributed, be it number of pages to
writeback or bdi->min/max_ratio, it's distributed according to the
bandwidth proportion a wb has in the bdi.

* cgroup writeback support adds one pointer to struct inode.


Missing pieces
--------------

* It requires some cooperation from the filesystem and currently only
works with ext2. The changes necessary on the filesystem side are
almost trivial. I'll write up a documentation on it.

* blk-throttle works but cfq-iosched isn't ready for writebacks coming
down with different cgroups. cfq-iosched should be updated to have
a writeback ioc per cgroup and route writeback IOs through it.


How to test
-----------

* Boot with kernel option "cgroup__DEVEL__legacy_files_on_dfl".

* umount /sys/fs/cgroup/memory
umount /sys/fs/cgroup/blkio
mkdir /sys/fs/cgroup/unified
mount -t cgroup -o __DEVEL__sane_behavior cgroup /sys/fs/cgroup/unified
echo +blkio > /sys/fs/cgroup/unified/cgroup.subtree_control

* Build the cgroup hierarchy (don't forget to enable blkio using
subtree_control) and put processes in cgroups and run tests on ext2
filesystems and blkio.throttle.* knobs.

This patchset contains the following 51 patches.

0001-page_writeback-revive-cancel_dirty_page-in-a-restric.patch
0002-memcg-add-per-cgroup-dirty-page-accounting.patch
0003-blkcg-move-block-blk-cgroup.h-to-include-linux-blk-c.patch
0004-update-CONFIG_BLK_CGROUP-dummies-in-include-linux-bl.patch
0005-blkcg-always-create-the-blkcg_gq-for-the-root-blkcg.patch
0006-memcg-add-mem_cgroup_root_css.patch
0007-blkcg-add-blkcg_root_css.patch
0008-cgroup-block-implement-task_get_css-and-use-it-in-bi.patch
0009-blkcg-implement-task_get_blkcg_css.patch
0010-blkcg-implement-bio_associate_blkcg.patch
0011-memcg-implement-mem_cgroup_css_from_page.patch
0012-writeback-move-backing_dev_info-state-into-bdi_write.patch
0013-writeback-move-backing_dev_info-bdi_stat-into-bdi_wr.patch
0014-writeback-move-bandwidth-related-fields-from-backing.patch
0015-writeback-s-bdi-wb-in-mm-page-writeback.c.patch
0016-writeback-move-backing_dev_info-wb_lock-and-worklist.patch
0017-writeback-reorganize-mm-backing-dev.c.patch
0018-writeback-separate-out-include-linux-backing-dev-def.patch
0019-bdi-make-inode_to_bdi-inline.patch
0020-writeback-add-gfp-to-wb_init.patch
0021-bdi-separate-out-congested-state-into-a-separate-str.patch
0022-writeback-add-CONFIG-BDI_CAP-FS-_CGROUP_WRITEBACK.patch
0023-writeback-make-backing_dev_info-host-cgroup-specific.patch
0024-writeback-blkcg-associate-each-blkcg_gq-with-the-cor.patch
0025-writeback-attribute-stats-to-the-matching-per-cgroup.patch
0026-writeback-let-balance_dirty_pages-work-on-the-matchi.patch
0027-writeback-make-congestion-functions-per-bdi_writebac.patch
0028-writeback-blkcg-restructure-blk_-set-clear-_queue_co.patch
0029-writeback-blkcg-propagate-non-root-blkcg-congestion-.patch
0030-writeback-implement-and-use-inode_congested.patch
0031-writeback-implement-WB_has_dirty_io-wb_state-flag.patch
0032-writeback-implement-backing_dev_info-tot_write_bandw.patch
0033-writeback-make-bdi_has_dirty_io-take-multiple-bdi_wr.patch
0034-writeback-don-t-issue-wb_writeback_work-if-clean.patch
0035-writeback-make-bdi-min-max_ratio-handling-cgroup-wri.patch
0036-writeback-implement-bdi_for_each_wb.patch
0037-writeback-remove-bdi_start_writeback.patch
0038-writeback-make-laptop_mode_timer_fn-handle-multiple-.patch
0039-writeback-make-writeback_in_progress-take-bdi_writeb.patch
0040-writeback-make-bdi_start_background_writeback-take-b.patch
0041-writeback-make-wakeup_flusher_threads-handle-multipl.patch
0042-writeback-make-wakeup_dirtytime_writeback-handle-mul.patch
0043-writeback-add-wb_writeback_work-auto_free.patch
0044-writeback-implement-bdi_wait_for_completion.patch
0045-writeback-implement-wb_wait_for_single_work.patch
0046-writeback-restructure-try_writeback_inodes_sb-_nr.patch
0047-writeback-make-writeback-initiation-functions-handle.patch
0048-writeback-dirty-inodes-against-their-matching-cgroup.patch
0049-buffer-writeback-make-__block_write_full_page-honor-.patch
0050-mpage-make-__mpage_writepage-honor-cgroup-writeback.patch
0051-ext2-enable-cgroup-writeback-support.patch

git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-writeback-20150522

diffstat follows. Thanks.

Documentation/cgroups/memory.txt | 1
block/bio.c | 35
block/blk-cgroup.c | 124 -
block/blk-cgroup.h | 603 --------
block/blk-core.c | 70 -
block/blk-integrity.c | 1
block/blk-sysfs.c | 3
block/blk-throttle.c | 2
block/bounce.c | 1
block/cfq-iosched.c | 2
block/elevator.c | 2
block/genhd.c | 1
drivers/block/drbd/drbd_int.h | 1
drivers/block/drbd/drbd_main.c | 10
drivers/block/pktcdvd.c | 1
drivers/char/raw.c | 1
drivers/md/bcache/request.c | 1
drivers/md/dm.c | 2
drivers/md/dm.h | 1
drivers/md/md.h | 1
drivers/md/raid1.c | 4
drivers/md/raid10.c | 2
drivers/mtd/devices/block2mtd.c | 1
drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h | 4
fs/block_dev.c | 9
fs/buffer.c | 64
fs/ext2/super.c | 2
fs/ext4/extents.c | 1
fs/ext4/mballoc.c | 1
fs/ext4/super.c | 1
fs/f2fs/node.c | 4
fs/f2fs/segment.h | 3
fs/fat/file.c | 1
fs/fat/inode.c | 1
fs/fs-writeback.c | 619 ++++++--
fs/fuse/file.c | 12
fs/gfs2/super.c | 2
fs/hfs/super.c | 1
fs/hfsplus/super.c | 1
fs/inode.c | 1
fs/mpage.c | 2
fs/nfs/filelayout/filelayout.c | 1
fs/nfs/internal.h | 2
fs/nfs/write.c | 3
fs/ocfs2/file.c | 1
fs/reiserfs/super.c | 1
fs/ufs/super.c | 1
fs/xfs/xfs_aops.c | 12
fs/xfs/xfs_file.c | 1
include/linux/backing-dev-defs.h | 188 ++
include/linux/backing-dev.h | 567 +++++---
include/linux/bio.h | 3
include/linux/blk-cgroup.h | 631 +++++++++
include/linux/blkdev.h | 21
include/linux/cgroup.h | 25
include/linux/fs.h | 13
include/linux/memcontrol.h | 10
include/linux/mm.h | 7
include/linux/pagemap.h | 3
include/linux/writeback.h | 25
include/trace/events/writeback.h | 8
init/Kconfig | 5
mm/backing-dev.c | 666 +++++++--
mm/fadvise.c | 2
mm/filemap.c | 31
mm/madvise.c | 1
mm/memcontrol.c | 59
mm/page-writeback.c | 696 +++++-----
mm/readahead.c | 2
mm/rmap.c | 2
mm/truncate.c | 18
mm/vmscan.c | 28
72 files changed, 3054 insertions(+), 1578 deletions(-)

--
tejun

[L] http://lkml.kernel.org/g/[email protected]


2015-05-22 21:33:11

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/51] page_writeback: revive cancel_dirty_page() in a restricted form

cancel_dirty_page() had some issues and b9ea25152e56 ("page_writeback:
clean up mess around cancel_dirty_page()") replaced it with
account_page_cleaned() which makes the caller responsible for clearing
the dirty bit; unfortunately, the planned changes for cgroup writeback
support requires synchronization between dirty bit manipulation and
stat updates. While we can open-code such synchronization in each
account_page_cleaned() callsite, that's gonna be unnecessarily awkward
and verbose.

This patch revives cancel_dirty_page() but in a more restricted form.
All it does is TestClearPageDirty() followed by account_page_cleaned()
invocation if the page was dirty. This helper covers all
account_page_cleaned() usages except for __delete_from_page_cache()
which is a special case anyway and left alone. As this leaves no
module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
from it.

This patch just revives cancel_dirty_page() as a trivial wrapper to
replace equivalent usages and doesn't introduce any functional
changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Konstantin Khlebnikov <[email protected]>
---
.../lustre/include/linux/lustre_patchless_compat.h | 4 +---
fs/buffer.c | 4 ++--
include/linux/mm.h | 1 +
mm/page-writeback.c | 27 ++++++++++++++++------
mm/truncate.c | 4 +---
5 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h b/drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h
index d726058..1456278 100644
--- a/drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h
+++ b/drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h
@@ -55,9 +55,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
if (PagePrivate(page))
page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE);

- if (TestClearPageDirty(page))
- account_page_cleaned(page, mapping);
-
+ cancel_dirty_page(page);
ClearPageMappedToDisk(page);
ll_delete_from_page_cache(page);
}
diff --git a/fs/buffer.c b/fs/buffer.c
index efd85e0..e776bec 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3233,8 +3233,8 @@ int try_to_free_buffers(struct page *page)
* to synchronise against __set_page_dirty_buffers and prevent the
* dirty bit from being lost.
*/
- if (ret && TestClearPageDirty(page))
- account_page_cleaned(page, mapping);
+ if (ret)
+ cancel_dirty_page(page);
spin_unlock(&mapping->private_lock);
out:
if (buffers_to_free) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0755b9f..a83cf3a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1215,6 +1215,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping);
void account_page_cleaned(struct page *page, struct address_space *mapping);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
+void cancel_dirty_page(struct page *page);
int clear_page_dirty_for_io(struct page *page);

int get_cmdline(struct task_struct *task, char *buffer, int buflen);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 5daf556..227b867 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2112,12 +2112,6 @@ EXPORT_SYMBOL(account_page_dirtied);

/*
* Helper function for deaccounting dirty page without writeback.
- *
- * Doing this should *normally* only ever be done when a page
- * is truncated, and is not actually mapped anywhere at all. However,
- * fs/buffer.c does this when it notices that somebody has cleaned
- * out all the buffers on a page without actually doing it through
- * the VM. Can you say "ext3 is horribly ugly"? Thought you could.
*/
void account_page_cleaned(struct page *page, struct address_space *mapping)
{
@@ -2127,7 +2121,6 @@ void account_page_cleaned(struct page *page, struct address_space *mapping)
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
}
}
-EXPORT_SYMBOL(account_page_cleaned);

/*
* For address_spaces which do not use buffers. Just tag the page as dirty in
@@ -2266,6 +2259,26 @@ int set_page_dirty_lock(struct page *page)
EXPORT_SYMBOL(set_page_dirty_lock);

/*
+ * This cancels just the dirty bit on the kernel page itself, it does NOT
+ * actually remove dirty bits on any mmap's that may be around. It also
+ * leaves the page tagged dirty, so any sync activity will still find it on
+ * the dirty lists, and in particular, clear_page_dirty_for_io() will still
+ * look at the dirty bits in the VM.
+ *
+ * Doing this should *normally* only ever be done when a page is truncated,
+ * and is not actually mapped anywhere at all. However, fs/buffer.c does
+ * this when it notices that somebody has cleaned out all the buffers on a
+ * page without actually doing it through the VM. Can you say "ext3 is
+ * horribly ugly"? Thought you could.
+ */
+void cancel_dirty_page(struct page *page)
+{
+ if (TestClearPageDirty(page))
+ account_page_cleaned(page, page_mapping(page));
+}
+EXPORT_SYMBOL(cancel_dirty_page);
+
+/*
* Clear a page's dirty flag, while caring for dirty memory accounting.
* Returns true if the page was previously dirty.
*
diff --git a/mm/truncate.c b/mm/truncate.c
index 66af903..0c36025 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -116,9 +116,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
* the VM has canceled the dirty bit (eg ext3 journaling).
* Hence dirty accounting check is placed after invalidation.
*/
- if (TestClearPageDirty(page))
- account_page_cleaned(page, mapping);
-
+ cancel_dirty_page(page);
ClearPageMappedToDisk(page);
delete_from_page_cache(page);
return 0;
--
2.4.0

2015-05-22 21:14:27

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/51] memcg: add per cgroup dirty page accounting

From: Greg Thelen <[email protected]>

When modifying PG_Dirty on cached file pages, update the new
MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
per memcg memory.stat cgroupfs file. The most recent past attempt at
this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

The new accounting supports future efforts to add per cgroup dirty
page throttling and writeback. It also helps an administrator break
down a container's memory usage and provides evidence to understand
memcg oom kills (the new dirty count is included in memcg oom kill
messages).

The ability to move page accounting between memcg
(memory.move_charge_at_immigrate) makes this accounting more
complicated than the global counter. The existing
mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
accounting with stat updates.
Typical update operation:
memcg = mem_cgroup_begin_page_stat(page)
if (TestSetPageDirty()) {
[...]
mem_cgroup_update_page_stat(memcg)
}
mem_cgroup_end_page_stat(memcg)

Summary of mem_cgroup_end_page_stat() overhead:
- Without CONFIG_MEMCG it's a no-op
- With CONFIG_MEMCG and no inter memcg task movement, it's just
rcu_read_lock()
- With CONFIG_MEMCG and inter memcg task movement, it's
rcu_read_lock() + spin_lock_irqsave()

A memcg parameter is added to several routines because their callers
now grab mem_cgroup_begin_page_stat() which returns the memcg later
needed by for mem_cgroup_update_page_stat().

Because mem_cgroup_begin_page_stat() may disable interrupts, some
adjustments are needed:
- move __mark_inode_dirty() from __set_page_dirty() to its caller.
__mark_inode_dirty() locking does not want interrupts disabled.
- use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
__delete_from_page_cache(), replace_page_cache_page(),
invalidate_complete_page2(), and __remove_mapping().

text data bss dec hex filename
8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
+192 text bytes
8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
+773 text bytes

Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
all metrics, they're all wall clock or cycle counts. The read and write
fault benchmarks just measure fault time, they do not include I/O time.

* CONFIG_MEMCG not set:
baseline patched
kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

* CONFIG_MEMCG=y root_memcg:
baseline patched
kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

* CONFIG_MEMCG=y non-root_memcg:
baseline patched
kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

As expected anon page faults are not affected by this patch.

tj: Updated to apply on top of the recent cancel_dirty_page() changes.

Signed-off-by: Sha Zhengju <[email protected]>
Signed-off-by: Greg Thelen <[email protected]>
Signed-off-by: Tejun Heo <[email protected]>
---
Documentation/cgroups/memory.txt | 1 +
fs/buffer.c | 34 +++++++++++++++++++++------
fs/xfs/xfs_aops.c | 12 ++++++++--
include/linux/memcontrol.h | 1 +
include/linux/mm.h | 6 +++--
include/linux/pagemap.h | 3 ++-
mm/filemap.c | 31 +++++++++++++++++--------
mm/memcontrol.c | 24 ++++++++++++++++++-
mm/page-writeback.c | 50 +++++++++++++++++++++++++++++++++-------
mm/rmap.c | 2 ++
mm/truncate.c | 14 +++++++----
mm/vmscan.c | 17 ++++++++++----
12 files changed, 156 insertions(+), 39 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index f456b43..ff71e16 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -493,6 +493,7 @@ pgpgin - # of charging events to the memory cgroup. The charging
pgpgout - # of uncharging events to the memory cgroup. The uncharging
event happens each time a page is unaccounted from the cgroup.
swap - # of bytes of swap usage
+dirty - # of bytes that are waiting to get written back to the disk.
writeback - # of bytes of file/anon cache that are queued for syncing to
disk.
inactive_anon - # of bytes of anonymous and swap cache memory on inactive
diff --git a/fs/buffer.c b/fs/buffer.c
index e776bec..c8aecf5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -623,21 +623,22 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
*
* If warn is true, then emit a warning if the page is not uptodate and has
* not been truncated.
+ *
+ * The caller must hold mem_cgroup_begin_page_stat() lock.
*/
-static void __set_page_dirty(struct page *page,
- struct address_space *mapping, int warn)
+static void __set_page_dirty(struct page *page, struct address_space *mapping,
+ struct mem_cgroup *memcg, int warn)
{
unsigned long flags;

spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
- account_page_dirtied(page, mapping);
+ account_page_dirtied(page, mapping, memcg);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
- __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}

/*
@@ -668,6 +669,7 @@ static void __set_page_dirty(struct page *page,
int __set_page_dirty_buffers(struct page *page)
{
int newly_dirty;
+ struct mem_cgroup *memcg;
struct address_space *mapping = page_mapping(page);

if (unlikely(!mapping))
@@ -683,11 +685,22 @@ int __set_page_dirty_buffers(struct page *page)
bh = bh->b_this_page;
} while (bh != head);
}
+ /*
+ * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
+ * per-memcg dirty page counters.
+ */
+ memcg = mem_cgroup_begin_page_stat(page);
newly_dirty = !TestSetPageDirty(page);
spin_unlock(&mapping->private_lock);

if (newly_dirty)
- __set_page_dirty(page, mapping, 1);
+ __set_page_dirty(page, mapping, memcg, 1);
+
+ mem_cgroup_end_page_stat(memcg);
+
+ if (newly_dirty)
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
return newly_dirty;
}
EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1158,11 +1171,18 @@ void mark_buffer_dirty(struct buffer_head *bh)

if (!test_set_buffer_dirty(bh)) {
struct page *page = bh->b_page;
+ struct address_space *mapping = NULL;
+ struct mem_cgroup *memcg;
+
+ memcg = mem_cgroup_begin_page_stat(page);
if (!TestSetPageDirty(page)) {
- struct address_space *mapping = page_mapping(page);
+ mapping = page_mapping(page);
if (mapping)
- __set_page_dirty(page, mapping, 0);
+ __set_page_dirty(page, mapping, memcg, 0);
}
+ mem_cgroup_end_page_stat(memcg);
+ if (mapping)
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
}
EXPORT_SYMBOL(mark_buffer_dirty);
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 095f94c..e5099f2 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1873,6 +1873,7 @@ xfs_vm_set_page_dirty(
loff_t end_offset;
loff_t offset;
int newly_dirty;
+ struct mem_cgroup *memcg;

if (unlikely(!mapping))
return !TestSetPageDirty(page);
@@ -1892,6 +1893,11 @@ xfs_vm_set_page_dirty(
offset += 1 << inode->i_blkbits;
} while (bh != head);
}
+ /*
+ * Use mem_group_begin_page_stat() to keep PageDirty synchronized with
+ * per-memcg dirty page counters.
+ */
+ memcg = mem_cgroup_begin_page_stat(page);
newly_dirty = !TestSetPageDirty(page);
spin_unlock(&mapping->private_lock);

@@ -1902,13 +1908,15 @@ xfs_vm_set_page_dirty(
spin_lock_irqsave(&mapping->tree_lock, flags);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(!PageUptodate(page));
- account_page_dirtied(page, mapping);
+ account_page_dirtied(page, mapping, memcg);
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
- __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
+ mem_cgroup_end_page_stat(memcg);
+ if (newly_dirty)
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
return newly_dirty;
}

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 72dff5f..5fe6411 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -41,6 +41,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */
MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
MEM_CGROUP_STAT_NSTATS,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a83cf3a..f48d979 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1211,8 +1211,10 @@ int __set_page_dirty_nobuffers(struct page *page);
int __set_page_dirty_no_writeback(struct page *page);
int redirty_page_for_writepage(struct writeback_control *wbc,
struct page *page);
-void account_page_dirtied(struct page *page, struct address_space *mapping);
-void account_page_cleaned(struct page *page, struct address_space *mapping);
+void account_page_dirtied(struct page *page, struct address_space *mapping,
+ struct mem_cgroup *memcg);
+void account_page_cleaned(struct page *page, struct address_space *mapping,
+ struct mem_cgroup *memcg);
int set_page_dirty(struct page *page);
int set_page_dirty_lock(struct page *page);
void cancel_dirty_page(struct page *page);
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 4b3736f..fb0814c 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -651,7 +651,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
int add_to_page_cache_lru(struct page *page, struct address_space *mapping,
pgoff_t index, gfp_t gfp_mask);
extern void delete_from_page_cache(struct page *page);
-extern void __delete_from_page_cache(struct page *page, void *shadow);
+extern void __delete_from_page_cache(struct page *page, void *shadow,
+ struct mem_cgroup *memcg);
int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);

/*
diff --git a/mm/filemap.c b/mm/filemap.c
index 6bf5e42..7b1443d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -100,6 +100,7 @@
* ->tree_lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
* ->inode->i_lock (page_remove_rmap->set_page_dirty)
+ * ->memcg->move_lock (page_remove_rmap->mem_cgroup_begin_page_stat)
* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
* ->inode->i_lock (zap_pte_range->set_page_dirty)
* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
@@ -174,9 +175,11 @@ static void page_cache_tree_delete(struct address_space *mapping,
/*
* Delete a page from the page cache and free it. Caller has to make
* sure the page is locked and that nobody else uses it - or that usage
- * is safe. The caller must hold the mapping's tree_lock.
+ * is safe. The caller must hold the mapping's tree_lock and
+ * mem_cgroup_begin_page_stat().
*/
-void __delete_from_page_cache(struct page *page, void *shadow)
+void __delete_from_page_cache(struct page *page, void *shadow,
+ struct mem_cgroup *memcg)
{
struct address_space *mapping = page->mapping;

@@ -210,7 +213,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
* anyway will be cleared before returning page into buddy allocator.
*/
if (WARN_ON_ONCE(PageDirty(page)))
- account_page_cleaned(page, mapping);
+ account_page_cleaned(page, mapping, memcg);
}

/**
@@ -224,14 +227,20 @@ void __delete_from_page_cache(struct page *page, void *shadow)
void delete_from_page_cache(struct page *page)
{
struct address_space *mapping = page->mapping;
+ struct mem_cgroup *memcg;
+ unsigned long flags;
+
void (*freepage)(struct page *);

BUG_ON(!PageLocked(page));

freepage = mapping->a_ops->freepage;
- spin_lock_irq(&mapping->tree_lock);
- __delete_from_page_cache(page, NULL);
- spin_unlock_irq(&mapping->tree_lock);
+
+ memcg = mem_cgroup_begin_page_stat(page);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ __delete_from_page_cache(page, NULL, memcg);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);

if (freepage)
freepage(page);
@@ -470,6 +479,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
if (!error) {
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *);
+ struct mem_cgroup *memcg;
+ unsigned long flags;

pgoff_t offset = old->index;
freepage = mapping->a_ops->freepage;
@@ -478,15 +489,17 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
new->mapping = mapping;
new->index = offset;

- spin_lock_irq(&mapping->tree_lock);
- __delete_from_page_cache(old, NULL);
+ memcg = mem_cgroup_begin_page_stat(old);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ __delete_from_page_cache(old, NULL, memcg);
error = radix_tree_insert(&mapping->page_tree, offset, new);
BUG_ON(error);
mapping->nrpages++;
__inc_zone_page_state(new, NR_FILE_PAGES);
if (PageSwapBacked(new))
__inc_zone_page_state(new, NR_SHMEM);
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);
mem_cgroup_migrate(old, new, true);
radix_tree_preload_end();
if (freepage)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f20..c23c1a3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -90,6 +90,7 @@ static const char * const mem_cgroup_stat_names[] = {
"rss",
"rss_huge",
"mapped_file",
+ "dirty",
"writeback",
"swap",
};
@@ -2011,6 +2012,7 @@ struct mem_cgroup *mem_cgroup_begin_page_stat(struct page *page)

return memcg;
}
+EXPORT_SYMBOL(mem_cgroup_begin_page_stat);

/**
* mem_cgroup_end_page_stat - finish a page state statistics transaction
@@ -2029,6 +2031,7 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg)

rcu_read_unlock();
}
+EXPORT_SYMBOL(mem_cgroup_end_page_stat);

/**
* mem_cgroup_update_page_stat - update page state statistics
@@ -4746,6 +4749,7 @@ static int mem_cgroup_move_account(struct page *page,
{
unsigned long flags;
int ret;
+ bool anon;

VM_BUG_ON(from == to);
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -4771,15 +4775,33 @@ static int mem_cgroup_move_account(struct page *page,
if (page->mem_cgroup != from)
goto out_unlock;

+ anon = PageAnon(page);
+
spin_lock_irqsave(&from->move_lock, flags);

- if (!PageAnon(page) && page_mapped(page)) {
+ if (!anon && page_mapped(page)) {
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
nr_pages);
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
nr_pages);
}

+ /*
+ * move_lock grabbed above and caller set from->moving_account, so
+ * mem_cgroup_update_page_stat() will serialize updates to PageDirty.
+ * So mapping should be stable for dirty pages.
+ */
+ if (!anon && PageDirty(page)) {
+ struct address_space *mapping = page_mapping(page);
+
+ if (mapping_cap_account_dirty(mapping)) {
+ __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_DIRTY],
+ nr_pages);
+ __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
+ nr_pages);
+ }
+ }
+
if (PageWriteback(page)) {
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
nr_pages);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 227b867..bdeecad 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2090,15 +2090,20 @@ int __set_page_dirty_no_writeback(struct page *page)

/*
* Helper function for set_page_dirty family.
+ *
+ * Caller must hold mem_cgroup_begin_page_stat().
+ *
* NOTE: This relies on being atomic wrt interrupts.
*/
-void account_page_dirtied(struct page *page, struct address_space *mapping)
+void account_page_dirtied(struct page *page, struct address_space *mapping,
+ struct mem_cgroup *memcg)
{
trace_writeback_dirty_page(page, mapping);

if (mapping_cap_account_dirty(mapping)) {
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);

+ mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
__inc_bdi_stat(bdi, BDI_RECLAIMABLE);
@@ -2112,10 +2117,14 @@ EXPORT_SYMBOL(account_page_dirtied);

/*
* Helper function for deaccounting dirty page without writeback.
+ *
+ * Caller must hold mem_cgroup_begin_page_stat().
*/
-void account_page_cleaned(struct page *page, struct address_space *mapping)
+void account_page_cleaned(struct page *page, struct address_space *mapping,
+ struct mem_cgroup *memcg)
{
if (mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE);
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
@@ -2136,26 +2145,34 @@ void account_page_cleaned(struct page *page, struct address_space *mapping)
*/
int __set_page_dirty_nobuffers(struct page *page)
{
+ struct mem_cgroup *memcg;
+
+ memcg = mem_cgroup_begin_page_stat(page);
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
unsigned long flags;

- if (!mapping)
+ if (!mapping) {
+ mem_cgroup_end_page_stat(memcg);
return 1;
+ }

spin_lock_irqsave(&mapping->tree_lock, flags);
BUG_ON(page_mapping(page) != mapping);
WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
- account_page_dirtied(page, mapping);
+ account_page_dirtied(page, mapping, memcg);
radix_tree_tag_set(&mapping->page_tree, page_index(page),
PAGECACHE_TAG_DIRTY);
spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);
+
if (mapping->host) {
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
return 1;
}
+ mem_cgroup_end_page_stat(memcg);
return 0;
}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2273,8 +2290,20 @@ EXPORT_SYMBOL(set_page_dirty_lock);
*/
void cancel_dirty_page(struct page *page)
{
- if (TestClearPageDirty(page))
- account_page_cleaned(page, page_mapping(page));
+ struct address_space *mapping = page_mapping(page);
+
+ if (mapping_cap_account_dirty(mapping)) {
+ struct mem_cgroup *memcg;
+
+ memcg = mem_cgroup_begin_page_stat(page);
+
+ if (TestClearPageDirty(page))
+ account_page_cleaned(page, mapping, memcg);
+
+ mem_cgroup_end_page_stat(memcg);
+ } else {
+ ClearPageDirty(page);
+ }
}
EXPORT_SYMBOL(cancel_dirty_page);

@@ -2295,6 +2324,8 @@ EXPORT_SYMBOL(cancel_dirty_page);
int clear_page_dirty_for_io(struct page *page)
{
struct address_space *mapping = page_mapping(page);
+ struct mem_cgroup *memcg;
+ int ret = 0;

BUG_ON(!PageLocked(page));

@@ -2334,13 +2365,16 @@ int clear_page_dirty_for_io(struct page *page)
* always locked coming in here, so we get the desired
* exclusion.
*/
+ memcg = mem_cgroup_begin_page_stat(page);
if (TestClearPageDirty(page)) {
+ mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(inode_to_bdi(mapping->host),
BDI_RECLAIMABLE);
- return 1;
+ ret = 1;
}
- return 0;
+ mem_cgroup_end_page_stat(memcg);
+ return ret;
}
return TestClearPageDirty(page);
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 24dd3f9..8fc556c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -30,6 +30,8 @@
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
+ * mem_cgroup_{begin,end}_page_stat (memcg->move_lock)
+ * mapping->tree_lock (widely used)
* inode->i_lock (in set_page_dirty's __mark_inode_dirty)
* bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
* sb_lock (within inode_lock in fs/fs-writeback.c)
diff --git a/mm/truncate.c b/mm/truncate.c
index 0c36025..76e35ad 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -510,19 +510,24 @@ EXPORT_SYMBOL(invalidate_mapping_pages);
static int
invalidate_complete_page2(struct address_space *mapping, struct page *page)
{
+ struct mem_cgroup *memcg;
+ unsigned long flags;
+
if (page->mapping != mapping)
return 0;

if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL))
return 0;

- spin_lock_irq(&mapping->tree_lock);
+ memcg = mem_cgroup_begin_page_stat(page);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
if (PageDirty(page))
goto failed;

BUG_ON(page_has_private(page));
- __delete_from_page_cache(page, NULL);
- spin_unlock_irq(&mapping->tree_lock);
+ __delete_from_page_cache(page, NULL, memcg);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);

if (mapping->a_ops->freepage)
mapping->a_ops->freepage(page);
@@ -530,7 +535,8 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page)
page_cache_release(page); /* pagecache ref */
return 1;
failed:
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);
return 0;
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e8eadd..7582f9f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -579,10 +579,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
static int __remove_mapping(struct address_space *mapping, struct page *page,
bool reclaimed)
{
+ unsigned long flags;
+ struct mem_cgroup *memcg;
+
BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));

- spin_lock_irq(&mapping->tree_lock);
+ memcg = mem_cgroup_begin_page_stat(page);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
/*
* The non racy check for a busy page.
*
@@ -620,7 +624,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
__delete_from_swap_cache(page);
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);
swapcache_free(swap);
} else {
void (*freepage)(struct page *);
@@ -640,8 +645,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
if (reclaimed && page_is_file_cache(page) &&
!mapping_exiting(mapping))
shadow = workingset_eviction(mapping, page);
- __delete_from_page_cache(page, shadow);
- spin_unlock_irq(&mapping->tree_lock);
+ __delete_from_page_cache(page, shadow, memcg);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);

if (freepage != NULL)
freepage(page);
@@ -650,7 +656,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
return 1;

cannot_free:
- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);
return 0;
}

--
2.4.0

2015-05-22 21:14:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/51] blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h

cgroup aware writeback support will require exposing some of blkcg
details. In preprataion, move block/blk-cgroup.h to
include/linux/blk-cgroup.h. This patch is pure file move.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 2 +-
block/blk-cgroup.h | 603 ---------------------------------------------
block/blk-core.c | 2 +-
block/blk-sysfs.c | 2 +-
block/blk-throttle.c | 2 +-
block/cfq-iosched.c | 2 +-
block/elevator.c | 2 +-
include/linux/blk-cgroup.h | 603 +++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 609 insertions(+), 609 deletions(-)
delete mode 100644 block/blk-cgroup.h
create mode 100644 include/linux/blk-cgroup.h

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 0ac817b..c3226ce 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -19,7 +19,7 @@
#include <linux/genhd.h>
#include <linux/delay.h>
#include <linux/atomic.h>
-#include "blk-cgroup.h"
+#include <linux/blk-cgroup.h>
#include "blk.h"

#define MAX_KEY_LEN 100
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
deleted file mode 100644
index c567865..0000000
--- a/block/blk-cgroup.h
+++ /dev/null
@@ -1,603 +0,0 @@
-#ifndef _BLK_CGROUP_H
-#define _BLK_CGROUP_H
-/*
- * Common Block IO controller cgroup interface
- *
- * Based on ideas and code from CFQ, CFS and BFQ:
- * Copyright (C) 2003 Jens Axboe <[email protected]>
- *
- * Copyright (C) 2008 Fabio Checconi <[email protected]>
- * Paolo Valente <[email protected]>
- *
- * Copyright (C) 2009 Vivek Goyal <[email protected]>
- * Nauman Rafique <[email protected]>
- */
-
-#include <linux/cgroup.h>
-#include <linux/u64_stats_sync.h>
-#include <linux/seq_file.h>
-#include <linux/radix-tree.h>
-#include <linux/blkdev.h>
-#include <linux/atomic.h>
-
-/* Max limits for throttle policy */
-#define THROTL_IOPS_MAX UINT_MAX
-
-/* CFQ specific, out here for blkcg->cfq_weight */
-#define CFQ_WEIGHT_MIN 10
-#define CFQ_WEIGHT_MAX 1000
-#define CFQ_WEIGHT_DEFAULT 500
-
-#ifdef CONFIG_BLK_CGROUP
-
-enum blkg_rwstat_type {
- BLKG_RWSTAT_READ,
- BLKG_RWSTAT_WRITE,
- BLKG_RWSTAT_SYNC,
- BLKG_RWSTAT_ASYNC,
-
- BLKG_RWSTAT_NR,
- BLKG_RWSTAT_TOTAL = BLKG_RWSTAT_NR,
-};
-
-struct blkcg_gq;
-
-struct blkcg {
- struct cgroup_subsys_state css;
- spinlock_t lock;
-
- struct radix_tree_root blkg_tree;
- struct blkcg_gq *blkg_hint;
- struct hlist_head blkg_list;
-
- /* TODO: per-policy storage in blkcg */
- unsigned int cfq_weight; /* belongs to cfq */
- unsigned int cfq_leaf_weight;
-};
-
-struct blkg_stat {
- struct u64_stats_sync syncp;
- uint64_t cnt;
-};
-
-struct blkg_rwstat {
- struct u64_stats_sync syncp;
- uint64_t cnt[BLKG_RWSTAT_NR];
-};
-
-/*
- * A blkcg_gq (blkg) is association between a block cgroup (blkcg) and a
- * request_queue (q). This is used by blkcg policies which need to track
- * information per blkcg - q pair.
- *
- * There can be multiple active blkcg policies and each has its private
- * data on each blkg, the size of which is determined by
- * blkcg_policy->pd_size. blkcg core allocates and frees such areas
- * together with blkg and invokes pd_init/exit_fn() methods.
- *
- * Such private data must embed struct blkg_policy_data (pd) at the
- * beginning and pd_size can't be smaller than pd.
- */
-struct blkg_policy_data {
- /* the blkg and policy id this per-policy data belongs to */
- struct blkcg_gq *blkg;
- int plid;
-
- /* used during policy activation */
- struct list_head alloc_node;
-};
-
-/* association between a blk cgroup and a request queue */
-struct blkcg_gq {
- /* Pointer to the associated request_queue */
- struct request_queue *q;
- struct list_head q_node;
- struct hlist_node blkcg_node;
- struct blkcg *blkcg;
-
- /* all non-root blkcg_gq's are guaranteed to have access to parent */
- struct blkcg_gq *parent;
-
- /* request allocation list for this blkcg-q pair */
- struct request_list rl;
-
- /* reference count */
- atomic_t refcnt;
-
- /* is this blkg online? protected by both blkcg and q locks */
- bool online;
-
- struct blkg_policy_data *pd[BLKCG_MAX_POLS];
-
- struct rcu_head rcu_head;
-};
-
-typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg);
-typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg);
-typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg);
-typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg);
-typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg);
-
-struct blkcg_policy {
- int plid;
- /* policy specific private data size */
- size_t pd_size;
- /* cgroup files for the policy */
- struct cftype *cftypes;
-
- /* operations */
- blkcg_pol_init_pd_fn *pd_init_fn;
- blkcg_pol_online_pd_fn *pd_online_fn;
- blkcg_pol_offline_pd_fn *pd_offline_fn;
- blkcg_pol_exit_pd_fn *pd_exit_fn;
- blkcg_pol_reset_pd_stats_fn *pd_reset_stats_fn;
-};
-
-extern struct blkcg blkcg_root;
-
-struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
-struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
- struct request_queue *q);
-int blkcg_init_queue(struct request_queue *q);
-void blkcg_drain_queue(struct request_queue *q);
-void blkcg_exit_queue(struct request_queue *q);
-
-/* Blkio controller policy registration */
-int blkcg_policy_register(struct blkcg_policy *pol);
-void blkcg_policy_unregister(struct blkcg_policy *pol);
-int blkcg_activate_policy(struct request_queue *q,
- const struct blkcg_policy *pol);
-void blkcg_deactivate_policy(struct request_queue *q,
- const struct blkcg_policy *pol);
-
-void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
- u64 (*prfill)(struct seq_file *,
- struct blkg_policy_data *, int),
- const struct blkcg_policy *pol, int data,
- bool show_total);
-u64 __blkg_prfill_u64(struct seq_file *sf, struct blkg_policy_data *pd, u64 v);
-u64 __blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
- const struct blkg_rwstat *rwstat);
-u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
-u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
- int off);
-
-u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off);
-struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
- int off);
-
-struct blkg_conf_ctx {
- struct gendisk *disk;
- struct blkcg_gq *blkg;
- u64 v;
-};
-
-int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
- const char *input, struct blkg_conf_ctx *ctx);
-void blkg_conf_finish(struct blkg_conf_ctx *ctx);
-
-
-static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
-{
- return css ? container_of(css, struct blkcg, css) : NULL;
-}
-
-static inline struct blkcg *task_blkcg(struct task_struct *tsk)
-{
- return css_to_blkcg(task_css(tsk, blkio_cgrp_id));
-}
-
-static inline struct blkcg *bio_blkcg(struct bio *bio)
-{
- if (bio && bio->bi_css)
- return css_to_blkcg(bio->bi_css);
- return task_blkcg(current);
-}
-
-/**
- * blkcg_parent - get the parent of a blkcg
- * @blkcg: blkcg of interest
- *
- * Return the parent blkcg of @blkcg. Can be called anytime.
- */
-static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
-{
- return css_to_blkcg(blkcg->css.parent);
-}
-
-/**
- * blkg_to_pdata - get policy private data
- * @blkg: blkg of interest
- * @pol: policy of interest
- *
- * Return pointer to private data associated with the @blkg-@pol pair.
- */
-static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
- struct blkcg_policy *pol)
-{
- return blkg ? blkg->pd[pol->plid] : NULL;
-}
-
-/**
- * pdata_to_blkg - get blkg associated with policy private data
- * @pd: policy private data of interest
- *
- * @pd is policy private data. Determine the blkg it's associated with.
- */
-static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd)
-{
- return pd ? pd->blkg : NULL;
-}
-
-/**
- * blkg_path - format cgroup path of blkg
- * @blkg: blkg of interest
- * @buf: target buffer
- * @buflen: target buffer length
- *
- * Format the path of the cgroup of @blkg into @buf.
- */
-static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen)
-{
- char *p;
-
- p = cgroup_path(blkg->blkcg->css.cgroup, buf, buflen);
- if (!p) {
- strncpy(buf, "<unavailable>", buflen);
- return -ENAMETOOLONG;
- }
-
- memmove(buf, p, buf + buflen - p);
- return 0;
-}
-
-/**
- * blkg_get - get a blkg reference
- * @blkg: blkg to get
- *
- * The caller should be holding an existing reference.
- */
-static inline void blkg_get(struct blkcg_gq *blkg)
-{
- WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
- atomic_inc(&blkg->refcnt);
-}
-
-void __blkg_release_rcu(struct rcu_head *rcu);
-
-/**
- * blkg_put - put a blkg reference
- * @blkg: blkg to put
- */
-static inline void blkg_put(struct blkcg_gq *blkg)
-{
- WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
- if (atomic_dec_and_test(&blkg->refcnt))
- call_rcu(&blkg->rcu_head, __blkg_release_rcu);
-}
-
-struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q,
- bool update_hint);
-
-/**
- * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
- * @d_blkg: loop cursor pointing to the current descendant
- * @pos_css: used for iteration
- * @p_blkg: target blkg to walk descendants of
- *
- * Walk @c_blkg through the descendants of @p_blkg. Must be used with RCU
- * read locked. If called under either blkcg or queue lock, the iteration
- * is guaranteed to include all and only online blkgs. The caller may
- * update @pos_css by calling css_rightmost_descendant() to skip subtree.
- * @p_blkg is included in the iteration and the first node to be visited.
- */
-#define blkg_for_each_descendant_pre(d_blkg, pos_css, p_blkg) \
- css_for_each_descendant_pre((pos_css), &(p_blkg)->blkcg->css) \
- if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css), \
- (p_blkg)->q, false)))
-
-/**
- * blkg_for_each_descendant_post - post-order walk of a blkg's descendants
- * @d_blkg: loop cursor pointing to the current descendant
- * @pos_css: used for iteration
- * @p_blkg: target blkg to walk descendants of
- *
- * Similar to blkg_for_each_descendant_pre() but performs post-order
- * traversal instead. Synchronization rules are the same. @p_blkg is
- * included in the iteration and the last node to be visited.
- */
-#define blkg_for_each_descendant_post(d_blkg, pos_css, p_blkg) \
- css_for_each_descendant_post((pos_css), &(p_blkg)->blkcg->css) \
- if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css), \
- (p_blkg)->q, false)))
-
-/**
- * blk_get_rl - get request_list to use
- * @q: request_queue of interest
- * @bio: bio which will be attached to the allocated request (may be %NULL)
- *
- * The caller wants to allocate a request from @q to use for @bio. Find
- * the request_list to use and obtain a reference on it. Should be called
- * under queue_lock. This function is guaranteed to return non-%NULL
- * request_list.
- */
-static inline struct request_list *blk_get_rl(struct request_queue *q,
- struct bio *bio)
-{
- struct blkcg *blkcg;
- struct blkcg_gq *blkg;
-
- rcu_read_lock();
-
- blkcg = bio_blkcg(bio);
-
- /* bypass blkg lookup and use @q->root_rl directly for root */
- if (blkcg == &blkcg_root)
- goto root_rl;
-
- /*
- * Try to use blkg->rl. blkg lookup may fail under memory pressure
- * or if either the blkcg or queue is going away. Fall back to
- * root_rl in such cases.
- */
- blkg = blkg_lookup_create(blkcg, q);
- if (unlikely(IS_ERR(blkg)))
- goto root_rl;
-
- blkg_get(blkg);
- rcu_read_unlock();
- return &blkg->rl;
-root_rl:
- rcu_read_unlock();
- return &q->root_rl;
-}
-
-/**
- * blk_put_rl - put request_list
- * @rl: request_list to put
- *
- * Put the reference acquired by blk_get_rl(). Should be called under
- * queue_lock.
- */
-static inline void blk_put_rl(struct request_list *rl)
-{
- /* root_rl may not have blkg set */
- if (rl->blkg && rl->blkg->blkcg != &blkcg_root)
- blkg_put(rl->blkg);
-}
-
-/**
- * blk_rq_set_rl - associate a request with a request_list
- * @rq: request of interest
- * @rl: target request_list
- *
- * Associate @rq with @rl so that accounting and freeing can know the
- * request_list @rq came from.
- */
-static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
-{
- rq->rl = rl;
-}
-
-/**
- * blk_rq_rl - return the request_list a request came from
- * @rq: request of interest
- *
- * Return the request_list @rq is allocated from.
- */
-static inline struct request_list *blk_rq_rl(struct request *rq)
-{
- return rq->rl;
-}
-
-struct request_list *__blk_queue_next_rl(struct request_list *rl,
- struct request_queue *q);
-/**
- * blk_queue_for_each_rl - iterate through all request_lists of a request_queue
- *
- * Should be used under queue_lock.
- */
-#define blk_queue_for_each_rl(rl, q) \
- for ((rl) = &(q)->root_rl; (rl); (rl) = __blk_queue_next_rl((rl), (q)))
-
-static inline void blkg_stat_init(struct blkg_stat *stat)
-{
- u64_stats_init(&stat->syncp);
-}
-
-/**
- * blkg_stat_add - add a value to a blkg_stat
- * @stat: target blkg_stat
- * @val: value to add
- *
- * Add @val to @stat. The caller is responsible for synchronizing calls to
- * this function.
- */
-static inline void blkg_stat_add(struct blkg_stat *stat, uint64_t val)
-{
- u64_stats_update_begin(&stat->syncp);
- stat->cnt += val;
- u64_stats_update_end(&stat->syncp);
-}
-
-/**
- * blkg_stat_read - read the current value of a blkg_stat
- * @stat: blkg_stat to read
- *
- * Read the current value of @stat. This function can be called without
- * synchroniztion and takes care of u64 atomicity.
- */
-static inline uint64_t blkg_stat_read(struct blkg_stat *stat)
-{
- unsigned int start;
- uint64_t v;
-
- do {
- start = u64_stats_fetch_begin_irq(&stat->syncp);
- v = stat->cnt;
- } while (u64_stats_fetch_retry_irq(&stat->syncp, start));
-
- return v;
-}
-
-/**
- * blkg_stat_reset - reset a blkg_stat
- * @stat: blkg_stat to reset
- */
-static inline void blkg_stat_reset(struct blkg_stat *stat)
-{
- stat->cnt = 0;
-}
-
-/**
- * blkg_stat_merge - merge a blkg_stat into another
- * @to: the destination blkg_stat
- * @from: the source
- *
- * Add @from's count to @to.
- */
-static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from)
-{
- blkg_stat_add(to, blkg_stat_read(from));
-}
-
-static inline void blkg_rwstat_init(struct blkg_rwstat *rwstat)
-{
- u64_stats_init(&rwstat->syncp);
-}
-
-/**
- * blkg_rwstat_add - add a value to a blkg_rwstat
- * @rwstat: target blkg_rwstat
- * @rw: mask of REQ_{WRITE|SYNC}
- * @val: value to add
- *
- * Add @val to @rwstat. The counters are chosen according to @rw. The
- * caller is responsible for synchronizing calls to this function.
- */
-static inline void blkg_rwstat_add(struct blkg_rwstat *rwstat,
- int rw, uint64_t val)
-{
- u64_stats_update_begin(&rwstat->syncp);
-
- if (rw & REQ_WRITE)
- rwstat->cnt[BLKG_RWSTAT_WRITE] += val;
- else
- rwstat->cnt[BLKG_RWSTAT_READ] += val;
- if (rw & REQ_SYNC)
- rwstat->cnt[BLKG_RWSTAT_SYNC] += val;
- else
- rwstat->cnt[BLKG_RWSTAT_ASYNC] += val;
-
- u64_stats_update_end(&rwstat->syncp);
-}
-
-/**
- * blkg_rwstat_read - read the current values of a blkg_rwstat
- * @rwstat: blkg_rwstat to read
- *
- * Read the current snapshot of @rwstat and return it as the return value.
- * This function can be called without synchronization and takes care of
- * u64 atomicity.
- */
-static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
-{
- unsigned int start;
- struct blkg_rwstat tmp;
-
- do {
- start = u64_stats_fetch_begin_irq(&rwstat->syncp);
- tmp = *rwstat;
- } while (u64_stats_fetch_retry_irq(&rwstat->syncp, start));
-
- return tmp;
-}
-
-/**
- * blkg_rwstat_total - read the total count of a blkg_rwstat
- * @rwstat: blkg_rwstat to read
- *
- * Return the total count of @rwstat regardless of the IO direction. This
- * function can be called without synchronization and takes care of u64
- * atomicity.
- */
-static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
-{
- struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
-
- return tmp.cnt[BLKG_RWSTAT_READ] + tmp.cnt[BLKG_RWSTAT_WRITE];
-}
-
-/**
- * blkg_rwstat_reset - reset a blkg_rwstat
- * @rwstat: blkg_rwstat to reset
- */
-static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
-{
- memset(rwstat->cnt, 0, sizeof(rwstat->cnt));
-}
-
-/**
- * blkg_rwstat_merge - merge a blkg_rwstat into another
- * @to: the destination blkg_rwstat
- * @from: the source
- *
- * Add @from's counts to @to.
- */
-static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
- struct blkg_rwstat *from)
-{
- struct blkg_rwstat v = blkg_rwstat_read(from);
- int i;
-
- u64_stats_update_begin(&to->syncp);
- for (i = 0; i < BLKG_RWSTAT_NR; i++)
- to->cnt[i] += v.cnt[i];
- u64_stats_update_end(&to->syncp);
-}
-
-#else /* CONFIG_BLK_CGROUP */
-
-struct cgroup;
-struct blkcg;
-
-struct blkg_policy_data {
-};
-
-struct blkcg_gq {
-};
-
-struct blkcg_policy {
-};
-
-static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
-static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
-static inline void blkcg_drain_queue(struct request_queue *q) { }
-static inline void blkcg_exit_queue(struct request_queue *q) { }
-static inline int blkcg_policy_register(struct blkcg_policy *pol) { return 0; }
-static inline void blkcg_policy_unregister(struct blkcg_policy *pol) { }
-static inline int blkcg_activate_policy(struct request_queue *q,
- const struct blkcg_policy *pol) { return 0; }
-static inline void blkcg_deactivate_policy(struct request_queue *q,
- const struct blkcg_policy *pol) { }
-
-static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
-
-static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
- struct blkcg_policy *pol) { return NULL; }
-static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
-static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
-static inline void blkg_get(struct blkcg_gq *blkg) { }
-static inline void blkg_put(struct blkcg_gq *blkg) { }
-
-static inline struct request_list *blk_get_rl(struct request_queue *q,
- struct bio *bio) { return &q->root_rl; }
-static inline void blk_put_rl(struct request_list *rl) { }
-static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl) { }
-static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q->root_rl; }
-
-#define blk_queue_for_each_rl(rl, q) \
- for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
-
-#endif /* CONFIG_BLK_CGROUP */
-#endif /* _BLK_CGROUP_H */
diff --git a/block/blk-core.c b/block/blk-core.c
index de474b5d..ed2427f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -32,12 +32,12 @@
#include <linux/delay.h>
#include <linux/ratelimit.h>
#include <linux/pm_runtime.h>
+#include <linux/blk-cgroup.h>

#define CREATE_TRACE_POINTS
#include <trace/events/block.h>

#include "blk.h"
-#include "blk-cgroup.h"
#include "blk-mq.h"

EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index faaf36a..5677eb7 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -8,9 +8,9 @@
#include <linux/blkdev.h>
#include <linux/blktrace_api.h>
#include <linux/blk-mq.h>
+#include <linux/blk-cgroup.h>

#include "blk.h"
-#include "blk-cgroup.h"
#include "blk-mq.h"

struct queue_sysfs_entry {
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 5b9c6d5..b231935 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -9,7 +9,7 @@
#include <linux/blkdev.h>
#include <linux/bio.h>
#include <linux/blktrace_api.h>
-#include "blk-cgroup.h"
+#include <linux/blk-cgroup.h>
#include "blk.h"

/* Max dispatch from a group in 1 round */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index 5da8e6e..bc8f429 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -14,8 +14,8 @@
#include <linux/rbtree.h>
#include <linux/ioprio.h>
#include <linux/blktrace_api.h>
+#include <linux/blk-cgroup.h>
#include "blk.h"
-#include "blk-cgroup.h"

/*
* tunables
diff --git a/block/elevator.c b/block/elevator.c
index 59794d0..3bbb48f 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -35,11 +35,11 @@
#include <linux/hash.h>
#include <linux/uaccess.h>
#include <linux/pm_runtime.h>
+#include <linux/blk-cgroup.h>

#include <trace/events/block.h>

#include "blk.h"
-#include "blk-cgroup.h"

static DEFINE_SPINLOCK(elv_list_lock);
static LIST_HEAD(elv_list);
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
new file mode 100644
index 0000000..c567865
--- /dev/null
+++ b/include/linux/blk-cgroup.h
@@ -0,0 +1,603 @@
+#ifndef _BLK_CGROUP_H
+#define _BLK_CGROUP_H
+/*
+ * Common Block IO controller cgroup interface
+ *
+ * Based on ideas and code from CFQ, CFS and BFQ:
+ * Copyright (C) 2003 Jens Axboe <[email protected]>
+ *
+ * Copyright (C) 2008 Fabio Checconi <[email protected]>
+ * Paolo Valente <[email protected]>
+ *
+ * Copyright (C) 2009 Vivek Goyal <[email protected]>
+ * Nauman Rafique <[email protected]>
+ */
+
+#include <linux/cgroup.h>
+#include <linux/u64_stats_sync.h>
+#include <linux/seq_file.h>
+#include <linux/radix-tree.h>
+#include <linux/blkdev.h>
+#include <linux/atomic.h>
+
+/* Max limits for throttle policy */
+#define THROTL_IOPS_MAX UINT_MAX
+
+/* CFQ specific, out here for blkcg->cfq_weight */
+#define CFQ_WEIGHT_MIN 10
+#define CFQ_WEIGHT_MAX 1000
+#define CFQ_WEIGHT_DEFAULT 500
+
+#ifdef CONFIG_BLK_CGROUP
+
+enum blkg_rwstat_type {
+ BLKG_RWSTAT_READ,
+ BLKG_RWSTAT_WRITE,
+ BLKG_RWSTAT_SYNC,
+ BLKG_RWSTAT_ASYNC,
+
+ BLKG_RWSTAT_NR,
+ BLKG_RWSTAT_TOTAL = BLKG_RWSTAT_NR,
+};
+
+struct blkcg_gq;
+
+struct blkcg {
+ struct cgroup_subsys_state css;
+ spinlock_t lock;
+
+ struct radix_tree_root blkg_tree;
+ struct blkcg_gq *blkg_hint;
+ struct hlist_head blkg_list;
+
+ /* TODO: per-policy storage in blkcg */
+ unsigned int cfq_weight; /* belongs to cfq */
+ unsigned int cfq_leaf_weight;
+};
+
+struct blkg_stat {
+ struct u64_stats_sync syncp;
+ uint64_t cnt;
+};
+
+struct blkg_rwstat {
+ struct u64_stats_sync syncp;
+ uint64_t cnt[BLKG_RWSTAT_NR];
+};
+
+/*
+ * A blkcg_gq (blkg) is association between a block cgroup (blkcg) and a
+ * request_queue (q). This is used by blkcg policies which need to track
+ * information per blkcg - q pair.
+ *
+ * There can be multiple active blkcg policies and each has its private
+ * data on each blkg, the size of which is determined by
+ * blkcg_policy->pd_size. blkcg core allocates and frees such areas
+ * together with blkg and invokes pd_init/exit_fn() methods.
+ *
+ * Such private data must embed struct blkg_policy_data (pd) at the
+ * beginning and pd_size can't be smaller than pd.
+ */
+struct blkg_policy_data {
+ /* the blkg and policy id this per-policy data belongs to */
+ struct blkcg_gq *blkg;
+ int plid;
+
+ /* used during policy activation */
+ struct list_head alloc_node;
+};
+
+/* association between a blk cgroup and a request queue */
+struct blkcg_gq {
+ /* Pointer to the associated request_queue */
+ struct request_queue *q;
+ struct list_head q_node;
+ struct hlist_node blkcg_node;
+ struct blkcg *blkcg;
+
+ /* all non-root blkcg_gq's are guaranteed to have access to parent */
+ struct blkcg_gq *parent;
+
+ /* request allocation list for this blkcg-q pair */
+ struct request_list rl;
+
+ /* reference count */
+ atomic_t refcnt;
+
+ /* is this blkg online? protected by both blkcg and q locks */
+ bool online;
+
+ struct blkg_policy_data *pd[BLKCG_MAX_POLS];
+
+ struct rcu_head rcu_head;
+};
+
+typedef void (blkcg_pol_init_pd_fn)(struct blkcg_gq *blkg);
+typedef void (blkcg_pol_online_pd_fn)(struct blkcg_gq *blkg);
+typedef void (blkcg_pol_offline_pd_fn)(struct blkcg_gq *blkg);
+typedef void (blkcg_pol_exit_pd_fn)(struct blkcg_gq *blkg);
+typedef void (blkcg_pol_reset_pd_stats_fn)(struct blkcg_gq *blkg);
+
+struct blkcg_policy {
+ int plid;
+ /* policy specific private data size */
+ size_t pd_size;
+ /* cgroup files for the policy */
+ struct cftype *cftypes;
+
+ /* operations */
+ blkcg_pol_init_pd_fn *pd_init_fn;
+ blkcg_pol_online_pd_fn *pd_online_fn;
+ blkcg_pol_offline_pd_fn *pd_offline_fn;
+ blkcg_pol_exit_pd_fn *pd_exit_fn;
+ blkcg_pol_reset_pd_stats_fn *pd_reset_stats_fn;
+};
+
+extern struct blkcg blkcg_root;
+
+struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
+struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
+ struct request_queue *q);
+int blkcg_init_queue(struct request_queue *q);
+void blkcg_drain_queue(struct request_queue *q);
+void blkcg_exit_queue(struct request_queue *q);
+
+/* Blkio controller policy registration */
+int blkcg_policy_register(struct blkcg_policy *pol);
+void blkcg_policy_unregister(struct blkcg_policy *pol);
+int blkcg_activate_policy(struct request_queue *q,
+ const struct blkcg_policy *pol);
+void blkcg_deactivate_policy(struct request_queue *q,
+ const struct blkcg_policy *pol);
+
+void blkcg_print_blkgs(struct seq_file *sf, struct blkcg *blkcg,
+ u64 (*prfill)(struct seq_file *,
+ struct blkg_policy_data *, int),
+ const struct blkcg_policy *pol, int data,
+ bool show_total);
+u64 __blkg_prfill_u64(struct seq_file *sf, struct blkg_policy_data *pd, u64 v);
+u64 __blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
+ const struct blkg_rwstat *rwstat);
+u64 blkg_prfill_stat(struct seq_file *sf, struct blkg_policy_data *pd, int off);
+u64 blkg_prfill_rwstat(struct seq_file *sf, struct blkg_policy_data *pd,
+ int off);
+
+u64 blkg_stat_recursive_sum(struct blkg_policy_data *pd, int off);
+struct blkg_rwstat blkg_rwstat_recursive_sum(struct blkg_policy_data *pd,
+ int off);
+
+struct blkg_conf_ctx {
+ struct gendisk *disk;
+ struct blkcg_gq *blkg;
+ u64 v;
+};
+
+int blkg_conf_prep(struct blkcg *blkcg, const struct blkcg_policy *pol,
+ const char *input, struct blkg_conf_ctx *ctx);
+void blkg_conf_finish(struct blkg_conf_ctx *ctx);
+
+
+static inline struct blkcg *css_to_blkcg(struct cgroup_subsys_state *css)
+{
+ return css ? container_of(css, struct blkcg, css) : NULL;
+}
+
+static inline struct blkcg *task_blkcg(struct task_struct *tsk)
+{
+ return css_to_blkcg(task_css(tsk, blkio_cgrp_id));
+}
+
+static inline struct blkcg *bio_blkcg(struct bio *bio)
+{
+ if (bio && bio->bi_css)
+ return css_to_blkcg(bio->bi_css);
+ return task_blkcg(current);
+}
+
+/**
+ * blkcg_parent - get the parent of a blkcg
+ * @blkcg: blkcg of interest
+ *
+ * Return the parent blkcg of @blkcg. Can be called anytime.
+ */
+static inline struct blkcg *blkcg_parent(struct blkcg *blkcg)
+{
+ return css_to_blkcg(blkcg->css.parent);
+}
+
+/**
+ * blkg_to_pdata - get policy private data
+ * @blkg: blkg of interest
+ * @pol: policy of interest
+ *
+ * Return pointer to private data associated with the @blkg-@pol pair.
+ */
+static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
+ struct blkcg_policy *pol)
+{
+ return blkg ? blkg->pd[pol->plid] : NULL;
+}
+
+/**
+ * pdata_to_blkg - get blkg associated with policy private data
+ * @pd: policy private data of interest
+ *
+ * @pd is policy private data. Determine the blkg it's associated with.
+ */
+static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd)
+{
+ return pd ? pd->blkg : NULL;
+}
+
+/**
+ * blkg_path - format cgroup path of blkg
+ * @blkg: blkg of interest
+ * @buf: target buffer
+ * @buflen: target buffer length
+ *
+ * Format the path of the cgroup of @blkg into @buf.
+ */
+static inline int blkg_path(struct blkcg_gq *blkg, char *buf, int buflen)
+{
+ char *p;
+
+ p = cgroup_path(blkg->blkcg->css.cgroup, buf, buflen);
+ if (!p) {
+ strncpy(buf, "<unavailable>", buflen);
+ return -ENAMETOOLONG;
+ }
+
+ memmove(buf, p, buf + buflen - p);
+ return 0;
+}
+
+/**
+ * blkg_get - get a blkg reference
+ * @blkg: blkg to get
+ *
+ * The caller should be holding an existing reference.
+ */
+static inline void blkg_get(struct blkcg_gq *blkg)
+{
+ WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
+ atomic_inc(&blkg->refcnt);
+}
+
+void __blkg_release_rcu(struct rcu_head *rcu);
+
+/**
+ * blkg_put - put a blkg reference
+ * @blkg: blkg to put
+ */
+static inline void blkg_put(struct blkcg_gq *blkg)
+{
+ WARN_ON_ONCE(atomic_read(&blkg->refcnt) <= 0);
+ if (atomic_dec_and_test(&blkg->refcnt))
+ call_rcu(&blkg->rcu_head, __blkg_release_rcu);
+}
+
+struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q,
+ bool update_hint);
+
+/**
+ * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants
+ * @d_blkg: loop cursor pointing to the current descendant
+ * @pos_css: used for iteration
+ * @p_blkg: target blkg to walk descendants of
+ *
+ * Walk @c_blkg through the descendants of @p_blkg. Must be used with RCU
+ * read locked. If called under either blkcg or queue lock, the iteration
+ * is guaranteed to include all and only online blkgs. The caller may
+ * update @pos_css by calling css_rightmost_descendant() to skip subtree.
+ * @p_blkg is included in the iteration and the first node to be visited.
+ */
+#define blkg_for_each_descendant_pre(d_blkg, pos_css, p_blkg) \
+ css_for_each_descendant_pre((pos_css), &(p_blkg)->blkcg->css) \
+ if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css), \
+ (p_blkg)->q, false)))
+
+/**
+ * blkg_for_each_descendant_post - post-order walk of a blkg's descendants
+ * @d_blkg: loop cursor pointing to the current descendant
+ * @pos_css: used for iteration
+ * @p_blkg: target blkg to walk descendants of
+ *
+ * Similar to blkg_for_each_descendant_pre() but performs post-order
+ * traversal instead. Synchronization rules are the same. @p_blkg is
+ * included in the iteration and the last node to be visited.
+ */
+#define blkg_for_each_descendant_post(d_blkg, pos_css, p_blkg) \
+ css_for_each_descendant_post((pos_css), &(p_blkg)->blkcg->css) \
+ if (((d_blkg) = __blkg_lookup(css_to_blkcg(pos_css), \
+ (p_blkg)->q, false)))
+
+/**
+ * blk_get_rl - get request_list to use
+ * @q: request_queue of interest
+ * @bio: bio which will be attached to the allocated request (may be %NULL)
+ *
+ * The caller wants to allocate a request from @q to use for @bio. Find
+ * the request_list to use and obtain a reference on it. Should be called
+ * under queue_lock. This function is guaranteed to return non-%NULL
+ * request_list.
+ */
+static inline struct request_list *blk_get_rl(struct request_queue *q,
+ struct bio *bio)
+{
+ struct blkcg *blkcg;
+ struct blkcg_gq *blkg;
+
+ rcu_read_lock();
+
+ blkcg = bio_blkcg(bio);
+
+ /* bypass blkg lookup and use @q->root_rl directly for root */
+ if (blkcg == &blkcg_root)
+ goto root_rl;
+
+ /*
+ * Try to use blkg->rl. blkg lookup may fail under memory pressure
+ * or if either the blkcg or queue is going away. Fall back to
+ * root_rl in such cases.
+ */
+ blkg = blkg_lookup_create(blkcg, q);
+ if (unlikely(IS_ERR(blkg)))
+ goto root_rl;
+
+ blkg_get(blkg);
+ rcu_read_unlock();
+ return &blkg->rl;
+root_rl:
+ rcu_read_unlock();
+ return &q->root_rl;
+}
+
+/**
+ * blk_put_rl - put request_list
+ * @rl: request_list to put
+ *
+ * Put the reference acquired by blk_get_rl(). Should be called under
+ * queue_lock.
+ */
+static inline void blk_put_rl(struct request_list *rl)
+{
+ /* root_rl may not have blkg set */
+ if (rl->blkg && rl->blkg->blkcg != &blkcg_root)
+ blkg_put(rl->blkg);
+}
+
+/**
+ * blk_rq_set_rl - associate a request with a request_list
+ * @rq: request of interest
+ * @rl: target request_list
+ *
+ * Associate @rq with @rl so that accounting and freeing can know the
+ * request_list @rq came from.
+ */
+static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl)
+{
+ rq->rl = rl;
+}
+
+/**
+ * blk_rq_rl - return the request_list a request came from
+ * @rq: request of interest
+ *
+ * Return the request_list @rq is allocated from.
+ */
+static inline struct request_list *blk_rq_rl(struct request *rq)
+{
+ return rq->rl;
+}
+
+struct request_list *__blk_queue_next_rl(struct request_list *rl,
+ struct request_queue *q);
+/**
+ * blk_queue_for_each_rl - iterate through all request_lists of a request_queue
+ *
+ * Should be used under queue_lock.
+ */
+#define blk_queue_for_each_rl(rl, q) \
+ for ((rl) = &(q)->root_rl; (rl); (rl) = __blk_queue_next_rl((rl), (q)))
+
+static inline void blkg_stat_init(struct blkg_stat *stat)
+{
+ u64_stats_init(&stat->syncp);
+}
+
+/**
+ * blkg_stat_add - add a value to a blkg_stat
+ * @stat: target blkg_stat
+ * @val: value to add
+ *
+ * Add @val to @stat. The caller is responsible for synchronizing calls to
+ * this function.
+ */
+static inline void blkg_stat_add(struct blkg_stat *stat, uint64_t val)
+{
+ u64_stats_update_begin(&stat->syncp);
+ stat->cnt += val;
+ u64_stats_update_end(&stat->syncp);
+}
+
+/**
+ * blkg_stat_read - read the current value of a blkg_stat
+ * @stat: blkg_stat to read
+ *
+ * Read the current value of @stat. This function can be called without
+ * synchroniztion and takes care of u64 atomicity.
+ */
+static inline uint64_t blkg_stat_read(struct blkg_stat *stat)
+{
+ unsigned int start;
+ uint64_t v;
+
+ do {
+ start = u64_stats_fetch_begin_irq(&stat->syncp);
+ v = stat->cnt;
+ } while (u64_stats_fetch_retry_irq(&stat->syncp, start));
+
+ return v;
+}
+
+/**
+ * blkg_stat_reset - reset a blkg_stat
+ * @stat: blkg_stat to reset
+ */
+static inline void blkg_stat_reset(struct blkg_stat *stat)
+{
+ stat->cnt = 0;
+}
+
+/**
+ * blkg_stat_merge - merge a blkg_stat into another
+ * @to: the destination blkg_stat
+ * @from: the source
+ *
+ * Add @from's count to @to.
+ */
+static inline void blkg_stat_merge(struct blkg_stat *to, struct blkg_stat *from)
+{
+ blkg_stat_add(to, blkg_stat_read(from));
+}
+
+static inline void blkg_rwstat_init(struct blkg_rwstat *rwstat)
+{
+ u64_stats_init(&rwstat->syncp);
+}
+
+/**
+ * blkg_rwstat_add - add a value to a blkg_rwstat
+ * @rwstat: target blkg_rwstat
+ * @rw: mask of REQ_{WRITE|SYNC}
+ * @val: value to add
+ *
+ * Add @val to @rwstat. The counters are chosen according to @rw. The
+ * caller is responsible for synchronizing calls to this function.
+ */
+static inline void blkg_rwstat_add(struct blkg_rwstat *rwstat,
+ int rw, uint64_t val)
+{
+ u64_stats_update_begin(&rwstat->syncp);
+
+ if (rw & REQ_WRITE)
+ rwstat->cnt[BLKG_RWSTAT_WRITE] += val;
+ else
+ rwstat->cnt[BLKG_RWSTAT_READ] += val;
+ if (rw & REQ_SYNC)
+ rwstat->cnt[BLKG_RWSTAT_SYNC] += val;
+ else
+ rwstat->cnt[BLKG_RWSTAT_ASYNC] += val;
+
+ u64_stats_update_end(&rwstat->syncp);
+}
+
+/**
+ * blkg_rwstat_read - read the current values of a blkg_rwstat
+ * @rwstat: blkg_rwstat to read
+ *
+ * Read the current snapshot of @rwstat and return it as the return value.
+ * This function can be called without synchronization and takes care of
+ * u64 atomicity.
+ */
+static inline struct blkg_rwstat blkg_rwstat_read(struct blkg_rwstat *rwstat)
+{
+ unsigned int start;
+ struct blkg_rwstat tmp;
+
+ do {
+ start = u64_stats_fetch_begin_irq(&rwstat->syncp);
+ tmp = *rwstat;
+ } while (u64_stats_fetch_retry_irq(&rwstat->syncp, start));
+
+ return tmp;
+}
+
+/**
+ * blkg_rwstat_total - read the total count of a blkg_rwstat
+ * @rwstat: blkg_rwstat to read
+ *
+ * Return the total count of @rwstat regardless of the IO direction. This
+ * function can be called without synchronization and takes care of u64
+ * atomicity.
+ */
+static inline uint64_t blkg_rwstat_total(struct blkg_rwstat *rwstat)
+{
+ struct blkg_rwstat tmp = blkg_rwstat_read(rwstat);
+
+ return tmp.cnt[BLKG_RWSTAT_READ] + tmp.cnt[BLKG_RWSTAT_WRITE];
+}
+
+/**
+ * blkg_rwstat_reset - reset a blkg_rwstat
+ * @rwstat: blkg_rwstat to reset
+ */
+static inline void blkg_rwstat_reset(struct blkg_rwstat *rwstat)
+{
+ memset(rwstat->cnt, 0, sizeof(rwstat->cnt));
+}
+
+/**
+ * blkg_rwstat_merge - merge a blkg_rwstat into another
+ * @to: the destination blkg_rwstat
+ * @from: the source
+ *
+ * Add @from's counts to @to.
+ */
+static inline void blkg_rwstat_merge(struct blkg_rwstat *to,
+ struct blkg_rwstat *from)
+{
+ struct blkg_rwstat v = blkg_rwstat_read(from);
+ int i;
+
+ u64_stats_update_begin(&to->syncp);
+ for (i = 0; i < BLKG_RWSTAT_NR; i++)
+ to->cnt[i] += v.cnt[i];
+ u64_stats_update_end(&to->syncp);
+}
+
+#else /* CONFIG_BLK_CGROUP */
+
+struct cgroup;
+struct blkcg;
+
+struct blkg_policy_data {
+};
+
+struct blkcg_gq {
+};
+
+struct blkcg_policy {
+};
+
+static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
+static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
+static inline void blkcg_drain_queue(struct request_queue *q) { }
+static inline void blkcg_exit_queue(struct request_queue *q) { }
+static inline int blkcg_policy_register(struct blkcg_policy *pol) { return 0; }
+static inline void blkcg_policy_unregister(struct blkcg_policy *pol) { }
+static inline int blkcg_activate_policy(struct request_queue *q,
+ const struct blkcg_policy *pol) { return 0; }
+static inline void blkcg_deactivate_policy(struct request_queue *q,
+ const struct blkcg_policy *pol) { }
+
+static inline struct blkcg *bio_blkcg(struct bio *bio) { return NULL; }
+
+static inline struct blkg_policy_data *blkg_to_pd(struct blkcg_gq *blkg,
+ struct blkcg_policy *pol) { return NULL; }
+static inline struct blkcg_gq *pd_to_blkg(struct blkg_policy_data *pd) { return NULL; }
+static inline char *blkg_path(struct blkcg_gq *blkg) { return NULL; }
+static inline void blkg_get(struct blkcg_gq *blkg) { }
+static inline void blkg_put(struct blkcg_gq *blkg) { }
+
+static inline struct request_list *blk_get_rl(struct request_queue *q,
+ struct bio *bio) { return &q->root_rl; }
+static inline void blk_put_rl(struct request_list *rl) { }
+static inline void blk_rq_set_rl(struct request *rq, struct request_list *rl) { }
+static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q->root_rl; }
+
+#define blk_queue_for_each_rl(rl, q) \
+ for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)
+
+#endif /* CONFIG_BLK_CGROUP */
+#endif /* _BLK_CGROUP_H */
--
2.4.0

2015-05-22 21:32:52

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/51] update !CONFIG_BLK_CGROUP dummies in include/linux/blk-cgroup.h

The header file will be used more widely with the pending cgroup
writeback support and the current set of dummy declarations aren't
enough to handle different config combinations. Update as follows.

* Drop the struct cgroup declaration. None of the dummy defs need it.

* Define blkcg as an empty struct instead of just declaring it.

* Wrap dummy function defs in CONFIG_BLOCK. Some functions use block
data types and none of them are to be used w/o block enabled.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/blk-cgroup.h | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index c567865..51f95b3 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -558,8 +558,8 @@ static inline void blkg_rwstat_merge(struct blkg_rwstat *to,

#else /* CONFIG_BLK_CGROUP */

-struct cgroup;
-struct blkcg;
+struct blkcg {
+};

struct blkg_policy_data {
};
@@ -570,6 +570,8 @@ struct blkcg_gq {
struct blkcg_policy {
};

+#ifdef CONFIG_BLOCK
+
static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
static inline int blkcg_init_queue(struct request_queue *q) { return 0; }
static inline void blkcg_drain_queue(struct request_queue *q) { }
@@ -599,5 +601,6 @@ static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q
#define blk_queue_for_each_rl(rl, q) \
for ((rl) = &(q)->root_rl; (rl); (rl) = NULL)

+#endif /* CONFIG_BLOCK */
#endif /* CONFIG_BLK_CGROUP */
#endif /* _BLK_CGROUP_H */
--
2.4.0

2015-05-22 21:31:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/51] blkcg: always create the blkcg_gq for the root blkcg

Currently, blkcg does a minor optimization where the root blkcg is
created when the first blkcg policy is activated on a queue and
destroyed on the deactivation of the last. On systems where blkcg is
configured but not used, this saves one blkcg_gq struct per queue. On
systems where blkcg is actually used, there's no difference. The only
case where this can lead to any meaninful, albeit still minute, save
in memory consumption is when all blkcg policies are deactivated after
being widely used in the system, which is a hihgly unlikely scenario.

The conditional existence of root blkcg_gq has already created several
bugs in blkcg and became an issue once again for the new per-cgroup
wb_congested mechanism for cgroup writeback support leading to a NULL
dereference when no blkcg policy is active. This is really not worth
bothering with. This patch makes blkcg always allocate and link the
root blkcg_gq and release it only on queue destruction.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Fengguang Wu <[email protected]>
---
block/blk-cgroup.c | 96 +++++++++++++++++++++++-------------------------------
1 file changed, 41 insertions(+), 55 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index c3226ce..2a4f77f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -235,13 +235,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg->online = true;
spin_unlock(&blkcg->lock);

- if (!ret) {
- if (blkcg == &blkcg_root) {
- q->root_blkg = blkg;
- q->root_rl.blkg = blkg;
- }
+ if (!ret)
return blkg;
- }

/* @blkg failed fully initialized, use the usual release path */
blkg_put(blkg);
@@ -340,15 +335,6 @@ static void blkg_destroy(struct blkcg_gq *blkg)
rcu_assign_pointer(blkcg->blkg_hint, NULL);

/*
- * If root blkg is destroyed. Just clear the pointer since root_rl
- * does not take reference on root blkg.
- */
- if (blkcg == &blkcg_root) {
- blkg->q->root_blkg = NULL;
- blkg->q->root_rl.blkg = NULL;
- }
-
- /*
* Put the reference taken at the time of creation so that when all
* queues are gone, group can be destroyed.
*/
@@ -855,9 +841,45 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
*/
int blkcg_init_queue(struct request_queue *q)
{
- might_sleep();
+ struct blkcg_gq *new_blkg, *blkg;
+ bool preloaded;
+ int ret;
+
+ new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
+ if (!new_blkg)
+ return -ENOMEM;
+
+ preloaded = !radix_tree_preload(GFP_KERNEL);

- return blk_throtl_init(q);
+ /*
+ * Make sure the root blkg exists and count the existing blkgs. As
+ * @q is bypassing at this point, blkg_lookup_create() can't be
+ * used. Open code insertion.
+ */
+ rcu_read_lock();
+ spin_lock_irq(q->queue_lock);
+ blkg = blkg_create(&blkcg_root, q, new_blkg);
+ spin_unlock_irq(q->queue_lock);
+ rcu_read_unlock();
+
+ if (preloaded)
+ radix_tree_preload_end();
+
+ if (IS_ERR(blkg)) {
+ kfree(new_blkg);
+ return PTR_ERR(blkg);
+ }
+
+ q->root_blkg = blkg;
+ q->root_rl.blkg = blkg;
+
+ ret = blk_throtl_init(q);
+ if (ret) {
+ spin_lock_irq(q->queue_lock);
+ blkg_destroy_all(q);
+ spin_unlock_irq(q->queue_lock);
+ }
+ return ret;
}

/**
@@ -958,52 +980,20 @@ int blkcg_activate_policy(struct request_queue *q,
const struct blkcg_policy *pol)
{
LIST_HEAD(pds);
- struct blkcg_gq *blkg, *new_blkg;
+ struct blkcg_gq *blkg;
struct blkg_policy_data *pd, *n;
int cnt = 0, ret;
- bool preloaded;

if (blkcg_policy_enabled(q, pol))
return 0;

- /* preallocations for root blkg */
- new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL);
- if (!new_blkg)
- return -ENOMEM;
-
+ /* count and allocate policy_data for all existing blkgs */
blk_queue_bypass_start(q);
-
- preloaded = !radix_tree_preload(GFP_KERNEL);
-
- /*
- * Make sure the root blkg exists and count the existing blkgs. As
- * @q is bypassing at this point, blkg_lookup_create() can't be
- * used. Open code it.
- */
spin_lock_irq(q->queue_lock);
-
- rcu_read_lock();
- blkg = __blkg_lookup(&blkcg_root, q, false);
- if (blkg)
- blkg_free(new_blkg);
- else
- blkg = blkg_create(&blkcg_root, q, new_blkg);
- rcu_read_unlock();
-
- if (preloaded)
- radix_tree_preload_end();
-
- if (IS_ERR(blkg)) {
- ret = PTR_ERR(blkg);
- goto out_unlock;
- }
-
list_for_each_entry(blkg, &q->blkg_list, q_node)
cnt++;
-
spin_unlock_irq(q->queue_lock);

- /* allocate policy_data for all existing blkgs */
while (cnt--) {
pd = kzalloc_node(pol->pd_size, GFP_KERNEL, q->node);
if (!pd) {
@@ -1072,10 +1062,6 @@ void blkcg_deactivate_policy(struct request_queue *q,

__clear_bit(pol->plid, q->blkcg_pols);

- /* if no policy is left, no need for blkgs - shoot them down */
- if (bitmap_empty(q->blkcg_pols, BLKCG_MAX_POLS))
- blkg_destroy_all(q);
-
list_for_each_entry(blkg, &q->blkg_list, q_node) {
/* grab blkcg lock too while removing @pd from @blkg */
spin_lock(&blkg->blkcg->lock);
--
2.4.0

2015-05-22 21:31:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/51] memcg: add mem_cgroup_root_css

Add global mem_cgroup_root_css which points to the root memcg css.
This will be used by cgroup writeback support. If memcg is disabled,
it's defined as ERR_PTR(-EINVAL).

Signed-off-by: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
aCc: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 4 ++++
mm/memcontrol.c | 2 ++
2 files changed, 6 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5fe6411..294498f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -68,6 +68,8 @@ enum mem_cgroup_events_index {
};

#ifdef CONFIG_MEMCG
+extern struct cgroup_subsys_state *mem_cgroup_root_css;
+
void mem_cgroup_events(struct mem_cgroup *memcg,
enum mem_cgroup_events_index idx,
unsigned int nr);
@@ -196,6 +198,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
#else /* CONFIG_MEMCG */
struct mem_cgroup;

+#define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
+
static inline void mem_cgroup_events(struct mem_cgroup *memcg,
enum mem_cgroup_events_index idx,
unsigned int nr)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c23c1a3..b22a92b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -77,6 +77,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys);

#define MEM_CGROUP_RECLAIM_RETRIES 5
static struct mem_cgroup *root_mem_cgroup __read_mostly;
+struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;

/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
@@ -4441,6 +4442,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
/* root ? */
if (parent_css == NULL) {
root_mem_cgroup = memcg;
+ mem_cgroup_root_css = &memcg->css;
page_counter_init(&memcg->memory, NULL);
memcg->high = PAGE_COUNTER_MAX;
memcg->soft_limit = PAGE_COUNTER_MAX;
--
2.4.0

2015-05-22 21:30:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/51] blkcg: add blkcg_root_css

Add global constant blkcg_root_css which points to &blkcg_root.css.
This will be used by cgroup writeback support. If blkcg is disabled,
it's defined as ERR_PTR(-EINVAL).

v2: The declarations moved to include/linux/blk-cgroup.h as suggested
by Vivek.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-cgroup.c | 2 ++
include/linux/blk-cgroup.h | 3 +++
2 files changed, 5 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 2a4f77f..54ec172 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -30,6 +30,8 @@ struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT,
.cfq_leaf_weight = 2 * CFQ_WEIGHT_DEFAULT, };
EXPORT_SYMBOL_GPL(blkcg_root);

+struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css;
+
static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS];

static bool blkcg_policy_enabled(struct request_queue *q,
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 51f95b3..65f0c17 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -134,6 +134,7 @@ struct blkcg_policy {
};

extern struct blkcg blkcg_root;
+extern struct cgroup_subsys_state * const blkcg_root_css;

struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q);
struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg,
@@ -570,6 +571,8 @@ struct blkcg_gq {
struct blkcg_policy {
};

+#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
+
#ifdef CONFIG_BLOCK

static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
--
2.4.0

2015-05-22 21:14:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/51] cgroup, block: implement task_get_css() and use it in bio_associate_current()

bio_associate_current() currently open codes task_css() and
css_tryget_online() to find and pin $current's blkcg css. Abstract it
into task_get_css() which is implemented from cgroup side. As a task
is always associated with an online css for every subsystem except
while the css_set update is propagating, task_get_css() retries till
css_tryget_online() succeeds.

This is a cleanup and shouldn't lead to noticeable behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Li Zefan <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
block/bio.c | 11 +----------
include/linux/cgroup.h | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index c2ff8a8..cb7faac 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -2011,7 +2011,6 @@ EXPORT_SYMBOL(bioset_create_nobvec);
int bio_associate_current(struct bio *bio)
{
struct io_context *ioc;
- struct cgroup_subsys_state *css;

if (bio->bi_ioc)
return -EBUSY;
@@ -2020,17 +2019,9 @@ int bio_associate_current(struct bio *bio)
if (!ioc)
return -ENOENT;

- /* acquire active ref on @ioc and associate */
get_io_context_active(ioc);
bio->bi_ioc = ioc;
-
- /* associate blkcg if exists */
- rcu_read_lock();
- css = task_css(current, blkio_cgrp_id);
- if (css && css_tryget_online(css))
- bio->bi_css = css;
- rcu_read_unlock();
-
+ bio->bi_css = task_get_css(current, blkio_cgrp_id);
return 0;
}

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index b9cb94c..e7da0aa 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -774,6 +774,31 @@ static inline struct cgroup_subsys_state *task_css(struct task_struct *task,
}

/**
+ * task_get_css - find and get the css for (task, subsys)
+ * @task: the target task
+ * @subsys_id: the target subsystem ID
+ *
+ * Find the css for the (@task, @subsys_id) combination, increment a
+ * reference on and return it. This function is guaranteed to return a
+ * valid css.
+ */
+static inline struct cgroup_subsys_state *
+task_get_css(struct task_struct *task, int subsys_id)
+{
+ struct cgroup_subsys_state *css;
+
+ rcu_read_lock();
+ while (true) {
+ css = task_css(task, subsys_id);
+ if (likely(css_tryget_online(css)))
+ break;
+ cpu_relax();
+ }
+ rcu_read_unlock();
+ return css;
+}
+
+/**
* task_css_is_root - test whether a task belongs to the root css
* @task: the target task
* @subsys_id: the target subsystem ID
--
2.4.0

2015-05-22 21:14:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/51] blkcg: implement task_get_blkcg_css()

Implement a wrapper around task_get_css() to acquire the blkcg css for
a given task. The wrapper is necessary for cgroup writeback support
as there will be places outside blkcg proper trying to acquire
blkcg_css and blkio_cgrp_id will be undefined when !CONFIG_BLK_CGROUP.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/blk-cgroup.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 65f0c17..4dc643f 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -195,6 +195,12 @@ static inline struct blkcg *bio_blkcg(struct bio *bio)
return task_blkcg(current);
}

+static inline struct cgroup_subsys_state *
+task_get_blkcg_css(struct task_struct *task)
+{
+ return task_get_css(task, blkio_cgrp_id);
+}
+
/**
* blkcg_parent - get the parent of a blkcg
* @blkcg: blkcg of interest
@@ -573,6 +579,12 @@ struct blkcg_policy {

#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))

+static inline struct cgroup_subsys_state *
+task_get_blkcg_css(struct task_struct *task)
+{
+ return NULL;
+}
+
#ifdef CONFIG_BLOCK

static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; }
--
2.4.0

2015-05-22 21:30:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/51] blkcg: implement bio_associate_blkcg()

Currently, a bio can only be associated with the io_context and blkcg
of %current using bio_associate_current(). This is too restrictive
for cgroup writeback support. Implement bio_associate_blkcg() which
associates a bio with the specified blkcg.

bio_associate_blkcg() leaves the io_context unassociated.
bio_associate_current() is updated so that it considers a bio as
already associated if it has a blkcg_css, instead of an io_context,
associated with it.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
block/bio.c | 24 +++++++++++++++++++++++-
include/linux/bio.h | 3 +++
2 files changed, 26 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index cb7faac..494ffdb 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1995,6 +1995,28 @@ struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_
EXPORT_SYMBOL(bioset_create_nobvec);

#ifdef CONFIG_BLK_CGROUP
+
+/**
+ * bio_associate_blkcg - associate a bio with the specified blkcg
+ * @bio: target bio
+ * @blkcg_css: css of the blkcg to associate
+ *
+ * Associate @bio with the blkcg specified by @blkcg_css. Block layer will
+ * treat @bio as if it were issued by a task which belongs to the blkcg.
+ *
+ * This function takes an extra reference of @blkcg_css which will be put
+ * when @bio is released. The caller must own @bio and is responsible for
+ * synchronizing calls to this function.
+ */
+int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css)
+{
+ if (unlikely(bio->bi_css))
+ return -EBUSY;
+ css_get(blkcg_css);
+ bio->bi_css = blkcg_css;
+ return 0;
+}
+
/**
* bio_associate_current - associate a bio with %current
* @bio: target bio
@@ -2012,7 +2034,7 @@ int bio_associate_current(struct bio *bio)
{
struct io_context *ioc;

- if (bio->bi_ioc)
+ if (bio->bi_css)
return -EBUSY;

ioc = current->io_context;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 7486ea1..14260d1 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -483,9 +483,12 @@ extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int);
extern unsigned int bvec_nr_vecs(unsigned short idx);

#ifdef CONFIG_BLK_CGROUP
+int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css);
int bio_associate_current(struct bio *bio);
void bio_disassociate_task(struct bio *bio);
#else /* CONFIG_BLK_CGROUP */
+static inline int bio_associate_blkcg(struct bio *bio,
+ struct cgroup_subsys_state *blkcg_css) { return 0; }
static inline int bio_associate_current(struct bio *bio) { return -ENOENT; }
static inline void bio_disassociate_task(struct bio *bio) { }
#endif /* CONFIG_BLK_CGROUP */
--
2.4.0

2015-05-22 21:14:47

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/51] memcg: implement mem_cgroup_css_from_page()

Implement mem_cgroup_css_from_page() which returns the
cgroup_subsys_state of the memcg associated with a given page. This
will be used by cgroup writeback support.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 14 ++++++++++++++
2 files changed, 15 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 294498f..637ef62 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -115,6 +115,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
}

extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg);
+extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);

struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
struct mem_cgroup *,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b22a92b..763f8f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -598,6 +598,20 @@ struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg)
return &memcg->css;
}

+/**
+ * mem_cgroup_css_from_page - css of the memcg associated with a page
+ * @page: page of interest
+ *
+ * This function is guaranteed to return a valid cgroup_subsys_state and
+ * the returned css remains accessible until @page is released.
+ */
+struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
+{
+ if (page->mem_cgroup)
+ return &page->mem_cgroup->css;
+ return &root_mem_cgroup->css;
+}
+
static struct mem_cgroup_per_zone *
mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page)
{
--
2.4.0

2015-05-22 21:29:50

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/51] writeback: move backing_dev_info->state into bdi_writeback

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: [email protected]
Cc: Neil Brown <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
---
block/blk-core.c | 1 -
drivers/block/drbd/drbd_main.c | 10 +++++-----
drivers/md/dm.c | 2 +-
drivers/md/raid1.c | 4 ++--
drivers/md/raid10.c | 2 +-
fs/fs-writeback.c | 14 +++++++-------
include/linux/backing-dev.h | 24 ++++++++++++------------
mm/backing-dev.c | 20 ++++++++++----------
8 files changed, 38 insertions(+), 39 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index ed2427f..f46688f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -620,7 +620,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
- q->backing_dev_info.state = 0;
q->backing_dev_info.capabilities = 0;
q->backing_dev_info.name = "block";
q->node = node_id;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 81fde9e..a151853 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2359,7 +2359,7 @@ static void drbd_cleanup(void)
* @congested_data: User data
* @bdi_bits: Bits the BDI flusher thread is currently interested in
*
- * Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested.
+ * Returns 1<<WB_async_congested and/or 1<<WB_sync_congested if we are congested.
*/
static int drbd_congested(void *congested_data, int bdi_bits)
{
@@ -2376,14 +2376,14 @@ static int drbd_congested(void *congested_data, int bdi_bits)
}

if (test_bit(CALLBACK_PENDING, &first_peer_device(device)->connection->flags)) {
- r |= (1 << BDI_async_congested);
+ r |= (1 << WB_async_congested);
/* Without good local data, we would need to read from remote,
* and that would need the worker thread as well, which is
* currently blocked waiting for that usermode helper to
* finish.
*/
if (!get_ldev_if_state(device, D_UP_TO_DATE))
- r |= (1 << BDI_sync_congested);
+ r |= (1 << WB_sync_congested);
else
put_ldev(device);
r &= bdi_bits;
@@ -2399,9 +2399,9 @@ static int drbd_congested(void *congested_data, int bdi_bits)
reason = 'b';
}

- if (bdi_bits & (1 << BDI_async_congested) &&
+ if (bdi_bits & (1 << WB_async_congested) &&
test_bit(NET_CONGESTED, &first_peer_device(device)->connection->flags)) {
- r |= (1 << BDI_async_congested);
+ r |= (1 << WB_async_congested);
reason = reason == 'b' ? 'a' : 'n';
}

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index a930b72..081fb1e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2162,7 +2162,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
* the query about congestion status of request_queue
*/
if (dm_request_based(md))
- r = md->queue->backing_dev_info.state &
+ r = md->queue->backing_dev_info.wb.state &
bdi_bits;
else
r = dm_table_any_congested(map, bdi_bits);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 9157a29..f80f1af 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -745,7 +745,7 @@ static int raid1_congested(struct mddev *mddev, int bits)
struct r1conf *conf = mddev->private;
int i, ret = 0;

- if ((bits & (1 << BDI_async_congested)) &&
+ if ((bits & (1 << WB_async_congested)) &&
conf->pending_count >= max_queued_requests)
return 1;

@@ -760,7 +760,7 @@ static int raid1_congested(struct mddev *mddev, int bits)
/* Note the '|| 1' - when read_balance prefers
* non-congested targets, it can be removed
*/
- if ((bits & (1<<BDI_async_congested)) || 1)
+ if ((bits & (1 << WB_async_congested)) || 1)
ret |= bdi_congested(&q->backing_dev_info, bits);
else
ret &= bdi_congested(&q->backing_dev_info, bits);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index e793ab6..fca8257 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -914,7 +914,7 @@ static int raid10_congested(struct mddev *mddev, int bits)
struct r10conf *conf = mddev->private;
int i, ret = 0;

- if ((bits & (1 << BDI_async_congested)) &&
+ if ((bits & (1 << WB_async_congested)) &&
conf->pending_count >= max_queued_requests)
return 1;

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 32a8bbd..983312c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -74,7 +74,7 @@ unsigned int dirtytime_expire_interval = 12 * 60 * 60;
*/
int writeback_in_progress(struct backing_dev_info *bdi)
{
- return test_bit(BDI_writeback_running, &bdi->state);
+ return test_bit(WB_writeback_running, &bdi->wb.state);
}
EXPORT_SYMBOL(writeback_in_progress);

@@ -112,7 +112,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
static void bdi_wakeup_thread(struct backing_dev_info *bdi)
{
spin_lock_bh(&bdi->wb_lock);
- if (test_bit(BDI_registered, &bdi->state))
+ if (test_bit(WB_registered, &bdi->wb.state))
mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
spin_unlock_bh(&bdi->wb_lock);
}
@@ -123,7 +123,7 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
trace_writeback_queue(bdi, work);

spin_lock_bh(&bdi->wb_lock);
- if (!test_bit(BDI_registered, &bdi->state)) {
+ if (!test_bit(WB_registered, &bdi->wb.state)) {
if (work->done)
complete(work->done);
goto out_unlock;
@@ -1057,7 +1057,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
struct wb_writeback_work *work;
long wrote = 0;

- set_bit(BDI_writeback_running, &wb->bdi->state);
+ set_bit(WB_writeback_running, &wb->state);
while ((work = get_next_work_item(bdi)) != NULL) {

trace_writeback_exec(bdi, work);
@@ -1079,7 +1079,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
*/
wrote += wb_check_old_data_flush(wb);
wrote += wb_check_background_flush(wb);
- clear_bit(BDI_writeback_running, &wb->bdi->state);
+ clear_bit(WB_writeback_running, &wb->state);

return wrote;
}
@@ -1099,7 +1099,7 @@ void bdi_writeback_workfn(struct work_struct *work)
current->flags |= PF_SWAPWRITE;

if (likely(!current_is_workqueue_rescuer() ||
- !test_bit(BDI_registered, &bdi->state))) {
+ !test_bit(WB_registered, &wb->state))) {
/*
* The normal path. Keep writing back @bdi until its
* work_list is empty. Note that this path is also taken
@@ -1323,7 +1323,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
spin_unlock(&inode->i_lock);
spin_lock(&bdi->wb.list_lock);
if (bdi_cap_writeback_dirty(bdi)) {
- WARN(!test_bit(BDI_registered, &bdi->state),
+ WARN(!test_bit(WB_registered, &bdi->wb.state),
"bdi-%s not registered\n", bdi->name);

/*
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index aff923a..eb14f98 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -25,13 +25,13 @@ struct device;
struct dentry;

/*
- * Bits in backing_dev_info.state
+ * Bits in bdi_writeback.state
*/
-enum bdi_state {
- BDI_async_congested, /* The async (write) queue is getting full */
- BDI_sync_congested, /* The sync queue is getting full */
- BDI_registered, /* bdi_register() was done */
- BDI_writeback_running, /* Writeback is in progress */
+enum wb_state {
+ WB_async_congested, /* The async (write) queue is getting full */
+ WB_sync_congested, /* The sync queue is getting full */
+ WB_registered, /* bdi_register() was done */
+ WB_writeback_running, /* Writeback is in progress */
};

typedef int (congested_fn)(void *, int);
@@ -49,6 +49,7 @@ enum bdi_stat_item {
struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */

+ unsigned long state; /* Always use atomic bitops on this */
unsigned long last_old_flush; /* last old data flush */

struct delayed_work dwork; /* work item used for writeback */
@@ -62,7 +63,6 @@ struct bdi_writeback {
struct backing_dev_info {
struct list_head bdi_list;
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
- unsigned long state; /* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data; /* Pointer to aux data for congested func */
@@ -250,23 +250,23 @@ static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
{
if (bdi->congested_fn)
return bdi->congested_fn(bdi->congested_data, bdi_bits);
- return (bdi->state & bdi_bits);
+ return (bdi->wb.state & bdi_bits);
}

static inline int bdi_read_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, 1 << BDI_sync_congested);
+ return bdi_congested(bdi, 1 << WB_sync_congested);
}

static inline int bdi_write_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, 1 << BDI_async_congested);
+ return bdi_congested(bdi, 1 << WB_async_congested);
}

static inline int bdi_rw_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, (1 << BDI_sync_congested) |
- (1 << BDI_async_congested));
+ return bdi_congested(bdi, (1 << WB_sync_congested) |
+ (1 << WB_async_congested));
}

enum {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6dc4580..b23cf0e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -96,7 +96,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
nr_io,
nr_more_io,
nr_dirty_time,
- !list_empty(&bdi->bdi_list), bdi->state);
+ !list_empty(&bdi->bdi_list), bdi->wb.state);
#undef K

return 0;
@@ -280,7 +280,7 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)

timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
spin_lock_bh(&bdi->wb_lock);
- if (test_bit(BDI_registered, &bdi->state))
+ if (test_bit(WB_registered, &bdi->wb.state))
queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
spin_unlock_bh(&bdi->wb_lock);
}
@@ -315,7 +315,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
bdi->dev = dev;

bdi_debug_register(bdi, dev_name(dev));
- set_bit(BDI_registered, &bdi->state);
+ set_bit(WB_registered, &bdi->wb.state);

spin_lock_bh(&bdi_lock);
list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
@@ -339,7 +339,7 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
{
/* Make sure nobody queues further work */
spin_lock_bh(&bdi->wb_lock);
- if (!test_and_clear_bit(BDI_registered, &bdi->state)) {
+ if (!test_and_clear_bit(WB_registered, &bdi->wb.state)) {
spin_unlock_bh(&bdi->wb_lock);
return;
}
@@ -492,11 +492,11 @@ static atomic_t nr_bdi_congested[2];

void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
- enum bdi_state bit;
+ enum wb_state bit;
wait_queue_head_t *wqh = &congestion_wqh[sync];

- bit = sync ? BDI_sync_congested : BDI_async_congested;
- if (test_and_clear_bit(bit, &bdi->state))
+ bit = sync ? WB_sync_congested : WB_async_congested;
+ if (test_and_clear_bit(bit, &bdi->wb.state))
atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_atomic();
if (waitqueue_active(wqh))
@@ -506,10 +506,10 @@ EXPORT_SYMBOL(clear_bdi_congested);

void set_bdi_congested(struct backing_dev_info *bdi, int sync)
{
- enum bdi_state bit;
+ enum wb_state bit;

- bit = sync ? BDI_sync_congested : BDI_async_congested;
- if (!test_and_set_bit(bit, &bdi->state))
+ bit = sync ? WB_sync_congested : WB_async_congested;
+ if (!test_and_set_bit(bit, &bdi->wb.state))
atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
--
2.4.0

2015-05-22 21:28:53

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 13/51] writeback: move backing_dev_info->bdi_stat[] into bdi_writeback

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->bdi_stat[] into wb.

* enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
enums is changed from BDI_ to WB_.

* BDI_STAT_BATCH() -> WB_STAT_BATCH()

* [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)

* bdi_stat[_error]() -> wb_stat[_error]()

* bdi_writeout_inc() -> wb_writeout_inc()

* stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
frees stat.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Trond Myklebust <[email protected]>
---
fs/fs-writeback.c | 2 +-
fs/fuse/file.c | 12 ++++----
fs/nfs/internal.h | 2 +-
fs/nfs/write.c | 3 +-
include/linux/backing-dev.h | 68 +++++++++++++++++++++------------------------
mm/backing-dev.c | 60 ++++++++++++++++++++++++---------------
mm/page-writeback.c | 55 ++++++++++++++++++------------------
7 files changed, 106 insertions(+), 96 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 983312c..8873ecd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -840,7 +840,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
global_page_state(NR_UNSTABLE_NFS) > background_thresh)
return true;

- if (bdi_stat(bdi, BDI_RECLAIMABLE) >
+ if (wb_stat(&bdi->wb, WB_RECLAIMABLE) >
bdi_dirty_limit(bdi, background_thresh))
return true;

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5ef05b5..8c5e2fa 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1445,9 +1445,9 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)

list_del(&req->writepages_entry);
for (i = 0; i < req->num_pages; i++) {
- dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_wb_stat(&bdi->wb, WB_WRITEBACK);
dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
- bdi_writeout_inc(bdi);
+ wb_writeout_inc(&bdi->wb);
}
wake_up(&fi->page_waitq);
}
@@ -1634,7 +1634,7 @@ static int fuse_writepage_locked(struct page *page)
req->end = fuse_writepage_end;
req->inode = inode;

- inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
+ inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);

spin_lock(&fc->lock);
@@ -1749,9 +1749,9 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
copy_highpage(old_req->pages[0], page);
spin_unlock(&fc->lock);

- dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_wb_stat(&bdi->wb, WB_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK_TEMP);
- bdi_writeout_inc(bdi);
+ wb_writeout_inc(&bdi->wb);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
@@ -1848,7 +1848,7 @@ static int fuse_writepages_fill(struct page *page,
req->page_descs[req->num_pages].offset = 0;
req->page_descs[req->num_pages].length = PAGE_SIZE;

- inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK);
+ inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);

err = 0;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index 9e6475b..7e3c460 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -607,7 +607,7 @@ void nfs_mark_page_unstable(struct page *page)
struct inode *inode = page_file_mapping(page)->host;

inc_zone_page_state(page, NR_UNSTABLE_NFS);
- inc_bdi_stat(inode_to_bdi(inode), BDI_RECLAIMABLE);
+ inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
}

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index d12a4be..94c7ce0 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -853,7 +853,8 @@ static void
nfs_clear_page_commit(struct page *page)
{
dec_zone_page_state(page, NR_UNSTABLE_NFS);
- dec_bdi_stat(inode_to_bdi(page_file_mapping(page)->host), BDI_RECLAIMABLE);
+ dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
+ WB_RECLAIMABLE);
}

/* Called holding inode (/cinfo) lock */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index eb14f98..fe7a907 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -36,15 +36,15 @@ enum wb_state {

typedef int (congested_fn)(void *, int);

-enum bdi_stat_item {
- BDI_RECLAIMABLE,
- BDI_WRITEBACK,
- BDI_DIRTIED,
- BDI_WRITTEN,
- NR_BDI_STAT_ITEMS
+enum wb_stat_item {
+ WB_RECLAIMABLE,
+ WB_WRITEBACK,
+ WB_DIRTIED,
+ WB_WRITTEN,
+ NR_WB_STAT_ITEMS
};

-#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))

struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */
@@ -58,6 +58,8 @@ struct bdi_writeback {
struct list_head b_more_io; /* parked for more writeback */
struct list_head b_dirty_time; /* time stamps are dirty */
spinlock_t list_lock; /* protects the b_* lists */
+
+ struct percpu_counter stat[NR_WB_STAT_ITEMS];
};

struct backing_dev_info {
@@ -69,8 +71,6 @@ struct backing_dev_info {

char *name;

- struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
-
unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
@@ -137,78 +137,74 @@ static inline int wb_has_dirty_io(struct bdi_writeback *wb)
!list_empty(&wb->b_more_io);
}

-static inline void __add_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item, s64 amount)
+static inline void __add_wb_stat(struct bdi_writeback *wb,
+ enum wb_stat_item item, s64 amount)
{
- __percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
+ __percpu_counter_add(&wb->stat[item], amount, WB_STAT_BATCH);
}

-static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline void __inc_wb_stat(struct bdi_writeback *wb,
+ enum wb_stat_item item)
{
- __add_bdi_stat(bdi, item, 1);
+ __add_wb_stat(wb, item, 1);
}

-static inline void inc_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
unsigned long flags;

local_irq_save(flags);
- __inc_bdi_stat(bdi, item);
+ __inc_wb_stat(wb, item);
local_irq_restore(flags);
}

-static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline void __dec_wb_stat(struct bdi_writeback *wb,
+ enum wb_stat_item item)
{
- __add_bdi_stat(bdi, item, -1);
+ __add_wb_stat(wb, item, -1);
}

-static inline void dec_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
unsigned long flags;

local_irq_save(flags);
- __dec_bdi_stat(bdi, item);
+ __dec_wb_stat(wb, item);
local_irq_restore(flags);
}

-static inline s64 bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
- return percpu_counter_read_positive(&bdi->bdi_stat[item]);
+ return percpu_counter_read_positive(&wb->stat[item]);
}

-static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
+ enum wb_stat_item item)
{
- return percpu_counter_sum_positive(&bdi->bdi_stat[item]);
+ return percpu_counter_sum_positive(&wb->stat[item]);
}

-static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
{
s64 sum;
unsigned long flags;

local_irq_save(flags);
- sum = __bdi_stat_sum(bdi, item);
+ sum = __wb_stat_sum(wb, item);
local_irq_restore(flags);

return sum;
}

-extern void bdi_writeout_inc(struct backing_dev_info *bdi);
+extern void wb_writeout_inc(struct bdi_writeback *wb);

/*
* maximal error of a stat counter.
*/
-static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
+static inline unsigned long wb_stat_error(struct bdi_writeback *wb)
{
#ifdef CONFIG_SMP
- return nr_cpu_ids * BDI_STAT_BATCH;
+ return nr_cpu_ids * WB_STAT_BATCH;
#else
return 1;
#endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b23cf0e..7b1d191 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -84,13 +84,13 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
"b_dirty_time: %10lu\n"
"bdi_list: %10u\n"
"state: %10lx\n",
- (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
- (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
+ (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
+ (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
K(bdi_thresh),
K(dirty_thresh),
K(background_thresh),
- (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
- (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+ (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
+ (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
(unsigned long) K(bdi->write_bandwidth),
nr_dirty,
nr_io,
@@ -376,8 +376,10 @@ void bdi_unregister(struct backing_dev_info *bdi)
}
EXPORT_SYMBOL(bdi_unregister);

-static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
+ int i, err;
+
memset(wb, 0, sizeof(*wb));

wb->bdi = bdi;
@@ -388,6 +390,27 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->b_dirty_time);
spin_lock_init(&wb->list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
+
+ for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
+ err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL);
+ if (err) {
+ while (--i)
+ percpu_counter_destroy(&wb->stat[i]);
+ return err;
+ }
+ }
+
+ return 0;
+}
+
+static void bdi_wb_exit(struct bdi_writeback *wb)
+{
+ int i;
+
+ WARN_ON(delayed_work_pending(&wb->dwork));
+
+ for (i = 0; i < NR_WB_STAT_ITEMS; i++)
+ percpu_counter_destroy(&wb->stat[i]);
}

/*
@@ -397,7 +420,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)

int bdi_init(struct backing_dev_info *bdi)
{
- int i, err;
+ int err;

bdi->dev = NULL;

@@ -408,13 +431,9 @@ int bdi_init(struct backing_dev_info *bdi)
INIT_LIST_HEAD(&bdi->bdi_list);
INIT_LIST_HEAD(&bdi->work_list);

- bdi_wb_init(&bdi->wb, bdi);
-
- for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
- err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL);
- if (err)
- goto err;
- }
+ err = bdi_wb_init(&bdi->wb, bdi);
+ if (err)
+ return err;

bdi->dirty_exceeded = 0;

@@ -427,25 +446,20 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->avg_write_bandwidth = INIT_BW;

err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
-
if (err) {
-err:
- while (i--)
- percpu_counter_destroy(&bdi->bdi_stat[i]);
+ bdi_wb_exit(&bdi->wb);
+ return err;
}

- return err;
+ return 0;
}
EXPORT_SYMBOL(bdi_init);

void bdi_destroy(struct backing_dev_info *bdi)
{
- int i;
-
bdi_wb_shutdown(bdi);

WARN_ON(!list_empty(&bdi->work_list));
- WARN_ON(delayed_work_pending(&bdi->wb.dwork));

if (bdi->dev) {
bdi_debug_unregister(bdi);
@@ -453,8 +467,8 @@ void bdi_destroy(struct backing_dev_info *bdi)
bdi->dev = NULL;
}

- for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
- percpu_counter_destroy(&bdi->bdi_stat[i]);
+ bdi_wb_exit(&bdi->wb);
+
fprop_local_destroy_percpu(&bdi->completions);
}
EXPORT_SYMBOL(bdi_destroy);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bdeecad..dc673a0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -396,11 +396,11 @@ static unsigned long wp_next_time(unsigned long cur_time)
* Increment the BDI's writeout completion count and the global writeout
* completion count. Called from test_clear_page_writeback().
*/
-static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
+static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
- __inc_bdi_stat(bdi, BDI_WRITTEN);
- __fprop_inc_percpu_max(&writeout_completions, &bdi->completions,
- bdi->max_prop_frac);
+ __inc_wb_stat(wb, WB_WRITTEN);
+ __fprop_inc_percpu_max(&writeout_completions, &wb->bdi->completions,
+ wb->bdi->max_prop_frac);
/* First event after period switching was turned off? */
if (!unlikely(writeout_period_time)) {
/*
@@ -414,15 +414,15 @@ static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
}
}

-void bdi_writeout_inc(struct backing_dev_info *bdi)
+void wb_writeout_inc(struct bdi_writeback *wb)
{
unsigned long flags;

local_irq_save(flags);
- __bdi_writeout_inc(bdi);
+ __wb_writeout_inc(wb);
local_irq_restore(flags);
}
-EXPORT_SYMBOL_GPL(bdi_writeout_inc);
+EXPORT_SYMBOL_GPL(wb_writeout_inc);

/*
* Obtain an accurate fraction of the BDI's portion.
@@ -1130,8 +1130,8 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
if (elapsed < BANDWIDTH_INTERVAL)
return;

- dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
- written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+ dirtied = percpu_counter_read(&bdi->wb.stat[WB_DIRTIED]);
+ written = percpu_counter_read(&bdi->wb.stat[WB_WRITTEN]);

/*
* Skip quiet periods when disk bandwidth is under-utilized.
@@ -1288,7 +1288,8 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
unsigned long *bdi_thresh,
unsigned long *bdi_bg_thresh)
{
- unsigned long bdi_reclaimable;
+ struct bdi_writeback *wb = &bdi->wb;
+ unsigned long wb_reclaimable;

/*
* bdi_thresh is not treated as some limiting factor as
@@ -1320,14 +1321,12 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
* actually dirty; with m+n sitting in the percpu
* deltas.
*/
- if (*bdi_thresh < 2 * bdi_stat_error(bdi)) {
- bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
- *bdi_dirty = bdi_reclaimable +
- bdi_stat_sum(bdi, BDI_WRITEBACK);
+ if (*bdi_thresh < 2 * wb_stat_error(wb)) {
+ wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
+ *bdi_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
} else {
- bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
- *bdi_dirty = bdi_reclaimable +
- bdi_stat(bdi, BDI_WRITEBACK);
+ wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
+ *bdi_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
}
}

@@ -1514,9 +1513,9 @@ static void balance_dirty_pages(struct address_space *mapping,
* In theory 1 page is enough to keep the comsumer-producer
* pipe going: the flusher cleans 1 page => the task dirties 1
* more page. However bdi_dirty has accounting errors. So use
- * the larger and more IO friendly bdi_stat_error.
+ * the larger and more IO friendly wb_stat_error.
*/
- if (bdi_dirty <= bdi_stat_error(bdi))
+ if (bdi_dirty <= wb_stat_error(&bdi->wb))
break;

if (fatal_signal_pending(current))
@@ -2106,8 +2105,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping,
mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
- __inc_bdi_stat(bdi, BDI_RECLAIMABLE);
- __inc_bdi_stat(bdi, BDI_DIRTIED);
+ __inc_wb_stat(&bdi->wb, WB_RECLAIMABLE);
+ __inc_wb_stat(&bdi->wb, WB_DIRTIED);
task_io_account_write(PAGE_CACHE_SIZE);
current->nr_dirtied++;
this_cpu_inc(bdp_ratelimits);
@@ -2126,7 +2125,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
if (mapping_cap_account_dirty(mapping)) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE);
+ dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_RECLAIMABLE);
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
}
}
@@ -2190,7 +2189,7 @@ void account_page_redirty(struct page *page)
if (mapping && mapping_cap_account_dirty(mapping)) {
current->nr_dirtied--;
dec_zone_page_state(page, NR_DIRTIED);
- dec_bdi_stat(inode_to_bdi(mapping->host), BDI_DIRTIED);
+ dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_DIRTIED);
}
}
EXPORT_SYMBOL(account_page_redirty);
@@ -2369,8 +2368,8 @@ int clear_page_dirty_for_io(struct page *page)
if (TestClearPageDirty(page)) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(inode_to_bdi(mapping->host),
- BDI_RECLAIMABLE);
+ dec_wb_stat(&inode_to_bdi(mapping->host)->wb,
+ WB_RECLAIMABLE);
ret = 1;
}
mem_cgroup_end_page_stat(memcg);
@@ -2398,8 +2397,8 @@ int test_clear_page_writeback(struct page *page)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi)) {
- __dec_bdi_stat(bdi, BDI_WRITEBACK);
- __bdi_writeout_inc(bdi);
+ __dec_wb_stat(&bdi->wb, WB_WRITEBACK);
+ __wb_writeout_inc(&bdi->wb);
}
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -2433,7 +2432,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
- __inc_bdi_stat(bdi, BDI_WRITEBACK);
+ __inc_wb_stat(&bdi->wb, WB_WRITEBACK);
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,
--
2.4.0

2015-05-22 21:14:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 14/51] writeback: move bandwidth related fields from backing_dev_info into bdi_writeback

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.

* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
balanced_dirty_ratelimit, completions and dirty_exceeded.

* writeback_chunk_size() and over_bground_thresh() now take @wb
instead of @bdi.

* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
[__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
respectively. Note that explicit zeroing is dropped in the process
as wb's are cleared in entirety anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.

v2: Typo in description fixed as suggested by Jan.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Steven Whitehouse <[email protected]>
---
fs/f2fs/node.c | 4 +-
fs/f2fs/segment.h | 2 +-
fs/fs-writeback.c | 17 ++-
fs/gfs2/super.c | 2 +-
include/linux/backing-dev.h | 20 +--
include/linux/writeback.h | 19 ++-
include/trace/events/writeback.h | 8 +-
mm/backing-dev.c | 45 +++----
mm/page-writeback.c | 262 ++++++++++++++++++++-------------------
9 files changed, 187 insertions(+), 192 deletions(-)

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 8ab0cf1..d211602 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -53,7 +53,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
PAGE_CACHE_SHIFT;
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2);
} else if (type == DIRTY_DENTS) {
- if (sbi->sb->s_bdi->dirty_exceeded)
+ if (sbi->sb->s_bdi->wb.dirty_exceeded)
return false;
mem_size = get_pages(sbi, F2FS_DIRTY_DENTS);
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
@@ -70,7 +70,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
sizeof(struct extent_node)) >> PAGE_CACHE_SHIFT;
res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1);
} else {
- if (sbi->sb->s_bdi->dirty_exceeded)
+ if (sbi->sb->s_bdi->wb.dirty_exceeded)
return false;
}
return res;
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
index 85d7fa7..6408989 100644
--- a/fs/f2fs/segment.h
+++ b/fs/f2fs/segment.h
@@ -713,7 +713,7 @@ static inline unsigned int max_hw_blocks(struct f2fs_sb_info *sbi)
*/
static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type)
{
- if (sbi->sb->s_bdi->dirty_exceeded)
+ if (sbi->sb->s_bdi->wb.dirty_exceeded)
return 0;

if (type == DATA)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8873ecd..1945cb9 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -624,7 +624,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
return ret;
}

-static long writeback_chunk_size(struct backing_dev_info *bdi,
+static long writeback_chunk_size(struct bdi_writeback *wb,
struct wb_writeback_work *work)
{
long pages;
@@ -645,7 +645,7 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
pages = LONG_MAX;
else {
- pages = min(bdi->avg_write_bandwidth / 2,
+ pages = min(wb->avg_write_bandwidth / 2,
global_dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
pages = round_down(pages + MIN_WRITEBACK_PAGES,
@@ -743,7 +743,7 @@ static long writeback_sb_inodes(struct super_block *sb,
inode->i_state |= I_SYNC;
spin_unlock(&inode->i_lock);

- write_chunk = writeback_chunk_size(wb->bdi, work);
+ write_chunk = writeback_chunk_size(wb, work);
wbc.nr_to_write = write_chunk;
wbc.pages_skipped = 0;

@@ -830,7 +830,7 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
return nr_pages - work.nr_pages;
}

-static bool over_bground_thresh(struct backing_dev_info *bdi)
+static bool over_bground_thresh(struct bdi_writeback *wb)
{
unsigned long background_thresh, dirty_thresh;

@@ -840,8 +840,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
global_page_state(NR_UNSTABLE_NFS) > background_thresh)
return true;

- if (wb_stat(&bdi->wb, WB_RECLAIMABLE) >
- bdi_dirty_limit(bdi, background_thresh))
+ if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh))
return true;

return false;
@@ -854,7 +853,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
static void wb_update_bandwidth(struct bdi_writeback *wb,
unsigned long start_time)
{
- __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
+ __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time);
}

/*
@@ -906,7 +905,7 @@ static long wb_writeback(struct bdi_writeback *wb,
* For background writeout, stop when we are below the
* background dirty threshold
*/
- if (work->for_background && !over_bground_thresh(wb->bdi))
+ if (work->for_background && !over_bground_thresh(wb))
break;

/*
@@ -998,7 +997,7 @@ static unsigned long get_nr_dirty_pages(void)

static long wb_check_background_flush(struct bdi_writeback *wb)
{
- if (over_bground_thresh(wb->bdi)) {
+ if (over_bground_thresh(wb)) {

struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 859c6ed..2982445 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -748,7 +748,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)

if (wbc->sync_mode == WB_SYNC_ALL)
gfs2_log_flush(GFS2_SB(inode), ip->i_gl, NORMAL_FLUSH);
- if (bdi->dirty_exceeded)
+ if (bdi->wb.dirty_exceeded)
gfs2_ail1_flush(sdp, wbc);
else
filemap_fdatawrite(metamapping);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index fe7a907..2ab0604 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -60,16 +60,6 @@ struct bdi_writeback {
spinlock_t list_lock; /* protects the b_* lists */

struct percpu_counter stat[NR_WB_STAT_ITEMS];
-};
-
-struct backing_dev_info {
- struct list_head bdi_list;
- unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
- unsigned int capabilities; /* Device capabilities */
- congested_fn *congested_fn; /* Function pointer if device is md/dm */
- void *congested_data; /* Pointer to aux data for congested func */
-
- char *name;

unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
@@ -88,6 +78,16 @@ struct backing_dev_info {

struct fprop_local_percpu completions;
int dirty_exceeded;
+};
+
+struct backing_dev_info {
+ struct list_head bdi_list;
+ unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
+ unsigned int capabilities; /* Device capabilities */
+ congested_fn *congested_fn; /* Function pointer if device is md/dm */
+ void *congested_data; /* Pointer to aux data for congested func */
+
+ char *name;

unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index b2dd371e..a6b9db7 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -155,16 +155,15 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);

void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
- unsigned long dirty);
-
-void __bdi_update_bandwidth(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long start_time);
+unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty);
+
+void __wb_update_bandwidth(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long start_time);

void page_writeback_init(void);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 880dd74..9b876f6 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -400,13 +400,13 @@ TRACE_EVENT(bdi_dirty_ratelimit,

TP_fast_assign(
strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
- __entry->write_bw = KBps(bdi->write_bandwidth);
- __entry->avg_write_bw = KBps(bdi->avg_write_bandwidth);
+ __entry->write_bw = KBps(bdi->wb.write_bandwidth);
+ __entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth);
__entry->dirty_rate = KBps(dirty_rate);
- __entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit);
+ __entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit);
__entry->task_ratelimit = KBps(task_ratelimit);
__entry->balanced_dirty_ratelimit =
- KBps(bdi->balanced_dirty_ratelimit);
+ KBps(bdi->wb.balanced_dirty_ratelimit);
),

TP_printk("bdi %s: "
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7b1d191..9a6c472 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -66,7 +66,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
spin_unlock(&wb->list_lock);

global_dirty_limits(&background_thresh, &dirty_thresh);
- bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+ bdi_thresh = wb_dirty_limit(wb, dirty_thresh);

#define K(x) ((x) << (PAGE_SHIFT - 10))
seq_printf(m,
@@ -91,7 +91,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
K(background_thresh),
(unsigned long) K(wb_stat(wb, WB_DIRTIED)),
(unsigned long) K(wb_stat(wb, WB_WRITTEN)),
- (unsigned long) K(bdi->write_bandwidth),
+ (unsigned long) K(wb->write_bandwidth),
nr_dirty,
nr_io,
nr_more_io,
@@ -376,6 +376,11 @@ void bdi_unregister(struct backing_dev_info *bdi)
}
EXPORT_SYMBOL(bdi_unregister);

+/*
+ * Initial write bandwidth: 100 MB/s
+ */
+#define INIT_BW (100 << (20 - PAGE_SHIFT))
+
static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
int i, err;
@@ -391,11 +396,22 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
spin_lock_init(&wb->list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);

+ wb->bw_time_stamp = jiffies;
+ wb->balanced_dirty_ratelimit = INIT_BW;
+ wb->dirty_ratelimit = INIT_BW;
+ wb->write_bandwidth = INIT_BW;
+ wb->avg_write_bandwidth = INIT_BW;
+
+ err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL);
+ if (err)
+ return err;
+
for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL);
if (err) {
while (--i)
percpu_counter_destroy(&wb->stat[i]);
+ fprop_local_destroy_percpu(&wb->completions);
return err;
}
}
@@ -411,12 +427,9 @@ static void bdi_wb_exit(struct bdi_writeback *wb)

for (i = 0; i < NR_WB_STAT_ITEMS; i++)
percpu_counter_destroy(&wb->stat[i]);
-}

-/*
- * Initial write bandwidth: 100 MB/s
- */
-#define INIT_BW (100 << (20 - PAGE_SHIFT))
+ fprop_local_destroy_percpu(&wb->completions);
+}

int bdi_init(struct backing_dev_info *bdi)
{
@@ -435,22 +448,6 @@ int bdi_init(struct backing_dev_info *bdi)
if (err)
return err;

- bdi->dirty_exceeded = 0;
-
- bdi->bw_time_stamp = jiffies;
- bdi->written_stamp = 0;
-
- bdi->balanced_dirty_ratelimit = INIT_BW;
- bdi->dirty_ratelimit = INIT_BW;
- bdi->write_bandwidth = INIT_BW;
- bdi->avg_write_bandwidth = INIT_BW;
-
- err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
- if (err) {
- bdi_wb_exit(&bdi->wb);
- return err;
- }
-
return 0;
}
EXPORT_SYMBOL(bdi_init);
@@ -468,8 +465,6 @@ void bdi_destroy(struct backing_dev_info *bdi)
}

bdi_wb_exit(&bdi->wb);
-
- fprop_local_destroy_percpu(&bdi->completions);
}
EXPORT_SYMBOL(bdi_destroy);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index dc673a0..cd39ee9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -399,7 +399,7 @@ static unsigned long wp_next_time(unsigned long cur_time)
static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
__inc_wb_stat(wb, WB_WRITTEN);
- __fprop_inc_percpu_max(&writeout_completions, &wb->bdi->completions,
+ __fprop_inc_percpu_max(&writeout_completions, &wb->completions,
wb->bdi->max_prop_frac);
/* First event after period switching was turned off? */
if (!unlikely(writeout_period_time)) {
@@ -427,10 +427,10 @@ EXPORT_SYMBOL_GPL(wb_writeout_inc);
/*
* Obtain an accurate fraction of the BDI's portion.
*/
-static void bdi_writeout_fraction(struct backing_dev_info *bdi,
- long *numerator, long *denominator)
+static void wb_writeout_fraction(struct bdi_writeback *wb,
+ long *numerator, long *denominator)
{
- fprop_fraction_percpu(&writeout_completions, &bdi->completions,
+ fprop_fraction_percpu(&writeout_completions, &wb->completions,
numerator, denominator);
}

@@ -516,11 +516,11 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
}

/**
- * bdi_dirty_limit - @bdi's share of dirty throttling threshold
- * @bdi: the backing_dev_info to query
+ * wb_dirty_limit - @wb's share of dirty throttling threshold
+ * @wb: bdi_writeback to query
* @dirty: global dirty limit in pages
*
- * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * Returns @wb's dirty limit in pages. The term "dirty" in the context of
* dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
*
* Note that balance_dirty_pages() will only seriously take it as a hard limit
@@ -528,34 +528,35 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
* control. For example, when the device is completely stalled due to some error
* conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
* In the other normal situations, it acts more gently by throttling the tasks
- * more (rather than completely block them) when the bdi dirty pages go high.
+ * more (rather than completely block them) when the wb dirty pages go high.
*
* It allocates high/low dirty limits to fast/slow devices, in order to prevent
* - starving fast devices
* - piling up dirty pages (that will take long time to sync) on slow devices
*
- * The bdi's share of dirty limit will be adapting to its throughput and
+ * The wb's share of dirty limit will be adapting to its throughput and
* bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
*/
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
+unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
{
- u64 bdi_dirty;
+ struct backing_dev_info *bdi = wb->bdi;
+ u64 wb_dirty;
long numerator, denominator;

/*
* Calculate this BDI's share of the dirty ratio.
*/
- bdi_writeout_fraction(bdi, &numerator, &denominator);
+ wb_writeout_fraction(wb, &numerator, &denominator);

- bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
- bdi_dirty *= numerator;
- do_div(bdi_dirty, denominator);
+ wb_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
+ wb_dirty *= numerator;
+ do_div(wb_dirty, denominator);

- bdi_dirty += (dirty * bdi->min_ratio) / 100;
- if (bdi_dirty > (dirty * bdi->max_ratio) / 100)
- bdi_dirty = dirty * bdi->max_ratio / 100;
+ wb_dirty += (dirty * bdi->min_ratio) / 100;
+ if (wb_dirty > (dirty * bdi->max_ratio) / 100)
+ wb_dirty = dirty * bdi->max_ratio / 100;

- return bdi_dirty;
+ return wb_dirty;
}

/*
@@ -664,14 +665,14 @@ static long long pos_ratio_polynom(unsigned long setpoint,
* card's bdi_dirty may rush to many times higher than bdi_setpoint.
* - the bdi dirty thresh drops quickly due to change of JBOD workload
*/
-static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty)
+static unsigned long wb_position_ratio(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty)
{
- unsigned long write_bw = bdi->avg_write_bandwidth;
+ unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
unsigned long x_intercept;
@@ -702,12 +703,12 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
* consume arbitrary amount of RAM because it is accounted in
* NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
*
- * Here, in bdi_position_ratio(), we calculate pos_ratio based on
+ * Here, in wb_position_ratio(), we calculate pos_ratio based on
* two values: bdi_dirty and bdi_thresh. Let's consider an example:
* total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
* limits are set by default to 10% and 20% (background and throttle).
* Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
- * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
+ * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. bdi_setpoint is
* about ~6K pages (as the average of background and throttle bdi
* limits). The 3rd order polynomial will provide positive feedback if
* bdi_dirty is under bdi_setpoint and vice versa.
@@ -717,7 +718,7 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
* much earlier than global "freerun" is reached (~23MB vs. ~2.3GB
* in the example above).
*/
- if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
+ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
long long bdi_pos_ratio;
unsigned long bdi_bg_thresh;

@@ -842,13 +843,13 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
return pos_ratio;
}

-static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
- unsigned long elapsed,
- unsigned long written)
+static void wb_update_write_bandwidth(struct bdi_writeback *wb,
+ unsigned long elapsed,
+ unsigned long written)
{
const unsigned long period = roundup_pow_of_two(3 * HZ);
- unsigned long avg = bdi->avg_write_bandwidth;
- unsigned long old = bdi->write_bandwidth;
+ unsigned long avg = wb->avg_write_bandwidth;
+ unsigned long old = wb->write_bandwidth;
u64 bw;

/*
@@ -861,14 +862,14 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
* @written may have decreased due to account_page_redirty().
* Avoid underflowing @bw calculation.
*/
- bw = written - min(written, bdi->written_stamp);
+ bw = written - min(written, wb->written_stamp);
bw *= HZ;
if (unlikely(elapsed > period)) {
do_div(bw, elapsed);
avg = bw;
goto out;
}
- bw += (u64)bdi->write_bandwidth * (period - elapsed);
+ bw += (u64)wb->write_bandwidth * (period - elapsed);
bw >>= ilog2(period);

/*
@@ -881,8 +882,8 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
avg += (old - avg) >> 3;

out:
- bdi->write_bandwidth = bw;
- bdi->avg_write_bandwidth = avg;
+ wb->write_bandwidth = bw;
+ wb->avg_write_bandwidth = avg;
}

/*
@@ -947,20 +948,20 @@ static void global_update_bandwidth(unsigned long thresh,
* Normal bdi tasks will be curbed at or below it in long term.
* Obviously it should be around (write_bw / N) when there are N dd tasks.
*/
-static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long dirtied,
- unsigned long elapsed)
+static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long dirtied,
+ unsigned long elapsed)
{
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
unsigned long setpoint = (freerun + limit) / 2;
- unsigned long write_bw = bdi->avg_write_bandwidth;
- unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
+ unsigned long write_bw = wb->avg_write_bandwidth;
+ unsigned long dirty_ratelimit = wb->dirty_ratelimit;
unsigned long dirty_rate;
unsigned long task_ratelimit;
unsigned long balanced_dirty_ratelimit;
@@ -972,10 +973,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
* The dirty rate will match the writeout rate in long term, except
* when dirty pages are truncated by userspace or re-dirtied by FS.
*/
- dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+ dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;

- pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty);
+ pos_ratio = wb_position_ratio(wb, thresh, bg_thresh, dirty,
+ bdi_thresh, bdi_dirty);
/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
@@ -1059,31 +1060,31 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,

/*
* For strictlimit case, calculations above were based on bdi counters
- * and limits (starting from pos_ratio = bdi_position_ratio() and up to
+ * and limits (starting from pos_ratio = wb_position_ratio() and up to
* balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
* Hence, to calculate "step" properly, we have to use bdi_dirty as
* "dirty" and bdi_setpoint as "setpoint".
*
* We rampup dirty_ratelimit forcibly if bdi_dirty is low because
* it's possible that bdi_thresh is close to zero due to inactivity
- * of backing device (see the implementation of bdi_dirty_limit()).
+ * of backing device (see the implementation of wb_dirty_limit()).
*/
- if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
+ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
dirty = bdi_dirty;
if (bdi_dirty < 8)
setpoint = bdi_dirty + 1;
else
setpoint = (bdi_thresh +
- bdi_dirty_limit(bdi, bg_thresh)) / 2;
+ wb_dirty_limit(wb, bg_thresh)) / 2;
}

if (dirty < setpoint) {
- x = min3(bdi->balanced_dirty_ratelimit,
+ x = min3(wb->balanced_dirty_ratelimit,
balanced_dirty_ratelimit, task_ratelimit);
if (dirty_ratelimit < x)
step = x - dirty_ratelimit;
} else {
- x = max3(bdi->balanced_dirty_ratelimit,
+ x = max3(wb->balanced_dirty_ratelimit,
balanced_dirty_ratelimit, task_ratelimit);
if (dirty_ratelimit > x)
step = dirty_ratelimit - x;
@@ -1105,22 +1106,22 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
else
dirty_ratelimit -= step;

- bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
- bdi->balanced_dirty_ratelimit = balanced_dirty_ratelimit;
+ wb->dirty_ratelimit = max(dirty_ratelimit, 1UL);
+ wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit;

- trace_bdi_dirty_ratelimit(bdi, dirty_rate, task_ratelimit);
+ trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
}

-void __bdi_update_bandwidth(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long start_time)
+void __wb_update_bandwidth(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long start_time)
{
unsigned long now = jiffies;
- unsigned long elapsed = now - bdi->bw_time_stamp;
+ unsigned long elapsed = now - wb->bw_time_stamp;
unsigned long dirtied;
unsigned long written;

@@ -1130,44 +1131,44 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
if (elapsed < BANDWIDTH_INTERVAL)
return;

- dirtied = percpu_counter_read(&bdi->wb.stat[WB_DIRTIED]);
- written = percpu_counter_read(&bdi->wb.stat[WB_WRITTEN]);
+ dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
+ written = percpu_counter_read(&wb->stat[WB_WRITTEN]);

/*
* Skip quiet periods when disk bandwidth is under-utilized.
* (at least 1s idle time between two flusher runs)
*/
- if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
+ if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time))
goto snapshot;

if (thresh) {
global_update_bandwidth(thresh, dirty, now);
- bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty,
- dirtied, elapsed);
+ wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty,
+ bdi_thresh, bdi_dirty,
+ dirtied, elapsed);
}
- bdi_update_write_bandwidth(bdi, elapsed, written);
+ wb_update_write_bandwidth(wb, elapsed, written);

snapshot:
- bdi->dirtied_stamp = dirtied;
- bdi->written_stamp = written;
- bdi->bw_time_stamp = now;
+ wb->dirtied_stamp = dirtied;
+ wb->written_stamp = written;
+ wb->bw_time_stamp = now;
}

-static void bdi_update_bandwidth(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long start_time)
+static void wb_update_bandwidth(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long start_time)
{
- if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
+ if (time_is_after_eq_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL))
return;
- spin_lock(&bdi->wb.list_lock);
- __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty, start_time);
- spin_unlock(&bdi->wb.list_lock);
+ spin_lock(&wb->list_lock);
+ __wb_update_bandwidth(wb, thresh, bg_thresh, dirty,
+ bdi_thresh, bdi_dirty, start_time);
+ spin_unlock(&wb->list_lock);
}

/*
@@ -1187,10 +1188,10 @@ static unsigned long dirty_poll_interval(unsigned long dirty,
return 1;
}

-static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
- unsigned long bdi_dirty)
+static unsigned long wb_max_pause(struct bdi_writeback *wb,
+ unsigned long bdi_dirty)
{
- unsigned long bw = bdi->avg_write_bandwidth;
+ unsigned long bw = wb->avg_write_bandwidth;
unsigned long t;

/*
@@ -1206,14 +1207,14 @@ static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
return min_t(unsigned long, t, MAX_PAUSE);
}

-static long bdi_min_pause(struct backing_dev_info *bdi,
- long max_pause,
- unsigned long task_ratelimit,
- unsigned long dirty_ratelimit,
- int *nr_dirtied_pause)
+static long wb_min_pause(struct bdi_writeback *wb,
+ long max_pause,
+ unsigned long task_ratelimit,
+ unsigned long dirty_ratelimit,
+ int *nr_dirtied_pause)
{
- long hi = ilog2(bdi->avg_write_bandwidth);
- long lo = ilog2(bdi->dirty_ratelimit);
+ long hi = ilog2(wb->avg_write_bandwidth);
+ long lo = ilog2(wb->dirty_ratelimit);
long t; /* target pause */
long pause; /* estimated next pause */
int pages; /* target nr_dirtied_pause */
@@ -1281,14 +1282,13 @@ static long bdi_min_pause(struct backing_dev_info *bdi,
return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
}

-static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
- unsigned long dirty_thresh,
- unsigned long background_thresh,
- unsigned long *bdi_dirty,
- unsigned long *bdi_thresh,
- unsigned long *bdi_bg_thresh)
+static inline void wb_dirty_limits(struct bdi_writeback *wb,
+ unsigned long dirty_thresh,
+ unsigned long background_thresh,
+ unsigned long *bdi_dirty,
+ unsigned long *bdi_thresh,
+ unsigned long *bdi_bg_thresh)
{
- struct bdi_writeback *wb = &bdi->wb;
unsigned long wb_reclaimable;

/*
@@ -1301,10 +1301,10 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
* In this case we don't want to hard throttle the USB key
* dirtiers for 100 seconds until bdi_dirty drops under
* bdi_thresh. Instead the auxiliary bdi control line in
- * bdi_position_ratio() will let the dirtier task progress
+ * wb_position_ratio() will let the dirtier task progress
* at some rate <= (write_bw / 2) for bringing down bdi_dirty.
*/
- *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+ *bdi_thresh = wb_dirty_limit(wb, dirty_thresh);

if (bdi_bg_thresh)
*bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh *
@@ -1354,6 +1354,7 @@ static void balance_dirty_pages(struct address_space *mapping,
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
+ struct bdi_writeback *wb = &bdi->wb;
bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
unsigned long start_time = jiffies;

@@ -1378,8 +1379,8 @@ static void balance_dirty_pages(struct address_space *mapping,
global_dirty_limits(&background_thresh, &dirty_thresh);

if (unlikely(strictlimit)) {
- bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
- &bdi_dirty, &bdi_thresh, &bg_thresh);
+ wb_dirty_limits(wb, dirty_thresh, background_thresh,
+ &bdi_dirty, &bdi_thresh, &bg_thresh);

dirty = bdi_dirty;
thresh = bdi_thresh;
@@ -1410,28 +1411,28 @@ static void balance_dirty_pages(struct address_space *mapping,
bdi_start_background_writeback(bdi);

if (!strictlimit)
- bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
- &bdi_dirty, &bdi_thresh, NULL);
+ wb_dirty_limits(wb, dirty_thresh, background_thresh,
+ &bdi_dirty, &bdi_thresh, NULL);

dirty_exceeded = (bdi_dirty > bdi_thresh) &&
((nr_dirty > dirty_thresh) || strictlimit);
- if (dirty_exceeded && !bdi->dirty_exceeded)
- bdi->dirty_exceeded = 1;
+ if (dirty_exceeded && !wb->dirty_exceeded)
+ wb->dirty_exceeded = 1;

- bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
- nr_dirty, bdi_thresh, bdi_dirty,
- start_time);
+ wb_update_bandwidth(wb, dirty_thresh, background_thresh,
+ nr_dirty, bdi_thresh, bdi_dirty,
+ start_time);

- dirty_ratelimit = bdi->dirty_ratelimit;
- pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
- background_thresh, nr_dirty,
- bdi_thresh, bdi_dirty);
+ dirty_ratelimit = wb->dirty_ratelimit;
+ pos_ratio = wb_position_ratio(wb, dirty_thresh,
+ background_thresh, nr_dirty,
+ bdi_thresh, bdi_dirty);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
- max_pause = bdi_max_pause(bdi, bdi_dirty);
- min_pause = bdi_min_pause(bdi, max_pause,
- task_ratelimit, dirty_ratelimit,
- &nr_dirtied_pause);
+ max_pause = wb_max_pause(wb, bdi_dirty);
+ min_pause = wb_min_pause(wb, max_pause,
+ task_ratelimit, dirty_ratelimit,
+ &nr_dirtied_pause);

if (unlikely(task_ratelimit == 0)) {
period = max_pause;
@@ -1515,15 +1516,15 @@ static void balance_dirty_pages(struct address_space *mapping,
* more page. However bdi_dirty has accounting errors. So use
* the larger and more IO friendly wb_stat_error.
*/
- if (bdi_dirty <= wb_stat_error(&bdi->wb))
+ if (bdi_dirty <= wb_stat_error(wb))
break;

if (fatal_signal_pending(current))
break;
}

- if (!dirty_exceeded && bdi->dirty_exceeded)
- bdi->dirty_exceeded = 0;
+ if (!dirty_exceeded && wb->dirty_exceeded)
+ wb->dirty_exceeded = 0;

if (writeback_in_progress(bdi))
return;
@@ -1577,6 +1578,7 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
+ struct bdi_writeback *wb = &bdi->wb;
int ratelimit;
int *p;

@@ -1584,7 +1586,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
return;

ratelimit = current->nr_dirtied_pause;
- if (bdi->dirty_exceeded)
+ if (wb->dirty_exceeded)
ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

preempt_disable();
--
2.4.0

2015-05-22 21:28:28

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 15/51] writeback: s/bdi/wb/ in mm/page-writeback.c

Writeback operations will now be per wb (bdi_writeback) instead of
bdi. Replace the relevant bdi references in symbol names and comments
with wb. This patch is purely cosmetic and doesn't make any
functional changes.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Jens Axboe <[email protected]>
---
mm/page-writeback.c | 270 ++++++++++++++++++++++++++--------------------------
1 file changed, 134 insertions(+), 136 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index cd39ee9..78ef551 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -595,7 +595,7 @@ static long long pos_ratio_polynom(unsigned long setpoint,
*
* (o) global/bdi setpoints
*
- * We want the dirty pages be balanced around the global/bdi setpoints.
+ * We want the dirty pages be balanced around the global/wb setpoints.
* When the number of dirty pages is higher/lower than the setpoint, the
* dirty position control ratio (and hence task dirty ratelimit) will be
* decreased/increased to bring the dirty pages back to the setpoint.
@@ -605,8 +605,8 @@ static long long pos_ratio_polynom(unsigned long setpoint,
* if (dirty < setpoint) scale up pos_ratio
* if (dirty > setpoint) scale down pos_ratio
*
- * if (bdi_dirty < bdi_setpoint) scale up pos_ratio
- * if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ * if (wb_dirty < wb_setpoint) scale up pos_ratio
+ * if (wb_dirty > wb_setpoint) scale down pos_ratio
*
* task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT
*
@@ -631,7 +631,7 @@ static long long pos_ratio_polynom(unsigned long setpoint,
* 0 +------------.------------------.----------------------*------------->
* freerun^ setpoint^ limit^ dirty pages
*
- * (o) bdi control line
+ * (o) wb control line
*
* ^ pos_ratio
* |
@@ -657,27 +657,27 @@ static long long pos_ratio_polynom(unsigned long setpoint,
* | . .
* | . .
* 0 +----------------------.-------------------------------.------------->
- * bdi_setpoint^ x_intercept^
+ * wb_setpoint^ x_intercept^
*
- * The bdi control line won't drop below pos_ratio=1/4, so that bdi_dirty can
+ * The wb control line won't drop below pos_ratio=1/4, so that wb_dirty can
* be smoothly throttled down to normal if it starts high in situations like
* - start writing to a slow SD card and a fast disk at the same time. The SD
- * card's bdi_dirty may rush to many times higher than bdi_setpoint.
- * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ * card's wb_dirty may rush to many times higher than wb_setpoint.
+ * - the wb dirty thresh drops quickly due to change of JBOD workload
*/
static unsigned long wb_position_ratio(struct bdi_writeback *wb,
unsigned long thresh,
unsigned long bg_thresh,
unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty)
+ unsigned long wb_thresh,
+ unsigned long wb_dirty)
{
unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
unsigned long x_intercept;
unsigned long setpoint; /* dirty pages' target balance point */
- unsigned long bdi_setpoint;
+ unsigned long wb_setpoint;
unsigned long span;
long long pos_ratio; /* for scaling up/down the rate limit */
long x;
@@ -696,146 +696,145 @@ static unsigned long wb_position_ratio(struct bdi_writeback *wb,
/*
* The strictlimit feature is a tool preventing mistrusted filesystems
* from growing a large number of dirty pages before throttling. For
- * such filesystems balance_dirty_pages always checks bdi counters
- * against bdi limits. Even if global "nr_dirty" is under "freerun".
+ * such filesystems balance_dirty_pages always checks wb counters
+ * against wb limits. Even if global "nr_dirty" is under "freerun".
* This is especially important for fuse which sets bdi->max_ratio to
* 1% by default. Without strictlimit feature, fuse writeback may
* consume arbitrary amount of RAM because it is accounted in
* NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
*
* Here, in wb_position_ratio(), we calculate pos_ratio based on
- * two values: bdi_dirty and bdi_thresh. Let's consider an example:
+ * two values: wb_dirty and wb_thresh. Let's consider an example:
* total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
* limits are set by default to 10% and 20% (background and throttle).
- * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
- * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. bdi_setpoint is
- * about ~6K pages (as the average of background and throttle bdi
+ * Then wb_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
+ * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. wb_setpoint is
+ * about ~6K pages (as the average of background and throttle wb
* limits). The 3rd order polynomial will provide positive feedback if
- * bdi_dirty is under bdi_setpoint and vice versa.
+ * wb_dirty is under wb_setpoint and vice versa.
*
* Note, that we cannot use global counters in these calculations
- * because we want to throttle process writing to a strictlimit BDI
+ * because we want to throttle process writing to a strictlimit wb
* much earlier than global "freerun" is reached (~23MB vs. ~2.3GB
* in the example above).
*/
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
- long long bdi_pos_ratio;
- unsigned long bdi_bg_thresh;
+ long long wb_pos_ratio;
+ unsigned long wb_bg_thresh;

- if (bdi_dirty < 8)
+ if (wb_dirty < 8)
return min_t(long long, pos_ratio * 2,
2 << RATELIMIT_CALC_SHIFT);

- if (bdi_dirty >= bdi_thresh)
+ if (wb_dirty >= wb_thresh)
return 0;

- bdi_bg_thresh = div_u64((u64)bdi_thresh * bg_thresh, thresh);
- bdi_setpoint = dirty_freerun_ceiling(bdi_thresh,
- bdi_bg_thresh);
+ wb_bg_thresh = div_u64((u64)wb_thresh * bg_thresh, thresh);
+ wb_setpoint = dirty_freerun_ceiling(wb_thresh, wb_bg_thresh);

- if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
+ if (wb_setpoint == 0 || wb_setpoint == wb_thresh)
return 0;

- bdi_pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
- bdi_thresh);
+ wb_pos_ratio = pos_ratio_polynom(wb_setpoint, wb_dirty,
+ wb_thresh);

/*
- * Typically, for strictlimit case, bdi_setpoint << setpoint
- * and pos_ratio >> bdi_pos_ratio. In the other words global
+ * Typically, for strictlimit case, wb_setpoint << setpoint
+ * and pos_ratio >> wb_pos_ratio. In the other words global
* state ("dirty") is not limiting factor and we have to
- * make decision based on bdi counters. But there is an
+ * make decision based on wb counters. But there is an
* important case when global pos_ratio should get precedence:
* global limits are exceeded (e.g. due to activities on other
- * BDIs) while given strictlimit BDI is below limit.
+ * wb's) while given strictlimit wb is below limit.
*
- * "pos_ratio * bdi_pos_ratio" would work for the case above,
+ * "pos_ratio * wb_pos_ratio" would work for the case above,
* but it would look too non-natural for the case of all
- * activity in the system coming from a single strictlimit BDI
+ * activity in the system coming from a single strictlimit wb
* with bdi->max_ratio == 100%.
*
* Note that min() below somewhat changes the dynamics of the
* control system. Normally, pos_ratio value can be well over 3
- * (when globally we are at freerun and bdi is well below bdi
+ * (when globally we are at freerun and wb is well below wb
* setpoint). Now the maximum pos_ratio in the same situation
* is 2. We might want to tweak this if we observe the control
* system is too slow to adapt.
*/
- return min(pos_ratio, bdi_pos_ratio);
+ return min(pos_ratio, wb_pos_ratio);
}

/*
* We have computed basic pos_ratio above based on global situation. If
- * the bdi is over/under its share of dirty pages, we want to scale
+ * the wb is over/under its share of dirty pages, we want to scale
* pos_ratio further down/up. That is done by the following mechanism.
*/

/*
- * bdi setpoint
+ * wb setpoint
*
- * f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+ * f(wb_dirty) := 1.0 + k * (wb_dirty - wb_setpoint)
*
- * x_intercept - bdi_dirty
+ * x_intercept - wb_dirty
* := --------------------------
- * x_intercept - bdi_setpoint
+ * x_intercept - wb_setpoint
*
- * The main bdi control line is a linear function that subjects to
+ * The main wb control line is a linear function that subjects to
*
- * (1) f(bdi_setpoint) = 1.0
- * (2) k = - 1 / (8 * write_bw) (in single bdi case)
- * or equally: x_intercept = bdi_setpoint + 8 * write_bw
+ * (1) f(wb_setpoint) = 1.0
+ * (2) k = - 1 / (8 * write_bw) (in single wb case)
+ * or equally: x_intercept = wb_setpoint + 8 * write_bw
*
- * For single bdi case, the dirty pages are observed to fluctuate
+ * For single wb case, the dirty pages are observed to fluctuate
* regularly within range
- * [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+ * [wb_setpoint - write_bw/2, wb_setpoint + write_bw/2]
* for various filesystems, where (2) can yield in a reasonable 12.5%
* fluctuation range for pos_ratio.
*
- * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+ * For JBOD case, wb_thresh (not wb_dirty!) could fluctuate up to its
* own size, so move the slope over accordingly and choose a slope that
- * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh.
+ * yields 100% pos_ratio fluctuation on suddenly doubled wb_thresh.
*/
- if (unlikely(bdi_thresh > thresh))
- bdi_thresh = thresh;
+ if (unlikely(wb_thresh > thresh))
+ wb_thresh = thresh;
/*
- * It's very possible that bdi_thresh is close to 0 not because the
+ * It's very possible that wb_thresh is close to 0 not because the
* device is slow, but that it has remained inactive for long time.
* Honour such devices a reasonable good (hopefully IO efficient)
* threshold, so that the occasional writes won't be blocked and active
* writes can rampup the threshold quickly.
*/
- bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
+ wb_thresh = max(wb_thresh, (limit - dirty) / 8);
/*
- * scale global setpoint to bdi's:
- * bdi_setpoint = setpoint * bdi_thresh / thresh
+ * scale global setpoint to wb's:
+ * wb_setpoint = setpoint * wb_thresh / thresh
*/
- x = div_u64((u64)bdi_thresh << 16, thresh + 1);
- bdi_setpoint = setpoint * (u64)x >> 16;
+ x = div_u64((u64)wb_thresh << 16, thresh + 1);
+ wb_setpoint = setpoint * (u64)x >> 16;
/*
- * Use span=(8*write_bw) in single bdi case as indicated by
- * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+ * Use span=(8*write_bw) in single wb case as indicated by
+ * (thresh - wb_thresh ~= 0) and transit to wb_thresh in JBOD case.
*
- * bdi_thresh thresh - bdi_thresh
- * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh
- * thresh thresh
+ * wb_thresh thresh - wb_thresh
+ * span = --------- * (8 * write_bw) + ------------------ * wb_thresh
+ * thresh thresh
*/
- span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16;
- x_intercept = bdi_setpoint + span;
+ span = (thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16;
+ x_intercept = wb_setpoint + span;

- if (bdi_dirty < x_intercept - span / 4) {
- pos_ratio = div64_u64(pos_ratio * (x_intercept - bdi_dirty),
- x_intercept - bdi_setpoint + 1);
+ if (wb_dirty < x_intercept - span / 4) {
+ pos_ratio = div64_u64(pos_ratio * (x_intercept - wb_dirty),
+ x_intercept - wb_setpoint + 1);
} else
pos_ratio /= 4;

/*
- * bdi reserve area, safeguard against dirty pool underrun and disk idle
+ * wb reserve area, safeguard against dirty pool underrun and disk idle
* It may push the desired control point of global dirty pages higher
* than setpoint.
*/
- x_intercept = bdi_thresh / 2;
- if (bdi_dirty < x_intercept) {
- if (bdi_dirty > x_intercept / 8)
- pos_ratio = div_u64(pos_ratio * x_intercept, bdi_dirty);
+ x_intercept = wb_thresh / 2;
+ if (wb_dirty < x_intercept) {
+ if (wb_dirty > x_intercept / 8)
+ pos_ratio = div_u64(pos_ratio * x_intercept, wb_dirty);
else
pos_ratio *= 8;
}
@@ -943,17 +942,17 @@ static void global_update_bandwidth(unsigned long thresh,
}

/*
- * Maintain bdi->dirty_ratelimit, the base dirty throttle rate.
+ * Maintain wb->dirty_ratelimit, the base dirty throttle rate.
*
- * Normal bdi tasks will be curbed at or below it in long term.
+ * Normal wb tasks will be curbed at or below it in long term.
* Obviously it should be around (write_bw / N) when there are N dd tasks.
*/
static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
unsigned long thresh,
unsigned long bg_thresh,
unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
+ unsigned long wb_thresh,
+ unsigned long wb_dirty,
unsigned long dirtied,
unsigned long elapsed)
{
@@ -976,7 +975,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;

pos_ratio = wb_position_ratio(wb, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty);
+ wb_thresh, wb_dirty);
/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
@@ -986,7 +985,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,

/*
* A linear estimation of the "balanced" throttle rate. The theory is,
- * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
+ * if there are N dd tasks, each throttled at task_ratelimit, the wb's
* dirty_rate will be measured to be (N * task_ratelimit). So the below
* formula will yield the balanced rate limit (write_bw / N).
*
@@ -1025,7 +1024,7 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
/*
* We could safely do this and return immediately:
*
- * bdi->dirty_ratelimit = balanced_dirty_ratelimit;
+ * wb->dirty_ratelimit = balanced_dirty_ratelimit;
*
* However to get a more stable dirty_ratelimit, the below elaborated
* code makes use of task_ratelimit to filter out singular points and
@@ -1059,22 +1058,22 @@ static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
step = 0;

/*
- * For strictlimit case, calculations above were based on bdi counters
+ * For strictlimit case, calculations above were based on wb counters
* and limits (starting from pos_ratio = wb_position_ratio() and up to
* balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
- * Hence, to calculate "step" properly, we have to use bdi_dirty as
- * "dirty" and bdi_setpoint as "setpoint".
+ * Hence, to calculate "step" properly, we have to use wb_dirty as
+ * "dirty" and wb_setpoint as "setpoint".
*
- * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
- * it's possible that bdi_thresh is close to zero due to inactivity
+ * We rampup dirty_ratelimit forcibly if wb_dirty is low because
+ * it's possible that wb_thresh is close to zero due to inactivity
* of backing device (see the implementation of wb_dirty_limit()).
*/
if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
- dirty = bdi_dirty;
- if (bdi_dirty < 8)
- setpoint = bdi_dirty + 1;
+ dirty = wb_dirty;
+ if (wb_dirty < 8)
+ setpoint = wb_dirty + 1;
else
- setpoint = (bdi_thresh +
+ setpoint = (wb_thresh +
wb_dirty_limit(wb, bg_thresh)) / 2;
}

@@ -1116,8 +1115,8 @@ void __wb_update_bandwidth(struct bdi_writeback *wb,
unsigned long thresh,
unsigned long bg_thresh,
unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
+ unsigned long wb_thresh,
+ unsigned long wb_dirty,
unsigned long start_time)
{
unsigned long now = jiffies;
@@ -1144,7 +1143,7 @@ void __wb_update_bandwidth(struct bdi_writeback *wb,
if (thresh) {
global_update_bandwidth(thresh, dirty, now);
wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty,
+ wb_thresh, wb_dirty,
dirtied, elapsed);
}
wb_update_write_bandwidth(wb, elapsed, written);
@@ -1159,15 +1158,15 @@ static void wb_update_bandwidth(struct bdi_writeback *wb,
unsigned long thresh,
unsigned long bg_thresh,
unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
+ unsigned long wb_thresh,
+ unsigned long wb_dirty,
unsigned long start_time)
{
if (time_is_after_eq_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL))
return;
spin_lock(&wb->list_lock);
__wb_update_bandwidth(wb, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty, start_time);
+ wb_thresh, wb_dirty, start_time);
spin_unlock(&wb->list_lock);
}

@@ -1189,7 +1188,7 @@ static unsigned long dirty_poll_interval(unsigned long dirty,
}

static unsigned long wb_max_pause(struct bdi_writeback *wb,
- unsigned long bdi_dirty)
+ unsigned long wb_dirty)
{
unsigned long bw = wb->avg_write_bandwidth;
unsigned long t;
@@ -1201,7 +1200,7 @@ static unsigned long wb_max_pause(struct bdi_writeback *wb,
*
* 8 serves as the safety ratio.
*/
- t = bdi_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8));
+ t = wb_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8));
t++;

return min_t(unsigned long, t, MAX_PAUSE);
@@ -1285,31 +1284,31 @@ static long wb_min_pause(struct bdi_writeback *wb,
static inline void wb_dirty_limits(struct bdi_writeback *wb,
unsigned long dirty_thresh,
unsigned long background_thresh,
- unsigned long *bdi_dirty,
- unsigned long *bdi_thresh,
- unsigned long *bdi_bg_thresh)
+ unsigned long *wb_dirty,
+ unsigned long *wb_thresh,
+ unsigned long *wb_bg_thresh)
{
unsigned long wb_reclaimable;

/*
- * bdi_thresh is not treated as some limiting factor as
+ * wb_thresh is not treated as some limiting factor as
* dirty_thresh, due to reasons
- * - in JBOD setup, bdi_thresh can fluctuate a lot
+ * - in JBOD setup, wb_thresh can fluctuate a lot
* - in a system with HDD and USB key, the USB key may somehow
- * go into state (bdi_dirty >> bdi_thresh) either because
- * bdi_dirty starts high, or because bdi_thresh drops low.
+ * go into state (wb_dirty >> wb_thresh) either because
+ * wb_dirty starts high, or because wb_thresh drops low.
* In this case we don't want to hard throttle the USB key
- * dirtiers for 100 seconds until bdi_dirty drops under
- * bdi_thresh. Instead the auxiliary bdi control line in
+ * dirtiers for 100 seconds until wb_dirty drops under
+ * wb_thresh. Instead the auxiliary wb control line in
* wb_position_ratio() will let the dirtier task progress
- * at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+ * at some rate <= (write_bw / 2) for bringing down wb_dirty.
*/
- *bdi_thresh = wb_dirty_limit(wb, dirty_thresh);
+ *wb_thresh = wb_dirty_limit(wb, dirty_thresh);

- if (bdi_bg_thresh)
- *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh *
- background_thresh,
- dirty_thresh) : 0;
+ if (wb_bg_thresh)
+ *wb_bg_thresh = dirty_thresh ? div_u64((u64)*wb_thresh *
+ background_thresh,
+ dirty_thresh) : 0;

/*
* In order to avoid the stacked BDI deadlock we need
@@ -1321,12 +1320,12 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb,
* actually dirty; with m+n sitting in the percpu
* deltas.
*/
- if (*bdi_thresh < 2 * wb_stat_error(wb)) {
+ if (*wb_thresh < 2 * wb_stat_error(wb)) {
wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
- *bdi_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
+ *wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
} else {
wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
- *bdi_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
+ *wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
}
}

@@ -1360,9 +1359,9 @@ static void balance_dirty_pages(struct address_space *mapping,

for (;;) {
unsigned long now = jiffies;
- unsigned long uninitialized_var(bdi_thresh);
+ unsigned long uninitialized_var(wb_thresh);
unsigned long thresh;
- unsigned long uninitialized_var(bdi_dirty);
+ unsigned long uninitialized_var(wb_dirty);
unsigned long dirty;
unsigned long bg_thresh;

@@ -1380,10 +1379,10 @@ static void balance_dirty_pages(struct address_space *mapping,

if (unlikely(strictlimit)) {
wb_dirty_limits(wb, dirty_thresh, background_thresh,
- &bdi_dirty, &bdi_thresh, &bg_thresh);
+ &wb_dirty, &wb_thresh, &bg_thresh);

- dirty = bdi_dirty;
- thresh = bdi_thresh;
+ dirty = wb_dirty;
+ thresh = wb_thresh;
} else {
dirty = nr_dirty;
thresh = dirty_thresh;
@@ -1393,10 +1392,10 @@ static void balance_dirty_pages(struct address_space *mapping,
/*
* Throttle it only when the background writeback cannot
* catch-up. This avoids (excessively) small writeouts
- * when the bdi limits are ramping up in case of !strictlimit.
+ * when the wb limits are ramping up in case of !strictlimit.
*
- * In strictlimit case make decision based on the bdi counters
- * and limits. Small writeouts when the bdi limits are ramping
+ * In strictlimit case make decision based on the wb counters
+ * and limits. Small writeouts when the wb limits are ramping
* up are the price we consciously pay for strictlimit-ing.
*/
if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {
@@ -1412,24 +1411,23 @@ static void balance_dirty_pages(struct address_space *mapping,

if (!strictlimit)
wb_dirty_limits(wb, dirty_thresh, background_thresh,
- &bdi_dirty, &bdi_thresh, NULL);
+ &wb_dirty, &wb_thresh, NULL);

- dirty_exceeded = (bdi_dirty > bdi_thresh) &&
+ dirty_exceeded = (wb_dirty > wb_thresh) &&
((nr_dirty > dirty_thresh) || strictlimit);
if (dirty_exceeded && !wb->dirty_exceeded)
wb->dirty_exceeded = 1;

wb_update_bandwidth(wb, dirty_thresh, background_thresh,
- nr_dirty, bdi_thresh, bdi_dirty,
- start_time);
+ nr_dirty, wb_thresh, wb_dirty, start_time);

dirty_ratelimit = wb->dirty_ratelimit;
pos_ratio = wb_position_ratio(wb, dirty_thresh,
background_thresh, nr_dirty,
- bdi_thresh, bdi_dirty);
+ wb_thresh, wb_dirty);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
- max_pause = wb_max_pause(wb, bdi_dirty);
+ max_pause = wb_max_pause(wb, wb_dirty);
min_pause = wb_min_pause(wb, max_pause,
task_ratelimit, dirty_ratelimit,
&nr_dirtied_pause);
@@ -1455,8 +1453,8 @@ static void balance_dirty_pages(struct address_space *mapping,
dirty_thresh,
background_thresh,
nr_dirty,
- bdi_thresh,
- bdi_dirty,
+ wb_thresh,
+ wb_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
@@ -1484,8 +1482,8 @@ static void balance_dirty_pages(struct address_space *mapping,
dirty_thresh,
background_thresh,
nr_dirty,
- bdi_thresh,
- bdi_dirty,
+ wb_thresh,
+ wb_dirty,
dirty_ratelimit,
task_ratelimit,
pages_dirtied,
@@ -1508,15 +1506,15 @@ static void balance_dirty_pages(struct address_space *mapping,

/*
* In the case of an unresponding NFS server and the NFS dirty
- * pages exceeds dirty_thresh, give the other good bdi's a pipe
+ * pages exceeds dirty_thresh, give the other good wb's a pipe
* to go through, so that tasks on them still remain responsive.
*
* In theory 1 page is enough to keep the comsumer-producer
* pipe going: the flusher cleans 1 page => the task dirties 1
- * more page. However bdi_dirty has accounting errors. So use
+ * more page. However wb_dirty has accounting errors. So use
* the larger and more IO friendly wb_stat_error.
*/
- if (bdi_dirty <= wb_stat_error(wb))
+ if (wb_dirty <= wb_stat_error(wb))
break;

if (fatal_signal_pending(current))
--
2.4.0

2015-05-22 21:27:45

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 16/51] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->wb_lock and ->worklist into wb.

* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.

* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)

* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().

* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Wu Fengguang <[email protected]>
---
fs/fs-writeback.c | 86 +++++++++++++++++++++------------------------
include/linux/backing-dev.h | 12 +++----
mm/backing-dev.c | 59 +++++++++++++++----------------
3 files changed, 75 insertions(+), 82 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1945cb9..a69d2e1 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -109,34 +109,33 @@ static inline struct inode *wb_inode(struct list_head *head)

EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);

-static void bdi_wakeup_thread(struct backing_dev_info *bdi)
+static void wb_wakeup(struct bdi_writeback *wb)
{
- spin_lock_bh(&bdi->wb_lock);
- if (test_bit(WB_registered, &bdi->wb.state))
- mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
- spin_unlock_bh(&bdi->wb_lock);
+ spin_lock_bh(&wb->work_lock);
+ if (test_bit(WB_registered, &wb->state))
+ mod_delayed_work(bdi_wq, &wb->dwork, 0);
+ spin_unlock_bh(&wb->work_lock);
}

-static void bdi_queue_work(struct backing_dev_info *bdi,
- struct wb_writeback_work *work)
+static void wb_queue_work(struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
{
- trace_writeback_queue(bdi, work);
+ trace_writeback_queue(wb->bdi, work);

- spin_lock_bh(&bdi->wb_lock);
- if (!test_bit(WB_registered, &bdi->wb.state)) {
+ spin_lock_bh(&wb->work_lock);
+ if (!test_bit(WB_registered, &wb->state)) {
if (work->done)
complete(work->done);
goto out_unlock;
}
- list_add_tail(&work->list, &bdi->work_list);
- mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+ list_add_tail(&work->list, &wb->work_list);
+ mod_delayed_work(bdi_wq, &wb->dwork, 0);
out_unlock:
- spin_unlock_bh(&bdi->wb_lock);
+ spin_unlock_bh(&wb->work_lock);
}

-static void
-__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
- bool range_cyclic, enum wb_reason reason)
+static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+ bool range_cyclic, enum wb_reason reason)
{
struct wb_writeback_work *work;

@@ -146,8 +145,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
*/
work = kzalloc(sizeof(*work), GFP_ATOMIC);
if (!work) {
- trace_writeback_nowork(bdi);
- bdi_wakeup_thread(bdi);
+ trace_writeback_nowork(wb->bdi);
+ wb_wakeup(wb);
return;
}

@@ -156,7 +155,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
work->range_cyclic = range_cyclic;
work->reason = reason;

- bdi_queue_work(bdi, work);
+ wb_queue_work(wb, work);
}

/**
@@ -174,7 +173,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason)
{
- __bdi_start_writeback(bdi, nr_pages, true, reason);
+ __wb_start_writeback(&bdi->wb, nr_pages, true, reason);
}

/**
@@ -194,7 +193,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
* writeback as soon as there is no other work to do.
*/
trace_writeback_wake_background(bdi);
- bdi_wakeup_thread(bdi);
+ wb_wakeup(&bdi->wb);
}

/*
@@ -898,7 +897,7 @@ static long wb_writeback(struct bdi_writeback *wb,
* after the other works are all done.
*/
if ((work->for_background || work->for_kupdate) &&
- !list_empty(&wb->bdi->work_list))
+ !list_empty(&wb->work_list))
break;

/*
@@ -969,18 +968,17 @@ static long wb_writeback(struct bdi_writeback *wb,
/*
* Return the next wb_writeback_work struct that hasn't been processed yet.
*/
-static struct wb_writeback_work *
-get_next_work_item(struct backing_dev_info *bdi)
+static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
{
struct wb_writeback_work *work = NULL;

- spin_lock_bh(&bdi->wb_lock);
- if (!list_empty(&bdi->work_list)) {
- work = list_entry(bdi->work_list.next,
+ spin_lock_bh(&wb->work_lock);
+ if (!list_empty(&wb->work_list)) {
+ work = list_entry(wb->work_list.next,
struct wb_writeback_work, list);
list_del_init(&work->list);
}
- spin_unlock_bh(&bdi->wb_lock);
+ spin_unlock_bh(&wb->work_lock);
return work;
}

@@ -1052,14 +1050,13 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
*/
static long wb_do_writeback(struct bdi_writeback *wb)
{
- struct backing_dev_info *bdi = wb->bdi;
struct wb_writeback_work *work;
long wrote = 0;

set_bit(WB_writeback_running, &wb->state);
- while ((work = get_next_work_item(bdi)) != NULL) {
+ while ((work = get_next_work_item(wb)) != NULL) {

- trace_writeback_exec(bdi, work);
+ trace_writeback_exec(wb->bdi, work);

wrote += wb_writeback(wb, work);

@@ -1087,43 +1084,42 @@ static long wb_do_writeback(struct bdi_writeback *wb)
* Handle writeback of dirty data for the device backed by this bdi. Also
* reschedules periodically and does kupdated style flushing.
*/
-void bdi_writeback_workfn(struct work_struct *work)
+void wb_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(to_delayed_work(work),
struct bdi_writeback, dwork);
- struct backing_dev_info *bdi = wb->bdi;
long pages_written;

- set_worker_desc("flush-%s", dev_name(bdi->dev));
+ set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
current->flags |= PF_SWAPWRITE;

if (likely(!current_is_workqueue_rescuer() ||
!test_bit(WB_registered, &wb->state))) {
/*
- * The normal path. Keep writing back @bdi until its
+ * The normal path. Keep writing back @wb until its
* work_list is empty. Note that this path is also taken
- * if @bdi is shutting down even when we're running off the
+ * if @wb is shutting down even when we're running off the
* rescuer as work_list needs to be drained.
*/
do {
pages_written = wb_do_writeback(wb);
trace_writeback_pages_written(pages_written);
- } while (!list_empty(&bdi->work_list));
+ } while (!list_empty(&wb->work_list));
} else {
/*
* bdi_wq can't get enough workers and we're running off
* the emergency worker. Don't hog it. Hopefully, 1024 is
* enough for efficient IO.
*/
- pages_written = writeback_inodes_wb(&bdi->wb, 1024,
+ pages_written = writeback_inodes_wb(wb, 1024,
WB_REASON_FORKER_THREAD);
trace_writeback_pages_written(pages_written);
}

- if (!list_empty(&bdi->work_list))
+ if (!list_empty(&wb->work_list))
mod_delayed_work(bdi_wq, &wb->dwork, 0);
else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
- bdi_wakeup_thread_delayed(bdi);
+ wb_wakeup_delayed(wb);

current->flags &= ~PF_SWAPWRITE;
}
@@ -1143,7 +1139,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
if (!bdi_has_dirty_io(bdi))
continue;
- __bdi_start_writeback(bdi, nr_pages, false, reason);
+ __wb_start_writeback(&bdi->wb, nr_pages, false, reason);
}
rcu_read_unlock();
}
@@ -1174,7 +1170,7 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
if (list_empty(&bdi->wb.b_dirty_time))
continue;
- bdi_wakeup_thread(bdi);
+ wb_wakeup(&bdi->wb);
}
rcu_read_unlock();
schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
@@ -1347,7 +1343,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
trace_writeback_dirty_inode_enqueue(inode);

if (wakeup_bdi)
- bdi_wakeup_thread_delayed(bdi);
+ wb_wakeup_delayed(&bdi->wb);
return;
}
}
@@ -1437,7 +1433,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
if (sb->s_bdi == &noop_backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
- bdi_queue_work(sb->s_bdi, &work);
+ wb_queue_work(&sb->s_bdi->wb, &work);
wait_for_completion(&done);
}
EXPORT_SYMBOL(writeback_inodes_sb_nr);
@@ -1521,7 +1517,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- bdi_queue_work(sb->s_bdi, &work);
+ wb_queue_work(&sb->s_bdi->wb, &work);
wait_for_completion(&done);

wait_sb_inodes(sb);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2ab0604..d796f49 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -52,7 +52,6 @@ struct bdi_writeback {
unsigned long state; /* Always use atomic bitops on this */
unsigned long last_old_flush; /* last old data flush */

- struct delayed_work dwork; /* work item used for writeback */
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
@@ -78,6 +77,10 @@ struct bdi_writeback {

struct fprop_local_percpu completions;
int dirty_exceeded;
+
+ spinlock_t work_lock; /* protects work_list & dwork scheduling */
+ struct list_head work_list;
+ struct delayed_work dwork; /* work item used for writeback */
};

struct backing_dev_info {
@@ -93,9 +96,6 @@ struct backing_dev_info {
unsigned int max_ratio, max_prop_frac;

struct bdi_writeback wb; /* default writeback info for this bdi */
- spinlock_t wb_lock; /* protects work_list & wb.dwork scheduling */
-
- struct list_head work_list;

struct device *dev;

@@ -121,9 +121,9 @@ int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason);
void bdi_start_background_writeback(struct backing_dev_info *bdi);
-void bdi_writeback_workfn(struct work_struct *work);
+void wb_workfn(struct work_struct *work);
int bdi_has_dirty_io(struct backing_dev_info *bdi);
-void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void wb_wakeup_delayed(struct bdi_writeback *wb);

extern spinlock_t bdi_lock;
extern struct list_head bdi_list;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 9a6c472..597f0ce 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -261,7 +261,7 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
}

/*
- * This function is used when the first inode for this bdi is marked dirty. It
+ * This function is used when the first inode for this wb is marked dirty. It
* wakes-up the corresponding bdi thread which should then take care of the
* periodic background write-out of dirty inodes. Since the write-out would
* starts only 'dirty_writeback_interval' centisecs from now anyway, we just
@@ -274,15 +274,15 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
* We have to be careful not to postpone flush work if it is scheduled for
* earlier. Thus we use queue_delayed_work().
*/
-void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
+void wb_wakeup_delayed(struct bdi_writeback *wb)
{
unsigned long timeout;

timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
- spin_lock_bh(&bdi->wb_lock);
- if (test_bit(WB_registered, &bdi->wb.state))
- queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
- spin_unlock_bh(&bdi->wb_lock);
+ spin_lock_bh(&wb->work_lock);
+ if (test_bit(WB_registered, &wb->state))
+ queue_delayed_work(bdi_wq, &wb->dwork, timeout);
+ spin_unlock_bh(&wb->work_lock);
}

/*
@@ -335,28 +335,24 @@ EXPORT_SYMBOL(bdi_register_dev);
/*
* Remove bdi from the global list and shutdown any threads we have running
*/
-static void bdi_wb_shutdown(struct backing_dev_info *bdi)
+static void wb_shutdown(struct bdi_writeback *wb)
{
/* Make sure nobody queues further work */
- spin_lock_bh(&bdi->wb_lock);
- if (!test_and_clear_bit(WB_registered, &bdi->wb.state)) {
- spin_unlock_bh(&bdi->wb_lock);
+ spin_lock_bh(&wb->work_lock);
+ if (!test_and_clear_bit(WB_registered, &wb->state)) {
+ spin_unlock_bh(&wb->work_lock);
return;
}
- spin_unlock_bh(&bdi->wb_lock);
+ spin_unlock_bh(&wb->work_lock);

/*
- * Make sure nobody finds us on the bdi_list anymore
+ * Drain work list and shutdown the delayed_work. !WB_registered
+ * tells wb_workfn() that @wb is dying and its work_list needs to
+ * be drained no matter what.
*/
- bdi_remove_from_list(bdi);
-
- /*
- * Drain work list and shutdown the delayed_work. At this point,
- * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
- * is dying and its work_list needs to be drained no matter what.
- */
- mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
- flush_delayed_work(&bdi->wb.dwork);
+ mod_delayed_work(bdi_wq, &wb->dwork, 0);
+ flush_delayed_work(&wb->dwork);
+ WARN_ON(!list_empty(&wb->work_list));
}

/*
@@ -381,7 +377,7 @@ EXPORT_SYMBOL(bdi_unregister);
*/
#define INIT_BW (100 << (20 - PAGE_SHIFT))

-static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
int i, err;

@@ -394,7 +390,6 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->b_more_io);
INIT_LIST_HEAD(&wb->b_dirty_time);
spin_lock_init(&wb->list_lock);
- INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);

wb->bw_time_stamp = jiffies;
wb->balanced_dirty_ratelimit = INIT_BW;
@@ -402,6 +397,10 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
wb->write_bandwidth = INIT_BW;
wb->avg_write_bandwidth = INIT_BW;

+ spin_lock_init(&wb->work_lock);
+ INIT_LIST_HEAD(&wb->work_list);
+ INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
+
err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL);
if (err)
return err;
@@ -419,7 +418,7 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
return 0;
}

-static void bdi_wb_exit(struct bdi_writeback *wb)
+static void wb_exit(struct bdi_writeback *wb)
{
int i;

@@ -440,11 +439,9 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->min_ratio = 0;
bdi->max_ratio = 100;
bdi->max_prop_frac = FPROP_FRAC_BASE;
- spin_lock_init(&bdi->wb_lock);
INIT_LIST_HEAD(&bdi->bdi_list);
- INIT_LIST_HEAD(&bdi->work_list);

- err = bdi_wb_init(&bdi->wb, bdi);
+ err = wb_init(&bdi->wb, bdi);
if (err)
return err;

@@ -454,9 +451,9 @@ EXPORT_SYMBOL(bdi_init);

void bdi_destroy(struct backing_dev_info *bdi)
{
- bdi_wb_shutdown(bdi);
-
- WARN_ON(!list_empty(&bdi->work_list));
+ /* make sure nobody finds us on the bdi_list anymore */
+ bdi_remove_from_list(bdi);
+ wb_shutdown(&bdi->wb);

if (bdi->dev) {
bdi_debug_unregister(bdi);
@@ -464,7 +461,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
bdi->dev = NULL;
}

- bdi_wb_exit(&bdi->wb);
+ wb_exit(&bdi->wb);
}
EXPORT_SYMBOL(bdi_destroy);

--
2.4.0

2015-05-22 21:27:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 17/51] writeback: reorganize mm/backing-dev.c

Move wb_shutdown(), bdi_register(), bdi_register_dev(),
bdi_prune_sb(), bdi_remove_from_list() and bdi_unregister() so that
init / exit functions are grouped together. This will make updating
init / exit paths for cgroup writeback support easier.

This is pure source file reorganization.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Wu Fengguang <[email protected]>
---
mm/backing-dev.c | 174 +++++++++++++++++++++++++++----------------------------
1 file changed, 87 insertions(+), 87 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 597f0ce..ff85ecb 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -286,93 +286,6 @@ void wb_wakeup_delayed(struct bdi_writeback *wb)
}

/*
- * Remove bdi from bdi_list, and ensure that it is no longer visible
- */
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
-{
- spin_lock_bh(&bdi_lock);
- list_del_rcu(&bdi->bdi_list);
- spin_unlock_bh(&bdi_lock);
-
- synchronize_rcu_expedited();
-}
-
-int bdi_register(struct backing_dev_info *bdi, struct device *parent,
- const char *fmt, ...)
-{
- va_list args;
- struct device *dev;
-
- if (bdi->dev) /* The driver needs to use separate queues per device */
- return 0;
-
- va_start(args, fmt);
- dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
- va_end(args);
- if (IS_ERR(dev))
- return PTR_ERR(dev);
-
- bdi->dev = dev;
-
- bdi_debug_register(bdi, dev_name(dev));
- set_bit(WB_registered, &bdi->wb.state);
-
- spin_lock_bh(&bdi_lock);
- list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
- spin_unlock_bh(&bdi_lock);
-
- trace_writeback_bdi_register(bdi);
- return 0;
-}
-EXPORT_SYMBOL(bdi_register);
-
-int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
-{
- return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev));
-}
-EXPORT_SYMBOL(bdi_register_dev);
-
-/*
- * Remove bdi from the global list and shutdown any threads we have running
- */
-static void wb_shutdown(struct bdi_writeback *wb)
-{
- /* Make sure nobody queues further work */
- spin_lock_bh(&wb->work_lock);
- if (!test_and_clear_bit(WB_registered, &wb->state)) {
- spin_unlock_bh(&wb->work_lock);
- return;
- }
- spin_unlock_bh(&wb->work_lock);
-
- /*
- * Drain work list and shutdown the delayed_work. !WB_registered
- * tells wb_workfn() that @wb is dying and its work_list needs to
- * be drained no matter what.
- */
- mod_delayed_work(bdi_wq, &wb->dwork, 0);
- flush_delayed_work(&wb->dwork);
- WARN_ON(!list_empty(&wb->work_list));
-}
-
-/*
- * Called when the device behind @bdi has been removed or ejected.
- *
- * We can't really do much here except for reducing the dirty ratio at
- * the moment. In the future we should be able to set a flag so that
- * the filesystem can handle errors at mark_inode_dirty time instead
- * of only at writeback time.
- */
-void bdi_unregister(struct backing_dev_info *bdi)
-{
- if (WARN_ON_ONCE(!bdi->dev))
- return;
-
- bdi_set_min_ratio(bdi, 0);
-}
-EXPORT_SYMBOL(bdi_unregister);
-
-/*
* Initial write bandwidth: 100 MB/s
*/
#define INIT_BW (100 << (20 - PAGE_SHIFT))
@@ -418,6 +331,29 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
return 0;
}

+/*
+ * Remove bdi from the global list and shutdown any threads we have running
+ */
+static void wb_shutdown(struct bdi_writeback *wb)
+{
+ /* Make sure nobody queues further work */
+ spin_lock_bh(&wb->work_lock);
+ if (!test_and_clear_bit(WB_registered, &wb->state)) {
+ spin_unlock_bh(&wb->work_lock);
+ return;
+ }
+ spin_unlock_bh(&wb->work_lock);
+
+ /*
+ * Drain work list and shutdown the delayed_work. !WB_registered
+ * tells wb_workfn() that @wb is dying and its work_list needs to
+ * be drained no matter what.
+ */
+ mod_delayed_work(bdi_wq, &wb->dwork, 0);
+ flush_delayed_work(&wb->dwork);
+ WARN_ON(!list_empty(&wb->work_list));
+}
+
static void wb_exit(struct bdi_writeback *wb)
{
int i;
@@ -449,6 +385,70 @@ int bdi_init(struct backing_dev_info *bdi)
}
EXPORT_SYMBOL(bdi_init);

+int bdi_register(struct backing_dev_info *bdi, struct device *parent,
+ const char *fmt, ...)
+{
+ va_list args;
+ struct device *dev;
+
+ if (bdi->dev) /* The driver needs to use separate queues per device */
+ return 0;
+
+ va_start(args, fmt);
+ dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
+ va_end(args);
+ if (IS_ERR(dev))
+ return PTR_ERR(dev);
+
+ bdi->dev = dev;
+
+ bdi_debug_register(bdi, dev_name(dev));
+ set_bit(WB_registered, &bdi->wb.state);
+
+ spin_lock_bh(&bdi_lock);
+ list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
+ spin_unlock_bh(&bdi_lock);
+
+ trace_writeback_bdi_register(bdi);
+ return 0;
+}
+EXPORT_SYMBOL(bdi_register);
+
+int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
+{
+ return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev));
+}
+EXPORT_SYMBOL(bdi_register_dev);
+
+/*
+ * Remove bdi from bdi_list, and ensure that it is no longer visible
+ */
+static void bdi_remove_from_list(struct backing_dev_info *bdi)
+{
+ spin_lock_bh(&bdi_lock);
+ list_del_rcu(&bdi->bdi_list);
+ spin_unlock_bh(&bdi_lock);
+
+ synchronize_rcu_expedited();
+}
+
+/*
+ * Called when the device behind @bdi has been removed or ejected.
+ *
+ * We can't really do much here except for reducing the dirty ratio at
+ * the moment. In the future we should be able to set a flag so that
+ * the filesystem can handle errors at mark_inode_dirty time instead
+ * of only at writeback time.
+ */
+void bdi_unregister(struct backing_dev_info *bdi)
+{
+ if (WARN_ON_ONCE(!bdi->dev))
+ return;
+
+ bdi_set_min_ratio(bdi, 0);
+}
+EXPORT_SYMBOL(bdi_unregister);
+
void bdi_destroy(struct backing_dev_info *bdi)
{
/* make sure nobody finds us on the bdi_list anymore */
--
2.4.0

2015-05-22 21:26:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 18/51] writeback: separate out include/linux/backing-dev-defs.h

With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

v2: fs/fat build failure fixed.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-integrity.c | 1 +
block/blk-sysfs.c | 1 +
block/bounce.c | 1 +
block/genhd.c | 1 +
drivers/block/drbd/drbd_int.h | 1 +
drivers/block/pktcdvd.c | 1 +
drivers/char/raw.c | 1 +
drivers/md/bcache/request.c | 1 +
drivers/md/dm.h | 1 +
drivers/md/md.h | 1 +
drivers/mtd/devices/block2mtd.c | 1 +
fs/block_dev.c | 1 +
fs/ext4/extents.c | 1 +
fs/ext4/mballoc.c | 1 +
fs/ext4/super.c | 1 +
fs/f2fs/segment.h | 1 +
fs/fat/file.c | 1 +
fs/fat/inode.c | 1 +
fs/hfs/super.c | 1 +
fs/hfsplus/super.c | 1 +
fs/nfs/filelayout/filelayout.c | 1 +
fs/ocfs2/file.c | 1 +
fs/reiserfs/super.c | 1 +
fs/ufs/super.c | 1 +
fs/xfs/xfs_file.c | 1 +
include/linux/backing-dev-defs.h | 106 +++++++++++++++++++++++++++++++++++++++
include/linux/backing-dev.h | 102 +------------------------------------
include/linux/blkdev.h | 2 +-
mm/madvise.c | 1 +
29 files changed, 134 insertions(+), 102 deletions(-)
create mode 100644 include/linux/backing-dev-defs.h

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 79ffb48..f548b64 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -21,6 +21,7 @@
*/

#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/mempool.h>
#include <linux/bio.h>
#include <linux/scatterlist.h>
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5677eb7..1b60941 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -6,6 +6,7 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/blktrace_api.h>
#include <linux/blk-mq.h>
#include <linux/blk-cgroup.h>
diff --git a/block/bounce.c b/block/bounce.c
index 4bac725..072280b 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -13,6 +13,7 @@
#include <linux/pagemap.h>
#include <linux/mempool.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
diff --git a/block/genhd.c b/block/genhd.c
index 0a536dc..d46ba56 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -8,6 +8,7 @@
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/spinlock.h>
#include <linux/proc_fs.h>
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index b905e98..efd19c2 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -38,6 +38,7 @@
#include <linux/mutex.h>
#include <linux/major.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/genhd.h>
#include <linux/idr.h>
#include <net/tcp.h>
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 09e628da..4c20c22 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -61,6 +61,7 @@
#include <linux/freezer.h>
#include <linux/mutex.h>
#include <linux/slab.h>
+#include <linux/backing-dev.h>
#include <scsi/scsi_cmnd.h>
#include <scsi/scsi_ioctl.h>
#include <scsi/scsi.h>
diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index 5fc291c..60316fb 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -12,6 +12,7 @@
#include <linux/fs.h>
#include <linux/major.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/module.h>
#include <linux/raw.h>
#include <linux/capability.h>
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 1616f66..4afb2d2 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/hash.h>
#include <linux/random.h>
+#include <linux/backing-dev.h>

#include <trace/events/bcache.h>

diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 6123c2b..4e98499 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -14,6 +14,7 @@
#include <linux/device-mapper.h>
#include <linux/list.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/hdreg.h>
#include <linux/completion.h>
#include <linux/kobject.h>
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 4046a6c..7da6e9c 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -16,6 +16,7 @@
#define _MD_MD_H

#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mm.h>
diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index b16f3cd..e2c0057 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -20,6 +20,7 @@
#include <linux/delay.h>
#include <linux/fs.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/bio.h>
#include <linux/pagemap.h>
#include <linux/list.h>
diff --git a/fs/block_dev.c b/fs/block_dev.c
index c7e4163..e545cbf 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -14,6 +14,7 @@
#include <linux/device_cgroup.h>
#include <linux/highmem.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/module.h>
#include <linux/blkpg.h>
#include <linux/magic.h>
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d74e0802..e8b5866 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -39,6 +39,7 @@
#include <linux/slab.h>
#include <asm/uaccess.h>
#include <linux/fiemap.h>
+#include <linux/backing-dev.h>
#include "ext4_jbd2.h"
#include "ext4_extents.h"
#include "xattr.h"
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8d1e602..440987c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -26,6 +26,7 @@
#include <linux/log2.h>
#include <linux/module.h>
#include <linux/slab.h>
+#include <linux/backing-dev.h>
#include <trace/events/ext4.h>

#ifdef CONFIG_EXT4_DEBUG
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f06d058..56b8bb7 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -24,6 +24,7 @@
#include <linux/slab.h>
#include <linux/init.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/parser.h>
#include <linux/buffer_head.h>
#include <linux/exportfs.h>
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
index 6408989..aba72f7 100644
--- a/fs/f2fs/segment.h
+++ b/fs/f2fs/segment.h
@@ -9,6 +9,7 @@
* published by the Free Software Foundation.
*/
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>

/* constant macro */
#define NULL_SEGNO ((unsigned int)(~0))
diff --git a/fs/fat/file.c b/fs/fat/file.c
index 442d50a..a08f103 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -11,6 +11,7 @@
#include <linux/compat.h>
#include <linux/mount.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/fsnotify.h>
#include <linux/security.h>
#include "fat.h"
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index c067746..509411d 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -18,6 +18,7 @@
#include <linux/parser.h>
#include <linux/uio.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <asm/unaligned.h>
#include "fat.h"

diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index eee7206..55c03b9 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -14,6 +14,7 @@

#include <linux/module.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/mount.h>
#include <linux/init.h>
#include <linux/nls.h>
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 593af2f..7302d96 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -11,6 +11,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/vfs.h>
diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
index a46bf6d..b34f2e2 100644
--- a/fs/nfs/filelayout/filelayout.c
+++ b/fs/nfs/filelayout/filelayout.c
@@ -32,6 +32,7 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
#include <linux/module.h>
+#include <linux/backing-dev.h>

#include <linux/sunrpc/metrics.h>

diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
index d8b670c..8f1feca 100644
--- a/fs/ocfs2/file.c
+++ b/fs/ocfs2/file.c
@@ -37,6 +37,7 @@
#include <linux/falloc.h>
#include <linux/quotaops.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>

#include <cluster/masklog.h>

diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index 0111ad0..3e0af31 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -21,6 +21,7 @@
#include "xattr.h"
#include <linux/init.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
#include <linux/exportfs.h>
#include <linux/quotaops.h>
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index b3bc3e7..098508a 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -80,6 +80,7 @@
#include <linux/stat.h>
#include <linux/string.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/parser.h>
#include <linux/buffer_head.h>
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 8121e75..4e00b38 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -41,6 +41,7 @@
#include <linux/dcache.h>
#include <linux/falloc.h>
#include <linux/pagevec.h>
+#include <linux/backing-dev.h>

static const struct vm_operations_struct xfs_file_vm_ops;

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
new file mode 100644
index 0000000..aa18c4b
--- /dev/null
+++ b/include/linux/backing-dev-defs.h
@@ -0,0 +1,106 @@
+#ifndef __LINUX_BACKING_DEV_DEFS_H
+#define __LINUX_BACKING_DEV_DEFS_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/percpu_counter.h>
+#include <linux/flex_proportions.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+
+struct page;
+struct device;
+struct dentry;
+
+/*
+ * Bits in bdi_writeback.state
+ */
+enum wb_state {
+ WB_async_congested, /* The async (write) queue is getting full */
+ WB_sync_congested, /* The sync queue is getting full */
+ WB_registered, /* bdi_register() was done */
+ WB_writeback_running, /* Writeback is in progress */
+};
+
+typedef int (congested_fn)(void *, int);
+
+enum wb_stat_item {
+ WB_RECLAIMABLE,
+ WB_WRITEBACK,
+ WB_DIRTIED,
+ WB_WRITTEN,
+ NR_WB_STAT_ITEMS
+};
+
+#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+
+struct bdi_writeback {
+ struct backing_dev_info *bdi; /* our parent bdi */
+
+ unsigned long state; /* Always use atomic bitops on this */
+ unsigned long last_old_flush; /* last old data flush */
+
+ struct list_head b_dirty; /* dirty inodes */
+ struct list_head b_io; /* parked for writeback */
+ struct list_head b_more_io; /* parked for more writeback */
+ struct list_head b_dirty_time; /* time stamps are dirty */
+ spinlock_t list_lock; /* protects the b_* lists */
+
+ struct percpu_counter stat[NR_WB_STAT_ITEMS];
+
+ unsigned long bw_time_stamp; /* last time write bw is updated */
+ unsigned long dirtied_stamp;
+ unsigned long written_stamp; /* pages written at bw_time_stamp */
+ unsigned long write_bandwidth; /* the estimated write bandwidth */
+ unsigned long avg_write_bandwidth; /* further smoothed write bw */
+
+ /*
+ * The base dirty throttle rate, re-calculated on every 200ms.
+ * All the bdi tasks' dirty rate will be curbed under it.
+ * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
+ * in small steps and is much more smooth/stable than the latter.
+ */
+ unsigned long dirty_ratelimit;
+ unsigned long balanced_dirty_ratelimit;
+
+ struct fprop_local_percpu completions;
+ int dirty_exceeded;
+
+ spinlock_t work_lock; /* protects work_list & dwork scheduling */
+ struct list_head work_list;
+ struct delayed_work dwork; /* work item used for writeback */
+};
+
+struct backing_dev_info {
+ struct list_head bdi_list;
+ unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
+ unsigned int capabilities; /* Device capabilities */
+ congested_fn *congested_fn; /* Function pointer if device is md/dm */
+ void *congested_data; /* Pointer to aux data for congested func */
+
+ char *name;
+
+ unsigned int min_ratio;
+ unsigned int max_ratio, max_prop_frac;
+
+ struct bdi_writeback wb; /* default writeback info for this bdi */
+
+ struct device *dev;
+
+ struct timer_list laptop_mode_wb_timer;
+
+#ifdef CONFIG_DEBUG_FS
+ struct dentry *debug_dir;
+ struct dentry *debug_stats;
+#endif
+};
+
+enum {
+ BLK_RW_ASYNC = 0,
+ BLK_RW_SYNC = 1,
+};
+
+void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
+void set_bdi_congested(struct backing_dev_info *bdi, int sync);
+
+#endif /* __LINUX_BACKING_DEV_DEFS_H */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index d796f49..5e39f7a 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -8,104 +8,11 @@
#ifndef _LINUX_BACKING_DEV_H
#define _LINUX_BACKING_DEV_H

-#include <linux/percpu_counter.h>
-#include <linux/log2.h>
-#include <linux/flex_proportions.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/sched.h>
-#include <linux/timer.h>
#include <linux/writeback.h>
-#include <linux/atomic.h>
-#include <linux/sysctl.h>
-#include <linux/workqueue.h>
-
-struct page;
-struct device;
-struct dentry;
-
-/*
- * Bits in bdi_writeback.state
- */
-enum wb_state {
- WB_async_congested, /* The async (write) queue is getting full */
- WB_sync_congested, /* The sync queue is getting full */
- WB_registered, /* bdi_register() was done */
- WB_writeback_running, /* Writeback is in progress */
-};
-
-typedef int (congested_fn)(void *, int);
-
-enum wb_stat_item {
- WB_RECLAIMABLE,
- WB_WRITEBACK,
- WB_DIRTIED,
- WB_WRITTEN,
- NR_WB_STAT_ITEMS
-};
-
-#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
-
-struct bdi_writeback {
- struct backing_dev_info *bdi; /* our parent bdi */
-
- unsigned long state; /* Always use atomic bitops on this */
- unsigned long last_old_flush; /* last old data flush */
-
- struct list_head b_dirty; /* dirty inodes */
- struct list_head b_io; /* parked for writeback */
- struct list_head b_more_io; /* parked for more writeback */
- struct list_head b_dirty_time; /* time stamps are dirty */
- spinlock_t list_lock; /* protects the b_* lists */
-
- struct percpu_counter stat[NR_WB_STAT_ITEMS];
-
- unsigned long bw_time_stamp; /* last time write bw is updated */
- unsigned long dirtied_stamp;
- unsigned long written_stamp; /* pages written at bw_time_stamp */
- unsigned long write_bandwidth; /* the estimated write bandwidth */
- unsigned long avg_write_bandwidth; /* further smoothed write bw */
-
- /*
- * The base dirty throttle rate, re-calculated on every 200ms.
- * All the bdi tasks' dirty rate will be curbed under it.
- * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
- * in small steps and is much more smooth/stable than the latter.
- */
- unsigned long dirty_ratelimit;
- unsigned long balanced_dirty_ratelimit;
-
- struct fprop_local_percpu completions;
- int dirty_exceeded;
-
- spinlock_t work_lock; /* protects work_list & dwork scheduling */
- struct list_head work_list;
- struct delayed_work dwork; /* work item used for writeback */
-};
-
-struct backing_dev_info {
- struct list_head bdi_list;
- unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
- unsigned int capabilities; /* Device capabilities */
- congested_fn *congested_fn; /* Function pointer if device is md/dm */
- void *congested_data; /* Pointer to aux data for congested func */
-
- char *name;
-
- unsigned int min_ratio;
- unsigned int max_ratio, max_prop_frac;
-
- struct bdi_writeback wb; /* default writeback info for this bdi */
-
- struct device *dev;
-
- struct timer_list laptop_mode_wb_timer;
-
-#ifdef CONFIG_DEBUG_FS
- struct dentry *debug_dir;
- struct dentry *debug_stats;
-#endif
-};
+#include <linux/backing-dev-defs.h>

struct backing_dev_info *inode_to_bdi(struct inode *inode);

@@ -265,13 +172,6 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
(1 << WB_async_congested));
}

-enum {
- BLK_RW_ASYNC = 0,
- BLK_RW_SYNC = 1,
-};
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
long congestion_wait(int sync, long timeout);
long wait_iff_congested(struct zone *zone, int sync, long timeout);
int pdflush_proc_obsolete(struct ctl_table *table, int write,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index bc91795..89bdef0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -12,7 +12,7 @@
#include <linux/timer.h>
#include <linux/workqueue.h>
#include <linux/pagemap.h>
-#include <linux/backing-dev.h>
+#include <linux/backing-dev-defs.h>
#include <linux/wait.h>
#include <linux/mempool.h>
#include <linux/bio.h>
diff --git a/mm/madvise.c b/mm/madvise.c
index d551475..64bb8a2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -17,6 +17,7 @@
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/swap.h>
#include <linux/swapops.h>

--
2.4.0

2015-05-22 21:14:59

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 19/51] bdi: make inode_to_bdi() inline

Now that bdi definitions are moved to backing-dev-defs.h,
backing-dev.h can include blkdev.h and inline inode_to_bdi() without
worrying about introducing circular include dependency. The function
gets called from hot paths and fairly trivial.

This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
function calls inline. blockdev_superblock and noop_backing_dev_info
are EXPORT_GPL'd to allow the inline functions to be used from
modules.

While at it, make sb_is_blkdev_sb() return bool instead of int.

v2: Fixed typo in description as suggested by Jan.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jens Axboe <[email protected]>
Cc: Christoph Hellwig <[email protected]>
---
fs/block_dev.c | 8 ++------
fs/fs-writeback.c | 16 ----------------
include/linux/backing-dev.h | 18 ++++++++++++++++--
include/linux/fs.h | 8 +++++++-
mm/backing-dev.c | 1 +
5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index e545cbf..f04c873 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -547,7 +547,8 @@ static struct file_system_type bd_type = {
.kill_sb = kill_anon_super,
};

-static struct super_block *blockdev_superblock __read_mostly;
+struct super_block *blockdev_superblock __read_mostly;
+EXPORT_SYMBOL_GPL(blockdev_superblock);

void __init bdev_cache_init(void)
{
@@ -688,11 +689,6 @@ static struct block_device *bd_acquire(struct inode *inode)
return bdev;
}

-int sb_is_blkdev_sb(struct super_block *sb)
-{
- return sb == blockdev_superblock;
-}
-
/* Call when you free inode */

void bd_forget(struct inode *inode)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a69d2e1..34d1cb8 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -78,22 +78,6 @@ int writeback_in_progress(struct backing_dev_info *bdi)
}
EXPORT_SYMBOL(writeback_in_progress);

-struct backing_dev_info *inode_to_bdi(struct inode *inode)
-{
- struct super_block *sb;
-
- if (!inode)
- return &noop_backing_dev_info;
-
- sb = inode->i_sb;
-#ifdef CONFIG_BLOCK
- if (sb_is_blkdev_sb(sb))
- return blk_get_backing_dev_info(I_BDEV(inode));
-#endif
- return sb->s_bdi;
-}
-EXPORT_SYMBOL_GPL(inode_to_bdi);
-
static inline struct inode *wb_inode(struct list_head *head)
{
return list_entry(head, struct inode, i_wb_list);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5e39f7a..7857820 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -11,11 +11,10 @@
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/sched.h>
+#include <linux/blkdev.h>
#include <linux/writeback.h>
#include <linux/backing-dev-defs.h>

-struct backing_dev_info *inode_to_bdi(struct inode *inode);
-
int __must_check bdi_init(struct backing_dev_info *bdi);
void bdi_destroy(struct backing_dev_info *bdi);

@@ -149,6 +148,21 @@ extern struct backing_dev_info noop_backing_dev_info;

int writeback_in_progress(struct backing_dev_info *bdi);

+static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+{
+ struct super_block *sb;
+
+ if (!inode)
+ return &noop_backing_dev_info;
+
+ sb = inode->i_sb;
+#ifdef CONFIG_BLOCK
+ if (sb_is_blkdev_sb(sb))
+ return blk_get_backing_dev_info(I_BDEV(inode));
+#endif
+ return sb->s_bdi;
+}
+
static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
{
if (bdi->congested_fn)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1ef6390..ce100b87 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2240,7 +2240,13 @@ extern struct super_block *freeze_bdev(struct block_device *);
extern void emergency_thaw_all(void);
extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
extern int fsync_bdev(struct block_device *);
-extern int sb_is_blkdev_sb(struct super_block *sb);
+
+extern struct super_block *blockdev_superblock;
+
+static inline bool sb_is_blkdev_sb(struct super_block *sb)
+{
+ return sb == blockdev_superblock;
+}
#else
static inline void bd_forget(struct inode *inode) {}
static inline int sync_blockdev(struct block_device *bdev) { return 0; }
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ff85ecb..b0707d1 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -18,6 +18,7 @@ struct backing_dev_info noop_backing_dev_info = {
.name = "noop",
.capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
};
+EXPORT_SYMBOL_GPL(noop_backing_dev_info);

static struct class *bdi_class;

--
2.4.0

2015-05-22 21:26:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 20/51] writeback: add @gfp to wb_init()

wb_init() currently always uses GFP_KERNEL but the planned cgroup
writeback support needs using other allocation masks. Add @gfp to
wb_init().

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
---
mm/backing-dev.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b0707d1..805b287 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -291,7 +291,8 @@ void wb_wakeup_delayed(struct bdi_writeback *wb)
*/
#define INIT_BW (100 << (20 - PAGE_SHIFT))

-static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
+ gfp_t gfp)
{
int i, err;

@@ -315,12 +316,12 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->work_list);
INIT_DELAYED_WORK(&wb->dwork, wb_workfn);

- err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL);
+ err = fprop_local_init_percpu(&wb->completions, gfp);
if (err)
return err;

for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
- err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL);
+ err = percpu_counter_init(&wb->stat[i], 0, gfp);
if (err) {
while (--i)
percpu_counter_destroy(&wb->stat[i]);
@@ -378,7 +379,7 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->max_prop_frac = FPROP_FRAC_BASE;
INIT_LIST_HEAD(&bdi->bdi_list);

- err = wb_init(&bdi->wb, bdi);
+ err = wb_init(&bdi->wb, bdi, GFP_KERNEL);
if (err)
return err;

--
2.4.0

2015-05-22 21:26:15

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 21/51] bdi: separate out congested state into a separate struct

Currently, a wb's (bdi_writeback) congestion state is carried in its
->state field; however, cgroup writeback support will require multiple
wb's sharing the same congestion state. This patch separates out
congestion state into its own struct - struct bdi_writeback_congested.
A new field wb field, wb_congested, points to its associated congested
struct. The default wb, bdi->wb, always points to bdi->wb_congested.

While this patch adds a layer of indirection, it doesn't introduce any
behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/backing-dev-defs.h | 14 ++++++++++++--
include/linux/backing-dev.h | 2 +-
mm/backing-dev.c | 7 +++++--
3 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index aa18c4b..9e9eafa 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -16,12 +16,15 @@ struct dentry;
* Bits in bdi_writeback.state
*/
enum wb_state {
- WB_async_congested, /* The async (write) queue is getting full */
- WB_sync_congested, /* The sync queue is getting full */
WB_registered, /* bdi_register() was done */
WB_writeback_running, /* Writeback is in progress */
};

+enum wb_congested_state {
+ WB_async_congested, /* The async (write) queue is getting full */
+ WB_sync_congested, /* The sync queue is getting full */
+};
+
typedef int (congested_fn)(void *, int);

enum wb_stat_item {
@@ -34,6 +37,10 @@ enum wb_stat_item {

#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))

+struct bdi_writeback_congested {
+ unsigned long state; /* WB_[a]sync_congested flags */
+};
+
struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */

@@ -48,6 +55,8 @@ struct bdi_writeback {

struct percpu_counter stat[NR_WB_STAT_ITEMS];

+ struct bdi_writeback_congested *congested;
+
unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
@@ -84,6 +93,7 @@ struct backing_dev_info {
unsigned int max_ratio, max_prop_frac;

struct bdi_writeback wb; /* default writeback info for this bdi */
+ struct bdi_writeback_congested wb_congested;

struct device *dev;

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 7857820..bfdaa18 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,7 +167,7 @@ static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
{
if (bdi->congested_fn)
return bdi->congested_fn(bdi->congested_data, bdi_bits);
- return (bdi->wb.state & bdi_bits);
+ return (bdi->wb.congested->state & bdi_bits);
}

static inline int bdi_read_congested(struct backing_dev_info *bdi)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 805b287..5ec7658 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -383,6 +383,9 @@ int bdi_init(struct backing_dev_info *bdi)
if (err)
return err;

+ bdi->wb_congested.state = 0;
+ bdi->wb.congested = &bdi->wb_congested;
+
return 0;
}
EXPORT_SYMBOL(bdi_init);
@@ -504,7 +507,7 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
wait_queue_head_t *wqh = &congestion_wqh[sync];

bit = sync ? WB_sync_congested : WB_async_congested;
- if (test_and_clear_bit(bit, &bdi->wb.state))
+ if (test_and_clear_bit(bit, &bdi->wb.congested->state))
atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_atomic();
if (waitqueue_active(wqh))
@@ -517,7 +520,7 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
enum wb_state bit;

bit = sync ? WB_sync_congested : WB_async_congested;
- if (!test_and_set_bit(bit, &bdi->wb.state))
+ if (!test_and_set_bit(bit, &bdi->wb.congested->state))
atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
--
2.4.0

2015-05-22 21:25:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 22/51] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK

cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.

inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
block/blk-core.c | 2 +-
include/linux/backing-dev.h | 32 +++++++++++++++++++++++++++++++-
include/linux/fs.h | 1 +
init/Kconfig | 5 +++++
4 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f46688f..e0f726f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -620,7 +620,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
- q->backing_dev_info.capabilities = 0;
+ q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
q->backing_dev_info.name = "block";
q->node = node_id;

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index bfdaa18..6bb3123 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -134,12 +134,15 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
* BDI_CAP_NO_WRITEBACK: Don't write pages back
* BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages
* BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
+ *
+ * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
*/
#define BDI_CAP_NO_ACCT_DIRTY 0x00000001
#define BDI_CAP_NO_WRITEBACK 0x00000002
#define BDI_CAP_NO_ACCT_WB 0x00000004
#define BDI_CAP_STABLE_WRITES 0x00000008
#define BDI_CAP_STRICTLIMIT 0x00000010
+#define BDI_CAP_CGROUP_WRITEBACK 0x00000020

#define BDI_CAP_NO_ACCT_AND_WRITEBACK \
(BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
@@ -229,4 +232,31 @@ static inline int bdi_sched_wait(void *word)
return 0;
}

-#endif /* _LINUX_BACKING_DEV_H */
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/**
+ * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
+ * @inode: inode of interest
+ *
+ * cgroup writeback requires support from both the bdi and filesystem.
+ * Test whether @inode has both.
+ */
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+ struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+ return bdi_cap_account_dirty(bdi) &&
+ (bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
+ (inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
+}
+
+#else /* CONFIG_CGROUP_WRITEBACK */
+
+static inline bool inode_cgwb_enabled(struct inode *inode)
+{
+ return false;
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
+#endif /* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ce100b87..74e0ae0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1897,6 +1897,7 @@ struct file_system_type {
#define FS_HAS_SUBTYPE 4
#define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
#define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */
+#define FS_CGROUP_WRITEBACK 32 /* Supports cgroup-aware writeback */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
struct dentry *(*mount) (struct file_system_type *, int,
const char *, void *);
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec..d4f7633 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1141,6 +1141,11 @@ config DEBUG_BLK_CGROUP
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.

+config CGROUP_WRITEBACK
+ bool
+ depends on MEMCG && BLK_CGROUP
+ default y
+
endif # CGROUPS

config CHECKPOINT_RESTORE
--
2.4.0

2015-05-22 21:15:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 23/51] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback). This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg. This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg. This means that there may be multiple
wb's of a bdi mapped to the same blkcg. As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state. This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
by its memcg id. Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.

v3: inode_attach_wb() in account_page_dirtied() moved inside
mapping_cap_account_dirty() block where it's known to be !NULL.
Also, an unnecessary NULL check before kfree() removed. Both
detected by the kbuild bot.

v2: Updated so that wb association is per inode and wb is per memcg
rather than blkcg.

Signed-off-by: Tejun Heo <[email protected]>
Cc: kbuild test robot <[email protected]>
Cc: Dan Carpenter <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
block/blk-cgroup.c | 7 +-
fs/fs-writeback.c | 8 +-
fs/inode.c | 1 +
include/linux/backing-dev-defs.h | 59 +++++-
include/linux/backing-dev.h | 195 +++++++++++++++++++
include/linux/blk-cgroup.h | 4 +
include/linux/fs.h | 4 +
include/linux/memcontrol.h | 4 +
mm/backing-dev.c | 397 +++++++++++++++++++++++++++++++++++++++
mm/memcontrol.c | 19 +-
mm/page-writeback.c | 11 +-
11 files changed, 698 insertions(+), 11 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 54ec172..979cfdb 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/err.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/slab.h>
#include <linux/genhd.h>
#include <linux/delay.h>
@@ -797,6 +798,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css)
}

spin_unlock_irq(&blkcg->lock);
+
+ wb_blkcg_offline(blkcg);
}

static void blkcg_css_free(struct cgroup_subsys_state *css)
@@ -827,7 +830,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
spin_lock_init(&blkcg->lock);
INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC);
INIT_HLIST_HEAD(&blkcg->blkg_list);
-
+#ifdef CONFIG_CGROUP_WRITEBACK
+ INIT_LIST_HEAD(&blkcg->cgwb_list);
+#endif
return &blkcg->css;
}

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 34d1cb8..99a2440 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -185,11 +185,11 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
*/
void inode_wb_list_del(struct inode *inode)
{
- struct backing_dev_info *bdi = inode_to_bdi(inode);
+ struct bdi_writeback *wb = inode_to_wb(inode);

- spin_lock(&bdi->wb.list_lock);
+ spin_lock(&wb->list_lock);
list_del_init(&inode->i_wb_list);
- spin_unlock(&bdi->wb.list_lock);
+ spin_unlock(&wb->list_lock);
}

/*
@@ -1268,6 +1268,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
if ((inode->i_state & flags) != flags) {
const int was_dirty = inode->i_state & I_DIRTY;

+ inode_attach_wb(inode, NULL);
+
if (flags & I_DIRTY_INODE)
inode->i_state &= ~I_DIRTY_TIME;
inode->i_state |= flags;
diff --git a/fs/inode.c b/fs/inode.c
index ea37cd1..efc9eda 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -223,6 +223,7 @@ EXPORT_SYMBOL(free_inode_nonrcu);
void __destroy_inode(struct inode *inode)
{
BUG_ON(inode_has_buffers(inode));
+ inode_detach_wb(inode);
security_inode_free(inode);
fsnotify_inode_delete(inode);
locks_free_lock_context(inode->i_flctx);
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 9e9eafa..a1e9c40 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -2,8 +2,11 @@
#define __LINUX_BACKING_DEV_DEFS_H

#include <linux/list.h>
+#include <linux/radix-tree.h>
+#include <linux/rbtree.h>
#include <linux/spinlock.h>
#include <linux/percpu_counter.h>
+#include <linux/percpu-refcount.h>
#include <linux/flex_proportions.h>
#include <linux/timer.h>
#include <linux/workqueue.h>
@@ -37,10 +40,43 @@ enum wb_stat_item {

#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))

+/*
+ * For cgroup writeback, multiple wb's may map to the same blkcg. Those
+ * wb's can operate mostly independently but should share the congested
+ * state. To facilitate such sharing, the congested state is tracked using
+ * the following struct which is created on demand, indexed by blkcg ID on
+ * its bdi, and refcounted.
+ */
struct bdi_writeback_congested {
unsigned long state; /* WB_[a]sync_congested flags */
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct backing_dev_info *bdi; /* the associated bdi */
+ atomic_t refcnt; /* nr of attached wb's and blkg */
+ int blkcg_id; /* ID of the associated blkcg */
+ struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */
+#endif
};

+/*
+ * Each wb (bdi_writeback) can perform writeback operations, is measured
+ * and throttled, independently. Without cgroup writeback, each bdi
+ * (bdi_writeback) is served by its embedded bdi->wb.
+ *
+ * On the default hierarchy, blkcg implicitly enables memcg. This allows
+ * using memcg's page ownership for attributing writeback IOs, and every
+ * memcg - blkcg combination can be served by its own wb by assigning a
+ * dedicated wb to each memcg, which enables isolation across different
+ * cgroups and propagation of IO back pressure down from the IO layer upto
+ * the tasks which are generating the dirty pages to be written back.
+ *
+ * A cgroup wb is indexed on its bdi by the ID of the associated memcg,
+ * refcounted with the number of inodes attached to it, and pins the memcg
+ * and the corresponding blkcg. As the corresponding blkcg for a memcg may
+ * change as blkcg is disabled and enabled higher up in the hierarchy, a wb
+ * is tested for blkcg after lookup and removed from index on mismatch so
+ * that a new wb for the combination can be created.
+ */
struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */

@@ -78,6 +114,19 @@ struct bdi_writeback {
spinlock_t work_lock; /* protects work_list & dwork scheduling */
struct list_head work_list;
struct delayed_work dwork; /* work item used for writeback */
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct percpu_ref refcnt; /* used only for !root wb's */
+ struct cgroup_subsys_state *memcg_css; /* the associated memcg */
+ struct cgroup_subsys_state *blkcg_css; /* and blkcg */
+ struct list_head memcg_node; /* anchored at memcg->cgwb_list */
+ struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
+
+ union {
+ struct work_struct release_work;
+ struct rcu_head rcu;
+ };
+#endif
};

struct backing_dev_info {
@@ -92,9 +141,13 @@ struct backing_dev_info {
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;

- struct bdi_writeback wb; /* default writeback info for this bdi */
- struct bdi_writeback_congested wb_congested;
-
+ struct bdi_writeback wb; /* the root writeback info for this bdi */
+ struct bdi_writeback_congested wb_congested; /* its congested state */
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */
+ struct rb_root cgwb_congested_tree; /* their congested states */
+ atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
+#endif
struct device *dev;

struct timer_list laptop_mode_wb_timer;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 6bb3123..8ae59df 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,7 @@
#include <linux/sched.h>
#include <linux/blkdev.h>
#include <linux/writeback.h>
+#include <linux/blk-cgroup.h>
#include <linux/backing-dev-defs.h>

int __must_check bdi_init(struct backing_dev_info *bdi);
@@ -234,6 +235,16 @@ static inline int bdi_sched_wait(void *word)

#ifdef CONFIG_CGROUP_WRITEBACK

+struct bdi_writeback_congested *
+wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp);
+void wb_congested_put(struct bdi_writeback_congested *congested);
+struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
+ struct cgroup_subsys_state *memcg_css,
+ gfp_t gfp);
+void __inode_attach_wb(struct inode *inode, struct page *page);
+void wb_memcg_offline(struct mem_cgroup *memcg);
+void wb_blkcg_offline(struct blkcg *blkcg);
+
/**
* inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
* @inode: inode of interest
@@ -250,6 +261,135 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
(inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
}

+/**
+ * wb_tryget - try to increment a wb's refcount
+ * @wb: bdi_writeback to get
+ */
+static inline bool wb_tryget(struct bdi_writeback *wb)
+{
+ if (wb != &wb->bdi->wb)
+ return percpu_ref_tryget(&wb->refcnt);
+ return true;
+}
+
+/**
+ * wb_get - increment a wb's refcount
+ * @wb: bdi_writeback to get
+ */
+static inline void wb_get(struct bdi_writeback *wb)
+{
+ if (wb != &wb->bdi->wb)
+ percpu_ref_get(&wb->refcnt);
+}
+
+/**
+ * wb_put - decrement a wb's refcount
+ * @wb: bdi_writeback to put
+ */
+static inline void wb_put(struct bdi_writeback *wb)
+{
+ if (wb != &wb->bdi->wb)
+ percpu_ref_put(&wb->refcnt);
+}
+
+/**
+ * wb_find_current - find wb for %current on a bdi
+ * @bdi: bdi of interest
+ *
+ * Find the wb of @bdi which matches both the memcg and blkcg of %current.
+ * Must be called under rcu_read_lock() which protects the returend wb.
+ * NULL if not found.
+ */
+static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
+{
+ struct cgroup_subsys_state *memcg_css;
+ struct bdi_writeback *wb;
+
+ memcg_css = task_css(current, memory_cgrp_id);
+ if (!memcg_css->parent)
+ return &bdi->wb;
+
+ wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+
+ /*
+ * %current's blkcg equals the effective blkcg of its memcg. No
+ * need to use the relatively expensive cgroup_get_e_css().
+ */
+ if (likely(wb && wb->blkcg_css == task_css(current, blkio_cgrp_id)))
+ return wb;
+ return NULL;
+}
+
+/**
+ * wb_get_create_current - get or create wb for %current on a bdi
+ * @bdi: bdi of interest
+ * @gfp: allocation mask
+ *
+ * Equivalent to wb_get_create() on %current's memcg. This function is
+ * called from a relatively hot path and optimizes the common cases using
+ * wb_find_current().
+ */
+static inline struct bdi_writeback *
+wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
+{
+ struct bdi_writeback *wb;
+
+ rcu_read_lock();
+ wb = wb_find_current(bdi);
+ if (wb && unlikely(!wb_tryget(wb)))
+ wb = NULL;
+ rcu_read_unlock();
+
+ if (unlikely(!wb)) {
+ struct cgroup_subsys_state *memcg_css;
+
+ memcg_css = task_get_css(current, memory_cgrp_id);
+ wb = wb_get_create(bdi, memcg_css, gfp);
+ css_put(memcg_css);
+ }
+ return wb;
+}
+
+/**
+ * inode_attach_wb - associate an inode with its wb
+ * @inode: inode of interest
+ * @page: page being dirtied (may be NULL)
+ *
+ * If @inode doesn't have its wb, associate it with the wb matching the
+ * memcg of @page or, if @page is NULL, %current. May be called w/ or w/o
+ * @inode->i_lock.
+ */
+static inline void inode_attach_wb(struct inode *inode, struct page *page)
+{
+ if (!inode->i_wb)
+ __inode_attach_wb(inode, page);
+}
+
+/**
+ * inode_detach_wb - disassociate an inode from its wb
+ * @inode: inode of interest
+ *
+ * @inode is being freed. Detach from its wb.
+ */
+static inline void inode_detach_wb(struct inode *inode)
+{
+ if (inode->i_wb) {
+ wb_put(inode->i_wb);
+ inode->i_wb = NULL;
+ }
+}
+
+/**
+ * inode_to_wb - determine the wb of an inode
+ * @inode: inode of interest
+ *
+ * Returns the wb @inode is currently associated with.
+ */
+static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
+{
+ return inode->i_wb;
+}
+
#else /* CONFIG_CGROUP_WRITEBACK */

static inline bool inode_cgwb_enabled(struct inode *inode)
@@ -257,6 +397,61 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
return false;
}

+static inline struct bdi_writeback_congested *
+wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
+{
+ return bdi->wb.congested;
+}
+
+static inline void wb_congested_put(struct bdi_writeback_congested *congested)
+{
+}
+
+static inline bool wb_tryget(struct bdi_writeback *wb)
+{
+ return true;
+}
+
+static inline void wb_get(struct bdi_writeback *wb)
+{
+}
+
+static inline void wb_put(struct bdi_writeback *wb)
+{
+}
+
+static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
+{
+ return &bdi->wb;
+}
+
+static inline struct bdi_writeback *
+wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
+{
+ return &bdi->wb;
+}
+
+static inline void inode_attach_wb(struct inode *inode, struct page *page)
+{
+}
+
+static inline void inode_detach_wb(struct inode *inode)
+{
+}
+
+static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
+{
+ return &inode_to_bdi(inode)->wb;
+}
+
+static inline void wb_memcg_offline(struct mem_cgroup *memcg)
+{
+}
+
+static inline void wb_blkcg_offline(struct blkcg *blkcg)
+{
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */

#endif /* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 4dc643f..3033eb1 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -53,6 +53,10 @@ struct blkcg {
/* TODO: per-policy storage in blkcg */
unsigned int cfq_weight; /* belongs to cfq */
unsigned int cfq_leaf_weight;
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct list_head cgwb_list;
+#endif
};

struct blkg_stat {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 74e0ae0..67a42ec 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -35,6 +35,7 @@
#include <uapi/linux/fs.h>

struct backing_dev_info;
+struct bdi_writeback;
struct export_operations;
struct hd_geometry;
struct iovec;
@@ -635,6 +636,9 @@ struct inode {

struct hlist_node i_hash;
struct list_head i_wb_list; /* backing dev IO list */
+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct bdi_writeback *i_wb; /* the associated cgroup wb */
+#endif
struct list_head i_lru; /* inode LRU list */
struct list_head i_sb_list;
union {
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 637ef62..662a953 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -388,6 +388,10 @@ enum {
OVER_LIMIT,
};

+#ifdef CONFIG_CGROUP_WRITEBACK
+struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg);
+#endif
+
struct sock;
#if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM)
void sock_update_memcg(struct sock *sk);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5ec7658..4c9386c 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -368,6 +368,401 @@ static void wb_exit(struct bdi_writeback *wb)
fprop_local_destroy_percpu(&wb->completions);
}

+#ifdef CONFIG_CGROUP_WRITEBACK
+
+#include <linux/memcontrol.h>
+
+/*
+ * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree,
+ * blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU
+ * protected. cgwb_release_wait is used to wait for the completion of cgwb
+ * releases from bdi destruction path.
+ */
+static DEFINE_SPINLOCK(cgwb_lock);
+static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait);
+
+/**
+ * wb_congested_get_create - get or create a wb_congested
+ * @bdi: associated bdi
+ * @blkcg_id: ID of the associated blkcg
+ * @gfp: allocation mask
+ *
+ * Look up the wb_congested for @blkcg_id on @bdi. If missing, create one.
+ * The returned wb_congested has its reference count incremented. Returns
+ * NULL on failure.
+ */
+struct bdi_writeback_congested *
+wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp)
+{
+ struct bdi_writeback_congested *new_congested = NULL, *congested;
+ struct rb_node **node, *parent;
+ unsigned long flags;
+
+ if (blkcg_id == 1)
+ return &bdi->wb_congested;
+retry:
+ spin_lock_irqsave(&cgwb_lock, flags);
+
+ node = &bdi->cgwb_congested_tree.rb_node;
+ parent = NULL;
+
+ while (*node != NULL) {
+ parent = *node;
+ congested = container_of(parent, struct bdi_writeback_congested,
+ rb_node);
+ if (congested->blkcg_id < blkcg_id)
+ node = &parent->rb_left;
+ else if (congested->blkcg_id > blkcg_id)
+ node = &parent->rb_right;
+ else
+ goto found;
+ }
+
+ if (new_congested) {
+ /* !found and storage for new one already allocated, insert */
+ congested = new_congested;
+ new_congested = NULL;
+ rb_link_node(&congested->rb_node, parent, node);
+ rb_insert_color(&congested->rb_node, &bdi->cgwb_congested_tree);
+ atomic_inc(&bdi->usage_cnt);
+ goto found;
+ }
+
+ spin_unlock_irqrestore(&cgwb_lock, flags);
+
+ /* allocate storage for new one and retry */
+ new_congested = kzalloc(sizeof(*new_congested), gfp);
+ if (!new_congested)
+ return NULL;
+
+ atomic_set(&new_congested->refcnt, 0);
+ new_congested->bdi = bdi;
+ new_congested->blkcg_id = blkcg_id;
+ goto retry;
+
+found:
+ atomic_inc(&congested->refcnt);
+ spin_unlock_irqrestore(&cgwb_lock, flags);
+ kfree(new_congested);
+ return congested;
+}
+
+/**
+ * wb_congested_put - put a wb_congested
+ * @congested: wb_congested to put
+ *
+ * Put @congested and destroy it if the refcnt reaches zero.
+ */
+void wb_congested_put(struct bdi_writeback_congested *congested)
+{
+ struct backing_dev_info *bdi = congested->bdi;
+ unsigned long flags;
+
+ if (congested->blkcg_id == 1)
+ return;
+
+ local_irq_save(flags);
+ if (!atomic_dec_and_lock(&congested->refcnt, &cgwb_lock)) {
+ local_irq_restore(flags);
+ return;
+ }
+
+ rb_erase(&congested->rb_node, &congested->bdi->cgwb_congested_tree);
+ spin_unlock_irqrestore(&cgwb_lock, flags);
+ kfree(congested);
+
+ if (atomic_dec_and_test(&bdi->usage_cnt))
+ wake_up_all(&cgwb_release_wait);
+}
+
+static void cgwb_release_workfn(struct work_struct *work)
+{
+ struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
+ release_work);
+ struct backing_dev_info *bdi = wb->bdi;
+
+ wb_shutdown(wb);
+
+ css_put(wb->memcg_css);
+ css_put(wb->blkcg_css);
+ wb_congested_put(wb->congested);
+
+ percpu_ref_exit(&wb->refcnt);
+ wb_exit(wb);
+ kfree_rcu(wb, rcu);
+
+ if (atomic_dec_and_test(&bdi->usage_cnt))
+ wake_up_all(&cgwb_release_wait);
+}
+
+static void cgwb_release(struct percpu_ref *refcnt)
+{
+ struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback,
+ refcnt);
+ schedule_work(&wb->release_work);
+}
+
+static void cgwb_kill(struct bdi_writeback *wb)
+{
+ lockdep_assert_held(&cgwb_lock);
+
+ WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
+ list_del(&wb->memcg_node);
+ list_del(&wb->blkcg_node);
+ percpu_ref_kill(&wb->refcnt);
+}
+
+static int cgwb_create(struct backing_dev_info *bdi,
+ struct cgroup_subsys_state *memcg_css, gfp_t gfp)
+{
+ struct mem_cgroup *memcg;
+ struct cgroup_subsys_state *blkcg_css;
+ struct blkcg *blkcg;
+ struct list_head *memcg_cgwb_list, *blkcg_cgwb_list;
+ struct bdi_writeback *wb;
+ unsigned long flags;
+ int ret = 0;
+
+ memcg = mem_cgroup_from_css(memcg_css);
+ blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &blkio_cgrp_subsys);
+ blkcg = css_to_blkcg(blkcg_css);
+ memcg_cgwb_list = mem_cgroup_cgwb_list(memcg);
+ blkcg_cgwb_list = &blkcg->cgwb_list;
+
+ /* look up again under lock and discard on blkcg mismatch */
+ spin_lock_irqsave(&cgwb_lock, flags);
+ wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+ if (wb && wb->blkcg_css != blkcg_css) {
+ cgwb_kill(wb);
+ wb = NULL;
+ }
+ spin_unlock_irqrestore(&cgwb_lock, flags);
+ if (wb)
+ goto out_put;
+
+ /* need to create a new one */
+ wb = kmalloc(sizeof(*wb), gfp);
+ if (!wb)
+ return -ENOMEM;
+
+ ret = wb_init(wb, bdi, gfp);
+ if (ret)
+ goto err_free;
+
+ ret = percpu_ref_init(&wb->refcnt, cgwb_release, 0, gfp);
+ if (ret)
+ goto err_wb_exit;
+
+ wb->congested = wb_congested_get_create(bdi, blkcg_css->id, gfp);
+ if (!wb->congested)
+ goto err_ref_exit;
+
+ wb->memcg_css = memcg_css;
+ wb->blkcg_css = blkcg_css;
+ INIT_WORK(&wb->release_work, cgwb_release_workfn);
+ set_bit(WB_registered, &wb->state);
+
+ /*
+ * The root wb determines the registered state of the whole bdi and
+ * memcg_cgwb_list and blkcg_cgwb_list's next pointers indicate
+ * whether they're still online. Don't link @wb if any is dead.
+ * See wb_memcg_offline() and wb_blkcg_offline().
+ */
+ ret = -ENODEV;
+ spin_lock_irqsave(&cgwb_lock, flags);
+ if (test_bit(WB_registered, &bdi->wb.state) &&
+ blkcg_cgwb_list->next && memcg_cgwb_list->next) {
+ /* we might have raced another instance of this function */
+ ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb);
+ if (!ret) {
+ atomic_inc(&bdi->usage_cnt);
+ list_add(&wb->memcg_node, memcg_cgwb_list);
+ list_add(&wb->blkcg_node, blkcg_cgwb_list);
+ css_get(memcg_css);
+ css_get(blkcg_css);
+ }
+ }
+ spin_unlock_irqrestore(&cgwb_lock, flags);
+ if (ret) {
+ if (ret == -EEXIST)
+ ret = 0;
+ goto err_put_congested;
+ }
+ goto out_put;
+
+err_put_congested:
+ wb_congested_put(wb->congested);
+err_ref_exit:
+ percpu_ref_exit(&wb->refcnt);
+err_wb_exit:
+ wb_exit(wb);
+err_free:
+ kfree(wb);
+out_put:
+ css_put(blkcg_css);
+ return ret;
+}
+
+/**
+ * wb_get_create - get wb for a given memcg, create if necessary
+ * @bdi: target bdi
+ * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref)
+ * @gfp: allocation mask to use
+ *
+ * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to
+ * create one. The returned wb has its refcount incremented.
+ *
+ * This function uses css_get() on @memcg_css and thus expects its refcnt
+ * to be positive on invocation. IOW, rcu_read_lock() protection on
+ * @memcg_css isn't enough. try_get it before calling this function.
+ *
+ * A wb is keyed by its associated memcg. As blkcg implicitly enables
+ * memcg on the default hierarchy, memcg association is guaranteed to be
+ * more specific (equal or descendant to the associated blkcg) and thus can
+ * identify both the memcg and blkcg associations.
+ *
+ * Because the blkcg associated with a memcg may change as blkcg is enabled
+ * and disabled closer to root in the hierarchy, each wb keeps track of
+ * both the memcg and blkcg associated with it and verifies the blkcg on
+ * each lookup. On mismatch, the existing wb is discarded and a new one is
+ * created.
+ */
+struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
+ struct cgroup_subsys_state *memcg_css,
+ gfp_t gfp)
+{
+ struct bdi_writeback *wb;
+
+ might_sleep_if(gfp & __GFP_WAIT);
+
+ if (!memcg_css->parent)
+ return &bdi->wb;
+
+ do {
+ rcu_read_lock();
+ wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
+ if (wb) {
+ struct cgroup_subsys_state *blkcg_css;
+
+ /* see whether the blkcg association has changed */
+ blkcg_css = cgroup_get_e_css(memcg_css->cgroup,
+ &blkio_cgrp_subsys);
+ if (unlikely(wb->blkcg_css != blkcg_css ||
+ !wb_tryget(wb)))
+ wb = NULL;
+ css_put(blkcg_css);
+ }
+ rcu_read_unlock();
+ } while (!wb && !cgwb_create(bdi, memcg_css, gfp));
+
+ return wb;
+}
+
+void __inode_attach_wb(struct inode *inode, struct page *page)
+{
+ struct backing_dev_info *bdi = inode_to_bdi(inode);
+ struct bdi_writeback *wb = NULL;
+
+ if (inode_cgwb_enabled(inode)) {
+ struct cgroup_subsys_state *memcg_css;
+
+ if (page) {
+ memcg_css = mem_cgroup_css_from_page(page);
+ wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+ } else {
+ /* must pin memcg_css, see wb_get_create() */
+ memcg_css = task_get_css(current, memory_cgrp_id);
+ wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+ css_put(memcg_css);
+ }
+ }
+
+ if (!wb)
+ wb = &bdi->wb;
+
+ /*
+ * There may be multiple instances of this function racing to
+ * update the same inode. Use cmpxchg() to tell the winner.
+ */
+ if (unlikely(cmpxchg(&inode->i_wb, NULL, wb)))
+ wb_put(wb);
+}
+
+static void cgwb_bdi_init(struct backing_dev_info *bdi)
+{
+ bdi->wb.memcg_css = mem_cgroup_root_css;
+ bdi->wb.blkcg_css = blkcg_root_css;
+ bdi->wb_congested.blkcg_id = 1;
+ INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC);
+ bdi->cgwb_congested_tree = RB_ROOT;
+ atomic_set(&bdi->usage_cnt, 1);
+}
+
+static void cgwb_bdi_destroy(struct backing_dev_info *bdi)
+{
+ struct radix_tree_iter iter;
+ void **slot;
+
+ WARN_ON(test_bit(WB_registered, &bdi->wb.state));
+
+ spin_lock_irq(&cgwb_lock);
+ radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0)
+ cgwb_kill(*slot);
+ spin_unlock_irq(&cgwb_lock);
+
+ /*
+ * All cgwb's and their congested states must be shutdown and
+ * released before returning. Drain the usage counter to wait for
+ * all cgwb's and cgwb_congested's ever created on @bdi.
+ */
+ atomic_dec(&bdi->usage_cnt);
+ wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt));
+}
+
+/**
+ * wb_memcg_offline - kill all wb's associated with a memcg being offlined
+ * @memcg: memcg being offlined
+ *
+ * Also prevents creation of any new wb's associated with @memcg.
+ */
+void wb_memcg_offline(struct mem_cgroup *memcg)
+{
+ LIST_HEAD(to_destroy);
+ struct list_head *memcg_cgwb_list = mem_cgroup_cgwb_list(memcg);
+ struct bdi_writeback *wb, *next;
+
+ spin_lock_irq(&cgwb_lock);
+ list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
+ cgwb_kill(wb);
+ memcg_cgwb_list->next = NULL; /* prevent new wb's */
+ spin_unlock_irq(&cgwb_lock);
+}
+
+/**
+ * wb_blkcg_offline - kill all wb's associated with a blkcg being offlined
+ * @blkcg: blkcg being offlined
+ *
+ * Also prevents creation of any new wb's associated with @blkcg.
+ */
+void wb_blkcg_offline(struct blkcg *blkcg)
+{
+ LIST_HEAD(to_destroy);
+ struct bdi_writeback *wb, *next;
+
+ spin_lock_irq(&cgwb_lock);
+ list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node)
+ cgwb_kill(wb);
+ blkcg->cgwb_list.next = NULL; /* prevent new wb's */
+ spin_unlock_irq(&cgwb_lock);
+}
+
+#else /* CONFIG_CGROUP_WRITEBACK */
+
+static void cgwb_bdi_init(struct backing_dev_info *bdi) { }
+static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { }
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
int bdi_init(struct backing_dev_info *bdi)
{
int err;
@@ -386,6 +781,7 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->wb_congested.state = 0;
bdi->wb.congested = &bdi->wb_congested;

+ cgwb_bdi_init(bdi);
return 0;
}
EXPORT_SYMBOL(bdi_init);
@@ -459,6 +855,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
/* make sure nobody finds us on the bdi_list anymore */
bdi_remove_from_list(bdi);
wb_shutdown(&bdi->wb);
+ cgwb_bdi_destroy(bdi);

if (bdi->dev) {
bdi_debug_unregister(bdi);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 763f8f3..6732c2c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -348,6 +348,10 @@ struct mem_cgroup {
atomic_t numainfo_updating;
#endif

+#ifdef CONFIG_CGROUP_WRITEBACK
+ struct list_head cgwb_list;
+#endif
+
/* List of events which userspace want to receive */
struct list_head event_list;
spinlock_t event_list_lock;
@@ -4011,6 +4015,15 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
}
#endif

+#ifdef CONFIG_CGROUP_WRITEBACK
+
+struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg)
+{
+ return &memcg->cgwb_list;
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
/*
* DO NOT USE IN NEW FILES.
*
@@ -4475,7 +4488,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
#ifdef CONFIG_MEMCG_KMEM
memcg->kmemcg_id = -1;
#endif
-
+#ifdef CONFIG_CGROUP_WRITEBACK
+ INIT_LIST_HEAD(&memcg->cgwb_list);
+#endif
return &memcg->css;

free_out:
@@ -4563,6 +4578,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
vmpressure_cleanup(&memcg->vmpressure);

memcg_deactivate_kmem(memcg);
+
+ wb_memcg_offline(memcg);
}

static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 78ef551..9b95cf8 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2097,16 +2097,21 @@ int __set_page_dirty_no_writeback(struct page *page)
void account_page_dirtied(struct page *page, struct address_space *mapping,
struct mem_cgroup *memcg)
{
+ struct inode *inode = mapping->host;
+
trace_writeback_dirty_page(page, mapping);

if (mapping_cap_account_dirty(mapping)) {
- struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
+ struct bdi_writeback *wb;
+
+ inode_attach_wb(inode, page);
+ wb = inode_to_wb(inode);

mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
- __inc_wb_stat(&bdi->wb, WB_RECLAIMABLE);
- __inc_wb_stat(&bdi->wb, WB_DIRTIED);
+ __inc_wb_stat(wb, WB_RECLAIMABLE);
+ __inc_wb_stat(wb, WB_DIRTIED);
task_io_account_write(PAGE_CACHE_SIZE);
current->nr_dirtied++;
this_cpu_inc(bdp_ratelimits);
--
2.4.0

2015-05-22 21:15:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 24/51] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested

A blkg (blkcg_gq) can be congested and decongested independently from
other blkgs on the same request_queue. Accordingly, for cgroup
writeback support, the congestion status at bdi (backing_dev_info)
should be split and updated separately from matching blkg's.

This patch prepares by adding blkg->wb_congested and associating a
blkg with its matching per-blkcg bdi_writeback_congested on creation.

v2: Updated to associate bdi_writeback_congested instead of
bdi_writeback.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
block/blk-cgroup.c | 17 +++++++++++++++--
include/linux/blk-cgroup.h | 6 ++++++
2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 979cfdb..31610ae 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -182,6 +182,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
struct blkcg_gq *new_blkg)
{
struct blkcg_gq *blkg;
+ struct bdi_writeback_congested *wb_congested;
int i, ret;

WARN_ON_ONCE(!rcu_read_lock_held());
@@ -193,22 +194,30 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
goto err_free_blkg;
}

+ wb_congested = wb_congested_get_create(&q->backing_dev_info,
+ blkcg->css.id, GFP_ATOMIC);
+ if (!wb_congested) {
+ ret = -ENOMEM;
+ goto err_put_css;
+ }
+
/* allocate */
if (!new_blkg) {
new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
if (unlikely(!new_blkg)) {
ret = -ENOMEM;
- goto err_put_css;
+ goto err_put_congested;
}
}
blkg = new_blkg;
+ blkg->wb_congested = wb_congested;

/* link parent */
if (blkcg_parent(blkcg)) {
blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
if (WARN_ON_ONCE(!blkg->parent)) {
ret = -EINVAL;
- goto err_put_css;
+ goto err_put_congested;
}
blkg_get(blkg->parent);
}
@@ -245,6 +254,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
blkg_put(blkg);
return ERR_PTR(ret);

+err_put_congested:
+ wb_congested_put(wb_congested);
err_put_css:
css_put(&blkcg->css);
err_free_blkg:
@@ -391,6 +402,8 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
if (blkg->parent)
blkg_put(blkg->parent);

+ wb_congested_put(blkg->wb_congested);
+
blkg_free(blkg);
}
EXPORT_SYMBOL_GPL(__blkg_release_rcu);
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 3033eb1..07a32b8 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -99,6 +99,12 @@ struct blkcg_gq {
struct hlist_node blkcg_node;
struct blkcg *blkcg;

+ /*
+ * Each blkg gets congested separately and the congestion state is
+ * propagated to the matching bdi_writeback_congested.
+ */
+ struct bdi_writeback_congested *wb_congested;
+
/* all non-root blkcg_gq's are guaranteed to have access to parent */
struct blkcg_gq *parent;

--
2.4.0

2015-05-22 21:15:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 25/51] writeback: attribute stats to the matching per-cgroup bdi_writeback

Until now, all WB_* stats were accounted against the root wb
(bdi_writeback), now that multiple wb (bdi_writeback) support is in
place, let's attributes the stats to the respective per-cgroup wb's.

As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
visible behavior differences.

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
mm/page-writeback.c | 24 +++++++++++++++---------
1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9b95cf8..4d0a9da 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2130,7 +2130,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
if (mapping_cap_account_dirty(mapping)) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_RECLAIMABLE);
+ dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE);
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
}
}
@@ -2191,10 +2191,13 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
void account_page_redirty(struct page *page)
{
struct address_space *mapping = page->mapping;
+
if (mapping && mapping_cap_account_dirty(mapping)) {
+ struct bdi_writeback *wb = inode_to_wb(mapping->host);
+
current->nr_dirtied--;
dec_zone_page_state(page, NR_DIRTIED);
- dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_DIRTIED);
+ dec_wb_stat(wb, WB_DIRTIED);
}
}
EXPORT_SYMBOL(account_page_redirty);
@@ -2373,8 +2376,7 @@ int clear_page_dirty_for_io(struct page *page)
if (TestClearPageDirty(page)) {
mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_wb_stat(&inode_to_bdi(mapping->host)->wb,
- WB_RECLAIMABLE);
+ dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE);
ret = 1;
}
mem_cgroup_end_page_stat(memcg);
@@ -2392,7 +2394,8 @@ int test_clear_page_writeback(struct page *page)

memcg = mem_cgroup_begin_page_stat(page);
if (mapping) {
- struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
+ struct inode *inode = mapping->host;
+ struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;

spin_lock_irqsave(&mapping->tree_lock, flags);
@@ -2402,8 +2405,10 @@ int test_clear_page_writeback(struct page *page)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi)) {
- __dec_wb_stat(&bdi->wb, WB_WRITEBACK);
- __wb_writeout_inc(&bdi->wb);
+ struct bdi_writeback *wb = inode_to_wb(inode);
+
+ __dec_wb_stat(wb, WB_WRITEBACK);
+ __wb_writeout_inc(wb);
}
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -2427,7 +2432,8 @@ int __test_set_page_writeback(struct page *page, bool keep_write)

memcg = mem_cgroup_begin_page_stat(page);
if (mapping) {
- struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
+ struct inode *inode = mapping->host;
+ struct backing_dev_info *bdi = inode_to_bdi(inode);
unsigned long flags;

spin_lock_irqsave(&mapping->tree_lock, flags);
@@ -2437,7 +2443,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
- __inc_wb_stat(&bdi->wb, WB_WRITEBACK);
+ __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,
--
2.4.0

2015-05-22 21:25:29

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 26/51] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback

Currently, balance_dirty_pages() always work on bdi->wb. This patch
updates it to work on the wb (bdi_writeback) matching memcg and blkcg
of the current task as that's what the inode is being dirtied against.

balance_dirty_pages_ratelimited() now pins the current wb and passes
it to balance_dirty_pages().

As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
visible behavior differences.

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
mm/page-writeback.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4d0a9da..e31dea9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1337,6 +1337,7 @@ static inline void wb_dirty_limits(struct bdi_writeback *wb,
* perform some writeout.
*/
static void balance_dirty_pages(struct address_space *mapping,
+ struct bdi_writeback *wb,
unsigned long pages_dirtied)
{
unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */
@@ -1352,8 +1353,7 @@ static void balance_dirty_pages(struct address_space *mapping,
unsigned long task_ratelimit;
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
- struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
- struct bdi_writeback *wb = &bdi->wb;
+ struct backing_dev_info *bdi = wb->bdi;
bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
unsigned long start_time = jiffies;

@@ -1575,14 +1575,20 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
*/
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
- struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
- struct bdi_writeback *wb = &bdi->wb;
+ struct inode *inode = mapping->host;
+ struct backing_dev_info *bdi = inode_to_bdi(inode);
+ struct bdi_writeback *wb = NULL;
int ratelimit;
int *p;

if (!bdi_cap_account_dirty(bdi))
return;

+ if (inode_cgwb_enabled(inode))
+ wb = wb_get_create_current(bdi, GFP_KERNEL);
+ if (!wb)
+ wb = &bdi->wb;
+
ratelimit = current->nr_dirtied_pause;
if (wb->dirty_exceeded)
ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
@@ -1616,7 +1622,9 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
preempt_enable();

if (unlikely(current->nr_dirtied >= ratelimit))
- balance_dirty_pages(mapping, current->nr_dirtied);
+ balance_dirty_pages(mapping, wb, current->nr_dirtied);
+
+ wb_put(wb);
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);

--
2.4.0

2015-05-22 21:24:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 27/51] writeback: make congestion functions per bdi_writeback

Currently, all congestion functions take bdi (backing_dev_info) and
always operate on the root wb (bdi->wb) and the congestion state from
the block layer is propagated only for the root blkcg. This patch
introduces {set|clear}_wb_congested() and wb_congested() which take a
bdi_writeback_congested and bdi_writeback respectively. The bdi
counteparts are now wrappers invoking the wb based functions on
@bdi->wb.

While converting clear_bdi_congested() to clear_wb_congested(), the
local variable declaration order between @wqh and @bit is swapped for
cosmetic reason.

This patch just adds the new wb based functions. The following
patches will apply them.

v2: Updated for bdi_writeback_congested.

Signed-off-by: Tejun Heo <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Cc: Jens Axboe <[email protected]>
---
include/linux/backing-dev-defs.h | 14 +++++++++++--
include/linux/backing-dev.h | 45 +++++++++++++++++++++++-----------------
mm/backing-dev.c | 22 ++++++++++----------
3 files changed, 49 insertions(+), 32 deletions(-)

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index a1e9c40..eb38676 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -163,7 +163,17 @@ enum {
BLK_RW_SYNC = 1,
};

-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
+void clear_wb_congested(struct bdi_writeback_congested *congested, int sync);
+void set_wb_congested(struct bdi_writeback_congested *congested, int sync);
+
+static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
+{
+ clear_wb_congested(bdi->wb.congested, sync);
+}
+
+static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
+{
+ set_wb_congested(bdi->wb.congested, sync);
+}

#endif /* __LINUX_BACKING_DEV_DEFS_H */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 8ae59df..2c498a2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -167,27 +167,13 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
return sb->s_bdi;
}

-static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
+static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
{
- if (bdi->congested_fn)
- return bdi->congested_fn(bdi->congested_data, bdi_bits);
- return (bdi->wb.congested->state & bdi_bits);
-}
-
-static inline int bdi_read_congested(struct backing_dev_info *bdi)
-{
- return bdi_congested(bdi, 1 << WB_sync_congested);
-}
-
-static inline int bdi_write_congested(struct backing_dev_info *bdi)
-{
- return bdi_congested(bdi, 1 << WB_async_congested);
-}
+ struct backing_dev_info *bdi = wb->bdi;

-static inline int bdi_rw_congested(struct backing_dev_info *bdi)
-{
- return bdi_congested(bdi, (1 << WB_sync_congested) |
- (1 << WB_async_congested));
+ if (bdi->congested_fn)
+ return bdi->congested_fn(bdi->congested_data, cong_bits);
+ return wb->congested->state & cong_bits;
}

long congestion_wait(int sync, long timeout);
@@ -454,4 +440,25 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)

#endif /* CONFIG_CGROUP_WRITEBACK */

+static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
+{
+ return wb_congested(&bdi->wb, cong_bits);
+}
+
+static inline int bdi_read_congested(struct backing_dev_info *bdi)
+{
+ return bdi_congested(bdi, 1 << WB_sync_congested);
+}
+
+static inline int bdi_write_congested(struct backing_dev_info *bdi)
+{
+ return bdi_congested(bdi, 1 << WB_async_congested);
+}
+
+static inline int bdi_rw_congested(struct backing_dev_info *bdi)
+{
+ return bdi_congested(bdi, (1 << WB_sync_congested) |
+ (1 << WB_async_congested));
+}
+
#endif /* _LINUX_BACKING_DEV_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4c9386c..5029c4a 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -896,31 +896,31 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
-static atomic_t nr_bdi_congested[2];
+static atomic_t nr_wb_congested[2];

-void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
+void clear_wb_congested(struct bdi_writeback_congested *congested, int sync)
{
- enum wb_state bit;
wait_queue_head_t *wqh = &congestion_wqh[sync];
+ enum wb_state bit;

bit = sync ? WB_sync_congested : WB_async_congested;
- if (test_and_clear_bit(bit, &bdi->wb.congested->state))
- atomic_dec(&nr_bdi_congested[sync]);
+ if (test_and_clear_bit(bit, &congested->state))
+ atomic_dec(&nr_wb_congested[sync]);
smp_mb__after_atomic();
if (waitqueue_active(wqh))
wake_up(wqh);
}
-EXPORT_SYMBOL(clear_bdi_congested);
+EXPORT_SYMBOL(clear_wb_congested);

-void set_bdi_congested(struct backing_dev_info *bdi, int sync)
+void set_wb_congested(struct bdi_writeback_congested *congested, int sync)
{
enum wb_state bit;

bit = sync ? WB_sync_congested : WB_async_congested;
- if (!test_and_set_bit(bit, &bdi->wb.congested->state))
- atomic_inc(&nr_bdi_congested[sync]);
+ if (!test_and_set_bit(bit, &congested->state))
+ atomic_inc(&nr_wb_congested[sync]);
}
-EXPORT_SYMBOL(set_bdi_congested);
+EXPORT_SYMBOL(set_wb_congested);

/**
* congestion_wait - wait for a backing_dev to become uncongested
@@ -979,7 +979,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
* encountered in the current zone, yield if necessary instead
* of sleeping on the congestion queue
*/
- if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
+ if (atomic_read(&nr_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
cond_resched();

--
2.4.0

2015-05-22 21:24:31

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 28/51] writeback, blkcg: restructure blk_{set|clear}_queue_congested()

blk_{set|clear}_queue_congested() take @q and set or clear,
respectively, the congestion state of its bdi's root wb. Because bdi
used to be able to handle congestion state only on the root wb, the
callers of those functions tested whether the congestion is on the
root blkcg and skipped if not.

This is cumbersome and makes implementation of per cgroup
bdi_writeback congestion state propagation difficult. This patch
renames blk_{set|clear}_queue_congested() to
blk_{set|clear}_congested(), and makes them take request_list instead
of request_queue and test whether the specified request_list is the
root one before updating bdi_writeback congestion state. This makes
the tests in the callers unnecessary and simplifies them.

As there are no external users of these functions, the definitions are
moved from include/linux/blkdev.h to block/blk-core.c.

This patch doesn't introduce any noticeable behavior difference.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
block/blk-core.c | 62 ++++++++++++++++++++++++++++++--------------------
include/linux/blkdev.h | 19 ----------------
2 files changed, 37 insertions(+), 44 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e0f726f..b457c4f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -63,6 +63,28 @@ struct kmem_cache *blk_requestq_cachep;
*/
static struct workqueue_struct *kblockd_workqueue;

+static void blk_clear_congested(struct request_list *rl, int sync)
+{
+ if (rl != &rl->q->root_rl)
+ return;
+#ifdef CONFIG_CGROUP_WRITEBACK
+ clear_wb_congested(rl->blkg->wb_congested, sync);
+#else
+ clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+#endif
+}
+
+static void blk_set_congested(struct request_list *rl, int sync)
+{
+ if (rl != &rl->q->root_rl)
+ return;
+#ifdef CONFIG_CGROUP_WRITEBACK
+ set_wb_congested(rl->blkg->wb_congested, sync);
+#else
+ set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+#endif
+}
+
void blk_queue_congestion_threshold(struct request_queue *q)
{
int nr;
@@ -841,13 +863,8 @@ static void __freed_request(struct request_list *rl, int sync)
{
struct request_queue *q = rl->q;

- /*
- * bdi isn't aware of blkcg yet. As all async IOs end up root
- * blkcg anyway, just use root blkcg state.
- */
- if (rl == &q->root_rl &&
- rl->count[sync] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, sync);
+ if (rl->count[sync] < queue_congestion_off_threshold(q))
+ blk_clear_congested(rl, sync);

if (rl->count[sync] + 1 <= q->nr_requests) {
if (waitqueue_active(&rl->wait[sync]))
@@ -880,25 +897,25 @@ static void freed_request(struct request_list *rl, unsigned int flags)
int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
{
struct request_list *rl;
+ int on_thresh, off_thresh;

spin_lock_irq(q->queue_lock);
q->nr_requests = nr;
blk_queue_congestion_threshold(q);
+ on_thresh = queue_congestion_on_threshold(q);
+ off_thresh = queue_congestion_off_threshold(q);

- /* congestion isn't cgroup aware and follows root blkcg for now */
- rl = &q->root_rl;
-
- if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
- blk_set_queue_congested(q, BLK_RW_SYNC);
- else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, BLK_RW_SYNC);
+ blk_queue_for_each_rl(rl, q) {
+ if (rl->count[BLK_RW_SYNC] >= on_thresh)
+ blk_set_congested(rl, BLK_RW_SYNC);
+ else if (rl->count[BLK_RW_SYNC] < off_thresh)
+ blk_clear_congested(rl, BLK_RW_SYNC);

- if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
- blk_set_queue_congested(q, BLK_RW_ASYNC);
- else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
- blk_clear_queue_congested(q, BLK_RW_ASYNC);
+ if (rl->count[BLK_RW_ASYNC] >= on_thresh)
+ blk_set_congested(rl, BLK_RW_ASYNC);
+ else if (rl->count[BLK_RW_ASYNC] < off_thresh)
+ blk_clear_congested(rl, BLK_RW_ASYNC);

- blk_queue_for_each_rl(rl, q) {
if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
blk_set_rl_full(rl, BLK_RW_SYNC);
} else {
@@ -1008,12 +1025,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
}
}
}
- /*
- * bdi isn't aware of blkcg yet. As all async IOs end up
- * root blkcg anyway, just use root blkcg state.
- */
- if (rl == &q->root_rl)
- blk_set_queue_congested(q, is_sync);
+ blk_set_congested(rl, is_sync);
}

/*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 89bdef0..3d1065c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -794,25 +794,6 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,

extern void blk_queue_bio(struct request_queue *q, struct bio *bio);

-/*
- * A queue has just exitted congestion. Note this in the global counter of
- * congested queues, and wake up anyone who was waiting for requests to be
- * put back.
- */
-static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
-{
- clear_bdi_congested(&q->backing_dev_info, sync);
-}
-
-/*
- * A queue has just entered congestion. Flag that in the queue's VM-visible
- * state flags and increment the global gounter of congested queues.
- */
-static inline void blk_set_queue_congested(struct request_queue *q, int sync)
-{
- set_bdi_congested(&q->backing_dev_info, sync);
-}
-
extern void blk_start_queue(struct request_queue *q);
extern void blk_stop_queue(struct request_queue *q);
extern void blk_sync_queue(struct request_queue *q);
--
2.4.0

2015-05-22 21:24:05

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 29/51] writeback, blkcg: propagate non-root blkcg congestion state

Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
blk_{set|clear}_congested() can propagate non-root blkcg congestion
state to them.

This can be easily achieved by disabling the root_rl tests in
blk_{set|clear}_congested(). Note that we still need those tests when
!CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
wb's congestion state for events happening on other blkcgs.

v2: Updated for bdi_writeback_congested.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
block/blk-core.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b457c4f..cf6974e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -65,23 +65,26 @@ static struct workqueue_struct *kblockd_workqueue;

static void blk_clear_congested(struct request_list *rl, int sync)
{
- if (rl != &rl->q->root_rl)
- return;
#ifdef CONFIG_CGROUP_WRITEBACK
clear_wb_congested(rl->blkg->wb_congested, sync);
#else
- clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+ /*
+ * If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't
+ * flip its congestion state for events on other blkcgs.
+ */
+ if (rl == &rl->q->root_rl)
+ clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
#endif
}

static void blk_set_congested(struct request_list *rl, int sync)
{
- if (rl != &rl->q->root_rl)
- return;
#ifdef CONFIG_CGROUP_WRITEBACK
set_wb_congested(rl->blkg->wb_congested, sync);
#else
- set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
+ /* see blk_clear_congested() */
+ if (rl == &rl->q->root_rl)
+ set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
#endif
}

--
2.4.0

2015-05-22 21:15:24

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 30/51] writeback: implement and use inode_congested()

In several places, bdi_congested() and its wrappers are used to
determine whether more IOs should be issued. With cgroup writeback
support, this question can't be answered solely based on the bdi
(backing_dev_info). It's dependent on whether the filesystem and bdi
support cgroup writeback and the blkcg the inode is associated with.

This patch implements inode_congested() and its wrappers which take
@inode and determines the congestion state considering cgroup
writeback. The new functions replace bdi_*congested() calls in places
where the query is about specific inode and task.

There are several filesystem users which also fit this criteria but
they should be updated when each filesystem implements cgroup
writeback support.

v2: Now that a given inode is associated with only one wb, congestion
state can be determined independent from the asking task. Drop
@task. Spotted by Vivek. Also, converted to take @inode instead
of @mapping and renamed to inode_congested().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Vivek Goyal <[email protected]>
---
fs/fs-writeback.c | 29 +++++++++++++++++++++++++++++
include/linux/backing-dev.h | 22 ++++++++++++++++++++++
mm/fadvise.c | 2 +-
mm/readahead.c | 2 +-
mm/vmscan.c | 11 +++++------
5 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 99a2440..7ec491b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -142,6 +142,35 @@ static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
wb_queue_work(wb, work);
}

+#ifdef CONFIG_CGROUP_WRITEBACK
+
+/**
+ * inode_congested - test whether an inode is congested
+ * @inode: inode to test for congestion
+ * @cong_bits: mask of WB_[a]sync_congested bits to test
+ *
+ * Tests whether @inode is congested. @cong_bits is the mask of congestion
+ * bits to test and the return value is the mask of set bits.
+ *
+ * If cgroup writeback is enabled for @inode, the congestion state is
+ * determined by whether the cgwb (cgroup bdi_writeback) for the blkcg
+ * associated with @inode is congested; otherwise, the root wb's congestion
+ * state is used.
+ */
+int inode_congested(struct inode *inode, int cong_bits)
+{
+ if (inode) {
+ struct bdi_writeback *wb = inode_to_wb(inode);
+ if (wb)
+ return wb_congested(wb, cong_bits);
+ }
+
+ return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
+}
+EXPORT_SYMBOL_GPL(inode_congested);
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
/**
* bdi_start_writeback - start writeback
* @bdi: the backing device to write from
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2c498a2..6f08821 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -230,6 +230,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
void __inode_attach_wb(struct inode *inode, struct page *page);
void wb_memcg_offline(struct mem_cgroup *memcg);
void wb_blkcg_offline(struct blkcg *blkcg);
+int inode_congested(struct inode *inode, int cong_bits);

/**
* inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
@@ -438,8 +439,29 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
{
}

+static inline int inode_congested(struct inode *inode, int cong_bits)
+{
+ return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */

+static inline int inode_read_congested(struct inode *inode)
+{
+ return inode_congested(inode, 1 << WB_sync_congested);
+}
+
+static inline int inode_write_congested(struct inode *inode)
+{
+ return inode_congested(inode, 1 << WB_async_congested);
+}
+
+static inline int inode_rw_congested(struct inode *inode)
+{
+ return inode_congested(inode, (1 << WB_sync_congested) |
+ (1 << WB_async_congested));
+}
+
static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
{
return wb_congested(&bdi->wb, cong_bits);
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 4a3907c..b8a5bc6 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -115,7 +115,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice)
case POSIX_FADV_NOREUSE:
break;
case POSIX_FADV_DONTNEED:
- if (!bdi_write_congested(bdi))
+ if (!inode_write_congested(mapping->host))
__filemap_fdatawrite_range(mapping, offset, endbyte,
WB_SYNC_NONE);

diff --git a/mm/readahead.c b/mm/readahead.c
index 9356758..60cd846 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -541,7 +541,7 @@ page_cache_async_readahead(struct address_space *mapping,
/*
* Defer asynchronous read-ahead on IO congestion.
*/
- if (bdi_read_congested(inode_to_bdi(mapping->host)))
+ if (inode_read_congested(mapping->host))
return;

/* do read-ahead */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7582f9f..f463398 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -452,14 +452,13 @@ static inline int is_page_cache_freeable(struct page *page)
return page_count(page) - page_has_private(page) == 2;
}

-static int may_write_to_queue(struct backing_dev_info *bdi,
- struct scan_control *sc)
+static int may_write_to_inode(struct inode *inode, struct scan_control *sc)
{
if (current->flags & PF_SWAPWRITE)
return 1;
- if (!bdi_write_congested(bdi))
+ if (!inode_write_congested(inode))
return 1;
- if (bdi == current->backing_dev_info)
+ if (inode_to_bdi(inode) == current->backing_dev_info)
return 1;
return 0;
}
@@ -538,7 +537,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
}
if (mapping->a_ops->writepage == NULL)
return PAGE_ACTIVATE;
- if (!may_write_to_queue(inode_to_bdi(mapping->host), sc))
+ if (!may_write_to_inode(mapping->host, sc))
return PAGE_KEEP;

if (clear_page_dirty_for_io(page)) {
@@ -924,7 +923,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
mapping = page_mapping(page);
if (((dirty || writeback) && mapping &&
- bdi_write_congested(inode_to_bdi(mapping->host))) ||
+ inode_write_congested(mapping->host)) ||
(writeback && PageReclaim(page)))
nr_congested++;

--
2.4.0

2015-05-22 21:23:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 31/51] writeback: implement WB_has_dirty_io wb_state flag

Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
has any dirty inode by testing all three IO lists on each invocation
without actively keeping track. For cgroup writeback support, a
single bdi will host multiple wb's each of which will host dirty
inodes separately and we'll need to make bdi_has_dirty_io(), which
currently only represents the root wb, aggregate has_dirty_io from all
member wb's, which requires tracking transitions in has_dirty_io state
on each wb.

This patch introduces inode_wb_list_{move|del}_locked() to consolidate
IO list operations leaving queue_io() the only other function which
directly manipulates IO lists (via move_expired_inodes()). All three
functions are updated to call wb_io_lists_[de]populated() which keep
track of whether the wb has dirty inodes or not and record it using
the new WB_has_dirty_io flag. inode_wb_list_moved_locked()'s return
value indicates whether the wb had no dirty inodes before.

mark_inode_dirty() is restructured so that the return value of
inode_wb_list_move_locked() can be used for deciding whether to wake
up the wb.

While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
These functions were returning 0 and 1 before. Also, add a comment
explaining the synchronization of wb_state flags.

v2: Updated to accommodate b_dirty_time.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 110 ++++++++++++++++++++++++++++++---------
include/linux/backing-dev-defs.h | 1 +
include/linux/backing-dev.h | 8 ++-
mm/backing-dev.c | 2 +-
4 files changed, 91 insertions(+), 30 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7ec491b..0a90dc55 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -93,6 +93,66 @@ static inline struct inode *wb_inode(struct list_head *head)

EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);

+static bool wb_io_lists_populated(struct bdi_writeback *wb)
+{
+ if (wb_has_dirty_io(wb)) {
+ return false;
+ } else {
+ set_bit(WB_has_dirty_io, &wb->state);
+ return true;
+ }
+}
+
+static void wb_io_lists_depopulated(struct bdi_writeback *wb)
+{
+ if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
+ list_empty(&wb->b_io) && list_empty(&wb->b_more_io))
+ clear_bit(WB_has_dirty_io, &wb->state);
+}
+
+/**
+ * inode_wb_list_move_locked - move an inode onto a bdi_writeback IO list
+ * @inode: inode to be moved
+ * @wb: target bdi_writeback
+ * @head: one of @wb->b_{dirty|io|more_io}
+ *
+ * Move @inode->i_wb_list to @list of @wb and set %WB_has_dirty_io.
+ * Returns %true if @inode is the first occupant of the !dirty_time IO
+ * lists; otherwise, %false.
+ */
+static bool inode_wb_list_move_locked(struct inode *inode,
+ struct bdi_writeback *wb,
+ struct list_head *head)
+{
+ assert_spin_locked(&wb->list_lock);
+
+ list_move(&inode->i_wb_list, head);
+
+ /* dirty_time doesn't count as dirty_io until expiration */
+ if (head != &wb->b_dirty_time)
+ return wb_io_lists_populated(wb);
+
+ wb_io_lists_depopulated(wb);
+ return false;
+}
+
+/**
+ * inode_wb_list_del_locked - remove an inode from its bdi_writeback IO list
+ * @inode: inode to be removed
+ * @wb: bdi_writeback @inode is being removed from
+ *
+ * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and
+ * clear %WB_has_dirty_io if all are empty afterwards.
+ */
+static void inode_wb_list_del_locked(struct inode *inode,
+ struct bdi_writeback *wb)
+{
+ assert_spin_locked(&wb->list_lock);
+
+ list_del_init(&inode->i_wb_list);
+ wb_io_lists_depopulated(wb);
+}
+
static void wb_wakeup(struct bdi_writeback *wb)
{
spin_lock_bh(&wb->work_lock);
@@ -217,7 +277,7 @@ void inode_wb_list_del(struct inode *inode)
struct bdi_writeback *wb = inode_to_wb(inode);

spin_lock(&wb->list_lock);
- list_del_init(&inode->i_wb_list);
+ inode_wb_list_del_locked(inode, wb);
spin_unlock(&wb->list_lock);
}

@@ -232,7 +292,6 @@ void inode_wb_list_del(struct inode *inode)
*/
static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
{
- assert_spin_locked(&wb->list_lock);
if (!list_empty(&wb->b_dirty)) {
struct inode *tail;

@@ -240,7 +299,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
if (time_before(inode->dirtied_when, tail->dirtied_when))
inode->dirtied_when = jiffies;
}
- list_move(&inode->i_wb_list, &wb->b_dirty);
+ inode_wb_list_move_locked(inode, wb, &wb->b_dirty);
}

/*
@@ -248,8 +307,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
*/
static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
{
- assert_spin_locked(&wb->list_lock);
- list_move(&inode->i_wb_list, &wb->b_more_io);
+ inode_wb_list_move_locked(inode, wb, &wb->b_more_io);
}

static void inode_sync_complete(struct inode *inode)
@@ -358,6 +416,8 @@ static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, work);
moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io,
EXPIRE_DIRTY_ATIME, work);
+ if (moved)
+ wb_io_lists_populated(wb);
trace_writeback_queue_io(wb, work, moved);
}

@@ -483,10 +543,10 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
redirty_tail(inode, wb);
} else if (inode->i_state & I_DIRTY_TIME) {
inode->dirtied_when = jiffies;
- list_move(&inode->i_wb_list, &wb->b_dirty_time);
+ inode_wb_list_move_locked(inode, wb, &wb->b_dirty_time);
} else {
/* The inode is clean. Remove from writeback lists. */
- list_del_init(&inode->i_wb_list);
+ inode_wb_list_del_locked(inode, wb);
}
}

@@ -628,7 +688,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
* touch it. See comment above for explanation.
*/
if (!(inode->i_state & I_DIRTY_ALL))
- list_del_init(&inode->i_wb_list);
+ inode_wb_list_del_locked(inode, wb);
spin_unlock(&wb->list_lock);
inode_sync_complete(inode);
out:
@@ -1327,37 +1387,39 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
+ struct list_head *dirty_list;
bool wakeup_bdi = false;
bdi = inode_to_bdi(inode);

spin_unlock(&inode->i_lock);
spin_lock(&bdi->wb.list_lock);
- if (bdi_cap_writeback_dirty(bdi)) {
- WARN(!test_bit(WB_registered, &bdi->wb.state),
- "bdi-%s not registered\n", bdi->name);

- /*
- * If this is the first dirty inode for this
- * bdi, we have to wake-up the corresponding
- * bdi thread to make sure background
- * write-back happens later.
- */
- if (!wb_has_dirty_io(&bdi->wb))
- wakeup_bdi = true;
- }
+ WARN(bdi_cap_writeback_dirty(bdi) &&
+ !test_bit(WB_registered, &bdi->wb.state),
+ "bdi-%s not registered\n", bdi->name);

inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;
+
if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES))
- list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+ dirty_list = &bdi->wb.b_dirty;
else
- list_move(&inode->i_wb_list,
- &bdi->wb.b_dirty_time);
+ dirty_list = &bdi->wb.b_dirty_time;
+
+ wakeup_bdi = inode_wb_list_move_locked(inode, &bdi->wb,
+ dirty_list);
+
spin_unlock(&bdi->wb.list_lock);
trace_writeback_dirty_inode_enqueue(inode);

- if (wakeup_bdi)
+ /*
+ * If this is the first dirty inode for this bdi,
+ * we have to wake-up the corresponding bdi thread
+ * to make sure background write-back happens
+ * later.
+ */
+ if (bdi_cap_writeback_dirty(bdi) && wakeup_bdi)
wb_wakeup_delayed(&bdi->wb);
return;
}
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index eb38676..7a94b78 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -21,6 +21,7 @@ struct dentry;
enum wb_state {
WB_registered, /* bdi_register() was done */
WB_writeback_running, /* Writeback is in progress */
+ WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
};

enum wb_congested_state {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 6f08821..3c8403c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,7 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason);
void bdi_start_background_writeback(struct backing_dev_info *bdi);
void wb_workfn(struct work_struct *work);
-int bdi_has_dirty_io(struct backing_dev_info *bdi);
+bool bdi_has_dirty_io(struct backing_dev_info *bdi);
void wb_wakeup_delayed(struct bdi_writeback *wb);

extern spinlock_t bdi_lock;
@@ -37,11 +37,9 @@ extern struct list_head bdi_list;

extern struct workqueue_struct *bdi_wq;

-static inline int wb_has_dirty_io(struct bdi_writeback *wb)
+static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
{
- return !list_empty(&wb->b_dirty) ||
- !list_empty(&wb->b_io) ||
- !list_empty(&wb->b_more_io);
+ return test_bit(WB_has_dirty_io, &wb->state);
}

static inline void __add_wb_stat(struct bdi_writeback *wb,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5029c4a..161ddf1 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -256,7 +256,7 @@ static int __init default_bdi_init(void)
}
subsys_initcall(default_bdi_init);

-int bdi_has_dirty_io(struct backing_dev_info *bdi)
+bool bdi_has_dirty_io(struct backing_dev_info *bdi)
{
return wb_has_dirty_io(&bdi->wb);
}
--
2.4.0

2015-05-22 21:22:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 32/51] writeback: implement backing_dev_info->tot_write_bandwidth

cgroup writeback support needs to keep track of the sum of
avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
distribute write workload. This patch adds bdi->tot_write_bandwidth
and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
and wb_update_write_bandwidth() to adjust it as wb's gain and lose
dirty inodes and its avg_write_bandwidth gets updated.

As the update events are not synchronized with each other,
bdi->tot_write_bandwidth is an atomic_long_t.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 7 ++++++-
include/linux/backing-dev-defs.h | 2 ++
mm/page-writeback.c | 3 +++
3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0a90dc55..bbccf68 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -99,6 +99,8 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
return false;
} else {
set_bit(WB_has_dirty_io, &wb->state);
+ atomic_long_add(wb->avg_write_bandwidth,
+ &wb->bdi->tot_write_bandwidth);
return true;
}
}
@@ -106,8 +108,11 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
static void wb_io_lists_depopulated(struct bdi_writeback *wb)
{
if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
- list_empty(&wb->b_io) && list_empty(&wb->b_more_io))
+ list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
clear_bit(WB_has_dirty_io, &wb->state);
+ atomic_long_sub(wb->avg_write_bandwidth,
+ &wb->bdi->tot_write_bandwidth);
+ }
}

/**
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 7a94b78..d631a61 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -142,6 +142,8 @@ struct backing_dev_info {
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;

+ atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
+
struct bdi_writeback wb; /* the root writeback info for this bdi */
struct bdi_writeback_congested wb_congested; /* its congested state */
#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e31dea9..c95eb24 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -881,6 +881,9 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
avg += (old - avg) >> 3;

out:
+ if (wb_has_dirty_io(wb))
+ atomic_long_add(avg - wb->avg_write_bandwidth,
+ &wb->bdi->tot_write_bandwidth);
wb->write_bandwidth = bw;
wb->avg_write_bandwidth = avg;
}
--
2.4.0

2015-05-22 21:22:33

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 33/51] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account

bdi_has_dirty_io() used to only reflect whether the root wb
(bdi_writeback) has dirty inodes. For cgroup writeback support, it
needs to take all active wb's into account. If any wb on the bdi has
dirty inodes, bdi_has_dirty_io() should return true.

To achieve that, as inode_wb_list_{move|del}_locked() now keep track
of the dirty state transition of each wb, the number of dirty wbs can
be counted in the bdi; however, bdi is already aggregating
wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
dip below 1. bdi_has_dirty_io() can simply test whether
bdi->tot_write_bandwidth is zero or not.

While this bumps the value of wb->avg_write_bandwidth to one when it
used to be zero, this shouldn't cause any meaningful behavior
difference.

bdi_has_dirty_io() is made an inline function which tests whether
->tot_write_bandwidth is non-zero. Also, WARN_ON_ONCE()'s on its
value are added to inode_wb_list_{move|del}_locked().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 5 +++--
include/linux/backing-dev-defs.h | 8 ++++++--
include/linux/backing-dev.h | 10 +++++++++-
mm/backing-dev.c | 5 -----
mm/page-writeback.c | 10 +++++++---
5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bbccf68..c98d392 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -99,6 +99,7 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
return false;
} else {
set_bit(WB_has_dirty_io, &wb->state);
+ WARN_ON_ONCE(!wb->avg_write_bandwidth);
atomic_long_add(wb->avg_write_bandwidth,
&wb->bdi->tot_write_bandwidth);
return true;
@@ -110,8 +111,8 @@ static void wb_io_lists_depopulated(struct bdi_writeback *wb)
if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
clear_bit(WB_has_dirty_io, &wb->state);
- atomic_long_sub(wb->avg_write_bandwidth,
- &wb->bdi->tot_write_bandwidth);
+ WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth,
+ &wb->bdi->tot_write_bandwidth) < 0);
}
}

diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index d631a61..8c857d7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -98,7 +98,7 @@ struct bdi_writeback {
unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
unsigned long write_bandwidth; /* the estimated write bandwidth */
- unsigned long avg_write_bandwidth; /* further smoothed write bw */
+ unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */

/*
* The base dirty throttle rate, re-calculated on every 200ms.
@@ -142,7 +142,11 @@ struct backing_dev_info {
unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;

- atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
+ /*
+ * Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are
+ * any dirty wbs, which is depended upon by bdi_has_dirty().
+ */
+ atomic_long_t tot_write_bandwidth;

struct bdi_writeback wb; /* the root writeback info for this bdi */
struct bdi_writeback_congested wb_congested; /* its congested state */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 3c8403c..0839e44 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,7 +29,6 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason);
void bdi_start_background_writeback(struct backing_dev_info *bdi);
void wb_workfn(struct work_struct *work);
-bool bdi_has_dirty_io(struct backing_dev_info *bdi);
void wb_wakeup_delayed(struct bdi_writeback *wb);

extern spinlock_t bdi_lock;
@@ -42,6 +41,15 @@ static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
return test_bit(WB_has_dirty_io, &wb->state);
}

+static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+ /*
+ * @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are
+ * any dirty wbs. See wb_update_write_bandwidth().
+ */
+ return atomic_long_read(&bdi->tot_write_bandwidth);
+}
+
static inline void __add_wb_stat(struct bdi_writeback *wb,
enum wb_stat_item item, s64 amount)
{
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 161ddf1..d2f16fc9 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -256,11 +256,6 @@ static int __init default_bdi_init(void)
}
subsys_initcall(default_bdi_init);

-bool bdi_has_dirty_io(struct backing_dev_info *bdi)
-{
- return wb_has_dirty_io(&bdi->wb);
-}
-
/*
* This function is used when the first inode for this wb is marked dirty. It
* wakes-up the corresponding bdi thread which should then take care of the
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index c95eb24..99b8846 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -881,9 +881,13 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
avg += (old - avg) >> 3;

out:
- if (wb_has_dirty_io(wb))
- atomic_long_add(avg - wb->avg_write_bandwidth,
- &wb->bdi->tot_write_bandwidth);
+ /* keep avg > 0 to guarantee that tot > 0 if there are dirty wbs */
+ avg = max(avg, 1LU);
+ if (wb_has_dirty_io(wb)) {
+ long delta = avg - wb->avg_write_bandwidth;
+ WARN_ON_ONCE(atomic_long_add_return(delta,
+ &wb->bdi->tot_write_bandwidth) <= 0);
+ }
wb->write_bandwidth = bw;
wb->avg_write_bandwidth = avg;
}
--
2.4.0

2015-05-22 21:22:09

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 34/51] writeback: don't issue wb_writeback_work if clean

There are several places in fs/fs-writeback.c which queues
wb_writeback_work without checking whether the target wb
(bdi_writeback) has dirty inodes or not. The only thing
wb_writeback_work does is writing back the dirty inodes for the target
wb and queueing a work item for a clean wb is essentially noop. There
are some side effects such as bandwidth stats being updated and
triggering tracepoints but these don't affect the operation in any
meaningful way.

This patch makes all writeback_inodes_sb_nr() and sync_inodes_sb()
skip wb_queue_work() if the target bdi is clean. Also, it moves
dirtiness check from wakeup_flusher_threads() to
__wb_start_writeback() so that all its callers benefit from the check.

While the overhead incurred by scheduling a noop work isn't currently
significant, the overhead may be higher with cgroup writeback support
as we may end up issuing noop work items to a lot of clean wb's.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c98d392..921a9e4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -189,6 +189,9 @@ static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
{
struct wb_writeback_work *work;

+ if (!wb_has_dirty_io(wb))
+ return;
+
/*
* This is WB_SYNC_NONE writeback, so if allocation fails just
* wakeup the thread for old dirty data writeback
@@ -1215,11 +1218,8 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
nr_pages = get_nr_dirty_pages();

rcu_read_lock();
- list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
- if (!bdi_has_dirty_io(bdi))
- continue;
+ list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
__wb_start_writeback(&bdi->wb, nr_pages, false, reason);
- }
rcu_read_unlock();
}

@@ -1512,11 +1512,12 @@ void writeback_inodes_sb_nr(struct super_block *sb,
.nr_pages = nr,
.reason = reason,
};
+ struct backing_dev_info *bdi = sb->s_bdi;

- if (sb->s_bdi == &noop_backing_dev_info)
+ if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
- wb_queue_work(&sb->s_bdi->wb, &work);
+ wb_queue_work(&bdi->wb, &work);
wait_for_completion(&done);
}
EXPORT_SYMBOL(writeback_inodes_sb_nr);
@@ -1594,13 +1595,14 @@ void sync_inodes_sb(struct super_block *sb)
.reason = WB_REASON_SYNC,
.for_sync = 1,
};
+ struct backing_dev_info *bdi = sb->s_bdi;

/* Nothing to do? */
- if (sb->s_bdi == &noop_backing_dev_info)
+ if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- wb_queue_work(&sb->s_bdi->wb, &work);
+ wb_queue_work(&bdi->wb, &work);
wait_for_completion(&done);

wait_sb_inodes(sb);
--
2.4.0

2015-05-22 21:15:32

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 35/51] writeback: make bdi->min/max_ratio handling cgroup writeback aware

bdi->min/max_ratio are user-configurable per-bdi knobs which regulate
dirty limit of each bdi. For cgroup writeback, they need to be
further distributed across wb's (bdi_writeback's) belonging to the
configured bdi.

This patch introduces wb_min_max_ratio() which distributes
bdi->min/max_ratio according to a wb's proportion in the total active
bandwidth of its bdi.

v2: Update wb_min_max_ratio() to fix a bug where both min and max were
assigned the min value and avoid calculations when possible.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
mm/page-writeback.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 46 insertions(+), 4 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 99b8846..9b55f12 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -155,6 +155,46 @@ static unsigned long writeout_period_time = 0;
*/
#define VM_COMPLETIONS_PERIOD_LEN (3*HZ)

+#ifdef CONFIG_CGROUP_WRITEBACK
+
+static void wb_min_max_ratio(struct bdi_writeback *wb,
+ unsigned long *minp, unsigned long *maxp)
+{
+ unsigned long this_bw = wb->avg_write_bandwidth;
+ unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
+ unsigned long long min = wb->bdi->min_ratio;
+ unsigned long long max = wb->bdi->max_ratio;
+
+ /*
+ * @wb may already be clean by the time control reaches here and
+ * the total may not include its bw.
+ */
+ if (this_bw < tot_bw) {
+ if (min) {
+ min *= this_bw;
+ do_div(min, tot_bw);
+ }
+ if (max < 100) {
+ max *= this_bw;
+ do_div(max, tot_bw);
+ }
+ }
+
+ *minp = min;
+ *maxp = max;
+}
+
+#else /* CONFIG_CGROUP_WRITEBACK */
+
+static void wb_min_max_ratio(struct bdi_writeback *wb,
+ unsigned long *minp, unsigned long *maxp)
+{
+ *minp = wb->bdi->min_ratio;
+ *maxp = wb->bdi->max_ratio;
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+
/*
* In a memory zone, there is a certain amount of pages we consider
* available for the page cache, which is essentially the number of
@@ -539,9 +579,9 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
*/
unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
{
- struct backing_dev_info *bdi = wb->bdi;
u64 wb_dirty;
long numerator, denominator;
+ unsigned long wb_min_ratio, wb_max_ratio;

/*
* Calculate this BDI's share of the dirty ratio.
@@ -552,9 +592,11 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
wb_dirty *= numerator;
do_div(wb_dirty, denominator);

- wb_dirty += (dirty * bdi->min_ratio) / 100;
- if (wb_dirty > (dirty * bdi->max_ratio) / 100)
- wb_dirty = dirty * bdi->max_ratio / 100;
+ wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
+
+ wb_dirty += (dirty * wb_min_ratio) / 100;
+ if (wb_dirty > (dirty * wb_max_ratio) / 100)
+ wb_dirty = dirty * wb_max_ratio / 100;

return wb_dirty;
}
--
2.4.0

2015-05-22 21:15:36

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 36/51] writeback: implement bdi_for_each_wb()

This will be used to implement bdi-wide operations which should be
distributed across all its cgroup bdi_writebacks.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
include/linux/backing-dev.h | 63 +++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 63 insertions(+)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0839e44..c797980 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -383,6 +383,61 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
return inode->i_wb;
}

+struct wb_iter {
+ int start_blkcg_id;
+ struct radix_tree_iter tree_iter;
+ void **slot;
+};
+
+static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter,
+ struct backing_dev_info *bdi)
+{
+ struct radix_tree_iter *titer = &iter->tree_iter;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ if (iter->start_blkcg_id >= 0) {
+ iter->slot = radix_tree_iter_init(titer, iter->start_blkcg_id);
+ iter->start_blkcg_id = -1;
+ } else {
+ iter->slot = radix_tree_next_slot(iter->slot, titer, 0);
+ }
+
+ if (!iter->slot)
+ iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0);
+ if (iter->slot)
+ return *iter->slot;
+ return NULL;
+}
+
+static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter,
+ struct backing_dev_info *bdi,
+ int start_blkcg_id)
+{
+ iter->start_blkcg_id = start_blkcg_id;
+
+ if (start_blkcg_id)
+ return __wb_iter_next(iter, bdi);
+ else
+ return &bdi->wb;
+}
+
+/**
+ * bdi_for_each_wb - walk all wb's of a bdi in ascending blkcg ID order
+ * @wb_cur: cursor struct bdi_writeback pointer
+ * @bdi: bdi to walk wb's of
+ * @iter: pointer to struct wb_iter to be used as iteration buffer
+ * @start_blkcg_id: blkcg ID to start iteration from
+ *
+ * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending
+ * blkcg ID order starting from @start_blkcg_id. @iter is struct wb_iter
+ * to be used as temp storage during iteration. rcu_read_lock() must be
+ * held throughout iteration.
+ */
+#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
+ for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id); \
+ (wb_cur); (wb_cur) = __wb_iter_next(iter, bdi))
+
#else /* CONFIG_CGROUP_WRITEBACK */

static inline bool inode_cgwb_enabled(struct inode *inode)
@@ -445,6 +500,14 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
{
}

+struct wb_iter {
+ int next_id;
+};
+
+#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
+ for ((iter)->next_id = (start_blkcg_id); \
+ ({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
+
static inline int inode_congested(struct inode *inode, int cong_bits)
{
return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
--
2.4.0

2015-05-22 21:21:45

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 37/51] writeback: remove bdi_start_writeback()

bdi_start_writeback() is a thin wrapper on top of
__wb_start_writeback() which is used only by laptop_mode_timer_fn().
This patches removes bdi_start_writeback(), renames
__wb_start_writeback() to wb_start_writeback() and makes
laptop_mode_timer_fn() use it instead.

This doesn't cause any functional difference and will ease making
laptop_mode_timer_fn() cgroup writeback aware.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 68 +++++++++++++++++----------------------------
include/linux/backing-dev.h | 4 +--
mm/page-writeback.c | 4 +--
3 files changed, 29 insertions(+), 47 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 921a9e4..79f11af 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -184,33 +184,6 @@ static void wb_queue_work(struct bdi_writeback *wb,
spin_unlock_bh(&wb->work_lock);
}

-static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
- bool range_cyclic, enum wb_reason reason)
-{
- struct wb_writeback_work *work;
-
- if (!wb_has_dirty_io(wb))
- return;
-
- /*
- * This is WB_SYNC_NONE writeback, so if allocation fails just
- * wakeup the thread for old dirty data writeback
- */
- work = kzalloc(sizeof(*work), GFP_ATOMIC);
- if (!work) {
- trace_writeback_nowork(wb->bdi);
- wb_wakeup(wb);
- return;
- }
-
- work->sync_mode = WB_SYNC_NONE;
- work->nr_pages = nr_pages;
- work->range_cyclic = range_cyclic;
- work->reason = reason;
-
- wb_queue_work(wb, work);
-}
-
#ifdef CONFIG_CGROUP_WRITEBACK

/**
@@ -240,22 +213,31 @@ EXPORT_SYMBOL_GPL(inode_congested);

#endif /* CONFIG_CGROUP_WRITEBACK */

-/**
- * bdi_start_writeback - start writeback
- * @bdi: the backing device to write from
- * @nr_pages: the number of pages to write
- * @reason: reason why some writeback work was initiated
- *
- * Description:
- * This does WB_SYNC_NONE opportunistic writeback. The IO is only
- * started when this function returns, we make no guarantees on
- * completion. Caller need not hold sb s_umount semaphore.
- *
- */
-void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
- enum wb_reason reason)
+void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+ bool range_cyclic, enum wb_reason reason)
{
- __wb_start_writeback(&bdi->wb, nr_pages, true, reason);
+ struct wb_writeback_work *work;
+
+ if (!wb_has_dirty_io(wb))
+ return;
+
+ /*
+ * This is WB_SYNC_NONE writeback, so if allocation fails just
+ * wakeup the thread for old dirty data writeback
+ */
+ work = kzalloc(sizeof(*work), GFP_ATOMIC);
+ if (!work) {
+ trace_writeback_nowork(wb->bdi);
+ wb_wakeup(wb);
+ return;
+ }
+
+ work->sync_mode = WB_SYNC_NONE;
+ work->nr_pages = nr_pages;
+ work->range_cyclic = range_cyclic;
+ work->reason = reason;
+
+ wb_queue_work(wb, work);
}

/**
@@ -1219,7 +1201,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)

rcu_read_lock();
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
- __wb_start_writeback(&bdi->wb, nr_pages, false, reason);
+ wb_start_writeback(&bdi->wb, nr_pages, false, reason);
rcu_read_unlock();
}

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c797980..0ff40c2 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -25,8 +25,8 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
void bdi_unregister(struct backing_dev_info *bdi);
int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
-void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
- enum wb_reason reason);
+void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+ bool range_cyclic, enum wb_reason reason);
void bdi_start_background_writeback(struct backing_dev_info *bdi);
void wb_workfn(struct work_struct *work);
void wb_wakeup_delayed(struct bdi_writeback *wb);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 9b55f12..6301af2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1729,8 +1729,8 @@ void laptop_mode_timer_fn(unsigned long data)
* threshold
*/
if (bdi_has_dirty_io(&q->backing_dev_info))
- bdi_start_writeback(&q->backing_dev_info, nr_pages,
- WB_REASON_LAPTOP_TIMER);
+ wb_start_writeback(&q->backing_dev_info.wb, nr_pages, true,
+ WB_REASON_LAPTOP_TIMER);
}

/*
--
2.4.0

2015-05-22 21:21:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 38/51] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's

For cgroup writeback support, all bdi-wide operations should be
distributed to all its wb's (bdi_writeback's).

This patch updates laptop_mode_timer_fn() so that it invokes
wb_start_writeback() on all wb's rather than just the root one. As
the intent is writing out all dirty data, there's no reason to split
the number of pages to write.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
mm/page-writeback.c | 12 +++++++++---
1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6301af2..682e3a6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1723,14 +1723,20 @@ void laptop_mode_timer_fn(unsigned long data)
struct request_queue *q = (struct request_queue *)data;
int nr_pages = global_page_state(NR_FILE_DIRTY) +
global_page_state(NR_UNSTABLE_NFS);
+ struct bdi_writeback *wb;
+ struct wb_iter iter;

/*
* We want to write everything out, not just down to the dirty
* threshold
*/
- if (bdi_has_dirty_io(&q->backing_dev_info))
- wb_start_writeback(&q->backing_dev_info.wb, nr_pages, true,
- WB_REASON_LAPTOP_TIMER);
+ if (!bdi_has_dirty_io(&q->backing_dev_info))
+ return;
+
+ bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
+ if (wb_has_dirty_io(wb))
+ wb_start_writeback(wb, nr_pages, true,
+ WB_REASON_LAPTOP_TIMER);
}

/*
--
2.4.0

2015-05-22 21:15:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 39/51] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info

writeback_in_progress() currently takes @bdi and returns whether
writeback is in progress on its root wb (bdi_writeback). In
preparation for cgroup writeback support, make it take wb instead.
While at it, make it an inline function.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 15 +--------------
include/linux/backing-dev.h | 12 +++++++++++-
mm/page-writeback.c | 4 ++--
3 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 79f11af..45baf6c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -65,19 +65,6 @@ struct wb_writeback_work {
*/
unsigned int dirtytime_expire_interval = 12 * 60 * 60;

-/**
- * writeback_in_progress - determine whether there is writeback in progress
- * @bdi: the device's backing_dev_info structure.
- *
- * Determine whether there is writeback waiting to be handled against a
- * backing device.
- */
-int writeback_in_progress(struct backing_dev_info *bdi)
-{
- return test_bit(WB_writeback_running, &bdi->wb.state);
-}
-EXPORT_SYMBOL(writeback_in_progress);
-
static inline struct inode *wb_inode(struct list_head *head)
{
return list_entry(head, struct inode, i_wb_list);
@@ -1532,7 +1519,7 @@ int try_to_writeback_inodes_sb_nr(struct super_block *sb,
unsigned long nr,
enum wb_reason reason)
{
- if (writeback_in_progress(sb->s_bdi))
+ if (writeback_in_progress(&sb->s_bdi->wb))
return 1;

if (!down_read_trylock(&sb->s_umount))
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0ff40c2..f04956c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -156,7 +156,17 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);

extern struct backing_dev_info noop_backing_dev_info;

-int writeback_in_progress(struct backing_dev_info *bdi);
+/**
+ * writeback_in_progress - determine whether there is writeback in progress
+ * @wb: bdi_writeback of interest
+ *
+ * Determine whether there is writeback waiting to be handled against a
+ * bdi_writeback.
+ */
+static inline bool writeback_in_progress(struct bdi_writeback *wb)
+{
+ return test_bit(WB_writeback_running, &wb->state);
+}

static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
{
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 682e3a6..e3b5c1d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1455,7 +1455,7 @@ static void balance_dirty_pages(struct address_space *mapping,
break;
}

- if (unlikely(!writeback_in_progress(bdi)))
+ if (unlikely(!writeback_in_progress(wb)))
bdi_start_background_writeback(bdi);

if (!strictlimit)
@@ -1573,7 +1573,7 @@ static void balance_dirty_pages(struct address_space *mapping,
if (!dirty_exceeded && wb->dirty_exceeded)
wb->dirty_exceeded = 0;

- if (writeback_in_progress(bdi))
+ if (writeback_in_progress(wb))
return;

/*
--
2.4.0

2015-05-22 21:15:46

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 40/51] writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info

bdi_start_background_writeback() currently takes @bdi and kicks the
root wb (bdi_writeback). In preparation for cgroup writeback support,
make it take wb instead.

This patch doesn't make any functional difference.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 12 ++++++------
include/linux/backing-dev.h | 2 +-
mm/page-writeback.c | 4 ++--
3 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 45baf6c..92aaf64 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -228,23 +228,23 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
}

/**
- * bdi_start_background_writeback - start background writeback
- * @bdi: the backing device to write from
+ * wb_start_background_writeback - start background writeback
+ * @wb: bdi_writback to write from
*
* Description:
* This makes sure WB_SYNC_NONE background writeback happens. When
- * this function returns, it is only guaranteed that for given BDI
+ * this function returns, it is only guaranteed that for given wb
* some IO is happening if we are over background dirty threshold.
* Caller need not hold sb s_umount semaphore.
*/
-void bdi_start_background_writeback(struct backing_dev_info *bdi)
+void wb_start_background_writeback(struct bdi_writeback *wb)
{
/*
* We just wake up the flusher thread. It will perform background
* writeback as soon as there is no other work to do.
*/
- trace_writeback_wake_background(bdi);
- wb_wakeup(&bdi->wb);
+ trace_writeback_wake_background(wb->bdi);
+ wb_wakeup(wb);
}

/*
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f04956c..9cc11e5 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -27,7 +27,7 @@ void bdi_unregister(struct backing_dev_info *bdi);
int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
bool range_cyclic, enum wb_reason reason);
-void bdi_start_background_writeback(struct backing_dev_info *bdi);
+void wb_start_background_writeback(struct bdi_writeback *wb);
void wb_workfn(struct work_struct *work);
void wb_wakeup_delayed(struct bdi_writeback *wb);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e3b5c1d..70cf98d 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1456,7 +1456,7 @@ static void balance_dirty_pages(struct address_space *mapping,
}

if (unlikely(!writeback_in_progress(wb)))
- bdi_start_background_writeback(bdi);
+ wb_start_background_writeback(wb);

if (!strictlimit)
wb_dirty_limits(wb, dirty_thresh, background_thresh,
@@ -1588,7 +1588,7 @@ static void balance_dirty_pages(struct address_space *mapping,
return;

if (nr_reclaimable > background_thresh)
- bdi_start_background_writeback(bdi);
+ wb_start_background_writeback(wb);
}

static DEFINE_PER_CPU(int, bdp_ratelimits);
--
2.4.0

2015-05-22 21:15:51

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 41/51] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's

wakeup_flusher_threads() currently only starts writeback on the root
wb (bdi_writeback). For cgroup writeback support, update the function
to wake up all wbs and distribute the number of pages to write
according to the proportion of each wb's write bandwidth, which is
implemented in wb_split_bdi_pages().

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 46 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 92aaf64..508e10c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -198,6 +198,41 @@ int inode_congested(struct inode *inode, int cong_bits)
}
EXPORT_SYMBOL_GPL(inode_congested);

+/**
+ * wb_split_bdi_pages - split nr_pages to write according to bandwidth
+ * @wb: target bdi_writeback to split @nr_pages to
+ * @nr_pages: number of pages to write for the whole bdi
+ *
+ * Split @wb's portion of @nr_pages according to @wb's write bandwidth in
+ * relation to the total write bandwidth of all wb's w/ dirty inodes on
+ * @wb->bdi.
+ */
+static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
+{
+ unsigned long this_bw = wb->avg_write_bandwidth;
+ unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
+
+ if (nr_pages == LONG_MAX)
+ return LONG_MAX;
+
+ /*
+ * This may be called on clean wb's and proportional distribution
+ * may not make sense, just use the original @nr_pages in those
+ * cases. In general, we wanna err on the side of writing more.
+ */
+ if (!tot_bw || this_bw >= tot_bw)
+ return nr_pages;
+ else
+ return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
+}
+
+#else /* CONFIG_CGROUP_WRITEBACK */
+
+static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
+{
+ return nr_pages;
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */

void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
@@ -1187,8 +1222,17 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
nr_pages = get_nr_dirty_pages();

rcu_read_lock();
- list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
- wb_start_writeback(&bdi->wb, nr_pages, false, reason);
+ list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
+ struct bdi_writeback *wb;
+ struct wb_iter iter;
+
+ if (!bdi_has_dirty_io(bdi))
+ continue;
+
+ bdi_for_each_wb(wb, bdi, &iter, 0)
+ wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
+ false, reason);
+ }
rcu_read_unlock();
}

--
2.4.0

2015-05-22 21:20:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 42/51] writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's

wakeup_dirtytime_writeback() currently only starts writeback on the
root wb (bdi_writeback). For cgroup writeback support, update the
function to check all wbs.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Theodore Ts'o <[email protected]>
---
fs/fs-writeback.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 508e10c..8ae212e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1260,9 +1260,12 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)

rcu_read_lock();
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
- if (list_empty(&bdi->wb.b_dirty_time))
- continue;
- wb_wakeup(&bdi->wb);
+ struct bdi_writeback *wb;
+ struct wb_iter iter;
+
+ bdi_for_each_wb(wb, bdi, &iter, 0)
+ if (!list_empty(&bdi->wb.b_dirty_time))
+ wb_wakeup(&bdi->wb);
}
rcu_read_unlock();
schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
--
2.4.0

2015-05-22 21:20:08

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 43/51] writeback: add wb_writeback_work->auto_free

Currently, a wb_writeback_work is freed automatically on completion if
it doesn't have ->done set. Add wb_writeback_work->auto_free to make
the switch explicit. This will help cgroup writeback support where
waiting for completion and whether to free automatically don't
necessarily move together.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 8ae212e..22f1def 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -47,6 +47,7 @@ struct wb_writeback_work {
unsigned int range_cyclic:1;
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
+ unsigned int auto_free:1; /* free on completion */
enum wb_reason reason; /* why was writeback initiated? */

struct list_head list; /* pending work list */
@@ -258,6 +259,7 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
work->nr_pages = nr_pages;
work->range_cyclic = range_cyclic;
work->reason = reason;
+ work->auto_free = 1;

wb_queue_work(wb, work);
}
@@ -1141,19 +1143,16 @@ static long wb_do_writeback(struct bdi_writeback *wb)

set_bit(WB_writeback_running, &wb->state);
while ((work = get_next_work_item(wb)) != NULL) {
+ struct completion *done = work->done;

trace_writeback_exec(wb->bdi, work);

wrote += wb_writeback(wb, work);

- /*
- * Notify the caller of completion if this is a synchronous
- * work item, otherwise just free it.
- */
- if (work->done)
- complete(work->done);
- else
+ if (work->auto_free)
kfree(work);
+ if (done)
+ complete(done);
}

/*
--
2.4.0

2015-05-22 21:19:49

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 44/51] writeback: implement bdi_wait_for_completion()

If the completion of a wb_writeback_work can be waited upon by setting
its ->done to a struct completion and waiting on it; however, for
cgroup writeback support, it's necessary to issue multiple work items
to multiple bdi_writebacks and wait for the completion of all.

This patch implements wb_completion which can wait for multiple work
items and replaces the struct completion with it. It can be defined
using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
waited for by wb_wait_for_completion().

Nobody currently issues multiple work items and this patch doesn't
introduce any behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 58 +++++++++++++++++++++++++++++++---------
include/linux/backing-dev-defs.h | 2 ++
mm/backing-dev.c | 1 +
3 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 22f1def..d7d4a1b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,6 +34,10 @@
*/
#define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10))

+struct wb_completion {
+ atomic_t cnt;
+};
+
/*
* Passed into wb_writeback(), essentially a subset of writeback_control
*/
@@ -51,10 +55,23 @@ struct wb_writeback_work {
enum wb_reason reason; /* why was writeback initiated? */

struct list_head list; /* pending work list */
- struct completion *done; /* set if the caller waits */
+ struct wb_completion *done; /* set if the caller waits */
};

/*
+ * If one wants to wait for one or more wb_writeback_works, each work's
+ * ->done should be set to a wb_completion defined using the following
+ * macro. Once all work items are issued with wb_queue_work(), the caller
+ * can wait for the completion of all using wb_wait_for_completion(). Work
+ * items which are waited upon aren't freed automatically on completion.
+ */
+#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \
+ struct wb_completion cmpl = { \
+ .cnt = ATOMIC_INIT(1), \
+ }
+
+
+/*
* If an inode is constantly having its pages dirtied, but then the
* updates stop dirtytime_expire_interval seconds in the past, it's
* possible for the worst case time between when an inode has its
@@ -161,17 +178,34 @@ static void wb_queue_work(struct bdi_writeback *wb,
trace_writeback_queue(wb->bdi, work);

spin_lock_bh(&wb->work_lock);
- if (!test_bit(WB_registered, &wb->state)) {
- if (work->done)
- complete(work->done);
+ if (!test_bit(WB_registered, &wb->state))
goto out_unlock;
- }
+ if (work->done)
+ atomic_inc(&work->done->cnt);
list_add_tail(&work->list, &wb->work_list);
mod_delayed_work(bdi_wq, &wb->dwork, 0);
out_unlock:
spin_unlock_bh(&wb->work_lock);
}

+/**
+ * wb_wait_for_completion - wait for completion of bdi_writeback_works
+ * @bdi: bdi work items were issued to
+ * @done: target wb_completion
+ *
+ * Wait for one or more work items issued to @bdi with their ->done field
+ * set to @done, which should have been defined with
+ * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such
+ * work items are completed. Work items which are waited upon aren't freed
+ * automatically on completion.
+ */
+static void wb_wait_for_completion(struct backing_dev_info *bdi,
+ struct wb_completion *done)
+{
+ atomic_dec(&done->cnt); /* put down the initial count */
+ wait_event(bdi->wb_waitq, !atomic_read(&done->cnt));
+}
+
#ifdef CONFIG_CGROUP_WRITEBACK

/**
@@ -1143,7 +1177,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)

set_bit(WB_writeback_running, &wb->state);
while ((work = get_next_work_item(wb)) != NULL) {
- struct completion *done = work->done;
+ struct wb_completion *done = work->done;

trace_writeback_exec(wb->bdi, work);

@@ -1151,8 +1185,8 @@ static long wb_do_writeback(struct bdi_writeback *wb)

if (work->auto_free)
kfree(work);
- if (done)
- complete(done);
+ if (done && atomic_dec_and_test(&done->cnt))
+ wake_up_all(&wb->bdi->wb_waitq);
}

/*
@@ -1518,7 +1552,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
unsigned long nr,
enum wb_reason reason)
{
- DECLARE_COMPLETION_ONSTACK(done);
+ DEFINE_WB_COMPLETION_ONSTACK(done);
struct wb_writeback_work work = {
.sb = sb,
.sync_mode = WB_SYNC_NONE,
@@ -1533,7 +1567,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
wb_queue_work(&bdi->wb, &work);
- wait_for_completion(&done);
+ wb_wait_for_completion(bdi, &done);
}
EXPORT_SYMBOL(writeback_inodes_sb_nr);

@@ -1600,7 +1634,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
*/
void sync_inodes_sb(struct super_block *sb)
{
- DECLARE_COMPLETION_ONSTACK(done);
+ DEFINE_WB_COMPLETION_ONSTACK(done);
struct wb_writeback_work work = {
.sb = sb,
.sync_mode = WB_SYNC_ALL,
@@ -1618,7 +1652,7 @@ void sync_inodes_sb(struct super_block *sb)
WARN_ON(!rwsem_is_locked(&sb->s_umount));

wb_queue_work(&bdi->wb, &work);
- wait_for_completion(&done);
+ wb_wait_for_completion(bdi, &done);

wait_sb_inodes(sb);
}
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 8c857d7..97a92fa 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -155,6 +155,8 @@ struct backing_dev_info {
struct rb_root cgwb_congested_tree; /* their congested states */
atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
#endif
+ wait_queue_head_t wb_waitq;
+
struct device *dev;

struct timer_list laptop_mode_wb_timer;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d2f16fc9..ad5608d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -768,6 +768,7 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->max_ratio = 100;
bdi->max_prop_frac = FPROP_FRAC_BASE;
INIT_LIST_HEAD(&bdi->bdi_list);
+ init_waitqueue_head(&bdi->wb_waitq);

err = wb_init(&bdi->wb, bdi, GFP_KERNEL);
if (err)
--
2.4.0

2015-05-22 21:19:04

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 45/51] writeback: implement wb_wait_for_single_work()

For cgroup writeback, multiple wb_writeback_work items may need to be
issuedto accomplish a single task. The previous patch updated the
waiting mechanism such that wb_wait_for_completion() can wait for
multiple work items.

Issuing mulitple work items involves memory allocation which may fail.
As most writeback operations can't fail or blocked on memory
allocation, in such cases, we'll fall back to sequential issuing of an
on-stack work item, which would need to be waited upon sequentially.

This patch implements wb_wait_for_single_work() which waits for a
single work item independently from wb_completion waiting so that such
fallback mechanism can be used without getting tangled with the usual
issuing / completion operation.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d7d4a1b..093b959 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -52,6 +52,8 @@ struct wb_writeback_work {
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned int auto_free:1; /* free on completion */
+ unsigned int single_wait:1;
+ unsigned int single_done:1;
enum wb_reason reason; /* why was writeback initiated? */

struct list_head list; /* pending work list */
@@ -178,8 +180,11 @@ static void wb_queue_work(struct bdi_writeback *wb,
trace_writeback_queue(wb->bdi, work);

spin_lock_bh(&wb->work_lock);
- if (!test_bit(WB_registered, &wb->state))
+ if (!test_bit(WB_registered, &wb->state)) {
+ if (work->single_wait)
+ work->single_done = 1;
goto out_unlock;
+ }
if (work->done)
atomic_inc(&work->done->cnt);
list_add_tail(&work->list, &wb->work_list);
@@ -234,6 +239,32 @@ int inode_congested(struct inode *inode, int cong_bits)
EXPORT_SYMBOL_GPL(inode_congested);

/**
+ * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work
+ * @bdi: bdi the work item was issued to
+ * @work: work item to wait for
+ *
+ * Wait for the completion of @work which was issued to one of @bdi's
+ * bdi_writeback's. The caller must have set @work->single_wait before
+ * issuing it. This wait operates independently fo
+ * wb_wait_for_completion() and also disables automatic freeing of @work.
+ */
+static void wb_wait_for_single_work(struct backing_dev_info *bdi,
+ struct wb_writeback_work *work)
+{
+ if (WARN_ON_ONCE(!work->single_wait))
+ return;
+
+ wait_event(bdi->wb_waitq, work->single_done);
+
+ /*
+ * Paired with smp_wmb() in wb_do_writeback() and ensures that all
+ * modifications to @work prior to assertion of ->single_done is
+ * visible to the caller once this function returns.
+ */
+ smp_rmb();
+}
+
+/**
* wb_split_bdi_pages - split nr_pages to write according to bandwidth
* @wb: target bdi_writeback to split @nr_pages to
* @nr_pages: number of pages to write for the whole bdi
@@ -1178,14 +1209,26 @@ static long wb_do_writeback(struct bdi_writeback *wb)
set_bit(WB_writeback_running, &wb->state);
while ((work = get_next_work_item(wb)) != NULL) {
struct wb_completion *done = work->done;
+ bool need_wake_up = false;

trace_writeback_exec(wb->bdi, work);

wrote += wb_writeback(wb, work);

- if (work->auto_free)
+ if (work->single_wait) {
+ WARN_ON_ONCE(work->auto_free);
+ /* paired w/ rmb in wb_wait_for_single_work() */
+ smp_wmb();
+ work->single_done = 1;
+ need_wake_up = true;
+ } else if (work->auto_free) {
kfree(work);
+ }
+
if (done && atomic_dec_and_test(&done->cnt))
+ need_wake_up = true;
+
+ if (need_wake_up)
wake_up_all(&wb->bdi->wb_waitq);
}

--
2.4.0

2015-05-22 21:18:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 46/51] writeback: restructure try_writeback_inodes_sb[_nr]()

try_writeback_inodes_sb_nr() wraps writeback_inodes_sb_nr() so that it
handles s_umount locking and skips if writeback is already in
progress. The in progress test is performed on the root wb
(bdi_writeback) which isn't sufficient for cgroup writeback support.
The test must be done per-wb.

To prepare for the change, this patch factors out
__writeback_inodes_sb_nr() from writeback_inodes_sb_nr() and adds
@skip_if_busy and moves the in progress test right before queueing the
wb_writeback_work. try_writeback_inodes_sb_nr() now just grabs
s_umount and invokes __writeback_inodes_sb_nr() with asserted
@skip_if_busy. This way, later addition of multiple wb handling can
skip only the wb's which already have writeback in progress.

This swaps the order between in progress test and s_umount test which
can flip the return value when writeback is in progress and s_umount
is being held by someone else but this shouldn't cause any meaningful
difference. It's a fringe condition and the return value is an
unsynchronized hint anyway.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 52 ++++++++++++++++++++++++++---------------------
include/linux/writeback.h | 6 +++---
2 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 093b959..0039c58 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1581,19 +1581,8 @@ static void wait_sb_inodes(struct super_block *sb)
iput(old_inode);
}

-/**
- * writeback_inodes_sb_nr - writeback dirty inodes from given super_block
- * @sb: the superblock
- * @nr: the number of pages to write
- * @reason: reason why some writeback work initiated
- *
- * Start writeback on some inodes on this super_block. No guarantees are made
- * on how many (if any) will be written, and this function does not wait
- * for IO completion of submitted IO.
- */
-void writeback_inodes_sb_nr(struct super_block *sb,
- unsigned long nr,
- enum wb_reason reason)
+static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
+ enum wb_reason reason, bool skip_if_busy)
{
DEFINE_WB_COMPLETION_ONSTACK(done);
struct wb_writeback_work work = {
@@ -1609,9 +1598,30 @@ void writeback_inodes_sb_nr(struct super_block *sb,
if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
+
+ if (skip_if_busy && writeback_in_progress(&bdi->wb))
+ return;
+
wb_queue_work(&bdi->wb, &work);
wb_wait_for_completion(bdi, &done);
}
+
+/**
+ * writeback_inodes_sb_nr - writeback dirty inodes from given super_block
+ * @sb: the superblock
+ * @nr: the number of pages to write
+ * @reason: reason why some writeback work initiated
+ *
+ * Start writeback on some inodes on this super_block. No guarantees are made
+ * on how many (if any) will be written, and this function does not wait
+ * for IO completion of submitted IO.
+ */
+void writeback_inodes_sb_nr(struct super_block *sb,
+ unsigned long nr,
+ enum wb_reason reason)
+{
+ __writeback_inodes_sb_nr(sb, nr, reason, false);
+}
EXPORT_SYMBOL(writeback_inodes_sb_nr);

/**
@@ -1638,19 +1648,15 @@ EXPORT_SYMBOL(writeback_inodes_sb);
* Invoke writeback_inodes_sb_nr if no writeback is currently underway.
* Returns 1 if writeback was started, 0 if not.
*/
-int try_to_writeback_inodes_sb_nr(struct super_block *sb,
- unsigned long nr,
- enum wb_reason reason)
+bool try_to_writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
+ enum wb_reason reason)
{
- if (writeback_in_progress(&sb->s_bdi->wb))
- return 1;
-
if (!down_read_trylock(&sb->s_umount))
- return 0;
+ return false;

- writeback_inodes_sb_nr(sb, nr, reason);
+ __writeback_inodes_sb_nr(sb, nr, reason, true);
up_read(&sb->s_umount);
- return 1;
+ return true;
}
EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr);

@@ -1662,7 +1668,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr);
* Implement by try_to_writeback_inodes_sb_nr()
* Returns 1 if writeback was started, 0 if not.
*/
-int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
+bool try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason)
{
return try_to_writeback_inodes_sb_nr(sb, get_nr_dirty_pages(), reason);
}
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a6b9db7..23af355 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -93,9 +93,9 @@ struct bdi_writeback;
void writeback_inodes_sb(struct super_block *, enum wb_reason reason);
void writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
enum wb_reason reason);
-int try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
-int try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
- enum wb_reason reason);
+bool try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason);
+bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr,
+ enum wb_reason reason);
void sync_inodes_sb(struct super_block *);
void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
void inode_wait_for_writeback(struct inode *inode);
--
2.4.0

2015-05-22 21:15:55

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 47/51] writeback: make writeback initiation functions handle multiple bdi_writeback's

[try_]writeback_inodes_sb[_nr]() and sync_inodes_sb() currently only
handle dirty inodes on the root wb (bdi_writeback) of the target bdi.
This patch implements bdi_split_work_to_wbs() and use it to make these
functions handle multiple wb's.

bdi_split_work_to_wbs() takes a base wb_writeback_work and create
clones of it and issue them to the wb's of the target bdi. The base
work's nr_pages is distributed using wb_split_bdi_pages() -
ie. according to each wb's write bandwidth's proportion in the bdi.

Cloning a bdi involves memory allocation which may fail. In such
cases, bdi_split_work_to_wbs() issues the base work directly and waits
for its completion before proceeding to the next wb to guarantee
forward progress and correctness under memory pressure.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 96 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 91 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0039c58..59d76f6 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -292,6 +292,80 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
}

+/**
+ * wb_clone_and_queue_work - clone a wb_writeback_work and issue it to a wb
+ * @wb: target bdi_writeback
+ * @base_work: source wb_writeback_work
+ *
+ * Try to make a clone of @base_work and issue it to @wb. If cloning
+ * succeeds, %true is returned; otherwise, @base_work is issued directly
+ * and %false is returned. In the latter case, the caller is required to
+ * wait for @base_work's completion using wb_wait_for_single_work().
+ *
+ * A clone is auto-freed on completion. @base_work never is.
+ */
+static bool wb_clone_and_queue_work(struct bdi_writeback *wb,
+ struct wb_writeback_work *base_work)
+{
+ struct wb_writeback_work *work;
+
+ work = kmalloc(sizeof(*work), GFP_ATOMIC);
+ if (work) {
+ *work = *base_work;
+ work->auto_free = 1;
+ work->single_wait = 0;
+ } else {
+ work = base_work;
+ work->auto_free = 0;
+ work->single_wait = 1;
+ }
+ work->single_done = 0;
+ wb_queue_work(wb, work);
+ return work != base_work;
+}
+
+/**
+ * bdi_split_work_to_wbs - split a wb_writeback_work to all wb's of a bdi
+ * @bdi: target backing_dev_info
+ * @base_work: wb_writeback_work to issue
+ * @skip_if_busy: skip wb's which already have writeback in progress
+ *
+ * Split and issue @base_work to all wb's (bdi_writeback's) of @bdi which
+ * have dirty inodes. If @base_work->nr_page isn't %LONG_MAX, it's
+ * distributed to the busy wbs according to each wb's proportion in the
+ * total active write bandwidth of @bdi.
+ */
+static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
+ struct wb_writeback_work *base_work,
+ bool skip_if_busy)
+{
+ long nr_pages = base_work->nr_pages;
+ int next_blkcg_id = 0;
+ struct bdi_writeback *wb;
+ struct wb_iter iter;
+
+ might_sleep();
+
+ if (!bdi_has_dirty_io(bdi))
+ return;
+restart:
+ rcu_read_lock();
+ bdi_for_each_wb(wb, bdi, &iter, next_blkcg_id) {
+ if (!wb_has_dirty_io(wb) ||
+ (skip_if_busy && writeback_in_progress(wb)))
+ continue;
+
+ base_work->nr_pages = wb_split_bdi_pages(wb, nr_pages);
+ if (!wb_clone_and_queue_work(wb, base_work)) {
+ next_blkcg_id = wb->blkcg_css->id + 1;
+ rcu_read_unlock();
+ wb_wait_for_single_work(bdi, base_work);
+ goto restart;
+ }
+ }
+ rcu_read_unlock();
+}
+
#else /* CONFIG_CGROUP_WRITEBACK */

static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
@@ -299,6 +373,21 @@ static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
return nr_pages;
}

+static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
+ struct wb_writeback_work *base_work,
+ bool skip_if_busy)
+{
+ might_sleep();
+
+ if (bdi_has_dirty_io(bdi) &&
+ (!skip_if_busy || !writeback_in_progress(&bdi->wb))) {
+ base_work->auto_free = 0;
+ base_work->single_wait = 0;
+ base_work->single_done = 0;
+ wb_queue_work(&bdi->wb, base_work);
+ }
+}
+
#endif /* CONFIG_CGROUP_WRITEBACK */

void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
@@ -1599,10 +1688,7 @@ static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr,
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- if (skip_if_busy && writeback_in_progress(&bdi->wb))
- return;
-
- wb_queue_work(&bdi->wb, &work);
+ bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy);
wb_wait_for_completion(bdi, &done);
}

@@ -1700,7 +1786,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- wb_queue_work(&bdi->wb, &work);
+ bdi_split_work_to_wbs(bdi, &work, false);
wb_wait_for_completion(bdi, &done);

wait_sb_inodes(sb);
--
2.4.0

2015-05-22 21:16:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 48/51] writeback: dirty inodes against their matching cgroup bdi_writeback's

__mark_inode_dirty() always dirtied the inode against the root wb
(bdi_writeback). The previous patches added all the infrastructure
necessary to attribute an inode against the wb of the dirtying cgroup.

This patch updates __mark_inode_dirty() so that it uses the wb
associated with the inode instead of unconditionally using the root
one.

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being dirtied against the root wb.

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 23 +++++++++++------------
1 file changed, 11 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 59d76f6..881ea5d 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1504,7 +1504,6 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
void __mark_inode_dirty(struct inode *inode, int flags)
{
struct super_block *sb = inode->i_sb;
- struct backing_dev_info *bdi = NULL;
int dirtytime;

trace_writeback_mark_inode_dirty(inode, flags);
@@ -1574,30 +1573,30 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* reposition it (that would break b_dirty time-ordering).
*/
if (!was_dirty) {
+ struct bdi_writeback *wb = inode_to_wb(inode);
struct list_head *dirty_list;
bool wakeup_bdi = false;
- bdi = inode_to_bdi(inode);

spin_unlock(&inode->i_lock);
- spin_lock(&bdi->wb.list_lock);
+ spin_lock(&wb->list_lock);

- WARN(bdi_cap_writeback_dirty(bdi) &&
- !test_bit(WB_registered, &bdi->wb.state),
- "bdi-%s not registered\n", bdi->name);
+ WARN(bdi_cap_writeback_dirty(wb->bdi) &&
+ !test_bit(WB_registered, &wb->state),
+ "bdi-%s not registered\n", wb->bdi->name);

inode->dirtied_when = jiffies;
if (dirtytime)
inode->dirtied_time_when = jiffies;

if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES))
- dirty_list = &bdi->wb.b_dirty;
+ dirty_list = &wb->b_dirty;
else
- dirty_list = &bdi->wb.b_dirty_time;
+ dirty_list = &wb->b_dirty_time;

- wakeup_bdi = inode_wb_list_move_locked(inode, &bdi->wb,
+ wakeup_bdi = inode_wb_list_move_locked(inode, wb,
dirty_list);

- spin_unlock(&bdi->wb.list_lock);
+ spin_unlock(&wb->list_lock);
trace_writeback_dirty_inode_enqueue(inode);

/*
@@ -1606,8 +1605,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
* to make sure background write-back happens
* later.
*/
- if (bdi_cap_writeback_dirty(bdi) && wakeup_bdi)
- wb_wakeup_delayed(&bdi->wb);
+ if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi)
+ wb_wakeup_delayed(wb);
return;
}
}
--
2.4.0

2015-05-22 21:16:07

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 49/51] buffer, writeback: make __block_write_full_page() honor cgroup writeback

[__]block_write_full_page() is used to implement ->writepage in
various filesystems. All writeback logic is now updated to handle
cgroup writeback and the block cgroup to issue IOs for is encoded in
writeback_control and can be retrieved from the inode; however,
[__]block_write_full_page() currently ignores the blkcg indicated by
inode and issues all bio's without explicit blkcg association.

This patch adds submit_bh_blkcg() which associates the bio with the
specified blkio cgroup before issuing and uses it in
__block_write_full_page() so that the issued bio's are associated with
inode_to_wb_blkcg_css(inode).

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andrew Morton <[email protected]>
---
fs/buffer.c | 26 ++++++++++++++++++++------
include/linux/backing-dev.h | 12 ++++++++++++
2 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c8aecf5..18cd378 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -30,6 +30,7 @@
#include <linux/quotaops.h>
#include <linux/highmem.h>
#include <linux/export.h>
+#include <linux/backing-dev.h>
#include <linux/writeback.h>
#include <linux/hash.h>
#include <linux/suspend.h>
@@ -44,6 +45,9 @@
#include <trace/events/block.h>

static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
+static int submit_bh_blkcg(int rw, struct buffer_head *bh,
+ unsigned long bio_flags,
+ struct cgroup_subsys_state *blkcg_css);

#define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers)

@@ -1704,8 +1708,8 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *bh, *head;
unsigned int blocksize, bbits;
int nr_underway = 0;
- int write_op = (wbc->sync_mode == WB_SYNC_ALL ?
- WRITE_SYNC : WRITE);
+ int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE);
+ struct cgroup_subsys_state *blkcg_css = inode_to_wb_blkcg_css(inode);

head = create_page_buffers(page, inode,
(1 << BH_Dirty)|(1 << BH_Uptodate));
@@ -1794,7 +1798,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
do {
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
- submit_bh(write_op, bh);
+ submit_bh_blkcg(write_op, bh, 0, blkcg_css);
nr_underway++;
}
bh = next;
@@ -1848,7 +1852,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page,
struct buffer_head *next = bh->b_this_page;
if (buffer_async_write(bh)) {
clear_buffer_dirty(bh);
- submit_bh(write_op, bh);
+ submit_bh_blkcg(write_op, bh, 0, blkcg_css);
nr_underway++;
}
bh = next;
@@ -3013,7 +3017,9 @@ void guard_bio_eod(int rw, struct bio *bio)
}
}

-int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
+static int submit_bh_blkcg(int rw, struct buffer_head *bh,
+ unsigned long bio_flags,
+ struct cgroup_subsys_state *blkcg_css)
{
struct bio *bio;
int ret = 0;
@@ -3036,6 +3042,9 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
*/
bio = bio_alloc(GFP_NOIO, 1);

+ if (blkcg_css)
+ bio_associate_blkcg(bio, blkcg_css);
+
bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9);
bio->bi_bdev = bh->b_bdev;
bio->bi_io_vec[0].bv_page = bh->b_page;
@@ -3060,11 +3069,16 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
submit_bio(rw, bio);
return ret;
}
+
+int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags)
+{
+ return submit_bh_blkcg(rw, bh, bio_flags, NULL);
+}
EXPORT_SYMBOL_GPL(_submit_bh);

int submit_bh(int rw, struct buffer_head *bh)
{
- return _submit_bh(rw, bh, 0);
+ return submit_bh_blkcg(rw, bh, 0, NULL);
}
EXPORT_SYMBOL(submit_bh);

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 9cc11e5..e9d7373 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -393,6 +393,12 @@ static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
return inode->i_wb;
}

+static inline struct cgroup_subsys_state *
+inode_to_wb_blkcg_css(struct inode *inode)
+{
+ return inode_to_wb(inode)->blkcg_css;
+}
+
struct wb_iter {
int start_blkcg_id;
struct radix_tree_iter tree_iter;
@@ -510,6 +516,12 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
{
}

+static inline struct cgroup_subsys_state *
+inode_to_wb_blkcg_css(struct inode *inode)
+{
+ return blkcg_root_css;
+}
+
struct wb_iter {
int next_id;
};
--
2.4.0

2015-05-22 21:17:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 50/51] mpage: make __mpage_writepage() honor cgroup writeback

__mpage_writepage() is used to implement mpage_writepages() which in
turn is used for ->writepages() of various filesystems. All writeback
logic is now updated to handle cgroup writeback and the block cgroup
to issue IOs for is encoded in writeback_control and can be retrieved
from the inode; however, __mpage_writepage() currently ignores the
blkcg indicated by the inode and issues all bio's without explicit
blkcg association.

This patch updates __mpage_writepage() so that the issued bio's are
associated with inode_to_writeback_blkcg_css(inode).

v2: Updated for per-inode wb association.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Alexander Viro <[email protected]>
---
fs/mpage.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/mpage.c b/fs/mpage.c
index 3e79220..a3ccb0b 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -605,6 +605,8 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
if (bio == NULL)
goto confused;
+
+ bio_associate_blkcg(bio, inode_to_wb_blkcg_css(inode));
}

/*
--
2.4.0

2015-05-22 21:16:30

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 51/51] ext2: enable cgroup writeback support

Writeback now supports cgroup writeback and the generic writeback,
buffer, libfs, and mpage helpers that ext2 uses are all updated to
work with cgroup writeback.

This patch enables cgroup writeback for ext2 by adding
FS_CGROUP_WRITEBACK to its ->fs_flags.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: [email protected]
---
fs/ext2/super.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index d0e746e..549219d 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1543,7 +1543,7 @@ static struct file_system_type ext2_fs_type = {
.name = "ext2",
.mount = ext2_mount,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_CGROUP_WRITEBACK,
};
MODULE_ALIAS_FS("ext2");

--
2.4.0

2015-05-22 23:28:49

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 11/51] memcg: implement mem_cgroup_css_from_page()

On Fri, May 22, 2015 at 05:13:25PM -0400, Tejun Heo wrote:
> +/**
> + * mem_cgroup_css_from_page - css of the memcg associated with a page
> + * @page: page of interest
> + *
> + * This function is guaranteed to return a valid cgroup_subsys_state and
> + * the returned css remains accessible until @page is released.
> + */
> +struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
> +{
> + if (page->mem_cgroup)
> + return &page->mem_cgroup->css;
> + return &root_mem_cgroup->css;
> +}

replace_page_cache() can clear page->mem_cgroup even when the page
still has references, so unfortunately you must hold the page lock
when calling this function.

I haven't checked how you use this - chances are you always have the
page locked anyways - but it probably needs a comment.

2015-05-24 21:24:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/51] memcg: implement mem_cgroup_css_from_page()

Hello,

On Fri, May 22, 2015 at 07:28:31PM -0400, Johannes Weiner wrote:
> replace_page_cache() can clear page->mem_cgroup even when the page
> still has references, so unfortunately you must hold the page lock
> when calling this function.
>
> I haven't checked how you use this - chances are you always have the
> page locked anyways - but it probably needs a comment.

Hmmm... as replace_page_cache_page() is used only by fuse and fuse's
bdi doesn't go through the usual writeback accounting which is
necessary for cgroup writeback support anyway, so I don't think this
presents an actual problem. I'll add a warning in
replace_page_cache_page() which triggers when it's used on a bdi which
has cgroup writeback enabled and add comments explaining what's going
on.

Thanks.

--
tejun

2015-06-07 00:53:12

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 16/51] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback

On 05/22/2015 05:13 PM, Tejun Heo wrote:
> Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
> and the role of the separation is unclear. For cgroup support for
> writeback IOs, a bdi will be updated to host multiple wb's where each
> wb serves writeback IOs of a different cgroup on the bdi. To achieve
> that, a wb should carry all states necessary for servicing writeback
> IOs for a cgroup independently.
>
> This patch moves bdi->wb_lock and ->worklist into wb.
>
> * The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
> moving, rename it to wb->work_lock as wb->wb_lock is confusing.
> Also, move wb->dwork downwards so that it's colocated with the new
> ->work_lock and ->work_list fields.
>
> * bdi_writeback_workfn() -> wb_workfn()
> bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
> bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
> bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
> __bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
> get_next_work_item(bdi) -> get_next_work_item(wb)
>
> * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
> The function contained parts which belong to the containing bdi
> rather than the wb itself - testing cap_writeback_dirty and
> bdi_remove_from_list() invocation. Those are moved to
> bdi_unregister().
>
> * bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
> Initializations of the moved bdi->wb_lock and ->work_list are
> relocated from bdi_init() to wb_init().
>
> * As there's still only one bdi_writeback per backing_dev_info, all
> uses of bdi->state are mechanically replaced with bdi->wb.state
> introducing no behavior changes.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Wu Fengguang <[email protected]>

Hi Tejun,

I'm now seeing:

[619070.603554] WARNING: CPU: 10 PID: 8316 at lib/list_debug.c:56 __list_del_entry+0x104/0x1a0()
[619070.604573] list_del corruption, ffff880540ad6fb8->prev is LIST_POISON2 (dead000000200200)
[619070.605501] Modules linked in:
[619070.606103] CPU: 10 PID: 8316 Comm: mount Not tainted 4.1.0-rc6-next-20150604-sasha-00039-g07bbbaf #2
268
[619070.607386] ffff8800c9aeb000 0000000061727ceb ffff8800c9387a38 ffffffffa3a02988
[619070.608791] 0000000000000000 ffff8800c9387ab8 ffff8800c9387a88 ffffffff9a1e5336
[619070.610029] ffff8802b6008680 ffffffff9bdae994 ffff8800c9387a68 ffffed0019270f53
[619070.610978] Call Trace:
[619070.611357] dump_stack (lib/dump_stack.c:52)
[619070.612019] warn_slowpath_common (kernel/panic.c:448)
[619070.612666] ? __list_del_entry (lib/list_debug.c:54 (discriminator 1))
[619070.613435] warn_slowpath_fmt (kernel/panic.c:454)
[619070.614102] ? warn_slowpath_common (kernel/panic.c:454)
[619070.614900] ? lock_acquired (kernel/locking/lockdep.c:3890)
[619070.615642] __list_del_entry (lib/list_debug.c:54 (discriminator 1))
[619070.616474] ? bdi_destroy (include/linux/rculist.h:131 mm/backing-dev.c:803 mm/backing-dev.c:812)
[619070.617273] bdi_destroy (include/linux/rculist.h:132 mm/backing-dev.c:803 mm/backing-dev.c:812)
[619070.618261] v9fs_session_close (include/linux/spinlock.h:312 fs/9p/v9fs.c:455)
[619070.619121] v9fs_mount (fs/9p/vfs_super.c:200)
[619070.619785] ? lockdep_init_map (kernel/locking/lockdep.c:3055)
[619070.620507] mount_fs (fs/super.c:1109)
[619070.621153] vfs_kern_mount (fs/namespace.c:948)
[619070.621867] ? get_fs_type (fs/filesystems.c:278 (discriminator 2))
[619070.622412] do_mount (fs/namespace.c:2385 fs/namespace.c:2701)
[619070.623035] ? copy_mount_string (fs/namespace.c:2634)
[619070.623610] ? __might_fault (mm/memory.c:3775 (discriminator 1))
[619070.624389] ? __might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3773)
[619070.625029] ? memdup_user (./arch/x86/include/asm/uaccess.h:718)
[619070.625833] SyS_mount (fs/namespace.c:2894 fs/namespace.c:2869)
[619070.626442] ? copy_mnt_ns (fs/namespace.c:2869)
[619070.627131] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[619070.628088] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[619070.629095] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:39)
[619070.629847] system_call_fastpath (arch/x86/kernel/entry_64.S:195)


Thanks,
Sasha

2015-06-08 05:57:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH block/for-4.2-writeback] v9fs: fix error handling in v9fs_session_init()

On failure, v9fs_session_init() returns with the v9fs_session_info
struct partially initialized and expects the caller to invoke
v9fs_session_close() to clean it up; however, it doesn't track whether
the bdi is initialized or not and curiously invokes bdi_destroy() in
both vfs_session_init() failure path too.

A. If v9fs_session_init() fails before the bdi is initialized, the
follow-up v9fs_session_close() will invoke bdi_destroy() on an
uninitialized bdi.

B. If v9fs_session_init() fails after the bdi is initialized,
bdi_destroy() will be called twice on the same bdi - once in the
failure path of v9fs_session_init() and then by
v9fs_session_close().

A is broken no matter what. B used to be okay because bdi_destroy()
allowed being invoked multiple times on the same bdi, which BTW was
broken in its own way - if bdi_destroy() was invoked on an initialiezd
but !registered bdi, it'd fail to free percpu counters. Since
f0054bb1e1f3 ("writeback: move backing_dev_info->wb_lock and
->worklist into bdi_writeback"), this no longer work - bdi_destroy()
on an initialized but not registered bdi works correctly but multiple
invocations of bdi_destroy() is no longer allowed.

The obvious culprit here is v9fs_session_init()'s odd and broken error
behavior. It should simply clean up after itself on failures. This
patch makes the following updates to v9fs_session_init().

* @rc -> @retval error return propagation removed. It didn't serve
any purpose. Just use @rc.

* Move addition to v9fs_sessionlist to the end of the function so that
incomplete sessions are not put on the list or iterated and error
path doesn't have to worry about it.

* Update error handling so that it cleans up after itself.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Sasha Levin <[email protected]>
---
fs/9p/v9fs.c | 50 ++++++++++++++++++++++----------------------------
fs/9p/vfs_super.c | 8 ++------
2 files changed, 24 insertions(+), 34 deletions(-)

diff --git a/fs/9p/v9fs.c b/fs/9p/v9fs.c
index 620d934..8aa56bb 100644
--- a/fs/9p/v9fs.c
+++ b/fs/9p/v9fs.c
@@ -320,31 +320,21 @@ static int v9fs_parse_options(struct v9fs_session_info *v9ses, char *opts)
struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
const char *dev_name, char *data)
{
- int retval = -EINVAL;
struct p9_fid *fid;
- int rc;
+ int rc = -ENOMEM;

v9ses->uname = kstrdup(V9FS_DEFUSER, GFP_KERNEL);
if (!v9ses->uname)
- return ERR_PTR(-ENOMEM);
+ goto err_names;

v9ses->aname = kstrdup(V9FS_DEFANAME, GFP_KERNEL);
- if (!v9ses->aname) {
- kfree(v9ses->uname);
- return ERR_PTR(-ENOMEM);
- }
+ if (!v9ses->aname)
+ goto err_names;
init_rwsem(&v9ses->rename_sem);

rc = bdi_setup_and_register(&v9ses->bdi, "9p");
- if (rc) {
- kfree(v9ses->aname);
- kfree(v9ses->uname);
- return ERR_PTR(rc);
- }
-
- spin_lock(&v9fs_sessionlist_lock);
- list_add(&v9ses->slist, &v9fs_sessionlist);
- spin_unlock(&v9fs_sessionlist_lock);
+ if (rc)
+ goto err_names;

v9ses->uid = INVALID_UID;
v9ses->dfltuid = V9FS_DEFUID;
@@ -352,10 +342,9 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,

v9ses->clnt = p9_client_create(dev_name, data);
if (IS_ERR(v9ses->clnt)) {
- retval = PTR_ERR(v9ses->clnt);
- v9ses->clnt = NULL;
+ rc = PTR_ERR(v9ses->clnt);
p9_debug(P9_DEBUG_ERROR, "problem initializing 9p client\n");
- goto error;
+ goto err_bdi;
}

v9ses->flags = V9FS_ACCESS_USER;
@@ -368,10 +357,8 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
}

rc = v9fs_parse_options(v9ses, data);
- if (rc < 0) {
- retval = rc;
- goto error;
- }
+ if (rc < 0)
+ goto err_clnt;

v9ses->maxdata = v9ses->clnt->msize - P9_IOHDRSZ;

@@ -405,10 +392,9 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
fid = p9_client_attach(v9ses->clnt, NULL, v9ses->uname, INVALID_UID,
v9ses->aname);
if (IS_ERR(fid)) {
- retval = PTR_ERR(fid);
- fid = NULL;
+ rc = PTR_ERR(fid);
p9_debug(P9_DEBUG_ERROR, "cannot attach\n");
- goto error;
+ goto err_clnt;
}

if ((v9ses->flags & V9FS_ACCESS_MASK) == V9FS_ACCESS_SINGLE)
@@ -420,12 +406,20 @@ struct p9_fid *v9fs_session_init(struct v9fs_session_info *v9ses,
/* register the session for caching */
v9fs_cache_session_get_cookie(v9ses);
#endif
+ spin_lock(&v9fs_sessionlist_lock);
+ list_add(&v9ses->slist, &v9fs_sessionlist);
+ spin_unlock(&v9fs_sessionlist_lock);

return fid;

-error:
+err_clnt:
+ p9_client_destroy(v9ses->clnt);
+err_bdi:
bdi_destroy(&v9ses->bdi);
- return ERR_PTR(retval);
+err_names:
+ kfree(v9ses->uname);
+ kfree(v9ses->aname);
+ return ERR_PTR(rc);
}

/**
diff --git a/fs/9p/vfs_super.c b/fs/9p/vfs_super.c
index e99a338..bf495ce 100644
--- a/fs/9p/vfs_super.c
+++ b/fs/9p/vfs_super.c
@@ -130,11 +130,7 @@ static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,
fid = v9fs_session_init(v9ses, dev_name, data);
if (IS_ERR(fid)) {
retval = PTR_ERR(fid);
- /*
- * we need to call session_close to tear down some
- * of the data structure setup by session_init
- */
- goto close_session;
+ goto free_session;
}

sb = sget(fs_type, NULL, v9fs_set_super, flags, v9ses);
@@ -195,8 +191,8 @@ static struct dentry *v9fs_mount(struct file_system_type *fs_type, int flags,

clunk_fid:
p9_client_clunk(fid);
-close_session:
v9fs_session_close(v9ses);
+free_session:
kfree(v9ses);
return ERR_PTR(retval);

2015-06-08 15:10:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH block/for-4.2-writeback] v9fs: fix error handling in v9fs_session_init()

On 06/07/2015 11:57 PM, Tejun Heo wrote:
> On failure, v9fs_session_init() returns with the v9fs_session_info
> struct partially initialized and expects the caller to invoke
> v9fs_session_close() to clean it up; however, it doesn't track whether
> the bdi is initialized or not and curiously invokes bdi_destroy() in
> both vfs_session_init() failure path too.
>
> A. If v9fs_session_init() fails before the bdi is initialized, the
> follow-up v9fs_session_close() will invoke bdi_destroy() on an
> uninitialized bdi.
>
> B. If v9fs_session_init() fails after the bdi is initialized,
> bdi_destroy() will be called twice on the same bdi - once in the
> failure path of v9fs_session_init() and then by
> v9fs_session_close().
>
> A is broken no matter what. B used to be okay because bdi_destroy()
> allowed being invoked multiple times on the same bdi, which BTW was
> broken in its own way - if bdi_destroy() was invoked on an initialiezd
> but !registered bdi, it'd fail to free percpu counters. Since
> f0054bb1e1f3 ("writeback: move backing_dev_info->wb_lock and
> ->worklist into bdi_writeback"), this no longer work - bdi_destroy()
> on an initialized but not registered bdi works correctly but multiple
> invocations of bdi_destroy() is no longer allowed.
>
> The obvious culprit here is v9fs_session_init()'s odd and broken error
> behavior. It should simply clean up after itself on failures. This
> patch makes the following updates to v9fs_session_init().
>
> * @rc -> @retval error return propagation removed. It didn't serve
> any purpose. Just use @rc.
>
> * Move addition to v9fs_sessionlist to the end of the function so that
> incomplete sessions are not put on the list or iterated and error
> path doesn't have to worry about it.
>
> * Update error handling so that it cleans up after itself.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Sasha Levin <[email protected]>

Added to for-4.2/writeback, thanks.

--
Jens Axboe

2015-06-17 14:56:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/51] memcg: add mem_cgroup_root_css

On Fri 22-05-15 17:13:20, Tejun Heo wrote:
> Add global mem_cgroup_root_css which points to the root memcg css.

Is there any reason to using css rather than mem_cgroup other than the
structure is not visible outside of memcontrol.c? Because I have a
patchset which exports it. It is not merged yet so a move to mem_cgroup
could be done later. I am just interested whether there is a stronger
reason.

> This will be used by cgroup writeback support. If memcg is disabled,
> it's defined as ERR_PTR(-EINVAL).

Hmm. Why EINVAL? I can see only mm/backing-dev.c (in
review-cgroup-writeback-switch-20150528 branch) which uses it and that
shouldn't even try to compile if !CONFIG_MEMCG no? Otherwise we would
simply blow up.

> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> aCc: Michal Hocko <[email protected]>
> ---
> include/linux/memcontrol.h | 4 ++++
> mm/memcontrol.c | 2 ++
> 2 files changed, 6 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 5fe6411..294498f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -68,6 +68,8 @@ enum mem_cgroup_events_index {
> };
>
> #ifdef CONFIG_MEMCG
> +extern struct cgroup_subsys_state *mem_cgroup_root_css;
> +
> void mem_cgroup_events(struct mem_cgroup *memcg,
> enum mem_cgroup_events_index idx,
> unsigned int nr);
> @@ -196,6 +198,8 @@ void mem_cgroup_split_huge_fixup(struct page *head);
> #else /* CONFIG_MEMCG */
> struct mem_cgroup;
>
> +#define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL))
> +
> static inline void mem_cgroup_events(struct mem_cgroup *memcg,
> enum mem_cgroup_events_index idx,
> unsigned int nr)
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c23c1a3..b22a92b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -77,6 +77,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
>
> #define MEM_CGROUP_RECLAIM_RETRIES 5
> static struct mem_cgroup *root_mem_cgroup __read_mostly;
> +struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly;
>
> /* Whether the swap controller is active */
> #ifdef CONFIG_MEMCG_SWAP
> @@ -4441,6 +4442,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> /* root ? */
> if (parent_css == NULL) {
> root_mem_cgroup = memcg;
> + mem_cgroup_root_css = &memcg->css;
> page_counter_init(&memcg->memory, NULL);
> memcg->high = PAGE_COUNTER_MAX;
> memcg->soft_limit = PAGE_COUNTER_MAX;
> --
> 2.4.0
>

--
Michal Hocko
SUSE Labs

2015-06-17 18:25:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 06/51] memcg: add mem_cgroup_root_css

Hey, Michal.

On Wed, Jun 17, 2015 at 04:56:42PM +0200, Michal Hocko wrote:
> On Fri 22-05-15 17:13:20, Tejun Heo wrote:
> > Add global mem_cgroup_root_css which points to the root memcg css.
>
> Is there any reason to using css rather than mem_cgroup other than the
> structure is not visible outside of memcontrol.c? Because I have a
> patchset which exports it. It is not merged yet so a move to mem_cgroup
> could be done later. I am just interested whether there is a stronger
> reason.

It doesn't really matter either way but I think it makes a bit more
sense to use css as the common type when external code interacts with
cgroup controllers. e.g. cgroup writeback interacts with both memcg
and blkcg and in most cases it doesn't know or care about their
internal states. Most of what it wants is tracking them and doing
some common css operations (refcnting, printing and so on) on them.

> > This will be used by cgroup writeback support. If memcg is disabled,
> > it's defined as ERR_PTR(-EINVAL).
>
> Hmm. Why EINVAL? I can see only mm/backing-dev.c (in
> review-cgroup-writeback-switch-20150528 branch) which uses it and that
> shouldn't even try to compile if !CONFIG_MEMCG no? Otherwise we would
> simply blow up.

Hmm... the code maybe has changed inbetween but there was something
which depended on the root css being defined when
!CONFIG_CGROUP_WRITEBACK or maybe it was on blkcg_root_css and memcg
side was added for consistency. An ERR_PTR value is non-zero, which
is an invariant which is often depended upon, while guaranteeing oops
when deref'd.

Thanks.

--
tejun

2015-06-18 11:12:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/51] memcg: add mem_cgroup_root_css

On Wed 17-06-15 14:25:00, Tejun Heo wrote:
> Hey, Michal.
>
> On Wed, Jun 17, 2015 at 04:56:42PM +0200, Michal Hocko wrote:
> > On Fri 22-05-15 17:13:20, Tejun Heo wrote:
> > > Add global mem_cgroup_root_css which points to the root memcg css.
> >
> > Is there any reason to using css rather than mem_cgroup other than the
> > structure is not visible outside of memcontrol.c? Because I have a
> > patchset which exports it. It is not merged yet so a move to mem_cgroup
> > could be done later. I am just interested whether there is a stronger
> > reason.
>
> It doesn't really matter either way but I think it makes a bit more
> sense to use css as the common type when external code interacts with
> cgroup controllers. e.g. cgroup writeback interacts with both memcg
> and blkcg and in most cases it doesn't know or care about their
> internal states. Most of what it wants is tracking them and doing
> some common css operations (refcnting, printing and so on) on them.

I see and yes, it makes some sense. I just think we can get rid of the
accessor functions when the struct mem_cgroup is visible and the code
can simply do &{page->}mem_cgroup->css.

> > > This will be used by cgroup writeback support. If memcg is disabled,
> > > it's defined as ERR_PTR(-EINVAL).
> >
> > Hmm. Why EINVAL? I can see only mm/backing-dev.c (in
> > review-cgroup-writeback-switch-20150528 branch) which uses it and that
> > shouldn't even try to compile if !CONFIG_MEMCG no? Otherwise we would
> > simply blow up.
>
> Hmm... the code maybe has changed inbetween but there was something
> which depended on the root css being defined when
> !CONFIG_CGROUP_WRITEBACK or maybe it was on blkcg_root_css and memcg
> side was added for consistency.

I have tried to compile with !CONFIG_MEMCG and !CONFIG_CGROUP_WRITEBACK
without mem_cgroup_root_css defined for this configuration and
mm/backing-dev.c compiles just fine. So maybe we should get rid of it
rather than have a potentially tricky code?

> An ERR_PTR value is non-zero, which
> is an invariant which is often depended upon, while guaranteeing oops
> when deref'd.

Yeah, but css_{get,put} and others consumers of the pointer are not
checking for ERR_PTR. So I think this is really misleading.

--
Michal Hocko
SUSE Labs

2015-06-18 17:49:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 06/51] memcg: add mem_cgroup_root_css

Hello, Michal.

On Thu, Jun 18, 2015 at 01:12:27PM +0200, Michal Hocko wrote:
...
> I see and yes, it makes some sense. I just think we can get rid of the
> accessor functions when the struct mem_cgroup is visible and the code
> can simply do &{page->}mem_cgroup->css.

As long as the accessors are inline, I think it should be fine.

> I have tried to compile with !CONFIG_MEMCG and !CONFIG_CGROUP_WRITEBACK
> without mem_cgroup_root_css defined for this configuration and
> mm/backing-dev.c compiles just fine. So maybe we should get rid of it
> rather than have a potentially tricky code?

Yeah, please feel free to queue a patch to remove it if doesn't break
anything.

Thanks.

--
tejun

2015-06-19 09:19:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 06/51] memcg: add mem_cgroup_root_css

On Thu 18-06-15 13:49:30, Tejun Heo wrote:
[...]
> > I have tried to compile with !CONFIG_MEMCG and !CONFIG_CGROUP_WRITEBACK
> > without mem_cgroup_root_css defined for this configuration and
> > mm/backing-dev.c compiles just fine. So maybe we should get rid of it
> > rather than have a potentially tricky code?
>
> Yeah, please feel free to queue a patch to remove it if doesn't break
> anything.

Against which branch should a I generate the patch?
--
Michal Hocko
SUSE Labs

2015-06-19 15:17:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 06/51] memcg: add mem_cgroup_root_css

On Fri, Jun 19, 2015 at 11:18:48AM +0200, Michal Hocko wrote:
> On Thu 18-06-15 13:49:30, Tejun Heo wrote:
> [...]
> > > I have tried to compile with !CONFIG_MEMCG and !CONFIG_CGROUP_WRITEBACK
> > > without mem_cgroup_root_css defined for this configuration and
> > > mm/backing-dev.c compiles just fine. So maybe we should get rid of it
> > > rather than have a potentially tricky code?
> >
> > Yeah, please feel free to queue a patch to remove it if doesn't break
> > anything.
>
> Against which branch should a I generate the patch?

It's in the for-4.2/writeback branch of the block tree; however, a
patch against -mm should work, right?

Thanks.

--
tejun

2015-06-30 06:47:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 19/51] bdi: make inode_to_bdi() inline

On Fri 22-05-15 17:13:33, Tejun Heo wrote:
> Now that bdi definitions are moved to backing-dev-defs.h,
> backing-dev.h can include blkdev.h and inline inode_to_bdi() without
> worrying about introducing circular include dependency. The function
> gets called from hot paths and fairly trivial.
>
> This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
> function calls inline. blockdev_superblock and noop_backing_dev_info
> are EXPORT_GPL'd to allow the inline functions to be used from
> modules.
>
> While at it, make sb_is_blkdev_sb() return bool instead of int.
>
> v2: Fixed typo in description as suggested by Jan.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reviewed-by: Jens Axboe <[email protected]>
> Cc: Christoph Hellwig <[email protected]>

Looks good. Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/block_dev.c | 8 ++------
> fs/fs-writeback.c | 16 ----------------
> include/linux/backing-dev.h | 18 ++++++++++++++++--
> include/linux/fs.h | 8 +++++++-
> mm/backing-dev.c | 1 +
> 5 files changed, 26 insertions(+), 25 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index e545cbf..f04c873 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -547,7 +547,8 @@ static struct file_system_type bd_type = {
> .kill_sb = kill_anon_super,
> };
>
> -static struct super_block *blockdev_superblock __read_mostly;
> +struct super_block *blockdev_superblock __read_mostly;
> +EXPORT_SYMBOL_GPL(blockdev_superblock);
>
> void __init bdev_cache_init(void)
> {
> @@ -688,11 +689,6 @@ static struct block_device *bd_acquire(struct inode *inode)
> return bdev;
> }
>
> -int sb_is_blkdev_sb(struct super_block *sb)
> -{
> - return sb == blockdev_superblock;
> -}
> -
> /* Call when you free inode */
>
> void bd_forget(struct inode *inode)
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a69d2e1..34d1cb8 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -78,22 +78,6 @@ int writeback_in_progress(struct backing_dev_info *bdi)
> }
> EXPORT_SYMBOL(writeback_in_progress);
>
> -struct backing_dev_info *inode_to_bdi(struct inode *inode)
> -{
> - struct super_block *sb;
> -
> - if (!inode)
> - return &noop_backing_dev_info;
> -
> - sb = inode->i_sb;
> -#ifdef CONFIG_BLOCK
> - if (sb_is_blkdev_sb(sb))
> - return blk_get_backing_dev_info(I_BDEV(inode));
> -#endif
> - return sb->s_bdi;
> -}
> -EXPORT_SYMBOL_GPL(inode_to_bdi);
> -
> static inline struct inode *wb_inode(struct list_head *head)
> {
> return list_entry(head, struct inode, i_wb_list);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 5e39f7a..7857820 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -11,11 +11,10 @@
> #include <linux/kernel.h>
> #include <linux/fs.h>
> #include <linux/sched.h>
> +#include <linux/blkdev.h>
> #include <linux/writeback.h>
> #include <linux/backing-dev-defs.h>
>
> -struct backing_dev_info *inode_to_bdi(struct inode *inode);
> -
> int __must_check bdi_init(struct backing_dev_info *bdi);
> void bdi_destroy(struct backing_dev_info *bdi);
>
> @@ -149,6 +148,21 @@ extern struct backing_dev_info noop_backing_dev_info;
>
> int writeback_in_progress(struct backing_dev_info *bdi);
>
> +static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
> +{
> + struct super_block *sb;
> +
> + if (!inode)
> + return &noop_backing_dev_info;
> +
> + sb = inode->i_sb;
> +#ifdef CONFIG_BLOCK
> + if (sb_is_blkdev_sb(sb))
> + return blk_get_backing_dev_info(I_BDEV(inode));
> +#endif
> + return sb->s_bdi;
> +}
> +
> static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
> {
> if (bdi->congested_fn)
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 1ef6390..ce100b87 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2240,7 +2240,13 @@ extern struct super_block *freeze_bdev(struct block_device *);
> extern void emergency_thaw_all(void);
> extern int thaw_bdev(struct block_device *bdev, struct super_block *sb);
> extern int fsync_bdev(struct block_device *);
> -extern int sb_is_blkdev_sb(struct super_block *sb);
> +
> +extern struct super_block *blockdev_superblock;
> +
> +static inline bool sb_is_blkdev_sb(struct super_block *sb)
> +{
> + return sb == blockdev_superblock;
> +}
> #else
> static inline void bd_forget(struct inode *inode) {}
> static inline int sync_blockdev(struct block_device *bdev) { return 0; }
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index ff85ecb..b0707d1 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -18,6 +18,7 @@ struct backing_dev_info noop_backing_dev_info = {
> .name = "noop",
> .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK,
> };
> +EXPORT_SYMBOL_GPL(noop_backing_dev_info);
>
> static struct class *bdi_class;
>
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 09:08:54

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 24/51] writeback, blkcg: associate each blkcg_gq with the corresponding bdi_writeback_congested

On Fri 22-05-15 17:13:38, Tejun Heo wrote:
> A blkg (blkcg_gq) can be congested and decongested independently from
> other blkgs on the same request_queue. Accordingly, for cgroup
> writeback support, the congestion status at bdi (backing_dev_info)
> should be split and updated separately from matching blkg's.
>
> This patch prepares by adding blkg->wb_congested and associating a
> blkg with its matching per-blkcg bdi_writeback_congested on creation.
>
> v2: Updated to associate bdi_writeback_congested instead of
> bdi_writeback.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Vivek Goyal <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

> ---
> block/blk-cgroup.c | 17 +++++++++++++++--
> include/linux/blk-cgroup.h | 6 ++++++
> 2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 979cfdb..31610ae 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -182,6 +182,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> struct blkcg_gq *new_blkg)
> {
> struct blkcg_gq *blkg;
> + struct bdi_writeback_congested *wb_congested;
> int i, ret;
>
> WARN_ON_ONCE(!rcu_read_lock_held());
> @@ -193,22 +194,30 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> goto err_free_blkg;
> }
>
> + wb_congested = wb_congested_get_create(&q->backing_dev_info,
> + blkcg->css.id, GFP_ATOMIC);
> + if (!wb_congested) {
> + ret = -ENOMEM;
> + goto err_put_css;
> + }
> +
> /* allocate */
> if (!new_blkg) {
> new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC);
> if (unlikely(!new_blkg)) {
> ret = -ENOMEM;
> - goto err_put_css;
> + goto err_put_congested;
> }
> }
> blkg = new_blkg;
> + blkg->wb_congested = wb_congested;
>
> /* link parent */
> if (blkcg_parent(blkcg)) {
> blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false);
> if (WARN_ON_ONCE(!blkg->parent)) {
> ret = -EINVAL;
> - goto err_put_css;
> + goto err_put_congested;
> }
> blkg_get(blkg->parent);
> }
> @@ -245,6 +254,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
> blkg_put(blkg);
> return ERR_PTR(ret);
>
> +err_put_congested:
> + wb_congested_put(wb_congested);
> err_put_css:
> css_put(&blkcg->css);
> err_free_blkg:
> @@ -391,6 +402,8 @@ void __blkg_release_rcu(struct rcu_head *rcu_head)
> if (blkg->parent)
> blkg_put(blkg->parent);
>
> + wb_congested_put(blkg->wb_congested);
> +
> blkg_free(blkg);
> }
> EXPORT_SYMBOL_GPL(__blkg_release_rcu);
> diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
> index 3033eb1..07a32b8 100644
> --- a/include/linux/blk-cgroup.h
> +++ b/include/linux/blk-cgroup.h
> @@ -99,6 +99,12 @@ struct blkcg_gq {
> struct hlist_node blkcg_node;
> struct blkcg *blkcg;
>
> + /*
> + * Each blkg gets congested separately and the congestion state is
> + * propagated to the matching bdi_writeback_congested.
> + */
> + struct bdi_writeback_congested *wb_congested;
> +
> /* all non-root blkcg_gq's are guaranteed to have access to parent */
> struct blkcg_gq *parent;
>
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 09:21:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 21/51] bdi: separate out congested state into a separate struct

On Fri 22-05-15 17:13:35, Tejun Heo wrote:
> Currently, a wb's (bdi_writeback) congestion state is carried in its
> ->state field; however, cgroup writeback support will require multiple
> wb's sharing the same congestion state. This patch separates out
> congestion state into its own struct - struct bdi_writeback_congested.
> A new field wb field, wb_congested, points to its associated congested
> struct. The default wb, bdi->wb, always points to bdi->wb_congested.
>
> While this patch adds a layer of indirection, it doesn't introduce any
> behavior changes.
>
> Signed-off-by: Tejun Heo <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> include/linux/backing-dev-defs.h | 14 ++++++++++++--
> include/linux/backing-dev.h | 2 +-
> mm/backing-dev.c | 7 +++++--
> 3 files changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index aa18c4b..9e9eafa 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -16,12 +16,15 @@ struct dentry;
> * Bits in bdi_writeback.state
> */
> enum wb_state {
> - WB_async_congested, /* The async (write) queue is getting full */
> - WB_sync_congested, /* The sync queue is getting full */
> WB_registered, /* bdi_register() was done */
> WB_writeback_running, /* Writeback is in progress */
> };
>
> +enum wb_congested_state {
> + WB_async_congested, /* The async (write) queue is getting full */
> + WB_sync_congested, /* The sync queue is getting full */
> +};
> +
> typedef int (congested_fn)(void *, int);
>
> enum wb_stat_item {
> @@ -34,6 +37,10 @@ enum wb_stat_item {
>
> #define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
>
> +struct bdi_writeback_congested {
> + unsigned long state; /* WB_[a]sync_congested flags */
> +};
> +
> struct bdi_writeback {
> struct backing_dev_info *bdi; /* our parent bdi */
>
> @@ -48,6 +55,8 @@ struct bdi_writeback {
>
> struct percpu_counter stat[NR_WB_STAT_ITEMS];
>
> + struct bdi_writeback_congested *congested;
> +
> unsigned long bw_time_stamp; /* last time write bw is updated */
> unsigned long dirtied_stamp;
> unsigned long written_stamp; /* pages written at bw_time_stamp */
> @@ -84,6 +93,7 @@ struct backing_dev_info {
> unsigned int max_ratio, max_prop_frac;
>
> struct bdi_writeback wb; /* default writeback info for this bdi */
> + struct bdi_writeback_congested wb_congested;
>
> struct device *dev;
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 7857820..bfdaa18 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -167,7 +167,7 @@ static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
> {
> if (bdi->congested_fn)
> return bdi->congested_fn(bdi->congested_data, bdi_bits);
> - return (bdi->wb.state & bdi_bits);
> + return (bdi->wb.congested->state & bdi_bits);
> }
>
> static inline int bdi_read_congested(struct backing_dev_info *bdi)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 805b287..5ec7658 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -383,6 +383,9 @@ int bdi_init(struct backing_dev_info *bdi)
> if (err)
> return err;
>
> + bdi->wb_congested.state = 0;
> + bdi->wb.congested = &bdi->wb_congested;
> +
> return 0;
> }
> EXPORT_SYMBOL(bdi_init);
> @@ -504,7 +507,7 @@ void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> bit = sync ? WB_sync_congested : WB_async_congested;
> - if (test_and_clear_bit(bit, &bdi->wb.state))
> + if (test_and_clear_bit(bit, &bdi->wb.congested->state))
> atomic_dec(&nr_bdi_congested[sync]);
> smp_mb__after_atomic();
> if (waitqueue_active(wqh))
> @@ -517,7 +520,7 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> enum wb_state bit;
>
> bit = sync ? WB_sync_congested : WB_async_congested;
> - if (!test_and_set_bit(bit, &bdi->wb.state))
> + if (!test_and_set_bit(bit, &bdi->wb.congested->state))
> atomic_inc(&nr_bdi_congested[sync]);
> }
> EXPORT_SYMBOL(set_bdi_congested);
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 09:38:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 22/51] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK

On Fri 22-05-15 17:13:36, Tejun Heo wrote:
> cgroup writeback requires support from both bdi and filesystem sides.
> Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
> support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
> default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
> both MEMCG and BLK_CGROUP are enabled.
>
> inode_cgwb_enabled() which determines whether a given inode's both bdi
> and fs support cgroup writeback is added.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

Hum, you later changed this to use a per-sb flag instead of a per-fs-type
flag, right? We could do it as well here but OK.

One more question - what does prevent us from supporting CGROUP_WRITEBACK
for all bdis capable of writeback? I guess the reason is that currently
blkcgs are bound to request_queue and we have to have blkcg(s) for
CGROUP_WRITEBACK to work, am I right? But in principle tracking writeback
state and doing writeback per memcg doesn't seem to be bound to any device
properties so we could do that right?

Anyway, this patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> block/blk-core.c | 2 +-
> include/linux/backing-dev.h | 32 +++++++++++++++++++++++++++++++-
> include/linux/fs.h | 1 +
> init/Kconfig | 5 +++++
> 4 files changed, 38 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index f46688f..e0f726f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -620,7 +620,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>
> q->backing_dev_info.ra_pages =
> (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
> - q->backing_dev_info.capabilities = 0;
> + q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK;
> q->backing_dev_info.name = "block";
> q->node = node_id;
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index bfdaa18..6bb3123 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -134,12 +134,15 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
> * BDI_CAP_NO_WRITEBACK: Don't write pages back
> * BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages
> * BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold.
> + *
> + * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback.
> */
> #define BDI_CAP_NO_ACCT_DIRTY 0x00000001
> #define BDI_CAP_NO_WRITEBACK 0x00000002
> #define BDI_CAP_NO_ACCT_WB 0x00000004
> #define BDI_CAP_STABLE_WRITES 0x00000008
> #define BDI_CAP_STRICTLIMIT 0x00000010
> +#define BDI_CAP_CGROUP_WRITEBACK 0x00000020
>
> #define BDI_CAP_NO_ACCT_AND_WRITEBACK \
> (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB)
> @@ -229,4 +232,31 @@ static inline int bdi_sched_wait(void *word)
> return 0;
> }
>
> -#endif /* _LINUX_BACKING_DEV_H */
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +
> +/**
> + * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
> + * @inode: inode of interest
> + *
> + * cgroup writeback requires support from both the bdi and filesystem.
> + * Test whether @inode has both.
> + */
> +static inline bool inode_cgwb_enabled(struct inode *inode)
> +{
> + struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> + return bdi_cap_account_dirty(bdi) &&
> + (bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) &&
> + (inode->i_sb->s_type->fs_flags & FS_CGROUP_WRITEBACK);
> +}
> +
> +#else /* CONFIG_CGROUP_WRITEBACK */
> +
> +static inline bool inode_cgwb_enabled(struct inode *inode)
> +{
> + return false;
> +}
> +
> +#endif /* CONFIG_CGROUP_WRITEBACK */
> +
> +#endif /* _LINUX_BACKING_DEV_H */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ce100b87..74e0ae0 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1897,6 +1897,7 @@ struct file_system_type {
> #define FS_HAS_SUBTYPE 4
> #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */
> #define FS_USERNS_DEV_MOUNT 16 /* A userns mount does not imply MNT_NODEV */
> +#define FS_CGROUP_WRITEBACK 32 /* Supports cgroup-aware writeback */
> #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> struct dentry *(*mount) (struct file_system_type *, int,
> const char *, void *);
> diff --git a/init/Kconfig b/init/Kconfig
> index dc24dec..d4f7633 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1141,6 +1141,11 @@ config DEBUG_BLK_CGROUP
> Enable some debugging help. Currently it exports additional stat
> files in a cgroup which can be useful for debugging.
>
> +config CGROUP_WRITEBACK
> + bool
> + depends on MEMCG && BLK_CGROUP
> + default y
> +
> endif # CGROUPS
>
> config CHECKPOINT_RESTORE
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 10:15:06

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 23/51] writeback: make backing_dev_info host cgroup-specific bdi_writebacks

On Fri 22-05-15 17:13:37, Tejun Heo wrote:
> For the planned cgroup writeback support, on each bdi
> (backing_dev_info), each memcg will be served by a separate wb
> (bdi_writeback). This patch updates bdi so that a bdi can host
> multiple wbs (bdi_writebacks).
>
> On the default hierarchy, blkcg implicitly enables memcg. This allows
> using memcg's page ownership for attributing writeback IOs, and every
> memcg - blkcg combination can be served by its own wb by assigning a
> dedicated wb to each memcg. This means that there may be multiple
> wb's of a bdi mapped to the same blkcg. As congested state is per
> blkcg - bdi combination, those wb's should share the same congested
> state. This is achieved by tracking congested state via
> bdi_writeback_congested structs which are keyed by blkcg.
>
> bdi->wb remains unchanged and will keep serving the root cgroup.
> cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
> looked up while dirtying an inode according to the memcg of the page
> being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
> by its memcg id. Once an inode is associated with its wb, it can be
> retrieved using inode_to_wb().
>
> Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
> pages will keep being associated with bdi->wb.
>
> v3: inode_attach_wb() in account_page_dirtied() moved inside
> mapping_cap_account_dirty() block where it's known to be !NULL.
> Also, an unnecessary NULL check before kfree() removed. Both
> detected by the kbuild bot.
>
> v2: Updated so that wb association is per inode and wb is per memcg
> rather than blkcg.

It may be a good place to explain in this changelog (and add that
explanation to a comment before the definition of struct bdi_writeback) why
are the writeback structures per memcg and not per coarser blkcg. I was
pondering about it for a while before I realized that amount of avaliable
memory and thus dirty limits are a memcg property so we have to be able to
writeback only a specific memcg. It would be nice if one didn't have to
figure this out on his own (although it's kind of obvious once you realize
that ;).

Other than that the patch looks good so you can add:

Reviewed-by: Jan Kara <[email protected]>

A few nits below.

> +/**
> + * wb_find_current - find wb for %current on a bdi
> + * @bdi: bdi of interest
> + *
> + * Find the wb of @bdi which matches both the memcg and blkcg of %current.
> + * Must be called under rcu_read_lock() which protects the returend wb.
^^ returned

> + * NULL if not found.
> + */
> +static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
> +{
> + struct cgroup_subsys_state *memcg_css;
> + struct bdi_writeback *wb;
> +
> + memcg_css = task_css(current, memory_cgrp_id);
> + if (!memcg_css->parent)
> + return &bdi->wb;
> +
> + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id);
> +
> + /*
> + * %current's blkcg equals the effective blkcg of its memcg. No
> + * need to use the relatively expensive cgroup_get_e_css().
> + */
> + if (likely(wb && wb->blkcg_css == task_css(current, blkio_cgrp_id)))
> + return wb;

This won't hit only in case where memcg moves to a different blkcg?
Just want to make sure I understand things right...

...
> +/**
> + * wb_congested_put - put a wb_congested
> + * @congested: wb_congested to put
> + *
> + * Put @congested and destroy it if the refcnt reaches zero.
> + */
> +void wb_congested_put(struct bdi_writeback_congested *congested)
> +{
> + struct backing_dev_info *bdi = congested->bdi;
> + unsigned long flags;
> +
> + if (congested->blkcg_id == 1)
> + return;
> +
> + local_irq_save(flags);
> + if (!atomic_dec_and_lock(&congested->refcnt, &cgwb_lock)) {
> + local_irq_restore(flags);
> + return;
> + }
> +
> + rb_erase(&congested->rb_node, &congested->bdi->cgwb_congested_tree);
> + spin_unlock_irqrestore(&cgwb_lock, flags);
> + kfree(congested);
> +
> + if (atomic_dec_and_test(&bdi->usage_cnt))
> + wake_up_all(&cgwb_release_wait);

Maybe we could have a small wrapper for dropping bdi->usage_cnt? If someone
forgets to wake up cgwb_release_wait after dropping the ref count, it will be
somewhat difficult to chase down that call site...

...
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +
> +struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg)
> +{
> + return &memcg->cgwb_list;
> +}
> +
> +#endif /* CONFIG_CGROUP_WRITEBACK */
> +

What is the reason for this wrapper? It doesn't seem particularly useful...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 14:17:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 25/51] writeback: attribute stats to the matching per-cgroup bdi_writeback

On Fri 22-05-15 17:13:39, Tejun Heo wrote:
> Until now, all WB_* stats were accounted against the root wb
> (bdi_writeback), now that multiple wb (bdi_writeback) support is in
> place, let's attributes the stats to the respective per-cgroup wb's.
>
> As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
> visible behavior differences.
>
> v2: Updated for per-inode wb association.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> mm/page-writeback.c | 24 +++++++++++++++---------
> 1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 9b95cf8..4d0a9da 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2130,7 +2130,7 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
> if (mapping_cap_account_dirty(mapping)) {
> mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> - dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_RECLAIMABLE);
> + dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE);
> task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> }
> }
> @@ -2191,10 +2191,13 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers);
> void account_page_redirty(struct page *page)
> {
> struct address_space *mapping = page->mapping;
> +
> if (mapping && mapping_cap_account_dirty(mapping)) {
> + struct bdi_writeback *wb = inode_to_wb(mapping->host);
> +
> current->nr_dirtied--;
> dec_zone_page_state(page, NR_DIRTIED);
> - dec_wb_stat(&inode_to_bdi(mapping->host)->wb, WB_DIRTIED);
> + dec_wb_stat(wb, WB_DIRTIED);
> }
> }
> EXPORT_SYMBOL(account_page_redirty);
> @@ -2373,8 +2376,7 @@ int clear_page_dirty_for_io(struct page *page)
> if (TestClearPageDirty(page)) {
> mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> - dec_wb_stat(&inode_to_bdi(mapping->host)->wb,
> - WB_RECLAIMABLE);
> + dec_wb_stat(inode_to_wb(mapping->host), WB_RECLAIMABLE);
> ret = 1;
> }
> mem_cgroup_end_page_stat(memcg);
> @@ -2392,7 +2394,8 @@ int test_clear_page_writeback(struct page *page)
>
> memcg = mem_cgroup_begin_page_stat(page);
> if (mapping) {
> - struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
> + struct inode *inode = mapping->host;
> + struct backing_dev_info *bdi = inode_to_bdi(inode);
> unsigned long flags;
>
> spin_lock_irqsave(&mapping->tree_lock, flags);
> @@ -2402,8 +2405,10 @@ int test_clear_page_writeback(struct page *page)
> page_index(page),
> PAGECACHE_TAG_WRITEBACK);
> if (bdi_cap_account_writeback(bdi)) {
> - __dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> - __wb_writeout_inc(&bdi->wb);
> + struct bdi_writeback *wb = inode_to_wb(inode);
> +
> + __dec_wb_stat(wb, WB_WRITEBACK);
> + __wb_writeout_inc(wb);
> }
> }
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> @@ -2427,7 +2432,8 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
>
> memcg = mem_cgroup_begin_page_stat(page);
> if (mapping) {
> - struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
> + struct inode *inode = mapping->host;
> + struct backing_dev_info *bdi = inode_to_bdi(inode);
> unsigned long flags;
>
> spin_lock_irqsave(&mapping->tree_lock, flags);
> @@ -2437,7 +2443,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> page_index(page),
> PAGECACHE_TAG_WRITEBACK);
> if (bdi_cap_account_writeback(bdi))
> - __inc_wb_stat(&bdi->wb, WB_WRITEBACK);
> + __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
> }
> if (!PageDirty(page))
> radix_tree_tag_clear(&mapping->page_tree,
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 14:31:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 26/51] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback

On Fri 22-05-15 17:13:40, Tejun Heo wrote:
> Currently, balance_dirty_pages() always work on bdi->wb. This patch
> updates it to work on the wb (bdi_writeback) matching memcg and blkcg
> of the current task as that's what the inode is being dirtied against.
>
> balance_dirty_pages_ratelimited() now pins the current wb and passes
> it to balance_dirty_pages().
>
> As no filesystem has FS_CGROUP_WRITEBACK yet, this doesn't lead to
> visible behavior differences.
...
> void balance_dirty_pages_ratelimited(struct address_space *mapping)
> {
> - struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
> - struct bdi_writeback *wb = &bdi->wb;
> + struct inode *inode = mapping->host;
> + struct backing_dev_info *bdi = inode_to_bdi(inode);
> + struct bdi_writeback *wb = NULL;
> int ratelimit;
> int *p;
>
> if (!bdi_cap_account_dirty(bdi))
> return;
>
> + if (inode_cgwb_enabled(inode))
> + wb = wb_get_create_current(bdi, GFP_KERNEL);
> + if (!wb)
> + wb = &bdi->wb;
> +

So this effectively adds a radix tree lookup (of wb belonging to memcg) for
every set_page_dirty() call. That seems relatively costly to me. And all
that just to check wb->dirty_exceeded. Cannot we just use inode_to_wb()
instead? I understand results may be different if multiple memcgs share an
inode and that's the reason why you use wb_get_create_current(), right?
But for dirty_exceeded check it may be good enough?

Honza

> ratelimit = current->nr_dirtied_pause;
> if (wb->dirty_exceeded)
> ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
> @@ -1616,7 +1622,9 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
> preempt_enable();
>
> if (unlikely(current->nr_dirtied >= ratelimit))
> - balance_dirty_pages(mapping, current->nr_dirtied);
> + balance_dirty_pages(mapping, wb, current->nr_dirtied);
> +
> + wb_put(wb);
> }
> EXPORT_SYMBOL(balance_dirty_pages_ratelimited);
>
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 14:50:54

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 27/51] writeback: make congestion functions per bdi_writeback

On Fri 22-05-15 17:13:41, Tejun Heo wrote:
> Currently, all congestion functions take bdi (backing_dev_info) and
> always operate on the root wb (bdi->wb) and the congestion state from
> the block layer is propagated only for the root blkcg. This patch
> introduces {set|clear}_wb_congested() and wb_congested() which take a
> bdi_writeback_congested and bdi_writeback respectively. The bdi
> counteparts are now wrappers invoking the wb based functions on
> @bdi->wb.
>
> While converting clear_bdi_congested() to clear_wb_congested(), the
> local variable declaration order between @wqh and @bit is swapped for
> cosmetic reason.
>
> This patch just adds the new wb based functions. The following
> patches will apply them.

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

>
> v2: Updated for bdi_writeback_congested.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Cc: Jens Axboe <[email protected]>
> ---
> include/linux/backing-dev-defs.h | 14 +++++++++++--
> include/linux/backing-dev.h | 45 +++++++++++++++++++++++-----------------
> mm/backing-dev.c | 22 ++++++++++----------
> 3 files changed, 49 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index a1e9c40..eb38676 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -163,7 +163,17 @@ enum {
> BLK_RW_SYNC = 1,
> };
>
> -void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> -void set_bdi_congested(struct backing_dev_info *bdi, int sync);
> +void clear_wb_congested(struct bdi_writeback_congested *congested, int sync);
> +void set_wb_congested(struct bdi_writeback_congested *congested, int sync);
> +
> +static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> +{
> + clear_wb_congested(bdi->wb.congested, sync);
> +}
> +
> +static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> +{
> + set_wb_congested(bdi->wb.congested, sync);
> +}
>
> #endif /* __LINUX_BACKING_DEV_DEFS_H */
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 8ae59df..2c498a2 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -167,27 +167,13 @@ static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
> return sb->s_bdi;
> }
>
> -static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
> +static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
> {
> - if (bdi->congested_fn)
> - return bdi->congested_fn(bdi->congested_data, bdi_bits);
> - return (bdi->wb.congested->state & bdi_bits);
> -}
> -
> -static inline int bdi_read_congested(struct backing_dev_info *bdi)
> -{
> - return bdi_congested(bdi, 1 << WB_sync_congested);
> -}
> -
> -static inline int bdi_write_congested(struct backing_dev_info *bdi)
> -{
> - return bdi_congested(bdi, 1 << WB_async_congested);
> -}
> + struct backing_dev_info *bdi = wb->bdi;
>
> -static inline int bdi_rw_congested(struct backing_dev_info *bdi)
> -{
> - return bdi_congested(bdi, (1 << WB_sync_congested) |
> - (1 << WB_async_congested));
> + if (bdi->congested_fn)
> + return bdi->congested_fn(bdi->congested_data, cong_bits);
> + return wb->congested->state & cong_bits;
> }
>
> long congestion_wait(int sync, long timeout);
> @@ -454,4 +440,25 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
>
> #endif /* CONFIG_CGROUP_WRITEBACK */
>
> +static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits)
> +{
> + return wb_congested(&bdi->wb, cong_bits);
> +}
> +
> +static inline int bdi_read_congested(struct backing_dev_info *bdi)
> +{
> + return bdi_congested(bdi, 1 << WB_sync_congested);
> +}
> +
> +static inline int bdi_write_congested(struct backing_dev_info *bdi)
> +{
> + return bdi_congested(bdi, 1 << WB_async_congested);
> +}
> +
> +static inline int bdi_rw_congested(struct backing_dev_info *bdi)
> +{
> + return bdi_congested(bdi, (1 << WB_sync_congested) |
> + (1 << WB_async_congested));
> +}
> +
> #endif /* _LINUX_BACKING_DEV_H */
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 4c9386c..5029c4a 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -896,31 +896,31 @@ static wait_queue_head_t congestion_wqh[2] = {
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
> __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
> };
> -static atomic_t nr_bdi_congested[2];
> +static atomic_t nr_wb_congested[2];
>
> -void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> +void clear_wb_congested(struct bdi_writeback_congested *congested, int sync)
> {
> - enum wb_state bit;
> wait_queue_head_t *wqh = &congestion_wqh[sync];
> + enum wb_state bit;
>
> bit = sync ? WB_sync_congested : WB_async_congested;
> - if (test_and_clear_bit(bit, &bdi->wb.congested->state))
> - atomic_dec(&nr_bdi_congested[sync]);
> + if (test_and_clear_bit(bit, &congested->state))
> + atomic_dec(&nr_wb_congested[sync]);
> smp_mb__after_atomic();
> if (waitqueue_active(wqh))
> wake_up(wqh);
> }
> -EXPORT_SYMBOL(clear_bdi_congested);
> +EXPORT_SYMBOL(clear_wb_congested);
>
> -void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> +void set_wb_congested(struct bdi_writeback_congested *congested, int sync)
> {
> enum wb_state bit;
>
> bit = sync ? WB_sync_congested : WB_async_congested;
> - if (!test_and_set_bit(bit, &bdi->wb.congested->state))
> - atomic_inc(&nr_bdi_congested[sync]);
> + if (!test_and_set_bit(bit, &congested->state))
> + atomic_inc(&nr_wb_congested[sync]);
> }
> -EXPORT_SYMBOL(set_bdi_congested);
> +EXPORT_SYMBOL(set_wb_congested);
>
> /**
> * congestion_wait - wait for a backing_dev to become uncongested
> @@ -979,7 +979,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
> * encountered in the current zone, yield if necessary instead
> * of sleeping on the congestion queue
> */
> - if (atomic_read(&nr_bdi_congested[sync]) == 0 ||
> + if (atomic_read(&nr_wb_congested[sync]) == 0 ||
> !test_bit(ZONE_CONGESTED, &zone->flags)) {
> cond_resched();
>
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 15:03:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 28/51] writeback, blkcg: restructure blk_{set|clear}_queue_congested()

On Fri 22-05-15 17:13:42, Tejun Heo wrote:
> blk_{set|clear}_queue_congested() take @q and set or clear,
> respectively, the congestion state of its bdi's root wb. Because bdi
> used to be able to handle congestion state only on the root wb, the
> callers of those functions tested whether the congestion is on the
> root blkcg and skipped if not.
>
> This is cumbersome and makes implementation of per cgroup
> bdi_writeback congestion state propagation difficult. This patch
> renames blk_{set|clear}_queue_congested() to
> blk_{set|clear}_congested(), and makes them take request_list instead
> of request_queue and test whether the specified request_list is the
> root one before updating bdi_writeback congestion state. This makes
> the tests in the callers unnecessary and simplifies them.
>
> As there are no external users of these functions, the definitions are
> moved from include/linux/blkdev.h to block/blk-core.c.
>
> This patch doesn't introduce any noticeable behavior difference.

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

BTW, I'd prefer if this was merged with the following patch. I was
wondering for a while about the condition at the beginning of
blk_clear_congested() only to learn it gets modified to the one I'd expect
in the following patch :)

Honza

>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> ---
> block/blk-core.c | 62 ++++++++++++++++++++++++++++++--------------------
> include/linux/blkdev.h | 19 ----------------
> 2 files changed, 37 insertions(+), 44 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index e0f726f..b457c4f 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -63,6 +63,28 @@ struct kmem_cache *blk_requestq_cachep;
> */
> static struct workqueue_struct *kblockd_workqueue;
>
> +static void blk_clear_congested(struct request_list *rl, int sync)
> +{
> + if (rl != &rl->q->root_rl)
> + return;
> +#ifdef CONFIG_CGROUP_WRITEBACK
> + clear_wb_congested(rl->blkg->wb_congested, sync);
> +#else
> + clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
> +#endif
> +}
> +
> +static void blk_set_congested(struct request_list *rl, int sync)
> +{
> + if (rl != &rl->q->root_rl)
> + return;
> +#ifdef CONFIG_CGROUP_WRITEBACK
> + set_wb_congested(rl->blkg->wb_congested, sync);
> +#else
> + set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
> +#endif
> +}
> +
> void blk_queue_congestion_threshold(struct request_queue *q)
> {
> int nr;
> @@ -841,13 +863,8 @@ static void __freed_request(struct request_list *rl, int sync)
> {
> struct request_queue *q = rl->q;
>
> - /*
> - * bdi isn't aware of blkcg yet. As all async IOs end up root
> - * blkcg anyway, just use root blkcg state.
> - */
> - if (rl == &q->root_rl &&
> - rl->count[sync] < queue_congestion_off_threshold(q))
> - blk_clear_queue_congested(q, sync);
> + if (rl->count[sync] < queue_congestion_off_threshold(q))
> + blk_clear_congested(rl, sync);
>
> if (rl->count[sync] + 1 <= q->nr_requests) {
> if (waitqueue_active(&rl->wait[sync]))
> @@ -880,25 +897,25 @@ static void freed_request(struct request_list *rl, unsigned int flags)
> int blk_update_nr_requests(struct request_queue *q, unsigned int nr)
> {
> struct request_list *rl;
> + int on_thresh, off_thresh;
>
> spin_lock_irq(q->queue_lock);
> q->nr_requests = nr;
> blk_queue_congestion_threshold(q);
> + on_thresh = queue_congestion_on_threshold(q);
> + off_thresh = queue_congestion_off_threshold(q);
>
> - /* congestion isn't cgroup aware and follows root blkcg for now */
> - rl = &q->root_rl;
> -
> - if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q))
> - blk_set_queue_congested(q, BLK_RW_SYNC);
> - else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q))
> - blk_clear_queue_congested(q, BLK_RW_SYNC);
> + blk_queue_for_each_rl(rl, q) {
> + if (rl->count[BLK_RW_SYNC] >= on_thresh)
> + blk_set_congested(rl, BLK_RW_SYNC);
> + else if (rl->count[BLK_RW_SYNC] < off_thresh)
> + blk_clear_congested(rl, BLK_RW_SYNC);
>
> - if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q))
> - blk_set_queue_congested(q, BLK_RW_ASYNC);
> - else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q))
> - blk_clear_queue_congested(q, BLK_RW_ASYNC);
> + if (rl->count[BLK_RW_ASYNC] >= on_thresh)
> + blk_set_congested(rl, BLK_RW_ASYNC);
> + else if (rl->count[BLK_RW_ASYNC] < off_thresh)
> + blk_clear_congested(rl, BLK_RW_ASYNC);
>
> - blk_queue_for_each_rl(rl, q) {
> if (rl->count[BLK_RW_SYNC] >= q->nr_requests) {
> blk_set_rl_full(rl, BLK_RW_SYNC);
> } else {
> @@ -1008,12 +1025,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags,
> }
> }
> }
> - /*
> - * bdi isn't aware of blkcg yet. As all async IOs end up
> - * root blkcg anyway, just use root blkcg state.
> - */
> - if (rl == &q->root_rl)
> - blk_set_queue_congested(q, is_sync);
> + blk_set_congested(rl, is_sync);
> }
>
> /*
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 89bdef0..3d1065c 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -794,25 +794,6 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
>
> extern void blk_queue_bio(struct request_queue *q, struct bio *bio);
>
> -/*
> - * A queue has just exitted congestion. Note this in the global counter of
> - * congested queues, and wake up anyone who was waiting for requests to be
> - * put back.
> - */
> -static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
> -{
> - clear_bdi_congested(&q->backing_dev_info, sync);
> -}
> -
> -/*
> - * A queue has just entered congestion. Flag that in the queue's VM-visible
> - * state flags and increment the global gounter of congested queues.
> - */
> -static inline void blk_set_queue_congested(struct request_queue *q, int sync)
> -{
> - set_bdi_congested(&q->backing_dev_info, sync);
> -}
> -
> extern void blk_start_queue(struct request_queue *q);
> extern void blk_stop_queue(struct request_queue *q);
> extern void blk_sync_queue(struct request_queue *q);
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 15:03:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 29/51] writeback, blkcg: propagate non-root blkcg congestion state

On Fri 22-05-15 17:13:43, Tejun Heo wrote:
> Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
> blk_{set|clear}_congested() can propagate non-root blkcg congestion
> state to them.
>
> This can be easily achieved by disabling the root_rl tests in
> blk_{set|clear}_congested(). Note that we still need those tests when
> !CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
> wb's congestion state for events happening on other blkcgs.
>
> v2: Updated for bdi_writeback_congested.

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> ---
> block/blk-core.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b457c4f..cf6974e 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -65,23 +65,26 @@ static struct workqueue_struct *kblockd_workqueue;
>
> static void blk_clear_congested(struct request_list *rl, int sync)
> {
> - if (rl != &rl->q->root_rl)
> - return;
> #ifdef CONFIG_CGROUP_WRITEBACK
> clear_wb_congested(rl->blkg->wb_congested, sync);
> #else
> - clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
> + /*
> + * If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't
> + * flip its congestion state for events on other blkcgs.
> + */
> + if (rl == &rl->q->root_rl)
> + clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
> #endif
> }
>
> static void blk_set_congested(struct request_list *rl, int sync)
> {
> - if (rl != &rl->q->root_rl)
> - return;
> #ifdef CONFIG_CGROUP_WRITEBACK
> set_wb_congested(rl->blkg->wb_congested, sync);
> #else
> - set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
> + /* see blk_clear_congested() */
> + if (rl == &rl->q->root_rl)
> + set_wb_congested(rl->q->backing_dev_info.wb.congested, sync);
> #endif
> }
>
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 15:21:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 30/51] writeback: implement and use inode_congested()

On Fri 22-05-15 17:13:44, Tejun Heo wrote:
> In several places, bdi_congested() and its wrappers are used to
> determine whether more IOs should be issued. With cgroup writeback
> support, this question can't be answered solely based on the bdi
> (backing_dev_info). It's dependent on whether the filesystem and bdi
> support cgroup writeback and the blkcg the inode is associated with.
>
> This patch implements inode_congested() and its wrappers which take
> @inode and determines the congestion state considering cgroup
> writeback. The new functions replace bdi_*congested() calls in places
> where the query is about specific inode and task.
>
> There are several filesystem users which also fit this criteria but
> they should be updated when each filesystem implements cgroup
> writeback support.
>
> v2: Now that a given inode is associated with only one wb, congestion
> state can be determined independent from the asking task. Drop
> @task. Spotted by Vivek. Also, converted to take @inode instead
> of @mapping and renamed to inode_congested().
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Vivek Goyal <[email protected]>
> ---
> fs/fs-writeback.c | 29 +++++++++++++++++++++++++++++
> include/linux/backing-dev.h | 22 ++++++++++++++++++++++
> mm/fadvise.c | 2 +-
> mm/readahead.c | 2 +-
> mm/vmscan.c | 11 +++++------
> 5 files changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 99a2440..7ec491b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -142,6 +142,35 @@ static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> wb_queue_work(wb, work);
> }
>
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +
> +/**
> + * inode_congested - test whether an inode is congested
> + * @inode: inode to test for congestion
> + * @cong_bits: mask of WB_[a]sync_congested bits to test
> + *
> + * Tests whether @inode is congested. @cong_bits is the mask of congestion
> + * bits to test and the return value is the mask of set bits.
> + *
> + * If cgroup writeback is enabled for @inode, the congestion state is
> + * determined by whether the cgwb (cgroup bdi_writeback) for the blkcg
> + * associated with @inode is congested; otherwise, the root wb's congestion
> + * state is used.
> + */
> +int inode_congested(struct inode *inode, int cong_bits)
> +{
> + if (inode) {

Hum, is there any point in supporting NULL inode with inode_congested()?
That would look more like a programming bug than anything... Otherwise the
patch looks good to me so you can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 15:42:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 31/51] writeback: implement WB_has_dirty_io wb_state flag

On Fri 22-05-15 17:13:45, Tejun Heo wrote:
> Currently, wb_has_dirty_io() determines whether a wb (bdi_writeback)
> has any dirty inode by testing all three IO lists on each invocation
> without actively keeping track. For cgroup writeback support, a
> single bdi will host multiple wb's each of which will host dirty
> inodes separately and we'll need to make bdi_has_dirty_io(), which
> currently only represents the root wb, aggregate has_dirty_io from all
> member wb's, which requires tracking transitions in has_dirty_io state
> on each wb.
>
> This patch introduces inode_wb_list_{move|del}_locked() to consolidate
> IO list operations leaving queue_io() the only other function which
> directly manipulates IO lists (via move_expired_inodes()). All three
> functions are updated to call wb_io_lists_[de]populated() which keep
> track of whether the wb has dirty inodes or not and record it using
> the new WB_has_dirty_io flag. inode_wb_list_moved_locked()'s return
> value indicates whether the wb had no dirty inodes before.
>
> mark_inode_dirty() is restructured so that the return value of
> inode_wb_list_move_locked() can be used for deciding whether to wake
> up the wb.
>
> While at it, change {bdi|wb}_has_dirty_io()'s return values to bool.
> These functions were returning 0 and 1 before. Also, add a comment
> explaining the synchronization of wb_state flags.

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

>
> v2: Updated to accommodate b_dirty_time.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 110 ++++++++++++++++++++++++++++++---------
> include/linux/backing-dev-defs.h | 1 +
> include/linux/backing-dev.h | 8 ++-
> mm/backing-dev.c | 2 +-
> 4 files changed, 91 insertions(+), 30 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 7ec491b..0a90dc55 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -93,6 +93,66 @@ static inline struct inode *wb_inode(struct list_head *head)
>
> EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
>
> +static bool wb_io_lists_populated(struct bdi_writeback *wb)
> +{
> + if (wb_has_dirty_io(wb)) {
> + return false;
> + } else {
> + set_bit(WB_has_dirty_io, &wb->state);
> + return true;
> + }
> +}
> +
> +static void wb_io_lists_depopulated(struct bdi_writeback *wb)
> +{
> + if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
> + list_empty(&wb->b_io) && list_empty(&wb->b_more_io))
> + clear_bit(WB_has_dirty_io, &wb->state);
> +}
> +
> +/**
> + * inode_wb_list_move_locked - move an inode onto a bdi_writeback IO list
> + * @inode: inode to be moved
> + * @wb: target bdi_writeback
> + * @head: one of @wb->b_{dirty|io|more_io}
> + *
> + * Move @inode->i_wb_list to @list of @wb and set %WB_has_dirty_io.
> + * Returns %true if @inode is the first occupant of the !dirty_time IO
> + * lists; otherwise, %false.
> + */
> +static bool inode_wb_list_move_locked(struct inode *inode,
> + struct bdi_writeback *wb,
> + struct list_head *head)
> +{
> + assert_spin_locked(&wb->list_lock);
> +
> + list_move(&inode->i_wb_list, head);
> +
> + /* dirty_time doesn't count as dirty_io until expiration */
> + if (head != &wb->b_dirty_time)
> + return wb_io_lists_populated(wb);
> +
> + wb_io_lists_depopulated(wb);
> + return false;
> +}
> +
> +/**
> + * inode_wb_list_del_locked - remove an inode from its bdi_writeback IO list
> + * @inode: inode to be removed
> + * @wb: bdi_writeback @inode is being removed from
> + *
> + * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and
> + * clear %WB_has_dirty_io if all are empty afterwards.
> + */
> +static void inode_wb_list_del_locked(struct inode *inode,
> + struct bdi_writeback *wb)
> +{
> + assert_spin_locked(&wb->list_lock);
> +
> + list_del_init(&inode->i_wb_list);
> + wb_io_lists_depopulated(wb);
> +}
> +
> static void wb_wakeup(struct bdi_writeback *wb)
> {
> spin_lock_bh(&wb->work_lock);
> @@ -217,7 +277,7 @@ void inode_wb_list_del(struct inode *inode)
> struct bdi_writeback *wb = inode_to_wb(inode);
>
> spin_lock(&wb->list_lock);
> - list_del_init(&inode->i_wb_list);
> + inode_wb_list_del_locked(inode, wb);
> spin_unlock(&wb->list_lock);
> }
>
> @@ -232,7 +292,6 @@ void inode_wb_list_del(struct inode *inode)
> */
> static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
> {
> - assert_spin_locked(&wb->list_lock);
> if (!list_empty(&wb->b_dirty)) {
> struct inode *tail;
>
> @@ -240,7 +299,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
> if (time_before(inode->dirtied_when, tail->dirtied_when))
> inode->dirtied_when = jiffies;
> }
> - list_move(&inode->i_wb_list, &wb->b_dirty);
> + inode_wb_list_move_locked(inode, wb, &wb->b_dirty);
> }
>
> /*
> @@ -248,8 +307,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
> */
> static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
> {
> - assert_spin_locked(&wb->list_lock);
> - list_move(&inode->i_wb_list, &wb->b_more_io);
> + inode_wb_list_move_locked(inode, wb, &wb->b_more_io);
> }
>
> static void inode_sync_complete(struct inode *inode)
> @@ -358,6 +416,8 @@ static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work)
> moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, work);
> moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io,
> EXPIRE_DIRTY_ATIME, work);
> + if (moved)
> + wb_io_lists_populated(wb);
> trace_writeback_queue_io(wb, work, moved);
> }
>
> @@ -483,10 +543,10 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
> redirty_tail(inode, wb);
> } else if (inode->i_state & I_DIRTY_TIME) {
> inode->dirtied_when = jiffies;
> - list_move(&inode->i_wb_list, &wb->b_dirty_time);
> + inode_wb_list_move_locked(inode, wb, &wb->b_dirty_time);
> } else {
> /* The inode is clean. Remove from writeback lists. */
> - list_del_init(&inode->i_wb_list);
> + inode_wb_list_del_locked(inode, wb);
> }
> }
>
> @@ -628,7 +688,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> * touch it. See comment above for explanation.
> */
> if (!(inode->i_state & I_DIRTY_ALL))
> - list_del_init(&inode->i_wb_list);
> + inode_wb_list_del_locked(inode, wb);
> spin_unlock(&wb->list_lock);
> inode_sync_complete(inode);
> out:
> @@ -1327,37 +1387,39 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> * reposition it (that would break b_dirty time-ordering).
> */
> if (!was_dirty) {
> + struct list_head *dirty_list;
> bool wakeup_bdi = false;
> bdi = inode_to_bdi(inode);
>
> spin_unlock(&inode->i_lock);
> spin_lock(&bdi->wb.list_lock);
> - if (bdi_cap_writeback_dirty(bdi)) {
> - WARN(!test_bit(WB_registered, &bdi->wb.state),
> - "bdi-%s not registered\n", bdi->name);
>
> - /*
> - * If this is the first dirty inode for this
> - * bdi, we have to wake-up the corresponding
> - * bdi thread to make sure background
> - * write-back happens later.
> - */
> - if (!wb_has_dirty_io(&bdi->wb))
> - wakeup_bdi = true;
> - }
> + WARN(bdi_cap_writeback_dirty(bdi) &&
> + !test_bit(WB_registered, &bdi->wb.state),
> + "bdi-%s not registered\n", bdi->name);
>
> inode->dirtied_when = jiffies;
> if (dirtytime)
> inode->dirtied_time_when = jiffies;
> +
> if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES))
> - list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> + dirty_list = &bdi->wb.b_dirty;
> else
> - list_move(&inode->i_wb_list,
> - &bdi->wb.b_dirty_time);
> + dirty_list = &bdi->wb.b_dirty_time;
> +
> + wakeup_bdi = inode_wb_list_move_locked(inode, &bdi->wb,
> + dirty_list);
> +
> spin_unlock(&bdi->wb.list_lock);
> trace_writeback_dirty_inode_enqueue(inode);
>
> - if (wakeup_bdi)
> + /*
> + * If this is the first dirty inode for this bdi,
> + * we have to wake-up the corresponding bdi thread
> + * to make sure background write-back happens
> + * later.
> + */
> + if (bdi_cap_writeback_dirty(bdi) && wakeup_bdi)
> wb_wakeup_delayed(&bdi->wb);
> return;
> }
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index eb38676..7a94b78 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -21,6 +21,7 @@ struct dentry;
> enum wb_state {
> WB_registered, /* bdi_register() was done */
> WB_writeback_running, /* Writeback is in progress */
> + WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */
> };
>
> enum wb_congested_state {
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 6f08821..3c8403c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -29,7 +29,7 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> enum wb_reason reason);
> void bdi_start_background_writeback(struct backing_dev_info *bdi);
> void wb_workfn(struct work_struct *work);
> -int bdi_has_dirty_io(struct backing_dev_info *bdi);
> +bool bdi_has_dirty_io(struct backing_dev_info *bdi);
> void wb_wakeup_delayed(struct bdi_writeback *wb);
>
> extern spinlock_t bdi_lock;
> @@ -37,11 +37,9 @@ extern struct list_head bdi_list;
>
> extern struct workqueue_struct *bdi_wq;
>
> -static inline int wb_has_dirty_io(struct bdi_writeback *wb)
> +static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
> {
> - return !list_empty(&wb->b_dirty) ||
> - !list_empty(&wb->b_io) ||
> - !list_empty(&wb->b_more_io);
> + return test_bit(WB_has_dirty_io, &wb->state);
> }
>
> static inline void __add_wb_stat(struct bdi_writeback *wb,
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 5029c4a..161ddf1 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -256,7 +256,7 @@ static int __init default_bdi_init(void)
> }
> subsys_initcall(default_bdi_init);
>
> -int bdi_has_dirty_io(struct backing_dev_info *bdi)
> +bool bdi_has_dirty_io(struct backing_dev_info *bdi)
> {
> return wb_has_dirty_io(&bdi->wb);
> }
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 16:15:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 32/51] writeback: implement backing_dev_info->tot_write_bandwidth

On Fri 22-05-15 17:13:46, Tejun Heo wrote:
> cgroup writeback support needs to keep track of the sum of
> avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
> distribute write workload. This patch adds bdi->tot_write_bandwidth
> and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
> and wb_update_write_bandwidth() to adjust it as wb's gain and lose
> dirty inodes and its avg_write_bandwidth gets updated.
>
> As the update events are not synchronized with each other,
> bdi->tot_write_bandwidth is an atomic_long_t.

So I was looking into what tot_write_bandwidth is used for and if I look
right it is used for bdi_has_dirty_io() and for distribution of dirty pages
when writeback is started against the whole bdi.

Now neither of these cases seem to be really performance critical (in all
the cases we iterate the list of all wbs of the bdi anyway) so why don't we
just compute the total write bandwidth when needed, instead of maintaining
it all the time?

Honza

> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 7 ++++++-
> include/linux/backing-dev-defs.h | 2 ++
> mm/page-writeback.c | 3 +++
> 3 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 0a90dc55..bbccf68 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -99,6 +99,8 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
> return false;
> } else {
> set_bit(WB_has_dirty_io, &wb->state);
> + atomic_long_add(wb->avg_write_bandwidth,
> + &wb->bdi->tot_write_bandwidth);
> return true;
> }
> }
> @@ -106,8 +108,11 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
> static void wb_io_lists_depopulated(struct bdi_writeback *wb)
> {
> if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
> - list_empty(&wb->b_io) && list_empty(&wb->b_more_io))
> + list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
> clear_bit(WB_has_dirty_io, &wb->state);
> + atomic_long_sub(wb->avg_write_bandwidth,
> + &wb->bdi->tot_write_bandwidth);
> + }
> }
>
> /**
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 7a94b78..d631a61 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -142,6 +142,8 @@ struct backing_dev_info {
> unsigned int min_ratio;
> unsigned int max_ratio, max_prop_frac;
>
> + atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
> +
> struct bdi_writeback wb; /* the root writeback info for this bdi */
> struct bdi_writeback_congested wb_congested; /* its congested state */
> #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e31dea9..c95eb24 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -881,6 +881,9 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
> avg += (old - avg) >> 3;
>
> out:
> + if (wb_has_dirty_io(wb))
> + atomic_long_add(avg - wb->avg_write_bandwidth,
> + &wb->bdi->tot_write_bandwidth);
> wb->write_bandwidth = bw;
> wb->avg_write_bandwidth = avg;
> }
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 16:18:21

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 34/51] writeback: don't issue wb_writeback_work if clean

On Fri 22-05-15 17:13:48, Tejun Heo wrote:
> There are several places in fs/fs-writeback.c which queues
> wb_writeback_work without checking whether the target wb
> (bdi_writeback) has dirty inodes or not. The only thing
> wb_writeback_work does is writing back the dirty inodes for the target
> wb and queueing a work item for a clean wb is essentially noop. There
> are some side effects such as bandwidth stats being updated and
> triggering tracepoints but these don't affect the operation in any
> meaningful way.
>
> This patch makes all writeback_inodes_sb_nr() and sync_inodes_sb()
> skip wb_queue_work() if the target bdi is clean. Also, it moves
> dirtiness check from wakeup_flusher_threads() to
> __wb_start_writeback() so that all its callers benefit from the check.
>
> While the overhead incurred by scheduling a noop work isn't currently
> significant, the overhead may be higher with cgroup writeback support
> as we may end up issuing noop work items to a lot of clean wb's.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/fs-writeback.c | 18 ++++++++++--------
> 1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index c98d392..921a9e4 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -189,6 +189,9 @@ static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> {
> struct wb_writeback_work *work;
>
> + if (!wb_has_dirty_io(wb))
> + return;
> +
> /*
> * This is WB_SYNC_NONE writeback, so if allocation fails just
> * wakeup the thread for old dirty data writeback
> @@ -1215,11 +1218,8 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
> nr_pages = get_nr_dirty_pages();
>
> rcu_read_lock();
> - list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
> - if (!bdi_has_dirty_io(bdi))
> - continue;
> + list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
> __wb_start_writeback(&bdi->wb, nr_pages, false, reason);
> - }
> rcu_read_unlock();
> }
>
> @@ -1512,11 +1512,12 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> .nr_pages = nr,
> .reason = reason,
> };
> + struct backing_dev_info *bdi = sb->s_bdi;
>
> - if (sb->s_bdi == &noop_backing_dev_info)
> + if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
> return;
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
> - wb_queue_work(&sb->s_bdi->wb, &work);
> + wb_queue_work(&bdi->wb, &work);
> wait_for_completion(&done);
> }
> EXPORT_SYMBOL(writeback_inodes_sb_nr);
> @@ -1594,13 +1595,14 @@ void sync_inodes_sb(struct super_block *sb)
> .reason = WB_REASON_SYNC,
> .for_sync = 1,
> };
> + struct backing_dev_info *bdi = sb->s_bdi;
>
> /* Nothing to do? */
> - if (sb->s_bdi == &noop_backing_dev_info)
> + if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info)
> return;
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> - wb_queue_work(&sb->s_bdi->wb, &work);
> + wb_queue_work(&bdi->wb, &work);
> wait_for_completion(&done);
>
> wait_sb_inodes(sb);
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 16:42:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 32/51] writeback: implement backing_dev_info->tot_write_bandwidth

On Tue 30-06-15 18:14:58, Jan Kara wrote:
> On Fri 22-05-15 17:13:46, Tejun Heo wrote:
> > cgroup writeback support needs to keep track of the sum of
> > avg_write_bandwidth of all wb's (bdi_writeback's) with dirty inodes to
> > distribute write workload. This patch adds bdi->tot_write_bandwidth
> > and updates inode_wb_list_move_locked(), inode_wb_list_del_locked()
> > and wb_update_write_bandwidth() to adjust it as wb's gain and lose
> > dirty inodes and its avg_write_bandwidth gets updated.
> >
> > As the update events are not synchronized with each other,
> > bdi->tot_write_bandwidth is an atomic_long_t.
>
> So I was looking into what tot_write_bandwidth is used for and if I look
> right it is used for bdi_has_dirty_io() and for distribution of dirty pages
> when writeback is started against the whole bdi.
>
> Now neither of these cases seem to be really performance critical (in all
> the cases we iterate the list of all wbs of the bdi anyway) so why don't we
> just compute the total write bandwidth when needed, instead of maintaining
> it all the time?

OK, now I realized that tot_write_bandwidth is also used in computation of
a dirty limit for a memcg and that one gets called pretty often so
maintaing total bandwidth probably pays off.

I was also thinking whether it wouldn't be better to maintain writeout
fractions for wb instead of bdi since summing average writeback bandwidths
seem somewhat hacky but what you do seems to be good enough for now. We can
always improve on that later when we see how things work in practice.

You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
>
> > Signed-off-by: Tejun Heo <[email protected]>
> > Cc: Jens Axboe <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > ---
> > fs/fs-writeback.c | 7 ++++++-
> > include/linux/backing-dev-defs.h | 2 ++
> > mm/page-writeback.c | 3 +++
> > 3 files changed, 11 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 0a90dc55..bbccf68 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -99,6 +99,8 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
> > return false;
> > } else {
> > set_bit(WB_has_dirty_io, &wb->state);
> > + atomic_long_add(wb->avg_write_bandwidth,
> > + &wb->bdi->tot_write_bandwidth);
> > return true;
> > }
> > }
> > @@ -106,8 +108,11 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
> > static void wb_io_lists_depopulated(struct bdi_writeback *wb)
> > {
> > if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
> > - list_empty(&wb->b_io) && list_empty(&wb->b_more_io))
> > + list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
> > clear_bit(WB_has_dirty_io, &wb->state);
> > + atomic_long_sub(wb->avg_write_bandwidth,
> > + &wb->bdi->tot_write_bandwidth);
> > + }
> > }
> >
> > /**
> > diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> > index 7a94b78..d631a61 100644
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -142,6 +142,8 @@ struct backing_dev_info {
> > unsigned int min_ratio;
> > unsigned int max_ratio, max_prop_frac;
> >
> > + atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
> > +
> > struct bdi_writeback wb; /* the root writeback info for this bdi */
> > struct bdi_writeback_congested wb_congested; /* its congested state */
> > #ifdef CONFIG_CGROUP_WRITEBACK
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index e31dea9..c95eb24 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -881,6 +881,9 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
> > avg += (old - avg) >> 3;
> >
> > out:
> > + if (wb_has_dirty_io(wb))
> > + atomic_long_add(avg - wb->avg_write_bandwidth,
> > + &wb->bdi->tot_write_bandwidth);
> > wb->write_bandwidth = bw;
> > wb->avg_write_bandwidth = avg;
> > }
> > --
> > 2.4.0
> >
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-06-30 16:48:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 33/51] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account

On Fri 22-05-15 17:13:47, Tejun Heo wrote:
> bdi_has_dirty_io() used to only reflect whether the root wb
> (bdi_writeback) has dirty inodes. For cgroup writeback support, it
> needs to take all active wb's into account. If any wb on the bdi has
> dirty inodes, bdi_has_dirty_io() should return true.
>
> To achieve that, as inode_wb_list_{move|del}_locked() now keep track
> of the dirty state transition of each wb, the number of dirty wbs can
> be counted in the bdi; however, bdi is already aggregating
> wb->avg_write_bandwidth which can easily be guaranteed to be > 0 when
> there are any dirty inodes by ensuring wb->avg_write_bandwidth can't
> dip below 1. bdi_has_dirty_io() can simply test whether
> bdi->tot_write_bandwidth is zero or not.
>
> While this bumps the value of wb->avg_write_bandwidth to one when it
> used to be zero, this shouldn't cause any meaningful behavior
> difference.
>
> bdi_has_dirty_io() is made an inline function which tests whether
> ->tot_write_bandwidth is non-zero. Also, WARN_ON_ONCE()'s on its
> value are added to inode_wb_list_{move|del}_locked().

It looks OK although I find using total write bandwidth to detect whether
any wb has any dirty IO rather hacky. Frankly I'd prefer to just iterate
all wbs from bdi_has_dirty_io() since that isn't performance critical
and we iterate all wbs in those paths anyway... Hmm?

Honza

> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 5 +++--
> include/linux/backing-dev-defs.h | 8 ++++++--
> include/linux/backing-dev.h | 10 +++++++++-
> mm/backing-dev.c | 5 -----
> mm/page-writeback.c | 10 +++++++---
> 5 files changed, 25 insertions(+), 13 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index bbccf68..c98d392 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -99,6 +99,7 @@ static bool wb_io_lists_populated(struct bdi_writeback *wb)
> return false;
> } else {
> set_bit(WB_has_dirty_io, &wb->state);
> + WARN_ON_ONCE(!wb->avg_write_bandwidth);
> atomic_long_add(wb->avg_write_bandwidth,
> &wb->bdi->tot_write_bandwidth);
> return true;
> @@ -110,8 +111,8 @@ static void wb_io_lists_depopulated(struct bdi_writeback *wb)
> if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) &&
> list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) {
> clear_bit(WB_has_dirty_io, &wb->state);
> - atomic_long_sub(wb->avg_write_bandwidth,
> - &wb->bdi->tot_write_bandwidth);
> + WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth,
> + &wb->bdi->tot_write_bandwidth) < 0);
> }
> }
>
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index d631a61..8c857d7 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -98,7 +98,7 @@ struct bdi_writeback {
> unsigned long dirtied_stamp;
> unsigned long written_stamp; /* pages written at bw_time_stamp */
> unsigned long write_bandwidth; /* the estimated write bandwidth */
> - unsigned long avg_write_bandwidth; /* further smoothed write bw */
> + unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */
>
> /*
> * The base dirty throttle rate, re-calculated on every 200ms.
> @@ -142,7 +142,11 @@ struct backing_dev_info {
> unsigned int min_ratio;
> unsigned int max_ratio, max_prop_frac;
>
> - atomic_long_t tot_write_bandwidth; /* sum of active avg_write_bw */
> + /*
> + * Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are
> + * any dirty wbs, which is depended upon by bdi_has_dirty().
> + */
> + atomic_long_t tot_write_bandwidth;
>
> struct bdi_writeback wb; /* the root writeback info for this bdi */
> struct bdi_writeback_congested wb_congested; /* its congested state */
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 3c8403c..0839e44 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -29,7 +29,6 @@ void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> enum wb_reason reason);
> void bdi_start_background_writeback(struct backing_dev_info *bdi);
> void wb_workfn(struct work_struct *work);
> -bool bdi_has_dirty_io(struct backing_dev_info *bdi);
> void wb_wakeup_delayed(struct bdi_writeback *wb);
>
> extern spinlock_t bdi_lock;
> @@ -42,6 +41,15 @@ static inline bool wb_has_dirty_io(struct bdi_writeback *wb)
> return test_bit(WB_has_dirty_io, &wb->state);
> }
>
> +static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi)
> +{
> + /*
> + * @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are
> + * any dirty wbs. See wb_update_write_bandwidth().
> + */
> + return atomic_long_read(&bdi->tot_write_bandwidth);
> +}
> +
> static inline void __add_wb_stat(struct bdi_writeback *wb,
> enum wb_stat_item item, s64 amount)
> {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 161ddf1..d2f16fc9 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -256,11 +256,6 @@ static int __init default_bdi_init(void)
> }
> subsys_initcall(default_bdi_init);
>
> -bool bdi_has_dirty_io(struct backing_dev_info *bdi)
> -{
> - return wb_has_dirty_io(&bdi->wb);
> -}
> -
> /*
> * This function is used when the first inode for this wb is marked dirty. It
> * wakes-up the corresponding bdi thread which should then take care of the
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index c95eb24..99b8846 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -881,9 +881,13 @@ static void wb_update_write_bandwidth(struct bdi_writeback *wb,
> avg += (old - avg) >> 3;
>
> out:
> - if (wb_has_dirty_io(wb))
> - atomic_long_add(avg - wb->avg_write_bandwidth,
> - &wb->bdi->tot_write_bandwidth);
> + /* keep avg > 0 to guarantee that tot > 0 if there are dirty wbs */
> + avg = max(avg, 1LU);
> + if (wb_has_dirty_io(wb)) {
> + long delta = avg - wb->avg_write_bandwidth;
> + WARN_ON_ONCE(atomic_long_add_return(delta,
> + &wb->bdi->tot_write_bandwidth) <= 0);
> + }
> wb->write_bandwidth = bw;
> wb->avg_write_bandwidth = avg;
> }
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 07:00:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 35/51] writeback: make bdi->min/max_ratio handling cgroup writeback aware

On Fri 22-05-15 17:13:49, Tejun Heo wrote:
> bdi->min/max_ratio are user-configurable per-bdi knobs which regulate
> dirty limit of each bdi. For cgroup writeback, they need to be
> further distributed across wb's (bdi_writeback's) belonging to the
> configured bdi.
>
> This patch introduces wb_min_max_ratio() which distributes
> bdi->min/max_ratio according to a wb's proportion in the total active
> bandwidth of its bdi.
>
> v2: Update wb_min_max_ratio() to fix a bug where both min and max were
> assigned the min value and avoid calculations when possible.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> mm/page-writeback.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 46 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 99b8846..9b55f12 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -155,6 +155,46 @@ static unsigned long writeout_period_time = 0;
> */
> #define VM_COMPLETIONS_PERIOD_LEN (3*HZ)
>
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +
> +static void wb_min_max_ratio(struct bdi_writeback *wb,
> + unsigned long *minp, unsigned long *maxp)
> +{
> + unsigned long this_bw = wb->avg_write_bandwidth;
> + unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
> + unsigned long long min = wb->bdi->min_ratio;
> + unsigned long long max = wb->bdi->max_ratio;
> +
> + /*
> + * @wb may already be clean by the time control reaches here and
> + * the total may not include its bw.
> + */
> + if (this_bw < tot_bw) {
> + if (min) {
> + min *= this_bw;
> + do_div(min, tot_bw);
> + }
> + if (max < 100) {
> + max *= this_bw;
> + do_div(max, tot_bw);
> + }
> + }
> +
> + *minp = min;
> + *maxp = max;
> +}
> +
> +#else /* CONFIG_CGROUP_WRITEBACK */
> +
> +static void wb_min_max_ratio(struct bdi_writeback *wb,
> + unsigned long *minp, unsigned long *maxp)
> +{
> + *minp = wb->bdi->min_ratio;
> + *maxp = wb->bdi->max_ratio;
> +}
> +
> +#endif /* CONFIG_CGROUP_WRITEBACK */
> +
> /*
> * In a memory zone, there is a certain amount of pages we consider
> * available for the page cache, which is essentially the number of
> @@ -539,9 +579,9 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> */
> unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
> {
> - struct backing_dev_info *bdi = wb->bdi;
> u64 wb_dirty;
> long numerator, denominator;
> + unsigned long wb_min_ratio, wb_max_ratio;
>
> /*
> * Calculate this BDI's share of the dirty ratio.
> @@ -552,9 +592,11 @@ unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
> wb_dirty *= numerator;
> do_div(wb_dirty, denominator);
>
> - wb_dirty += (dirty * bdi->min_ratio) / 100;
> - if (wb_dirty > (dirty * bdi->max_ratio) / 100)
> - wb_dirty = dirty * bdi->max_ratio / 100;
> + wb_min_max_ratio(wb, &wb_min_ratio, &wb_max_ratio);
> +
> + wb_dirty += (dirty * wb_min_ratio) / 100;
> + if (wb_dirty > (dirty * wb_max_ratio) / 100)
> + wb_dirty = dirty * wb_max_ratio / 100;
>
> return wb_dirty;
> }
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 07:28:13

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 36/51] writeback: implement bdi_for_each_wb()

On Fri 22-05-15 17:13:50, Tejun Heo wrote:
> This will be used to implement bdi-wide operations which should be
> distributed across all its cgroup bdi_writebacks.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

One comment below.

> @@ -445,6 +500,14 @@ static inline void wb_blkcg_offline(struct blkcg *blkcg)
> {
> }
>
> +struct wb_iter {
> + int next_id;
> +};
> +
> +#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
> + for ((iter)->next_id = (start_blkcg_id); \
> + ({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
> +

This looks quite confusing. Won't it be easier to understand as:

struct wb_iter {
} __attribute__ ((unused));

#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
if (((wb_cur) = (!start_blkcg_id ? &(bdi)->wb : NULL)))

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 07:30:59

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 37/51] writeback: remove bdi_start_writeback()

On Fri 22-05-15 17:13:51, Tejun Heo wrote:
> bdi_start_writeback() is a thin wrapper on top of
> __wb_start_writeback() which is used only by laptop_mode_timer_fn().
> This patches removes bdi_start_writeback(), renames
> __wb_start_writeback() to wb_start_writeback() and makes
> laptop_mode_timer_fn() use it instead.
>
> This doesn't cause any functional difference and will ease making
> laptop_mode_timer_fn() cgroup writeback aware.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/fs-writeback.c | 68 +++++++++++++++++----------------------------
> include/linux/backing-dev.h | 4 +--
> mm/page-writeback.c | 4 +--
> 3 files changed, 29 insertions(+), 47 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 921a9e4..79f11af 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -184,33 +184,6 @@ static void wb_queue_work(struct bdi_writeback *wb,
> spin_unlock_bh(&wb->work_lock);
> }
>
> -static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> - bool range_cyclic, enum wb_reason reason)
> -{
> - struct wb_writeback_work *work;
> -
> - if (!wb_has_dirty_io(wb))
> - return;
> -
> - /*
> - * This is WB_SYNC_NONE writeback, so if allocation fails just
> - * wakeup the thread for old dirty data writeback
> - */
> - work = kzalloc(sizeof(*work), GFP_ATOMIC);
> - if (!work) {
> - trace_writeback_nowork(wb->bdi);
> - wb_wakeup(wb);
> - return;
> - }
> -
> - work->sync_mode = WB_SYNC_NONE;
> - work->nr_pages = nr_pages;
> - work->range_cyclic = range_cyclic;
> - work->reason = reason;
> -
> - wb_queue_work(wb, work);
> -}
> -
> #ifdef CONFIG_CGROUP_WRITEBACK
>
> /**
> @@ -240,22 +213,31 @@ EXPORT_SYMBOL_GPL(inode_congested);
>
> #endif /* CONFIG_CGROUP_WRITEBACK */
>
> -/**
> - * bdi_start_writeback - start writeback
> - * @bdi: the backing device to write from
> - * @nr_pages: the number of pages to write
> - * @reason: reason why some writeback work was initiated
> - *
> - * Description:
> - * This does WB_SYNC_NONE opportunistic writeback. The IO is only
> - * started when this function returns, we make no guarantees on
> - * completion. Caller need not hold sb s_umount semaphore.
> - *
> - */
> -void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> - enum wb_reason reason)
> +void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> + bool range_cyclic, enum wb_reason reason)
> {
> - __wb_start_writeback(&bdi->wb, nr_pages, true, reason);
> + struct wb_writeback_work *work;
> +
> + if (!wb_has_dirty_io(wb))
> + return;
> +
> + /*
> + * This is WB_SYNC_NONE writeback, so if allocation fails just
> + * wakeup the thread for old dirty data writeback
> + */
> + work = kzalloc(sizeof(*work), GFP_ATOMIC);
> + if (!work) {
> + trace_writeback_nowork(wb->bdi);
> + wb_wakeup(wb);
> + return;
> + }
> +
> + work->sync_mode = WB_SYNC_NONE;
> + work->nr_pages = nr_pages;
> + work->range_cyclic = range_cyclic;
> + work->reason = reason;
> +
> + wb_queue_work(wb, work);
> }
>
> /**
> @@ -1219,7 +1201,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
>
> rcu_read_lock();
> list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
> - __wb_start_writeback(&bdi->wb, nr_pages, false, reason);
> + wb_start_writeback(&bdi->wb, nr_pages, false, reason);
> rcu_read_unlock();
> }
>
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c797980..0ff40c2 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -25,8 +25,8 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
> int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
> void bdi_unregister(struct backing_dev_info *bdi);
> int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
> -void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> - enum wb_reason reason);
> +void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> + bool range_cyclic, enum wb_reason reason);
> void bdi_start_background_writeback(struct backing_dev_info *bdi);
> void wb_workfn(struct work_struct *work);
> void wb_wakeup_delayed(struct bdi_writeback *wb);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 9b55f12..6301af2 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1729,8 +1729,8 @@ void laptop_mode_timer_fn(unsigned long data)
> * threshold
> */
> if (bdi_has_dirty_io(&q->backing_dev_info))
> - bdi_start_writeback(&q->backing_dev_info, nr_pages,
> - WB_REASON_LAPTOP_TIMER);
> + wb_start_writeback(&q->backing_dev_info.wb, nr_pages, true,
> + WB_REASON_LAPTOP_TIMER);
> }
>
> /*
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 07:32:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 38/51] writeback: make laptop_mode_timer_fn() handle multiple bdi_writeback's

On Fri 22-05-15 17:13:52, Tejun Heo wrote:
> For cgroup writeback support, all bdi-wide operations should be
> distributed to all its wb's (bdi_writeback's).
>
> This patch updates laptop_mode_timer_fn() so that it invokes
> wb_start_writeback() on all wb's rather than just the root one. As
> the intent is writing out all dirty data, there's no reason to split
> the number of pages to write.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> mm/page-writeback.c | 12 +++++++++---
> 1 file changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 6301af2..682e3a6 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1723,14 +1723,20 @@ void laptop_mode_timer_fn(unsigned long data)
> struct request_queue *q = (struct request_queue *)data;
> int nr_pages = global_page_state(NR_FILE_DIRTY) +
> global_page_state(NR_UNSTABLE_NFS);
> + struct bdi_writeback *wb;
> + struct wb_iter iter;
>
> /*
> * We want to write everything out, not just down to the dirty
> * threshold
> */
> - if (bdi_has_dirty_io(&q->backing_dev_info))
> - wb_start_writeback(&q->backing_dev_info.wb, nr_pages, true,
> - WB_REASON_LAPTOP_TIMER);
> + if (!bdi_has_dirty_io(&q->backing_dev_info))
> + return;
> +
> + bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0)
> + if (wb_has_dirty_io(wb))
> + wb_start_writeback(wb, nr_pages, true,
> + WB_REASON_LAPTOP_TIMER);
> }
>
> /*
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 07:47:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 39/51] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info

On Fri 22-05-15 17:13:53, Tejun Heo wrote:
> writeback_in_progress() currently takes @bdi and returns whether
> writeback is in progress on its root wb (bdi_writeback). In
> preparation for cgroup writeback support, make it take wb instead.
> While at it, make it an inline function.
>
> This patch doesn't make any functional difference.

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

BTW: It would have been easier for me to review this if e.g. a move from
bdi to wb parameter was split among less patches. The intermediate state
where some functions call partly bdi and party wb functions is strange and
it always makes me go search in the series whether the other part of the
function gets converted and whether they play well together...

Honza

>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 15 +--------------
> include/linux/backing-dev.h | 12 +++++++++++-
> mm/page-writeback.c | 4 ++--
> 3 files changed, 14 insertions(+), 17 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 79f11af..45baf6c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -65,19 +65,6 @@ struct wb_writeback_work {
> */
> unsigned int dirtytime_expire_interval = 12 * 60 * 60;
>
> -/**
> - * writeback_in_progress - determine whether there is writeback in progress
> - * @bdi: the device's backing_dev_info structure.
> - *
> - * Determine whether there is writeback waiting to be handled against a
> - * backing device.
> - */
> -int writeback_in_progress(struct backing_dev_info *bdi)
> -{
> - return test_bit(WB_writeback_running, &bdi->wb.state);
> -}
> -EXPORT_SYMBOL(writeback_in_progress);
> -
> static inline struct inode *wb_inode(struct list_head *head)
> {
> return list_entry(head, struct inode, i_wb_list);
> @@ -1532,7 +1519,7 @@ int try_to_writeback_inodes_sb_nr(struct super_block *sb,
> unsigned long nr,
> enum wb_reason reason)
> {
> - if (writeback_in_progress(sb->s_bdi))
> + if (writeback_in_progress(&sb->s_bdi->wb))
> return 1;
>
> if (!down_read_trylock(&sb->s_umount))
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 0ff40c2..f04956c 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -156,7 +156,17 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
>
> extern struct backing_dev_info noop_backing_dev_info;
>
> -int writeback_in_progress(struct backing_dev_info *bdi);
> +/**
> + * writeback_in_progress - determine whether there is writeback in progress
> + * @wb: bdi_writeback of interest
> + *
> + * Determine whether there is writeback waiting to be handled against a
> + * bdi_writeback.
> + */
> +static inline bool writeback_in_progress(struct bdi_writeback *wb)
> +{
> + return test_bit(WB_writeback_running, &wb->state);
> +}
>
> static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
> {
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 682e3a6..e3b5c1d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1455,7 +1455,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> break;
> }
>
> - if (unlikely(!writeback_in_progress(bdi)))
> + if (unlikely(!writeback_in_progress(wb)))
> bdi_start_background_writeback(bdi);
>
> if (!strictlimit)
> @@ -1573,7 +1573,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> if (!dirty_exceeded && wb->dirty_exceeded)
> wb->dirty_exceeded = 0;
>
> - if (writeback_in_progress(bdi))
> + if (writeback_in_progress(wb))
> return;
>
> /*
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 07:50:27

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 40/51] writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info

On Fri 22-05-15 17:13:54, Tejun Heo wrote:
> bdi_start_background_writeback() currently takes @bdi and kicks the
> root wb (bdi_writeback). In preparation for cgroup writeback support,
> make it take wb instead.
>
> This patch doesn't make any functional difference.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 12 ++++++------
> include/linux/backing-dev.h | 2 +-
> mm/page-writeback.c | 4 ++--
> 3 files changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 45baf6c..92aaf64 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -228,23 +228,23 @@ void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> }
>
> /**
> - * bdi_start_background_writeback - start background writeback
> - * @bdi: the backing device to write from
> + * wb_start_background_writeback - start background writeback
> + * @wb: bdi_writback to write from
> *
> * Description:
> * This makes sure WB_SYNC_NONE background writeback happens. When
> - * this function returns, it is only guaranteed that for given BDI
> + * this function returns, it is only guaranteed that for given wb
> * some IO is happening if we are over background dirty threshold.
> * Caller need not hold sb s_umount semaphore.
> */
> -void bdi_start_background_writeback(struct backing_dev_info *bdi)
> +void wb_start_background_writeback(struct bdi_writeback *wb)
> {
> /*
> * We just wake up the flusher thread. It will perform background
> * writeback as soon as there is no other work to do.
> */
> - trace_writeback_wake_background(bdi);
> - wb_wakeup(&bdi->wb);
> + trace_writeback_wake_background(wb->bdi);
> + wb_wakeup(wb);

Can we add a memcg id of the wb to the tracepoint please? Because just bdi
needn't be enough when debugging stuff...

Otherwise the patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
> }
>
> /*
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index f04956c..9cc11e5 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -27,7 +27,7 @@ void bdi_unregister(struct backing_dev_info *bdi);
> int __must_check bdi_setup_and_register(struct backing_dev_info *, char *);
> void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> bool range_cyclic, enum wb_reason reason);
> -void bdi_start_background_writeback(struct backing_dev_info *bdi);
> +void wb_start_background_writeback(struct bdi_writeback *wb);
> void wb_workfn(struct work_struct *work);
> void wb_wakeup_delayed(struct bdi_writeback *wb);
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e3b5c1d..70cf98d 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1456,7 +1456,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> }
>
> if (unlikely(!writeback_in_progress(wb)))
> - bdi_start_background_writeback(bdi);
> + wb_start_background_writeback(wb);
>
> if (!strictlimit)
> wb_dirty_limits(wb, dirty_thresh, background_thresh,
> @@ -1588,7 +1588,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> return;
>
> if (nr_reclaimable > background_thresh)
> - bdi_start_background_writeback(bdi);
> + wb_start_background_writeback(wb);
> }
>
> static DEFINE_PER_CPU(int, bdp_ratelimits);
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 08:15:45

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 41/51] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's

On Fri 22-05-15 17:13:55, Tejun Heo wrote:
> wakeup_flusher_threads() currently only starts writeback on the root
> wb (bdi_writeback). For cgroup writeback support, update the function
> to wake up all wbs and distribute the number of pages to write
> according to the proportion of each wb's write bandwidth, which is
> implemented in wb_split_bdi_pages().
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

I was looking at who uses wakeup_flusher_threads(). There are two usecases:

1) sync() - we want to writeback everything
2) We want to relieve memory pressure by cleaning and subsequently
reclaiming pages.

Neither of these cares about number of pages too much if you write enough.
So similarly as we don't split the passed nr_pages argument among bdis, I
wouldn't split the nr_pages among wbs. Just pass the nr_pages to each wb
and be done with that...

Honza

> ---
> fs/fs-writeback.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 46 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 92aaf64..508e10c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -198,6 +198,41 @@ int inode_congested(struct inode *inode, int cong_bits)
> }
> EXPORT_SYMBOL_GPL(inode_congested);
>
> +/**
> + * wb_split_bdi_pages - split nr_pages to write according to bandwidth
> + * @wb: target bdi_writeback to split @nr_pages to
> + * @nr_pages: number of pages to write for the whole bdi
> + *
> + * Split @wb's portion of @nr_pages according to @wb's write bandwidth in
> + * relation to the total write bandwidth of all wb's w/ dirty inodes on
> + * @wb->bdi.
> + */
> +static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
> +{
> + unsigned long this_bw = wb->avg_write_bandwidth;
> + unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth);
> +
> + if (nr_pages == LONG_MAX)
> + return LONG_MAX;
> +
> + /*
> + * This may be called on clean wb's and proportional distribution
> + * may not make sense, just use the original @nr_pages in those
> + * cases. In general, we wanna err on the side of writing more.
> + */
> + if (!tot_bw || this_bw >= tot_bw)
> + return nr_pages;
> + else
> + return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw);
> +}
> +
> +#else /* CONFIG_CGROUP_WRITEBACK */
> +
> +static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages)
> +{
> + return nr_pages;
> +}
> +
> #endif /* CONFIG_CGROUP_WRITEBACK */
>
> void wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> @@ -1187,8 +1222,17 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
> nr_pages = get_nr_dirty_pages();
>
> rcu_read_lock();
> - list_for_each_entry_rcu(bdi, &bdi_list, bdi_list)
> - wb_start_writeback(&bdi->wb, nr_pages, false, reason);
> + list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
> + struct bdi_writeback *wb;
> + struct wb_iter iter;
> +
> + if (!bdi_has_dirty_io(bdi))
> + continue;
> +
> + bdi_for_each_wb(wb, bdi, &iter, 0)
> + wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages),
> + false, reason);
> + }
> rcu_read_unlock();
> }
>
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 08:20:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 42/51] writeback: make wakeup_dirtytime_writeback() handle multiple bdi_writeback's

On Fri 22-05-15 17:13:56, Tejun Heo wrote:
> wakeup_dirtytime_writeback() currently only starts writeback on the
> root wb (bdi_writeback). For cgroup writeback support, update the
> function to check all wbs.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Theodore Ts'o <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/fs-writeback.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 508e10c..8ae212e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1260,9 +1260,12 @@ static void wakeup_dirtytime_writeback(struct work_struct *w)
>
> rcu_read_lock();
> list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
> - if (list_empty(&bdi->wb.b_dirty_time))
> - continue;
> - wb_wakeup(&bdi->wb);
> + struct bdi_writeback *wb;
> + struct wb_iter iter;
> +
> + bdi_for_each_wb(wb, bdi, &iter, 0)
> + if (!list_empty(&bdi->wb.b_dirty_time))
> + wb_wakeup(&bdi->wb);
> }
> rcu_read_unlock();
> schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ);
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 16:04:51

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 44/51] writeback: implement bdi_wait_for_completion()

On Fri 22-05-15 17:13:58, Tejun Heo wrote:
> If the completion of a wb_writeback_work can be waited upon by setting
> its ->done to a struct completion and waiting on it; however, for
> cgroup writeback support, it's necessary to issue multiple work items
> to multiple bdi_writebacks and wait for the completion of all.
>
> This patch implements wb_completion which can wait for multiple work
> items and replaces the struct completion with it. It can be defined
> using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
> waited for by wb_wait_for_completion().
>
> Nobody currently issues multiple work items and this patch doesn't
> introduce any behavior changes.

I'd find it better to extend completions to allow doing what you need. It
isn't that special. It seems it would be enough to implement

void wait_for_completions(struct completion *x, int n);

where @n is the number of completions to wait for. And the implementation
can stay as is, only in do_wait_for_common() we change checks for x->done ==
0 to "x->done < n". That's about it...

Honza


>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 58 +++++++++++++++++++++++++++++++---------
> include/linux/backing-dev-defs.h | 2 ++
> mm/backing-dev.c | 1 +
> 3 files changed, 49 insertions(+), 12 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 22f1def..d7d4a1b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -34,6 +34,10 @@
> */
> #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10))
>
> +struct wb_completion {
> + atomic_t cnt;
> +};
> +
> /*
> * Passed into wb_writeback(), essentially a subset of writeback_control
> */
> @@ -51,10 +55,23 @@ struct wb_writeback_work {
> enum wb_reason reason; /* why was writeback initiated? */
>
> struct list_head list; /* pending work list */
> - struct completion *done; /* set if the caller waits */
> + struct wb_completion *done; /* set if the caller waits */
> };
>
> /*
> + * If one wants to wait for one or more wb_writeback_works, each work's
> + * ->done should be set to a wb_completion defined using the following
> + * macro. Once all work items are issued with wb_queue_work(), the caller
> + * can wait for the completion of all using wb_wait_for_completion(). Work
> + * items which are waited upon aren't freed automatically on completion.
> + */
> +#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \
> + struct wb_completion cmpl = { \
> + .cnt = ATOMIC_INIT(1), \
> + }
> +
> +
> +/*
> * If an inode is constantly having its pages dirtied, but then the
> * updates stop dirtytime_expire_interval seconds in the past, it's
> * possible for the worst case time between when an inode has its
> @@ -161,17 +178,34 @@ static void wb_queue_work(struct bdi_writeback *wb,
> trace_writeback_queue(wb->bdi, work);
>
> spin_lock_bh(&wb->work_lock);
> - if (!test_bit(WB_registered, &wb->state)) {
> - if (work->done)
> - complete(work->done);
> + if (!test_bit(WB_registered, &wb->state))
> goto out_unlock;
> - }
> + if (work->done)
> + atomic_inc(&work->done->cnt);
> list_add_tail(&work->list, &wb->work_list);
> mod_delayed_work(bdi_wq, &wb->dwork, 0);
> out_unlock:
> spin_unlock_bh(&wb->work_lock);
> }
>
> +/**
> + * wb_wait_for_completion - wait for completion of bdi_writeback_works
> + * @bdi: bdi work items were issued to
> + * @done: target wb_completion
> + *
> + * Wait for one or more work items issued to @bdi with their ->done field
> + * set to @done, which should have been defined with
> + * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such
> + * work items are completed. Work items which are waited upon aren't freed
> + * automatically on completion.
> + */
> +static void wb_wait_for_completion(struct backing_dev_info *bdi,
> + struct wb_completion *done)
> +{
> + atomic_dec(&done->cnt); /* put down the initial count */
> + wait_event(bdi->wb_waitq, !atomic_read(&done->cnt));
> +}
> +
> #ifdef CONFIG_CGROUP_WRITEBACK
>
> /**
> @@ -1143,7 +1177,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
>
> set_bit(WB_writeback_running, &wb->state);
> while ((work = get_next_work_item(wb)) != NULL) {
> - struct completion *done = work->done;
> + struct wb_completion *done = work->done;
>
> trace_writeback_exec(wb->bdi, work);
>
> @@ -1151,8 +1185,8 @@ static long wb_do_writeback(struct bdi_writeback *wb)
>
> if (work->auto_free)
> kfree(work);
> - if (done)
> - complete(done);
> + if (done && atomic_dec_and_test(&done->cnt))
> + wake_up_all(&wb->bdi->wb_waitq);
> }
>
> /*
> @@ -1518,7 +1552,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> unsigned long nr,
> enum wb_reason reason)
> {
> - DECLARE_COMPLETION_ONSTACK(done);
> + DEFINE_WB_COMPLETION_ONSTACK(done);
> struct wb_writeback_work work = {
> .sb = sb,
> .sync_mode = WB_SYNC_NONE,
> @@ -1533,7 +1567,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> return;
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
> wb_queue_work(&bdi->wb, &work);
> - wait_for_completion(&done);
> + wb_wait_for_completion(bdi, &done);
> }
> EXPORT_SYMBOL(writeback_inodes_sb_nr);
>
> @@ -1600,7 +1634,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb);
> */
> void sync_inodes_sb(struct super_block *sb)
> {
> - DECLARE_COMPLETION_ONSTACK(done);
> + DEFINE_WB_COMPLETION_ONSTACK(done);
> struct wb_writeback_work work = {
> .sb = sb,
> .sync_mode = WB_SYNC_ALL,
> @@ -1618,7 +1652,7 @@ void sync_inodes_sb(struct super_block *sb)
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> wb_queue_work(&bdi->wb, &work);
> - wait_for_completion(&done);
> + wb_wait_for_completion(bdi, &done);
>
> wait_sb_inodes(sb);
> }
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index 8c857d7..97a92fa 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -155,6 +155,8 @@ struct backing_dev_info {
> struct rb_root cgwb_congested_tree; /* their congested states */
> atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */
> #endif
> + wait_queue_head_t wb_waitq;
> +
> struct device *dev;
>
> struct timer_list laptop_mode_wb_timer;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index d2f16fc9..ad5608d 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -768,6 +768,7 @@ int bdi_init(struct backing_dev_info *bdi)
> bdi->max_ratio = 100;
> bdi->max_prop_frac = FPROP_FRAC_BASE;
> INIT_LIST_HEAD(&bdi->bdi_list);
> + init_waitqueue_head(&bdi->wb_waitq);
>
> err = wb_init(&bdi->wb, bdi, GFP_KERNEL);
> if (err)
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 16:09:32

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 44/51] writeback: implement bdi_wait_for_completion()

On Fri 22-05-15 17:13:58, Tejun Heo wrote:
> If the completion of a wb_writeback_work can be waited upon by setting
> its ->done to a struct completion and waiting on it; however, for
> cgroup writeback support, it's necessary to issue multiple work items
> to multiple bdi_writebacks and wait for the completion of all.
>
> This patch implements wb_completion which can wait for multiple work
> items and replaces the struct completion with it. It can be defined
> using DEFINE_WB_COMPLETION_ONSTACK(), used for multiple work items and
> waited for by wb_wait_for_completion().
>
> Nobody currently issues multiple work items and this patch doesn't
> introduce any behavior changes.

One more thing...

> @@ -161,17 +178,34 @@ static void wb_queue_work(struct bdi_writeback *wb,
> trace_writeback_queue(wb->bdi, work);
>
> spin_lock_bh(&wb->work_lock);
> - if (!test_bit(WB_registered, &wb->state)) {
> - if (work->done)
> - complete(work->done);
> + if (!test_bit(WB_registered, &wb->state))
> goto out_unlock;

This seems like a change in behavior. Previously unregistered wbs just
completed the work->done, now you don't complete them. Is that intentional?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 19:07:51

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 45/51] writeback: implement wb_wait_for_single_work()

On Fri 22-05-15 17:13:59, Tejun Heo wrote:
> For cgroup writeback, multiple wb_writeback_work items may need to be
> issuedto accomplish a single task. The previous patch updated the
> waiting mechanism such that wb_wait_for_completion() can wait for
> multiple work items.
>
> Issuing mulitple work items involves memory allocation which may fail.
> As most writeback operations can't fail or blocked on memory
> allocation, in such cases, we'll fall back to sequential issuing of an
> on-stack work item, which would need to be waited upon sequentially.
>
> This patch implements wb_wait_for_single_work() which waits for a
> single work item independently from wb_completion waiting so that such
> fallback mechanism can be used without getting tangled with the usual
> issuing / completion operation.

I don't understand, why is the special handling with single_wait,
single_done necessary. When we fail to allocate work and thus use the
base_work for submission, we can still use the standard completion mechanism
to wait for work to finish, can't we?

BTW: Again it would be easier for me to review this if the implementation
of this function was in one patch with the use of it so that one can see
how it gets used...

Honza
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> fs/fs-writeback.c | 47 +++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d7d4a1b..093b959 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -52,6 +52,8 @@ struct wb_writeback_work {
> unsigned int for_background:1;
> unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
> unsigned int auto_free:1; /* free on completion */
> + unsigned int single_wait:1;
> + unsigned int single_done:1;
> enum wb_reason reason; /* why was writeback initiated? */
>
> struct list_head list; /* pending work list */
> @@ -178,8 +180,11 @@ static void wb_queue_work(struct bdi_writeback *wb,
> trace_writeback_queue(wb->bdi, work);
>
> spin_lock_bh(&wb->work_lock);
> - if (!test_bit(WB_registered, &wb->state))
> + if (!test_bit(WB_registered, &wb->state)) {
> + if (work->single_wait)
> + work->single_done = 1;
> goto out_unlock;
> + }
> if (work->done)
> atomic_inc(&work->done->cnt);
> list_add_tail(&work->list, &wb->work_list);
> @@ -234,6 +239,32 @@ int inode_congested(struct inode *inode, int cong_bits)
> EXPORT_SYMBOL_GPL(inode_congested);
>
> /**
> + * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work
> + * @bdi: bdi the work item was issued to
> + * @work: work item to wait for
> + *
> + * Wait for the completion of @work which was issued to one of @bdi's
> + * bdi_writeback's. The caller must have set @work->single_wait before
> + * issuing it. This wait operates independently fo
> + * wb_wait_for_completion() and also disables automatic freeing of @work.
> + */
> +static void wb_wait_for_single_work(struct backing_dev_info *bdi,
> + struct wb_writeback_work *work)
> +{
> + if (WARN_ON_ONCE(!work->single_wait))
> + return;
> +
> + wait_event(bdi->wb_waitq, work->single_done);
> +
> + /*
> + * Paired with smp_wmb() in wb_do_writeback() and ensures that all
> + * modifications to @work prior to assertion of ->single_done is
> + * visible to the caller once this function returns.
> + */
> + smp_rmb();
> +}
> +
> +/**
> * wb_split_bdi_pages - split nr_pages to write according to bandwidth
> * @wb: target bdi_writeback to split @nr_pages to
> * @nr_pages: number of pages to write for the whole bdi
> @@ -1178,14 +1209,26 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> set_bit(WB_writeback_running, &wb->state);
> while ((work = get_next_work_item(wb)) != NULL) {
> struct wb_completion *done = work->done;
> + bool need_wake_up = false;
>
> trace_writeback_exec(wb->bdi, work);
>
> wrote += wb_writeback(wb, work);
>
> - if (work->auto_free)
> + if (work->single_wait) {
> + WARN_ON_ONCE(work->auto_free);
> + /* paired w/ rmb in wb_wait_for_single_work() */
> + smp_wmb();
> + work->single_done = 1;
> + need_wake_up = true;
> + } else if (work->auto_free) {
> kfree(work);
> + }
> +
> if (done && atomic_dec_and_test(&done->cnt))
> + need_wake_up = true;
> +
> + if (need_wake_up)
> wake_up_all(&wb->bdi->wb_waitq);
> }
>
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 19:17:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 48/51] writeback: dirty inodes against their matching cgroup bdi_writeback's

On Fri 22-05-15 17:14:02, Tejun Heo wrote:
> __mark_inode_dirty() always dirtied the inode against the root wb
> (bdi_writeback). The previous patches added all the infrastructure
> necessary to attribute an inode against the wb of the dirtying cgroup.
>
> This patch updates __mark_inode_dirty() so that it uses the wb
> associated with the inode instead of unconditionally using the root
> one.
>
> Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
> pages will keep being dirtied against the root wb.
>
> v2: Updated for per-inode wb association.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/fs-writeback.c | 23 +++++++++++------------
> 1 file changed, 11 insertions(+), 12 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 59d76f6..881ea5d 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1504,7 +1504,6 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode)
> void __mark_inode_dirty(struct inode *inode, int flags)
> {
> struct super_block *sb = inode->i_sb;
> - struct backing_dev_info *bdi = NULL;
> int dirtytime;
>
> trace_writeback_mark_inode_dirty(inode, flags);
> @@ -1574,30 +1573,30 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> * reposition it (that would break b_dirty time-ordering).
> */
> if (!was_dirty) {
> + struct bdi_writeback *wb = inode_to_wb(inode);
> struct list_head *dirty_list;
> bool wakeup_bdi = false;
> - bdi = inode_to_bdi(inode);
>
> spin_unlock(&inode->i_lock);
> - spin_lock(&bdi->wb.list_lock);
> + spin_lock(&wb->list_lock);
>
> - WARN(bdi_cap_writeback_dirty(bdi) &&
> - !test_bit(WB_registered, &bdi->wb.state),
> - "bdi-%s not registered\n", bdi->name);
> + WARN(bdi_cap_writeback_dirty(wb->bdi) &&
> + !test_bit(WB_registered, &wb->state),
> + "bdi-%s not registered\n", wb->bdi->name);
>
> inode->dirtied_when = jiffies;
> if (dirtytime)
> inode->dirtied_time_when = jiffies;
>
> if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES))
> - dirty_list = &bdi->wb.b_dirty;
> + dirty_list = &wb->b_dirty;
> else
> - dirty_list = &bdi->wb.b_dirty_time;
> + dirty_list = &wb->b_dirty_time;
>
> - wakeup_bdi = inode_wb_list_move_locked(inode, &bdi->wb,
> + wakeup_bdi = inode_wb_list_move_locked(inode, wb,
> dirty_list);
>
> - spin_unlock(&bdi->wb.list_lock);
> + spin_unlock(&wb->list_lock);
> trace_writeback_dirty_inode_enqueue(inode);
>
> /*
> @@ -1606,8 +1605,8 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> * to make sure background write-back happens
> * later.
> */
> - if (bdi_cap_writeback_dirty(bdi) && wakeup_bdi)
> - wb_wakeup_delayed(&bdi->wb);
> + if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi)
> + wb_wakeup_delayed(wb);
> return;
> }
> }
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 19:21:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 49/51] buffer, writeback: make __block_write_full_page() honor cgroup writeback

On Fri 22-05-15 17:14:03, Tejun Heo wrote:
> [__]block_write_full_page() is used to implement ->writepage in
> various filesystems. All writeback logic is now updated to handle
> cgroup writeback and the block cgroup to issue IOs for is encoded in
> writeback_control and can be retrieved from the inode; however,
> [__]block_write_full_page() currently ignores the blkcg indicated by
> inode and issues all bio's without explicit blkcg association.
>
> This patch adds submit_bh_blkcg() which associates the bio with the
> specified blkio cgroup before issuing and uses it in
> __block_write_full_page() so that the issued bio's are associated with
> inode_to_wb_blkcg_css(inode).

One comment below...

> @@ -44,6 +45,9 @@
> #include <trace/events/block.h>
>
> static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
> +static int submit_bh_blkcg(int rw, struct buffer_head *bh,
> + unsigned long bio_flags,

The argument bio_flags is unused. What is is good for?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 19:26:54

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 50/51] mpage: make __mpage_writepage() honor cgroup writeback

On Fri 22-05-15 17:14:04, Tejun Heo wrote:
> __mpage_writepage() is used to implement mpage_writepages() which in
> turn is used for ->writepages() of various filesystems. All writeback
> logic is now updated to handle cgroup writeback and the block cgroup
> to issue IOs for is encoded in writeback_control and can be retrieved
> from the inode; however, __mpage_writepage() currently ignores the
> blkcg indicated by the inode and issues all bio's without explicit
> blkcg association.
>
> This patch updates __mpage_writepage() so that the issued bio's are
> associated with inode_to_writeback_blkcg_css(inode).
>
> v2: Updated for per-inode wb association.

Looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Alexander Viro <[email protected]>
> ---
> fs/mpage.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/fs/mpage.c b/fs/mpage.c
> index 3e79220..a3ccb0b 100644
> --- a/fs/mpage.c
> +++ b/fs/mpage.c
> @@ -605,6 +605,8 @@ static int __mpage_writepage(struct page *page, struct writeback_control *wbc,
> bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH);
> if (bio == NULL)
> goto confused;
> +
> + bio_associate_blkcg(bio, inode_to_wb_blkcg_css(inode));
> }
>
> /*
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 19:28:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 49/51] buffer, writeback: make __block_write_full_page() honor cgroup writeback

On Wed 01-07-15 21:21:02, Jan Kara wrote:
> On Fri 22-05-15 17:14:03, Tejun Heo wrote:
> > [__]block_write_full_page() is used to implement ->writepage in
> > various filesystems. All writeback logic is now updated to handle
> > cgroup writeback and the block cgroup to issue IOs for is encoded in
> > writeback_control and can be retrieved from the inode; however,
> > [__]block_write_full_page() currently ignores the blkcg indicated by
> > inode and issues all bio's without explicit blkcg association.
> >
> > This patch adds submit_bh_blkcg() which associates the bio with the
> > specified blkio cgroup before issuing and uses it in
> > __block_write_full_page() so that the issued bio's are associated with
> > inode_to_wb_blkcg_css(inode).
>
> One comment below...
>
> > @@ -44,6 +45,9 @@
> > #include <trace/events/block.h>
> >
> > static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
> > +static int submit_bh_blkcg(int rw, struct buffer_head *bh,
> > + unsigned long bio_flags,
>
> The argument bio_flags is unused. What is is good for?

Ah, sorry, I guess I'm too tired. I now see how bio_flags are used. The
patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-01 19:29:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 51/51] ext2: enable cgroup writeback support

On Fri 22-05-15 17:14:05, Tejun Heo wrote:
> Writeback now supports cgroup writeback and the generic writeback,
> buffer, libfs, and mpage helpers that ext2 uses are all updated to
> work with cgroup writeback.
>
> This patch enables cgroup writeback for ext2 by adding
> FS_CGROUP_WRITEBACK to its ->fs_flags.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: [email protected]

Hallelujah!

Reviewed-by: Jan Kara <[email protected]>

> ---
> fs/ext2/super.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index d0e746e..549219d 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -1543,7 +1543,7 @@ static struct file_system_type ext2_fs_type = {
> .name = "ext2",
> .mount = ext2_mount,
> .kill_sb = kill_block_super,
> - .fs_flags = FS_REQUIRES_DEV,
> + .fs_flags = FS_REQUIRES_DEV | FS_CGROUP_WRITEBACK,
> };
> MODULE_ALIAS_FS("ext2");
>
> --
> 2.4.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-02 01:11:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 22/51] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK

Hello, Jan.

On Tue, Jun 30, 2015 at 11:37:51AM +0200, Jan Kara wrote:
> Hum, you later changed this to use a per-sb flag instead of a per-fs-type
> flag, right? We could do it as well here but OK.

The commits were already in stable branch at that point and landed in
mainline during this merge window, so I'm afraid the review points
will have to be addressed as additional patches.

> One more question - what does prevent us from supporting CGROUP_WRITEBACK
> for all bdis capable of writeback? I guess the reason is that currently
> blkcgs are bound to request_queue and we have to have blkcg(s) for
> CGROUP_WRITEBACK to work, am I right? But in principle tracking writeback
> state and doing writeback per memcg doesn't seem to be bound to any device
> properties so we could do that right?

The main issue is that cgroup should somehow know how the processes
are mapped to the underlying IO layer - the IO domain should somehow
be defined. We can introduce an intermediate abstraction which maps
to blkcg and whatever other cgroup controllers which may define cgroup
IO domains but given that such cases would be fairly niche, I think
we'd be better off making those corner cases represent themselves
using blkcg rather than introducing an additional layer.

Thanks.

--
tejun

2015-07-02 01:27:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 26/51] writeback: let balance_dirty_pages() work on the matching cgroup bdi_writeback

Hello, Jan.

On Tue, Jun 30, 2015 at 04:31:00PM +0200, Jan Kara wrote:
...
> > + if (inode_cgwb_enabled(inode))
> > + wb = wb_get_create_current(bdi, GFP_KERNEL);
> > + if (!wb)
> > + wb = &bdi->wb;
> > +
>
> So this effectively adds a radix tree lookup (of wb belonging to memcg) for
> every set_page_dirty() call. That seems relatively costly to me. And all

Hmmm... idk, radix tree lookups should be cheap especially when
shallow and set_page_dirty(). It's a glorified array indexing. If
not, we should really be improving radix tree implementation. That
said,

> that just to check wb->dirty_exceeded. Cannot we just use inode_to_wb()
> instead? I understand results may be different if multiple memcgs share an
> inode and that's the reason why you use wb_get_create_current(), right?
> But for dirty_exceeded check it may be good enough?

Yeah, that probably should work. I'll think more about it.

Thanks.

--
tejun

2015-07-02 01:38:28

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 28/51] writeback, blkcg: restructure blk_{set|clear}_queue_congested()

Hello, Jan.

On Tue, Jun 30, 2015 at 05:02:54PM +0200, Jan Kara wrote:
> BTW, I'd prefer if this was merged with the following patch. I was
> wondering for a while about the condition at the beginning of
> blk_clear_congested() only to learn it gets modified to the one I'd expect
> in the following patch :)

The patches are already merged, it's a bit too late to discuss but I
usually try to keep each step quite granular. e.g. I try hard to
avoid combining code relocation / restructuring with actual functional
changes so that the code change A -> B -> C where B is functionally
identical to A and C is different from B only where the actual
functional changes occur.

I think your argument is that as C is the final form, introducing B is
actually harder for reviewing. I have to disagree with that pretty
strongly. When you only think about the functional transformations A
-> C might seem easier but given that we also want to verify the
changes - both during development and review - it's far more
beneficial to go through the intermediate stage as that isolates
functional changes from mere code transformation.

Another thing to consider is that there's a difference when one is
reviewing a patch series as a whole tracking the development of big
picture and later when somebody tries to debug or bisect a bug the
patchset introduces. At that point, the general larger flow isn't
really in the picture and combining structural and functional changes
may make understanding what's going on significantly harder in
addition to making such errors more likely and less detectable in the
first place.

Thanks.

--
tejun

2015-07-02 01:46:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 30/51] writeback: implement and use inode_congested()

Hello,

On Tue, Jun 30, 2015 at 05:21:05PM +0200, Jan Kara wrote:
> Hum, is there any point in supporting NULL inode with inode_congested()?
> That would look more like a programming bug than anything... Otherwise the
> patch looks good to me so you can add:

Those are inherited from the existing usages and all for swapper
space. I think we should have a dummy inode instead of scattering
NULL mapping->host test all over the place but that's for another day.

Thanks.

--
tejun

2015-07-02 02:01:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 33/51] writeback: make bdi_has_dirty_io() take multiple bdi_writeback's into account

Hello, Jan.

On Tue, Jun 30, 2015 at 06:48:24PM +0200, Jan Kara wrote:
> It looks OK although I find using total write bandwidth to detect whether
> any wb has any dirty IO rather hacky. Frankly I'd prefer to just iterate
> all wbs from bdi_has_dirty_io() since that isn't performance critical
> and we iterate all wbs in those paths anyway... Hmm?

When there are wb's to write out, maybe walking it twice isn't too
bad; however, the problem, I think, is when there's nothing to do.
When there are enough number of devices and cgroups, we end up making
what used to be a trivial operation something which can be
computationally significant. ie. userland behaviors which used to be
completely fine because things are very cheap when there's nothing to
do can become scalability liabilities.

I don't think it's highly likely that this would become a visible
issue but I feel pretty uneasy about making O(1) noops O(N),
especially given that we need to maintain per-bdi fraction anyway.

Thanks.

--
tejun

2015-07-02 02:22:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 36/51] writeback: implement bdi_for_each_wb()

On Wed, Jul 01, 2015 at 09:27:57AM +0200, Jan Kara wrote:
> > +#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
> > + for ((iter)->next_id = (start_blkcg_id); \
> > + ({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
> > +
>
> This looks quite confusing. Won't it be easier to understand as:
>
> struct wb_iter {
> } __attribute__ ((unused));
>
> #define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
> if (((wb_cur) = (!start_blkcg_id ? &(bdi)->wb : NULL)))

But then break or continue wouldn't work as expected. It can get
really confusing when it's wrapped by an outer loop.

Thanks.

--
tejun

2015-07-02 02:28:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 39/51] writeback: make writeback_in_progress() take bdi_writeback instead of backing_dev_info

Hello, Jan.

On Wed, Jul 01, 2015 at 09:47:08AM +0200, Jan Kara wrote:
> BTW: It would have been easier for me to review this if e.g. a move from
> bdi to wb parameter was split among less patches. The intermediate state
> where some functions call partly bdi and party wb functions is strange and
> it always makes me go search in the series whether the other part of the
> function gets converted and whether they play well together...

Similar argument. When reviewing big picture transitions, it *could*
be easier to have larger lumps but I believe that's not necessarily
because reviewing itself becomes easier but more because it becomes
easier to skip what's uninteresting like actually verifying each
change. Another aspect is that some of the changes are spread out.
When each patch modifies one part, it's clear that all changes in the
patch belong to that specific part; however, in larger lumps, there
usually are a number of stragglers across the changes and associating
them with other parts aren't necessarily trivial. This happens with
patch descrption too. It becomes easier to slip in, intentionally or
by mistake, unrelated changes without explaining what's going on.

Thanks.

--
tejun

2015-07-02 02:29:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 40/51] writeback: make bdi_start_background_writeback() take bdi_writeback instead of backing_dev_info

On Wed, Jul 01, 2015 at 09:50:09AM +0200, Jan Kara wrote:
> Can we add a memcg id of the wb to the tracepoint please? Because just bdi
> needn't be enough when debugging stuff...

Sure, will add cgroup path to identify the actual wb. css IDs aren't
visible to userland.

Thanks.

--
tejun

2015-07-02 02:37:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 41/51] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's

Hello,

On Wed, Jul 01, 2015 at 10:15:28AM +0200, Jan Kara wrote:
> I was looking at who uses wakeup_flusher_threads(). There are two usecases:
>
> 1) sync() - we want to writeback everything
> 2) We want to relieve memory pressure by cleaning and subsequently
> reclaiming pages.
>
> Neither of these cares about number of pages too much if you write enough.

What's enough tho? Saying "yeah let's try about 1000 pages" is one
thing and "let's try about 1000 pages on each of 100 cgroups" is a
quite different operation. Given the nature of "let's try to write
some", I'd venture to say that writing somewhat less is an a lot
better behavior than possibly trying to write out possibly huge amount
given that the amount of fluctuation such behaviors may cause
system-wide and how non-obvious the reasons for such fluctuations
would be.

> So similarly as we don't split the passed nr_pages argument among bdis, I

bdi's are bound by actual hardware. wb's aren't. This is a purely
logical construct and there can be a lot of them. Again, trying to
write 1024 pages on each of 100 devices and trying to write 1024 * 100
pages to single device are quite different.

Thanks.

--
tejun

2015-07-02 03:02:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 44/51] writeback: implement bdi_wait_for_completion()

On Wed, Jul 01, 2015 at 06:09:18PM +0200, Jan Kara wrote:
> > @@ -161,17 +178,34 @@ static void wb_queue_work(struct bdi_writeback *wb,
> > trace_writeback_queue(wb->bdi, work);
> >
> > spin_lock_bh(&wb->work_lock);
> > - if (!test_bit(WB_registered, &wb->state)) {
> > - if (work->done)
> > - complete(work->done);
> > + if (!test_bit(WB_registered, &wb->state))
> > goto out_unlock;
>
> This seems like a change in behavior. Previously unregistered wbs just
> completed the work->done, now you don't complete them. Is that intentional?

If nothing is queued, the cnt is never increased and the wait becomes
noop. The default states are different between completion and
wb_completion. There's no need to do anything to indicate that
nothing needs to be waited.

Thanks.

--
tejun

2015-07-02 03:06:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 44/51] writeback: implement bdi_wait_for_completion()

Hello, Jan.

On Wed, Jul 01, 2015 at 06:04:37PM +0200, Jan Kara wrote:
> I'd find it better to extend completions to allow doing what you need. It
> isn't that special. It seems it would be enough to implement
>
> void wait_for_completions(struct completion *x, int n);
>
> where @n is the number of completions to wait for. And the implementation
> can stay as is, only in do_wait_for_common() we change checks for x->done ==
> 0 to "x->done < n". That's about it...

I don't know. While I agree that it'd be nice to have a generic event
count & trigger mechanism in the kernel, I don't think extending
completion is a good idea - the count then works both ways as the
event counter && listener counter and effectively becomes a semaphore
which usually doesn't end well. There are very few cases where we
want the counter works both ways and I personally think we'd be far
better served if those rare cases implement something custom rather
than generic mechanism becoming cryptic trying to cover everything.

Thanks.

--
tejun

2015-07-02 03:07:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 45/51] writeback: implement wb_wait_for_single_work()

Hello,

On Wed, Jul 01, 2015 at 09:07:35PM +0200, Jan Kara wrote:
> I don't understand, why is the special handling with single_wait,
> single_done necessary. When we fail to allocate work and thus use the
> base_work for submission, we can still use the standard completion mechanism
> to wait for work to finish, can't we?

Indeed. I'm not sure why I didn't do that. I'll try.

> BTW: Again it would be easier for me to review this if the implementation
> of this function was in one patch with the use of it so that one can see
> how it gets used...

Same point on this one as before.

Thanks.

--
tejun

2015-07-02 03:08:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 51/51] ext2: enable cgroup writeback support

On Wed, Jul 01, 2015 at 09:29:12PM +0200, Jan Kara wrote:
> On Fri 22-05-15 17:14:05, Tejun Heo wrote:
> > Writeback now supports cgroup writeback and the generic writeback,
> > buffer, libfs, and mpage helpers that ext2 uses are all updated to
> > work with cgroup writeback.
> >
> > This patch enables cgroup writeback for ext2 by adding
> > FS_CGROUP_WRITEBACK to its ->fs_flags.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Cc: Jens Axboe <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Cc: [email protected]
>
> Hallelujah!
>
> Reviewed-by: Jan Kara <[email protected]>

Hooray! Thanks a lot for going through all the patches! :)

--
tejun

2015-07-03 10:50:11

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 22/51] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK

On Wed 01-07-15 21:10:56, Tejun Heo wrote:
> Hello, Jan.
>
> On Tue, Jun 30, 2015 at 11:37:51AM +0200, Jan Kara wrote:
> > Hum, you later changed this to use a per-sb flag instead of a per-fs-type
> > flag, right? We could do it as well here but OK.
>
> The commits were already in stable branch at that point and landed in
> mainline during this merge window, so I'm afraid the review points
> will have to be addressed as additional patches.

Yeah, I know but I just didn't get to the series earlier. Anyway, I didn't
find fundamental issues so it's easy to change things in followup patches.

> > One more question - what does prevent us from supporting CGROUP_WRITEBACK
> > for all bdis capable of writeback? I guess the reason is that currently
> > blkcgs are bound to request_queue and we have to have blkcg(s) for
> > CGROUP_WRITEBACK to work, am I right? But in principle tracking writeback
> > state and doing writeback per memcg doesn't seem to be bound to any device
> > properties so we could do that right?
>
> The main issue is that cgroup should somehow know how the processes
> are mapped to the underlying IO layer - the IO domain should somehow
> be defined. We can introduce an intermediate abstraction which maps
> to blkcg and whatever other cgroup controllers which may define cgroup
> IO domains but given that such cases would be fairly niche, I think
> we'd be better off making those corner cases represent themselves
> using blkcg rather than introducing an additional layer.

Well, unless there is some specific mapping for the device, we could just
fall back to attributing everything to the root cgroup. We would still
account dirty pages in memcg, throttle writers in memcg when there are too
many dirty pages, issue writeback for inodes in memcg with enough dirty
pages etc. Just all IO from different memcgs would be equal so no
separation would be there. But it would still seem better that just
ignoring the split of dirty pages among memcgs as we do now... Thoughts?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-03 12:16:28

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 28/51] writeback, blkcg: restructure blk_{set|clear}_queue_congested()

On Wed 01-07-15 21:38:15, Tejun Heo wrote:
> Hello, Jan.
>
> On Tue, Jun 30, 2015 at 05:02:54PM +0200, Jan Kara wrote:
> > BTW, I'd prefer if this was merged with the following patch. I was
> > wondering for a while about the condition at the beginning of
> > blk_clear_congested() only to learn it gets modified to the one I'd expect
> > in the following patch :)
>
> The patches are already merged, it's a bit too late to discuss but I
> usually try to keep each step quite granular. e.g. I try hard to
> avoid combining code relocation / restructuring with actual functional
> changes so that the code change A -> B -> C where B is functionally
> identical to A and C is different from B only where the actual
> functional changes occur.

Yeah, I didn't mean this comment as something you should change even if the
series wasn't applied yet (it isn't that bad). I meant it more as a
feedback for future.

> I think your argument is that as C is the final form, introducing B is
> actually harder for reviewing. I have to disagree with that pretty
> strongly. When you only think about the functional transformations A
> -> C might seem easier but given that we also want to verify the
> changes - both during development and review - it's far more
> beneficial to go through the intermediate stage as that isolates
> functional changes from mere code transformation.
>
> Another thing to consider is that there's a difference when one is
> reviewing a patch series as a whole tracking the development of big
> picture and later when somebody tries to debug or bisect a bug the
> patchset introduces. At that point, the general larger flow isn't
> really in the picture and combining structural and functional changes
> may make understanding what's going on significantly harder in
> addition to making such errors more likely and less detectable in the
> first place.

In general I agree with you - separating refactoring from functional
changes is useful. I just think you took it a bit to the extreme in this
series :) When I'm reviewing patches, I'm also checking whether the
function does what it's "supposed" to do. So in case of splitting patches
like this I have to go through the series and verify that in the end we end
up with what one would expect. And sometimes the correctness is so much
easier to verify when changes are split that the extra patch chasing is
worth it. But in simple cases like this one, merged patch would have been
easier for me. I guess it's a matter of taste...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-03 12:17:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 30/51] writeback: implement and use inode_congested()

On Wed 01-07-15 21:46:34, Tejun Heo wrote:
> Hello,
>
> On Tue, Jun 30, 2015 at 05:21:05PM +0200, Jan Kara wrote:
> > Hum, is there any point in supporting NULL inode with inode_congested()?
> > That would look more like a programming bug than anything... Otherwise the
> > patch looks good to me so you can add:
>
> Those are inherited from the existing usages and all for swapper
> space. I think we should have a dummy inode instead of scattering
> NULL mapping->host test all over the place but that's for another day.

Ah, OK. A comment about this would be nice.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-03 12:26:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 36/51] writeback: implement bdi_for_each_wb()

On Wed 01-07-15 22:22:26, Tejun Heo wrote:
> On Wed, Jul 01, 2015 at 09:27:57AM +0200, Jan Kara wrote:
> > > +#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
> > > + for ((iter)->next_id = (start_blkcg_id); \
> > > + ({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); )
> > > +
> >
> > This looks quite confusing. Won't it be easier to understand as:
> >
> > struct wb_iter {
> > } __attribute__ ((unused));
> >
> > #define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \
> > if (((wb_cur) = (!start_blkcg_id ? &(bdi)->wb : NULL)))
>
> But then break or continue wouldn't work as expected. It can get
> really confusing when it's wrapped by an outer loop.

That's a good point. Thanks for explanation. Maybe add a comment like:
/*
* We use use this seemingly complicated 'for' loop so that 'break' and
* 'continue' continue to work as expected.
*/

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-03 12:36:51

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 44/51] writeback: implement bdi_wait_for_completion()

On Wed 01-07-15 23:06:24, Tejun Heo wrote:
> Hello, Jan.
>
> On Wed, Jul 01, 2015 at 06:04:37PM +0200, Jan Kara wrote:
> > I'd find it better to extend completions to allow doing what you need. It
> > isn't that special. It seems it would be enough to implement
> >
> > void wait_for_completions(struct completion *x, int n);
> >
> > where @n is the number of completions to wait for. And the implementation
> > can stay as is, only in do_wait_for_common() we change checks for x->done ==
> > 0 to "x->done < n". That's about it...
>
> I don't know. While I agree that it'd be nice to have a generic event
> count & trigger mechanism in the kernel, I don't think extending
> completion is a good idea - the count then works both ways as the
> event counter && listener counter and effectively becomes a semaphore
> which usually doesn't end well. There are very few cases where we
> want the counter works both ways and I personally think we'd be far
> better served if those rare cases implement something custom rather
> than generic mechanism becoming cryptic trying to cover everything.

Let me phrase my objection this differently: Instead of implementing custom
synchronization mechanism, you could as well do:

int count_submitted; /* Number of submitted works we want to wait for */
struct completion done;
...
submit works with 'done' as completion.
...
while (count_submitted--)
wait_for_completion(&done);

And we could also easily optimize that loop and put it in
kernel/sched/completion.c. The less synchronization mechanisms we have the
better I'd think...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-03 13:02:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 41/51] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's

On Wed 01-07-15 22:37:06, Tejun Heo wrote:
> Hello,
>
> On Wed, Jul 01, 2015 at 10:15:28AM +0200, Jan Kara wrote:
> > I was looking at who uses wakeup_flusher_threads(). There are two usecases:
> >
> > 1) sync() - we want to writeback everything
> > 2) We want to relieve memory pressure by cleaning and subsequently
> > reclaiming pages.
> >
> > Neither of these cares about number of pages too much if you write enough.
>
> What's enough tho? Saying "yeah let's try about 1000 pages" is one
> thing and "let's try about 1000 pages on each of 100 cgroups" is a
> quite different operation. Given the nature of "let's try to write
> some", I'd venture to say that writing somewhat less is an a lot
> better behavior than possibly trying to write out possibly huge amount
> given that the amount of fluctuation such behaviors may cause
> system-wide and how non-obvious the reasons for such fluctuations
> would be.
>
> > So similarly as we don't split the passed nr_pages argument among bdis, I
>
> bdi's are bound by actual hardware. wb's aren't. This is a purely
> logical construct and there can be a lot of them. Again, trying to
> write 1024 pages on each of 100 devices and trying to write 1024 * 100
> pages to single device are quite different.

OK, I agree with your device vs logical construct argument. However when
splitting pages based on avg throughput each cgroup generates, we know
nothing about actual amount of dirty pages in each cgroup so we may end up
writing much fewer pages than we originally wanted since a cgroup which was
assigned a big chunk needn't have many pages available. So your algorithm
is basically bound to undershoot the requested number of pages in some
cases...

Another concern is that if we have two cgroups with same amount of dirty
pages but cgroup A has them randomly scattered (and thus have much lower
bandwidth) and cgroup B has them in a sequential fashion (thus with higher
bandwidth), you end up cleaning (and subsequently reclaiming) more from
cgroup B. That may be good for immediate memory pressure but could be
considered unfair by the cgroup owner.

Maybe it would be better to split number of pages to write based on
fraction of dirty pages each cgroup has in the bdi?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-03 16:33:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 41/51] writeback: make wakeup_flusher_threads() handle multiple bdi_writeback's

Hello,

On Fri, Jul 03, 2015 at 03:02:13PM +0200, Jan Kara wrote:
...
> OK, I agree with your device vs logical construct argument. However when
> splitting pages based on avg throughput each cgroup generates, we know
> nothing about actual amount of dirty pages in each cgroup so we may end up
> writing much fewer pages than we originally wanted since a cgroup which was
> assigned a big chunk needn't have many pages available. So your algorithm
> is basically bound to undershoot the requested number of pages in some
> cases...

Sure, but the nr_to_write has never been a strict thing except when
we're writing out everything. We don't overshoot them but writing out
less than requested is not unusual. Also, note that write bandwidth
is the primary measure that we base the distribution of dirty pages
on. Sure, there can be cases where the two deviate but this is the
better measure to use than, say, number of currently dirty pages.

> Another concern is that if we have two cgroups with same amount of dirty
> pages but cgroup A has them randomly scattered (and thus have much lower
> bandwidth) and cgroup B has them in a sequential fashion (thus with higher
> bandwidth), you end up cleaning (and subsequently reclaiming) more from
> cgroup B. That may be good for immediate memory pressure but could be
> considered unfair by the cgroup owner.
>
> Maybe it would be better to split number of pages to write based on
> fraction of dirty pages each cgroup has in the bdi?

The dirty pages are already distributed according to write bandwidth.
Write bandwidth is the de-facto currency of dirty page distribution.
If it can be shown that some other measure is better for this purpose,
sure, but I don't see why we'd deviate just based on a vague feeling
that something else might be better and given how these mechanisms are
used I don't think going either way would matter a bit.

Thanks.

--
tejun

2015-07-03 17:02:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 44/51] writeback: implement bdi_wait_for_completion()

Hello,

On Fri, Jul 03, 2015 at 02:36:42PM +0200, Jan Kara wrote:
> Let me phrase my objection this differently: Instead of implementing custom
> synchronization mechanism, you could as well do:
>
> int count_submitted; /* Number of submitted works we want to wait for */
> struct completion done;
> ...
> submit works with 'done' as completion.
> ...
> while (count_submitted--)
> wait_for_completion(&done);
>
> And we could also easily optimize that loop and put it in
> kernel/sched/completion.c. The less synchronization mechanisms we have the
> better I'd think...

And what I'm trying to say is that we most likely don't want to build
it around completions. We really don't want to roll "event count" and
"wakeup count" into the same mechanism. There's nothing completion
provides that such event counting mechanism needs or wants. It isn't
that attractive from the completion side either. The main reason we
have completions is for stupid simple synchronizations and we wanna
keep it simple.

I do agree that we might want a generic "event count" mechanism but at
the same time combining a counter and wait_event is usually pretty
trivial. Maybe atomic_t + waitqueue is a useful enough abstraction
but then we would eventually end up having to deal with all the
different types of waits and timeouts. We might end up with a lot of
thin wrappers which really don't do much of anything.

If you can think of a good way to abstract this, please go head.

Thanks.

--
tejun

2015-07-03 17:06:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 36/51] writeback: implement bdi_for_each_wb()

On Fri, Jul 03, 2015 at 02:26:27PM +0200, Jan Kara wrote:
> That's a good point. Thanks for explanation. Maybe add a comment like:
> /*
> * We use use this seemingly complicated 'for' loop so that 'break' and
> * 'continue' continue to work as expected.
> */

This kinda feel superflous for me. This is something true for all
iteration wrappers which falls within the area of well-established
convention, I think. If it's doing something weird like combining
if-else clause to do post-conditional processing, sure, but this is
really kinda standard.

Thanks.

--
tejun

2015-07-03 17:07:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 30/51] writeback: implement and use inode_congested()

On Fri, Jul 03, 2015 at 02:17:21PM +0200, Jan Kara wrote:
> On Wed 01-07-15 21:46:34, Tejun Heo wrote:
> > Hello,
> >
> > On Tue, Jun 30, 2015 at 05:21:05PM +0200, Jan Kara wrote:
> > > Hum, is there any point in supporting NULL inode with inode_congested()?
> > > That would look more like a programming bug than anything... Otherwise the
> > > patch looks good to me so you can add:
> >
> > Those are inherited from the existing usages and all for swapper
> > space. I think we should have a dummy inode instead of scattering
> > NULL mapping->host test all over the place but that's for another day.
>
> Ah, OK. A comment about this would be nice.

Will add.

Thanks!

--
tejun

2015-07-03 17:14:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 22/51] writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK

Hello,

On Fri, Jul 03, 2015 at 12:49:57PM +0200, Jan Kara wrote:
> Well, unless there is some specific mapping for the device, we could just
> fall back to attributing everything to the root cgroup. We would still
> account dirty pages in memcg, throttle writers in memcg when there are too
> many dirty pages, issue writeback for inodes in memcg with enough dirty
> pages etc. Just all IO from different memcgs would be equal so no
> separation would be there. But it would still seem better that just
> ignoring the split of dirty pages among memcgs as we do now... Thoughts?

Sure, if you mark a bdi as capable of supporing cgroup writeback
without enforcing any IO isolation, the above would be what's
happening. I'm not convinced this would be something actually useful
tho. Sure, it changes the behavior but is still gonna be a crapshoot.

Thanks.

--
tejun

2015-07-03 22:12:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH block/for-4.3] writeback: remove wb_writeback_work->single_wait/done

Hello, Jan.

So, something like the following. It depends on other changes so
won't apply as-is. I'll repost it as part of a patchset once -rc1
drops.

Thanks!

------ 8< ------
wb_writeback_work->single_wait/done are used for the wait mechanism
for synchronous wb_work (wb_writeback_work) items which are issued
when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
wb_work items; however, there's no reason to use a separate wait
mechanism for this. bdi_split_work_to_wbs() can simply use on-stack
fallback wb_work item and separate wb_completion to wait for it.

This patch removes wb_work->single_wait/done and the related code and
make bdi_split_work_to_wbs() use on-stack fallback wb_work and
wb_completion instead.

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 116 +++++++++++++-----------------------------------------
1 file changed, 30 insertions(+), 86 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -53,8 +53,6 @@ struct wb_writeback_work {
unsigned int for_background:1;
unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
unsigned int auto_free:1; /* free on completion */
- unsigned int single_wait:1;
- unsigned int single_done:1;
enum wb_reason reason; /* why was writeback initiated? */

struct list_head list; /* pending work list */
@@ -181,11 +179,8 @@ static void wb_queue_work(struct bdi_wri
trace_writeback_queue(wb->bdi, work);

spin_lock_bh(&wb->work_lock);
- if (!test_bit(WB_registered, &wb->state)) {
- if (work->single_wait)
- work->single_done = 1;
+ if (!test_bit(WB_registered, &wb->state))
goto out_unlock;
- }
if (work->done)
atomic_inc(&work->done->cnt);
list_add_tail(&work->list, &wb->work_list);
@@ -737,32 +732,6 @@ int inode_congested(struct inode *inode,
EXPORT_SYMBOL_GPL(inode_congested);

/**
- * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work
- * @bdi: bdi the work item was issued to
- * @work: work item to wait for
- *
- * Wait for the completion of @work which was issued to one of @bdi's
- * bdi_writeback's. The caller must have set @work->single_wait before
- * issuing it. This wait operates independently fo
- * wb_wait_for_completion() and also disables automatic freeing of @work.
- */
-static void wb_wait_for_single_work(struct backing_dev_info *bdi,
- struct wb_writeback_work *work)
-{
- if (WARN_ON_ONCE(!work->single_wait))
- return;
-
- wait_event(bdi->wb_waitq, work->single_done);
-
- /*
- * Paired with smp_wmb() in wb_do_writeback() and ensures that all
- * modifications to @work prior to assertion of ->single_done is
- * visible to the caller once this function returns.
- */
- smp_rmb();
-}
-
-/**
* wb_split_bdi_pages - split nr_pages to write according to bandwidth
* @wb: target bdi_writeback to split @nr_pages to
* @nr_pages: number of pages to write for the whole bdi
@@ -791,38 +760,6 @@ static long wb_split_bdi_pages(struct bd
}

/**
- * wb_clone_and_queue_work - clone a wb_writeback_work and issue it to a wb
- * @wb: target bdi_writeback
- * @base_work: source wb_writeback_work
- *
- * Try to make a clone of @base_work and issue it to @wb. If cloning
- * succeeds, %true is returned; otherwise, @base_work is issued directly
- * and %false is returned. In the latter case, the caller is required to
- * wait for @base_work's completion using wb_wait_for_single_work().
- *
- * A clone is auto-freed on completion. @base_work never is.
- */
-static bool wb_clone_and_queue_work(struct bdi_writeback *wb,
- struct wb_writeback_work *base_work)
-{
- struct wb_writeback_work *work;
-
- work = kmalloc(sizeof(*work), GFP_ATOMIC);
- if (work) {
- *work = *base_work;
- work->auto_free = 1;
- work->single_wait = 0;
- } else {
- work = base_work;
- work->auto_free = 0;
- work->single_wait = 1;
- }
- work->single_done = 0;
- wb_queue_work(wb, work);
- return work != base_work;
-}
-
-/**
* bdi_split_work_to_wbs - split a wb_writeback_work to all wb's of a bdi
* @bdi: target backing_dev_info
* @base_work: wb_writeback_work to issue
@@ -837,7 +774,6 @@ static void bdi_split_work_to_wbs(struct
struct wb_writeback_work *base_work,
bool skip_if_busy)
{
- long nr_pages = base_work->nr_pages;
int next_memcg_id = 0;
struct bdi_writeback *wb;
struct wb_iter iter;
@@ -849,17 +785,39 @@ static void bdi_split_work_to_wbs(struct
restart:
rcu_read_lock();
bdi_for_each_wb(wb, bdi, &iter, next_memcg_id) {
+ DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
+ struct wb_writeback_work fallback_work;
+ struct wb_writeback_work *work;
+ long nr_pages;
+
if (!wb_has_dirty_io(wb) ||
(skip_if_busy && writeback_in_progress(wb)))
continue;

- base_work->nr_pages = wb_split_bdi_pages(wb, nr_pages);
- if (!wb_clone_and_queue_work(wb, base_work)) {
- next_memcg_id = wb->memcg_css->id + 1;
- rcu_read_unlock();
- wb_wait_for_single_work(bdi, base_work);
- goto restart;
+ nr_pages = wb_split_bdi_pages(wb, base_work->nr_pages);
+
+ work = kmalloc(sizeof(*work), GFP_ATOMIC);
+ if (work) {
+ *work = *base_work;
+ work->nr_pages = nr_pages;
+ work->auto_free = 1;
+ wb_queue_work(wb, work);
+ continue;
}
+
+ /* alloc failed, execute synchronously using on-stack fallback */
+ work = &fallback_work;
+ *work = *base_work;
+ work->nr_pages = nr_pages;
+ work->auto_free = 0;
+ work->done = &fallback_work_done;
+
+ wb_queue_work(wb, work);
+
+ next_memcg_id = wb->memcg_css->id + 1;
+ rcu_read_unlock();
+ wb_wait_for_completion(bdi, &fallback_work_done);
+ goto restart;
}
rcu_read_unlock();
}
@@ -901,8 +859,6 @@ static void bdi_split_work_to_wbs(struct
if (bdi_has_dirty_io(bdi) &&
(!skip_if_busy || !writeback_in_progress(&bdi->wb))) {
base_work->auto_free = 0;
- base_work->single_wait = 0;
- base_work->single_done = 0;
wb_queue_work(&bdi->wb, base_work);
}
}
@@ -1793,26 +1749,14 @@ static long wb_do_writeback(struct bdi_w
set_bit(WB_writeback_running, &wb->state);
while ((work = get_next_work_item(wb)) != NULL) {
struct wb_completion *done = work->done;
- bool need_wake_up = false;

trace_writeback_exec(wb->bdi, work);

wrote += wb_writeback(wb, work);

- if (work->single_wait) {
- WARN_ON_ONCE(work->auto_free);
- /* paired w/ rmb in wb_wait_for_single_work() */
- smp_wmb();
- work->single_done = 1;
- need_wake_up = true;
- } else if (work->auto_free) {
+ if (work->auto_free)
kfree(work);
- }
-
if (done && atomic_dec_and_test(&done->cnt))
- need_wake_up = true;
-
- if (need_wake_up)
wake_up_all(&wb->bdi->wb_waitq);
}

2015-07-05 10:30:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH block/for-4.3] writeback: explain why @inode is allowed to be NULL for inode_congested()

Signed-off-by: Tejun Heo <[email protected]>
Suggested-by: Jan Kara <[email protected]>
---
Hello,

So, something like this. I'll resend this patch as part of a patch
series once -rc1 drops.

Thanks.

fs/fs-writeback.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -700,7 +700,7 @@ void wbc_account_io(struct writeback_con

/**
* inode_congested - test whether an inode is congested
- * @inode: inode to test for congestion
+ * @inode: inode to test for congestion (may be NULL)
* @cong_bits: mask of WB_[a]sync_congested bits to test
*
* Tests whether @inode is congested. @cong_bits is the mask of congestion
@@ -710,6 +710,9 @@ void wbc_account_io(struct writeback_con
* determined by whether the cgwb (cgroup bdi_writeback) for the blkcg
* associated with @inode is congested; otherwise, the root wb's congestion
* state is used.
+ *
+ * @inode is allowed to be NULL as this function is often called on
+ * mapping->host which is NULL for the swapper space.
*/
int inode_congested(struct inode *inode, int cong_bits)
{

2015-07-06 19:36:52

by Tejun Heo

[permalink] [raw]
Subject: [PATCH block/for-4.3] writeback: update writeback tracepoints to report cgroup

The following tracepoints are updated to report the cgroup used during
cgroup writeback.

* writeback_write_inode[_start]
* writeback_queue
* writeback_exec
* writeback_start
* writeback_written
* writeback_wait
* writeback_nowork
* writeback_wake_background
* wbc_writepage
* writeback_queue_io
* bdi_dirty_ratelimit
* balance_dirty_pages
* writeback_sb_inodes_requeue
* writeback_single_inode[_start]

Note that writeback_bdi_register is separated out from writeback_class
as reporting cgroup doesn't make sense to it. Tracepoints which take
bdi are updated to take bdi_writeback instead.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jan Kara <[email protected]>
---
Hello,

Will soon post this as part of a patch series of cgroup writeback
updates.

Thanks.

fs/fs-writeback.c | 14 +--
include/trace/events/writeback.h | 180 ++++++++++++++++++++++++++++++---------
mm/page-writeback.c | 6 -
3 files changed, 151 insertions(+), 49 deletions(-)

--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -176,7 +176,7 @@ static void wb_wakeup(struct bdi_writeba
static void wb_queue_work(struct bdi_writeback *wb,
struct wb_writeback_work *work)
{
- trace_writeback_queue(wb->bdi, work);
+ trace_writeback_queue(wb, work);

spin_lock_bh(&wb->work_lock);
if (!test_bit(WB_registered, &wb->state))
@@ -882,7 +882,7 @@ void wb_start_writeback(struct bdi_write
*/
work = kzalloc(sizeof(*work), GFP_ATOMIC);
if (!work) {
- trace_writeback_nowork(wb->bdi);
+ trace_writeback_nowork(wb);
wb_wakeup(wb);
return;
}
@@ -912,7 +912,7 @@ void wb_start_background_writeback(struc
* We just wake up the flusher thread. It will perform background
* writeback as soon as there is no other work to do.
*/
- trace_writeback_wake_background(wb->bdi);
+ trace_writeback_wake_background(wb);
wb_wakeup(wb);
}

@@ -1615,14 +1615,14 @@ static long wb_writeback(struct bdi_writ
} else if (work->for_background)
oldest_jif = jiffies;

- trace_writeback_start(wb->bdi, work);
+ trace_writeback_start(wb, work);
if (list_empty(&wb->b_io))
queue_io(wb, work);
if (work->sb)
progress = writeback_sb_inodes(work->sb, wb, work);
else
progress = __writeback_inodes_wb(wb, work);
- trace_writeback_written(wb->bdi, work);
+ trace_writeback_written(wb, work);

wb_update_bandwidth(wb, wb_start);

@@ -1647,7 +1647,7 @@ static long wb_writeback(struct bdi_writ
* we'll just busyloop.
*/
if (!list_empty(&wb->b_more_io)) {
- trace_writeback_wait(wb->bdi, work);
+ trace_writeback_wait(wb, work);
inode = wb_inode(wb->b_more_io.prev);
spin_lock(&inode->i_lock);
spin_unlock(&wb->list_lock);
@@ -1753,7 +1753,7 @@ static long wb_do_writeback(struct bdi_w
while ((work = get_next_work_item(wb)) != NULL) {
struct wb_completion *done = work->done;

- trace_writeback_exec(wb->bdi, work);
+ trace_writeback_exec(wb, work);

wrote += wb_writeback(wb, work);

--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -131,6 +131,66 @@ DEFINE_EVENT(writeback_dirty_inode_templ
TP_ARGS(inode, flags)
);

+#ifdef CREATE_TRACE_POINTS
+#ifdef CONFIG_CGROUP_WRITEBACK
+
+static inline size_t __trace_wb_cgroup_size(struct bdi_writeback *wb)
+{
+ return kernfs_path_len(wb->memcg_css->cgroup->kn) + 1;
+}
+
+static inline void __trace_wb_assign_cgroup(char *buf, struct bdi_writeback *wb)
+{
+ struct cgroup *cgrp = wb->memcg_css->cgroup;
+ char *path;
+
+ path = cgroup_path(cgrp, buf, kernfs_path_len(cgrp->kn) + 1);
+ WARN_ON_ONCE(path != buf);
+}
+
+static inline size_t __trace_wbc_cgroup_size(struct writeback_control *wbc)
+{
+ if (wbc->wb)
+ return __trace_wb_cgroup_size(wbc->wb);
+ else
+ return 2;
+}
+
+static inline void __trace_wbc_assign_cgroup(char *buf,
+ struct writeback_control *wbc)
+{
+ if (wbc->wb)
+ __trace_wb_assign_cgroup(buf, wbc->wb);
+ else
+ strcpy(buf, "/");
+}
+
+#else /* CONFIG_CGROUP_WRITEBACK */
+
+static inline size_t __trace_wb_cgroup_size(struct bdi_writeback *wb)
+{
+ return 2;
+}
+
+static inline void __trace_wb_assign_cgroup(char *buf, struct bdi_writeback *wb)
+{
+ strcpy(buf, "/");
+}
+
+static inline size_t __trace_wbc_cgroup_size(struct writeback_control *wbc)
+{
+ return 2;
+}
+
+static inline void __trace_wbc_assign_cgroup(char *buf,
+ struct writeback_control *wbc)
+{
+ strcpy(buf, "/");
+}
+
+#endif /* CONFIG_CGROUP_WRITEBACK */
+#endif /* CREATE_TRACE_POINTS */
+
DECLARE_EVENT_CLASS(writeback_write_inode_template,

TP_PROTO(struct inode *inode, struct writeback_control *wbc),
@@ -141,6 +201,7 @@ DECLARE_EVENT_CLASS(writeback_write_inod
__array(char, name, 32)
__field(unsigned long, ino)
__field(int, sync_mode)
+ __dynamic_array(char, cgroup, __trace_wbc_cgroup_size(wbc))
),

TP_fast_assign(
@@ -148,12 +209,14 @@ DECLARE_EVENT_CLASS(writeback_write_inod
dev_name(inode_to_bdi(inode)->dev), 32);
__entry->ino = inode->i_ino;
__entry->sync_mode = wbc->sync_mode;
+ __trace_wbc_assign_cgroup(__get_str(cgroup), wbc);
),

- TP_printk("bdi %s: ino=%lu sync_mode=%d",
+ TP_printk("bdi %s: ino=%lu sync_mode=%d cgroup=%s",
__entry->name,
__entry->ino,
- __entry->sync_mode
+ __entry->sync_mode,
+ __get_str(cgroup)
)
);

@@ -172,8 +235,8 @@ DEFINE_EVENT(writeback_write_inode_templ
);

DECLARE_EVENT_CLASS(writeback_work_class,
- TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
- TP_ARGS(bdi, work),
+ TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
+ TP_ARGS(wb, work),
TP_STRUCT__entry(
__array(char, name, 32)
__field(long, nr_pages)
@@ -183,10 +246,11 @@ DECLARE_EVENT_CLASS(writeback_work_class
__field(int, range_cyclic)
__field(int, for_background)
__field(int, reason)
+ __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
),
TP_fast_assign(
strncpy(__entry->name,
- bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32);
+ wb->bdi->dev ? dev_name(wb->bdi->dev) : "(unknown)", 32);
__entry->nr_pages = work->nr_pages;
__entry->sb_dev = work->sb ? work->sb->s_dev : 0;
__entry->sync_mode = work->sync_mode;
@@ -194,9 +258,10 @@ DECLARE_EVENT_CLASS(writeback_work_class
__entry->range_cyclic = work->range_cyclic;
__entry->for_background = work->for_background;
__entry->reason = work->reason;
+ __trace_wb_assign_cgroup(__get_str(cgroup), wb);
),
TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
- "kupdate=%d range_cyclic=%d background=%d reason=%s",
+ "kupdate=%d range_cyclic=%d background=%d reason=%s cgroup=%s",
__entry->name,
MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
__entry->nr_pages,
@@ -204,13 +269,14 @@ DECLARE_EVENT_CLASS(writeback_work_class
__entry->for_kupdate,
__entry->range_cyclic,
__entry->for_background,
- __print_symbolic(__entry->reason, WB_WORK_REASON)
+ __print_symbolic(__entry->reason, WB_WORK_REASON),
+ __get_str(cgroup)
)
);
#define DEFINE_WRITEBACK_WORK_EVENT(name) \
DEFINE_EVENT(writeback_work_class, name, \
- TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work), \
- TP_ARGS(bdi, work))
+ TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work), \
+ TP_ARGS(wb, work))
DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
@@ -230,26 +296,42 @@ TRACE_EVENT(writeback_pages_written,
);

DECLARE_EVENT_CLASS(writeback_class,
- TP_PROTO(struct backing_dev_info *bdi),
- TP_ARGS(bdi),
+ TP_PROTO(struct bdi_writeback *wb),
+ TP_ARGS(wb),
TP_STRUCT__entry(
__array(char, name, 32)
+ __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
),
TP_fast_assign(
- strncpy(__entry->name, dev_name(bdi->dev), 32);
+ strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
+ __trace_wb_assign_cgroup(__get_str(cgroup), wb);
),
- TP_printk("bdi %s",
- __entry->name
+ TP_printk("bdi %s: cgroup=%s",
+ __entry->name,
+ __get_str(cgroup)
)
);
#define DEFINE_WRITEBACK_EVENT(name) \
DEFINE_EVENT(writeback_class, name, \
- TP_PROTO(struct backing_dev_info *bdi), \
- TP_ARGS(bdi))
+ TP_PROTO(struct bdi_writeback *wb), \
+ TP_ARGS(wb))

DEFINE_WRITEBACK_EVENT(writeback_nowork);
DEFINE_WRITEBACK_EVENT(writeback_wake_background);
-DEFINE_WRITEBACK_EVENT(writeback_bdi_register);
+
+TRACE_EVENT(writeback_bdi_register,
+ TP_PROTO(struct backing_dev_info *bdi),
+ TP_ARGS(bdi),
+ TP_STRUCT__entry(
+ __array(char, name, 32)
+ ),
+ TP_fast_assign(
+ strncpy(__entry->name, dev_name(bdi->dev), 32);
+ ),
+ TP_printk("bdi %s",
+ __entry->name
+ )
+);

DECLARE_EVENT_CLASS(wbc_class,
TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
@@ -265,6 +347,7 @@ DECLARE_EVENT_CLASS(wbc_class,
__field(int, range_cyclic)
__field(long, range_start)
__field(long, range_end)
+ __dynamic_array(char, cgroup, __trace_wbc_cgroup_size(wbc))
),

TP_fast_assign(
@@ -278,11 +361,12 @@ DECLARE_EVENT_CLASS(wbc_class,
__entry->range_cyclic = wbc->range_cyclic;
__entry->range_start = (long)wbc->range_start;
__entry->range_end = (long)wbc->range_end;
+ __trace_wbc_assign_cgroup(__get_str(cgroup), wbc);
),

TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
"bgrd=%d reclm=%d cyclic=%d "
- "start=0x%lx end=0x%lx",
+ "start=0x%lx end=0x%lx cgroup=%s",
__entry->name,
__entry->nr_to_write,
__entry->pages_skipped,
@@ -292,7 +376,9 @@ DECLARE_EVENT_CLASS(wbc_class,
__entry->for_reclaim,
__entry->range_cyclic,
__entry->range_start,
- __entry->range_end)
+ __entry->range_end,
+ __get_str(cgroup)
+ )
)

#define DEFINE_WBC_EVENT(name) \
@@ -312,6 +398,7 @@ TRACE_EVENT(writeback_queue_io,
__field(long, age)
__field(int, moved)
__field(int, reason)
+ __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
),
TP_fast_assign(
unsigned long *older_than_this = work->older_than_this;
@@ -321,13 +408,15 @@ TRACE_EVENT(writeback_queue_io,
(jiffies - *older_than_this) * 1000 / HZ : -1;
__entry->moved = moved;
__entry->reason = work->reason;
+ __trace_wb_assign_cgroup(__get_str(cgroup), wb);
),
- TP_printk("bdi %s: older=%lu age=%ld enqueue=%d reason=%s",
+ TP_printk("bdi %s: older=%lu age=%ld enqueue=%d reason=%s cgroup=%s",
__entry->name,
__entry->older, /* older_than_this in jiffies */
__entry->age, /* older_than_this in relative milliseconds */
__entry->moved,
- __print_symbolic(__entry->reason, WB_WORK_REASON)
+ __print_symbolic(__entry->reason, WB_WORK_REASON),
+ __get_str(cgroup)
)
);

@@ -381,11 +470,11 @@ TRACE_EVENT(global_dirty_state,

TRACE_EVENT(bdi_dirty_ratelimit,

- TP_PROTO(struct backing_dev_info *bdi,
+ TP_PROTO(struct bdi_writeback *wb,
unsigned long dirty_rate,
unsigned long task_ratelimit),

- TP_ARGS(bdi, dirty_rate, task_ratelimit),
+ TP_ARGS(wb, dirty_rate, task_ratelimit),

TP_STRUCT__entry(
__array(char, bdi, 32)
@@ -395,36 +484,39 @@ TRACE_EVENT(bdi_dirty_ratelimit,
__field(unsigned long, dirty_ratelimit)
__field(unsigned long, task_ratelimit)
__field(unsigned long, balanced_dirty_ratelimit)
+ __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
),

TP_fast_assign(
- strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
- __entry->write_bw = KBps(bdi->wb.write_bandwidth);
- __entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth);
+ strlcpy(__entry->bdi, dev_name(wb->bdi->dev), 32);
+ __entry->write_bw = KBps(wb->write_bandwidth);
+ __entry->avg_write_bw = KBps(wb->avg_write_bandwidth);
__entry->dirty_rate = KBps(dirty_rate);
- __entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit);
+ __entry->dirty_ratelimit = KBps(wb->dirty_ratelimit);
__entry->task_ratelimit = KBps(task_ratelimit);
__entry->balanced_dirty_ratelimit =
- KBps(bdi->wb.balanced_dirty_ratelimit);
+ KBps(wb->balanced_dirty_ratelimit);
+ __trace_wb_assign_cgroup(__get_str(cgroup), wb);
),

TP_printk("bdi %s: "
"write_bw=%lu awrite_bw=%lu dirty_rate=%lu "
"dirty_ratelimit=%lu task_ratelimit=%lu "
- "balanced_dirty_ratelimit=%lu",
+ "balanced_dirty_ratelimit=%lu cgroup=%s",
__entry->bdi,
__entry->write_bw, /* write bandwidth */
__entry->avg_write_bw, /* avg write bandwidth */
__entry->dirty_rate, /* bdi dirty rate */
__entry->dirty_ratelimit, /* base ratelimit */
__entry->task_ratelimit, /* ratelimit with position control */
- __entry->balanced_dirty_ratelimit /* the balanced ratelimit */
+ __entry->balanced_dirty_ratelimit, /* the balanced ratelimit */
+ __get_str(cgroup)
)
);

TRACE_EVENT(balance_dirty_pages,

- TP_PROTO(struct backing_dev_info *bdi,
+ TP_PROTO(struct bdi_writeback *wb,
unsigned long thresh,
unsigned long bg_thresh,
unsigned long dirty,
@@ -437,7 +529,7 @@ TRACE_EVENT(balance_dirty_pages,
long pause,
unsigned long start_time),

- TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
+ TP_ARGS(wb, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
dirty_ratelimit, task_ratelimit,
dirtied, period, pause, start_time),

@@ -456,11 +548,12 @@ TRACE_EVENT(balance_dirty_pages,
__field( long, pause)
__field(unsigned long, period)
__field( long, think)
+ __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
),

TP_fast_assign(
unsigned long freerun = (thresh + bg_thresh) / 2;
- strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
+ strlcpy(__entry->bdi, dev_name(wb->bdi->dev), 32);

__entry->limit = global_wb_domain.dirty_limit;
__entry->setpoint = (global_wb_domain.dirty_limit +
@@ -478,6 +571,7 @@ TRACE_EVENT(balance_dirty_pages,
__entry->period = period * 1000 / HZ;
__entry->pause = pause * 1000 / HZ;
__entry->paused = (jiffies - start_time) * 1000 / HZ;
+ __trace_wb_assign_cgroup(__get_str(cgroup), wb);
),


@@ -486,7 +580,7 @@ TRACE_EVENT(balance_dirty_pages,
"bdi_setpoint=%lu bdi_dirty=%lu "
"dirty_ratelimit=%lu task_ratelimit=%lu "
"dirtied=%u dirtied_pause=%u "
- "paused=%lu pause=%ld period=%lu think=%ld",
+ "paused=%lu pause=%ld period=%lu think=%ld cgroup=%s",
__entry->bdi,
__entry->limit,
__entry->setpoint,
@@ -500,7 +594,8 @@ TRACE_EVENT(balance_dirty_pages,
__entry->paused, /* ms */
__entry->pause, /* ms */
__entry->period, /* ms */
- __entry->think /* ms */
+ __entry->think, /* ms */
+ __get_str(cgroup)
)
);

@@ -514,6 +609,8 @@ TRACE_EVENT(writeback_sb_inodes_requeue,
__field(unsigned long, ino)
__field(unsigned long, state)
__field(unsigned long, dirtied_when)
+ __dynamic_array(char, cgroup,
+ __trace_wb_cgroup_size(inode_to_wb(inode)))
),

TP_fast_assign(
@@ -522,14 +619,16 @@ TRACE_EVENT(writeback_sb_inodes_requeue,
__entry->ino = inode->i_ino;
__entry->state = inode->i_state;
__entry->dirtied_when = inode->dirtied_when;
+ __trace_wb_assign_cgroup(__get_str(cgroup), inode_to_wb(inode));
),

- TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu",
+ TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu cgroup=%s",
__entry->name,
__entry->ino,
show_inode_state(__entry->state),
__entry->dirtied_when,
- (jiffies - __entry->dirtied_when) / HZ
+ (jiffies - __entry->dirtied_when) / HZ,
+ __get_str(cgroup)
)
);

@@ -585,6 +684,7 @@ DECLARE_EVENT_CLASS(writeback_single_ino
__field(unsigned long, writeback_index)
__field(long, nr_to_write)
__field(unsigned long, wrote)
+ __dynamic_array(char, cgroup, __trace_wbc_cgroup_size(wbc))
),

TP_fast_assign(
@@ -596,10 +696,11 @@ DECLARE_EVENT_CLASS(writeback_single_ino
__entry->writeback_index = inode->i_mapping->writeback_index;
__entry->nr_to_write = nr_to_write;
__entry->wrote = nr_to_write - wbc->nr_to_write;
+ __trace_wbc_assign_cgroup(__get_str(cgroup), wbc);
),

TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu "
- "index=%lu to_write=%ld wrote=%lu",
+ "index=%lu to_write=%ld wrote=%lu cgroup=%s",
__entry->name,
__entry->ino,
show_inode_state(__entry->state),
@@ -607,7 +708,8 @@ DECLARE_EVENT_CLASS(writeback_single_ino
(jiffies - __entry->dirtied_when) / HZ,
__entry->writeback_index,
__entry->nr_to_write,
- __entry->wrote
+ __entry->wrote,
+ __get_str(cgroup)
)
);

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1289,7 +1289,7 @@ static void wb_update_dirty_ratelimit(st
wb->dirty_ratelimit = max(dirty_ratelimit, 1UL);
wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit;

- trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
+ trace_bdi_dirty_ratelimit(wb, dirty_rate, task_ratelimit);
}

static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
@@ -1683,7 +1683,7 @@ static void balance_dirty_pages(struct a
* do a reset, as it may be a light dirtier.
*/
if (pause < min_pause) {
- trace_balance_dirty_pages(bdi,
+ trace_balance_dirty_pages(wb,
sdtc->thresh,
sdtc->bg_thresh,
sdtc->dirty,
@@ -1712,7 +1712,7 @@ static void balance_dirty_pages(struct a
}

pause:
- trace_balance_dirty_pages(bdi,
+ trace_balance_dirty_pages(wb,
sdtc->thresh,
sdtc->bg_thresh,
sdtc->dirty,

2015-07-08 08:13:55

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH block/for-4.3] writeback: explain why @inode is allowed to be NULL for inode_congested()

On Sat 04-07-15 11:12:00, Tejun Heo wrote:
> Signed-off-by: Tejun Heo <[email protected]>
> Suggested-by: Jan Kara <[email protected]>
> ---
> Hello,
>
> So, something like this. I'll resend this patch as part of a patch
> series once -rc1 drops.
Looks good. Thanks!

Honza

> fs/fs-writeback.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -700,7 +700,7 @@ void wbc_account_io(struct writeback_con
>
> /**
> * inode_congested - test whether an inode is congested
> - * @inode: inode to test for congestion
> + * @inode: inode to test for congestion (may be NULL)
> * @cong_bits: mask of WB_[a]sync_congested bits to test
> *
> * Tests whether @inode is congested. @cong_bits is the mask of congestion
> @@ -710,6 +710,9 @@ void wbc_account_io(struct writeback_con
> * determined by whether the cgwb (cgroup bdi_writeback) for the blkcg
> * associated with @inode is congested; otherwise, the root wb's congestion
> * state is used.
> + *
> + * @inode is allowed to be NULL as this function is often called on
> + * mapping->host which is NULL for the swapper space.
> */
> int inode_congested(struct inode *inode, int cong_bits)
> {
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-08 08:20:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH block/for-4.3] writeback: update writeback tracepoints to report cgroup

On Mon 06-07-15 15:36:42, Tejun Heo wrote:
> The following tracepoints are updated to report the cgroup used during
> cgroup writeback.
>
> * writeback_write_inode[_start]
> * writeback_queue
> * writeback_exec
> * writeback_start
> * writeback_written
> * writeback_wait
> * writeback_nowork
> * writeback_wake_background
> * wbc_writepage
> * writeback_queue_io
> * bdi_dirty_ratelimit
> * balance_dirty_pages
> * writeback_sb_inodes_requeue
> * writeback_single_inode[_start]
>
> Note that writeback_bdi_register is separated out from writeback_class
> as reporting cgroup doesn't make sense to it. Tracepoints which take
> bdi are updated to take bdi_writeback instead.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jan Kara <[email protected]>
> ---
> Hello,
>
> Will soon post this as part of a patch series of cgroup writeback
> updates.

Thanks. The patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza


> fs/fs-writeback.c | 14 +--
> include/trace/events/writeback.h | 180 ++++++++++++++++++++++++++++++---------
> mm/page-writeback.c | 6 -
> 3 files changed, 151 insertions(+), 49 deletions(-)
>
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -176,7 +176,7 @@ static void wb_wakeup(struct bdi_writeba
> static void wb_queue_work(struct bdi_writeback *wb,
> struct wb_writeback_work *work)
> {
> - trace_writeback_queue(wb->bdi, work);
> + trace_writeback_queue(wb, work);
>
> spin_lock_bh(&wb->work_lock);
> if (!test_bit(WB_registered, &wb->state))
> @@ -882,7 +882,7 @@ void wb_start_writeback(struct bdi_write
> */
> work = kzalloc(sizeof(*work), GFP_ATOMIC);
> if (!work) {
> - trace_writeback_nowork(wb->bdi);
> + trace_writeback_nowork(wb);
> wb_wakeup(wb);
> return;
> }
> @@ -912,7 +912,7 @@ void wb_start_background_writeback(struc
> * We just wake up the flusher thread. It will perform background
> * writeback as soon as there is no other work to do.
> */
> - trace_writeback_wake_background(wb->bdi);
> + trace_writeback_wake_background(wb);
> wb_wakeup(wb);
> }
>
> @@ -1615,14 +1615,14 @@ static long wb_writeback(struct bdi_writ
> } else if (work->for_background)
> oldest_jif = jiffies;
>
> - trace_writeback_start(wb->bdi, work);
> + trace_writeback_start(wb, work);
> if (list_empty(&wb->b_io))
> queue_io(wb, work);
> if (work->sb)
> progress = writeback_sb_inodes(work->sb, wb, work);
> else
> progress = __writeback_inodes_wb(wb, work);
> - trace_writeback_written(wb->bdi, work);
> + trace_writeback_written(wb, work);
>
> wb_update_bandwidth(wb, wb_start);
>
> @@ -1647,7 +1647,7 @@ static long wb_writeback(struct bdi_writ
> * we'll just busyloop.
> */
> if (!list_empty(&wb->b_more_io)) {
> - trace_writeback_wait(wb->bdi, work);
> + trace_writeback_wait(wb, work);
> inode = wb_inode(wb->b_more_io.prev);
> spin_lock(&inode->i_lock);
> spin_unlock(&wb->list_lock);
> @@ -1753,7 +1753,7 @@ static long wb_do_writeback(struct bdi_w
> while ((work = get_next_work_item(wb)) != NULL) {
> struct wb_completion *done = work->done;
>
> - trace_writeback_exec(wb->bdi, work);
> + trace_writeback_exec(wb, work);
>
> wrote += wb_writeback(wb, work);
>
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -131,6 +131,66 @@ DEFINE_EVENT(writeback_dirty_inode_templ
> TP_ARGS(inode, flags)
> );
>
> +#ifdef CREATE_TRACE_POINTS
> +#ifdef CONFIG_CGROUP_WRITEBACK
> +
> +static inline size_t __trace_wb_cgroup_size(struct bdi_writeback *wb)
> +{
> + return kernfs_path_len(wb->memcg_css->cgroup->kn) + 1;
> +}
> +
> +static inline void __trace_wb_assign_cgroup(char *buf, struct bdi_writeback *wb)
> +{
> + struct cgroup *cgrp = wb->memcg_css->cgroup;
> + char *path;
> +
> + path = cgroup_path(cgrp, buf, kernfs_path_len(cgrp->kn) + 1);
> + WARN_ON_ONCE(path != buf);
> +}
> +
> +static inline size_t __trace_wbc_cgroup_size(struct writeback_control *wbc)
> +{
> + if (wbc->wb)
> + return __trace_wb_cgroup_size(wbc->wb);
> + else
> + return 2;
> +}
> +
> +static inline void __trace_wbc_assign_cgroup(char *buf,
> + struct writeback_control *wbc)
> +{
> + if (wbc->wb)
> + __trace_wb_assign_cgroup(buf, wbc->wb);
> + else
> + strcpy(buf, "/");
> +}
> +
> +#else /* CONFIG_CGROUP_WRITEBACK */
> +
> +static inline size_t __trace_wb_cgroup_size(struct bdi_writeback *wb)
> +{
> + return 2;
> +}
> +
> +static inline void __trace_wb_assign_cgroup(char *buf, struct bdi_writeback *wb)
> +{
> + strcpy(buf, "/");
> +}
> +
> +static inline size_t __trace_wbc_cgroup_size(struct writeback_control *wbc)
> +{
> + return 2;
> +}
> +
> +static inline void __trace_wbc_assign_cgroup(char *buf,
> + struct writeback_control *wbc)
> +{
> + strcpy(buf, "/");
> +}
> +
> +#endif /* CONFIG_CGROUP_WRITEBACK */
> +#endif /* CREATE_TRACE_POINTS */
> +
> DECLARE_EVENT_CLASS(writeback_write_inode_template,
>
> TP_PROTO(struct inode *inode, struct writeback_control *wbc),
> @@ -141,6 +201,7 @@ DECLARE_EVENT_CLASS(writeback_write_inod
> __array(char, name, 32)
> __field(unsigned long, ino)
> __field(int, sync_mode)
> + __dynamic_array(char, cgroup, __trace_wbc_cgroup_size(wbc))
> ),
>
> TP_fast_assign(
> @@ -148,12 +209,14 @@ DECLARE_EVENT_CLASS(writeback_write_inod
> dev_name(inode_to_bdi(inode)->dev), 32);
> __entry->ino = inode->i_ino;
> __entry->sync_mode = wbc->sync_mode;
> + __trace_wbc_assign_cgroup(__get_str(cgroup), wbc);
> ),
>
> - TP_printk("bdi %s: ino=%lu sync_mode=%d",
> + TP_printk("bdi %s: ino=%lu sync_mode=%d cgroup=%s",
> __entry->name,
> __entry->ino,
> - __entry->sync_mode
> + __entry->sync_mode,
> + __get_str(cgroup)
> )
> );
>
> @@ -172,8 +235,8 @@ DEFINE_EVENT(writeback_write_inode_templ
> );
>
> DECLARE_EVENT_CLASS(writeback_work_class,
> - TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work),
> - TP_ARGS(bdi, work),
> + TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
> + TP_ARGS(wb, work),
> TP_STRUCT__entry(
> __array(char, name, 32)
> __field(long, nr_pages)
> @@ -183,10 +246,11 @@ DECLARE_EVENT_CLASS(writeback_work_class
> __field(int, range_cyclic)
> __field(int, for_background)
> __field(int, reason)
> + __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
> ),
> TP_fast_assign(
> strncpy(__entry->name,
> - bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32);
> + wb->bdi->dev ? dev_name(wb->bdi->dev) : "(unknown)", 32);
> __entry->nr_pages = work->nr_pages;
> __entry->sb_dev = work->sb ? work->sb->s_dev : 0;
> __entry->sync_mode = work->sync_mode;
> @@ -194,9 +258,10 @@ DECLARE_EVENT_CLASS(writeback_work_class
> __entry->range_cyclic = work->range_cyclic;
> __entry->for_background = work->for_background;
> __entry->reason = work->reason;
> + __trace_wb_assign_cgroup(__get_str(cgroup), wb);
> ),
> TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
> - "kupdate=%d range_cyclic=%d background=%d reason=%s",
> + "kupdate=%d range_cyclic=%d background=%d reason=%s cgroup=%s",
> __entry->name,
> MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
> __entry->nr_pages,
> @@ -204,13 +269,14 @@ DECLARE_EVENT_CLASS(writeback_work_class
> __entry->for_kupdate,
> __entry->range_cyclic,
> __entry->for_background,
> - __print_symbolic(__entry->reason, WB_WORK_REASON)
> + __print_symbolic(__entry->reason, WB_WORK_REASON),
> + __get_str(cgroup)
> )
> );
> #define DEFINE_WRITEBACK_WORK_EVENT(name) \
> DEFINE_EVENT(writeback_work_class, name, \
> - TP_PROTO(struct backing_dev_info *bdi, struct wb_writeback_work *work), \
> - TP_ARGS(bdi, work))
> + TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work), \
> + TP_ARGS(wb, work))
> DEFINE_WRITEBACK_WORK_EVENT(writeback_queue);
> DEFINE_WRITEBACK_WORK_EVENT(writeback_exec);
> DEFINE_WRITEBACK_WORK_EVENT(writeback_start);
> @@ -230,26 +296,42 @@ TRACE_EVENT(writeback_pages_written,
> );
>
> DECLARE_EVENT_CLASS(writeback_class,
> - TP_PROTO(struct backing_dev_info *bdi),
> - TP_ARGS(bdi),
> + TP_PROTO(struct bdi_writeback *wb),
> + TP_ARGS(wb),
> TP_STRUCT__entry(
> __array(char, name, 32)
> + __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
> ),
> TP_fast_assign(
> - strncpy(__entry->name, dev_name(bdi->dev), 32);
> + strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
> + __trace_wb_assign_cgroup(__get_str(cgroup), wb);
> ),
> - TP_printk("bdi %s",
> - __entry->name
> + TP_printk("bdi %s: cgroup=%s",
> + __entry->name,
> + __get_str(cgroup)
> )
> );
> #define DEFINE_WRITEBACK_EVENT(name) \
> DEFINE_EVENT(writeback_class, name, \
> - TP_PROTO(struct backing_dev_info *bdi), \
> - TP_ARGS(bdi))
> + TP_PROTO(struct bdi_writeback *wb), \
> + TP_ARGS(wb))
>
> DEFINE_WRITEBACK_EVENT(writeback_nowork);
> DEFINE_WRITEBACK_EVENT(writeback_wake_background);
> -DEFINE_WRITEBACK_EVENT(writeback_bdi_register);
> +
> +TRACE_EVENT(writeback_bdi_register,
> + TP_PROTO(struct backing_dev_info *bdi),
> + TP_ARGS(bdi),
> + TP_STRUCT__entry(
> + __array(char, name, 32)
> + ),
> + TP_fast_assign(
> + strncpy(__entry->name, dev_name(bdi->dev), 32);
> + ),
> + TP_printk("bdi %s",
> + __entry->name
> + )
> +);
>
> DECLARE_EVENT_CLASS(wbc_class,
> TP_PROTO(struct writeback_control *wbc, struct backing_dev_info *bdi),
> @@ -265,6 +347,7 @@ DECLARE_EVENT_CLASS(wbc_class,
> __field(int, range_cyclic)
> __field(long, range_start)
> __field(long, range_end)
> + __dynamic_array(char, cgroup, __trace_wbc_cgroup_size(wbc))
> ),
>
> TP_fast_assign(
> @@ -278,11 +361,12 @@ DECLARE_EVENT_CLASS(wbc_class,
> __entry->range_cyclic = wbc->range_cyclic;
> __entry->range_start = (long)wbc->range_start;
> __entry->range_end = (long)wbc->range_end;
> + __trace_wbc_assign_cgroup(__get_str(cgroup), wbc);
> ),
>
> TP_printk("bdi %s: towrt=%ld skip=%ld mode=%d kupd=%d "
> "bgrd=%d reclm=%d cyclic=%d "
> - "start=0x%lx end=0x%lx",
> + "start=0x%lx end=0x%lx cgroup=%s",
> __entry->name,
> __entry->nr_to_write,
> __entry->pages_skipped,
> @@ -292,7 +376,9 @@ DECLARE_EVENT_CLASS(wbc_class,
> __entry->for_reclaim,
> __entry->range_cyclic,
> __entry->range_start,
> - __entry->range_end)
> + __entry->range_end,
> + __get_str(cgroup)
> + )
> )
>
> #define DEFINE_WBC_EVENT(name) \
> @@ -312,6 +398,7 @@ TRACE_EVENT(writeback_queue_io,
> __field(long, age)
> __field(int, moved)
> __field(int, reason)
> + __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
> ),
> TP_fast_assign(
> unsigned long *older_than_this = work->older_than_this;
> @@ -321,13 +408,15 @@ TRACE_EVENT(writeback_queue_io,
> (jiffies - *older_than_this) * 1000 / HZ : -1;
> __entry->moved = moved;
> __entry->reason = work->reason;
> + __trace_wb_assign_cgroup(__get_str(cgroup), wb);
> ),
> - TP_printk("bdi %s: older=%lu age=%ld enqueue=%d reason=%s",
> + TP_printk("bdi %s: older=%lu age=%ld enqueue=%d reason=%s cgroup=%s",
> __entry->name,
> __entry->older, /* older_than_this in jiffies */
> __entry->age, /* older_than_this in relative milliseconds */
> __entry->moved,
> - __print_symbolic(__entry->reason, WB_WORK_REASON)
> + __print_symbolic(__entry->reason, WB_WORK_REASON),
> + __get_str(cgroup)
> )
> );
>
> @@ -381,11 +470,11 @@ TRACE_EVENT(global_dirty_state,
>
> TRACE_EVENT(bdi_dirty_ratelimit,
>
> - TP_PROTO(struct backing_dev_info *bdi,
> + TP_PROTO(struct bdi_writeback *wb,
> unsigned long dirty_rate,
> unsigned long task_ratelimit),
>
> - TP_ARGS(bdi, dirty_rate, task_ratelimit),
> + TP_ARGS(wb, dirty_rate, task_ratelimit),
>
> TP_STRUCT__entry(
> __array(char, bdi, 32)
> @@ -395,36 +484,39 @@ TRACE_EVENT(bdi_dirty_ratelimit,
> __field(unsigned long, dirty_ratelimit)
> __field(unsigned long, task_ratelimit)
> __field(unsigned long, balanced_dirty_ratelimit)
> + __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
> ),
>
> TP_fast_assign(
> - strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
> - __entry->write_bw = KBps(bdi->wb.write_bandwidth);
> - __entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth);
> + strlcpy(__entry->bdi, dev_name(wb->bdi->dev), 32);
> + __entry->write_bw = KBps(wb->write_bandwidth);
> + __entry->avg_write_bw = KBps(wb->avg_write_bandwidth);
> __entry->dirty_rate = KBps(dirty_rate);
> - __entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit);
> + __entry->dirty_ratelimit = KBps(wb->dirty_ratelimit);
> __entry->task_ratelimit = KBps(task_ratelimit);
> __entry->balanced_dirty_ratelimit =
> - KBps(bdi->wb.balanced_dirty_ratelimit);
> + KBps(wb->balanced_dirty_ratelimit);
> + __trace_wb_assign_cgroup(__get_str(cgroup), wb);
> ),
>
> TP_printk("bdi %s: "
> "write_bw=%lu awrite_bw=%lu dirty_rate=%lu "
> "dirty_ratelimit=%lu task_ratelimit=%lu "
> - "balanced_dirty_ratelimit=%lu",
> + "balanced_dirty_ratelimit=%lu cgroup=%s",
> __entry->bdi,
> __entry->write_bw, /* write bandwidth */
> __entry->avg_write_bw, /* avg write bandwidth */
> __entry->dirty_rate, /* bdi dirty rate */
> __entry->dirty_ratelimit, /* base ratelimit */
> __entry->task_ratelimit, /* ratelimit with position control */
> - __entry->balanced_dirty_ratelimit /* the balanced ratelimit */
> + __entry->balanced_dirty_ratelimit, /* the balanced ratelimit */
> + __get_str(cgroup)
> )
> );
>
> TRACE_EVENT(balance_dirty_pages,
>
> - TP_PROTO(struct backing_dev_info *bdi,
> + TP_PROTO(struct bdi_writeback *wb,
> unsigned long thresh,
> unsigned long bg_thresh,
> unsigned long dirty,
> @@ -437,7 +529,7 @@ TRACE_EVENT(balance_dirty_pages,
> long pause,
> unsigned long start_time),
>
> - TP_ARGS(bdi, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
> + TP_ARGS(wb, thresh, bg_thresh, dirty, bdi_thresh, bdi_dirty,
> dirty_ratelimit, task_ratelimit,
> dirtied, period, pause, start_time),
>
> @@ -456,11 +548,12 @@ TRACE_EVENT(balance_dirty_pages,
> __field( long, pause)
> __field(unsigned long, period)
> __field( long, think)
> + __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))
> ),
>
> TP_fast_assign(
> unsigned long freerun = (thresh + bg_thresh) / 2;
> - strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
> + strlcpy(__entry->bdi, dev_name(wb->bdi->dev), 32);
>
> __entry->limit = global_wb_domain.dirty_limit;
> __entry->setpoint = (global_wb_domain.dirty_limit +
> @@ -478,6 +571,7 @@ TRACE_EVENT(balance_dirty_pages,
> __entry->period = period * 1000 / HZ;
> __entry->pause = pause * 1000 / HZ;
> __entry->paused = (jiffies - start_time) * 1000 / HZ;
> + __trace_wb_assign_cgroup(__get_str(cgroup), wb);
> ),
>
>
> @@ -486,7 +580,7 @@ TRACE_EVENT(balance_dirty_pages,
> "bdi_setpoint=%lu bdi_dirty=%lu "
> "dirty_ratelimit=%lu task_ratelimit=%lu "
> "dirtied=%u dirtied_pause=%u "
> - "paused=%lu pause=%ld period=%lu think=%ld",
> + "paused=%lu pause=%ld period=%lu think=%ld cgroup=%s",
> __entry->bdi,
> __entry->limit,
> __entry->setpoint,
> @@ -500,7 +594,8 @@ TRACE_EVENT(balance_dirty_pages,
> __entry->paused, /* ms */
> __entry->pause, /* ms */
> __entry->period, /* ms */
> - __entry->think /* ms */
> + __entry->think, /* ms */
> + __get_str(cgroup)
> )
> );
>
> @@ -514,6 +609,8 @@ TRACE_EVENT(writeback_sb_inodes_requeue,
> __field(unsigned long, ino)
> __field(unsigned long, state)
> __field(unsigned long, dirtied_when)
> + __dynamic_array(char, cgroup,
> + __trace_wb_cgroup_size(inode_to_wb(inode)))
> ),
>
> TP_fast_assign(
> @@ -522,14 +619,16 @@ TRACE_EVENT(writeback_sb_inodes_requeue,
> __entry->ino = inode->i_ino;
> __entry->state = inode->i_state;
> __entry->dirtied_when = inode->dirtied_when;
> + __trace_wb_assign_cgroup(__get_str(cgroup), inode_to_wb(inode));
> ),
>
> - TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu",
> + TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu cgroup=%s",
> __entry->name,
> __entry->ino,
> show_inode_state(__entry->state),
> __entry->dirtied_when,
> - (jiffies - __entry->dirtied_when) / HZ
> + (jiffies - __entry->dirtied_when) / HZ,
> + __get_str(cgroup)
> )
> );
>
> @@ -585,6 +684,7 @@ DECLARE_EVENT_CLASS(writeback_single_ino
> __field(unsigned long, writeback_index)
> __field(long, nr_to_write)
> __field(unsigned long, wrote)
> + __dynamic_array(char, cgroup, __trace_wbc_cgroup_size(wbc))
> ),
>
> TP_fast_assign(
> @@ -596,10 +696,11 @@ DECLARE_EVENT_CLASS(writeback_single_ino
> __entry->writeback_index = inode->i_mapping->writeback_index;
> __entry->nr_to_write = nr_to_write;
> __entry->wrote = nr_to_write - wbc->nr_to_write;
> + __trace_wbc_assign_cgroup(__get_str(cgroup), wbc);
> ),
>
> TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu "
> - "index=%lu to_write=%ld wrote=%lu",
> + "index=%lu to_write=%ld wrote=%lu cgroup=%s",
> __entry->name,
> __entry->ino,
> show_inode_state(__entry->state),
> @@ -607,7 +708,8 @@ DECLARE_EVENT_CLASS(writeback_single_ino
> (jiffies - __entry->dirtied_when) / HZ,
> __entry->writeback_index,
> __entry->nr_to_write,
> - __entry->wrote
> + __entry->wrote,
> + __get_str(cgroup)
> )
> );
>
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1289,7 +1289,7 @@ static void wb_update_dirty_ratelimit(st
> wb->dirty_ratelimit = max(dirty_ratelimit, 1UL);
> wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit;
>
> - trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
> + trace_bdi_dirty_ratelimit(wb, dirty_rate, task_ratelimit);
> }
>
> static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc,
> @@ -1683,7 +1683,7 @@ static void balance_dirty_pages(struct a
> * do a reset, as it may be a light dirtier.
> */
> if (pause < min_pause) {
> - trace_balance_dirty_pages(bdi,
> + trace_balance_dirty_pages(wb,
> sdtc->thresh,
> sdtc->bg_thresh,
> sdtc->dirty,
> @@ -1712,7 +1712,7 @@ static void balance_dirty_pages(struct a
> }
>
> pause:
> - trace_balance_dirty_pages(bdi,
> + trace_balance_dirty_pages(wb,
> sdtc->thresh,
> sdtc->bg_thresh,
> sdtc->dirty,
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-07-08 08:25:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH block/for-4.3] writeback: remove wb_writeback_work->single_wait/done

On Fri 03-07-15 18:12:23, Tejun Heo wrote:
> Hello, Jan.
>
> So, something like the following. It depends on other changes so
> won't apply as-is. I'll repost it as part of a patchset once -rc1
> drops.
>
> Thanks!
>
> ------ 8< ------
> wb_writeback_work->single_wait/done are used for the wait mechanism
> for synchronous wb_work (wb_writeback_work) items which are issued
> when bdi_split_work_to_wbs() fails to allocate memory for asynchronous
> wb_work items; however, there's no reason to use a separate wait
> mechanism for this. bdi_split_work_to_wbs() can simply use on-stack
> fallback wb_work item and separate wb_completion to wait for it.
>
> This patch removes wb_work->single_wait/done and the related code and
> make bdi_split_work_to_wbs() use on-stack fallback wb_work and
> wb_completion instead.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Suggested-by: Jan Kara <[email protected]>

Thanks! The patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/fs-writeback.c | 116 +++++++++++++-----------------------------------------
> 1 file changed, 30 insertions(+), 86 deletions(-)
>
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -53,8 +53,6 @@ struct wb_writeback_work {
> unsigned int for_background:1;
> unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */
> unsigned int auto_free:1; /* free on completion */
> - unsigned int single_wait:1;
> - unsigned int single_done:1;
> enum wb_reason reason; /* why was writeback initiated? */
>
> struct list_head list; /* pending work list */
> @@ -181,11 +179,8 @@ static void wb_queue_work(struct bdi_wri
> trace_writeback_queue(wb->bdi, work);
>
> spin_lock_bh(&wb->work_lock);
> - if (!test_bit(WB_registered, &wb->state)) {
> - if (work->single_wait)
> - work->single_done = 1;
> + if (!test_bit(WB_registered, &wb->state))
> goto out_unlock;
> - }
> if (work->done)
> atomic_inc(&work->done->cnt);
> list_add_tail(&work->list, &wb->work_list);
> @@ -737,32 +732,6 @@ int inode_congested(struct inode *inode,
> EXPORT_SYMBOL_GPL(inode_congested);
>
> /**
> - * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work
> - * @bdi: bdi the work item was issued to
> - * @work: work item to wait for
> - *
> - * Wait for the completion of @work which was issued to one of @bdi's
> - * bdi_writeback's. The caller must have set @work->single_wait before
> - * issuing it. This wait operates independently fo
> - * wb_wait_for_completion() and also disables automatic freeing of @work.
> - */
> -static void wb_wait_for_single_work(struct backing_dev_info *bdi,
> - struct wb_writeback_work *work)
> -{
> - if (WARN_ON_ONCE(!work->single_wait))
> - return;
> -
> - wait_event(bdi->wb_waitq, work->single_done);
> -
> - /*
> - * Paired with smp_wmb() in wb_do_writeback() and ensures that all
> - * modifications to @work prior to assertion of ->single_done is
> - * visible to the caller once this function returns.
> - */
> - smp_rmb();
> -}
> -
> -/**
> * wb_split_bdi_pages - split nr_pages to write according to bandwidth
> * @wb: target bdi_writeback to split @nr_pages to
> * @nr_pages: number of pages to write for the whole bdi
> @@ -791,38 +760,6 @@ static long wb_split_bdi_pages(struct bd
> }
>
> /**
> - * wb_clone_and_queue_work - clone a wb_writeback_work and issue it to a wb
> - * @wb: target bdi_writeback
> - * @base_work: source wb_writeback_work
> - *
> - * Try to make a clone of @base_work and issue it to @wb. If cloning
> - * succeeds, %true is returned; otherwise, @base_work is issued directly
> - * and %false is returned. In the latter case, the caller is required to
> - * wait for @base_work's completion using wb_wait_for_single_work().
> - *
> - * A clone is auto-freed on completion. @base_work never is.
> - */
> -static bool wb_clone_and_queue_work(struct bdi_writeback *wb,
> - struct wb_writeback_work *base_work)
> -{
> - struct wb_writeback_work *work;
> -
> - work = kmalloc(sizeof(*work), GFP_ATOMIC);
> - if (work) {
> - *work = *base_work;
> - work->auto_free = 1;
> - work->single_wait = 0;
> - } else {
> - work = base_work;
> - work->auto_free = 0;
> - work->single_wait = 1;
> - }
> - work->single_done = 0;
> - wb_queue_work(wb, work);
> - return work != base_work;
> -}
> -
> -/**
> * bdi_split_work_to_wbs - split a wb_writeback_work to all wb's of a bdi
> * @bdi: target backing_dev_info
> * @base_work: wb_writeback_work to issue
> @@ -837,7 +774,6 @@ static void bdi_split_work_to_wbs(struct
> struct wb_writeback_work *base_work,
> bool skip_if_busy)
> {
> - long nr_pages = base_work->nr_pages;
> int next_memcg_id = 0;
> struct bdi_writeback *wb;
> struct wb_iter iter;
> @@ -849,17 +785,39 @@ static void bdi_split_work_to_wbs(struct
> restart:
> rcu_read_lock();
> bdi_for_each_wb(wb, bdi, &iter, next_memcg_id) {
> + DEFINE_WB_COMPLETION_ONSTACK(fallback_work_done);
> + struct wb_writeback_work fallback_work;
> + struct wb_writeback_work *work;
> + long nr_pages;
> +
> if (!wb_has_dirty_io(wb) ||
> (skip_if_busy && writeback_in_progress(wb)))
> continue;
>
> - base_work->nr_pages = wb_split_bdi_pages(wb, nr_pages);
> - if (!wb_clone_and_queue_work(wb, base_work)) {
> - next_memcg_id = wb->memcg_css->id + 1;
> - rcu_read_unlock();
> - wb_wait_for_single_work(bdi, base_work);
> - goto restart;
> + nr_pages = wb_split_bdi_pages(wb, base_work->nr_pages);
> +
> + work = kmalloc(sizeof(*work), GFP_ATOMIC);
> + if (work) {
> + *work = *base_work;
> + work->nr_pages = nr_pages;
> + work->auto_free = 1;
> + wb_queue_work(wb, work);
> + continue;
> }
> +
> + /* alloc failed, execute synchronously using on-stack fallback */
> + work = &fallback_work;
> + *work = *base_work;
> + work->nr_pages = nr_pages;
> + work->auto_free = 0;
> + work->done = &fallback_work_done;
> +
> + wb_queue_work(wb, work);
> +
> + next_memcg_id = wb->memcg_css->id + 1;
> + rcu_read_unlock();
> + wb_wait_for_completion(bdi, &fallback_work_done);
> + goto restart;
> }
> rcu_read_unlock();
> }
> @@ -901,8 +859,6 @@ static void bdi_split_work_to_wbs(struct
> if (bdi_has_dirty_io(bdi) &&
> (!skip_if_busy || !writeback_in_progress(&bdi->wb))) {
> base_work->auto_free = 0;
> - base_work->single_wait = 0;
> - base_work->single_done = 0;
> wb_queue_work(&bdi->wb, base_work);
> }
> }
> @@ -1793,26 +1749,14 @@ static long wb_do_writeback(struct bdi_w
> set_bit(WB_writeback_running, &wb->state);
> while ((work = get_next_work_item(wb)) != NULL) {
> struct wb_completion *done = work->done;
> - bool need_wake_up = false;
>
> trace_writeback_exec(wb->bdi, work);
>
> wrote += wb_writeback(wb, work);
>
> - if (work->single_wait) {
> - WARN_ON_ONCE(work->auto_free);
> - /* paired w/ rmb in wb_wait_for_single_work() */
> - smp_wmb();
> - work->single_done = 1;
> - need_wake_up = true;
> - } else if (work->auto_free) {
> + if (work->auto_free)
> kfree(work);
> - }
> -
> if (done && atomic_dec_and_test(&done->cnt))
> - need_wake_up = true;
> -
> - if (need_wake_up)
> wake_up_all(&wb->bdi->wb_waitq);
> }
>
--
Jan Kara <[email protected]>
SUSE Labs, CR