2014-11-18 08:37:37

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET block/for-next] writeback: prepare for cgroup writeback support

Hello,

This patchset contains the following 10 prepatory patches for cgroup
writeback support. None of these patches introduces behavior changes.

0001-writeback-move-backing_dev_info-state-into-bdi_write.patch
0002-writeback-move-backing_dev_info-bdi_stat-into-bdi_wr.patch
0003-writeback-move-bandwidth-related-fields-from-backing.patch
0004-writeback-move-backing_dev_info-wb_lock-and-worklist.patch
0005-writeback-move-lingering-dirty-IO-lists-transfer-fro.patch
0006-writeback-reorganize-mm-backing-dev.c.patch
0007-writeback-separate-out-include-linux-backing-dev-def.patch
0008-writeback-cosmetic-change-in-account_page_dirtied.patch
0009-writeback-add-gfp-to-wb_init.patch
0010-writeback-move-inode_to_bdi-to-include-linux-backing.patch

0001-0005 move writeback related fields from bdi (backing_dev_info) to
wb (bdi_writeback). Currently, one bdi embeds one wb and the
separation between the two is blurry. bdi's lock protects wb's fields
and fields which are closely related are scattered across the two.
These five patches move all fields which are used during writeback
into wb.

0006-0010 are misc prep patches. They're all rather trivial and each
is self-explanatory.

This patchset is on top of the current block/for-next eb494facbee2
("5748c0fce0fd40c87d164d6bee61") and is available in the following git
branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-cgroup-writeback-wb-prep

diffstat follows. Thanks.

block/blk-core.c | 1
block/blk-integrity.c | 1
block/blk-sysfs.c | 1
block/bounce.c | 1
block/genhd.c | 1
drivers/block/drbd/drbd_int.h | 1
drivers/block/drbd/drbd_main.c | 10 -
drivers/block/pktcdvd.c | 1
drivers/char/raw.c | 1
drivers/md/bcache/request.c | 1
drivers/md/dm.c | 2
drivers/md/dm.h | 1
drivers/md/md.h | 1
drivers/md/raid1.c | 4
drivers/md/raid10.c | 2
drivers/mtd/devices/block2mtd.c | 1
fs/block_dev.c | 1
fs/ext4/extents.c | 1
fs/ext4/mballoc.c | 1
fs/f2fs/node.c | 2
fs/f2fs/segment.h | 1
fs/fs-writeback.c | 121 ++++++---------
fs/fuse/file.c | 12 -
fs/gfs2/super.c | 2
fs/hfs/super.c | 1
fs/hfsplus/super.c | 1
fs/nfs/filelayout/filelayout.c | 5
fs/nfs/write.c | 11 -
fs/reiserfs/super.c | 1
fs/ufs/super.c | 1
include/linux/backing-dev-defs.h | 105 +++++++++++++
include/linux/backing-dev.h | 174 +++++-----------------
include/linux/blkdev.h | 2
include/linux/writeback.h | 19 +-
include/trace/events/writeback.h | 8 -
mm/backing-dev.c | 306 +++++++++++++++++++--------------------
mm/filemap.c | 2
mm/madvise.c | 1
mm/page-writeback.c | 304 +++++++++++++++++++-------------------
mm/truncate.c | 4
40 files changed, 570 insertions(+), 546 deletions(-)

--
tejun


2014-11-18 08:37:39

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/10] writeback: move backing_dev_info->state into bdi_writeback

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: [email protected]
Cc: Neil Brown <[email protected]>
Cc: Alasdair Kergon <[email protected]>
Cc: Mike Snitzer <[email protected]>
---
block/blk-core.c | 1 -
drivers/block/drbd/drbd_main.c | 10 +++++-----
drivers/md/dm.c | 2 +-
drivers/md/raid1.c | 4 ++--
drivers/md/raid10.c | 2 +-
fs/fs-writeback.c | 14 +++++++-------
include/linux/backing-dev.h | 24 ++++++++++++------------
mm/backing-dev.c | 21 ++++++++++-----------
8 files changed, 38 insertions(+), 40 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0421b53..8801682 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -584,7 +584,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)

q->backing_dev_info.ra_pages =
(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
- q->backing_dev_info.state = 0;
q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
q->backing_dev_info.name = "block";
q->node = node_id;
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 1fc8342..61b00aa 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2360,7 +2360,7 @@ static void drbd_cleanup(void)
* @congested_data: User data
* @bdi_bits: Bits the BDI flusher thread is currently interested in
*
- * Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested.
+ * Returns 1<<WB_async_congested and/or 1<<WB_sync_congested if we are congested.
*/
static int drbd_congested(void *congested_data, int bdi_bits)
{
@@ -2377,14 +2377,14 @@ static int drbd_congested(void *congested_data, int bdi_bits)
}

if (test_bit(CALLBACK_PENDING, &first_peer_device(device)->connection->flags)) {
- r |= (1 << BDI_async_congested);
+ r |= (1 << WB_async_congested);
/* Without good local data, we would need to read from remote,
* and that would need the worker thread as well, which is
* currently blocked waiting for that usermode helper to
* finish.
*/
if (!get_ldev_if_state(device, D_UP_TO_DATE))
- r |= (1 << BDI_sync_congested);
+ r |= (1 << WB_sync_congested);
else
put_ldev(device);
r &= bdi_bits;
@@ -2400,9 +2400,9 @@ static int drbd_congested(void *congested_data, int bdi_bits)
reason = 'b';
}

- if (bdi_bits & (1 << BDI_async_congested) &&
+ if (bdi_bits & (1 << WB_async_congested) &&
test_bit(NET_CONGESTED, &first_peer_device(device)->connection->flags)) {
- r |= (1 << BDI_async_congested);
+ r |= (1 << WB_async_congested);
reason = reason == 'b' ? 'a' : 'n';
}

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 58f3927..c4c53af 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1950,7 +1950,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
* the query about congestion status of request_queue
*/
if (dm_request_based(md))
- r = md->queue->backing_dev_info.state &
+ r = md->queue->backing_dev_info.wb.state &
bdi_bits;
else
r = dm_table_any_congested(map, bdi_bits);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 40b35be..aad1482 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -739,7 +739,7 @@ int md_raid1_congested(struct mddev *mddev, int bits)
struct r1conf *conf = mddev->private;
int i, ret = 0;

- if ((bits & (1 << BDI_async_congested)) &&
+ if ((bits & (1 << WB_async_congested)) &&
conf->pending_count >= max_queued_requests)
return 1;

@@ -754,7 +754,7 @@ int md_raid1_congested(struct mddev *mddev, int bits)
/* Note the '|| 1' - when read_balance prefers
* non-congested targets, it can be removed
*/
- if ((bits & (1<<BDI_async_congested)) || 1)
+ if ((bits & (1<<WB_async_congested)) || 1)
ret |= bdi_congested(&q->backing_dev_info, bits);
else
ret &= bdi_congested(&q->backing_dev_info, bits);
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 32e282f..5180e75 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -915,7 +915,7 @@ int md_raid10_congested(struct mddev *mddev, int bits)
struct r10conf *conf = mddev->private;
int i, ret = 0;

- if ((bits & (1 << BDI_async_congested)) &&
+ if ((bits & (1 << WB_async_congested)) &&
conf->pending_count >= max_queued_requests)
return 1;

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2d609a5..a797bda 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -62,7 +62,7 @@ struct wb_writeback_work {
*/
int writeback_in_progress(struct backing_dev_info *bdi)
{
- return test_bit(BDI_writeback_running, &bdi->state);
+ return test_bit(WB_writeback_running, &bdi->wb.state);
}
EXPORT_SYMBOL(writeback_in_progress);

@@ -94,7 +94,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
static void bdi_wakeup_thread(struct backing_dev_info *bdi)
{
spin_lock_bh(&bdi->wb_lock);
- if (test_bit(BDI_registered, &bdi->state))
+ if (test_bit(WB_registered, &bdi->wb.state))
mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
spin_unlock_bh(&bdi->wb_lock);
}
@@ -105,7 +105,7 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
trace_writeback_queue(bdi, work);

spin_lock_bh(&bdi->wb_lock);
- if (!test_bit(BDI_registered, &bdi->state)) {
+ if (!test_bit(WB_registered, &bdi->wb.state)) {
if (work->done)
complete(work->done);
goto out_unlock;
@@ -1007,7 +1007,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
struct wb_writeback_work *work;
long wrote = 0;

- set_bit(BDI_writeback_running, &wb->bdi->state);
+ set_bit(WB_writeback_running, &wb->state);
while ((work = get_next_work_item(bdi)) != NULL) {

trace_writeback_exec(bdi, work);
@@ -1029,7 +1029,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
*/
wrote += wb_check_old_data_flush(wb);
wrote += wb_check_background_flush(wb);
- clear_bit(BDI_writeback_running, &wb->bdi->state);
+ clear_bit(WB_writeback_running, &wb->state);

return wrote;
}
@@ -1049,7 +1049,7 @@ void bdi_writeback_workfn(struct work_struct *work)
current->flags |= PF_SWAPWRITE;

if (likely(!current_is_workqueue_rescuer() ||
- !test_bit(BDI_registered, &bdi->state))) {
+ !test_bit(WB_registered, &wb->state))) {
/*
* The normal path. Keep writing back @bdi until its
* work_list is empty. Note that this path is also taken
@@ -1211,7 +1211,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
spin_unlock(&inode->i_lock);
spin_lock(&bdi->wb.list_lock);
if (bdi_cap_writeback_dirty(bdi)) {
- WARN(!test_bit(BDI_registered, &bdi->state),
+ WARN(!test_bit(WB_registered, &bdi->wb.state),
"bdi-%s not registered\n", bdi->name);

/*
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 5da6012..a356ccd 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -25,13 +25,13 @@ struct device;
struct dentry;

/*
- * Bits in backing_dev_info.state
+ * Bits in bdi_writeback.state
*/
-enum bdi_state {
- BDI_async_congested, /* The async (write) queue is getting full */
- BDI_sync_congested, /* The sync queue is getting full */
- BDI_registered, /* bdi_register() was done */
- BDI_writeback_running, /* Writeback is in progress */
+enum wb_state {
+ WB_async_congested, /* The async (write) queue is getting full */
+ WB_sync_congested, /* The sync queue is getting full */
+ WB_registered, /* bdi_register() was done */
+ WB_writeback_running, /* Writeback is in progress */
};

typedef int (congested_fn)(void *, int);
@@ -49,6 +49,7 @@ enum bdi_stat_item {
struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */

+ unsigned long state; /* Always use atomic bitops on this */
unsigned long last_old_flush; /* last old data flush */

struct delayed_work dwork; /* work item used for writeback */
@@ -61,7 +62,6 @@ struct bdi_writeback {
struct backing_dev_info {
struct list_head bdi_list;
unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
- unsigned long state; /* Always use atomic bitops on this */
unsigned int capabilities; /* Device capabilities */
congested_fn *congested_fn; /* Function pointer if device is md/dm */
void *congested_data; /* Pointer to aux data for congested func */
@@ -276,23 +276,23 @@ static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
{
if (bdi->congested_fn)
return bdi->congested_fn(bdi->congested_data, bdi_bits);
- return (bdi->state & bdi_bits);
+ return (bdi->wb.state & bdi_bits);
}

static inline int bdi_read_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, 1 << BDI_sync_congested);
+ return bdi_congested(bdi, 1 << WB_sync_congested);
}

static inline int bdi_write_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, 1 << BDI_async_congested);
+ return bdi_congested(bdi, 1 << WB_async_congested);
}

static inline int bdi_rw_congested(struct backing_dev_info *bdi)
{
- return bdi_congested(bdi, (1 << BDI_sync_congested) |
- (1 << BDI_async_congested));
+ return bdi_congested(bdi, (1 << WB_sync_congested) |
+ (1 << WB_async_congested));
}

enum {
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 0ae0df5..62f3b33 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -17,7 +17,6 @@ static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
struct backing_dev_info default_backing_dev_info = {
.name = "default",
.ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
- .state = 0,
.capabilities = BDI_CAP_MAP_COPY,
};
EXPORT_SYMBOL_GPL(default_backing_dev_info);
@@ -111,7 +110,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
nr_dirty,
nr_io,
nr_more_io,
- !list_empty(&bdi->bdi_list), bdi->state);
+ !list_empty(&bdi->bdi_list), bdi->wb.state);
#undef K

return 0;
@@ -298,7 +297,7 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)

timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
spin_lock_bh(&bdi->wb_lock);
- if (test_bit(BDI_registered, &bdi->state))
+ if (test_bit(WB_registered, &bdi->wb.state))
queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
spin_unlock_bh(&bdi->wb_lock);
}
@@ -333,7 +332,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
bdi->dev = dev;

bdi_debug_register(bdi, dev_name(dev));
- set_bit(BDI_registered, &bdi->state);
+ set_bit(WB_registered, &bdi->wb.state);

spin_lock_bh(&bdi_lock);
list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
@@ -365,7 +364,7 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)

/* Make sure nobody queues further work */
spin_lock_bh(&bdi->wb_lock);
- clear_bit(BDI_registered, &bdi->state);
+ clear_bit(WB_registered, &bdi->wb.state);
spin_unlock_bh(&bdi->wb_lock);

/*
@@ -543,11 +542,11 @@ static atomic_t nr_bdi_congested[2];

void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
- enum bdi_state bit;
+ enum wb_state bit;
wait_queue_head_t *wqh = &congestion_wqh[sync];

- bit = sync ? BDI_sync_congested : BDI_async_congested;
- if (test_and_clear_bit(bit, &bdi->state))
+ bit = sync ? WB_sync_congested : WB_async_congested;
+ if (test_and_clear_bit(bit, &bdi->wb.state))
atomic_dec(&nr_bdi_congested[sync]);
smp_mb__after_atomic();
if (waitqueue_active(wqh))
@@ -557,10 +556,10 @@ EXPORT_SYMBOL(clear_bdi_congested);

void set_bdi_congested(struct backing_dev_info *bdi, int sync)
{
- enum bdi_state bit;
+ enum wb_state bit;

- bit = sync ? BDI_sync_congested : BDI_async_congested;
- if (!test_and_set_bit(bit, &bdi->state))
+ bit = sync ? WB_sync_congested : WB_async_congested;
+ if (!test_and_set_bit(bit, &bdi->wb.state))
atomic_inc(&nr_bdi_congested[sync]);
}
EXPORT_SYMBOL(set_bdi_congested);
--
1.9.3

2014-11-18 08:38:42

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/10] writeback: move inode_to_bdi() to include/linux/backing-dev.h

inode_to_bdi() will be used by inline functions for the planned cgroup
writeback support. Move it to include/linux/backing-dev.h.

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
fs/fs-writeback.c | 10 ----------
include/linux/backing-dev.h | 10 ++++++++++
2 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 41c9f1e..5130895 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -66,16 +66,6 @@ int writeback_in_progress(struct backing_dev_info *bdi)
}
EXPORT_SYMBOL(writeback_in_progress);

-static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
-{
- struct super_block *sb = inode->i_sb;
-
- if (sb_is_blkdev_sb(sb))
- return inode->i_mapping->backing_dev_info;
-
- return sb->s_bdi;
-}
-
static inline struct inode *wb_inode(struct list_head *head)
{
return list_entry(head, struct inode, i_wb_list);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 918f5c9..3c6fd34 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -253,4 +253,14 @@ static inline int bdi_sched_wait(void *word)
return 0;
}

+static inline struct backing_dev_info *inode_to_bdi(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+
+ if (sb_is_blkdev_sb(sb))
+ return inode->i_mapping->backing_dev_info;
+
+ return sb->s_bdi;
+}
+
#endif /* _LINUX_BACKING_DEV_H */
--
1.9.3

2014-11-18 08:39:02

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/10] writeback: cosmetic change in account_page_dirtied()

Flip the polarity of mapping_cap_account_dirty() test so that the body
of page accounting can be moved outside the if () block. This will
help adding cgroup writeback support.

This causes no logic changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
---
mm/page-writeback.c | 19 ++++++++++---------
1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7c721b4..29d5bd2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2104,15 +2104,16 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
{
trace_writeback_dirty_page(page, mapping);

- if (mapping_cap_account_dirty(mapping)) {
- __inc_zone_page_state(page, NR_FILE_DIRTY);
- __inc_zone_page_state(page, NR_DIRTIED);
- __inc_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
- __inc_wb_stat(&mapping->backing_dev_info->wb, WB_DIRTIED);
- task_io_account_write(PAGE_CACHE_SIZE);
- current->nr_dirtied++;
- this_cpu_inc(bdp_ratelimits);
- }
+ if (!mapping_cap_account_dirty(mapping))
+ return;
+
+ __inc_zone_page_state(page, NR_FILE_DIRTY);
+ __inc_zone_page_state(page, NR_DIRTIED);
+ __inc_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
+ __inc_wb_stat(&mapping->backing_dev_info->wb, WB_DIRTIED);
+ task_io_account_write(PAGE_CACHE_SIZE);
+ current->nr_dirtied++;
+ this_cpu_inc(bdp_ratelimits);
}
EXPORT_SYMBOL(account_page_dirtied);

--
1.9.3

2014-11-18 08:39:00

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/10] writeback: add @gfp to wb_init()

wb_init() currently always uses GFP_KERNEL but the planned cgroup
writeback support needs using other allocation masks. Add @gfp to
wb_init().

This patch doesn't introduce any behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
---
mm/backing-dev.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4d87957..876bf4f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -307,7 +307,8 @@ void wb_wakeup_delayed(struct bdi_writeback *wb)
*/
#define INIT_BW (100 << (20 - PAGE_SHIFT))

-static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
+ gfp_t gfp)
{
int i, err;

@@ -330,12 +331,12 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->work_list);
INIT_DELAYED_WORK(&wb->dwork, wb_workfn);

- err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL);
+ err = fprop_local_init_percpu(&wb->completions, gfp);
if (err)
return err;

for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
- err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL);
+ err = percpu_counter_init(&wb->stat[i], 0, gfp);
if (err) {
while (--i)
percpu_counter_destroy(&wb->stat[i]);
@@ -415,7 +416,7 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->max_prop_frac = FPROP_FRAC_BASE;
INIT_LIST_HEAD(&bdi->bdi_list);

- err = wb_init(&bdi->wb, bdi);
+ err = wb_init(&bdi->wb, bdi, GFP_KERNEL);
if (err)
return err;

--
1.9.3

2014-11-18 08:39:45

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/10] writeback: reorganize mm/backing-dev.c

Move wb_shutdown(), bdi_register(), bdi_register_dev(),
bdi_prune_sb(), bdi_remove_from_list() and bdi_unregister() so that
init / exit functions are grouped together. This will make updating
init / exit paths for cgroup writeback support easier.

This is pure source file reorganization.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
---
mm/backing-dev.c | 206 +++++++++++++++++++++++++++----------------------------
1 file changed, 103 insertions(+), 103 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 18a4c32..4d87957 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -303,109 +303,6 @@ void wb_wakeup_delayed(struct bdi_writeback *wb)
}

/*
- * Remove bdi from bdi_list, and ensure that it is no longer visible
- */
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
-{
- spin_lock_bh(&bdi_lock);
- list_del_rcu(&bdi->bdi_list);
- spin_unlock_bh(&bdi_lock);
-
- synchronize_rcu_expedited();
-}
-
-int bdi_register(struct backing_dev_info *bdi, struct device *parent,
- const char *fmt, ...)
-{
- va_list args;
- struct device *dev;
-
- if (bdi->dev) /* The driver needs to use separate queues per device */
- return 0;
-
- va_start(args, fmt);
- dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
- va_end(args);
- if (IS_ERR(dev))
- return PTR_ERR(dev);
-
- bdi->dev = dev;
-
- bdi_debug_register(bdi, dev_name(dev));
- set_bit(WB_registered, &bdi->wb.state);
-
- spin_lock_bh(&bdi_lock);
- list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
- spin_unlock_bh(&bdi_lock);
-
- trace_writeback_bdi_register(bdi);
- return 0;
-}
-EXPORT_SYMBOL(bdi_register);
-
-int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
-{
- return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev));
-}
-EXPORT_SYMBOL(bdi_register_dev);
-
-/*
- * Remove bdi from the global list and shutdown any threads we have running
- */
-static void wb_shutdown(struct bdi_writeback *wb)
-{
- /* Make sure nobody queues further work */
- spin_lock_bh(&wb->work_lock);
- clear_bit(WB_registered, &wb->state);
- spin_unlock_bh(&wb->work_lock);
-
- /*
- * Drain work list and shutdown the delayed_work. !WB_registered
- * tells wb_workfn() that @wb is dying and its work_list needs to
- * be drained no matter what.
- */
- mod_delayed_work(bdi_wq, &wb->dwork, 0);
- flush_delayed_work(&wb->dwork);
- WARN_ON(!list_empty(&wb->work_list));
- WARN_ON(delayed_work_pending(&wb->dwork));
-}
-
-/*
- * This bdi is going away now, make sure that no super_blocks point to it
- */
-static void bdi_prune_sb(struct backing_dev_info *bdi)
-{
- struct super_block *sb;
-
- spin_lock(&sb_lock);
- list_for_each_entry(sb, &super_blocks, s_list) {
- if (sb->s_bdi == bdi)
- sb->s_bdi = &default_backing_dev_info;
- }
- spin_unlock(&sb_lock);
-}
-
-void bdi_unregister(struct backing_dev_info *bdi)
-{
- if (bdi->dev) {
- bdi_set_min_ratio(bdi, 0);
- trace_writeback_bdi_unregister(bdi);
- bdi_prune_sb(bdi);
-
- if (bdi_cap_writeback_dirty(bdi)) {
- /* make sure nobody finds us on the bdi_list anymore */
- bdi_remove_from_list(bdi);
- wb_shutdown(&bdi->wb);
- }
-
- bdi_debug_unregister(bdi);
- device_unregister(bdi->dev);
- bdi->dev = NULL;
- }
-}
-EXPORT_SYMBOL(bdi_unregister);
-
-/*
* Initial write bandwidth: 100 MB/s
*/
#define INIT_BW (100 << (20 - PAGE_SHIFT))
@@ -450,6 +347,27 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
return 0;
}

+/*
+ * Remove bdi from the global list and shutdown any threads we have running
+ */
+static void wb_shutdown(struct bdi_writeback *wb)
+{
+ /* Make sure nobody queues further work */
+ spin_lock_bh(&wb->work_lock);
+ clear_bit(WB_registered, &wb->state);
+ spin_unlock_bh(&wb->work_lock);
+
+ /*
+ * Drain work list and shutdown the delayed_work. !WB_registered
+ * tells wb_workfn() that @wb is dying and its work_list needs to
+ * be drained no matter what.
+ */
+ mod_delayed_work(bdi_wq, &wb->dwork, 0);
+ flush_delayed_work(&wb->dwork);
+ WARN_ON(!list_empty(&wb->work_list));
+ WARN_ON(delayed_work_pending(&wb->dwork));
+}
+
static void wb_exit(struct bdi_writeback *wb)
{
int i;
@@ -505,6 +423,88 @@ int bdi_init(struct backing_dev_info *bdi)
}
EXPORT_SYMBOL(bdi_init);

+int bdi_register(struct backing_dev_info *bdi, struct device *parent,
+ const char *fmt, ...)
+{
+ va_list args;
+ struct device *dev;
+
+ if (bdi->dev) /* The driver needs to use separate queues per device */
+ return 0;
+
+ va_start(args, fmt);
+ dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
+ va_end(args);
+ if (IS_ERR(dev))
+ return PTR_ERR(dev);
+
+ bdi->dev = dev;
+
+ bdi_debug_register(bdi, dev_name(dev));
+ set_bit(WB_registered, &bdi->wb.state);
+
+ spin_lock_bh(&bdi_lock);
+ list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
+ spin_unlock_bh(&bdi_lock);
+
+ trace_writeback_bdi_register(bdi);
+ return 0;
+}
+EXPORT_SYMBOL(bdi_register);
+
+int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
+{
+ return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev));
+}
+EXPORT_SYMBOL(bdi_register_dev);
+
+/*
+ * This bdi is going away now, make sure that no super_blocks point to it
+ */
+static void bdi_prune_sb(struct backing_dev_info *bdi)
+{
+ struct super_block *sb;
+
+ spin_lock(&sb_lock);
+ list_for_each_entry(sb, &super_blocks, s_list) {
+ if (sb->s_bdi == bdi)
+ sb->s_bdi = &default_backing_dev_info;
+ }
+ spin_unlock(&sb_lock);
+}
+
+/*
+ * Remove bdi from bdi_list, and ensure that it is no longer visible
+ */
+static void bdi_remove_from_list(struct backing_dev_info *bdi)
+{
+ spin_lock_bh(&bdi_lock);
+ list_del_rcu(&bdi->bdi_list);
+ spin_unlock_bh(&bdi_lock);
+
+ synchronize_rcu_expedited();
+}
+
+void bdi_unregister(struct backing_dev_info *bdi)
+{
+ if (bdi->dev) {
+ bdi_set_min_ratio(bdi, 0);
+ trace_writeback_bdi_unregister(bdi);
+ bdi_prune_sb(bdi);
+
+ if (bdi_cap_writeback_dirty(bdi)) {
+ /* make sure nobody finds us on the bdi_list anymore */
+ bdi_remove_from_list(bdi);
+ wb_shutdown(&bdi->wb);
+ }
+
+ bdi_debug_unregister(bdi);
+ device_unregister(bdi->dev);
+ bdi->dev = NULL;
+ }
+}
+EXPORT_SYMBOL(bdi_unregister);
+
void bdi_destroy(struct backing_dev_info *bdi)
{
bdi_unregister(bdi);
--
1.9.3

2014-11-18 08:39:47

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/10] writeback: move lingering dirty IO lists transfer from bdi_destroy() to wb_exit()

If a bdi still has dirty IOs on destruction, bdi_destroy() transfers
them to the default bdi; however, dirty IO lists belong to wb
(bdi_writeback) not bdi (backing_dev_info) and after the recent
changes we now have wb_exit() which handles destruction of a wb. Move
the transfer logic to wb_exit().

This patch is pure reorganization.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
---
mm/backing-dev.c | 48 ++++++++++++++++++++++++------------------------
1 file changed, 24 insertions(+), 24 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4904456..18a4c32 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -456,6 +456,30 @@ static void wb_exit(struct bdi_writeback *wb)

WARN_ON(delayed_work_pending(&wb->dwork));

+ /*
+ * Splice our entries to the default_backing_dev_info. This
+ * condition shouldn't happen. @wb must be empty at this point and
+ * dirty inodes on it might cause other issues. This workaround is
+ * added by ce5f8e779519 ("writeback: splice dirty inode entries to
+ * default bdi on bdi_destroy()") without root-causing the issue.
+ *
+ * http://lkml.kernel.org/g/[email protected]
+ * http://thread.gmane.org/gmane.linux.file-systems/35341/focus=35350
+ *
+ * We should probably add WARN_ON() to find out whether it still
+ * happens and track it down if so.
+ */
+ if (wb_has_dirty_io(wb)) {
+ struct bdi_writeback *dst = &default_backing_dev_info.wb;
+
+ bdi_lock_two(wb, dst);
+ list_splice(&wb->b_dirty, &dst->b_dirty);
+ list_splice(&wb->b_io, &dst->b_io);
+ list_splice(&wb->b_more_io, &dst->b_more_io);
+ spin_unlock(&wb->list_lock);
+ spin_unlock(&dst->list_lock);
+ }
+
for (i = 0; i < NR_WB_STAT_ITEMS; i++)
percpu_counter_destroy(&wb->stat[i]);

@@ -483,30 +507,6 @@ EXPORT_SYMBOL(bdi_init);

void bdi_destroy(struct backing_dev_info *bdi)
{
- /*
- * Splice our entries to the default_backing_dev_info. This
- * condition shouldn't happen. @wb must be empty at this point and
- * dirty inodes on it might cause other issues. This workaround is
- * added by ce5f8e779519 ("writeback: splice dirty inode entries to
- * default bdi on bdi_destroy()") without root-causing the issue.
- *
- * http://lkml.kernel.org/g/[email protected]
- * http://thread.gmane.org/gmane.linux.file-systems/35341/focus=35350
- *
- * We should probably add WARN_ON() to find out whether it still
- * happens and track it down if so.
- */
- if (bdi_has_dirty_io(bdi)) {
- struct bdi_writeback *dst = &default_backing_dev_info.wb;
-
- bdi_lock_two(&bdi->wb, dst);
- list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
- list_splice(&bdi->wb.b_io, &dst->b_io);
- list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
- spin_unlock(&bdi->wb.list_lock);
- spin_unlock(&dst->list_lock);
- }
-
bdi_unregister(bdi);
wb_exit(&bdi->wb);
}
--
1.9.3

2014-11-18 08:39:43

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/10] writeback: separate out include/linux/backing-dev-defs.h

With the planned cgroup writeback support, backing-dev related
declarations will be more widely used across block and cgroup;
unfortunately, including backing-dev.h from include/linux/blkdev.h
makes cyclic include dependency quite likely.

This patch separates out backing-dev-defs.h which only has the
essential definitions and updates blkdev.h to include it. c files
which need access to more backing-dev details now include
backing-dev.h directly. This takes backing-dev.h off the common
include dependency chain making it a lot easier to use it across block
and cgroup.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
---
block/blk-integrity.c | 1 +
block/blk-sysfs.c | 1 +
block/bounce.c | 1 +
block/genhd.c | 1 +
drivers/block/drbd/drbd_int.h | 1 +
drivers/block/pktcdvd.c | 1 +
drivers/char/raw.c | 1 +
drivers/md/bcache/request.c | 1 +
drivers/md/dm.h | 1 +
drivers/md/md.h | 1 +
drivers/mtd/devices/block2mtd.c | 1 +
fs/block_dev.c | 1 +
fs/ext4/extents.c | 1 +
fs/ext4/mballoc.c | 1 +
fs/f2fs/segment.h | 1 +
fs/hfs/super.c | 1 +
fs/hfsplus/super.c | 1 +
fs/nfs/filelayout/filelayout.c | 1 +
fs/reiserfs/super.c | 1 +
fs/ufs/super.c | 1 +
include/linux/backing-dev-defs.h | 105 +++++++++++++++++++++++++++++++++++++++
include/linux/backing-dev.h | 100 +------------------------------------
include/linux/blkdev.h | 2 +-
mm/madvise.c | 1 +
24 files changed, 128 insertions(+), 100 deletions(-)
create mode 100644 include/linux/backing-dev-defs.h

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 79ffb48..f548b64 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -21,6 +21,7 @@
*/

#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/mempool.h>
#include <linux/bio.h>
#include <linux/scatterlist.h>
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 1fac434..70f4fc6 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -6,6 +6,7 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/blktrace_api.h>
#include <linux/blk-mq.h>

diff --git a/block/bounce.c b/block/bounce.c
index ab21ba2..c616a60 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -13,6 +13,7 @@
#include <linux/pagemap.h>
#include <linux/mempool.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/hash.h>
#include <linux/highmem.h>
diff --git a/block/genhd.c b/block/genhd.c
index bd30606..e0ba5ce 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -8,6 +8,7 @@
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/spinlock.h>
#include <linux/proc_fs.h>
diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index b905e98..efd19c2 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -38,6 +38,7 @@
#include <linux/mutex.h>
#include <linux/major.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/genhd.h>
#include <linux/idr.h>
#include <net/tcp.h>
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 09e628da..4c20c22 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -61,6 +61,7 @@
#include <linux/freezer.h>
#include <linux/mutex.h>
#include <linux/slab.h>
+#include <linux/backing-dev.h>
#include <scsi/scsi_cmnd.h>
#include <scsi/scsi_ioctl.h>
#include <scsi/scsi.h>
diff --git a/drivers/char/raw.c b/drivers/char/raw.c
index 0102dc7..ca49f11 100644
--- a/drivers/char/raw.c
+++ b/drivers/char/raw.c
@@ -12,6 +12,7 @@
#include <linux/fs.h>
#include <linux/major.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/module.h>
#include <linux/raw.h>
#include <linux/capability.h>
diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c
index 62e6e98..502b4ed 100644
--- a/drivers/md/bcache/request.c
+++ b/drivers/md/bcache/request.c
@@ -15,6 +15,7 @@
#include <linux/module.h>
#include <linux/hash.h>
#include <linux/random.h>
+#include <linux/backing-dev.h>

#include <trace/events/bcache.h>

diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 988c7fb..c7d087b 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -14,6 +14,7 @@
#include <linux/device-mapper.h>
#include <linux/list.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/hdreg.h>
#include <linux/completion.h>
#include <linux/kobject.h>
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 03cec5b..684b7ff 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -16,6 +16,7 @@
#define _MD_MD_H

#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/kobject.h>
#include <linux/list.h>
#include <linux/mm.h>
diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c
index 66f0405..e22e40f 100644
--- a/drivers/mtd/devices/block2mtd.c
+++ b/drivers/mtd/devices/block2mtd.c
@@ -12,6 +12,7 @@
#include <linux/module.h>
#include <linux/fs.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/bio.h>
#include <linux/pagemap.h>
#include <linux/list.h>
diff --git a/fs/block_dev.c b/fs/block_dev.c
index cc9d411..0bfe096 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -14,6 +14,7 @@
#include <linux/device_cgroup.h>
#include <linux/highmem.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/module.h>
#include <linux/blkpg.h>
#include <linux/magic.h>
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 37043d0..38d1c16 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -39,6 +39,7 @@
#include <linux/slab.h>
#include <asm/uaccess.h>
#include <linux/fiemap.h>
+#include <linux/backing-dev.h>
#include "ext4_jbd2.h"
#include "ext4_extents.h"
#include "xattr.h"
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index dbfe15c..4fcc4a0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -26,6 +26,7 @@
#include <linux/log2.h>
#include <linux/module.h>
#include <linux/slab.h>
+#include <linux/backing-dev.h>
#include <trace/events/ext4.h>

#ifdef CONFIG_EXT4_DEBUG
diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h
index 2495bec..230dd48 100644
--- a/fs/f2fs/segment.h
+++ b/fs/f2fs/segment.h
@@ -9,6 +9,7 @@
* published by the Free Software Foundation.
*/
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>

/* constant macro */
#define NULL_SEGNO ((unsigned int)(~0))
diff --git a/fs/hfs/super.c b/fs/hfs/super.c
index eee7206..55c03b9 100644
--- a/fs/hfs/super.c
+++ b/fs/hfs/super.c
@@ -14,6 +14,7 @@

#include <linux/module.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/mount.h>
#include <linux/init.h>
#include <linux/nls.h>
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index 4cf2024..01b9a23 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -11,6 +11,7 @@
#include <linux/init.h>
#include <linux/pagemap.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/fs.h>
#include <linux/slab.h>
#include <linux/vfs.h>
diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
index 0554e3c..4a8d758 100644
--- a/fs/nfs/filelayout/filelayout.c
+++ b/fs/nfs/filelayout/filelayout.c
@@ -32,6 +32,7 @@
#include <linux/nfs_fs.h>
#include <linux/nfs_page.h>
#include <linux/module.h>
+#include <linux/backing-dev.h>

#include <linux/sunrpc/metrics.h>

diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c
index f1376c9..1929dd9 100644
--- a/fs/reiserfs/super.c
+++ b/fs/reiserfs/super.c
@@ -21,6 +21,7 @@
#include "xattr.h"
#include <linux/init.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/buffer_head.h>
#include <linux/exportfs.h>
#include <linux/quotaops.h>
diff --git a/fs/ufs/super.c b/fs/ufs/super.c
index da73801..c6b9466 100644
--- a/fs/ufs/super.c
+++ b/fs/ufs/super.c
@@ -80,6 +80,7 @@
#include <linux/stat.h>
#include <linux/string.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/init.h>
#include <linux/parser.h>
#include <linux/buffer_head.h>
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
new file mode 100644
index 0000000..2874d83
--- /dev/null
+++ b/include/linux/backing-dev-defs.h
@@ -0,0 +1,105 @@
+#ifndef __LINUX_BACKING_DEV_DEFS_H
+#define __LINUX_BACKING_DEV_DEFS_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/percpu_counter.h>
+#include <linux/flex_proportions.h>
+#include <linux/timer.h>
+#include <linux/workqueue.h>
+
+struct page;
+struct device;
+struct dentry;
+
+/*
+ * Bits in bdi_writeback.state
+ */
+enum wb_state {
+ WB_async_congested, /* The async (write) queue is getting full */
+ WB_sync_congested, /* The sync queue is getting full */
+ WB_registered, /* bdi_register() was done */
+ WB_writeback_running, /* Writeback is in progress */
+};
+
+typedef int (congested_fn)(void *, int);
+
+enum wb_stat_item {
+ WB_RECLAIMABLE,
+ WB_WRITEBACK,
+ WB_DIRTIED,
+ WB_WRITTEN,
+ NR_WB_STAT_ITEMS
+};
+
+#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+
+struct bdi_writeback {
+ struct backing_dev_info *bdi; /* our parent bdi */
+
+ unsigned long state; /* Always use atomic bitops on this */
+ unsigned long last_old_flush; /* last old data flush */
+
+ struct list_head b_dirty; /* dirty inodes */
+ struct list_head b_io; /* parked for writeback */
+ struct list_head b_more_io; /* parked for more writeback */
+ spinlock_t list_lock; /* protects the b_* lists */
+
+ struct percpu_counter stat[NR_WB_STAT_ITEMS];
+
+ unsigned long bw_time_stamp; /* last time write bw is updated */
+ unsigned long dirtied_stamp;
+ unsigned long written_stamp; /* pages written at bw_time_stamp */
+ unsigned long write_bandwidth; /* the estimated write bandwidth */
+ unsigned long avg_write_bandwidth; /* further smoothed write bw */
+
+ /*
+ * The base dirty throttle rate, re-calculated on every 200ms.
+ * All the bdi tasks' dirty rate will be curbed under it.
+ * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
+ * in small steps and is much more smooth/stable than the latter.
+ */
+ unsigned long dirty_ratelimit;
+ unsigned long balanced_dirty_ratelimit;
+
+ struct fprop_local_percpu completions;
+ int dirty_exceeded;
+
+ spinlock_t work_lock; /* protects work_list & dwork scheduling */
+ struct list_head work_list;
+ struct delayed_work dwork; /* work item used for writeback */
+};
+
+struct backing_dev_info {
+ struct list_head bdi_list;
+ unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
+ unsigned int capabilities; /* Device capabilities */
+ congested_fn *congested_fn; /* Function pointer if device is md/dm */
+ void *congested_data; /* Pointer to aux data for congested func */
+
+ char *name;
+
+ unsigned int min_ratio;
+ unsigned int max_ratio, max_prop_frac;
+
+ struct bdi_writeback wb; /* default writeback info for this bdi */
+
+ struct device *dev;
+
+ struct timer_list laptop_mode_wb_timer;
+
+#ifdef CONFIG_DEBUG_FS
+ struct dentry *debug_dir;
+ struct dentry *debug_stats;
+#endif
+};
+
+enum {
+ BLK_RW_ASYNC = 0,
+ BLK_RW_SYNC = 1,
+};
+
+void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
+void set_bdi_congested(struct backing_dev_info *bdi, int sync);
+
+#endif /* __LINUX_BACKING_DEV_DEFS_H */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 6aba0d3..918f5c9 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -8,103 +8,12 @@
#ifndef _LINUX_BACKING_DEV_H
#define _LINUX_BACKING_DEV_H

-#include <linux/percpu_counter.h>
-#include <linux/log2.h>
-#include <linux/flex_proportions.h>
#include <linux/kernel.h>
#include <linux/fs.h>
#include <linux/sched.h>
-#include <linux/timer.h>
#include <linux/writeback.h>
-#include <linux/atomic.h>
-#include <linux/sysctl.h>
-#include <linux/workqueue.h>

-struct page;
-struct device;
-struct dentry;
-
-/*
- * Bits in bdi_writeback.state
- */
-enum wb_state {
- WB_async_congested, /* The async (write) queue is getting full */
- WB_sync_congested, /* The sync queue is getting full */
- WB_registered, /* bdi_register() was done */
- WB_writeback_running, /* Writeback is in progress */
-};
-
-typedef int (congested_fn)(void *, int);
-
-enum wb_stat_item {
- WB_RECLAIMABLE,
- WB_WRITEBACK,
- WB_DIRTIED,
- WB_WRITTEN,
- NR_WB_STAT_ITEMS
-};
-
-#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
-
-struct bdi_writeback {
- struct backing_dev_info *bdi; /* our parent bdi */
-
- unsigned long state; /* Always use atomic bitops on this */
- unsigned long last_old_flush; /* last old data flush */
-
- struct list_head b_dirty; /* dirty inodes */
- struct list_head b_io; /* parked for writeback */
- struct list_head b_more_io; /* parked for more writeback */
- spinlock_t list_lock; /* protects the b_* lists */
-
- struct percpu_counter stat[NR_WB_STAT_ITEMS];
-
- unsigned long bw_time_stamp; /* last time write bw is updated */
- unsigned long dirtied_stamp;
- unsigned long written_stamp; /* pages written at bw_time_stamp */
- unsigned long write_bandwidth; /* the estimated write bandwidth */
- unsigned long avg_write_bandwidth; /* further smoothed write bw */
-
- /*
- * The base dirty throttle rate, re-calculated on every 200ms.
- * All the bdi tasks' dirty rate will be curbed under it.
- * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit
- * in small steps and is much more smooth/stable than the latter.
- */
- unsigned long dirty_ratelimit;
- unsigned long balanced_dirty_ratelimit;
-
- struct fprop_local_percpu completions;
- int dirty_exceeded;
-
- spinlock_t work_lock; /* protects work_list & dwork scheduling */
- struct list_head work_list;
- struct delayed_work dwork; /* work item used for writeback */
-};
-
-struct backing_dev_info {
- struct list_head bdi_list;
- unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
- unsigned int capabilities; /* Device capabilities */
- congested_fn *congested_fn; /* Function pointer if device is md/dm */
- void *congested_data; /* Pointer to aux data for congested func */
-
- char *name;
-
- unsigned int min_ratio;
- unsigned int max_ratio, max_prop_frac;
-
- struct bdi_writeback wb; /* default writeback info for this bdi */
-
- struct device *dev;
-
- struct timer_list laptop_mode_wb_timer;
-
-#ifdef CONFIG_DEBUG_FS
- struct dentry *debug_dir;
- struct dentry *debug_stats;
-#endif
-};
+#include <linux/backing-dev-defs.h>

int __must_check bdi_init(struct backing_dev_info *bdi);
void bdi_destroy(struct backing_dev_info *bdi);
@@ -291,13 +200,6 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
(1 << WB_async_congested));
}

-enum {
- BLK_RW_ASYNC = 0,
- BLK_RW_SYNC = 1,
-};
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
long congestion_wait(int sync, long timeout);
long wait_iff_congested(struct zone *zone, int sync, long timeout);
int pdflush_proc_obsolete(struct ctl_table *table, int write,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 77db6dc..de56016 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -12,7 +12,7 @@
#include <linux/timer.h>
#include <linux/workqueue.h>
#include <linux/pagemap.h>
-#include <linux/backing-dev.h>
+#include <linux/backing-dev-defs.h>
#include <linux/wait.h>
#include <linux/mempool.h>
#include <linux/bio.h>
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30..5793169 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -17,6 +17,7 @@
#include <linux/fs.h>
#include <linux/file.h>
#include <linux/blkdev.h>
+#include <linux/backing-dev.h>
#include <linux/swap.h>
#include <linux/swapops.h>

--
1.9.3

2014-11-18 08:40:58

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/10] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->wb_lock and ->worklist into wb.

* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.

* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)

* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().

* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
---
fs/fs-writeback.c | 84 +++++++++++++++++++++------------------------
include/linux/backing-dev.h | 12 +++----
mm/backing-dev.c | 64 +++++++++++++++++-----------------
3 files changed, 77 insertions(+), 83 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index daa91ae..41c9f1e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -91,34 +91,33 @@ static inline struct inode *wb_inode(struct list_head *head)

EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);

-static void bdi_wakeup_thread(struct backing_dev_info *bdi)
+static void wb_wakeup(struct bdi_writeback *wb)
{
- spin_lock_bh(&bdi->wb_lock);
- if (test_bit(WB_registered, &bdi->wb.state))
- mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
- spin_unlock_bh(&bdi->wb_lock);
+ spin_lock_bh(&wb->work_lock);
+ if (test_bit(WB_registered, &wb->state))
+ mod_delayed_work(bdi_wq, &wb->dwork, 0);
+ spin_unlock_bh(&wb->work_lock);
}

-static void bdi_queue_work(struct backing_dev_info *bdi,
- struct wb_writeback_work *work)
+static void wb_queue_work(struct bdi_writeback *wb,
+ struct wb_writeback_work *work)
{
- trace_writeback_queue(bdi, work);
+ trace_writeback_queue(wb->bdi, work);

- spin_lock_bh(&bdi->wb_lock);
- if (!test_bit(WB_registered, &bdi->wb.state)) {
+ spin_lock_bh(&wb->work_lock);
+ if (!test_bit(WB_registered, &wb->state)) {
if (work->done)
complete(work->done);
goto out_unlock;
}
- list_add_tail(&work->list, &bdi->work_list);
- mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
+ list_add_tail(&work->list, &wb->work_list);
+ mod_delayed_work(bdi_wq, &wb->dwork, 0);
out_unlock:
- spin_unlock_bh(&bdi->wb_lock);
+ spin_unlock_bh(&wb->work_lock);
}

-static void
-__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
- bool range_cyclic, enum wb_reason reason)
+static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
+ bool range_cyclic, enum wb_reason reason)
{
struct wb_writeback_work *work;

@@ -128,8 +127,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
*/
work = kzalloc(sizeof(*work), GFP_ATOMIC);
if (!work) {
- trace_writeback_nowork(bdi);
- bdi_wakeup_thread(bdi);
+ trace_writeback_nowork(wb->bdi);
+ wb_wakeup(wb);
return;
}

@@ -138,7 +137,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
work->range_cyclic = range_cyclic;
work->reason = reason;

- bdi_queue_work(bdi, work);
+ wb_queue_work(wb, work);
}

/**
@@ -156,7 +155,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason)
{
- __bdi_start_writeback(bdi, nr_pages, true, reason);
+ __wb_start_writeback(&bdi->wb, nr_pages, true, reason);
}

/**
@@ -176,7 +175,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
* writeback as soon as there is no other work to do.
*/
trace_writeback_wake_background(bdi);
- bdi_wakeup_thread(bdi);
+ wb_wakeup(&bdi->wb);
}

/*
@@ -848,7 +847,7 @@ static long wb_writeback(struct bdi_writeback *wb,
* after the other works are all done.
*/
if ((work->for_background || work->for_kupdate) &&
- !list_empty(&wb->bdi->work_list))
+ !list_empty(&wb->work_list))
break;

/*
@@ -919,18 +918,17 @@ static long wb_writeback(struct bdi_writeback *wb,
/*
* Return the next wb_writeback_work struct that hasn't been processed yet.
*/
-static struct wb_writeback_work *
-get_next_work_item(struct backing_dev_info *bdi)
+static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
{
struct wb_writeback_work *work = NULL;

- spin_lock_bh(&bdi->wb_lock);
- if (!list_empty(&bdi->work_list)) {
- work = list_entry(bdi->work_list.next,
+ spin_lock_bh(&wb->work_lock);
+ if (!list_empty(&wb->work_list)) {
+ work = list_entry(wb->work_list.next,
struct wb_writeback_work, list);
list_del_init(&work->list);
}
- spin_unlock_bh(&bdi->wb_lock);
+ spin_unlock_bh(&wb->work_lock);
return work;
}

@@ -1002,14 +1000,13 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
*/
static long wb_do_writeback(struct bdi_writeback *wb)
{
- struct backing_dev_info *bdi = wb->bdi;
struct wb_writeback_work *work;
long wrote = 0;

set_bit(WB_writeback_running, &wb->state);
- while ((work = get_next_work_item(bdi)) != NULL) {
+ while ((work = get_next_work_item(wb)) != NULL) {

- trace_writeback_exec(bdi, work);
+ trace_writeback_exec(wb->bdi, work);

wrote += wb_writeback(wb, work);

@@ -1037,43 +1034,42 @@ static long wb_do_writeback(struct bdi_writeback *wb)
* Handle writeback of dirty data for the device backed by this bdi. Also
* reschedules periodically and does kupdated style flushing.
*/
-void bdi_writeback_workfn(struct work_struct *work)
+void wb_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(to_delayed_work(work),
struct bdi_writeback, dwork);
- struct backing_dev_info *bdi = wb->bdi;
long pages_written;

- set_worker_desc("flush-%s", dev_name(bdi->dev));
+ set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
current->flags |= PF_SWAPWRITE;

if (likely(!current_is_workqueue_rescuer() ||
!test_bit(WB_registered, &wb->state))) {
/*
- * The normal path. Keep writing back @bdi until its
+ * The normal path. Keep writing back @wb until its
* work_list is empty. Note that this path is also taken
- * if @bdi is shutting down even when we're running off the
+ * if @wb is shutting down even when we're running off the
* rescuer as work_list needs to be drained.
*/
do {
pages_written = wb_do_writeback(wb);
trace_writeback_pages_written(pages_written);
- } while (!list_empty(&bdi->work_list));
+ } while (!list_empty(&wb->work_list));
} else {
/*
* bdi_wq can't get enough workers and we're running off
* the emergency worker. Don't hog it. Hopefully, 1024 is
* enough for efficient IO.
*/
- pages_written = writeback_inodes_wb(&bdi->wb, 1024,
+ pages_written = writeback_inodes_wb(wb, 1024,
WB_REASON_FORKER_THREAD);
trace_writeback_pages_written(pages_written);
}

- if (!list_empty(&bdi->work_list))
+ if (!list_empty(&wb->work_list))
mod_delayed_work(bdi_wq, &wb->dwork, 0);
else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
- bdi_wakeup_thread_delayed(bdi);
+ wb_wakeup_delayed(wb);

current->flags &= ~PF_SWAPWRITE;
}
@@ -1093,7 +1089,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
if (!bdi_has_dirty_io(bdi))
continue;
- __bdi_start_writeback(bdi, nr_pages, false, reason);
+ __wb_start_writeback(&bdi->wb, nr_pages, false, reason);
}
rcu_read_unlock();
}
@@ -1228,7 +1224,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
spin_unlock(&bdi->wb.list_lock);

if (wakeup_bdi)
- bdi_wakeup_thread_delayed(bdi);
+ wb_wakeup_delayed(&bdi->wb);
return;
}
}
@@ -1318,7 +1314,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
if (sb->s_bdi == &noop_backing_dev_info)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));
- bdi_queue_work(sb->s_bdi, &work);
+ wb_queue_work(&sb->s_bdi->wb, &work);
wait_for_completion(&done);
}
EXPORT_SYMBOL(writeback_inodes_sb_nr);
@@ -1402,7 +1398,7 @@ void sync_inodes_sb(struct super_block *sb)
return;
WARN_ON(!rwsem_is_locked(&sb->s_umount));

- bdi_queue_work(sb->s_bdi, &work);
+ wb_queue_work(&sb->s_bdi->wb, &work);
wait_for_completion(&done);

wait_sb_inodes(sb);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index a077a8d..6aba0d3 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -52,7 +52,6 @@ struct bdi_writeback {
unsigned long state; /* Always use atomic bitops on this */
unsigned long last_old_flush; /* last old data flush */

- struct delayed_work dwork; /* work item used for writeback */
struct list_head b_dirty; /* dirty inodes */
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
@@ -77,6 +76,10 @@ struct bdi_writeback {

struct fprop_local_percpu completions;
int dirty_exceeded;
+
+ spinlock_t work_lock; /* protects work_list & dwork scheduling */
+ struct list_head work_list;
+ struct delayed_work dwork; /* work item used for writeback */
};

struct backing_dev_info {
@@ -92,9 +95,6 @@ struct backing_dev_info {
unsigned int max_ratio, max_prop_frac;

struct bdi_writeback wb; /* default writeback info for this bdi */
- spinlock_t wb_lock; /* protects work_list & wb.dwork scheduling */
-
- struct list_head work_list;

struct device *dev;

@@ -118,9 +118,9 @@ int __must_check bdi_setup_and_register(struct backing_dev_info *, char *, unsig
void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
enum wb_reason reason);
void bdi_start_background_writeback(struct backing_dev_info *bdi);
-void bdi_writeback_workfn(struct work_struct *work);
+void wb_workfn(struct work_struct *work);
int bdi_has_dirty_io(struct backing_dev_info *bdi);
-void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
+void wb_wakeup_delayed(struct bdi_writeback *wb);

extern spinlock_t bdi_lock;
extern struct list_head bdi_list;
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7b9b10e..4904456 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -278,7 +278,7 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
}

/*
- * This function is used when the first inode for this bdi is marked dirty. It
+ * This function is used when the first inode for this wb is marked dirty. It
* wakes-up the corresponding bdi thread which should then take care of the
* periodic background write-out of dirty inodes. Since the write-out would
* starts only 'dirty_writeback_interval' centisecs from now anyway, we just
@@ -291,15 +291,15 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
* We have to be careful not to postpone flush work if it is scheduled for
* earlier. Thus we use queue_delayed_work().
*/
-void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
+void wb_wakeup_delayed(struct bdi_writeback *wb)
{
unsigned long timeout;

timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
- spin_lock_bh(&bdi->wb_lock);
- if (test_bit(WB_registered, &bdi->wb.state))
- queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
- spin_unlock_bh(&bdi->wb_lock);
+ spin_lock_bh(&wb->work_lock);
+ if (test_bit(WB_registered, &wb->state))
+ queue_delayed_work(bdi_wq, &wb->dwork, timeout);
+ spin_unlock_bh(&wb->work_lock);
}

/*
@@ -352,30 +352,22 @@ EXPORT_SYMBOL(bdi_register_dev);
/*
* Remove bdi from the global list and shutdown any threads we have running
*/
-static void bdi_wb_shutdown(struct backing_dev_info *bdi)
+static void wb_shutdown(struct bdi_writeback *wb)
{
- if (!bdi_cap_writeback_dirty(bdi))
- return;
-
- /*
- * Make sure nobody finds us on the bdi_list anymore
- */
- bdi_remove_from_list(bdi);
-
/* Make sure nobody queues further work */
- spin_lock_bh(&bdi->wb_lock);
- clear_bit(WB_registered, &bdi->wb.state);
- spin_unlock_bh(&bdi->wb_lock);
+ spin_lock_bh(&wb->work_lock);
+ clear_bit(WB_registered, &wb->state);
+ spin_unlock_bh(&wb->work_lock);

/*
- * Drain work list and shutdown the delayed_work. At this point,
- * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
- * is dying and its work_list needs to be drained no matter what.
+ * Drain work list and shutdown the delayed_work. !WB_registered
+ * tells wb_workfn() that @wb is dying and its work_list needs to
+ * be drained no matter what.
*/
- mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
- flush_delayed_work(&bdi->wb.dwork);
- WARN_ON(!list_empty(&bdi->work_list));
- WARN_ON(delayed_work_pending(&bdi->wb.dwork));
+ mod_delayed_work(bdi_wq, &wb->dwork, 0);
+ flush_delayed_work(&wb->dwork);
+ WARN_ON(!list_empty(&wb->work_list));
+ WARN_ON(delayed_work_pending(&wb->dwork));
}

/*
@@ -400,7 +392,12 @@ void bdi_unregister(struct backing_dev_info *bdi)
trace_writeback_bdi_unregister(bdi);
bdi_prune_sb(bdi);

- bdi_wb_shutdown(bdi);
+ if (bdi_cap_writeback_dirty(bdi)) {
+ /* make sure nobody finds us on the bdi_list anymore */
+ bdi_remove_from_list(bdi);
+ wb_shutdown(&bdi->wb);
+ }
+
bdi_debug_unregister(bdi);
device_unregister(bdi->dev);
bdi->dev = NULL;
@@ -413,7 +410,7 @@ EXPORT_SYMBOL(bdi_unregister);
*/
#define INIT_BW (100 << (20 - PAGE_SHIFT))

-static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
int i, err;

@@ -425,7 +422,6 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
spin_lock_init(&wb->list_lock);
- INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);

wb->bw_time_stamp = jiffies;
wb->balanced_dirty_ratelimit = INIT_BW;
@@ -433,6 +429,10 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
wb->write_bandwidth = INIT_BW;
wb->avg_write_bandwidth = INIT_BW;

+ spin_lock_init(&wb->work_lock);
+ INIT_LIST_HEAD(&wb->work_list);
+ INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
+
err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL);
if (err)
return err;
@@ -450,7 +450,7 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
return 0;
}

-static void bdi_wb_exit(struct bdi_writeback *wb)
+static void wb_exit(struct bdi_writeback *wb)
{
int i;

@@ -471,11 +471,9 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->min_ratio = 0;
bdi->max_ratio = 100;
bdi->max_prop_frac = FPROP_FRAC_BASE;
- spin_lock_init(&bdi->wb_lock);
INIT_LIST_HEAD(&bdi->bdi_list);
- INIT_LIST_HEAD(&bdi->work_list);

- err = bdi_wb_init(&bdi->wb, bdi);
+ err = wb_init(&bdi->wb, bdi);
if (err)
return err;

@@ -510,7 +508,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
}

bdi_unregister(bdi);
- bdi_wb_exit(&bdi->wb);
+ wb_exit(&bdi->wb);
}
EXPORT_SYMBOL(bdi_destroy);

--
1.9.3

2014-11-18 08:41:41

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/10] writeback: move backing_dev_info->bdi_stat[] into bdi_writeback

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->bdi_stat[] into wb.

* enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
enums is changed from BDI_ to WB_.

* BDI_STAT_BATCH() -> WB_STAT_BATCH()

* [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)

* bdi_stat[_error]() -> wb_stat[_error]()

* bdi_writeout_inc() -> wb_writeout_inc()

* stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
frees stat.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Trond Myklebust <[email protected]>
---
fs/fs-writeback.c | 2 +-
fs/fuse/file.c | 12 ++++----
fs/nfs/filelayout/filelayout.c | 4 +--
fs/nfs/write.c | 11 +++----
include/linux/backing-dev.h | 68 ++++++++++++++++++++----------------------
mm/backing-dev.c | 61 +++++++++++++++++++++----------------
mm/filemap.c | 2 +-
mm/page-writeback.c | 53 ++++++++++++++++----------------
mm/truncate.c | 4 +--
9 files changed, 112 insertions(+), 105 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a797bda..f5ca16e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -790,7 +790,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
global_page_state(NR_UNSTABLE_NFS) > background_thresh)
return true;

- if (bdi_stat(bdi, BDI_RECLAIMABLE) >
+ if (wb_stat(&bdi->wb, WB_RECLAIMABLE) >
bdi_dirty_limit(bdi, background_thresh))
return true;

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index caa8d95..1199471 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1514,9 +1514,9 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)

list_del(&req->writepages_entry);
for (i = 0; i < req->num_pages; i++) {
- dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_wb_stat(&bdi->wb, WB_WRITEBACK);
dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
- bdi_writeout_inc(bdi);
+ wb_writeout_inc(&bdi->wb);
}
wake_up(&fi->page_waitq);
}
@@ -1703,7 +1703,7 @@ static int fuse_writepage_locked(struct page *page)
req->end = fuse_writepage_end;
req->inode = inode;

- inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+ inc_wb_stat(&mapping->backing_dev_info->wb, WB_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);

spin_lock(&fc->lock);
@@ -1818,9 +1818,9 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
copy_highpage(old_req->pages[0], page);
spin_unlock(&fc->lock);

- dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_wb_stat(&bdi->wb, WB_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK_TEMP);
- bdi_writeout_inc(bdi);
+ wb_writeout_inc(&bdi->wb);
fuse_writepage_free(fc, new_req);
fuse_request_free(new_req);
goto out;
@@ -1917,7 +1917,7 @@ static int fuse_writepages_fill(struct page *page,
req->page_descs[req->num_pages].offset = 0;
req->page_descs[req->num_pages].length = PAGE_SIZE;

- inc_bdi_stat(page->mapping->backing_dev_info, BDI_WRITEBACK);
+ inc_wb_stat(&page->mapping->backing_dev_info->wb, WB_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);

err = 0;
diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
index 46fab1cb..0554e3c 100644
--- a/fs/nfs/filelayout/filelayout.c
+++ b/fs/nfs/filelayout/filelayout.c
@@ -1084,8 +1084,8 @@ mds_commit:
spin_unlock(cinfo->lock);
if (!cinfo->dreq) {
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
- inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
- BDI_RECLAIMABLE);
+ inc_wb_stat(&page_file_mapping(req->wb_page)->backing_dev_info->wb,
+ WB_RECLAIMABLE);
__mark_inode_dirty(req->wb_context->dentry->d_inode,
I_DIRTY_DATASYNC);
}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1249384..943ddab 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -781,8 +781,8 @@ nfs_request_add_commit_list(struct nfs_page *req, struct list_head *dst,
spin_unlock(cinfo->lock);
if (!cinfo->dreq) {
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
- inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
- BDI_RECLAIMABLE);
+ inc_wb_stat(&page_file_mapping(req->wb_page)->backing_dev_info->wb,
+ WB_RECLAIMABLE);
__mark_inode_dirty(req->wb_context->dentry->d_inode,
I_DIRTY_DATASYNC);
}
@@ -848,7 +848,8 @@ static void
nfs_clear_page_commit(struct page *page)
{
dec_zone_page_state(page, NR_UNSTABLE_NFS);
- dec_bdi_stat(page_file_mapping(page)->backing_dev_info, BDI_RECLAIMABLE);
+ dec_wb_stat(&page_file_mapping(page)->backing_dev_info->wb,
+ WB_RECLAIMABLE);
}

/* Called holding inode (/cinfo) lock */
@@ -1559,8 +1560,8 @@ void nfs_retry_commit(struct list_head *page_list,
nfs_mark_request_commit(req, lseg, cinfo);
if (!cinfo->dreq) {
dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
- dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
- BDI_RECLAIMABLE);
+ dec_wb_stat(&page_file_mapping(req->wb_page)->backing_dev_info->wb,
+ WB_RECLAIMABLE);
}
nfs_unlock_and_release_request(req);
}
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index a356ccd..92fed42 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -36,15 +36,15 @@ enum wb_state {

typedef int (congested_fn)(void *, int);

-enum bdi_stat_item {
- BDI_RECLAIMABLE,
- BDI_WRITEBACK,
- BDI_DIRTIED,
- BDI_WRITTEN,
- NR_BDI_STAT_ITEMS
+enum wb_stat_item {
+ WB_RECLAIMABLE,
+ WB_WRITEBACK,
+ WB_DIRTIED,
+ WB_WRITTEN,
+ NR_WB_STAT_ITEMS
};

-#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
+#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))

struct bdi_writeback {
struct backing_dev_info *bdi; /* our parent bdi */
@@ -57,6 +57,8 @@ struct bdi_writeback {
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
spinlock_t list_lock; /* protects the b_* lists */
+
+ struct percpu_counter stat[NR_WB_STAT_ITEMS];
};

struct backing_dev_info {
@@ -68,8 +70,6 @@ struct backing_dev_info {

char *name;

- struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
-
unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
@@ -134,78 +134,74 @@ static inline int wb_has_dirty_io(struct bdi_writeback *wb)
!list_empty(&wb->b_more_io);
}

-static inline void __add_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item, s64 amount)
+static inline void __add_wb_stat(struct bdi_writeback *wb,
+ enum wb_stat_item item, s64 amount)
{
- __percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
+ __percpu_counter_add(&wb->stat[item], amount, WB_STAT_BATCH);
}

-static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline void __inc_wb_stat(struct bdi_writeback *wb,
+ enum wb_stat_item item)
{
- __add_bdi_stat(bdi, item, 1);
+ __add_wb_stat(wb, item, 1);
}

-static inline void inc_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
unsigned long flags;

local_irq_save(flags);
- __inc_bdi_stat(bdi, item);
+ __inc_wb_stat(wb, item);
local_irq_restore(flags);
}

-static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline void __dec_wb_stat(struct bdi_writeback *wb,
+ enum wb_stat_item item)
{
- __add_bdi_stat(bdi, item, -1);
+ __add_wb_stat(wb, item, -1);
}

-static inline void dec_bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
unsigned long flags;

local_irq_save(flags);
- __dec_bdi_stat(bdi, item);
+ __dec_wb_stat(wb, item);
local_irq_restore(flags);
}

-static inline s64 bdi_stat(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
{
- return percpu_counter_read_positive(&bdi->bdi_stat[item]);
+ return percpu_counter_read_positive(&wb->stat[item]);
}

-static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
+ enum wb_stat_item item)
{
- return percpu_counter_sum_positive(&bdi->bdi_stat[item]);
+ return percpu_counter_sum_positive(&wb->stat[item]);
}

-static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
- enum bdi_stat_item item)
+static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
{
s64 sum;
unsigned long flags;

local_irq_save(flags);
- sum = __bdi_stat_sum(bdi, item);
+ sum = __wb_stat_sum(wb, item);
local_irq_restore(flags);

return sum;
}

-extern void bdi_writeout_inc(struct backing_dev_info *bdi);
+extern void wb_writeout_inc(struct bdi_writeback *wb);

/*
* maximal error of a stat counter.
*/
-static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
+static inline unsigned long wb_stat_error(struct bdi_writeback *wb)
{
#ifdef CONFIG_SMP
- return nr_cpu_ids * BDI_STAT_BATCH;
+ return nr_cpu_ids * WB_STAT_BATCH;
#else
return 1;
#endif
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 62f3b33..4b6f650 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -99,13 +99,13 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
"b_more_io: %10lu\n"
"bdi_list: %10u\n"
"state: %10lx\n",
- (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
- (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
+ (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
+ (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
K(bdi_thresh),
K(dirty_thresh),
K(background_thresh),
- (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
- (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+ (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
+ (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
(unsigned long) K(bdi->write_bandwidth),
nr_dirty,
nr_io,
@@ -408,8 +408,10 @@ void bdi_unregister(struct backing_dev_info *bdi)
}
EXPORT_SYMBOL(bdi_unregister);

-static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
+ int i, err;
+
memset(wb, 0, sizeof(*wb));

wb->bdi = bdi;
@@ -419,6 +421,27 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
INIT_LIST_HEAD(&wb->b_more_io);
spin_lock_init(&wb->list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
+
+ for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
+ err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL);
+ if (err) {
+ while (--i)
+ percpu_counter_destroy(&wb->stat[i]);
+ return err;
+ }
+ }
+
+ return 0;
+}
+
+static void bdi_wb_exit(struct bdi_writeback *wb)
+{
+ int i;
+
+ WARN_ON(delayed_work_pending(&wb->dwork));
+
+ for (i = 0; i < NR_WB_STAT_ITEMS; i++)
+ percpu_counter_destroy(&wb->stat[i]);
}

/*
@@ -428,7 +451,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)

int bdi_init(struct backing_dev_info *bdi)
{
- int i, err;
+ int err;

bdi->dev = NULL;

@@ -439,13 +462,9 @@ int bdi_init(struct backing_dev_info *bdi)
INIT_LIST_HEAD(&bdi->bdi_list);
INIT_LIST_HEAD(&bdi->work_list);

- bdi_wb_init(&bdi->wb, bdi);
-
- for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
- err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL);
- if (err)
- goto err;
- }
+ err = bdi_wb_init(&bdi->wb, bdi);
+ if (err)
+ return err;

bdi->dirty_exceeded = 0;

@@ -458,21 +477,17 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->avg_write_bandwidth = INIT_BW;

err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
-
if (err) {
-err:
- while (i--)
- percpu_counter_destroy(&bdi->bdi_stat[i]);
+ bdi_wb_exit(&bdi->wb);
+ return err;
}

- return err;
+ return 0;
}
EXPORT_SYMBOL(bdi_init);

void bdi_destroy(struct backing_dev_info *bdi)
{
- int i;
-
/*
* Splice our entries to the default_backing_dev_info. This
* condition shouldn't happen. @wb must be empty at this point and
@@ -498,11 +513,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
}

bdi_unregister(bdi);
-
- WARN_ON(delayed_work_pending(&bdi->wb.dwork));
-
- for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
- percpu_counter_destroy(&bdi->bdi_stat[i]);
+ bdi_wb_exit(&bdi->wb);

fprop_local_destroy_percpu(&bdi->completions);
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 14b4642..1405fc5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -211,7 +211,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ dec_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
}
}

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 19ceae8..68fd72a 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -396,11 +396,11 @@ static unsigned long wp_next_time(unsigned long cur_time)
* Increment the BDI's writeout completion count and the global writeout
* completion count. Called from test_clear_page_writeback().
*/
-static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
+static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
- __inc_bdi_stat(bdi, BDI_WRITTEN);
- __fprop_inc_percpu_max(&writeout_completions, &bdi->completions,
- bdi->max_prop_frac);
+ __inc_wb_stat(wb, WB_WRITTEN);
+ __fprop_inc_percpu_max(&writeout_completions, &wb->bdi->completions,
+ wb->bdi->max_prop_frac);
/* First event after period switching was turned off? */
if (!unlikely(writeout_period_time)) {
/*
@@ -414,15 +414,15 @@ static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
}
}

-void bdi_writeout_inc(struct backing_dev_info *bdi)
+void wb_writeout_inc(struct bdi_writeback *wb)
{
unsigned long flags;

local_irq_save(flags);
- __bdi_writeout_inc(bdi);
+ __wb_writeout_inc(wb);
local_irq_restore(flags);
}
-EXPORT_SYMBOL_GPL(bdi_writeout_inc);
+EXPORT_SYMBOL_GPL(wb_writeout_inc);

/*
* Obtain an accurate fraction of the BDI's portion.
@@ -1127,8 +1127,8 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
if (elapsed < BANDWIDTH_INTERVAL)
return;

- dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
- written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+ dirtied = percpu_counter_read(&bdi->wb.stat[WB_DIRTIED]);
+ written = percpu_counter_read(&bdi->wb.stat[WB_WRITTEN]);

/*
* Skip quiet periods when disk bandwidth is under-utilized.
@@ -1285,7 +1285,8 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
unsigned long *bdi_thresh,
unsigned long *bdi_bg_thresh)
{
- unsigned long bdi_reclaimable;
+ struct bdi_writeback *wb = &bdi->wb;
+ unsigned long wb_reclaimable;

/*
* bdi_thresh is not treated as some limiting factor as
@@ -1317,14 +1318,12 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
* actually dirty; with m+n sitting in the percpu
* deltas.
*/
- if (*bdi_thresh < 2 * bdi_stat_error(bdi)) {
- bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
- *bdi_dirty = bdi_reclaimable +
- bdi_stat_sum(bdi, BDI_WRITEBACK);
+ if (*bdi_thresh < 2 * wb_stat_error(wb)) {
+ wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
+ *bdi_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
} else {
- bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
- *bdi_dirty = bdi_reclaimable +
- bdi_stat(bdi, BDI_WRITEBACK);
+ wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
+ *bdi_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
}
}

@@ -1511,9 +1510,9 @@ pause:
* In theory 1 page is enough to keep the comsumer-producer
* pipe going: the flusher cleans 1 page => the task dirties 1
* more page. However bdi_dirty has accounting errors. So use
- * the larger and more IO friendly bdi_stat_error.
+ * the larger and more IO friendly wb_stat_error.
*/
- if (bdi_dirty <= bdi_stat_error(bdi))
+ if (bdi_dirty <= wb_stat_error(&bdi->wb))
break;

if (fatal_signal_pending(current))
@@ -2106,8 +2105,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
if (mapping_cap_account_dirty(mapping)) {
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
- __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
- __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+ __inc_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
+ __inc_wb_stat(&mapping->backing_dev_info->wb, WB_DIRTIED);
task_io_account_write(PAGE_CACHE_SIZE);
current->nr_dirtied++;
this_cpu_inc(bdp_ratelimits);
@@ -2173,7 +2172,7 @@ void account_page_redirty(struct page *page)
if (mapping && mapping_cap_account_dirty(mapping)) {
current->nr_dirtied--;
dec_zone_page_state(page, NR_DIRTIED);
- dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
+ dec_wb_stat(&mapping->backing_dev_info->wb, WB_DIRTIED);
}
}
EXPORT_SYMBOL(account_page_redirty);
@@ -2314,8 +2313,8 @@ int clear_page_dirty_for_io(struct page *page)
*/
if (TestClearPageDirty(page)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info,
- BDI_RECLAIMABLE);
+ dec_wb_stat(&mapping->backing_dev_info->wb,
+ WB_RECLAIMABLE);
return 1;
}
return 0;
@@ -2344,8 +2343,8 @@ int test_clear_page_writeback(struct page *page)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi)) {
- __dec_bdi_stat(bdi, BDI_WRITEBACK);
- __bdi_writeout_inc(bdi);
+ __dec_wb_stat(&bdi->wb, WB_WRITEBACK);
+ __wb_writeout_inc(&bdi->wb);
}
}
spin_unlock_irqrestore(&mapping->tree_lock, flags);
@@ -2381,7 +2380,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
page_index(page),
PAGECACHE_TAG_WRITEBACK);
if (bdi_cap_account_writeback(bdi))
- __inc_bdi_stat(bdi, BDI_WRITEBACK);
+ __inc_wb_stat(&bdi->wb, WB_WRITEBACK);
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,
diff --git a/mm/truncate.c b/mm/truncate.c
index 261eaf6..623319c 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -112,8 +112,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
dec_zone_page_state(page, NR_FILE_DIRTY);
- dec_bdi_stat(mapping->backing_dev_info,
- BDI_RECLAIMABLE);
+ dec_wb_stat(&mapping->backing_dev_info->wb,
+ WB_RECLAIMABLE);
if (account_size)
task_io_account_cancelled_write(account_size);
}
--
1.9.3

2014-11-18 08:41:40

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/10] writeback: move bandwidth related fields from backing_dev_info into bdi_writeback

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.

* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
balanced_dirty_ratelimit, completions and dirty_exceeded.

* writeback_chunk_size() and over_bgroup_thresh() now take @wb instead
of @bdi.

* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
[__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
respectively. Note that explicit zeroing is dropped in the process
as wb's are cleared in entirety anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Wu Fengguang <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Steven Whitehouse <[email protected]>
---
fs/f2fs/node.c | 2 +-
fs/fs-writeback.c | 17 ++-
fs/gfs2/super.c | 2 +-
include/linux/backing-dev.h | 20 ++--
include/linux/writeback.h | 19 ++-
include/trace/events/writeback.h | 8 +-
mm/backing-dev.c | 45 ++++---
mm/page-writeback.c | 246 ++++++++++++++++++++-------------------
8 files changed, 177 insertions(+), 182 deletions(-)

diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 44b8afe..c53d94b 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -43,7 +43,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
mem_size = (nm_i->nat_cnt * sizeof(struct nat_entry)) >> 12;
res = mem_size < ((val.totalram * nm_i->ram_thresh / 100) >> 2);
} else if (type == DIRTY_DENTS) {
- if (sbi->sb->s_bdi->dirty_exceeded)
+ if (sbi->sb->s_bdi->wb.dirty_exceeded)
return false;
mem_size = get_pages(sbi, F2FS_DIRTY_DENTS);
res = mem_size < ((val.totalram * nm_i->ram_thresh / 100) >> 1);
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f5ca16e..daa91ae 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -574,7 +574,7 @@ out:
return ret;
}

-static long writeback_chunk_size(struct backing_dev_info *bdi,
+static long writeback_chunk_size(struct bdi_writeback *wb,
struct wb_writeback_work *work)
{
long pages;
@@ -595,7 +595,7 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
pages = LONG_MAX;
else {
- pages = min(bdi->avg_write_bandwidth / 2,
+ pages = min(wb->avg_write_bandwidth / 2,
global_dirty_limit / DIRTY_SCOPE);
pages = min(pages, work->nr_pages);
pages = round_down(pages + MIN_WRITEBACK_PAGES,
@@ -693,7 +693,7 @@ static long writeback_sb_inodes(struct super_block *sb,
inode->i_state |= I_SYNC;
spin_unlock(&inode->i_lock);

- write_chunk = writeback_chunk_size(wb->bdi, work);
+ write_chunk = writeback_chunk_size(wb, work);
wbc.nr_to_write = write_chunk;
wbc.pages_skipped = 0;

@@ -780,7 +780,7 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
return nr_pages - work.nr_pages;
}

-static bool over_bground_thresh(struct backing_dev_info *bdi)
+static bool over_bground_thresh(struct bdi_writeback *wb)
{
unsigned long background_thresh, dirty_thresh;

@@ -790,8 +790,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
global_page_state(NR_UNSTABLE_NFS) > background_thresh)
return true;

- if (wb_stat(&bdi->wb, WB_RECLAIMABLE) >
- bdi_dirty_limit(bdi, background_thresh))
+ if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh))
return true;

return false;
@@ -804,7 +803,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
static void wb_update_bandwidth(struct bdi_writeback *wb,
unsigned long start_time)
{
- __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
+ __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time);
}

/*
@@ -856,7 +855,7 @@ static long wb_writeback(struct bdi_writeback *wb,
* For background writeout, stop when we are below the
* background dirty threshold
*/
- if (work->for_background && !over_bground_thresh(wb->bdi))
+ if (work->for_background && !over_bground_thresh(wb))
break;

/*
@@ -948,7 +947,7 @@ static unsigned long get_nr_dirty_pages(void)

static long wb_check_background_flush(struct bdi_writeback *wb)
{
- if (over_bground_thresh(wb->bdi)) {
+ if (over_bground_thresh(wb)) {

struct wb_writeback_work work = {
.nr_pages = LONG_MAX,
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index a346f56..4566c89 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -755,7 +755,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)

if (wbc->sync_mode == WB_SYNC_ALL)
gfs2_log_flush(GFS2_SB(inode), ip->i_gl, NORMAL_FLUSH);
- if (bdi->dirty_exceeded)
+ if (bdi->wb.dirty_exceeded)
gfs2_ail1_flush(sdp, wbc);
else
filemap_fdatawrite(metamapping);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 92fed42..a077a8d 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -59,16 +59,6 @@ struct bdi_writeback {
spinlock_t list_lock; /* protects the b_* lists */

struct percpu_counter stat[NR_WB_STAT_ITEMS];
-};
-
-struct backing_dev_info {
- struct list_head bdi_list;
- unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
- unsigned int capabilities; /* Device capabilities */
- congested_fn *congested_fn; /* Function pointer if device is md/dm */
- void *congested_data; /* Pointer to aux data for congested func */
-
- char *name;

unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
@@ -87,6 +77,16 @@ struct backing_dev_info {

struct fprop_local_percpu completions;
int dirty_exceeded;
+};
+
+struct backing_dev_info {
+ struct list_head bdi_list;
+ unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
+ unsigned int capabilities; /* Device capabilities */
+ congested_fn *congested_fn; /* Function pointer if device is md/dm */
+ void *congested_data; /* Pointer to aux data for congested func */
+
+ char *name;

unsigned int min_ratio;
unsigned int max_ratio, max_prop_frac;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a219be9..6887eb5 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -152,16 +152,15 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);

void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
- unsigned long dirty);
-
-void __bdi_update_bandwidth(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long start_time);
+unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty);
+
+void __wb_update_bandwidth(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long start_time);

void page_writeback_init(void);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index cee02d6..8622b5b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -373,13 +373,13 @@ TRACE_EVENT(bdi_dirty_ratelimit,

TP_fast_assign(
strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
- __entry->write_bw = KBps(bdi->write_bandwidth);
- __entry->avg_write_bw = KBps(bdi->avg_write_bandwidth);
+ __entry->write_bw = KBps(bdi->wb.write_bandwidth);
+ __entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth);
__entry->dirty_rate = KBps(dirty_rate);
- __entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit);
+ __entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit);
__entry->task_ratelimit = KBps(task_ratelimit);
__entry->balanced_dirty_ratelimit =
- KBps(bdi->balanced_dirty_ratelimit);
+ KBps(bdi->wb.balanced_dirty_ratelimit);
),

TP_printk("bdi %s: "
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4b6f650..7b9b10e 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -82,7 +82,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
spin_unlock(&wb->list_lock);

global_dirty_limits(&background_thresh, &dirty_thresh);
- bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+ bdi_thresh = wb_dirty_limit(wb, dirty_thresh);

#define K(x) ((x) << (PAGE_SHIFT - 10))
seq_printf(m,
@@ -106,7 +106,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
K(background_thresh),
(unsigned long) K(wb_stat(wb, WB_DIRTIED)),
(unsigned long) K(wb_stat(wb, WB_WRITTEN)),
- (unsigned long) K(bdi->write_bandwidth),
+ (unsigned long) K(wb->write_bandwidth),
nr_dirty,
nr_io,
nr_more_io,
@@ -408,6 +408,11 @@ void bdi_unregister(struct backing_dev_info *bdi)
}
EXPORT_SYMBOL(bdi_unregister);

+/*
+ * Initial write bandwidth: 100 MB/s
+ */
+#define INIT_BW (100 << (20 - PAGE_SHIFT))
+
static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
int i, err;
@@ -422,11 +427,22 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
spin_lock_init(&wb->list_lock);
INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);

+ wb->bw_time_stamp = jiffies;
+ wb->balanced_dirty_ratelimit = INIT_BW;
+ wb->dirty_ratelimit = INIT_BW;
+ wb->write_bandwidth = INIT_BW;
+ wb->avg_write_bandwidth = INIT_BW;
+
+ err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL);
+ if (err)
+ return err;
+
for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL);
if (err) {
while (--i)
percpu_counter_destroy(&wb->stat[i]);
+ fprop_local_destroy_percpu(&wb->completions);
return err;
}
}
@@ -442,12 +458,9 @@ static void bdi_wb_exit(struct bdi_writeback *wb)

for (i = 0; i < NR_WB_STAT_ITEMS; i++)
percpu_counter_destroy(&wb->stat[i]);
-}

-/*
- * Initial write bandwidth: 100 MB/s
- */
-#define INIT_BW (100 << (20 - PAGE_SHIFT))
+ fprop_local_destroy_percpu(&wb->completions);
+}

int bdi_init(struct backing_dev_info *bdi)
{
@@ -466,22 +479,6 @@ int bdi_init(struct backing_dev_info *bdi)
if (err)
return err;

- bdi->dirty_exceeded = 0;
-
- bdi->bw_time_stamp = jiffies;
- bdi->written_stamp = 0;
-
- bdi->balanced_dirty_ratelimit = INIT_BW;
- bdi->dirty_ratelimit = INIT_BW;
- bdi->write_bandwidth = INIT_BW;
- bdi->avg_write_bandwidth = INIT_BW;
-
- err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
- if (err) {
- bdi_wb_exit(&bdi->wb);
- return err;
- }
-
return 0;
}
EXPORT_SYMBOL(bdi_init);
@@ -514,8 +511,6 @@ void bdi_destroy(struct backing_dev_info *bdi)

bdi_unregister(bdi);
bdi_wb_exit(&bdi->wb);
-
- fprop_local_destroy_percpu(&bdi->completions);
}
EXPORT_SYMBOL(bdi_destroy);

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 68fd72a..7c721b4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -399,7 +399,7 @@ static unsigned long wp_next_time(unsigned long cur_time)
static inline void __wb_writeout_inc(struct bdi_writeback *wb)
{
__inc_wb_stat(wb, WB_WRITTEN);
- __fprop_inc_percpu_max(&writeout_completions, &wb->bdi->completions,
+ __fprop_inc_percpu_max(&writeout_completions, &wb->completions,
wb->bdi->max_prop_frac);
/* First event after period switching was turned off? */
if (!unlikely(writeout_period_time)) {
@@ -427,10 +427,10 @@ EXPORT_SYMBOL_GPL(wb_writeout_inc);
/*
* Obtain an accurate fraction of the BDI's portion.
*/
-static void bdi_writeout_fraction(struct backing_dev_info *bdi,
- long *numerator, long *denominator)
+static void wb_writeout_fraction(struct bdi_writeback *wb,
+ long *numerator, long *denominator)
{
- fprop_fraction_percpu(&writeout_completions, &bdi->completions,
+ fprop_fraction_percpu(&writeout_completions, &wb->completions,
numerator, denominator);
}

@@ -516,11 +516,11 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
}

/**
- * bdi_dirty_limit - @bdi's share of dirty throttling threshold
- * @bdi: the backing_dev_info to query
+ * wb_dirty_limit - @wb's share of dirty throttling threshold
+ * @wb: the bdi_writeback to query
* @dirty: global dirty limit in pages
*
- * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
+ * Returns @wb's dirty limit in pages. The term "dirty" in the context of
* dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
*
* Note that balance_dirty_pages() will only seriously take it as a hard limit
@@ -528,24 +528,25 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
* control. For example, when the device is completely stalled due to some error
* conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
* In the other normal situations, it acts more gently by throttling the tasks
- * more (rather than completely block them) when the bdi dirty pages go high.
+ * more (rather than completely block them) when the wb dirty pages go high.
*
* It allocates high/low dirty limits to fast/slow devices, in order to prevent
* - starving fast devices
* - piling up dirty pages (that will take long time to sync) on slow devices
*
- * The bdi's share of dirty limit will be adapting to its throughput and
+ * The wb's share of dirty limit will be adapting to its throughput and
* bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
*/
-unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
+unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
{
+ struct backing_dev_info *bdi = wb->bdi;
u64 bdi_dirty;
long numerator, denominator;

/*
* Calculate this BDI's share of the dirty ratio.
*/
- bdi_writeout_fraction(bdi, &numerator, &denominator);
+ wb_writeout_fraction(wb, &numerator, &denominator);

bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
bdi_dirty *= numerator;
@@ -664,14 +665,14 @@ static long long pos_ratio_polynom(unsigned long setpoint,
* card's bdi_dirty may rush to many times higher than bdi_setpoint.
* - the bdi dirty thresh drops quickly due to change of JBOD workload
*/
-static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty)
+static unsigned long wb_position_ratio(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty)
{
- unsigned long write_bw = bdi->avg_write_bandwidth;
+ unsigned long write_bw = wb->avg_write_bandwidth;
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
unsigned long x_intercept;
@@ -702,12 +703,12 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
* consume arbitrary amount of RAM because it is accounted in
* NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
*
- * Here, in bdi_position_ratio(), we calculate pos_ratio based on
+ * Here, in wb_position_ratio(), we calculate pos_ratio based on
* two values: bdi_dirty and bdi_thresh. Let's consider an example:
* total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
* limits are set by default to 10% and 20% (background and throttle).
* Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
- * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
+ * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. bdi_setpoint is
* about ~6K pages (as the average of background and throttle bdi
* limits). The 3rd order polynomial will provide positive feedback if
* bdi_dirty is under bdi_setpoint and vice versa.
@@ -717,7 +718,7 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
* much earlier than global "freerun" is reached (~23MB vs. ~2.3GB
* in the example above).
*/
- if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
+ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
long long bdi_pos_ratio;
unsigned long bdi_bg_thresh;

@@ -842,13 +843,13 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
return pos_ratio;
}

-static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
- unsigned long elapsed,
- unsigned long written)
+static void wb_update_write_bandwidth(struct bdi_writeback *wb,
+ unsigned long elapsed,
+ unsigned long written)
{
const unsigned long period = roundup_pow_of_two(3 * HZ);
- unsigned long avg = bdi->avg_write_bandwidth;
- unsigned long old = bdi->write_bandwidth;
+ unsigned long avg = wb->avg_write_bandwidth;
+ unsigned long old = wb->write_bandwidth;
u64 bw;

/*
@@ -858,14 +859,14 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
* write_bandwidth = ---------------------------------------------------
* period
*/
- bw = written - bdi->written_stamp;
+ bw = written - wb->written_stamp;
bw *= HZ;
if (unlikely(elapsed > period)) {
do_div(bw, elapsed);
avg = bw;
goto out;
}
- bw += (u64)bdi->write_bandwidth * (period - elapsed);
+ bw += (u64)wb->write_bandwidth * (period - elapsed);
bw >>= ilog2(period);

/*
@@ -878,8 +879,8 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
avg += (old - avg) >> 3;

out:
- bdi->write_bandwidth = bw;
- bdi->avg_write_bandwidth = avg;
+ wb->write_bandwidth = bw;
+ wb->avg_write_bandwidth = avg;
}

/*
@@ -944,20 +945,20 @@ static void global_update_bandwidth(unsigned long thresh,
* Normal bdi tasks will be curbed at or below it in long term.
* Obviously it should be around (write_bw / N) when there are N dd tasks.
*/
-static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long dirtied,
- unsigned long elapsed)
+static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long dirtied,
+ unsigned long elapsed)
{
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
unsigned long setpoint = (freerun + limit) / 2;
- unsigned long write_bw = bdi->avg_write_bandwidth;
- unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
+ unsigned long write_bw = wb->avg_write_bandwidth;
+ unsigned long dirty_ratelimit = wb->dirty_ratelimit;
unsigned long dirty_rate;
unsigned long task_ratelimit;
unsigned long balanced_dirty_ratelimit;
@@ -969,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
* The dirty rate will match the writeout rate in long term, except
* when dirty pages are truncated by userspace or re-dirtied by FS.
*/
- dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+ dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;

- pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty);
+ pos_ratio = wb_position_ratio(wb, thresh, bg_thresh, dirty,
+ bdi_thresh, bdi_dirty);
/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
@@ -1056,31 +1057,31 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,

/*
* For strictlimit case, calculations above were based on bdi counters
- * and limits (starting from pos_ratio = bdi_position_ratio() and up to
+ * and limits (starting from pos_ratio = wb_position_ratio() and up to
* balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
* Hence, to calculate "step" properly, we have to use bdi_dirty as
* "dirty" and bdi_setpoint as "setpoint".
*
* We rampup dirty_ratelimit forcibly if bdi_dirty is low because
* it's possible that bdi_thresh is close to zero due to inactivity
- * of backing device (see the implementation of bdi_dirty_limit()).
+ * of backing device (see the implementation of wb_dirty_limit()).
*/
- if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
+ if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
dirty = bdi_dirty;
if (bdi_dirty < 8)
setpoint = bdi_dirty + 1;
else
setpoint = (bdi_thresh +
- bdi_dirty_limit(bdi, bg_thresh)) / 2;
+ wb_dirty_limit(wb, bg_thresh)) / 2;
}

if (dirty < setpoint) {
- x = min3(bdi->balanced_dirty_ratelimit,
+ x = min3(wb->balanced_dirty_ratelimit,
balanced_dirty_ratelimit, task_ratelimit);
if (dirty_ratelimit < x)
step = x - dirty_ratelimit;
} else {
- x = max3(bdi->balanced_dirty_ratelimit,
+ x = max3(wb->balanced_dirty_ratelimit,
balanced_dirty_ratelimit, task_ratelimit);
if (dirty_ratelimit > x)
step = dirty_ratelimit - x;
@@ -1102,22 +1103,22 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
else
dirty_ratelimit -= step;

- bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
- bdi->balanced_dirty_ratelimit = balanced_dirty_ratelimit;
+ wb->dirty_ratelimit = max(dirty_ratelimit, 1UL);
+ wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit;

- trace_bdi_dirty_ratelimit(bdi, dirty_rate, task_ratelimit);
+ trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
}

-void __bdi_update_bandwidth(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long start_time)
+void __wb_update_bandwidth(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long start_time)
{
unsigned long now = jiffies;
- unsigned long elapsed = now - bdi->bw_time_stamp;
+ unsigned long elapsed = now - wb->bw_time_stamp;
unsigned long dirtied;
unsigned long written;

@@ -1127,44 +1128,44 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
if (elapsed < BANDWIDTH_INTERVAL)
return;

- dirtied = percpu_counter_read(&bdi->wb.stat[WB_DIRTIED]);
- written = percpu_counter_read(&bdi->wb.stat[WB_WRITTEN]);
+ dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
+ written = percpu_counter_read(&wb->stat[WB_WRITTEN]);

/*
* Skip quiet periods when disk bandwidth is under-utilized.
* (at least 1s idle time between two flusher runs)
*/
- if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
+ if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time))
goto snapshot;

if (thresh) {
global_update_bandwidth(thresh, dirty, now);
- bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty,
- dirtied, elapsed);
+ wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty,
+ bdi_thresh, bdi_dirty,
+ dirtied, elapsed);
}
- bdi_update_write_bandwidth(bdi, elapsed, written);
+ wb_update_write_bandwidth(wb, elapsed, written);

snapshot:
- bdi->dirtied_stamp = dirtied;
- bdi->written_stamp = written;
- bdi->bw_time_stamp = now;
+ wb->dirtied_stamp = dirtied;
+ wb->written_stamp = written;
+ wb->bw_time_stamp = now;
}

-static void bdi_update_bandwidth(struct backing_dev_info *bdi,
- unsigned long thresh,
- unsigned long bg_thresh,
- unsigned long dirty,
- unsigned long bdi_thresh,
- unsigned long bdi_dirty,
- unsigned long start_time)
+static void wb_update_bandwidth(struct bdi_writeback *wb,
+ unsigned long thresh,
+ unsigned long bg_thresh,
+ unsigned long dirty,
+ unsigned long bdi_thresh,
+ unsigned long bdi_dirty,
+ unsigned long start_time)
{
- if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
+ if (time_is_after_eq_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL))
return;
- spin_lock(&bdi->wb.list_lock);
- __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty, start_time);
- spin_unlock(&bdi->wb.list_lock);
+ spin_lock(&wb->list_lock);
+ __wb_update_bandwidth(wb, thresh, bg_thresh, dirty,
+ bdi_thresh, bdi_dirty, start_time);
+ spin_unlock(&wb->list_lock);
}

/*
@@ -1184,10 +1185,10 @@ static unsigned long dirty_poll_interval(unsigned long dirty,
return 1;
}

-static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
- unsigned long bdi_dirty)
+static unsigned long wb_max_pause(struct bdi_writeback *wb,
+ unsigned long bdi_dirty)
{
- unsigned long bw = bdi->avg_write_bandwidth;
+ unsigned long bw = wb->avg_write_bandwidth;
unsigned long t;

/*
@@ -1203,14 +1204,14 @@ static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
return min_t(unsigned long, t, MAX_PAUSE);
}

-static long bdi_min_pause(struct backing_dev_info *bdi,
- long max_pause,
- unsigned long task_ratelimit,
- unsigned long dirty_ratelimit,
- int *nr_dirtied_pause)
+static long wb_min_pause(struct bdi_writeback *wb,
+ long max_pause,
+ unsigned long task_ratelimit,
+ unsigned long dirty_ratelimit,
+ int *nr_dirtied_pause)
{
- long hi = ilog2(bdi->avg_write_bandwidth);
- long lo = ilog2(bdi->dirty_ratelimit);
+ long hi = ilog2(wb->avg_write_bandwidth);
+ long lo = ilog2(wb->dirty_ratelimit);
long t; /* target pause */
long pause; /* estimated next pause */
int pages; /* target nr_dirtied_pause */
@@ -1278,14 +1279,13 @@ static long bdi_min_pause(struct backing_dev_info *bdi,
return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
}

-static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
- unsigned long dirty_thresh,
- unsigned long background_thresh,
- unsigned long *bdi_dirty,
- unsigned long *bdi_thresh,
- unsigned long *bdi_bg_thresh)
+static inline void wb_dirty_limits(struct bdi_writeback *wb,
+ unsigned long dirty_thresh,
+ unsigned long background_thresh,
+ unsigned long *bdi_dirty,
+ unsigned long *bdi_thresh,
+ unsigned long *bdi_bg_thresh)
{
- struct bdi_writeback *wb = &bdi->wb;
unsigned long wb_reclaimable;

/*
@@ -1298,10 +1298,10 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
* In this case we don't want to hard throttle the USB key
* dirtiers for 100 seconds until bdi_dirty drops under
* bdi_thresh. Instead the auxiliary bdi control line in
- * bdi_position_ratio() will let the dirtier task progress
+ * wb_position_ratio() will let the dirtier task progress
* at some rate <= (write_bw / 2) for bringing down bdi_dirty.
*/
- *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
+ *bdi_thresh = wb_dirty_limit(wb, dirty_thresh);

if (bdi_bg_thresh)
*bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh *
@@ -1351,6 +1351,7 @@ static void balance_dirty_pages(struct address_space *mapping,
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct bdi_writeback *wb = &bdi->wb;
bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
unsigned long start_time = jiffies;

@@ -1375,8 +1376,8 @@ static void balance_dirty_pages(struct address_space *mapping,
global_dirty_limits(&background_thresh, &dirty_thresh);

if (unlikely(strictlimit)) {
- bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
- &bdi_dirty, &bdi_thresh, &bg_thresh);
+ wb_dirty_limits(wb, dirty_thresh, background_thresh,
+ &bdi_dirty, &bdi_thresh, &bg_thresh);

dirty = bdi_dirty;
thresh = bdi_thresh;
@@ -1407,28 +1408,28 @@ static void balance_dirty_pages(struct address_space *mapping,
bdi_start_background_writeback(bdi);

if (!strictlimit)
- bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
- &bdi_dirty, &bdi_thresh, NULL);
+ wb_dirty_limits(wb, dirty_thresh, background_thresh,
+ &bdi_dirty, &bdi_thresh, NULL);

dirty_exceeded = (bdi_dirty > bdi_thresh) &&
((nr_dirty > dirty_thresh) || strictlimit);
- if (dirty_exceeded && !bdi->dirty_exceeded)
- bdi->dirty_exceeded = 1;
+ if (dirty_exceeded && !wb->dirty_exceeded)
+ wb->dirty_exceeded = 1;

- bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
- nr_dirty, bdi_thresh, bdi_dirty,
- start_time);
+ wb_update_bandwidth(wb, dirty_thresh, background_thresh,
+ nr_dirty, bdi_thresh, bdi_dirty,
+ start_time);

- dirty_ratelimit = bdi->dirty_ratelimit;
- pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
- background_thresh, nr_dirty,
- bdi_thresh, bdi_dirty);
+ dirty_ratelimit = wb->dirty_ratelimit;
+ pos_ratio = wb_position_ratio(wb, dirty_thresh,
+ background_thresh, nr_dirty,
+ bdi_thresh, bdi_dirty);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
- max_pause = bdi_max_pause(bdi, bdi_dirty);
- min_pause = bdi_min_pause(bdi, max_pause,
- task_ratelimit, dirty_ratelimit,
- &nr_dirtied_pause);
+ max_pause = wb_max_pause(wb, bdi_dirty);
+ min_pause = wb_min_pause(wb, max_pause,
+ task_ratelimit, dirty_ratelimit,
+ &nr_dirtied_pause);

if (unlikely(task_ratelimit == 0)) {
period = max_pause;
@@ -1512,15 +1513,15 @@ pause:
* more page. However bdi_dirty has accounting errors. So use
* the larger and more IO friendly wb_stat_error.
*/
- if (bdi_dirty <= wb_stat_error(&bdi->wb))
+ if (bdi_dirty <= wb_stat_error(wb))
break;

if (fatal_signal_pending(current))
break;
}

- if (!dirty_exceeded && bdi->dirty_exceeded)
- bdi->dirty_exceeded = 0;
+ if (!dirty_exceeded && wb->dirty_exceeded)
+ wb->dirty_exceeded = 0;

if (writeback_in_progress(bdi))
return;
@@ -1584,6 +1585,7 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ struct bdi_writeback *wb = &bdi->wb;
int ratelimit;
int *p;

@@ -1591,7 +1593,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
return;

ratelimit = current->nr_dirtied_pause;
- if (bdi->dirty_exceeded)
+ if (wb->dirty_exceeded)
ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));

preempt_disable();
--
1.9.3

2014-11-20 15:13:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCHSET block/for-next] writeback: prepare for cgroup writeback support

On Tue 18-11-14 03:37:18, Tejun Heo wrote:
> Hello,
>
> This patchset contains the following 10 prepatory patches for cgroup
> writeback support. None of these patches introduces behavior changes.
>
> 0001-writeback-move-backing_dev_info-state-into-bdi_write.patch
> 0002-writeback-move-backing_dev_info-bdi_stat-into-bdi_wr.patch
> 0003-writeback-move-bandwidth-related-fields-from-backing.patch
> 0004-writeback-move-backing_dev_info-wb_lock-and-worklist.patch
> 0005-writeback-move-lingering-dirty-IO-lists-transfer-fro.patch
> 0006-writeback-reorganize-mm-backing-dev.c.patch
> 0007-writeback-separate-out-include-linux-backing-dev-def.patch
> 0008-writeback-cosmetic-change-in-account_page_dirtied.patch
> 0009-writeback-add-gfp-to-wb_init.patch
> 0010-writeback-move-inode_to_bdi-to-include-linux-backing.patch
>
> 0001-0005 move writeback related fields from bdi (backing_dev_info) to
> wb (bdi_writeback). Currently, one bdi embeds one wb and the
> separation between the two is blurry. bdi's lock protects wb's fields
> and fields which are closely related are scattered across the two.
> These five patches move all fields which are used during writeback
> into wb.
>
> 0006-0010 are misc prep patches. They're all rather trivial and each
> is self-explanatory.
>
> This patchset is on top of the current block/for-next eb494facbee2
> ("5748c0fce0fd40c87d164d6bee61") and is available in the following git
> branch.
I have no problem with these patches in principle (I'll check individual
patches in detail) but do you have some higher level design where exactly
are you going?

Honza

PS: I've added CC to linux-fsdevel since there's high chance people miss
these patches in lkml...

> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-cgroup-writeback-wb-prep
>
> diffstat follows. Thanks.
>
> block/blk-core.c | 1
> block/blk-integrity.c | 1
> block/blk-sysfs.c | 1
> block/bounce.c | 1
> block/genhd.c | 1
> drivers/block/drbd/drbd_int.h | 1
> drivers/block/drbd/drbd_main.c | 10 -
> drivers/block/pktcdvd.c | 1
> drivers/char/raw.c | 1
> drivers/md/bcache/request.c | 1
> drivers/md/dm.c | 2
> drivers/md/dm.h | 1
> drivers/md/md.h | 1
> drivers/md/raid1.c | 4
> drivers/md/raid10.c | 2
> drivers/mtd/devices/block2mtd.c | 1
> fs/block_dev.c | 1
> fs/ext4/extents.c | 1
> fs/ext4/mballoc.c | 1
> fs/f2fs/node.c | 2
> fs/f2fs/segment.h | 1
> fs/fs-writeback.c | 121 ++++++---------
> fs/fuse/file.c | 12 -
> fs/gfs2/super.c | 2
> fs/hfs/super.c | 1
> fs/hfsplus/super.c | 1
> fs/nfs/filelayout/filelayout.c | 5
> fs/nfs/write.c | 11 -
> fs/reiserfs/super.c | 1
> fs/ufs/super.c | 1
> include/linux/backing-dev-defs.h | 105 +++++++++++++
> include/linux/backing-dev.h | 174 +++++-----------------
> include/linux/blkdev.h | 2
> include/linux/writeback.h | 19 +-
> include/trace/events/writeback.h | 8 -
> mm/backing-dev.c | 306 +++++++++++++++++++--------------------
> mm/filemap.c | 2
> mm/madvise.c | 1
> mm/page-writeback.c | 304 +++++++++++++++++++-------------------
> mm/truncate.c | 4
> 40 files changed, 570 insertions(+), 546 deletions(-)
>
> --
> tejun
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-20 15:15:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block/for-next] writeback: prepare for cgroup writeback support

On Thu, Nov 20, 2014 at 04:13:11PM +0100, Jan Kara wrote:
> On Tue 18-11-14 03:37:18, Tejun Heo wrote:
> > Hello,
> >
> > This patchset contains the following 10 prepatory patches for cgroup
> > writeback support. None of these patches introduces behavior changes.
> >
> > 0001-writeback-move-backing_dev_info-state-into-bdi_write.patch
> > 0002-writeback-move-backing_dev_info-bdi_stat-into-bdi_wr.patch
> > 0003-writeback-move-bandwidth-related-fields-from-backing.patch
> > 0004-writeback-move-backing_dev_info-wb_lock-and-worklist.patch
> > 0005-writeback-move-lingering-dirty-IO-lists-transfer-fro.patch
> > 0006-writeback-reorganize-mm-backing-dev.c.patch
> > 0007-writeback-separate-out-include-linux-backing-dev-def.patch
> > 0008-writeback-cosmetic-change-in-account_page_dirtied.patch
> > 0009-writeback-add-gfp-to-wb_init.patch
> > 0010-writeback-move-inode_to_bdi-to-include-linux-backing.patch
> >
> > 0001-0005 move writeback related fields from bdi (backing_dev_info) to
> > wb (bdi_writeback). Currently, one bdi embeds one wb and the
> > separation between the two is blurry. bdi's lock protects wb's fields
> > and fields which are closely related are scattered across the two.
> > These five patches move all fields which are used during writeback
> > into wb.
> >
> > 0006-0010 are misc prep patches. They're all rather trivial and each
> > is self-explanatory.
> >
> > This patchset is on top of the current block/for-next eb494facbee2
> > ("5748c0fce0fd40c87d164d6bee61") and is available in the following git
> > branch.
> I have no problem with these patches in principle (I'll check individual
> patches in detail) but do you have some higher level design where exactly
> are you going?

Yeah, I'm prepping the actual patchset and it'll go out with high
level description. Just wanted to send out the prep ones separately
to reduce the patchset to a manageable size.

> PS: I've added CC to linux-fsdevel since there's high chance people miss
> these patches in lkml...

Will do so when posting the actual series.

Thanks.

--
tejun

2014-11-20 15:27:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 01/10] writeback: move backing_dev_info->state into bdi_writeback

On Tue 18-11-14 03:37:19, Tejun Heo wrote:
> Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
> and the role of the separation is unclear. For cgroup support for
> writeback IOs, a bdi will be updated to host multiple wb's where each
> wb serves writeback IOs of a different cgroup on the bdi. To achieve
> that, a wb should carry all states necessary for servicing writeback
> IOs for a cgroup independently.
>
> This patch moves bdi->state into wb.
>
> * enum bdi_state is renamed to wb_state and the prefix of all enums is
> changed from BDI_ to WB_.
>
> * Explicit zeroing of bdi->state is removed without adding zeoring of
> wb->state as the whole data structure is zeroed on init anyway.
>
> * As there's still only one bdi_writeback per backing_dev_info, all
> uses of bdi->state are mechanically replaced with bdi->wb.state
> introducing no behavior changes.
Hum, does it make sense to convert BDI_sync_congested and
BDI_async_congested? It contains information whether the *device* is
congested and cannot take more work. I understand that in a cgroup world
you want to throttle IO from a cgroup to a device so when you take
bdi_writeback to be a per-cgroup structure you want some indication there
that a particular cgroup cannot push more to the device. But is it that
e.g. mdraid cares about a cgroup and not about the device?

Honza
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: [email protected]
> Cc: Neil Brown <[email protected]>
> Cc: Alasdair Kergon <[email protected]>
> Cc: Mike Snitzer <[email protected]>
> ---
> block/blk-core.c | 1 -
> drivers/block/drbd/drbd_main.c | 10 +++++-----
> drivers/md/dm.c | 2 +-
> drivers/md/raid1.c | 4 ++--
> drivers/md/raid10.c | 2 +-
> fs/fs-writeback.c | 14 +++++++-------
> include/linux/backing-dev.h | 24 ++++++++++++------------
> mm/backing-dev.c | 21 ++++++++++-----------
> 8 files changed, 38 insertions(+), 40 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 0421b53..8801682 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -584,7 +584,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
>
> q->backing_dev_info.ra_pages =
> (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
> - q->backing_dev_info.state = 0;
> q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
> q->backing_dev_info.name = "block";
> q->node = node_id;
> diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
> index 1fc8342..61b00aa 100644
> --- a/drivers/block/drbd/drbd_main.c
> +++ b/drivers/block/drbd/drbd_main.c
> @@ -2360,7 +2360,7 @@ static void drbd_cleanup(void)
> * @congested_data: User data
> * @bdi_bits: Bits the BDI flusher thread is currently interested in
> *
> - * Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested.
> + * Returns 1<<WB_async_congested and/or 1<<WB_sync_congested if we are congested.
> */
> static int drbd_congested(void *congested_data, int bdi_bits)
> {
> @@ -2377,14 +2377,14 @@ static int drbd_congested(void *congested_data, int bdi_bits)
> }
>
> if (test_bit(CALLBACK_PENDING, &first_peer_device(device)->connection->flags)) {
> - r |= (1 << BDI_async_congested);
> + r |= (1 << WB_async_congested);
> /* Without good local data, we would need to read from remote,
> * and that would need the worker thread as well, which is
> * currently blocked waiting for that usermode helper to
> * finish.
> */
> if (!get_ldev_if_state(device, D_UP_TO_DATE))
> - r |= (1 << BDI_sync_congested);
> + r |= (1 << WB_sync_congested);
> else
> put_ldev(device);
> r &= bdi_bits;
> @@ -2400,9 +2400,9 @@ static int drbd_congested(void *congested_data, int bdi_bits)
> reason = 'b';
> }
>
> - if (bdi_bits & (1 << BDI_async_congested) &&
> + if (bdi_bits & (1 << WB_async_congested) &&
> test_bit(NET_CONGESTED, &first_peer_device(device)->connection->flags)) {
> - r |= (1 << BDI_async_congested);
> + r |= (1 << WB_async_congested);
> reason = reason == 'b' ? 'a' : 'n';
> }
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 58f3927..c4c53af 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1950,7 +1950,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
> * the query about congestion status of request_queue
> */
> if (dm_request_based(md))
> - r = md->queue->backing_dev_info.state &
> + r = md->queue->backing_dev_info.wb.state &
> bdi_bits;
> else
> r = dm_table_any_congested(map, bdi_bits);
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 40b35be..aad1482 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -739,7 +739,7 @@ int md_raid1_congested(struct mddev *mddev, int bits)
> struct r1conf *conf = mddev->private;
> int i, ret = 0;
>
> - if ((bits & (1 << BDI_async_congested)) &&
> + if ((bits & (1 << WB_async_congested)) &&
> conf->pending_count >= max_queued_requests)
> return 1;
>
> @@ -754,7 +754,7 @@ int md_raid1_congested(struct mddev *mddev, int bits)
> /* Note the '|| 1' - when read_balance prefers
> * non-congested targets, it can be removed
> */
> - if ((bits & (1<<BDI_async_congested)) || 1)
> + if ((bits & (1<<WB_async_congested)) || 1)
> ret |= bdi_congested(&q->backing_dev_info, bits);
> else
> ret &= bdi_congested(&q->backing_dev_info, bits);
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 32e282f..5180e75 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -915,7 +915,7 @@ int md_raid10_congested(struct mddev *mddev, int bits)
> struct r10conf *conf = mddev->private;
> int i, ret = 0;
>
> - if ((bits & (1 << BDI_async_congested)) &&
> + if ((bits & (1 << WB_async_congested)) &&
> conf->pending_count >= max_queued_requests)
> return 1;
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 2d609a5..a797bda 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -62,7 +62,7 @@ struct wb_writeback_work {
> */
> int writeback_in_progress(struct backing_dev_info *bdi)
> {
> - return test_bit(BDI_writeback_running, &bdi->state);
> + return test_bit(WB_writeback_running, &bdi->wb.state);
> }
> EXPORT_SYMBOL(writeback_in_progress);
>
> @@ -94,7 +94,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
> static void bdi_wakeup_thread(struct backing_dev_info *bdi)
> {
> spin_lock_bh(&bdi->wb_lock);
> - if (test_bit(BDI_registered, &bdi->state))
> + if (test_bit(WB_registered, &bdi->wb.state))
> mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
> spin_unlock_bh(&bdi->wb_lock);
> }
> @@ -105,7 +105,7 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
> trace_writeback_queue(bdi, work);
>
> spin_lock_bh(&bdi->wb_lock);
> - if (!test_bit(BDI_registered, &bdi->state)) {
> + if (!test_bit(WB_registered, &bdi->wb.state)) {
> if (work->done)
> complete(work->done);
> goto out_unlock;
> @@ -1007,7 +1007,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> struct wb_writeback_work *work;
> long wrote = 0;
>
> - set_bit(BDI_writeback_running, &wb->bdi->state);
> + set_bit(WB_writeback_running, &wb->state);
> while ((work = get_next_work_item(bdi)) != NULL) {
>
> trace_writeback_exec(bdi, work);
> @@ -1029,7 +1029,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> */
> wrote += wb_check_old_data_flush(wb);
> wrote += wb_check_background_flush(wb);
> - clear_bit(BDI_writeback_running, &wb->bdi->state);
> + clear_bit(WB_writeback_running, &wb->state);
>
> return wrote;
> }
> @@ -1049,7 +1049,7 @@ void bdi_writeback_workfn(struct work_struct *work)
> current->flags |= PF_SWAPWRITE;
>
> if (likely(!current_is_workqueue_rescuer() ||
> - !test_bit(BDI_registered, &bdi->state))) {
> + !test_bit(WB_registered, &wb->state))) {
> /*
> * The normal path. Keep writing back @bdi until its
> * work_list is empty. Note that this path is also taken
> @@ -1211,7 +1211,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> spin_unlock(&inode->i_lock);
> spin_lock(&bdi->wb.list_lock);
> if (bdi_cap_writeback_dirty(bdi)) {
> - WARN(!test_bit(BDI_registered, &bdi->state),
> + WARN(!test_bit(WB_registered, &bdi->wb.state),
> "bdi-%s not registered\n", bdi->name);
>
> /*
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 5da6012..a356ccd 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -25,13 +25,13 @@ struct device;
> struct dentry;
>
> /*
> - * Bits in backing_dev_info.state
> + * Bits in bdi_writeback.state
> */
> -enum bdi_state {
> - BDI_async_congested, /* The async (write) queue is getting full */
> - BDI_sync_congested, /* The sync queue is getting full */
> - BDI_registered, /* bdi_register() was done */
> - BDI_writeback_running, /* Writeback is in progress */
> +enum wb_state {
> + WB_async_congested, /* The async (write) queue is getting full */
> + WB_sync_congested, /* The sync queue is getting full */
> + WB_registered, /* bdi_register() was done */
> + WB_writeback_running, /* Writeback is in progress */
> };
>
> typedef int (congested_fn)(void *, int);
> @@ -49,6 +49,7 @@ enum bdi_stat_item {
> struct bdi_writeback {
> struct backing_dev_info *bdi; /* our parent bdi */
>
> + unsigned long state; /* Always use atomic bitops on this */
> unsigned long last_old_flush; /* last old data flush */
>
> struct delayed_work dwork; /* work item used for writeback */
> @@ -61,7 +62,6 @@ struct bdi_writeback {
> struct backing_dev_info {
> struct list_head bdi_list;
> unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
> - unsigned long state; /* Always use atomic bitops on this */
> unsigned int capabilities; /* Device capabilities */
> congested_fn *congested_fn; /* Function pointer if device is md/dm */
> void *congested_data; /* Pointer to aux data for congested func */
> @@ -276,23 +276,23 @@ static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
> {
> if (bdi->congested_fn)
> return bdi->congested_fn(bdi->congested_data, bdi_bits);
> - return (bdi->state & bdi_bits);
> + return (bdi->wb.state & bdi_bits);
> }
>
> static inline int bdi_read_congested(struct backing_dev_info *bdi)
> {
> - return bdi_congested(bdi, 1 << BDI_sync_congested);
> + return bdi_congested(bdi, 1 << WB_sync_congested);
> }
>
> static inline int bdi_write_congested(struct backing_dev_info *bdi)
> {
> - return bdi_congested(bdi, 1 << BDI_async_congested);
> + return bdi_congested(bdi, 1 << WB_async_congested);
> }
>
> static inline int bdi_rw_congested(struct backing_dev_info *bdi)
> {
> - return bdi_congested(bdi, (1 << BDI_sync_congested) |
> - (1 << BDI_async_congested));
> + return bdi_congested(bdi, (1 << WB_sync_congested) |
> + (1 << WB_async_congested));
> }
>
> enum {
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 0ae0df5..62f3b33 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -17,7 +17,6 @@ static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
> struct backing_dev_info default_backing_dev_info = {
> .name = "default",
> .ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
> - .state = 0,
> .capabilities = BDI_CAP_MAP_COPY,
> };
> EXPORT_SYMBOL_GPL(default_backing_dev_info);
> @@ -111,7 +110,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> nr_dirty,
> nr_io,
> nr_more_io,
> - !list_empty(&bdi->bdi_list), bdi->state);
> + !list_empty(&bdi->bdi_list), bdi->wb.state);
> #undef K
>
> return 0;
> @@ -298,7 +297,7 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
>
> timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
> spin_lock_bh(&bdi->wb_lock);
> - if (test_bit(BDI_registered, &bdi->state))
> + if (test_bit(WB_registered, &bdi->wb.state))
> queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
> spin_unlock_bh(&bdi->wb_lock);
> }
> @@ -333,7 +332,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
> bdi->dev = dev;
>
> bdi_debug_register(bdi, dev_name(dev));
> - set_bit(BDI_registered, &bdi->state);
> + set_bit(WB_registered, &bdi->wb.state);
>
> spin_lock_bh(&bdi_lock);
> list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
> @@ -365,7 +364,7 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
>
> /* Make sure nobody queues further work */
> spin_lock_bh(&bdi->wb_lock);
> - clear_bit(BDI_registered, &bdi->state);
> + clear_bit(WB_registered, &bdi->wb.state);
> spin_unlock_bh(&bdi->wb_lock);
>
> /*
> @@ -543,11 +542,11 @@ static atomic_t nr_bdi_congested[2];
>
> void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> - enum bdi_state bit;
> + enum wb_state bit;
> wait_queue_head_t *wqh = &congestion_wqh[sync];
>
> - bit = sync ? BDI_sync_congested : BDI_async_congested;
> - if (test_and_clear_bit(bit, &bdi->state))
> + bit = sync ? WB_sync_congested : WB_async_congested;
> + if (test_and_clear_bit(bit, &bdi->wb.state))
> atomic_dec(&nr_bdi_congested[sync]);
> smp_mb__after_atomic();
> if (waitqueue_active(wqh))
> @@ -557,10 +556,10 @@ EXPORT_SYMBOL(clear_bdi_congested);
>
> void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> {
> - enum bdi_state bit;
> + enum wb_state bit;
>
> - bit = sync ? BDI_sync_congested : BDI_async_congested;
> - if (!test_and_set_bit(bit, &bdi->state))
> + bit = sync ? WB_sync_congested : WB_async_congested;
> + if (!test_and_set_bit(bit, &bdi->wb.state))
> atomic_inc(&nr_bdi_congested[sync]);
> }
> EXPORT_SYMBOL(set_bdi_congested);
> --
> 1.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-20 15:31:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 02/10] writeback: move backing_dev_info->bdi_stat[] into bdi_writeback

On Tue 18-11-14 03:37:20, Tejun Heo wrote:
> Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
> and the role of the separation is unclear. For cgroup support for
> writeback IOs, a bdi will be updated to host multiple wb's where each
> wb serves writeback IOs of a different cgroup on the bdi. To achieve
> that, a wb should carry all states necessary for servicing writeback
> IOs for a cgroup independently.
>
> This patch moves bdi->bdi_stat[] into wb.
>
> * enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
> enums is changed from BDI_ to WB_.
>
> * BDI_STAT_BATCH() -> WB_STAT_BATCH()
>
> * [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)
>
> * bdi_stat[_error]() -> wb_stat[_error]()
>
> * bdi_writeout_inc() -> wb_writeout_inc()
>
> * stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
> frees stat.
>
> * As there's still only one bdi_writeback per backing_dev_info, all
> uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
> introducing no behavior changes.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Miklos Szeredi <[email protected]>
> Cc: Trond Myklebust <[email protected]>
Looks good. You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza
> ---
> fs/fs-writeback.c | 2 +-
> fs/fuse/file.c | 12 ++++----
> fs/nfs/filelayout/filelayout.c | 4 +--
> fs/nfs/write.c | 11 +++----
> include/linux/backing-dev.h | 68 ++++++++++++++++++++----------------------
> mm/backing-dev.c | 61 +++++++++++++++++++++----------------
> mm/filemap.c | 2 +-
> mm/page-writeback.c | 53 ++++++++++++++++----------------
> mm/truncate.c | 4 +--
> 9 files changed, 112 insertions(+), 105 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index a797bda..f5ca16e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -790,7 +790,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
> global_page_state(NR_UNSTABLE_NFS) > background_thresh)
> return true;
>
> - if (bdi_stat(bdi, BDI_RECLAIMABLE) >
> + if (wb_stat(&bdi->wb, WB_RECLAIMABLE) >
> bdi_dirty_limit(bdi, background_thresh))
> return true;
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index caa8d95..1199471 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1514,9 +1514,9 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
>
> list_del(&req->writepages_entry);
> for (i = 0; i < req->num_pages; i++) {
> - dec_bdi_stat(bdi, BDI_WRITEBACK);
> + dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
> - bdi_writeout_inc(bdi);
> + wb_writeout_inc(&bdi->wb);
> }
> wake_up(&fi->page_waitq);
> }
> @@ -1703,7 +1703,7 @@ static int fuse_writepage_locked(struct page *page)
> req->end = fuse_writepage_end;
> req->inode = inode;
>
> - inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> + inc_wb_stat(&mapping->backing_dev_info->wb, WB_WRITEBACK);
> inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>
> spin_lock(&fc->lock);
> @@ -1818,9 +1818,9 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req,
> copy_highpage(old_req->pages[0], page);
> spin_unlock(&fc->lock);
>
> - dec_bdi_stat(bdi, BDI_WRITEBACK);
> + dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> dec_zone_page_state(page, NR_WRITEBACK_TEMP);
> - bdi_writeout_inc(bdi);
> + wb_writeout_inc(&bdi->wb);
> fuse_writepage_free(fc, new_req);
> fuse_request_free(new_req);
> goto out;
> @@ -1917,7 +1917,7 @@ static int fuse_writepages_fill(struct page *page,
> req->page_descs[req->num_pages].offset = 0;
> req->page_descs[req->num_pages].length = PAGE_SIZE;
>
> - inc_bdi_stat(page->mapping->backing_dev_info, BDI_WRITEBACK);
> + inc_wb_stat(&page->mapping->backing_dev_info->wb, WB_WRITEBACK);
> inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
>
> err = 0;
> diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c
> index 46fab1cb..0554e3c 100644
> --- a/fs/nfs/filelayout/filelayout.c
> +++ b/fs/nfs/filelayout/filelayout.c
> @@ -1084,8 +1084,8 @@ mds_commit:
> spin_unlock(cinfo->lock);
> if (!cinfo->dreq) {
> inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> - inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
> - BDI_RECLAIMABLE);
> + inc_wb_stat(&page_file_mapping(req->wb_page)->backing_dev_info->wb,
> + WB_RECLAIMABLE);
> __mark_inode_dirty(req->wb_context->dentry->d_inode,
> I_DIRTY_DATASYNC);
> }
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 1249384..943ddab 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -781,8 +781,8 @@ nfs_request_add_commit_list(struct nfs_page *req, struct list_head *dst,
> spin_unlock(cinfo->lock);
> if (!cinfo->dreq) {
> inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> - inc_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
> - BDI_RECLAIMABLE);
> + inc_wb_stat(&page_file_mapping(req->wb_page)->backing_dev_info->wb,
> + WB_RECLAIMABLE);
> __mark_inode_dirty(req->wb_context->dentry->d_inode,
> I_DIRTY_DATASYNC);
> }
> @@ -848,7 +848,8 @@ static void
> nfs_clear_page_commit(struct page *page)
> {
> dec_zone_page_state(page, NR_UNSTABLE_NFS);
> - dec_bdi_stat(page_file_mapping(page)->backing_dev_info, BDI_RECLAIMABLE);
> + dec_wb_stat(&page_file_mapping(page)->backing_dev_info->wb,
> + WB_RECLAIMABLE);
> }
>
> /* Called holding inode (/cinfo) lock */
> @@ -1559,8 +1560,8 @@ void nfs_retry_commit(struct list_head *page_list,
> nfs_mark_request_commit(req, lseg, cinfo);
> if (!cinfo->dreq) {
> dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
> - dec_bdi_stat(page_file_mapping(req->wb_page)->backing_dev_info,
> - BDI_RECLAIMABLE);
> + dec_wb_stat(&page_file_mapping(req->wb_page)->backing_dev_info->wb,
> + WB_RECLAIMABLE);
> }
> nfs_unlock_and_release_request(req);
> }
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index a356ccd..92fed42 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -36,15 +36,15 @@ enum wb_state {
>
> typedef int (congested_fn)(void *, int);
>
> -enum bdi_stat_item {
> - BDI_RECLAIMABLE,
> - BDI_WRITEBACK,
> - BDI_DIRTIED,
> - BDI_WRITTEN,
> - NR_BDI_STAT_ITEMS
> +enum wb_stat_item {
> + WB_RECLAIMABLE,
> + WB_WRITEBACK,
> + WB_DIRTIED,
> + WB_WRITTEN,
> + NR_WB_STAT_ITEMS
> };
>
> -#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
> +#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
>
> struct bdi_writeback {
> struct backing_dev_info *bdi; /* our parent bdi */
> @@ -57,6 +57,8 @@ struct bdi_writeback {
> struct list_head b_io; /* parked for writeback */
> struct list_head b_more_io; /* parked for more writeback */
> spinlock_t list_lock; /* protects the b_* lists */
> +
> + struct percpu_counter stat[NR_WB_STAT_ITEMS];
> };
>
> struct backing_dev_info {
> @@ -68,8 +70,6 @@ struct backing_dev_info {
>
> char *name;
>
> - struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
> -
> unsigned long bw_time_stamp; /* last time write bw is updated */
> unsigned long dirtied_stamp;
> unsigned long written_stamp; /* pages written at bw_time_stamp */
> @@ -134,78 +134,74 @@ static inline int wb_has_dirty_io(struct bdi_writeback *wb)
> !list_empty(&wb->b_more_io);
> }
>
> -static inline void __add_bdi_stat(struct backing_dev_info *bdi,
> - enum bdi_stat_item item, s64 amount)
> +static inline void __add_wb_stat(struct bdi_writeback *wb,
> + enum wb_stat_item item, s64 amount)
> {
> - __percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH);
> + __percpu_counter_add(&wb->stat[item], amount, WB_STAT_BATCH);
> }
>
> -static inline void __inc_bdi_stat(struct backing_dev_info *bdi,
> - enum bdi_stat_item item)
> +static inline void __inc_wb_stat(struct bdi_writeback *wb,
> + enum wb_stat_item item)
> {
> - __add_bdi_stat(bdi, item, 1);
> + __add_wb_stat(wb, item, 1);
> }
>
> -static inline void inc_bdi_stat(struct backing_dev_info *bdi,
> - enum bdi_stat_item item)
> +static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
> {
> unsigned long flags;
>
> local_irq_save(flags);
> - __inc_bdi_stat(bdi, item);
> + __inc_wb_stat(wb, item);
> local_irq_restore(flags);
> }
>
> -static inline void __dec_bdi_stat(struct backing_dev_info *bdi,
> - enum bdi_stat_item item)
> +static inline void __dec_wb_stat(struct bdi_writeback *wb,
> + enum wb_stat_item item)
> {
> - __add_bdi_stat(bdi, item, -1);
> + __add_wb_stat(wb, item, -1);
> }
>
> -static inline void dec_bdi_stat(struct backing_dev_info *bdi,
> - enum bdi_stat_item item)
> +static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
> {
> unsigned long flags;
>
> local_irq_save(flags);
> - __dec_bdi_stat(bdi, item);
> + __dec_wb_stat(wb, item);
> local_irq_restore(flags);
> }
>
> -static inline s64 bdi_stat(struct backing_dev_info *bdi,
> - enum bdi_stat_item item)
> +static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item)
> {
> - return percpu_counter_read_positive(&bdi->bdi_stat[item]);
> + return percpu_counter_read_positive(&wb->stat[item]);
> }
>
> -static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi,
> - enum bdi_stat_item item)
> +static inline s64 __wb_stat_sum(struct bdi_writeback *wb,
> + enum wb_stat_item item)
> {
> - return percpu_counter_sum_positive(&bdi->bdi_stat[item]);
> + return percpu_counter_sum_positive(&wb->stat[item]);
> }
>
> -static inline s64 bdi_stat_sum(struct backing_dev_info *bdi,
> - enum bdi_stat_item item)
> +static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item)
> {
> s64 sum;
> unsigned long flags;
>
> local_irq_save(flags);
> - sum = __bdi_stat_sum(bdi, item);
> + sum = __wb_stat_sum(wb, item);
> local_irq_restore(flags);
>
> return sum;
> }
>
> -extern void bdi_writeout_inc(struct backing_dev_info *bdi);
> +extern void wb_writeout_inc(struct bdi_writeback *wb);
>
> /*
> * maximal error of a stat counter.
> */
> -static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi)
> +static inline unsigned long wb_stat_error(struct bdi_writeback *wb)
> {
> #ifdef CONFIG_SMP
> - return nr_cpu_ids * BDI_STAT_BATCH;
> + return nr_cpu_ids * WB_STAT_BATCH;
> #else
> return 1;
> #endif
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 62f3b33..4b6f650 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -99,13 +99,13 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> "b_more_io: %10lu\n"
> "bdi_list: %10u\n"
> "state: %10lx\n",
> - (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
> - (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
> + (unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
> + (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)),
> K(bdi_thresh),
> K(dirty_thresh),
> K(background_thresh),
> - (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
> - (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
> + (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
> + (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
> (unsigned long) K(bdi->write_bandwidth),
> nr_dirty,
> nr_io,
> @@ -408,8 +408,10 @@ void bdi_unregister(struct backing_dev_info *bdi)
> }
> EXPORT_SYMBOL(bdi_unregister);
>
> -static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> +static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> {
> + int i, err;
> +
> memset(wb, 0, sizeof(*wb));
>
> wb->bdi = bdi;
> @@ -419,6 +421,27 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> INIT_LIST_HEAD(&wb->b_more_io);
> spin_lock_init(&wb->list_lock);
> INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
> +
> + for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
> + err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL);
> + if (err) {
> + while (--i)
> + percpu_counter_destroy(&wb->stat[i]);
> + return err;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static void bdi_wb_exit(struct bdi_writeback *wb)
> +{
> + int i;
> +
> + WARN_ON(delayed_work_pending(&wb->dwork));
> +
> + for (i = 0; i < NR_WB_STAT_ITEMS; i++)
> + percpu_counter_destroy(&wb->stat[i]);
> }
>
> /*
> @@ -428,7 +451,7 @@ static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
>
> int bdi_init(struct backing_dev_info *bdi)
> {
> - int i, err;
> + int err;
>
> bdi->dev = NULL;
>
> @@ -439,13 +462,9 @@ int bdi_init(struct backing_dev_info *bdi)
> INIT_LIST_HEAD(&bdi->bdi_list);
> INIT_LIST_HEAD(&bdi->work_list);
>
> - bdi_wb_init(&bdi->wb, bdi);
> -
> - for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
> - err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL);
> - if (err)
> - goto err;
> - }
> + err = bdi_wb_init(&bdi->wb, bdi);
> + if (err)
> + return err;
>
> bdi->dirty_exceeded = 0;
>
> @@ -458,21 +477,17 @@ int bdi_init(struct backing_dev_info *bdi)
> bdi->avg_write_bandwidth = INIT_BW;
>
> err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
> -
> if (err) {
> -err:
> - while (i--)
> - percpu_counter_destroy(&bdi->bdi_stat[i]);
> + bdi_wb_exit(&bdi->wb);
> + return err;
> }
>
> - return err;
> + return 0;
> }
> EXPORT_SYMBOL(bdi_init);
>
> void bdi_destroy(struct backing_dev_info *bdi)
> {
> - int i;
> -
> /*
> * Splice our entries to the default_backing_dev_info. This
> * condition shouldn't happen. @wb must be empty at this point and
> @@ -498,11 +513,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
> }
>
> bdi_unregister(bdi);
> -
> - WARN_ON(delayed_work_pending(&bdi->wb.dwork));
> -
> - for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
> - percpu_counter_destroy(&bdi->bdi_stat[i]);
> + bdi_wb_exit(&bdi->wb);
>
> fprop_local_destroy_percpu(&bdi->completions);
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 14b4642..1405fc5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -211,7 +211,7 @@ void __delete_from_page_cache(struct page *page, void *shadow)
> */
> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> - dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> + dec_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
> }
> }
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 19ceae8..68fd72a 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -396,11 +396,11 @@ static unsigned long wp_next_time(unsigned long cur_time)
> * Increment the BDI's writeout completion count and the global writeout
> * completion count. Called from test_clear_page_writeback().
> */
> -static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
> +static inline void __wb_writeout_inc(struct bdi_writeback *wb)
> {
> - __inc_bdi_stat(bdi, BDI_WRITTEN);
> - __fprop_inc_percpu_max(&writeout_completions, &bdi->completions,
> - bdi->max_prop_frac);
> + __inc_wb_stat(wb, WB_WRITTEN);
> + __fprop_inc_percpu_max(&writeout_completions, &wb->bdi->completions,
> + wb->bdi->max_prop_frac);
> /* First event after period switching was turned off? */
> if (!unlikely(writeout_period_time)) {
> /*
> @@ -414,15 +414,15 @@ static inline void __bdi_writeout_inc(struct backing_dev_info *bdi)
> }
> }
>
> -void bdi_writeout_inc(struct backing_dev_info *bdi)
> +void wb_writeout_inc(struct bdi_writeback *wb)
> {
> unsigned long flags;
>
> local_irq_save(flags);
> - __bdi_writeout_inc(bdi);
> + __wb_writeout_inc(wb);
> local_irq_restore(flags);
> }
> -EXPORT_SYMBOL_GPL(bdi_writeout_inc);
> +EXPORT_SYMBOL_GPL(wb_writeout_inc);
>
> /*
> * Obtain an accurate fraction of the BDI's portion.
> @@ -1127,8 +1127,8 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> if (elapsed < BANDWIDTH_INTERVAL)
> return;
>
> - dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
> - written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> + dirtied = percpu_counter_read(&bdi->wb.stat[WB_DIRTIED]);
> + written = percpu_counter_read(&bdi->wb.stat[WB_WRITTEN]);
>
> /*
> * Skip quiet periods when disk bandwidth is under-utilized.
> @@ -1285,7 +1285,8 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
> unsigned long *bdi_thresh,
> unsigned long *bdi_bg_thresh)
> {
> - unsigned long bdi_reclaimable;
> + struct bdi_writeback *wb = &bdi->wb;
> + unsigned long wb_reclaimable;
>
> /*
> * bdi_thresh is not treated as some limiting factor as
> @@ -1317,14 +1318,12 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
> * actually dirty; with m+n sitting in the percpu
> * deltas.
> */
> - if (*bdi_thresh < 2 * bdi_stat_error(bdi)) {
> - bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE);
> - *bdi_dirty = bdi_reclaimable +
> - bdi_stat_sum(bdi, BDI_WRITEBACK);
> + if (*bdi_thresh < 2 * wb_stat_error(wb)) {
> + wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE);
> + *bdi_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK);
> } else {
> - bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE);
> - *bdi_dirty = bdi_reclaimable +
> - bdi_stat(bdi, BDI_WRITEBACK);
> + wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE);
> + *bdi_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK);
> }
> }
>
> @@ -1511,9 +1510,9 @@ pause:
> * In theory 1 page is enough to keep the comsumer-producer
> * pipe going: the flusher cleans 1 page => the task dirties 1
> * more page. However bdi_dirty has accounting errors. So use
> - * the larger and more IO friendly bdi_stat_error.
> + * the larger and more IO friendly wb_stat_error.
> */
> - if (bdi_dirty <= bdi_stat_error(bdi))
> + if (bdi_dirty <= wb_stat_error(&bdi->wb))
> break;
>
> if (fatal_signal_pending(current))
> @@ -2106,8 +2105,8 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
> if (mapping_cap_account_dirty(mapping)) {
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_zone_page_state(page, NR_DIRTIED);
> - __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> - __inc_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> + __inc_wb_stat(&mapping->backing_dev_info->wb, WB_RECLAIMABLE);
> + __inc_wb_stat(&mapping->backing_dev_info->wb, WB_DIRTIED);
> task_io_account_write(PAGE_CACHE_SIZE);
> current->nr_dirtied++;
> this_cpu_inc(bdp_ratelimits);
> @@ -2173,7 +2172,7 @@ void account_page_redirty(struct page *page)
> if (mapping && mapping_cap_account_dirty(mapping)) {
> current->nr_dirtied--;
> dec_zone_page_state(page, NR_DIRTIED);
> - dec_bdi_stat(mapping->backing_dev_info, BDI_DIRTIED);
> + dec_wb_stat(&mapping->backing_dev_info->wb, WB_DIRTIED);
> }
> }
> EXPORT_SYMBOL(account_page_redirty);
> @@ -2314,8 +2313,8 @@ int clear_page_dirty_for_io(struct page *page)
> */
> if (TestClearPageDirty(page)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> - dec_bdi_stat(mapping->backing_dev_info,
> - BDI_RECLAIMABLE);
> + dec_wb_stat(&mapping->backing_dev_info->wb,
> + WB_RECLAIMABLE);
> return 1;
> }
> return 0;
> @@ -2344,8 +2343,8 @@ int test_clear_page_writeback(struct page *page)
> page_index(page),
> PAGECACHE_TAG_WRITEBACK);
> if (bdi_cap_account_writeback(bdi)) {
> - __dec_bdi_stat(bdi, BDI_WRITEBACK);
> - __bdi_writeout_inc(bdi);
> + __dec_wb_stat(&bdi->wb, WB_WRITEBACK);
> + __wb_writeout_inc(&bdi->wb);
> }
> }
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> @@ -2381,7 +2380,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> page_index(page),
> PAGECACHE_TAG_WRITEBACK);
> if (bdi_cap_account_writeback(bdi))
> - __inc_bdi_stat(bdi, BDI_WRITEBACK);
> + __inc_wb_stat(&bdi->wb, WB_WRITEBACK);
> }
> if (!PageDirty(page))
> radix_tree_tag_clear(&mapping->page_tree,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 261eaf6..623319c 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -112,8 +112,8 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
> struct address_space *mapping = page->mapping;
> if (mapping && mapping_cap_account_dirty(mapping)) {
> dec_zone_page_state(page, NR_FILE_DIRTY);
> - dec_bdi_stat(mapping->backing_dev_info,
> - BDI_RECLAIMABLE);
> + dec_wb_stat(&mapping->backing_dev_info->wb,
> + WB_RECLAIMABLE);
> if (account_size)
> task_io_account_cancelled_write(account_size);
> }
> --
> 1.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-20 15:38:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 01/10] writeback: move backing_dev_info->state into bdi_writeback

Hello, Jan.

On Thu, Nov 20, 2014 at 04:27:02PM +0100, Jan Kara wrote:
> Hum, does it make sense to convert BDI_sync_congested and
> BDI_async_congested? It contains information whether the *device* is
> congested and cannot take more work. I understand that in a cgroup world

Yeah, I mean, with cgroup writeback, the device itself doesn't matter.
The only thing writeback sees is that cgroup's slice of the device
whose congestion status can be independent from other slices of the
device.

> you want to throttle IO from a cgroup to a device so when you take
> bdi_writeback to be a per-cgroup structure you want some indication there
> that a particular cgroup cannot push more to the device. But is it that
> e.g. mdraid cares about a cgroup and not about the device?

I didn't update mdraid to support cgroup writeback yet but it depends
on how it's implemented. If it just transmits back the pressure from
individual underlying cgroup split devices, it's the same. If we
wanna put blkcg splitting in front of mdraid and keep the backend side
clear of cgroup splitting, it'd just send down everything as belonging
to the root cgroup.

Thanks.

--
tejun

2014-11-20 15:45:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCHSET block/for-next] writeback: prepare for cgroup writeback support

On Thu, Nov 20, 2014 at 10:14:56AM -0500, Tejun Heo wrote:
> > PS: I've added CC to linux-fsdevel since there's high chance people miss
> > these patches in lkml...
>
> Will do so when posting the actual series.

Please send the prep patches to fsdevel and linux-mm. Without that I'll
auto-NAK them :)

2014-11-20 15:53:01

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 03/10] writeback: move bandwidth related fields from backing_dev_info into bdi_writeback

On Tue 18-11-14 03:37:21, Tejun Heo wrote:
> Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
> and the role of the separation is unclear. For cgroup support for
> writeback IOs, a bdi will be updated to host multiple wb's where each
> wb serves writeback IOs of a different cgroup on the bdi. To achieve
> that, a wb should carry all states necessary for servicing writeback
> IOs for a cgroup independently.
>
> This patch moves bandwidth related fields from backing_dev_info into
> bdi_writeback.
>
> * The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
> write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
> balanced_dirty_ratelimit, completions and dirty_exceeded.
>
> * writeback_chunk_size() and over_bgroup_thresh() now take @wb instead
> of @bdi.
>
> * bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
> bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
> bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
> bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
> [__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
> bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
> bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)
>
> * Init/exits of the relocated fields are moved to bdi_wb_init/exit()
> respectively. Note that explicit zeroing is dropped in the process
> as wb's are cleared in entirety anyway.
>
> * As there's still only one bdi_writeback per backing_dev_info, all
> uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
> introducing no behavior changes.
Ok, no problem with this patch but I wonder: When you are moving all the
dirty limiting logic to bdi_writeback, then how do you plan to interpret
min/max_ratio in presence of several bdi_writeback structures?

Honza
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Wu Fengguang <[email protected]>
> Cc: Jaegeuk Kim <[email protected]>
> Cc: Steven Whitehouse <[email protected]>
> ---
> fs/f2fs/node.c | 2 +-
> fs/fs-writeback.c | 17 ++-
> fs/gfs2/super.c | 2 +-
> include/linux/backing-dev.h | 20 ++--
> include/linux/writeback.h | 19 ++-
> include/trace/events/writeback.h | 8 +-
> mm/backing-dev.c | 45 ++++---
> mm/page-writeback.c | 246 ++++++++++++++++++++-------------------
> 8 files changed, 177 insertions(+), 182 deletions(-)
>
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 44b8afe..c53d94b 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -43,7 +43,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type)
> mem_size = (nm_i->nat_cnt * sizeof(struct nat_entry)) >> 12;
> res = mem_size < ((val.totalram * nm_i->ram_thresh / 100) >> 2);
> } else if (type == DIRTY_DENTS) {
> - if (sbi->sb->s_bdi->dirty_exceeded)
> + if (sbi->sb->s_bdi->wb.dirty_exceeded)
> return false;
> mem_size = get_pages(sbi, F2FS_DIRTY_DENTS);
> res = mem_size < ((val.totalram * nm_i->ram_thresh / 100) >> 1);
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index f5ca16e..daa91ae 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -574,7 +574,7 @@ out:
> return ret;
> }
>
> -static long writeback_chunk_size(struct backing_dev_info *bdi,
> +static long writeback_chunk_size(struct bdi_writeback *wb,
> struct wb_writeback_work *work)
> {
> long pages;
> @@ -595,7 +595,7 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
> if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages)
> pages = LONG_MAX;
> else {
> - pages = min(bdi->avg_write_bandwidth / 2,
> + pages = min(wb->avg_write_bandwidth / 2,
> global_dirty_limit / DIRTY_SCOPE);
> pages = min(pages, work->nr_pages);
> pages = round_down(pages + MIN_WRITEBACK_PAGES,
> @@ -693,7 +693,7 @@ static long writeback_sb_inodes(struct super_block *sb,
> inode->i_state |= I_SYNC;
> spin_unlock(&inode->i_lock);
>
> - write_chunk = writeback_chunk_size(wb->bdi, work);
> + write_chunk = writeback_chunk_size(wb, work);
> wbc.nr_to_write = write_chunk;
> wbc.pages_skipped = 0;
>
> @@ -780,7 +780,7 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
> return nr_pages - work.nr_pages;
> }
>
> -static bool over_bground_thresh(struct backing_dev_info *bdi)
> +static bool over_bground_thresh(struct bdi_writeback *wb)
> {
> unsigned long background_thresh, dirty_thresh;
>
> @@ -790,8 +790,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
> global_page_state(NR_UNSTABLE_NFS) > background_thresh)
> return true;
>
> - if (wb_stat(&bdi->wb, WB_RECLAIMABLE) >
> - bdi_dirty_limit(bdi, background_thresh))
> + if (wb_stat(wb, WB_RECLAIMABLE) > wb_dirty_limit(wb, background_thresh))
> return true;
>
> return false;
> @@ -804,7 +803,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
> static void wb_update_bandwidth(struct bdi_writeback *wb,
> unsigned long start_time)
> {
> - __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
> + __wb_update_bandwidth(wb, 0, 0, 0, 0, 0, start_time);
> }
>
> /*
> @@ -856,7 +855,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> * For background writeout, stop when we are below the
> * background dirty threshold
> */
> - if (work->for_background && !over_bground_thresh(wb->bdi))
> + if (work->for_background && !over_bground_thresh(wb))
> break;
>
> /*
> @@ -948,7 +947,7 @@ static unsigned long get_nr_dirty_pages(void)
>
> static long wb_check_background_flush(struct bdi_writeback *wb)
> {
> - if (over_bground_thresh(wb->bdi)) {
> + if (over_bground_thresh(wb)) {
>
> struct wb_writeback_work work = {
> .nr_pages = LONG_MAX,
> diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
> index a346f56..4566c89 100644
> --- a/fs/gfs2/super.c
> +++ b/fs/gfs2/super.c
> @@ -755,7 +755,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc)
>
> if (wbc->sync_mode == WB_SYNC_ALL)
> gfs2_log_flush(GFS2_SB(inode), ip->i_gl, NORMAL_FLUSH);
> - if (bdi->dirty_exceeded)
> + if (bdi->wb.dirty_exceeded)
> gfs2_ail1_flush(sdp, wbc);
> else
> filemap_fdatawrite(metamapping);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 92fed42..a077a8d 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -59,16 +59,6 @@ struct bdi_writeback {
> spinlock_t list_lock; /* protects the b_* lists */
>
> struct percpu_counter stat[NR_WB_STAT_ITEMS];
> -};
> -
> -struct backing_dev_info {
> - struct list_head bdi_list;
> - unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
> - unsigned int capabilities; /* Device capabilities */
> - congested_fn *congested_fn; /* Function pointer if device is md/dm */
> - void *congested_data; /* Pointer to aux data for congested func */
> -
> - char *name;
>
> unsigned long bw_time_stamp; /* last time write bw is updated */
> unsigned long dirtied_stamp;
> @@ -87,6 +77,16 @@ struct backing_dev_info {
>
> struct fprop_local_percpu completions;
> int dirty_exceeded;
> +};
> +
> +struct backing_dev_info {
> + struct list_head bdi_list;
> + unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
> + unsigned int capabilities; /* Device capabilities */
> + congested_fn *congested_fn; /* Function pointer if device is md/dm */
> + void *congested_data; /* Pointer to aux data for congested func */
> +
> + char *name;
>
> unsigned int min_ratio;
> unsigned int max_ratio, max_prop_frac;
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index a219be9..6887eb5 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -152,16 +152,15 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int,
> void __user *, size_t *, loff_t *);
>
> void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty);
> -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi,
> - unsigned long dirty);
> -
> -void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> - unsigned long thresh,
> - unsigned long bg_thresh,
> - unsigned long dirty,
> - unsigned long bdi_thresh,
> - unsigned long bdi_dirty,
> - unsigned long start_time);
> +unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty);
> +
> +void __wb_update_bandwidth(struct bdi_writeback *wb,
> + unsigned long thresh,
> + unsigned long bg_thresh,
> + unsigned long dirty,
> + unsigned long bdi_thresh,
> + unsigned long bdi_dirty,
> + unsigned long start_time);
>
> void page_writeback_init(void);
> void balance_dirty_pages_ratelimited(struct address_space *mapping);
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index cee02d6..8622b5b 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -373,13 +373,13 @@ TRACE_EVENT(bdi_dirty_ratelimit,
>
> TP_fast_assign(
> strlcpy(__entry->bdi, dev_name(bdi->dev), 32);
> - __entry->write_bw = KBps(bdi->write_bandwidth);
> - __entry->avg_write_bw = KBps(bdi->avg_write_bandwidth);
> + __entry->write_bw = KBps(bdi->wb.write_bandwidth);
> + __entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth);
> __entry->dirty_rate = KBps(dirty_rate);
> - __entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit);
> + __entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit);
> __entry->task_ratelimit = KBps(task_ratelimit);
> __entry->balanced_dirty_ratelimit =
> - KBps(bdi->balanced_dirty_ratelimit);
> + KBps(bdi->wb.balanced_dirty_ratelimit);
> ),
>
> TP_printk("bdi %s: "
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 4b6f650..7b9b10e 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -82,7 +82,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> spin_unlock(&wb->list_lock);
>
> global_dirty_limits(&background_thresh, &dirty_thresh);
> - bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> + bdi_thresh = wb_dirty_limit(wb, dirty_thresh);
>
> #define K(x) ((x) << (PAGE_SHIFT - 10))
> seq_printf(m,
> @@ -106,7 +106,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> K(background_thresh),
> (unsigned long) K(wb_stat(wb, WB_DIRTIED)),
> (unsigned long) K(wb_stat(wb, WB_WRITTEN)),
> - (unsigned long) K(bdi->write_bandwidth),
> + (unsigned long) K(wb->write_bandwidth),
> nr_dirty,
> nr_io,
> nr_more_io,
> @@ -408,6 +408,11 @@ void bdi_unregister(struct backing_dev_info *bdi)
> }
> EXPORT_SYMBOL(bdi_unregister);
>
> +/*
> + * Initial write bandwidth: 100 MB/s
> + */
> +#define INIT_BW (100 << (20 - PAGE_SHIFT))
> +
> static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> {
> int i, err;
> @@ -422,11 +427,22 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> spin_lock_init(&wb->list_lock);
> INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>
> + wb->bw_time_stamp = jiffies;
> + wb->balanced_dirty_ratelimit = INIT_BW;
> + wb->dirty_ratelimit = INIT_BW;
> + wb->write_bandwidth = INIT_BW;
> + wb->avg_write_bandwidth = INIT_BW;
> +
> + err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL);
> + if (err)
> + return err;
> +
> for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
> err = percpu_counter_init(&wb->stat[i], 0, GFP_KERNEL);
> if (err) {
> while (--i)
> percpu_counter_destroy(&wb->stat[i]);
> + fprop_local_destroy_percpu(&wb->completions);
> return err;
> }
> }
> @@ -442,12 +458,9 @@ static void bdi_wb_exit(struct bdi_writeback *wb)
>
> for (i = 0; i < NR_WB_STAT_ITEMS; i++)
> percpu_counter_destroy(&wb->stat[i]);
> -}
>
> -/*
> - * Initial write bandwidth: 100 MB/s
> - */
> -#define INIT_BW (100 << (20 - PAGE_SHIFT))
> + fprop_local_destroy_percpu(&wb->completions);
> +}
>
> int bdi_init(struct backing_dev_info *bdi)
> {
> @@ -466,22 +479,6 @@ int bdi_init(struct backing_dev_info *bdi)
> if (err)
> return err;
>
> - bdi->dirty_exceeded = 0;
> -
> - bdi->bw_time_stamp = jiffies;
> - bdi->written_stamp = 0;
> -
> - bdi->balanced_dirty_ratelimit = INIT_BW;
> - bdi->dirty_ratelimit = INIT_BW;
> - bdi->write_bandwidth = INIT_BW;
> - bdi->avg_write_bandwidth = INIT_BW;
> -
> - err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL);
> - if (err) {
> - bdi_wb_exit(&bdi->wb);
> - return err;
> - }
> -
> return 0;
> }
> EXPORT_SYMBOL(bdi_init);
> @@ -514,8 +511,6 @@ void bdi_destroy(struct backing_dev_info *bdi)
>
> bdi_unregister(bdi);
> bdi_wb_exit(&bdi->wb);
> -
> - fprop_local_destroy_percpu(&bdi->completions);
> }
> EXPORT_SYMBOL(bdi_destroy);
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 68fd72a..7c721b4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -399,7 +399,7 @@ static unsigned long wp_next_time(unsigned long cur_time)
> static inline void __wb_writeout_inc(struct bdi_writeback *wb)
> {
> __inc_wb_stat(wb, WB_WRITTEN);
> - __fprop_inc_percpu_max(&writeout_completions, &wb->bdi->completions,
> + __fprop_inc_percpu_max(&writeout_completions, &wb->completions,
> wb->bdi->max_prop_frac);
> /* First event after period switching was turned off? */
> if (!unlikely(writeout_period_time)) {
> @@ -427,10 +427,10 @@ EXPORT_SYMBOL_GPL(wb_writeout_inc);
> /*
> * Obtain an accurate fraction of the BDI's portion.
> */
> -static void bdi_writeout_fraction(struct backing_dev_info *bdi,
> - long *numerator, long *denominator)
> +static void wb_writeout_fraction(struct bdi_writeback *wb,
> + long *numerator, long *denominator)
> {
> - fprop_fraction_percpu(&writeout_completions, &bdi->completions,
> + fprop_fraction_percpu(&writeout_completions, &wb->completions,
> numerator, denominator);
> }
>
> @@ -516,11 +516,11 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> }
>
> /**
> - * bdi_dirty_limit - @bdi's share of dirty throttling threshold
> - * @bdi: the backing_dev_info to query
> + * wb_dirty_limit - @wb's share of dirty throttling threshold
> + * @wb: the bdi_writeback to query
> * @dirty: global dirty limit in pages
> *
> - * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
> + * Returns @wb's dirty limit in pages. The term "dirty" in the context of
> * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
> *
> * Note that balance_dirty_pages() will only seriously take it as a hard limit
> @@ -528,24 +528,25 @@ static unsigned long hard_dirty_limit(unsigned long thresh)
> * control. For example, when the device is completely stalled due to some error
> * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
> * In the other normal situations, it acts more gently by throttling the tasks
> - * more (rather than completely block them) when the bdi dirty pages go high.
> + * more (rather than completely block them) when the wb dirty pages go high.
> *
> * It allocates high/low dirty limits to fast/slow devices, in order to prevent
> * - starving fast devices
> * - piling up dirty pages (that will take long time to sync) on slow devices
> *
> - * The bdi's share of dirty limit will be adapting to its throughput and
> + * The wb's share of dirty limit will be adapting to its throughput and
> * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
> */
> -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
> +unsigned long wb_dirty_limit(struct bdi_writeback *wb, unsigned long dirty)
> {
> + struct backing_dev_info *bdi = wb->bdi;
> u64 bdi_dirty;
> long numerator, denominator;
>
> /*
> * Calculate this BDI's share of the dirty ratio.
> */
> - bdi_writeout_fraction(bdi, &numerator, &denominator);
> + wb_writeout_fraction(wb, &numerator, &denominator);
>
> bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100;
> bdi_dirty *= numerator;
> @@ -664,14 +665,14 @@ static long long pos_ratio_polynom(unsigned long setpoint,
> * card's bdi_dirty may rush to many times higher than bdi_setpoint.
> * - the bdi dirty thresh drops quickly due to change of JBOD workload
> */
> -static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> - unsigned long thresh,
> - unsigned long bg_thresh,
> - unsigned long dirty,
> - unsigned long bdi_thresh,
> - unsigned long bdi_dirty)
> +static unsigned long wb_position_ratio(struct bdi_writeback *wb,
> + unsigned long thresh,
> + unsigned long bg_thresh,
> + unsigned long dirty,
> + unsigned long bdi_thresh,
> + unsigned long bdi_dirty)
> {
> - unsigned long write_bw = bdi->avg_write_bandwidth;
> + unsigned long write_bw = wb->avg_write_bandwidth;
> unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> unsigned long limit = hard_dirty_limit(thresh);
> unsigned long x_intercept;
> @@ -702,12 +703,12 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> * consume arbitrary amount of RAM because it is accounted in
> * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
> *
> - * Here, in bdi_position_ratio(), we calculate pos_ratio based on
> + * Here, in wb_position_ratio(), we calculate pos_ratio based on
> * two values: bdi_dirty and bdi_thresh. Let's consider an example:
> * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
> * limits are set by default to 10% and 20% (background and throttle).
> * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
> - * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
> + * wb_dirty_limit(wb, bg_thresh) is about ~4K pages. bdi_setpoint is
> * about ~6K pages (as the average of background and throttle bdi
> * limits). The 3rd order polynomial will provide positive feedback if
> * bdi_dirty is under bdi_setpoint and vice versa.
> @@ -717,7 +718,7 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB
> * in the example above).
> */
> - if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> long long bdi_pos_ratio;
> unsigned long bdi_bg_thresh;
>
> @@ -842,13 +843,13 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> return pos_ratio;
> }
>
> -static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> - unsigned long elapsed,
> - unsigned long written)
> +static void wb_update_write_bandwidth(struct bdi_writeback *wb,
> + unsigned long elapsed,
> + unsigned long written)
> {
> const unsigned long period = roundup_pow_of_two(3 * HZ);
> - unsigned long avg = bdi->avg_write_bandwidth;
> - unsigned long old = bdi->write_bandwidth;
> + unsigned long avg = wb->avg_write_bandwidth;
> + unsigned long old = wb->write_bandwidth;
> u64 bw;
>
> /*
> @@ -858,14 +859,14 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> * write_bandwidth = ---------------------------------------------------
> * period
> */
> - bw = written - bdi->written_stamp;
> + bw = written - wb->written_stamp;
> bw *= HZ;
> if (unlikely(elapsed > period)) {
> do_div(bw, elapsed);
> avg = bw;
> goto out;
> }
> - bw += (u64)bdi->write_bandwidth * (period - elapsed);
> + bw += (u64)wb->write_bandwidth * (period - elapsed);
> bw >>= ilog2(period);
>
> /*
> @@ -878,8 +879,8 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
> avg += (old - avg) >> 3;
>
> out:
> - bdi->write_bandwidth = bw;
> - bdi->avg_write_bandwidth = avg;
> + wb->write_bandwidth = bw;
> + wb->avg_write_bandwidth = avg;
> }
>
> /*
> @@ -944,20 +945,20 @@ static void global_update_bandwidth(unsigned long thresh,
> * Normal bdi tasks will be curbed at or below it in long term.
> * Obviously it should be around (write_bw / N) when there are N dd tasks.
> */
> -static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> - unsigned long thresh,
> - unsigned long bg_thresh,
> - unsigned long dirty,
> - unsigned long bdi_thresh,
> - unsigned long bdi_dirty,
> - unsigned long dirtied,
> - unsigned long elapsed)
> +static void wb_update_dirty_ratelimit(struct bdi_writeback *wb,
> + unsigned long thresh,
> + unsigned long bg_thresh,
> + unsigned long dirty,
> + unsigned long bdi_thresh,
> + unsigned long bdi_dirty,
> + unsigned long dirtied,
> + unsigned long elapsed)
> {
> unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> unsigned long limit = hard_dirty_limit(thresh);
> unsigned long setpoint = (freerun + limit) / 2;
> - unsigned long write_bw = bdi->avg_write_bandwidth;
> - unsigned long dirty_ratelimit = bdi->dirty_ratelimit;
> + unsigned long write_bw = wb->avg_write_bandwidth;
> + unsigned long dirty_ratelimit = wb->dirty_ratelimit;
> unsigned long dirty_rate;
> unsigned long task_ratelimit;
> unsigned long balanced_dirty_ratelimit;
> @@ -969,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> * The dirty rate will match the writeout rate in long term, except
> * when dirty pages are truncated by userspace or re-dirtied by FS.
> */
> - dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> + dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed;
>
> - pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
> - bdi_thresh, bdi_dirty);
> + pos_ratio = wb_position_ratio(wb, thresh, bg_thresh, dirty,
> + bdi_thresh, bdi_dirty);
> /*
> * task_ratelimit reflects each dd's dirty rate for the past 200ms.
> */
> @@ -1056,31 +1057,31 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>
> /*
> * For strictlimit case, calculations above were based on bdi counters
> - * and limits (starting from pos_ratio = bdi_position_ratio() and up to
> + * and limits (starting from pos_ratio = wb_position_ratio() and up to
> * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate).
> * Hence, to calculate "step" properly, we have to use bdi_dirty as
> * "dirty" and bdi_setpoint as "setpoint".
> *
> * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
> * it's possible that bdi_thresh is close to zero due to inactivity
> - * of backing device (see the implementation of bdi_dirty_limit()).
> + * of backing device (see the implementation of wb_dirty_limit()).
> */
> - if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> dirty = bdi_dirty;
> if (bdi_dirty < 8)
> setpoint = bdi_dirty + 1;
> else
> setpoint = (bdi_thresh +
> - bdi_dirty_limit(bdi, bg_thresh)) / 2;
> + wb_dirty_limit(wb, bg_thresh)) / 2;
> }
>
> if (dirty < setpoint) {
> - x = min3(bdi->balanced_dirty_ratelimit,
> + x = min3(wb->balanced_dirty_ratelimit,
> balanced_dirty_ratelimit, task_ratelimit);
> if (dirty_ratelimit < x)
> step = x - dirty_ratelimit;
> } else {
> - x = max3(bdi->balanced_dirty_ratelimit,
> + x = max3(wb->balanced_dirty_ratelimit,
> balanced_dirty_ratelimit, task_ratelimit);
> if (dirty_ratelimit > x)
> step = dirty_ratelimit - x;
> @@ -1102,22 +1103,22 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> else
> dirty_ratelimit -= step;
>
> - bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL);
> - bdi->balanced_dirty_ratelimit = balanced_dirty_ratelimit;
> + wb->dirty_ratelimit = max(dirty_ratelimit, 1UL);
> + wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit;
>
> - trace_bdi_dirty_ratelimit(bdi, dirty_rate, task_ratelimit);
> + trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit);
> }
>
> -void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> - unsigned long thresh,
> - unsigned long bg_thresh,
> - unsigned long dirty,
> - unsigned long bdi_thresh,
> - unsigned long bdi_dirty,
> - unsigned long start_time)
> +void __wb_update_bandwidth(struct bdi_writeback *wb,
> + unsigned long thresh,
> + unsigned long bg_thresh,
> + unsigned long dirty,
> + unsigned long bdi_thresh,
> + unsigned long bdi_dirty,
> + unsigned long start_time)
> {
> unsigned long now = jiffies;
> - unsigned long elapsed = now - bdi->bw_time_stamp;
> + unsigned long elapsed = now - wb->bw_time_stamp;
> unsigned long dirtied;
> unsigned long written;
>
> @@ -1127,44 +1128,44 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> if (elapsed < BANDWIDTH_INTERVAL)
> return;
>
> - dirtied = percpu_counter_read(&bdi->wb.stat[WB_DIRTIED]);
> - written = percpu_counter_read(&bdi->wb.stat[WB_WRITTEN]);
> + dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]);
> + written = percpu_counter_read(&wb->stat[WB_WRITTEN]);
>
> /*
> * Skip quiet periods when disk bandwidth is under-utilized.
> * (at least 1s idle time between two flusher runs)
> */
> - if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
> + if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time))
> goto snapshot;
>
> if (thresh) {
> global_update_bandwidth(thresh, dirty, now);
> - bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
> - bdi_thresh, bdi_dirty,
> - dirtied, elapsed);
> + wb_update_dirty_ratelimit(wb, thresh, bg_thresh, dirty,
> + bdi_thresh, bdi_dirty,
> + dirtied, elapsed);
> }
> - bdi_update_write_bandwidth(bdi, elapsed, written);
> + wb_update_write_bandwidth(wb, elapsed, written);
>
> snapshot:
> - bdi->dirtied_stamp = dirtied;
> - bdi->written_stamp = written;
> - bdi->bw_time_stamp = now;
> + wb->dirtied_stamp = dirtied;
> + wb->written_stamp = written;
> + wb->bw_time_stamp = now;
> }
>
> -static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> - unsigned long thresh,
> - unsigned long bg_thresh,
> - unsigned long dirty,
> - unsigned long bdi_thresh,
> - unsigned long bdi_dirty,
> - unsigned long start_time)
> +static void wb_update_bandwidth(struct bdi_writeback *wb,
> + unsigned long thresh,
> + unsigned long bg_thresh,
> + unsigned long dirty,
> + unsigned long bdi_thresh,
> + unsigned long bdi_dirty,
> + unsigned long start_time)
> {
> - if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
> + if (time_is_after_eq_jiffies(wb->bw_time_stamp + BANDWIDTH_INTERVAL))
> return;
> - spin_lock(&bdi->wb.list_lock);
> - __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> - bdi_thresh, bdi_dirty, start_time);
> - spin_unlock(&bdi->wb.list_lock);
> + spin_lock(&wb->list_lock);
> + __wb_update_bandwidth(wb, thresh, bg_thresh, dirty,
> + bdi_thresh, bdi_dirty, start_time);
> + spin_unlock(&wb->list_lock);
> }
>
> /*
> @@ -1184,10 +1185,10 @@ static unsigned long dirty_poll_interval(unsigned long dirty,
> return 1;
> }
>
> -static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
> - unsigned long bdi_dirty)
> +static unsigned long wb_max_pause(struct bdi_writeback *wb,
> + unsigned long bdi_dirty)
> {
> - unsigned long bw = bdi->avg_write_bandwidth;
> + unsigned long bw = wb->avg_write_bandwidth;
> unsigned long t;
>
> /*
> @@ -1203,14 +1204,14 @@ static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
> return min_t(unsigned long, t, MAX_PAUSE);
> }
>
> -static long bdi_min_pause(struct backing_dev_info *bdi,
> - long max_pause,
> - unsigned long task_ratelimit,
> - unsigned long dirty_ratelimit,
> - int *nr_dirtied_pause)
> +static long wb_min_pause(struct bdi_writeback *wb,
> + long max_pause,
> + unsigned long task_ratelimit,
> + unsigned long dirty_ratelimit,
> + int *nr_dirtied_pause)
> {
> - long hi = ilog2(bdi->avg_write_bandwidth);
> - long lo = ilog2(bdi->dirty_ratelimit);
> + long hi = ilog2(wb->avg_write_bandwidth);
> + long lo = ilog2(wb->dirty_ratelimit);
> long t; /* target pause */
> long pause; /* estimated next pause */
> int pages; /* target nr_dirtied_pause */
> @@ -1278,14 +1279,13 @@ static long bdi_min_pause(struct backing_dev_info *bdi,
> return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t;
> }
>
> -static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
> - unsigned long dirty_thresh,
> - unsigned long background_thresh,
> - unsigned long *bdi_dirty,
> - unsigned long *bdi_thresh,
> - unsigned long *bdi_bg_thresh)
> +static inline void wb_dirty_limits(struct bdi_writeback *wb,
> + unsigned long dirty_thresh,
> + unsigned long background_thresh,
> + unsigned long *bdi_dirty,
> + unsigned long *bdi_thresh,
> + unsigned long *bdi_bg_thresh)
> {
> - struct bdi_writeback *wb = &bdi->wb;
> unsigned long wb_reclaimable;
>
> /*
> @@ -1298,10 +1298,10 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi,
> * In this case we don't want to hard throttle the USB key
> * dirtiers for 100 seconds until bdi_dirty drops under
> * bdi_thresh. Instead the auxiliary bdi control line in
> - * bdi_position_ratio() will let the dirtier task progress
> + * wb_position_ratio() will let the dirtier task progress
> * at some rate <= (write_bw / 2) for bringing down bdi_dirty.
> */
> - *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> + *bdi_thresh = wb_dirty_limit(wb, dirty_thresh);
>
> if (bdi_bg_thresh)
> *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh *
> @@ -1351,6 +1351,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> unsigned long dirty_ratelimit;
> unsigned long pos_ratio;
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> + struct bdi_writeback *wb = &bdi->wb;
> bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT;
> unsigned long start_time = jiffies;
>
> @@ -1375,8 +1376,8 @@ static void balance_dirty_pages(struct address_space *mapping,
> global_dirty_limits(&background_thresh, &dirty_thresh);
>
> if (unlikely(strictlimit)) {
> - bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
> - &bdi_dirty, &bdi_thresh, &bg_thresh);
> + wb_dirty_limits(wb, dirty_thresh, background_thresh,
> + &bdi_dirty, &bdi_thresh, &bg_thresh);
>
> dirty = bdi_dirty;
> thresh = bdi_thresh;
> @@ -1407,28 +1408,28 @@ static void balance_dirty_pages(struct address_space *mapping,
> bdi_start_background_writeback(bdi);
>
> if (!strictlimit)
> - bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
> - &bdi_dirty, &bdi_thresh, NULL);
> + wb_dirty_limits(wb, dirty_thresh, background_thresh,
> + &bdi_dirty, &bdi_thresh, NULL);
>
> dirty_exceeded = (bdi_dirty > bdi_thresh) &&
> ((nr_dirty > dirty_thresh) || strictlimit);
> - if (dirty_exceeded && !bdi->dirty_exceeded)
> - bdi->dirty_exceeded = 1;
> + if (dirty_exceeded && !wb->dirty_exceeded)
> + wb->dirty_exceeded = 1;
>
> - bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> - nr_dirty, bdi_thresh, bdi_dirty,
> - start_time);
> + wb_update_bandwidth(wb, dirty_thresh, background_thresh,
> + nr_dirty, bdi_thresh, bdi_dirty,
> + start_time);
>
> - dirty_ratelimit = bdi->dirty_ratelimit;
> - pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> - background_thresh, nr_dirty,
> - bdi_thresh, bdi_dirty);
> + dirty_ratelimit = wb->dirty_ratelimit;
> + pos_ratio = wb_position_ratio(wb, dirty_thresh,
> + background_thresh, nr_dirty,
> + bdi_thresh, bdi_dirty);
> task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
> RATELIMIT_CALC_SHIFT;
> - max_pause = bdi_max_pause(bdi, bdi_dirty);
> - min_pause = bdi_min_pause(bdi, max_pause,
> - task_ratelimit, dirty_ratelimit,
> - &nr_dirtied_pause);
> + max_pause = wb_max_pause(wb, bdi_dirty);
> + min_pause = wb_min_pause(wb, max_pause,
> + task_ratelimit, dirty_ratelimit,
> + &nr_dirtied_pause);
>
> if (unlikely(task_ratelimit == 0)) {
> period = max_pause;
> @@ -1512,15 +1513,15 @@ pause:
> * more page. However bdi_dirty has accounting errors. So use
> * the larger and more IO friendly wb_stat_error.
> */
> - if (bdi_dirty <= wb_stat_error(&bdi->wb))
> + if (bdi_dirty <= wb_stat_error(wb))
> break;
>
> if (fatal_signal_pending(current))
> break;
> }
>
> - if (!dirty_exceeded && bdi->dirty_exceeded)
> - bdi->dirty_exceeded = 0;
> + if (!dirty_exceeded && wb->dirty_exceeded)
> + wb->dirty_exceeded = 0;
>
> if (writeback_in_progress(bdi))
> return;
> @@ -1584,6 +1585,7 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0;
> void balance_dirty_pages_ratelimited(struct address_space *mapping)
> {
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> + struct bdi_writeback *wb = &bdi->wb;
> int ratelimit;
> int *p;
>
> @@ -1591,7 +1593,7 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping)
> return;
>
> ratelimit = current->nr_dirtied_pause;
> - if (bdi->dirty_exceeded)
> + if (wb->dirty_exceeded)
> ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));
>
> preempt_disable();
> --
> 1.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-20 16:01:34

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 04/10] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback

On Tue 18-11-14 03:37:22, Tejun Heo wrote:
> Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
> and the role of the separation is unclear. For cgroup support for
> writeback IOs, a bdi will be updated to host multiple wb's where each
> wb serves writeback IOs of a different cgroup on the bdi. To achieve
> that, a wb should carry all states necessary for servicing writeback
> IOs for a cgroup independently.
>
> This patch moves bdi->wb_lock and ->worklist into wb.
>
> * The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
> moving, rename it to wb->work_lock as wb->wb_lock is confusing.
> Also, move wb->dwork downwards so that it's colocated with the new
> ->work_lock and ->work_list fields.
>
> * bdi_writeback_workfn() -> wb_workfn()
> bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
> bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
> bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
> __bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
> get_next_work_item(bdi) -> get_next_work_item(wb)
>
> * bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
> The function contained parts which belong to the containing bdi
> rather than the wb itself - testing cap_writeback_dirty and
> bdi_remove_from_list() invocation. Those are moved to
> bdi_unregister().
>
> * bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
> Initializations of the moved bdi->wb_lock and ->work_list are
> relocated from bdi_init() to wb_init().
>
> * As there's still only one bdi_writeback per backing_dev_info, all
> uses of bdi->state are mechanically replaced with bdi->wb.state
> introducing no behavior changes.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Wu Fengguang <[email protected]>
Does this mean you want to have per-device, per-cgroup flusher workqueues?
Otherwise this change doesn't make sense...

Honza

> ---
> fs/fs-writeback.c | 84 +++++++++++++++++++++------------------------
> include/linux/backing-dev.h | 12 +++----
> mm/backing-dev.c | 64 +++++++++++++++++-----------------
> 3 files changed, 77 insertions(+), 83 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index daa91ae..41c9f1e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -91,34 +91,33 @@ static inline struct inode *wb_inode(struct list_head *head)
>
> EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
>
> -static void bdi_wakeup_thread(struct backing_dev_info *bdi)
> +static void wb_wakeup(struct bdi_writeback *wb)
> {
> - spin_lock_bh(&bdi->wb_lock);
> - if (test_bit(WB_registered, &bdi->wb.state))
> - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
> - spin_unlock_bh(&bdi->wb_lock);
> + spin_lock_bh(&wb->work_lock);
> + if (test_bit(WB_registered, &wb->state))
> + mod_delayed_work(bdi_wq, &wb->dwork, 0);
> + spin_unlock_bh(&wb->work_lock);
> }
> -static void bdi_queue_work(struct backing_dev_info *bdi,
> - struct wb_writeback_work *work)
> +static void wb_queue_work(struct bdi_writeback *wb,
> + struct wb_writeback_work *work)
> {
> - trace_writeback_queue(bdi, work);
> + trace_writeback_queue(wb->bdi, work);
>
> - spin_lock_bh(&bdi->wb_lock);
> - if (!test_bit(WB_registered, &bdi->wb.state)) {
> + spin_lock_bh(&wb->work_lock);
> + if (!test_bit(WB_registered, &wb->state)) {
> if (work->done)
> complete(work->done);
> goto out_unlock;
> }
> - list_add_tail(&work->list, &bdi->work_list);
> - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
> + list_add_tail(&work->list, &wb->work_list);
> + mod_delayed_work(bdi_wq, &wb->dwork, 0);
> out_unlock:
> - spin_unlock_bh(&bdi->wb_lock);
> + spin_unlock_bh(&wb->work_lock);
> }
>
> -static void
> -__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> - bool range_cyclic, enum wb_reason reason)
> +static void __wb_start_writeback(struct bdi_writeback *wb, long nr_pages,
> + bool range_cyclic, enum wb_reason reason)
> {
> struct wb_writeback_work *work;
>
> @@ -128,8 +127,8 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> */
> work = kzalloc(sizeof(*work), GFP_ATOMIC);
> if (!work) {
> - trace_writeback_nowork(bdi);
> - bdi_wakeup_thread(bdi);
> + trace_writeback_nowork(wb->bdi);
> + wb_wakeup(wb);
> return;
> }
>
> @@ -138,7 +137,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> work->range_cyclic = range_cyclic;
> work->reason = reason;
>
> - bdi_queue_work(bdi, work);
> + wb_queue_work(wb, work);
> }
>
> /**
> @@ -156,7 +155,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> enum wb_reason reason)
> {
> - __bdi_start_writeback(bdi, nr_pages, true, reason);
> + __wb_start_writeback(&bdi->wb, nr_pages, true, reason);
> }
>
> /**
> @@ -176,7 +175,7 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi)
> * writeback as soon as there is no other work to do.
> */
> trace_writeback_wake_background(bdi);
> - bdi_wakeup_thread(bdi);
> + wb_wakeup(&bdi->wb);
> }
>
> /*
> @@ -848,7 +847,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> * after the other works are all done.
> */
> if ((work->for_background || work->for_kupdate) &&
> - !list_empty(&wb->bdi->work_list))
> + !list_empty(&wb->work_list))
> break;
>
> /*
> @@ -919,18 +918,17 @@ static long wb_writeback(struct bdi_writeback *wb,
> /*
> * Return the next wb_writeback_work struct that hasn't been processed yet.
> */
> -static struct wb_writeback_work *
> -get_next_work_item(struct backing_dev_info *bdi)
> +static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb)
> {
> struct wb_writeback_work *work = NULL;
>
> - spin_lock_bh(&bdi->wb_lock);
> - if (!list_empty(&bdi->work_list)) {
> - work = list_entry(bdi->work_list.next,
> + spin_lock_bh(&wb->work_lock);
> + if (!list_empty(&wb->work_list)) {
> + work = list_entry(wb->work_list.next,
> struct wb_writeback_work, list);
> list_del_init(&work->list);
> }
> - spin_unlock_bh(&bdi->wb_lock);
> + spin_unlock_bh(&wb->work_lock);
> return work;
> }
>
> @@ -1002,14 +1000,13 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb)
> */
> static long wb_do_writeback(struct bdi_writeback *wb)
> {
> - struct backing_dev_info *bdi = wb->bdi;
> struct wb_writeback_work *work;
> long wrote = 0;
>
> set_bit(WB_writeback_running, &wb->state);
> - while ((work = get_next_work_item(bdi)) != NULL) {
> + while ((work = get_next_work_item(wb)) != NULL) {
>
> - trace_writeback_exec(bdi, work);
> + trace_writeback_exec(wb->bdi, work);
>
> wrote += wb_writeback(wb, work);
>
> @@ -1037,43 +1034,42 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> * Handle writeback of dirty data for the device backed by this bdi. Also
> * reschedules periodically and does kupdated style flushing.
> */
> -void bdi_writeback_workfn(struct work_struct *work)
> +void wb_workfn(struct work_struct *work)
> {
> struct bdi_writeback *wb = container_of(to_delayed_work(work),
> struct bdi_writeback, dwork);
> - struct backing_dev_info *bdi = wb->bdi;
> long pages_written;
>
> - set_worker_desc("flush-%s", dev_name(bdi->dev));
> + set_worker_desc("flush-%s", dev_name(wb->bdi->dev));
> current->flags |= PF_SWAPWRITE;
>
> if (likely(!current_is_workqueue_rescuer() ||
> !test_bit(WB_registered, &wb->state))) {
> /*
> - * The normal path. Keep writing back @bdi until its
> + * The normal path. Keep writing back @wb until its
> * work_list is empty. Note that this path is also taken
> - * if @bdi is shutting down even when we're running off the
> + * if @wb is shutting down even when we're running off the
> * rescuer as work_list needs to be drained.
> */
> do {
> pages_written = wb_do_writeback(wb);
> trace_writeback_pages_written(pages_written);
> - } while (!list_empty(&bdi->work_list));
> + } while (!list_empty(&wb->work_list));
> } else {
> /*
> * bdi_wq can't get enough workers and we're running off
> * the emergency worker. Don't hog it. Hopefully, 1024 is
> * enough for efficient IO.
> */
> - pages_written = writeback_inodes_wb(&bdi->wb, 1024,
> + pages_written = writeback_inodes_wb(wb, 1024,
> WB_REASON_FORKER_THREAD);
> trace_writeback_pages_written(pages_written);
> }
>
> - if (!list_empty(&bdi->work_list))
> + if (!list_empty(&wb->work_list))
> mod_delayed_work(bdi_wq, &wb->dwork, 0);
> else if (wb_has_dirty_io(wb) && dirty_writeback_interval)
> - bdi_wakeup_thread_delayed(bdi);
> + wb_wakeup_delayed(wb);
>
> current->flags &= ~PF_SWAPWRITE;
> }
> @@ -1093,7 +1089,7 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
> list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
> if (!bdi_has_dirty_io(bdi))
> continue;
> - __bdi_start_writeback(bdi, nr_pages, false, reason);
> + __wb_start_writeback(&bdi->wb, nr_pages, false, reason);
> }
> rcu_read_unlock();
> }
> @@ -1228,7 +1224,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> spin_unlock(&bdi->wb.list_lock);
>
> if (wakeup_bdi)
> - bdi_wakeup_thread_delayed(bdi);
> + wb_wakeup_delayed(&bdi->wb);
> return;
> }
> }
> @@ -1318,7 +1314,7 @@ void writeback_inodes_sb_nr(struct super_block *sb,
> if (sb->s_bdi == &noop_backing_dev_info)
> return;
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
> - bdi_queue_work(sb->s_bdi, &work);
> + wb_queue_work(&sb->s_bdi->wb, &work);
> wait_for_completion(&done);
> }
> EXPORT_SYMBOL(writeback_inodes_sb_nr);
> @@ -1402,7 +1398,7 @@ void sync_inodes_sb(struct super_block *sb)
> return;
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> - bdi_queue_work(sb->s_bdi, &work);
> + wb_queue_work(&sb->s_bdi->wb, &work);
> wait_for_completion(&done);
>
> wait_sb_inodes(sb);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index a077a8d..6aba0d3 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -52,7 +52,6 @@ struct bdi_writeback {
> unsigned long state; /* Always use atomic bitops on this */
> unsigned long last_old_flush; /* last old data flush */
>
> - struct delayed_work dwork; /* work item used for writeback */
> struct list_head b_dirty; /* dirty inodes */
> struct list_head b_io; /* parked for writeback */
> struct list_head b_more_io; /* parked for more writeback */
> @@ -77,6 +76,10 @@ struct bdi_writeback {
>
> struct fprop_local_percpu completions;
> int dirty_exceeded;
> +
> + spinlock_t work_lock; /* protects work_list & dwork scheduling */
> + struct list_head work_list;
> + struct delayed_work dwork; /* work item used for writeback */
> };
>
> struct backing_dev_info {
> @@ -92,9 +95,6 @@ struct backing_dev_info {
> unsigned int max_ratio, max_prop_frac;
>
> struct bdi_writeback wb; /* default writeback info for this bdi */
> - spinlock_t wb_lock; /* protects work_list & wb.dwork scheduling */
> -
> - struct list_head work_list;
>
> struct device *dev;
>
> @@ -118,9 +118,9 @@ int __must_check bdi_setup_and_register(struct backing_dev_info *, char *, unsig
> void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
> enum wb_reason reason);
> void bdi_start_background_writeback(struct backing_dev_info *bdi);
> -void bdi_writeback_workfn(struct work_struct *work);
> +void wb_workfn(struct work_struct *work);
> int bdi_has_dirty_io(struct backing_dev_info *bdi);
> -void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> +void wb_wakeup_delayed(struct bdi_writeback *wb);
>
> extern spinlock_t bdi_lock;
> extern struct list_head bdi_list;
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7b9b10e..4904456 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -278,7 +278,7 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
> }
>
> /*
> - * This function is used when the first inode for this bdi is marked dirty. It
> + * This function is used when the first inode for this wb is marked dirty. It
> * wakes-up the corresponding bdi thread which should then take care of the
> * periodic background write-out of dirty inodes. Since the write-out would
> * starts only 'dirty_writeback_interval' centisecs from now anyway, we just
> @@ -291,15 +291,15 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi)
> * We have to be careful not to postpone flush work if it is scheduled for
> * earlier. Thus we use queue_delayed_work().
> */
> -void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
> +void wb_wakeup_delayed(struct bdi_writeback *wb)
> {
> unsigned long timeout;
>
> timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
> - spin_lock_bh(&bdi->wb_lock);
> - if (test_bit(WB_registered, &bdi->wb.state))
> - queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
> - spin_unlock_bh(&bdi->wb_lock);
> + spin_lock_bh(&wb->work_lock);
> + if (test_bit(WB_registered, &wb->state))
> + queue_delayed_work(bdi_wq, &wb->dwork, timeout);
> + spin_unlock_bh(&wb->work_lock);
> }
>
> /*
> @@ -352,30 +352,22 @@ EXPORT_SYMBOL(bdi_register_dev);
> /*
> * Remove bdi from the global list and shutdown any threads we have running
> */
> -static void bdi_wb_shutdown(struct backing_dev_info *bdi)
> +static void wb_shutdown(struct bdi_writeback *wb)
> {
> - if (!bdi_cap_writeback_dirty(bdi))
> - return;
> -
> - /*
> - * Make sure nobody finds us on the bdi_list anymore
> - */
> - bdi_remove_from_list(bdi);
> -
> /* Make sure nobody queues further work */
> - spin_lock_bh(&bdi->wb_lock);
> - clear_bit(WB_registered, &bdi->wb.state);
> - spin_unlock_bh(&bdi->wb_lock);
> + spin_lock_bh(&wb->work_lock);
> + clear_bit(WB_registered, &wb->state);
> + spin_unlock_bh(&wb->work_lock);
>
> /*
> - * Drain work list and shutdown the delayed_work. At this point,
> - * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi
> - * is dying and its work_list needs to be drained no matter what.
> + * Drain work list and shutdown the delayed_work. !WB_registered
> + * tells wb_workfn() that @wb is dying and its work_list needs to
> + * be drained no matter what.
> */
> - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
> - flush_delayed_work(&bdi->wb.dwork);
> - WARN_ON(!list_empty(&bdi->work_list));
> - WARN_ON(delayed_work_pending(&bdi->wb.dwork));
> + mod_delayed_work(bdi_wq, &wb->dwork, 0);
> + flush_delayed_work(&wb->dwork);
> + WARN_ON(!list_empty(&wb->work_list));
> + WARN_ON(delayed_work_pending(&wb->dwork));
> }
>
> /*
> @@ -400,7 +392,12 @@ void bdi_unregister(struct backing_dev_info *bdi)
> trace_writeback_bdi_unregister(bdi);
> bdi_prune_sb(bdi);
>
> - bdi_wb_shutdown(bdi);
> + if (bdi_cap_writeback_dirty(bdi)) {
> + /* make sure nobody finds us on the bdi_list anymore */
> + bdi_remove_from_list(bdi);
> + wb_shutdown(&bdi->wb);
> + }
> +
> bdi_debug_unregister(bdi);
> device_unregister(bdi->dev);
> bdi->dev = NULL;
> @@ -413,7 +410,7 @@ EXPORT_SYMBOL(bdi_unregister);
> */
> #define INIT_BW (100 << (20 - PAGE_SHIFT))
>
> -static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> +static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> {
> int i, err;
>
> @@ -425,7 +422,6 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> INIT_LIST_HEAD(&wb->b_io);
> INIT_LIST_HEAD(&wb->b_more_io);
> spin_lock_init(&wb->list_lock);
> - INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn);
>
> wb->bw_time_stamp = jiffies;
> wb->balanced_dirty_ratelimit = INIT_BW;
> @@ -433,6 +429,10 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> wb->write_bandwidth = INIT_BW;
> wb->avg_write_bandwidth = INIT_BW;
>
> + spin_lock_init(&wb->work_lock);
> + INIT_LIST_HEAD(&wb->work_list);
> + INIT_DELAYED_WORK(&wb->dwork, wb_workfn);
> +
> err = fprop_local_init_percpu(&wb->completions, GFP_KERNEL);
> if (err)
> return err;
> @@ -450,7 +450,7 @@ static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
> return 0;
> }
>
> -static void bdi_wb_exit(struct bdi_writeback *wb)
> +static void wb_exit(struct bdi_writeback *wb)
> {
> int i;
>
> @@ -471,11 +471,9 @@ int bdi_init(struct backing_dev_info *bdi)
> bdi->min_ratio = 0;
> bdi->max_ratio = 100;
> bdi->max_prop_frac = FPROP_FRAC_BASE;
> - spin_lock_init(&bdi->wb_lock);
> INIT_LIST_HEAD(&bdi->bdi_list);
> - INIT_LIST_HEAD(&bdi->work_list);
>
> - err = bdi_wb_init(&bdi->wb, bdi);
> + err = wb_init(&bdi->wb, bdi);
> if (err)
> return err;
>
> @@ -510,7 +508,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
> }
>
> bdi_unregister(bdi);
> - bdi_wb_exit(&bdi->wb);
> + wb_exit(&bdi->wb);
> }
> EXPORT_SYMBOL(bdi_destroy);
>
> --
> 1.9.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-20 16:10:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 04/10] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback

On Thu 20-11-14 11:02:49, Tejun Heo wrote:
> On Thu, Nov 20, 2014 at 05:01:28PM +0100, Jan Kara wrote:
> ...
> > Does this mean you want to have per-device, per-cgroup flusher workqueues?
> > Otherwise this change doesn't make sense...
>
> Not workqueues but yes separate work items per device-cgroup combo.
> There's no way around it.
Ah, right. I was confused. So you can add:
Reviewed-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-11-20 16:02:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 03/10] writeback: move bandwidth related fields from backing_dev_info into bdi_writeback

On Thu, Nov 20, 2014 at 04:52:53PM +0100, Jan Kara wrote:
> On Tue 18-11-14 03:37:21, Tejun Heo wrote:
...
> Ok, no problem with this patch but I wonder: When you are moving all the
> dirty limiting logic to bdi_writeback, then how do you plan to interpret
> min/max_ratio in presence of several bdi_writeback structures?

I think the current code is botching it but basic idea is to
distribute all bw controls according to the portion that the specific
wb takes out of the whole bdi. ie. keep sum of avg_bw of all active
wbs and divide distribute bw params according to that.

Thanks.

--
tejun

2014-11-20 16:02:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 04/10] writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback

On Thu, Nov 20, 2014 at 05:01:28PM +0100, Jan Kara wrote:
...
> Does this mean you want to have per-device, per-cgroup flusher workqueues?
> Otherwise this change doesn't make sense...

Not workqueues but yes separate work items per device-cgroup combo.
There's no way around it.

Thanks.

--
tejun

2014-11-20 16:21:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCHSET block/for-next] writeback: prepare for cgroup writeback support

On Thu, Nov 20, 2014 at 07:44:58AM -0800, Christoph Hellwig wrote:
> On Thu, Nov 20, 2014 at 10:14:56AM -0500, Tejun Heo wrote:
> > > PS: I've added CC to linux-fsdevel since there's high chance people miss
> > > these patches in lkml...
> >
> > Will do so when posting the actual series.
>
> Please send the prep patches to fsdevel and linux-mm. Without that I'll
> auto-NAK them :)

Alright, I'll re-send these w/ Jan's acks added and fsdevel and
linux-mm cc'd when posting the actual patchset.

Thanks.

--
tejun

2014-11-25 01:20:17

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 01/10] writeback: move backing_dev_info->state into bdi_writeback

On Thu, 20 Nov 2014 16:27:02 +0100 Jan Kara <[email protected]> wrote:

> On Tue 18-11-14 03:37:19, Tejun Heo wrote:
> > Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
> > and the role of the separation is unclear. For cgroup support for
> > writeback IOs, a bdi will be updated to host multiple wb's where each
> > wb serves writeback IOs of a different cgroup on the bdi. To achieve
> > that, a wb should carry all states necessary for servicing writeback
> > IOs for a cgroup independently.
> >
> > This patch moves bdi->state into wb.
> >
> > * enum bdi_state is renamed to wb_state and the prefix of all enums is
> > changed from BDI_ to WB_.
> >
> > * Explicit zeroing of bdi->state is removed without adding zeoring of
> > wb->state as the whole data structure is zeroed on init anyway.
> >
> > * As there's still only one bdi_writeback per backing_dev_info, all
> > uses of bdi->state are mechanically replaced with bdi->wb.state
> > introducing no behavior changes.
> Hum, does it make sense to convert BDI_sync_congested and
> BDI_async_congested? It contains information whether the *device* is
> congested and cannot take more work.

I think the "congested" concept really applies to a "queue", more than a
"device" ... though devices often have internal and external queues so the
concepts blur.

I think the best operational definition would be:
"congested" - an attempt to add a request to the queue might block
"not congested" - an attempt to add a request to the queue will not block

Given that, I would much rather do away with the "congested" flag (in the
interface) and allow submit_bio() to be either blocking or non-blocking.
That way you avoid races ("Why did you block, you weren't congested when I
checked moments ago!!??). But that is getting a bit off-topic.

As 'struct bdi_writeback' contains some lists of inodes which need to be
written, it is a bit like a queue and so could reasonably be "contested" or
not.

Certainly different cgroups would expect different block/don't block
behaviours for the same bdi, so keeping this flag per-cgroup instead of
per-device makes some sense.

NeilBrown



> I understand that in a cgroup world
> you want to throttle IO from a cgroup to a device so when you take
> bdi_writeback to be a per-cgroup structure you want some indication there
> that a particular cgroup cannot push more to the device. But is it that
> e.g. mdraid cares about a cgroup and not about the device?
>
> Honza
> >
> > Signed-off-by: Tejun Heo <[email protected]>
> > Cc: Jens Axboe <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Cc: Wu Fengguang <[email protected]>
> > Cc: [email protected]
> > Cc: Neil Brown <[email protected]>
> > Cc: Alasdair Kergon <[email protected]>
> > Cc: Mike Snitzer <[email protected]>
> > ---
> > block/blk-core.c | 1 -
> > drivers/block/drbd/drbd_main.c | 10 +++++-----
> > drivers/md/dm.c | 2 +-
> > drivers/md/raid1.c | 4 ++--
> > drivers/md/raid10.c | 2 +-
> > fs/fs-writeback.c | 14 +++++++-------
> > include/linux/backing-dev.h | 24 ++++++++++++------------
> > mm/backing-dev.c | 21 ++++++++++-----------
> > 8 files changed, 38 insertions(+), 40 deletions(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 0421b53..8801682 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -584,7 +584,6 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
> >
> > q->backing_dev_info.ra_pages =
> > (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
> > - q->backing_dev_info.state = 0;
> > q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
> > q->backing_dev_info.name = "block";
> > q->node = node_id;
> > diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
> > index 1fc8342..61b00aa 100644
> > --- a/drivers/block/drbd/drbd_main.c
> > +++ b/drivers/block/drbd/drbd_main.c
> > @@ -2360,7 +2360,7 @@ static void drbd_cleanup(void)
> > * @congested_data: User data
> > * @bdi_bits: Bits the BDI flusher thread is currently interested in
> > *
> > - * Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested.
> > + * Returns 1<<WB_async_congested and/or 1<<WB_sync_congested if we are congested.
> > */
> > static int drbd_congested(void *congested_data, int bdi_bits)
> > {
> > @@ -2377,14 +2377,14 @@ static int drbd_congested(void *congested_data, int bdi_bits)
> > }
> >
> > if (test_bit(CALLBACK_PENDING, &first_peer_device(device)->connection->flags)) {
> > - r |= (1 << BDI_async_congested);
> > + r |= (1 << WB_async_congested);
> > /* Without good local data, we would need to read from remote,
> > * and that would need the worker thread as well, which is
> > * currently blocked waiting for that usermode helper to
> > * finish.
> > */
> > if (!get_ldev_if_state(device, D_UP_TO_DATE))
> > - r |= (1 << BDI_sync_congested);
> > + r |= (1 << WB_sync_congested);
> > else
> > put_ldev(device);
> > r &= bdi_bits;
> > @@ -2400,9 +2400,9 @@ static int drbd_congested(void *congested_data, int bdi_bits)
> > reason = 'b';
> > }
> >
> > - if (bdi_bits & (1 << BDI_async_congested) &&
> > + if (bdi_bits & (1 << WB_async_congested) &&
> > test_bit(NET_CONGESTED, &first_peer_device(device)->connection->flags)) {
> > - r |= (1 << BDI_async_congested);
> > + r |= (1 << WB_async_congested);
> > reason = reason == 'b' ? 'a' : 'n';
> > }
> >
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 58f3927..c4c53af 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -1950,7 +1950,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits)
> > * the query about congestion status of request_queue
> > */
> > if (dm_request_based(md))
> > - r = md->queue->backing_dev_info.state &
> > + r = md->queue->backing_dev_info.wb.state &
> > bdi_bits;
> > else
> > r = dm_table_any_congested(map, bdi_bits);
> > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> > index 40b35be..aad1482 100644
> > --- a/drivers/md/raid1.c
> > +++ b/drivers/md/raid1.c
> > @@ -739,7 +739,7 @@ int md_raid1_congested(struct mddev *mddev, int bits)
> > struct r1conf *conf = mddev->private;
> > int i, ret = 0;
> >
> > - if ((bits & (1 << BDI_async_congested)) &&
> > + if ((bits & (1 << WB_async_congested)) &&
> > conf->pending_count >= max_queued_requests)
> > return 1;
> >
> > @@ -754,7 +754,7 @@ int md_raid1_congested(struct mddev *mddev, int bits)
> > /* Note the '|| 1' - when read_balance prefers
> > * non-congested targets, it can be removed
> > */
> > - if ((bits & (1<<BDI_async_congested)) || 1)
> > + if ((bits & (1<<WB_async_congested)) || 1)
> > ret |= bdi_congested(&q->backing_dev_info, bits);
> > else
> > ret &= bdi_congested(&q->backing_dev_info, bits);
> > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > index 32e282f..5180e75 100644
> > --- a/drivers/md/raid10.c
> > +++ b/drivers/md/raid10.c
> > @@ -915,7 +915,7 @@ int md_raid10_congested(struct mddev *mddev, int bits)
> > struct r10conf *conf = mddev->private;
> > int i, ret = 0;
> >
> > - if ((bits & (1 << BDI_async_congested)) &&
> > + if ((bits & (1 << WB_async_congested)) &&
> > conf->pending_count >= max_queued_requests)
> > return 1;
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 2d609a5..a797bda 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -62,7 +62,7 @@ struct wb_writeback_work {
> > */
> > int writeback_in_progress(struct backing_dev_info *bdi)
> > {
> > - return test_bit(BDI_writeback_running, &bdi->state);
> > + return test_bit(WB_writeback_running, &bdi->wb.state);
> > }
> > EXPORT_SYMBOL(writeback_in_progress);
> >
> > @@ -94,7 +94,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage);
> > static void bdi_wakeup_thread(struct backing_dev_info *bdi)
> > {
> > spin_lock_bh(&bdi->wb_lock);
> > - if (test_bit(BDI_registered, &bdi->state))
> > + if (test_bit(WB_registered, &bdi->wb.state))
> > mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0);
> > spin_unlock_bh(&bdi->wb_lock);
> > }
> > @@ -105,7 +105,7 @@ static void bdi_queue_work(struct backing_dev_info *bdi,
> > trace_writeback_queue(bdi, work);
> >
> > spin_lock_bh(&bdi->wb_lock);
> > - if (!test_bit(BDI_registered, &bdi->state)) {
> > + if (!test_bit(WB_registered, &bdi->wb.state)) {
> > if (work->done)
> > complete(work->done);
> > goto out_unlock;
> > @@ -1007,7 +1007,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> > struct wb_writeback_work *work;
> > long wrote = 0;
> >
> > - set_bit(BDI_writeback_running, &wb->bdi->state);
> > + set_bit(WB_writeback_running, &wb->state);
> > while ((work = get_next_work_item(bdi)) != NULL) {
> >
> > trace_writeback_exec(bdi, work);
> > @@ -1029,7 +1029,7 @@ static long wb_do_writeback(struct bdi_writeback *wb)
> > */
> > wrote += wb_check_old_data_flush(wb);
> > wrote += wb_check_background_flush(wb);
> > - clear_bit(BDI_writeback_running, &wb->bdi->state);
> > + clear_bit(WB_writeback_running, &wb->state);
> >
> > return wrote;
> > }
> > @@ -1049,7 +1049,7 @@ void bdi_writeback_workfn(struct work_struct *work)
> > current->flags |= PF_SWAPWRITE;
> >
> > if (likely(!current_is_workqueue_rescuer() ||
> > - !test_bit(BDI_registered, &bdi->state))) {
> > + !test_bit(WB_registered, &wb->state))) {
> > /*
> > * The normal path. Keep writing back @bdi until its
> > * work_list is empty. Note that this path is also taken
> > @@ -1211,7 +1211,7 @@ void __mark_inode_dirty(struct inode *inode, int flags)
> > spin_unlock(&inode->i_lock);
> > spin_lock(&bdi->wb.list_lock);
> > if (bdi_cap_writeback_dirty(bdi)) {
> > - WARN(!test_bit(BDI_registered, &bdi->state),
> > + WARN(!test_bit(WB_registered, &bdi->wb.state),
> > "bdi-%s not registered\n", bdi->name);
> >
> > /*
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 5da6012..a356ccd 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -25,13 +25,13 @@ struct device;
> > struct dentry;
> >
> > /*
> > - * Bits in backing_dev_info.state
> > + * Bits in bdi_writeback.state
> > */
> > -enum bdi_state {
> > - BDI_async_congested, /* The async (write) queue is getting full */
> > - BDI_sync_congested, /* The sync queue is getting full */
> > - BDI_registered, /* bdi_register() was done */
> > - BDI_writeback_running, /* Writeback is in progress */
> > +enum wb_state {
> > + WB_async_congested, /* The async (write) queue is getting full */
> > + WB_sync_congested, /* The sync queue is getting full */
> > + WB_registered, /* bdi_register() was done */
> > + WB_writeback_running, /* Writeback is in progress */
> > };
> >
> > typedef int (congested_fn)(void *, int);
> > @@ -49,6 +49,7 @@ enum bdi_stat_item {
> > struct bdi_writeback {
> > struct backing_dev_info *bdi; /* our parent bdi */
> >
> > + unsigned long state; /* Always use atomic bitops on this */
> > unsigned long last_old_flush; /* last old data flush */
> >
> > struct delayed_work dwork; /* work item used for writeback */
> > @@ -61,7 +62,6 @@ struct bdi_writeback {
> > struct backing_dev_info {
> > struct list_head bdi_list;
> > unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */
> > - unsigned long state; /* Always use atomic bitops on this */
> > unsigned int capabilities; /* Device capabilities */
> > congested_fn *congested_fn; /* Function pointer if device is md/dm */
> > void *congested_data; /* Pointer to aux data for congested func */
> > @@ -276,23 +276,23 @@ static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
> > {
> > if (bdi->congested_fn)
> > return bdi->congested_fn(bdi->congested_data, bdi_bits);
> > - return (bdi->state & bdi_bits);
> > + return (bdi->wb.state & bdi_bits);
> > }
> >
> > static inline int bdi_read_congested(struct backing_dev_info *bdi)
> > {
> > - return bdi_congested(bdi, 1 << BDI_sync_congested);
> > + return bdi_congested(bdi, 1 << WB_sync_congested);
> > }
> >
> > static inline int bdi_write_congested(struct backing_dev_info *bdi)
> > {
> > - return bdi_congested(bdi, 1 << BDI_async_congested);
> > + return bdi_congested(bdi, 1 << WB_async_congested);
> > }
> >
> > static inline int bdi_rw_congested(struct backing_dev_info *bdi)
> > {
> > - return bdi_congested(bdi, (1 << BDI_sync_congested) |
> > - (1 << BDI_async_congested));
> > + return bdi_congested(bdi, (1 << WB_sync_congested) |
> > + (1 << WB_async_congested));
> > }
> >
> > enum {
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 0ae0df5..62f3b33 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -17,7 +17,6 @@ static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
> > struct backing_dev_info default_backing_dev_info = {
> > .name = "default",
> > .ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
> > - .state = 0,
> > .capabilities = BDI_CAP_MAP_COPY,
> > };
> > EXPORT_SYMBOL_GPL(default_backing_dev_info);
> > @@ -111,7 +110,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> > nr_dirty,
> > nr_io,
> > nr_more_io,
> > - !list_empty(&bdi->bdi_list), bdi->state);
> > + !list_empty(&bdi->bdi_list), bdi->wb.state);
> > #undef K
> >
> > return 0;
> > @@ -298,7 +297,7 @@ void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi)
> >
> > timeout = msecs_to_jiffies(dirty_writeback_interval * 10);
> > spin_lock_bh(&bdi->wb_lock);
> > - if (test_bit(BDI_registered, &bdi->state))
> > + if (test_bit(WB_registered, &bdi->wb.state))
> > queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout);
> > spin_unlock_bh(&bdi->wb_lock);
> > }
> > @@ -333,7 +332,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
> > bdi->dev = dev;
> >
> > bdi_debug_register(bdi, dev_name(dev));
> > - set_bit(BDI_registered, &bdi->state);
> > + set_bit(WB_registered, &bdi->wb.state);
> >
> > spin_lock_bh(&bdi_lock);
> > list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
> > @@ -365,7 +364,7 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
> >
> > /* Make sure nobody queues further work */
> > spin_lock_bh(&bdi->wb_lock);
> > - clear_bit(BDI_registered, &bdi->state);
> > + clear_bit(WB_registered, &bdi->wb.state);
> > spin_unlock_bh(&bdi->wb_lock);
> >
> > /*
> > @@ -543,11 +542,11 @@ static atomic_t nr_bdi_congested[2];
> >
> > void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > - enum bdi_state bit;
> > + enum wb_state bit;
> > wait_queue_head_t *wqh = &congestion_wqh[sync];
> >
> > - bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - if (test_and_clear_bit(bit, &bdi->state))
> > + bit = sync ? WB_sync_congested : WB_async_congested;
> > + if (test_and_clear_bit(bit, &bdi->wb.state))
> > atomic_dec(&nr_bdi_congested[sync]);
> > smp_mb__after_atomic();
> > if (waitqueue_active(wqh))
> > @@ -557,10 +556,10 @@ EXPORT_SYMBOL(clear_bdi_congested);
> >
> > void set_bdi_congested(struct backing_dev_info *bdi, int sync)
> > {
> > - enum bdi_state bit;
> > + enum wb_state bit;
> >
> > - bit = sync ? BDI_sync_congested : BDI_async_congested;
> > - if (!test_and_set_bit(bit, &bdi->state))
> > + bit = sync ? WB_sync_congested : WB_async_congested;
> > + if (!test_and_set_bit(bit, &bdi->wb.state))
> > atomic_inc(&nr_bdi_congested[sync]);
> > }
> > EXPORT_SYMBOL(set_bdi_congested);
> > --
> > 1.9.3
> >


Attachments:
(No filename) (811.00 B)
OpenPGP digital signature