2021-05-27 01:00:20

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v5 0/2] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

When an inode is getting dirty for the first time it's associated
with a wb structure (see __inode_attach_wb()). It can later be
switched to another wb (if e.g. some other cgroup is writing a lot of
data to the same inode), but otherwise stays attached to the original
wb until being reclaimed.

The problem is that the wb structure holds a reference to the original
memory and blkcg cgroups. So if an inode has been dirty once and later
is actively used in read-only mode, it has a good chance to pin down
the original memory and blkcg cgroups forewer. This is often the case with
services bringing data for other services, e.g. updating some rpm
packages.

In the real life it becomes a problem due to a large size of the memcg
structure, which can easily be 1000x larger than an inode. Also a
really large number of dying cgroups can raise different scalability
issues, e.g. making the memory reclaim costly and less effective.

To solve the problem inodes should be eventually detached from the
corresponding writeback structure. It's inefficient to do it after
every writeback completion. Instead it can be done whenever the
original memory cgroup is offlined and writeback structure is getting
killed. Scanning over a (potentially long) list of inodes and detach
them from the writeback structure can take quite some time. To avoid
scanning all inodes, attached inodes are kept on a new list (b_attached).
To make it less noticeable to a user, the scanning is performed from a
work context.

Big thanks to Jan Kara and Dennis Zhou for their ideas and
contribution to the previous iterations of this patch.

v5:
- switch inodes to bdi->wb instead of zeroing inode->i_wb
- split the single patch into two
- only cgwbs maintain lists of attached inodes
- added cond_resched()
- fixed !CONFIG_CGROUP_WRITEBACK handling
- extended list of prohibited inodes flag
- other small fixes


Roman Gushchin (2):
writeback, cgroup: keep list of inodes attached to bdi_writeback
writeback, cgroup: release dying cgwbs by switching attached inodes

fs/fs-writeback.c | 101 +++++++++++++++++++++++++------
include/linux/backing-dev-defs.h | 2 +
include/linux/backing-dev.h | 7 +++
include/linux/writeback.h | 2 +
mm/backing-dev.c | 63 ++++++++++++++++++-
5 files changed, 156 insertions(+), 19 deletions(-)

--
2.31.1


2021-05-27 02:48:00

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v5 1/2] writeback, cgroup: keep list of inodes attached to bdi_writeback

Currently there is no way to iterate over inodes attached to a
specific cgwb structure. It limits the ability to efficiently
reclaim the writeback structure itself and associated memory and
block cgroup structures without scanning all inodes belonging to a sb,
which can be prohibitively expensive.

While dirty/in-active-writeback an inode belongs to one of the
bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.
Once cleaned up, it's removed from all io lists. So the
inode->i_io_list can be reused to maintain the list of inodes,
attached to a bdi_writeback structure.

This patch introduces a new wb->b_attached list, which contains all
inodes which were dirty at least once and are attached to the given
cgwb. Inodes attached to the root bdi_writeback structures are never
placed on such list. The following patch will use this list to try to
release cgwbs structures more efficiently.

Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Roman Gushchin <[email protected]>
---
fs/fs-writeback.c | 66 ++++++++++++++++++++++++--------
include/linux/backing-dev-defs.h | 1 +
include/linux/backing-dev.h | 7 ++++
include/linux/writeback.h | 1 +
mm/backing-dev.c | 2 +
5 files changed, 60 insertions(+), 17 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e91980f49388..631ef6366293 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -135,18 +135,23 @@ static bool inode_io_list_move_locked(struct inode *inode,
* inode_io_list_del_locked - remove an inode from its bdi_writeback IO list
* @inode: inode to be removed
* @wb: bdi_writeback @inode is being removed from
+ * @final: inode is going to be freed and can't reappear on any IO list
*
* Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and
* clear %WB_has_dirty_io if all are empty afterwards.
*/
static void inode_io_list_del_locked(struct inode *inode,
- struct bdi_writeback *wb)
+ struct bdi_writeback *wb,
+ bool final)
{
assert_spin_locked(&wb->list_lock);
assert_spin_locked(&inode->i_lock);

inode->i_state &= ~I_SYNC_QUEUED;
- list_del_init(&inode->i_io_list);
+ if (final)
+ list_del_init(&inode->i_io_list);
+ else
+ inode_cgwb_move_to_attached(inode, wb);
wb_io_lists_depopulated(wb);
}

@@ -278,6 +283,25 @@ void __inode_attach_wb(struct inode *inode, struct page *page)
}
EXPORT_SYMBOL_GPL(__inode_attach_wb);

+/**
+ * inode_cgwb_move_to_attached - put the inode onto wb->b_attached list
+ * @inode: inode of interest with i_lock held
+ * @wb: target bdi_writeback
+ *
+ * Remove the inode from wb's io lists and if necessarily put onto b_attached
+ * list. Only inodes attached to cgwb's are kept on this list.
+ */
+void inode_cgwb_move_to_attached(struct inode *inode, struct bdi_writeback *wb)
+{
+ assert_spin_locked(&wb->list_lock);
+ assert_spin_locked(&inode->i_lock);
+
+ if (wb != &wb->bdi->wb)
+ list_move(&inode->i_io_list, &wb->b_attached);
+ else
+ list_del_init(&inode->i_io_list);
+}
+
/**
* locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
* @inode: inode of interest with i_lock held
@@ -419,21 +443,29 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
wb_get(new_wb);

/*
- * Transfer to @new_wb's IO list if necessary. The specific list
- * @inode was on is ignored and the inode is put on ->b_dirty which
- * is always correct including from ->b_dirty_time. The transfer
- * preserves @inode->dirtied_when ordering.
+ * Transfer to @new_wb's IO list if necessary. If the @inode is dirty,
+ * the specific list @inode was on is ignored and the @inode is put on
+ * ->b_dirty which is always correct including from ->b_dirty_time.
+ * The transfer preserves @inode->dirtied_when ordering. If the @inode
+ * was clean, it means it was on the b_attached list, so move it onto
+ * the b_attached list of @new_wb.
*/
if (!list_empty(&inode->i_io_list)) {
- struct inode *pos;
-
- inode_io_list_del_locked(inode, old_wb);
+ inode_io_list_del_locked(inode, old_wb, true);
inode->i_wb = new_wb;
- list_for_each_entry(pos, &new_wb->b_dirty, i_io_list)
- if (time_after_eq(inode->dirtied_when,
- pos->dirtied_when))
- break;
- inode_io_list_move_locked(inode, new_wb, pos->i_io_list.prev);
+
+ if (inode->i_state & I_DIRTY_ALL) {
+ struct inode *pos;
+
+ list_for_each_entry(pos, &new_wb->b_dirty, i_io_list)
+ if (time_after_eq(inode->dirtied_when,
+ pos->dirtied_when))
+ break;
+ inode_io_list_move_locked(inode, new_wb,
+ pos->i_io_list.prev);
+ } else {
+ inode_cgwb_move_to_attached(inode, new_wb);
+ }
} else {
inode->i_wb = new_wb;
}
@@ -1124,7 +1156,7 @@ void inode_io_list_del(struct inode *inode)

wb = inode_to_wb_and_lock_list(inode);
spin_lock(&inode->i_lock);
- inode_io_list_del_locked(inode, wb);
+ inode_io_list_del_locked(inode, wb, true);
spin_unlock(&inode->i_lock);
spin_unlock(&wb->list_lock);
}
@@ -1437,7 +1469,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
inode->i_state &= ~I_SYNC_QUEUED;
} else {
/* The inode is clean. Remove from writeback lists. */
- inode_io_list_del_locked(inode, wb);
+ inode_io_list_del_locked(inode, wb, false);
}
}

@@ -1589,7 +1621,7 @@ static int writeback_single_inode(struct inode *inode,
* responsible for the writeback lists.
*/
if (!(inode->i_state & I_DIRTY_ALL))
- inode_io_list_del_locked(inode, wb);
+ inode_io_list_del_locked(inode, wb, false);
spin_unlock(&wb->list_lock);
inode_sync_complete(inode);
out:
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index fff9367a6348..e5dc238ebe4f 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -154,6 +154,7 @@ struct bdi_writeback {
struct cgroup_subsys_state *blkcg_css; /* and blkcg */
struct list_head memcg_node; /* anchored at memcg->cgwb_list */
struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
+ struct list_head b_attached; /* attached inodes, protected by list_lock */

union {
struct work_struct release_work;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 44df4fcef65c..4256e66802e6 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -177,6 +177,7 @@ struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi,
void wb_memcg_offline(struct mem_cgroup *memcg);
void wb_blkcg_offline(struct blkcg *blkcg);
int inode_congested(struct inode *inode, int cong_bits);
+void inode_cgwb_move_to_attached(struct inode *inode, struct bdi_writeback *wb);

/**
* inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode
@@ -345,6 +346,12 @@ static inline bool inode_cgwb_enabled(struct inode *inode)
return false;
}

+static inline void inode_cgwb_move_to_attached(struct inode *inode,
+ struct bdi_writeback *wb)
+{
+ list_del_init(&inode->i_io_list);
+}
+
static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi)
{
return &bdi->wb;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8e5c5bb16e2d..572a13c40c90 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -212,6 +212,7 @@ static inline void wait_on_inode(struct inode *inode)
#include <linux/bio.h>

void __inode_attach_wb(struct inode *inode, struct page *page);
+void inode_cgwb_move_to_attached(struct inode *inode, struct bdi_writeback *wb);
void wbc_attach_and_unlock_inode(struct writeback_control *wbc,
struct inode *inode)
__releases(&inode->i_lock);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 576220acd686..54c5dc4b8c24 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -396,6 +396,7 @@ static void cgwb_release_workfn(struct work_struct *work)
fprop_local_destroy_percpu(&wb->memcg_completions);
percpu_ref_exit(&wb->refcnt);
wb_exit(wb);
+ WARN_ON_ONCE(!list_empty(&wb->b_attached));
kfree_rcu(wb, rcu);
}

@@ -472,6 +473,7 @@ static int cgwb_create(struct backing_dev_info *bdi,

wb->memcg_css = memcg_css;
wb->blkcg_css = blkcg_css;
+ INIT_LIST_HEAD(&wb->b_attached);
INIT_WORK(&wb->release_work, cgwb_release_workfn);
set_bit(WB_registered, &wb->state);

--
2.31.1

2021-05-27 02:48:34

by Roman Gushchin

[permalink] [raw]
Subject: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

Asynchronously try to release dying cgwbs by switching clean attached
inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
structures themselves and of pinned memory and block cgroups, which
are way larger structures (mostly due to large per-cpu statistics
data). It helps to prevent memory waste and different scalability
problems caused by large piles of dying cgroups.

A cgwb cleanup operation can fail due to different reasons (e.g. the
cgwb has in-glight/pending io, an attached inode is locked or isn't
clean, etc). In this case the next scheduled cleanup will make a new
attempt. An attempt is made each time a new cgwb is offlined (in other
words a memcg and/or a blkcg is deleted by a user). In the future an
additional attempt scheduled by a timer can be implemented.

Signed-off-by: Roman Gushchin <[email protected]>
---
fs/fs-writeback.c | 35 ++++++++++++++++++
include/linux/backing-dev-defs.h | 1 +
include/linux/writeback.h | 1 +
mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++--
4 files changed, 96 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 631ef6366293..8fbcd50844f0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
kfree(isw);
}

+/**
+ * cleanup_offline_wb - detach associated clean inodes
+ * @wb: target wb
+ *
+ * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
+ * drop the corresponding per-cgroup wb's reference. Skip inodes which are
+ * dirty, freeing, in the active writeback process or are in any way busy.
+ */
+void cleanup_offline_wb(struct bdi_writeback *wb)
+{
+ struct inode *inode, *tmp;
+
+ spin_lock(&wb->list_lock);
+restart:
+ list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
+ if (!spin_trylock(&inode->i_lock))
+ continue;
+ xa_lock_irq(&inode->i_mapping->i_pages);
+ if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {
+ struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb;
+
+ WARN_ON_ONCE(inode->i_wb != wb);
+
+ inode->i_wb = bdi_wb;
+ list_del_init(&inode->i_io_list);
+ wb_put(wb);
+ }
+ xa_unlock_irq(&inode->i_mapping->i_pages);
+ spin_unlock(&inode->i_lock);
+ if (cond_resched_lock(&wb->list_lock))
+ goto restart;
+ }
+ spin_unlock(&wb->list_lock);
+}
+
/**
* wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it
* @wbc: writeback_control of interest
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index e5dc238ebe4f..07d6b6d6dbdf 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -155,6 +155,7 @@ struct bdi_writeback {
struct list_head memcg_node; /* anchored at memcg->cgwb_list */
struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
struct list_head b_attached; /* attached inodes, protected by list_lock */
+ struct list_head offline_node; /* anchored at offline_cgwbs */

union {
struct work_struct release_work;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 572a13c40c90..922f15fe6ad4 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -222,6 +222,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages,
enum wb_reason reason, struct wb_completion *done);
void cgroup_writeback_umount(void);
+void cleanup_offline_wb(struct bdi_writeback *wb);

/**
* inode_attach_wb - associate an inode with its wb
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 54c5dc4b8c24..92a00bcaa504 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -371,12 +371,16 @@ static void wb_exit(struct bdi_writeback *wb)
#include <linux/memcontrol.h>

/*
- * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_list.
- * bdi->cgwb_tree is also RCU protected.
+ * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and
+ * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected.
*/
static DEFINE_SPINLOCK(cgwb_lock);
static struct workqueue_struct *cgwb_release_wq;

+static LIST_HEAD(offline_cgwbs);
+static void cleanup_offline_cgwbs_workfn(struct work_struct *work);
+static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn);
+
static void cgwb_release_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
@@ -395,6 +399,7 @@ static void cgwb_release_workfn(struct work_struct *work)

fprop_local_destroy_percpu(&wb->memcg_completions);
percpu_ref_exit(&wb->refcnt);
+ WARN_ON(!list_empty(&wb->offline_node));
wb_exit(wb);
WARN_ON_ONCE(!list_empty(&wb->b_attached));
kfree_rcu(wb, rcu);
@@ -414,6 +419,10 @@ static void cgwb_kill(struct bdi_writeback *wb)
WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
list_del(&wb->memcg_node);
list_del(&wb->blkcg_node);
+ if (!list_empty(&wb->b_attached))
+ list_add(&wb->offline_node, &offline_cgwbs);
+ else
+ INIT_LIST_HEAD(&wb->offline_node);
percpu_ref_kill(&wb->refcnt);
}

@@ -635,6 +644,50 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
mutex_unlock(&bdi->cgwb_release_mutex);
}

+/**
+ * cleanup_offline_cgwbs - try to release dying cgwbs
+ *
+ * Try to release dying cgwbs by switching attached inodes to the wb
+ * belonging to the root memory cgroup. Processed wbs are placed at the
+ * end of the list to guarantee the forward progress.
+ *
+ * Should be called with the acquired cgwb_lock lock, which might
+ * be released and re-acquired in the process.
+ */
+static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
+{
+ struct bdi_writeback *wb;
+ LIST_HEAD(processed);
+
+ spin_lock_irq(&cgwb_lock);
+
+ while (!list_empty(&offline_cgwbs)) {
+ wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
+ offline_node);
+ list_move(&wb->offline_node, &processed);
+
+ if (wb_has_dirty_io(wb))
+ continue;
+
+ if (!percpu_ref_tryget(&wb->refcnt))
+ continue;
+
+ spin_unlock_irq(&cgwb_lock);
+ cleanup_offline_wb(wb);
+ spin_lock_irq(&cgwb_lock);
+
+ if (list_empty(&wb->b_attached))
+ list_del_init(&wb->offline_node);
+
+ wb_put(wb);
+ }
+
+ if (!list_empty(&processed))
+ list_splice_tail(&processed, &offline_cgwbs);
+
+ spin_unlock_irq(&cgwb_lock);
+}
+
/**
* wb_memcg_offline - kill all wb's associated with a memcg being offlined
* @memcg: memcg being offlined
@@ -650,6 +703,10 @@ void wb_memcg_offline(struct mem_cgroup *memcg)
list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
cgwb_kill(wb);
memcg_cgwb_list->next = NULL; /* prevent new wb's */
+
+ if (!list_empty(&offline_cgwbs))
+ schedule_work(&cleanup_offline_cgwbs_work);
+
spin_unlock_irq(&cgwb_lock);
}

--
2.31.1

2021-05-27 10:37:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] writeback, cgroup: keep list of inodes attached to bdi_writeback

On Wed 26-05-21 15:25:56, Roman Gushchin wrote:
> Currently there is no way to iterate over inodes attached to a
> specific cgwb structure. It limits the ability to efficiently
> reclaim the writeback structure itself and associated memory and
> block cgroup structures without scanning all inodes belonging to a sb,
> which can be prohibitively expensive.
>
> While dirty/in-active-writeback an inode belongs to one of the
> bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.
> Once cleaned up, it's removed from all io lists. So the
> inode->i_io_list can be reused to maintain the list of inodes,
> attached to a bdi_writeback structure.
>
> This patch introduces a new wb->b_attached list, which contains all
> inodes which were dirty at least once and are attached to the given
> cgwb. Inodes attached to the root bdi_writeback structures are never
> placed on such list. The following patch will use this list to try to
> release cgwbs structures more efficiently.
>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: Roman Gushchin <[email protected]>

Looks good. Just some minor nits below:

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e91980f49388..631ef6366293 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -135,18 +135,23 @@ static bool inode_io_list_move_locked(struct inode *inode,
> * inode_io_list_del_locked - remove an inode from its bdi_writeback IO list
> * @inode: inode to be removed
> * @wb: bdi_writeback @inode is being removed from
> + * @final: inode is going to be freed and can't reappear on any IO list
> *
> * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and
> * clear %WB_has_dirty_io if all are empty afterwards.
> */
> static void inode_io_list_del_locked(struct inode *inode,
> - struct bdi_writeback *wb)
> + struct bdi_writeback *wb,
> + bool final)
> {
> assert_spin_locked(&wb->list_lock);
> assert_spin_locked(&inode->i_lock);
>
> inode->i_state &= ~I_SYNC_QUEUED;
> - list_del_init(&inode->i_io_list);
> + if (final)
> + list_del_init(&inode->i_io_list);
> + else
> + inode_cgwb_move_to_attached(inode, wb);
> wb_io_lists_depopulated(wb);
> }

With these changes the naming is actually somewhat confusing and the bool
argument makes it even worse. Looking into the code I'd just fold
inode_io_list_del_locked() into inode_io_list_del() and make it really
delete inode from all IO lists. There are currently three other
inode_io_list_del_locked() users:

requeue_inode(), writeback_single_inode() - these should just call
inode_cgwb_move_to_attached() unconditionally
(inode_cgwb_move_to_attached() just needs to clear I_SYNC_QUEUED and call
wb_io_lists_depopulated() in addition to what it currently does).

inode_switch_wbs_work_fn() - I don't think it needs
inode_io_list_del_locked() at all. See below...

> @@ -278,6 +283,25 @@ void __inode_attach_wb(struct inode *inode, struct page *page)
> }
> EXPORT_SYMBOL_GPL(__inode_attach_wb);
>
> +/**
> + * inode_cgwb_move_to_attached - put the inode onto wb->b_attached list
> + * @inode: inode of interest with i_lock held
> + * @wb: target bdi_writeback
> + *
> + * Remove the inode from wb's io lists and if necessarily put onto b_attached
> + * list. Only inodes attached to cgwb's are kept on this list.
> + */
> +void inode_cgwb_move_to_attached(struct inode *inode, struct bdi_writeback *wb)
> +{
> + assert_spin_locked(&wb->list_lock);
> + assert_spin_locked(&inode->i_lock);
> +
> + if (wb != &wb->bdi->wb)
> + list_move(&inode->i_io_list, &wb->b_attached);
> + else
> + list_del_init(&inode->i_io_list);
> +}

I think this can be static and you can drop the declarations from header
files below. At least I wasn't able to find where this would be used
outside of fs/writeback.c.

> /**
> * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it
> * @inode: inode of interest with i_lock held
> @@ -419,21 +443,29 @@ static void inode_switch_wbs_work_fn(struct work_struct *work)
> wb_get(new_wb);
>
> /*
> - * Transfer to @new_wb's IO list if necessary. The specific list
> - * @inode was on is ignored and the inode is put on ->b_dirty which
> - * is always correct including from ->b_dirty_time. The transfer
> - * preserves @inode->dirtied_when ordering.
> + * Transfer to @new_wb's IO list if necessary. If the @inode is dirty,
> + * the specific list @inode was on is ignored and the @inode is put on
> + * ->b_dirty which is always correct including from ->b_dirty_time.
> + * The transfer preserves @inode->dirtied_when ordering. If the @inode
> + * was clean, it means it was on the b_attached list, so move it onto
> + * the b_attached list of @new_wb.
> */
> if (!list_empty(&inode->i_io_list)) {
> - struct inode *pos;
> -
> - inode_io_list_del_locked(inode, old_wb);
> + inode_io_list_del_locked(inode, old_wb, true);

Do we need inode_io_list_del_locked() here at all? Below we are careful
enough to always use list_move() which does the deletion for us anyway.

> inode->i_wb = new_wb;
> - list_for_each_entry(pos, &new_wb->b_dirty, i_io_list)
> - if (time_after_eq(inode->dirtied_when,
> - pos->dirtied_when))
> - break;
> - inode_io_list_move_locked(inode, new_wb, pos->i_io_list.prev);
> +
> + if (inode->i_state & I_DIRTY_ALL) {
> + struct inode *pos;
> +
> + list_for_each_entry(pos, &new_wb->b_dirty, i_io_list)
> + if (time_after_eq(inode->dirtied_when,
> + pos->dirtied_when))
> + break;
> + inode_io_list_move_locked(inode, new_wb,
> + pos->i_io_list.prev);
> + } else {
> + inode_cgwb_move_to_attached(inode, new_wb);
> + }

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-05-27 15:48:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

On Wed 26-05-21 15:25:57, Roman Gushchin wrote:
> Asynchronously try to release dying cgwbs by switching clean attached
> inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> structures themselves and of pinned memory and block cgroups, which
> are way larger structures (mostly due to large per-cpu statistics
> data). It helps to prevent memory waste and different scalability
> problems caused by large piles of dying cgroups.
>
> A cgwb cleanup operation can fail due to different reasons (e.g. the
> cgwb has in-glight/pending io, an attached inode is locked or isn't
> clean, etc). In this case the next scheduled cleanup will make a new
> attempt. An attempt is made each time a new cgwb is offlined (in other
> words a memcg and/or a blkcg is deleted by a user). In the future an
> additional attempt scheduled by a timer can be implemented.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> fs/fs-writeback.c | 35 ++++++++++++++++++
> include/linux/backing-dev-defs.h | 1 +
> include/linux/writeback.h | 1 +
> mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++--
> 4 files changed, 96 insertions(+), 2 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 631ef6366293..8fbcd50844f0 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> kfree(isw);
> }
>
> +/**
> + * cleanup_offline_wb - detach associated clean inodes
> + * @wb: target wb
> + *
> + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
> + * drop the corresponding per-cgroup wb's reference. Skip inodes which are
> + * dirty, freeing, in the active writeback process or are in any way busy.

I think the comment doesn't match the function anymore.

> + */
> +void cleanup_offline_wb(struct bdi_writeback *wb)
> +{
> + struct inode *inode, *tmp;
> +
> + spin_lock(&wb->list_lock);
> +restart:
> + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> + if (!spin_trylock(&inode->i_lock))
> + continue;
> + xa_lock_irq(&inode->i_mapping->i_pages);
> + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {

Why the I_REFERENCED check here? That's just inode aging bit and I have
hard time seeing how it would relate to whether inode should switch wbs...

> + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb;
> +
> + WARN_ON_ONCE(inode->i_wb != wb);
> +
> + inode->i_wb = bdi_wb;
> + list_del_init(&inode->i_io_list);
> + wb_put(wb);

I was kind of hoping you'll use some variant of inode_switch_wbs() here.
That way we have single function handling all the subtleties of switching
inode->i_wb of an active inode. Maybe it isn't strictly needed here because
you detach only from b_attached list and move to bdi_wb so things are
indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat
and I'd also like to have a comment here explaining why this cannot race
with other writeback handling or wb switching in a harmful way.

> + }
> + xa_unlock_irq(&inode->i_mapping->i_pages);
> + spin_unlock(&inode->i_lock);
> + if (cond_resched_lock(&wb->list_lock))
> + goto restart;
> + }
> + spin_unlock(&wb->list_lock);
> +}
> +
> /**
> * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it
> * @wbc: writeback_control of interest
> diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> index e5dc238ebe4f..07d6b6d6dbdf 100644
> --- a/include/linux/backing-dev-defs.h
> +++ b/include/linux/backing-dev-defs.h
> @@ -155,6 +155,7 @@ struct bdi_writeback {
> struct list_head memcg_node; /* anchored at memcg->cgwb_list */
> struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
> struct list_head b_attached; /* attached inodes, protected by list_lock */
> + struct list_head offline_node; /* anchored at offline_cgwbs */
>
> union {
> struct work_struct release_work;
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 572a13c40c90..922f15fe6ad4 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -222,6 +222,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
> int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages,
> enum wb_reason reason, struct wb_completion *done);
> void cgroup_writeback_umount(void);
> +void cleanup_offline_wb(struct bdi_writeback *wb);
>
> /**
> * inode_attach_wb - associate an inode with its wb
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 54c5dc4b8c24..92a00bcaa504 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -371,12 +371,16 @@ static void wb_exit(struct bdi_writeback *wb)
> #include <linux/memcontrol.h>
>
> /*
> - * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_list.
> - * bdi->cgwb_tree is also RCU protected.
> + * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and
> + * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected.
> */
> static DEFINE_SPINLOCK(cgwb_lock);
> static struct workqueue_struct *cgwb_release_wq;
>
> +static LIST_HEAD(offline_cgwbs);
> +static void cleanup_offline_cgwbs_workfn(struct work_struct *work);
> +static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn);
> +
> static void cgwb_release_workfn(struct work_struct *work)
> {
> struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
> @@ -395,6 +399,7 @@ static void cgwb_release_workfn(struct work_struct *work)
>
> fprop_local_destroy_percpu(&wb->memcg_completions);
> percpu_ref_exit(&wb->refcnt);
> + WARN_ON(!list_empty(&wb->offline_node));

Hum, cannot this happen when when wb had originally some attached inodes,
we added it to offline_cgwbs but then normal inode reclaim cleaned all the
inodes (and thus all wb refs were dropped) before
cleanup_offline_cgwbs_workfn() was executed? So either the offline_cgwbs
list has to hold its own wb ref or we have to remove cgwb from the list
in cgwb_release_workfn().

> wb_exit(wb);
> WARN_ON_ONCE(!list_empty(&wb->b_attached));
> kfree_rcu(wb, rcu);
> @@ -414,6 +419,10 @@ static void cgwb_kill(struct bdi_writeback *wb)
> WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
> list_del(&wb->memcg_node);
> list_del(&wb->blkcg_node);
> + if (!list_empty(&wb->b_attached))
> + list_add(&wb->offline_node, &offline_cgwbs);
> + else
> + INIT_LIST_HEAD(&wb->offline_node);
> percpu_ref_kill(&wb->refcnt);
> }
>
> @@ -635,6 +644,50 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
> mutex_unlock(&bdi->cgwb_release_mutex);
> }
>
> +/**
> + * cleanup_offline_cgwbs - try to release dying cgwbs
> + *
> + * Try to release dying cgwbs by switching attached inodes to the wb
> + * belonging to the root memory cgroup. Processed wbs are placed at the
> + * end of the list to guarantee the forward progress.
> + *
> + * Should be called with the acquired cgwb_lock lock, which might
> + * be released and re-acquired in the process.
> + */
> +static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
> +{
> + struct bdi_writeback *wb;
> + LIST_HEAD(processed);
> +
> + spin_lock_irq(&cgwb_lock);
> +
> + while (!list_empty(&offline_cgwbs)) {
> + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
> + offline_node);
> + list_move(&wb->offline_node, &processed);
> +
> + if (wb_has_dirty_io(wb))
> + continue;
> +
> + if (!percpu_ref_tryget(&wb->refcnt))
> + continue;
> +
> + spin_unlock_irq(&cgwb_lock);
> + cleanup_offline_wb(wb);
> + spin_lock_irq(&cgwb_lock);
> +
> + if (list_empty(&wb->b_attached))
> + list_del_init(&wb->offline_node);

But the cgwb can still have inodes in its dirty lists which will eventually
move to b_attached. So you can delete cgwb here prematurely, cannot you?

> +
> + wb_put(wb);
> + }
> +
> + if (!list_empty(&processed))
> + list_splice_tail(&processed, &offline_cgwbs);
> +
> + spin_unlock_irq(&cgwb_lock);
> +}
> +
> /**
> * wb_memcg_offline - kill all wb's associated with a memcg being offlined
> * @memcg: memcg being offlined
> @@ -650,6 +703,10 @@ void wb_memcg_offline(struct mem_cgroup *memcg)
> list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
> cgwb_kill(wb);
> memcg_cgwb_list->next = NULL; /* prevent new wb's */
> +
> + if (!list_empty(&offline_cgwbs))
> + schedule_work(&cleanup_offline_cgwbs_work);
> +
> spin_unlock_irq(&cgwb_lock);

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-05-27 19:47:23

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

On Thu, May 27, 2021 at 10:48:59AM -0700, Roman Gushchin wrote:
> On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote:
> > On Wed 26-05-21 15:25:57, Roman Gushchin wrote:
> > > Asynchronously try to release dying cgwbs by switching clean attached
> > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> > > structures themselves and of pinned memory and block cgroups, which
> > > are way larger structures (mostly due to large per-cpu statistics
> > > data). It helps to prevent memory waste and different scalability
> > > problems caused by large piles of dying cgroups.
> > >
> > > A cgwb cleanup operation can fail due to different reasons (e.g. the
> > > cgwb has in-glight/pending io, an attached inode is locked or isn't
> > > clean, etc). In this case the next scheduled cleanup will make a new
> > > attempt. An attempt is made each time a new cgwb is offlined (in other
> > > words a memcg and/or a blkcg is deleted by a user). In the future an
> > > additional attempt scheduled by a timer can be implemented.
> > >
> > > Signed-off-by: Roman Gushchin <[email protected]>
> > > ---
> > > fs/fs-writeback.c | 35 ++++++++++++++++++
> > > include/linux/backing-dev-defs.h | 1 +
> > > include/linux/writeback.h | 1 +
> > > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++--
> > > 4 files changed, 96 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 631ef6366293..8fbcd50844f0 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > > kfree(isw);
> > > }
> > >
> > > +/**
> > > + * cleanup_offline_wb - detach associated clean inodes
> > > + * @wb: target wb
> > > + *
> > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
> > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are
> > > + * dirty, freeing, in the active writeback process or are in any way busy.
> >
> > I think the comment doesn't match the function anymore.
> >
> > > + */
> > > +void cleanup_offline_wb(struct bdi_writeback *wb)
> > > +{
> > > + struct inode *inode, *tmp;
> > > +
> > > + spin_lock(&wb->list_lock);
> > > +restart:
> > > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> > > + if (!spin_trylock(&inode->i_lock))
> > > + continue;
> > > + xa_lock_irq(&inode->i_mapping->i_pages);
> > > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {
> >
> > Why the I_REFERENCED check here? That's just inode aging bit and I have
> > hard time seeing how it would relate to whether inode should switch wbs...
>
> What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted
> flag here. So there must be
> if ((inode->i_state | I_REFERENCED) != I_REFERENCED)

Sorry, I'm wrong. Must be:

if ((inode->i_state | I_REFERENCED) == I_REFERENCED) {
...
}

or even simpler:

if (!(inode->i_state & ~I_REFERENCED)) {
...
}

2021-05-27 23:23:39

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v5 1/2] writeback, cgroup: keep list of inodes attached to bdi_writeback

On Thu, May 27, 2021 at 12:35:17PM +0200, Jan Kara wrote:
> On Wed 26-05-21 15:25:56, Roman Gushchin wrote:
> > Currently there is no way to iterate over inodes attached to a
> > specific cgwb structure. It limits the ability to efficiently
> > reclaim the writeback structure itself and associated memory and
> > block cgroup structures without scanning all inodes belonging to a sb,
> > which can be prohibitively expensive.
> >
> > While dirty/in-active-writeback an inode belongs to one of the
> > bdi_writeback's io lists: b_dirty, b_io, b_more_io and b_dirty_time.
> > Once cleaned up, it's removed from all io lists. So the
> > inode->i_io_list can be reused to maintain the list of inodes,
> > attached to a bdi_writeback structure.
> >
> > This patch introduces a new wb->b_attached list, which contains all
> > inodes which were dirty at least once and are attached to the given
> > cgwb. Inodes attached to the root bdi_writeback structures are never
> > placed on such list. The following patch will use this list to try to
> > release cgwbs structures more efficiently.
> >
> > Suggested-by: Jan Kara <[email protected]>
> > Signed-off-by: Roman Gushchin <[email protected]>
>
> Looks good. Just some minor nits below:
>
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index e91980f49388..631ef6366293 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -135,18 +135,23 @@ static bool inode_io_list_move_locked(struct inode *inode,
> > * inode_io_list_del_locked - remove an inode from its bdi_writeback IO list
> > * @inode: inode to be removed
> > * @wb: bdi_writeback @inode is being removed from
> > + * @final: inode is going to be freed and can't reappear on any IO list
> > *
> > * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and
> > * clear %WB_has_dirty_io if all are empty afterwards.
> > */
> > static void inode_io_list_del_locked(struct inode *inode,
> > - struct bdi_writeback *wb)
> > + struct bdi_writeback *wb,
> > + bool final)
> > {
> > assert_spin_locked(&wb->list_lock);
> > assert_spin_locked(&inode->i_lock);
> >
> > inode->i_state &= ~I_SYNC_QUEUED;
> > - list_del_init(&inode->i_io_list);
> > + if (final)
> > + list_del_init(&inode->i_io_list);
> > + else
> > + inode_cgwb_move_to_attached(inode, wb);
> > wb_io_lists_depopulated(wb);
> > }
>
> With these changes the naming is actually somewhat confusing and the bool
> argument makes it even worse. Looking into the code I'd just fold
> inode_io_list_del_locked() into inode_io_list_del() and make it really
> delete inode from all IO lists. There are currently three other
> inode_io_list_del_locked() users:

Yeah, good idea. Will do in the next version. Thanks!

2021-05-27 23:24:10

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

On Thu, May 27, 2021 at 11:24:59AM +0800, Hillf Danton wrote:
> On Wed, 26 May 2021 15:25:57 -0700 Roman Gushchin wrote:
> >+
> >+ if (!list_empty(&offline_cgwbs))
> >+ schedule_work(&cleanup_offline_cgwbs_work);
> >+
>
> Good work overall.

Thanks!

>
> Nit, given cond_resched_lock() in cleanup_offline_wb(), what you need instead
> is
> queue_work(system_unbound_wq, &cleanup_offline_cgwbs_work);

Neat, will do in the next version.

Thanks!

2021-05-27 23:50:17

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote:
> On Wed 26-05-21 15:25:57, Roman Gushchin wrote:
> > Asynchronously try to release dying cgwbs by switching clean attached
> > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> > structures themselves and of pinned memory and block cgroups, which
> > are way larger structures (mostly due to large per-cpu statistics
> > data). It helps to prevent memory waste and different scalability
> > problems caused by large piles of dying cgroups.
> >
> > A cgwb cleanup operation can fail due to different reasons (e.g. the
> > cgwb has in-glight/pending io, an attached inode is locked or isn't
> > clean, etc). In this case the next scheduled cleanup will make a new
> > attempt. An attempt is made each time a new cgwb is offlined (in other
> > words a memcg and/or a blkcg is deleted by a user). In the future an
> > additional attempt scheduled by a timer can be implemented.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
> > ---
> > fs/fs-writeback.c | 35 ++++++++++++++++++
> > include/linux/backing-dev-defs.h | 1 +
> > include/linux/writeback.h | 1 +
> > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++--
> > 4 files changed, 96 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 631ef6366293..8fbcd50844f0 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > kfree(isw);
> > }
> >
> > +/**
> > + * cleanup_offline_wb - detach associated clean inodes
> > + * @wb: target wb
> > + *
> > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
> > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are
> > + * dirty, freeing, in the active writeback process or are in any way busy.
>
> I think the comment doesn't match the function anymore.
>
> > + */
> > +void cleanup_offline_wb(struct bdi_writeback *wb)
> > +{
> > + struct inode *inode, *tmp;
> > +
> > + spin_lock(&wb->list_lock);
> > +restart:
> > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> > + if (!spin_trylock(&inode->i_lock))
> > + continue;
> > + xa_lock_irq(&inode->i_mapping->i_pages);
> > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {
>
> Why the I_REFERENCED check here? That's just inode aging bit and I have
> hard time seeing how it would relate to whether inode should switch wbs...

What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted
flag here. So there must be
if ((inode->i_state | I_REFERENCED) != I_REFERENCED)

Does this look good or I am wrong and there are other flags acceptable here?

>
> > + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb;
> > +
> > + WARN_ON_ONCE(inode->i_wb != wb);
> > +
> > + inode->i_wb = bdi_wb;
> > + list_del_init(&inode->i_io_list);
> > + wb_put(wb);
>
> I was kind of hoping you'll use some variant of inode_switch_wbs() here.

My reasoning was that by definition inode_switch_wbs() handles dirty inodes,
while in the cleanup case we can deal only with clean inodes and clean wb's.
Hopefully this can make the whole procedure simpler/cheaper. Also, the number
of simultaneous switches is limited and I don't think cleanups should share
this limit.
However I agree that it would be nice to share at least some code.

> That way we have single function handling all the subtleties of switching
> inode->i_wb of an active inode. Maybe it isn't strictly needed here because
> you detach only from b_attached list and move to bdi_wb so things are
> indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat
> and I'd also like to have a comment here explaining why this cannot race
> with other writeback handling or wb switching in a harmful way.

If we'll check under wb->list_lock that wb has no inodes on any writeback lists
(excluding b_attached), doesn't it mean that WB_WRITEBACK must be 0?

Re racing: my logic here was that we're taking all possible locks before doing
anything and then we check that the inode is entirely clean, so this must be
safe:
spin_lock(&wb->list_lock);
spin_trylock(&inode->i_lock);
xa_lock_irq(&inode->i_mapping->i_pages);
...

But now I see that the unlocked inode's wb access mechanism
(unlocked_inode_to_wb_begin()/end()) probably requires additional care.
Repeating the mechanism with scheduling the switching of each inode separately
after an rcu grace period looks too slow. Maybe we can mark all inodes at once
and then switch them all at once, all in two steps. I need to think more.
Do you have any ideas/suggestions here?

>
> > + }
> > + xa_unlock_irq(&inode->i_mapping->i_pages);
> > + spin_unlock(&inode->i_lock);
> > + if (cond_resched_lock(&wb->list_lock))
> > + goto restart;
> > + }
> > + spin_unlock(&wb->list_lock);
> > +}
> > +
> > /**
> > * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it
> > * @wbc: writeback_control of interest
> > diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
> > index e5dc238ebe4f..07d6b6d6dbdf 100644
> > --- a/include/linux/backing-dev-defs.h
> > +++ b/include/linux/backing-dev-defs.h
> > @@ -155,6 +155,7 @@ struct bdi_writeback {
> > struct list_head memcg_node; /* anchored at memcg->cgwb_list */
> > struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
> > struct list_head b_attached; /* attached inodes, protected by list_lock */
> > + struct list_head offline_node; /* anchored at offline_cgwbs */
> >
> > union {
> > struct work_struct release_work;
> > diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> > index 572a13c40c90..922f15fe6ad4 100644
> > --- a/include/linux/writeback.h
> > +++ b/include/linux/writeback.h
> > @@ -222,6 +222,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
> > int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages,
> > enum wb_reason reason, struct wb_completion *done);
> > void cgroup_writeback_umount(void);
> > +void cleanup_offline_wb(struct bdi_writeback *wb);
> >
> > /**
> > * inode_attach_wb - associate an inode with its wb
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index 54c5dc4b8c24..92a00bcaa504 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -371,12 +371,16 @@ static void wb_exit(struct bdi_writeback *wb)
> > #include <linux/memcontrol.h>
> >
> > /*
> > - * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_list.
> > - * bdi->cgwb_tree is also RCU protected.
> > + * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and
> > + * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected.
> > */
> > static DEFINE_SPINLOCK(cgwb_lock);
> > static struct workqueue_struct *cgwb_release_wq;
> >
> > +static LIST_HEAD(offline_cgwbs);
> > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work);
> > +static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn);
> > +
> > static void cgwb_release_workfn(struct work_struct *work)
> > {
> > struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
> > @@ -395,6 +399,7 @@ static void cgwb_release_workfn(struct work_struct *work)
> >
> > fprop_local_destroy_percpu(&wb->memcg_completions);
> > percpu_ref_exit(&wb->refcnt);
> > + WARN_ON(!list_empty(&wb->offline_node));
>
> Hum, cannot this happen when when wb had originally some attached inodes,
> we added it to offline_cgwbs but then normal inode reclaim cleaned all the
> inodes (and thus all wb refs were dropped) before
> cleanup_offline_cgwbs_workfn() was executed? So either the offline_cgwbs
> list has to hold its own wb ref or we have to remove cgwb from the list
> in cgwb_release_workfn().

Yes, clearly a bug, thanks for catching!

>
> > wb_exit(wb);
> > WARN_ON_ONCE(!list_empty(&wb->b_attached));
> > kfree_rcu(wb, rcu);
> > @@ -414,6 +419,10 @@ static void cgwb_kill(struct bdi_writeback *wb)
> > WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
> > list_del(&wb->memcg_node);
> > list_del(&wb->blkcg_node);
> > + if (!list_empty(&wb->b_attached))
> > + list_add(&wb->offline_node, &offline_cgwbs);
> > + else
> > + INIT_LIST_HEAD(&wb->offline_node);
> > percpu_ref_kill(&wb->refcnt);
> > }
> >
> > @@ -635,6 +644,50 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
> > mutex_unlock(&bdi->cgwb_release_mutex);
> > }
> >
> > +/**
> > + * cleanup_offline_cgwbs - try to release dying cgwbs
> > + *
> > + * Try to release dying cgwbs by switching attached inodes to the wb
> > + * belonging to the root memory cgroup. Processed wbs are placed at the
> > + * end of the list to guarantee the forward progress.
> > + *
> > + * Should be called with the acquired cgwb_lock lock, which might
> > + * be released and re-acquired in the process.
> > + */
> > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
> > +{
> > + struct bdi_writeback *wb;
> > + LIST_HEAD(processed);
> > +
> > + spin_lock_irq(&cgwb_lock);
> > +
> > + while (!list_empty(&offline_cgwbs)) {
> > + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
> > + offline_node);
> > + list_move(&wb->offline_node, &processed);
> > +
> > + if (wb_has_dirty_io(wb))
> > + continue;
> > +
> > + if (!percpu_ref_tryget(&wb->refcnt))
> > + continue;
> > +
> > + spin_unlock_irq(&cgwb_lock);
> > + cleanup_offline_wb(wb);
> > + spin_lock_irq(&cgwb_lock);
> > +
> > + if (list_empty(&wb->b_attached))
> > + list_del_init(&wb->offline_node);
>
> But the cgwb can still have inodes in its dirty lists which will eventually
> move to b_attached. So you can delete cgwb here prematurely, cannot you?

Hm, I thought that in this case wb_has_dirty_io() check above will fail.
But you're right, nothing really protects wb from being re-dirtied after
the check.
At least we need to hold wb->list_lock for wb_has_dirty_io(wb) and
list_empty(&wb->b_attached) checks...

Will fix it.

Thank you, really appreciate your feedback!

2021-05-28 06:07:55

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

On Wed, May 26, 2021 at 03:25:57PM -0700, Roman Gushchin wrote:
> Asynchronously try to release dying cgwbs by switching clean attached
> inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> structures themselves and of pinned memory and block cgroups, which
> are way larger structures (mostly due to large per-cpu statistics
> data). It helps to prevent memory waste and different scalability
> problems caused by large piles of dying cgroups.
>
> A cgwb cleanup operation can fail due to different reasons (e.g. the
> cgwb has in-glight/pending io, an attached inode is locked or isn't
> clean, etc). In this case the next scheduled cleanup will make a new
> attempt. An attempt is made each time a new cgwb is offlined (in other
> words a memcg and/or a blkcg is deleted by a user). In the future an
> additional attempt scheduled by a timer can be implemented.
>
> Signed-off-by: Roman Gushchin <[email protected]>
> ---
> fs/fs-writeback.c | 35 ++++++++++++++++++
> include/linux/backing-dev-defs.h | 1 +
> include/linux/writeback.h | 1 +
> mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++--
> 4 files changed, 96 insertions(+), 2 deletions(-)
>

Hello Roman,

The following kernel panic is triggered by this patch:

[[email protected] xfstests-dev]# ./check generic/563
[ 47.186811] SGI XFS with ACLs, security attributes, realtime, verbose warnings, quota, no debug enabled
[ 47.190152] XFS (sdb): Mounting V5 Filesystem
[ 47.201551] XFS (sdb): Ending clean mount
[ 47.205501] xfs filesystem being mounted at /mnt/test supports timestamps until 2038 (0x7fffffff)
FSTYP -- xfs (non-debug)
PLATFORM -- Linux/x86_64 ktest-01 5.13.0-rc3+ #294 SMP Fri May 28 10:51:02 CST 2021
MKFS_OPTIONS -- -f -bsize=4096 /dev/sda
MOUNT_OPTIONS -- /dev/sda /mnt/scratch

[ 47.431775] XFS (sda): Mounting V5 Filesystem
[ 47.441731] XFS (sda): Ending clean mount
[ 47.445080] xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
[ 47.449189] XFS (sda): Unmounting Filesystem
[ 47.473863] XFS (sdb): Unmounting Filesystem
[ 47.614561] XFS (sdb): Mounting V5 Filesystem
[ 47.628670] XFS (sdb): Ending clean mount
[ 47.631904] xfs filesystem being mounted at /mnt/test supports timestamps until 2038 (0x7fffffff)
generic/563 1s ... [ 47.661393] run fstests generic/563 at 2021-05-28 02:54:59
[ 47.947414] loop0: detected capacity change from 0 to 16777216
[ 48.034564] XFS (loop0): Mounting V5 Filesystem
[ 48.069959] XFS (loop0): Ending clean mount
[ 48.070726] xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
[ 48.132314] XFS (loop0): Unmounting Filesystem
[ 48.204548] XFS (loop0): Mounting V5 Filesystem
[ 48.215500] XFS (loop0): Ending clean mount
[ 48.219223] xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
[ 48.534420] XFS (loop0): Unmounting Filesystem
[ 48.535142] ------------[ cut here ]------------
[ 48.535921] WARNING: CPU: 3 PID: 114 at mm/backing-dev.c:402 cgwb_release_workfn+0xa4/0xd8
[ 48.537461] Modules linked in: xfs libcrc32c iTCO_wdt i2c_i801 iTCO_vendor_support nvme i2c_smbus lpc_ich usb_storage i2c_s
[ 48.540613] CPU: 3 PID: 114 Comm: kworker/3:1 Not tainted 5.13.0-rc3+ #294
[ 48.541927] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[ 48.543439] Workqueue: cgwb_release cgwb_release_workfn
[ 48.544365] RIP: 0010:cgwb_release_workfn+0xa4/0xd8
[ 48.545185] Code: 00 00 00 48 85 db 75 d5 48 8d 7d 80 e8 98 71 20 00 48 8d bd 70 ff ff ff e8 36 7b 1d 00 48 8b 55 f0 48 8d4
[ 48.548935] RSP: 0018:ffffc90001f47e88 EFLAGS: 00010202
[ 48.549844] RAX: ffff88810321d280 RBX: ffffffff82f69ac0 RCX: 0000000080400011
[ 48.552645] RDX: ffffffff82669f00 RSI: 0000000000210d00 RDI: ffff888100042500
[ 48.553935] RBP: ffff88810321d290 R08: 0000000000000001 R09: ffffffff811c8754
[ 48.555054] R10: 000000000000005e R11: 0000000000000046 R12: ffff88810321d000
[ 48.556183] R13: ffff88815c4f2300 R14: 0000000000000000 R15: 0000000000000000
[ 48.557116] FS: 0000000000000000(0000) GS:ffff88815c4c0000(0000) knlGS:0000000000000000
[ 48.558131] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 48.558818] CR2: 00007f3aba17f9c0 CR3: 0000000108e86004 CR4: 0000000000370ee0
[ 48.559950] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 48.561057] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 48.562075] Call Trace:
[ 48.562421] elfcorehdr_read+0xf/0xf
[ 48.562920] ? worker_thread+0x117/0x1b9
[ 48.563443] ? rescuer_thread+0x291/0x291
[ 48.564001] ? kthread+0xec/0xf4
[ 48.564411] ? kthread_create_worker_on_cpu+0x65/0x65
[ 48.565086] ? ret_from_fork+0x1f/0x30
[ 48.565594] ---[ end trace bdeef00aa75cca5c ]---
[ 48.601694] XFS (loop0): Mounting V5 Filesystem
[ 48.605863] XFS (loop0): Ending clean mount
[ 48.607129] xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
[ 48.830734] general protection fault, probably for non-canonical address 0xffff11033f71f000: 0000 [#1] SMP NOPTI
[ 48.832720] CPU: 10 PID: 234 Comm: kworker/10:1 Tainted: G W 5.13.0-rc3+ #294
[ 48.833932] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1.fc33 04/01/2014
[ 48.835146] Workqueue: events cleanup_offline_cgwbs_workfn
[ 48.835952] RIP: 0010:percpu_ref_tryget_many.constprop.0+0x12/0x43
[ 48.836849] Code: 48 8b 47 08 48 8b 40 08 ff d0 0f 1f 00 eb 04 65 48 ff 08 e9 25 fd ff ff 41 54 48 8b 07 a8 03 74 09 48 8b4
[ 48.839494] RSP: 0018:ffffc90001383e58 EFLAGS: 00010046
[ 48.840246] RAX: ffff88810321f000 RBX: ffff88810321d280 RCX: ffff8881016f9770
[ 48.841165] RDX: ffffffff82669f00 RSI: 0000000000000280 RDI: ffff88810321d200
[ 48.842050] RBP: ffffc90001383e68 R08: ffff88810006c8b0 R09: 000073746e657665
[ 48.843224] R10: 8080808080808080 R11: fefefefefefefeff R12: ffff88810321d200
[ 48.844133] R13: ffff88810321d000 R14: 0000000000000000 R15: 0000000000000000
[ 48.845022] FS: 0000000000000000(0000) GS:ffff88823c500000(0000) knlGS:0000000000000000
[ 48.845903] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 48.846495] CR2: 00007fb630166198 CR3: 0000000179a1e006 CR4: 0000000000370ee0
[ 48.847414] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 48.848405] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 48.849126] Call Trace:
[ 48.849388] cleanup_offline_cgwbs_workfn+0x8a/0x14c
[ 48.849906] process_one_work+0x15c/0x234
[ 48.850346] worker_thread+0x117/0x1b9
[ 48.850706] ? rescuer_thread+0x291/0x291
[ 48.851065] kthread+0xec/0xf4
[ 48.851346] ? kthread_create_worker_on_cpu+0x65/0x65
[ 48.851815] ret_from_fork+0x1f/0x30
[ 48.852151] Modules linked in: xfs libcrc32c iTCO_wdt i2c_i801 iTCO_vendor_support nvme i2c_smbus lpc_ich usb_storage i2c_s
[ 48.854166] Dumping ftrace buffer:
[ 48.854546] (ftrace buffer empty)
[ 48.854909] ---[ end trace bdeef00aa75cca5d ]---



Thanks,
Ming

2021-05-28 14:10:11

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

On Thu 27-05-21 10:48:59, Roman Gushchin wrote:
> On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote:
> > On Wed 26-05-21 15:25:57, Roman Gushchin wrote:
> > > Asynchronously try to release dying cgwbs by switching clean attached
> > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> > > structures themselves and of pinned memory and block cgroups, which
> > > are way larger structures (mostly due to large per-cpu statistics
> > > data). It helps to prevent memory waste and different scalability
> > > problems caused by large piles of dying cgroups.
> > >
> > > A cgwb cleanup operation can fail due to different reasons (e.g. the
> > > cgwb has in-glight/pending io, an attached inode is locked or isn't
> > > clean, etc). In this case the next scheduled cleanup will make a new
> > > attempt. An attempt is made each time a new cgwb is offlined (in other
> > > words a memcg and/or a blkcg is deleted by a user). In the future an
> > > additional attempt scheduled by a timer can be implemented.
> > >
> > > Signed-off-by: Roman Gushchin <[email protected]>
> > > ---
> > > fs/fs-writeback.c | 35 ++++++++++++++++++
> > > include/linux/backing-dev-defs.h | 1 +
> > > include/linux/writeback.h | 1 +
> > > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++--
> > > 4 files changed, 96 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > index 631ef6366293..8fbcd50844f0 100644
> > > --- a/fs/fs-writeback.c
> > > +++ b/fs/fs-writeback.c
> > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > > kfree(isw);
> > > }
> > >
> > > +/**
> > > + * cleanup_offline_wb - detach associated clean inodes
> > > + * @wb: target wb
> > > + *
> > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
> > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are
> > > + * dirty, freeing, in the active writeback process or are in any way busy.
> >
> > I think the comment doesn't match the function anymore.
> >
> > > + */
> > > +void cleanup_offline_wb(struct bdi_writeback *wb)
> > > +{
> > > + struct inode *inode, *tmp;
> > > +
> > > + spin_lock(&wb->list_lock);
> > > +restart:
> > > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> > > + if (!spin_trylock(&inode->i_lock))
> > > + continue;
> > > + xa_lock_irq(&inode->i_mapping->i_pages);
> > > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {
> >
> > Why the I_REFERENCED check here? That's just inode aging bit and I have
> > hard time seeing how it would relate to whether inode should switch wbs...
>
> What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted
> flag here. So there must be
> if ((inode->i_state | I_REFERENCED) != I_REFERENCED)
>
> Does this look good or I am wrong and there are other flags acceptable here?

Ah, I see. That makes more sense. I guess you could also exclude I_DONTCACHE
and I_OVL_INUSE but that's not that important.

> > > + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb;
> > > +
> > > + WARN_ON_ONCE(inode->i_wb != wb);
> > > +
> > > + inode->i_wb = bdi_wb;
> > > + list_del_init(&inode->i_io_list);
> > > + wb_put(wb);
> >
> > I was kind of hoping you'll use some variant of inode_switch_wbs() here.
>
> My reasoning was that by definition inode_switch_wbs() handles dirty inodes,
> while in the cleanup case we can deal only with clean inodes and clean wb's.
> Hopefully this can make the whole procedure simpler/cheaper. Also, the number
> of simultaneous switches is limited and I don't think cleanups should share
> this limit.
> However I agree that it would be nice to share at least some code.

I agree limits on parallel switches should not apply. Otherwise I agree
some bits of inode_switch_wbs_work_fn() should not be strictly necessary
but they should be pretty cheap anyway.

> > That way we have single function handling all the subtleties of switching
> > inode->i_wb of an active inode. Maybe it isn't strictly needed here because
> > you detach only from b_attached list and move to bdi_wb so things are
> > indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat
> > and I'd also like to have a comment here explaining why this cannot race
> > with other writeback handling or wb switching in a harmful way.
>
> If we'll check under wb->list_lock that wb has no inodes on any writeback
> lists (excluding b_attached), doesn't it mean that WB_WRITEBACK must be
> 0?

No, pages under writeback are not reflected in inode->i_state in any way.
You would need to check mapping_tagged(inode->i_mapping,
PAGECACHE_TAG_WRITEBACK) to find that out. But if you'd use
inode_switch_wbs_work_fn() you wouldn't even have to be that careful when
switching inodes as it can handle alive inodes just fine...

> Re racing: my logic here was that we're taking all possible locks before doing
> anything and then we check that the inode is entirely clean, so this must be
> safe:
> spin_lock(&wb->list_lock);
> spin_trylock(&inode->i_lock);
> xa_lock_irq(&inode->i_mapping->i_pages);
> ...
>
> But now I see that the unlocked inode's wb access mechanism
> (unlocked_inode_to_wb_begin()/end()) probably requires additional care.

Yeah, exactly corner case like this were not quite clear to me whether you
have them correct or not.

> Repeating the mechanism with scheduling the switching of each inode separately
> after an rcu grace period looks too slow. Maybe we can mark all inodes at once
> and then switch them all at once, all in two steps. I need to think more.
> Do you have any ideas/suggestions here?

Nothing really bright. As you say I'd do this in batches - i.e., tag all
inodes for switching with I_WB_SWITCH, then synchronize_rcu(), then call
inode_switch_wbs_work_fn() for each inode (or probably some helper function
that has guts of inode_switch_wbs_work_fn() as we probably don't want to
acquire wb->list_lock's and wb_switch_rwsem repeatedly unnecessarily).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-05-28 17:16:58

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

On Fri, May 28, 2021 at 10:58:04AM +0800, Ming Lei wrote:
> On Wed, May 26, 2021 at 03:25:57PM -0700, Roman Gushchin wrote:
> > Asynchronously try to release dying cgwbs by switching clean attached
> > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> > structures themselves and of pinned memory and block cgroups, which
> > are way larger structures (mostly due to large per-cpu statistics
> > data). It helps to prevent memory waste and different scalability
> > problems caused by large piles of dying cgroups.
> >
> > A cgwb cleanup operation can fail due to different reasons (e.g. the
> > cgwb has in-glight/pending io, an attached inode is locked or isn't
> > clean, etc). In this case the next scheduled cleanup will make a new
> > attempt. An attempt is made each time a new cgwb is offlined (in other
> > words a memcg and/or a blkcg is deleted by a user). In the future an
> > additional attempt scheduled by a timer can be implemented.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
> > ---
> > fs/fs-writeback.c | 35 ++++++++++++++++++
> > include/linux/backing-dev-defs.h | 1 +
> > include/linux/writeback.h | 1 +
> > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++--
> > 4 files changed, 96 insertions(+), 2 deletions(-)
> >
>
> Hello Roman,
>
> The following kernel panic is triggered by this patch:

Hello Ming!

Thank you for the report and for trying my patches!
I think I know what it is and will fix in the next version.

Thanks!

2021-05-28 17:19:54

by Roman Gushchin

[permalink] [raw]
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes

On Fri, May 28, 2021 at 03:05:43PM +0200, Jan Kara wrote:
> On Thu 27-05-21 10:48:59, Roman Gushchin wrote:
> > On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote:
> > > On Wed 26-05-21 15:25:57, Roman Gushchin wrote:
> > > > Asynchronously try to release dying cgwbs by switching clean attached
> > > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> > > > structures themselves and of pinned memory and block cgroups, which
> > > > are way larger structures (mostly due to large per-cpu statistics
> > > > data). It helps to prevent memory waste and different scalability
> > > > problems caused by large piles of dying cgroups.
> > > >
> > > > A cgwb cleanup operation can fail due to different reasons (e.g. the
> > > > cgwb has in-glight/pending io, an attached inode is locked or isn't
> > > > clean, etc). In this case the next scheduled cleanup will make a new
> > > > attempt. An attempt is made each time a new cgwb is offlined (in other
> > > > words a memcg and/or a blkcg is deleted by a user). In the future an
> > > > additional attempt scheduled by a timer can be implemented.
> > > >
> > > > Signed-off-by: Roman Gushchin <[email protected]>
> > > > ---
> > > > fs/fs-writeback.c | 35 ++++++++++++++++++
> > > > include/linux/backing-dev-defs.h | 1 +
> > > > include/linux/writeback.h | 1 +
> > > > mm/backing-dev.c | 61 ++++++++++++++++++++++++++++++--
> > > > 4 files changed, 96 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 631ef6366293..8fbcd50844f0 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > > > kfree(isw);
> > > > }
> > > >
> > > > +/**
> > > > + * cleanup_offline_wb - detach associated clean inodes
> > > > + * @wb: target wb
> > > > + *
> > > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
> > > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are
> > > > + * dirty, freeing, in the active writeback process or are in any way busy.
> > >
> > > I think the comment doesn't match the function anymore.
> > >
> > > > + */
> > > > +void cleanup_offline_wb(struct bdi_writeback *wb)
> > > > +{
> > > > + struct inode *inode, *tmp;
> > > > +
> > > > + spin_lock(&wb->list_lock);
> > > > +restart:
> > > > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> > > > + if (!spin_trylock(&inode->i_lock))
> > > > + continue;
> > > > + xa_lock_irq(&inode->i_mapping->i_pages);
> > > > + if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {
> > >
> > > Why the I_REFERENCED check here? That's just inode aging bit and I have
> > > hard time seeing how it would relate to whether inode should switch wbs...
> >
> > What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted
> > flag here. So there must be
> > if ((inode->i_state | I_REFERENCED) != I_REFERENCED)
> >
> > Does this look good or I am wrong and there are other flags acceptable here?
>
> Ah, I see. That makes more sense. I guess you could also exclude I_DONTCACHE
> and I_OVL_INUSE but that's not that important.
>
> > > > + struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb;
> > > > +
> > > > + WARN_ON_ONCE(inode->i_wb != wb);
> > > > +
> > > > + inode->i_wb = bdi_wb;
> > > > + list_del_init(&inode->i_io_list);
> > > > + wb_put(wb);
> > >
> > > I was kind of hoping you'll use some variant of inode_switch_wbs() here.
> >
> > My reasoning was that by definition inode_switch_wbs() handles dirty inodes,
> > while in the cleanup case we can deal only with clean inodes and clean wb's.
> > Hopefully this can make the whole procedure simpler/cheaper. Also, the number
> > of simultaneous switches is limited and I don't think cleanups should share
> > this limit.
> > However I agree that it would be nice to share at least some code.
>
> I agree limits on parallel switches should not apply. Otherwise I agree
> some bits of inode_switch_wbs_work_fn() should not be strictly necessary
> but they should be pretty cheap anyway.
>
> > > That way we have single function handling all the subtleties of switching
> > > inode->i_wb of an active inode. Maybe it isn't strictly needed here because
> > > you detach only from b_attached list and move to bdi_wb so things are
> > > indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat
> > > and I'd also like to have a comment here explaining why this cannot race
> > > with other writeback handling or wb switching in a harmful way.
> >
> > If we'll check under wb->list_lock that wb has no inodes on any writeback
> > lists (excluding b_attached), doesn't it mean that WB_WRITEBACK must be
> > 0?
>
> No, pages under writeback are not reflected in inode->i_state in any way.
> You would need to check mapping_tagged(inode->i_mapping,
> PAGECACHE_TAG_WRITEBACK) to find that out. But if you'd use
> inode_switch_wbs_work_fn() you wouldn't even have to be that careful when
> switching inodes as it can handle alive inodes just fine...

I see...

>
> > Re racing: my logic here was that we're taking all possible locks before doing
> > anything and then we check that the inode is entirely clean, so this must be
> > safe:
> > spin_lock(&wb->list_lock);
> > spin_trylock(&inode->i_lock);
> > xa_lock_irq(&inode->i_mapping->i_pages);
> > ...
> >
> > But now I see that the unlocked inode's wb access mechanism
> > (unlocked_inode_to_wb_begin()/end()) probably requires additional care.
>
> Yeah, exactly corner case like this were not quite clear to me whether you
> have them correct or not.
>
> > Repeating the mechanism with scheduling the switching of each inode separately
> > after an rcu grace period looks too slow. Maybe we can mark all inodes at once
> > and then switch them all at once, all in two steps. I need to think more.
> > Do you have any ideas/suggestions here?
>
> Nothing really bright. As you say I'd do this in batches - i.e., tag all
> inodes for switching with I_WB_SWITCH, then synchronize_rcu(), then call
> inode_switch_wbs_work_fn() for each inode (or probably some helper function
> that has guts of inode_switch_wbs_work_fn() as we probably don't want to
> acquire wb->list_lock's and wb_switch_rwsem repeatedly unnecessarily).

Ok, sounds good to me. I'm a bit worried about the possible CPU overhead,
but hopefully we can switch inodes slow enough so that the impact on the
system will be acceptable.

Thanks!