LinuxLists.cc - [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

2021-05-13 00:48:25

Subject: [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

When an inode is getting dirty for the first time it's associated
with a wb structure (see __inode_attach_wb()). It can later be
switched to another wb (if e.g. some other cgroup is writing a lot of
data to the same inode), but otherwise stays attached to the original
wb until being reclaimed.

The problem is that the wb structure holds a reference to the original
memory and blkcg cgroups. So if an inode has been dirty once and later
is actively used in read-only mode, it has a good chance to pin down
the original memory and blkcg cgroups. This is often the case with
services bringing data for other services, e.g. updating some rpm
packages.

In the real life it becomes a problem due to a large size of the memcg
structure, which can easily be 1000x larger than an inode. Also a
really large number of dying cgroups can raise different scalability
issues, e.g. making the memory reclaim costly and less effective.

To solve the problem inodes should be eventually detached from the
corresponding writeback structure. It's inefficient to do it after
every writeback completion. Instead it can be done whenever the
original memory cgroup is offlined and writeback structure is getting
killed. Scanning over a (potentially long) list of inodes and detach
them from the writeback structure can take quite some time. To avoid
scanning all inodes, attached inodes are kept on a new list (b_attached).
To make it less noticeable to a user, the scanning is performed from a
work context.

Big thanks to Jan Kara and Dennis Zhou for their ideas and
contribution to the previous iterations of this patch.

Signed-off-by: Roman Gushchin <[email protected]>
---
fs/fs-writeback.c | 84 ++++++++++++++++++++++++--------
include/linux/backing-dev-defs.h | 4 ++
include/linux/backing-dev.h | 17 -------
include/linux/writeback.h | 1 +
mm/backing-dev.c | 68 ++++++++++++++++++++++++--
5 files changed, 132 insertions(+), 42 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e91980f49388..3deba686d3d4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -109,10 +109,10 @@ static void wb_io_lists_depopulated(struct bdi_writeback *wb)
* inode_io_list_move_locked - move an inode onto a bdi_writeback IO list
* @inode: inode to be moved
* @wb: target bdi_writeback
- * @head: one of @wb->b_{dirty|io|more_io|dirty_time}
+ * @head: one of @wb->b_{dirty|io|more_io|dirty_time|attached}
*
* Move @inode->i_io_list to @list of @wb and set %WB_has_dirty_io.
- * Returns %true if @inode is the first occupant of the !dirty_time IO
+ * Returns %true if @inode is the first occupant of the dirty, io or more_io
* lists; otherwise, %false.
*/
static bool inode_io_list_move_locked(struct inode *inode,
@@ -123,12 +123,17 @@ static bool inode_io_list_move_locked(struct inode *inode,

list_move(&inode->i_io_list, head);

- /* dirty_time doesn't count as dirty_io until expiration */
- if (head != &wb->b_dirty_time)
- return wb_io_lists_populated(wb);
+ if (head == &wb->b_dirty_time || head == &wb->b_attached) {
+ /*
+ * dirty_time doesn't count as dirty_io until expiration,
+ * attached list keeps a list of clean inodes, which are
+ * attached to wb.
+ */
+ wb_io_lists_depopulated(wb);
+ return false;
+ }

- wb_io_lists_depopulated(wb);
- return false;
+ return wb_io_lists_populated(wb);
}

/**
@@ -545,6 +550,37 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
kfree(isw);
}

+/**
+ * cleanup_offline_wb - detach attached clean inodes
+ * @wb: target wb
+ *
+ * Clear the ->i_wb pointer of the attached inodes and drop
+ * the corresponding wb reference. Skip inodes which are dirty,
+ * freeing, switching or in the active writeback process.
+ *
+ */
+void cleanup_offline_wb(struct bdi_writeback *wb)
+{
+ struct inode *inode, *tmp;
+
+ spin_lock(&wb->list_lock);
+ list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
+ if (!spin_trylock(&inode->i_lock))
+ continue;
+ xa_lock_irq(&inode->i_mapping->i_pages);
+ if (!(inode->i_state &
+ (I_FREEING | I_CLEAR | I_SYNC | I_DIRTY | I_WB_SWITCH))) {
+ WARN_ON_ONCE(inode->i_wb != wb);
+ inode->i_wb = NULL;
+ wb_put(wb);
+ list_del_init(&inode->i_io_list);
+ }
+ xa_unlock_irq(&inode->i_mapping->i_pages);
+ spin_unlock(&inode->i_lock);
+ }
+ spin_unlock(&wb->list_lock);
+}
+
/**
* wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it
* @wbc: writeback_control of interest
@@ -779,19 +815,18 @@ EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
*/
int inode_congested(struct inode *inode, int cong_bits)
{
- /*
- * Once set, ->i_wb never becomes NULL while the inode is alive.
- * Start transaction iff ->i_wb is visible.
- */
- if (inode && inode_to_wb_is_valid(inode)) {
+ if (inode) {
struct bdi_writeback *wb;
struct wb_lock_cookie lock_cookie = {};
bool congested;

wb = unlocked_inode_to_wb_begin(inode, &lock_cookie);
- congested = wb_congested(wb, cong_bits);
+ if (wb) {
+ congested = wb_congested(wb, cong_bits);
+ unlocked_inode_to_wb_end(inode, &lock_cookie);
+ return congested;
+ }
unlocked_inode_to_wb_end(inode, &lock_cookie);
- return congested;
}

return wb_congested(&inode_to_bdi(inode)->wb, cong_bits);
@@ -1436,8 +1471,13 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
inode_io_list_move_locked(inode, wb, &wb->b_dirty_time);
inode->i_state &= ~I_SYNC_QUEUED;
} else {
- /* The inode is clean. Remove from writeback lists. */
- inode_io_list_del_locked(inode, wb);
+ /*
+ * The inode is clean. Remove from writeback lists and
+ * move it to the attached list, because the inode is
+ * still attached to wb.
+ */
+ inode_io_list_move_locked(inode, wb, &wb->b_attached);
+ inode->i_state &= ~I_SYNC_QUEUED;
}
}

@@ -1584,12 +1624,14 @@ static int writeback_single_inode(struct inode *inode,
wb = inode_to_wb_and_lock_list(inode);
spin_lock(&inode->i_lock);
/*
- * If the inode is now fully clean, then it can be safely removed from
- * its writeback list (if any). Otherwise the flusher threads are
- * responsible for the writeback lists.
+ * If inode is clean, remove it from writeback lists and put into
+ * the attached list. Otherwise don't touch it. See comment above
+ * for explanation.
*/
- if (!(inode->i_state & I_DIRTY_ALL))
- inode_io_list_del_locked(inode, wb);
+ if (!(inode->i_state & I_DIRTY_ALL)) {
+ inode_io_list_move_locked(inode, wb, &wb->b_attached);
+ inode->i_state &= ~I_SYNC_QUEUED;
+ }
spin_unlock(&wb->list_lock);
inode_sync_complete(inode);
out:
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index fff9367a6348..fb49434be4eb 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -114,6 +114,9 @@ struct bdi_writeback {
struct list_head b_io; /* parked for writeback */
struct list_head b_more_io; /* parked for more writeback */
struct list_head b_dirty_time; /* time stamps are dirty */
+ struct list_head b_attached; /* clean inodes pinning this wb
+ * though inode->i_wb
+ */
spinlock_t list_lock; /* protects the b_* lists */

struct percpu_counter stat[NR_WB_STAT_ITEMS];
@@ -154,6 +157,7 @@ struct bdi_writeback {
struct cgroup_subsys_state *blkcg_css; /* and blkcg */
struct list_head memcg_node; /* anchored at memcg->cgwb_list */
struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */
+ struct list_head offline_node;

union {
struct work_struct release_work;
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 44df4fcef65c..cca7eb0e602d 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -257,18 +257,6 @@ wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
return wb;
}

-/**
- * inode_to_wb_is_valid - test whether an inode has a wb associated
- * @inode: inode of interest
- *
- * Returns %true if @inode has a wb associated. May be called without any
- * locking.
- */
-static inline bool inode_to_wb_is_valid(struct inode *inode)
-{
- return inode->i_wb;
-}
-
/**
* inode_to_wb - determine the wb of an inode
* @inode: inode of interest
@@ -356,11 +344,6 @@ wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp)
return &bdi->wb;
}

-static inline bool inode_to_wb_is_valid(struct inode *inode)
-{
- return true;
-}
-
static inline struct bdi_writeback *inode_to_wb(struct inode *inode)
{
return &inode_to_bdi(inode)->wb;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 8e5c5bb16e2d..8ed76e7d54db 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -221,6 +221,7 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
int cgroup_writeback_by_id(u64 bdi_id, int memcg_id, unsigned long nr_pages,
enum wb_reason reason, struct wb_completion *done);
void cgroup_writeback_umount(void);
+void cleanup_offline_wb(struct bdi_writeback *wb);

/**
* inode_attach_wb - associate an inode with its wb
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 576220acd686..2176c5199c0d 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -53,10 +53,10 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
unsigned long background_thresh;
unsigned long dirty_thresh;
unsigned long wb_thresh;
- unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time;
+ unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time, nr_attached;
struct inode *inode;

- nr_dirty = nr_io = nr_more_io = nr_dirty_time = 0;
+ nr_dirty = nr_io = nr_more_io = nr_dirty_time = nr_attached = 0;
spin_lock(&wb->list_lock);
list_for_each_entry(inode, &wb->b_dirty, i_io_list)
nr_dirty++;
@@ -67,6 +67,8 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
list_for_each_entry(inode, &wb->b_dirty_time, i_io_list)
if (inode->i_state & I_DIRTY_TIME)
nr_dirty_time++;
+ list_for_each_entry(inode, &wb->b_attached, i_io_list)
+ nr_attached++;
spin_unlock(&wb->list_lock);

global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -85,6 +87,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
"b_io: %10lu\n"
"b_more_io: %10lu\n"
"b_dirty_time: %10lu\n"
+ "b_attached: %10lu\n"
"bdi_list: %10u\n"
"state: %10lx\n",
(unsigned long) K(wb_stat(wb, WB_WRITEBACK)),
@@ -99,6 +102,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
nr_io,
nr_more_io,
nr_dirty_time,
+ nr_attached,
!list_empty(&bdi->bdi_list), bdi->wb.state);

return 0;
@@ -291,6 +295,7 @@ static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi,
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
INIT_LIST_HEAD(&wb->b_dirty_time);
+ INIT_LIST_HEAD(&wb->b_attached);
spin_lock_init(&wb->list_lock);

wb->bw_time_stamp = jiffies;
@@ -371,12 +376,16 @@ static void wb_exit(struct bdi_writeback *wb)
#include <linux/memcontrol.h>

/*
- * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, and memcg->cgwb_list.
- * bdi->cgwb_tree is also RCU protected.
+ * cgwb_lock protects bdi->cgwb_tree, blkcg->cgwb_list, offline_cgwbs and
+ * memcg->cgwb_list. bdi->cgwb_tree is also RCU protected.
*/
static DEFINE_SPINLOCK(cgwb_lock);
static struct workqueue_struct *cgwb_release_wq;

+static LIST_HEAD(offline_cgwbs);
+static void cleanup_offline_cgwbs_workfn(struct work_struct *work);
+static DECLARE_WORK(cleanup_offline_cgwbs_work, cleanup_offline_cgwbs_workfn);
+
static void cgwb_release_workfn(struct work_struct *work)
{
struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
@@ -386,6 +395,10 @@ static void cgwb_release_workfn(struct work_struct *work)
mutex_lock(&wb->bdi->cgwb_release_mutex);
wb_shutdown(wb);

+ spin_lock_irq(&cgwb_lock);
+ list_del(&wb->offline_node);
+ spin_unlock_irq(&cgwb_lock);
+
css_put(wb->memcg_css);
css_put(wb->blkcg_css);
mutex_unlock(&wb->bdi->cgwb_release_mutex);
@@ -413,6 +426,7 @@ static void cgwb_kill(struct bdi_writeback *wb)
WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
list_del(&wb->memcg_node);
list_del(&wb->blkcg_node);
+ list_add(&wb->offline_node, &offline_cgwbs);
percpu_ref_kill(&wb->refcnt);
}

@@ -633,6 +647,48 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
mutex_unlock(&bdi->cgwb_release_mutex);
}

+/**
+ * cleanup_offline_cgwbs - try to release dying cgwbs
+ *
+ * Try to release dying cgwbs by switching attached inodes to the wb
+ * belonging to the root memory cgroup. Processed wbs are placed at the
+ * end of the list to guarantee the forward progress.
+ *
+ * Should be called with the acquired cgwb_lock lock, which might
+ * be released and re-acquired in the process.
+ */
+static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
+{
+ struct bdi_writeback *wb;
+ LIST_HEAD(processed);
+
+ spin_lock_irq(&cgwb_lock);
+
+ while (!list_empty(&offline_cgwbs)) {
+ wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
+ offline_node);
+
+ list_move(&wb->offline_node, &processed);
+
+ if (wb_has_dirty_io(wb))
+ continue;
+
+ if (!percpu_ref_tryget(&wb->refcnt))
+ continue;
+
+ spin_unlock_irq(&cgwb_lock);
+ cleanup_offline_wb(wb);
+ spin_lock_irq(&cgwb_lock);
+
+ wb_put(wb);
+ }
+
+ if (!list_empty(&processed))
+ list_splice_tail(&processed, &offline_cgwbs);
+
+ spin_unlock_irq(&cgwb_lock);
+}
+
/**
* wb_memcg_offline - kill all wb's associated with a memcg being offlined
* @memcg: memcg being offlined
@@ -648,6 +704,10 @@ void wb_memcg_offline(struct mem_cgroup *memcg)
list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node)
cgwb_kill(wb);
memcg_cgwb_list->next = NULL; /* prevent new wb's */
+
+ if (!list_empty(&offline_cgwbs))
+ schedule_work(&cleanup_offline_cgwbs_work);
+
spin_unlock_irq(&cgwb_lock);
}

--
2.30.2

2021-05-13 11:32:39

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

On Wed 12-05-21 17:42:58, Roman Gushchin wrote:
> When an inode is getting dirty for the first time it's associated
> with a wb structure (see __inode_attach_wb()). It can later be
> switched to another wb (if e.g. some other cgroup is writing a lot of
> data to the same inode), but otherwise stays attached to the original
> wb until being reclaimed.
>
> The problem is that the wb structure holds a reference to the original
> memory and blkcg cgroups. So if an inode has been dirty once and later
> is actively used in read-only mode, it has a good chance to pin down
> the original memory and blkcg cgroups. This is often the case with
> services bringing data for other services, e.g. updating some rpm
> packages.
>
> In the real life it becomes a problem due to a large size of the memcg
> structure, which can easily be 1000x larger than an inode. Also a
> really large number of dying cgroups can raise different scalability
> issues, e.g. making the memory reclaim costly and less effective.
>
> To solve the problem inodes should be eventually detached from the
> corresponding writeback structure. It's inefficient to do it after
> every writeback completion. Instead it can be done whenever the
> original memory cgroup is offlined and writeback structure is getting
> killed. Scanning over a (potentially long) list of inodes and detach
> them from the writeback structure can take quite some time. To avoid
> scanning all inodes, attached inodes are kept on a new list (b_attached).
> To make it less noticeable to a user, the scanning is performed from a
> work context.
>
> Big thanks to Jan Kara and Dennis Zhou for their ideas and
> contribution to the previous iterations of this patch.
>
> Signed-off-by: Roman Gushchin <[email protected]>

Thanks for the patch! On a general note maybe it would be better to split
this patch into two - the first one which introduces b_attached list and
its handling and the second one which uses it to detach inodes from
bdi_writeback structures. Some more comments below.

> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index e91980f49388..3deba686d3d4 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -123,12 +123,17 @@ static bool inode_io_list_move_locked(struct inode *inode,
>
> list_move(&inode->i_io_list, head);
>
> - /* dirty_time doesn't count as dirty_io until expiration */
> - if (head != &wb->b_dirty_time)
> - return wb_io_lists_populated(wb);
> + if (head == &wb->b_dirty_time || head == &wb->b_attached) {
> + /*
> + * dirty_time doesn't count as dirty_io until expiration,
> + * attached list keeps a list of clean inodes, which are
> + * attached to wb.
> + */
> + wb_io_lists_depopulated(wb);
> + return false;
> + }
>
> - wb_io_lists_depopulated(wb);
> - return false;
> + return wb_io_lists_populated(wb);
> }
>
> /**
> @@ -545,6 +550,37 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> kfree(isw);
> }

I suppose the list_empty(&inode->i_io_list) case in
inode_switch_wbs_work_fn() is impossible with your changes now? Can you
perhaps add a WARN_ON_ONCE there for this? Also I don't think we want to
move clean inodes to dirty list so perhaps we need to be more careful about
the selection of target writeback list in that function?

> +/**
> + * cleanup_offline_wb - detach attached clean inodes
> + * @wb: target wb
> + *
> + * Clear the ->i_wb pointer of the attached inodes and drop
> + * the corresponding wb reference. Skip inodes which are dirty,
> + * freeing, switching or in the active writeback process.
> + *
> + */
> +void cleanup_offline_wb(struct bdi_writeback *wb)
> +{
> + struct inode *inode, *tmp;
> +
> + spin_lock(&wb->list_lock);
> + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> + if (!spin_trylock(&inode->i_lock))
> + continue;
> + xa_lock_irq(&inode->i_mapping->i_pages);
> + if (!(inode->i_state &
> + (I_FREEING | I_CLEAR | I_SYNC | I_DIRTY | I_WB_SWITCH))) {

Use I_DIRTY_ALL here instead of I_DIRTY? We don't want to touch
I_DIRTY_TIME inodes either I'd say... Also I think you don't want to touch
I_WILL_FREE inodes either.

> + WARN_ON_ONCE(inode->i_wb != wb);
> + inode->i_wb = NULL;
> + wb_put(wb);
> + list_del_init(&inode->i_io_list);

So I was thinking about this and I'm still a bit nervous that setting i_wb
to NULL is going to cause subtle crashes somewhere. Granted you are very
careful when not to touch the inode but still, even stuff like
inode_to_bdi() is not safe to call with inode->i_wb is NULL. So I'm afraid
that some place in the writeback code will be looking at i_wb without
having any of those bits set and boom. E.g. inode_to_wb() call in
test_clear_page_writeback() - what protects that one?

I forgot what possibilities did we already discuss in the past but cannot
we just switch inode->i_wb to inode_to_bdi(inode)->wb (i.e., root cgroup
writeback structure)? That would be certainly safer...

> + }
> + xa_unlock_irq(&inode->i_mapping->i_pages);
> + spin_unlock(&inode->i_lock);
> + }
> + spin_unlock(&wb->list_lock);
> +}
> +
...
> @@ -386,6 +395,10 @@ static void cgwb_release_workfn(struct work_struct *work)
> mutex_lock(&wb->bdi->cgwb_release_mutex);
> wb_shutdown(wb);
>
> + spin_lock_irq(&cgwb_lock);
> + list_del(&wb->offline_node);
> + spin_unlock_irq(&cgwb_lock);
> +
> css_put(wb->memcg_css);
> css_put(wb->blkcg_css);
> mutex_unlock(&wb->bdi->cgwb_release_mutex);
> @@ -413,6 +426,7 @@ static void cgwb_kill(struct bdi_writeback *wb)
> WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
> list_del(&wb->memcg_node);
> list_del(&wb->blkcg_node);
> + list_add(&wb->offline_node, &offline_cgwbs);
> percpu_ref_kill(&wb->refcnt);
> }

I think you need to be a bit more careful with the wb->offline_node.
cgwb_create() can end up destroying half-created bdi_writeback structure on
error and then you'd see cgwb_release_workfn() called without cgwb_kill()
called and you'd likely crash or corrupt memory.

>
> @@ -633,6 +647,48 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
> mutex_unlock(&bdi->cgwb_release_mutex);
> }
>
> +/**
> + * cleanup_offline_cgwbs - try to release dying cgwbs
> + *
> + * Try to release dying cgwbs by switching attached inodes to the wb
> + * belonging to the root memory cgroup. Processed wbs are placed at the
> + * end of the list to guarantee the forward progress.
> + *
> + * Should be called with the acquired cgwb_lock lock, which might
> + * be released and re-acquired in the process.
> + */
> +static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
> +{
> + struct bdi_writeback *wb;
> + LIST_HEAD(processed);
> +
> + spin_lock_irq(&cgwb_lock);
> +
> + while (!list_empty(&offline_cgwbs)) {
> + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
> + offline_node);
> +
> + list_move(&wb->offline_node, &processed);
> +
> + if (wb_has_dirty_io(wb))
> + continue;
> +
> + if (!percpu_ref_tryget(&wb->refcnt))
> + continue;
> +
> + spin_unlock_irq(&cgwb_lock);
> + cleanup_offline_wb(wb);
> + spin_lock_irq(&cgwb_lock);
> +
> + wb_put(wb);
> + }
> +
> + if (!list_empty(&processed))
> + list_splice_tail(&processed, &offline_cgwbs);
> +
> + spin_unlock_irq(&cgwb_lock);

Shouldn't we reschedule this work with some delay if offline_cgwbs is
non-empty? Otherwise we can end up with non-empty &offline_cgwbs and no
cleaning scheduled...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-05-14 06:10:38

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

On Thu, May 13, 2021 at 12:12:39PM +0200, Jan Kara wrote:
> On Wed 12-05-21 17:42:58, Roman Gushchin wrote:
> > When an inode is getting dirty for the first time it's associated
> > with a wb structure (see __inode_attach_wb()). It can later be
> > switched to another wb (if e.g. some other cgroup is writing a lot of
> > data to the same inode), but otherwise stays attached to the original
> > wb until being reclaimed.
> >
> > The problem is that the wb structure holds a reference to the original
> > memory and blkcg cgroups. So if an inode has been dirty once and later
> > is actively used in read-only mode, it has a good chance to pin down
> > the original memory and blkcg cgroups. This is often the case with
> > services bringing data for other services, e.g. updating some rpm
> > packages.
> >
> > In the real life it becomes a problem due to a large size of the memcg
> > structure, which can easily be 1000x larger than an inode. Also a
> > really large number of dying cgroups can raise different scalability
> > issues, e.g. making the memory reclaim costly and less effective.
> >
> > To solve the problem inodes should be eventually detached from the
> > corresponding writeback structure. It's inefficient to do it after
> > every writeback completion. Instead it can be done whenever the
> > original memory cgroup is offlined and writeback structure is getting
> > killed. Scanning over a (potentially long) list of inodes and detach
> > them from the writeback structure can take quite some time. To avoid
> > scanning all inodes, attached inodes are kept on a new list (b_attached).
> > To make it less noticeable to a user, the scanning is performed from a
> > work context.
> >
> > Big thanks to Jan Kara and Dennis Zhou for their ideas and
> > contribution to the previous iterations of this patch.
> >
> > Signed-off-by: Roman Gushchin <[email protected]>
>
> Thanks for the patch! On a general note maybe it would be better to split
> this patch into two - the first one which introduces b_attached list and
> its handling and the second one which uses it to detach inodes from
> bdi_writeback structures. Some more comments below.

Good idea, will do in the next version. And thank you for taking a look!

>
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index e91980f49388..3deba686d3d4 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -123,12 +123,17 @@ static bool inode_io_list_move_locked(struct inode *inode,
> >
> > list_move(&inode->i_io_list, head);
> >
> > - /* dirty_time doesn't count as dirty_io until expiration */
> > - if (head != &wb->b_dirty_time)
> > - return wb_io_lists_populated(wb);
> > + if (head == &wb->b_dirty_time || head == &wb->b_attached) {
> > + /*
> > + * dirty_time doesn't count as dirty_io until expiration,
> > + * attached list keeps a list of clean inodes, which are
> > + * attached to wb.
> > + */
> > + wb_io_lists_depopulated(wb);
> > + return false;
> > + }
> >
> > - wb_io_lists_depopulated(wb);
> > - return false;
> > + return wb_io_lists_populated(wb);
> > }
> >
> > /**
> > @@ -545,6 +550,37 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > kfree(isw);
> > }
>
> I suppose the list_empty(&inode->i_io_list) case in
> inode_switch_wbs_work_fn() is impossible with your changes now? Can you
> perhaps add a WARN_ON_ONCE there for this? Also I don't think we want to
> move clean inodes to dirty list so perhaps we need to be more careful about
> the selection of target writeback list in that function?

Good point, will add in the next version and double check the clean inode case.

>
> > +/**
> > + * cleanup_offline_wb - detach attached clean inodes
> > + * @wb: target wb
> > + *
> > + * Clear the ->i_wb pointer of the attached inodes and drop
> > + * the corresponding wb reference. Skip inodes which are dirty,
> > + * freeing, switching or in the active writeback process.
> > + *
> > + */
> > +void cleanup_offline_wb(struct bdi_writeback *wb)
> > +{
> > + struct inode *inode, *tmp;
> > +
> > + spin_lock(&wb->list_lock);
> > + list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> > + if (!spin_trylock(&inode->i_lock))
> > + continue;
> > + xa_lock_irq(&inode->i_mapping->i_pages);
> > + if (!(inode->i_state &
> > + (I_FREEING | I_CLEAR | I_SYNC | I_DIRTY | I_WB_SWITCH))) {
>
> Use I_DIRTY_ALL here instead of I_DIRTY? We don't want to touch
> I_DIRTY_TIME inodes either I'd say... Also I think you don't want to touch
> I_WILL_FREE inodes either.

You're right, will fix it.

>
> > + WARN_ON_ONCE(inode->i_wb != wb);
> > + inode->i_wb = NULL;
> > + wb_put(wb);
> > + list_del_init(&inode->i_io_list);
>
> So I was thinking about this and I'm still a bit nervous that setting i_wb
> to NULL is going to cause subtle crashes somewhere. Granted you are very
> careful when not to touch the inode but still, even stuff like
> inode_to_bdi() is not safe to call with inode->i_wb is NULL. So I'm afraid
> that some place in the writeback code will be looking at i_wb without
> having any of those bits set and boom. E.g. inode_to_wb() call in
> test_clear_page_writeback() - what protects that one?

I assume that if the page is dirty/under the writeback, the inode must be dirty
too, so we can't zero inode->i_wb.

>
> I forgot what possibilities did we already discuss in the past but cannot
> we just switch inode->i_wb to inode_to_bdi(inode)->wb (i.e., root cgroup
> writeback structure)? That would be certainly safer...

I am/was nervous too. I had several BUG_ON()'s in all such places and ran
the test kernel for about a day on my dev desktop (even updated to Fedora
34 lol), and haven't seen any panics. And certainly I can give it a
comprehensive test in a more production environment.

Re switching to the root wb: it's certainly a possibility too, however
zeroing has an advantage: the next potential writeback will be accounted
to the right cgroup without a need in an additional switch.
I'd try to go with zeroing if it's possible, and keep the switching to the
root wb as plab B.

>
> > + }
> > + xa_unlock_irq(&inode->i_mapping->i_pages);
> > + spin_unlock(&inode->i_lock);
> > + }
> > + spin_unlock(&wb->list_lock);
> > +}
> > +
> ...
> > @@ -386,6 +395,10 @@ static void cgwb_release_workfn(struct work_struct *work)
> > mutex_lock(&wb->bdi->cgwb_release_mutex);
> > wb_shutdown(wb);
> >
> > + spin_lock_irq(&cgwb_lock);
> > + list_del(&wb->offline_node);
> > + spin_unlock_irq(&cgwb_lock);
> > +
> > css_put(wb->memcg_css);
> > css_put(wb->blkcg_css);
> > mutex_unlock(&wb->bdi->cgwb_release_mutex);
> > @@ -413,6 +426,7 @@ static void cgwb_kill(struct bdi_writeback *wb)
> > WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id));
> > list_del(&wb->memcg_node);
> > list_del(&wb->blkcg_node);
> > + list_add(&wb->offline_node, &offline_cgwbs);
> > percpu_ref_kill(&wb->refcnt);
> > }
>
> I think you need to be a bit more careful with the wb->offline_node.
> cgwb_create() can end up destroying half-created bdi_writeback structure on
> error and then you'd see cgwb_release_workfn() called without cgwb_kill()
> called and you'd likely crash or corrupt memory.

Good point, will check it.

>
> >
> > @@ -633,6 +647,48 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
> > mutex_unlock(&bdi->cgwb_release_mutex);
> > }
> >
> > +/**
> > + * cleanup_offline_cgwbs - try to release dying cgwbs
> > + *
> > + * Try to release dying cgwbs by switching attached inodes to the wb
> > + * belonging to the root memory cgroup. Processed wbs are placed at the
> > + * end of the list to guarantee the forward progress.
> > + *
> > + * Should be called with the acquired cgwb_lock lock, which might
> > + * be released and re-acquired in the process.
> > + */
> > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
> > +{
> > + struct bdi_writeback *wb;
> > + LIST_HEAD(processed);
> > +
> > + spin_lock_irq(&cgwb_lock);
> > +
> > + while (!list_empty(&offline_cgwbs)) {
> > + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
> > + offline_node);
> > +
> > + list_move(&wb->offline_node, &processed);
> > +
> > + if (wb_has_dirty_io(wb))
> > + continue;
> > +
> > + if (!percpu_ref_tryget(&wb->refcnt))
> > + continue;
> > +
> > + spin_unlock_irq(&cgwb_lock);
> > + cleanup_offline_wb(wb);
> > + spin_lock_irq(&cgwb_lock);
> > +
> > + wb_put(wb);
> > + }
> > +
> > + if (!list_empty(&processed))
> > + list_splice_tail(&processed, &offline_cgwbs);
> > +
> > + spin_unlock_irq(&cgwb_lock);
>
> Shouldn't we reschedule this work with some delay if offline_cgwbs is
> non-empty? Otherwise we can end up with non-empty &offline_cgwbs and no
> cleaning scheduled...

I'm not sure. In general it's not a big problem to have few outstanding
wb structures, we just need to make sure we don't pile them.
I'm scheduling the cleanup from the memcg offlining path, so if new cgroups
are regularly created and destroyed, some pressure will be applied.

To reschedule based on time we need to come up with some heuristic how to
calculate the required delay and I don't have any specific ideas. If you do,
I'm totally fine to incorporate them.

Thanks!

2021-05-14 11:25:11

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

On Thu 13-05-21 11:12:16, Roman Gushchin wrote:
> On Thu, May 13, 2021 at 12:12:39PM +0200, Jan Kara wrote:
> > On Wed 12-05-21 17:42:58, Roman Gushchin wrote:
> > > + WARN_ON_ONCE(inode->i_wb != wb);
> > > + inode->i_wb = NULL;
> > > + wb_put(wb);
> > > + list_del_init(&inode->i_io_list);
> >
> > So I was thinking about this and I'm still a bit nervous that setting i_wb
> > to NULL is going to cause subtle crashes somewhere. Granted you are very
> > careful when not to touch the inode but still, even stuff like
> > inode_to_bdi() is not safe to call with inode->i_wb is NULL. So I'm afraid
> > that some place in the writeback code will be looking at i_wb without
> > having any of those bits set and boom. E.g. inode_to_wb() call in
> > test_clear_page_writeback() - what protects that one?
>
> I assume that if the page is dirty/under the writeback, the inode must be
> dirty too, so we can't zero inode->i_wb.

If page is under writeback, the inode can be already clean. You could check
!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)
but see how fragile it is... For every place using inode_to_wb() you have
to come up with a test excluding it...

> > I forgot what possibilities did we already discuss in the past but cannot
> > we just switch inode->i_wb to inode_to_bdi(inode)->wb (i.e., root cgroup
> > writeback structure)? That would be certainly safer...
>
> I am/was nervous too. I had several BUG_ON()'s in all such places and ran
> the test kernel for about a day on my dev desktop (even updated to Fedora
> 34 lol), and haven't seen any panics. And certainly I can give it a
> comprehensive test in a more production environment.

I appreciate the testing but it is also about how likely this is to break
sometime in future because someone unaware of this corner-case will add new
inode_to_wb() call not excluded by one of your conditions.

> Re switching to the root wb: it's certainly a possibility too, however
> zeroing has an advantage: the next potential writeback will be accounted
> to the right cgroup without a need in an additional switch.
> I'd try to go with zeroing if it's possible, and keep the switching to the
> root wb as plab B.

Yes, there will be the cost of an additional switch. But inodes attached to
dying cgroups shouldn't be the fast path anyway so does it matter?

> > > @@ -633,6 +647,48 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
> > > mutex_unlock(&bdi->cgwb_release_mutex);
> > > }
> > >
> > > +/**
> > > + * cleanup_offline_cgwbs - try to release dying cgwbs
> > > + *
> > > + * Try to release dying cgwbs by switching attached inodes to the wb
> > > + * belonging to the root memory cgroup. Processed wbs are placed at the
> > > + * end of the list to guarantee the forward progress.
> > > + *
> > > + * Should be called with the acquired cgwb_lock lock, which might
> > > + * be released and re-acquired in the process.
> > > + */
> > > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
> > > +{
> > > + struct bdi_writeback *wb;
> > > + LIST_HEAD(processed);
> > > +
> > > + spin_lock_irq(&cgwb_lock);
> > > +
> > > + while (!list_empty(&offline_cgwbs)) {
> > > + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
> > > + offline_node);
> > > +
> > > + list_move(&wb->offline_node, &processed);
> > > +
> > > + if (wb_has_dirty_io(wb))
> > > + continue;
> > > +
> > > + if (!percpu_ref_tryget(&wb->refcnt))
> > > + continue;
> > > +
> > > + spin_unlock_irq(&cgwb_lock);
> > > + cleanup_offline_wb(wb);
> > > + spin_lock_irq(&cgwb_lock);
> > > +
> > > + wb_put(wb);
> > > + }
> > > +
> > > + if (!list_empty(&processed))
> > > + list_splice_tail(&processed, &offline_cgwbs);
> > > +
> > > + spin_unlock_irq(&cgwb_lock);
> >
> > Shouldn't we reschedule this work with some delay if offline_cgwbs is
> > non-empty? Otherwise we can end up with non-empty &offline_cgwbs and no
> > cleaning scheduled...
>
> I'm not sure. In general it's not a big problem to have few outstanding
> wb structures, we just need to make sure we don't pile them.
> I'm scheduling the cleanup from the memcg offlining path, so if new cgroups
> are regularly created and destroyed, some pressure will be applied.
>
> To reschedule based on time we need to come up with some heuristic how to
> calculate the required delay and I don't have any specific ideas. If you do,
> I'm totally fine to incorporate them.

Well, I'd pick e.g. dirty_expire_interval (30s by default) as a time after
which we try again. Because after that time writeback has likely already
happened. But I don't insist here. If you think leaving inodes attached to
dead cgroups for potentially long time in some cases isn't a problem, then
we can leave this as is. If we find it's a problem in the future, we can
always add the time-based retry.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-05-14 21:35:52

by Roman Gushchin

[permalink] [raw]

Subject: Re: [PATCH v4] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups

On Fri, May 14, 2021 at 01:23:20PM +0200, Jan Kara wrote:
> On Thu 13-05-21 11:12:16, Roman Gushchin wrote:
> > On Thu, May 13, 2021 at 12:12:39PM +0200, Jan Kara wrote:
> > > On Wed 12-05-21 17:42:58, Roman Gushchin wrote:
> > > > + WARN_ON_ONCE(inode->i_wb != wb);
> > > > + inode->i_wb = NULL;
> > > > + wb_put(wb);
> > > > + list_del_init(&inode->i_io_list);
> > >
> > > So I was thinking about this and I'm still a bit nervous that setting i_wb
> > > to NULL is going to cause subtle crashes somewhere. Granted you are very
> > > careful when not to touch the inode but still, even stuff like
> > > inode_to_bdi() is not safe to call with inode->i_wb is NULL. So I'm afraid
> > > that some place in the writeback code will be looking at i_wb without
> > > having any of those bits set and boom. E.g. inode_to_wb() call in
> > > test_clear_page_writeback() - what protects that one?
> >
> > I assume that if the page is dirty/under the writeback, the inode must be
> > dirty too, so we can't zero inode->i_wb.
>
> If page is under writeback, the inode can be already clean. You could check
> !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)
> but see how fragile it is... For every place using inode_to_wb() you have
> to come up with a test excluding it...
>
> > > I forgot what possibilities did we already discuss in the past but cannot
> > > we just switch inode->i_wb to inode_to_bdi(inode)->wb (i.e., root cgroup
> > > writeback structure)? That would be certainly safer...
> >
> > I am/was nervous too. I had several BUG_ON()'s in all such places and ran
> > the test kernel for about a day on my dev desktop (even updated to Fedora
> > 34 lol), and haven't seen any panics. And certainly I can give it a
> > comprehensive test in a more production environment.
>
> I appreciate the testing but it is also about how likely this is to break
> sometime in future because someone unaware of this corner-case will add new
> inode_to_wb() call not excluded by one of your conditions.

Ok, you convinced me, will change in the next version.

>
> > Re switching to the root wb: it's certainly a possibility too, however
> > zeroing has an advantage: the next potential writeback will be accounted
> > to the right cgroup without a need in an additional switch.
> > I'd try to go with zeroing if it's possible, and keep the switching to the
> > root wb as plab B.
>
> Yes, there will be the cost of an additional switch. But inodes attached to
> dying cgroups shouldn't be the fast path anyway so does it matter?
>
> > > > @@ -633,6 +647,48 @@ static void cgwb_bdi_unregister(struct backing_dev_info *bdi)
> > > > mutex_unlock(&bdi->cgwb_release_mutex);
> > > > }
> > > >
> > > > +/**
> > > > + * cleanup_offline_cgwbs - try to release dying cgwbs
> > > > + *
> > > > + * Try to release dying cgwbs by switching attached inodes to the wb
> > > > + * belonging to the root memory cgroup. Processed wbs are placed at the
> > > > + * end of the list to guarantee the forward progress.
> > > > + *
> > > > + * Should be called with the acquired cgwb_lock lock, which might
> > > > + * be released and re-acquired in the process.
> > > > + */
> > > > +static void cleanup_offline_cgwbs_workfn(struct work_struct *work)
> > > > +{
> > > > + struct bdi_writeback *wb;
> > > > + LIST_HEAD(processed);
> > > > +
> > > > + spin_lock_irq(&cgwb_lock);
> > > > +
> > > > + while (!list_empty(&offline_cgwbs)) {
> > > > + wb = list_first_entry(&offline_cgwbs, struct bdi_writeback,
> > > > + offline_node);
> > > > +
> > > > + list_move(&wb->offline_node, &processed);
> > > > +
> > > > + if (wb_has_dirty_io(wb))
> > > > + continue;
> > > > +
> > > > + if (!percpu_ref_tryget(&wb->refcnt))
> > > > + continue;
> > > > +
> > > > + spin_unlock_irq(&cgwb_lock);
> > > > + cleanup_offline_wb(wb);
> > > > + spin_lock_irq(&cgwb_lock);
> > > > +
> > > > + wb_put(wb);
> > > > + }
> > > > +
> > > > + if (!list_empty(&processed))
> > > > + list_splice_tail(&processed, &offline_cgwbs);
> > > > +
> > > > + spin_unlock_irq(&cgwb_lock);
> > >
> > > Shouldn't we reschedule this work with some delay if offline_cgwbs is
> > > non-empty? Otherwise we can end up with non-empty &offline_cgwbs and no
> > > cleaning scheduled...
> >
> > I'm not sure. In general it's not a big problem to have few outstanding
> > wb structures, we just need to make sure we don't pile them.
> > I'm scheduling the cleanup from the memcg offlining path, so if new cgroups
> > are regularly created and destroyed, some pressure will be applied.
> >
> > To reschedule based on time we need to come up with some heuristic how to
> > calculate the required delay and I don't have any specific ideas. If you do,
> > I'm totally fine to incorporate them.
>
> Well, I'd pick e.g. dirty_expire_interval (30s by default) as a time after
> which we try again. Because after that time writeback has likely already
> happened. But I don't insist here. If you think leaving inodes attached to
> dead cgroups for potentially long time in some cases isn't a problem, then
> we can leave this as is. If we find it's a problem in the future, we can
> always add the time-based retry.

Yes, I agree, we can add it later.

Thank you!

Roman