2017-12-04 16:13:04

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Hi,

this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
It may be used to limit number of aio requests, which are available for
a cgroup, and could be useful for containers.

The accounting is hierarchical, and aio contexts, allocated in child cgroup,
are accounted in parent cgroups too.

Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
aio requests limit, to show current limit and number of currenly occupied
requests.

Patches 1-3 are refactoring.
Patch 4 is the place where the accounting actually introduced.
Patch 5 adds "io.aio_nr" file.

---

Kirill Tkhai (5):
aio: Move aio_nr increment to separate function
aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
blkcg: Charge aio requests in blkio cgroup hierarchy
blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr


block/blk-cgroup.c | 88 +++++++++++++++++++++++++-
fs/aio.c | 151 ++++++++++++++++++++++++++++++++++++++++----
include/linux/aio.h | 21 ++++++
include/linux/blk-cgroup.h | 4 +
4 files changed, 247 insertions(+), 17 deletions(-)

--
Signed-off-by: Kirill Tkhai <[email protected]>


2017-12-04 16:13:15

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 1/5] aio: Move aio_nr increment to separate function

There is no functional changes, only a preparation
for next patches.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/aio.c | 44 ++++++++++++++++++++++++++++++++------------
1 file changed, 32 insertions(+), 12 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index e6de7715228c..04209c0561b2 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -694,13 +694,39 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
}
}

-static void aio_nr_sub(unsigned nr)
+static bool __try_to_charge_aio_nr(unsigned nr)
+{
+ if (aio_nr + nr > aio_max_nr ||
+ aio_nr + nr < aio_nr)
+ return false;
+
+ aio_nr += nr;
+ return true;
+}
+
+static void __uncharge_aio_nr(unsigned nr)
{
- spin_lock(&aio_nr_lock);
if (WARN_ON(aio_nr - nr > aio_nr))
aio_nr = 0;
else
aio_nr -= nr;
+}
+
+static bool try_to_charge_aio_nr(unsigned nr)
+{
+ bool ret;
+
+ spin_lock(&aio_nr_lock);
+ ret = __try_to_charge_aio_nr(nr);
+ spin_unlock(&aio_nr_lock);
+
+ return ret;
+}
+
+static void uncharge_aio_nr(unsigned nr)
+{
+ spin_lock(&aio_nr_lock);
+ __uncharge_aio_nr(nr);
spin_unlock(&aio_nr_lock);
}

@@ -776,15 +802,9 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
ctx->req_batch = 1;

/* limit the number of system wide aios */
- spin_lock(&aio_nr_lock);
- if (aio_nr + ctx->max_reqs > aio_max_nr ||
- aio_nr + ctx->max_reqs < aio_nr) {
- spin_unlock(&aio_nr_lock);
- err = -EAGAIN;
+ err = -EAGAIN;
+ if (!try_to_charge_aio_nr(ctx->max_reqs))
goto err_ctx;
- }
- aio_nr += ctx->max_reqs;
- spin_unlock(&aio_nr_lock);

percpu_ref_get(&ctx->users); /* io_setup() will drop this ref */
percpu_ref_get(&ctx->reqs); /* free_ioctx_users() will drop this */
@@ -801,7 +821,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
return ctx;

err_cleanup:
- aio_nr_sub(ctx->max_reqs);
+ uncharge_aio_nr(ctx->max_reqs);
err_ctx:
atomic_set(&ctx->dead, 1);
if (ctx->mmap_size)
@@ -848,7 +868,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
* -EAGAIN with no ioctxs actually in use (as far as userspace
* could tell).
*/
- aio_nr_sub(ctx->max_reqs);
+ uncharge_aio_nr(ctx->max_reqs);

if (ctx->mmap_size)
vm_munmap(ctx->mmap_base, ctx->mmap_size);

2017-12-04 16:13:22

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 2/5] aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h

Next patch will use the values in more files, so let's make them visible external.

Signed-off-by: Kirill Tkhai <[email protected]>
---
fs/aio.c | 4 ++--
include/linux/aio.h | 4 ++++
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index 04209c0561b2..9dc98a29077c 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -188,10 +188,10 @@ struct aio_kiocb {
struct eventfd_ctx *ki_eventfd;
};

+DEFINE_SPINLOCK(aio_nr_lock);
/*------ sysctl variables----*/
-static DEFINE_SPINLOCK(aio_nr_lock);
unsigned long aio_nr; /* current system wide number of aio requests */
-unsigned long aio_max_nr = 0x10000; /* system wide maximum number of aio requests */
+unsigned long aio_max_nr = AIO_NR_DEF; /* system wide maximum number of aio requests */
/*----end sysctl variables---*/

static struct kmem_cache *kiocb_cachep;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 9d8aabecfe2d..5dda2663802f 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -10,6 +10,10 @@ struct mm_struct;

#define KIOCB_KEY 0

+#define AIO_NR_INF (-1UL)
+#define AIO_NR_DEF 0x10000
+
+extern spinlock_t aio_nr_lock;
typedef int (kiocb_cancel_fn)(struct kiocb *);

/* prototypes */

2017-12-04 16:13:32

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 3/5] blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr

This adds new members of struct blkcg, which will
be used to account numbers of cgroup's aio requests.

Also, blkcg_root is used to store sysctl variables
aio_nr and aio_max_nr.

Signed-off-by: Kirill Tkhai <[email protected]>
---
block/blk-cgroup.c | 4 ++++
fs/aio.c | 2 ++
include/linux/aio.h | 6 ++++++
include/linux/blk-cgroup.h | 4 ++++
4 files changed, 16 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index d3f56baee936..774560469b01 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -27,6 +27,7 @@
#include <linux/atomic.h>
#include <linux/ctype.h>
#include <linux/blk-cgroup.h>
+#include <linux/aio.h>
#include "blk.h"

#define MAX_KEY_LEN 100
@@ -1101,6 +1102,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
INIT_HLIST_HEAD(&blkcg->blkg_list);
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&blkcg->cgwb_list);
+#endif
+#ifdef CONFIG_AIO
+ blkcg->blkg_aio_max_nr = parent_css ? AIO_NR_INF : AIO_NR_DEF;
#endif
list_add_tail(&blkcg->all_blkcgs_node, &all_blkcgs);

diff --git a/fs/aio.c b/fs/aio.c
index 9dc98a29077c..755f97a42ebe 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -189,10 +189,12 @@ struct aio_kiocb {
};

DEFINE_SPINLOCK(aio_nr_lock);
+#ifndef CONFIG_BLK_CGROUP
/*------ sysctl variables----*/
unsigned long aio_nr; /* current system wide number of aio requests */
unsigned long aio_max_nr = AIO_NR_DEF; /* system wide maximum number of aio requests */
/*----end sysctl variables---*/
+#endif

static struct kmem_cache *kiocb_cachep;
static struct kmem_cache *kioctx_cachep;
diff --git a/include/linux/aio.h b/include/linux/aio.h
index 5dda2663802f..de929a8c9c59 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -2,6 +2,7 @@
#ifndef __LINUX__AIO_H
#define __LINUX__AIO_H

+#include <linux/blk-cgroup.h>
#include <linux/aio_abi.h>

struct kioctx;
@@ -26,8 +27,13 @@ static inline void kiocb_set_cancel_fn(struct kiocb *req,
kiocb_cancel_fn *cancel) { }
#endif /* CONFIG_AIO */

+#if !defined(CONFIG_BLK_CGROUP) || !defined(CONFIG_AIO)
/* for sysctl: */
extern unsigned long aio_nr;
extern unsigned long aio_max_nr;
+#else
+#define aio_nr blkcg_root.blkg_aio_nr
+#define aio_max_nr blkcg_root.blkg_aio_max_nr
+#endif /* !CONFIG_BLK_CGROUP || !CONFIG_AIO */

#endif /* __LINUX__AIO_H */
diff --git a/include/linux/blk-cgroup.h b/include/linux/blk-cgroup.h
index 8bbc3716507a..3d9c176fc173 100644
--- a/include/linux/blk-cgroup.h
+++ b/include/linux/blk-cgroup.h
@@ -55,6 +55,10 @@ struct blkcg {
#ifdef CONFIG_CGROUP_WRITEBACK
struct list_head cgwb_list;
#endif
+#ifdef CONFIG_AIO
+ unsigned long blkg_aio_nr;
+ unsigned long blkg_aio_max_nr;
+#endif
};

/*

2017-12-04 16:13:41

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 4/5] blkcg: Charge aio requests in blkio cgroup hierarchy

This patch adds accounting of number of requests of allocated aio
contexts per blkio cgroup, and aggregates child cgroups requests
up the hierarhy. This may be used to limit aio requests available
for containers.

By default, newly allocated blkcg::blkg_aio_max_nr is set
to "unlimited" value (see blkcg_css_alloc() in previous patch).
This guarantees that applications, which do not know about
blkcg::blkg_aio_max_nr, will be able to run like they used to do
before, without configuring child cgroup's blkg_aio_max_nr.

For protection "task attach" vs "io_context create/destroy"
read locked cgroup_threadgroup_rwsem is used. We take it
via cgroup_threadgroup_change_*() interfaces, which are used
around the places we charge kioctx::max_reqs and link a ctx
to mm_struct::ioctx_table.

Single allocation are protected aio_nr_lock like it used before.

Signed-off-by: Kirill Tkhai <[email protected]>
---
block/blk-cgroup.c | 44 +++++++++++++++++++++-
fs/aio.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/aio.h | 11 ++++++
3 files changed, 153 insertions(+), 3 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 774560469b01..9cc6e9574946 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1217,8 +1217,8 @@ void blkcg_exit_queue(struct request_queue *q)
*/
static int blkcg_can_attach(struct cgroup_taskset *tset)
{
- struct task_struct *task;
- struct cgroup_subsys_state *dst_css;
+ struct cgroup_subsys_state *dst_css, *old_css;
+ struct task_struct *task, *p;
struct io_context *ioc;
int ret = 0;

@@ -1230,11 +1230,46 @@ static int blkcg_can_attach(struct cgroup_taskset *tset)
ret = -EINVAL;
task_unlock(task);
if (ret)
- break;
+ goto err;
+ if (!thread_group_leader(task))
+ continue;
+ ret = charge_task_aio_nr(task, css_to_blkcg(dst_css));
+ if (ret)
+ goto err;
+ old_css = task_css(task, io_cgrp_id);
+ uncharge_task_aio_nr(task, css_to_blkcg(old_css));
+ }
+err:
+ if (ret) {
+ cgroup_taskset_for_each(p, dst_css, tset) {
+ if (p == task)
+ break;
+ if (!thread_group_leader(p))
+ continue;
+ uncharge_task_aio_nr(p, css_to_blkcg(dst_css));
+ old_css = task_css(p, io_cgrp_id);
+ WARN_ON_ONCE(charge_task_aio_nr(p, css_to_blkcg(old_css)));
+ }
}
return ret;
}

+#ifdef CONFIG_AIO
+static void blkcg_cancel_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *dst_css, *old_css;
+ struct task_struct *p;
+
+ cgroup_taskset_for_each(p, dst_css, tset) {
+ if (!thread_group_leader(p))
+ continue;
+ uncharge_task_aio_nr(p, css_to_blkcg(dst_css));
+ old_css = task_css(p, io_cgrp_id);
+ WARN_ON_ONCE(charge_task_aio_nr(p, css_to_blkcg(old_css)));
+ }
+}
+#endif
+
static void blkcg_bind(struct cgroup_subsys_state *root_css)
{
int i;
@@ -1260,6 +1295,9 @@ struct cgroup_subsys io_cgrp_subsys = {
.css_offline = blkcg_css_offline,
.css_free = blkcg_css_free,
.can_attach = blkcg_can_attach,
+#ifdef CONFIG_AIO
+ .cancel_attach = blkcg_cancel_attach,
+#endif
.bind = blkcg_bind,
.dfl_cftypes = blkcg_files,
.legacy_cftypes = blkcg_legacy_files,
diff --git a/fs/aio.c b/fs/aio.c
index 755f97a42ebe..2e63f5c582c0 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -40,6 +40,7 @@
#include <linux/ramfs.h>
#include <linux/percpu-refcount.h>
#include <linux/mount.h>
+#include <linux/cgroup-defs.h>

#include <asm/kmap_types.h>
#include <linux/uaccess.h>
@@ -696,6 +697,97 @@ static int ioctx_add_table(struct kioctx *ctx, struct mm_struct *mm)
}
}

+#ifdef CONFIG_BLK_CGROUP
+static bool try_to_charge_blkcg(unsigned long nr, struct blkcg *blkg)
+{
+ struct blkcg *tmp = blkg;
+
+ while (blkg) {
+ if (nr + blkg->blkg_aio_nr > blkg->blkg_aio_max_nr ||
+ nr + blkg->blkg_aio_nr < nr)
+ goto fail;
+
+ blkg->blkg_aio_nr += nr;
+ blkg = blkcg_parent(blkg);
+ }
+
+ return true;
+fail:
+ while (tmp != blkg) {
+ tmp->blkg_aio_nr -= nr;
+ tmp = blkcg_parent(tmp);
+ }
+ return false;
+}
+
+
+static void uncharge_blkcg(unsigned long nr, struct blkcg *blkg)
+{
+ while (blkg) {
+ blkg->blkg_aio_nr -= nr;
+ blkg = blkcg_parent(blkg);
+ }
+}
+
+static bool __try_to_charge_aio_nr(unsigned nr)
+{
+ struct blkcg *blkg;
+
+ percpu_rwsem_assert_held(&cgroup_threadgroup_rwsem);
+ blkg = container_of(task_css_check(current, io_cgrp_id, true),
+ struct blkcg, css);
+ return try_to_charge_blkcg(nr, blkg);
+}
+
+static void __uncharge_aio_nr(unsigned nr)
+{
+ struct blkcg *blkg;
+
+ percpu_rwsem_assert_held(&cgroup_threadgroup_rwsem);
+ blkg = container_of(task_css_check(current, io_cgrp_id, true),
+ struct blkcg, css);
+ uncharge_blkcg(nr, blkg);
+}
+
+static unsigned long get_task_max_reqs(struct task_struct *p)
+{
+ struct kioctx_table *tbl;
+ unsigned long nr = 0;
+ struct kioctx *ctx;
+ int i;
+
+ if (p->flags & PF_KTHREAD)
+ return 0;
+ /* rwsem must be write locked */
+ tbl = rcu_dereference_protected(p->mm->ioctx_table,
+ percpu_rwsem_is_held(&cgroup_threadgroup_rwsem));
+ if (!tbl)
+ return 0;
+ for (i = 0; i < tbl->nr; i++) {
+ ctx = tbl->table[i];
+ if (!ctx)
+ continue;
+ nr += ctx->max_reqs;
+ }
+ return nr;
+}
+
+int charge_task_aio_nr(struct task_struct *p, struct blkcg *blkg)
+{
+ unsigned long nr = get_task_max_reqs(p);
+
+ if (!nr || try_to_charge_blkcg(nr, blkg))
+ return 0;
+ return -ENOMEM;
+}
+
+void uncharge_task_aio_nr(struct task_struct *p, struct blkcg *blkg)
+{
+ unsigned long nr = get_task_max_reqs(p);
+ if (nr)
+ uncharge_blkcg(nr, blkg);
+}
+#else
static bool __try_to_charge_aio_nr(unsigned nr)
{
if (aio_nr + nr > aio_max_nr ||
@@ -713,6 +805,7 @@ static void __uncharge_aio_nr(unsigned nr)
else
aio_nr -= nr;
}
+#endif /* CONFIG_BLK_CGROUP */

static bool try_to_charge_aio_nr(unsigned nr)
{
@@ -803,6 +896,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
if (ctx->req_batch < 1)
ctx->req_batch = 1;

+ cgroup_threadgroup_change_begin(current);
+
/* limit the number of system wide aios */
err = -EAGAIN;
if (!try_to_charge_aio_nr(ctx->max_reqs))
@@ -815,6 +910,8 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
if (err)
goto err_cleanup;

+ cgroup_threadgroup_change_end(current);
+
/* Release the ring_lock mutex now that all setup is complete. */
mutex_unlock(&ctx->ring_lock);

@@ -825,6 +922,7 @@ static struct kioctx *ioctx_alloc(unsigned nr_events)
err_cleanup:
uncharge_aio_nr(ctx->max_reqs);
err_ctx:
+ cgroup_threadgroup_change_end(current);
atomic_set(&ctx->dead, 1);
if (ctx->mmap_size)
vm_munmap(ctx->mmap_base, ctx->mmap_size);
@@ -849,9 +947,11 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
{
struct kioctx_table *table;

+ cgroup_threadgroup_change_begin(current);
spin_lock(&mm->ioctx_lock);
if (atomic_xchg(&ctx->dead, 1)) {
spin_unlock(&mm->ioctx_lock);
+ cgroup_threadgroup_change_end(current);
return -EINVAL;
}

@@ -871,6 +971,7 @@ static int kill_ioctx(struct mm_struct *mm, struct kioctx *ctx,
* could tell).
*/
uncharge_aio_nr(ctx->max_reqs);
+ cgroup_threadgroup_change_end(current);

if (ctx->mmap_size)
vm_munmap(ctx->mmap_base, ctx->mmap_size);
diff --git a/include/linux/aio.h b/include/linux/aio.h
index de929a8c9c59..bf442e562a8f 100644
--- a/include/linux/aio.h
+++ b/include/linux/aio.h
@@ -31,9 +31,20 @@ static inline void kiocb_set_cancel_fn(struct kiocb *req,
/* for sysctl: */
extern unsigned long aio_nr;
extern unsigned long aio_max_nr;
+
+static inline int charge_task_aio_nr(struct task_struct *p, struct blkcg *g)
+{
+ return 0;
+}
+static inline void uncharge_task_aio_nr(struct task_struct *p, struct blkcg *g)
+{
+}
#else
#define aio_nr blkcg_root.blkg_aio_nr
#define aio_max_nr blkcg_root.blkg_aio_max_nr
+
+extern int charge_task_aio_nr(struct task_struct *, struct blkcg *);
+extern void uncharge_task_aio_nr(struct task_struct *, struct blkcg *);
#endif /* !CONFIG_BLK_CGROUP || !CONFIG_AIO */

#endif /* __LINUX__AIO_H */

2017-12-04 16:13:58

by Kirill Tkhai

[permalink] [raw]
Subject: [PATCH 5/5] blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr

Add a file to configure per-cgroup maximum number of available
aio requests.

Allow write the values, which are less then currently allocated
numbers of requests.

Show numbers of currently allocated and maximum available requests.

Signed-off-by: Kirill Tkhai <[email protected]>
---
block/blk-cgroup.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 9cc6e9574946..dc5600ef4523 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -981,12 +981,52 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
return 0;
}

+#ifdef CONFIG_AIO
+static int blkcg_write_aio_max_nr(struct cgroup_subsys_state *css,
+ struct cftype *cftype, u64 val)
+{
+ struct blkcg *blkg = css_to_blkcg(css);
+ int ret = 0;
+
+ percpu_down_read(&cgroup_threadgroup_rwsem);
+ spin_lock(&aio_nr_lock);
+ if (val >= blkg->blkg_aio_nr)
+ blkg->blkg_aio_max_nr = val;
+ else
+ ret = -EBUSY;
+ spin_unlock(&aio_nr_lock);
+ percpu_up_read(&cgroup_threadgroup_rwsem);
+ return ret;
+}
+
+static int blkcg_show_aio_nrs(struct seq_file *sf, void *v)
+{
+ struct blkcg *blkg = css_to_blkcg(seq_css(sf));
+ unsigned long max_nr, nr;
+
+ spin_lock(&aio_nr_lock);
+ max_nr = blkg->blkg_aio_max_nr;
+ nr = blkg->blkg_aio_nr;
+ spin_unlock(&aio_nr_lock);
+
+ seq_printf(sf, "used=%lu, max=%lu\n", nr, max_nr);
+ return 0;
+}
+#endif /* CONFIG_AIO */
+
static struct cftype blkcg_files[] = {
{
.name = "stat",
.flags = CFTYPE_NOT_ON_ROOT,
.seq_show = blkcg_print_stat,
},
+#ifdef CONFIG_AIO
+ {
+ .name = "aio_nr",
+ .write_u64 = blkcg_write_aio_max_nr,
+ .seq_show = blkcg_show_aio_nrs,
+ },
+#endif
{ } /* terminate */
};


2017-12-04 17:13:09

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Hi Kirill,

On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
> Hi,
>
> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
> It may be used to limit number of aio requests, which are available for
> a cgroup, and could be useful for containers.
>
> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
> are accounted in parent cgroups too.
>
> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
> aio requests limit, to show current limit and number of currenly occupied
> requests.

Where are your test cases to check this functionality in the libaio test
suite?

-ben

> Patches 1-3 are refactoring.
> Patch 4 is the place where the accounting actually introduced.
> Patch 5 adds "io.aio_nr" file.
>
> ---
>
> Kirill Tkhai (5):
> aio: Move aio_nr increment to separate function
> aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
> blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
> blkcg: Charge aio requests in blkio cgroup hierarchy
> blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>
>
> block/blk-cgroup.c | 88 +++++++++++++++++++++++++-
> fs/aio.c | 151 ++++++++++++++++++++++++++++++++++++++++----
> include/linux/aio.h | 21 ++++++
> include/linux/blk-cgroup.h | 4 +
> 4 files changed, 247 insertions(+), 17 deletions(-)
>
> --
> Signed-off-by: Kirill Tkhai <[email protected]>
>

--
"Thought is the essence of where you are now."

2017-12-04 20:08:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Hello, Kirill.

On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
> It may be used to limit number of aio requests, which are available for
> a cgroup, and could be useful for containers.

Can you please explain how this is a fundamental resource which can't
be controlled otherwise?

Thanks.

--
tejun

2017-12-04 21:27:52

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Hi, Benjamin,

On 04.12.2017 19:52, Benjamin LaHaise wrote:
> Hi Kirill,
>
> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>> Hi,
>>
>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>> It may be used to limit number of aio requests, which are available for
>> a cgroup, and could be useful for containers.
>>
>> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
>> are accounted in parent cgroups too.
>>
>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
>> aio requests limit, to show current limit and number of currenly occupied
>> requests.
>
> Where are your test cases to check this functionality in the libaio test
> suite?

I tried to find actual libaio test suite repository url in google,
but there is no certain answer. Also, there is no information in kernel
anywhere (I hope I grepped right).

Could you please provide url where actual upstream libaio could be obtained?

Kirill

>
> -ben
>
>> Patches 1-3 are refactoring.
>> Patch 4 is the place where the accounting actually introduced.
>> Patch 5 adds "io.aio_nr" file.
>>
>> ---
>>
>> Kirill Tkhai (5):
>> aio: Move aio_nr increment to separate function
>> aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
>> blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>> blkcg: Charge aio requests in blkio cgroup hierarchy
>> blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>>
>>
>> block/blk-cgroup.c | 88 +++++++++++++++++++++++++-
>> fs/aio.c | 151 ++++++++++++++++++++++++++++++++++++++++----
>> include/linux/aio.h | 21 ++++++
>> include/linux/blk-cgroup.h | 4 +
>> 4 files changed, 247 insertions(+), 17 deletions(-)
>>
>> --
>> Signed-off-by: Kirill Tkhai <[email protected]>
>>
>

2017-12-04 21:35:33

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Kirill Tkhai <[email protected]> writes:

> Hi, Benjamin,
>
> On 04.12.2017 19:52, Benjamin LaHaise wrote:
>> Hi Kirill,
>>
>> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>>> Hi,
>>>
>>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>>> It may be used to limit number of aio requests, which are available for
>>> a cgroup, and could be useful for containers.
>>>
>>> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
>>> are accounted in parent cgroups too.
>>>
>>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
>>> aio requests limit, to show current limit and number of currenly occupied
>>> requests.
>>
>> Where are your test cases to check this functionality in the libaio test
>> suite?
>
> I tried to find actual libaio test suite repository url in google,
> but there is no certain answer. Also, there is no information in kernel
> anywhere (I hope I grepped right).
>
> Could you please provide url where actual upstream libaio could be obtained?

https://pagure.io/libaio

Patches can be sent to this list (linux-aio).

Cheers,
Jeff

>>> Patches 1-3 are refactoring.
>>> Patch 4 is the place where the accounting actually introduced.
>>> Patch 5 adds "io.aio_nr" file.
>>>
>>> ---
>>>
>>> Kirill Tkhai (5):
>>> aio: Move aio_nr increment to separate function
>>> aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
>>> blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>>> blkcg: Charge aio requests in blkio cgroup hierarchy
>>> blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>>>
>>>
>>> block/blk-cgroup.c | 88 +++++++++++++++++++++++++-
>>> fs/aio.c | 151 ++++++++++++++++++++++++++++++++++++++++----
>>> include/linux/aio.h | 21 ++++++
>>> include/linux/blk-cgroup.h | 4 +
>>> 4 files changed, 247 insertions(+), 17 deletions(-)
>>>
>>> --
>>> Signed-off-by: Kirill Tkhai <[email protected]>
>>>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

2017-12-04 21:44:14

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Hello, Tejun,

On 04.12.2017 23:07, Tejun Heo wrote:
> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>> It may be used to limit number of aio requests, which are available for
>> a cgroup, and could be useful for containers.
>
> Can you please explain how this is a fundamental resource which can't
> be controlled otherwise?

Currently, aio_nr and aio_max_nr are global. In case of containers this
means that a single container may occupy all aio requests, which are
available in the system, and to deprive others possibility to use aio
at all. This may happen because of evil intentions of the container's
user or because of the program error, when the user makes this occasionally.

My patch set allows to guarantee that every container or cgroup has
its own number of allowed aios, and nobody can steal it, and therefore
can't slow down another containers, and to force programs to use direct io.

AIO gives certain advantages to its user, so this patchset just doesn't
allow to rob the advantages without any possibility to protect against that.

This could be used by LXC or for starting some critical micro-services,
for example.

Kirill

2017-12-04 21:48:34

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 05.12.2017 00:35, Jeff Moyer wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> Hi, Benjamin,
>>
>> On 04.12.2017 19:52, Benjamin LaHaise wrote:
>>> Hi Kirill,
>>>
>>> On Mon, Dec 04, 2017 at 07:12:51PM +0300, Kirill Tkhai wrote:
>>>> Hi,
>>>>
>>>> this patch set introduces accounting aio_nr and aio_max_nr per blkio cgroup.
>>>> It may be used to limit number of aio requests, which are available for
>>>> a cgroup, and could be useful for containers.
>>>>
>>>> The accounting is hierarchical, and aio contexts, allocated in child cgroup,
>>>> are accounted in parent cgroups too.
>>>>
>>>> Also, new cgroup file "io.aio_nr" is introduced. It's used to set cgroup
>>>> aio requests limit, to show current limit and number of currenly occupied
>>>> requests.
>>>
>>> Where are your test cases to check this functionality in the libaio test
>>> suite?
>>
>> I tried to find actual libaio test suite repository url in google,
>> but there is no certain answer. Also, there is no information in kernel
>> anywhere (I hope I grepped right).
>>
>> Could you please provide url where actual upstream libaio could be obtained?
>
> https://pagure.io/libaio
>
> Patches can be sent to this list (linux-aio).
>

Thank you, Jeff.

Kirill

>>>> Patches 1-3 are refactoring.
>>>> Patch 4 is the place where the accounting actually introduced.
>>>> Patch 5 adds "io.aio_nr" file.
>>>>
>>>> ---
>>>>
>>>> Kirill Tkhai (5):
>>>> aio: Move aio_nr increment to separate function
>>>> aio: Export aio_nr_lock and aio_max_nr initial value to include/linux/aio.h
>>>> blkcg: Add blkcg::blkg_aio_nr and blkcg::blkg_aio_max_nr
>>>> blkcg: Charge aio requests in blkio cgroup hierarchy
>>>> blkcg: Add cgroup file to configure blkcg::blkg_aio_max_nr
>>>>
>>>>
>>>> block/blk-cgroup.c | 88 +++++++++++++++++++++++++-
>>>> fs/aio.c | 151 ++++++++++++++++++++++++++++++++++++++++----
>>>> include/linux/aio.h | 21 ++++++
>>>> include/linux/blk-cgroup.h | 4 +
>>>> 4 files changed, 247 insertions(+), 17 deletions(-)
>>>>
>>>> --
>>>> Signed-off-by: Kirill Tkhai <[email protected]>
>>>>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-aio' in
>> the body to [email protected]. For more info on Linux AIO,
>> see: http://www.kvack.org/aio/
>> Don't email: <a href=mailto:"[email protected]">[email protected]</a>

2017-12-04 21:52:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Hello, Kirill.

On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
> > Can you please explain how this is a fundamental resource which can't
> > be controlled otherwise?
>
> Currently, aio_nr and aio_max_nr are global. In case of containers this
> means that a single container may occupy all aio requests, which are
> available in the system, and to deprive others possibility to use aio
> at all. This may happen because of evil intentions of the container's
> user or because of the program error, when the user makes this occasionally.

Hmm... I see. It feels really wrong to me to make this a first class
resource because there is a system wide limit. The only reason I can
think of for the system wide limit is to prevent too much kernel
memory consumed by creating a lot of aios but that squarely falls
inside cgroup memory controller protection. If there are other
reasons why the number of aios should be limited system-wide, please
bring them up.

If the only reason is kernel memory consumption protection, the only
thing we need to do is making sure that memory used for aio commands
are accounted against cgroup kernel memory consumption and
relaxing/removing system wide limit.

Thanks.

--
tejun

2017-12-04 22:49:52

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 05.12.2017 00:52, Tejun Heo wrote:
> Hello, Kirill.
>
> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
>>> Can you please explain how this is a fundamental resource which can't
>>> be controlled otherwise?
>>
>> Currently, aio_nr and aio_max_nr are global. In case of containers this
>> means that a single container may occupy all aio requests, which are
>> available in the system, and to deprive others possibility to use aio
>> at all. This may happen because of evil intentions of the container's
>> user or because of the program error, when the user makes this occasionally.
>
> Hmm... I see. It feels really wrong to me to make this a first class
> resource because there is a system wide limit. The only reason I can
> think of for the system wide limit is to prevent too much kernel
> memory consumed by creating a lot of aios but that squarely falls
> inside cgroup memory controller protection. If there are other
> reasons why the number of aios should be limited system-wide, please
> bring them up.
>
> If the only reason is kernel memory consumption protection, the only
> thing we need to do is making sure that memory used for aio commands
> are accounted against cgroup kernel memory consumption and
> relaxing/removing system wide limit.

So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
structures and pages, and all the memory will be accounted in kmem and
limited by memcg. Looks very good.

One detail about memory consumption. io_submit() calls primitives
file_operations::write_iter and read_iter. It's not clear for me whether
they consume the same memory as if writev() or readv() system calls
would be used instead. writev() may delay the actual write till dirty
pages limit will be reached, so it seems logic of the accounting should
be the same. So aio mustn't use more not accounted system memory in file
system internals, then simple writev().

Could you please to say if you have thoughts about this?

Kirill

2017-12-04 22:59:28

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Kirill Tkhai <[email protected]> writes:

> On 05.12.2017 00:52, Tejun Heo wrote:
>> Hello, Kirill.
>>
>> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
>>>> Can you please explain how this is a fundamental resource which can't
>>>> be controlled otherwise?
>>>
>>> Currently, aio_nr and aio_max_nr are global. In case of containers this
>>> means that a single container may occupy all aio requests, which are
>>> available in the system, and to deprive others possibility to use aio
>>> at all. This may happen because of evil intentions of the container's
>>> user or because of the program error, when the user makes this occasionally.
>>
>> Hmm... I see. It feels really wrong to me to make this a first class
>> resource because there is a system wide limit. The only reason I can
>> think of for the system wide limit is to prevent too much kernel
>> memory consumed by creating a lot of aios but that squarely falls
>> inside cgroup memory controller protection. If there are other
>> reasons why the number of aios should be limited system-wide, please
>> bring them up.
>>
>> If the only reason is kernel memory consumption protection, the only
>> thing we need to do is making sure that memory used for aio commands
>> are accounted against cgroup kernel memory consumption and
>> relaxing/removing system wide limit.
>
> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
> structures and pages, and all the memory will be accounted in kmem and
> limited by memcg. Looks very good.
>
> One detail about memory consumption. io_submit() calls primitives
> file_operations::write_iter and read_iter. It's not clear for me whether
> they consume the same memory as if writev() or readv() system calls
> would be used instead. writev() may delay the actual write till dirty
> pages limit will be reached, so it seems logic of the accounting should
> be the same. So aio mustn't use more not accounted system memory in file
> system internals, then simple writev().
>
> Could you please to say if you have thoughts about this?

I think you just need to account the completion ring.

Cheers,
Jeff

2017-12-04 23:02:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Hello, Kirill.

On Tue, Dec 05, 2017 at 01:49:42AM +0300, Kirill Tkhai wrote:
> > If the only reason is kernel memory consumption protection, the only
> > thing we need to do is making sure that memory used for aio commands
> > are accounted against cgroup kernel memory consumption and
> > relaxing/removing system wide limit.
>
> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
> structures and pages, and all the memory will be accounted in kmem and
> limited by memcg. Looks very good.

Yeah.

> One detail about memory consumption. io_submit() calls primitives
> file_operations::write_iter and read_iter. It's not clear for me whether
> they consume the same memory as if writev() or readv() system calls
> would be used instead. writev() may delay the actual write till dirty
> pages limit will be reached, so it seems logic of the accounting should
> be the same. So aio mustn't use more not accounted system memory in file
> system internals, then simple writev().
>
> Could you please to say if you have thoughts about this?

I'm not too familiar with vfs / filesystems but I don't think there's
gonna be significant unaccounted memory consumption. It shouldn't be
too difficult to find out with experiments too.

Thanks.

--
tejun

2017-12-04 23:05:41

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 05.12.2017 02:02, Tejun Heo wrote:
> Hello, Kirill.
>
> On Tue, Dec 05, 2017 at 01:49:42AM +0300, Kirill Tkhai wrote:
>>> If the only reason is kernel memory consumption protection, the only
>>> thing we need to do is making sure that memory used for aio commands
>>> are accounted against cgroup kernel memory consumption and
>>> relaxing/removing system wide limit.
>>
>> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
>> structures and pages, and all the memory will be accounted in kmem and
>> limited by memcg. Looks very good.
>
> Yeah.
>
>> One detail about memory consumption. io_submit() calls primitives
>> file_operations::write_iter and read_iter. It's not clear for me whether
>> they consume the same memory as if writev() or readv() system calls
>> would be used instead. writev() may delay the actual write till dirty
>> pages limit will be reached, so it seems logic of the accounting should
>> be the same. So aio mustn't use more not accounted system memory in file
>> system internals, then simple writev().
>>
>> Could you please to say if you have thoughts about this?
>
> I'm not too familiar with vfs / filesystems but I don't think there's
> gonna be significant unaccounted memory consumption. It shouldn't be
> too difficult to find out with experiments too.
>
> Thanks.

Thanks, Tejun!

Kirill

2017-12-04 23:15:04

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 05.12.2017 01:59, Jeff Moyer wrote:
> Kirill Tkhai <[email protected]> writes:
>
>> On 05.12.2017 00:52, Tejun Heo wrote:
>>> Hello, Kirill.
>>>
>>> On Tue, Dec 05, 2017 at 12:44:00AM +0300, Kirill Tkhai wrote:
>>>>> Can you please explain how this is a fundamental resource which can't
>>>>> be controlled otherwise?
>>>>
>>>> Currently, aio_nr and aio_max_nr are global. In case of containers this
>>>> means that a single container may occupy all aio requests, which are
>>>> available in the system, and to deprive others possibility to use aio
>>>> at all. This may happen because of evil intentions of the container's
>>>> user or because of the program error, when the user makes this occasionally.
>>>
>>> Hmm... I see. It feels really wrong to me to make this a first class
>>> resource because there is a system wide limit. The only reason I can
>>> think of for the system wide limit is to prevent too much kernel
>>> memory consumed by creating a lot of aios but that squarely falls
>>> inside cgroup memory controller protection. If there are other
>>> reasons why the number of aios should be limited system-wide, please
>>> bring them up.
>>>
>>> If the only reason is kernel memory consumption protection, the only
>>> thing we need to do is making sure that memory used for aio commands
>>> are accounted against cgroup kernel memory consumption and
>>> relaxing/removing system wide limit.
>>
>> So, we just use GFP_KERNEL_ACCOUNT flag for allocation of internal aio
>> structures and pages, and all the memory will be accounted in kmem and
>> limited by memcg. Looks very good.
>>
>> One detail about memory consumption. io_submit() calls primitives
>> file_operations::write_iter and read_iter. It's not clear for me whether
>> they consume the same memory as if writev() or readv() system calls
>> would be used instead. writev() may delay the actual write till dirty
>> pages limit will be reached, so it seems logic of the accounting should
>> be the same. So aio mustn't use more not accounted system memory in file
>> system internals, then simple writev().
>>
>> Could you please to say if you have thoughts about this?
>
> I think you just need to account the completion ring.

A request of struct aio_kiocb type consumes much more memory, than
struct io_event does. Shouldn't we account it too?

Kirill

2017-12-05 15:20:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 12/05, Kirill Tkhai wrote:
>
> Currently, aio_nr and aio_max_nr are global.

Yeah, I too tried to complain 2 years ago...

> In case of containers this
> means that a single container may occupy all aio requests, which are
> available in the system,

and memory. let me quote my old emails...


This is off-topic, but the whole "vm" logic in aio_setup_ring()
looks sub-optimal. I do not mean the code, just it seems to me it
is pointless to pollute the page cache, and expose the pages we
can not swap/free to lru. Afaics we _only_ need this for migration.

This memory lives in page-cache/lru, it is visible for shrinker which
will unmap these pages for no reason on memory shortage. IOW, aio fools
the kernel, this memory looks reclaimable but it is not. And we only do
this for migration.

Even if this is not a problem, this does not look right. So perhaps at
least mapping_set_unevictable() makes sense. But I simply do not know
if migration will work with this change.



Perhaps I missed something, doesn't matter. But this means that
this memory is not accounted, so if I increase aio-max-nr then
this test-case

#define __NR_io_setup 206

int main(void)
{
int nr;

for (nr = 0; ;++nr) {
void *ctx = NULL;
int ret = syscall(__NR_io_setup, 1, &ctx);
if (ret) {
printf("failed %d %m: ", nr);
getchar();
}
}

return 0;
}

triggers OOM-killer which kills sshd and other daemons on my machine.
These pages were not even faulted in (or the shrinker can unmap them),
the kernel can not know who should be blamed.

Shouldn't we account aio events/pages somehow, say per-user, or in
mm->pinned_vm ?

I do not think this is unkown, and probably this all is fine. IOW,
this is just a question, not a bug-report or something like this.

And of course, this is not exploitable because aio-max-nr limits
the number of pages you can steal.

But otoh, aio_max_nr is system-wide, so the unpriviliged user can
ddos (say) mysqld. And this leads to the same question: shouldn't
we account nr_events at least?

Oleg.

2017-12-05 15:35:07

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On Tue, Dec 05, 2017 at 04:19:56PM +0100, Oleg Nesterov wrote:
> On 12/05, Kirill Tkhai wrote:
> >
> > Currently, aio_nr and aio_max_nr are global.
>
> Yeah, I too tried to complain 2 years ago...
>
> > In case of containers this
> > means that a single container may occupy all aio requests, which are
> > available in the system,
>
> and memory. let me quote my old emails...
>
>
> This is off-topic, but the whole "vm" logic in aio_setup_ring()
> looks sub-optimal. I do not mean the code, just it seems to me it
> is pointless to pollute the page cache, and expose the pages we
> can not swap/free to lru. Afaics we _only_ need this for migration.

It is needed for migration, which is needed for hot unplug of memory.
There is no way around this.

> This memory lives in page-cache/lru, it is visible for shrinker which
> will unmap these pages for no reason on memory shortage. IOW, aio fools
> the kernel, this memory looks reclaimable but it is not. And we only do
> this for migration.

It's the same as any other memory that's mlock()ed into RAM.

> Even if this is not a problem, this does not look right. So perhaps at
> least mapping_set_unevictable() makes sense. But I simply do not know
> if migration will work with this change.
>
>
>
> Perhaps I missed something, doesn't matter. But this means that
> this memory is not accounted, so if I increase aio-max-nr then
> this test-case
>
> #define __NR_io_setup 206
>
> int main(void)
> {
> int nr;
>
> for (nr = 0; ;++nr) {
> void *ctx = NULL;
> int ret = syscall(__NR_io_setup, 1, &ctx);
> if (ret) {
> printf("failed %d %m: ", nr);
> getchar();
> }
> }
>
> return 0;
> }
>
> triggers OOM-killer which kills sshd and other daemons on my machine.
> These pages were not even faulted in (or the shrinker can unmap them),
> the kernel can not know who should be blamed.

The OOM-killer killed the wrong process: News at 11. This is not new
behaviour, and has always been an issue. If it bothered to kill the
process that was doing the allocations, the ioctxes would be freed and all
would be well. Doesn't the OOM killer take into account which process is
allocating memory? Seems like a pretty major deficiency if it isn't.

> Shouldn't we account aio events/pages somehow, say per-user, or in
> mm->pinned_vm ?

Sure, I have no objection to that. Please send a patch!

> I do not think this is unkown, and probably this all is fine. IOW,
> this is just a question, not a bug-report or something like this.
>
> And of course, this is not exploitable because aio-max-nr limits
> the number of pages you can steal.

Which is why the aio-max-nr limit exists! That it is imprecise and
imperfect is not being denied.

> But otoh, aio_max_nr is system-wide, so the unpriviliged user can
> ddos (say) mysqld. And this leads to the same question: shouldn't
> we account nr_events at least?

Anyone can DDoS the local system - this nothing new and has always been
the case. I'm not opposed to improvements in this area. The only issue I
have with Kirill's changes is that we need to have test cases for this new
functionality if the code is to be merged.

TBH, using memory accounting to limit things may be a better approach, as
that at least doesn't require changes to containers implementations to
obtain a benefit from bounding aio's memory usage.

-ben

> Oleg.
>

--
"Thought is the essence of where you are now."

2017-12-05 15:41:18

by Jeff Moyer

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Kirill Tkhai <[email protected]> writes:

>> I think you just need to account the completion ring.
>
> A request of struct aio_kiocb type consumes much more memory, than
> struct io_event does. Shouldn't we account it too?

Not in my opinion. The completion ring is the part that gets pinned for
long periods of time.

Just be sure to document this where appropriate. Users/admins should
know that the aio completion ring now contributes to their memory
budget.

Cheers,
Jeff

2017-12-05 15:51:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

Hello, Jeff.

On Tue, Dec 05, 2017 at 10:41:11AM -0500, Jeff Moyer wrote:
> Kirill Tkhai <[email protected]> writes:
>
> >> I think you just need to account the completion ring.
> >
> > A request of struct aio_kiocb type consumes much more memory, than
> > struct io_event does. Shouldn't we account it too?
>
> Not in my opinion. The completion ring is the part that gets pinned for
> long periods of time.

For memcg, it should account all possibly significant memory
consumptions whether long term or transitional. Otherwise, isolation
doesn't really work that well.

> Just be sure to document this where appropriate. Users/admins should
> know that the aio completion ring now contributes to their memory
> budget.

Yeah, the memory section in cgroup-v2.txt does have a section for
this.

Thanks.

--
tejun

2017-12-06 17:33:02

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 12/05, Benjamin LaHaise wrote:
>
> On Tue, Dec 05, 2017 at 04:19:56PM +0100, Oleg Nesterov wrote:
> >
> > and memory. let me quote my old emails...
> >
> >
> > This is off-topic, but the whole "vm" logic in aio_setup_ring()
> > looks sub-optimal. I do not mean the code, just it seems to me it
> > is pointless to pollute the page cache, and expose the pages we
> > can not swap/free to lru. Afaics we _only_ need this for migration.
>
> It is needed for migration, which is needed for hot unplug of memory.
> There is no way around this.

I know, and I even mentioned this above.


> > This memory lives in page-cache/lru, it is visible for shrinker which
> > will unmap these pages for no reason on memory shortage. IOW, aio fools
> > the kernel, this memory looks reclaimable but it is not. And we only do
> > this for migration.
>
> It's the same as any other memory that's mlock()ed into RAM.

No. Again, this memory is not properly accounted, and unlike mlock()ed
memory it is visible to shrinker which will do the unnecessary work on
memory shortage which in turn will lead to unnecessary page faults.

So let me repeat, shouldn't we at least do mapping_set_unevictable() in
aio_private_file() ?


> > triggers OOM-killer which kills sshd and other daemons on my machine.
> > These pages were not even faulted in (or the shrinker can unmap them),
> > the kernel can not know who should be blamed.
>
> The OOM-killer killed the wrong process: News at 11.

Well. I do not think we should blame OOM-killer in this case. But as I
said this is not a bug-report or something like this, I agree this is
a minor issue.

Oleg.

2017-12-06 17:44:51

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
> > > This memory lives in page-cache/lru, it is visible for shrinker which
> > > will unmap these pages for no reason on memory shortage. IOW, aio fools
> > > the kernel, this memory looks reclaimable but it is not. And we only do
> > > this for migration.
> >
> > It's the same as any other memory that's mlock()ed into RAM.
>
> No. Again, this memory is not properly accounted, and unlike mlock()ed
> memory it is visible to shrinker which will do the unnecessary work on
> memory shortage which in turn will lead to unnecessary page faults.
>
> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
> aio_private_file() ?

Send a patch then! I don't know why you're asking rather than sending a
patch to do this if you think it is needed.

> > > triggers OOM-killer which kills sshd and other daemons on my machine.
> > > These pages were not even faulted in (or the shrinker can unmap them),
> > > the kernel can not know who should be blamed.
> >
> > The OOM-killer killed the wrong process: News at 11.
>
> Well. I do not think we should blame OOM-killer in this case. But as I
> said this is not a bug-report or something like this, I agree this is
> a minor issue.

I do think the OOM-killer is doing the wrong thing here. If process X is
the only one that is allocating gobs of memory, why kill process Y that
hasn't allocated memory in minutes or hours just because it is bigger?
We're not perfect at tracking who owns memory allocations, so why not
factor in memory allocation rate when decided which process to kill? We
keep throwing bandaids on the OOM-killer by annotating allocations, and we
keep missing the annotation of allocations. Doesn't sound like a real fix
for the underlying problem to me when a better heuristic would solve the
current problem and probably several other future instances of the same
problem.

-ben
--
"Thought is the essence of where you are now."

2017-12-06 18:19:19

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 06.12.2017 20:44, Benjamin LaHaise wrote:
> On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
>>>> This memory lives in page-cache/lru, it is visible for shrinker which
>>>> will unmap these pages for no reason on memory shortage. IOW, aio fools
>>>> the kernel, this memory looks reclaimable but it is not. And we only do
>>>> this for migration.
>>>
>>> It's the same as any other memory that's mlock()ed into RAM.
>>
>> No. Again, this memory is not properly accounted, and unlike mlock()ed
>> memory it is visible to shrinker which will do the unnecessary work on
>> memory shortage which in turn will lead to unnecessary page faults.
>>
>> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
>> aio_private_file() ?
>
> Send a patch then! I don't know why you're asking rather than sending a
> patch to do this if you think it is needed.
>
>>>> triggers OOM-killer which kills sshd and other daemons on my machine.
>>>> These pages were not even faulted in (or the shrinker can unmap them),
>>>> the kernel can not know who should be blamed.
>>>
>>> The OOM-killer killed the wrong process: News at 11.
>>
>> Well. I do not think we should blame OOM-killer in this case. But as I
>> said this is not a bug-report or something like this, I agree this is
>> a minor issue.
>
> I do think the OOM-killer is doing the wrong thing here. If process X is
> the only one that is allocating gobs of memory, why kill process Y that
> hasn't allocated memory in minutes or hours just because it is bigger?

I assume, if a process hasn't allocated memory in minutes or hours,
then the most probably all of its evictable memory has already been
moved to swap as its pages were last in lru list.

> We're not perfect at tracking who owns memory allocations, so why not
> factor in memory allocation rate when decided which process to kill? We
> keep throwing bandaids on the OOM-killer by annotating allocations, and we
> keep missing the annotation of allocations. Doesn't sound like a real fix
> for the underlying problem to me when a better heuristic would solve the
> current problem and probably several other future instances of the same
> problem.

Kirill

2017-12-06 18:30:39

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On Wed, Dec 06, 2017 at 09:19:09PM +0300, Kirill Tkhai wrote:
> On 06.12.2017 20:44, Benjamin LaHaise wrote:
> > On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
> >>>> This memory lives in page-cache/lru, it is visible for shrinker which
> >>>> will unmap these pages for no reason on memory shortage. IOW, aio fools
> >>>> the kernel, this memory looks reclaimable but it is not. And we only do
> >>>> this for migration.
> >>>
> >>> It's the same as any other memory that's mlock()ed into RAM.
> >>
> >> No. Again, this memory is not properly accounted, and unlike mlock()ed
> >> memory it is visible to shrinker which will do the unnecessary work on
> >> memory shortage which in turn will lead to unnecessary page faults.
> >>
> >> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
> >> aio_private_file() ?
> >
> > Send a patch then! I don't know why you're asking rather than sending a
> > patch to do this if you think it is needed.
> >
> >>>> triggers OOM-killer which kills sshd and other daemons on my machine.
> >>>> These pages were not even faulted in (or the shrinker can unmap them),
> >>>> the kernel can not know who should be blamed.
> >>>
> >>> The OOM-killer killed the wrong process: News at 11.
> >>
> >> Well. I do not think we should blame OOM-killer in this case. But as I
> >> said this is not a bug-report or something like this, I agree this is
> >> a minor issue.
> >
> > I do think the OOM-killer is doing the wrong thing here. If process X is
> > the only one that is allocating gobs of memory, why kill process Y that
> > hasn't allocated memory in minutes or hours just because it is bigger?
>
> I assume, if a process hasn't allocated memory in minutes or hours,
> then the most probably all of its evictable memory has already been
> moved to swap as its pages were last in lru list.

That assumption would be incorrect. Just because memory isn't being
allocated doesn't mean that it isn't being used - the two are very
different things. Most of the embedded systems I work on allocate memory
at startup and then use it until the system gets restarted or rebooted.
By only doing memory allocations at startup, code is simplified,
performance is more predictable and failure modes are constrained.

-ben
--
"Thought is the essence of where you are now."

2017-12-06 19:38:03

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 06.12.2017 21:30, Benjamin LaHaise wrote:
> On Wed, Dec 06, 2017 at 09:19:09PM +0300, Kirill Tkhai wrote:
>> On 06.12.2017 20:44, Benjamin LaHaise wrote:
>>> On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
>>>>>> This memory lives in page-cache/lru, it is visible for shrinker which
>>>>>> will unmap these pages for no reason on memory shortage. IOW, aio fools
>>>>>> the kernel, this memory looks reclaimable but it is not. And we only do
>>>>>> this for migration.
>>>>>
>>>>> It's the same as any other memory that's mlock()ed into RAM.
>>>>
>>>> No. Again, this memory is not properly accounted, and unlike mlock()ed
>>>> memory it is visible to shrinker which will do the unnecessary work on
>>>> memory shortage which in turn will lead to unnecessary page faults.
>>>>
>>>> So let me repeat, shouldn't we at least do mapping_set_unevictable() in
>>>> aio_private_file() ?
>>>
>>> Send a patch then! I don't know why you're asking rather than sending a
>>> patch to do this if you think it is needed.
>>>
>>>>>> triggers OOM-killer which kills sshd and other daemons on my machine.
>>>>>> These pages were not even faulted in (or the shrinker can unmap them),
>>>>>> the kernel can not know who should be blamed.
>>>>>
>>>>> The OOM-killer killed the wrong process: News at 11.
>>>>
>>>> Well. I do not think we should blame OOM-killer in this case. But as I
>>>> said this is not a bug-report or something like this, I agree this is
>>>> a minor issue.
>>>
>>> I do think the OOM-killer is doing the wrong thing here. If process X is
>>> the only one that is allocating gobs of memory, why kill process Y that
>>> hasn't allocated memory in minutes or hours just because it is bigger?
>>
>> I assume, if a process hasn't allocated memory in minutes or hours,
>> then the most probably all of its evictable memory has already been
>> moved to swap as its pages were last in lru list.
>
> That assumption would be incorrect. Just because memory isn't being
> allocated doesn't mean that it isn't being used - the two are very
> different things.

You are sure, allocation and using are different things. But if memory
is being used does not mean that it can't be swapped on disk. And as
the memory is still being used, the swapped pages are read to the RAM
again, when pagefaults happen. And as pagefaults happen, the task allocates
new pages to place the swapped pages in RAM. So, allocations of memory
happen, and honestly this seems to be not the case we started to speak.

Of course, maybe, there were implemented a tracker of used memory, which
prevents used pages to be swapped on the disk. But when I dived there
last time, there was only PROT_NONE based tracker used for NUMA balancing.
Maybe you or someone else have new information about this.

> Most of the embedded systems I work on allocate memory
> at startup and then use it until the system gets restarted or rebooted.
> By only doing memory allocations at startup, code is simplified,
> performance is more predictable and failure modes are constrained.

Yeah, and many realtime systems have the same model. But it may look
strange, aio is also popular on many other configurations too :)

Kirill

2017-12-07 13:44:52

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 0/5] blkcg: Limit maximum number of aio requests available for cgroup

On 12/06, Benjamin LaHaise wrote:
>
> On Wed, Dec 06, 2017 at 06:32:56PM +0100, Oleg Nesterov wrote:
> >
> > No. Again, this memory is not properly accounted, and unlike mlock()ed
> > memory it is visible to shrinker which will do the unnecessary work on
> > memory shortage which in turn will lead to unnecessary page faults.
> >
> > So let me repeat, shouldn't we at least do mapping_set_unevictable() in
> > aio_private_file() ?

... and probably account this memory in ->pinned_vm

> Send a patch then!

I have no idea how to test this change, and personally I don't reallly care
about aio,

> I don't know why you're asking rather than sending a
> patch to do this if you think it is needed.

Because you are maintainer, and I naively thought it is always fine to
ask the maintainer if you think the code is not correct or sub-optimal.
Sorry for bothering you.

> > > > triggers OOM-killer which kills sshd and other daemons on my machine.
> > > > These pages were not even faulted in (or the shrinker can unmap them),
> > > > the kernel can not know who should be blamed.
> > >
> > > The OOM-killer killed the wrong process: News at 11.
> >
> > Well. I do not think we should blame OOM-killer in this case. But as I
> > said this is not a bug-report or something like this, I agree this is
> > a minor issue.
>
> I do think the OOM-killer is doing the wrong thing here. If process X is
> the only one that is allocating gobs of memory,

aio_setup_ring() does find_or_create_page(file->f_mapping), this adds
the page to page cache. Again, this memory looks _reclaimable_ but it
is not because ctx->ring_pages has a reference.

I do not understand how we can blame OOM-killer, it should not kill the
task which blows the page cache, and this is how io_setup() looks to vm.

Quite possibly I missed something, please correct me.

Oleg.