2016-10-07 10:46:19

by Angel Shtilianov

[permalink] [raw]
Subject: [PATCH] inotify: Convert to using per-namespace limits

This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.

Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.

Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace.

Signed-off-by: Nikolay Borisov <[email protected]>
---
Hello Eric,

I saw you've finally sent your pull request for 4.9 and it
includes your implementatino of the ucount infrastructure. So
here is my respin of the inotify patches using that.

fs/notify/inotify/inotify.h | 17 +++++++++++++
fs/notify/inotify/inotify_fsnotify.c | 6 ++---
fs/notify/inotify/inotify_user.c | 46 ++++++++++++------------------------
include/linux/fsnotify_backend.h | 3 ++-
include/linux/sched.h | 4 ----
include/linux/user_namespace.h | 4 ++++
kernel/ucount.c | 6 ++++-
7 files changed, 45 insertions(+), 41 deletions(-)

diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
index ed855ef6f077..b5536f8ad3e0 100644
--- a/fs/notify/inotify/inotify.h
+++ b/fs/notify/inotify/inotify.h
@@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
const unsigned char *file_name, u32 cookie);

extern const struct fsnotify_ops inotify_fsnotify_ops;
+
+#ifdef CONFIG_INOTIFY_USER
+static void dec_inotify_instances(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
+}
+
+static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
+{
+ return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
+}
+
+static void dec_inotify_watches(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
+}
+#endif
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 2cd900c2c737..1e6b3cfc2cfd 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct fsnotify_group *group)
/* ideally the idr is empty and we won't hit the BUG in the callback */
idr_for_each(&group->inotify_data.idr, idr_callback, group);
idr_destroy(&group->inotify_data.idr);
- if (group->inotify_data.user) {
- atomic_dec(&group->inotify_data.user->inotify_devs);
- free_uid(group->inotify_data.user);
- }
+ if (group->inotify_data.ucounts)
+ dec_inotify_instances(group->inotify_data.ucounts);
}

static void inotify_free_event(struct fsnotify_event *fsn_event)
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index b8d08d0d0a4d..cfe48bb511c6 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -44,10 +44,8 @@

#include <asm/ioctls.h>

-/* these are configurable via /proc/sys/fs/inotify/ */
-static int inotify_max_user_instances __read_mostly;
+/* configurable via /proc/sys/fs/inotify/ */
static int inotify_max_queued_events __read_mostly;
-static int inotify_max_user_watches __read_mostly;

static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;

@@ -59,22 +57,6 @@ static int zero;

struct ctl_table inotify_table[] = {
{
- .procname = "max_user_instances",
- .data = &inotify_max_user_instances,
- .maxlen = sizeof(int),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = &zero,
- },
- {
- .procname = "max_user_watches",
- .data = &inotify_max_user_watches,
- .maxlen = sizeof(int),
- .mode = 0644,
- .proc_handler = proc_dointvec_minmax,
- .extra1 = &zero,
- },
- {
.procname = "max_queued_events",
.data = &inotify_max_queued_events,
.maxlen = sizeof(int),
@@ -500,7 +482,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
/* remove this mark from the idr */
inotify_remove_from_idr(group, i_mark);

- atomic_dec(&group->inotify_data.user->inotify_watches);
+ dec_inotify_watches(group->inotify_data.ucounts);
}

/* ding dong the mark is dead */
@@ -584,14 +566,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
tmp_i_mark->fsn_mark.mask = mask;
tmp_i_mark->wd = -1;

- ret = -ENOSPC;
- if (atomic_read(&group->inotify_data.user->inotify_watches) >= inotify_max_user_watches)
- goto out_err;
-
ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
if (ret)
goto out_err;

+ /* increment the number of watches the user has */
+ if (!inc_inotify_watches(group->inotify_data.ucounts)) {
+ inotify_remove_from_idr(group, tmp_i_mark);
+ ret = -ENOSPC;
+ goto out_err;
+ }
+
/* we are on the idr, now get on the inode */
ret = fsnotify_add_mark_locked(&tmp_i_mark->fsn_mark, group, inode,
NULL, 0);
@@ -601,8 +586,6 @@ static int inotify_new_watch(struct fsnotify_group *group,
goto out_err;
}

- /* increment the number of watches the user has */
- atomic_inc(&group->inotify_data.user->inotify_watches);

/* return the watch descriptor for this new mark */
ret = tmp_i_mark->wd;
@@ -653,10 +636,11 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)

spin_lock_init(&group->inotify_data.idr_lock);
idr_init(&group->inotify_data.idr);
- group->inotify_data.user = get_current_user();
+ group->inotify_data.ucounts = inc_ucount(current_user_ns(),
+ current_euid(),
+ UCOUNT_INOTIFY_INSTANCES);

- if (atomic_inc_return(&group->inotify_data.user->inotify_devs) >
- inotify_max_user_instances) {
+ if (!group->inotify_data.ucounts) {
fsnotify_destroy_group(group);
return ERR_PTR(-EMFILE);
}
@@ -819,8 +803,8 @@ static int __init inotify_user_setup(void)
inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);

inotify_max_queued_events = 16384;
- inotify_max_user_instances = 128;
- inotify_max_user_watches = 8192;
+ init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
+ init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = 8192;

return 0;
}
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 58205f33af02..892959de1162 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -16,6 +16,7 @@
#include <linux/spinlock.h>
#include <linux/types.h>
#include <linux/atomic.h>
+#include <linux/user_namespace.h>

/*
* IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
@@ -169,7 +170,7 @@ struct fsnotify_group {
struct inotify_group_private_data {
spinlock_t idr_lock;
struct idr idr;
- struct user_struct *user;
+ struct ucounts *ucounts;
} inotify_data;
#endif
#ifdef CONFIG_FANOTIFY
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e513e39..35230a2c73ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -840,10 +840,6 @@ struct user_struct {
atomic_t __count; /* reference count */
atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending; /* How many pending signals does this user have? */
-#ifdef CONFIG_INOTIFY_USER
- atomic_t inotify_watches; /* How many inotify watches does this user have? */
- atomic_t inotify_devs; /* How many inotify devs does this user have opened? */
-#endif
#ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
#endif
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index eb209d4523f5..cbcac7a5ec41 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -32,6 +32,10 @@ enum ucount_type {
UCOUNT_NET_NAMESPACES,
UCOUNT_MNT_NAMESPACES,
UCOUNT_CGROUP_NAMESPACES,
+#ifdef CONFIG_INOTIFY_USER
+ UCOUNT_INOTIFY_INSTANCES,
+ UCOUNT_INOTIFY_WATCHES,
+#endif
UCOUNT_COUNTS,
};

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 9d20d5dd298a..a6bcea47306b 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -57,7 +57,7 @@ static struct ctl_table_root set_root = {

static int zero = 0;
static int int_max = INT_MAX;
-#define UCOUNT_ENTRY(name) \
+#define UCOUNT_ENTRY(name) \
{ \
.procname = name, \
.maxlen = sizeof(int), \
@@ -74,6 +74,10 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_net_namespaces"),
UCOUNT_ENTRY("max_mnt_namespaces"),
UCOUNT_ENTRY("max_cgroup_namespaces"),
+#ifdef CONFIG_INOTIFY_USER_
+ UCOUNT_ENTRY("max_inotify_instances"),
+ UCOUNT_ENTRY("max_inotify_watches"),
+#endif
{ }
};
#endif /* CONFIG_SYSCTL */
--
2.5.0


2016-10-07 18:16:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] inotify: Convert to using per-namespace limits

Nikolay Borisov <[email protected]> writes:

> This patchset converts inotify to using the newly introduced
> per-userns sysctl infrastructure.
>
> Currently the inotify instances/watches are being accounted in the
> user_struct structure. This means that in setups where multiple
> users in unprivileged containers map to the same underlying
> real user (i.e. pointing to the same user_struct) the inotify limits
> are going to be shared as well, allowing one user(or application) to exhaust
> all others limits.
>
> Fix this by switching the inotify sysctls to using the
> per-namespace/per-user limits. This will allow the server admin to
> set sensible global limits, which can further be tuned inside every
> individual user namespace.
>
> Signed-off-by: Nikolay Borisov <[email protected]>
> ---
> Hello Eric,
>
> I saw you've finally sent your pull request for 4.9 and it
> includes your implementatino of the ucount infrastructure. So
> here is my respin of the inotify patches using that.

Thanks. I will take a good hard look at this after -rc1 when things are
stable enough that I can start a new development branch.

I am a little concerned that the old sysctls have gone away. If no one
cares it is fine, but if someone depends on them existing that may count
as an unnecessary userspace regression. But otherwise skimming through
this code it looks good.

Eric


> fs/notify/inotify/inotify.h | 17 +++++++++++++
> fs/notify/inotify/inotify_fsnotify.c | 6 ++---
> fs/notify/inotify/inotify_user.c | 46 ++++++++++++------------------------
> include/linux/fsnotify_backend.h | 3 ++-
> include/linux/sched.h | 4 ----
> include/linux/user_namespace.h | 4 ++++
> kernel/ucount.c | 6 ++++-
> 7 files changed, 45 insertions(+), 41 deletions(-)
>
> diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
> index ed855ef6f077..b5536f8ad3e0 100644
> --- a/fs/notify/inotify/inotify.h
> +++ b/fs/notify/inotify/inotify.h
> @@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
> const unsigned char *file_name, u32 cookie);
>
> extern const struct fsnotify_ops inotify_fsnotify_ops;
> +
> +#ifdef CONFIG_INOTIFY_USER
> +static void dec_inotify_instances(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
> +}
> +
> +static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
> +{
> + return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
> +}
> +
> +static void dec_inotify_watches(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
> +}
> +#endif
> diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
> index 2cd900c2c737..1e6b3cfc2cfd 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct fsnotify_group *group)
> /* ideally the idr is empty and we won't hit the BUG in the callback */
> idr_for_each(&group->inotify_data.idr, idr_callback, group);
> idr_destroy(&group->inotify_data.idr);
> - if (group->inotify_data.user) {
> - atomic_dec(&group->inotify_data.user->inotify_devs);
> - free_uid(group->inotify_data.user);
> - }
> + if (group->inotify_data.ucounts)
> + dec_inotify_instances(group->inotify_data.ucounts);
> }
>
> static void inotify_free_event(struct fsnotify_event *fsn_event)
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index b8d08d0d0a4d..cfe48bb511c6 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -44,10 +44,8 @@
>
> #include <asm/ioctls.h>
>
> -/* these are configurable via /proc/sys/fs/inotify/ */
> -static int inotify_max_user_instances __read_mostly;
> +/* configurable via /proc/sys/fs/inotify/ */
> static int inotify_max_queued_events __read_mostly;
> -static int inotify_max_user_watches __read_mostly;
>
> static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
>
> @@ -59,22 +57,6 @@ static int zero;
>
> struct ctl_table inotify_table[] = {
> {
> - .procname = "max_user_instances",
> - .data = &inotify_max_user_instances,
> - .maxlen = sizeof(int),
> - .mode = 0644,
> - .proc_handler = proc_dointvec_minmax,
> - .extra1 = &zero,
> - },
> - {
> - .procname = "max_user_watches",
> - .data = &inotify_max_user_watches,
> - .maxlen = sizeof(int),
> - .mode = 0644,
> - .proc_handler = proc_dointvec_minmax,
> - .extra1 = &zero,
> - },
> - {
> .procname = "max_queued_events",
> .data = &inotify_max_queued_events,
> .maxlen = sizeof(int),
> @@ -500,7 +482,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
> /* remove this mark from the idr */
> inotify_remove_from_idr(group, i_mark);
>
> - atomic_dec(&group->inotify_data.user->inotify_watches);
> + dec_inotify_watches(group->inotify_data.ucounts);
> }
>
> /* ding dong the mark is dead */
> @@ -584,14 +566,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
> tmp_i_mark->fsn_mark.mask = mask;
> tmp_i_mark->wd = -1;
>
> - ret = -ENOSPC;
> - if (atomic_read(&group->inotify_data.user->inotify_watches) >= inotify_max_user_watches)
> - goto out_err;
> -
> ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
> if (ret)
> goto out_err;
>
> + /* increment the number of watches the user has */
> + if (!inc_inotify_watches(group->inotify_data.ucounts)) {
> + inotify_remove_from_idr(group, tmp_i_mark);
> + ret = -ENOSPC;
> + goto out_err;
> + }
> +
> /* we are on the idr, now get on the inode */
> ret = fsnotify_add_mark_locked(&tmp_i_mark->fsn_mark, group, inode,
> NULL, 0);
> @@ -601,8 +586,6 @@ static int inotify_new_watch(struct fsnotify_group *group,
> goto out_err;
> }
>
> - /* increment the number of watches the user has */
> - atomic_inc(&group->inotify_data.user->inotify_watches);
>
> /* return the watch descriptor for this new mark */
> ret = tmp_i_mark->wd;
> @@ -653,10 +636,11 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
>
> spin_lock_init(&group->inotify_data.idr_lock);
> idr_init(&group->inotify_data.idr);
> - group->inotify_data.user = get_current_user();
> + group->inotify_data.ucounts = inc_ucount(current_user_ns(),
> + current_euid(),
> + UCOUNT_INOTIFY_INSTANCES);
>
> - if (atomic_inc_return(&group->inotify_data.user->inotify_devs) >
> - inotify_max_user_instances) {
> + if (!group->inotify_data.ucounts) {
> fsnotify_destroy_group(group);
> return ERR_PTR(-EMFILE);
> }
> @@ -819,8 +803,8 @@ static int __init inotify_user_setup(void)
> inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
>
> inotify_max_queued_events = 16384;
> - inotify_max_user_instances = 128;
> - inotify_max_user_watches = 8192;
> + init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
> + init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = 8192;
>
> return 0;
> }
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 58205f33af02..892959de1162 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -16,6 +16,7 @@
> #include <linux/spinlock.h>
> #include <linux/types.h>
> #include <linux/atomic.h>
> +#include <linux/user_namespace.h>
>
> /*
> * IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
> @@ -169,7 +170,7 @@ struct fsnotify_group {
> struct inotify_group_private_data {
> spinlock_t idr_lock;
> struct idr idr;
> - struct user_struct *user;
> + struct ucounts *ucounts;
> } inotify_data;
> #endif
> #ifdef CONFIG_FANOTIFY
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 62c68e513e39..35230a2c73ac 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -840,10 +840,6 @@ struct user_struct {
> atomic_t __count; /* reference count */
> atomic_t processes; /* How many processes does this user have? */
> atomic_t sigpending; /* How many pending signals does this user have? */
> -#ifdef CONFIG_INOTIFY_USER
> - atomic_t inotify_watches; /* How many inotify watches does this user have? */
> - atomic_t inotify_devs; /* How many inotify devs does this user have opened? */
> -#endif
> #ifdef CONFIG_FANOTIFY
> atomic_t fanotify_listeners;
> #endif
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb209d4523f5..cbcac7a5ec41 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -32,6 +32,10 @@ enum ucount_type {
> UCOUNT_NET_NAMESPACES,
> UCOUNT_MNT_NAMESPACES,
> UCOUNT_CGROUP_NAMESPACES,
> +#ifdef CONFIG_INOTIFY_USER
> + UCOUNT_INOTIFY_INSTANCES,
> + UCOUNT_INOTIFY_WATCHES,
> +#endif
> UCOUNT_COUNTS,
> };
>
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 9d20d5dd298a..a6bcea47306b 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -57,7 +57,7 @@ static struct ctl_table_root set_root = {
>
> static int zero = 0;
> static int int_max = INT_MAX;
> -#define UCOUNT_ENTRY(name) \
> +#define UCOUNT_ENTRY(name) \
> { \
> .procname = name, \
> .maxlen = sizeof(int), \
> @@ -74,6 +74,10 @@ static struct ctl_table user_table[] = {
> UCOUNT_ENTRY("max_net_namespaces"),
> UCOUNT_ENTRY("max_mnt_namespaces"),
> UCOUNT_ENTRY("max_cgroup_namespaces"),
> +#ifdef CONFIG_INOTIFY_USER_
> + UCOUNT_ENTRY("max_inotify_instances"),
> + UCOUNT_ENTRY("max_inotify_watches"),
> +#endif
> { }
> };
> #endif /* CONFIG_SYSCTL */

2016-10-09 05:55:52

by kernel test robot

[permalink] [raw]
Subject: [lkp] [inotify] 1109954e99: BUG kmalloc-512 (Not tainted): Freepointer corrupt


FYI, we noticed the following commit:

https://github.com/0day-ci/linux Nikolay-Borisov/inotify-Convert-to-using-per-namespace-limits/20161007-184900
commit 1109954e99c57a13814a9c1ebb3f01c53b48091f ("inotify: Convert to using per-namespace limits")

in testcase: trinity
with following parameters:

runtime: 300s


Trinity is a linux system call fuzz tester.


on test machine: qemu-system-x86_64 -enable-kvm -cpu IvyBridge -m 360M

caused below changes:


+----------------------------------------------------------------------------+------------+------------+
| | 3477d168ba | 1109954e99 |
+----------------------------------------------------------------------------+------------+------------+
| boot_successes | 19 | 5 |
| boot_failures | 11 | 29 |
| invoked_oom-killer:gfp_mask=0x | 8 | 3 |
| Mem-Info | 8 | 3 |
| BUG:kernel_reboot-without-warning_in_test_stage | 1 | 3 |
| Kernel_panic-not_syncing:VFS:Unable_to_mount_root_fs_on_unknown-block(#,#) | 2 | 2 |
| calltrace:prepare_namespace | 2 | 2 |
| BUG_kmalloc-#(Not_tainted):Freepointer_corrupt | 0 | 18 |
| INFO:Allocated_in_setup_userns_sysctls_age=#cpu=#pid= | 0 | 18 |
| INFO:Freed_in_qlist_free_all_age=#cpu=#pid= | 0 | 15 |
| INFO:Slab#objects=#used=#fp=#flags= | 0 | 14 |
| INFO:Object#@offset=#fp= | 0 | 18 |
| calltrace:SyS_lgetxattr | 0 | 1 |
| RIP:__kmalloc | 0 | 1 |
| calltrace:virtio_pci_driver_init | 0 | 4 |
| Kernel_panic-not_syncing:softlockup:hung_tasks | 0 | 4 |
| calltrace:SyS_clone | 0 | 11 |
| calltrace:SyS_listxattr | 0 | 1 |
| BUG_kmalloc-#(Tainted:G_B):Freepointer_corrupt | 0 | 2 |
| INFO:Slab#objects=#used=#fp=0x(null)flags= | 0 | 4 |
| RIP:memcmp | 0 | 1 |
| RIP:unwind_get_return_address | 0 | 1 |
| RIP:_raw_spin_unlock_irqrestore | 0 | 1 |
| calltrace:SyS_add_key | 0 | 1 |
| calltrace:SyS_fchownat | 0 | 1 |
| calltrace:SyS_chown | 0 | 1 |
| calltrace:SyS_chown16 | 0 | 1 |
| calltrace:SyS_setfsgid | 0 | 1 |
| calltrace:SyS_setfsgid16 | 0 | 1 |
| calltrace:SyS_fgetxattr | 0 | 1 |
| calltrace:SyS_setgid | 0 | 1 |
+----------------------------------------------------------------------------+------------+------------+



[ 35.734332] VFS: Warning: trinity-c0 using old stat() call. Recompile your binary.
[ 35.757516] VFS: Warning: trinity-c2 using old stat() call. Recompile your binary.
[ 39.409080] =============================================================================
[ 39.411116] BUG kmalloc-512 (Not tainted): Freepointer corrupt
[ 39.414680] -----------------------------------------------------------------------------
[ 39.414680]
[ 39.417417] Disabling lock debugging due to kernel taint
[ 39.418853] INFO: Allocated in setup_userns_sysctls+0x43/0xac age=25 cpu=0 pid=1716
[ 39.431035] INFO: Freed in qlist_free_all+0x7e/0xca age=36 cpu=0 pid=1719
[ 39.448221] INFO: Slab 0xffffea00002e0a00 objects=9 used=7 fp=0xffff88000b829b08 flags=0x4000000000004081
[ 39.450623] INFO: Object 0xffff88000b8286c8 @offset=1736 fp=0xffff88000c3781b0
[ 39.450623]
[ 39.453102] Redzone ffff88000b8286c0: cc cc cc cc cc cc cc cc ........
[ 39.474115] Object ffff88000b8286c8: 08 80 37 0c 00 88 ff ff 90 81 37 0c 00 88 ff ff ..7.......7.....
[ 39.476523] Object ffff88000b8286d8: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 39.478798] Object ffff88000b8286e8: ff 5d c9 9a ff ff ff ff 00 00 00 00 00 00 00 00 .]..............
[ 39.481183] Object ffff88000b8286f8: 30 ae 79 9b ff ff ff ff 70 b6 64 9b ff ff ff ff 0.y.....p.d.....
[ 39.483548] Object ffff88000b828708: 6e 28 40 9b ff ff ff ff 94 81 37 0c 00 88 ff ff n(@.......7.....
[ 39.485836] Object ffff88000b828718: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 39.488232] Object ffff88000b828728: ff 5d c9 9a ff ff ff ff 00 00 00 00 00 00 00 00 .]..............
[ 39.490612] Object ffff88000b828738: 30 ae 79 9b ff ff ff ff 70 b6 64 9b ff ff ff ff 0.y.....p.d.....
[ 39.493044] Object ffff88000b828748: 81 28 40 9b ff ff ff ff 98 81 37 0c 00 88 ff ff .(@.......7.....
[ 39.495350] Object ffff88000b828758: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 39.497721] Object ffff88000b828768: ff 5d c9 9a ff ff ff ff 00 00 00 00 00 00 00 00 .]..............
[ 39.500034] Object ffff88000b828778: 30 ae 79 9b ff ff ff ff 70 b6 64 9b ff ff ff ff 0.y.....p.d.....
[ 39.502325] Object ffff88000b828788: 94 28 40 9b ff ff ff ff 9c 81 37 0c 00 88 ff ff .(@.......7.....
[ 39.504549] Object ffff88000b828798: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 39.506834] Object ffff88000b8287a8: ff 5d c9 9a ff ff ff ff 00 00 00 00 00 00 00 00 .]..............
[ 39.509108] Object ffff88000b8287b8: 30 ae 79 9b ff ff ff ff 70 b6 64 9b ff ff ff ff 0.y.....p.d.....
[ 39.511379] Object ffff88000b8287c8: a7 28 40 9b ff ff ff ff a0 81 37 0c 00 88 ff ff .(@.......7.....
[ 39.513665] Object ffff88000b8287d8: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 39.515964] Object ffff88000b8287e8: ff 5d c9 9a ff ff ff ff 00 00 00 00 00 00 00 00 .]..............
[ 39.518230] Object ffff88000b8287f8: 30 ae 79 9b ff ff ff ff 70 b6 64 9b ff ff ff ff 0.y.....p.d.....
[ 39.520508] Object ffff88000b828808: ba 28 40 9b ff ff ff ff a4 81 37 0c 00 88 ff ff .(@.......7.....
[ 39.522820] Object ffff88000b828818: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 39.525094] Object ffff88000b828828: ff 5d c9 9a ff ff ff ff 00 00 00 00 00 00 00 00 .]..............
[ 39.527390] Object ffff88000b828838: 30 ae 79 9b ff ff ff ff 70 b6 64 9b ff ff ff ff 0.y.....p.d.....
[ 39.529689] Object ffff88000b828848: cd 28 40 9b ff ff ff ff a8 81 37 0c 00 88 ff ff .(@.......7.....
[ 39.531969] Object ffff88000b828858: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 39.534248] Object ffff88000b828868: ff 5d c9 9a ff ff ff ff 00 00 00 00 00 00 00 00 .]..............
[ 39.536541] Object ffff88000b828878: 30 ae 79 9b ff ff ff ff 70 b6 64 9b ff ff ff ff 0.y.....p.d.....
[ 39.538845] Object ffff88000b828888: 00 00 00 00 00 00 00 00 ac 81 37 0c 00 88 ff ff ..........7.....
[ 39.541123] Object ffff88000b828898: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 39.543355] Object ffff88000b8288a8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 39.545624] Object ffff88000b8288b8: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 39.547908] Redzone ffff88000b8288c8: cc cc cc cc cc cc cc cc ........
[ 39.550043] Padding ffff88000b828a18: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[ 39.552214] CPU: 0 PID: 1717 Comm: trinity-c1 Tainted: G B 4.8.0-09432-g1109954 #1
[ 39.554401] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 39.556611] ffff88000b1377b8 ffffffff9af67c6e ffff88000b1377e8 ffffffff9ad48ae9
[ 39.559019] ffff880010402cc0 ffffea00002e0a00 ffff88000b8286c8 0000000000000350
[ 39.561466] ffff88000b137818 ffffffff9ad48c30 ffff88000b8286c8 ffff880010402cc0
[ 39.563880] Call Trace:
[ 39.564673] [<ffffffff9af67c6e>] dump_stack+0x19/0x1b
[ 39.565973] [<ffffffff9ad48ae9>] print_trailer+0x175/0x17e
[ 39.567355] [<ffffffff9ad48c30>] object_err+0x35/0x3d
[ 39.568656] [<ffffffff9ad48fe1>] check_object+0x1db/0x1ff
[ 39.570038] [<ffffffff9ad48c82>] ? on_freelist+0x4a/0x1ce
[ 39.571401] [<ffffffff9ad4e6ca>] ? qlist_free_all+0x7e/0xca
[ 39.572785] [<ffffffff9ad4e6ca>] ? qlist_free_all+0x7e/0xca
[ 39.574180] [<ffffffff9ad4adb3>] free_debug_processing+0xbf/0x1ef
[ 39.575681] [<ffffffff9ad4af1d>] __slab_free+0x3a/0x27f
[ 39.577023] [<ffffffff9ad4b62b>] ___cache_free+0x9c/0xa3
[ 39.578497] [<ffffffff9ad4e6e7>] qlist_free_all+0x9b/0xca
[ 39.579854] [<ffffffff9ad4ea00>] quarantine_reduce+0x214/0x226
[ 39.581239] [<ffffffff9ad4896c>] ? init_object+0x73/0x7b
[ 39.582570] [<ffffffff9ad4acaf>] ? alloc_debug_processing+0xb6/0xfb
[ 39.584085] [<ffffffff9ad4d493>] kasan_kmalloc+0x2b/0xac
[ 39.585427] [<ffffffff9ad4d523>] kasan_slab_alloc+0xf/0x11
[ 39.586799] [<ffffffff9ad49c22>] slab_post_alloc_hook+0x38/0x4a
[ 39.588251] [<ffffffff9ac8ee0e>] ? copy_process+0x12a/0x14ae
[ 39.589643] [<ffffffff9ad4bf7c>] kmem_cache_alloc+0xc4/0xd5
[ 39.591023] [<ffffffff9ac8ee0e>] copy_process+0x12a/0x14ae
[ 39.592402] [<ffffffff9ac8eb1a>] ? __mmdrop+0xc4/0xd1
[ 39.593718] [<ffffffff9ad2c65c>] ? wp_page_reuse+0x54/0xbf
[ 39.595093] [<ffffffff9ad2ea24>] ? do_wp_page+0x2a4/0x413


To reproduce:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml



Thanks,
Xiaolong


Attachments:
(No filename) (10.62 kB)
config-4.8.0-09432-g1109954 (86.50 kB)
job-script (3.67 kB)
dmesg.xz (10.88 kB)
job.yaml (2.88 kB)
Download all attachments

2016-10-10 06:44:52

by Angel Shtilianov

[permalink] [raw]
Subject: Re: [PATCH] inotify: Convert to using per-namespace limits



On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
> Nikolay Borisov <[email protected]> writes:
>
>> This patchset converts inotify to using the newly introduced
>> per-userns sysctl infrastructure.
>>
>> Currently the inotify instances/watches are being accounted in the
>> user_struct structure. This means that in setups where multiple
>> users in unprivileged containers map to the same underlying
>> real user (i.e. pointing to the same user_struct) the inotify limits
>> are going to be shared as well, allowing one user(or application) to exhaust
>> all others limits.
>>
>> Fix this by switching the inotify sysctls to using the
>> per-namespace/per-user limits. This will allow the server admin to
>> set sensible global limits, which can further be tuned inside every
>> individual user namespace.
>>
>> Signed-off-by: Nikolay Borisov <[email protected]>
>> ---
>> Hello Eric,
>>
>> I saw you've finally sent your pull request for 4.9 and it
>> includes your implementatino of the ucount infrastructure. So
>> here is my respin of the inotify patches using that.
>
> Thanks. I will take a good hard look at this after -rc1 when things are
> stable enough that I can start a new development branch.
>
> I am a little concerned that the old sysctls have gone away. If no one
> cares it is fine, but if someone depends on them existing that may count
> as an unnecessary userspace regression. But otherwise skimming through
> this code it looks good.

So this indeed this is real issue and I meant to write something about
it. Anyway, in order to preserve those sysctl what can be done is to
hook them up with a custom sysctl handler taking the ns from the proc
mount and the euid of current? I think this is a good approach, but
let's wait and see if anyone will have objections to completely
eliminating those sysctls.


>
[SNIP]

2016-10-10 16:40:50

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] inotify: Convert to using per-namespace limits

On Mon 10-10-16 09:44:19, Nikolay Borisov wrote:
> On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
> > Nikolay Borisov <[email protected]> writes:
> >
> >> This patchset converts inotify to using the newly introduced
> >> per-userns sysctl infrastructure.
> >>
> >> Currently the inotify instances/watches are being accounted in the
> >> user_struct structure. This means that in setups where multiple
> >> users in unprivileged containers map to the same underlying
> >> real user (i.e. pointing to the same user_struct) the inotify limits
> >> are going to be shared as well, allowing one user(or application) to exhaust
> >> all others limits.
> >>
> >> Fix this by switching the inotify sysctls to using the
> >> per-namespace/per-user limits. This will allow the server admin to
> >> set sensible global limits, which can further be tuned inside every
> >> individual user namespace.
> >>
> >> Signed-off-by: Nikolay Borisov <[email protected]>
> >> ---
> >> Hello Eric,
> >>
> >> I saw you've finally sent your pull request for 4.9 and it
> >> includes your implementatino of the ucount infrastructure. So
> >> here is my respin of the inotify patches using that.
> >
> > Thanks. I will take a good hard look at this after -rc1 when things are
> > stable enough that I can start a new development branch.
> >
> > I am a little concerned that the old sysctls have gone away. If no one
> > cares it is fine, but if someone depends on them existing that may count
> > as an unnecessary userspace regression. But otherwise skimming through
> > this code it looks good.
>
> So this indeed this is real issue and I meant to write something about
> it. Anyway, in order to preserve those sysctl what can be done is to
> hook them up with a custom sysctl handler taking the ns from the proc
> mount and the euid of current? I think this is a good approach, but
> let's wait and see if anyone will have objections to completely
> eliminating those sysctls.

Well, I believe just discarding those sysctls is not an option - I'm pretty
sure there are scripts out there which tune these sysctls and those would
stop working. IMO not acceptable regression.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-10-10 20:51:21

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] inotify: Convert to using per-namespace limits

Jan Kara <[email protected]> writes:

> On Mon 10-10-16 09:44:19, Nikolay Borisov wrote:
>> On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
>> > Nikolay Borisov <[email protected]> writes:
>> >
>> >> This patchset converts inotify to using the newly introduced
>> >> per-userns sysctl infrastructure.
>> >>
>> >> Currently the inotify instances/watches are being accounted in the
>> >> user_struct structure. This means that in setups where multiple
>> >> users in unprivileged containers map to the same underlying
>> >> real user (i.e. pointing to the same user_struct) the inotify limits
>> >> are going to be shared as well, allowing one user(or application) to exhaust
>> >> all others limits.
>> >>
>> >> Fix this by switching the inotify sysctls to using the
>> >> per-namespace/per-user limits. This will allow the server admin to
>> >> set sensible global limits, which can further be tuned inside every
>> >> individual user namespace.
>> >>
>> >> Signed-off-by: Nikolay Borisov <[email protected]>
>> >> ---
>> >> Hello Eric,
>> >>
>> >> I saw you've finally sent your pull request for 4.9 and it
>> >> includes your implementatino of the ucount infrastructure. So
>> >> here is my respin of the inotify patches using that.
>> >
>> > Thanks. I will take a good hard look at this after -rc1 when things are
>> > stable enough that I can start a new development branch.
>> >
>> > I am a little concerned that the old sysctls have gone away. If no one
>> > cares it is fine, but if someone depends on them existing that may count
>> > as an unnecessary userspace regression. But otherwise skimming through
>> > this code it looks good.
>>
>> So this indeed this is real issue and I meant to write something about
>> it. Anyway, in order to preserve those sysctl what can be done is to
>> hook them up with a custom sysctl handler taking the ns from the proc
>> mount and the euid of current? I think this is a good approach, but
>> let's wait and see if anyone will have objections to completely
>> eliminating those sysctls.
>
> Well, I believe just discarding those sysctls is not an option - I'm pretty
> sure there are scripts out there which tune these sysctls and those would
> stop working. IMO not acceptable regression.

Nikolay there is your objection.

So since it should be straight forward let's preserve the existing
sysctls. Then this change doesn't need to prove there are no scripts
that tweak those sysctls.

We are just talking changing the values in the initial user namespace so
it should be completely compatible and straight forward to implement
unless I am missing something.

Eric

2016-10-10 21:54:09

by Angel Shtilianov

[permalink] [raw]
Subject: Re: [PATCH] inotify: Convert to using per-namespace limits

On Mon, Oct 10, 2016 at 11:49 PM, Eric W. Biederman
<[email protected]> wrote:
> Jan Kara <[email protected]> writes:
>
>> On Mon 10-10-16 09:44:19, Nikolay Borisov wrote:
>>> On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
>>> > Nikolay Borisov <[email protected]> writes:
>>> >
>>> >> This patchset converts inotify to using the newly introduced
>>> >> per-userns sysctl infrastructure.
>>> >>
>>> >> Currently the inotify instances/watches are being accounted in the
>>> >> user_struct structure. This means that in setups where multiple
>>> >> users in unprivileged containers map to the same underlying
>>> >> real user (i.e. pointing to the same user_struct) the inotify limits
>>> >> are going to be shared as well, allowing one user(or application) to exhaust
>>> >> all others limits.
>>> >>
>>> >> Fix this by switching the inotify sysctls to using the
>>> >> per-namespace/per-user limits. This will allow the server admin to
>>> >> set sensible global limits, which can further be tuned inside every
>>> >> individual user namespace.
>>> >>
>>> >> Signed-off-by: Nikolay Borisov <[email protected]>
>>> >> ---
>>> >> Hello Eric,
>>> >>
>>> >> I saw you've finally sent your pull request for 4.9 and it
>>> >> includes your implementatino of the ucount infrastructure. So
>>> >> here is my respin of the inotify patches using that.
>>> >
>>> > Thanks. I will take a good hard look at this after -rc1 when things are
>>> > stable enough that I can start a new development branch.
>>> >
>>> > I am a little concerned that the old sysctls have gone away. If no one
>>> > cares it is fine, but if someone depends on them existing that may count
>>> > as an unnecessary userspace regression. But otherwise skimming through
>>> > this code it looks good.
>>>
>>> So this indeed this is real issue and I meant to write something about
>>> it. Anyway, in order to preserve those sysctl what can be done is to
>>> hook them up with a custom sysctl handler taking the ns from the proc
>>> mount and the euid of current? I think this is a good approach, but
>>> let's wait and see if anyone will have objections to completely
>>> eliminating those sysctls.
>>
>> Well, I believe just discarding those sysctls is not an option - I'm pretty
>> sure there are scripts out there which tune these sysctls and those would
>> stop working. IMO not acceptable regression.
>
> Nikolay there is your objection.
>
> So since it should be straight forward let's preserve the existing
> sysctls. Then this change doesn't need to prove there are no scripts
> that tweak those sysctls.
>
> We are just talking changing the values in the initial user namespace so
> it should be completely compatible and straight forward to implement
> unless I am missing something.

Well I'm not so sure about this. Let's say those sysctls are going to
modify the ucount values in the init_user_ns. That's fine, however for
which particular user should they do this ? Should it be hardcoded for
kuid 0? or current_euid? I personally think they should be changing
the values for the current_euid.

>
> Eric

2016-10-10 23:03:32

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] inotify: Convert to using per-namespace limits

Nikolay Borisov <[email protected]> writes:

> On Mon, Oct 10, 2016 at 11:49 PM, Eric W. Biederman
> <[email protected]> wrote:
>> Jan Kara <[email protected]> writes:
>>
>>> On Mon 10-10-16 09:44:19, Nikolay Borisov wrote:
>>>> On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
>>>> > Nikolay Borisov <[email protected]> writes:
>>>> >
>>>> >> This patchset converts inotify to using the newly introduced
>>>> >> per-userns sysctl infrastructure.
>>>> >>
>>>> >> Currently the inotify instances/watches are being accounted in the
>>>> >> user_struct structure. This means that in setups where multiple
>>>> >> users in unprivileged containers map to the same underlying
>>>> >> real user (i.e. pointing to the same user_struct) the inotify limits
>>>> >> are going to be shared as well, allowing one user(or application) to exhaust
>>>> >> all others limits.
>>>> >>
>>>> >> Fix this by switching the inotify sysctls to using the
>>>> >> per-namespace/per-user limits. This will allow the server admin to
>>>> >> set sensible global limits, which can further be tuned inside every
>>>> >> individual user namespace.
>>>> >>
>>>> >> Signed-off-by: Nikolay Borisov <[email protected]>
>>>> >> ---
>>>> >> Hello Eric,
>>>> >>
>>>> >> I saw you've finally sent your pull request for 4.9 and it
>>>> >> includes your implementatino of the ucount infrastructure. So
>>>> >> here is my respin of the inotify patches using that.
>>>> >
>>>> > Thanks. I will take a good hard look at this after -rc1 when things are
>>>> > stable enough that I can start a new development branch.
>>>> >
>>>> > I am a little concerned that the old sysctls have gone away. If no one
>>>> > cares it is fine, but if someone depends on them existing that may count
>>>> > as an unnecessary userspace regression. But otherwise skimming through
>>>> > this code it looks good.
>>>>
>>>> So this indeed this is real issue and I meant to write something about
>>>> it. Anyway, in order to preserve those sysctl what can be done is to
>>>> hook them up with a custom sysctl handler taking the ns from the proc
>>>> mount and the euid of current? I think this is a good approach, but
>>>> let's wait and see if anyone will have objections to completely
>>>> eliminating those sysctls.
>>>
>>> Well, I believe just discarding those sysctls is not an option - I'm pretty
>>> sure there are scripts out there which tune these sysctls and those would
>>> stop working. IMO not acceptable regression.
>>
>> Nikolay there is your objection.
>>
>> So since it should be straight forward let's preserve the existing
>> sysctls. Then this change doesn't need to prove there are no scripts
>> that tweak those sysctls.
>>
>> We are just talking changing the values in the initial user namespace so
>> it should be completely compatible and straight forward to implement
>> unless I am missing something.
>
> Well I'm not so sure about this. Let's say those sysctls are going to
> modify the ucount values in the init_user_ns. That's fine, however for
> which particular user should they do this ? Should it be hardcoded for
> kuid 0? or current_euid? I personally think they should be changing
> the values for the current_euid.

Unless I have missed something the limits are per user namespace. The
counts are per user in that namespace. Certainly that is what the rest
of the ucount infrastructure is doing.

At which point having the existing sysctls simply update the limit in
the initial user namespace should result in no change.

Eric

2016-10-11 07:36:38

by Angel Shtilianov

[permalink] [raw]
Subject: [PATCH v2] inotify: Convert to using per-namespace limits

This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.

Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.

Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace. Additionally, in order to preserve the
sysctl ABI make the existing inotify instances/watches sysctls
modify the values of the initial user namespace.

Signed-off-by: Nikolay Borisov <[email protected]>
---

So here is a revised version which retains the existing sysctls,
and hooks them to the init_user_ns values.

fs/notify/inotify/inotify.h | 17 +++++++++++++++++
fs/notify/inotify/inotify_fsnotify.c | 6 ++----
fs/notify/inotify/inotify_user.c | 34 +++++++++++++++++-----------------
include/linux/fsnotify_backend.h | 3 ++-
include/linux/sched.h | 4 ----
include/linux/user_namespace.h | 4 ++++
kernel/ucount.c | 6 +++++-
7 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
index ed855ef6f077..b5536f8ad3e0 100644
--- a/fs/notify/inotify/inotify.h
+++ b/fs/notify/inotify/inotify.h
@@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
const unsigned char *file_name, u32 cookie);

extern const struct fsnotify_ops inotify_fsnotify_ops;
+
+#ifdef CONFIG_INOTIFY_USER
+static void dec_inotify_instances(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
+}
+
+static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
+{
+ return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
+}
+
+static void dec_inotify_watches(struct ucounts *ucounts)
+{
+ dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
+}
+#endif
diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
index 2cd900c2c737..1e6b3cfc2cfd 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct fsnotify_group *group)
/* ideally the idr is empty and we won't hit the BUG in the callback */
idr_for_each(&group->inotify_data.idr, idr_callback, group);
idr_destroy(&group->inotify_data.idr);
- if (group->inotify_data.user) {
- atomic_dec(&group->inotify_data.user->inotify_devs);
- free_uid(group->inotify_data.user);
- }
+ if (group->inotify_data.ucounts)
+ dec_inotify_instances(group->inotify_data.ucounts);
}

static void inotify_free_event(struct fsnotify_event *fsn_event)
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index b8d08d0d0a4d..7d769a824742 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -44,10 +44,8 @@

#include <asm/ioctls.h>

-/* these are configurable via /proc/sys/fs/inotify/ */
-static int inotify_max_user_instances __read_mostly;
+/* configurable via /proc/sys/fs/inotify/ */
static int inotify_max_queued_events __read_mostly;
-static int inotify_max_user_watches __read_mostly;

static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;

@@ -60,7 +58,7 @@ static int zero;
struct ctl_table inotify_table[] = {
{
.procname = "max_user_instances",
- .data = &inotify_max_user_instances,
+ .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
@@ -68,7 +66,7 @@ struct ctl_table inotify_table[] = {
},
{
.procname = "max_user_watches",
- .data = &inotify_max_user_watches,
+ .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
.maxlen = sizeof(int),
.mode = 0644,
.proc_handler = proc_dointvec_minmax,
@@ -500,7 +498,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
/* remove this mark from the idr */
inotify_remove_from_idr(group, i_mark);

- atomic_dec(&group->inotify_data.user->inotify_watches);
+ dec_inotify_watches(group->inotify_data.ucounts);
}

/* ding dong the mark is dead */
@@ -584,14 +582,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
tmp_i_mark->fsn_mark.mask = mask;
tmp_i_mark->wd = -1;

- ret = -ENOSPC;
- if (atomic_read(&group->inotify_data.user->inotify_watches) >= inotify_max_user_watches)
- goto out_err;
-
ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
if (ret)
goto out_err;

+ /* increment the number of watches the user has */
+ if (!inc_inotify_watches(group->inotify_data.ucounts)) {
+ inotify_remove_from_idr(group, tmp_i_mark);
+ ret = -ENOSPC;
+ goto out_err;
+ }
+
/* we are on the idr, now get on the inode */
ret = fsnotify_add_mark_locked(&tmp_i_mark->fsn_mark, group, inode,
NULL, 0);
@@ -601,8 +602,6 @@ static int inotify_new_watch(struct fsnotify_group *group,
goto out_err;
}

- /* increment the number of watches the user has */
- atomic_inc(&group->inotify_data.user->inotify_watches);

/* return the watch descriptor for this new mark */
ret = tmp_i_mark->wd;
@@ -653,10 +652,11 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)

spin_lock_init(&group->inotify_data.idr_lock);
idr_init(&group->inotify_data.idr);
- group->inotify_data.user = get_current_user();
+ group->inotify_data.ucounts = inc_ucount(current_user_ns(),
+ current_euid(),
+ UCOUNT_INOTIFY_INSTANCES);

- if (atomic_inc_return(&group->inotify_data.user->inotify_devs) >
- inotify_max_user_instances) {
+ if (!group->inotify_data.ucounts) {
fsnotify_destroy_group(group);
return ERR_PTR(-EMFILE);
}
@@ -819,8 +819,8 @@ static int __init inotify_user_setup(void)
inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);

inotify_max_queued_events = 16384;
- inotify_max_user_instances = 128;
- inotify_max_user_watches = 8192;
+ init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
+ init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = 8192;

return 0;
}
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index 58205f33af02..892959de1162 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -16,6 +16,7 @@
#include <linux/spinlock.h>
#include <linux/types.h>
#include <linux/atomic.h>
+#include <linux/user_namespace.h>

/*
* IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
@@ -169,7 +170,7 @@ struct fsnotify_group {
struct inotify_group_private_data {
spinlock_t idr_lock;
struct idr idr;
- struct user_struct *user;
+ struct ucounts *ucounts;
} inotify_data;
#endif
#ifdef CONFIG_FANOTIFY
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e513e39..35230a2c73ac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -840,10 +840,6 @@ struct user_struct {
atomic_t __count; /* reference count */
atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending; /* How many pending signals does this user have? */
-#ifdef CONFIG_INOTIFY_USER
- atomic_t inotify_watches; /* How many inotify watches does this user have? */
- atomic_t inotify_devs; /* How many inotify devs does this user have opened? */
-#endif
#ifdef CONFIG_FANOTIFY
atomic_t fanotify_listeners;
#endif
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index eb209d4523f5..cbcac7a5ec41 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -32,6 +32,10 @@ enum ucount_type {
UCOUNT_NET_NAMESPACES,
UCOUNT_MNT_NAMESPACES,
UCOUNT_CGROUP_NAMESPACES,
+#ifdef CONFIG_INOTIFY_USER
+ UCOUNT_INOTIFY_INSTANCES,
+ UCOUNT_INOTIFY_WATCHES,
+#endif
UCOUNT_COUNTS,
};

diff --git a/kernel/ucount.c b/kernel/ucount.c
index 9d20d5dd298a..a6bcea47306b 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -57,7 +57,7 @@ static struct ctl_table_root set_root = {

static int zero = 0;
static int int_max = INT_MAX;
-#define UCOUNT_ENTRY(name) \
+#define UCOUNT_ENTRY(name) \
{ \
.procname = name, \
.maxlen = sizeof(int), \
@@ -74,6 +74,10 @@ static struct ctl_table user_table[] = {
UCOUNT_ENTRY("max_net_namespaces"),
UCOUNT_ENTRY("max_mnt_namespaces"),
UCOUNT_ENTRY("max_cgroup_namespaces"),
+#ifdef CONFIG_INOTIFY_USER_
+ UCOUNT_ENTRY("max_inotify_instances"),
+ UCOUNT_ENTRY("max_inotify_watches"),
+#endif
{ }
};
#endif /* CONFIG_SYSCTL */
--
2.5.0

2016-10-14 02:32:35

by kernel test robot

[permalink] [raw]
Subject: [lkp] [inotify] 464e1236c3: BUG kmalloc-512 (Not tainted): Freepointer corrupt


FYI, we noticed the following commit:

https://github.com/0day-ci/linux Nikolay-Borisov/inotify-Convert-to-using-per-namespace-limits/20161011-153830
commit 464e1236c367919e405c8d248d6a4118fdc4a2c1 ("inotify: Convert to using per-namespace limits")

in testcase: trinity
with following parameters:

runtime: 300s


Trinity is a linux system call fuzz tester.


on test machine: qemu-system-x86_64 -enable-kvm -smp 2 -m 320M

caused below changes:


+-------------------------------------------------------+------------+------------+
| | 101105b171 | 464e1236c3 |
+-------------------------------------------------------+------------+------------+
| boot_successes | 20 | 62 |
| boot_failures | 14 | 94 |
| invoked_oom-killer:gfp_mask=0x | 14 | 10 |
| Mem-Info | 14 | 10 |
| page_allocation_failure:order:#,mode:#(GFP_USER) | 1 | |
| BUG_kmalloc-#(Not_tainted):Freepointer_corrupt | 0 | 46 |
| INFO:Allocated_in_setup_userns_sysctls_age=#cpu=#pid= | 0 | 46 |
| INFO:Freed_in_free_ctx_age=#cpu=#pid= | 0 | 8 |
| INFO:Slab#objects=#used=#fp=#flags= | 0 | 45 |
| INFO:Object#@offset=#fp= | 0 | 46 |
| calltrace:free_user_ns | 0 | 46 |
| BUG_kmalloc-#(Tainted:G_B):Freepointer_corrupt | 0 | 3 |
| INFO:Freed_in_kernfs_fop_release_age=#cpu=#pid= | 0 | 8 |
| BUG:kernel_reboot-without-warning_in_test_stage | 0 | 38 |
| INFO:Slab#objects=#used=#fp=0x(null)flags= | 0 | 1 |
| BUG:unable_to_handle_kernel | 0 | 1 |
| Oops | 0 | 1 |
| RIP:copy_process | 0 | 1 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 1 |
| INFO:Freed_in_skb_free_head_age=#cpu=#pid= | 0 | 3 |
| INFO:Freed_in_kvfree_age=#cpu=#pid= | 0 | 2 |
| INFO:Freed_in_ep_free_age=#cpu=#pid= | 0 | 1 |
| INFO:Freed_in_free_pipe_info_age=#cpu=#pid= | 0 | 3 |
+-------------------------------------------------------+------------+------------+



[ 64.996369] genirq: Flags mismatch irq 4. 00000000 (serial) vs. 00000080 (goldfish_pdev_bus)
[ 65.007839] genirq: Flags mismatch irq 4. 00000000 (serial) vs. 00000080 (goldfish_pdev_bus)
[ 65.519812] =============================================================================
[ 65.521973] BUG kmalloc-512 (Not tainted): Freepointer corrupt
[ 65.523368] -----------------------------------------------------------------------------
[ 65.523368]
[ 65.525977] Disabling lock debugging due to kernel taint
[ 65.527277] INFO: Allocated in setup_userns_sysctls+0x3f/0xa6 age=5 cpu=1 pid=418
[ 65.558397] INFO: Freed in free_ctx+0x1d/0x20 age=6 cpu=0 pid=19
[ 65.566491] INFO: Slab 0xffff88000f147700 objects=19 used=15 fp=0xffff8800070de7c8 flags=0x200004081
[ 65.568956] INFO: Object 0xffff8800070dee68 @offset=11880 fp=0xffff880007030288
[ 65.568956]
[ 65.574100] Redzone ffff8800070dee60: cc cc cc cc cc cc cc cc ........
[ 65.576524] Object ffff8800070dee68: 90 d1 fd 81 ff ff ff ff 68 02 03 07 00 88 ff ff ........h.......
[ 65.579009] Object ffff8800070dee78: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 65.581691] Object ffff8800070dee88: 59 02 0c 81 ff ff ff ff 00 00 00 00 00 00 00 00 Y...............
[ 65.584222] Object ffff8800070dee98: e0 4d 4a 83 ff ff ff ff 40 17 26 82 ff ff ff ff .MJ.....@.&.....
[ 65.586768] Object ffff8800070deea8: a4 d1 fd 81 ff ff ff ff 6c 02 03 07 00 88 ff ff ........l.......
[ 65.589412] Object ffff8800070deeb8: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 65.591971] Object ffff8800070deec8: 59 02 0c 81 ff ff ff ff 00 00 00 00 00 00 00 00 Y...............
[ 65.594469] Object ffff8800070deed8: e0 4d 4a 83 ff ff ff ff 40 17 26 82 ff ff ff ff .MJ.....@.&.....
[ 65.596977] Object ffff8800070deee8: b7 d1 fd 81 ff ff ff ff 70 02 03 07 00 88 ff ff ........p.......
[ 65.599617] Object ffff8800070deef8: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 65.602173] Object ffff8800070def08: 59 02 0c 81 ff ff ff ff 00 00 00 00 00 00 00 00 Y...............
[ 65.604667] Object ffff8800070def18: e0 4d 4a 83 ff ff ff ff 40 17 26 82 ff ff ff ff .MJ.....@.&.....
[ 65.607358] Object ffff8800070def28: ca d1 fd 81 ff ff ff ff 74 02 03 07 00 88 ff ff ........t.......
[ 65.609905] Object ffff8800070def38: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 65.612456] Object ffff8800070def48: 59 02 0c 81 ff ff ff ff 00 00 00 00 00 00 00 00 Y...............
[ 65.614946] Object ffff8800070def58: e0 4d 4a 83 ff ff ff ff 40 17 26 82 ff ff ff ff .MJ.....@.&.....
[ 65.617618] Object ffff8800070def68: dd d1 fd 81 ff ff ff ff 78 02 03 07 00 88 ff ff ........x.......
[ 65.620145] Object ffff8800070def78: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 65.622607] Object ffff8800070def88: 59 02 0c 81 ff ff ff ff 00 00 00 00 00 00 00 00 Y...............
[ 65.625270] Object ffff8800070def98: e0 4d 4a 83 ff ff ff ff 40 17 26 82 ff ff ff ff .MJ.....@.&.....
[ 65.627773] Object ffff8800070defa8: f0 d1 fd 81 ff ff ff ff 7c 02 03 07 00 88 ff ff ........|.......
[ 65.630300] Object ffff8800070defb8: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 65.632804] Object ffff8800070defc8: 59 02 0c 81 ff ff ff ff 00 00 00 00 00 00 00 00 Y...............
[ 65.635477] Object ffff8800070defd8: e0 4d 4a 83 ff ff ff ff 40 17 26 82 ff ff ff ff .MJ.....@.&.....
[ 65.637983] Object ffff8800070defe8: 03 d2 fd 81 ff ff ff ff 80 02 03 07 00 88 ff ff ................
[ 65.640507] Object ffff8800070deff8: 04 00 00 00 a4 01 00 00 00 00 00 00 00 00 00 00 ................
[ 65.642994] Object ffff8800070df008: 59 02 0c 81 ff ff ff ff 00 00 00 00 00 00 00 00 Y...............
[ 65.645711] Object ffff8800070df018: e0 4d 4a 83 ff ff ff ff 40 17 26 82 ff ff ff ff .MJ.....@.&.....
[ 65.648170] Object ffff8800070df028: 00 00 00 00 00 00 00 00 84 02 03 07 00 88 ff ff ................
[ 65.650683] Object ffff8800070df038: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 65.653395] Object ffff8800070df048: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 65.655876] Object ffff8800070df058: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 65.658394] Redzone ffff8800070df068: cc cc cc cc cc cc cc cc ........
[ 65.660854] Padding ffff8800070df1a8: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
[ 65.663396] CPU: 0 PID: 35 Comm: kworker/0:1 Tainted: G B 4.8.0-11826-g464e123 #1
[ 65.665746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 65.668185] Workqueue: events free_user_ns
[ 65.669571] ffffc90000187ad8 ffffffff8148d545 ffff88000e804e00 ffff8800070dee68
[ 65.672224] ffffc90000187b08 ffffffff811a74a0 ffff88000e804e00 ffff88000f147700
[ 65.674863] ffff8800070dee68 00000000000000cc ffffc90000187b30 ffffffff811a8088
[ 65.677604] Call Trace:
[ 65.678412] [<ffffffff8148d545>] dump_stack+0x86/0xc0
[ 65.679908] [<ffffffff811a74a0>] print_trailer+0x178/0x181
[ 65.681439] [<ffffffff811a8088>] object_err+0x2f/0x36
[ 65.682835] [<ffffffff811a82f4>] check_object+0x265/0x282
[ 65.684336] [<ffffffff811a9e1b>] free_debug_processing+0xc1/0x35c
[ 65.686049] [<ffffffff810d8a3f>] ? retire_userns_sysctls+0x2e/0x33
[ 65.687714] [<ffffffff810d8a3f>] ? retire_userns_sysctls+0x2e/0x33
[ 65.689398] [<ffffffff811aa125>] __slab_free+0x6f/0x426
[ 65.690840] [<ffffffff81037aea>] ? kvm_clock_read+0x25/0x2e
[ 65.692350] [<ffffffff81037b07>] ? kvm_sched_clock_read+0x9/0x12
[ 65.694056] [<ffffffff8101c211>] ? sched_clock+0x9/0xd
[ 65.695552] [<ffffffff810fa12d>] ? mark_held_locks+0x5e/0x74
[ 65.697043] [<ffffffff811abab5>] ? kfree+0xfe/0x170
[ 65.698430] [<ffffffff810d8a3f>] ? retire_userns_sysctls+0x2e/0x33
[ 65.700159] [<ffffffff811abb1c>] kfree+0x165/0x170
[ 65.701540] [<ffffffff811abb1c>] ? kfree+0x165/0x170
[ 65.702885] [<ffffffff810d8a3f>] retire_userns_sysctls+0x2e/0x33
[ 65.704553] [<ffffffff81137c4c>] free_user_ns+0x26/0x6b
[ 65.706069] [<ffffffff810cf1a6>] process_one_work+0x208/0x3a5
[ 65.707635] [<ffffffff810cf143>] ? process_one_work+0x1a5/0x3a5
[ 65.729991] [<ffffffff810cf5bb>] worker_thread+0x24a/0x380
[ 65.731583] [<ffffffff810cf371>] ? process_scheduled_works+0x2e/0x2e
[ 65.733274] [<ffffffff810d546c>] kthread+0x106/0x10e
[ 65.734628] [<ffffffff810d5366>] ? __kthread_parkme+0x81/0x81
[ 65.736286] [<ffffffff81b60bea>] ret_from_fork+0x2a/0x40
[ 65.737828] FIX kmalloc-512: Object at 0xffff8800070dee68 not freed
[ 65.887942] genirq: Flags mismatch irq 4. 00000000 (serial) vs. 00000080 (goldfish_pdev_bus)
[ 66.042944] genirq: Flags mismatch irq 4. 00000000 (serial) vs. 00000080 (goldfish_pdev_bus)


To reproduce:

git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml



Thanks,
Xiaolong


Attachments:
(No filename) (9.68 kB)
config-4.8.0-11826-g464e123 (110.13 kB)
job-script (3.52 kB)
dmesg.xz (17.93 kB)
job.yaml (2.75 kB)
Download all attachments

2016-10-24 06:47:50

by Angel Shtilianov

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits



On 10/11/2016 10:36 AM, Nikolay Borisov wrote:
> This patchset converts inotify to using the newly introduced
> per-userns sysctl infrastructure.
>
> Currently the inotify instances/watches are being accounted in the
> user_struct structure. This means that in setups where multiple
> users in unprivileged containers map to the same underlying
> real user (i.e. pointing to the same user_struct) the inotify limits
> are going to be shared as well, allowing one user(or application) to exhaust
> all others limits.
>
> Fix this by switching the inotify sysctls to using the
> per-namespace/per-user limits. This will allow the server admin to
> set sensible global limits, which can further be tuned inside every
> individual user namespace. Additionally, in order to preserve the
> sysctl ABI make the existing inotify instances/watches sysctls
> modify the values of the initial user namespace.
>
> Signed-off-by: Nikolay Borisov <[email protected]>
> ---
>
> So here is a revised version which retains the existing sysctls,
> and hooks them to the init_user_ns values.

Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
resolved.

>
> fs/notify/inotify/inotify.h | 17 +++++++++++++++++
> fs/notify/inotify/inotify_fsnotify.c | 6 ++----
> fs/notify/inotify/inotify_user.c | 34 +++++++++++++++++-----------------
> include/linux/fsnotify_backend.h | 3 ++-
> include/linux/sched.h | 4 ----
> include/linux/user_namespace.h | 4 ++++
> kernel/ucount.c | 6 +++++-
> 7 files changed, 47 insertions(+), 27 deletions(-)
>
> diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
> index ed855ef6f077..b5536f8ad3e0 100644
> --- a/fs/notify/inotify/inotify.h
> +++ b/fs/notify/inotify/inotify.h
> @@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
> const unsigned char *file_name, u32 cookie);
>
> extern const struct fsnotify_ops inotify_fsnotify_ops;
> +
> +#ifdef CONFIG_INOTIFY_USER
> +static void dec_inotify_instances(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
> +}
> +
> +static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
> +{
> + return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
> +}
> +
> +static void dec_inotify_watches(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
> +}
> +#endif
> diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
> index 2cd900c2c737..1e6b3cfc2cfd 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct fsnotify_group *group)
> /* ideally the idr is empty and we won't hit the BUG in the callback */
> idr_for_each(&group->inotify_data.idr, idr_callback, group);
> idr_destroy(&group->inotify_data.idr);
> - if (group->inotify_data.user) {
> - atomic_dec(&group->inotify_data.user->inotify_devs);
> - free_uid(group->inotify_data.user);
> - }
> + if (group->inotify_data.ucounts)
> + dec_inotify_instances(group->inotify_data.ucounts);
> }
>
> static void inotify_free_event(struct fsnotify_event *fsn_event)
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index b8d08d0d0a4d..7d769a824742 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -44,10 +44,8 @@
>
> #include <asm/ioctls.h>
>
> -/* these are configurable via /proc/sys/fs/inotify/ */
> -static int inotify_max_user_instances __read_mostly;
> +/* configurable via /proc/sys/fs/inotify/ */
> static int inotify_max_queued_events __read_mostly;
> -static int inotify_max_user_watches __read_mostly;
>
> static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
>
> @@ -60,7 +58,7 @@ static int zero;
> struct ctl_table inotify_table[] = {
> {
> .procname = "max_user_instances",
> - .data = &inotify_max_user_instances,
> + .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
> .maxlen = sizeof(int),
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> @@ -68,7 +66,7 @@ struct ctl_table inotify_table[] = {
> },
> {
> .procname = "max_user_watches",
> - .data = &inotify_max_user_watches,
> + .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
> .maxlen = sizeof(int),
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> @@ -500,7 +498,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
> /* remove this mark from the idr */
> inotify_remove_from_idr(group, i_mark);
>
> - atomic_dec(&group->inotify_data.user->inotify_watches);
> + dec_inotify_watches(group->inotify_data.ucounts);
> }
>
> /* ding dong the mark is dead */
> @@ -584,14 +582,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
> tmp_i_mark->fsn_mark.mask = mask;
> tmp_i_mark->wd = -1;
>
> - ret = -ENOSPC;
> - if (atomic_read(&group->inotify_data.user->inotify_watches) >= inotify_max_user_watches)
> - goto out_err;
> -
> ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
> if (ret)
> goto out_err;
>
> + /* increment the number of watches the user has */
> + if (!inc_inotify_watches(group->inotify_data.ucounts)) {
> + inotify_remove_from_idr(group, tmp_i_mark);
> + ret = -ENOSPC;
> + goto out_err;
> + }
> +
> /* we are on the idr, now get on the inode */
> ret = fsnotify_add_mark_locked(&tmp_i_mark->fsn_mark, group, inode,
> NULL, 0);
> @@ -601,8 +602,6 @@ static int inotify_new_watch(struct fsnotify_group *group,
> goto out_err;
> }
>
> - /* increment the number of watches the user has */
> - atomic_inc(&group->inotify_data.user->inotify_watches);
>
> /* return the watch descriptor for this new mark */
> ret = tmp_i_mark->wd;
> @@ -653,10 +652,11 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
>
> spin_lock_init(&group->inotify_data.idr_lock);
> idr_init(&group->inotify_data.idr);
> - group->inotify_data.user = get_current_user();
> + group->inotify_data.ucounts = inc_ucount(current_user_ns(),
> + current_euid(),
> + UCOUNT_INOTIFY_INSTANCES);
>
> - if (atomic_inc_return(&group->inotify_data.user->inotify_devs) >
> - inotify_max_user_instances) {
> + if (!group->inotify_data.ucounts) {
> fsnotify_destroy_group(group);
> return ERR_PTR(-EMFILE);
> }
> @@ -819,8 +819,8 @@ static int __init inotify_user_setup(void)
> inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
>
> inotify_max_queued_events = 16384;
> - inotify_max_user_instances = 128;
> - inotify_max_user_watches = 8192;
> + init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
> + init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = 8192;
>
> return 0;
> }
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 58205f33af02..892959de1162 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -16,6 +16,7 @@
> #include <linux/spinlock.h>
> #include <linux/types.h>
> #include <linux/atomic.h>
> +#include <linux/user_namespace.h>
>
> /*
> * IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
> @@ -169,7 +170,7 @@ struct fsnotify_group {
> struct inotify_group_private_data {
> spinlock_t idr_lock;
> struct idr idr;
> - struct user_struct *user;
> + struct ucounts *ucounts;
> } inotify_data;
> #endif
> #ifdef CONFIG_FANOTIFY
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 62c68e513e39..35230a2c73ac 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -840,10 +840,6 @@ struct user_struct {
> atomic_t __count; /* reference count */
> atomic_t processes; /* How many processes does this user have? */
> atomic_t sigpending; /* How many pending signals does this user have? */
> -#ifdef CONFIG_INOTIFY_USER
> - atomic_t inotify_watches; /* How many inotify watches does this user have? */
> - atomic_t inotify_devs; /* How many inotify devs does this user have opened? */
> -#endif
> #ifdef CONFIG_FANOTIFY
> atomic_t fanotify_listeners;
> #endif
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb209d4523f5..cbcac7a5ec41 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -32,6 +32,10 @@ enum ucount_type {
> UCOUNT_NET_NAMESPACES,
> UCOUNT_MNT_NAMESPACES,
> UCOUNT_CGROUP_NAMESPACES,
> +#ifdef CONFIG_INOTIFY_USER
> + UCOUNT_INOTIFY_INSTANCES,
> + UCOUNT_INOTIFY_WATCHES,
> +#endif
> UCOUNT_COUNTS,
> };
>
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 9d20d5dd298a..a6bcea47306b 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -57,7 +57,7 @@ static struct ctl_table_root set_root = {
>
> static int zero = 0;
> static int int_max = INT_MAX;
> -#define UCOUNT_ENTRY(name) \
> +#define UCOUNT_ENTRY(name) \
> { \
> .procname = name, \
> .maxlen = sizeof(int), \
> @@ -74,6 +74,10 @@ static struct ctl_table user_table[] = {
> UCOUNT_ENTRY("max_net_namespaces"),
> UCOUNT_ENTRY("max_mnt_namespaces"),
> UCOUNT_ENTRY("max_cgroup_namespaces"),
> +#ifdef CONFIG_INOTIFY_USER_
> + UCOUNT_ENTRY("max_inotify_instances"),
> + UCOUNT_ENTRY("max_inotify_watches"),
> +#endif
> { }
> };
> #endif /* CONFIG_SYSCTL */
>

2016-10-24 07:48:24

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits

On Tue 11-10-16 10:36:22, Nikolay Borisov wrote:
> This patchset converts inotify to using the newly introduced
> per-userns sysctl infrastructure.
>
> Currently the inotify instances/watches are being accounted in the
> user_struct structure. This means that in setups where multiple
> users in unprivileged containers map to the same underlying
> real user (i.e. pointing to the same user_struct) the inotify limits
> are going to be shared as well, allowing one user(or application) to exhaust
> all others limits.
>
> Fix this by switching the inotify sysctls to using the
> per-namespace/per-user limits. This will allow the server admin to
> set sensible global limits, which can further be tuned inside every
> individual user namespace. Additionally, in order to preserve the
> sysctl ABI make the existing inotify instances/watches sysctls
> modify the values of the initial user namespace.
>
> Signed-off-by: Nikolay Borisov <[email protected]>

I cannot really comment on the user-namespaces sanity of this but from fs
notification framework POV the patch looks fine so feel free to add:

Acked-by: Jan Kara <[email protected]>

Honza

> ---
>
> So here is a revised version which retains the existing sysctls,
> and hooks them to the init_user_ns values.
>
> fs/notify/inotify/inotify.h | 17 +++++++++++++++++
> fs/notify/inotify/inotify_fsnotify.c | 6 ++----
> fs/notify/inotify/inotify_user.c | 34 +++++++++++++++++-----------------
> include/linux/fsnotify_backend.h | 3 ++-
> include/linux/sched.h | 4 ----
> include/linux/user_namespace.h | 4 ++++
> kernel/ucount.c | 6 +++++-
> 7 files changed, 47 insertions(+), 27 deletions(-)
>
> diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
> index ed855ef6f077..b5536f8ad3e0 100644
> --- a/fs/notify/inotify/inotify.h
> +++ b/fs/notify/inotify/inotify.h
> @@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
> const unsigned char *file_name, u32 cookie);
>
> extern const struct fsnotify_ops inotify_fsnotify_ops;
> +
> +#ifdef CONFIG_INOTIFY_USER
> +static void dec_inotify_instances(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
> +}
> +
> +static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
> +{
> + return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
> +}
> +
> +static void dec_inotify_watches(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
> +}
> +#endif
> diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
> index 2cd900c2c737..1e6b3cfc2cfd 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct fsnotify_group *group)
> /* ideally the idr is empty and we won't hit the BUG in the callback */
> idr_for_each(&group->inotify_data.idr, idr_callback, group);
> idr_destroy(&group->inotify_data.idr);
> - if (group->inotify_data.user) {
> - atomic_dec(&group->inotify_data.user->inotify_devs);
> - free_uid(group->inotify_data.user);
> - }
> + if (group->inotify_data.ucounts)
> + dec_inotify_instances(group->inotify_data.ucounts);
> }
>
> static void inotify_free_event(struct fsnotify_event *fsn_event)
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index b8d08d0d0a4d..7d769a824742 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -44,10 +44,8 @@
>
> #include <asm/ioctls.h>
>
> -/* these are configurable via /proc/sys/fs/inotify/ */
> -static int inotify_max_user_instances __read_mostly;
> +/* configurable via /proc/sys/fs/inotify/ */
> static int inotify_max_queued_events __read_mostly;
> -static int inotify_max_user_watches __read_mostly;
>
> static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
>
> @@ -60,7 +58,7 @@ static int zero;
> struct ctl_table inotify_table[] = {
> {
> .procname = "max_user_instances",
> - .data = &inotify_max_user_instances,
> + .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
> .maxlen = sizeof(int),
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> @@ -68,7 +66,7 @@ struct ctl_table inotify_table[] = {
> },
> {
> .procname = "max_user_watches",
> - .data = &inotify_max_user_watches,
> + .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
> .maxlen = sizeof(int),
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> @@ -500,7 +498,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
> /* remove this mark from the idr */
> inotify_remove_from_idr(group, i_mark);
>
> - atomic_dec(&group->inotify_data.user->inotify_watches);
> + dec_inotify_watches(group->inotify_data.ucounts);
> }
>
> /* ding dong the mark is dead */
> @@ -584,14 +582,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
> tmp_i_mark->fsn_mark.mask = mask;
> tmp_i_mark->wd = -1;
>
> - ret = -ENOSPC;
> - if (atomic_read(&group->inotify_data.user->inotify_watches) >= inotify_max_user_watches)
> - goto out_err;
> -
> ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
> if (ret)
> goto out_err;
>
> + /* increment the number of watches the user has */
> + if (!inc_inotify_watches(group->inotify_data.ucounts)) {
> + inotify_remove_from_idr(group, tmp_i_mark);
> + ret = -ENOSPC;
> + goto out_err;
> + }
> +
> /* we are on the idr, now get on the inode */
> ret = fsnotify_add_mark_locked(&tmp_i_mark->fsn_mark, group, inode,
> NULL, 0);
> @@ -601,8 +602,6 @@ static int inotify_new_watch(struct fsnotify_group *group,
> goto out_err;
> }
>
> - /* increment the number of watches the user has */
> - atomic_inc(&group->inotify_data.user->inotify_watches);
>
> /* return the watch descriptor for this new mark */
> ret = tmp_i_mark->wd;
> @@ -653,10 +652,11 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
>
> spin_lock_init(&group->inotify_data.idr_lock);
> idr_init(&group->inotify_data.idr);
> - group->inotify_data.user = get_current_user();
> + group->inotify_data.ucounts = inc_ucount(current_user_ns(),
> + current_euid(),
> + UCOUNT_INOTIFY_INSTANCES);
>
> - if (atomic_inc_return(&group->inotify_data.user->inotify_devs) >
> - inotify_max_user_instances) {
> + if (!group->inotify_data.ucounts) {
> fsnotify_destroy_group(group);
> return ERR_PTR(-EMFILE);
> }
> @@ -819,8 +819,8 @@ static int __init inotify_user_setup(void)
> inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
>
> inotify_max_queued_events = 16384;
> - inotify_max_user_instances = 128;
> - inotify_max_user_watches = 8192;
> + init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
> + init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = 8192;
>
> return 0;
> }
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 58205f33af02..892959de1162 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -16,6 +16,7 @@
> #include <linux/spinlock.h>
> #include <linux/types.h>
> #include <linux/atomic.h>
> +#include <linux/user_namespace.h>
>
> /*
> * IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
> @@ -169,7 +170,7 @@ struct fsnotify_group {
> struct inotify_group_private_data {
> spinlock_t idr_lock;
> struct idr idr;
> - struct user_struct *user;
> + struct ucounts *ucounts;
> } inotify_data;
> #endif
> #ifdef CONFIG_FANOTIFY
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 62c68e513e39..35230a2c73ac 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -840,10 +840,6 @@ struct user_struct {
> atomic_t __count; /* reference count */
> atomic_t processes; /* How many processes does this user have? */
> atomic_t sigpending; /* How many pending signals does this user have? */
> -#ifdef CONFIG_INOTIFY_USER
> - atomic_t inotify_watches; /* How many inotify watches does this user have? */
> - atomic_t inotify_devs; /* How many inotify devs does this user have opened? */
> -#endif
> #ifdef CONFIG_FANOTIFY
> atomic_t fanotify_listeners;
> #endif
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb209d4523f5..cbcac7a5ec41 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -32,6 +32,10 @@ enum ucount_type {
> UCOUNT_NET_NAMESPACES,
> UCOUNT_MNT_NAMESPACES,
> UCOUNT_CGROUP_NAMESPACES,
> +#ifdef CONFIG_INOTIFY_USER
> + UCOUNT_INOTIFY_INSTANCES,
> + UCOUNT_INOTIFY_WATCHES,
> +#endif
> UCOUNT_COUNTS,
> };
>
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 9d20d5dd298a..a6bcea47306b 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -57,7 +57,7 @@ static struct ctl_table_root set_root = {
>
> static int zero = 0;
> static int int_max = INT_MAX;
> -#define UCOUNT_ENTRY(name) \
> +#define UCOUNT_ENTRY(name) \
> { \
> .procname = name, \
> .maxlen = sizeof(int), \
> @@ -74,6 +74,10 @@ static struct ctl_table user_table[] = {
> UCOUNT_ENTRY("max_net_namespaces"),
> UCOUNT_ENTRY("max_mnt_namespaces"),
> UCOUNT_ENTRY("max_cgroup_namespaces"),
> +#ifdef CONFIG_INOTIFY_USER_
> + UCOUNT_ENTRY("max_inotify_instances"),
> + UCOUNT_ENTRY("max_inotify_watches"),
> +#endif
> { }
> };
> #endif /* CONFIG_SYSCTL */
> --
> 2.5.0
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-10-27 15:48:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits

Nikolay Borisov <[email protected]> writes:

> On 10/11/2016 10:36 AM, Nikolay Borisov wrote:
>> This patchset converts inotify to using the newly introduced
>> per-userns sysctl infrastructure.
>>
>> Currently the inotify instances/watches are being accounted in the
>> user_struct structure. This means that in setups where multiple
>> users in unprivileged containers map to the same underlying
>> real user (i.e. pointing to the same user_struct) the inotify limits
>> are going to be shared as well, allowing one user(or application) to exhaust
>> all others limits.
>>
>> Fix this by switching the inotify sysctls to using the
>> per-namespace/per-user limits. This will allow the server admin to
>> set sensible global limits, which can further be tuned inside every
>> individual user namespace. Additionally, in order to preserve the
>> sysctl ABI make the existing inotify instances/watches sysctls
>> modify the values of the initial user namespace.
>>
>> Signed-off-by: Nikolay Borisov <[email protected]>
>> ---
>>
>> So here is a revised version which retains the existing sysctls,
>> and hooks them to the init_user_ns values.
>
> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
> resolved.

I plan to give this a once over and merge this. I have a very funny
pile of serious bug fixes that jumped on my at the beginning of this
development cycle and a cold so I am getting to this more slowly than I
would like.

Eric

2016-11-14 06:04:09

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits

On Tue, Oct 11, 2016 at 10:36:22AM +0300, Nikolay Borisov wrote:
> This patchset converts inotify to using the newly introduced
> per-userns sysctl infrastructure.
>
> Currently the inotify instances/watches are being accounted in the
> user_struct structure. This means that in setups where multiple
> users in unprivileged containers map to the same underlying
> real user (i.e. pointing to the same user_struct) the inotify limits
> are going to be shared as well, allowing one user(or application) to exhaust
> all others limits.
>
> Fix this by switching the inotify sysctls to using the
> per-namespace/per-user limits. This will allow the server admin to
> set sensible global limits, which can further be tuned inside every
> individual user namespace. Additionally, in order to preserve the
> sysctl ABI make the existing inotify instances/watches sysctls
> modify the values of the initial user namespace.
>
> Signed-off-by: Nikolay Borisov <[email protected]>

If I'm reading the existing ucounts code correctly, this looks
good, thanks.

Acked-by: Serge Hallyn <[email protected]>


> ---
>
> So here is a revised version which retains the existing sysctls,
> and hooks them to the init_user_ns values.
>
> fs/notify/inotify/inotify.h | 17 +++++++++++++++++
> fs/notify/inotify/inotify_fsnotify.c | 6 ++----
> fs/notify/inotify/inotify_user.c | 34 +++++++++++++++++-----------------
> include/linux/fsnotify_backend.h | 3 ++-
> include/linux/sched.h | 4 ----
> include/linux/user_namespace.h | 4 ++++
> kernel/ucount.c | 6 +++++-
> 7 files changed, 47 insertions(+), 27 deletions(-)
>
> diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
> index ed855ef6f077..b5536f8ad3e0 100644
> --- a/fs/notify/inotify/inotify.h
> +++ b/fs/notify/inotify/inotify.h
> @@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
> const unsigned char *file_name, u32 cookie);
>
> extern const struct fsnotify_ops inotify_fsnotify_ops;
> +
> +#ifdef CONFIG_INOTIFY_USER
> +static void dec_inotify_instances(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
> +}
> +
> +static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
> +{
> + return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
> +}
> +
> +static void dec_inotify_watches(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
> +}
> +#endif
> diff --git a/fs/notify/inotify/inotify_fsnotify.c b/fs/notify/inotify/inotify_fsnotify.c
> index 2cd900c2c737..1e6b3cfc2cfd 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct fsnotify_group *group)
> /* ideally the idr is empty and we won't hit the BUG in the callback */
> idr_for_each(&group->inotify_data.idr, idr_callback, group);
> idr_destroy(&group->inotify_data.idr);
> - if (group->inotify_data.user) {
> - atomic_dec(&group->inotify_data.user->inotify_devs);
> - free_uid(group->inotify_data.user);
> - }
> + if (group->inotify_data.ucounts)
> + dec_inotify_instances(group->inotify_data.ucounts);
> }
>
> static void inotify_free_event(struct fsnotify_event *fsn_event)
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index b8d08d0d0a4d..7d769a824742 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -44,10 +44,8 @@
>
> #include <asm/ioctls.h>
>
> -/* these are configurable via /proc/sys/fs/inotify/ */
> -static int inotify_max_user_instances __read_mostly;
> +/* configurable via /proc/sys/fs/inotify/ */
> static int inotify_max_queued_events __read_mostly;
> -static int inotify_max_user_watches __read_mostly;
>
> static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
>
> @@ -60,7 +58,7 @@ static int zero;
> struct ctl_table inotify_table[] = {
> {
> .procname = "max_user_instances",
> - .data = &inotify_max_user_instances,
> + .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
> .maxlen = sizeof(int),
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> @@ -68,7 +66,7 @@ struct ctl_table inotify_table[] = {
> },
> {
> .procname = "max_user_watches",
> - .data = &inotify_max_user_watches,
> + .data = &init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
> .maxlen = sizeof(int),
> .mode = 0644,
> .proc_handler = proc_dointvec_minmax,
> @@ -500,7 +498,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark *fsn_mark,
> /* remove this mark from the idr */
> inotify_remove_from_idr(group, i_mark);
>
> - atomic_dec(&group->inotify_data.user->inotify_watches);
> + dec_inotify_watches(group->inotify_data.ucounts);
> }
>
> /* ding dong the mark is dead */
> @@ -584,14 +582,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
> tmp_i_mark->fsn_mark.mask = mask;
> tmp_i_mark->wd = -1;
>
> - ret = -ENOSPC;
> - if (atomic_read(&group->inotify_data.user->inotify_watches) >= inotify_max_user_watches)
> - goto out_err;
> -
> ret = inotify_add_to_idr(idr, idr_lock, tmp_i_mark);
> if (ret)
> goto out_err;
>
> + /* increment the number of watches the user has */
> + if (!inc_inotify_watches(group->inotify_data.ucounts)) {
> + inotify_remove_from_idr(group, tmp_i_mark);
> + ret = -ENOSPC;
> + goto out_err;
> + }
> +
> /* we are on the idr, now get on the inode */
> ret = fsnotify_add_mark_locked(&tmp_i_mark->fsn_mark, group, inode,
> NULL, 0);
> @@ -601,8 +602,6 @@ static int inotify_new_watch(struct fsnotify_group *group,
> goto out_err;
> }
>
> - /* increment the number of watches the user has */
> - atomic_inc(&group->inotify_data.user->inotify_watches);
>
> /* return the watch descriptor for this new mark */
> ret = tmp_i_mark->wd;
> @@ -653,10 +652,11 @@ static struct fsnotify_group *inotify_new_group(unsigned int max_events)
>
> spin_lock_init(&group->inotify_data.idr_lock);
> idr_init(&group->inotify_data.idr);
> - group->inotify_data.user = get_current_user();
> + group->inotify_data.ucounts = inc_ucount(current_user_ns(),
> + current_euid(),
> + UCOUNT_INOTIFY_INSTANCES);
>
> - if (atomic_inc_return(&group->inotify_data.user->inotify_devs) >
> - inotify_max_user_instances) {
> + if (!group->inotify_data.ucounts) {
> fsnotify_destroy_group(group);
> return ERR_PTR(-EMFILE);
> }
> @@ -819,8 +819,8 @@ static int __init inotify_user_setup(void)
> inotify_inode_mark_cachep = KMEM_CACHE(inotify_inode_mark, SLAB_PANIC);
>
> inotify_max_queued_events = 16384;
> - inotify_max_user_instances = 128;
> - inotify_max_user_watches = 8192;
> + init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
> + init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = 8192;
>
> return 0;
> }
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index 58205f33af02..892959de1162 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -16,6 +16,7 @@
> #include <linux/spinlock.h>
> #include <linux/types.h>
> #include <linux/atomic.h>
> +#include <linux/user_namespace.h>
>
> /*
> * IN_* from inotfy.h lines up EXACTLY with FS_*, this is so we can easily
> @@ -169,7 +170,7 @@ struct fsnotify_group {
> struct inotify_group_private_data {
> spinlock_t idr_lock;
> struct idr idr;
> - struct user_struct *user;
> + struct ucounts *ucounts;
> } inotify_data;
> #endif
> #ifdef CONFIG_FANOTIFY
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 62c68e513e39..35230a2c73ac 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -840,10 +840,6 @@ struct user_struct {
> atomic_t __count; /* reference count */
> atomic_t processes; /* How many processes does this user have? */
> atomic_t sigpending; /* How many pending signals does this user have? */
> -#ifdef CONFIG_INOTIFY_USER
> - atomic_t inotify_watches; /* How many inotify watches does this user have? */
> - atomic_t inotify_devs; /* How many inotify devs does this user have opened? */
> -#endif
> #ifdef CONFIG_FANOTIFY
> atomic_t fanotify_listeners;
> #endif
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb209d4523f5..cbcac7a5ec41 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -32,6 +32,10 @@ enum ucount_type {
> UCOUNT_NET_NAMESPACES,
> UCOUNT_MNT_NAMESPACES,
> UCOUNT_CGROUP_NAMESPACES,
> +#ifdef CONFIG_INOTIFY_USER
> + UCOUNT_INOTIFY_INSTANCES,
> + UCOUNT_INOTIFY_WATCHES,
> +#endif
> UCOUNT_COUNTS,
> };
>
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 9d20d5dd298a..a6bcea47306b 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -57,7 +57,7 @@ static struct ctl_table_root set_root = {
>
> static int zero = 0;
> static int int_max = INT_MAX;
> -#define UCOUNT_ENTRY(name) \
> +#define UCOUNT_ENTRY(name) \
> { \
> .procname = name, \
> .maxlen = sizeof(int), \
> @@ -74,6 +74,10 @@ static struct ctl_table user_table[] = {
> UCOUNT_ENTRY("max_net_namespaces"),
> UCOUNT_ENTRY("max_mnt_namespaces"),
> UCOUNT_ENTRY("max_cgroup_namespaces"),
> +#ifdef CONFIG_INOTIFY_USER_
> + UCOUNT_ENTRY("max_inotify_instances"),
> + UCOUNT_ENTRY("max_inotify_watches"),
> +#endif
> { }
> };
> #endif /* CONFIG_SYSCTL */
> --
> 2.5.0

2016-12-08 01:44:02

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits

Nikolay Borisov <[email protected]> writes:

> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
> resolved.

After getting slowed down by some fixes I am now taking a hard look at
your patch in the hopes of merging it.

Did you happen to see the kbuild test roboot boot failures and did you
happen to look into what caused them? I have just skimmed them and it
appears to be related to your patch.

Eric

2016-12-08 06:58:37

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits



On 8.12.2016 03:40, Eric W. Biederman wrote:
> Nikolay Borisov <[email protected]> writes:
>
>> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
>> resolved.
>
> After getting slowed down by some fixes I am now taking a hard look at
> your patch in the hopes of merging it.
>
> Did you happen to see the kbuild test roboot boot failures and did you
> happen to look into what caused them? I have just skimmed them and it
> appears to be related to your patch.

I saw them in the beginning but they did look like a generic memory
corruption and I believe at the time those patches were submitted there
was a lingering memory corruption hitting various patches. Thus I didn't
think it was related to my patches. I've since left my work so been
taking a bit of time off and haven't looked really hard, so those
patches have been kind of lingering.


But now that you mention it I will try and take a second look to see
what might cause the memory corruption? Is there a way to force 0day to
re-run them to see whether the failure was indeed caused by my patches
or were intermittent?

Regards,
Nikolay


>
> Eric
>
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>

2016-12-08 08:15:26

by Nikolay Borisov

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits



On 8.12.2016 08:58, Nikolay Borisov wrote:
>
>
> On 8.12.2016 03:40, Eric W. Biederman wrote:
>> Nikolay Borisov <[email protected]> writes:
>>
>>> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
>>> resolved.
>>
>> After getting slowed down by some fixes I am now taking a hard look at
>> your patch in the hopes of merging it.
>>
>> Did you happen to see the kbuild test roboot boot failures and did you
>> happen to look into what caused them? I have just skimmed them and it
>> appears to be related to your patch.
>
> I saw them in the beginning but they did look like a generic memory
> corruption and I believe at the time those patches were submitted there
> was a lingering memory corruption hitting various patches. Thus I didn't
> think it was related to my patches. I've since left my work so been
> taking a bit of time off and haven't looked really hard, so those
> patches have been kind of lingering.
>
>
> But now that you mention it I will try and take a second look to see
> what might cause the memory corruption? Is there a way to force 0day to
> re-run them to see whether the failure was indeed caused by my patches
> or were intermittent?

Ok, I took another look into the report but bear in mind that the
corruption indeed happened in retire_userns_sysctls. But also this row
in the report leads me to believe it's not my patch that's the culprit:

[ 65.527277] INFO: Allocated in setup_userns_sysctls+0x3f/0xa6 age=5
cpu=1 pid=418
[ 65.558397] INFO: Freed in free_ctx+0x1d/0x20 age=6 cpu=0 pid=19


So a free_ctx function did free it originally, likely causing the
corruption. And there is no such function involved in the code I'm touching.
>
> Regards,
> Nikolay
>
>
>>
>> Eric
>>
>> _______________________________________________
>> Containers mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
>>

2016-12-09 02:53:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits

Nikolay Borisov <[email protected]> writes:

> On 8.12.2016 03:40, Eric W. Biederman wrote:
>> Nikolay Borisov <[email protected]> writes:
>>
>>> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
>>> resolved.
>>
>> After getting slowed down by some fixes I am now taking a hard look at
>> your patch in the hopes of merging it.
>>
>> Did you happen to see the kbuild test roboot boot failures and did you
>> happen to look into what caused them? I have just skimmed them and it
>> appears to be related to your patch.
>
> I saw them in the beginning but they did look like a generic memory
> corruption and I believe at the time those patches were submitted there
> was a lingering memory corruption hitting various patches. Thus I didn't
> think it was related to my patches. I've since left my work so been
> taking a bit of time off and haven't looked really hard, so those
> patches have been kind of lingering.

Fair enough. I generally give the kbuild folks the benefit of the doubt
as they try hard to avoid false positives.

> But now that you mention it I will try and take a second look to see
> what might cause the memory corruption? Is there a way to force 0day to
> re-run them to see whether the failure was indeed caused by my patches
> or were intermittent?

Good question. I will push the patch to my for-testing branch and see
if any problems show up.

That plus I will read through your patch and make certain I can
understand what is going on.

Eric

2016-12-09 05:42:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v2] inotify: Convert to using per-namespace limits

Nikolay Borisov <[email protected]> writes:

> On 8.12.2016 08:58, Nikolay Borisov wrote:
>>
>>
>> On 8.12.2016 03:40, Eric W. Biederman wrote:
>>> Nikolay Borisov <[email protected]> writes:
>>>
>>>> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
>>>> resolved.
>>>
>>> After getting slowed down by some fixes I am now taking a hard look at
>>> your patch in the hopes of merging it.
>>>
>>> Did you happen to see the kbuild test roboot boot failures and did you
>>> happen to look into what caused them? I have just skimmed them and it
>>> appears to be related to your patch.
>>
>> I saw them in the beginning but they did look like a generic memory
>> corruption and I believe at the time those patches were submitted there
>> was a lingering memory corruption hitting various patches. Thus I didn't
>> think it was related to my patches. I've since left my work so been
>> taking a bit of time off and haven't looked really hard, so those
>> patches have been kind of lingering.
>>
>>
>> But now that you mention it I will try and take a second look to see
>> what might cause the memory corruption? Is there a way to force 0day to
>> re-run them to see whether the failure was indeed caused by my patches
>> or were intermittent?
>
> Ok, I took another look into the report but bear in mind that the
> corruption indeed happened in retire_userns_sysctls. But also this row
> in the report leads me to believe it's not my patch that's the culprit:
>
> [ 65.527277] INFO: Allocated in setup_userns_sysctls+0x3f/0xa6 age=5
> cpu=1 pid=418
> [ 65.558397] INFO: Freed in free_ctx+0x1d/0x20 age=6 cpu=0 pid=19
>
>
> So a free_ctx function did free it originally, likely causing the
> corruption. And there is no such function involved in the code I'm touching.

Yes. I read through your patch carefully and it doesn't look like it
could possibly cause that kind of corruption, the code is just too
simple.

So I have (belatedly) placed this change in linux-next.

Eric