2018-04-24 02:22:51

by Jeff Mahoney

[permalink] [raw]
Subject: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks

From: Jeff Mahoney <[email protected]>

Hi all -

I recently encountered a customer issue where, on a machine with many TiB
of memory and a few hundred cores, after a task with a few thousand threads
and hundreds of files open exited, the system would softlockup. That
issue was (is still) being addressed by Nik Borisov's patch to add a
cond_resched call to shrink_dentry_list. The underlying issue is still
there, though. We just don't complain as loudly. When a huge task
exits, now the system is more or less unresponsive for about eight
minutes. All CPUs are pinned and every one of them is going through
dentry and inode eviction for the procfs files associated with each
thread. It's made worse by every CPU contending on the super's
inode list lock.

The numbers get big. My test case was 4096 threads with 16384 files
open. It's a contrived example, but not that far off from the actual
customer case. In this case, a simple "find /proc" would create around
300 million dentry/inode pairs. More practically, lsof(1) does it too,
it just takes longer. On smaller systems, memory pressure starts pushing
them out. Memory pressure isn't really an issue on this machine, so we
end up using well over 100GB for proc files. It's the combination of
the wasted CPU cycles in teardown and the wasted memory at runtime that
pushed me to take this approach.

The biggest culprit is the "fd" and "fdinfo" directories, but those are
made worse by there being multiple copies of them even for the same
task without threads getting involved:

- /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
resources.

- Every /proc/pid/task/*/fd directory in a thread group has identical
contents (unless unshare(CLONE_FILES) was called), but share no
resources.

- If we do a lookup like /proc/pid/fd on a member of a thread group,
we'll get a valid directory. Inside, there will be a complete
copy of /proc/pid/task/* just like in /proc/tgid/task. Again,
nothing is shared.

This patch set reduces some (most) of the duplication by conditionally
replacing some of the directories with symbolic links to copies that are
identical.

1) Eliminate the duplication of the task directories between threads.
The task directory belongs to the thread leader and the threads
link to it: e.g. /proc/915/task -> ../910/task This mainly
reduces duplication when individual threads are looked up directly
at the tgid level. The impact varies based on the number of threads.
The user has to go out of their way in order to mess up their system
in this way. But if they were so inclined, they could create ~550
billion inodes and dentries using the test case.

2) Eliminate the duplication of directories that are created identically
between the tgid-level pid directory and its task directory: fd,
fdinfo, ns, net, attr. There is obviously more duplication between
the two directories, but replacing a file with a symbolic link
doesn't get us anything. This reduces the number of files associated
with fd and fdinfo by half if threads aren't involved.

3) Eliminate the duplication of fd and fdinfo directories among threads
that share a files_struct. We check at directory creation time if
the task is a group leader and if not, whether it shares ->files with
the group leader. If so, we create a symbolic link to ../tgid/fd*.
We use a d_revalidate callback to check whether the thread has called
unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
Upon re-lookup, a directory will be created in its place. This is
pretty simple, so if the thread group leader calls unshare, all threads
get directories.

With these patches applied, running the same testcase, the proc_inode
cache only gets to about 600k objects, which is about 99.7% fewer. I
get that procfs isn't supposed to be scalable, but this is kind of
extreme. :)

Finally, I'm not a procfs expert. I'm posting this as an RFC for folks
with more knowledge of the details to pick it apart. The biggest is that
I'm not sure if any tools depend on any of these things being directories
instead of symlinks. I'd hope not, but I don't have the answer. I'm
sure there are corner cases I'm missing. Hopefully, it's not just flat
out broken since this is a problem that does need solving.

Now I'll go put on the fireproof suit.

Thanks,

-Jeff

---

Jeff Mahoney (5):
procfs: factor out a few helpers
procfs: factor out inode revalidation work from pid_revalidation
procfs: use symlinks for /proc/<pid>/task when not thread group leader
procfs: share common directories between /proc/tgid and
/proc/tgid/task/tgid
procfs: share fd/fdinfo with thread group leader when files are shared

fs/proc/base.c | 487 +++++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 437 insertions(+), 50 deletions(-)

--
2.12.3



2018-04-24 02:22:59

by Jeff Mahoney

[permalink] [raw]
Subject: [PATCH 1/5] procfs: factor out a few helpers

From: Jeff Mahoney <[email protected]>

In order to make the following series easier to review, this patch
factors out a few helpers:
- proc_fill_cache_entry: proc_fill_cache that takes an entry instead of
an entry, a name, and a name length
- pident_lookup_task: proc_pident_lookup that takes a task, allowing the
caller to do the task validation
- pid_entry_match_dentry: the comparison between a dentry's name and
a pid_entry

Signed-off-by: Jeff Mahoney <[email protected]>
---
fs/proc/base.c | 60 ++++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 40 insertions(+), 20 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9298324325ed..e9876a89a5ad 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2453,19 +2453,28 @@ static int proc_pident_instantiate(struct inode *dir,
return -ENOENT;
}

-static struct dentry *proc_pident_lookup(struct inode *dir,
- struct dentry *dentry,
- const struct pid_entry *ents,
- unsigned int nents)
+static bool proc_fill_cache_entry(struct file *file, struct dir_context *ctx,
+ const struct pid_entry *entry,
+ struct task_struct *task)
{
- int error;
- struct task_struct *task = get_proc_task(dir);
- const struct pid_entry *p, *last;
+ return proc_fill_cache(file, ctx, entry->name, entry->len,
+ proc_pident_instantiate, task, entry);
+}

- error = -ENOENT;
+static bool pid_entry_match_dentry(const struct pid_entry *entry,
+ const struct dentry *dentry)
+{
+ if (entry->len != dentry->d_name.len)
+ return false;
+ return !memcmp(dentry->d_name.name, entry->name, entry->len);
+}

- if (!task)
- goto out_no_task;
+static int proc_pident_lookup_task(struct inode *dir, struct dentry *dentry,
+ const struct pid_entry *ents,
+ unsigned int nents,
+ struct task_struct *task)
+{
+ const struct pid_entry *p, *last;

/*
* Yes, it does not scale. And it should not. Don't add
@@ -2473,18 +2482,30 @@ static struct dentry *proc_pident_lookup(struct inode *dir,
*/
last = &ents[nents];
for (p = ents; p < last; p++) {
- if (p->len != dentry->d_name.len)
- continue;
- if (!memcmp(dentry->d_name.name, p->name, p->len))
+ if (pid_entry_match_dentry(p, dentry))
break;
}
if (p >= last)
- goto out;
+ return -ENOENT;
+
+ return proc_pident_instantiate(dir, dentry, task, p);
+}
+
+static struct dentry *proc_pident_lookup(struct inode *dir,
+ struct dentry *dentry,
+ const struct pid_entry *ents,
+ unsigned int nents)
+{
+ struct task_struct *task;
+ int error = -ENOENT;
+
+ task = get_proc_task(dir);
+ if (task) {
+ error = proc_pident_lookup_task(dir, dentry, ents,
+ nents, task);
+ put_task_struct(task);
+ }

- error = proc_pident_instantiate(dir, dentry, task, p);
-out:
- put_task_struct(task);
-out_no_task:
return ERR_PTR(error);
}

@@ -2504,8 +2525,7 @@ static int proc_pident_readdir(struct file *file, struct dir_context *ctx,
goto out;

for (p = ents + (ctx->pos - 2); p < ents + nents; p++) {
- if (!proc_fill_cache(file, ctx, p->name, p->len,
- proc_pident_instantiate, task, p))
+ if (!proc_fill_cache_entry(file, ctx, p, task))
break;
ctx->pos++;
}
--
2.12.3


2018-04-24 02:23:19

by Jeff Mahoney

[permalink] [raw]
Subject: [PATCH 2/5] procfs: factor out inode revalidation work from pid_revalidation

From: Jeff Mahoney <[email protected]>

Subsequent patches will introduce their own d_revalidate callbacks and we
won't want to get the task multiple times.

Signed-off-by: Jeff Mahoney <[email protected]>
---
fs/proc/base.c | 21 ++++++++++++++-------
1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e9876a89a5ad..e7ca45504a5f 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1802,14 +1802,24 @@ int pid_getattr(const struct path *path, struct kstat *stat,

/* dentry stuff */

+
+/*
+ * Rewrite the inode's ownerships here because the owning task may have
+ * performed a setuid(), etc.
+ */
+static void pid_revalidate_inode(struct inode *inode, struct task_struct *task)
+{
+ task_dump_owner(task, inode->i_mode, &inode->i_uid, &inode->i_gid);
+
+ inode->i_mode &= ~(S_ISUID | S_ISGID);
+ security_task_to_inode(task, inode);
+}
+
/*
* Exceptional case: normally we are not allowed to unhash a busy
* directory. In this case, however, we can do it - no aliasing problems
* due to the way we treat inodes.
*
- * Rewrite the inode's ownerships here because the owning task may have
- * performed a setuid(), etc.
- *
*/
int pid_revalidate(struct dentry *dentry, unsigned int flags)
{
@@ -1823,10 +1833,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags)
task = get_proc_task(inode);

if (task) {
- task_dump_owner(task, inode->i_mode, &inode->i_uid, &inode->i_gid);
-
- inode->i_mode &= ~(S_ISUID | S_ISGID);
- security_task_to_inode(task, inode);
+ pid_revalidate_inode(inode, task);
put_task_struct(task);
return 1;
}
--
2.12.3


2018-04-24 02:23:43

by Jeff Mahoney

[permalink] [raw]
Subject: [PATCH 4/5] procfs: share common directories between /proc/tgid and /proc/tgid/task/tgid

From: Jeff Mahoney <[email protected]>

The contents of /proc/tgid and /proc/tgid/task/tgid are nearly identical
but aren't. There are files (or directory) contained in only one or
the other. Therefore, we can't replace one with a symbolic link.
Creating links for the common files doesn't gain us anything since we'll
still need the dentries and inodes for them. What we can do is create
links for the common subdirectories so the files below them are shared.

This patch converts the fd, fdinfo, ns, net, and attr directories at the
/proc/tgid level to symbolic links that point to the directories in
/proc/tgid/task/pid .

Signed-off-by: Jeff Mahoney <[email protected]>
---
fs/proc/base.c | 44 +++++++++++++++++++++++++++++++++++++++-----
1 file changed, 39 insertions(+), 5 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index de12bd2137ac..005b4f8a19c2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -140,6 +140,9 @@ struct pid_entry {
NOD(NAME, (S_IFLNK|S_IRWXUGO), \
&proc_pid_link_inode_operations, NULL, \
{ .proc_get_link = get_link } )
+#define TASKLNK(NAME) \
+ NOD(NAME, (S_IFLNK|S_IRWXUGO), \
+ &proc_tgid_to_pid_link_inode_operations, NULL, {})
#define REG(NAME, MODE, fops) \
NOD(NAME, (S_IFREG|(MODE)), NULL, &fops, {})
#define ONE(NAME, MODE, show) \
@@ -2941,6 +2944,37 @@ static int proc_pid_patch_state(struct seq_file *m, struct pid_namespace *ns,
}
#endif /* CONFIG_LIVEPATCH */

+static const char *proc_tgid_to_pid_symlink_get_link(struct dentry *dentry,
+ struct inode *inode,
+ struct delayed_call *done)
+{
+ struct task_struct *task;
+ char *link = ERR_PTR(-ENOENT);
+
+ if (!dentry)
+ return ERR_PTR(-ECHILD);
+
+ task = get_proc_task(inode);
+ if (task) {
+ struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+ link = kasprintf(GFP_KERNEL, "task/%u/%.*s",
+ pid_nr_ns(PROC_I(inode)->pid, ns),
+ dentry->d_name.len, dentry->d_name.name);
+ if (link)
+ set_delayed_call(done, kfree_link, link);
+ else
+ link = ERR_PTR(-ENOMEM);
+ put_task_struct(task);
+ }
+ return link;
+}
+
+static const struct inode_operations proc_tgid_to_pid_link_inode_operations = {
+ .get_link = proc_tgid_to_pid_symlink_get_link,
+ .setattr = proc_setattr,
+};
+
/*
* Thread groups
*/
@@ -2948,12 +2982,12 @@ static const struct file_operations proc_task_operations;
static const struct inode_operations proc_task_inode_operations;

static const struct pid_entry tgid_base_stuff[] = {
- DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
+ TASKLNK("fd"),
DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
- DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
- DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
+ TASKLNK("fdinfo"),
+ TASKLNK("ns"),
#ifdef CONFIG_NET
- DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
+ TASKLNK("net"),
#endif
REG("environ", S_IRUSR, proc_environ_operations),
REG("auxv", S_IRUSR, proc_auxv_operations),
@@ -2991,7 +3025,7 @@ static const struct pid_entry tgid_base_stuff[] = {
REG("pagemap", S_IRUSR, proc_pagemap_operations),
#endif
#ifdef CONFIG_SECURITY
- DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations),
+ TASKLNK("attr"),
#endif
#ifdef CONFIG_KALLSYMS
ONE("wchan", S_IRUGO, proc_pid_wchan),
--
2.12.3


2018-04-24 02:24:07

by Jeff Mahoney

[permalink] [raw]
Subject: [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared

From: Jeff Mahoney <[email protected]>

When we have a single task with e.g. 4096 threads and 16k files open,
we can create over 134 million inode and dentry pairs just to back the fd
and fdinfo directories. On smaller systems, memory pressure keeps the
number relatively contained. On huge systems, all of these can fit
in memory. The wasted memory is a problem, but the real problem is
what happens when that task exits. Every task attempts to free its
own proc files, and we end up with a system that becomes unresponsive
for several minutes due to contention on the super's inode list lock.

The thing is, except for threads that have called unshare(CLONE_FILES),
the contents of every one of these directories is identical and comes
from the same files_struct.

This patch uses symbolic links to the thread group leader's fd and
fdinfo directories if the thread and the group leader have the same
files_struct. If the thread calls unshare(CLONE_FILES), the
d_revalidate callback will bounce the symlink and the lookup will
create a directory. If it's the thread group leader that calls
unshare, no symlinks will be used.

In the 4096 threads * 16k files case, the total procfs load is
about 600k files instead of 134M.

Signed-off-by: Jeff Mahoney <[email protected]>
---
fs/proc/base.c | 242 ++++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 231 insertions(+), 11 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 005b4f8a19c2..dbdc2f9b2c58 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -122,6 +122,7 @@ struct pid_entry {
umode_t mode;
const struct inode_operations *iop;
const struct file_operations *fop;
+ const struct dentry_operations *dop;
union proc_op op;
};

@@ -2438,6 +2439,7 @@ static const struct file_operations proc_pid_set_timerslack_ns_operations = {
static int proc_pident_instantiate(struct inode *dir,
struct dentry *dentry, struct task_struct *task, const void *ptr)
{
+ const struct dentry_operations *dops = &pid_dentry_operations;
const struct pid_entry *p = ptr;
struct inode *inode;
struct proc_inode *ei;
@@ -2454,7 +2456,9 @@ static int proc_pident_instantiate(struct inode *dir,
if (p->fop)
inode->i_fop = p->fop;
ei->op = p->op;
- d_set_d_op(dentry, &pid_dentry_operations);
+ if (p->dop)
+ dops = p->dop;
+ d_set_d_op(dentry, dops);
d_add(dentry, inode);
/* Close the race of the process dying before we return the dentry */
if (pid_revalidate(dentry, 0))
@@ -3482,12 +3486,136 @@ static const struct inode_operations proc_tid_comm_inode_operations = {
.permission = proc_tid_comm_permission,
};

+static const char *proc_pid_sibling_get_link(struct dentry *dentry,
+ struct inode *inode,
+ struct delayed_call *done)
+{
+ struct task_struct *task;
+ char *link = ERR_PTR(-ENOENT);
+
+ if (!dentry)
+ return ERR_PTR(-ECHILD);
+
+ task = get_proc_task(inode);
+ if (task) {
+ struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+ link = kasprintf(GFP_KERNEL, "../%u/%.*s",
+ pid_nr_ns(task_tgid(task), ns),
+ dentry->d_name.len,
+ dentry->d_name.name);
+ if (link)
+ set_delayed_call(done, kfree_link, link);
+ else
+ link = ERR_PTR(-ENOMEM);
+
+ put_task_struct(task);
+ }
+
+ return link;
+}
+
+static const struct inode_operations proc_pid_sibling_symlink_inode_operations = {
+ .get_link = proc_pid_sibling_get_link,
+ .setattr = proc_setattr,
+};
+
+static bool tasks_share_files(const struct task_struct *task)
+{
+ return task->files == task->group_leader->files;
+}
+
+static int proc_pid_files_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct task_struct *task;
+ struct inode *inode;
+ int ret = 1;
+
+ if (flags & LOOKUP_RCU)
+ return -ECHILD;
+
+ inode = d_inode(dentry);
+ task = get_proc_task(inode);
+ if (!task)
+ return -ENOENT;
+
+ pid_revalidate_inode(inode, task);
+
+ /*
+ * This thread called unshare(CLONE_FILES).
+ * We need to turn it into a directory.
+ */
+ if (!thread_group_leader(task) && (inode->i_mode & S_IFLNK) &&
+ !tasks_share_files(task))
+ ret = 0;
+
+ put_task_struct(task);
+ return ret;
+}
+
+/*
+ * This only gets used with the symbolic links. Once converted to a
+ * directory, there's no more work to do beyond pid_revalidate_inode, so
+ * we just use the regular pid_dentry_operations.
+ */
+const struct dentry_operations proc_pid_files_link_dentry_operations = {
+ .d_revalidate = proc_pid_files_revalidate,
+ .d_delete = pid_delete_dentry,
+};
+
+static const struct pid_entry proc_pid_fd_dir_entry = {
+ .name = "fd",
+ .len = sizeof("fd") - 1,
+ .mode = S_IFDIR|S_IRUSR|S_IXUSR,
+ .iop = &proc_fd_inode_operations,
+ .fop = &proc_fd_operations,
+};
+
+static const struct pid_entry proc_pid_fd_link_entry = {
+ .name = "fd",
+ .len = sizeof("fd") - 1,
+ .mode = S_IFLNK|S_IRWXUGO,
+ .iop = &proc_pid_sibling_symlink_inode_operations,
+ .dop = &proc_pid_files_link_dentry_operations
+};
+
+static const struct pid_entry *proc_pid_fd_pid_entry(struct task_struct *task)
+{
+ if (thread_group_leader(task) || !tasks_share_files(task))
+ return &proc_pid_fd_dir_entry;
+ else
+ return &proc_pid_fd_link_entry;
+}
+
+static const struct pid_entry proc_pid_fdinfo_dir_entry = {
+ .name = "fdinfo",
+ .len = sizeof("fdinfo") - 1,
+ .mode = S_IFDIR|S_IRUSR|S_IXUSR,
+ .iop = &proc_fdinfo_inode_operations,
+ .fop = &proc_fdinfo_operations,
+};
+
+static const struct pid_entry proc_pid_fdinfo_link_entry = {
+ .name = "fdinfo",
+ .len = sizeof("fdinfo") - 1,
+ .mode = S_IFLNK|S_IRWXUGO,
+ .iop = &proc_pid_sibling_symlink_inode_operations,
+ .dop = &proc_pid_files_link_dentry_operations
+};
+
+static const struct pid_entry *proc_pid_fdinfo_pid_entry(
+ struct task_struct *task)
+{
+ if (thread_group_leader(task) || !tasks_share_files(task))
+ return &proc_pid_fdinfo_dir_entry;
+ else
+ return &proc_pid_fdinfo_link_entry;
+}
+
/*
* Tasks
*/
static const struct pid_entry tid_base_stuff[] = {
- DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
- DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
DIR("ns", S_IRUSR|S_IXUGO, proc_ns_dir_inode_operations, proc_ns_dir_operations),
#ifdef CONFIG_NET
DIR("net", S_IRUGO|S_IXUGO, proc_net_inode_operations, proc_net_operations),
@@ -3579,14 +3707,71 @@ static const struct pid_entry tid_base_stuff[] = {

static int proc_tid_base_readdir(struct file *file, struct dir_context *ctx)
{
- return proc_pident_readdir(file, ctx,
- tid_base_stuff, ARRAY_SIZE(tid_base_stuff));
+ const struct pid_entry *entry;
+ struct task_struct *task = get_proc_task(file_inode(file));
+ int i;
+
+ if (!task)
+ return -ENOENT;
+
+ if (!dir_emit_dots(file, ctx))
+ goto out;
+
+ if (ctx->pos == 2) {
+ entry = proc_pid_fd_pid_entry(task);
+
+ if (!proc_fill_cache_entry(file, ctx, entry, task))
+ goto out;
+ ctx->pos++;
+ }
+
+ if (ctx->pos == 3) {
+ entry = proc_pid_fdinfo_pid_entry(task);
+
+ if (!proc_fill_cache_entry(file, ctx, entry, task))
+ goto out;
+ ctx->pos++;
+ }
+
+ for (i = ctx->pos - 4; i < ARRAY_SIZE(tid_base_stuff); i++) {
+ entry = &tid_base_stuff[i];
+
+ if (!proc_fill_cache_entry(file, ctx, entry, task))
+ goto out;
+ ctx->pos++;
+ }
+
+out:
+ put_task_struct(task);
+ return 0;
}

-static struct dentry *proc_tid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
+static struct dentry *proc_tid_base_lookup(struct inode *dir,
+ struct dentry *dentry,
+ unsigned int flags)
{
- return proc_pident_lookup(dir, dentry,
- tid_base_stuff, ARRAY_SIZE(tid_base_stuff));
+ struct task_struct *task;
+ int error;
+
+ task = get_proc_task(dir);
+ if (!task)
+ return ERR_PTR(-ENOENT);
+
+ /* /proc/pid/task/pid/fd */
+ if (pid_entry_match_dentry(&proc_pid_fd_dir_entry, dentry))
+ error = proc_pident_instantiate(dir, dentry, task,
+ proc_pid_fd_pid_entry(task));
+ /* /proc/pid/task/pid/fdinfo */
+ else if (pid_entry_match_dentry(&proc_pid_fdinfo_dir_entry, dentry))
+ error = proc_pident_instantiate(dir, dentry, task,
+ proc_pid_fdinfo_pid_entry(task));
+ else
+ error = proc_pident_lookup_task(dir, dentry, tid_base_stuff,
+ ARRAY_SIZE(tid_base_stuff),
+ task);
+
+ put_task_struct(task);
+ return ERR_PTR(error);
}

static const struct file_operations proc_tid_base_operations = {
@@ -3601,6 +3786,42 @@ static const struct inode_operations proc_tid_base_inode_operations = {
.setattr = proc_setattr,
};

+static int proc_task_count_links(struct task_struct *task)
+{
+ int nlinks = nlink_tid;
+
+ /* Shared files: symlinks for fd and fdinfo */
+ if (!thread_group_leader(task) && tasks_share_files(task))
+ nlinks++;
+
+ return nlinks;
+}
+
+static int tid_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct inode *inode;
+ struct task_struct *task;
+
+ if (flags & LOOKUP_RCU)
+ return -ECHILD;
+
+ inode = d_inode(dentry);
+ task = get_proc_task(inode);
+
+ if (task) {
+ pid_revalidate_inode(inode, task);
+ set_nlink(inode, proc_task_count_links(task));
+ put_task_struct(task);
+ return 1;
+ }
+ return 0;
+}
+
+static const struct dentry_operations proc_tid_dentry_operations = {
+ .d_revalidate = tid_revalidate,
+ .d_delete = pid_delete_dentry,
+};
+
static int proc_task_instantiate(struct inode *dir,
struct dentry *dentry, struct task_struct *task, const void *ptr)
{
@@ -3613,9 +3834,8 @@ static int proc_task_instantiate(struct inode *dir,
inode->i_fop = &proc_tid_base_operations;
inode->i_flags|=S_IMMUTABLE;

- set_nlink(inode, nlink_tid);
-
- d_set_d_op(dentry, &pid_dentry_operations);
+ set_nlink(inode, proc_task_count_links(task));
+ d_set_d_op(dentry, &proc_tid_dentry_operations);

d_add(dentry, inode);
/* Close the race of the process dying before we return the dentry */
--
2.12.3


2018-04-24 02:24:45

by Jeff Mahoney

[permalink] [raw]
Subject: [PATCH 3/5] procfs: use symlinks for /proc/<pid>/task when not thread group leader

From: Jeff Mahoney <[email protected]>

Although readdir only lists thread group leaders at the tgid-level of
/proc, it's possible to do a lookup to get individual threads back. The
directory contains all of the usual tgid-level files and directories,
including task. The task directory contains directories for every sibling
thread populated with the usual complement of files, all of which are
identical to the files contained under the tgid's own task directory.
If every thread is looked up, we'll create n^2 directories and there
is no sharing among them. For a 3000-thread task, that becomes a pretty
big number.

This patch avoids the duplication by retaining the tgid's copy of
the task directory and converting the other threads' task directory
to a symbolic link to the tgid's copy.

Signed-off-by: Jeff Mahoney <[email protected]>
---
fs/proc/base.c | 123 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 116 insertions(+), 7 deletions(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index e7ca45504a5f..de12bd2137ac 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2948,7 +2948,6 @@ static const struct file_operations proc_task_operations;
static const struct inode_operations proc_task_inode_operations;

static const struct pid_entry tgid_base_stuff[] = {
- DIR("task", S_IRUGO|S_IXUGO, proc_task_inode_operations, proc_task_operations),
DIR("fd", S_IRUSR|S_IXUSR, proc_fd_inode_operations, proc_fd_operations),
DIR("map_files", S_IRUSR|S_IXUSR, proc_map_files_inode_operations, proc_map_files_operations),
DIR("fdinfo", S_IRUSR|S_IXUSR, proc_fdinfo_inode_operations, proc_fdinfo_operations),
@@ -3047,10 +3046,96 @@ static const struct pid_entry tgid_base_stuff[] = {
#endif
};

+/*
+ * Don't instantiate a full duplicate of the thread leader's task
+ * directory for every member of the task group. Just symlink to the
+ * thread leader's copy.
+ */
+static const char *proc_tgid_task_symlink_get_link(struct dentry *dentry,
+ struct inode *inode,
+ struct delayed_call *done)
+{
+ struct task_struct *task;
+ char *link = ERR_PTR(-ENOENT);
+
+ if (!dentry)
+ return ERR_PTR(-ECHILD);
+
+ task = get_proc_task(inode);
+ if (task) {
+ struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+ link = kasprintf(GFP_KERNEL, "../%u/task",
+ pid_nr_ns(task_tgid(task), ns));
+ if (link)
+ set_delayed_call(done, kfree_link, link);
+ else
+ link = ERR_PTR(-ENOMEM);
+ put_task_struct(task);
+ }
+ return link;
+}
+
+static const struct inode_operations proc_task_symlink_inode_operations = {
+ .get_link = proc_tgid_task_symlink_get_link,
+ .setattr = proc_setattr,
+};
+
+static const struct pid_entry proc_tgid_task_symlink_entry = {
+ .name = "task",
+ .len = sizeof("task") - 1,
+ .mode = S_IFLNK|S_IRWXUGO,
+ .iop = &proc_task_symlink_inode_operations,
+};
+
+static const struct pid_entry proc_tgid_task_dir_entry = {
+ .name = "task",
+ .len = sizeof("task") - 1,
+ .mode = S_IFDIR|S_IRUGO|S_IXUGO,
+ .iop = &proc_task_inode_operations,
+ .fop = &proc_task_operations,
+};
+
+static const struct pid_entry *proc_tgid_task_entry(struct task_struct *task)
+{
+ if (thread_group_leader(task))
+ return &proc_tgid_task_dir_entry;
+ else
+ return &proc_tgid_task_symlink_entry;
+}
+
static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx)
{
- return proc_pident_readdir(file, ctx,
- tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff));
+ const struct pid_entry *entry;
+ struct task_struct *task;
+ int i;
+
+ task = get_proc_task(file_inode(file));
+ if (!task)
+ return -ENOENT;
+
+ if (!dir_emit_dots(file, ctx))
+ goto out;
+
+ /* Add /proc/pid/task entry */
+ if (ctx->pos == 2) {
+ entry = proc_tgid_task_entry(task);
+
+ if (!proc_fill_cache_entry(file, ctx, entry, task))
+ goto out;
+ ctx->pos++;
+ }
+
+ for (i = ctx->pos - 3; i < ARRAY_SIZE(tgid_base_stuff); i++) {
+ entry = &tgid_base_stuff[i];
+
+ if (!proc_fill_cache_entry(file, ctx, entry, task))
+ goto out;
+ ctx->pos++;
+ }
+out:
+ put_task_struct(task);
+ return 0;
}

static const struct file_operations proc_tgid_base_operations = {
@@ -3059,10 +3144,29 @@ static const struct file_operations proc_tgid_base_operations = {
.llseek = generic_file_llseek,
};

-static struct dentry *proc_tgid_base_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
+
+static struct dentry *proc_tgid_base_lookup(struct inode *dir,
+ struct dentry *dentry,
+ unsigned int flags)
{
- return proc_pident_lookup(dir, dentry,
- tgid_base_stuff, ARRAY_SIZE(tgid_base_stuff));
+ struct task_struct *task;
+ int error = -ENOENT;
+
+ task = get_proc_task(dir);
+ if (!task)
+ goto out;
+
+ /* Handle /proc/pid/task separately */
+ if (pid_entry_match_dentry(&proc_tgid_task_dir_entry, dentry))
+ error = proc_pident_instantiate(dir, dentry, task,
+ proc_tgid_task_entry(task));
+ else
+ error = proc_pident_lookup_task(dir, dentry, tgid_base_stuff,
+ ARRAY_SIZE(tgid_base_stuff),
+ task);
+ put_task_struct(task);
+out:
+ return ERR_PTR(error);
}

static const struct inode_operations proc_tgid_base_inode_operations = {
@@ -3163,6 +3267,7 @@ static int proc_pid_instantiate(struct inode *dir,
struct task_struct *task, const void *ptr)
{
struct inode *inode;
+ int nlinks = nlink_tgid;

inode = proc_pid_make_inode(dir->i_sb, task, S_IFDIR | S_IRUGO | S_IXUGO);
if (!inode)
@@ -3172,7 +3277,11 @@ static int proc_pid_instantiate(struct inode *dir,
inode->i_fop = &proc_tgid_base_operations;
inode->i_flags|=S_IMMUTABLE;

- set_nlink(inode, nlink_tgid);
+ /* The group leader has a directory */
+ if (thread_group_leader(task))
+ nlinks++;
+
+ set_nlink(inode, nlinks);

d_set_d_op(dentry, &pid_dentry_operations);

--
2.12.3


2018-04-24 06:28:57

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks

On Mon, Apr 23, 2018 at 10:21:01PM -0400, [email protected] wrote:
> Memory pressure isn't really an issue on this machine, so we
> end up using well over 100GB for proc files.

Text files at scale!

> With these patches applied, running the same testcase, the proc_inode
> cache only gets to about 600k objects, which is about 99.7% fewer. I
> get that procfs isn't supposed to be scalable, but this is kind of
> extreme. :)

Easy stuff:
* all ->get_link hooks are broken in RCU lookup (use GFP_KERNEL),
* "%.*s" for dentry names is probably unnecessary,
they're always NUL terminated
* kasprintf() does printing twice, since we're kind of care about /proc
performance, allocate for the worst case.

* "int nlinks = nlink_tgid;"
Unsigned police.

* (inode->i_mode & S_IFLNK)
this is sketchy, S_ISLNK exists.

2018-04-24 14:17:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks

[email protected] writes:

> From: Jeff Mahoney <[email protected]>
>
> Hi all -
>
> I recently encountered a customer issue where, on a machine with many TiB
> of memory and a few hundred cores, after a task with a few thousand threads
> and hundreds of files open exited, the system would softlockup. That
> issue was (is still) being addressed by Nik Borisov's patch to add a
> cond_resched call to shrink_dentry_list. The underlying issue is still
> there, though. We just don't complain as loudly. When a huge task
> exits, now the system is more or less unresponsive for about eight
> minutes. All CPUs are pinned and every one of them is going through
> dentry and inode eviction for the procfs files associated with each
> thread. It's made worse by every CPU contending on the super's
> inode list lock.
>
> The numbers get big. My test case was 4096 threads with 16384 files
> open. It's a contrived example, but not that far off from the actual
> customer case. In this case, a simple "find /proc" would create around
> 300 million dentry/inode pairs. More practically, lsof(1) does it too,
> it just takes longer. On smaller systems, memory pressure starts pushing
> them out. Memory pressure isn't really an issue on this machine, so we
> end up using well over 100GB for proc files. It's the combination of
> the wasted CPU cycles in teardown and the wasted memory at runtime that
> pushed me to take this approach.
>
> The biggest culprit is the "fd" and "fdinfo" directories, but those are
> made worse by there being multiple copies of them even for the same
> task without threads getting involved:
>
> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
> resources.
>
> - Every /proc/pid/task/*/fd directory in a thread group has identical
> contents (unless unshare(CLONE_FILES) was called), but share no
> resources.
>
> - If we do a lookup like /proc/pid/fd on a member of a thread group,
> we'll get a valid directory. Inside, there will be a complete
> copy of /proc/pid/task/* just like in /proc/tgid/task. Again,
> nothing is shared.
>
> This patch set reduces some (most) of the duplication by conditionally
> replacing some of the directories with symbolic links to copies that are
> identical.
>
> 1) Eliminate the duplication of the task directories between threads.
> The task directory belongs to the thread leader and the threads
> link to it: e.g. /proc/915/task -> ../910/task This mainly
> reduces duplication when individual threads are looked up directly
> at the tgid level. The impact varies based on the number of threads.
> The user has to go out of their way in order to mess up their system
> in this way. But if they were so inclined, they could create ~550
> billion inodes and dentries using the test case.
>
> 2) Eliminate the duplication of directories that are created identically
> between the tgid-level pid directory and its task directory: fd,
> fdinfo, ns, net, attr. There is obviously more duplication between
> the two directories, but replacing a file with a symbolic link
> doesn't get us anything. This reduces the number of files associated
> with fd and fdinfo by half if threads aren't involved.
>
> 3) Eliminate the duplication of fd and fdinfo directories among threads
> that share a files_struct. We check at directory creation time if
> the task is a group leader and if not, whether it shares ->files with
> the group leader. If so, we create a symbolic link to ../tgid/fd*.
> We use a d_revalidate callback to check whether the thread has called
> unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
> Upon re-lookup, a directory will be created in its place. This is
> pretty simple, so if the thread group leader calls unshare, all threads
> get directories.
>
> With these patches applied, running the same testcase, the proc_inode
> cache only gets to about 600k objects, which is about 99.7% fewer. I
> get that procfs isn't supposed to be scalable, but this is kind of
> extreme. :)
>
> Finally, I'm not a procfs expert. I'm posting this as an RFC for folks
> with more knowledge of the details to pick it apart. The biggest is that
> I'm not sure if any tools depend on any of these things being directories
> instead of symlinks. I'd hope not, but I don't have the answer. I'm
> sure there are corner cases I'm missing. Hopefully, it's not just flat
> out broken since this is a problem that does need solving.
>
> Now I'll go put on the fireproof suit.

This needs to be tested against at least apparmor to see if this breaks
common policies. Changing files to symlinks in proc has a bad habit of
either breaking apparmor policies or userspace assumptions. Symbolic
links are unfortunately visible to userspace.

Further the proc structure is tgid/task/tid where the leaf directories
are per thread.

We more likely could get away with some magic symlinks (that would not
be user visible) rather than actual symlinks.

So I think you are probably on the right track to reduce the memory
usage but I think some more work will be needed to make it transparently
backwards compatible.

Eric

2018-04-24 15:45:06

by kernel test robot

[permalink] [raw]
Subject: [RFC PATCH] procfs: proc_pid_files_link_dentry_operations can be static


Fixes: fc3babee0341 ("procfs: share fd/fdinfo with thread group leader when files are shared")
Signed-off-by: Fengguang Wu <[email protected]>
---
base.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 98a847b..deb0950 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -3565,7 +3565,7 @@ static int proc_pid_files_revalidate(struct dentry *dentry, unsigned int flags)
* directory, there's no more work to do beyond pid_revalidate_inode, so
* we just use the regular pid_dentry_operations.
*/
-const struct dentry_operations proc_pid_files_link_dentry_operations = {
+static const struct dentry_operations proc_pid_files_link_dentry_operations = {
.d_revalidate = proc_pid_files_revalidate,
.d_delete = pid_delete_dentry,
};

2018-04-24 15:45:06

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 5/5] procfs: share fd/fdinfo with thread group leader when files are shared

Hi Jeff,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v4.17-rc2 next-20180424]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/jeffm-suse-com/procfs-reduce-duplication-by-using-symlinks/20180424-203620
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:285:34: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:331:42: sparse: expression using sizeof(void)
fs/proc/base.c:834:32: sparse: expression using sizeof(void)
fs/proc/base.c:950:27: sparse: expression using sizeof(void)
fs/proc/base.c:951:28: sparse: expression using sizeof(void)
fs/proc/base.c:951:28: sparse: expression using sizeof(void)
fs/proc/base.c:2085:25: sparse: cast to restricted fmode_t
fs/proc/base.c:2140:42: sparse: cast from restricted fmode_t
fs/proc/base.c:2243:48: sparse: cast from restricted fmode_t
>> fs/proc/base.c:3568:32: sparse: symbol 'proc_pid_files_link_dentry_operations' was not declared. Should it be static?
fs/proc/base.c:1078:36: sparse: context imbalance in '__set_oom_adj' - unexpected unlock
fs/proc/base.c:2271:13: sparse: context imbalance in 'timers_start' - wrong count at exit
fs/proc/base.c:2297:36: sparse: context imbalance in 'timers_stop' - unexpected unlock

Please review and possibly fold the followup patch.

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

2018-04-25 18:05:57

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks

On 4/24/18 2:17 AM, Alexey Dobriyan wrote:
> On Mon, Apr 23, 2018 at 10:21:01PM -0400, [email protected] wrote:
>> Memory pressure isn't really an issue on this machine, so we
>> end up using well over 100GB for proc files.
>
> Text files at scale!
>
>> With these patches applied, running the same testcase, the proc_inode
>> cache only gets to about 600k objects, which is about 99.7% fewer. I
>> get that procfs isn't supposed to be scalable, but this is kind of
>> extreme. :)>
> Easy stuff:
> * all ->get_link hooks are broken in RCU lookup (use GFP_KERNEL),

It's a pretty common pattern in the kernel, but it's just as easy to set
inode->i_link during instantiation and keep RCU lookup. There aren't so
many of these to make it a real burden on memory.

> * "%.*s" for dentry names is probably unnecessary,
> they're always NUL terminated

Ack.

> * kasprintf() does printing twice, since we're kind of care about /proc
> performance, allocate for the worst case.

Ack, integrated with ->get_link fix.

> * "int nlinks = nlink_tgid;"
> Unsigned police.

Ack. nlink_t{,g}id are both u8, but it's easy to make it consistent.

> * (inode->i_mode & S_IFLNK)
> this is sketchy, S_ISLNK exists.
>

Ack.

Notes of my own:

proc_task_count_links also had the logic backward. It would add an
extra link to the count for the symlink rather than the dir.

proc_pid_files_revalidate only needs to check if the tasks share files
since it won't be called if it's not a symlink.

Thanks for the review,

-Jeff

--
Jeff Mahoney
SUSE Labs

2018-04-26 21:05:12

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks

On 4/24/18 10:14 AM, Eric W. Biederman wrote:
> [email protected] writes:
>
>> From: Jeff Mahoney <[email protected]>
>>
>> Hi all -
>>
>> I recently encountered a customer issue where, on a machine with many TiB
>> of memory and a few hundred cores, after a task with a few thousand threads
>> and hundreds of files open exited, the system would softlockup. That
>> issue was (is still) being addressed by Nik Borisov's patch to add a
>> cond_resched call to shrink_dentry_list. The underlying issue is still
>> there, though. We just don't complain as loudly. When a huge task
>> exits, now the system is more or less unresponsive for about eight
>> minutes. All CPUs are pinned and every one of them is going through
>> dentry and inode eviction for the procfs files associated with each
>> thread. It's made worse by every CPU contending on the super's
>> inode list lock.
>>
>> The numbers get big. My test case was 4096 threads with 16384 files
>> open. It's a contrived example, but not that far off from the actual
>> customer case. In this case, a simple "find /proc" would create around
>> 300 million dentry/inode pairs. More practically, lsof(1) does it too,
>> it just takes longer. On smaller systems, memory pressure starts pushing
>> them out. Memory pressure isn't really an issue on this machine, so we
>> end up using well over 100GB for proc files. It's the combination of
>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>> pushed me to take this approach.
>>
>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>> made worse by there being multiple copies of them even for the same
>> task without threads getting involved:
>>
>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>> resources.
>>
>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>> contents (unless unshare(CLONE_FILES) was called), but share no
>> resources.
>>
>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>> we'll get a valid directory. Inside, there will be a complete
>> copy of /proc/pid/task/* just like in /proc/tgid/task. Again,
>> nothing is shared.
>>
>> This patch set reduces some (most) of the duplication by conditionally
>> replacing some of the directories with symbolic links to copies that are
>> identical.
>>
>> 1) Eliminate the duplication of the task directories between threads.
>> The task directory belongs to the thread leader and the threads
>> link to it: e.g. /proc/915/task -> ../910/task This mainly
>> reduces duplication when individual threads are looked up directly
>> at the tgid level. The impact varies based on the number of threads.
>> The user has to go out of their way in order to mess up their system
>> in this way. But if they were so inclined, they could create ~550
>> billion inodes and dentries using the test case.
>>
>> 2) Eliminate the duplication of directories that are created identically
>> between the tgid-level pid directory and its task directory: fd,
>> fdinfo, ns, net, attr. There is obviously more duplication between
>> the two directories, but replacing a file with a symbolic link
>> doesn't get us anything. This reduces the number of files associated
>> with fd and fdinfo by half if threads aren't involved.
>>
>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>> that share a files_struct. We check at directory creation time if
>> the task is a group leader and if not, whether it shares ->files with
>> the group leader. If so, we create a symbolic link to ../tgid/fd*.
>> We use a d_revalidate callback to check whether the thread has called
>> unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>> Upon re-lookup, a directory will be created in its place. This is
>> pretty simple, so if the thread group leader calls unshare, all threads
>> get directories.
>>
>> With these patches applied, running the same testcase, the proc_inode
>> cache only gets to about 600k objects, which is about 99.7% fewer. I
>> get that procfs isn't supposed to be scalable, but this is kind of
>> extreme. :)
>>
>> Finally, I'm not a procfs expert. I'm posting this as an RFC for folks
>> with more knowledge of the details to pick it apart. The biggest is that
>> I'm not sure if any tools depend on any of these things being directories
>> instead of symlinks. I'd hope not, but I don't have the answer. I'm
>> sure there are corner cases I'm missing. Hopefully, it's not just flat
>> out broken since this is a problem that does need solving.
>>
>> Now I'll go put on the fireproof suit.
>
> This needs to be tested against at least apparmor to see if this breaks
> common policies. Changing files to symlinks in proc has a bad habit of
> either breaking apparmor policies or userspace assumptions. Symbolic
> links are unfortunately visible to userspace.

That's my biggest concern as well.

> Further the proc structure is tgid/task/tid where the leaf directories
> are per thread.
>
> We more likely could get away with some magic symlinks (that would not
> be user visible) rather than actual symlinks.

I'd considered that, but we'll need to instantiate other portions of the
tree in order to use those dentries as magic symlink targets. That
seemed like it might not fly, so I went with the regular symlink approach.

> So I think you are probably on the right track to reduce the memory
> usage but I think some more work will be needed to make it transparently
> backwards compatible.

Thanks for the review,

-Jeff

--
Jeff Mahoney
SUSE Labs

2019-03-21 18:31:24

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks

On 4/24/18 10:14 AM, Eric W. Biederman wrote:
> [email protected] writes:
>
>> From: Jeff Mahoney <[email protected]>
>>
>> Hi all -
>>
>> I recently encountered a customer issue where, on a machine with many TiB
>> of memory and a few hundred cores, after a task with a few thousand threads
>> and hundreds of files open exited, the system would softlockup. That
>> issue was (is still) being addressed by Nik Borisov's patch to add a
>> cond_resched call to shrink_dentry_list. The underlying issue is still
>> there, though. We just don't complain as loudly. When a huge task
>> exits, now the system is more or less unresponsive for about eight
>> minutes. All CPUs are pinned and every one of them is going through
>> dentry and inode eviction for the procfs files associated with each
>> thread. It's made worse by every CPU contending on the super's
>> inode list lock.
>>
>> The numbers get big. My test case was 4096 threads with 16384 files
>> open. It's a contrived example, but not that far off from the actual
>> customer case. In this case, a simple "find /proc" would create around
>> 300 million dentry/inode pairs. More practically, lsof(1) does it too,
>> it just takes longer. On smaller systems, memory pressure starts pushing
>> them out. Memory pressure isn't really an issue on this machine, so we
>> end up using well over 100GB for proc files. It's the combination of
>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>> pushed me to take this approach.
>>
>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>> made worse by there being multiple copies of them even for the same
>> task without threads getting involved:
>>
>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>> resources.
>>
>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>> contents (unless unshare(CLONE_FILES) was called), but share no
>> resources.
>>
>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>> we'll get a valid directory. Inside, there will be a complete
>> copy of /proc/pid/task/* just like in /proc/tgid/task. Again,
>> nothing is shared.
>>
>> This patch set reduces some (most) of the duplication by conditionally
>> replacing some of the directories with symbolic links to copies that are
>> identical.
>>
>> 1) Eliminate the duplication of the task directories between threads.
>> The task directory belongs to the thread leader and the threads
>> link to it: e.g. /proc/915/task -> ../910/task This mainly
>> reduces duplication when individual threads are looked up directly
>> at the tgid level. The impact varies based on the number of threads.
>> The user has to go out of their way in order to mess up their system
>> in this way. But if they were so inclined, they could create ~550
>> billion inodes and dentries using the test case.
>>
>> 2) Eliminate the duplication of directories that are created identically
>> between the tgid-level pid directory and its task directory: fd,
>> fdinfo, ns, net, attr. There is obviously more duplication between
>> the two directories, but replacing a file with a symbolic link
>> doesn't get us anything. This reduces the number of files associated
>> with fd and fdinfo by half if threads aren't involved.
>>
>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>> that share a files_struct. We check at directory creation time if
>> the task is a group leader and if not, whether it shares ->files with
>> the group leader. If so, we create a symbolic link to ../tgid/fd*.
>> We use a d_revalidate callback to check whether the thread has called
>> unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>> Upon re-lookup, a directory will be created in its place. This is
>> pretty simple, so if the thread group leader calls unshare, all threads
>> get directories.
>>
>> With these patches applied, running the same testcase, the proc_inode
>> cache only gets to about 600k objects, which is about 99.7% fewer. I
>> get that procfs isn't supposed to be scalable, but this is kind of
>> extreme. :)
>>
>> Finally, I'm not a procfs expert. I'm posting this as an RFC for folks
>> with more knowledge of the details to pick it apart. The biggest is that
>> I'm not sure if any tools depend on any of these things being directories
>> instead of symlinks. I'd hope not, but I don't have the answer. I'm
>> sure there are corner cases I'm missing. Hopefully, it's not just flat
>> out broken since this is a problem that does need solving.
>>
>> Now I'll go put on the fireproof suit.

Thanks for your comments. This ended up having to get back-burnered but
I've finally found some time to get back to it. I have new patches that
don't treat each entry as a special case and makes more sense, IMO.
They're not worth posting yet since some of the issues below remain.

> This needs to be tested against at least apparmor to see if this breaks
> common policies. Changing files to symlinks in proc has a bad habit of
> either breaking apparmor policies or userspace assumptions. Symbolic
> links are unfortunately visible to userspace.

AppArmor uses the @{pids} var in profiles that translates to a numeric
regex. That means that /proc/pid/task -> /proc/tgid/task won't break
profiles but /proc/pid/fdinfo -> /proc/pid/task/tgid/fdinfo will break.
Apparmor doesn't have a follow_link hook at all, so all that matters is
the final path. SELinux does have a follow_link hook, but I'm not
familiar enough with it to know whether introducing a symlink in proc
will make a difference.

I've dropped the /proc/pid/{dirs} -> /proc/pid/task/pid/{dirs} part
since that clearly won't work.

> Further the proc structure is tgid/task/tid where the leaf directories
> are per thread.

Yes, but threads are still in /proc for lookup at the tgid level even if
they don't show up in readdir.

> We more likely could get away with some magic symlinks (that would not
> be user visible) rather than actual symlinks.

I think I'm missing something here. Aren't magic symlinks still
represented to the user as symlinks?

> So I think you are probably on the right track to reduce the memory
> usage but I think some more work will be needed to make it transparently
> backwards compatible.

Yeah, that's going to be the big hiccup. I think I've resolved the
biggest issue with AppArmor, but I don't think the problem is solvable
without introducing symlinks.

-Jeff

--
Jeff Mahoney
SUSE Labs


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-03-23 15:57:36

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks

Jeff Mahoney <[email protected]> writes:

> On 4/24/18 10:14 AM, Eric W. Biederman wrote:
>> [email protected] writes:
>>
>>> From: Jeff Mahoney <[email protected]>
>>>
>>> Hi all -
>>>
>>> I recently encountered a customer issue where, on a machine with many TiB
>>> of memory and a few hundred cores, after a task with a few thousand threads
>>> and hundreds of files open exited, the system would softlockup. That
>>> issue was (is still) being addressed by Nik Borisov's patch to add a
>>> cond_resched call to shrink_dentry_list. The underlying issue is still
>>> there, though. We just don't complain as loudly. When a huge task
>>> exits, now the system is more or less unresponsive for about eight
>>> minutes. All CPUs are pinned and every one of them is going through
>>> dentry and inode eviction for the procfs files associated with each
>>> thread. It's made worse by every CPU contending on the super's
>>> inode list lock.
>>>
>>> The numbers get big. My test case was 4096 threads with 16384 files
>>> open. It's a contrived example, but not that far off from the actual
>>> customer case. In this case, a simple "find /proc" would create around
>>> 300 million dentry/inode pairs. More practically, lsof(1) does it too,
>>> it just takes longer. On smaller systems, memory pressure starts pushing
>>> them out. Memory pressure isn't really an issue on this machine, so we
>>> end up using well over 100GB for proc files. It's the combination of
>>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>>> pushed me to take this approach.
>>>
>>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>>> made worse by there being multiple copies of them even for the same
>>> task without threads getting involved:
>>>
>>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>>> resources.
>>>
>>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>>> contents (unless unshare(CLONE_FILES) was called), but share no
>>> resources.
>>>
>>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>>> we'll get a valid directory. Inside, there will be a complete
>>> copy of /proc/pid/task/* just like in /proc/tgid/task. Again,
>>> nothing is shared.
>>>
>>> This patch set reduces some (most) of the duplication by conditionally
>>> replacing some of the directories with symbolic links to copies that are
>>> identical.
>>>
>>> 1) Eliminate the duplication of the task directories between threads.
>>> The task directory belongs to the thread leader and the threads
>>> link to it: e.g. /proc/915/task -> ../910/task This mainly
>>> reduces duplication when individual threads are looked up directly
>>> at the tgid level. The impact varies based on the number of threads.
>>> The user has to go out of their way in order to mess up their system
>>> in this way. But if they were so inclined, they could create ~550
>>> billion inodes and dentries using the test case.
>>>
>>> 2) Eliminate the duplication of directories that are created identically
>>> between the tgid-level pid directory and its task directory: fd,
>>> fdinfo, ns, net, attr. There is obviously more duplication between
>>> the two directories, but replacing a file with a symbolic link
>>> doesn't get us anything. This reduces the number of files associated
>>> with fd and fdinfo by half if threads aren't involved.
>>>
>>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>>> that share a files_struct. We check at directory creation time if
>>> the task is a group leader and if not, whether it shares ->files with
>>> the group leader. If so, we create a symbolic link to ../tgid/fd*.
>>> We use a d_revalidate callback to check whether the thread has called
>>> unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>>> Upon re-lookup, a directory will be created in its place. This is
>>> pretty simple, so if the thread group leader calls unshare, all threads
>>> get directories.
>>>
>>> With these patches applied, running the same testcase, the proc_inode
>>> cache only gets to about 600k objects, which is about 99.7% fewer. I
>>> get that procfs isn't supposed to be scalable, but this is kind of
>>> extreme. :)
>>>
>>> Finally, I'm not a procfs expert. I'm posting this as an RFC for folks
>>> with more knowledge of the details to pick it apart. The biggest is that
>>> I'm not sure if any tools depend on any of these things being directories
>>> instead of symlinks. I'd hope not, but I don't have the answer. I'm
>>> sure there are corner cases I'm missing. Hopefully, it's not just flat
>>> out broken since this is a problem that does need solving.
>>>
>>> Now I'll go put on the fireproof suit.
>
> Thanks for your comments. This ended up having to get back-burnered but
> I've finally found some time to get back to it. I have new patches that
> don't treat each entry as a special case and makes more sense, IMO.
> They're not worth posting yet since some of the issues below remain.
>
>> This needs to be tested against at least apparmor to see if this breaks
>> common policies. Changing files to symlinks in proc has a bad habit of
>> either breaking apparmor policies or userspace assumptions. Symbolic
>> links are unfortunately visible to userspace.
>
> AppArmor uses the @{pids} var in profiles that translates to a numeric
> regex. That means that /proc/pid/task -> /proc/tgid/task won't break
> profiles but /proc/pid/fdinfo -> /proc/pid/task/tgid/fdinfo will break.
> Apparmor doesn't have a follow_link hook at all, so all that matters is
> the final path. SELinux does have a follow_link hook, but I'm not
> familiar enough with it to know whether introducing a symlink in proc
> will make a difference.
>
> I've dropped the /proc/pid/{dirs} -> /proc/pid/task/pid/{dirs} part
> since that clearly won't work.
>
>> Further the proc structure is tgid/task/tid where the leaf directories
>> are per thread.
>
> Yes, but threads are still in /proc for lookup at the tgid level even if
> they don't show up in readdir.
>
>> We more likely could get away with some magic symlinks (that would not
>> be user visible) rather than actual symlinks.
>
> I think I'm missing something here. Aren't magic symlinks still
> represented to the user as symlinks?
>
>> So I think you are probably on the right track to reduce the memory
>> usage but I think some more work will be needed to make it transparently
>> backwards compatible.
>
> Yeah, that's going to be the big hiccup. I think I've resolved the
> biggest issue with AppArmor, but I don't think the problem is solvable
> without introducing symlinks.

Has anyone looked at making the fd and fdinfo files hard links.

Alternatively it may make sense to see if there is something that we can
do with the locking to reduce the thundering hurd problem that is being
seen.

Eric


2019-03-24 03:03:37

by Jeff Mahoney

[permalink] [raw]
Subject: Re: [RFC] [PATCH 0/5] procfs: reduce duplication by using symlinks

On 3/23/19 11:56 AM, Eric W. Biederman wrote:
> Jeff Mahoney <[email protected]> writes:
>
>> On 4/24/18 10:14 AM, Eric W. Biederman wrote:
>>> [email protected] writes:
>>>
>>>> From: Jeff Mahoney <[email protected]>
>>>>
>>>> Hi all -
>>>>
>>>> I recently encountered a customer issue where, on a machine with many TiB
>>>> of memory and a few hundred cores, after a task with a few thousand threads
>>>> and hundreds of files open exited, the system would softlockup. That
>>>> issue was (is still) being addressed by Nik Borisov's patch to add a
>>>> cond_resched call to shrink_dentry_list. The underlying issue is still
>>>> there, though. We just don't complain as loudly. When a huge task
>>>> exits, now the system is more or less unresponsive for about eight
>>>> minutes. All CPUs are pinned and every one of them is going through
>>>> dentry and inode eviction for the procfs files associated with each
>>>> thread. It's made worse by every CPU contending on the super's
>>>> inode list lock.
>>>>
>>>> The numbers get big. My test case was 4096 threads with 16384 files
>>>> open. It's a contrived example, but not that far off from the actual
>>>> customer case. In this case, a simple "find /proc" would create around
>>>> 300 million dentry/inode pairs. More practically, lsof(1) does it too,
>>>> it just takes longer. On smaller systems, memory pressure starts pushing
>>>> them out. Memory pressure isn't really an issue on this machine, so we
>>>> end up using well over 100GB for proc files. It's the combination of
>>>> the wasted CPU cycles in teardown and the wasted memory at runtime that
>>>> pushed me to take this approach.
>>>>
>>>> The biggest culprit is the "fd" and "fdinfo" directories, but those are
>>>> made worse by there being multiple copies of them even for the same
>>>> task without threads getting involved:
>>>>
>>>> - /proc/pid/fd and /proc/pid/task/pid/fd are identical but share no
>>>> resources.
>>>>
>>>> - Every /proc/pid/task/*/fd directory in a thread group has identical
>>>> contents (unless unshare(CLONE_FILES) was called), but share no
>>>> resources.
>>>>
>>>> - If we do a lookup like /proc/pid/fd on a member of a thread group,
>>>> we'll get a valid directory. Inside, there will be a complete
>>>> copy of /proc/pid/task/* just like in /proc/tgid/task. Again,
>>>> nothing is shared.
>>>>
>>>> This patch set reduces some (most) of the duplication by conditionally
>>>> replacing some of the directories with symbolic links to copies that are
>>>> identical.
>>>>
>>>> 1) Eliminate the duplication of the task directories between threads.
>>>> The task directory belongs to the thread leader and the threads
>>>> link to it: e.g. /proc/915/task -> ../910/task This mainly
>>>> reduces duplication when individual threads are looked up directly
>>>> at the tgid level. The impact varies based on the number of threads.
>>>> The user has to go out of their way in order to mess up their system
>>>> in this way. But if they were so inclined, they could create ~550
>>>> billion inodes and dentries using the test case.
>>>>
>>>> 2) Eliminate the duplication of directories that are created identically
>>>> between the tgid-level pid directory and its task directory: fd,
>>>> fdinfo, ns, net, attr. There is obviously more duplication between
>>>> the two directories, but replacing a file with a symbolic link
>>>> doesn't get us anything. This reduces the number of files associated
>>>> with fd and fdinfo by half if threads aren't involved.
>>>>
>>>> 3) Eliminate the duplication of fd and fdinfo directories among threads
>>>> that share a files_struct. We check at directory creation time if
>>>> the task is a group leader and if not, whether it shares ->files with
>>>> the group leader. If so, we create a symbolic link to ../tgid/fd*.
>>>> We use a d_revalidate callback to check whether the thread has called
>>>> unshare(CLONE_FILES) and, if so, fail the revalidation for the symlink.
>>>> Upon re-lookup, a directory will be created in its place. This is
>>>> pretty simple, so if the thread group leader calls unshare, all threads
>>>> get directories.
>>>>
>>>> With these patches applied, running the same testcase, the proc_inode
>>>> cache only gets to about 600k objects, which is about 99.7% fewer. I
>>>> get that procfs isn't supposed to be scalable, but this is kind of
>>>> extreme. :)
>>>>
>>>> Finally, I'm not a procfs expert. I'm posting this as an RFC for folks
>>>> with more knowledge of the details to pick it apart. The biggest is that
>>>> I'm not sure if any tools depend on any of these things being directories
>>>> instead of symlinks. I'd hope not, but I don't have the answer. I'm
>>>> sure there are corner cases I'm missing. Hopefully, it's not just flat
>>>> out broken since this is a problem that does need solving.
>>>>
>>>> Now I'll go put on the fireproof suit.
>>
>> Thanks for your comments. This ended up having to get back-burnered but
>> I've finally found some time to get back to it. I have new patches that
>> don't treat each entry as a special case and makes more sense, IMO.
>> They're not worth posting yet since some of the issues below remain.
>>
>>> This needs to be tested against at least apparmor to see if this breaks
>>> common policies. Changing files to symlinks in proc has a bad habit of
>>> either breaking apparmor policies or userspace assumptions. Symbolic
>>> links are unfortunately visible to userspace.
>>
>> AppArmor uses the @{pids} var in profiles that translates to a numeric
>> regex. That means that /proc/pid/task -> /proc/tgid/task won't break
>> profiles but /proc/pid/fdinfo -> /proc/pid/task/tgid/fdinfo will break.
>> Apparmor doesn't have a follow_link hook at all, so all that matters is
>> the final path. SELinux does have a follow_link hook, but I'm not
>> familiar enough with it to know whether introducing a symlink in proc
>> will make a difference.
>>
>> I've dropped the /proc/pid/{dirs} -> /proc/pid/task/pid/{dirs} part
>> since that clearly won't work.
>>
>>> Further the proc structure is tgid/task/tid where the leaf directories
>>> are per thread.
>>
>> Yes, but threads are still in /proc for lookup at the tgid level even if
>> they don't show up in readdir.
>>
>>> We more likely could get away with some magic symlinks (that would not
>>> be user visible) rather than actual symlinks.
>>
>> I think I'm missing something here. Aren't magic symlinks still
>> represented to the user as symlinks?
>>
>>> So I think you are probably on the right track to reduce the memory
>>> usage but I think some more work will be needed to make it transparently
>>> backwards compatible.
>>
>> Yeah, that's going to be the big hiccup. I think I've resolved the
>> biggest issue with AppArmor, but I don't think the problem is solvable
>> without introducing symlinks.
>
> Has anyone looked at making the fd and fdinfo files hard links.

That could work to a certain degree. It would certainly reduce the
inode count. It would still create all the dentries, though. That's
still a n^2 problem where n is the number of threads in the group.

> Alternatively it may make sense to see if there is something that we can
> do with the locking to reduce the thundering hurd problem that is being
> seen.

Yeah, that could still use some attention. The thundering herd problem
is more of a tap when you reduce the contention by 99% though.

-Jeff

--
Jeff Mahoney
SUSE Labs


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature