2022-09-17 00:08:05

by Ivan Babrou

[permalink] [raw]
Subject: [RFC] proc: report open files as size in stat() for /proc/pid/fd

Many monitoring tools include open file count as a metric. Currently
the only way to get this number is to enumerate the files in /proc/pid/fd.

The problem with the current approach is that it does many things people
generally don't care about when they need one number for a metric.
In our tests for cadvisor, which reports open file counts per cgroup,
we observed that reading the number of open files is slow. Out of 35.23%
of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
`proc_fill_cache`, which is responsible for filling dentry info.
Some of this extra time is spinlock contention, but it's a contention
for the lock we don't want to take to begin with.

We considered putting the number of open files in /proc/pid/stat.
Unfortunately, counting the number of fds involves iterating the fdtable,
which means that it might slow down /proc/pid/stat for processes
with many open files. Instead we opted to put this info in /proc/pid/fd
as a size member of the stat syscall result. Previously the reported
number was zero, so there's very little risk of breaking anything,
while still providing a somewhat logical way to count the open files.

Previously:

```
$ sudo stat /proc/1/fd | head -n2
File: /proc/1/fd
Size: 0 Blocks: 0 IO Block: 1024 directory
```

With this patch:

```
$ sudo stat /proc/1/fd | head -n2
File: /proc/1/fd
Size: 65 Blocks: 0 IO Block: 1024 directory
```

Correctness check:

```
$ sudo ls /proc/1/fd | wc -l
65
```

There are two alternatives to this approach that I can see:

* Expose /proc/pid/fd_count with a count there
* Make fd count acces O(1) and expose it in /proc/pid/status

I can probably figure out how to do the former, but the latter
will require somebody with more experience in file code than myself.

Signed-off-by: Ivan Babrou <[email protected]>
---
fs/proc/fd.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 47 insertions(+)

diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index 913bef0d2a36..c7ac142500a8 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
return 0;
}

+static int proc_readfd_count(struct inode *inode)
+{
+ struct task_struct *p = get_proc_task(inode);
+ unsigned int fd = 0, count = 0;
+
+ if (!p)
+ return -ENOENT;
+
+ rcu_read_lock();
+ while (task_lookup_next_fd_rcu(p, &fd)) {
+ rcu_read_unlock();
+
+ count++;
+ fd++;
+
+ cond_resched();
+ rcu_read_lock();
+ }
+ rcu_read_unlock();
+ put_task_struct(p);
+ return count;
+}
+
static int proc_readfd(struct file *file, struct dir_context *ctx)
{
return proc_readfd_common(file, ctx, proc_fd_instantiate);
@@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
return rv;
}

+int proc_fd_getattr(struct user_namespace *mnt_userns,
+ const struct path *path, struct kstat *stat,
+ u32 request_mask, unsigned int query_flags)
+{
+ struct inode *inode = d_inode(path->dentry);
+ struct proc_dir_entry *de = PDE(inode);
+
+ if (de) {
+ nlink_t nlink = READ_ONCE(de->nlink);
+
+ if (nlink > 0)
+ set_nlink(inode, nlink);
+ }
+
+ generic_fillattr(&init_user_ns, inode, stat);
+
+ /* If it's a directory, put the number of open fds there */
+ if (S_ISDIR(inode->i_mode))
+ stat->size = proc_readfd_count(inode);
+
+ return 0;
+}
+
const struct inode_operations proc_fd_inode_operations = {
.lookup = proc_lookupfd,
.permission = proc_fd_permission,
+ .getattr = proc_fd_getattr,
.setattr = proc_setattr,
};

--
2.37.2


2022-09-17 01:00:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

(cc's added)

On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <[email protected]> wrote:

> Many monitoring tools include open file count as a metric. Currently
> the only way to get this number is to enumerate the files in /proc/pid/fd.
>
> The problem with the current approach is that it does many things people
> generally don't care about when they need one number for a metric.
> In our tests for cadvisor, which reports open file counts per cgroup,
> we observed that reading the number of open files is slow. Out of 35.23%
> of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> `proc_fill_cache`, which is responsible for filling dentry info.
> Some of this extra time is spinlock contention, but it's a contention
> for the lock we don't want to take to begin with.
>
> We considered putting the number of open files in /proc/pid/stat.
> Unfortunately, counting the number of fds involves iterating the fdtable,
> which means that it might slow down /proc/pid/stat for processes
> with many open files. Instead we opted to put this info in /proc/pid/fd
> as a size member of the stat syscall result. Previously the reported
> number was zero, so there's very little risk of breaking anything,
> while still providing a somewhat logical way to count the open files.

Documentation/filesystems/proc.rst would be an appropriate place to
document this ;)

> Previously:
>
> ```
> $ sudo stat /proc/1/fd | head -n2
> File: /proc/1/fd
> Size: 0 Blocks: 0 IO Block: 1024 directory
> ```
>
> With this patch:
>
> ```
> $ sudo stat /proc/1/fd | head -n2
> File: /proc/1/fd
> Size: 65 Blocks: 0 IO Block: 1024 directory
> ```
>
> Correctness check:
>
> ```
> $ sudo ls /proc/1/fd | wc -l
> 65
> ```
>
> There are two alternatives to this approach that I can see:
>
> * Expose /proc/pid/fd_count with a count there
> * Make fd count acces O(1) and expose it in /proc/pid/status
>
> I can probably figure out how to do the former, but the latter
> will require somebody with more experience in file code than myself.
>
> Signed-off-by: Ivan Babrou <[email protected]>
> ---
> fs/proc/fd.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 47 insertions(+)
>
> diff --git a/fs/proc/fd.c b/fs/proc/fd.c
> index 913bef0d2a36..c7ac142500a8 100644
> --- a/fs/proc/fd.c
> +++ b/fs/proc/fd.c
> @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
> return 0;
> }
>
> +static int proc_readfd_count(struct inode *inode)
> +{
> + struct task_struct *p = get_proc_task(inode);
> + unsigned int fd = 0, count = 0;
> +
> + if (!p)
> + return -ENOENT;
> +
> + rcu_read_lock();
> + while (task_lookup_next_fd_rcu(p, &fd)) {
> + rcu_read_unlock();
> +
> + count++;
> + fd++;
> +
> + cond_resched();
> + rcu_read_lock();
> + }
> + rcu_read_unlock();
> + put_task_struct(p);
> + return count;
> +}
> +
> static int proc_readfd(struct file *file, struct dir_context *ctx)
> {
> return proc_readfd_common(file, ctx, proc_fd_instantiate);
> @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
> return rv;
> }
>
> +int proc_fd_getattr(struct user_namespace *mnt_userns,
> + const struct path *path, struct kstat *stat,
> + u32 request_mask, unsigned int query_flags)
> +{
> + struct inode *inode = d_inode(path->dentry);
> + struct proc_dir_entry *de = PDE(inode);
> +
> + if (de) {
> + nlink_t nlink = READ_ONCE(de->nlink);
> +
> + if (nlink > 0)
> + set_nlink(inode, nlink);
> + }
> +
> + generic_fillattr(&init_user_ns, inode, stat);
> +
> + /* If it's a directory, put the number of open fds there */
> + if (S_ISDIR(inode->i_mode))
> + stat->size = proc_readfd_count(inode);
> +
> + return 0;
> +}
> +
> const struct inode_operations proc_fd_inode_operations = {
> .lookup = proc_lookupfd,
> .permission = proc_fd_permission,
> + .getattr = proc_fd_getattr,
> .setattr = proc_setattr,
> };
>
> --
> 2.37.2

2022-09-17 01:03:53

by Ivan Babrou

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

On Fri, Sep 16, 2022 at 5:01 PM Andrew Morton <[email protected]> wrote:
>
> (cc's added)
>
> On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <[email protected]> wrote:
>
> > Many monitoring tools include open file count as a metric. Currently
> > the only way to get this number is to enumerate the files in /proc/pid/fd.
> >
> > The problem with the current approach is that it does many things people
> > generally don't care about when they need one number for a metric.
> > In our tests for cadvisor, which reports open file counts per cgroup,
> > we observed that reading the number of open files is slow. Out of 35.23%
> > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > `proc_fill_cache`, which is responsible for filling dentry info.
> > Some of this extra time is spinlock contention, but it's a contention
> > for the lock we don't want to take to begin with.
> >
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
>
> Documentation/filesystems/proc.rst would be an appropriate place to
> document this ;)

I am more than happy to add the docs after there's a confirmation that
this is an appropriate approach to expose this information. I probably
should've mentioned this explicitly, that's on me. There are two
alternative approaches at the bottom of my original email that might
be considered.

2022-09-17 07:41:34

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

On Fri, Sep 16, 2022 at 05:01:15PM -0700, Andrew Morton wrote:
> (cc's added)
>
> On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <[email protected]> wrote:
>
> > Many monitoring tools include open file count as a metric. Currently
> > the only way to get this number is to enumerate the files in /proc/pid/fd.
> >
> > The problem with the current approach is that it does many things people
> > generally don't care about when they need one number for a metric.
> > In our tests for cadvisor, which reports open file counts per cgroup,
> > we observed that reading the number of open files is slow. Out of 35.23%
> > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > `proc_fill_cache`, which is responsible for filling dentry info.
> > Some of this extra time is spinlock contention, but it's a contention
> > for the lock we don't want to take to begin with.
> >
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
>
> Documentation/filesystems/proc.rst would be an appropriate place to
> document this ;)
>
> > Previously:
> >
> > ```
> > $ sudo stat /proc/1/fd | head -n2
> > File: /proc/1/fd
> > Size: 0 Blocks: 0 IO Block: 1024 directory
> > ```
> >
> > With this patch:
> >
> > ```
> > $ sudo stat /proc/1/fd | head -n2
> > File: /proc/1/fd
> > Size: 65 Blocks: 0 IO Block: 1024 directory

Yes. This is natural place.

> > ```
> >
> > Correctness check:
> >
> > ```
> > $ sudo ls /proc/1/fd | wc -l
> > 65
> > ```
> >
> > There are two alternatives to this approach that I can see:
> >
> > * Expose /proc/pid/fd_count with a count there

> > * Make fd count acces O(1) and expose it in /proc/pid/status

This is doable, next to FDSize.

Below is doable too.

> > --- a/fs/proc/fd.c
> > +++ b/fs/proc/fd.c
> > @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
> > return 0;
> > }
> >
> > +static int proc_readfd_count(struct inode *inode)
> > +{
> > + struct task_struct *p = get_proc_task(inode);
> > + unsigned int fd = 0, count = 0;
> > +
> > + if (!p)
> > + return -ENOENT;
> > +
> > + rcu_read_lock();
> > + while (task_lookup_next_fd_rcu(p, &fd)) {
> > + rcu_read_unlock();
> > +
> > + count++;
> > + fd++;
> > +
> > + cond_resched();
> > + rcu_read_lock();
> > + }
> > + rcu_read_unlock();
> > + put_task_struct(p);
> > + return count;
> > +}
> > +
> > static int proc_readfd(struct file *file, struct dir_context *ctx)
> > {
> > return proc_readfd_common(file, ctx, proc_fd_instantiate);
> > @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
> > return rv;
> > }
> >
> > +int proc_fd_getattr(struct user_namespace *mnt_userns,
> > + const struct path *path, struct kstat *stat,
> > + u32 request_mask, unsigned int query_flags)
> > +{
> > + struct inode *inode = d_inode(path->dentry);
> > + struct proc_dir_entry *de = PDE(inode);
> > +
> > + if (de) {
> > + nlink_t nlink = READ_ONCE(de->nlink);
> > +
> > + if (nlink > 0)
> > + set_nlink(inode, nlink);
> > + }
> > +
> > + generic_fillattr(&init_user_ns, inode, stat);
^^^^^^^^^^^^^

Is this correct? I'm not userns guy at all.

> > +
> > + /* If it's a directory, put the number of open fds there */
> > + if (S_ISDIR(inode->i_mode))
> > + stat->size = proc_readfd_count(inode);

ENOENT can get there. In principle this is OK, userspace can live with it.

> > const struct inode_operations proc_fd_inode_operations = {
> > .lookup = proc_lookupfd,
> > .permission = proc_fd_permission,
> > + .getattr = proc_fd_getattr,
> > .setattr = proc_setattr,

2022-09-17 16:15:13

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

On Fri, Sep 16, 2022 at 04:08:52PM -0700, Ivan Babrou wrote:
> We considered putting the number of open files in /proc/pid/stat.
> Unfortunately, counting the number of fds involves iterating the fdtable,
> which means that it might slow down /proc/pid/stat for processes
> with many open files. Instead we opted to put this info in /proc/pid/fd
> as a size member of the stat syscall result. Previously the reported
> number was zero, so there's very little risk of breaking anything,
> while still providing a somewhat logical way to count the open files.

Instead of using the st_size of /proc/<pid>/fd, why not return that
value in st_nlink? /proc/<pid>/fd is a directory, so having st_nlinks
return number of fd's plus 2 (for . and ..) would be much more natural.

Cheers,

- Ted

2022-09-17 19:43:58

by Ivan Babrou

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

> > > * Make fd count acces O(1) and expose it in /proc/pid/status
>
> This is doable, next to FDSize.

It feels like a better solution, but maybe I'm missing some context
here. Let me know whether this is preferred.

That said, I've tried doing it, but failed. There's a noticeable
mismatch in the numbers:

* systemd:

ivan@vm:~$ sudo ls -l /proc/1/fd | wc -l
66
ivan@vm:~$ cat /proc/1/status | fgrep FD
FDSize: 256
FDUsed: 71

* journald:

ivan@vm:~$ sudo ls -l /proc/803/fd | wc -l
29
ivan@vm:~$ cat /proc/803/status | fgrep FD
FDSize: 128
FDUsed: 37

I'll see if I can make it work next week. I'm happy to receive tips as well.

Below is my attempt (link in case gmail breaks patch formatting):

* https://gist.githubusercontent.com/bobrik/acce40881d629d8cce2e55966b31a0a2/raw/716eb4724a8fe3afeeb76fd2a7a47ee13790a9e9/fdused.patch

diff --git a/fs/file.c b/fs/file.c
index 3bcc1ecc314a..8bc0741cabf1 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -85,6 +85,8 @@ static void copy_fdtable(struct fdtable *nfdt,
struct fdtable *ofdt)
memset((char *)nfdt->fd + cpy, 0, set);

copy_fd_bitmaps(nfdt, ofdt, ofdt->max_fds);
+
+ atomic_set(&nfdt->count, atomic_read(&ofdt->count));
}

/*
@@ -105,6 +107,7 @@ static void copy_fdtable(struct fdtable *nfdt,
struct fdtable *ofdt)
static struct fdtable * alloc_fdtable(unsigned int nr)
{
struct fdtable *fdt;
+ atomic_t count = ATOMIC_INIT(0);
void *data;

/*
@@ -148,6 +151,7 @@ static struct fdtable * alloc_fdtable(unsigned int nr)
fdt->close_on_exec = data;
data += nr / BITS_PER_BYTE;
fdt->full_fds_bits = data;
+ fdt->count = count;

return fdt;

@@ -399,6 +403,8 @@ struct files_struct *dup_fd(struct files_struct
*oldf, unsigned int max_fds, int
/* clear the remainder */
memset(new_fds, 0, (new_fdt->max_fds - open_files) * sizeof(struct file *));

+ atomic_set(&new_fdt->count, atomic_read(&old_fdt->count));
+
rcu_assign_pointer(newf->fdt, new_fdt);

return newf;
@@ -474,6 +480,7 @@ struct files_struct init_files = {
.close_on_exec = init_files.close_on_exec_init,
.open_fds = init_files.open_fds_init,
.full_fds_bits = init_files.full_fds_bits_init,
+ .count = ATOMIC_INIT(0),
},
.file_lock = __SPIN_LOCK_UNLOCKED(init_files.file_lock),
.resize_wait = __WAIT_QUEUE_HEAD_INITIALIZER(init_files.resize_wait),
@@ -613,6 +620,7 @@ void fd_install(unsigned int fd, struct file *file)
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
spin_unlock(&files->file_lock);
+ atomic_inc(&fdt->count);
return;
}
/* coupled with smp_wmb() in expand_fdtable() */
@@ -621,6 +629,7 @@ void fd_install(unsigned int fd, struct file *file)
BUG_ON(fdt->fd[fd] != NULL);
rcu_assign_pointer(fdt->fd[fd], file);
rcu_read_unlock_sched();
+ atomic_inc(&fdt->count);
}

EXPORT_SYMBOL(fd_install);
@@ -646,6 +655,7 @@ static struct file *pick_file(struct files_struct
*files, unsigned fd)
if (file) {
rcu_assign_pointer(fdt->fd[fd], NULL);
__put_unused_fd(files, fd);
+ atomic_dec(&fdt->count);
}
return file;
}
@@ -844,6 +854,7 @@ void do_close_on_exec(struct files_struct *files)
filp_close(file, files);
cond_resched();
spin_lock(&files->file_lock);
+ atomic_dec(&fdt->count);
}

}
@@ -1108,6 +1119,7 @@ __releases(&files->file_lock)
else
__clear_close_on_exec(fd, fdt);
spin_unlock(&files->file_lock);
+ atomic_inc(&fdt->count);

if (tofree)
filp_close(tofree, files);
diff --git a/fs/proc/array.c b/fs/proc/array.c
index 99fcbfda8e25..5847f077bfc3 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -153,7 +153,8 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
struct task_struct *tracer;
const struct cred *cred;
pid_t ppid, tpid = 0, tgid, ngid;
- unsigned int max_fds = 0;
+ struct fdtable *fdt;
+ unsigned int max_fds = 0, open_fds = 0;

rcu_read_lock();
ppid = pid_alive(p) ?
@@ -170,8 +171,11 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
task_lock(p);
if (p->fs)
umask = p->fs->umask;
- if (p->files)
- max_fds = files_fdtable(p->files)->max_fds;
+ if (p->files) {
+ fdt = files_fdtable(p->files);
+ max_fds = fdt->max_fds;
+ open_fds = atomic_read(&fdt->count);
+ }
task_unlock(p);
rcu_read_unlock();

@@ -194,6 +198,7 @@ static inline void task_state(struct seq_file *m,
struct pid_namespace *ns,
seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->sgid));
seq_put_decimal_ull(m, "\t", from_kgid_munged(user_ns, cred->fsgid));
seq_put_decimal_ull(m, "\nFDSize:\t", max_fds);
+ seq_put_decimal_ull(m, "\nFDUsed:\t", open_fds);

seq_puts(m, "\nGroups:\t");
group_info = cred->group_info;
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e066816f3519..59aceb1e4bc6 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -31,6 +31,7 @@ struct fdtable {
unsigned long *open_fds;
unsigned long *full_fds_bits;
struct rcu_head rcu;
+ atomic_t count;
};

static inline bool close_on_exec(unsigned int fd, const struct fdtable *fdt)


> > > +
> > > + generic_fillattr(&init_user_ns, inode, stat);
> ^^^^^^^^^^^^^
>
> Is this correct? I'm not userns guy at all.

I mostly copied from here:

* https://elixir.bootlin.com/linux/v6.0-rc5/source/fs/proc/generic.c#L150

Maybe it can be simplified even further to match this one:

* https://elixir.bootlin.com/linux/v6.0-rc5/source/fs/proc/root.c#L317

2022-09-17 20:18:56

by Ivan Babrou

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

On Sat, Sep 17, 2022 at 8:31 AM Theodore Ts'o <[email protected]> wrote:
>
> On Fri, Sep 16, 2022 at 04:08:52PM -0700, Ivan Babrou wrote:
> > We considered putting the number of open files in /proc/pid/stat.
> > Unfortunately, counting the number of fds involves iterating the fdtable,
> > which means that it might slow down /proc/pid/stat for processes
> > with many open files. Instead we opted to put this info in /proc/pid/fd
> > as a size member of the stat syscall result. Previously the reported
> > number was zero, so there's very little risk of breaking anything,
> > while still providing a somewhat logical way to count the open files.
>
> Instead of using the st_size of /proc/<pid>/fd, why not return that
> value in st_nlink? /proc/<pid>/fd is a directory, so having st_nlinks
> return number of fd's plus 2 (for . and ..) would be much more natural.

From what I see, st_nlinks is used for the number of subdirectories
and it doesn't include files. In /proc/fd we only have files (well,
symlinks really). I'm still happy to use that instead if that's
preferred.

2022-09-19 09:14:08

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

On Sat, Sep 17, 2022 at 10:15:13AM +0300, Alexey Dobriyan wrote:
> On Fri, Sep 16, 2022 at 05:01:15PM -0700, Andrew Morton wrote:
> > (cc's added)
> >
> > On Fri, 16 Sep 2022 16:08:52 -0700 Ivan Babrou <[email protected]> wrote:
> >
> > > Many monitoring tools include open file count as a metric. Currently
> > > the only way to get this number is to enumerate the files in /proc/pid/fd.
> > >
> > > The problem with the current approach is that it does many things people
> > > generally don't care about when they need one number for a metric.
> > > In our tests for cadvisor, which reports open file counts per cgroup,
> > > we observed that reading the number of open files is slow. Out of 35.23%
> > > of CPU time spent in `proc_readfd_common`, we see 29.43% spent in
> > > `proc_fill_cache`, which is responsible for filling dentry info.
> > > Some of this extra time is spinlock contention, but it's a contention
> > > for the lock we don't want to take to begin with.
> > >
> > > We considered putting the number of open files in /proc/pid/stat.
> > > Unfortunately, counting the number of fds involves iterating the fdtable,
> > > which means that it might slow down /proc/pid/stat for processes
> > > with many open files. Instead we opted to put this info in /proc/pid/fd
> > > as a size member of the stat syscall result. Previously the reported
> > > number was zero, so there's very little risk of breaking anything,
> > > while still providing a somewhat logical way to count the open files.
> >
> > Documentation/filesystems/proc.rst would be an appropriate place to
> > document this ;)
> >
> > > Previously:
> > >
> > > ```
> > > $ sudo stat /proc/1/fd | head -n2
> > > File: /proc/1/fd
> > > Size: 0 Blocks: 0 IO Block: 1024 directory
> > > ```
> > >
> > > With this patch:
> > >
> > > ```
> > > $ sudo stat /proc/1/fd | head -n2
> > > File: /proc/1/fd
> > > Size: 65 Blocks: 0 IO Block: 1024 directory
>
> Yes. This is natural place.
>
> > > ```
> > >
> > > Correctness check:
> > >
> > > ```
> > > $ sudo ls /proc/1/fd | wc -l
> > > 65
> > > ```
> > >
> > > There are two alternatives to this approach that I can see:
> > >
> > > * Expose /proc/pid/fd_count with a count there
>
> > > * Make fd count acces O(1) and expose it in /proc/pid/status
>
> This is doable, next to FDSize.
>
> Below is doable too.
>
> > > --- a/fs/proc/fd.c
> > > +++ b/fs/proc/fd.c
> > > @@ -279,6 +279,29 @@ static int proc_readfd_common(struct file *file, struct dir_context *ctx,
> > > return 0;
> > > }
> > >
> > > +static int proc_readfd_count(struct inode *inode)
> > > +{
> > > + struct task_struct *p = get_proc_task(inode);
> > > + unsigned int fd = 0, count = 0;
> > > +
> > > + if (!p)
> > > + return -ENOENT;
> > > +
> > > + rcu_read_lock();
> > > + while (task_lookup_next_fd_rcu(p, &fd)) {
> > > + rcu_read_unlock();
> > > +
> > > + count++;
> > > + fd++;
> > > +
> > > + cond_resched();
> > > + rcu_read_lock();
> > > + }
> > > + rcu_read_unlock();
> > > + put_task_struct(p);
> > > + return count;
> > > +}
> > > +
> > > static int proc_readfd(struct file *file, struct dir_context *ctx)
> > > {
> > > return proc_readfd_common(file, ctx, proc_fd_instantiate);
> > > @@ -319,9 +342,33 @@ int proc_fd_permission(struct user_namespace *mnt_userns,
> > > return rv;
> > > }
> > >
> > > +int proc_fd_getattr(struct user_namespace *mnt_userns,
> > > + const struct path *path, struct kstat *stat,
> > > + u32 request_mask, unsigned int query_flags)
> > > +{
> > > + struct inode *inode = d_inode(path->dentry);
> > > + struct proc_dir_entry *de = PDE(inode);
> > > +
> > > + if (de) {
> > > + nlink_t nlink = READ_ONCE(de->nlink);
> > > +
> > > + if (nlink > 0)
> > > + set_nlink(inode, nlink);
> > > + }
> > > +
> > > + generic_fillattr(&init_user_ns, inode, stat);
> ^^^^^^^^^^^^^
>
> Is this correct? I'm not userns guy at all.

This is correct. :)

2022-09-19 13:05:00

by Alexey Dobriyan

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

On Sat, Sep 17, 2022 at 11:32:02AM -0700, Ivan Babrou wrote:
> > > > * Make fd count acces O(1) and expose it in /proc/pid/status
> >
> > This is doable, next to FDSize.
>
> It feels like a better solution, but maybe I'm missing some context
> here. Let me know whether this is preferred.

I don't know. I'd put it in st_size as you did initially.
/proc/*/status should be slow.

2022-09-19 22:17:54

by Ivan Babrou

[permalink] [raw]
Subject: Re: [RFC] proc: report open files as size in stat() for /proc/pid/fd

On Mon, Sep 19, 2022 at 5:30 AM Alexey Dobriyan <[email protected]> wrote:
>
> On Sat, Sep 17, 2022 at 11:32:02AM -0700, Ivan Babrou wrote:
> > > > > * Make fd count acces O(1) and expose it in /proc/pid/status
> > >
> > > This is doable, next to FDSize.
> >
> > It feels like a better solution, but maybe I'm missing some context
> > here. Let me know whether this is preferred.
>
> I don't know. I'd put it in st_size as you did initially.
> /proc/*/status should be slow.

Could you elaborate what you mean?

* Are you saying that having FDUsed in /proc/*/status _would_ be slow?
I would imagine that adding atomic_read() there shouldn't slow things
down too much.
* Are you saying that reading /proc/*/status is already slow and
reading the number of open files from there would be inefficient?