2023-09-22 07:55:40

by Max Kellermann

[permalink] [raw]
Subject: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

I'd like to be able to run metrics collector processes without special
privileges

In the kernel, there is a mix of debugfs files being world-readable
and not world-readable is; with a naive "git grep", I found 723
world-readable debugfs_create_file() calls and 582 calls which were
only accessible to privileged processe.

From the code, I cannot derive a consistent policy for that, but the
ceph statistics seem harmless (and useful) enough.

Signed-off-by: Max Kellermann <[email protected]>
---
fs/ceph/debugfs.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index 3904333fa6c3..2abee7e18144 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -429,31 +429,31 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
name);

fsc->debugfs_mdsmap = debugfs_create_file("mdsmap",
- 0400,
+ 0444,
fsc->client->debugfs_dir,
fsc,
&mdsmap_fops);

fsc->debugfs_mds_sessions = debugfs_create_file("mds_sessions",
- 0400,
+ 0444,
fsc->client->debugfs_dir,
fsc,
&mds_sessions_fops);

fsc->debugfs_mdsc = debugfs_create_file("mdsc",
- 0400,
+ 0444,
fsc->client->debugfs_dir,
fsc,
&mdsc_fops);

fsc->debugfs_caps = debugfs_create_file("caps",
- 0400,
+ 0444,
fsc->client->debugfs_dir,
fsc,
&caps_fops);

fsc->debugfs_status = debugfs_create_file("status",
- 0400,
+ 0444,
fsc->client->debugfs_dir,
fsc,
&status_fops);
@@ -461,13 +461,13 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
fsc->debugfs_metrics_dir = debugfs_create_dir("metrics",
fsc->client->debugfs_dir);

- debugfs_create_file("file", 0400, fsc->debugfs_metrics_dir, fsc,
+ debugfs_create_file("file", 0444, fsc->debugfs_metrics_dir, fsc,
&metrics_file_fops);
- debugfs_create_file("latency", 0400, fsc->debugfs_metrics_dir, fsc,
+ debugfs_create_file("latency", 0444, fsc->debugfs_metrics_dir, fsc,
&metrics_latency_fops);
- debugfs_create_file("size", 0400, fsc->debugfs_metrics_dir, fsc,
+ debugfs_create_file("size", 0444, fsc->debugfs_metrics_dir, fsc,
&metrics_size_fops);
- debugfs_create_file("caps", 0400, fsc->debugfs_metrics_dir, fsc,
+ debugfs_create_file("caps", 0444, fsc->debugfs_metrics_dir, fsc,
&metrics_caps_fops);
}

--
2.39.2


2023-09-22 15:27:47

by Max Kellermann

[permalink] [raw]
Subject: [PATCH 2/2] fs/ceph/debugfs: expose raw metric counters

To enable userspace to calculate the current latency, not just the
average latency since the filesystem was mounted.

We have been running this patch for a while on our servers and our
Prometheus exporter collects these statistics:

https://github.com/CM4all/Prometheus-Exporters/
https://github.com/CM4all/Prometheus-Exporters/blob/master/src/KernelExporter.cxx

Signed-off-by: Max Kellermann <[email protected]>
---
fs/ceph/debugfs.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
index 2abee7e18144..d13a1ab8822a 100644
--- a/fs/ceph/debugfs.c
+++ b/fs/ceph/debugfs.c
@@ -170,6 +170,30 @@ static const char * const metric_str[] = {
"metadata",
"copyfrom"
};
+
+static int metrics_counters_show(struct seq_file *s, void *p)
+{
+ struct ceph_fs_client *fsc = s->private;
+ struct ceph_client_metric *cm = &fsc->mdsc->metric;
+ u64 count, size_bytes, wait_ns;
+
+ seq_printf(s, "item count size_bytes wait_ns\n");
+
+ for (unsigned i = 0; i < METRIC_MAX; i++) {
+ struct ceph_metric *m = &cm->metric[i];
+ spin_lock(&m->lock);
+ count = m->total;
+ size_bytes = m->size_sum;
+ wait_ns = ktime_to_ns(m->latency_sum);
+ spin_unlock(&m->lock);
+
+ seq_printf(s, "%s %llu %llu %llu\n",
+ metric_str[i], count, size_bytes, wait_ns);
+ }
+
+ return 0;
+}
+
static int metrics_latency_show(struct seq_file *s, void *p)
{
struct ceph_fs_client *fsc = s->private;
@@ -368,6 +392,7 @@ DEFINE_SHOW_ATTRIBUTE(caps);
DEFINE_SHOW_ATTRIBUTE(mds_sessions);
DEFINE_SHOW_ATTRIBUTE(status);
DEFINE_SHOW_ATTRIBUTE(metrics_file);
+DEFINE_SHOW_ATTRIBUTE(metrics_counters);
DEFINE_SHOW_ATTRIBUTE(metrics_latency);
DEFINE_SHOW_ATTRIBUTE(metrics_size);
DEFINE_SHOW_ATTRIBUTE(metrics_caps);
@@ -463,6 +488,8 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)

debugfs_create_file("file", 0444, fsc->debugfs_metrics_dir, fsc,
&metrics_file_fops);
+ debugfs_create_file("counters", 0444, fsc->debugfs_metrics_dir, fsc,
+ &metrics_counters_fops);
debugfs_create_file("latency", 0444, fsc->debugfs_metrics_dir, fsc,
&metrics_latency_fops);
debugfs_create_file("size", 0444, fsc->debugfs_metrics_dir, fsc,
--
2.39.2

2023-09-25 10:32:13

by Xiubo Li

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable


On 9/22/23 14:25, Max Kellermann wrote:
> I'd like to be able to run metrics collector processes without special
> privileges
>
> In the kernel, there is a mix of debugfs files being world-readable
> and not world-readable is; with a naive "git grep", I found 723
> world-readable debugfs_create_file() calls and 582 calls which were
> only accessible to privileged processe.
>
> From the code, I cannot derive a consistent policy for that, but the
> ceph statistics seem harmless (and useful) enough.

I am not sure whether will this make sense. Because the 'debug' under
'/sys/kernel/' is also only accessible by privileged process.

Ilya, Jeff

Any idea ?

Thanks

- Xiubo

> Signed-off-by: Max Kellermann <[email protected]>
> ---
> fs/ceph/debugfs.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> index 3904333fa6c3..2abee7e18144 100644
> --- a/fs/ceph/debugfs.c
> +++ b/fs/ceph/debugfs.c
> @@ -429,31 +429,31 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> name);
>
> fsc->debugfs_mdsmap = debugfs_create_file("mdsmap",
> - 0400,
> + 0444,
> fsc->client->debugfs_dir,
> fsc,
> &mdsmap_fops);
>
> fsc->debugfs_mds_sessions = debugfs_create_file("mds_sessions",
> - 0400,
> + 0444,
> fsc->client->debugfs_dir,
> fsc,
> &mds_sessions_fops);
>
> fsc->debugfs_mdsc = debugfs_create_file("mdsc",
> - 0400,
> + 0444,
> fsc->client->debugfs_dir,
> fsc,
> &mdsc_fops);
>
> fsc->debugfs_caps = debugfs_create_file("caps",
> - 0400,
> + 0444,
> fsc->client->debugfs_dir,
> fsc,
> &caps_fops);
>
> fsc->debugfs_status = debugfs_create_file("status",
> - 0400,
> + 0444,
> fsc->client->debugfs_dir,
> fsc,
> &status_fops);
> @@ -461,13 +461,13 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> fsc->debugfs_metrics_dir = debugfs_create_dir("metrics",
> fsc->client->debugfs_dir);
>
> - debugfs_create_file("file", 0400, fsc->debugfs_metrics_dir, fsc,
> + debugfs_create_file("file", 0444, fsc->debugfs_metrics_dir, fsc,
> &metrics_file_fops);
> - debugfs_create_file("latency", 0400, fsc->debugfs_metrics_dir, fsc,
> + debugfs_create_file("latency", 0444, fsc->debugfs_metrics_dir, fsc,
> &metrics_latency_fops);
> - debugfs_create_file("size", 0400, fsc->debugfs_metrics_dir, fsc,
> + debugfs_create_file("size", 0444, fsc->debugfs_metrics_dir, fsc,
> &metrics_size_fops);
> - debugfs_create_file("caps", 0400, fsc->debugfs_metrics_dir, fsc,
> + debugfs_create_file("caps", 0444, fsc->debugfs_metrics_dir, fsc,
> &metrics_caps_fops);
> }
>

2023-09-25 10:59:08

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Mon, 2023-09-25 at 13:18 +0800, Xiubo Li wrote:
> On 9/22/23 14:25, Max Kellermann wrote:
> > I'd like to be able to run metrics collector processes without special
> > privileges
> >
> > In the kernel, there is a mix of debugfs files being world-readable
> > and not world-readable is; with a naive "git grep", I found 723
> > world-readable debugfs_create_file() calls and 582 calls which were
> > only accessible to privileged processe.
> >
> > From the code, I cannot derive a consistent policy for that, but the
> > ceph statistics seem harmless (and useful) enough.
>
> I am not sure whether will this make sense. Because the 'debug' under
> '/sys/kernel/' is also only accessible by privileged process.
>
> Ilya, Jeff
>
> Any idea ?
>

Yeah, I don't think this makes much sense. At least on my machine:

# stat -c '%A' /sys/kernel/debug
drwx------

Without at least x permissions, an unprivileged user can't pathwalk
through there. Max, how are you testing this?


>
> > Signed-off-by: Max Kellermann <[email protected]>
> > ---
> > fs/ceph/debugfs.c | 18 +++++++++---------
> > 1 file changed, 9 insertions(+), 9 deletions(-)
> >
> > diff --git a/fs/ceph/debugfs.c b/fs/ceph/debugfs.c
> > index 3904333fa6c3..2abee7e18144 100644
> > --- a/fs/ceph/debugfs.c
> > +++ b/fs/ceph/debugfs.c
> > @@ -429,31 +429,31 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> > name);
> >
> > fsc->debugfs_mdsmap = debugfs_create_file("mdsmap",
> > - 0400,
> > + 0444,
> > fsc->client->debugfs_dir,
> > fsc,
> > &mdsmap_fops);
> >
> > fsc->debugfs_mds_sessions = debugfs_create_file("mds_sessions",
> > - 0400,
> > + 0444,
> > fsc->client->debugfs_dir,
> > fsc,
> > &mds_sessions_fops);
> >
> > fsc->debugfs_mdsc = debugfs_create_file("mdsc",
> > - 0400,
> > + 0444,
> > fsc->client->debugfs_dir,
> > fsc,
> > &mdsc_fops);
> >
> > fsc->debugfs_caps = debugfs_create_file("caps",
> > - 0400,
> > + 0444,
> > fsc->client->debugfs_dir,
> > fsc,
> > &caps_fops);
> >
> > fsc->debugfs_status = debugfs_create_file("status",
> > - 0400,
> > + 0444,
> > fsc->client->debugfs_dir,
> > fsc,
> > &status_fops);
> > @@ -461,13 +461,13 @@ void ceph_fs_debugfs_init(struct ceph_fs_client *fsc)
> > fsc->debugfs_metrics_dir = debugfs_create_dir("metrics",
> > fsc->client->debugfs_dir);
> >
> > - debugfs_create_file("file", 0400, fsc->debugfs_metrics_dir, fsc,
> > + debugfs_create_file("file", 0444, fsc->debugfs_metrics_dir, fsc,
> > &metrics_file_fops);
> > - debugfs_create_file("latency", 0400, fsc->debugfs_metrics_dir, fsc,
> > + debugfs_create_file("latency", 0444, fsc->debugfs_metrics_dir, fsc,
> > &metrics_latency_fops);
> > - debugfs_create_file("size", 0400, fsc->debugfs_metrics_dir, fsc,
> > + debugfs_create_file("size", 0444, fsc->debugfs_metrics_dir, fsc,
> > &metrics_size_fops);
> > - debugfs_create_file("caps", 0400, fsc->debugfs_metrics_dir, fsc,
> > + debugfs_create_file("caps", 0444, fsc->debugfs_metrics_dir, fsc,
> > &metrics_caps_fops);
> > }
> >
>

--
Jeff Layton <[email protected]>

2023-09-25 11:42:07

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Mon, 2023-09-25 at 12:41 +0200, Ilya Dryomov wrote:
> On Fri, Sep 22, 2023 at 8:26 AM Max Kellermann <[email protected]> wrote:
> >
> > I'd like to be able to run metrics collector processes without special
> > privileges
>
> Hi Max,
>
> A word of caution about building metrics collectors based on debugfs
> output: there are no stability guarantees. While the format won't be
> changed just for the sake of change of course, expect zero effort to
> preserve backwards compatibility.
>
> The latency metrics in particular are sent to the MDS in binary form
> and are intended to be consumed through commands like "ceph fs top".
> debugfs stuff is there just for an occasional sneak peek (apart from
> actual debugging).
>

FWIW, I wish we had gone with netlink for this functionality instead of
a seqfile. Lorenzo has been working with netlink for some similar
functionality with nfsd[1], and it's much nicer for this sort of thing.

[1]: https://lore.kernel.org/linux-nfs/ZQTM6l7NrsVHFoR5@lore-desk/T/#t

--
Jeff Layton <[email protected]>

2023-09-25 18:03:32

by Ilya Dryomov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Fri, Sep 22, 2023 at 8:26 AM Max Kellermann <[email protected]> wrote:
>
> I'd like to be able to run metrics collector processes without special
> privileges

Hi Max,

A word of caution about building metrics collectors based on debugfs
output: there are no stability guarantees. While the format won't be
changed just for the sake of change of course, expect zero effort to
preserve backwards compatibility.

The latency metrics in particular are sent to the MDS in binary form
and are intended to be consumed through commands like "ceph fs top".
debugfs stuff is there just for an occasional sneak peek (apart from
actual debugging).

Thanks,

Ilya

2023-09-26 09:29:59

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Mon, Sep 25, 2023 at 7:18 AM Xiubo Li <[email protected]> wrote:
> I am not sure whether will this make sense. Because the 'debug' under
> '/sys/kernel/' is also only accessible by privileged process.

Not exactly correct. It is by default accessible to processes who have
CAP_DAC_OVERRIDE and additionally it is accessible to (unprivileged)
processes running as uid=0 (those two traits usually overlap).
But we don't want to run kernel-exporter as uid=0 and neither do we
want to give it CAP_DAC_OVERRIDE; both would be too much, it would
affect much more than just (read) access to debugfs.
Instead, we mount debugfs with "gid=X,mode=0710". That way, we can
give (unprivileged) processes which are member of a certain group
access to debugfs, and we put our kernel-exporter process in that
group.

We can use these mount options to change debugfs defaults, but if a
debugfs implementor (such as cephfs) decides to override these global
debugfs settings by passing stricter file permissions, we can't easily
override that. And that is what my patch is about: restore the ability
to override debugfs permissions with a mount option, as debugfs was
designed.

Max

2023-09-26 12:57:29

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Mon, Sep 25, 2023 at 12:42 PM Ilya Dryomov <[email protected]> wrote:
> A word of caution about building metrics collectors based on debugfs
> output: there are no stability guarantees. While the format won't be
> changed just for the sake of change of course, expect zero effort to
> preserve backwards compatibility.

Agree, but there's nothing else. We have been using my patch for quite
some time, and it has been very useful.

Maybe we can discuss promoting these statistics to sysfs/proc? (the
raw numbers, not the existing aggregates which are useless for any
practical purpose)

> The latency metrics in particular are sent to the MDS in binary form
> and are intended to be consumed through commands like "ceph fs top".
> debugfs stuff is there just for an occasional sneak peek (apart from
> actual debugging).

I don't know the whole Ceph ecosystem so well, but "ceph" is a command
that is supposed to run on a Ceph server, and not on a machine that
mounts a cephfs, right? If that's right, then this command is useless
for me.

Max

2023-09-26 13:11:26

by Ilya Dryomov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Tue, Sep 26, 2023 at 8:16 AM Max Kellermann <[email protected]> wrote:
>
> On Mon, Sep 25, 2023 at 12:42 PM Ilya Dryomov <[email protected]> wrote:
> > A word of caution about building metrics collectors based on debugfs
> > output: there are no stability guarantees. While the format won't be
> > changed just for the sake of change of course, expect zero effort to
> > preserve backwards compatibility.
>
> Agree, but there's nothing else. We have been using my patch for quite
> some time, and it has been very useful.
>
> Maybe we can discuss promoting these statistics to sysfs/proc? (the
> raw numbers, not the existing aggregates which are useless for any
> practical purpose)
>
> > The latency metrics in particular are sent to the MDS in binary form
> > and are intended to be consumed through commands like "ceph fs top".
> > debugfs stuff is there just for an occasional sneak peek (apart from
> > actual debugging).
>
> I don't know the whole Ceph ecosystem so well, but "ceph" is a command
> that is supposed to run on a Ceph server, and not on a machine that
> mounts a cephfs, right? If that's right, then this command is useless
> for me.

No, "ceph" command (as well as "rbd", "rados", etc) can be run from
anywhere -- it's just a matter of installing a package which is likely
already installed unless you are mounting CephFS manually without using
/sbin/mount.ceph mount helper.

Thanks,

Ilya

2023-09-26 13:26:45

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Tue, Sep 26, 2023 at 10:46 AM Ilya Dryomov <[email protected]> wrote:
> No, "ceph" command (as well as "rbd", "rados", etc) can be run from
> anywhere -- it's just a matter of installing a package which is likely
> already installed unless you are mounting CephFS manually without using
> /sbin/mount.ceph mount helper.

I have never heard of that helper, so no, we're not using it - should we?

This "ceph" tool requires installing 90 MB of additional Debian
packages, which I just tried on a test cluster, and "ceph fs top"
fails with "Error initializing cluster client: ObjectNotFound('RADOS
object not found (error calling conf_read_file)')". Okay, so I have to
configure something.... but .... I don't get why I would want to do
that, when I can get the same information from the kernel without
installing or configuring anything. This sounds like overcomplexifying
the thing for no reason.

Max

2023-09-27 14:35:05

by Ilya Dryomov

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Tue, Sep 26, 2023 at 11:09 AM Max Kellermann
<[email protected]> wrote:
>
> On Tue, Sep 26, 2023 at 10:46 AM Ilya Dryomov <[email protected]> wrote:
> > No, "ceph" command (as well as "rbd", "rados", etc) can be run from
> > anywhere -- it's just a matter of installing a package which is likely
> > already installed unless you are mounting CephFS manually without using
> > /sbin/mount.ceph mount helper.
>
> I have never heard of that helper, so no, we're not using it - should we?

If you have figured out the right mount options, you might as well not.
The helper does things like determine whether v1 or v2 addresses should
be used, fetch the key and pass it via the kernel keyring (whereas you
are probably passing it verbatim on the command line), etc. It's the
same syscall in the end, so the helper is certainly not required.

>
> This "ceph" tool requires installing 90 MB of additional Debian
> packages, which I just tried on a test cluster, and "ceph fs top"
> fails with "Error initializing cluster client: ObjectNotFound('RADOS
> object not found (error calling conf_read_file)')". Okay, so I have to
> configure something.... but .... I don't get why I would want to do
> that, when I can get the same information from the kernel without
> installing or configuring anything. This sounds like overcomplexifying
> the thing for no reason.

I have relayed my understanding of this feature (or rather how it was
presented to me). I see where you are coming from, so adding more
CephFS folks to chime in.

Thanks,

Ilya

2023-09-27 15:13:23

by Max Kellermann

[permalink] [raw]
Subject: Re: [PATCH 1/2] fs/ceph/debugfs: make all files world-readable

On Wed, Sep 27, 2023 at 12:53 PM Ilya Dryomov <[email protected]> wrote:
> > This "ceph" tool requires installing 90 MB of additional Debian
> > packages, which I just tried on a test cluster, and "ceph fs top"
> > fails with "Error initializing cluster client: ObjectNotFound('RADOS
> > object not found (error calling conf_read_file)')". Okay, so I have to
> > configure something.... but .... I don't get why I would want to do
> > that, when I can get the same information from the kernel without
> > installing or configuring anything. This sounds like overcomplexifying
> > the thing for no reason.
>
> I have relayed my understanding of this feature (or rather how it was
> presented to me). I see where you are coming from, so adding more
> CephFS folks to chime in.

Let me show these folks how badly "ceph fs stats" performs:

# time ceph fs perf stats
{"version": 2, "global_counters": ["cap_hit", "read_latency",
"write_latency"[...]
real 0m0.502s
user 0m0.393s
sys 0m0.053s

Now my debugfs-based solution:

# time cat /sys/kernel/debug/ceph/*/metrics/latency
item total avg_lat(us) min_lat(us) max_lat(us)
stdev(us)
[...]
real 0m0.002s
user 0m0.002s
sys 0m0.001s

debugfs is more than 200 times faster. It is so fast, it can hardly be
measured by "time" - and most of these 2ms is the overhead for
executing /bin/cat, not for actually reading the debugfs file.
Our kernel-exporter is a daemon process, it only needs a single
pread() system call in each iteration, it has even less overhead.
Integrating the "ceph" tool instead would require forking the process
each time, starting a new Python VM, and so on...
For obtaining real-time latency statistics, the "ceph" script is the
wrong tool for the job.

Max