2024-05-20 22:30:33

by Tejun Heo

[permalink] [raw]
Subject: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

Currently, worker ID formatting is open coded in create_worker(),
init_rescuer() and worker_thread() (for %WORKER_DIE case). The formatted ID
is saved into task->comm and wq_worker_comm() uses it as the base name to
append extra information to when generating the name to be shown as.

However, TASK_COMM_LEN is only 16 leading to badly truncated names for
rescuers. For example, the rescuer for the inet_frag_wq workqueue becomes:

$ ps -ef | grep '[k]worker/R-inet'
root 483 2 0 Apr26 ? 00:00:00 [kworker/R-inet_]

Even for non-rescue workers, it's easy to run over 15 characters on
moderately large machines.

Fit it by consolidating worker ID formatting into a new helper
format_worker_id() and calling it from wq_worker_comm() to obtain the
untruncated worker ID.

$ ps -ef | grep '[k]worker/R-inet'
root 60 2 0 12:10 ? 00:00:00 [kworker/R-inet_frag_wq]

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Jan Engelhardt <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
---
(cc'ing LKML and Lai)

On Mon, May 20, 2024 at 10:17:00AM -0700, Linus Torvalds wrote:
> Oh well. I suspect this would be trivial to fix. I think the
> get_kthread_comm() should use the full name, and only *then* attach
> the extra worker pool information if it exists..
>
> Tejun?

Yeah, looks like even the unbound worker IDs are getting truncated on larger
machines. This patch should fix it. I'll apply it to wq/for-6.10-fixes soon.

> Also, independently of the WQ worker issue, I do think we could
> possibly expand TASK_COMM_LEN a bit more. 16 bytes is too short for
> user-level names too, and while it's seldom used for 'ps', it's
> visible in things like oops messages etc where it gets truncated.

That'd be great. ISTR this being discussed a while ago. Am I imagining that
we were going to raise this to 32?

Thanks.

kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++++++-----------------
1 file changed, 34 insertions(+), 17 deletions(-)

--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -125,6 +125,7 @@ enum wq_internal_consts {
HIGHPRI_NICE_LEVEL = MIN_NICE,

WQ_NAME_LEN = 32,
+ WORKER_ID_LEN = 10 + WQ_NAME_LEN, /* "kworker/R-" + WQ_NAME_LEN */
};

/*
@@ -2742,6 +2743,26 @@ static void worker_detach_from_pool(stru
complete(detach_completion);
}

+static int format_worker_id(char *buf, size_t size, struct worker *worker,
+ struct worker_pool *pool)
+{
+ if (worker->rescue_wq)
+ return scnprintf(buf, size, "kworker/R-%s",
+ worker->rescue_wq->name);
+
+ if (pool) {
+ if (pool->cpu >= 0)
+ return scnprintf(buf, size, "kworker/%d:%d%s",
+ pool->cpu, worker->id,
+ pool->attrs->nice < 0 ? "H" : "");
+ else
+ return scnprintf(buf, size, "kworker/u%d:%d",
+ pool->id, worker->id);
+ } else {
+ return scnprintf(buf, size, "kworker/dying");
+ }
+}
+
/**
* create_worker - create a new workqueue worker
* @pool: pool the new worker will belong to
@@ -2758,7 +2779,6 @@ static struct worker *create_worker(stru
{
struct worker *worker;
int id;
- char id_buf[23];

/* ID is needed to determine kthread name */
id = ida_alloc(&pool->worker_ida, GFP_KERNEL);
@@ -2777,17 +2797,14 @@ static struct worker *create_worker(stru
worker->id = id;

if (!(pool->flags & POOL_BH)) {
- if (pool->cpu >= 0)
- snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
- pool->attrs->nice < 0 ? "H" : "");
- else
- snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
+ char id_buf[WORKER_ID_LEN];

+ format_worker_id(id_buf, sizeof(id_buf), worker, pool);
worker->task = kthread_create_on_node(worker_thread, worker,
- pool->node, "kworker/%s", id_buf);
+ pool->node, "%s", id_buf);
if (IS_ERR(worker->task)) {
if (PTR_ERR(worker->task) == -EINTR) {
- pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
+ pr_err("workqueue: Interrupted when creating a worker thread \"%s\"\n",
id_buf);
} else {
pr_err_once("workqueue: Failed to create a worker thread: %pe",
@@ -3350,7 +3367,6 @@ woke_up:
raw_spin_unlock_irq(&pool->lock);
set_pf_worker(false);

- set_task_comm(worker->task, "kworker/dying");
ida_free(&pool->worker_ida, worker->id);
worker_detach_from_pool(worker);
WARN_ON_ONCE(!list_empty(&worker->entry));
@@ -5542,6 +5558,7 @@ static int wq_clamp_max_active(int max_a
static int init_rescuer(struct workqueue_struct *wq)
{
struct worker *rescuer;
+ char id_buf[WORKER_ID_LEN];
int ret;

if (!(wq->flags & WQ_MEM_RECLAIM))
@@ -5555,7 +5572,9 @@ static int init_rescuer(struct workqueue
}

rescuer->rescue_wq = wq;
- rescuer->task = kthread_create(rescuer_thread, rescuer, "kworker/R-%s", wq->name);
+ format_worker_id(id_buf, sizeof(id_buf), rescuer, NULL);
+
+ rescuer->task = kthread_create(rescuer_thread, rescuer, "%s", id_buf);
if (IS_ERR(rescuer->task)) {
ret = PTR_ERR(rescuer->task);
pr_err("workqueue: Failed to create a rescuer kthread for wq \"%s\": %pe",
@@ -6384,19 +6403,15 @@ void show_freezable_workqueues(void)
/* used to show worker information through /proc/PID/{comm,stat,status} */
void wq_worker_comm(char *buf, size_t size, struct task_struct *task)
{
- int off;
-
- /* always show the actual comm */
- off = strscpy(buf, task->comm, size);
- if (off < 0)
- return;
-
/* stabilize PF_WQ_WORKER and worker pool association */
mutex_lock(&wq_pool_attach_mutex);

if (task->flags & PF_WQ_WORKER) {
struct worker *worker = kthread_data(task);
struct worker_pool *pool = worker->pool;
+ int off;
+
+ off = format_worker_id(buf, size, worker, pool);

if (pool) {
raw_spin_lock_irq(&pool->lock);
@@ -6415,6 +6430,8 @@ void wq_worker_comm(char *buf, size_t si
}
raw_spin_unlock_irq(&pool->lock);
}
+ } else {
+ strscpy(buf, task->comm, size);
}

mutex_unlock(&wq_pool_attach_mutex);


2024-05-20 23:27:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

Hello,

On Tue, May 21, 2024 at 01:26:28AM +0200, Jan Engelhardt wrote:
> Tested-by: Jan Engelhardt <[email protected]>

Thanks for testing. I'll apply the patch to wq/for-6.10-fixes with minor
description fixes.

Thanks.

--
tejun

2024-05-20 23:36:27

by Jan Engelhardt

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string


On Tuesday 2024-05-21 00:30, Tejun Heo wrote:

>Currently, worker ID formatting is open coded in create_worker(),
>init_rescuer() and worker_thread() (for %WORKER_DIE case). The formatted ID
>is saved into task->comm and wq_worker_comm() uses it as the base name to
>append extra information to when generating the name to be shown as.
>
>However, TASK_COMM_LEN is only 16 leading to badly truncated names for
>rescuers. For example, the rescuer for the inet_frag_wq workqueue becomes:
>
> $ ps -ef | grep '[k]worker/R-inet'
> root 483 2 0 Apr26 ? 00:00:00 [kworker/R-inet_]
>[...]

This patch works satisfactorily for me.


Tested-by: Jan Engelhardt <[email protected]>

2024-05-21 02:34:37

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Tue, May 21, 2024 at 6:30 AM Tejun Heo <[email protected]> wrote:
>
> Currently, worker ID formatting is open coded in create_worker(),
> init_rescuer() and worker_thread() (for %WORKER_DIE case). The formatted ID
> is saved into task->comm and wq_worker_comm() uses it as the base name to
> append extra information to when generating the name to be shown as.
>
> However, TASK_COMM_LEN is only 16 leading to badly truncated names for
> rescuers. For example, the rescuer for the inet_frag_wq workqueue becomes:
>
> $ ps -ef | grep '[k]worker/R-inet'
> root 483 2 0 Apr26 ? 00:00:00 [kworker/R-inet_]
>
> Even for non-rescue workers, it's easy to run over 15 characters on
> moderately large machines.
>
> Fit it by consolidating worker ID formatting into a new helper
> format_worker_id() and calling it from wq_worker_comm() to obtain the
> untruncated worker ID.
>
> $ ps -ef | grep '[k]worker/R-inet'
> root 60 2 0 12:10 ? 00:00:00 [kworker/R-inet_frag_wq]
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Jan Engelhardt <[email protected]>
> Suggested-by: Linus Torvalds <[email protected]>
> ---
> (cc'ing LKML and Lai)
>
> On Mon, May 20, 2024 at 10:17:00AM -0700, Linus Torvalds wrote:
> > Oh well. I suspect this would be trivial to fix. I think the
> > get_kthread_comm() should use the full name, and only *then* attach
> > the extra worker pool information if it exists..
> >
> > Tejun?
>
> Yeah, looks like even the unbound worker IDs are getting truncated on larger
> machines. This patch should fix it. I'll apply it to wq/for-6.10-fixes soon.
>
> > Also, independently of the WQ worker issue, I do think we could
> > possibly expand TASK_COMM_LEN a bit more. 16 bytes is too short for
> > user-level names too, and while it's seldom used for 'ps', it's
> > visible in things like oops messages etc where it gets truncated.
>
> That'd be great. ISTR this being discussed a while ago. Am I imagining that
> we were going to raise this to 32?

We discussed extending it to 24 characters several years ago [0], but
some userspace tools might break. Therefore, we ultimately decided to
dynamically allocate the kthread's full name instead[1].

[0] https://lore.kernel.org/linux-mm/20211101060419.4682-1-laoar.shao@gmailcom/
[1] https://lore.kernel.org/linux-mm/[email protected]/

2024-05-21 18:06:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Mon, 20 May 2024 at 19:34, Yafang Shao <[email protected]> wrote:
>
> We discussed extending it to 24 characters several years ago [0], but
> some userspace tools might break.

Well, the fact that we already expose names longer than 16 bytes in
/proc means that at least *that* side of it could use an extended
comm[] array.

Yes, some other interfaces might want to still use a 16-byte limit as
the length for the buffers they use (tracing?) but I suspect we could
make the comm[] array easily bigger.

But what I suspect we should do *first* is to try to get rid of a lot
of the "current->comm" users. One of the most common uses is purely
for printing, and we could actually just add a new '%p' pointer for
printing the current name. That would allow our vsprintf() code to not
just use tsk->comm, but to use the full_name for threads etc.

So instead of

printf("%s ..", tsk->comm..);

we could have something like

printf("%pc ..", tsk);

to print the name of the task.

That would get rid of a lot of the bare ->comm[] uses, and then the
rest should probably use proper wrappers for copying the data (ie
using 'get_task_comm()' etc).

That would not only pick up the better names for printk and oopses, it
would also make future cleanups simpler (for example, I'd love to get
rid of the 'comm' name entirely, and replace it with 'exe_name[24]'
and have the compiler just notice when somebody is trying to access
'comm' directly).

Linus

2024-05-23 02:39:06

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Wed, May 22, 2024 at 2:06 AM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, 20 May 2024 at 19:34, Yafang Shao <[email protected]> wrote:
> >
> > We discussed extending it to 24 characters several years ago [0], but
> > some userspace tools might break.
>
> Well, the fact that we already expose names longer than 16 bytes in
> /proc means that at least *that* side of it could use an extended
> comm[] array.
>
> Yes, some other interfaces might want to still use a 16-byte limit as
> the length for the buffers they use (tracing?) but I suspect we could
> make the comm[] array easily bigger.

Indeed, the 16-byte limit is hard-coded in certain BPF code:

$ grep -r "comm\[" tools/testing/selftests/bpf/
tools/testing/selftests/bpf//prog_tests/ringbuf_multi.c: char comm[16];
tools/testing/selftests/bpf//prog_tests/sk_storage_tracing.c: char comm[16];
tools/testing/selftests/bpf//prog_tests/test_overhead.c: char comm[16] = {};
tools/testing/selftests/bpf//prog_tests/ringbuf.c: char comm[16];
tools/testing/selftests/bpf//progs/pyperf.h: char comm[TASK_COMM_LEN];
tools/testing/selftests/bpf//progs/dynptr_success.c: char comm[16];
tools/testing/selftests/bpf//progs/test_ringbuf.c: char comm[16];
tools/testing/selftests/bpf//progs/test_ringbuf_n.c: char comm[16];
tools/testing/selftests/bpf//progs/task_kfunc_success.c:
bpf_strncmp(&task->comm[8], 4, "foo");
tools/testing/selftests/bpf//progs/user_ringbuf_fail.c: char comm[16];
tools/testing/selftests/bpf//progs/test_ringbuf_map_key.c: char comm[16];
tools/testing/selftests/bpf//progs/test_core_reloc_kernel.c: char
comm[sizeof("test_progs")];
tools/testing/selftests/bpf//progs/test_core_reloc_kernel.c: char comm[16];
tools/testing/selftests/bpf//progs/dynptr_fail.c: char comm[16];
tools/testing/selftests/bpf//progs/strobemeta.h: char comm[TASK_COMM_LEN];
tools/testing/selftests/bpf//progs/core_reloc_types.h: char
comm[sizeof("test_progs")];
tools/testing/selftests/bpf//progs/core_reloc_types.h: char
comm[sizeof("test_progs")];
tools/testing/selftests/bpf//progs/test_skb_helpers.c: char comm[TEST_COMM_LEN];
tools/testing/selftests/bpf//progs/test_tracepoint.c: char
prev_comm[TASK_COMM_LEN];
tools/testing/selftests/bpf//progs/test_tracepoint.c: char
next_comm[TASK_COMM_LEN];
tools/testing/selftests/bpf//progs/test_ringbuf_multi.c: char comm[16];
tools/testing/selftests/bpf//progs/test_user_ringbuf.h: char comm[16];
tools/testing/selftests/bpf//progs/test_core_reloc_module.c: char
comm[sizeof("test_progs")];
tools/testing/selftests/bpf//progs/test_stacktrace_map.c: char
prev_comm[TASK_COMM_LEN];
tools/testing/selftests/bpf//progs/test_stacktrace_map.c: char
next_comm[TASK_COMM_LEN];
tools/testing/selftests/bpf//progs/test_sk_storage_tracing.c: char comm[16];
tools/testing/selftests/bpf//progs/test_sk_storage_tracing.c:char
task_comm[16] = "";

>
> But what I suspect we should do *first* is to try to get rid of a lot
> of the "current->comm" users. One of the most common uses is purely
> for printing, and we could actually just add a new '%p' pointer for
> printing the current name. That would allow our vsprintf() code to not
> just use tsk->comm, but to use the full_name for threads etc.
>
> So instead of
>
> printf("%s ..", tsk->comm..);
>
> we could have something like
>
> printf("%pc ..", tsk);
>
> to print the name of the task.

I believe it's a good start.

>
> That would get rid of a lot of the bare ->comm[] uses, and then the
> rest should probably use proper wrappers for copying the data (ie
> using 'get_task_comm()' etc).
>
> That would not only pick up the better names for printk and oopses, it
> would also make future cleanups simpler (for example, I'd love to get
> rid of the 'comm' name entirely, and replace it with 'exe_name[24]'
> and have the compiler just notice when somebody is trying to access
> 'comm' directly).

Some tools may flag the naming change. Below is a simple grep from
bcc-tools and bpftrace.

bcc $ grep -r "\->comm" tools/
tools//wakeuptime.py: bpf_probe_read_kernel(&key.target,
sizeof(key.target), p->comm);
tools//bitesize.py: bpf_probe_read_kernel(&key.name,
sizeof(key.name), args->comm);
tools//tcptracer.py: evt4.comm[i] = p->comm[i];
tools//tcptracer.py: evt6.comm[i] = p->comm[i];
tools//old/wakeuptime.py: bpf_probe_read(&key.target,
sizeof(key.target), p->comm);
tools//old/oomkill.py: bpf_probe_read(&data.tcomm,
sizeof(data.tcomm), p->comm);
tools//oomkill.py: bpf_probe_read_kernel(&data.tcomm,
sizeof(data.tcomm), p->comm);
tools//runqslower.py: bpf_probe_read_kernel_str(&data.prev_task,
sizeof(data.prev_task), prev->comm);
tools//runqslower.py: bpf_probe_read_kernel_str(&data.task,
sizeof(data.task), next->comm);
tools//runqslower.py: bpf_probe_read_kernel_str(&data.prev_task,
sizeof(data.prev_task), prev->comm);
tools//shmsnoop.py: if (bpf_get_current_comm(&val->comm,
sizeof(val->comm)) != 0)
tools//sslsniff.py: bpf_get_current_comm(&data->comm,
sizeof(data->comm));
tools//sslsniff.py: bpf_get_current_comm(&data->comm,
sizeof(data->comm));
tools//fileslower.py: bpf_probe_read_kernel(&data.comm,
sizeof(data.comm), valp->comm);
tools//mountsnoop.py: bpf_probe_read_kernel_str(&event.enter.pcomm,
TASK_COMM_LEN, task->real_parent->comm);
tools//mountsnoop.py: bpf_probe_read_kernel_str(&event.enter.pcomm,
TASK_COMM_LEN, task->real_parent->comm);
tools//gethostlatency.py: bpf_probe_read_kernel(&data.comm,
sizeof(data.comm), valp->comm);
tools//opensnoop.py: bpf_probe_read_kernel(&data.comm,
sizeof(data.comm), valp->comm);
tools//killsnoop.py: bpf_probe_read_kernel(&data.comm,
sizeof(data.comm), valp->comm);

bpftrace $ grep -r "\->comm" tools/
tools//naptime.bt: $task->real_parent->comm, pid, comm,
tools//oomkill.bt: $oc->chosen->pid, $oc->chosen->comm, $oc->totalpages);


--
Regards
Yafang

2024-05-23 04:32:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Wed, 22 May 2024 at 19:38, Yafang Shao <[email protected]> wrote:
>
> Indeed, the 16-byte limit is hard-coded in certain BPF code:

It's worse than that.

We have code like this:

memcpy(__entry->comm, t->comm, TASK_COMM_LEN);

and it looks like this code not only has a fixed-size target buffer of
TASK_COMM_LEN, it also just uses "memcpy()" instead of "strscpy()",
knowing that the source has the NUL byte in it.

If it wasn't for that memcpy() pattern, I think this trivial patch
would "JustWork(tm)"

diff --git a/fs/exec.c b/fs/exec.c
index 2d7dd0e39034..5829912a2fa0 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1239,7 +1239,7 @@ char *__get_task_comm(char *buf, size_t
buf_size, struct task_struct *tsk)
{
task_lock(tsk);
/* Always NUL terminated and zero-padded */
- strscpy_pad(buf, tsk->comm, buf_size);
+ strscpy_pad(buf, tsk->real_comm, buf_size);
task_unlock(tsk);
return buf;
}
@@ -1254,7 +1254,7 @@ void __set_task_comm(struct task_struct *tsk,
const char *buf, bool exec)
{
task_lock(tsk);
trace_task_rename(tsk, buf);
- strscpy_pad(tsk->comm, buf, sizeof(tsk->comm));
+ strscpy_pad(tsk->real_comm, buf, sizeof(tsk->real_comm));
task_unlock(tsk);
perf_event_comm(tsk, exec);
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 61591ac6eab6..948220958548 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -299,6 +299,7 @@ struct user_event_mm;
*/
enum {
TASK_COMM_LEN = 16,
+ REAL_TASK_COMM_LEN = 24,
};

extern void sched_tick(void);
@@ -1090,7 +1091,10 @@ struct task_struct {
* - access it with [gs]et_task_comm()
* - lock it with task_lock()
*/
- char comm[TASK_COMM_LEN];
+ union {
+ char comm[TASK_COMM_LEN];
+ char real_comm[REAL_TASK_COMM_LEN];
+ };

struct nameidata *nameidata;

and the old common pattern of just printing with '%s' and tsk->comm
would just continue to work:

pr_alert("BUG: Bad page state in process %s pfn:%05lx\n",
current->comm, page_to_pfn(page));

but will get a longer max string.

Of course, we have code like this in security/selinux/selinuxfs.c that
is literally written so that it won't work:

if (new_value) {
char comm[sizeof(current->comm)];

memcpy(comm, current->comm, sizeof(comm));
pr_err("SELinux: %s (%d) set checkreqprot to 1. This
is no longer supported.\n",
comm, current->pid);
}

which copies to a temporary buffer (which now does *NOT* have a
closing NUL character), and then prints from that. The intent is to at
least have a stable buffer, but it basically relies on the source of
the memcpy() being stable enough anyway.

That said, a simple grep like this:

git grep 'memcpy.*->comm\>'

more than likely finds all relevant cases. Not *that* many, and just
changing the 'memcpy()' to 'copy_comm()' should fix them all.

The "copy_comm()" would trivially look something like this:

memcpy(dst, src, TASK_COMM_LEN);
dst[TASK_COMM_LEN-1] = 0;

and the people who want that old TASK_COMM_LEN behavior will get it,
and the people who just print out ->comm as a string will magically
get the longer new "real comm".

And people who do "sizeof(->comm)" will continue to get the old value
because of the hacky union. FWIW.

Anybody want to polish up the above turd? It doesn't look all that
hard unless I'm missing something, but needs some testing and care.

Linus

2024-05-23 07:17:12

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On 21/05/2024 20.05, Linus Torvalds wrote:
> On Mon, 20 May 2024 at 19:34, Yafang Shao <[email protected]> wrote:
>>
>> We discussed extending it to 24 characters several years ago [0], but
>> some userspace tools might break.
>
> Well, the fact that we already expose names longer than 16 bytes in
> /proc means that at least *that* side of it could use an extended
> comm[] array.
>
> Yes, some other interfaces might want to still use a 16-byte limit as
> the length for the buffers they use (tracing?) but I suspect we could
> make the comm[] array easily bigger.
>
> But what I suspect we should do *first* is to try to get rid of a lot
> of the "current->comm" users. One of the most common uses is purely
> for printing, and we could actually just add a new '%p' pointer for
> printing the current name. That would allow our vsprintf() code to not
> just use tsk->comm, but to use the full_name for threads etc.
>
> So instead of
>
> printf("%s ..", tsk->comm..);
>
> we could have something like
>
> printf("%pc ..", tsk);

The alphabet is running out of letters, so if somebody goes and
implement this, I'd suggest making this %pc thing take extra modifiers,
so one could also get the pid or whatever else one might want to print
from a task. I.e. %pcN, %pcP, ... We could of course have %pc with no
following letter mean ->comm (or "real name" for kernel threads) if that
is the most common thing, which it very well might be.

Rasmus


2024-05-23 13:04:24

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Thu, May 23, 2024 at 12:32 PM Linus Torvalds
<[email protected]> wrote:
>
> On Wed, 22 May 2024 at 19:38, Yafang Shao <[email protected]> wrote:
> >
> > Indeed, the 16-byte limit is hard-coded in certain BPF code:
>
> It's worse than that.
>
> We have code like this:
>
> memcpy(__entry->comm, t->comm, TASK_COMM_LEN);
>
> and it looks like this code not only has a fixed-size target buffer of
> TASK_COMM_LEN, it also just uses "memcpy()" instead of "strscpy()",
> knowing that the source has the NUL byte in it.
>
> If it wasn't for that memcpy() pattern, I think this trivial patch
> would "JustWork(tm)"
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 2d7dd0e39034..5829912a2fa0 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1239,7 +1239,7 @@ char *__get_task_comm(char *buf, size_t
> buf_size, struct task_struct *tsk)
> {
> task_lock(tsk);
> /* Always NUL terminated and zero-padded */
> - strscpy_pad(buf, tsk->comm, buf_size);
> + strscpy_pad(buf, tsk->real_comm, buf_size);
> task_unlock(tsk);
> return buf;
> }
> @@ -1254,7 +1254,7 @@ void __set_task_comm(struct task_struct *tsk,
> const char *buf, bool exec)
> {
> task_lock(tsk);
> trace_task_rename(tsk, buf);
> - strscpy_pad(tsk->comm, buf, sizeof(tsk->comm));
> + strscpy_pad(tsk->real_comm, buf, sizeof(tsk->real_comm));
> task_unlock(tsk);
> perf_event_comm(tsk, exec);
> }
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 61591ac6eab6..948220958548 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -299,6 +299,7 @@ struct user_event_mm;
> */
> enum {
> TASK_COMM_LEN = 16,
> + REAL_TASK_COMM_LEN = 24,
> };
>
> extern void sched_tick(void);
> @@ -1090,7 +1091,10 @@ struct task_struct {
> * - access it with [gs]et_task_comm()
> * - lock it with task_lock()
> */
> - char comm[TASK_COMM_LEN];
> + union {
> + char comm[TASK_COMM_LEN];
> + char real_comm[REAL_TASK_COMM_LEN];
> + };
>
> struct nameidata *nameidata;
>
> and the old common pattern of just printing with '%s' and tsk->comm
> would just continue to work:
>
> pr_alert("BUG: Bad page state in process %s pfn:%05lx\n",
> current->comm, page_to_pfn(page));
>
> but will get a longer max string.
>
> Of course, we have code like this in security/selinux/selinuxfs.c that
> is literally written so that it won't work:
>
> if (new_value) {
> char comm[sizeof(current->comm)];
>
> memcpy(comm, current->comm, sizeof(comm));
> pr_err("SELinux: %s (%d) set checkreqprot to 1. This
> is no longer supported.\n",
> comm, current->pid);
> }
>
> which copies to a temporary buffer (which now does *NOT* have a
> closing NUL character), and then prints from that. The intent is to at
> least have a stable buffer, but it basically relies on the source of
> the memcpy() being stable enough anyway.
>
> That said, a simple grep like this:
>
> git grep 'memcpy.*->comm\>'
>
> more than likely finds all relevant cases. Not *that* many, and just
> changing the 'memcpy()' to 'copy_comm()' should fix them all.
>
> The "copy_comm()" would trivially look something like this:
>
> memcpy(dst, src, TASK_COMM_LEN);
> dst[TASK_COMM_LEN-1] = 0;
>
> and the people who want that old TASK_COMM_LEN behavior will get it,
> and the people who just print out ->comm as a string will magically
> get the longer new "real comm".
>
> And people who do "sizeof(->comm)" will continue to get the old value
> because of the hacky union. FWIW.
>
> Anybody want to polish up the above turd? It doesn't look all that
> hard unless I'm missing something, but needs some testing and care.

If it's not urgent and no one else will handle it, I'll take care of
it. However, I might not be able to complete it quickly.

--
Regards
Yafang

2024-05-23 15:56:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Thu, 23 May 2024 at 06:04, Yafang Shao <[email protected]> wrote:
>
> If it's not urgent and no one else will handle it, I'll take care of
> it. However, I might not be able to complete it quickly.

It's not urgent. In fact, I'm not convinced we need to even increase
the current comm[] size, since for normal user programs the main way
'ps' and friends get it is by just reading the full command line etc.

But I think it would be good to at least do the cleanup and walk away
from the bare hardcoded memcpy() so that we can move in that
direction.

Linus

2024-05-23 17:25:43

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Wed, May 22, 2024 at 09:32:03PM -0700, Linus Torvalds wrote:
> On Wed, 22 May 2024 at 19:38, Yafang Shao <[email protected]> wrote:
> >
> > Indeed, the 16-byte limit is hard-coded in certain BPF code:
>
> It's worse than that.
>
> We have code like this:
>
> memcpy(__entry->comm, t->comm, TASK_COMM_LEN);

FYI, I would be happy to convert the tracing events over to dynamic strings.
It takes a 4 byte meta data that holds the offset and size of the string, then
the string itself (appended at the end of the event buffer) as well as the
space. The sched_switch and sched_waking events were created before the
dynamic string code was added, hence the hard coded size.

I'm not sure what fallout that would have for user space tools, as some
tooling does hardcode the parsing of the sched event. But I'm sure I can work
to fix those tools.

-- Steve


2024-05-23 19:05:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Thu, 23 May 2024 at 10:25, Steven Rostedt <[email protected]> wrote:
>
> FYI, I would be happy to convert the tracing events over to dynamic strings.

I doubt it is worth it.

The reason I would want to clean up the existing random memcpy is not
so much because 15 characters wouldn't be enough for tracing, but
because it is just ugly how we have this bad hardcoded interface
without proper abstractions, and it would keep us from changing it.

Linus

2024-05-24 07:43:27

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Thu, May 23, 2024 at 11:55 PM Linus Torvalds
<[email protected]> wrote:
>
> On Thu, 23 May 2024 at 06:04, Yafang Shao <[email protected]> wrote:
> >
> > If it's not urgent and no one else will handle it, I'll take care of
> > it. However, I might not be able to complete it quickly.
>
> It's not urgent. In fact, I'm not convinced we need to even increase
> the current comm[] size, since for normal user programs the main way
> 'ps' and friends get it is by just reading the full command line etc.
>
> But I think it would be good to at least do the cleanup and walk away
> from the bare hardcoded memcpy() so that we can move in that
> direction.

Certainly, let's start with the cleanup.

Actually, there are already helpers for this: get_task_comm() and
__get_task_comm(). We can simply replace the memcpy() with one of
these. If the task_lock() in __get_task_comm() is a concern, we could
consider adding a new __get_current_comm().

It's important to note that people may continue to directly access
task->comm in new code, even if we've added a comment to avoid that:

struct task_struct {
...
/*
* executable name, excluding path.
*
* - normally initialized setup_new_exec()
* - access it with [gs]et_task_comm()
* - lock it with task_lock()
*/
char comm[TASK_COMM_LEN];
...
}

We might add a rule in checkpatch.pl to warn against this, but that’s
not an ideal solution.

--
Regards
Yafang

2024-05-24 14:58:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Fri, 24 May 2024 at 00:43, Yafang Shao <[email protected]> wrote:
>
> Actually, there are already helpers for this: get_task_comm() and
> __get_task_comm(). We can simply replace the memcpy() with one of
> these

No. We should get rid of those horrendous helpers.

> If the task_lock() in __get_task_comm() is a concern, we could
> consider adding a new __get_current_comm().

The task_lock is indeed the problem - it generates locking problems
and basically means that most places cannot use them. Certainly not
things like tracing etc.

The locking is also entirely pointless\, since absolutely nobody
cares. If somebody is changing the name at the same time - which
doesn't happen in practice - getting some halfway result is fine as
long as you get a proper NUL terminated result.

Even for non-current, they are largely useless. They were a mistake.

So those functions should never be used for any normal thing. Instead
of locking, the function should literally just do a "copy a couple of
words and make sure the end result still has a NUL at the end".

That's literally what selinuxfs.c wants, for example - it copies the
thing to a local buffer not because it cares about some locking issue,
but because it wants one stable value. But by using 'memcpy()' and
that fixed size, it means that we can't sanely extend the source size
because now it wouldn't be NUL-terminated. But selinux never wanted a
lock, and never wanted any kind of *consistent* result, it just wanted
a *stable* result.

Since user space can randomly change their names anyway, using locking
was always wrong for readers (for writers it probably does make sense
to have some lock - although practically speaking nobody cares there
either, but at least for a writer some kind of race could have
long-term mixed results)

Oh well.

Linus

2024-05-26 02:34:58

by Yafang Shao

[permalink] [raw]
Subject: Re: [PATCH workqueue/for-6.10-fixes] workqueue: Refactor worker ID formatting and make wq_worker_comm() use full ID string

On Fri, May 24, 2024 at 10:58 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, 24 May 2024 at 00:43, Yafang Shao <[email protected]> wrote:
> >
> > Actually, there are already helpers for this: get_task_comm() and
> > __get_task_comm(). We can simply replace the memcpy() with one of
> > these
>
> No. We should get rid of those horrendous helpers.
>
> > If the task_lock() in __get_task_comm() is a concern, we could
> > consider adding a new __get_current_comm().
>
> The task_lock is indeed the problem - it generates locking problems
> and basically means that most places cannot use them. Certainly not
> things like tracing etc.
>
> The locking is also entirely pointless\, since absolutely nobody
> cares. If somebody is changing the name at the same time - which
> doesn't happen in practice - getting some halfway result is fine as
> long as you get a proper NUL terminated result.
>
> Even for non-current, they are largely useless. They were a mistake.
>
> So those functions should never be used for any normal thing. Instead
> of locking, the function should literally just do a "copy a couple of
> words and make sure the end result still has a NUL at the end".
>
> That's literally what selinuxfs.c wants, for example - it copies the
> thing to a local buffer not because it cares about some locking issue,
> but because it wants one stable value. But by using 'memcpy()' and
> that fixed size, it means that we can't sanely extend the source size
> because now it wouldn't be NUL-terminated. But selinux never wanted a
> lock, and never wanted any kind of *consistent* result, it just wanted
> a *stable* result.
>
> Since user space can randomly change their names anyway, using locking
> was always wrong for readers (for writers it probably does make sense
> to have some lock - although practically speaking nobody cares there
> either, but at least for a writer some kind of race could have
> long-term mixed results)

Thanks for your explanation.
Let's proceed by removing the task_lock() inside __get_task_comm().

--
Regards
Yafang