2022-10-19 14:45:10

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: signal: break out of wait loops on kthread_stop()


Hi,

A question regarding a7c01fa93aeb ("signal: break out of wait loops on
kthread_stop()") if I may.

We have a bunch code in i915, possibly limited to self tests (ie debug
builds) but still important for our flows, which spawn kernel threads
and exercises parts of the driver.

Problem we are hitting with this patch is that code did not really need
to be signal aware until now. Well to say that more accurately - we were
able to test the code which is normally executed from userspace, so is
signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
cases itself.

For example threads which exercise an internal API for a while until the
parent calls kthread_stop. Now those tests can hit unexpected errors.

Question is how to best approach working around this change. It is of
course technically possible to rework our code in more than one way,
although with some cost and impact already felt due reduced pass rates
in our automated test suites.

Maybe an opt out kthread flag from this new behavior? Would that be
acceptable as a quick fix? Or any other comments?

Regards,

Tvrtko


2022-10-19 16:10:26

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: signal: break out of wait loops on kthread_stop()

On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
<[email protected]> wrote:
>
>
> Hi,
>
> A question regarding a7c01fa93aeb ("signal: break out of wait loops on
> kthread_stop()") if I may.
>
> We have a bunch code in i915, possibly limited to self tests (ie debug
> builds) but still important for our flows, which spawn kernel threads
> and exercises parts of the driver.
>
> Problem we are hitting with this patch is that code did not really need
> to be signal aware until now. Well to say that more accurately - we were
> able to test the code which is normally executed from userspace, so is
> signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
> cases itself.
>
> For example threads which exercise an internal API for a while until the
> parent calls kthread_stop. Now those tests can hit unexpected errors.
>
> Question is how to best approach working around this change. It is of
> course technically possible to rework our code in more than one way,
> although with some cost and impact already felt due reduced pass rates
> in our automated test suites.
>
> Maybe an opt out kthread flag from this new behavior? Would that be
> acceptable as a quick fix? Or any other comments?

You can opt out by running `clear_tsk_thread_flag(current,
TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
fix your code instead. Were I your reviewer, I wouldn't merge code
that took the lazy path like that. However, that should work, if you
do opt for the quick fix.

Jason

2022-10-19 16:48:14

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: signal: break out of wait loops on kthread_stop()

On Wed, Oct 19, 2022 at 10:00 AM Jason A. Donenfeld <[email protected]> wrote:
>
> On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
> <[email protected]> wrote:
> >
> >
> > Hi,
> >
> > A question regarding a7c01fa93aeb ("signal: break out of wait loops on
> > kthread_stop()") if I may.
> >
> > We have a bunch code in i915, possibly limited to self tests (ie debug
> > builds) but still important for our flows, which spawn kernel threads
> > and exercises parts of the driver.
> >
> > Problem we are hitting with this patch is that code did not really need
> > to be signal aware until now. Well to say that more accurately - we were
> > able to test the code which is normally executed from userspace, so is
> > signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
> > cases itself.
> >
> > For example threads which exercise an internal API for a while until the
> > parent calls kthread_stop. Now those tests can hit unexpected errors.
> >
> > Question is how to best approach working around this change. It is of
> > course technically possible to rework our code in more than one way,
> > although with some cost and impact already felt due reduced pass rates
> > in our automated test suites.
> >
> > Maybe an opt out kthread flag from this new behavior? Would that be
> > acceptable as a quick fix? Or any other comments?
>
> You can opt out by running `clear_tsk_thread_flag(current,
> TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
> fix your code instead. Were I your reviewer, I wouldn't merge code
> that took the lazy path like that. However, that should work, if you
> do opt for the quick fix.

Oh my, I haven't had my coffee yet and sent that too fast without
thinking straight. That certainly won't work as intended. Sorry for
the noise.

Jason

2022-10-19 18:37:19

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: signal: break out of wait loops on kthread_stop()


On 19/10/2022 17:00, Jason A. Donenfeld wrote:
> On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
> <[email protected]> wrote:
>>
>>
>> Hi,
>>
>> A question regarding a7c01fa93aeb ("signal: break out of wait loops on
>> kthread_stop()") if I may.
>>
>> We have a bunch code in i915, possibly limited to self tests (ie debug
>> builds) but still important for our flows, which spawn kernel threads
>> and exercises parts of the driver.
>>
>> Problem we are hitting with this patch is that code did not really need
>> to be signal aware until now. Well to say that more accurately - we were
>> able to test the code which is normally executed from userspace, so is
>> signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
>> cases itself.
>>
>> For example threads which exercise an internal API for a while until the
>> parent calls kthread_stop. Now those tests can hit unexpected errors.
>>
>> Question is how to best approach working around this change. It is of
>> course technically possible to rework our code in more than one way,
>> although with some cost and impact already felt due reduced pass rates
>> in our automated test suites.
>>
>> Maybe an opt out kthread flag from this new behavior? Would that be
>> acceptable as a quick fix? Or any other comments?
>
> You can opt out by running `clear_tsk_thread_flag(current,
> TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
> fix your code instead. Were I your reviewer, I wouldn't merge code
> that took the lazy path like that. However, that should work, if you
> do opt for the quick fix.

Right, but our hand is a bit forced at the moment. Since 6.1-rc1 has
propagated to our development tree on Monday, our automated testing
started failing significantly, which prevents us merging new work until
resolved. So a quick fix trumps the ideal road in the short term. Just
because it is quick.

Also, are you confident that the change will not catch anyone else by
surprise? In the original thread I did not spot any concerns about the
kthreads being generally unprepared to start receiving EINTR/ERESTARTSYS
from random call chains.

Regards,

Tvrtko

2022-10-19 18:52:22

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: signal: break out of wait loops on kthread_stop()

On Wed, Oct 19, 2022 at 06:57:38PM +0100, Tvrtko Ursulin wrote:
>
> On 19/10/2022 17:00, Jason A. Donenfeld wrote:
> > On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
> > <[email protected]> wrote:
> >>
> >>
> >> Hi,
> >>
> >> A question regarding a7c01fa93aeb ("signal: break out of wait loops on
> >> kthread_stop()") if I may.
> >>
> >> We have a bunch code in i915, possibly limited to self tests (ie debug
> >> builds) but still important for our flows, which spawn kernel threads
> >> and exercises parts of the driver.
> >>
> >> Problem we are hitting with this patch is that code did not really need
> >> to be signal aware until now. Well to say that more accurately - we were
> >> able to test the code which is normally executed from userspace, so is
> >> signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
> >> cases itself.
> >>
> >> For example threads which exercise an internal API for a while until the
> >> parent calls kthread_stop. Now those tests can hit unexpected errors.
> >>
> >> Question is how to best approach working around this change. It is of
> >> course technically possible to rework our code in more than one way,
> >> although with some cost and impact already felt due reduced pass rates
> >> in our automated test suites.
> >>
> >> Maybe an opt out kthread flag from this new behavior? Would that be
> >> acceptable as a quick fix? Or any other comments?
> >
> > You can opt out by running `clear_tsk_thread_flag(current,
> > TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
> > fix your code instead. Were I your reviewer, I wouldn't merge code
> > that took the lazy path like that. However, that should work, if you
> > do opt for the quick fix.
>
> Also, are you confident that the change will not catch anyone else by
> surprise? In the original thread I did not spot any concerns about the
> kthreads being generally unprepared to start receiving EINTR/ERESTARTSYS
> from random call chains.

Pretty sure, yea. i915 is unique in its abuse of the API. Keep in mind
that kthread_stop() also sets KTHREAD_SHOULD_STOP and such. Your code is
abusing the API by calling kthread_run() followed by kthread_stop().

As evidence of how broken your code actually is, the kthread_stop()
function has a comment that makes it clear, "This can also be called
after kthread_create() instead of calling wake_up_process(): the thread
will exit without calling threadfn()," yet i915 has attempted to hack
around it with ridiculous yields and msleeps. For example:

threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
&t, "igt/%d", n);
...

yield(); /* start all threads before we begin */
msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
...
err = kthread_stop(threads[n]);


Or here's another one:

tsk = kthread_run(fn, &thread[i], "igt-%d", i);
...
msleep(10); /* start all threads before we kthread_stop() */
...
status = kthread_stop(tsk);

I mean come on.

This is brittle and bad and kind of ridiculous that it shipped this way.
Now you're asking to extend your brittleness, so that you can avoid the
work of cleaning up 5 call sites. Just clean up those 5 call sites. It's
only 5, as far as I can tell.


> Right, but our hand is a bit forced at the moment. Since 6.1-rc1 has
> propagated to our development tree on Monday, our automated testing
> started failing significantly, which prevents us merging new work until
> resolved. So a quick fix trumps the ideal road in the short term. Just
> because it is quick.

"Short term" -- somehow I can imagine the short term hack will turn into
a long term one.

Anyway, what I suspect you might actually want as a bandaid is a
"kthread_wait()"-like function, that doesn't try to stop the thread with
KTHREAD_SHOULD_STOP and such, but just waits for the completion:

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 30e5bec81d2b..2699cc45ad15 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -86,6 +86,7 @@ void free_kthread_struct(struct task_struct *k);
void kthread_bind(struct task_struct *k, unsigned int cpu);
void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
int kthread_stop(struct task_struct *k);
+int kthread_wait(struct task_struct *k);
bool kthread_should_stop(void);
bool kthread_should_park(void);
bool __kthread_should_park(struct task_struct *k);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index f97fd01a2932..d581d78a3a26 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -715,6 +715,22 @@ int kthread_stop(struct task_struct *k)
}
EXPORT_SYMBOL(kthread_stop);

+int kthread_wait(struct task_struct *k)
+{
+ struct kthread *kthread;
+ int ret;
+
+ get_task_struct(k);
+ kthread = to_kthread(k);
+ wake_up_process(k);
+ wait_for_completion(&kthread->exited);
+ ret = kthread->result;
+ put_task_struct(k);
+
+ return ret;
+}
+EXPORT_SYMBOL(kthread_stop);
+
int kthreadd(void *unused)
{
struct task_struct *tsk = current;

2022-10-19 19:14:42

by Sultan Alsawaf

[permalink] [raw]
Subject: Re: signal: break out of wait loops on kthread_stop()

On Wed, Oct 19, 2022 at 12:16:22PM -0600, Jason A. Donenfeld wrote:
> On Wed, Oct 19, 2022 at 06:57:38PM +0100, Tvrtko Ursulin wrote:
> >
> > On 19/10/2022 17:00, Jason A. Donenfeld wrote:
> > > On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
> > > <[email protected]> wrote:
> > >>
> > >>
> > >> Hi,
> > >>
> > >> A question regarding a7c01fa93aeb ("signal: break out of wait loops on
> > >> kthread_stop()") if I may.
> > >>
> > >> We have a bunch code in i915, possibly limited to self tests (ie debug
> > >> builds) but still important for our flows, which spawn kernel threads
> > >> and exercises parts of the driver.
> > >>
> > >> Problem we are hitting with this patch is that code did not really need
> > >> to be signal aware until now. Well to say that more accurately - we were
> > >> able to test the code which is normally executed from userspace, so is
> > >> signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
> > >> cases itself.
> > >>
> > >> For example threads which exercise an internal API for a while until the
> > >> parent calls kthread_stop. Now those tests can hit unexpected errors.
> > >>
> > >> Question is how to best approach working around this change. It is of
> > >> course technically possible to rework our code in more than one way,
> > >> although with some cost and impact already felt due reduced pass rates
> > >> in our automated test suites.
> > >>
> > >> Maybe an opt out kthread flag from this new behavior? Would that be
> > >> acceptable as a quick fix? Or any other comments?
> > >
> > > You can opt out by running `clear_tsk_thread_flag(current,
> > > TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
> > > fix your code instead. Were I your reviewer, I wouldn't merge code
> > > that took the lazy path like that. However, that should work, if you
> > > do opt for the quick fix.
> >
> > Also, are you confident that the change will not catch anyone else by
> > surprise? In the original thread I did not spot any concerns about the
> > kthreads being generally unprepared to start receiving EINTR/ERESTARTSYS
> > from random call chains.
>
> Pretty sure, yea. i915 is unique in its abuse of the API. Keep in mind
> that kthread_stop() also sets KTHREAD_SHOULD_STOP and such. Your code is
> abusing the API by calling kthread_run() followed by kthread_stop().
>
> As evidence of how broken your code actually is, the kthread_stop()
> function has a comment that makes it clear, "This can also be called
> after kthread_create() instead of calling wake_up_process(): the thread
> will exit without calling threadfn()," yet i915 has attempted to hack
> around it with ridiculous yields and msleeps. For example:
>
> threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
> &t, "igt/%d", n);
> ...
>
> yield(); /* start all threads before we begin */
> msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
> ...
> err = kthread_stop(threads[n]);
>
>
> Or here's another one:
>
> tsk = kthread_run(fn, &thread[i], "igt-%d", i);
> ...
> msleep(10); /* start all threads before we kthread_stop() */
> ...
> status = kthread_stop(tsk);
>
> I mean come on.
>
> This is brittle and bad and kind of ridiculous that it shipped this way.
> Now you're asking to extend your brittleness, so that you can avoid the
> work of cleaning up 5 call sites. Just clean up those 5 call sites. It's
> only 5, as far as I can tell.
>
>
> > Right, but our hand is a bit forced at the moment. Since 6.1-rc1 has
> > propagated to our development tree on Monday, our automated testing
> > started failing significantly, which prevents us merging new work until
> > resolved. So a quick fix trumps the ideal road in the short term. Just
> > because it is quick.
>
> "Short term" -- somehow I can imagine the short term hack will turn into
> a long term one.
>
> Anyway, what I suspect you might actually want as a bandaid is a
> "kthread_wait()"-like function, that doesn't try to stop the thread with
> KTHREAD_SHOULD_STOP and such, but just waits for the completion:
>
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 30e5bec81d2b..2699cc45ad15 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -86,6 +86,7 @@ void free_kthread_struct(struct task_struct *k);
> void kthread_bind(struct task_struct *k, unsigned int cpu);
> void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
> int kthread_stop(struct task_struct *k);
> +int kthread_wait(struct task_struct *k);
> bool kthread_should_stop(void);
> bool kthread_should_park(void);
> bool __kthread_should_park(struct task_struct *k);
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index f97fd01a2932..d581d78a3a26 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -715,6 +715,22 @@ int kthread_stop(struct task_struct *k)
> }
> EXPORT_SYMBOL(kthread_stop);
>
> +int kthread_wait(struct task_struct *k)
> +{
> + struct kthread *kthread;
> + int ret;
> +
> + get_task_struct(k);
> + kthread = to_kthread(k);
> + wake_up_process(k);
> + wait_for_completion(&kthread->exited);
> + ret = kthread->result;
> + put_task_struct(k);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(kthread_stop);
> +
> int kthreadd(void *unused)
> {
> struct task_struct *tsk = current;
>

Hi,

Given the need to ensure the kthreads are started and then synchronously
flushed, this seems like a good fit for the kthread_work API, which provides all
of the necessary plumbing for this usecase. No need for the eldritch kthread API
abuse and flimsy yield+msleep.

Sultan

2022-10-19 21:16:29

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: signal: break out of wait loops on kthread_stop()


On 19/10/2022 19:16, Jason A. Donenfeld wrote:
> On Wed, Oct 19, 2022 at 06:57:38PM +0100, Tvrtko Ursulin wrote:
>>
>> On 19/10/2022 17:00, Jason A. Donenfeld wrote:
>>> On Wed, Oct 19, 2022 at 7:31 AM Tvrtko Ursulin
>>> <[email protected]> wrote:
>>>>
>>>>
>>>> Hi,
>>>>
>>>> A question regarding a7c01fa93aeb ("signal: break out of wait loops on
>>>> kthread_stop()") if I may.
>>>>
>>>> We have a bunch code in i915, possibly limited to self tests (ie debug
>>>> builds) but still important for our flows, which spawn kernel threads
>>>> and exercises parts of the driver.
>>>>
>>>> Problem we are hitting with this patch is that code did not really need
>>>> to be signal aware until now. Well to say that more accurately - we were
>>>> able to test the code which is normally executed from userspace, so is
>>>> signal aware, but not worry about -ERESTARTSYS or -EINTR within the test
>>>> cases itself.
>>>>
>>>> For example threads which exercise an internal API for a while until the
>>>> parent calls kthread_stop. Now those tests can hit unexpected errors.
>>>>
>>>> Question is how to best approach working around this change. It is of
>>>> course technically possible to rework our code in more than one way,
>>>> although with some cost and impact already felt due reduced pass rates
>>>> in our automated test suites.
>>>>
>>>> Maybe an opt out kthread flag from this new behavior? Would that be
>>>> acceptable as a quick fix? Or any other comments?
>>>
>>> You can opt out by running `clear_tsk_thread_flag(current,
>>> TIF_NOTIFY_SIGNAL);` at the top of your kthread. But you should really
>>> fix your code instead. Were I your reviewer, I wouldn't merge code
>>> that took the lazy path like that. However, that should work, if you
>>> do opt for the quick fix.
>>
>> Also, are you confident that the change will not catch anyone else by
>> surprise? In the original thread I did not spot any concerns about the
>> kthreads being generally unprepared to start receiving EINTR/ERESTARTSYS
>> from random call chains.
>
> Pretty sure, yea. i915 is unique in its abuse of the API. Keep in mind
> that kthread_stop() also sets KTHREAD_SHOULD_STOP and such. Your code is
> abusing the API by calling kthread_run() followed by kthread_stop().

Hm why is kthread_stop() after kthread_run() abuse? I don't see it in
kerneldoc that it must not be used for stopping threads.

> As evidence of how broken your code actually is, the kthread_stop()
> function has a comment that makes it clear, "This can also be called
> after kthread_create() instead of calling wake_up_process(): the thread
> will exit without calling threadfn()," yet i915 has attempted to hack
> around it with ridiculous yields and msleeps. For example:
>
> threads[n] = kthread_run(__igt_breadcrumbs_smoketest,
> &t, "igt/%d", n);
> ...
>
> yield(); /* start all threads before we begin */
> msleep(jiffies_to_msecs(i915_selftest.timeout_jiffies));
> ...
> err = kthread_stop(threads[n]);
>
>
> Or here's another one:
>
> tsk = kthread_run(fn, &thread[i], "igt-%d", i);
> ...
> msleep(10); /* start all threads before we kthread_stop() */
> ...
> status = kthread_stop(tsk);
>
> I mean come on.
>
> This is brittle and bad and kind of ridiculous that it shipped this way.
> Now you're asking to extend your brittleness, so that you can avoid the
> work of cleaning up 5 call sites. Just clean up those 5 call sites. It's
> only 5, as far as I can tell.

Yep the yields and sleeps are horrible and will go. But they are also
not relevant for the topic at hand. Issue is signal_pending() in the
thread which just happens to now let kthread_stop() exit the thread
before the work it used to do. And lack of consistent EINTR/ERESTARTSYS
handling throughout.

Luckily I am almost sure this hasn't "shipped" anywhere real, in the
sense it is debug only part of the driver.

Never mind, I was not looking for anything more than a suggestion on how
to maybe work around it in piece as someone is dealing with the affected
call sites.

kthread_wait below is perhaps a bit too indirect, since overall
refactoring of the approach will be needed, but thanks anyway.

Thanks,

Tvrtko

>> Right, but our hand is a bit forced at the moment. Since 6.1-rc1 has
>> propagated to our development tree on Monday, our automated testing
>> started failing significantly, which prevents us merging new work until
>> resolved. So a quick fix trumps the ideal road in the short term. Just
>> because it is quick.
>
> "Short term" -- somehow I can imagine the short term hack will turn into
> a long term one.
>
> Anyway, what I suspect you might actually want as a bandaid is a
> "kthread_wait()"-like function, that doesn't try to stop the thread with
> KTHREAD_SHOULD_STOP and such, but just waits for the completion:
>
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 30e5bec81d2b..2699cc45ad15 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -86,6 +86,7 @@ void free_kthread_struct(struct task_struct *k);
> void kthread_bind(struct task_struct *k, unsigned int cpu);
> void kthread_bind_mask(struct task_struct *k, const struct cpumask *mask);
> int kthread_stop(struct task_struct *k);
> +int kthread_wait(struct task_struct *k);
> bool kthread_should_stop(void);
> bool kthread_should_park(void);
> bool __kthread_should_park(struct task_struct *k);
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index f97fd01a2932..d581d78a3a26 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -715,6 +715,22 @@ int kthread_stop(struct task_struct *k)
> }
> EXPORT_SYMBOL(kthread_stop);
>
> +int kthread_wait(struct task_struct *k)
> +{
> + struct kthread *kthread;
> + int ret;
> +
> + get_task_struct(k);
> + kthread = to_kthread(k);
> + wake_up_process(k);
> + wait_for_completion(&kthread->exited);
> + ret = kthread->result;
> + put_task_struct(k);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(kthread_stop);
> +
> int kthreadd(void *unused)
> {
> struct task_struct *tsk = current;
>

2022-10-19 21:17:20

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [Intel-gfx] signal: break out of wait loops on kthread_stop()

On Wed, Oct 19, 2022 at 09:09:28PM +0100, Tvrtko Ursulin wrote:
> Hm why is kthread_stop() after kthread_run() abuse? I don't see it in
> kerneldoc that it must not be used for stopping threads.

Because you don't want it to stop. You want to wait until it's done. If
you call stop right after run, it will even stop it before it even
begins to run. That's why you wind up sprinkling your msleeps
everywhere, indicating that clearly this is not meant to work that way.

> Yep the yields and sleeps are horrible and will go. But they are also
> not relevant for the topic at hand.

Except they very much are. The reason you need these is because you're
using kthread_stop() for something it's not meant to do.

> Never mind, I was not looking for anything more than a suggestion on how
> to maybe work around it in piece as someone is dealing with the affected
> call sites.

Sultan's kthread_work idea is probably the right direction. This would
seem to have what you need.

Jason

2022-10-20 14:03:33

by Tvrtko Ursulin

[permalink] [raw]
Subject: Re: [Intel-gfx] signal: break out of wait loops on kthread_stop()


On 19/10/2022 21:19, Jason A. Donenfeld wrote:
> On Wed, Oct 19, 2022 at 09:09:28PM +0100, Tvrtko Ursulin wrote:
>> Hm why is kthread_stop() after kthread_run() abuse? I don't see it in
>> kerneldoc that it must not be used for stopping threads.
>
> Because you don't want it to stop. You want to wait until it's done. If
> you call stop right after run, it will even stop it before it even
> begins to run. That's why you wind up sprinkling your msleeps
> everywhere, indicating that clearly this is not meant to work that way.
Not after kthread_run which wakes it up already. If the kerneldoc for
kthread_stop() is correct at least... In which case I really do think
that the yields are pointless/red herring. Perhaps they predate
kthread_run and then they were even wrong.

>> Yep the yields and sleeps are horrible and will go. But they are also
>> not relevant for the topic at hand.
>
> Except they very much are. The reason you need these is because you're
> using kthread_stop() for something it's not meant to do.

It is supposed to assert kthread_should_stop() which thread can look at
as when to exit. Except that now it can fail to get to that controlled
exit point. Granted that argument is moot since it implies incomplete
error handling in the thread anyway.

Btw there are actually two use cases in our code base. One is thread
controls the exit, second is caller controls the exit. Anyway...

>> Never mind, I was not looking for anything more than a suggestion on how
>> to maybe work around it in piece as someone is dealing with the affected
>> call sites.
>
> Sultan's kthread_work idea is probably the right direction. This would
> seem to have what you need.

... yes, it can be converted. Even though for one of the two use cases
we need explicit signalling. There now isn't anything which would assert
kthread_should_stop() without also asserting the signal, right?. Neither
I found that the thread work API can do it.

Fingers crossed we were the only "abusers" of the API. There's a quite a
number of kthread_stop callers and it would be a large job to audit them
all.

Regards,

Tvrtko

2022-11-28 19:02:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Intel-gfx] signal: break out of wait loops on kthread_stop()

Tvrtko Ursulin <[email protected]> writes:

> On 19/10/2022 21:19, Jason A. Donenfeld wrote:
>> On Wed, Oct 19, 2022 at 09:09:28PM +0100, Tvrtko Ursulin wrote:
>>> Hm why is kthread_stop() after kthread_run() abuse? I don't see it in
>>> kerneldoc that it must not be used for stopping threads.
>> Because you don't want it to stop. You want to wait until it's done. If
>> you call stop right after run, it will even stop it before it even
>> begins to run. That's why you wind up sprinkling your msleeps
>> everywhere, indicating that clearly this is not meant to work that way.
> Not after kthread_run which wakes it up already. If the kerneldoc for
> kthread_stop() is correct at least... In which case I really do think
> that the yields are pointless/red herring. Perhaps they predate kthread_run and
> then they were even wrong.
>
>>> Yep the yields and sleeps are horrible and will go. But they are also
>>> not relevant for the topic at hand.
>> Except they very much are. The reason you need these is because you're
>> using kthread_stop() for something it's not meant to do.
>
> It is supposed to assert kthread_should_stop() which thread can look at as when
> to exit. Except that now it can fail to get to that controlled exit
> point. Granted that argument is moot since it implies incomplete error handling
> in the thread anyway.
>
> Btw there are actually two use cases in our code base. One is thread controls
> the exit, second is caller controls the exit. Anyway...
>
>>> Never mind, I was not looking for anything more than a suggestion on how
>>> to maybe work around it in piece as someone is dealing with the affected
>>> call sites.
>> Sultan's kthread_work idea is probably the right direction. This would
>> seem to have what you need.
>
> ... yes, it can be converted. Even though for one of the two use cases we need
> explicit signalling. There now isn't anything which would assert
> kthread_should_stop() without also asserting the signal, right?. Neither
> I found that the thread work API can do it.
>
> Fingers crossed we were the only "abusers" of the API. There's a quite a number
> of kthread_stop callers and it would be a large job to audit them all.


I have been out and am coming to this late. Did this get resolved?


I really don't expect this affected much of anything else as the code
sat in linux-next for an entire development cycle before being merged.

But I would like to make certain problems with this change were resolved.

Thank you,
Eric