2015-06-05 15:01:48

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 00/18] kthreads/signal: Safer kthread API and signal handling

Kthreads are implemented as an infinite loop. They include check points
for termination, freezer, parking, and even signal handling.

We need to touch all kthreads every time we want to add or
modify the behavior of such checkpoints. It is not easy because
there are several hundreds of kthreads and each of them is
implemented in a slightly different way.

This anarchy brings potentially broken or non-standard behavior.
For example, few kthreads already handle signals a strange way.


This patchset is a _proof-of-concept_ how to improve the situation.


The goal is:


+ enforce cleaner and better maintainable kthreads implementation
using a new API

+ standardize signal handling in kthreads

+ hopefully solve some existing problems, e.g. with suspend


Why new API?


First, I do not want to add yet another API that would need
to be supported. The aim is to _replace_ the current API.
Well, the old API would need to stay around for some time until
all kthreads are converted.

Second, there are two more existing alternatives. They fulfill
the needs and can be used for some conversions. But IMHO, they
are not well usable in all cases. Let's talk more about them.


Workqueue


Workqueues are quite popular and many kthreads have already been
converted into them.

Work queues allow to split the function into even more pieces and
reach the common check point more often. It is especially useful
when a kthread handles more tasks and is woken when some work
is needed. Then we could queue the appropriate work instead
of waking the whole kthread and checking what exactly needs
to be done.

But there are many kthreads that need to cycle many times
until some work is finished, e.g. khugepaged, virtio_balloon,
jffs2_garbage_collect_thread. They would need to queue the
work item repeatedly from the same work item or between
more work items. It would be a strange semantic.

Work queues allow to share the same kthread between more users.
It helps to reduce the number of running kthreads. It is especially
useful if you would need a kthread for each CPU.

But this might also be a disadvantage. Just look into the output
of the command "ps" and see the many [kworker*] processes. One
might see this a black hole. If a kworker makes the system busy,
it is less obvious what the problem is in compare with the old
"simple" and dedicated kthreads.

Yes, we could add some debugging tools for work queues but
it would be another non-standard thing that developers and
system administrators would need to understand.

Another thing is that work queues have their own scheduler. If we
move even more tasks there it might need even more love. Anyway,
the extra scheduler adds another level of complexity when
debugging problems.


kthread_worker


kthread_worker is similar to workqueues in many ways. You need to

+ define work functions
+ define and initialize work structs
+ queue work items (structs pointing to the functions and data)

We could repeat the paragraphs about splitting the work
and sharing the kthread between more users here.

Well, the kthread_worker implementation is much simpler than
the one for workqueues. It is more similar to a simple
kthread. Especially, it uses the system scheduler.
But it is still more complex that the simple kthread.

One interesting thing is that kthread_workers add internal work
items into the queue. They typically use a completion. An example
is the flush work. see flush_kthread_work(). It is a nice trick
but you need to be careful. For example, if you would want to
terminate the kthread, you might want to remove some work item
from the queue, especially if you need to break a work item that
is called in a cycle (queues itself). The question is what to do
with the internal tasks. If you keep them, they might wake up
sleepers when the work was not really completed. If you remove
them, the counter part might sleep forever.


Conclusion


I think that we still want some rather simple API for kthreads
but it need to be more enforcing that the current simple one.


Content


This patchset is split the following way:

+ 2nd patch: defines basic structure of a new kthread API that
allows to get most of the checks into a single place

+ 6th patch: proposal of signal handling in kthreads

+ 7th patch: makes kthreads using the new API freezable by default

+ 9th, 16th patches: proposal how to maintain sleeping between
kthread iterations on a single place

+ 10th, 11th, 12th, 17th, 18th patches: show how the new API
could be used in some kthreads and hopefully clean them
a bit

+ the other patches add some helper functions or do some
related clean up


The patchset touches many areas: kthreads, scheduler, signal handling,
freezer, parking, many subsystems and drivers are using kthreads. This
is why I added so many people into CC.

The patch set can be applied against current Linus' tree for 4.1.0-rc6.


Petr Mladek (18):
kthread: Allow to call __kthread_create_on_node() with va_list args
kthread: Add API for iterant kthreads
kthread: Add kthread_stop_current()
signal: Rename kernel_sigaction() to kthread_sigaction() and clean it
up
freezer/scheduler: Add freezable_cond_resched()
signal/kthread: Initial implementation of kthread signal handling
kthread: Make iterant kthreads freezable by default
kthread: Allow to get struct kthread_iterant from task_struct
kthread: Make it easier to correctly sleep in iterant kthreads
jffs2: Remove forward definition of jffs2_garbage_collect_thread()
jffs2: Convert jffs2_gcd_mtd kthread into the iterant API
lockd: Convert the central lockd service to kthread_iterant API
ring_buffer: Use iterant kthreads API in the ring buffer benchmark
ring_buffer: Allow to cleanly freeze the ring buffer benchmark
kthreads
ring_buffer: Allow to exit the ring buffer benchmark immediately
kthread: Support interruptible sleep with a timeout by iterant
kthreads
ring_buffer: Use the new API for a sleep with a timeout in the
benchmark
jffs2: Use the new API for a sleep with a timeout

fs/jffs2/background.c | 178 ++++++++++------------
fs/lockd/svc.c | 80 +++++-----
include/linux/freezer.h | 8 +
include/linux/kthread.h | 67 ++++++++
include/linux/signal.h | 24 ++-
include/linux/sunrpc/svc.h | 2 +
kernel/kmod.c | 2 +-
kernel/kthread.c | 286 +++++++++++++++++++++++++++++++----
kernel/signal.c | 84 +++++++++-
kernel/trace/ring_buffer_benchmark.c | 110 +++++++-------
10 files changed, 611 insertions(+), 230 deletions(-)

--
1.8.5.6



2015-06-05 15:01:52

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 02/18] kthread: Add API for iterant kthreads

Kthreads are implemented as an infinite loop. They include check points
for termination, freezer, parking, and even signal handling.

In many cases there are some operations done before and after
the main cycle.

We need to touch all kthreads everytime we want to add or
modify the behavior of such checkpoints. It is not easy because
there are several hundreds of kthreads and each of them is
implemented slightly different way.

This anarchy brings potentially broken or non-standard behavior.
For example, few kthreads associated strange behavior to signals.

This patch defines new API that will help to standardize
kthreads and move all checkpoints to a single location.

The initial implementation allows to define functions that are
called before, in, and after the main cycle. It defines the common
checkpoint for freezing and ktherad stopping. The support for
parking and signal handling will be added later.

It will be sufficient for many kthreads. And it is ready for more
features that will be needed for the rest. For example, some kthreads
might want to call special functions after a return from the fridge.

Why new API?

First, we do not want to add yet another API that would need
to be supported. The aim is to _replace_ the current API,
split the monolithic threadfn, and define a standard structure
of the main cycle and standardize the related check points.
Well, the old API would need to stay around for some time until
all kthreads are converted.

Second, there are two more existing alternatives. They fulfill
the needs, will be used for some conversions, but they are not
well usable in all cases. Let's talk more about them.

Workqueue

Workqueues are quite popular and many kthreads have already been
converted into them.

Work queues allow to split the function into even more pieces and
reach the common check point more often. It is especially useful
when a kthread handles more tasks and is waken when some work
is needed. Then we could queue the appropriate work instead
of waking the whole kthread and checking what exactly needs
to be done.

But there are many kthreads that need to cycle many times
until some work is finished, e.g. khugepaged, virtio_balloon,
jffs2_garbage_collect_thread. They would need to queue
the work repeatedly or between more work items. It would be
a strange semantic. Note that we need to queue the work
repeatedly when it needs to be called in a cycle.

Work queues allow to share the same kthread between more users.
It helps to reduce the number of running kthreads. It is especially
useful if you would need a kthread for each CPU.

But this might also be a disadvantage. Just look into the output
of the command "ps" and see the many [kworker*] processes. One
might see this a black hole. If a kworker makes the system busy,
it is less obvious what the problem is in compare with the old
"simple" and dedicated kthreads.

Yes, we could add some debugging tools for work queues but
it would be another non-standard thing that developers and
system administrators would need to understand.

Another thing is that work queues have their own scheduler. If we
move even more tasks there it might need even more love. Anyway,
the extra scheduler adds another level of complexity when
debugging problems.

kthread_worker

kthread_worker is similar to workqueues in many ways. You need to

+ define work functions
+ define and initialize work structs
+ queue work items (structs pointing to the functions and data)

We could repeat here the paragraphs about splitting the work
and sharing the kthread between more users.

The kthread_worker implementation is much simpler than
the one for workqueues. It is more similar to a simple
kthread. Especially, it uses the system scheduler.
But it is still more complex that the simple kthread.

One interesting thing is that kthread_workers add internal work
items into the queue. They typically use a completion. An example
is the flush work. It is a nice trick but you need to be careful.
For example, if you would want to terminate the kthread, you might
want remove some work item from the queue, especially if you need
to break a work item that is called in a cycle (queues itself).
The question is what to do with the internal tasks. If you keep
them, they might wake up sleepers when the work was not really
completed. If you remove them, the counter part might sleep
forever.

Conclusion

The aim of the new API is to provide a simple and straightforward API
to create dedicated kthreads. It will split the current monolithic
function into few well defined pieces. This might look as a small
complication. But it will also move the typical checks into a common
place. This will reduce the complexity of the kthreads and will
allow to do the checks properly and maintain them more easily.

Signed-off-by: Petr Mladek <[email protected]>
---
include/linux/kthread.h | 48 ++++++++++++++++++++++++++++++
kernel/kthread.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 126 insertions(+)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 13d55206ccf6..06fe9ad192dd 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -4,6 +4,9 @@
#include <linux/err.h>
#include <linux/sched.h>

+/*
+ * Old API that defines kthreads using a single function.
+ */
__printf(4, 5)
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data,
@@ -37,6 +40,51 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
__k; \
})

+/*
+ * New API that allows to create kthreads with a clear semantic. It defines
+ * functions that are called inside the kthread. The functions must return
+ * in a consistent state so that the kthread can get frozen, parked,
+ * and even signals processed between the calls. It means that all locks
+ * must be released and non-interruptible work finished.
+ */
+
+/**
+ * struct kthread_iterant - structure describing the function of the kthread
+ * @data: pointer to a data passed to the functions.
+ * @init: function called when the kthread is created.
+ * @func: function called in the main cycle until the kthread is terminated.
+ * @destroy: function called when the kthread is being terminated.
+ */
+struct kthread_iterant {
+ void *data;
+ void (*init)(void *data);
+ void (*func)(void *data);
+ void (*destroy)(void *data);
+};
+
+__printf(3, 4)
+struct task_struct *
+kthread_iterant_create_on_node(struct kthread_iterant *kti,
+ int node,
+ const char namefmt[], ...);
+
+#define kthread_iterant_create(kti, namefmt, arg...) \
+ kthread_iterant_create_on_node(kti, -1, namefmt, ##arg)
+
+struct task_struct *
+kthread_iterant_create_on_cpu(struct kthread_iterant *kti,
+ unsigned int cpu,
+ const char *namefmt);
+
+#define kthread_iterant_run(kti, namefmt, ...) \
+({ \
+ struct task_struct *__k \
+ = kthread_iterant_create(kti, namefmt, ## __VA_ARGS__); \
+ if (!IS_ERR(__k)) \
+ wake_up_process(__k); \
+ __k; \
+})
+
void kthread_bind(struct task_struct *k, unsigned int cpu);
int kthread_stop(struct task_struct *k);
bool kthread_should_stop(void);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fca7cd124512..4b2698bcc622 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -393,6 +393,84 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
return p;
}

+/**
+ * kthread_iterant_fn - kthread function to process an iterant kthread
+ * @kti_ptr: pointer to an initialized struct kthread_iterant.
+ *
+ * This function defines the work flow of iterant kthreads. It calls
+ * the given callbacks. Also it defines and implements a common
+ * check point for freezing and stopping.
+ */
+static int kthread_iterant_fn(void *kti_ptr)
+{
+ struct kthread_iterant *kti = kti_ptr;
+ void *data = kti->data;
+
+ if (kti->init)
+ kti->init(data);
+
+ do {
+ if (kti->func)
+ kti->func(data);
+
+ /* this check is safe also for non-freezable kthreads */
+ } while (!kthread_freezable_should_stop(NULL));
+
+ if (kti->destroy)
+ kti->destroy(data);
+
+ return 0;
+}
+
+/**
+ * kthread_iterant_create_on_node - create an iterant kthread.
+ * @kti: structure describing the thread function.
+ * @node: memory node number.
+ * @namefmt: printf-style name for the thread.
+ *
+ * Description: This helper function creates and names an iterant kernel
+ * thread. The thread will be stopped: use wake_up_process() to start
+ * it. See also kthread_iterant_run().
+ *
+ * It behaves the same as kthread_create_on_node(). The only difference
+ * is that the function is described using struct kthread_iterant.
+ */
+struct task_struct *
+kthread_iterant_create_on_node(struct kthread_iterant *kti,
+ int node,
+ const char namefmt[], ...)
+{
+ struct task_struct *task;
+ va_list args;
+
+ va_start(args, namefmt);
+ task = __kthread_create_on_node(kthread_iterant_fn, kti, node,
+ namefmt, args);
+ va_end(args);
+
+ return task;
+}
+EXPORT_SYMBOL(kthread_iterant_create_on_node);
+
+/**
+ * kthread_iterant_create_on_cpu - Create a cpu bound iterant kthread
+ * @kti: structure describing the function.
+ * @cpu: The cpu on which the thread should be bound,
+ * @namefmt: printf-style name for the thread. Format is restricted
+ * to "name.*%u". Code fills in cpu number.
+ *
+ * Description: This helper function creates and names an iterant kernel
+ * thread. The thread will be woken and put into park mode.
+ */
+struct task_struct *
+kthread_iterant_create_on_cpu(struct kthread_iterant *kti,
+ unsigned int cpu,
+ const char *namefmt)
+{
+ return kthread_create_on_cpu(kthread_iterant_fn, kti,
+ cpu, namefmt);
+}
+
static void __kthread_unpark(struct task_struct *k, struct kthread *kthread)
{
clear_bit(KTHREAD_SHOULD_PARK, &kthread->flags);
--
1.8.5.6


2015-06-05 15:02:05

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 12/18] lockd: Convert the central lockd service to kthread_iterant API

The new iterant kthread API allows to define a common checkpoint for
freezing, parking, termination, and even signal handling. It will allow
to maintain kthreads more easily and make the operations more reliable.

The kthread function is split into optional init(), func(), destroy() parts
where func() is called in a cycle. The common check point is after
each func() function finishes. See kthread_iterant_fn() for more details.

This patch moves the action associated with the signal into a proper
signal handler.

It removes the obsolete set_freezable() call because iterant kthreads
are freezable by default.

struct kthread_iterant is stored into struct svc_rqst. We have already
stored there the pointer to the related task_struct.

The rest is just shuffling the code from the while cycle into _func().

Signed-off-by: Petr Mladek <[email protected]>
---
fs/lockd/svc.c | 80 ++++++++++++++++++++++++----------------------
include/linux/sunrpc/svc.h | 2 ++
2 files changed, 44 insertions(+), 38 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 55505cbe11af..5b1efe509fcc 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -122,58 +122,55 @@ static void restart_grace(void)
}

/*
- * This is the lockd kernel thread
+ * Lockd kernel thread implementation using the iterant API
+ * We don't terminate until the last NFS mount or NFS daemon
+ * has gone away.
*/
-static int
-lockd(void *vrqstp)
+static void lockd_sigkill(int sig)
{
- int err = 0;
- struct svc_rqst *rqstp = vrqstp;
-
- /* try_to_freeze() is called from svc_recv() */
- set_freezable();
+ restart_grace();
+}

+static void lockd_init(void *vrqstp)
+{
/* Allow SIGKILL to tell lockd to drop all of its locks */
- allow_signal(SIGKILL);
+ kthread_sigaction(SIGKILL, lockd_sigkill);

dprintk("NFS locking service started (ver " LOCKD_VERSION ").\n");
+}

- /*
- * The main request loop. We don't terminate until the last
- * NFS mount or NFS daemon has gone away.
- */
- while (!kthread_should_stop()) {
- long timeout = MAX_SCHEDULE_TIMEOUT;
- RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
+static void lockd_func(void *vrqstp)
+{
+ int err = 0;
+ struct svc_rqst *rqstp = vrqstp;

- /* update sv_maxconn if it has changed */
- rqstp->rq_server->sv_maxconn = nlm_max_connections;
+ long timeout = MAX_SCHEDULE_TIMEOUT;
+ RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);

- if (signalled()) {
- flush_signals(current);
- restart_grace();
- continue;
- }
+ /* update sv_maxconn if it has changed */
+ rqstp->rq_server->sv_maxconn = nlm_max_connections;

- timeout = nlmsvc_retry_blocked();
+ timeout = nlmsvc_retry_blocked();

- /*
- * Find a socket with data available and call its
- * recvfrom routine.
- */
- err = svc_recv(rqstp, timeout);
- if (err == -EAGAIN || err == -EINTR)
- continue;
- dprintk("lockd: request from %s\n",
- svc_print_addr(rqstp, buf, sizeof(buf)));
+ /*
+ * Find a socket with data available and call its
+ * recvfrom routine.
+ */
+ err = svc_recv(rqstp, timeout);
+ if (err == -EAGAIN || err == -EINTR)
+ return;

- svc_process(rqstp);
- }
- flush_signals(current);
+ dprintk("lockd: request from %s\n",
+ svc_print_addr(rqstp, buf, sizeof(buf)));
+
+ svc_process(rqstp);
+}
+
+static void lockd_destroy(void *vrqstp)
+{
if (nlmsvc_ops)
nlmsvc_invalidate_all();
nlm_shutdown_hosts();
- return 0;
}

static int create_lockd_listener(struct svc_serv *serv, const char *name,
@@ -301,7 +298,14 @@ static int lockd_start_svc(struct svc_serv *serv)
svc_sock_update_bufs(serv);
serv->sv_maxconn = nlm_max_connections;

- nlmsvc_task = kthread_create(lockd, nlmsvc_rqst, "%s", serv->sv_name);
+ nlmsvc_rqst->rq_kti.type = 0;
+ nlmsvc_rqst->rq_kti.data = nlmsvc_rqst;
+ nlmsvc_rqst->rq_kti.init = lockd_init;
+ nlmsvc_rqst->rq_kti.func = lockd_func;
+ nlmsvc_rqst->rq_kti.destroy = lockd_destroy;
+
+ nlmsvc_task = kthread_iterant_create(&nlmsvc_rqst->rq_kti,
+ "%s", serv->sv_name);
if (IS_ERR(nlmsvc_task)) {
error = PTR_ERR(nlmsvc_task);
printk(KERN_WARNING
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index fae6fb947fc8..6275e9b9df9b 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -16,6 +16,7 @@
#include <linux/sunrpc/xdr.h>
#include <linux/sunrpc/auth.h>
#include <linux/sunrpc/svcauth.h>
+#include <linux/kthread.h>
#include <linux/wait.h>
#include <linux/mm.h>

@@ -283,6 +284,7 @@ struct svc_rqst {
struct auth_domain * rq_client; /* RPC peer info */
struct auth_domain * rq_gssclient; /* "gss/"-style peer info */
struct svc_cacherep * rq_cacherep; /* cache info */
+ struct kthread_iterant rq_kti; /* info for iterant kthread */
struct task_struct *rq_task; /* service thread */
spinlock_t rq_lock; /* per-request lock */
};
--
1.8.5.6


2015-06-05 15:02:13

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 18/18] jffs2: Use the new API for a sleep with a timeout

It allows to maintain the checks for freezing, parking, and signal
handling on a single place.

Note that the sleep with the timeout is used here to slow down
the progress a bit. It does not matter if we do it before or after
calling jffs2_garbage_collect_pass().

Signed-off-by: Petr Mladek <[email protected]>
---
fs/jffs2/background.c | 29 +++++++++++------------------
1 file changed, 11 insertions(+), 18 deletions(-)

diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 50c16048ba2d..2d38d40e0ad1 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -62,6 +62,16 @@ static void jffs2_garbage_collect_thread_func(void *_c)
}
spin_unlock(&c->erase_completion_lock);

+ /* We don't want SIGHUP to interrupt us. STOP and KILL are OK though. */
+ sigprocmask(SIG_BLOCK, &hupmask, NULL);
+
+ jffs2_dbg(1, "%s(): pass\n", __func__);
+ if (jffs2_garbage_collect_pass(c) == -ENOSPC) {
+ pr_notice("No space for garbage collection. Aborting GC thread\n");
+ kthread_stop_current();
+ }
+ sigprocmask(SIG_UNBLOCK, &hupmask, NULL);
+
/* Problem - immediately after bootup, the GCD spends a lot
* of time in places like jffs2_kill_fragtree(); so much so
* that userspace processes (like gdm and X) are starved
@@ -73,24 +83,7 @@ static void jffs2_garbage_collect_thread_func(void *_c)
* inode in with read_inode() is much preferable to having
* the GC thread get there first.
*/
- schedule_timeout_interruptible(msecs_to_jiffies(50));
-
- if (kthread_should_stop()) {
- jffs2_dbg(1, "%s(): kthread_stop() called\n", __func__);
- return;
- }
-
- try_to_freeze();
-
- /* We don't want SIGHUP to interrupt us. STOP and KILL are OK though. */
- sigprocmask(SIG_BLOCK, &hupmask, NULL);
-
- jffs2_dbg(1, "%s(): pass\n", __func__);
- if (jffs2_garbage_collect_pass(c) == -ENOSPC) {
- pr_notice("No space for garbage collection. Aborting GC thread\n");
- kthread_stop_current();
- }
- sigprocmask(SIG_UNBLOCK, &hupmask, NULL);
+ set_kthread_iterant_int_sleep_timeout(msecs_to_jiffies(50));
}

static void jffs2_garbage_collect_thread_destroy(void *_c)
--
1.8.5.6


2015-06-05 15:02:11

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 17/18] ring_buffer: Use the new API for a sleep with a timeout in the benchmark

It will allow to maintain the checks for freezing, parking, and signal
handling on a single place.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/trace/ring_buffer_benchmark.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 4ecf7c4567e2..ae6295d6e63f 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -398,8 +398,7 @@ static void ring_buffer_producer_thread_func(void *arg)
goto test_killed;

trace_printk("Sleeping for 10 secs\n");
- set_current_state(TASK_INTERRUPTIBLE);
- schedule_timeout(HZ * SLEEP_TIME);
+ set_kthread_iterant_int_sleep_timeout(HZ * SLEEP_TIME);
return;

test_killed:
--
1.8.5.6


2015-06-05 15:02:06

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 13/18] ring_buffer: Use iterant kthreads API in the ring buffer benchmark

The new iterant kthread API allows to define a common checkpoint for
freezing, parking, termination, and even signal handling. It will allow
to maintain kthreads more easily and make the operations more reliable.

The kthread function is split into optional init(), func(), destroy() parts
where func() is called in a cycle. The common check point is after
each func() function finishes. See kthread_iterant_fn() for more details.

This patch removes the wait_to_die() cycle. Instead it uses
the main cycle and func() does nothing in the kill_test state.

The threads are not safe for freezing because there is used schedule()
without try_to_freeze() inside the kthread. Let's fix it in separate
patch. In the meantime, we need to disable it explicitly in the init()
function.

Consumer does not need to call schedule in func(). Instead, it sets
KTI_INT_SLEEP type and lets the kthread iterant framework to do the sleep
between iterations.

On the other hand, the producer still need to implement the sleeping
in func() because the iterant kthread framework does not support
this type of sleep. But the plan is to make it possible later.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/trace/ring_buffer_benchmark.c | 81 +++++++++++++++++-------------------
1 file changed, 39 insertions(+), 42 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 13d945c0d03f..164f3762cc82 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -30,6 +30,8 @@ static struct completion read_done;
static struct ring_buffer *buffer;
static struct task_struct *producer;
static struct task_struct *consumer;
+static struct kthread_iterant producer_kti;
+static struct kthread_iterant consumer_kti;
static unsigned long read;

static int disable_reader;
@@ -354,61 +356,51 @@ static void ring_buffer_producer(void)
}
}

-static void wait_to_die(void)
+static void ring_buffer_consumer_thread_init(void *arg)
{
- set_current_state(TASK_INTERRUPTIBLE);
- while (!kthread_should_stop()) {
- schedule();
- set_current_state(TASK_INTERRUPTIBLE);
- }
- __set_current_state(TASK_RUNNING);
+ /* it does not matter where we freeze */
+ current->flags |= PF_NOFREEZE;
}

-static int ring_buffer_consumer_thread(void *arg)
+static void ring_buffer_consumer_thread_func(void *arg)
{
- while (!kthread_should_stop() && !kill_test) {
- complete(&read_start);
-
- ring_buffer_consumer();
-
- set_current_state(TASK_INTERRUPTIBLE);
- if (kthread_should_stop() || kill_test)
- break;
-
- schedule();
- }
- __set_current_state(TASK_RUNNING);
-
if (kill_test)
- wait_to_die();
+ return;

- return 0;
+ complete(&read_start);
+
+ ring_buffer_consumer();
}

-static int ring_buffer_producer_thread(void *arg)
+static void ring_buffer_producer_thread_init(void *arg)
{
init_completion(&read_start);

- while (!kthread_should_stop() && !kill_test) {
- ring_buffer_reset(buffer);
+ /* it does not matter where we freeze */
+ current->flags |= PF_NOFREEZE;
+}

- if (consumer) {
- smp_wmb();
- wake_up_process(consumer);
- wait_for_completion(&read_start);
- }
+static void ring_buffer_producer_thread_func(void *arg)
+{
+ if (kill_test) {
+ set_kthread_iterant_int_sleep();
+ return;
+ }

- ring_buffer_producer();
+ ring_buffer_reset(buffer);

- trace_printk("Sleeping for 10 secs\n");
- set_current_state(TASK_INTERRUPTIBLE);
- schedule_timeout(HZ * SLEEP_TIME);
+ if (consumer) {
+ /* reset ring buffer before waking up the consumer */
+ smp_wmb();
+ wake_up_process(consumer);
+ wait_for_completion(&read_start);
}

- if (kill_test)
- wait_to_die();
+ ring_buffer_producer();

- return 0;
+ trace_printk("Sleeping for 10 secs\n");
+ set_current_state(TASK_INTERRUPTIBLE);
+ schedule_timeout(HZ * SLEEP_TIME);
}

static int __init ring_buffer_benchmark_init(void)
@@ -420,16 +412,21 @@ static int __init ring_buffer_benchmark_init(void)
if (!buffer)
return -ENOMEM;

+ consumer_kti.type = KTI_INT_SLEEP;
+ consumer_kti.func = ring_buffer_consumer_thread_init;
+ consumer_kti.func = ring_buffer_consumer_thread_func;
+
if (!disable_reader) {
- consumer = kthread_create(ring_buffer_consumer_thread,
- NULL, "rb_consumer");
+ consumer = kthread_iterant_create(&consumer_kti, "rb_consumer");
ret = PTR_ERR(consumer);
if (IS_ERR(consumer))
goto out_fail;
}

- producer = kthread_run(ring_buffer_producer_thread,
- NULL, "rb_producer");
+ producer_kti.init = ring_buffer_producer_thread_init;
+ producer_kti.func = ring_buffer_producer_thread_func;
+
+ producer = kthread_iterant_run(&producer_kti, "rb_producer");
ret = PTR_ERR(producer);

if (IS_ERR(producer))
--
1.8.5.6


2015-06-05 15:02:11

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 16/18] kthread: Support interruptible sleep with a timeout by iterant kthreads

The aim of kthread iterant API is to maintain some non-trivial operations
on the single place. One of this operations is sleeping between iterations
because we need to handle properly system freezing, parking, and
kthread stopping.

This patch adds support for interruptible sleep with a timeout. It defines
the type, function to set and do the sleep. The timeout value is added
into the struct kthread_iterant.

Signed-off-by: Petr Mladek <[email protected]>
---
include/linux/kthread.h | 5 +++++
kernel/kthread.c | 30 +++++++++++++++++++++++++++++-
2 files changed, 34 insertions(+), 1 deletion(-)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 415178c20cde..e8f7c0106cfd 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -53,13 +53,16 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
#define KTI_PAUSE_ONCE 0x00000001
/* Do interruptible sleep between iterations. */
#define KTI_INT_SLEEP 0x00000002
+#define KTI_INT_SLEEP_TIMEOUT 0x00000004

#define KTI_PAUSE_MASK (KTI_INT_SLEEP | \
+ KTI_INT_SLEEP_TIMEOUT | \
KTI_PAUSE_ONCE)

/**
* struct kthread_iterant - structure describing the function of the kthread
* @type: modifies the kthread behavior using extra flags.
+ * @timeout: timeout value in jiffies.
* @data: pointer to a data passed to the functions.
* @init: function called when the kthread is created.
* @func: function called in the main cycle until the kthread is terminated.
@@ -67,6 +70,7 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
*/
struct kthread_iterant {
unsigned int type;
+ signed long timeout;
void *data;
void (*init)(void *data);
void (*func)(void *data);
@@ -97,6 +101,7 @@ kthread_iterant_create_on_cpu(struct kthread_iterant *kti,
})

void set_kthread_iterant_int_sleep(void);
+void set_kthread_iterant_int_sleep_timeout(signed long timeout);

void kthread_bind(struct task_struct *k, unsigned int cpu);
void kthread_stop_current(void);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index fa40fb549e22..9a5845674419 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -455,6 +455,30 @@ void set_kthread_iterant_int_sleep(void)
EXPORT_SYMBOL(set_kthread_iterant_int_sleep);

/**
+ * set_kthread_iterant_int_sleep_timeout - do interuptible sleep
+ * with a timeout before the next iteration
+ * @timeout: timeout value in jiffies
+ *
+ * This function is typically called under a lock, when the main func()
+ * checks for the pending work.
+ *
+ * Kthreads should pause between iterations only when there is no work,
+ * signal pending, freezing, parking, or termination; we want to do most
+ * of these checks properly on a single place: kthread_iterant_fn().
+ * Only the check for pending work need to be done in the main func()
+ * because it is not generic.
+ */
+void set_kthread_iterant_int_sleep_timeout(signed long timeout)
+{
+ struct kthread_iterant *kti = to_kthread_iterant(current);
+
+ set_kthread_iterant_pause_type(kti, KTI_INT_SLEEP_TIMEOUT);
+ kti->timeout = timeout;
+ set_current_state(TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL(set_kthread_iterant_int_sleep_timeout);
+
+/**
* do_kthread_iterant_pause - do the selected pause before next iteration
*
* Most kthreads sleep or wait between iterations. This function
@@ -470,7 +494,9 @@ static void do_kthread_iterant_pause(struct kthread_iterant *kti)
* Explicitly set the task state when it was not set by
* set_kthread_iterant_int_sleep*() functions.
*/
- if (!(type & KTI_PAUSE_ONCE) && (type & KTI_INT_SLEEP))
+ if (!(type & KTI_PAUSE_ONCE) &&
+ ((type & KTI_INT_SLEEP) ||
+ (type & KTI_INT_SLEEP_TIMEOUT)))
set_current_state(TASK_INTERRUPTIBLE);

if (kthread_freezable_should_stop(NULL)) {
@@ -480,6 +506,8 @@ static void do_kthread_iterant_pause(struct kthread_iterant *kti)

if (type & KTI_INT_SLEEP)
freezable_schedule();
+ else if (type & KTI_INT_SLEEP_TIMEOUT)
+ freezable_schedule_timeout(kti->timeout);
else
freezable_cond_resched();

--
1.8.5.6


2015-06-05 15:02:08

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 15/18] ring_buffer: Allow to exit the ring buffer benchmark immediately

It takes a while until the ring_buffer_benchmark module is removed
when the ring buffer hammer is running. It is because it takes
few seconds and kthread_should_terminate() is not being checked.

This patch adds the check for kthread termination into the producer.
It uses the existing kill_test flag to finish the kthreads as
cleanly as possible.

It disables printing the "ERROR" message when the kthread is going.

Also it makes sure that producer does not go into the 10sec sleep
when it is being killed.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/trace/ring_buffer_benchmark.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 2d276b892aea..4ecf7c4567e2 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -270,6 +270,9 @@ static void ring_buffer_producer(void)
try_to_freeze();
}

+ if (kthread_should_stop())
+ kill_test = 1;
+
} while (ktime_before(end_time, timeout) && !kill_test);
trace_printk("End ring buffer hammer\n");

@@ -291,7 +294,7 @@ static void ring_buffer_producer(void)
entries = ring_buffer_entries(buffer);
overruns = ring_buffer_overruns(buffer);

- if (kill_test)
+ if (kill_test && !kthread_should_stop())
trace_printk("ERROR!\n");

if (!disable_reader) {
@@ -377,10 +380,8 @@ static void ring_buffer_producer_thread_init(void *arg)

static void ring_buffer_producer_thread_func(void *arg)
{
- if (kill_test) {
- set_kthread_iterant_int_sleep();
- return;
- }
+ if (kill_test)
+ goto test_killed;

ring_buffer_reset(buffer);

@@ -393,9 +394,16 @@ static void ring_buffer_producer_thread_func(void *arg)

ring_buffer_producer();

+ if (kill_test)
+ goto test_killed;
+
trace_printk("Sleeping for 10 secs\n");
set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout(HZ * SLEEP_TIME);
+ return;
+
+test_killed:
+ set_kthread_iterant_int_sleep();
}

static int __init ring_buffer_benchmark_init(void)
--
1.8.5.6


2015-06-05 15:02:08

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 14/18] ring_buffer: Allow to cleanly freeze the ring buffer benchmark kthreads

One target of the switch to the kthread iterant API is to make
most kthreads cleanly freezable.

This patch adds try_to_freeze() where appropriate and enables
freezing.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/trace/ring_buffer_benchmark.c | 34 ++++++++++++++--------------------
1 file changed, 14 insertions(+), 20 deletions(-)

diff --git a/kernel/trace/ring_buffer_benchmark.c b/kernel/trace/ring_buffer_benchmark.c
index 164f3762cc82..2d276b892aea 100644
--- a/kernel/trace/ring_buffer_benchmark.c
+++ b/kernel/trace/ring_buffer_benchmark.c
@@ -6,6 +6,7 @@
#include <linux/ring_buffer.h>
#include <linux/completion.h>
#include <linux/kthread.h>
+#include <linux/freezer.h>
#include <linux/module.h>
#include <linux/ktime.h>
#include <asm/local.h>
@@ -207,6 +208,7 @@ static void ring_buffer_consumer(void)
break;

schedule();
+ try_to_freeze();
}
reader_finish = 0;
complete(&read_done);
@@ -252,19 +254,21 @@ static void ring_buffer_producer(void)
if (consumer && !(cnt % wakeup_interval))
wake_up_process(consumer);

+ if (cnt % wakeup_interval) {
#ifndef CONFIG_PREEMPT
- /*
- * If we are a non preempt kernel, the 10 second run will
- * stop everything while it runs. Instead, we will call
- * cond_resched and also add any time that was lost by a
- * rescedule.
- *
- * Do a cond resched at the same frequency we would wake up
- * the reader.
- */
- if (cnt % wakeup_interval)
+ /*
+ * If we are a non preempt kernel, the 10 second run
+ * will stop everything while it runs. Instead, we will
+ * call cond_resched and also add any time that was
+ * lost by a rescedule.
+ *
+ * Do a cond resched at the same frequency we would
+ * wake up the reader.
+ */
cond_resched();
#endif
+ try_to_freeze();
+ }

} while (ktime_before(end_time, timeout) && !kill_test);
trace_printk("End ring buffer hammer\n");
@@ -356,12 +360,6 @@ static void ring_buffer_producer(void)
}
}

-static void ring_buffer_consumer_thread_init(void *arg)
-{
- /* it does not matter where we freeze */
- current->flags |= PF_NOFREEZE;
-}
-
static void ring_buffer_consumer_thread_func(void *arg)
{
if (kill_test)
@@ -375,9 +373,6 @@ static void ring_buffer_consumer_thread_func(void *arg)
static void ring_buffer_producer_thread_init(void *arg)
{
init_completion(&read_start);
-
- /* it does not matter where we freeze */
- current->flags |= PF_NOFREEZE;
}

static void ring_buffer_producer_thread_func(void *arg)
@@ -413,7 +408,6 @@ static int __init ring_buffer_benchmark_init(void)
return -ENOMEM;

consumer_kti.type = KTI_INT_SLEEP;
- consumer_kti.func = ring_buffer_consumer_thread_init;
consumer_kti.func = ring_buffer_consumer_thread_func;

if (!disable_reader) {
--
1.8.5.6


2015-06-05 15:02:04

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 11/18] jffs2: Convert jffs2_gcd_mtd kthread into the iterant API

The new iterant kthread API allows to define a common checkpoint for
freezing, parking, termination, and even signal handling. It will allow
to maintain kthreads more easily and make the operations more reliable[*].

The kthread function is split into optional init(), func(), destroy() parts
where func() is called in a cycle. The common check point is after
each func() finishes. See kthread_iterant_fn() for more details.

func() does not need to call schedule(). Instead, we call
set_kthread_iterant_int_sleep() and let the generic code
to handle the sleeping properly.

There are few small functional changes:

+ SIGSTOP uses the standard handler and could not print debug
messages any longer

+ there is no need to enable or define sighandler for SIGCONT;
it is handled by prepare_signal() and could not get disabled

+ the debug message for default signal handler was removed; It is
handled in kernel_do_signal() a standard way. In reality, this
never happens because all signals are disabled except for the
four ones enabled by kthread_sigaction();

+ we call complete() instead of complete_and_exit(); do_exit() is
called in kthread() anyway after the function returns.

[*] In fact, there was a bug in the original code. It tried to process
a non-existing signal when the system was freezing. See the common
check for pending signal and freezing.

Signed-off-by: Petr Mladek <[email protected]>
---
fs/jffs2/background.c | 156 ++++++++++++++++++++++++--------------------------
1 file changed, 74 insertions(+), 82 deletions(-)

diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index 6af076b8f60f..50c16048ba2d 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -21,99 +21,86 @@
#include <linux/kthread.h>
#include "nodelist.h"

-static int jffs2_garbage_collect_thread(void *_c)
+static void jffs2_garbage_collect_thread_sighup(int sig)
+{
+ jffs2_dbg(1, "%s(): SIGHUP received\n", __func__);
+}
+
+static void jffs2_garbage_collect_thread_sigkill(int sig)
+{
+ jffs2_dbg(1, "%s(): SIGKILL received\n", __func__);
+ kthread_stop_current();
+}
+
+static void jffs2_garbage_collect_thread_init(void *_c)
{
struct jffs2_sb_info *c = _c;
- sigset_t hupmask;

- siginitset(&hupmask, sigmask(SIGHUP));
- allow_signal(SIGKILL);
- allow_signal(SIGSTOP);
- allow_signal(SIGCONT);
- allow_signal(SIGHUP);
+ kthread_sigaction(SIGKILL, jffs2_garbage_collect_thread_sigkill);
+ kthread_sigaction(SIGHUP, jffs2_garbage_collect_thread_sighup);
+ kthread_sigaction(SIGSTOP, KTHREAD_SIG_DFL);

c->gc_task = current;
complete(&c->gc_thread_start);

set_user_nice(current, 10);
+};
+
+static void jffs2_garbage_collect_thread_func(void *_c)
+{
+ struct jffs2_sb_info *c = _c;
+ sigset_t hupmask;
+
+ siginitset(&hupmask, sigmask(SIGHUP));

- set_freezable();
- for (;;) {
- sigprocmask(SIG_UNBLOCK, &hupmask, NULL);
- again:
- spin_lock(&c->erase_completion_lock);
- if (!jffs2_thread_should_wake(c)) {
- set_current_state (TASK_INTERRUPTIBLE);
- spin_unlock(&c->erase_completion_lock);
- jffs2_dbg(1, "%s(): sleeping...\n", __func__);
- schedule();
- } else {
- spin_unlock(&c->erase_completion_lock);
- }
- /* Problem - immediately after bootup, the GCD spends a lot
- * of time in places like jffs2_kill_fragtree(); so much so
- * that userspace processes (like gdm and X) are starved
- * despite plenty of cond_resched()s and renicing. Yield()
- * doesn't help, either (presumably because userspace and GCD
- * are generally competing for a higher latency resource -
- * disk).
- * This forces the GCD to slow the hell down. Pulling an
- * inode in with read_inode() is much preferable to having
- * the GC thread get there first. */
- schedule_timeout_interruptible(msecs_to_jiffies(50));
-
- if (kthread_should_stop()) {
- jffs2_dbg(1, "%s(): kthread_stop() called\n", __func__);
- goto die;
- }
-
- /* Put_super will send a SIGKILL and then wait on the sem.
- */
- while (signal_pending(current) || freezing(current)) {
- siginfo_t info;
- unsigned long signr;
-
- if (try_to_freeze())
- goto again;
-
- signr = dequeue_signal_lock(current, &current->blocked, &info);
-
- switch(signr) {
- case SIGSTOP:
- jffs2_dbg(1, "%s(): SIGSTOP received\n",
- __func__);
- set_current_state(TASK_STOPPED);
- schedule();
- break;
-
- case SIGKILL:
- jffs2_dbg(1, "%s(): SIGKILL received\n",
- __func__);
- goto die;
-
- case SIGHUP:
- jffs2_dbg(1, "%s(): SIGHUP received\n",
- __func__);
- break;
- default:
- jffs2_dbg(1, "%s(): signal %ld received\n",
- __func__, signr);
- }
- }
- /* We don't want SIGHUP to interrupt us. STOP and KILL are OK though. */
- sigprocmask(SIG_BLOCK, &hupmask, NULL);
-
- jffs2_dbg(1, "%s(): pass\n", __func__);
- if (jffs2_garbage_collect_pass(c) == -ENOSPC) {
- pr_notice("No space for garbage collection. Aborting GC thread\n");
- goto die;
- }
+ spin_lock(&c->erase_completion_lock);
+ if (!jffs2_thread_should_wake(c)) {
+ set_kthread_iterant_int_sleep();
+ spin_unlock(&c->erase_completion_lock);
+ jffs2_dbg(1, "%s(): sleeping...\n", __func__);
+ return;
+ }
+ spin_unlock(&c->erase_completion_lock);
+
+ /* Problem - immediately after bootup, the GCD spends a lot
+ * of time in places like jffs2_kill_fragtree(); so much so
+ * that userspace processes (like gdm and X) are starved
+ * despite plenty of cond_resched()s and renicing. Yield()
+ * doesn't help, either (presumably because userspace and GCD
+ * are generally competing for a higher latency resource -
+ * disk).
+ * This forces the GCD to slow the hell down. Pulling an
+ * inode in with read_inode() is much preferable to having
+ * the GC thread get there first.
+ */
+ schedule_timeout_interruptible(msecs_to_jiffies(50));
+
+ if (kthread_should_stop()) {
+ jffs2_dbg(1, "%s(): kthread_stop() called\n", __func__);
+ return;
}
- die:
+
+ try_to_freeze();
+
+ /* We don't want SIGHUP to interrupt us. STOP and KILL are OK though. */
+ sigprocmask(SIG_BLOCK, &hupmask, NULL);
+
+ jffs2_dbg(1, "%s(): pass\n", __func__);
+ if (jffs2_garbage_collect_pass(c) == -ENOSPC) {
+ pr_notice("No space for garbage collection. Aborting GC thread\n");
+ kthread_stop_current();
+ }
+ sigprocmask(SIG_UNBLOCK, &hupmask, NULL);
+}
+
+static void jffs2_garbage_collect_thread_destroy(void *_c)
+{
+ struct jffs2_sb_info *c = _c;
+
spin_lock(&c->erase_completion_lock);
c->gc_task = NULL;
spin_unlock(&c->erase_completion_lock);
- complete_and_exit(&c->gc_thread_exit, 0);
+ complete(&c->gc_thread_exit);
}

void jffs2_garbage_collect_trigger(struct jffs2_sb_info *c)
@@ -126,16 +113,21 @@ void jffs2_garbage_collect_trigger(struct jffs2_sb_info *c)
/* This must only ever be called when no GC thread is currently running */
int jffs2_start_garbage_collect_thread(struct jffs2_sb_info *c)
{
+ static struct kthread_iterant kti = {
+ .init = jffs2_garbage_collect_thread_init,
+ .func = jffs2_garbage_collect_thread_func,
+ .destroy = jffs2_garbage_collect_thread_destroy,
+ };
struct task_struct *tsk;
int ret = 0;

BUG_ON(c->gc_task);

+ kti.data = c;
init_completion(&c->gc_thread_start);
init_completion(&c->gc_thread_exit);

- tsk = kthread_run(jffs2_garbage_collect_thread, c,
- "jffs2_gcd_mtd%d", c->mtd->index);
+ tsk = kthread_iterant_run(&kti, "jffs2_gcd_mtd%d", c->mtd->index);
if (IS_ERR(tsk)) {
pr_warn("fork failed for JFFS2 garbage collect thread: %ld\n",
-PTR_ERR(tsk));
--
1.8.5.6


2015-06-05 15:01:58

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 07/18] kthread: Make iterant kthreads freezable by default

Many kthreads already calls set_freezable() before they enter the main
cycle. One of the reasons for creating iterant kthreads is to create
a safe point for freezing and make even more kthreads properly
freezable. Therefore it would make sense to set all iterant
kthreads freezable by default.

However some kthreads might be hard to make properly freezable.
For example, if they do non-interruptible sleeps. They would
need to explicitly clear PF_NOFREEZE flag in the init() call.
But it should be avoided whenever possible.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/kthread.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 185902d0e293..e34f67cb8ecf 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -412,13 +412,20 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
*
* This function defines the work flow of iterant kthreads. It calls
* the given callbacks. Also it defines and implements a common
- * check point for freezing and stopping.
+ * check point for freezing, stopping, and signal handling.
+ *
+ * IMPORTANT: Iterant kthreads are freezable by default. The functions
+ * defined in the given struct kthread_iterant still have to explicitly
+ * call try_to_freeze() after each interruptible sleep. They must avoid
+ * non-interruptible sleep if freezing is allowed.
*/
static int kthread_iterant_fn(void *kti_ptr)
{
struct kthread_iterant *kti = kti_ptr;
void *data = kti->data;

+ set_freezable();
+
if (kti->init)
kti->init(data);

--
1.8.5.6


2015-06-05 15:02:02

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 10/18] jffs2: Remove forward definition of jffs2_garbage_collect_thread()

This commit just moves the function definition and fixes few coding
style problems reported by checkpatch.pl. There are no changes in
the functionality.

The change will be useful when switching to the new iterant kthread API.

Signed-off-by: Petr Mladek <[email protected]>
---
fs/jffs2/background.c | 101 +++++++++++++++++++++++++-------------------------
1 file changed, 50 insertions(+), 51 deletions(-)

diff --git a/fs/jffs2/background.c b/fs/jffs2/background.c
index bb9cebc9ca8a..6af076b8f60f 100644
--- a/fs/jffs2/background.c
+++ b/fs/jffs2/background.c
@@ -21,57 +21,6 @@
#include <linux/kthread.h>
#include "nodelist.h"

-
-static int jffs2_garbage_collect_thread(void *);
-
-void jffs2_garbage_collect_trigger(struct jffs2_sb_info *c)
-{
- assert_spin_locked(&c->erase_completion_lock);
- if (c->gc_task && jffs2_thread_should_wake(c))
- send_sig(SIGHUP, c->gc_task, 1);
-}
-
-/* This must only ever be called when no GC thread is currently running */
-int jffs2_start_garbage_collect_thread(struct jffs2_sb_info *c)
-{
- struct task_struct *tsk;
- int ret = 0;
-
- BUG_ON(c->gc_task);
-
- init_completion(&c->gc_thread_start);
- init_completion(&c->gc_thread_exit);
-
- tsk = kthread_run(jffs2_garbage_collect_thread, c, "jffs2_gcd_mtd%d", c->mtd->index);
- if (IS_ERR(tsk)) {
- pr_warn("fork failed for JFFS2 garbage collect thread: %ld\n",
- -PTR_ERR(tsk));
- complete(&c->gc_thread_exit);
- ret = PTR_ERR(tsk);
- } else {
- /* Wait for it... */
- jffs2_dbg(1, "Garbage collect thread is pid %d\n", tsk->pid);
- wait_for_completion(&c->gc_thread_start);
- ret = tsk->pid;
- }
-
- return ret;
-}
-
-void jffs2_stop_garbage_collect_thread(struct jffs2_sb_info *c)
-{
- int wait = 0;
- spin_lock(&c->erase_completion_lock);
- if (c->gc_task) {
- jffs2_dbg(1, "Killing GC task %d\n", c->gc_task->pid);
- send_sig(SIGKILL, c->gc_task, 1);
- wait = 1;
- }
- spin_unlock(&c->erase_completion_lock);
- if (wait)
- wait_for_completion(&c->gc_thread_exit);
-}
-
static int jffs2_garbage_collect_thread(void *_c)
{
struct jffs2_sb_info *c = _c;
@@ -166,3 +115,53 @@ static int jffs2_garbage_collect_thread(void *_c)
spin_unlock(&c->erase_completion_lock);
complete_and_exit(&c->gc_thread_exit, 0);
}
+
+void jffs2_garbage_collect_trigger(struct jffs2_sb_info *c)
+{
+ assert_spin_locked(&c->erase_completion_lock);
+ if (c->gc_task && jffs2_thread_should_wake(c))
+ send_sig(SIGHUP, c->gc_task, 1);
+}
+
+/* This must only ever be called when no GC thread is currently running */
+int jffs2_start_garbage_collect_thread(struct jffs2_sb_info *c)
+{
+ struct task_struct *tsk;
+ int ret = 0;
+
+ BUG_ON(c->gc_task);
+
+ init_completion(&c->gc_thread_start);
+ init_completion(&c->gc_thread_exit);
+
+ tsk = kthread_run(jffs2_garbage_collect_thread, c,
+ "jffs2_gcd_mtd%d", c->mtd->index);
+ if (IS_ERR(tsk)) {
+ pr_warn("fork failed for JFFS2 garbage collect thread: %ld\n",
+ -PTR_ERR(tsk));
+ complete(&c->gc_thread_exit);
+ ret = PTR_ERR(tsk);
+ } else {
+ /* Wait for it... */
+ jffs2_dbg(1, "Garbage collect thread is pid %d\n", tsk->pid);
+ wait_for_completion(&c->gc_thread_start);
+ ret = tsk->pid;
+ }
+
+ return ret;
+}
+
+void jffs2_stop_garbage_collect_thread(struct jffs2_sb_info *c)
+{
+ int wait = 0;
+
+ spin_lock(&c->erase_completion_lock);
+ if (c->gc_task) {
+ jffs2_dbg(1, "Killing GC task %d\n", c->gc_task->pid);
+ send_sig(SIGKILL, c->gc_task, 1);
+ wait = 1;
+ }
+ spin_unlock(&c->erase_completion_lock);
+ if (wait)
+ wait_for_completion(&c->gc_thread_exit);
+}
--
1.8.5.6


2015-06-05 15:02:01

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

Many kthreads go into an interruptible sleep when there is nothing
to do. They should check if anyone did not requested the kthread
to terminate, freeze, or park in the meantime. It is easy to do
it a wrong way.

This patch adds an API to do the sleep easier and correct way.
The big advantage is that we will have all the checks on a single
place and will be able to fix all broken threads easily.

We need two steps. set_current_state(TASK_INTERRUPTIBLE)
is typically called under some lock together with a check for
a pending work. While schedule() has to be called outside
the lock.

We use the freezable variants of kthread_should_stop(), schedule(),
and cond_resched(). They are needed in freezable kthreads and they
do the right job also in non-freezable ones.

The API is ready to support more sleeping variants, e.g.
wait_event_freezable(), wait_event_freezable_timeout(). Well,
we will need to add some more items into the struct kthread_iterant
for them.

Signed-off-by: Petr Mladek <[email protected]>
---
include/linux/kthread.h | 13 +++++++++
kernel/kthread.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 89 insertions(+)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 100c1e006729..415178c20cde 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -48,14 +48,25 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
* must be released and non-interruptible work finished.
*/

+/* Flags that modify the default behavior of iterant kthreads. */
+/* Reset pause flags for the next iteration. */
+#define KTI_PAUSE_ONCE 0x00000001
+/* Do interruptible sleep between iterations. */
+#define KTI_INT_SLEEP 0x00000002
+
+#define KTI_PAUSE_MASK (KTI_INT_SLEEP | \
+ KTI_PAUSE_ONCE)
+
/**
* struct kthread_iterant - structure describing the function of the kthread
+ * @type: modifies the kthread behavior using extra flags.
* @data: pointer to a data passed to the functions.
* @init: function called when the kthread is created.
* @func: function called in the main cycle until the kthread is terminated.
* @destroy: function called when the kthread is being terminated.
*/
struct kthread_iterant {
+ unsigned int type;
void *data;
void (*init)(void *data);
void (*func)(void *data);
@@ -85,6 +96,8 @@ kthread_iterant_create_on_cpu(struct kthread_iterant *kti,
__k; \
})

+void set_kthread_iterant_int_sleep(void);
+
void kthread_bind(struct task_struct *k, unsigned int cpu);
void kthread_stop_current(void);
int kthread_stop(struct task_struct *k);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 41fb6a43a1f1..fa40fb549e22 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -414,6 +414,80 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
}

/**
+ * set_kthread_iterant_pause_type - set the type of the pause before
+ * the next iteration
+ * @type: flags that define the type of the pause; they are the same
+ * as the flags used for @type in struct kthread_iterant
+ */
+static void set_kthread_iterant_pause_type(struct kthread_iterant *kti,
+ unsigned int type)
+{
+ WARN_ONCE(type & ~KTI_PAUSE_MASK,
+ "unknown pause type: 0x%x\n", type);
+ type &= KTI_PAUSE_MASK;
+
+ kti->type &= ~KTI_PAUSE_MASK;
+ kti->type |= type;
+ /* this function set the pause only for the next iteration */
+ kti->type |= KTI_PAUSE_ONCE;
+}
+
+/**
+ * set_kthread_iterant_int_sleep - do interruptible sleep before
+ * the next iteration
+ *
+ * This function is typically called under a lock when the main func()
+ * checks for the pending work.
+ *
+ * Kthreads should pause between iterations only when there is no work,
+ * signal pending, freezing, parking, or termination. We want to do most
+ * of these checks properly on a single place: kthread_iterant_fn().
+ * Only the checks for pending work need to be done in the main func()
+ * because they are not generic.
+ */
+void set_kthread_iterant_int_sleep(void)
+{
+ struct kthread_iterant *kti = to_kthread_iterant(current);
+
+ set_kthread_iterant_pause_type(kti, KTI_INT_SLEEP);
+ set_current_state(TASK_INTERRUPTIBLE);
+}
+EXPORT_SYMBOL(set_kthread_iterant_int_sleep);
+
+/**
+ * do_kthread_iterant_pause - do the selected pause before next iteration
+ *
+ * Most kthreads sleep or wait between iterations. This function
+ * allows to do it the right way. The particular pause variant
+ * is defined by @type flags in struct kthread_iterant. Anyway,
+ * this is a safe place to call cond_resched().
+ */
+static void do_kthread_iterant_pause(struct kthread_iterant *kti)
+{
+ unsigned int type = kti->type;
+
+ /*
+ * Explicitly set the task state when it was not set by
+ * set_kthread_iterant_int_sleep*() functions.
+ */
+ if (!(type & KTI_PAUSE_ONCE) && (type & KTI_INT_SLEEP))
+ set_current_state(TASK_INTERRUPTIBLE);
+
+ if (kthread_freezable_should_stop(NULL)) {
+ set_current_state(TASK_RUNNING);
+ return;
+ }
+
+ if (type & KTI_INT_SLEEP)
+ freezable_schedule();
+ else
+ freezable_cond_resched();
+
+ if (type & KTI_PAUSE_ONCE)
+ kti->type &= ~KTI_PAUSE_MASK;
+}
+
+/**
* kthread_iterant_fn - kthread function to process an iterant kthread
* @kti_ptr: pointer to an initialized struct kthread_iterant.
*
@@ -443,6 +517,8 @@ static int kthread_iterant_fn(void *kti_ptr)
if (kti->func)
kti->func(data);

+ do_kthread_iterant_pause(kti);
+
if (signal_pending(current))
kthread_do_signal();

--
1.8.5.6


2015-06-05 15:02:00

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 08/18] kthread: Allow to get struct kthread_iterant from task_struct

It is easy to make a mistake when one implements sleeping between kthread
iterations. We will want to do it a more uniform way. For this we will
want to set some flags in struct kthread_iterant from the current
task (kthread). This patch adds the basic infrastructure to make
it possible.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/kthread.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index e34f67cb8ecf..41fb6a43a1f1 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -44,6 +44,7 @@ struct kthread {
void *data;
struct completion parked;
struct completion exited;
+ struct kthread_iterant *kti;
};

enum KTHREAD_BITS {
@@ -69,6 +70,11 @@ static struct kthread *to_live_kthread(struct task_struct *k)
return NULL;
}

+static struct kthread_iterant *to_kthread_iterant(struct task_struct *k)
+{
+ return __to_kthread(k->vfork_done)->kti;
+}
+
/**
* kthread_stop_current - make the current kthread to terminate a safe way
*
@@ -199,6 +205,7 @@ static int kthread(void *_create)
self.data = data;
init_completion(&self.exited);
init_completion(&self.parked);
+ self.kti = NULL;
current->vfork_done = &self.exited;

/* If user was SIGKILLed, I release the structure. */
@@ -421,9 +428,12 @@ struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
*/
static int kthread_iterant_fn(void *kti_ptr)
{
+ struct kthread *kt = to_kthread(current);
struct kthread_iterant *kti = kti_ptr;
void *data = kti->data;

+ kt->kti = kti;
+
set_freezable();

if (kti->init)
--
1.8.5.6


2015-06-05 15:01:57

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

Several kthreads already handle signals some way. They do so
in different ways, search for allow_signal().

This patch allows to handle the signals in a more uniform way using
kthread_do_signal().

The main question is how much it should follow POSIX and the signal
handling of user space processes. On one hand, we want to be as close
as possible. On the other hand, we need to be very careful. All signals
were ignored until now, except for the few kthreads that somehow
handled them.

POSIX behavior might cause some surprises and crazy bug reports.
For example, imagine a home user stopping or killing kswapd because
it makes the system too busy. It is like shooting in its own leg.

Also the signals are already used a strange way. For example, klockd
drops all locks when it gets SIGKILL. This was added before initial
git commit, so the archaeology was not easy. The most likely explanation
is that it is used during the system shutdown. It would make sense
to drop all locks when user space processes are killed and before
filesystems get unmounted.

Now, the patch does the following compromise:

+ defines default actions for SIGSTOP, SIGCONT, and fatal signals
+ custom actions can be defined by kthread_sigaction()
+ all signals are still ignored by default

Well, fatal signals do not terminate the kthread immediately. They just
set a flag to terminate the kthread, flush all other pending signals,
and return to the main kthread cycle. The kthread is supposed to check
the flag, leave the main cycle, do some cleanup actions when necessary
and return from the kthread.

Note that kthreads are running in the kernel address space. It makes
the life much easier. The signal handler can be called directly and
we do not need any arch-specific code to process it in the userspace.
Also we do not care about ptrace.

Finally, kthread_do_signal() is called on a safe place in the main
iterant kthread cycle. Then we will not need any special code for
signals when using this kthread API.

This change does not affect any existing kthread. The plan is to
use them only in safe kthreads that can be converted into
the iterant kthread API.

Signed-off-by: Petr Mladek <[email protected]>
---
include/linux/signal.h | 1 +
kernel/kthread.c | 3 ++
kernel/signal.c | 76 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 80 insertions(+)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 06e54762c281..41e30310bc3b 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -294,6 +294,7 @@ extern int get_signal(struct ksignal *ksig);
extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping);
extern void exit_signals(struct task_struct *tsk);
extern void kthread_sigaction(int, __kthread_sighandler_t);
+extern int kthread_do_signal(void);

static inline void allow_signal(int sig)
{
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 688bb4cfd807..185902d0e293 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -426,6 +426,9 @@ static int kthread_iterant_fn(void *kti_ptr)
if (kti->func)
kti->func(data);

+ if (signal_pending(current))
+ kthread_do_signal();
+
/* this check is safe also for non-freezable kthreads */
} while (!kthread_freezable_should_stop(NULL));

diff --git a/kernel/signal.c b/kernel/signal.c
index ca6bdb411449..cdca53965c3d 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -23,6 +23,7 @@
#include <linux/ptrace.h>
#include <linux/signal.h>
#include <linux/signalfd.h>
+#include <linux/kthread.h>
#include <linux/ratelimit.h>
#include <linux/tracehook.h>
#include <linux/capability.h>
@@ -2365,6 +2366,81 @@ relock:
}

/**
+ * kthread_do_signal - process signals delivered to kernel threads
+ *
+ * Kthreads are running in the kernel address space. It makes the life
+ * much easier. The signal handler can be called directly and we do not
+ * need any arch-specific code to process it in the userspace.
+ * Also we do not care about ptrace.
+ */
+int kthread_do_signal(void)
+{
+ struct ksignal ksig;
+ struct sighand_struct *sighand = current->sighand;
+ struct signal_struct *signal = current->signal;
+ unsigned long flags;
+ int signr;
+
+ try_to_freeze();
+relock:
+ spin_lock_irqsave(&sighand->siglock, flags);
+
+ if (unlikely(signal->flags & SIGNAL_CLD_MASK)) {
+ WARN(1, "there are no parents for kernel threads\n");
+ signal->flags &= ~SIGNAL_CLD_MASK;
+ }
+
+ for (;;) {
+ struct k_sigaction *ka;
+
+ signr = dequeue_signal(current, &current->blocked, &ksig.info);
+
+ if (!signr)
+ break;
+
+ ka = &sighand->action[signr-1];
+
+ /* Do nothing for ignored signals */
+ if (ka->sa.kthread_sa_handler == KTHREAD_SIG_IGN)
+ continue;
+
+ /* Run the custom handler if any */
+ if (ka->sa.kthread_sa_handler != KTHREAD_SIG_DFL) {
+ ksig.ka = *ka;
+
+ if (ka->sa.sa_flags & SA_ONESHOT)
+ ka->sa.kthread_sa_handler = KTHREAD_SIG_DFL;
+
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+ /* could run directly for kthreads */
+ ksig.ka.sa.kthread_sa_handler(signr);
+ freezable_cond_resched();
+ goto relock;
+ }
+
+ /* Default actions */
+ if (sig_kernel_ignore(signr))
+ continue;
+
+ if (sig_kernel_stop(signr)) {
+ __set_current_state(TASK_STOPPED);
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+ /* Don't run again until woken by SIGCONT or SIGKILL */
+ freezable_schedule();
+ goto relock;
+ }
+
+ /* Death signals, but try to terminate cleanly */
+ kthread_stop_current();
+ __flush_signals(current);
+ break;
+ }
+ spin_unlock_irqrestore(&sighand->siglock, flags);
+
+ return 0;
+}
+
+/**
* signal_delivered -
* @ksig: kernel signal struct
* @stepping: nonzero if debugger single-step or block-step in use
--
1.8.5.6


2015-06-05 15:01:56

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 05/18] freezer/scheduler: Add freezable_cond_resched()

It makes sense to call cond_resched() in freezable kthreads.
But there is the chicken&egg problem whether to call try_to_freeze()
before or after the schedule.

This patch adds freezable_cond_resched(). It uses the same trick as
freezable_schedule(). It tells the freezer that it might ignore
the sleeping process.

Note that it might be used only on locations where the freezing
is safe. It means that no lock is taken and we are ready to sleep
for a very long time.

Signed-off-by: Petr Mladek <[email protected]>
---
include/linux/freezer.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index 6b7fd9cf5ea2..1446f210dfc6 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -172,6 +172,14 @@ static inline void freezable_schedule(void)
freezer_count();
}

+/* Like cond_resched(), but should not block the freezer. */
+static inline void freezable_cond_resched(void)
+{
+ freezer_do_not_count();
+ cond_resched();
+ freezer_count();
+}
+
/* DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION */
static inline void freezable_schedule_unsafe(void)
{
--
1.8.5.6


2015-06-05 15:01:53

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 03/18] kthread: Add kthread_stop_current()

For example, jffs2_gcd_mtd kthread can be stopped by SIGKILL.
The signal is handled inside the main function.

We would like to convert such kthreads to the iterant API
and use proper signal handlers.

The new functions will allow to pass the information
between the signal handler and the main kthread functions.

kthread_stop_current() allows to quit the kthread correctly.
It means to leave the main cycle on a safe point and call
clean up actions via post() function.

Signed-off-by: Petr Mladek <[email protected]>
---
include/linux/kthread.h | 1 +
kernel/kthread.c | 13 +++++++++++++
2 files changed, 14 insertions(+)

diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 06fe9ad192dd..100c1e006729 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -86,6 +86,7 @@ kthread_iterant_create_on_cpu(struct kthread_iterant *kti,
})

void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_stop_current(void);
int kthread_stop(struct task_struct *k);
bool kthread_should_stop(void);
bool kthread_should_park(void);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 4b2698bcc622..688bb4cfd807 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -70,6 +70,19 @@ static struct kthread *to_live_kthread(struct task_struct *k)
}

/**
+ * kthread_stop_current - make the current kthread to terminate a safe way
+ *
+ * This function sets KTHREAD_SHOULD_STOP flags. It makes kthread to break
+ * the main loop and do some clean up actions before the main kthread
+ * function finishes. It is the standard behavior for SIGTERM signal.
+ */
+void kthread_stop_current(void)
+{
+ set_bit(KTHREAD_SHOULD_STOP, &to_kthread(current)->flags);
+}
+EXPORT_SYMBOL(kthread_stop_current);
+
+/**
* kthread_should_stop - should this kthread return now?
*
* When someone calls kthread_stop() on your kthread, it will be woken
--
1.8.5.6


2015-06-05 15:01:55

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 04/18] signal: Rename kernel_sigaction() to kthread_sigaction() and clean it up

This patch defines __kthread_sighandler_t so that we do not need to force
a cast from __sighandler_t when passing an address from kernel space.

At the same time, it defines constants for the ignore, default, and
error sigactions that could be used within the kernel space. The __force
is a bit ugly but it is needed because of the ordering of includes.

Finally, the patch renames kernel_sigaction() to kthread_sigaction().
We are going to add proper signal handling to kthreads and this will
help to distinguish the related functions from the user-space
orientated ones. Note that we will just improve the existing
basic support for signals in kthreads.

The patch does not have any runtime effects. It is just a cleanup.

Note that it does not break the compilation with sparse. It warns
when a wrong part or the union is used in a wrong context.

Signed-off-by: Petr Mladek <[email protected]>
---
include/linux/signal.h | 23 ++++++++++++++++++-----
kernel/kmod.c | 2 +-
kernel/signal.c | 8 ++++----
3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index ab1e0392b5ac..06e54762c281 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -241,13 +241,26 @@ extern void __set_current_blocked(const sigset_t *);
extern int show_unhandled_signals;
extern int sigsuspend(sigset_t *);

+typedef __signalfn_t *__kthread_sighandler_t;
+
+/* use the default signal handler for the kthread */
+#define KTHREAD_SIG_DFL ((__force __kthread_sighandler_t)SIG_DFL)
+/* ignore the signal in the kthread */
+#define KTHREAD_SIG_IGN ((__force __kthread_sighandler_t)SIG_IGN)
+
struct sigaction {
#ifndef __ARCH_HAS_IRIX_SIGACTION
- __sighandler_t sa_handler;
+ union {
+ __sighandler_t sa_handler;
+ __kthread_sighandler_t kthread_sa_handler;
+ };
unsigned long sa_flags;
#else
unsigned int sa_flags;
- __sighandler_t sa_handler;
+ union {
+ __sighandler_t sa_handler;
+ __kthread_sighandler_t kthread_sa_handler;
+ };
#endif
#ifdef __ARCH_HAS_SA_RESTORER
__sigrestore_t sa_restorer;
@@ -280,7 +293,7 @@ struct ksignal {
extern int get_signal(struct ksignal *ksig);
extern void signal_setup_done(int failed, struct ksignal *ksig, int stepping);
extern void exit_signals(struct task_struct *tsk);
-extern void kernel_sigaction(int, __sighandler_t);
+extern void kthread_sigaction(int, __kthread_sighandler_t);

static inline void allow_signal(int sig)
{
@@ -289,12 +302,12 @@ static inline void allow_signal(int sig)
* know it'll be handled, so that they don't get converted to
* SIGKILL or just silently dropped.
*/
- kernel_sigaction(sig, (__force __sighandler_t)2);
+ kthread_sigaction(sig, (__kthread_sighandler_t)2);
}

static inline void disallow_signal(int sig)
{
- kernel_sigaction(sig, SIG_IGN);
+ kthread_sigaction(sig, KTHREAD_SIG_IGN);
}

extern struct kmem_cache *sighand_cachep;
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 2777f40a9c7b..d45966319bcc 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -273,7 +273,7 @@ static int wait_for_helper(void *data)
pid_t pid;

/* If SIGCLD is ignored sys_wait4 won't populate the status. */
- kernel_sigaction(SIGCHLD, SIG_DFL);
+ kthread_sigaction(SIGCHLD, KTHREAD_SIG_DFL);
pid = kernel_thread(____call_usermodehelper, sub_info, SIGCHLD);
if (pid < 0) {
sub_info->retval = pid;
diff --git a/kernel/signal.c b/kernel/signal.c
index d51c5ddd855c..ca6bdb411449 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3077,11 +3077,11 @@ COMPAT_SYSCALL_DEFINE4(rt_tgsigqueueinfo,
/*
* For kthreads only, must not be used if cloned with CLONE_SIGHAND
*/
-void kernel_sigaction(int sig, __sighandler_t action)
+void kthread_sigaction(int sig, __kthread_sighandler_t action)
{
spin_lock_irq(&current->sighand->siglock);
- current->sighand->action[sig - 1].sa.sa_handler = action;
- if (action == SIG_IGN) {
+ current->sighand->action[sig - 1].sa.kthread_sa_handler = action;
+ if (action == KTHREAD_SIG_IGN) {
sigset_t mask;

sigemptyset(&mask);
@@ -3093,7 +3093,7 @@ void kernel_sigaction(int sig, __sighandler_t action)
}
spin_unlock_irq(&current->sighand->siglock);
}
-EXPORT_SYMBOL(kernel_sigaction);
+EXPORT_SYMBOL(kthread_sigaction);

int do_sigaction(int sig, struct k_sigaction *act, struct k_sigaction *oact)
{
--
1.8.5.6


2015-06-05 15:01:50

by Petr Mladek

[permalink] [raw]
Subject: [RFC PATCH 01/18] kthread: Allow to call __kthread_create_on_node() with va_list args

kthread_create_on_node() implements a bunch of logic to create
the kthread. It is already called by kthread_create_on_cpu().

We are going to add a new API that will allow to standardize kthreads
and define safe points for termination, freezing, parking, and even
signal handling. It will want to call kthread_create_on_node()
with va_list args.

This patch does only a refactoring and does not modify the existing
behavior.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/kthread.c | 71 +++++++++++++++++++++++++++++++++-----------------------
1 file changed, 42 insertions(+), 29 deletions(-)

diff --git a/kernel/kthread.c b/kernel/kthread.c
index 10e489c448fe..fca7cd124512 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -242,32 +242,10 @@ static void create_kthread(struct kthread_create_info *create)
}
}

-/**
- * kthread_create_on_node - create a kthread.
- * @threadfn: the function to run until signal_pending(current).
- * @data: data ptr for @threadfn.
- * @node: memory node number.
- * @namefmt: printf-style name for the thread.
- *
- * Description: This helper function creates and names a kernel
- * thread. The thread will be stopped: use wake_up_process() to start
- * it. See also kthread_run().
- *
- * If thread is going to be bound on a particular cpu, give its node
- * in @node, to get NUMA affinity for kthread stack, or else give -1.
- * When woken, the thread will run @threadfn() with @data as its
- * argument. @threadfn() can either call do_exit() directly if it is a
- * standalone thread for which no one will call kthread_stop(), or
- * return when 'kthread_should_stop()' is true (which means
- * kthread_stop() has been called). The return value should be zero
- * or a negative error number; it will be passed to kthread_stop().
- *
- * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
- */
-struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
- void *data, int node,
- const char namefmt[],
- ...)
+static struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
+ void *data, int node,
+ const char namefmt[],
+ va_list args)
{
DECLARE_COMPLETION_ONSTACK(done);
struct task_struct *task;
@@ -308,11 +286,8 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
task = create->result;
if (!IS_ERR(task)) {
static const struct sched_param param = { .sched_priority = 0 };
- va_list args;

- va_start(args, namefmt);
vsnprintf(task->comm, sizeof(task->comm), namefmt, args);
- va_end(args);
/*
* root may have changed our (kthreadd's) priority or CPU mask.
* The kernel thread should not inherit these properties.
@@ -323,6 +298,44 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
kfree(create);
return task;
}
+
+
+/**
+ * kthread_create_on_node - create a kthread.
+ * @threadfn: the function to run until signal_pending(current).
+ * @data: data ptr for @threadfn.
+ * @node: memory node number.
+ * @namefmt: printf-style name for the thread.
+ *
+ * Description: This helper function creates and names a kernel
+ * thread. The thread will be stopped: use wake_up_process() to start
+ * it. See also kthread_run().
+ *
+ * If thread is going to be bound on a particular cpu, give its node
+ * in @node, to get NUMA affinity for kthread stack, or else give -1.
+ * When woken, the thread will run @threadfn() with @data as its
+ * argument. @threadfn() can either call do_exit() directly if it is a
+ * standalone thread for which no one will call kthread_stop(), or
+ * return when 'kthread_should_stop()' is true (which means
+ * kthread_stop() has been called). The return value should be zero
+ * or a negative error number; it will be passed to kthread_stop().
+ *
+ * Returns a task_struct or ERR_PTR(-ENOMEM) or ERR_PTR(-EINTR).
+ */
+struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
+ void *data, int node,
+ const char namefmt[],
+ ...)
+{
+ struct task_struct *task;
+ va_list args;
+
+ va_start(args, namefmt);
+ task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
+ va_end(args);
+
+ return task;
+}
EXPORT_SYMBOL(kthread_create_on_node);

static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
--
1.8.5.6


2015-06-05 16:10:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Fri, Jun 05, 2015 at 05:01:08PM +0200, Petr Mladek wrote:
> Many kthreads go into an interruptible sleep when there is nothing
> to do. They should check if anyone did not requested the kthread
> to terminate, freeze, or park in the meantime. It is easy to do
> it a wrong way.

INTERRUPTIBLE is the wrong state to idle in for kthreads, use
TASK_IDLE.

---

commit 80ed87c8a9ca0cad7ca66cf3bbdfb17559a66dcf
Author: Peter Zijlstra <[email protected]>
Date: Fri May 8 14:23:45 2015 +0200

sched/wait: Introduce TASK_NOLOAD and TASK_IDLE

Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
all idle kthreads contribute to the loadavg is somewhat silly.

Now mostly this works OK, because kthreads have all their signals
masked. However there's a few sites where this is causing problems and
TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.

This patch adds TASK_NOLOAD which, when combined with
TASK_UNINTERRUPTIBLE avoids the loadavg accounting.

As most of imagined usage sites are loops where a thread wants to
idle, waiting for work, a helper TASK_IDLE is introduced.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Julian Anastasov <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: NeilBrown <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dd07ac03f82a..7de815c6fa78 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -218,9 +218,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
#define TASK_WAKEKILL 128
#define TASK_WAKING 256
#define TASK_PARKED 512
-#define TASK_STATE_MAX 1024
+#define TASK_NOLOAD 1024
+#define TASK_STATE_MAX 2048

-#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
+#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPN"

extern char ___assert_task_state[1 - 2*!!(
sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
@@ -230,6 +231,8 @@ extern char ___assert_task_state[1 - 2*!!(
#define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)
#define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)

+#define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
+
/* Convenience macros for the sake of wake_up */
#define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
#define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED)
@@ -245,7 +248,8 @@ extern char ___assert_task_state[1 - 2*!!(
((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
#define task_contributes_to_load(task) \
((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
- (task->flags & PF_FROZEN) == 0)
+ (task->flags & PF_FROZEN) == 0 && \
+ (task->state & TASK_NOLOAD) == 0)

#ifdef CONFIG_DEBUG_ATOMIC_SLEEP

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 30fedaf3e56a..d57a575fe31f 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -147,7 +147,8 @@ TRACE_EVENT(sched_switch,
__print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
{ 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
{ 16, "Z" }, { 32, "X" }, { 64, "x" },
- { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
+ { 128, "K" }, { 256, "W" }, { 512, "P" },
+ { 1024, "N" }) : "R",
__entry->prev_state & TASK_STATE_MAX ? "+" : "",
__entry->next_comm, __entry->next_pid, __entry->next_prio)
);

2015-06-05 16:22:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] kthreads/signal: Safer kthread API and signal handling

On Fri, Jun 05, 2015 at 05:00:59PM +0200, Petr Mladek wrote:
> Workqueue
>
>
> Workqueues are quite popular and many kthreads have already been
> converted into them.
>
> Work queues allow to split the function into even more pieces and
> reach the common check point more often. It is especially useful
> when a kthread handles more tasks and is woken when some work
> is needed. Then we could queue the appropriate work instead
> of waking the whole kthread and checking what exactly needs
> to be done.
>
> But there are many kthreads that need to cycle many times
> until some work is finished, e.g. khugepaged, virtio_balloon,
> jffs2_garbage_collect_thread. They would need to queue the
> work item repeatedly from the same work item or between
> more work items. It would be a strange semantic.
>
> Work queues allow to share the same kthread between more users.
> It helps to reduce the number of running kthreads. It is especially
> useful if you would need a kthread for each CPU.
>
> But this might also be a disadvantage. Just look into the output
> of the command "ps" and see the many [kworker*] processes. One
> might see this a black hole. If a kworker makes the system busy,
> it is less obvious what the problem is in compare with the old
> "simple" and dedicated kthreads.
>
> Yes, we could add some debugging tools for work queues but
> it would be another non-standard thing that developers and
> system administrators would need to understand.
>
> Another thing is that work queues have their own scheduler. If we
> move even more tasks there it might need even more love. Anyway,
> the extra scheduler adds another level of complexity when
> debugging problems.

There's a lot more problems with workqueues:

- they're not regular tasks and all the task controls don't work on
them. This means all things scheduler, like cpu-affinity, nice, and
RT/deadline scheduling policies. Instead there is some half baked
secondary interface for some of these.

But this also very much includes things like cgroups, which brings me
to the second point.

- its oblivious to cgroups (as it is to RT priority for example) both
leading to priority inversion. A work enqueued from a deep/limited
cgroup does not inherit the task's cgroup. Instead this work is ran
from the root cgroup.

This breaks cgroup isolation, more significantly so when a large part
of the actual work is done from workqueues (as some workloads end up
being). Instead of being able to control the work, it all ends up in
the root cgroup outside of control.



2015-06-06 21:17:49

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 11/18] jffs2: Convert jffs2_gcd_mtd kthread into the iterant API

On 06/05, Petr Mladek wrote:
>
> [*] In fact, there was a bug in the original code. It tried to process
> a non-existing signal when the system was freezing. See the common
> check for pending signal and freezing.

And another bug afaics:

> - case SIGSTOP:
> - jffs2_dbg(1, "%s(): SIGSTOP received\n",
> - __func__);
> - set_current_state(TASK_STOPPED);
> - schedule();
> - break;

This is obviously racy, we can miss SIGCONT.

Still I personally dislike the new kthread_sigaction() API. I agree,
a couple if signal helpers for kthreads make sense. Say,

void kthread_do_signal_stop(void)
{
spin_lock_irq(&curtent->sighand->siglock);
if (current->jobctl & JOBCTL_STOP_DEQUEUED)
__set_current_state(TASK_STOPPED);
spin_unlock_irq(&current->sighand->siglock);

schedule();
}

and probably even "int kthread_signal_deque(void)".

But personally I do not think kthread_do_signal() makes a lot of sense...

Oleg.


2015-06-06 21:32:44

by Jiri Kosina

[permalink] [raw]
Subject: Re: [RFC PATCH 11/18] jffs2: Convert jffs2_gcd_mtd kthread into the iterant API

On Sat, 6 Jun 2015, Oleg Nesterov wrote:

> Still I personally dislike the new kthread_sigaction() API. I agree,
> a couple if signal helpers for kthreads make sense. Say,
>
> void kthread_do_signal_stop(void)
> {
> spin_lock_irq(&curtent->sighand->siglock);
> if (current->jobctl & JOBCTL_STOP_DEQUEUED)
> __set_current_state(TASK_STOPPED);
> spin_unlock_irq(&current->sighand->siglock);
>
> schedule();
> }

... not to mention the fact that 'STOP' keyword in relation to kthreads
has completely different meaning today, which just contributes to overall
confusion; but that's an independent story.

>
> and probably even "int kthread_signal_deque(void)".
>
> But personally I do not think kthread_do_signal() makes a lot of sense...

Would it be possible for you to elaborate a little bit more why you think
so ... ?

I personally don't see a huge principal difference between
"kthread_signal_dequeue() + kthread_do_signal_{stop,...}" vs. generic
"kthread_do_signal()" that's just basically completely general and takes
care of 'everything necessary'. That being said, my relationship to signal
handling code is of course much less intimate compared to yours, so I am
really curious what particular objections to that interface have.

Thanks a lot,

--
Jiri Kosina
SUSE Labs

2015-06-06 21:59:17

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

On 06/05, Petr Mladek wrote:
>
> The main question is how much it should follow POSIX and the signal
> handling of user space processes. On one hand, we want to be as close
> as possible.

Why? Let the kthread decide what it should if it gets, say, SIGSTOP.

> Finally, kthread_do_signal() is called on a safe place in the main
> iterant kthread cycle. Then we will not need any special code for
> signals when using this kthread API.

OK, I will not comment other parts of iterant API in this thread.

But as for signal handling, to me a single kthread_iterant->do_signal()
callback looks better. Rather than multiple callbacks passed as
->kthread_sa_handler.

That single callback can deque a signal and decide what it should do.

> + spin_lock_irqsave(&sighand->siglock, flags);
> +
> + if (unlikely(signal->flags & SIGNAL_CLD_MASK)) {
> + WARN(1, "there are no parents for kernel threads\n");
> + signal->flags &= ~SIGNAL_CLD_MASK;
> + }
> +
> + for (;;) {
> + struct k_sigaction *ka;
> +
> + signr = dequeue_signal(current, &current->blocked, &ksig.info);
> +
> + if (!signr)
> + break;
> +
> + ka = &sighand->action[signr-1];
> +
> + /* Do nothing for ignored signals */
> + if (ka->sa.kthread_sa_handler == KTHREAD_SIG_IGN)
> + continue;

Again, I agree something like the simple kthread_dequeue_signal() makes
sense. Say, to drop the ignore signal like this code does. Although I
do not think this is really important, SIG_IGN is only possible if this
kthread does something strange. Say, blocks/unblocs the ignored signal.

> +
> + /* Run the custom handler if any */
> + if (ka->sa.kthread_sa_handler != KTHREAD_SIG_DFL) {
> + ksig.ka = *ka;
> +
> + if (ka->sa.sa_flags & SA_ONESHOT)
> + ka->sa.kthread_sa_handler = KTHREAD_SIG_DFL;
> +
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> + /* could run directly for kthreads */
> + ksig.ka.sa.kthread_sa_handler(signr);
> + freezable_cond_resched();
> + goto relock;

Well. But for what? A simple "switch (signr)" after kthread_dequeue_signal()
can do the same. Or, speaking of kthread_iterant_fn() it can even dequeue the
signal and pass it to kti->whatever(signr).

> + if (sig_kernel_ignore(signr))
> + continue;

For what? Why a kthread should unignore (say) SIGWINCH if it is not going
to react?

> + if (sig_kernel_stop(signr)) {
> + __set_current_state(TASK_STOPPED);
> + spin_unlock_irqrestore(&sighand->siglock, flags);
> + /* Don't run again until woken by SIGCONT or SIGKILL */
> + freezable_schedule();
> + goto relock;

Yes this avoids the race with SIGCONT. But as I said we can add another
trivial helper which checks JOBCTL_STOP_DEQUEUED. So a kthread can do
this itself.

To me, SIG_DFL behaviour just makes makes no sense when it comes to
kthreads. I do not even think this can simplify the code. Unlike user-
space task, kthread can happily dequeue SIGSTOP, so why should we mimic
the userspace SIG_DFL logic.


> + /* Death signals, but try to terminate cleanly */
> + kthread_stop_current();
> + __flush_signals(current);
> + break;

The same.

Oleg.


2015-06-06 22:31:02

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 11/18] jffs2: Convert jffs2_gcd_mtd kthread into the iterant API

On 06/06, Jiri Kosina wrote:
>
> On Sat, 6 Jun 2015, Oleg Nesterov wrote:
>
> > Still I personally dislike the new kthread_sigaction() API. I agree,
> > a couple if signal helpers for kthreads make sense. Say,
> >
> > void kthread_do_signal_stop(void)
> > {
> > spin_lock_irq(&curtent->sighand->siglock);
> > if (current->jobctl & JOBCTL_STOP_DEQUEUED)
> > __set_current_state(TASK_STOPPED);
> > spin_unlock_irq(&current->sighand->siglock);
> >
> > schedule();
> > }
>
> ... not to mention the fact that 'STOP' keyword in relation to kthreads
> has completely different meaning today, which just contributes to overall
> confusion; but that's an independent story.

Yes, agreed.

> > But personally I do not think kthread_do_signal() makes a lot of sense...
>
> Would it be possible for you to elaborate a little bit more why you think
> so ... ?

Please see another email I sent in reply to 06/18.

> I personally don't see a huge principal difference between
> "kthread_signal_dequeue() + kthread_do_signal_{stop,...}" vs. generic
> "kthread_do_signal()" that's just basically completely general and takes
> care of 'everything necessary'.

Then why do we need the new API ?

And I do see the difference. Rightly or not I belive that this API buys
nothing but makes the kthread && signal interaction more complex and
confusing. For no reason.

But!

> That being said, my relationship to signal
> handling code is of course much less intimate compared to yours,

No, no, no, this doesn't matter at all ;)

Yes I do dislike this API. So what? I can be wrong. So if other reviewers
like it I will hate them all ^W^W^W not argure. So please comment. I never
trust myself unless I can technically (try to) prove I am right. In this
case I can't, this is only my feeling.

Oleg.


2015-06-06 22:44:21

by Jiri Kosina

[permalink] [raw]
Subject: Re: [RFC PATCH 11/18] jffs2: Convert jffs2_gcd_mtd kthread into the iterant API

On Sun, 7 Jun 2015, Oleg Nesterov wrote:

> > > Still I personally dislike the new kthread_sigaction() API. I agree,
> > > a couple if signal helpers for kthreads make sense. Say,
> > >
> > > void kthread_do_signal_stop(void)
> > > {
> > > spin_lock_irq(&curtent->sighand->siglock);
> > > if (current->jobctl & JOBCTL_STOP_DEQUEUED)
> > > __set_current_state(TASK_STOPPED);
> > > spin_unlock_irq(&current->sighand->siglock);
> > >
> > > schedule();
> > > }
> >
> > ... not to mention the fact that 'STOP' keyword in relation to kthreads
> > has completely different meaning today, which just contributes to overall
> > confusion; but that's an independent story.
>
> Yes, agreed.

I BTW think this really needs to be fixed. We have kthread parking and
kthread stopping at least (and that doesn't include kthreads that actually
*do* handle SIGSTOP), and the naming is unfortunate and confusing.

> > > But personally I do not think kthread_do_signal() makes a lot of sense...
> >
> > Would it be possible for you to elaborate a little bit more why you think
> > so ... ?
>
> Please see another email I sent in reply to 06/18.

Yeah, let's just continue more detailed discussion there. I saw your reply
only after I answered to your original mail.

> > I personally don't see a huge principal difference between
> > "kthread_signal_dequeue() + kthread_do_signal_{stop,...}" vs. generic
> > "kthread_do_signal()" that's just basically completely general and
> > takes care of 'everything necessary'.
>
> Then why do we need the new API ?

Well, in a nutshell, because of the "it's general and takes care of
everything" part.

> And I do see the difference. Rightly or not I belive that this API buys
> nothing but makes the kthread && signal interaction more complex and
> confusing. For no reason.

Current situation with kthrads is a mess. Everyone and his grand-son needs
to put explicit try_to_freeze(), cond_resched() and whatever else checks
in his private kthreads main loop.

The fact that kthreads are more and more prone to making use of signal
handling makes this even more complicated, because everyone is reinventing
his own wheel for this.

It can't be really properly reviewed and - more importantly - it makes the
whole "how would this particular kthread react to this particular signal?"
very unpredictable and hard to answer question.

IMO Petr's patchset basically brings some kind order to this all -- it
"batches" the kthread executions to individual iterations of the main
loop, and allows to perform actions on the border of the iteration (such
as, but not limited to, handling of the signals). Signal handling is just
one of the piggy-backers on top of this general cleanup.

Thanks,

--
Jiri Kosina
SUSE Labs

2015-06-06 22:59:54

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 11/18] jffs2: Convert jffs2_gcd_mtd kthread into the iterant API

On 06/07, Jiri Kosina wrote:
>
> On Sun, 7 Jun 2015, Oleg Nesterov wrote:
>
> > > I personally don't see a huge principal difference between
> > > "kthread_signal_dequeue() + kthread_do_signal_{stop,...}" vs. generic
> > > "kthread_do_signal()" that's just basically completely general and
> > > takes care of 'everything necessary'.
> >
> > Then why do we need the new API ?
>
> Well, in a nutshell, because of the "it's general and takes care of
> everything" part.

...

> Signal handling is just
> one of the piggy-backers on top of this general cleanup.

And to avoid the confusion: so far I only argued with the signal
handling part of this API. Namely with kthread_do_signal(), especially
with the SIG_DFL logic.

If we want somthing like kthread_iterant agree it should probably help to
handle the signals too. But afaics kthread_do_signal() doesn't really help
and certainly it is not strictly necessary.

Oleg.


2015-06-08 10:01:14

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Fri 2015-06-05 18:10:21, Peter Zijlstra wrote:
> On Fri, Jun 05, 2015 at 05:01:08PM +0200, Petr Mladek wrote:
> > Many kthreads go into an interruptible sleep when there is nothing
> > to do. They should check if anyone did not requested the kthread
> > to terminate, freeze, or park in the meantime. It is easy to do
> > it a wrong way.
>
> INTERRUPTIBLE is the wrong state to idle in for kthreads, use
> TASK_IDLE.
>
> ---
>
> commit 80ed87c8a9ca0cad7ca66cf3bbdfb17559a66dcf
> Author: Peter Zijlstra <[email protected]>
> Date: Fri May 8 14:23:45 2015 +0200
>
> sched/wait: Introduce TASK_NOLOAD and TASK_IDLE
>
> Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
> 'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
> all idle kthreads contribute to the loadavg is somewhat silly.
>
> Now mostly this works OK, because kthreads have all their signals
> masked. However there's a few sites where this is causing problems and
> TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
>
> This patch adds TASK_NOLOAD which, when combined with
> TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
>
> As most of imagined usage sites are loops where a thread wants to
> idle, waiting for work, a helper TASK_IDLE is introduced.

Just to be sure. Do you suggest to use TASK_IDLE everywhere in
kthreads or only when the uninterruptible sleep is really needed?

IMHO, we should not use TASK_IDLE in freezable kthreads because
it would break freezing. Well, we could freezable_schedule() but only
on locations where it is safe to get freezed. Anyway, we need to
be careful here.

BTW: What is the preferred way of freezing, please? Is it better
to end up in the fridge or is it fine to call freezer_do_not_count();
or set PF_NOFREEZE when it is safe?

The fridge looks more clean to me but in this case we should avoid
uninterruptible sleep as much as possible.


Best Regards,
Petr

> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Cc: Julian Anastasov <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: NeilBrown <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dd07ac03f82a..7de815c6fa78 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -218,9 +218,10 @@ print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq);
> #define TASK_WAKEKILL 128
> #define TASK_WAKING 256
> #define TASK_PARKED 512
> -#define TASK_STATE_MAX 1024
> +#define TASK_NOLOAD 1024
> +#define TASK_STATE_MAX 2048
>
> -#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWP"
> +#define TASK_STATE_TO_CHAR_STR "RSDTtXZxKWPN"
>
> extern char ___assert_task_state[1 - 2*!!(
> sizeof(TASK_STATE_TO_CHAR_STR)-1 != ilog2(TASK_STATE_MAX)+1)];
> @@ -230,6 +231,8 @@ extern char ___assert_task_state[1 - 2*!!(
> #define TASK_STOPPED (TASK_WAKEKILL | __TASK_STOPPED)
> #define TASK_TRACED (TASK_WAKEKILL | __TASK_TRACED)
>
> +#define TASK_IDLE (TASK_UNINTERRUPTIBLE | TASK_NOLOAD)
> +
> /* Convenience macros for the sake of wake_up */
> #define TASK_NORMAL (TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)
> #define TASK_ALL (TASK_NORMAL | __TASK_STOPPED | __TASK_TRACED)
> @@ -245,7 +248,8 @@ extern char ___assert_task_state[1 - 2*!!(
> ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
> #define task_contributes_to_load(task) \
> ((task->state & TASK_UNINTERRUPTIBLE) != 0 && \
> - (task->flags & PF_FROZEN) == 0)
> + (task->flags & PF_FROZEN) == 0 && \
> + (task->state & TASK_NOLOAD) == 0)
>
> #ifdef CONFIG_DEBUG_ATOMIC_SLEEP
>
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index 30fedaf3e56a..d57a575fe31f 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -147,7 +147,8 @@ TRACE_EVENT(sched_switch,
> __print_flags(__entry->prev_state & (TASK_STATE_MAX-1), "|",
> { 1, "S"} , { 2, "D" }, { 4, "T" }, { 8, "t" },
> { 16, "Z" }, { 32, "X" }, { 64, "x" },
> - { 128, "K" }, { 256, "W" }, { 512, "P" }) : "R",
> + { 128, "K" }, { 256, "W" }, { 512, "P" },
> + { 1024, "N" }) : "R",
> __entry->prev_state & TASK_STATE_MAX ? "+" : "",
> __entry->next_comm, __entry->next_pid, __entry->next_prio)
> );

2015-06-08 11:40:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Mon, 2015-06-08 at 12:01 +0200, Petr Mladek wrote:

> Just to be sure. Do you suggest to use TASK_IDLE everywhere in
> kthreads or only when the uninterruptible sleep is really needed?

Always, only use INTERRUPTIBLE when you're actually interruptible, that
is you want signals or such muck to terminate your wait.

> IMHO, we should not use TASK_IDLE in freezable kthreads because
> it would break freezing.

How so? The task is IDLE, its not doing anything.

> Well, we could freezable_schedule() but only
> on locations where it is safe to get freezed. Anyway, we need to
> be careful here.

s/freezed/frozen/

Bah, you made me look at the freezer code, karma reduction for you.

And this is the arch typical freeze point if ever there was one, you're
checking kthread_stop, if we can terminate the kthread, we can certainly
get frozen.

> BTW: What is the preferred way of freezing, please? Is it better
> to end up in the fridge or is it fine to call freezer_do_not_count();
> or set PF_NOFREEZE when it is safe?

freezable_schedule() is fine in this case.

2015-06-08 13:51:11

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

On Sat 2015-06-06 23:58:16, Oleg Nesterov wrote:
> On 06/05, Petr Mladek wrote:
> >
> > The main question is how much it should follow POSIX and the signal
> > handling of user space processes. On one hand, we want to be as close
> > as possible.
>
> Why? Let the kthread decide what it should if it gets, say, SIGSTOP.

I agree that it does not make sense to follow POSIX too much for
kthreads. They are special. Signals can be send from userspace and
userspace should not be able to affect the core kernel services, for
example, to stop or kill some kthreas.

On the other hand, I was surprised how signals were handled in the few
kthreads. It looked like a mess to me. For example, lockd dropped locks
when it got SIGKILL, r8712_cmd_thread() allowed SIGTERM and then just
flushed the signal.

To be honest, this patch set does _not_ make any big change. All
signals are still ignored by default. There is default action
only for SIGSTOP and fatal signals.

On the other hand, it should motivate developers to use the signals
a reasonable way. And if nothing, it will help to handle the signals
correctly without races.


> > Finally, kthread_do_signal() is called on a safe place in the main
> > iterant kthread cycle. Then we will not need any special code for
> > signals when using this kthread API.
>
> OK, I will not comment other parts of iterant API in this thread.
>
> But as for signal handling, to me a single kthread_iterant->do_signal()
> callback looks better. Rather than multiple callbacks passed as
> ->kthread_sa_handler.

I think that we should make it independent on the iterant kthread API.
I thought about handling signals also in the kthread worker and
workqueues.

One of the side plans here is to make freezing more reliable. Freezer
already pokes normal processes using a fake signal. The same technique
might be usable also for kthreads. A signal could wake them and
navigate to a safe place (fridge).

Another example is kthread_stop(). It sets some flag, wakes the
kthread, and waits for completion. IMHO, it would be more elegant to
send a signal that would wake the kthread and push it to the safe
place where signals might be handled and where the kthread might
get terminated.

The advantage is that a pending signal will not allow to sleep on
a location that is not safe for the given action (freezing, stopping)
and might skip even more sleeps on the way.


> That single callback can deque a signal and decide what it should do.
>
> > + spin_lock_irqsave(&sighand->siglock, flags);
> > +
> > + if (unlikely(signal->flags & SIGNAL_CLD_MASK)) {
> > + WARN(1, "there are no parents for kernel threads\n");
> > + signal->flags &= ~SIGNAL_CLD_MASK;
> > + }
> > +
> > + for (;;) {
> > + struct k_sigaction *ka;
> > +
> > + signr = dequeue_signal(current, &current->blocked, &ksig.info);
> > +
> > + if (!signr)
> > + break;
> > +
> > + ka = &sighand->action[signr-1];
> > +
> > + /* Do nothing for ignored signals */
> > + if (ka->sa.kthread_sa_handler == KTHREAD_SIG_IGN)
> > + continue;
>
> Again, I agree something like the simple kthread_dequeue_signal() makes
> sense. Say, to drop the ignore signal like this code does. Although I
> do not think this is really important, SIG_IGN is only possible if this
> kthread does something strange. Say, blocks/unblocs the ignored signal.

Yup, I can't find any usecase for this at the moment. I do not resist
on SIG_IGN handling.

I did this only for completness. I thought that any similarity with
the normal userspace handling would be useful: the-least-surprise
approach.

Well, note that allow_signal() sets some "crazy" value (2) for the
signal handler. IMHO, we should check for these values and handle
them reasonably even in kthreads. It will make the code more secure.


> > +
> > + /* Run the custom handler if any */
> > + if (ka->sa.kthread_sa_handler != KTHREAD_SIG_DFL) {
> > + ksig.ka = *ka;
> > +
> > + if (ka->sa.sa_flags & SA_ONESHOT)
> > + ka->sa.kthread_sa_handler = KTHREAD_SIG_DFL;
> > +
> > + spin_unlock_irqrestore(&sighand->siglock, flags);
> > + /* could run directly for kthreads */
> > + ksig.ka.sa.kthread_sa_handler(signr);
> > + freezable_cond_resched();
> > + goto relock;
>
> Well. But for what? A simple "switch (signr)" after kthread_dequeue_signal()
> can do the same. Or, speaking of kthread_iterant_fn() it can even dequeue the
> signal and pass it to kti->whatever(signr).

I wanted to make it independent on the iterant API. Also if you want to
handle more signals, you need even more code, e.g. the cycle,
cond_resched(). So, I think that some generic helper is useful.



> > + if (sig_kernel_ignore(signr))
> > + continue;
>
> For what? Why a kthread should unignore (say) SIGWINCH if it is not going
> to react?

I do not resist on this.


> > + if (sig_kernel_stop(signr)) {
> > + __set_current_state(TASK_STOPPED);
> > + spin_unlock_irqrestore(&sighand->siglock, flags);
> > + /* Don't run again until woken by SIGCONT or SIGKILL */
> > + freezable_schedule();
> > + goto relock;
>
> Yes this avoids the race with SIGCONT. But as I said we can add another
> trivial helper which checks JOBCTL_STOP_DEQUEUED. So a kthread can do
> this itself.

Hmm, the helper would have a strange semantic. You need to take
sighand->siglock, dequeue the signal (SIGSTOP), and call
__set_current_state(TASK_STOPPED) before you release the lock.
But what would happen if the dequeued signal is _not_ SIGSTOP?

I think that we should support only the standard handling of
SIGSTOP. It is closely related with SIGCONT. Both are manipulated
in prepare_signal(). I think that it is nice to share the code here
and dangerous to use it othrewise.


> To me, SIG_DFL behaviour just makes makes no sense when it comes to
> kthreads. I do not even think this can simplify the code. Unlike user-
> space task, kthread can happily dequeue SIGSTOP, so why should we mimic
> the userspace SIG_DFL logic.

Maybe, we should handle only SIGSTOP here and do not mimic the
standard behavior for the other sig_kernel_stop(),
sig_kernel_ignore(), and death signals here. I think
about printing some warning in case that the handler
is not defined.

Best Regards,
Petr


> > + /* Death signals, but try to terminate cleanly */
> > + kthread_stop_current();
> > + __flush_signals(current);
> > + break;
>
> The same.
>
> Oleg.
>

2015-06-08 17:53:56

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 15/18] ring_buffer: Allow to exit the ring buffer benchmark immediately

On Fri, 5 Jun 2015 17:01:14 +0200
Petr Mladek <[email protected]> wrote:

> It takes a while until the ring_buffer_benchmark module is removed
> when the ring buffer hammer is running. It is because it takes
> few seconds and kthread_should_terminate() is not being checked.
>
> This patch adds the check for kthread termination into the producer.
> It uses the existing kill_test flag to finish the kthreads as
> cleanly as possible.
>
> It disables printing the "ERROR" message when the kthread is going.
>
> Also it makes sure that producer does not go into the 10sec sleep
> when it is being killed.

This patch looks like something I may take regardless of the other
patches (if it applies cleanly).

As for the other patches, the ring buffer benchmark is just that, a
benchmark that I use when making changes to the ring buffer. It's not
something for production systems.

What about just adding a depend on !LIVE_PATCHING to
RING_BUFFER_BENCHMARK, or force it to shut down during patching.
There's no reason to make it safe to be running when you patch the
kernel. Just adds complexity to some simple code.

-- Steve

2015-06-08 17:55:23

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Fri, 5 Jun 2015 18:10:21 +0200
Peter Zijlstra <[email protected]> wrote:

> On Fri, Jun 05, 2015 at 05:01:08PM +0200, Petr Mladek wrote:
> > Many kthreads go into an interruptible sleep when there is nothing
> > to do. They should check if anyone did not requested the kthread
> > to terminate, freeze, or park in the meantime. It is easy to do
> > it a wrong way.
>
> INTERRUPTIBLE is the wrong state to idle in for kthreads, use
> TASK_IDLE.
>
> ---
>
> commit 80ed87c8a9ca0cad7ca66cf3bbdfb17559a66dcf
> Author: Peter Zijlstra <[email protected]>
> Date: Fri May 8 14:23:45 2015 +0200
>
> sched/wait: Introduce TASK_NOLOAD and TASK_IDLE
>
> Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
> 'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
> all idle kthreads contribute to the loadavg is somewhat silly.

Not to mention, tasks in TASK_UNINTERRUPTIBLE state for too long will
trigger hung task detection.


>
> Now mostly this works OK, because kthreads have all their signals
> masked. However there's a few sites where this is causing problems and
> TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
>
> This patch adds TASK_NOLOAD which, when combined with
> TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
>
> As most of imagined usage sites are loops where a thread wants to
> idle, waiting for work, a helper TASK_IDLE is introduced.
>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>

Acked-by: Steven Rostedt <[email protected]>

-- Steve

> Cc: Julian Anastasov <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: NeilBrown <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h

2015-06-08 21:14:39

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

Let me first repeat that I agree that everything is subjective ;)

On 06/08, Petr Mladek wrote:
>
> To be honest, this patch set does _not_ make any big change.

But to me it does because (again, imo) it adds the a) unnecessary
and b) wrong interface.

But yes, yes, I agree that most (all?) of kthread/signal (ab)users
need cleanups. And fixes.

> I think that we should make it independent on the iterant kthread API.

Yes! please. Then we can discuss this again and perhaps reconsider
this API.

So I am going to ignore some parts of your email. I am sleeping,
please let me know if I missed something important ;)

> Well, note that allow_signal() sets some "crazy" value (2) for the
> signal handler. IMHO, we should check for these values and handle
> them reasonably even in kthreads. It will make the code more secure.

Not sure I understand. The crazy "2" value just means that kthread
wants to recieve and dequeue this signal. I agree with the good name
for this hard-coded number in advance.

> > > +
> > > + /* Run the custom handler if any */
> > > + if (ka->sa.kthread_sa_handler != KTHREAD_SIG_DFL) {
> > > + ksig.ka = *ka;
> > > +
> > > + if (ka->sa.sa_flags & SA_ONESHOT)
> > > + ka->sa.kthread_sa_handler = KTHREAD_SIG_DFL;
> > > +
> > > + spin_unlock_irqrestore(&sighand->siglock, flags);
> > > + /* could run directly for kthreads */
> > > + ksig.ka.sa.kthread_sa_handler(signr);
> > > + freezable_cond_resched();
> > > + goto relock;
> >
> > Well. But for what? A simple "switch (signr)" after kthread_dequeue_signal()
> > can do the same. Or, speaking of kthread_iterant_fn() it can even dequeue the
> > signal and pass it to kti->whatever(signr).
>
> I wanted to make it independent on the iterant API. Also if you want to
> handle more signals, you need even more code, e.g. the cycle,
> cond_resched(). So, I think that some generic helper is useful.

I do not. Contrary, I think this needs more code in the likely case.
Anyway, this API won't have too many users, so I don't even this this
is that important.

> > > + if (sig_kernel_stop(signr)) {
> > > + __set_current_state(TASK_STOPPED);
> > > + spin_unlock_irqrestore(&sighand->siglock, flags);
> > > + /* Don't run again until woken by SIGCONT or SIGKILL */
> > > + freezable_schedule();
> > > + goto relock;
> >
> > Yes this avoids the race with SIGCONT. But as I said we can add another
> > trivial helper which checks JOBCTL_STOP_DEQUEUED. So a kthread can do
> > this itself.
>
> Hmm, the helper would have a strange semantic. You need to take
> sighand->siglock, dequeue the signal (SIGSTOP), and call
> __set_current_state(TASK_STOPPED) before you release the lock.
> But what would happen if the dequeued signal is _not_ SIGSTOP?

Perhaps I missed your point, but no. If you want to handle SIGSTOP
you can do

signr = kthread_signal_dequeue();
switch (signr) {
case SIGSTOP:
something_else();
kthread_do_signal_stop();
...
}


> I think that we should support only the standard handling of
> SIGSTOP. It is closely related with SIGCONT.

Agreed. If kthread wants to actually sleep in TASK_STOPPED state then
it should know about SIGCONT.

> > To me, SIG_DFL behaviour just makes makes no sense when it comes to
> > kthreads. I do not even think this can simplify the code. Unlike user-
> > space task, kthread can happily dequeue SIGSTOP, so why should we mimic
> > the userspace SIG_DFL logic.
>
> Maybe, we should handle only SIGSTOP

So far I even disagree with SIGSTOP "default" semantics. I simply see
no value.

Oleg.


2015-06-09 06:10:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] kthreads/signal: Safer kthread API and signal handling

Hello,

On Fri, Jun 05, 2015 at 05:00:59PM +0200, Petr Mladek wrote:
> Workqueue
...
> But there are many kthreads that need to cycle many times
> until some work is finished, e.g. khugepaged, virtio_balloon,
> jffs2_garbage_collect_thread. They would need to queue the
> work item repeatedly from the same work item or between
> more work items. It would be a strange semantic.

Hmm... took a look at balloon and jffs2_garbage_collect_thread and I'm
not not sure about this part of argument. What's wrong with doing a
single round and requeueing if more rounds are necessary? There are
quite a few which already behave that way. It sure adds queueing /
dispatch overhead but I'm skeptical that's gonna be anything
noticeable in vast majority of cases.

> Work queues allow to share the same kthread between more users.
> It helps to reduce the number of running kthreads. It is especially
> useful if you would need a kthread for each CPU.

Unbound workqueues also do NUMA node affinities automatically which
usually is a pretty big win for jobs which process lots of data.
Managing concurrency level for these cases is still cumbersome but
there have been some ideas making that part mostly automatic too.

> But this might also be a disadvantage. Just look into the output
> of the command "ps" and see the many [kworker*] processes. One
> might see this a black hole. If a kworker makes the system busy,
> it is less obvious what the problem is in compare with the old
> "simple" and dedicated kthreads.
>
> Yes, we could add some debugging tools for work queues but
> it would be another non-standard thing that developers and
> system administrators would need to understand.

Yeah, that's an almost inherent drawback of pooling anything. It'd be
nice to improve visibility into what workers are doing. sysrq-t has
recently been extended to help debugging but it'd nice to have
visibility into who's consuming how much.

> Another thing is that work queues have their own scheduler. If we
> move even more tasks there it might need even more love. Anyway,
> the extra scheduler adds another level of complexity when
> debugging problems.

That's only true for the concurrency managed percpu workqueues which
shouldn't be used for anything CPU intensive anyway. I don't see that
getting much more sophiscated than now. For unbound workqueues, once
a work item starts executing, the worker thread is under the control
of the scheduler.

> kthread_worker
>
>
> kthread_worker is similar to workqueues in many ways. You need to
>
> + define work functions
> + define and initialize work structs
> + queue work items (structs pointing to the functions and data)
>
> We could repeat the paragraphs about splitting the work
> and sharing the kthread between more users here.
>
> Well, the kthread_worker implementation is much simpler than
> the one for workqueues. It is more similar to a simple
> kthread. Especially, it uses the system scheduler.
> But it is still more complex that the simple kthread.
>
> One interesting thing is that kthread_workers add internal work
> items into the queue. They typically use a completion. An example
> is the flush work. see flush_kthread_work(). It is a nice trick
> but you need to be careful. For example, if you would want to
> terminate the kthread, you might want to remove some work item
> from the queue, especially if you need to break a work item that
> is called in a cycle (queues itself). The question is what to do
> with the internal tasks. If you keep them, they might wake up
> sleepers when the work was not really completed. If you remove
> them, the counter part might sleep forever.

You can pretty much align this part of interface with regular
workqueues which has cancel_work_sync(). The only reason it doesn't
have that is because nobody needed it yet.

That said, there sure are cases where dedicated kthreads are necessary
and kthread_worker was added to help those cases to be more
structured. I think it's nice to keep the API aligned with workqueue
so that users can easily be converted back and forth but if something
which is more aligned to the current kthread usages eases replacing
the original kthread API completely, I'm all for that.

Also, if this one takes off, I don't think we have any need for
kthread_worker. It should be easy to replace the few kthread_worker
usages to the new API.

Thanks.

--
tejun

2015-06-09 06:14:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] kthreads/signal: Safer kthread API and signal handling

Hey, Peter.

On Fri, Jun 05, 2015 at 06:22:16PM +0200, Peter Zijlstra wrote:
> There's a lot more problems with workqueues:
>
> - they're not regular tasks and all the task controls don't work on
> them. This means all things scheduler, like cpu-affinity, nice, and
> RT/deadline scheduling policies. Instead there is some half baked
> secondary interface for some of these.

Because there's a pool of them and the workers come and go
dynamically. There's no way around it. The attributes just have to
be per-pool.

> But this also very much includes things like cgroups, which brings me
> to the second point.
>
> - its oblivious to cgroups (as it is to RT priority for example) both
> leading to priority inversion. A work enqueued from a deep/limited
> cgroup does not inherit the task's cgroup. Instead this work is ran
> from the root cgroup.
>
> This breaks cgroup isolation, more significantly so when a large part
> of the actual work is done from workqueues (as some workloads end up
> being). Instead of being able to control the work, it all ends up in
> the root cgroup outside of control.

cgroup support will surely be added but I'm not sure we can or should
do inheritance automatically. Using a different API doesn't solve the
problem automatically either. A lot of kthreads are shared
system-wide after all. We'll need an abstraction layer to deal with
that no matter where we do it.

Thanks.

--
tejun

2015-06-09 06:24:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 02/18] kthread: Add API for iterant kthreads

Hello, Petr.

On Fri, Jun 05, 2015 at 05:01:01PM +0200, Petr Mladek wrote:
> +static int kthread_iterant_fn(void *kti_ptr)
> +{
> + struct kthread_iterant *kti = kti_ptr;
> + void *data = kti->data;
> +
> + if (kti->init)
> + kti->init(data);
> +
> + do {
> + if (kti->func)
> + kti->func(data);

Is supporting kthread_iterant w/o the body function intentional? If
so, did you have anything specific on mind for it? I don't think it
matters either way. Just curious how this came to be.

Thanks.

--
tejun

2015-06-09 07:10:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

Hello, Petr.

On Fri, Jun 05, 2015 at 05:01:05PM +0200, Petr Mladek wrote:
> Several kthreads already handle signals some way. They do so
> in different ways, search for allow_signal().
>
> This patch allows to handle the signals in a more uniform way using
> kthread_do_signal().
>
> The main question is how much it should follow POSIX and the signal
> handling of user space processes. On one hand, we want to be as close
> as possible. On the other hand, we need to be very careful. All signals
> were ignored until now, except for the few kthreads that somehow
> handled them.
>
> POSIX behavior might cause some surprises and crazy bug reports.
> For example, imagine a home user stopping or killing kswapd because
> it makes the system too busy. It is like shooting in its own leg.
>
> Also the signals are already used a strange way. For example, klockd
> drops all locks when it gets SIGKILL. This was added before initial
> git commit, so the archaeology was not easy. The most likely explanation
> is that it is used during the system shutdown. It would make sense
> to drop all locks when user space processes are killed and before
> filesystems get unmounted.
>
> Now, the patch does the following compromise:
>
> + defines default actions for SIGSTOP, SIGCONT, and fatal signals
> + custom actions can be defined by kthread_sigaction()
> + all signals are still ignored by default
>
> Well, fatal signals do not terminate the kthread immediately. They just
> set a flag to terminate the kthread, flush all other pending signals,
> and return to the main kthread cycle. The kthread is supposed to check
> the flag, leave the main cycle, do some cleanup actions when necessary
> and return from the kthread.
>
> Note that kthreads are running in the kernel address space. It makes
> the life much easier. The signal handler can be called directly and
> we do not need any arch-specific code to process it in the userspace.
> Also we do not care about ptrace.
>
> Finally, kthread_do_signal() is called on a safe place in the main
> iterant kthread cycle. Then we will not need any special code for
> signals when using this kthread API.
>
> This change does not affect any existing kthread. The plan is to
> use them only in safe kthreads that can be converted into
> the iterant kthread API.

While I agree that it'd be great to consolidate the existing kthread
signal users, I feel quite uncomfortable with doing full-fledged
signal handling for kthreads which try to mimic the userland behavior.
Most of the default behaviors don't make sense for kthreads and it'd
be awful to have different kthreads interpreting arbitrary signals for
differing custom behaviors, which often happens when there's no
default.

While we do have several kthread signal users, they aren't too many
and majority of them use it to allow userland to tell it to shutdown
and there seem to be a couple which use HUP/USR1 to cancel whatever
it's processing / waiting on at the moment. Hmmm... jffs uses
STOP/CONT too.

I don't know how this should be done but let's please try to

1. Encourage uniform behaviors across the signals.

2. Ultimately discourage usage of signals on kthreads as this closely
ties implementation detail (use of single kthread) to the userland
visible interface in a way where we can't easily get back out of.
For example, what if jffs needs its gc workers to be multi-threaded
and per-NUMA for high-iops devices later on?

Thanks.

--
tejun

2015-06-09 07:20:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 07/18] kthread: Make iterant kthreads freezable by default

Hello, Petr.

On Fri, Jun 05, 2015 at 05:01:06PM +0200, Petr Mladek wrote:
> Many kthreads already calls set_freezable() before they enter the main
> cycle. One of the reasons for creating iterant kthreads is to create
> a safe point for freezing and make even more kthreads properly
> freezable. Therefore it would make sense to set all iterant
> kthreads freezable by default.

Actually, for most cases, making kthreads freezable is unnecessary and
often indicative of something going wrong. This is a crude mechanism
which goes along the line of "if all threads are stopped, the machine
should be safe to be put into whatever state", which isn't true at all
as there usually are a lot of stuff going on asynchronously especially
when interacting with hardware.

In most cases, we want to implement proper power management callbacks
which plug new issuance of whatever work-unit the code is dealing with
and drain in-flight ones. Whether the worker threads are frozen or
not doesn't matter once that's implemented.

It seems that people have been marking kthreads freezable w/o really
thinking about it - some of them are subtly broken due to missing
drainage of in-flight things while others simply don't need freezing
for correctness.

We do want to clean up freezer usage in the kernel but definitely do
not want to make kthreads freezable by default especially given that
the freezer essentially is one giant lockdep-less system-wide lock.

> However some kthreads might be hard to make properly freezable.
> For example, if they do non-interruptible sleeps. They would
> need to explicitly clear PF_NOFREEZE flag in the init() call.
> But it should be avoided whenever possible.

So, big no here.

Thanks.

--
tejun

2015-06-09 07:33:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

Hello,

On Mon, Jun 08, 2015 at 12:01:07PM +0200, Petr Mladek wrote:
> BTW: What is the preferred way of freezing, please? Is it better
> to end up in the fridge or is it fine to call freezer_do_not_count();
> or set PF_NOFREEZE when it is safe?

There's no one good answer. The closest would be "don't use freezer
on kthreads". As Peter said, exit points are always safe freezing
points and it's generally a good idea to avoid adding one anywhere
else.

Thanks.

--
tejun

2015-06-09 07:58:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] kthreads/signal: Safer kthread API and signal handling

Hello, Petr.

I've skimmed through the patchset and I'm not quite sure.
kthread_iterant seems to map almost one to one to kthread_worker
interface. One calls a predefined callback repeatedly while the other
queues work items which contain callback. One does nasty plumbing
tasks inbetween interations, the other does inbetween work items. One
has sleep helper to sleep "safely", the other can use delayed work
items and flushing cancel. In fact, I'm pretty sure it'd be trivial
to convert between the two sets of API.

If so, is it really worthwhile to introduce the new API?
kthread_iterant is closer to raw kthread but not quite. It shouldn't
be difficult to apply the bulk of kthread_iterant's features to
kthread_worker. Wouldn't it be more beneficial to have async
execution mechanisms more closely aligned?

Thanks a lot.

--
tejun

2015-06-09 12:15:28

by Jiri Kosina

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

On Tue, 9 Jun 2015, Tejun Heo wrote:

> While I agree that it'd be great to consolidate the existing kthread
> signal users, I feel quite uncomfortable with doing full-fledged
> signal handling for kthreads which try to mimic the userland behavior.
> Most of the default behaviors don't make sense for kthreads and it'd
> be awful to have different kthreads interpreting arbitrary signals for
> differing custom behaviors, which often happens when there's no
> default.

I don't think the ultimate goal is to mimic what userspace does;
especially when it comes to default actions.

To me, the ultimate goal (*) is to make it possible for kthread to be able
to decide whether it wants "some kind of default behavior" (however that'd
be defined), or "ignore all", or "just handle this set of signals with
these handlers", and make API for this that would avoid every kthread
implementing its own complete signal handling.

(*) Well, the ultimate goal really is to bring sanity to how kthreads are
handling their main loop. Trying to bring some consistency to how
kthreads are handling signals is just an (optional) added value on
top of that.

> While we do have several kthread signal users, they aren't too many and
> majority of them use it to allow userland to tell it to shutdown and

Yeah. Via SIGKILL. Or SIGTERM. Or SIGINT. Or SIGQUIT. Not really
consistent either.

> there seem to be a couple which use HUP/USR1 to cancel whatever it's
> processing / waiting on at the moment. Hmmm... jffs uses STOP/CONT too.

> I don't know how this should be done but let's please try to
>
> 1. Encourage uniform behaviors across the signals.

Fully agreed.

> 2. Ultimately discourage usage of signals on kthreads as this closely
> ties implementation detail (use of single kthread) to the userland
> visible interface in a way where we can't easily get back out of.
> For example, what if jffs needs its gc workers to be multi-threaded
> and per-NUMA for high-iops devices later on?

What kind of multi-threading kthreads are you referring to here? Something
more sophisticated than simply spawning several per-CPU (or
per-whatever-resource) full-fledged kthreads?

--
Jiri Kosina
SUSE Labs

2015-06-09 15:25:31

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Mon 2015-06-08 13:39:55, Peter Zijlstra wrote:
> On Mon, 2015-06-08 at 12:01 +0200, Petr Mladek wrote:
>
> > Just to be sure. Do you suggest to use TASK_IDLE everywhere in
> > kthreads or only when the uninterruptible sleep is really needed?
>
> Always, only use INTERRUPTIBLE when you're actually interruptible, that
> is you want signals or such muck to terminate your wait.

I like that TASK_IDLE clearly describes that the state of the task.
I am just curious. Is there any particular advantage of using
uninterruptible sleep over the interruptible one, please?

I ask because making freezable kthreads is quite tricky. You need to
call try_to_freeze() after each schedule or call freezable_* variants
of schedule(). I think that it is easy to make a mistake. I wonder if
it might be more elegant to use interruptible sleep whenever possible,
send the fake signal also to kthreads and force them moving into some
location where the freeze is safe and handled.


> > IMHO, we should not use TASK_IDLE in freezable kthreads because
> > it would break freezing.
>
> How so? The task is IDLE, its not doing anything.

Well, it might cause the freeze. I have just double checked this
with ubi_thread(). It calls set_freezable().

I did the following change:

diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
index 16214d3d57a4..d528fa5e93ba 100644
--- a/drivers/mtd/ubi/wl.c
+++ b/drivers/mtd/ubi/wl.c
@@ -1428,7 +1428,7 @@ int ubi_thread(void *u)
spin_lock(&ubi->wl_lock);
if (list_empty(&ubi->works) || ubi->ro_mode ||
!ubi->thread_enabled || ubi_dbg_is_bgt_disabled(ubi)) {
- set_current_state(TASK_INTERRUPTIBLE);
+ set_current_state(TASK_IDLE);
spin_unlock(&ubi->wl_lock);
schedule();
continue;


enabled this stuff:

$> grep UBI .config
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_CUBIC=y
CONFIG_MTD_UBI=y
CONFIG_MTD_UBI_WL_THRESHOLD=4096
CONFIG_MTD_UBI_BEB_LIMIT=20
# CONFIG_MTD_UBI_FASTMAP is not set
CONFIG_MTD_UBI_GLUEBI=y
CONFIG_MTD_UBI_BLOCK=y
# CONFIG_JFFS2_RUBIN is not set
CONFIG_UBIFS_FS=y
# CONFIG_UBIFS_FS_ADVANCED_COMPR is not set
CONFIG_UBIFS_FS_LZO=y
CONFIG_UBIFS_FS_ZLIB=y
# CONFIG_CRYPTO_ANUBIS is not set

called on the running system:

$> ubiattach /dev/ubi_ctrl -m 0

to launch "ubi_bgt0d"

and it started to block freezer:

$> echo freezer >/sys/power/pm_test
$> echo reboot >/sys/power/disk
$> echo disk >/sys/power/state
-bash: echo: write error: Device or resource busy

$dmesg
[ 658.874518] Freezing of tasks failed after 20.008 seconds (1 tasks refusing to freeze, wq_busy=0):
[ 658.874527] ubi_bgt0d D ffff880136e2be38 0 3107 2 0x00000000
[ 658.874530] ffff880136e2be38 ffff88013a456410 ffff880133808210 ffff8800b3f5cd00
[ 658.874532] ffff880136e2c000 ffff880133808210 ffff8800ba1fd15c ffff8800ba1fd1d8
[ 658.874533] ffff8800ba1fd1fc ffff880136e2be58 ffffffff81905737 ffff8800ba1fd15c
[ 658.874535] Call Trace:
[ 658.874540] [<ffffffff81905737>] schedule+0x37/0x90
[ 658.874543] [<ffffffff816b3065>] ubi_thread+0xd5/0x1f0
[ 658.874545] [<ffffffff816b2f90>] ? ubi_wl_flush+0x1f0/0x1f0
[ 658.874547] [<ffffffff81085219>] kthread+0xc9/0xe0
[ 658.874549] [<ffffffff81085150>] ? kthread_create_on_node+0x180/0x180
[ 658.874551] [<ffffffff81909962>] ret_from_fork+0x42/0x70
[ 658.874552] [<ffffffff81085150>] ? kthread_create_on_node+0x180/0x180

[ 658.874554] Restarting kernel threads ... done.
[ 658.892995] PM: Basic memory bitmaps freed
[ 658.892999] Restarting tasks ... done.


It is because freeze_task() tries to wake up inly kthreads in
interruptible state. I does:

wake_up_state(p, TASK_INTERRUPTIBLE);


Solutions would be to try in freeze_task() also

wake_up_state(p, TASK_IDLE);

or use in ubi_thread()

freezable_schedule()

or always ignore freezing when the task sets TASK_IDLE.


> > Well, we could freezable_schedule() but only
> > on locations where it is safe to get freezed. Anyway, we need to
> > be careful here.
>
> Bah, you made me look at the freezer code, karma reduction for you.

I feel like it has happened.


> And this is the arch typical freeze point if ever there was one, you're
> checking kthread_stop, if we can terminate the kthread, we can certainly
> get frozen.

It makes sense. The tasks should be in some sane state when it goes
into the idle state. I hope that people will not misuse it too much.

Best Regards,
Petr

2015-06-09 15:53:17

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 07/18] kthread: Make iterant kthreads freezable by default

On Tue 2015-06-09 16:20:03, Tejun Heo wrote:
> Hello, Petr.
>
> On Fri, Jun 05, 2015 at 05:01:06PM +0200, Petr Mladek wrote:
> > Many kthreads already calls set_freezable() before they enter the main
> > cycle. One of the reasons for creating iterant kthreads is to create
> > a safe point for freezing and make even more kthreads properly
> > freezable. Therefore it would make sense to set all iterant
> > kthreads freezable by default.
>
> Actually, for most cases, making kthreads freezable is unnecessary and
> often indicative of something going wrong. This is a crude mechanism
> which goes along the line of "if all threads are stopped, the machine
> should be safe to be put into whatever state", which isn't true at all
> as there usually are a lot of stuff going on asynchronously especially
> when interacting with hardware.

I think that the interaction with the hardware should be the reason to
make them properly freezable. In the current state they are stopped at
some random place, they might either miss some event from the hardware
or the hardware might get resumed into another state and the kthread
might wait forever.

Also I think that freezing kthreads on some well defined location
should help with reproducing and fixing problems.

Note that _only_ kthreads using the _new API_ should be freezable by
default. The move need to be done carefully. It is chance to review
and clean up the freezer stuff.

Of course, I might be too optimistic here.

> In most cases, we want to implement proper power management callbacks
> which plug new issuance of whatever work-unit the code is dealing with
> and drain in-flight ones. Whether the worker threads are frozen or
> not doesn't matter once that's implemented.

I see. The power management is important here.


> It seems that people have been marking kthreads freezable w/o really
> thinking about it - some of them are subtly broken due to missing
> drainage of in-flight things while others simply don't need freezing
> for correctness.

I have similar feeling.

> We do want to clean up freezer usage in the kernel but definitely do
> not want to make kthreads freezable by default especially given that
> the freezer essentially is one giant lockdep-less system-wide lock.

I think that we both want to clean up freezing. I would like to make
it more deterministic and you suggest to make it more relaxed.
Do I understand it correctly?

Best Regards,
Petr


PS: Thanks a lot everyone for feedback. I try hard to get it
somehow sorted and more confident about the conclusions.

2015-06-10 03:13:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

Hello, Jiri.

On Tue, Jun 09, 2015 at 02:15:24PM +0200, Jiri Kosina wrote:
> To me, the ultimate goal (*) is to make it possible for kthread to be able
> to decide whether it wants "some kind of default behavior" (however that'd
> be defined), or "ignore all", or "just handle this set of signals with
> these handlers", and make API for this that would avoid every kthread
> implementing its own complete signal handling.

Yes, cleaning up the current usages like above would be great but one
concern is that the above might inadvertantly encourage usage of
signals in kthreads which I don't think is a good idea. It'd be great
if we can consolidate the current users while also making it clear
that we shouldn't be adding new ones.

> > While we do have several kthread signal users, they aren't too many and
> > majority of them use it to allow userland to tell it to shutdown and
>
> Yeah. Via SIGKILL. Or SIGTERM. Or SIGINT. Or SIGQUIT. Not really
> consistent either.

Exactly, and it's pretty confusing from userland. Why do some
kthreads take SIGTERM but not SIGKILL while others do the other way
and yet others ignore them all? This is too low level for a userland
visible interface which is tied too closely to the implementation
detail (usage of one kthread) and often unclear in terms of the
meaning of the action.

> > there seem to be a couple which use HUP/USR1 to cancel whatever it's
> > processing / waiting on at the moment. Hmmm... jffs uses STOP/CONT too.
>
> > I don't know how this should be done but let's please try to
> >
> > 1. Encourage uniform behaviors across the signals.
>
> Fully agreed.
>
> > 2. Ultimately discourage usage of signals on kthreads as this closely
> > ties implementation detail (use of single kthread) to the userland
> > visible interface in a way where we can't easily get back out of.
> > For example, what if jffs needs its gc workers to be multi-threaded
> > and per-NUMA for high-iops devices later on?
>
> What kind of multi-threading kthreads are you referring to here? Something
> more sophisticated than simply spawning several per-CPU (or
> per-whatever-resource) full-fledged kthreads?

Becoming simple multi-threaded or per-cpu are good examples but these
things not too rarely develop into something which needs more
sophiscation. e.g. jobs which process considerable amount data are
usually best served by NUMA-node-affine workers roughly matching the
number of CPUs in each node. workqueue is half way there but still
lacking an automatic way to regulate concurrency in those cases.

For certain use cases, you really can't avoid employing a pool of
workers and once things get there, having tied userland interface to a
single kthread becomes pretty awkward. It sure works for certain use
cases and sending signals to kthreads *might* be considered a "softer"
interface where userland is meddling with kernel implementation
details that it can be changed later on but it can still be painful
for users depending on it.

Thanks.

--
tejun

2015-06-10 04:32:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 07/18] kthread: Make iterant kthreads freezable by default

Hello, Petr.

On Tue, Jun 09, 2015 at 05:53:13PM +0200, Petr Mladek wrote:
> I think that the interaction with the hardware should be the reason to
> make them properly freezable. In the current state they are stopped at
> some random place, they might either miss some event from the hardware
> or the hardware might get resumed into another state and the kthread
> might wait forever.

Yes, IIUC, there are two classes of use cases where freezing kthreads
makes sense.

* While hibernating, freezing writeback workers and whoever else which
might issue IOs. This is because we have to thaw devices to issue
IOs to write out the prepared hibernation image. If new IOs are
issued from page cache or whereever when the devices are thawed for
writing out the hibernation image, the hibernation image and the
data on the disk deviate.

Note that this only works because currently none of the block
drivers which may be used to write out hibernation images depend on
freezable kthreads and hopefully nobody issues IOs from unfreezable
kthreads or bh or softirq, which, I think, can happen with
preallocated requests or bio based devices.

This is a very brittle approach. Instead of controlling things
where it actually can be controlled - the IO ingress point - we're
trying to control all its users with a pretty blunt tool while at
the same time depending on the fact that some of the low level
subsystems don't happen to make use of the said blunt tool.

I think what we should do is simply blocking all non-image IOs from
the block layer while writing out hibernation image. It'd be
simpler and a lot more reliable.

* Device drivers which interact with their devices in a fully
synchronous manner so that blocking the kthread always reliably
quiesces the device. For this to be true, the device shouldn't
carry over any state across the freezing points. There are drivers
which are actually like this and it works for them.

However, my impression is that people don't really scrutinize why
freezer works for a specific case and tend to spray them all over
and develop a fuzzy sense that freezing will somehow magically make
the driver ready for PM operatoins. It doesn't help that freezer is
such a blunt tool and our historical usage has always been fuzzy -
in the earlier days, we depended on freezer even when we knew it
wasn't sufficient probably because updating all subsystems and
drivers were such a huge undertaking and freezer kinda sorta works
in many cases.

IMHO we'd be a lot better served by blocking the kthread from PM
callbacks explicitly for these drivers to unify control points for
PM operations and make what's going on a lot more explicit. This
will surely involve a bit more boilerplate code but with the right
helpers it should be mostly trivial and I believe that it's likely
to encourage higher quality PM operations why getting rid of this
fuzzy feeling around the freezer.

For both cases, I don't really think kthread freezer is a good
solution. This is a blunt enough tool to hide problems in most but
not all cases while unnecessarily obscuring what's going on from
developers. I personally strongly think that we eventually need to
get rid of the kernel part of freezer.

> Also I think that freezing kthreads on some well defined location
> should help with reproducing and fixing problems.
>
> Note that _only_ kthreads using the _new API_ should be freezable by
> default. The move need to be done carefully. It is chance to review
> and clean up the freezer stuff.
>
> Of course, I might be too optimistic here.

I'm strongly against this. We really don't want to make it even
fuzzier. There are finite things which need to be controlled for PM
operations and I believe what we need to do is identifying them and
implementing explicit and clear control mechanisms for them, not
spreading this feel-good mechanism even more, further obscuring where
those points are.

This becomes the game of "is it frozen ENOUGH yet?" and that "enough"
is extremely difficult to determine as we're not looking at the choke
spots at all and freezable kthreads only cover part of kernel
activity. The fuzzy enoughness also actually inhibits development of
proper mechanisms - "I believe this is frozen ENOUGH at this point so
it is probably safe to assume that X, Y and Z aren't happening
anymore" and it usually is except when it's not.

Let's please not spread this even more.

> > In most cases, we want to implement proper power management callbacks
> > which plug new issuance of whatever work-unit the code is dealing with
> > and drain in-flight ones. Whether the worker threads are frozen or
> > not doesn't matter once that's implemented.
>
> I see. The power management is important here.

That's the only reason we have freezer at all.

> > It seems that people have been marking kthreads freezable w/o really
> > thinking about it - some of them are subtly broken due to missing
> > drainage of in-flight things while others simply don't need freezing
> > for correctness.
>
> I have similar feeling.
>
> > We do want to clean up freezer usage in the kernel but definitely do
> > not want to make kthreads freezable by default especially given that
> > the freezer essentially is one giant lockdep-less system-wide lock.
>
> I think that we both want to clean up freezing. I would like to make
> it more deterministic and you suggest to make it more relaxed.
> Do I understand it correctly?

I'm not sure I'm trying to make it more relaxed. I just don't want it
to spread uncontrolled.

Thanks a lot.

--
tejun

2015-06-10 09:06:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Tue, Jun 09, 2015 at 05:25:26PM +0200, Petr Mladek wrote:
> On Mon 2015-06-08 13:39:55, Peter Zijlstra wrote:
> > On Mon, 2015-06-08 at 12:01 +0200, Petr Mladek wrote:
> >
> > > Just to be sure. Do you suggest to use TASK_IDLE everywhere in
> > > kthreads or only when the uninterruptible sleep is really needed?
> >
> > Always, only use INTERRUPTIBLE when you're actually interruptible, that
> > is you want signals or such muck to terminate your wait.
>
> I like that TASK_IDLE clearly describes that the state of the task.
> I am just curious. Is there any particular advantage of using
> uninterruptible sleep over the interruptible one, please?

I think, given how all schedule() calls should be inside a loop testing
their sleep condition, and one has to assume spurious wakeups anyhow,
one can make an argument for removing the distinction.

That said, typically INTERRUPTIBLE means 'capable of handling signals
while waiting for $foo', and as said elsewhere in this thread, kthreads
should not really be having signals.

In that spirit, I think UNINTERRUPTIBLE is the right sleep.

> I ask because making freezable kthreads is quite tricky. You need to
> call try_to_freeze() after each schedule or call freezable_* variants
> of schedule(). I think that it is easy to make a mistake. I wonder if
> it might be more elegant to use interruptible sleep whenever possible,
> send the fake signal also to kthreads and force them moving into some
> location where the freeze is safe and handled.

I don't think that's really a concern here, you have an absolutely
perfect freeze point and freezable_schedule() is fine.
>
> > > IMHO, we should not use TASK_IDLE in freezable kthreads because
> > > it would break freezing.
> >
> > How so? The task is IDLE, its not doing anything.
>
> Well, it might cause the freeze. I have just double checked this
> with ubi_thread(). It calls set_freezable().

> I did the following change:
>
> diff --git a/drivers/mtd/ubi/wl.c b/drivers/mtd/ubi/wl.c
> index 16214d3d57a4..d528fa5e93ba 100644
> --- a/drivers/mtd/ubi/wl.c
> +++ b/drivers/mtd/ubi/wl.c
> @@ -1428,7 +1428,7 @@ int ubi_thread(void *u)
> spin_lock(&ubi->wl_lock);
> if (list_empty(&ubi->works) || ubi->ro_mode ||
> !ubi->thread_enabled || ubi_dbg_is_bgt_disabled(ubi)) {
> - set_current_state(TASK_INTERRUPTIBLE);
> + set_current_state(TASK_IDLE);
> spin_unlock(&ubi->wl_lock);
> schedule();
> continue;


> or use in ubi_thread()
>
> freezable_schedule()

This

> or always ignore freezing when the task sets TASK_IDLE.

Can't, because they might get woken up, in which case they need to end
up in the fridge.

> > And this is the arch typical freeze point if ever there was one, you're
> > checking kthread_stop, if we can terminate the kthread, we can certainly
> > get frozen.
>
> It makes sense. The tasks should be in some sane state when it goes
> into the idle state. I hope that people will not misuse it too much.

Do your utmost bestest to put in as many assertions as you can to avoid
abuse.

2015-06-10 09:07:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Mon, Jun 08, 2015 at 01:48:10PM -0400, Steven Rostedt wrote:
> > commit 80ed87c8a9ca0cad7ca66cf3bbdfb17559a66dcf
> > Author: Peter Zijlstra <[email protected]>
> > Date: Fri May 8 14:23:45 2015 +0200
> >
> > sched/wait: Introduce TASK_NOLOAD and TASK_IDLE
> >
> > Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
> > 'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
> > all idle kthreads contribute to the loadavg is somewhat silly.
>
> Not to mention, tasks in TASK_UNINTERRUPTIBLE state for too long will
> trigger hung task detection.

Right, and I had not considered that, but it turns out the hung_task
detector checks p->state == TASK_UNINTERRUPTIBLE, so TASK_IDLE is indeed
safe from that.

2015-06-10 10:41:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] kthreads/signal: Safer kthread API and signal handling

On Tue, Jun 09, 2015 at 03:14:46PM +0900, Tejun Heo wrote:
> Hey, Peter.
>
> On Fri, Jun 05, 2015 at 06:22:16PM +0200, Peter Zijlstra wrote:
> > There's a lot more problems with workqueues:
> >
> > - they're not regular tasks and all the task controls don't work on
> > them. This means all things scheduler, like cpu-affinity, nice, and
> > RT/deadline scheduling policies. Instead there is some half baked
> > secondary interface for some of these.
>
> Because there's a pool of them and the workers come and go
> dynamically. There's no way around it. The attributes just have to
> be per-pool.

Sure, but there's a few possible ways to still make that work with the
regular syscall interfaces.

1) propagate the change to any one worker to all workers of the same
pool

2) have a common ancestor task for each pool, and allow changing that.
You can combine that with either the propagation like above, or a
rule that workers kill themselves if they observe their parent
changed (eg. check a attribute sequence count after each work).

> > But this also very much includes things like cgroups, which brings me
> > to the second point.
> >
> > - its oblivious to cgroups (as it is to RT priority for example) both
> > leading to priority inversion. A work enqueued from a deep/limited
> > cgroup does not inherit the task's cgroup. Instead this work is ran
> > from the root cgroup.
> >
> > This breaks cgroup isolation, more significantly so when a large part
> > of the actual work is done from workqueues (as some workloads end up
> > being). Instead of being able to control the work, it all ends up in
> > the root cgroup outside of control.
>
> cgroup support will surely be added but I'm not sure we can or should
> do inheritance automatically.

I think its a good default to inherit stuff from the task that queued
it.

> Using a different API doesn't solve the
> problem automatically either. A lot of kthreads are shared
> system-wide after all. We'll need an abstraction layer to deal with
> that no matter where we do it.

Yes, hardware threads are global, but so is the hardware. Those are not
a problem provided the thread map 1:1 with the actual devices and do not
service multiple devices from a single thread.

Once you start combining things you start to get all the above problems
all over again.

2015-06-10 14:07:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Wed, 10 Jun 2015 11:07:24 +0200
Peter Zijlstra <[email protected]> wrote:

> > Not to mention, tasks in TASK_UNINTERRUPTIBLE state for too long will
> > trigger hung task detection.
>
> Right, and I had not considered that, but it turns out the hung_task
> detector checks p->state == TASK_UNINTERRUPTIBLE, so TASK_IDLE is indeed
> safe from that.

Also, I would assume that TASK_IDLE only makes sense for kernel
threads, I wonder if we should add an assertion in schedule that
triggers if a task is scheduling with TASK_IDLE and is not a kernel
thread (has its own mm?)

-- Steve

2015-06-11 04:28:32

by Jiri Kosina

[permalink] [raw]
Subject: Re: [RFC PATCH 09/18] kthread: Make it easier to correctly sleep in iterant kthreads

On Wed, 10 Jun 2015, Steven Rostedt wrote:

> > Right, and I had not considered that, but it turns out the hung_task
> > detector checks p->state == TASK_UNINTERRUPTIBLE, so TASK_IDLE is indeed
> > safe from that.
>
> Also, I would assume that TASK_IDLE only makes sense for kernel
> threads, I wonder if we should add an assertion in schedule that
> triggers if a task is scheduling with TASK_IDLE and is not a kernel
> thread (has its own mm?)

For the sake of completnes -- testing for !task_struct->mm is not a
correct test to find out whether given entity is a kernel thread; kernel
threads are free to temporarily adopt user struct mm via use_mm() (usually
for handling AIO on behalf of a particular struct mm).

The correct check is to look at PF_KTHREAD flag in task_struct->flags.

--
Jiri Kosina
SUSE Labs

2015-06-11 22:02:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] kthreads/signal: Safer kthread API and signal handling

Hello, Peter.

On Wed, Jun 10, 2015 at 12:40:57PM +0200, Peter Zijlstra wrote:
> > Because there's a pool of them and the workers come and go
> > dynamically. There's no way around it. The attributes just have to
> > be per-pool.
>
> Sure, but there's a few possible ways to still make that work with the
> regular syscall interfaces.
>
> 1) propagate the change to any one worker to all workers of the same
> pool
>
> 2) have a common ancestor task for each pool, and allow changing that.
> You can combine that with either the propagation like above, or a
> rule that workers kill themselves if they observe their parent
> changed (eg. check a attribute sequence count after each work).

Sure, we can build the interface in different ways but that doesn't
really change the backend much which is where bulk of work lies.

I'm not sure having a proxy task is even a better interface. It is
better in that we'd be able to reuse task based interface but then
we'd end up with the "proxy" tasks, hooking up notifiers from a number
of essentially unrelated input points into the worker pool mechanism
and what's supported and what's not wouldn't be clear either as the
support for various attributes gradually grow.

More importantly, not all pool attributes will be translatable to task
attributes. There's no way to map things like CPU or NUMA affinity,
concurrency level or mode of concurrency to attributes of a task
without involving a convoluted mapping or an extra side-band
interface. Given that that's the case in the other direction too (a
lot of task attributes won't translate to pool attributes), I'm not
doubtful there's a lot of benefit to gain from trying to reuse task
interface for pools.

> > cgroup support will surely be added but I'm not sure we can or should
> > do inheritance automatically.
>
> I think its a good default to inherit stuff from the task that queued
> it.

While I agree that it'd make sense for certain use cases, I'm not sure
making that a default. At least for workqueue, a lot of use cases
don't even register in terms of resource usage and they're just
punting to be in the right execution context. I'm not sure what we'd
be gaining by going full-on w/ inheritance, which will inevitably
involve a fairly large amount of complexity and overhead as it's
likely to reduce the amount of sharing considerably.

Also, a lot of asynchronous executions share some resources - the
execution context itself, synchronization construct and so on. While
we do cause priority inversion by putting them all into the same
bucket right now, priority inversions caused by blindly putting all
such async executions into separate buckets are likely to be a lot
worse by blocking higher priority executions behind an extremely
resource constrained instance.

> > Using a different API doesn't solve the
> > problem automatically either. A lot of kthreads are shared
> > system-wide after all. We'll need an abstraction layer to deal with
> > that no matter where we do it.
>
> Yes, hardware threads are global, but so is the hardware. Those are not
> a problem provided the thread map 1:1 with the actual devices and do not
> service multiple devices from a single thread.

I'm not sure why hardware is relevant here (especially given that a
lot of devices which matter in terms of performance are heavily
asynchronous), but if you're saying that certain things would be
simpler if we don't pool anything, that is true but I'm quite doubtful
that we can afford dedicated kthreads for every possible purpose at
this point.

> Once you start combining things you start to get all the above problems
> all over again.

Yes, again, the cost of having pools at all. I'm not disagreeing that
it adds a layer of abstraction and complexity. I'm saying this is the
cost we need to pay.

Thanks.

--
tejun

2015-06-12 13:24:44

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 07/18] kthread: Make iterant kthreads freezable by default

On Wed 2015-06-10 13:31:54, Tejun Heo wrote:
> Hello, Petr.
>
> On Tue, Jun 09, 2015 at 05:53:13PM +0200, Petr Mladek wrote:
> > I think that the interaction with the hardware should be the reason to
> > make them properly freezable. In the current state they are stopped at
> > some random place, they might either miss some event from the hardware
> > or the hardware might get resumed into another state and the kthread
> > might wait forever.
>
> Yes, IIUC, there are two classes of use cases where freezing kthreads
> makes sense.

First, thanks a lot for detailed explanation.

> * While hibernating, freezing writeback workers and whoever else which
> might issue IOs. This is because we have to thaw devices to issue
> IOs to write out the prepared hibernation image. If new IOs are
> issued from page cache or whereever when the devices are thawed for
> writing out the hibernation image, the hibernation image and the
> data on the disk deviate.

Just an idea. I do not say that it is a good solution. If we already
thaw devices needed to write the data. It should be possible to thaw
also kthreads needed to write the data.


> Note that this only works because currently none of the block
> drivers which may be used to write out hibernation images depend on
> freezable kthreads and hopefully nobody issues IOs from unfreezable
> kthreads or bh or softirq, which, I think, can happen with
> preallocated requests or bio based devices.

It means that some kthreads must be stopped, some must be available,
and we do not mind about the rest. I wonder which group is bigger.
It might help to decide about the more safe default. It is not easy
for me to decide :-/


> This is a very brittle approach. Instead of controlling things
> where it actually can be controlled - the IO ingress point - we're
> trying to control all its users with a pretty blunt tool while at
> the same time depending on the fact that some of the low level
> subsystems don't happen to make use of the said blunt tool.
>
> I think what we should do is simply blocking all non-image IOs from
> the block layer while writing out hibernation image. It'd be
> simpler and a lot more reliable.

This sounds like an interesting idea to me. Do you already have some
more precise plan in mind?

Unfortunately, for me it is not easy to judge it. I wonder how many
interfaces would need to get hacked to make this working.
Also I wonder if they would be considered as a hack or a clean
solution by the affected parties. By other words, I wonder if it is
realistic.


> * Device drivers which interact with their devices in a fully
> synchronous manner so that blocking the kthread always reliably
> quiesces the device. For this to be true, the device shouldn't
> carry over any state across the freezing points. There are drivers
> which are actually like this and it works for them.

I am trying to sort it in my head. In each case, we need to stop these
kthreads in some well defined state. Also the kthread (or device)
must be able to block the freezing if they are not in the state yet.

The stop points are defined by try_to_freeze() now. And freezable
kthreads block the freezer until they reach these points.


> However, my impression is that people don't really scrutinize why
> freezer works for a specific case and tend to spray them all over
> and develop a fuzzy sense that freezing will somehow magically make
> the driver ready for PM operatoins. It doesn't help that freezer is
> such a blunt tool and our historical usage has always been fuzzy -
> in the earlier days, we depended on freezer even when we knew it
> wasn't sufficient probably because updating all subsystems and
> drivers were such a huge undertaking and freezer kinda sorta works
> in many cases.

I try to better understand why freezer is considered to be a blunt
tool. Is it because it is a generic API, try_to_freeze() is put on
"random" locations, so that it does not define the safe point
precisely enough?


> IMHO we'd be a lot better served by blocking the kthread from PM
> callbacks explicitly for these drivers to unify control points for
> PM operations and make what's going on a lot more explicit. This
> will surely involve a bit more boilerplate code but with the right
> helpers it should be mostly trivial and I believe that it's likely
> to encourage higher quality PM operations why getting rid of this
> fuzzy feeling around the freezer.

I agree. There is needed some synchronization between the device
drivers and kthread to make sure that they are on the same note.

If I get this correctly. It works the following way now. PM notifies
both kthreads (via freezer) and device drivers (via callbacks) about
that the suspend is in progress. They are supposed to go into a sane
state. But there is no explicit communication between the kthread
and the driver.

What do you exactly mean with the callbacks? Should they set some
flags that would navigate the kthread into some state?


> I'm strongly against this. We really don't want to make it even
> fuzzier. There are finite things which need to be controlled for PM
> operations and I believe what we need to do is identifying them and
> implementing explicit and clear control mechanisms for them, not
> spreading this feel-good mechanism even more, further obscuring where
> those points are.

I see. This explains why you are so strongly against changing the
default. I see the following in the kernel sources:

61 set_freezable()
97 kthread_create()
259 kthread_run()

I means that only about 17% kthreads are freezable these days.

> This becomes the game of "is it frozen ENOUGH yet?" and that "enough"
> is extremely difficult to determine as we're not looking at the choke
> spots at all and freezable kthreads only cover part of kernel
> activity. The fuzzy enoughness also actually inhibits development of
> proper mechanisms - "I believe this is frozen ENOUGH at this point so
> it is probably safe to assume that X, Y and Z aren't happening
> anymore" and it usually is except when it's not.

I am not sure what you mean by frozen enough? I think that a tasks
is either frozen or not. Do I miss something, please?

Best Regards,
Petr

2015-06-13 23:22:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC PATCH 07/18] kthread: Make iterant kthreads freezable by default

Hey, Petr.

On Fri, Jun 12, 2015 at 03:24:40PM +0200, Petr Mladek wrote:
> > * While hibernating, freezing writeback workers and whoever else which
> > might issue IOs. This is because we have to thaw devices to issue
> > IOs to write out the prepared hibernation image. If new IOs are
> > issued from page cache or whereever when the devices are thawed for
> > writing out the hibernation image, the hibernation image and the
> > data on the disk deviate.
>
> Just an idea. I do not say that it is a good solution. If we already
> thaw devices needed to write the data. It should be possible to thaw
> also kthreads needed to write the data.

Sure, we'd end up having to mark the involved kthreads and whatever
they may depend on with "freeze for snapshotting but not for writing
out images".

> > Note that this only works because currently none of the block
> > drivers which may be used to write out hibernation images depend on
> > freezable kthreads and hopefully nobody issues IOs from unfreezable
> > kthreads or bh or softirq, which, I think, can happen with
> > preallocated requests or bio based devices.
>
> It means that some kthreads must be stopped, some must be available,
> and we do not mind about the rest. I wonder which group is bigger.
> It might help to decide about the more safe default. It is not easy
> for me to decide :-/

Please do not spread freezer more than necessary. It's not about
which part is larger but about not completely losing track of what's
going on. At least now we can say "this needs freezer because XYZ"
and that XYZ actually points to something which needs to be
synchronized for PM operations. If we flip the polarity, all we're
left with is "this can't be frozen because it deadlocks with XYZ"
which is a lot less information, both in terms of relevance and
amount, than the other way around.

> > This is a very brittle approach. Instead of controlling things
> > where it actually can be controlled - the IO ingress point - we're
> > trying to control all its users with a pretty blunt tool while at
> > the same time depending on the fact that some of the low level
> > subsystems don't happen to make use of the said blunt tool.
> >
> > I think what we should do is simply blocking all non-image IOs from
> > the block layer while writing out hibernation image. It'd be
> > simpler and a lot more reliable.
>
> This sounds like an interesting idea to me. Do you already have some
> more precise plan in mind?

Unfortunately, no. No concrete plans or patches yet but even then I'm
pretty sure we don't wanna head the other direction. It's just wrong.

> Unfortunately, for me it is not easy to judge it. I wonder how many
> interfaces would need to get hacked to make this working.

Not much really. We just need a way to flag IOs to write image - be
that a bio / request flag or task flag and then a way to tell the
block layer to block everything else.

> Also I wonder if they would be considered as a hack or a clean
> solution by the affected parties. By other words, I wonder if it is
> realistic.

I'd say it's pretty realistic. This is synchronizing where the
operations need to be synchronized. What part would be hackish?

> > * Device drivers which interact with their devices in a fully
> > synchronous manner so that blocking the kthread always reliably
> > quiesces the device. For this to be true, the device shouldn't
> > carry over any state across the freezing points. There are drivers
> > which are actually like this and it works for them.
>
> I am trying to sort it in my head. In each case, we need to stop these
> kthreads in some well defined state. Also the kthread (or device)
> must be able to block the freezing if they are not in the state yet.

Whether kthreads need to be stopped or not is periphery once you have
a proper blocking / draining mechanism. The thing is most of these
kthreads are blocked on requests coming down from upper layers anyway.
Once you plug queueing of new requests and drain whatever is in
flight, the kthread isn't gonna do anything anyway.

> The stop points are defined by try_to_freeze() now. And freezable
> kthreads block the freezer until they reach these points.

I'm repeating myself but that only works for fully synchronous devices
(ie. kthread processing one command at a time). You need to drain
whatever is in flight too. You can't do that with simple freezing
points. There's a reason why a lot of drivers can't rely on freezer
for PM operations. Drivers which can fully depend on freezer would be
minority.

> > However, my impression is that people don't really scrutinize why
> > freezer works for a specific case and tend to spray them all over
> > and develop a fuzzy sense that freezing will somehow magically make
> > the driver ready for PM operatoins. It doesn't help that freezer is
> > such a blunt tool and our historical usage has always been fuzzy -
> > in the earlier days, we depended on freezer even when we knew it
> > wasn't sufficient probably because updating all subsystems and
> > drivers were such a huge undertaking and freezer kinda sorta works
> > in many cases.
>
> I try to better understand why freezer is considered to be a blunt
> tool. Is it because it is a generic API, try_to_freeze() is put on
> "random" locations, so that it does not define the safe point
> precisely enough?

Not that. I don't know how to explain it better. Hmmm... okay, let's
say there's a shared queue Q and N users o fit. If you wanna make Q
empty and keep it that way for a while, the right thing to do is
blocking new queueing and then wait till Q drains - you choke the
entity that you wanna control.

Instead of that, freezer is trying to block the "N" users part. In
majority of cases, it blocks enough but it's pretty difficult to be
sure whether you actually got all N of them (as some of them may not
involve kthreads at all or unfreezable kthreads might end up doing
those operations too on corner cases) and it's also not that clear
whether blocking the N users actually make Q empty. Maybe there are
things which can be in flight asynchronously on Q even all its N users
are blocked. This is inherently finicky.

> > IMHO we'd be a lot better served by blocking the kthread from PM
> > callbacks explicitly for these drivers to unify control points for
> > PM operations and make what's going on a lot more explicit. This
> > will surely involve a bit more boilerplate code but with the right
> > helpers it should be mostly trivial and I believe that it's likely
> > to encourage higher quality PM operations why getting rid of this
> > fuzzy feeling around the freezer.
>
> I agree. There is needed some synchronization between the device
> drivers and kthread to make sure that they are on the same note.

No, not kthreads. They should be synchronizing on the actual entities
that they wanna keep empty. Where the kthreads are don't matter at
all. It's the wrong thing to control in most cases. In the minority
of cases where blocking kthread is enough, we can do that explicitly
from the usual PM callbacks.

> If I get this correctly. It works the following way now. PM notifies
> both kthreads (via freezer) and device drivers (via callbacks) about
> that the suspend is in progress. They are supposed to go into a sane
> state. But there is no explicit communication between the kthread
> and the driver.
>
> What do you exactly mean with the callbacks? Should they set some
> flags that would navigate the kthread into some state?

The driver PM callbacks.

> > I'm strongly against this. We really don't want to make it even
> > fuzzier. There are finite things which need to be controlled for PM
> > operations and I believe what we need to do is identifying them and
> > implementing explicit and clear control mechanisms for them, not
> > spreading this feel-good mechanism even more, further obscuring where
> > those points are.
>
> I see. This explains why you are so strongly against changing the
> default. I see the following in the kernel sources:
>
> 61 set_freezable()
> 97 kthread_create()
> 259 kthread_run()
>
> I means that only about 17% kthreads are freezable these days.

And some of them probably don't even need it.

> > This becomes the game of "is it frozen ENOUGH yet?" and that "enough"
> > is extremely difficult to determine as we're not looking at the choke
> > spots at all and freezable kthreads only cover part of kernel
> > activity. The fuzzy enoughness also actually inhibits development of
> > proper mechanisms - "I believe this is frozen ENOUGH at this point so
> > it is probably safe to assume that X, Y and Z aren't happening
> > anymore" and it usually is except when it's not.
>
> I am not sure what you mean by frozen enough? I think that a tasks
> is either frozen or not. Do I miss something, please?

Please refer back to the above Q and N users example. Sure, each
kthread is either frozen or thawed; however, whether you got all of N
or not is pretty difficult to tell reliably and I'd say a lot of the
currently existing freezer usage is pretty fuzzy on that respect.

Thanks.

--
tejun

2015-06-15 09:28:42

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 07/18] kthread: Make iterant kthreads freezable by default

On Sat 2015-06-13 18:22:22, Tejun Heo wrote:
> > I try to better understand why freezer is considered to be a blunt
> > tool. Is it because it is a generic API, try_to_freeze() is put on
> > "random" locations, so that it does not define the safe point
> > precisely enough?
>
> Not that. I don't know how to explain it better. Hmmm... okay, let's
> say there's a shared queue Q and N users o fit. If you wanna make Q
> empty and keep it that way for a while, the right thing to do is
> blocking new queueing and then wait till Q drains - you choke the
> entity that you wanna control.
>
> Instead of that, freezer is trying to block the "N" users part. In
> majority of cases, it blocks enough but it's pretty difficult to be
> sure whether you actually got all N of them (as some of them may not
> involve kthreads at all or unfreezable kthreads might end up doing
> those operations too on corner cases) and it's also not that clear
> whether blocking the N users actually make Q empty. Maybe there are
> things which can be in flight asynchronously on Q even all its N users
> are blocked. This is inherently finicky.

I feel convinced that it does not make sense to make kthreads
freezable by default and that we should not use it when not
necessary.

Thanks a lot for patience and so detailed explanation.

Best Regards,
Petr

2015-06-15 12:46:47

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 02/18] kthread: Add API for iterant kthreads

On Tue 2015-06-09 15:23:49, Tejun Heo wrote:
> Hello, Petr.
>
> On Fri, Jun 05, 2015 at 05:01:01PM +0200, Petr Mladek wrote:
> > +static int kthread_iterant_fn(void *kti_ptr)
> > +{
> > + struct kthread_iterant *kti = kti_ptr;
> > + void *data = kti->data;
> > +
> > + if (kti->init)
> > + kti->init(data);
> > +
> > + do {
> > + if (kti->func)
> > + kti->func(data);
>
> Is supporting kthread_iterant w/o the body function intentional? If
> so, did you have anything specific on mind for it? I don't think it
> matters either way. Just curious how this came to be.

Good question. It might make sense to add a warning for kthreads
with empty main function.

Best Regards,
Petr

2015-06-15 13:13:44

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

Hi Oleg,

I am sorry for the late reply. I wanted to think more before answering
all the mails.

On Mon 2015-06-08 23:13:36, Oleg Nesterov wrote:
> I do not. Contrary, I think this needs more code in the likely case.
> Anyway, this API won't have too many users, so I don't even this this
> is that important.
>
> > > > + if (sig_kernel_stop(signr)) {
> > > > + __set_current_state(TASK_STOPPED);
> > > > + spin_unlock_irqrestore(&sighand->siglock, flags);
> > > > + /* Don't run again until woken by SIGCONT or SIGKILL */
> > > > + freezable_schedule();
> > > > + goto relock;
> > >
> > > Yes this avoids the race with SIGCONT. But as I said we can add another
> > > trivial helper which checks JOBCTL_STOP_DEQUEUED. So a kthread can do
> > > this itself.
> >
> > Hmm, the helper would have a strange semantic. You need to take
> > sighand->siglock, dequeue the signal (SIGSTOP), and call
> > __set_current_state(TASK_STOPPED) before you release the lock.
> > But what would happen if the dequeued signal is _not_ SIGSTOP?
>
> Perhaps I missed your point, but no. If you want to handle SIGSTOP
> you can do
>

I think that we need to add:

spin_lock_irq(&sighand->siglock);

> signr = kthread_signal_dequeue();
> switch (signr) {
> case SIGSTOP:
> something_else();
> kthread_do_signal_stop();
> ...
> }

And if we want to avoid any race, kthread_do_signal_stop() should look like:

void kthread_do_signal_stop(unsigned long flags)
{
struct sighand_struct *sighand = current->sighand;

__set_current_state(TASK_STOPPED);
spin_unlock_irqrestore(&sighand->siglock, flags);
/* Don't run again until woken by SIGCONT or SIGKILL */
freezable_schedule();
}

It means that we will have spin_lock() in one function and
spin_unlock() in another one. This is what I meant with
the strange semantic. This is why I think that it might be
cleaner to implement some generic kthread_do_signal() or so
and allow to (re)define/add sigactions via callbacks.

Note that I am not aware of any kthread that would use SIGSTOP
non-standard way.

Anyway, I am going to concentrate on the main structure of the kthread
API and will put the controversial signal handling a side for now.
I will get back to it when converting the few kthreads that use
signals. I will think more about your feedback in the meantime.

Best Regards,
Petr

2015-06-15 15:23:17

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 15/18] ring_buffer: Allow to exit the ring buffer benchmark immediately

On Mon 2015-06-08 13:44:17, Steven Rostedt wrote:
> On Fri, 5 Jun 2015 17:01:14 +0200
> Petr Mladek <[email protected]> wrote:
>
> > It takes a while until the ring_buffer_benchmark module is removed
> > when the ring buffer hammer is running. It is because it takes
> > few seconds and kthread_should_terminate() is not being checked.
> >
> > This patch adds the check for kthread termination into the producer.
> > It uses the existing kill_test flag to finish the kthreads as
> > cleanly as possible.
> >
> > It disables printing the "ERROR" message when the kthread is going.
> >
> > Also it makes sure that producer does not go into the 10sec sleep
> > when it is being killed.
>
> This patch looks like something I may take regardless of the other
> patches (if it applies cleanly).

Please, find below a version of the patch that can be applied against
current Linus tree and also against your for-next branch.

> As for the other patches, the ring buffer benchmark is just that, a
> benchmark that I use when making changes to the ring buffer. It's not
> something for production systems.

I see.

> What about just adding a depend on !LIVE_PATCHING to
> RING_BUFFER_BENCHMARK, or force it to shut down during patching.
> There's no reason to make it safe to be running when you patch the
> kernel. Just adds complexity to some simple code.

I would like to convert all kthreads into some sane API. I hope
that it will make them rather easier. Let's see how it will look
like in the end.

Anyway, the two kthreads in this benchmark are in the "easy" group
regarding patching because they are sleeping only limited time.

2015-06-15 15:33:48

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC PATCH 15/18] ring_buffer: Allow to exit the ring buffer benchmark immediately

On Mon, 15 Jun 2015 17:23:13 +0200
Petr Mladek <[email protected]> wrote:


> Please, find below a version of the patch that can be applied against
> current Linus tree and also against your for-next branch.
>

Thanks, just on Friday, I was testing ring buffer changes with the
ringbuffer benchmark, and was bitching about how I had to wait the 10
seconds before the module would unload ;-)

-- Steve

2015-06-15 15:54:33

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 15/18] ring_buffer: Allow to exit the ring buffer benchmark immediately

On Mon 2015-06-15 11:33:43, Steven Rostedt wrote:
> On Mon, 15 Jun 2015 17:23:13 +0200
> Petr Mladek <[email protected]> wrote:
>
>
> > Please, find below a version of the patch that can be applied against
> > current Linus tree and also against your for-next branch.
> >
>
> Thanks, just on Friday, I was testing ring buffer changes with the
> ringbuffer benchmark, and was bitching about how I had to wait the 10
> seconds before the module would unload ;-)

I am glad that it will be useful.

Please, find below a version with a small typo fix. Heh, I spent a lot
of time with doulbe checking the patch and then decided to do a last
minute change. It produced "only" compilation warning that I missed.
It was not visible with my testing :-(

2015-06-15 19:15:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

Hi Petr,

On 06/15, Petr Mladek wrote:
>
> I am sorry for the late reply. I wanted to think more before answering
> all the mails.

Don't worry I am always late ;)

> On Mon 2015-06-08 23:13:36, Oleg Nesterov wrote:
> > >
> > > Hmm, the helper would have a strange semantic. You need to take
> > > sighand->siglock, dequeue the signal (SIGSTOP), and call
> > > __set_current_state(TASK_STOPPED) before you release the lock.
> > > But what would happen if the dequeued signal is _not_ SIGSTOP?
> >
> > Perhaps I missed your point, but no. If you want to handle SIGSTOP
> > you can do
> >
>
> I think that we need to add:
>
> spin_lock_irq(&sighand->siglock);
>
> > signr = kthread_signal_dequeue();
> > switch (signr) {
> > case SIGSTOP:
> > something_else();
> > kthread_do_signal_stop();
> > ...
> > }
>
> And if we want to avoid any race, kthread_do_signal_stop() should look like:
>
> void kthread_do_signal_stop(unsigned long flags)
> {
> struct sighand_struct *sighand = current->sighand;
>
> __set_current_state(TASK_STOPPED);
> spin_unlock_irqrestore(&sighand->siglock, flags);
> /* Don't run again until woken by SIGCONT or SIGKILL */
> freezable_schedule();
> }

Ah, understand. You think that we need to take ->siglock in advance
to avoid the race with SIGCONT?

No, we don't. Let me show you the code I suggested again:

void kthread_do_signal_stop(void)
{
spin_lock_irq(&curtent->sighand->siglock);
if (current->jobctl & JOBCTL_STOP_DEQUEUED)
__set_current_state(TASK_STOPPED);
spin_unlock_irq(&current->sighand->siglock);

schedule();
}

so you can dequeue_signal() and call kthread_do_signal_stop() without
holding ->siglock. We can rely on JOBCTL_STOP_DEQUEUED bit. SIGCONT
clears it, so kthread_do_signal_stop() can't race.

Oleg.


2015-06-16 07:54:18

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] signal/kthread: Initial implementation of kthread signal handling

On Mon 2015-06-15 21:14:29, Oleg Nesterov wrote:
> Ah, understand. You think that we need to take ->siglock in advance
> to avoid the race with SIGCONT?

exactly

> No, we don't. Let me show you the code I suggested again:
>
> void kthread_do_signal_stop(void)
> {
> spin_lock_irq(&curtent->sighand->siglock);
> if (current->jobctl & JOBCTL_STOP_DEQUEUED)
> __set_current_state(TASK_STOPPED);
> spin_unlock_irq(&current->sighand->siglock);
>
> schedule();
> }
>
> so you can dequeue_signal() and call kthread_do_signal_stop() without
> holding ->siglock. We can rely on JOBCTL_STOP_DEQUEUED bit. SIGCONT
> clears it, so kthread_do_signal_stop() can't race.

Heureka, I have got it. I have previously missed the meaning of the
JOBCTL_STOP_DEQUEUED bit. Thanks for explanation.

Best Regards,
Petr

2015-06-17 11:35:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] kthreads/signal: Safer kthread API and signal handling

On Fri, Jun 05, 2015 at 05:00:59PM +0200, Petr Mladek wrote:
> Kthreads are implemented as an infinite loop. They include check points
> for termination, freezer, parking, and even signal handling.
>
> We need to touch all kthreads every time we want to add or
> modify the behavior of such checkpoints. It is not easy because
> there are several hundreds of kthreads and each of them is
> implemented in a slightly different way.
>
> This anarchy brings potentially broken or non-standard behavior.
> For example, few kthreads already handle signals a strange way.
>
>
> This patchset is a _proof-of-concept_ how to improve the situation.

Maybe it might be a better idea to audit the existing callers of
signals for kthreads and first check if we can get rid of them. Then
as a second step if any are left convert them to signalfd-style
operation where signals aren't checked synchronously but as something
the threads can consume on a as needed basis.