2019-03-21 21:47:50

by Waiman Long

[permalink] [raw]
Subject: [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue()

It was found that if a process has accumulated sufficient number of
pending signals, the exiting of that process may cause its parent to
have hard lockup when running on a debug kernel with a slow memory
freeing path (like with KASAN enabled).

release_task() => flush_sigqueue()

The lockup condition can be reproduced on a large system with a lot of
memory and relatively slow CPUs running LTP's sigqueue_9-1 test on a
debug kernel.

This patchset tries to mitigate this problem by introducing a new kernel
memory freeing queue mechanism modelled after the wake_q mechanism for
waking up tasks. Then flush_sigqueue() and release_task() are modified
to use the freeing queue mechanism to defer the actual memory object
freeing until after releasing the tasklist_lock and with irq re-enabled.

With the patchset applied, the hard lockup problem was no longer
reproducible on the debug kernel.

Waiman Long (4):
mm: Implement kmem objects freeing queue
signal: Make flush_sigqueue() use free_q to release memory
signal: Add free_uid_to_q()
mm: Do periodic rescheduling when freeing objects in kmem_free_up_q()

include/linux/sched/user.h | 3 +++
include/linux/signal.h | 4 ++-
include/linux/slab.h | 28 +++++++++++++++++++++
kernel/exit.c | 12 ++++++---
kernel/signal.c | 29 +++++++++++++---------
kernel/user.c | 17 ++++++++++---
mm/slab_common.c | 50 ++++++++++++++++++++++++++++++++++++++
security/selinux/hooks.c | 8 ++++--
8 files changed, 128 insertions(+), 23 deletions(-)

--
2.18.1



2019-03-21 21:46:41

by Waiman Long

[permalink] [raw]
Subject: [PATCH 1/4] mm: Implement kmem objects freeing queue

When releasing kernel data structures, freeing up the memory
occupied by those objects is usually the last step. To avoid races,
the release operation is commonly done with a lock held. However, the
freeing operations do not need to be under lock, but are in many cases.

In some complex cases where the locks protect many different memory
objects, that can be a problem especially if some memory debugging
features like KASAN are enabled. In those cases, freeing memory objects
under lock can greatly lengthen the lock hold time. This can even lead
to soft/hard lockups in some extreme cases.

To make it easer to defer freeing memory objects until after unlock,
a kernel memory freeing queue mechanism is now added. It is modelled
after the wake_q mechanism for waking up tasks without holding a lock.

Now kmem_free_q_add() can be called to add memory objects into a freeing
queue. Later on, kmem_free_up_q() can be called to free all the memory
objects in the freeing queue after releasing the lock.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/slab.h | 28 ++++++++++++++++++++++++++++
mm/slab_common.c | 41 +++++++++++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 11b45f7ae405..6116fcecbd8f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -762,4 +762,32 @@ int slab_dead_cpu(unsigned int cpu);
#define slab_dead_cpu NULL
#endif

+/*
+ * Freeing queue node for freeing kmem_cache slab objects later.
+ * The node is put at the beginning of the memory object and so the object
+ * size cannot be smaller than sizeof(kmem_free_q_node).
+ */
+struct kmem_free_q_node {
+ struct kmem_free_q_node *next;
+ struct kmem_cache *cachep; /* NULL if alloc'ed by kmalloc */
+};
+
+struct kmem_free_q_head {
+ struct kmem_free_q_node *first;
+ struct kmem_free_q_node **lastp;
+};
+
+#define DEFINE_KMEM_FREE_Q(name) \
+ struct kmem_free_q_head name = { NULL, &name.first }
+
+static inline void kmem_free_q_init(struct kmem_free_q_head *head)
+{
+ head->first = NULL;
+ head->lastp = &head->first;
+}
+
+extern void kmem_free_q_add(struct kmem_free_q_head *head,
+ struct kmem_cache *cachep, void *object);
+extern void kmem_free_up_q(struct kmem_free_q_head *head);
+
#endif /* _LINUX_SLAB_H */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 03eeb8b7b4b1..dba20b4208f1 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1597,6 +1597,47 @@ void kzfree(const void *p)
}
EXPORT_SYMBOL(kzfree);

+/**
+ * kmem_free_q_add - add a kmem object to a freeing queue
+ * @head: freeing queue head
+ * @cachep: kmem_cache pointer (NULL for kmalloc'ed objects)
+ * @object: kmem object to be freed put into the queue
+ *
+ * Put a kmem object into the freeing queue to be freed later.
+ */
+void kmem_free_q_add(struct kmem_free_q_head *head, struct kmem_cache *cachep,
+ void *object)
+{
+ struct kmem_free_q_node *node = object;
+
+ WARN_ON_ONCE(cachep && cachep->object_size < sizeof(*node));
+ node->next = NULL;
+ node->cachep = cachep;
+ *(head->lastp) = node;
+ head->lastp = &node->next;
+}
+EXPORT_SYMBOL_GPL(kmem_free_q_add);
+
+/**
+ * kmem_free_up_q - free all the objects in the freeing queue
+ * @head: freeing queue head
+ *
+ * Free all the objects in the freeing queue.
+ */
+void kmem_free_up_q(struct kmem_free_q_head *head)
+{
+ struct kmem_free_q_node *node, *next;
+
+ for (node = head->first; node; node = next) {
+ next = node->next;
+ if (node->cachep)
+ kmem_cache_free(node->cachep, node);
+ else
+ kfree(node);
+ }
+}
+EXPORT_SYMBOL_GPL(kmem_free_up_q);
+
/* Tracepoints definitions. */
EXPORT_TRACEPOINT_SYMBOL(kmalloc);
EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);
--
2.18.1


2019-03-21 21:47:00

by Waiman Long

[permalink] [raw]
Subject: [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q()

If the freeing queue has many objects, freeing all of them consecutively
may cause soft lockup especially on a debug kernel. So kmem_free_up_q()
is modified to call cond_resched() if running in the process context.

Signed-off-by: Waiman Long <[email protected]>
---
mm/slab_common.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index dba20b4208f1..633a1d0f6d20 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1622,11 +1622,14 @@ EXPORT_SYMBOL_GPL(kmem_free_q_add);
* kmem_free_up_q - free all the objects in the freeing queue
* @head: freeing queue head
*
- * Free all the objects in the freeing queue.
+ * Free all the objects in the freeing queue. The caller cannot hold any
+ * non-sleeping locks.
*/
void kmem_free_up_q(struct kmem_free_q_head *head)
{
struct kmem_free_q_node *node, *next;
+ bool do_resched = !in_irq();
+ int cnt = 0;

for (node = head->first; node; node = next) {
next = node->next;
@@ -1634,6 +1637,12 @@ void kmem_free_up_q(struct kmem_free_q_head *head)
kmem_cache_free(node->cachep, node);
else
kfree(node);
+ /*
+ * Call cond_resched() every 256 objects freed when in
+ * process context.
+ */
+ if (do_resched && !(++cnt & 0xff))
+ cond_resched();
}
}
EXPORT_SYMBOL_GPL(kmem_free_up_q);
--
2.18.1


2019-03-21 21:47:29

by Waiman Long

[permalink] [raw]
Subject: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

It was found that if a process had many pending signals (e.g. millions),
the act of exiting that process might cause its parent to have a hard
lockup especially on a debug kernel with features like KASAN enabled.
It was because the flush_sigqueue() was called in release_task() with
tasklist_lock held and irq disabled.

[ 3133.105601] NMI watchdog: Watchdog detected hard LOCKUP on cpu 37
:
[ 3133.105709] CPU: 37 PID: 11200 Comm: bash Kdump: loaded Not tainted 4.18.0-80.el8.x86_64+debug #1
:
[ 3133.105750] slab_free_freelist_hook+0xa0/0x120
[ 3133.105758] kmem_cache_free+0x9d/0x310
[ 3133.105762] flush_sigqueue+0x12b/0x1d0
[ 3133.105766] release_task.part.14+0xaf7/0x1520
[ 3133.105784] wait_consider_task+0x28da/0x3350
[ 3133.105804] do_wait+0x3eb/0x8c0
[ 3133.105819] kernel_wait4+0xe4/0x1b0
[ 3133.105834] __do_sys_wait4+0xe0/0xf0
[ 3133.105864] do_syscall_64+0xa5/0x4a0
[ 3133.105868] entry_SYSCALL_64_after_hwframe+0x6a/0xdf

[ All the "?" stack trace entries were removed from above. ]

To avoid this dire condition and reduce lock hold time of tasklist_lock,
flush_sigqueue() is modified to pass in a freeing queue pointer so that
the actual freeing of memory objects can be deferred until after the
tasklist_lock is released and irq re-enabled.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/signal.h | 4 +++-
kernel/exit.c | 12 ++++++++----
kernel/signal.c | 27 ++++++++++++++++-----------
security/selinux/hooks.c | 8 ++++++--
4 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 9702016734b1..a9562e502122 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -5,6 +5,7 @@
#include <linux/bug.h>
#include <linux/signal_types.h>
#include <linux/string.h>
+#include <linux/slab.h>

struct task_struct;

@@ -254,7 +255,8 @@ static inline void init_sigpending(struct sigpending *sig)
INIT_LIST_HEAD(&sig->list);
}

-extern void flush_sigqueue(struct sigpending *queue);
+extern void flush_sigqueue(struct sigpending *queue,
+ struct kmem_free_q_head *head);

/* Test if 'sig' is valid signal. Use this instead of testing _NSIG directly */
static inline int valid_signal(unsigned long sig)
diff --git a/kernel/exit.c b/kernel/exit.c
index 2166c2d92ddc..ee707a63edfd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -88,7 +88,8 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
/*
* This function expects the tasklist_lock write-locked.
*/
-static void __exit_signal(struct task_struct *tsk)
+static void __exit_signal(struct task_struct *tsk,
+ struct kmem_free_q_head *free_q)
{
struct signal_struct *sig = tsk->signal;
bool group_dead = thread_group_leader(tsk);
@@ -160,14 +161,14 @@ static void __exit_signal(struct task_struct *tsk)
* Do this under ->siglock, we can race with another thread
* doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
*/
- flush_sigqueue(&tsk->pending);
+ flush_sigqueue(&tsk->pending, free_q);
tsk->sighand = NULL;
spin_unlock(&sighand->siglock);

__cleanup_sighand(sighand);
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
if (group_dead) {
- flush_sigqueue(&sig->shared_pending);
+ flush_sigqueue(&sig->shared_pending, free_q);
tty_kref_put(tty);
}
}
@@ -186,6 +187,8 @@ void release_task(struct task_struct *p)
{
struct task_struct *leader;
int zap_leader;
+ DEFINE_KMEM_FREE_Q(free_q);
+
repeat:
/* don't need to get the RCU readlock here - the process is dead and
* can't be modifying its own credentials. But shut RCU-lockdep up */
@@ -197,7 +200,7 @@ void release_task(struct task_struct *p)

write_lock_irq(&tasklist_lock);
ptrace_release_task(p);
- __exit_signal(p);
+ __exit_signal(p, &free_q);

/*
* If we are the last non-leader member of the thread
@@ -219,6 +222,7 @@ void release_task(struct task_struct *p)
}

write_unlock_irq(&tasklist_lock);
+ kmem_free_up_q(&free_q);
cgroup_release(p);
release_thread(p);
call_rcu(&p->rcu, delayed_put_task_struct);
diff --git a/kernel/signal.c b/kernel/signal.c
index b7953934aa99..04fb202c16bd 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -435,16 +435,19 @@ __sigqueue_alloc(int sig, struct task_struct *t, gfp_t flags, int override_rlimi
return q;
}

-static void __sigqueue_free(struct sigqueue *q)
+static void __sigqueue_free(struct sigqueue *q, struct kmem_free_q_head *free_q)
{
if (q->flags & SIGQUEUE_PREALLOC)
return;
atomic_dec(&q->user->sigpending);
free_uid(q->user);
- kmem_cache_free(sigqueue_cachep, q);
+ if (free_q)
+ kmem_free_q_add(free_q, sigqueue_cachep, q);
+ else
+ kmem_cache_free(sigqueue_cachep, q);
}

-void flush_sigqueue(struct sigpending *queue)
+void flush_sigqueue(struct sigpending *queue, struct kmem_free_q_head *free_q)
{
struct sigqueue *q;

@@ -452,7 +455,7 @@ void flush_sigqueue(struct sigpending *queue)
while (!list_empty(&queue->list)) {
q = list_entry(queue->list.next, struct sigqueue , list);
list_del_init(&q->list);
- __sigqueue_free(q);
+ __sigqueue_free(q, free_q);
}
}

@@ -462,12 +465,14 @@ void flush_sigqueue(struct sigpending *queue)
void flush_signals(struct task_struct *t)
{
unsigned long flags;
+ DEFINE_KMEM_FREE_Q(free_q);

spin_lock_irqsave(&t->sighand->siglock, flags);
clear_tsk_thread_flag(t, TIF_SIGPENDING);
- flush_sigqueue(&t->pending);
- flush_sigqueue(&t->signal->shared_pending);
+ flush_sigqueue(&t->pending, &free_q);
+ flush_sigqueue(&t->signal->shared_pending, &free_q);
spin_unlock_irqrestore(&t->sighand->siglock, flags);
+ kmem_free_up_q(&free_q);
}
EXPORT_SYMBOL(flush_signals);

@@ -488,7 +493,7 @@ static void __flush_itimer_signals(struct sigpending *pending)
} else {
sigdelset(&signal, sig);
list_del_init(&q->list);
- __sigqueue_free(q);
+ __sigqueue_free(q, NULL);
}
}

@@ -580,7 +585,7 @@ static void collect_signal(int sig, struct sigpending *list, kernel_siginfo_t *i
(info->si_code == SI_TIMER) &&
(info->si_sys_private);

- __sigqueue_free(first);
+ __sigqueue_free(first, NULL);
} else {
/*
* Ok, it wasn't in the queue. This must be
@@ -728,7 +733,7 @@ static int dequeue_synchronous_signal(kernel_siginfo_t *info)
still_pending:
list_del_init(&sync->list);
copy_siginfo(info, &sync->info);
- __sigqueue_free(sync);
+ __sigqueue_free(sync, NULL);
return info->si_signo;
}

@@ -776,7 +781,7 @@ static void flush_sigqueue_mask(sigset_t *mask, struct sigpending *s)
list_for_each_entry_safe(q, n, &s->list, list) {
if (sigismember(mask, q->info.si_signo)) {
list_del_init(&q->list);
- __sigqueue_free(q);
+ __sigqueue_free(q, NULL);
}
}
}
@@ -1749,7 +1754,7 @@ void sigqueue_free(struct sigqueue *q)
spin_unlock_irqrestore(lock, flags);

if (q)
- __sigqueue_free(q);
+ __sigqueue_free(q, NULL);
}

int send_sigqueue(struct sigqueue *q, struct pid *pid, enum pid_type type)
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 1d0b37af2444..8ca571a0b2ac 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2548,6 +2548,8 @@ static void selinux_bprm_committed_creds(struct linux_binprm *bprm)
rc = avc_has_perm(&selinux_state,
osid, sid, SECCLASS_PROCESS, PROCESS__SIGINH, NULL);
if (rc) {
+ DEFINE_KMEM_FREE_Q(free_q);
+
if (IS_ENABLED(CONFIG_POSIX_TIMERS)) {
memset(&itimer, 0, sizeof itimer);
for (i = 0; i < 3; i++)
@@ -2555,13 +2557,15 @@ static void selinux_bprm_committed_creds(struct linux_binprm *bprm)
}
spin_lock_irq(&current->sighand->siglock);
if (!fatal_signal_pending(current)) {
- flush_sigqueue(&current->pending);
- flush_sigqueue(&current->signal->shared_pending);
+ flush_sigqueue(&current->pending, &free_q);
+ flush_sigqueue(&current->signal->shared_pending,
+ &free_q);
flush_signal_handlers(current, 1);
sigemptyset(&current->blocked);
recalc_sigpending();
}
spin_unlock_irq(&current->sighand->siglock);
+ kmem_free_up_q(&free_q);
}

/* Wake up the parent if it is waiting so that it can recheck
--
2.18.1


2019-03-21 21:48:33

by Waiman Long

[permalink] [raw]
Subject: [PATCH 3/4] signal: Add free_uid_to_q()

Add a new free_uid_to_q() function to put the user structure on
freeing queue instead of freeing it directly. That new function is then
called from __sigqueue_free() with a free_q parameter.

Signed-off-by: Waiman Long <[email protected]>
---
include/linux/sched/user.h | 3 +++
kernel/signal.c | 2 +-
kernel/user.c | 17 +++++++++++++----
3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index c7b5f86b91a1..77f28d5cb940 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -63,6 +63,9 @@ static inline struct user_struct *get_uid(struct user_struct *u)
refcount_inc(&u->__count);
return u;
}
+
+struct kmem_free_q_head;
extern void free_uid(struct user_struct *);
+extern void free_uid_to_q(struct user_struct *u, struct kmem_free_q_head *q);

#endif /* _LINUX_SCHED_USER_H */
diff --git a/kernel/signal.c b/kernel/signal.c
index 04fb202c16bd..2ecb23b540eb 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -440,7 +440,7 @@ static void __sigqueue_free(struct sigqueue *q, struct kmem_free_q_head *free_q)
if (q->flags & SIGQUEUE_PREALLOC)
return;
atomic_dec(&q->user->sigpending);
- free_uid(q->user);
+ free_uid_to_q(q->user, free_q);
if (free_q)
kmem_free_q_add(free_q, sigqueue_cachep, q);
else
diff --git a/kernel/user.c b/kernel/user.c
index 0df9b1640b2a..d92629bae546 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -135,14 +135,18 @@ static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent)
* IRQ state (as stored in flags) is restored and uidhash_lock released
* upon function exit.
*/
-static void free_user(struct user_struct *up, unsigned long flags)
+static void free_user(struct user_struct *up, unsigned long flags,
+ struct kmem_free_q_head *free_q)
__releases(&uidhash_lock)
{
uid_hash_remove(up);
spin_unlock_irqrestore(&uidhash_lock, flags);
key_put(up->uid_keyring);
key_put(up->session_keyring);
- kmem_cache_free(uid_cachep, up);
+ if (free_q)
+ kmem_free_q_add(free_q, uid_cachep, up);
+ else
+ kmem_cache_free(uid_cachep, up);
}

/*
@@ -162,7 +166,7 @@ struct user_struct *find_user(kuid_t uid)
return ret;
}

-void free_uid(struct user_struct *up)
+void free_uid_to_q(struct user_struct *up, struct kmem_free_q_head *free_q)
{
unsigned long flags;

@@ -170,7 +174,12 @@ void free_uid(struct user_struct *up)
return;

if (refcount_dec_and_lock_irqsave(&up->__count, &uidhash_lock, &flags))
- free_user(up, flags);
+ free_user(up, flags, free_q);
+}
+
+void free_uid(struct user_struct *up)
+{
+ free_uid_to_q(up, NULL);
}

struct user_struct *alloc_uid(kuid_t uid)
--
2.18.1


2019-03-21 22:02:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q()

On Thu, Mar 21, 2019 at 05:45:12PM -0400, Waiman Long wrote:
> If the freeing queue has many objects, freeing all of them consecutively
> may cause soft lockup especially on a debug kernel. So kmem_free_up_q()
> is modified to call cond_resched() if running in the process context.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> mm/slab_common.c | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index dba20b4208f1..633a1d0f6d20 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -1622,11 +1622,14 @@ EXPORT_SYMBOL_GPL(kmem_free_q_add);
> * kmem_free_up_q - free all the objects in the freeing queue
> * @head: freeing queue head
> *
> - * Free all the objects in the freeing queue.
> + * Free all the objects in the freeing queue. The caller cannot hold any
> + * non-sleeping locks.
> */
> void kmem_free_up_q(struct kmem_free_q_head *head)
> {
> struct kmem_free_q_node *node, *next;
> + bool do_resched = !in_irq();
> + int cnt = 0;
>
> for (node = head->first; node; node = next) {
> next = node->next;
> @@ -1634,6 +1637,12 @@ void kmem_free_up_q(struct kmem_free_q_head *head)
> kmem_cache_free(node->cachep, node);
> else
> kfree(node);
> + /*
> + * Call cond_resched() every 256 objects freed when in
> + * process context.
> + */
> + if (do_resched && !(++cnt & 0xff))
> + cond_resched();

Why not just: cond_resched() ?

> }
> }
> EXPORT_SYMBOL_GPL(kmem_free_up_q);
> --
> 2.18.1
>

2019-03-22 01:54:08

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
> It was found that if a process had many pending signals (e.g. millions),
> the act of exiting that process might cause its parent to have a hard
> lockup especially on a debug kernel with features like KASAN enabled.
> It was because the flush_sigqueue() was called in release_task() with
> tasklist_lock held and irq disabled.

This rather apocalyptic language is a bit uncalled for. I appreciate the
warning is scary, but all that's really happening is that the debug code
is taking too long to execute.

> To avoid this dire condition and reduce lock hold time of tasklist_lock,
> flush_sigqueue() is modified to pass in a freeing queue pointer so that
> the actual freeing of memory objects can be deferred until after the
> tasklist_lock is released and irq re-enabled.

I think this is a really bad solution. It looks kind of generic,
but isn't. It's terribly inefficient, and all it's really doing is
deferring the debugging code until we've re-enabled interrupts.

We'd be much better off just having a list_head in the caller
and list_splice() the queue->list onto that caller. Then call
__sigqueue_free() for each signal on the queue.

2019-03-22 10:17:36

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue()

On Thu, Mar 21, 2019 at 05:45:08PM -0400, Waiman Long wrote:
> It was found that if a process has accumulated sufficient number of
> pending signals, the exiting of that process may cause its parent to
> have hard lockup when running on a debug kernel with a slow memory
> freeing path (like with KASAN enabled).

I appreciate these are "reliable" signals, but why do we accumulate so
many signals to a task which will never receive them? Can we detect at
signal delivery time that the task is going to die and avoid queueing
them in the first place?

2019-03-22 11:17:40

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On 03/21, Matthew Wilcox wrote:
>
> On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
>
> > To avoid this dire condition and reduce lock hold time of tasklist_lock,
> > flush_sigqueue() is modified to pass in a freeing queue pointer so that
> > the actual freeing of memory objects can be deferred until after the
> > tasklist_lock is released and irq re-enabled.
>
> I think this is a really bad solution. It looks kind of generic,
> but isn't. It's terribly inefficient, and all it's really doing is
> deferring the debugging code until we've re-enabled interrupts.

Agreed.

> We'd be much better off just having a list_head in the caller
> and list_splice() the queue->list onto that caller. Then call
> __sigqueue_free() for each signal on the queue.

This won't work, note the comment which explains the race with sigqueue_free().

Let me think about it... at least we can do something like

close_the_race_with_sigqueue_free(struct sigpending *queue)
{
struct sigqueue *q, *t;

list_for_each_entry_safe(q, t, ...) {
if (q->flags & SIGQUEUE_PREALLOC)
list_del_init(&q->list);
}

called with ->siglock held, tasklist_lock is not needed.

After that flush_sigqueue() can be called lockless in release_task() release_task.

I'll try to make the patch tomorrow.

Oleg.


2019-03-22 12:58:08

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 0/4] Signal: Fix hard lockup problem in flush_sigqueue()

On 03/22, Matthew Wilcox wrote:
>
> On Thu, Mar 21, 2019 at 05:45:08PM -0400, Waiman Long wrote:
> > It was found that if a process has accumulated sufficient number of
> > pending signals, the exiting of that process may cause its parent to
> > have hard lockup when running on a debug kernel with a slow memory
> > freeing path (like with KASAN enabled).
>
> I appreciate these are "reliable" signals, but why do we accumulate so
> many signals to a task which will never receive them? Can we detect at
> signal delivery time that the task is going to die and avoid queueing
> them in the first place?

A task can block the signal and accumulate up to RLIMIT_SIGPENDING signals,
then it can exit.

Oleg.


2019-03-22 14:36:45

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 4/4] mm: Do periodic rescheduling when freeing objects in kmem_free_up_q()

On 03/21/2019 06:00 PM, Peter Zijlstra wrote:
> On Thu, Mar 21, 2019 at 05:45:12PM -0400, Waiman Long wrote:
>> If the freeing queue has many objects, freeing all of them consecutively
>> may cause soft lockup especially on a debug kernel. So kmem_free_up_q()
>> is modified to call cond_resched() if running in the process context.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> mm/slab_common.c | 11 ++++++++++-
>> 1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>> index dba20b4208f1..633a1d0f6d20 100644
>> --- a/mm/slab_common.c
>> +++ b/mm/slab_common.c
>> @@ -1622,11 +1622,14 @@ EXPORT_SYMBOL_GPL(kmem_free_q_add);
>> * kmem_free_up_q - free all the objects in the freeing queue
>> * @head: freeing queue head
>> *
>> - * Free all the objects in the freeing queue.
>> + * Free all the objects in the freeing queue. The caller cannot hold any
>> + * non-sleeping locks.
>> */
>> void kmem_free_up_q(struct kmem_free_q_head *head)
>> {
>> struct kmem_free_q_node *node, *next;
>> + bool do_resched = !in_irq();
>> + int cnt = 0;
>>
>> for (node = head->first; node; node = next) {
>> next = node->next;
>> @@ -1634,6 +1637,12 @@ void kmem_free_up_q(struct kmem_free_q_head *head)
>> kmem_cache_free(node->cachep, node);
>> else
>> kfree(node);
>> + /*
>> + * Call cond_resched() every 256 objects freed when in
>> + * process context.
>> + */
>> + if (do_resched && !(++cnt & 0xff))
>> + cond_resched();
> Why not just: cond_resched() ?

cond_resched() calls ___might_sleep(). So it is prudent to check for
process context first to avoid erroneous message. Yes, I can call
cond_resched() after every free. I added the count just to not call it
too frequently.

Cheers,
Longman

2019-03-22 16:11:26

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On 03/22/2019 07:16 AM, Oleg Nesterov wrote:
> On 03/21, Matthew Wilcox wrote:
>> On Thu, Mar 21, 2019 at 05:45:10PM -0400, Waiman Long wrote:
>>
>>> To avoid this dire condition and reduce lock hold time of tasklist_lock,
>>> flush_sigqueue() is modified to pass in a freeing queue pointer so that
>>> the actual freeing of memory objects can be deferred until after the
>>> tasklist_lock is released and irq re-enabled.
>> I think this is a really bad solution. It looks kind of generic,
>> but isn't. It's terribly inefficient, and all it's really doing is
>> deferring the debugging code until we've re-enabled interrupts.
> Agreed.

Thanks for looking into that. As I am not knowledgeable enough about the
signal handling code path, I choose the lowest risk approach of not
trying to change the code flow while deferring memory deallocation after
releasing the tasklist_lock.

>> We'd be much better off just having a list_head in the caller
>> and list_splice() the queue->list onto that caller. Then call
>> __sigqueue_free() for each signal on the queue.
> This won't work, note the comment which explains the race with sigqueue_free().
>
> Let me think about it... at least we can do something like
>
> close_the_race_with_sigqueue_free(struct sigpending *queue)
> {
> struct sigqueue *q, *t;
>
> list_for_each_entry_safe(q, t, ...) {
> if (q->flags & SIGQUEUE_PREALLOC)
> list_del_init(&q->list);
> }
>
> called with ->siglock held, tasklist_lock is not needed.
>
> After that flush_sigqueue() can be called lockless in release_task() release_task.
>
> I'll try to make the patch tomorrow.
>
> Oleg.
>
I am looking forward to it.

Thanks,
Longman



Subject: Re: [PATCH 1/4] mm: Implement kmem objects freeing queue

On Thu, 21 Mar 2019, Waiman Long wrote:

> When releasing kernel data structures, freeing up the memory
> occupied by those objects is usually the last step. To avoid races,
> the release operation is commonly done with a lock held. However, the
> freeing operations do not need to be under lock, but are in many cases.
>
> In some complex cases where the locks protect many different memory
> objects, that can be a problem especially if some memory debugging
> features like KASAN are enabled. In those cases, freeing memory objects
> under lock can greatly lengthen the lock hold time. This can even lead
> to soft/hard lockups in some extreme cases.
>
> To make it easer to defer freeing memory objects until after unlock,
> a kernel memory freeing queue mechanism is now added. It is modelled
> after the wake_q mechanism for waking up tasks without holding a lock.

It is already pretty easy. You just store the pointer to the slab object
in a local variable, finish all the unlocks and then free the objects.
This is done in numerous places of the kernel.

I fear that the automated mechanism will make the code more difficult to
read and result in a loss of clarity of the sequencing of events in
releasing locks and objects.

Also there is already kfree_rcu which does a similar thing to what you are
proposing here and is used in numerous places.


Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On Fri, 22 Mar 2019, Waiman Long wrote:

> I am looking forward to it.

There is also alrady rcu being used in these paths. kfree_rcu() would not
be enough? It is an estalished mechanism that is mature and well
understood.


2019-03-22 18:13:32

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On 03/22/2019 01:50 PM, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Waiman Long wrote:
>
>> I am looking forward to it.
> There is also alrady rcu being used in these paths. kfree_rcu() would not
> be enough? It is an estalished mechanism that is mature and well
> understood.
>
In this case, the memory objects are from kmem caches, so they can't
freed using kfree_rcu().

There are certainly overhead using the kfree_rcu(), or a
kfree_rcu()-like mechanism. Also I think the actual freeing is done at
SoftIRQ context which can be a problem if there are too many memory
objects to free.

I think what Oleg is trying to do is probably the most efficient way.

Cheers,
Longman



Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On Fri, 22 Mar 2019, Waiman Long wrote:

> >
> >> I am looking forward to it.
> > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > be enough? It is an estalished mechanism that is mature and well
> > understood.
> >
> In this case, the memory objects are from kmem caches, so they can't
> freed using kfree_rcu().

Oh they can. kfree() can free memory from any slab cache.

> There are certainly overhead using the kfree_rcu(), or a
> kfree_rcu()-like mechanism. Also I think the actual freeing is done at
> SoftIRQ context which can be a problem if there are too many memory
> objects to free.

No there is none that I am aware of. And this has survived testing of HPC
loads with gazillion of objects that have to be freed from multiple
processors. We really should not rearchitect this stuff... It took us
quite a long time to have this scale well under all loads.

Please use rcu_free().



2019-03-22 20:02:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Waiman Long wrote:
>
> > >
> > >> I am looking forward to it.
> > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > be enough? It is an estalished mechanism that is mature and well
> > > understood.
> > >
> > In this case, the memory objects are from kmem caches, so they can't
> > freed using kfree_rcu().
>
> Oh they can. kfree() can free memory from any slab cache.

Only for SLAB and SLUB. SLOB requires that you pass a pointer to the
slab cache; it has no way to look up the slab cache from the object.


Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On Fri, 22 Mar 2019, Matthew Wilcox wrote:

> On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> > On Fri, 22 Mar 2019, Waiman Long wrote:
> >
> > > >
> > > >> I am looking forward to it.
> > > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > > be enough? It is an estalished mechanism that is mature and well
> > > > understood.
> > > >
> > > In this case, the memory objects are from kmem caches, so they can't
> > > freed using kfree_rcu().
> >
> > Oh they can. kfree() can free memory from any slab cache.
>
> Only for SLAB and SLUB. SLOB requires that you pass a pointer to the
> slab cache; it has no way to look up the slab cache from the object.

Well then we could either fix SLOB to conform to the others or add a
kmem_cache_free_rcu() variant.


2019-03-25 15:27:46

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On Mon, Mar 25, 2019 at 02:15:25PM +0000, Christopher Lameter wrote:
> On Fri, 22 Mar 2019, Matthew Wilcox wrote:
>
> > On Fri, Mar 22, 2019 at 07:39:31PM +0000, Christopher Lameter wrote:
> > > On Fri, 22 Mar 2019, Waiman Long wrote:
> > >
> > > > >
> > > > >> I am looking forward to it.
> > > > > There is also alrady rcu being used in these paths. kfree_rcu() would not
> > > > > be enough? It is an estalished mechanism that is mature and well
> > > > > understood.
> > > > >
> > > > In this case, the memory objects are from kmem caches, so they can't
> > > > freed using kfree_rcu().
> > >
> > > Oh they can. kfree() can free memory from any slab cache.
> >
> > Only for SLAB and SLUB. SLOB requires that you pass a pointer to the
> > slab cache; it has no way to look up the slab cache from the object.
>
> Well then we could either fix SLOB to conform to the others or add a
> kmem_cache_free_rcu() variant.

The problem with a kmem_cache_free_rcu() variant is that we now have
three pointers to store -- the object pointer, the slab pointer and the
rcu next pointer.

I spent some time looking at how SLOB might be fixed, and I didn't come
up with a good idea. Currently SLOB stores the size of the object in the
four bytes before the object, unless the object is "allocated from a slab
cache", in which case the size is taken from the cache pointer instead.
So calling kfree() on a pointer allocated using kmem_cache_alloc() will
cause corruption as it attempts to determine the length of the object.

Options:

1. Dispense with this optimisation and always store the size of the
object before the object.

2. Add a kmem_cache flag that says whether objects in this cache may be
freed with kfree(). Only dispense with this optimisation for slabs
with this flag set.

3. Change SLOB to segregate objects by size. If someone has gone to
the trouble of creating a kmem_cache, this is a pretty good hint that
there will be a lot of objects of this size allocated, so this could
help SLOB fight fragmentation.

Any other bright ideas?

Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On Mon, 25 Mar 2019, Matthew Wilcox wrote:

> Options:
>
> 1. Dispense with this optimisation and always store the size of the
> object before the object.

I think thats how SLOB handled it at some point in the past. Lets go back
to that setup so its compatible with the other allocators?

2019-03-26 13:31:07

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

Sorry, I am sick and can't work, hopefully I'll return tomorrow.

On 03/22, Christopher Lameter wrote:
>
> On Fri, 22 Mar 2019, Waiman Long wrote:
>
> > I am looking forward to it.
>
> There is also alrady rcu being used in these paths. kfree_rcu() would not
> be enough? It is an estalished mechanism that is mature and well
> understood.

But why do we want to increase the number of rcu callbacks in flight?

For the moment, lets discuss the exiting tasks only. The only reason why
flush_sigqueue(&tsk->pending) needs spin_lock_irq() is the race with
release_posix_timer()->sigqueue_free() from another thread which can remove
a SIGQUEUE_PREALLOC'ed sigqueue from list. With the simple patch below
flush_sigqueue() can be called lockless with irqs enabled.

However, this change is not enough, we need to do something similar with
do_sigaction()->flush_sigqueue_mask(), and this is less simple.

So I won't really argue with kfree_rcu() but I am not sure this is the best
option.

Oleg.


--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -85,6 +85,17 @@ static void __unhash_process(struct task_struct *p, bool group_dead)
list_del_rcu(&p->thread_node);
}

+// Rename me and move into signal.c
+void remove_prealloced(struct sigpending *queue)
+{
+ struct sigqueue *q, *t;
+
+ list_for_each_entry_safe(q, t, &queue->list, list) {
+ if (q->flags & SIGQUEUE_PREALLOC)
+ list_del_init(&q->list);
+ }
+}
+
/*
* This function expects the tasklist_lock write-locked.
*/
@@ -160,16 +171,15 @@ static void __exit_signal(struct task_struct *tsk)
* Do this under ->siglock, we can race with another thread
* doing sigqueue_free() if we have SIGQUEUE_PREALLOC signals.
*/
- flush_sigqueue(&tsk->pending);
+ if (!group_dead)
+ remove_prealloced(&tsk->pending);
tsk->sighand = NULL;
spin_unlock(&sighand->siglock);

__cleanup_sighand(sighand);
clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
- if (group_dead) {
- flush_sigqueue(&sig->shared_pending);
+ if (group_dead)
tty_kref_put(tty);
- }
}

static void delayed_put_task_struct(struct rcu_head *rhp)
@@ -221,6 +231,11 @@ void release_task(struct task_struct *p)
write_unlock_irq(&tasklist_lock);
cgroup_release(p);
release_thread(p);
+
+ flush_sigqueue(&p->pending);
+ if (thread_group_leader(p))
+ flush_sigqueue(&p->signal->shared_pending);
+
call_rcu(&p->rcu, delayed_put_task_struct);

p = leader;


2019-03-26 13:38:01

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 2/4] signal: Make flush_sigqueue() use free_q to release memory

On 03/25, Christopher Lameter wrote:
>
> On Fri, 22 Mar 2019, Matthew Wilcox wrote:
>
> > Only for SLAB and SLUB. SLOB requires that you pass a pointer to the
> > slab cache; it has no way to look up the slab cache from the object.
>
> Well then we could either fix SLOB to conform to the others or add a
> kmem_cache_free_rcu() variant.

Speaking of struct sigqueue we can simply use call_rcu() but see my previous
email, I am not sure this is the best option.

Oleg.